Oracle RAC on Azure
Published Feb 20 2020 12:20 PM 26.8K Views
Microsoft

*Updated 03/10/2022

This is a consistent question I receive often and although RAC isn't supported in any third-party cloud by Oracle, it's an important topic as more workloads lift and shift to Azure and there is absolutely a reason to have or not to have Real Applications Clusters, (RAC) as part of them.  The only current option for RAC on Azure available is:

The goal of this post is to push past the idea that a lift and shift should always be a 1:1 move.  It's important when moving to the cloud to use the right tools, not just the tools you've always used, which is a very important lesson when it comes to Oracle Real Application Cluster, (RAC).

I'm going to repeat this once more-  Oracle doesn't support RAC on Azure.  Flashgrid does have excellent support of their RAC solution in Azure, but the support is through Flashgrid.  So lets start discussing why architecting for the cloud may be very different than architecting on-prem by dispersing with some myths.

RAC is Not High Availability

This may be an unpopular opinion by many in the Oracle world, but RAC doesn't meet many of the requirements for HA.  If a solution doesn't meet even one, do you really have a solution?

  • RAC does have rolling patches that eliminate some of the downtime for patching, but doesn't ensure that ALL patches are delivered in this manner.   As patches are built by different teams at Oracle or didn't have enough time to build a rolling patch for RAC, there will be downtime.
  • One of the biggest flaws in RAC for meeting HA requirements is by default all nodes reside in one datacenter unless an extended distance cluster has been deployed.  All nodes will be in a single Availability Zone, which means if the AZ goes down, so do all the nodes and the shared storage, which means you still need to failover to a DataGuard standby for High Availability.  
  • RAC possesses only one database that interacts with multiple nodes, unlike Always-on Availability Groups, which has multiple databases. This also doesn't protect from data corruption, which relies on Data Guard.

RAC was architected for scalability and instance resiliency, which it does very well, but the default deployment will result, if there is a datacenter failure, the loss of all nodes and database, failing HA requirements.

  • Another consistent issue is most applications still connected to RAC databases aren't "RAC aware".  This results in outages when a failover or patch occurs, which is another challenge to the HA guarantee.  All tiers/stack must be included in an HA solution.
  • RAC environments have a number of additional components that add complexity to the environment that can create issues that cause failure, during failover and outside of it. 

There have been a significant number of times where I've reviewed an AWR report and thought the best thing for an environment was to UnRAC it.  The code and database design simply wasn't designed to run effectively or efficiently on RAC, resulting in high global cache waits, etc.

 

RAC, in many experts view, is for scalability and for me, scalability is as likely for growth as it is necessary due to lacking resources or experience to manage and build solutions to handle the needs. 

Only one project out of 100 that I've worked with at Microsoft has had a real need for RAC and yes, I work mostly with multi-terabyte workloads, so an assumption should not be made that it was just small databases.

Deploying RAC in Azure

For RAC to work on Azure today, (not counting the new private preview) requires a third party service that works as a communication center for the software cluster.  

Flashgrid works with IaaS VM Images and is supported through Flashgrid if any issues arise.  A high level architecture diagram looks like the following:

ora_flashgrid.png

If the customer is running anything more than a simple Oracle RAC environment- they've deployed a complex data model, complex code or application layer or they don't meet any of the requirements for scaling with RAC, I'm going to try to convince them to architect for the cloud instead.

Architecting for the Cloud

An Azure datacenter is built on a globally distributed infrastructure, which contains numerous layers of redundancy and a resilient interconnected network.  This is far superior to an on-prem datacenter because it HAS TO BE.  Geo-regions are fault tolerant in case of complete regional datacenter failure, which means the way we architect for the cloud is often different than the way we architect on-prem.

The Keep It Simple Silly, (KISS) principle comes in handy here, as complexity only impacts management, deployment and licensing costs for our Azure lift and shift projects.  Best practice for Oracle, which is also stated in the docs states:

  1. With the scalability of Azure cloud VMs, the database to deploy should be a single instance Oracle database, (or Oracle supported product).
  2. To help scale, consider using Oracle Active Dataguard to leverage 1, 2 or more secondary databases for reporting, feeding ELT/ETL or backups.
  3. If deploying a secondary Dataguard to another region, consider using an Oracle Far Sync instance to assist in keeping them up to date.
  4. Also use Oracle Active Dataguard configured automatic failover for DR purposes, designating sequential failover steps as required.
  5. Use Azure Site Recovery, (ASR) to take snapshots of the Oracle VM(s) and create new copies that can be used to quickly do a final recovery to a consistent state vs. cloning or recovering from a full backup.
  6. Use RMAN to take backups and save backups to Azure Blob storage.

Oracle on Azure High Level DiagramOracle on Azure High Level Diagram

If the database needs more resources, it is easy to scale the VM(s) up as necessary.  I spend a larger amount of time calculating IO to make sure the disk IO has room to grow over time.  Now Disk is separate from the VMs and is important to Oracle-  I'd like to save that for another post, so I will leave you with this:

  • Have the discussion about what Azure cloud, any cloud is and how it is architected differently than an on-prem datacenter.
  • Ask the customer why they are using RAC and then ask them if their RAC environment passes the HA or scalability needs of RAC.  
  • Seriously consider how Oracle Dataguard, either passive or active can play a role into a strong HA and DR story for the customer.  The product is incredibly robust and is superior to cloud needs for customer's Oracle databases.  As Dataguard is less than RAC, it can save the customer a considerable amount of money on licensing costs, too.

RAC Skills

It's alright the DBA might want to simply keep their RAC skills up-to-date by having it- I understand, I've got 2 decades under my belt as a DBA.  The thing is, there are so many cool new tools and products, like Azure CLI, Azure services and automation with DevOps to learn, there's plenty of new skills they'll acquire that will make them more valuable than just knowing Oracle RAC.

7 Comments
Copper Contributor

@DBAKevlar According to the comment on the Shared Disks announcement page, Shared Disks do not support Oracle RAC.

Microsoft

That is correct...they won't support Oracle RAC, but that doesn't mean you can't hack it to build it.  I have customers already pushing for the solution.  I also have customers pushing for me to further HVR for RAC, too.  It's a huge topic, but I mentioned it because there are a ton of people asking about it and the topic should be broached.

Thank you,

Kellyn Gorman

Copper Contributor

@DBAKevlar The "RAC is Not High Availability" statement seems to be very misleading. Multiple RAC users use it for HA specifically, including FlashGrid SkyCluster customers on Azure. Can it provide 100% uptime guarantee? Of course not, nothing can. But 99.99% or even 99.999% is achievable (using multi-AZ, which is the default in SkyCluster). So, the question is whether RAC is the best HA tool for Oracle database. My answer is yes, in most cases. All major release updates and most patches can be applied in rolling fashion. You don't have to do a manual failover in the middle of the night when a failure happens. And in case of RAC on SkyCluster, all "additional components" are packaged and pre-integrated together, so it just works. On the other hand, DataGuard is a great tool for disaster recovery, and our customers use it on top of RAC for doing async replication to a different region, in accordance with Oracle's Maximum Availability Architecture recommendations.

Microsoft

Dear Art,

Thank you for your experience with RAC with FlashGrid, but you may have wanted to preclude your answer with a disclaimer that you're the CEO and CTO for Flashgrid.

https://www.linkedin.com/in/art-danielov-66668a1

Have a great weekend!

Kellyn Gorman

Copper Contributor

You are absolutely correct with promoting the cloud-architecting of RAC instead of force-bending the rules to mimic the existing on-prems setup.
Most people (about 90%) are running on-prems using 3rd party storage and don't realise how fragile ASM is (even the very latest patch levels) when working directly with the disks. 
There are exceptional cases when even the X8 full rack Exa is barely capable of running a single report but there are very very few of those and the rest can be easily broken into the chunk sized single-VM workloads.

Microsoft

ASM does have its limits and we see this very often when attempting to mirror disk layouts and use more advanced LVM choices that we simply don't have in ASM.  As for RAC, it's not supported by Oracle in any third party cloud and the customer must make this decision- to go unsupported by Oracle or have the support go through a vendor such as Flashgrid As someone who wants the long term satisfaction for the customer on Azure, my most common solution, especially with OLAP RAC or single instance or Exadata migrations to Azure IaaS VMs, is to look at the advanced processing services like Azure Synapse to scale part of that workload.  Evolving their Oracle database by incorporating a solution which will allow for MPP to process the data and present it in the final format for best reporting is and excellent solution vs. trying to build so much of it during the querying process.  It is why so many have moved from ETL to ELT, also allowing the customer to build a future-view data lake solution which can promote machine learning solutions and other newer technologies than just trying to do new with older RDBMS technologies.

Copper Contributor

Great blog. thanks for the explanation. 

 

my comments # RAC - is designed to primarily address instance failures, as well as H/W failures. Datacenter failures does require DR strategy. 

And ASM will have redendesy of storage to avoid storage failures. - this is not a main concern in Cloud world.

I am currently working(POC) on Project to utilize fast failover of database services for a ERP systems. 

Co-Authors
Version history
Last update:
‎Mar 10 2022 06:30 PM
Updated by: