A common conversation for bringing Oracle workloads to Azure always surrounds the topic of Real Application Clusters, (RAC). As it’s been quite some time since I’ve covered this topic, I wanted to update from this previous post, as with the cloud and technology, change is constant.
One thing that hasn’t changed is my belief RAC is A solution for Oracle for a specific use case and not THE solution for Oracle. The small detail that Oracle won’t support RAC in any third-party cloud is less important than the lack of need for RAC in most cases for those migrating to an enterprise level cloud such as Azure.
Not So Lift and Shift
Whenever we are working on migrating Oracle workloads to Azure, it is important for us to focus on how we should most effectively architect for the Azure cloud and not to just lift and shift what exists onprem. A common challenge during cloud migrations is when an attempt is made to duplicate everything onprem in the cloud or simply treating the cloud like another data center, not realizing how much high availability is built into Azure that isn’t in their onprem data center.
We often experience customers implementing redundancy in products, features and at the same time, introducing redundancy and sometimes their own failures. Most common abuses are in the areas of hypervisors on top of hypervisors, mirroring/storage copies and storage management tripping all over itself and this topic, high availability products.
In Data Guard We Trust
Due to this transparent and very important part of the Azure cloud, the most common Oracle architecture deployed are single instance Oracle databases, (often from onprem RAC) with Oracle Data Guard standbys to support both disaster recovery and high availability using features that surprisingly, less technologists are aware of than we’d like.
When we’ve completed an Oracle sizing and architecture assessment here at Microsoft with a customer, the diagrams look very similar to the following:
As there are several topics around why we so rarely use RAC in Azure, I’m going to take each of these on separately and hopefully cover the important details.
High Availability Cluster
If we were to build out a RAC cluster in the Azure cloud, unlike an Always on Availability Group, all the nodes for RAC are deployed only to a single Availability Zone. If we think about high availability architecture design, you will realize that this architecture will fail basic HA requirements.
Oracle Data Guard is very similar in design to Always on AG and an essential part of any Disaster Recovery architecture design. Notice in the diagram, Figure 3, if the RAC in Availability Zone 1 goes down, there will be an outage unless there is an Oracle Data Guard standby available in another Availability Zone to failover to.
With Oracle Data Guard, we can configure several features to build out a full-blown, highly available architecture that can support 8-9’s of uptime. With Oracle Data Guard configured with Fast-Start Failover, (FSFO) if the primary database becomes unavailable or goes down for any reason, the secondary will automatically become the primary and take over in a matter of seconds. A notification can be set up in Oracle Enterprise Manager to notify those responsible, but this failover happens in seconds when configured correctly in Azure, allowing for transparent failover to a secondary standby in a second availability zone.
To take this a step further, you can configure the DG Broker and set up Oracle Observer in secondary Availability zones, (with full redundancy) to failover applications that are failover compatible. This results in a transparent failover of new sessions to the secondary when FSFO comes into play, failing over the database.
We can deploy the primary in one Availability Zone and a standby in a second Availability Zone, creating a fully redundant and automatic failover solution. As Oracle Data Guard can support numerous standby databases, these HA and DR copies can be deployed in multiple Availability Zones and secondary regions to meet the customer’s RPO/RTO, no matter how complex the SLA.
For Oracle’s Maximum Availability Architecture, (MAA) to reach a gold standard, Oracle Data Guard must be part of the deployed solution. As customers often move to storage snapshots from RMAN for backups, having Data Guard features, such as DBverify and Analyze to perform logical checks for intra-block and inter-object consistency offers added benefits. Data Guard provides in-memory intra-block checks and shadow lost write protection if there is an interruption in service to the storage layer to the database.
For an additional charge of Active Data Guard, the standby can be used for an RMAN backup target to offload the demand on the primary database, as well as offload the intra-block logical checks to the standby in its active read-only mode.
We can also use a separate Far Sync instance to guarantee zero data loss by performing a compressed offload compressed transport of the redo to the Active standby database. This also offers the ability to perform continuous Oracle validation to the standby and additional encryption to secure business data.
High Availability via rolling patches and upgrades
As RAC isn’t supported in any third-party cloud, Azure specialists are going to investigate the solutions that do provide what is required and for Azure cloud, Oracle Data Guard is very compatible with Azure cloud infrastructure HA. Another nice feature that many aren’t aware of is that with Oracle Active Data Guard, (active/active, secondary is a read-only active standby) you can do switch overs and using the DBMS_ROLLING package provided with Oracle 19c, you can do rolling patches and upgrades. This provides one of the most loved features of Oracle RAC by DBAs and is very little known in Data Guard.
With DBMS_ROLLING and Active Data Guard, database and application downtime can be decreased to seconds with a fault tolerant, resumable and rollback capable solution.
This is the best reason to use RAC and for many, the least common reason we’ve often seen businesses choose it. For an OLTP or hybrid database workload that requires significant CPU and memory and the database design has been optimized for RAC, considerable demanding workloads can be leveraged with the product. When we reach for RAC over OS level clustering with load balancer or a larger VM that can handle the workload has to do with per VM limits we can’t work our way around. There are significant challenges with concurrency, initial transaction locking, GC waits or shipping between nodes that are outside of this discussion, but you do realize the benefit that could be brought to the table with RAC…but not in Azure or any other third-party cloud if you want it supported by Oracle. It’s not about the shared storage or even the multi-cast network that’s the problem, it’s simply around supportability by Oracle.
Although we’re not able to use RAC for scalability, for heavy read-only workloads, we can use Oracle’s Active Data Guard standbys, in a read-only active mode to disperse those application compatible workloads, retaining the primary to only process the transactional workload.
Oracle Sharding offers another option for scalability, spreading the database, with a shared-nothing architecture, across multiple databases/hosts using shard keys. Sharding is a horizontal partitioning of data across numerous databases and each shard holds a subset of the total data source vs. housing in a single database. As RAC isn’t supported in any third-party cloud, this is deployed in the Azure cloud without a RAC clusterware backbone but is able to use Oracle’s multi-tenant feature with the additional licensing.
For those workloads that absolutely need a RAC solution, we leverage OS level clustering in Azure VMs using PaceMaker and for the customers who can adopt a co-location, we recommend Azure BareMetal RAC offering. This is a proximity located Co-Lo to Azure that can offer RAC for customer that absolutely must have it. The infrastructure is supported by Azure and everything above that is supported by the customer.
There is also Azure RAC BareMetal which is in a gated GA status. Bare Metal, which are dedicated machines in a co-location configuration in proximity tothe Azure cloud offers a RAC solution where the infrastructure is supported by Azure, but everything above this is managed by the customer.
Latency between the Bare Metal solution and other Azure services and VMs is minimized, along with additional high availability built into the offering to support what would be missing from most onprem data center deployments.
Azure Bare Metal can support RAC One Node and standard RAC deployments with an HA storage configuration to support the demands of 2 and 4 Node RAC configurations in a highly scaled, enterprise cloud solution.
A customer may use Flashgrid on Azure with the understanding that all support for this RAC solution running in Azure must go through Flashgrid. Neither Oracle nor Microsoft can support RAC running inside the Azure cloud, but Flashgrid has shown a history of offering solid support for their customers.
Although initially, we hoped to use Azure’s shared storage for an option to run Oracle RAC in our cloud, we’ve backtracked from this due to support constraints and same goes for networking advancements in Azure. It’s not that we can’t run RAC in Azure, it’s just that it isn’t supported, and our main goal is long term customer satisfaction and supportability in Azure. No matter your feelings on RAC, the goal for this post was to discuss what features are best suited for a deployment in Azure making Oracle highly available, easy to manage and most likely to receive vendor support.