Architectural Decisions to maximize ANF investment in HANA N+M Scale-Out Architecture - Part 3
Published Mar 23 2021 08:20 AM 4,723 Views
Microsoft

Introduction

This blog is a continuation of the blog series on this topic. If you haven’t read Part 1 and Part 2 of the series, here is the link to help you catch up. By the end of Part 2, we finished designing the architecture up until the backup and recovery of HANA N+M scale-out on ANF. Now, we are ready to design and build the high-availability architecture, followed by the disaster recovery architecture using the native storage replication feature of Azure NetApp Files. Let’s get to it!

Note: The intention throughout the blog series is to cover a solution architect’s view to designing and building the solution in a more thought-provoking manner, and to compliment already available well-crafted MS Docs and NetApp documentation on this topic. I will cover key highlights from these documentations, and where necessary, will also augment additional detail in this blog to cover more surface area.

Final Design

 

DR DesignDR Design

 

Before going through the entire design process, it might be beneficial to see what the end result might look like. The above diagram serves that purpose. The diagram may have components that you may not yet be familiar with, and that is okay. Once you have gone through the design process below and come back to this diagram, it will make a lot more sense.

Design and Build – High-Availability

Design Considerations

 

I will keep the high-availability section a bit leaner than the other sections, primarily because the technology we will be using to achieve high-availability is well documented in Azure and SAP docs, not to mention dozens of reputable blogs. Keeping this particular technical scenario in mind, the high-availability for N+M scale-out nodes is provided natively by SAP’s ‘host auto-failover’ mechanism. However, it had not been available on a public cloud until recently, thanks to Azure NetApp Files and its sub-millisecond NFS v4.1 shared storage capability. Now, with a N+M architecture, the additional stand-by node can provide high-availability for this host auto-failover solution with an underlying shared ANF storage. Good news - no OS cluster required for this. You just need to provision an additional node or nodes depending upon the number of scale-out nodes, and then follow the standard stand-by node installation process. The stand-by node(s) will be of the same capacity and will have the same OS configuration. Keep in mind that this is a cold stand-by setup, therefore, you will need to test and assess whether the failover time meets your overall RTO.

To provide high-availability from infrastructure perspective, you will need to ensure that all the HANA nodes, including the stand-by nodes, are part of the same availability set.

Build Considerations

This is a straightforward build scenario, and Azure Docs has all the details to set it up (links at the end), however, here are a few helpful tips:

  • Create separate directories representing each node’s /usr/sap mount points and the common /hana/shared directory under the same ANF volume. This is a cost-optimized option as these mount points do not generate a considerable enough workload to be requiring dedicated throughput.
  • Ensure password-less ssh connection among all the nodes and also with the central management server if you plan to control the orchestration of azcopy offloading from the central management server.
  • Test all the failover scenarios, including deallocating the virtual machine, creating a kernel panic, and also by killing the HDB process. The steps are explained in the Azure Doc that talks about the N+M Architecture with ANF on RHEL.
  • Remember to tweak the kernel.panic settings to improve the failover time.

Design – Disaster Recovery

Following are some of the design considerations for architecting disaster recovery for the HANA scale-out on Azure VMs running on ANF:

  • DR requirements: Do you have the major requirements sorted our such as RPO and RTO? How much are you willing to spend on DR? How much autonomy and automation are you expecting?
  • Choosing the data replication mechanism: Probably the biggest technical decision after settling in on the requirements would be around the mechanism to replicate the data over to the DR site. You have two (technically three) options for the replication:

#

DR Option

Comments

1

HANA System Replication (HSR)

Application managed (asynchronous) replication

A faster HANA uptime after the takeover

Shorter memory loading duration (log replay option)

Higher network consumption compared to a block level storage replication

Requires an active compute infrastructure to receive the replication

Cannot leverage the replicated data for lower environment refreshes at the DR site unless you break the replication

RPO - lowest among the options. Plan and size the infrastructure to meet the required network throughput

Flexibility in choosing the DR region of choice 

2

ANF Cross Region Replication (CRR)

Managed at the PaaS layer with Azure APIs

Requires the HANA recovery process before the system is functional

Longer memory loading duration post recovery

Lower network consumption since the block level changes are replicated

Does not require an active compute unit, but would reduce the RTO if you had one

The application-consistent snapshots can be used to create new clones volumes for lower environment refreshes without breaking the replication 

RPO - Lower. It does require tweaking the frequency of data and log backups snapshots

Available for already-defined region pairs (see the reference link below for the complete list of pairs)

3

Replicate Backup files and restore at the DR site

Just for completeness’s sake, you do have an option to send file level backups replicated across the regions and restore them as a tertiary DR mechanism. This probably won’t be your first choice for critical applications.

RPO - Provides the worst among the options

Flexibility in choosing the DR region of choice 

 

  • Region of choice: This is another important decision when it comes to DR. Though this should have already been identified with the base infrastructure design (part 1 of the blog), it is worth repeating here that finding all the required Azure services to support the SAP landscape in predefined Azure DR region pairs may not be possible at times, and you may end up selecting a non-paired DR region.

For this blog, I have chosen ANF as the choice of replication. Let’s go over some of the key design aspects of this DR solution:

  • Single Subscription and CRR: If you remember from the previous blog, we chose a single subscription that spanned across both the primary and the DR region. This was done precisely to be able to leverage ANF CRR as the DR replication mechanism, and ANF CRR cannot be achieved without meeting the single subscription requirement (at the time of writing).
  • What to replicate: Hana data and log backup volumes. The snapshots on the data volumes will serve as the snapshot recovery point, and snapshots on the log volumes will enable point-in-time recovery at the DR site, thus reducing the RPO even further. Keep in mind that, in addition to taking snapshots on the log backups volume, you will need to adjust the HANA parameter for the log backup timeout [log_backup_timeout_s] in order to meet the RPO requirement.
  • How frequently to replicate: The following table provides an example of meeting a 30 min replication RPO:

Item

Frequency

ANF CRR replicating HANA data and log backups volumes

10 min for all the volumes

(which gives the maximum replication RPO of 20 min)

ANF snapshots on the log backups volume

10 min

(and log_backup_timeout_s = 300. The DR requirement is now causing a change to the frequency, from 15 min to 10 min that we chose during the backup/recovery setup – blog part 2. Also plan to stagger the snapshot frequency so that azacsnap doesn’t take the snap at the same time as CRR internal replication time

ANF snapshots on the data volumes

4 hours

(See the below comment)

 

  • Other driving factors to above frequencies: The amount of logs generated in the system affects the point-in-time recovery at the DR site. To shorten the time it takes to apply the logs in point-in-time recovery process, you can take more frequent data snapshots. On the other hand, consider the impact of this change, that is more frequent data snapshots, to the space occupied by the snapshots on the source volume, and to the amount of egress traffic to the DR site.
  • DR ANF account, pools and volumes: Since an ANF account doesn’t span across regions, you will need a separate account in the DR region for the CRR target volumes. For the storage pool, you can choose a lower tier pool for the target volumes. Keep in mind that for DR, you will need to move the replicated volumes to higher tier pools to meet the volume throughput requirements.
  • Staging HANA installation: If you are planning to set up a dual-purpose DR system, then you can leverage smaller sized lower tier volumes to stage the production HANA instance installation ahead of time. This means, as part of the DR, you will then need to swap these smaller volumes with the ones replicated via CRR. If you are not planning to have a dual-purpose system, then you don’t need any additional volumes during normal operations, but at the time of DR, you will need to provision the volumes that didn’t come from replication, such as hana shared binaries (/hana/shared), log volumes (/hana/log) and a volume for data backups.

Build – Disaster Recovery

Following are the high-level steps to set up disaster recovery setup using ANF cross-region replication:

  1. Ensure that you have the necessary landing zone ready in the DR region for this setup.
  2. Create an ANF account, a delegated ANF subnet and a target storage pool for the replicated volumes. (Refer to part 1 of the blog series)
  3. [At Primary Region] Locate the source volume id of the following source volumes. In our example, we will have total three volumes to replicate.
    1. Data Volumes: /hana/data/mnt00001, /hana/data/mnt00002 …
    2. Log Backups Volume: /backup/log

Moaz_Mirza_1-1615946547655.png

/subscriptions/<YourSubID>/resourceGroups/<ANFGroupName>/providers/Microsoft.NetApp/netAppAccounts/<ANFAccountName>/capacityPools/<SourceCapacityPool>/volumes/<ANFVolumeName>

  1.  [At DR Region] Create the respective data replication volumes by choosing the “Add data replication” option when creating the volume. For better mapping and identification, you can also choose to name the replicated volume the same as the source volume name, but with a unique suffix, such as <sameSourceVolumeName>-crr-<Year>. You will also need to provide the source volume id (copied above) in the “Replication” tab. For the “Replication Schedule”, you can use 10 min to follow this example. Don’t forget to choose the protocol as NFSv4.1.Check out the link at the end to perform the same operation using REST APIs instead.

Moaz_Mirza_2-1615946547662.png

Moaz_Mirza_3-1615946547667.png

  1. [At DR Region] Capture the destination volume ID for the next step.
  2. [At Primary Region] You will now need to go back to the source volumes and choose the “Authorize” option under Replication option to pair and activate the replication.

Moaz_Mirza_4-1615946547673.png

Moaz_Mirza_5-1615946547678.png

  1. [At Primary Region] That’s all you need to initiate the replication. The initial status would look like this: Uninitialized

Moaz_Mirza_6-1615946547680.png

  1. [At DR Region] For the 10 min replication schedule, I started seeing the telemetry around 10th min mark. You can monitor it from either destination, or from the source volume > replication option – see the example metrics below. Also note that Replication Schedule field is only displayed from the destination volume side.

Moaz_Mirza_7-1615946547689.png

  1. So, now what? From DR setup perspective, that is it. You should now have all the data volumes (all the mounts) and the log backups volume being replicated every 10 min. You will also see that all the snapshots on the source volume are also getting replicated over to the destination volume, as these snapshots will be critical to a consistent-state recovery of HANA DB.
  2. You will also see one or more system-generated snapshots starting with a prefix snapmirror.<longstring>. This snapshot(s) serves as an internal reference point for the storage replication. We will not be using these for recovery, as these snapshots will not have an application consistent state of the database.

Moaz_Mirza_8-1615946547695.png

  1. In addition, to replicate configuration changes, you could also replicate the /hana/shared volume to leverage when setting up the production HANA instance at the DR site. The frequency on these binaries don’t need to be that aggressive, due to the stale nature of this volume.

Validate – Disaster Recovery

To validate that you can use this replicated volume set for a successful recovery and meet your RPO and RTO objectives, follow the high-level steps below:

 

  1. [At DR Region] In this example, I will be using a dual-purpose DR system, where a base installation of production instance is already staged using a smaller HANA volume set. This DR Production HANA instance is currently down as the other instance (say, a QA system) is up and running. Let’s also assume that this production instance is patched regularly with the same HANA application patches as the primary production instance, and the profile files are kept up to date as well. In addition, you also have a backup/snapshot on these smaller volumes to revert this DR production instance back to the pre-DR condition once this DR test is done.
  2. [At DR Region] Stop the QA instance in preparation for the DR.
  3. [At DR Region] If you have restricted the memory allocation for the production instance at the DR site, then you can update that and cycle (start/stop) the system. You need the DR production instance in the down state for the recovery, but you do need the sapstartsrv service to be running so you can connect to the instance from the client such as HANA Studio for the recovery process.
  4. Ensure that the snapshots are being propagated to the respective destination volumes.
  5. [At DR Region] Note down the latest data snapshot that is available on the destination volumes. A couple of hours old data snapshots will serve as a realistic scenario, where you will need to apply log backups for the roll-forward recovery on top of the data snapshot.
  6. [At Primary Region] Make a change as a pointer for your point-in-time recovery. It could be as simple as creating a security user in HANA. Note down the completion time stamp.
  7. Wait for the log backup snapshot to be replicated.

Test transaction time – 0:00

Log Backups generated – 0:05

Snapshot executed (5th,15th, 25th, 35th, 45th and 55th min of the hour) including the transaction – 0:05 or worst case 0:15

CRR replication cycle – 0:10 or worst case 0:20

Adding CRR RPO of total 20 min – 0:20 or 0:30 [Excessive block level changes and the distance between the region can have an impact on the overall replication RPO]

  1. [At Primary Region] Simulate a DR – Stop/De-allocate the HANA VMs at the primary site.
  2. [At DR Region] Make sure that the replication states of all the replication volumes are: Healthy, Mirrored, and Idle before  you break the replication.

Moaz_Mirza_9-1615946547697.png

  1. Disaster is declared and DR is initiated.
  2. To use the replicated volumes for disaster recovery, you have the following options:
    1. Create a new volume from one of the replicated snapshots – Restore to new volume
    2. Revert the replicated volume to an earlier snapshot – Revert volume
    3. Use the volume as-is

The option “c” doesn’t guarantee an application consistent state, so we cannot use that. The other two options are valid options for HANA restore. Option “a” will create extra volumes but can serve as additional contingency since the replicated volumes are intact. Also, Only the option “a” is available while the replication is active. You can use it if you want to run a DR drill without impacting the DR replication.

  1. For option “b”, Once the CRR mirror is broken, you will also have to delete the replication configuration by choosing the Delete option, before the revert volume option becomes available. The advantage that you get by just breaking the mirror and not deleting it, is the resync capability  without having to re-establish CRR from scratch.
  2. Now, combining the above two considerations about breaking and deleting the replications in conjunctions with application consistent recovery: Do we get any advantage by only breaking the mirror and not deleting it. The answer is: Not so much. Since, we will need to go back to an application consistent snapshot for a successful HANA recovery, and since using either of the options, a or b, would essentially nullify the value of preserving the replication configuration, therefore, we will simply break and delete the replication and recreate when we are done with the DR test.
  3. [At DR Region] Break and delete the replication for all the volumes.

Moaz_Mirza_10-1615946547706.png

Moaz_Mirza_21-1615947052871.png

Moaz_Mirza_12-1615946547715.png

Moaz_Mirza_13-1615946547722.png

 

  1. [At DR Region] Revert each of the volumes to the respective snapshots: the latest data snapshot, and the latest log backups snapshot that included our test change. You could also create new volumes from these snapshots if you preferred.

Moaz_Mirza_14-1615946547753.png

  1. [At DR Region] Once the volumes are reverted/created, assign them to appropriate pools, by choosing the "Change pool" option,  to meet the required performance throughput. Resize the volumes, as necessary.

Moaz_Mirza_15-1615946547759.png

  1. [At DR Region] On the dual-purpose system, swap out the data and log backups volumes with these new volumes and adjust the tier and size of the existing volumes (shared, logs) as needed. You will need to unmount the smaller old volumes and mount the new ones, after updating the fstab.
  2. Start the HANA recovery process for the SYSTEM database
  3. After the successful SYSTEM database recovery, start the HANA recovery process for each of the tenants. Key screenshots from tenant recovery are shown below:

Moaz_Mirza_16-1615946547770.png

Moaz_Mirza_17-1615946547807.png

Moaz_Mirza_18-1615946547834.png

Moaz_Mirza_19-1615946547841.png

Moaz_Mirza_20-1615946547847.png

  1. [At DR Region] Validate the point in time recovery by checking the last known transaction exist in the DR system. In our case, it was a security user that has been successfully replicated.
  2. A note on failing back: For the DR drill, you can simply toss these DR volumes out, put back the smaller volumes you had from before, recover your HANA target instance using the snapshot you had taken on those smaller volumes. Create new CRR replication for the data and the log backups volumes from the primary region again, since the replication content and the meta data were lost when we deleted the replication pairing for recovery purposes. I have included the link to the documentation that has more details for the ANF CRR steps.

Conclusion

What a ride! I have had fun writing it. I hope you have had fun reading it. We went through quite a lot in this blog series and have covered the major architectural domains for HANA scale-out with ANF on Azure. From the base infrastructure to setting up DR and test it, we now have a functional HANA scale-out architecture running on the most flexible NFSv4.1 storage, ANF.

As much as I want to dive into the monitoring and operations aspects of ANF, I will defer this now. I will, however, mention that to note down the upcoming hard quota change which is taking effect as of April 1st, and how the Azure Logic App based solution that Sean Luce from NetApp has created is worth a look. Kudos to him for creating an easy to deploy and manage solution for monitoring and auto growing volumes and pools utilizing native Azure services. Be sure to check out the links in the Reference section.

Good Luck with your ANF explorations!

Reference

Part 3 - Links - Comments

SAP HANA - Host Auto-Failover – SAP’s documentation

Host Auto-Failover - SAP Help Portal – SAP’s documentation

SAP HANA scale-out with standby with Azure NetApp Files on RHEL - Azure Virtual Machines | Microsoft...

Availability options for Azure Virtual Machines - Azure Virtual Machines | Microsoft Docs

SAP HANA availability within one Azure region - Azure Virtual Machines | Microsoft Docs

Supported cross-region Replication Pairs - ANF CRR regions

Requirements and considerations for using Azure NetApp Files volume cross-region replication | Micro...

Manage disaster recovery using Azure NetApp Files cross-region replication | Microsoft Docs

Create volume replication for Azure NetApp Files | Microsoft Docs

Metrics for Azure NetApp Files | Microsoft Docs – CRR metrics

Volumes (Azure Azure NetApp Files) | Microsoft Docs – REST API reference for operations on ANF Volumes

What changing to volume hard quota means for your Azure NetApp Files service | Microsoft Docs – IMPORTANT!

GitHub - ANFTechTeam/ANFCapacityManager: An Azure Logic App that manages capacity based alert rules ... – By Sean Luce - NetApp

 

Part 2 – Links – Comments

Service levels for Azure NetApp Files | Microsoft Docs – Ultra, Premium and Standard throughput

Azure NetApp Files – Cross Region Replication pricing (microsoft.com) – ANF pricing

Get started with Azure Application Consistent Snapshot tool for Azure NetApp Files | Microsoft Docs - azacsnap

Access tiers for Azure Blob Storage - hot, cool, and archive | Microsoft Docs

Immutable blob storage - Azure Storage | Microsoft Docs

2843934 - How to Check the Consistency of the Persistence - SAP ONE Support Launchpad - hdbpersdiag

Use private endpoints - Azure Storage | Microsoft Docs

Tutorial: Connect to a storage account using an Azure Private endpoint - Azure Private Link | Micros...

Authorize access to blobs with AzCopy & Azure Active Directory | Microsoft Docs – managed ID, azcopy and offloading

Copy or move data to Azure Storage by using AzCopy v10 | Microsoft Docs – Download and install azcopy

moazmirza/aztools (github.com) – aztools github link

SAP Applications on Microsoft Azure Using Azure NetApp Files | TR-4746 | NetApp – NetApp SAP on Azure doc

 

Part 1 – Links – Comments

Azure NetApp Files HANA certification and new region availability | Azure updates | Microsoft Azure - Published: Nov 2019

Cross-region replication of Azure NetApp Files volumes | Microsoft Docs - Supported region pairs

Guidelines for Azure NetApp Files network planning | Microsoft Docs – Delegated Subnet requirements

Delegate a subnet to Azure NetApp Files | Microsoft Docs - Steps

What is subnet delegation in Azure virtual network? | Microsoft Docs

Azure proximity placement groups for SAP applications - Azure Virtual Machines | Microsoft Docs

What's new in Azure NetApp Files | Microsoft Docs – Latest and greatest features updated monthly

Storage-Whitepaper-2-54.pdf (saphana.com) – SAP HANA Storage requirements

SAP HANA Azure virtual machine ANF configuration - Azure Virtual Machines | Microsoft Docs

SAP HANA scale-out with standby with Azure NetApp Files on RHEL - Azure Virtual Machines | Microsoft... – Build Guide

1928533 - SAP Applications on Azure: Supported Products and Azure VM types - SAP ONE Support Launchp...

SAP on Azure - Video Podcast Episode #14 - The One with Azure NetApp Files - YouTube – Ralf Klahr

SAP on Azure Video Podcast - Episode #15 - The One with Azure NetApp Files - Backup - YouTube - Ralf Klahr

SAP on Azure Video Podcast - Episode #16 - The One with Azure NetApp Files - Recovery & Cloning - Yo... - Ralf Klahr

 
 
2 Comments
Co-Authors
Version history
Last update:
‎Mar 23 2021 08:15 AM
Updated by: