Architectural Decisions to maximize ANF investment in HANA N+M Scale-Out Architecture - Part 1
Introduction
Do you have a large performance-sensitive SAP HANA footprint in Microsoft Azure? Are you looking to leverage SAP’s native resilience solution? Is operational flexibility at the top of your mind? If so, then look no further. This blog covers and answers the above-mentioned questions with Microsoft’s newer (since May 2019) PaaS storage service in partnership with NetApp, called Azure NetApp Files (ANF). While providing bare-metal performance with sub-millisecond latency and unparallel operational flexibility, the service also opens the door for its customers to leverage the SAP HANA’s native high-availability feature for N+M scale-out architecture, called ‘host auto-failover’, thanks to NFS v4.1 support on ANF.
In this blog series, I will focus on the technical execution aspects of running SAP HANA scale-out N+M on Azure NetApp Files from a Solution Architect’s point-of-view. More specifically, I will focus on how a solution architect would design, build, and operate the infrastructure, backup/recovery, and HA/DR components for a 2+1 scale-out scenario running on RHEL OS. I will not focus on the application tier, neither will I focus on other cloud foundational elements. This blog series would also pique interests of Operations Architects, Database Administrators, and the likes. This series is broken up into multiple publications to cover the journey step by step. This part, Part 1, covers the overall solution overview and the base infrastructure design and build. Part 2 will cover the backup/recover, and Part 3 will cover the HA/DR component. Let’s get to it.
Solution Overview
The key architectural components of this scenario include multiple SAP HANA nodes that are set up as one scale-out system. The required shared storage for this setup would come from Azure NetApp Files, with NFS v4.1 selected as the protocol of choice. There will be an additional virtual machine to provide a centralized automatic backup management capability utilizing a new-to-ANF tool called “azacsnap” (in preview). A pseudo diagram to capture this high-level solution would look something like this:
We will now take this initial diagram and transform it through the three phases of solution designing and building across the solution components (base infrastructure, backup and HA/DR). These phases of solution designing are:
- Assess/Plan
- Design
- Build/Operate
Note: The intention throughout the blog is to cover a solution architect view to designing and building the solution in a more thought-provoking manner, and to compliment already available well-crafted MS Docs and NetApp documentation on this topic. I will cover key highlights from these documentations, and where necessary, will also augment additional detail in this blog to cover more surface area.
Assess/Plan – Infrastructure, Backup/Recovery and HA/DR
The criticality of assessing and planning for a major technical solution cannot be overstated. While the Cloud is synonymous with flexibility, the price of remediating certain elements of the architecture, to accommodate for the requirements that fell through the crack during discovery, could be costly. Therefore, take your time in making decisions around foundational elements of SAP landing zone such as subscription design, network design, HANA scaling architecture etc. It is much harder to retrofit them especially when you have an active infrastructure. Taking time in planning and assessing goes a long way. This solution is no exception. Let’s look at some of the key points of planning:
- SAP landing zone/cloud foundation planning: The planning for the cloud foundational elements is not in scope for this blog. Let’s assume standard SAP landing zone components are already in place.
- SAP HANA architecture planning: The architecture planning for SAP HANA is also assumed to have been completed, alongside the decision to select ANF as the shared storage choice for the architecture.
- Single versus multiple subscriptions: It is important to know the high-level Disaster Recovery (DR) solution for this setup could drive the subscription requirement for the SAP landing zone planning. As of now, the ANF Cross Region Replication (CRR) only supports replication within the same subscription. If you intend to use CRR as the DR solution over SAP HANA System Replication, then you must use one subscription. ANF team brings expansion to existing features set and relaxation to existing limitations on a regular basis, so be sure to validate this at the time of planning.
- Region selection: ANF service availability, ANF CRR region pairing and SAP HANA VM SKU selection are among the key factors in deciding the regions of choice for primary operations and disaster recovery.
- Number of ANF accounts: An ANF account is an administrative layer with a regional scope that holds one or more storage pools. Creating an account per region will give you the most flexibility from system restore and environment refresh purposes. You will not be able to create a copy of an ANF volume outside its associated storage pool.
- Networking: ANF requires a delegated subnet per VNET. A /28 with 11 usable IP addresses is sufficient. Also, you cannot apply NSGs or UDRs on this subnet.
- High-Availability: The ANF service is a PaaS service and the resilience is already baked in. However, we will need high-availability for SAP HANA DB. Since we are covering the N+M scenario, the high-availability is provided by the native HA feature called Host Auto-failover. This means an additional same size HANA VM is needed as a stand-by and a distinct storage configuration to support this scenario. Also, plan on placing them in an availability set (AS) associated with a proximity placement group (PPG). Why? See the next point:
- Minimizing latency: In addition to leveraging PPG/AS to bring App and DB closer, we will also use PPG/AS to manually pin the ANF resources closer to HANA Databases. This is a manual backend process and needs to be coordinated with Microsoft support team.
- Backup Offloading: Will you be using ANF snapshots for HANA backups? If so, do you have a requirement to provide additional redundancy and protection to ANF snapshots? This is surely not a must due to already available redundancy in ANF and the security layers you could leverage at VNET, AAD/RBAC and OS levels etc. However, should you have such a requirement, you will need to invest in a secure storage account for offloading backups and for keeping redundant full copies.
- Security: For this scenario, the recommendation is to use NFS v4.1 as the protocol for all SAP HANA volumes. Besides the restriction on the delegated subnet, your VNET overall can have the desired enterprise-scale security protocols. You can also apply all the recommended OS-level security for Linux as well as application-level security for the HANA DB. You can control the access to the ANF PaaS service via RBAC. Keep in mind that if you are planning to automate scheduling of the database snapshots, we will need to allow outbound access to publicly available Azure APIs. This is needed for the automation and orchestration software, azacsnap, to be able to access Azure AD via a service principal to interact with the ANF service.
- ANF Storage Capacity Planning: We will need the sizing estimate for SAP HANA DB. In addition, a high-level backup policy (backup scope, frequency, retention) and offloading/offsite requirements is also required. Finally, we will need to leverage the different performance tiers – ultra, premium, and standard to balance the cost and performance need for each of the volumes.
- Available ANF Features: ANF product team has a great customer feedback cycle, and as a result, the ANF service is refreshed regularly with feedback-resulted features. Be sure to check what’s new in ANF world and assess whether you would like to add any of the latest and greatest feature to your design.
Next, we will take this assessment and the key findings into the design phase. We will design the base infrastructure first, and then will design the backup/restore and HA/DR components on top of it.
Design - Base Infrastructure
Network
- This design includes a single VNET for both Prod and Non-Prod. However, the other options include:
- You could have created two or more VNETs for segregation with each VNET to have volumes associated with a separate ANF account, but that would restrict the environment refresh across VNETs.
- Another option would be to place ANF volumes in one VNET and provision them to clients in all the peered VNETs. Account for minor to negligible latency here due to peering. Also keep in mind the transit routing restriction with the VNET peering.
- A slight variation to the above, you could create an ANF delegated subnet per VNET, create volumes in each of the subnets and provision the volumes to the clients in the respective VNETs. It is still all under the same ANF account.
- Each VNET receiving the ANF volumes must have one and only one delegated subnet dedicated for ANF. If you try to deploy to another delegated ANF subnet within the same VNET, you will get an error like this:
- A delegated subnet /28 delegated is sufficient for most cases. For a very large deployment, say 12+ scale-out nodes running on very large Mv2s, you may need a bigger subnet to maximize performance with concurrent unique network paths to ANF.
- The management VM will need an SPN authentication using Azure public APIs. You will need an outbound access to azure APIs as stated.
- Attach three NICs per HANA node for segregating the client, inter-node and ANF storage communications. This will not triple the network performance on the VM.
Compute
- Two HANA Scale-out nodes plus the stand-by node in the DB or equivalent subnet. You can increase the count per your requirement. The VM SKUs, chosen to meet the HANA sizing requirement, must be supported by SAP. Check out the SAP Note 1928533 for more information.
- Associate HANA DB VMs with an availability set tied to a PPG.
- A Backup Management VM in a separate subnet for isolation. A D2s/D4s would do just fine. Also, we don’t need a data disk on this VM for this scenario.
Storage – Azure NetApp Files
- An ANF Account per region: A single account in the region will streamline the environment refresh for lower environments in the region.
- An Ultra storage pool and a premium storage pool: The Ultra pool would hold hana data and log volumes, while the premium would hold the rest. If you are concerned about reaching the 500TB maximum size limit of the pool, then create a pair of Ultra and Premium pools for each of the SAP landscapes. For example:
BW: Pool 1: Ultra Pool , Pool 2: Premium Pool
CAR: Pool 3: Ultra Pool, Pool 4: Premium Pool
And so on
This setup would enable both the needed scalability and the ability to perform lower environment system refreshes.
- Volume size as a function of HANA memory: Follow the typical sizing approach for SAP HANA but stay above the minimum size as mentioned below to stay to meet the minimum performance guidelines. The details are in MS Docs – HANA storage configurations with ANF
Volume |
Est. Sizing |
Storage Pool Service Level |
Min. throughput requirement from SAP |
NFS protocol |
/hana/data |
1.2 x Net Disk Space AND >3.2TB Ultra OR >6.3TB Premium |
Ultra/ Premium |
Read activity of at least 400 MB/s for /hana/data for 16-MB and 64-MB I/O sizes. Write activity of at least 250 MB/s for /hana/data with 16-MB and 64-MB I/O sizes. |
v4.1 |
/hana/log |
1 x Memory AND >2TB Ultra OR >4TB Premium |
Ultra/ Premium |
Read-write on /hana/log of 250 megabytes per second (MB/s) with 1-MB I/O sizes. |
v4.1 |
/hana/shared |
1 x Memory every 4 nodes |
Premium |
|
v4.1 (or v3) |
/usr/sap |
50 GB |
Premium |
|
v4.1 (or v3) |
/backup/log |
For log backups (change default location) |
Premium |
|
v4.1 (or v3) |
/backup/data |
Optional for file-level native backups |
Premium |
|
v4.1 (or v3) |
As an example, the volume sizes for 2+1 M128s (2TB) would look like this:
Volume |
Est. Sizing |
Service Level |
NFS protocol |
/hana/data/<SID>/mnt00001 |
3.2 TB |
Ultra |
v4.1 |
/hana/data/<SID>/mnt00002 |
3.2 TB |
Ultra |
v4.1 |
/hana/log/<SID>/mnt00001 |
2 TB |
Ultra |
v4.1 |
/hana/log/<SID>/mnt00002 |
2 TB |
Ultra |
v4.1 |
/hana/shared /usr/sap for Node 1 /usr/sap for Node 2 /usr/sap for Node 3 |
2 TB 50 GB 50 GB 50 GB |
Premium |
v4.1 |
/backup/log |
2 TB (varies per log retention and performance requirements) |
Premium/ Standard* |
v4.1 |
/backup/data |
6 TB (varies per backup size and performance requirements) |
Premium/ Standard* |
v4.1 |
*Note:
- I encourage to perform benchmark tests to validate the experienced throughput at the OS layer and adjust the volume throughput accordingly per the application’s requirement.
- Due to the throughput restriction of 1.2 – 1.4 GB/s for a LIF in a single TCP session, the throughput experienced at the OS layer hits the ceiling around 15 TB for a Ultra SKU volume and 40 TB for a Premium SKU Volume. Consider using a lower tier in this situation.
- The flexibility of dynamically changing the service level enables you to switch these volumes from one tier to another. Changes in size requirement or performance could result in exercising this option.
- You could choose NFS v3 for volumes other than data and log, but we will keep v4.1 in this scenario for consistency.
- OS Managed Disk: We will use premium managed for the OS disk. There is no additional data disk requirement for this scenario.
Build - Base infrastructure
The steps to build a similar architecture is well laid out in MS Docs, therefore I will not repeat them here. With this MS Doc guide, the details in the design section and the below anecdotal comments, you have enough ammunition for the build phase:
Network
- The delegated subnet is needed when we are ready to create the volumes.
- On a restrictive subnet, the outbound internet access to management URLs can be achieved by configuring an outbound rule on a dedicated public load balancer, or by configuring the outbound connectivity in Azure Firewall or in a third-party firewall.
Compute
- The anchor VM, usually the DB VM, goes first in the deployment to pin the infrastructure.
- You may run into a problem when mounting an azure storage files onto the RHEL Linux system and get an error like this:
Install cifs-utils program “yum install cifs-utils” and then try again.
- When installing SAP HANA, you may face installation errors pointing to missing modules, if so then install the following and try again: yum install libtool-ltdl
Storage
- For complex workloads, to ensure sub millisecond latency to the storage, you may need Microsoft’s assistance in performing a backend manual pinning for the ANF storage to be laid out closer to the compute units. To do that, the key information you will need to provide to the support team would be the empty Availability Set, PPG and the empty storage pool. Once the pinning is done, you are good to proceed with the infrastructure provisioning.
- For large VMs in a scale-out, depending on the environment size, performance requirement and the environment priority, you may also need to use a dedicate logical network interfaces (LIF) for each of the data and log volumes. This is also done manually today and with the help of the support team. For this 2+1 scale-out scenario, an example LIFS pinning would look like this:
IP1 à data mnt 1, IP2 à log mnt 1
IP2 à data mnt 2, IP1 à log mnt 2
This helps enabling two unique paths from each of the nodes down to the ANF volume.
Conclusion and What’s next
This concludes the first part of the blog series. Now that we have the base infrastructure, we will focus on backup/recover and HA/DR in the upcoming parts of this series. Stay tuned.
Reference