Overview
On Azure, high availability of SAP workload running on Linux can be achieved by setting up a pacemaker cluster. In an active system, the cluster sometimes detects that one of the nodes is behaving strangely and wants to remove it. So, it will fence that node, which is commonly done with a STONITH resource. In Azure, you have two options for setting up STONITH in the pacemaker cluster. You can use an Azure fence agent based STONITH, which restarts a failed node via Azure APIs, or you can use a disk based STONITH (SBD) device. Depending on the type of Linux distro (SLES or RHEL), you may need to choose appropriate option for your STONITH on Azure. You can refer to Setting up Pacemaker on RHEL in Azure or Setting up Pacemaker on SLES in Azure guide for more details.
Setting up high availability for SAP workload on Azure protect application from infrastructure maintenance or failure within region. But it doesn't provide protection from widespread regional disaster. For DR, protection of application running on Azure VMs can be enabled by replicating components using Azure site recovery (ASR) to another Azure region. In this article, we will talk about achieving high availability configuration on DR region using ASR, when your SAP ASCS/ERS pacemaker cluster with SBD device (using iSCSI target server) is configured in productive region.
IMPORTANT NOTE
- The example shown in this article is exercised on the below OS versions with "classic" file system structure.
- SAP ASCS/ERS OS version: SUSE Enterprise Linux Server 15 SP03 for SAP Application
- iSCSI target OS version: SUSE Enterprise Linux Server 15 SP02
- As Red Hat doesn’t support SBD on cloud platforms, the steps are applicable only for the application running on SLES.
- Depending on the type of shared storage used for SAP workload, you have to adopt an appropriate method to replicate the storage data to the DR site. In this example, NFS on Azure NetApp Files (ANF) is used and its DR setup can be achieved using cross-region replication.
- Failover of other dependent services like DNS or Active directory is not covered in this blog.
- To replicate VMs using ASR for DR setup, review supported regions.
- Azure NetApp Files is not available in all regions. Refer Azure Products by Region to see if ANF is available in your DR region. Also you need to make sure ANF volume replication is supported for your DR region. For cross-region replication of Azure NetApp Files, review supported replication pairs.
- ASR doesn't replicate Azure load balancer that is used as virtual IP for the SAP ASCS/ERS cluster configuration in source site. In the DR site, you need to create load balancer manually beforehand or at the time of the failover event.
- The procedure described here has not been coordinated with different OS releases and respective OS providers. So, it might not work in completeness with your specific implementations or with future OS releases. So, make sure you test and document the entire procedure thoroughly in your environment.
- If you have configured SAP ASCS/ERS pacemaker cluster with Azure fence agent in your production region, and want to achieve highly available configuration on DR region using ASR, refer to the blog SAP ASCS HA Cluster (in Linux OS) failover to DR region using Azure Site Recovery - Microsoft Tech Community
SAP ASCS/ERS Disaster Recovery Architecture
In the figure below, the SAP ASCS/ERS high availability cluster is configured in the primary region. The cluster uses SBD devices for STONITH that are being discovered by HA cluster nodes from three different iSCSI target server. To establish DR for the setup, Azure site recovery (ASR) is used to replicate the SAP ASCS/ERS and all iSCSI target servers VMs across site. NFS on Azure NetApp Files (ANF) volume used by SAP ASCS/ERS is replicated to the DR site using cross-region replication.
NOTE: You can also leverage NFS on Azure Files for SAP ASCS/ERS cluster. But in this blog, NFS on Azure files service is not covered.
As described in the example, to achieve HA setup in the DR site for SAP ASCS/ERS we need to make sure that all components that are part of the solution are replicated.
Components | DR setup |
SAP ASCS/ERS VMs | Replicate VMs using Azure site recovery |
iSCSI target server VMs | Replicate VMs using Azure site recovery |
Azure NetApp Files (ANF) | Cross region replication |
Disaster Recovery (DR) site preparation
To achieve similar highly available setup of SAP ASCS/ERS on the DR site, you need to make sure that all the components are replicated.
Configure ASR using SAP ASCS/ERS and iSCSI target server
- Deploy Resource Group, Virtual Network, Subnet and Recovery Service Vault in secondary region. For more information on networking in Azure VM disaster recovery, refer to prepare networking for Azure VM disaster recovery. In this example, we are using Azure NetApp Files, you also need to create a separate subnet delegated to NetApp service.
- Follow the instructions in Tutorial to set up disaster recovery for Azure VMs document to configure ASR for SAP ASCS/ERS and all iSCSI target servers.
- On enabling Azure site recovery for VM to set up DR, the OS and local data disk that are attached to VMs gets replicated to DR site. During replication, the VM disk writes are sent to a cache storage account in the source region. Data is sent from there to the target region, and recovery points are generated from the data. When you fail over a VM during DR, a recovery point is used to restore the VM in the target region.
- After the VMs are replicated, the status will tun into “Protected” and the replication health will be “Healthy”.
Replicate volumes in DR site
Configure cross region replication for Azure NetApp Files
- Before you configure cross region replication for ANF, refer to Requirements and considerations for Azure NetApp Files cross-region replication.
- Create a NetApp account, capacity pool and delegated subnet for ANF in DR region.
NOTE: The destination account of ANF must be in a different region from the source volume region. - Follow the steps as mentioned in Create volume replication for Azure NetApp Files document to create a replication peering between and source and destination ANF volume.
- Create a data replication volume in the DR site for each volume in the source ANF. After you authorize the replication from the source volume, you can see the Mirror state of the volume changing from “Uninitialized” to “Mirrored”. Refer to Display health status of Azure NetApp Files replication relationship to understand different health status.
ANF volume in DR siteANF volume health status - Maintain NFS on ANF DR mount volume entries in the SAP system VMs running on primary region. The DR mount volume entries must be commented out.
app1531:~ # more /etc/fstab # Primary region NFS on ANF mounts 10.0.3.5:/app153-sapmntQAS /sapmnt/QAS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0 10.0.3.5:/app153-usrsapQASsys /usr/sap/QAS/SYS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0 # DR site NFS on ANF mounts (REMOVE HASH IN THE EVENT OF DR FAILOVER EVENT) # 10.29.1.4:/app153-sapmntqas /sapmnt/QAS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0 # 10.29.1.4:/app153-usrsapqassys /usr/sap/QAS/SYS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0
NOTE: /usr/sap/QAS/ASCS00 and /usr/sap/QAS/ERS01 file system is managed by pacemaker cluster. So, no need to maintain in /etc/fstab
Configure Standard Load Balancer for SAP ASCS/ERS in DR site
Deploy load balancer as described in this article on the DR site. If you are creating load balancer beforehand on the DR site, you won’t be able to assign VMs to the backend pool. You can create the backend pool as empty which allows you to define the load balancing rules, but you cannot assign the DR VMs to the backend pool. Also, keep following points in mind -
- Keep the probe port in DR load balancer same as primary.
- When VMs without public IP addresses are placed in the backend pool of internal (no public IP address) Standard Azure load balancer, there will be no outbound internet connectivity, unless additional configuration is performed to allow routing to public end points. For details on how to achieve outbound connectivity see public endpoint connectivity for Virtual Machines using Azure Standard Load Balancer in SAP high-availability scenarios.
Site | Frontend IP |
Primary region | ASCS: 10.0.1.12 |
ERS: 10.0.1.13 | |
DR region | ASCS: 10.29.0.10 |
ERS: 10.29.0.11 |
Disaster Recovery (DR) failover
In case of a DR event, the following procedure needs to be followed for SAP ASCS/ERS. If you are using different Azure services with your SAP system, you may need to tweak your procedure accordingly.
- Perform the failover of SAP ASCS/ERS cluster and iSCSI target VM using ASR to the DR region. For more details on how to failover, refer Tutorial to fail over Azure VMs to a secondary region for disaster recovery with Azure Site Recovery document.
- After the failover is completed, the status of the replicated items in the recovery service vault would be like below -
Failover status of VMs in Recovery Service Vault - Update the IP addresses of all the VMs, in this example SAP ASCS/ERS and all iSCSI target servers in AD/DNS or in host files.
- Failover source ANF volume to destination volume. To activate the destination volume (example, when you want to failover to the destination region), you need to break replication peering and then mount the destination volume. Follow the instruction in manage disaster recovery using cross-region replication document for failover.
ANF volumes broken peering status - As we have maintained the entry of the DR volume (/sapmnt/QAS and /usr/sap/QAS/SYS) in /etc/fstab on primary region. We need to remove the comment to mount the volume after failover. But you should ‘comment' the entry of former primary region file system.
NOTE: /usr/sap/QAS/ASCS00 and /usr/sap/QAS/ERS01 file system is managed by pacemaker cluster. So, no need to maintain in /etc/fstabapp1531:~ # more /etc/fstab # Former primary site NFS on ANF mounts (REMOVE HASH IN THE EVENT OF FAILBACK) # 10.0.3.5:/app153-sapmntQAS /sapmnt/QAS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0 # 10.0.3.5:/app153-usrsapQASascs /usr/sap/QAS/ASCS00 nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0 # 10.0.3.5:/app153-usrsapQASsys /usr/sap/QAS/SYS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0 # DR site NFS on ANF mounts 10.29.1.4:/app153-sapmntqas /sapmnt/QAS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 10.29.1.4:/app153-usrsapqassys /usr/sap/QAS/SYS nfs rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys 0 0 app1531:~ # mount -a app1531:~ # df -h | grep -i qas Filesystem Size Used Avail Use% Mounted on 10.29.1.4:/app153-sapmntqas 512G 6.8G 506G 2% /sapmnt/QAS 10.29.1.4:/app153-usrsapqassys 512G 1.5M 512G 1% /usr/sap/QAS/SYS
- Add SAP ASCS/ERS VMs in the backend pool of Standard Load Balancer.
Standard Load Balancer backend pool in DR site
NOTE: When VMs without public IP addresses are placed in the backend pool of internal (no public IP address) Standard Azure load balancer, there will be no outbound internet connectivity, unless additional configuration is performed to allow routing to public end points. For details on how to achieve outbound connectivity see public endpoint connectivity for Virtual Machines using Azure Standard Load Balancer in SAP high-availability scenarios. - Update the VMs physical IP addresses in below section of /etc/corosync/corosync.conf
nodelist { node { ring0_addr: 10.29.0.4 nodeid: 1 } node { ring0_addr: 10.29.0.6 nodeid: 2 } }
- After failover, iSCSI disks are not discovered automatically as IP address of the iSCSI target server is changed. So you need connect to iSCSI devices using iscsiadm command
The SBD device ID remains same after ASR failover, and you do not have to make changes in /etc/sysconfig/sbd file.app1531:~ # lsscsi [0:0:0:0] disk Msft Virtual Disk 1.0 /dev/sda [3:0:1:0] disk Msft Virtual Disk 1.0 /dev/sdb # Excute below command with root, by chaning the IP address of your iSCSI devices. Also make sure you use correct IQN name to discover iSCSI devices sudo iscsiadm -m discovery --type=st --portal=10.29.0.7:3260 sudo iscsiadm -m node -T iqn.2006-04.ascsnw1.local:ascsnw1 --login --portal=10.29.0.7:3260 sudo iscsiadm -m node -p 10.29.0.7:3260 -T iqn.2006-04.ascsnw1.local:ascsnw1 --op=update --name=node.startup --value=automatic sudo iscsiadm -m discovery --type=st --portal=10.29.0.8:3260 sudo iscsiadm -m node -T iqn.2006-04.ascsnw1.local:ascsnw1 --login --portal=10.29.0.8:3260 sudo iscsiadm -m node -p 10.29.0.8:3260 -T iqn.2006-04.ascsnw1.local:ascsnw1 --op=update --name=node.startup --value=automatic sudo iscsiadm -m discovery --type=st --portal=10.29.0.9:3260 sudo iscsiadm -m node -T iqn.2006-04.ascsnw1.local:ascsnw1 --login --portal=10.29.0.9:3260 sudo iscsiadm -m node -p 10.29.0.9:3260 -T iqn.2006-04.ascsnw1.local:ascsnw1 --op=update --name=node.startup --value=automatic app1531:~ # lsscsi [0:0:0:0] disk Msft Virtual Disk 1.0 /dev/sda [1:0:1:0] disk Msft Virtual Disk 1.0 /dev/sdb [6:0:0:0] disk LIO-ORG sbdascsnw1 4.0 /dev/sdc [7:0:0:0] disk LIO-ORG sbdascsnw1 4.0 /dev/sdd [8:0:0:0] disk LIO-ORG sbdascsnw1 4.0 /dev/sde app1531:~ # ls -l /dev/disk/by-id/scsi-* | grep sdc lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-1LIO-ORG_sbdascsnw1:d778427c-479f-4264-a510-4fecedfd044e -> ../../sdc lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-36001405d778427c479f4264a5104fece -> ../../sdc lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-SLIO-ORG_sbdascsnw1_d778427c-479f-4264-a510-4fecedfd044e -> ../../sdc app1531:~ # ls -l /dev/disk/by-id/scsi-* | grep sdd lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-1LIO-ORG_sbdascsnw1:a15a2398-9090-4610-b3b1-185ae2385d3b -> ../../sdd lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-36001405a15a239890904610b3b1185ae -> ../../sdd lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-SLIO-ORG_sbdascsnw1_a15a2398-9090-4610-b3b1-185ae2385d3b -> ../../sdd app1531:~ # ls -l /dev/disk/by-id/scsi-* | grep sde lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-1LIO-ORG_sbdascsnw1:4287e788-620c-4e0c-93f4-293c8169c86d -> ../../sde lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-360014054287e788620c4e0c93f4293c8 -> ../../sde lrwxrwxrwx 1 root root 9 Jul 19 22:47 /dev/disk/by-id/scsi-SLIO-ORG_sbdascsnw1_4287e788-620c-4e0c-93f4-293c8169c86d -> ../../sde
app1531:~ # more /etc/sysconfig/sbd | grep -i SBD_DEVICE # SBD_DEVICE specifies the devices to use for exchanging sbd messages SBD_DEVICE="/dev/disk/by-id/scsi-36001405d778427c479f4264a5104fece;/dev/disk/by-id/scsi-36001405a15a239890904610b3b1185ae;/dev/disk/by-id/scsi-360014054287e788620c4e0c93f4293c8"
- Start pacemaker cluster on SAP ASCS and SAP ERS (both) nodes and place it in maintenance mode
# Excecute below command on both nodes to start cluster services sudo crm cluster start # Place the cluster in maintenace node. Execute from any one of the node crm configure property maintenance-mode=true
- Update configuration for filesystem resource (fs_QAS_*), virtual IP resource (vip_QAS_*) in pacemaker cluster and save the changes.
NOTE: If you have changed the probe port to something different from primary site, then you need to edit nc_QAS_ASCS and nc_QAS_ERS resource as well with new port.# On executing below command, the resource configuration will open in vi editor. Make changes and save. sudo crm configure edit fs_QAS_ASCS # Change the device to DR volume primitive fs_QAS_ASCS Filesystem \ params device="10.29.1.4:/app153-usrsapqasascs" directory="/usr/sap/QAS/ASCS00" fstype=nfs options="sec=sys,vers=4.1" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op monitor interval=20s timeout=40s sudo crm configure edit fs_QAS_ERS # Change virtual IP address to the frontend IP address of load balancer in DR region primitive vip_QAS_ASCS IPaddr2 \ params ip=10.29.0.10 cidr_netmask=24 \ op monitor interval=10 timeout=20 sudo crm configure edit vip_QAS_ASCS # Change the device to DR volume primitive fs_QAS_ERS Filesystem \ params device="10.29.1.4:/app153-usrsapqasers" directory="/usr/sap/QAS/ERS01" fstype=nfs options="sec=sys,vers=4.1" \ op start timeout=60s interval=0 \ op stop timeout=60s interval=0 \ op monitor interval=20s timeout=40s sudo crm configure edit vip_QAS_ERS # Change virtual IP address to the frontend IP address of load balancer in DR region primitive vip_QAS_ERS IPaddr2 \ params ip=10.29.0.11 cidr_netmask=24 \ op monitor interval=10 timeout=20
- After editing the resources, you have to clear any existing error message in the cluster and take the cluster out of maintenance mode
sudo crm resource cleanup sudo crm configure property maintenance-mode=false sudo crm status #Full List of Resources: # * stonith-sbd (stonith:external/sbd): Started app1531 # * Clone Set: cln_azure-events [rsc_azure-events]: # * Started: [ app1531 app1532 ] # * Resource Group: g-QAS_ASCS: # * fs_QAS_ASCS (ocf::heartbeat:Filesystem): Started app1532 # * nc_QAS_ASCS (ocf::heartbeat:azure-lb): Started app1532 # * vip_QAS_ASCS (ocf::heartbeat:IPaddr2): Started app1532 # * rsc_sap_QAS_ASCS00 (ocf::heartbeat:SAPInstance): Started app1532 # * Resource Group: g-QAS_ERS: # * fs_QAS_ERS (ocf::heartbeat:Filesystem): Started app1531 # * nc_QAS_ERS (ocf::heartbeat:azure-lb): Started app1531 # * vip_QAS_ERS (ocf::heartbeat:IPaddr2): Started app1531 # * rsc_sap_QAS_ERS01 (ocf::heartbeat:SAPInstance): Started app1531
- Perform validation and cluster testing in DR environment.
Updated Jul 25, 2022
Version 1.0dennispadia
Microsoft
Joined August 19, 2020
Running SAP Applications on the Microsoft Platform
Follow this blog board to get notified when there's new activity