SAP ASCS/ERS HA Cluster with SBD device (using iSCSI target server) failover to DR region using ASR

Microsoft

Jul 26, 2022

Overview

On Azure, high availability of SAP workload running on Linux can be achieved by setting up a pacemaker cluster. In an active system, the cluster sometimes detects that one of the nodes is behaving strangely and wants to remove it. So, it will fence that node, which is commonly done with a STONITH resource. In Azure, you have two options for setting up STONITH in the pacemaker cluster. You can use an Azure fence agent based STONITH, which restarts a failed node via Azure APIs, or you can use a disk based STONITH (SBD) device. Depending on the type of Linux distro (SLES or RHEL), you may need to choose appropriate option for your STONITH on Azure. You can refer to Setting up Pacemaker on RHEL in Azure or Setting up Pacemaker on SLES in Azure guide for more details.

Setting up high availability for SAP workload on Azure protect application from infrastructure maintenance or failure within region. But it doesn't provide protection from widespread regional disaster. For DR, protection of application running on Azure VMs can be enabled by replicating components using Azure site recovery (ASR) to another Azure region. In this article, we will talk about achieving high availability configuration on DR region using ASR, when your SAP ASCS/ERS pacemaker cluster with SBD device (using iSCSI target server) is configured in productive region.

IMPORTANT NOTE

The example shown in this article is exercised on the below OS versions with "classic" file system structure.
- SAP ASCS/ERS OS version: SUSE Enterprise Linux Server 15 SP03 for SAP Application
- iSCSI target OS version: SUSE Enterprise Linux Server 15 SP02
As Red Hat doesn’t support SBD on cloud platforms, the steps are applicable only for the application running on SLES.
Depending on the type of shared storage used for SAP workload, you have to adopt an appropriate method to replicate the storage data to the DR site. In this example, NFS on Azure NetApp Files (ANF) is used and its DR setup can be achieved using cross-region replication.
Failover of other dependent services like DNS or Active directory is not covered in this blog.
To replicate VMs using ASR for DR setup, review supported regions.
Azure NetApp Files is not available in all regions. Refer Azure Products by Region to see if ANF is available in your DR region. Also you need to make sure ANF volume replication is supported for your DR region. For cross-region replication of Azure NetApp Files, review supported replication pairs.
ASR doesn't replicate Azure load balancer that is used as virtual IP for the SAP ASCS/ERS cluster configuration in source site. In the DR site, you need to create load balancer manually beforehand or at the time of the failover event.
The procedure described here has not been coordinated with different OS releases and respective OS providers. So, it might not work in completeness with your specific implementations or with future OS releases. So, make sure you test and document the entire procedure thoroughly in your environment.
If you have configured SAP ASCS/ERS pacemaker cluster with Azure fence agent in your production region, and want to achieve highly available configuration on DR region using ASR, refer to the blog SAP ASCS HA Cluster (in Linux OS) failover to DR region using Azure Site Recovery - Microsoft Tech Community

SAP ASCS/ERS Disaster Recovery Architecture

In the figure below, the SAP ASCS/ERS high availability cluster is configured in the primary region. The cluster uses SBD devices for STONITH that are being discovered by HA cluster nodes from three different iSCSI target server. To establish DR for the setup, Azure site recovery (ASR) is used to replicate the SAP ASCS/ERS and all iSCSI target servers VMs across site. NFS on Azure NetApp Files (ANF) volume used by SAP ASCS/ERS is replicated to the DR site using cross-region replication.

NOTE: You can also leverage NFS on Azure Files for SAP ASCS/ERS cluster. But in this blog, NFS on Azure files service is not covered.

As described in the example, to achieve HA setup in the DR site for SAP ASCS/ERS we need to make sure that all components that are part of the solution are replicated.

Components	DR setup
SAP ASCS/ERS VMs	Replicate VMs using Azure site recovery
iSCSI target server VMs	Replicate VMs using Azure site recovery
Azure NetApp Files (ANF)	Cross region replication

Disaster Recovery (DR) site preparation

To achieve similar highly available setup of SAP ASCS/ERS on the DR site, you need to make sure that all the components are replicated.

Configure ASR using SAP ASCS/ERS and iSCSI target server

Deploy Resource Group, Virtual Network, Subnet and Recovery Service Vault in secondary region. For more information on networking in Azure VM disaster recovery, refer to prepare networking for Azure VM disaster recovery. In this example, we are using Azure NetApp Files, you also need to create a separate subnet delegated to NetApp service.
Follow the instructions in Tutorial to set up disaster recovery for Azure VMs document to configure ASR for SAP ASCS/ERS and all iSCSI target servers.
On enabling Azure site recovery for VM to set up DR, the OS and local data disk that are attached to VMs gets replicated to DR site. During replication, the VM disk writes are sent to a cache storage account in the source region. Data is sent from there to the target region, and recovery points are generated from the data. When you fail over a VM during DR, a recovery point is used to restore the VM in the target region.
After the VMs are replicated, the status will tun into “Protected” and the replication health will be “Healthy”.
Replicate volumes in DR site

Configure cross region replication for Azure NetApp Files

Before you configure cross region replication for ANF, refer to Requirements and considerations for Azure NetApp Files cross-region replication.
Create a NetApp account, capacity pool and delegated subnet for ANF in DR region.
NOTE: The destination account of ANF must be in a different region from the source volume region.
Follow the steps as mentioned in Create volume replication for Azure NetApp Files document to create a replication peering between and source and destination ANF volume.
Create a data replication volume in the DR site for each volume in the source ANF. After you authorize the replication from the source volume, you can see the Mirror state of the volume changing from “Uninitialized” to “Mirrored”. Refer to Display health status of Azure NetApp Files replication relationship to understand different health status.
ANF volume in DR siteANF volume health status

Maintain NFS on ANF DR mount volume entries in the SAP system VMs running on primary region. The DR mount volume entries must be commented out.

app1531:~ # more /etc/fstab 
# Primary region NFS on ANF mounts
10.0.3.5:/app153-sapmntQAS /sapmnt/QAS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0 
10.0.3.5:/app153-usrsapQASsys /usr/sap/QAS/SYS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0 
 
# DR site NFS on ANF mounts (REMOVE HASH IN THE EVENT OF DR FAILOVER EVENT) 
# 10.29.1.4:/app153-sapmntqas /sapmnt/QAS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0 
# 10.29.1.4:/app153-usrsapqassys /usr/sap/QAS/SYS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0

NOTE: /usr/sap/QAS/ASCS00 and /usr/sap/QAS/ERS01 file system is managed by pacemaker cluster. So, no need to maintain in /etc/fstab

Configure Standard Load Balancer for SAP ASCS/ERS in DR site

Deploy load balancer as described in this article on the DR site. If you are creating load balancer beforehand on the DR site, you won’t be able to assign VMs to the backend pool. You can create the backend pool as empty which allows you to define the load balancing rules, but you cannot assign the DR VMs to the backend pool. Also, keep following points in mind -

Keep the probe port in DR load balancer same as primary.
When VMs without public IP addresses are placed in the backend pool of internal (no public IP address) Standard Azure load balancer, there will be no outbound internet connectivity, unless additional configuration is performed to allow routing to public end points. For details on how to achieve outbound connectivity see public endpoint connectivity for Virtual Machines using Azure Standard Load Balancer in SAP high-availability scenarios.

Site	Frontend IP
Primary region	ASCS: 10.0.1.12
Primary region	ERS: 10.0.1.13
DR region	ASCS: 10.29.0.10
DR region	ERS: 10.29.0.11

Disaster Recovery (DR) failover

In case of a DR event, the following procedure needs to be followed for SAP ASCS/ERS. If you are using different Azure services with your SAP system, you may need to tweak your procedure accordingly.

Perform the failover of SAP ASCS/ERS cluster and iSCSI target VM using ASR to the DR region. For more details on how to failover, refer Tutorial to fail over Azure VMs to a secondary region for disaster recovery with Azure Site Recovery document.
After the failover is completed, the status of the replicated items in the recovery service vault would be like below -
Failover status of VMs in Recovery Service Vault
Update the IP addresses of all the VMs, in this example SAP ASCS/ERS and all iSCSI target servers in AD/DNS or in host files.
Failover source ANF volume to destination volume. To activate the destination volume (example, when you want to failover to the destination region), you need to break replication peering and then mount the destination volume. Follow the instruction in manage disaster recovery using cross-region replication document for failover.
ANF volumes broken peering status

As we have maintained the entry of the DR volume (/sapmnt/QAS and /usr/sap/QAS/SYS) in /etc/fstab on primary region. We need to remove the comment to mount the volume after failover. But you should ‘comment' the entry of former primary region file system.

app1531:~ # more /etc/fstab 
# Former primary site NFS on ANF mounts (REMOVE HASH IN THE EVENT OF FAILBACK) 
# 10.0.3.5:/app153-sapmntQAS /sapmnt/QAS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0 
# 10.0.3.5:/app153-usrsapQASascs /usr/sap/QAS/ASCS00  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0 
# 10.0.3.5:/app153-usrsapQASsys /usr/sap/QAS/SYS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0 
 
# DR site NFS on ANF mounts 
10.29.1.4:/app153-sapmntqas /sapmnt/QAS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0   
10.29.1.4:/app153-usrsapqassys /usr/sap/QAS/SYS  nfs   rw,vers=4,minorversion=1,hard,timeo=600,rsize=262144,wsize=262144,intr,noatime,lock,_netdev,sec=sys  0  0 
 
app1531:~ # mount -a 
 
app1531:~ # df -h | grep -i qas
Filesystem                      Size  Used Avail Use% Mounted on 
10.29.1.4:/app153-sapmntqas     512G  6.8G  506G   2% /sapmnt/QAS 
10.29.1.4:/app153-usrsapqassys  512G  1.5M  512G   1% /usr/sap/QAS/SYS

NOTE: /usr/sap/QAS/ASCS00 and /usr/sap/QAS/ERS01 file system is managed by pacemaker cluster. So, no need to maintain in /etc/fstab

Add SAP ASCS/ERS VMs in the backend pool of Standard Load Balancer.
Standard Load Balancer backend pool in DR site
NOTE: When VMs without public IP addresses are placed in the backend pool of internal (no public IP address) Standard Azure load balancer, there will be no outbound internet connectivity, unless additional configuration is performed to allow routing to public end points. For details on how to achieve outbound connectivity see public endpoint connectivity for Virtual Machines using Azure Standard Load Balancer in SAP high-availability scenarios.

Update the VMs physical IP addresses in below section of /etc/corosync/corosync.conf

nodelist {
        node {
                ring0_addr: 10.29.0.4
                nodeid: 1
        }

        node {
                ring0_addr: 10.29.0.6
                nodeid: 2
        }

}

After failover, iSCSI disks are not discovered automatically as IP address of the iSCSI target server is changed. So you need connect to iSCSI devices using iscsiadm command

app1531:~ # lsscsi 
[0:0:0:0]    disk    Msft     Virtual Disk     1.0   /dev/sda 
[3:0:1:0]    disk    Msft     Virtual Disk     1.0   /dev/sdb

# Excute below command with root, by chaning the IP address of your iSCSI devices. Also make sure you use correct IQN name to discover iSCSI devices 
 
sudo iscsiadm -m discovery --type=st --portal=10.29.0.7:3260    
sudo iscsiadm -m node -T iqn.2006-04.ascsnw1.local:ascsnw1 --login --portal=10.29.0.7:3260 
sudo iscsiadm -m node -p 10.29.0.7:3260 -T iqn.2006-04.ascsnw1.local:ascsnw1 --op=update --name=node.startup --value=automatic 
 
sudo iscsiadm -m discovery --type=st --portal=10.29.0.8:3260    
sudo iscsiadm -m node -T iqn.2006-04.ascsnw1.local:ascsnw1 --login --portal=10.29.0.8:3260 
sudo iscsiadm -m node -p 10.29.0.8:3260 -T iqn.2006-04.ascsnw1.local:ascsnw1 --op=update --name=node.startup --value=automatic 
 
sudo iscsiadm -m discovery --type=st --portal=10.29.0.9:3260    
sudo iscsiadm -m node -T iqn.2006-04.ascsnw1.local:ascsnw1 --login --portal=10.29.0.9:3260 
sudo iscsiadm -m node -p 10.29.0.9:3260 -T iqn.2006-04.ascsnw1.local:ascsnw1 --op=update --name=node.startup --value=automatic

app1531:~ # lsscsi 
[0:0:0:0]    disk    Msft     Virtual Disk     1.0   /dev/sda 
[1:0:1:0]    disk    Msft     Virtual Disk     1.0   /dev/sdb 
[6:0:0:0]    disk    LIO-ORG  sbdascsnw1       4.0   /dev/sdc 
[7:0:0:0]    disk    LIO-ORG  sbdascsnw1       4.0   /dev/sdd 
[8:0:0:0]    disk    LIO-ORG  sbdascsnw1       4.0   /dev/sde 
 
app1531:~ # ls -l /dev/disk/by-id/scsi-* | grep sdc 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-1LIO-ORG_sbdascsnw1:d778427c-479f-4264-a510-4fecedfd044e -> ../../sdc 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-36001405d778427c479f4264a5104fece -> ../../sdc 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-SLIO-ORG_sbdascsnw1_d778427c-479f-4264-a510-4fecedfd044e -> ../../sdc 
app1531:~ # ls -l /dev/disk/by-id/scsi-* | grep sdd 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-1LIO-ORG_sbdascsnw1:a15a2398-9090-4610-b3b1-185ae2385d3b -> ../../sdd 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-36001405a15a239890904610b3b1185ae -> ../../sdd 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-SLIO-ORG_sbdascsnw1_a15a2398-9090-4610-b3b1-185ae2385d3b -> ../../sdd 
app1531:~ # ls -l /dev/disk/by-id/scsi-* | grep sde 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-1LIO-ORG_sbdascsnw1:4287e788-620c-4e0c-93f4-293c8169c86d -> ../../sde 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-360014054287e788620c4e0c93f4293c8 -> ../../sde 
lrwxrwxrwx 1 root root  9 Jul 19 22:47 /dev/disk/by-id/scsi-SLIO-ORG_sbdascsnw1_4287e788-620c-4e0c-93f4-293c8169c86d -> ../../sde

The SBD device ID remains same after ASR failover, and you do not have to make changes in /etc/sysconfig/sbd file.

app1531:~ # more /etc/sysconfig/sbd | grep -i SBD_DEVICE
# SBD_DEVICE specifies the devices to use for exchanging sbd messages
SBD_DEVICE="/dev/disk/by-id/scsi-36001405d778427c479f4264a5104fece;/dev/disk/by-id/scsi-36001405a15a239890904610b3b1185ae;/dev/disk/by-id/scsi-360014054287e788620c4e0c93f4293c8"

Start pacemaker cluster on SAP ASCS and SAP ERS (both) nodes and place it in maintenance mode

# Excecute below command on both nodes to start cluster services 
sudo crm cluster start 
 
# Place the cluster in maintenace node. Execute from any one of the node 
crm configure property maintenance-mode=true

Update configuration for filesystem resource (fs_QAS_*), virtual IP resource (vip_QAS_*) in pacemaker cluster and save the changes.

# On executing below command, the resource configuration will open in vi editor. Make changes and save. 
 
sudo crm configure edit fs_QAS_ASCS 
# Change the device to DR volume 
primitive fs_QAS_ASCS Filesystem \ 
        params device="10.29.1.4:/app153-usrsapqasascs" directory="/usr/sap/QAS/ASCS00" fstype=nfs options="sec=sys,vers=4.1" \ 
        op start timeout=60s interval=0 \ 
        op stop timeout=60s interval=0 \ 
        op monitor interval=20s timeout=40s 
         
sudo crm configure edit fs_QAS_ERS 
# Change virtual IP address to the frontend IP address of load balancer in DR region 
primitive vip_QAS_ASCS IPaddr2 \ 
        params ip=10.29.0.10 cidr_netmask=24 \ 
        op monitor interval=10 timeout=20 
         
sudo crm configure edit vip_QAS_ASCS 
# Change the device to DR volume 
primitive fs_QAS_ERS Filesystem \ 
        params device="10.29.1.4:/app153-usrsapqasers" directory="/usr/sap/QAS/ERS01" fstype=nfs options="sec=sys,vers=4.1" \ 
        op start timeout=60s interval=0 \ 
        op stop timeout=60s interval=0 \ 
        op monitor interval=20s timeout=40s 
 
sudo crm configure edit vip_QAS_ERS 
# Change virtual IP address to the frontend IP address of load balancer in DR region 
primitive vip_QAS_ERS IPaddr2 \ 
        params ip=10.29.0.11 cidr_netmask=24 \ 
        op monitor interval=10 timeout=20

NOTE: If you have changed the probe port to something different from primary site, then you need to edit nc_QAS_ASCS and nc_QAS_ERS resource as well with new port.

After editing the resources, you have to clear any existing error message in the cluster and take the cluster out of maintenance mode

sudo crm resource cleanup 
sudo crm configure property maintenance-mode=false 
 
sudo crm status 

#Full List of Resources: 
#  * stonith-sbd (stonith:external/sbd):  Started app1531 
#  * Clone Set: cln_azure-events [rsc_azure-events]: 
#    * Started: [ app1531 app1532 ] 
#  * Resource Group: g-QAS_ASCS: 
#    * fs_QAS_ASCS       (ocf::heartbeat:Filesystem):     Started app1532 
#    * nc_QAS_ASCS       (ocf::heartbeat:azure-lb):       Started app1532 
#    * vip_QAS_ASCS      (ocf::heartbeat:IPaddr2):        Started app1532 
#    * rsc_sap_QAS_ASCS00        (ocf::heartbeat:SAPInstance):    Started app1532 
#  * Resource Group: g-QAS_ERS: 
#    * fs_QAS_ERS        (ocf::heartbeat:Filesystem):     Started app1531 
#    * nc_QAS_ERS        (ocf::heartbeat:azure-lb):       Started app1531 
#    * vip_QAS_ERS       (ocf::heartbeat:IPaddr2):        Started app1531 
#    * rsc_sap_QAS_ERS01 (ocf::heartbeat:SAPInstance):    Started app1531