Blog Post

Running SAP Applications on the Microsoft Platform
10 MIN READ

SAP on Azure high availability – change from SPN to MSI for Pacemaker clusters using Azure fencing

RobertBiro's avatar
RobertBiro
Icon for Microsoft rankMicrosoft
Sep 06, 2022

For high availability architecture of SAP systems in Azure using Pacemaker, options exist on how to implement the fencing mechanism. For SuSE Enterprise Linux (SLES) only, SBD device(s) can be used for STONITH implementation. For both SLES and RedHat Enterprise Linux (RHEL), Azure fence agent can be configured.

 

Starting now, OS distributions support managed identities for Azure resources (MSI) in the Pacemaker cluster and Microsoft has 

documented this in the respective high availability for SAP documentation (SLES and RHEL). As many customers are running 

existing SAP clusters with service principal (SPN) use, the purpose of this blog post is to provide steps on how to change these clusters from SPN to MSI. It does not describe the creation of a new SAP cluster with MSI. It assumes you have an existing cluster running with SPN and you want to modify it to leverage managed identities instead.

 

NOTE: If you have a SLES cluster using SBD fencing, this guide does not apply. Only pacemaker clusters using Azure fencing. SBD fencing does not use any SPN or MSI for cluster operation.

 

Benefits of MSI over SPN for SAP clusters

SPNs are used by current pacemaker cluster by providing the username (SPN’s client ID in AAD), a password (SPN secret) and the AAD tenant. These values are set in your stonith resource within pacemaker.  Every time the cluster performs a monitoring or other action, it authenticates with this userid/password/tenant against Azure active directory to perform management API requests.

 

The SPN password/secret has an expiration date. Since April 2021, secrets no longer can have ‘never expire’ settings and 24 months is the longest allowed value, on portal or API/CLI. Your company also likely has a security policy to ensure certificates are updated much sooner, for example every 6 months. You need to keep track of expiration dates and update your pacemaker configuration with a new SPN secret in time. An expired secret means no status monitoring and no fencing action can be performed by pacemaker at that time, crippling your SAP cluster.

 

Managed identities solve the problem of expiring credentials and renewal cycle. There are no keys or secrets to manage. To look under the covers, documentation here shows how managed identities for Azure resources work with Azure virtual machines

 

Creating managed identity for SAP clusters

High-availability documentation by Azure uses system-assigned identity. User assigned managed identities are not documented and should not be used for this purpose.

 

If a VM does not yet have a managed identity, navigate to the one of the VMs in the pacemaker cluster, select Identity on the left blade and set status to ‘on’ for system assigned identity. Save and repeat for all VMs in the pacemaker cluster. If a system assigned identity already exists, no changes are needed, and this MSI will be used in next step assigning permissions.

 

Figure 1 - Enabling system assigned identity to VM

 

A VM can have a single system-assigned identity, which has one or multiple role assignments for authorization. For cluster operation, Azure documentation recommends using a custom role for the fence agent. Details how to create such role are described here. This blog post assumes a role already exists and is assigned to the SPN used to provide fencing services of the cluster.

 

Next, we assign this cluster custom role to managed identities. In Azure portal, navigate to the virtual machines (VMs) of the cluster. Select the first VM, navigate to the Access control (IAM) page and add a new role assignment.

 

NOTE: Instead of assigning the custom role on cluster VMs only, an alternative is to assign on whole SAP system’s resource group or even Azure subscription. This lowers the implementation effort but is less secure, as all VMs can read and powerOff all VMs in the assigned resource group or subscription. This blog post follows the principle of least privilege and assigns roles on VMs only.

 

Figure 2 - Assigning custom role for cluster on VM to MSI - step 1/2

 

Select the cluster role you created, or you used previously for the SPN in the cluster. Using the same role of the SPN ensures continuity as same permissions are used by the managed identity. Assign all virtual machine MIs of the cluster to this role assignment. For (A)SCS cluster, the two VM identities, for DB cluster the respective DB cluster VMs.

 

Picture below shows the first cluster VM – in picture VM mi2scsvm1 – granting access to perform actions defined in custom role AzFenceAgent to two MIs, belonging to  VMs mi2scsvm1 and 2.

 

Figure 3 - Assigning custom role for cluster on VM to MSI - step 2/2

 

Repeat the same steps for the second and any other cluster VM, again grant the cluster role to all MSIs of the SAP cluster.

 

Verify the role has been assigned to the right scope. Navigate to either of the cluster VMs and select identity on the left blade. For the system assigned identity, click on Azure role assignments to view them.

 

Figure 4 - Verifying assigned roles to VM's MSI - step 1/2

 

You should see at least two assignments. The cluster role on both cluster VM, assigned to the VM you are looking at. Vice versa on the second cluster VM. You might see a reader role on the resource group the VM is in, for your system assigned identity. This is likely from Azure Monitoring Extension for SAP, which uses this for access to the VM’s resources. You might also see other role assignments, for example from your company policies.

 

Figure 5 - Verifying assigned roles to VM's MSI - step 2/2

 

Note: Consider the Limitation of using managed identities for authorization.
Since authorization tokens received by pacemaker are cached by the underlying infrastructure, any changes to the managed identity’s role can take significant time to take effect – both adding and removing roles can be delayed.

Plan your cluster role and role assignment well in advance.

 

 

Verify MSI and prepare cluster VMs

With the authorizations assigned, you are ready to start working on OS level. This verification will also confirm the role assignments are effective.

 

First verify and if needed, update the agent packages necessary. If the installed versions are lower, install the latest version for the listed packages.

 

Minimum version for MSI support:

For RHEL 8.4 - fence-agents-4.2.1-54.el8

For RHEL 8.2 - fence-agents-4.2.1-41.el8_2.4

For RHEL 8.1 - fence-agents-4.2.1-30.el8_1.4

For RHEL 7.9 - fence-agents-4.2.1-41.el7_9.4

For SLES 15 SP1 and newer - fence-agents 4.5.2+git.1592573838.1eee0863 or higher.

For SLES 12 SP5 - fence-agents 4.9.0+git.1624456340.8d746be9-3.35.2 or higher. 

Older releases of either SLES and RHEL have not been verified, even if necessary packages might be available now or over time.

 

After the required packages are updated on the cluster VMs, you can verify the operation with MSI using the command below. This can be done during normal operation and only performs a list action. Output will be the listing of VMs which can be read in the provides Azure subscription and resource group. Your assigned authorization to the VM’s MSI determines which VMs are seen by the command.

 

sudo fence_azure_arm --msi --action=list --subscriptionId=<azureSubscriptionId> --resourceGroup=<resourceGroupName>

 

Example output if only cluster VMs authorized for MSI to read. No other authorized roles assigned to system assigned MSI.

azureuser@msiscsvm1:~> sudo fence_azure_arm --msi --action=list --subscriptionId=xxxxx-xxxx-xxx --resourceGroup=sap-prod-cluster-w2us-rg
msiscsvm1,
msiscsvm2,

 

Example output if cluster VMs have Azure monitoring extension for SAP, which by currently uses Reader role on entire resource group. Seeing all 6 VMs in same resource group, not just cluster VMs.

azureuser@msiscsvm1:~> sudo fence_azure_arm --msi --action=list --subscriptionId=xxxxx-xxxx-xxx --resourceGroup=sap-prod-cluster-w2us-rg
msiappvm1,
msiappvm2,
msidbvm1,
msidbvm2,
msiscsvm1,
msiscsvm2,

 

Repeat the verification step on ALL VMs of the cluster! Any errors - syntax, Azure authorizations insufficient, other errors - need to be resolved before proceeding. The command must complete without errors and show all VMs in the cluster. 

 

For further testing of MSI, action element can be changed from list to on|off|reboot|status with additional parameter ‘--plug <vmName>’ which VM should be fenced and stopped/restarted.

NOTE: Any value for parameter action other than list or status WILL IMPACT the cluster causing downtime of the VM and resources running on it. Any such testing should be limited to either during SAP cluster and application downtime, or on test systems.

 

If the test list command completes successfully, the cluster is ready for update of the stonith resource. Using same custom role as assigned to the current SPN ensures continuity.

 

Change from SPN to MSI

With MSI created, role authorizations granted and MSI tested on cluster VMs, the step to change from current SPN setup to MSI can start.

 

NOTE: While the actions described can be executed on running SAP cluster, any failure of the stonith resource can lead to cluster package move and unexpected downtime even if thoroughly tested on non-productive or sandbox environments
The author does not take any responsibility for any possible downtime if unforeseen circumstances occur during online change. SAP and/or database downtime for such cluster change is recommended. This way cluster configuration gets changed and operation is verified before releasing the SAP system back to users.

 

All steps in this chapter immediately affect the whole cluster and should be executed only once, on any active cluster VM. See commands for the respective OS used.

 

RHEL 8.x

Current STONITH settings are with configured SPN, verify configuration

[azureuser@msiscsvm1 ~]$ sudo pcs stonith config
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: password=<SPN-secret> pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id> tenantId=<AAD-tenant-ID> username=<SPN-name>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Set cluster to maintenance mode and disable stonith

sudo pcs property set maintenance-mode=true
sudo pcs property set stonith-enabled=false

 

Enable use of MSI and unset previous values for username, password and AAD tenant ID, which are no longer needed.

sudo pcs stonith update rsc_st_azure fence_azure_arm msi=true password= username= tenantId=

 

Verify configuration does not contain username, password or tenant ID, with MSI enabled

[azureuser@msiscsvm1 ~]$ sudo pcs stonith config
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: msi=true pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Disable maintenance mode and re-enable stonith

sudo pcs property set stonith-enabled=true
sudo pcs property set maintenance-mode=false

 

Check cluster, verify stonith resource is started on one cluster host

sudo pcs status

 

RHEL 7.x

Current STONITH settings are with configured SPN, verify configuration

[azureuser@msiscsvm1 ~]$ sudo pcs stonith show --full
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: login=<SPN-secret> passwd=<SPN-name> pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id> tenantId=<AAD-tenant-ID>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Set cluster to maintenance mode and disable stonith

sudo pcs property set maintenance-mode=true
sudo pcs property set stonith-enabled=false

 

Enable use of MSI and unset previous values for username, password and AAD tenant ID, which are no longer needed.

sudo pcs stonith update rsc_st_azure fence_azure_arm msi=true passwd= login= tenantId=

 

Verify configuration does not contain username, password or tenant ID, with MSI enabled

[azureuser@msiscsvm1 ~]$ sudo pcs stonith show --full
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: msi=true pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Disable maintenance mode and re-enable stonith

sudo pcs property set stonith-enabled=true
sudo pcs property set maintenance-mode=false

 

Check cluster, verify stonith resource is started on one cluster host

sudo pcs status

 

SLES 12 SP5, SLES 15 SP1 and newer

Current STONITH settings are with configured SPN, verify configuration

azureuser@msiscsvm1:~> sudo crm configure show rsc_st_azure
primitive rsc_st_azure stonith:fence_azure_arm \
        params subscriptionId=<subscription-id> resourceGroup=<SAP-resource-group>  tenantID=<AAD-tenant-ID>  login=<SPN-secret> passwd="******" pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=300 \
        op monitor interval=3600 timeout=120

 

Set cluster to maintenance mode and disable stonith

sudo crm configure property maintenance-mode=true
sudo crm configure property stonith-enabled=false

 

Enable use of MSI and unset previous values for username, password and AAD tenant ID, which are no longer needed. Edit the cluster resource interactively

sudo crm configure edit rsc_st_azure

 

Add new first parameter msi=true.

Remove the parameters tenantId, login and passwd entirely from the resource.

Save with vi-style command (:wq) to exit edit mode.

Example of edited parameter below

primitive rsc_st_azure stonith:fence_azure_arm \
        params msi=true subscriptionId=<subscription-id> resourceGroup=<SAP-resource-group> pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=300 \
        op monitor interval=3600 timeout=120

 

Verify configuration does not contain username, password or tenant ID, with MSI enabled

azureuser@msiscsvm1:~> sudo crm configure show rsc_st_azure
primitive rsc_st_azure stonith:fence_azure_arm \
        params msi=true subscriptionId=<subscription-id> resourceGroup=<SAP-resource-group> pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=300 \
        op monitor interval=3600 timeout=120

 

Disable maintenance mode and re-enable stonith

sudo crm configure property stonith-enabled=true
sudo crm configure property maintenance-mode=false

 

Check cluster, verify stonith resource is started on one cluster host

sudo crm status

 

All OS verions – testing Azure fence agent with MSI after change

Verifying the managed identity is used for the cluster action, for example on test system. While the stonith resource configuration can be shown, a true test of a fencing action would show the managed identity in Azure, confirming role and scope assignment was done correctly. This section will illustrate how SPN and MSI usage show in Azure activity logs.

 

For test scenario, assume a (A)SCS cluster with 2 VMs, mi2scsvm1 and mi2scsvm2. Initially set-up with SPN and few hours later changed to MSI for Azure fence agent. VM mi2scsvm2 is stopped and powered back on by pacemaker cluster, as part of testing as documented here (RHEL and SLES versions).

 

On Azure portal, looking at the VM which was fenced – mi2scsvm2 – under activity log, in our example one can see fence action (yellow box) initially done by the SPN (sap-cluster-prod-spn) some hours ago. This was the original cluster setup with SPN. After change to MSI, same cluster test repeated (green box) will show same VM start and powerOff, but this time by the MSI for mi2scsvm1. Azure fence agent running on mi2scsvm1 detected a hanging VM and restarted it.

 

Figure 6 - Activity log of cluster testing with both SPN and MSI, showing differences.

 

Remove SPN

After the cluster operates with MSI, the previously used SPN can be deleted. Ensure the SPN is not used in other clusters or applications. Remove the role assignment on the scope it was granted to SPN – VMs, resource group and/or subscription.

 

Final words

Benefits of increased security and no expiring secrets make managed identities for Azure resource the recommended way of assigning authorization for applications. With pacemaker high-availability solution now supporting this method, it is the recommended way to use Azure fence agent. 

 

Further reading and links

 

Azure Docs | How managed identities for Azure resources work with Azure virtual machines

M365 dev blogs | Client Secret expiration now limited to a maximum of two years

Azure Docs | Assign Azure roles to a managed identity

 

Updated Sep 23, 2022
Version 2.0
  • wesleypaes's avatar
    wesleypaes
    Copper Contributor

    Hi RobertBiro 

    Have you had experiences running msi=true "fence" when the nodes are in different resource groups?

    Thanks.

  • All tests were conducted with all cluster members/VMs in same resource group. The man page for azure_fence_arm also lists only a singular value for --resourceGroup parameter for both SPN and MSI. 

  • exodus-35's avatar
    exodus-35
    Copper Contributor

    I added the Linux Fence Agent Role to the VM and fence_azure_arm command works from the command line.

    Note that you do not need to add the subscription id or any other info from the command line.  So, when you add the Role to allow fencing in the VM, it simplifies the usage.  This is on a RHEL 9.3 OS image.

     

    AZ_GRP_ID=`az group show -n ${MYGRP} --query id --output tsv 2>/dev/null`

    VM_MSI_ID=`az vm identity show --name ${VM_HOSTNAME} -g ${MYGRP} --query 'principalId' --output tsv 2>/dev/null`

    az role assignment create --role "Linux Fence Agent Role" --assignee-object-id "${VM_MSI_ID}" --scope ${AZ_GRP_ID}

     

    [root@r9p2clazpn1 azureuser]# fence_azure_arm --resourceGroup MYGRP -msi -n r9p2clazpn1 -o list
    r9p2clazpn1,
    r9p2clazpn2,
    [root@r9p2clazpn1 azureuser]# fence_azure_arm --resourceGroup MYGRP --msi -n r9p2clazpn2 -o list
    r9p2clazpn1,
    r9p2clazpn2,

    [azureuser@r9p2clazpn1 ~]$ sudo pcs stonith config vmfence1
    Resource: vmfence1 (class=stonith type=fence_azure_arm)
    Attributes: vmfence1-instance_attributes
    msi=true
    pcmk_action_limit=3
    pcmk_delay_max=15
    pcmk_host_list=r9p2clazpn1
    pcmk_host_map=r9p2clazpn1:
    pcmk_monitor_retries=4
    pcmk_monitor_timeout=120
    pcmk_reboot_timeout=900
    power_timeout=240
    resourceGroup=MYGRP
    Operations:
    monitor: vmfence1-monitor-interval-3600
    interval=3600

    [azureuser@r9p2clazpn1 ~]$ sudo pcs status
    Cluster name: r9p2clazp
    Cluster Summary:
    * Stack: corosync (Pacemaker is running)
    * Current DC: r9p2clazpn1 (version 2.1.6-9.el9-6fdc9deea29) - partition with quorum
    * Last updated: Fri Dec 8 07:06:59 2023 on r9p2clazpn1
    * Last change: Fri Dec 8 07:04:55 2023 by root via cibadmin on r9p2clazpn1
    * 2 nodes configured
    * 13 resource instances configured

    Node List:
    * Online: [ r9p2clazpn1 r9p2clazpn2 ]

    Full List of Resources:
    * vmfence1 (stonith:fence_azure_arm): Started r9p2clazpn1
    * vmfence2 (stonith:fence_azure_arm): Started r9p2clazpn2
    * Clone Set: locking-clone [locking]:
    * Started: [ r9p2clazpn1 r9p2clazpn2 ]
    * Resource Group: mydisk:
    * myvolLVM (ocf:heartbeat:LVM-activate): Started r9p2clazpn1
    * myvolFS (ocf:heartbeat:Filesystem): Started r9p2clazpn1
    * Resource Group: MYGRP:
    * myvip (ocf:heartbeat:IPaddr2): Started r9p2clazpn1

    Daemon Status:
    corosync: active/enabled
    pacemaker: active/enabled
    pcsd: active/enabled

     

    [azureuser@r9p2clazpn1 ~]$ rpm -qa|grep fence|sort
    fence-agents-azure-arm-4.10.0-55.el9.x86_64
    fence-agents-common-4.10.0-55.el9.noarch

  • exodus-35 Maybe your post was cut off by length on this platform, I'm guessing you are wondering about your errors in syslog. See the correct syntax for pcmk_host_map and  pcmk_host_list. You only need pcmk_host_name if OS $HOSTNAME doesn't match the Azure VM name.

    Syntax is here https://clusterlabs.org/pacemaker/doc/2.1/Pacemaker_Explained/singlehtml/#special-instance-attributes-for-fencing-resources

    Azure spefic parameters https://learn.microsoft.com/en-us/azure/sap/workloads/high-availability-guide-suse-pacemaker#create-a-fencing-device-on-the-pacemaker-cluster Specify the mapping in the format hostname:vm-name

    pcmk_host_list is likely not needed, too

  • exodus-35's avatar
    exodus-35
    Copper Contributor

    I did one last test and I updated my original comment reflecting success.  I ran my shell cloudinit scripts and retested the build of the 2 node cluster all the way up to adding the VIP.   Go figure, the fence came up and stayed up.  Any suggestions for making the addition of the role to the vm more secure would be appreciated, or other comments.  Regards.

  • exodus-35 I can only think of assigning the fence role to individual VMs participating in the cluster, not subscription/resourcegroup/etc.  With the MSI then, only the assigned VMs can power on/off the cluster VMs.  Note, if you have azure extended monitoring for SAP extension installed, it provided read privilege to whole resource group by default. Meaning you see all VMs but cannot act on them (fence), it might explain why many VMs show up in list command.