SAP on Azure high availability – change from SPN to MSI for Pacemaker clusters using Azure fencing
Published Sep 06 2022 09:00 AM 1,329 Views
Microsoft

For high availability architecture of SAP systems in Azure using Pacemaker, options exist on how to implement the fencing mechanism. For SuSE Enterprise Linux (SLES) only, SBD device(s) can be used for STONITH implementation. For both SLES and RedHat Enterprise Linux (RHEL), Azure fence agent can be configured.

 

Starting now, OS distributions support managed identities for Azure resources (MSI) in the Pacemaker cluster and Microsoft has 

documented this in the respective high availability for SAP documentation (SLES and RHEL). As many customers are running 

existing SAP clusters with service principal (SPN) use, the purpose of this blog post is to provide steps on how to change these clusters from SPN to MSI. It does not describe the creation of a new SAP cluster with MSI. It assumes you have an existing cluster running with SPN and you want to modify it to leverage managed identities instead.

 

NOTE: If you have a SLES cluster using SBD fencing, this guide does not apply. Only pacemaker clusters using Azure fencing. SBD fencing does not use any SPN or MSI for cluster operation.

 

Benefits of MSI over SPN for SAP clusters

SPNs are used by current pacemaker cluster by providing the username (SPN’s client ID in AAD), a password (SPN secret) and the AAD tenant. These values are set in your stonith resource within pacemaker.  Every time the cluster performs a monitoring or other action, it authenticates with this userid/password/tenant against Azure active directory to perform management API requests.

 

The SPN password/secret has an expiration date. Since April 2021, secrets no longer can have ‘never expire’ settings and 24 months is the longest allowed value, on portal or API/CLI. Your company also likely has a security policy to ensure certificates are updated much sooner, for example every 6 months. You need to keep track of expiration dates and update your pacemaker configuration with a new SPN secret in time. An expired secret means no status monitoring and no fencing action can be performed by pacemaker at that time, crippling your SAP cluster.

 

Managed identities solve the problem of expiring credentials and renewal cycle. There are no keys or secrets to manage. To look under the covers, documentation here shows how managed identities for Azure resources work with Azure virtual machines

 

Creating managed identity for SAP clusters

High-availability documentation by Azure uses system-assigned identity. User assigned managed identities are not documented and should not be used for this purpose.

 

If a VM does not yet have a managed identity, navigate to the one of the VMs in the pacemaker cluster, select Identity on the left blade and set status to ‘on’ for system assigned identity. Save and repeat for all VMs in the pacemaker cluster. If a system assigned identity already exists, no changes are needed, and this MSI will be used in next step assigning permissions.

 

Figure 1 - Enabling system assigned identity to VMFigure 1 - Enabling system assigned identity to VM

 

A VM can have a single system-assigned identity, which has one or multiple role assignments for authorization. For cluster operation, Azure documentation recommends using a custom role for the fence agent. Details how to create such role are described here. This blog post assumes a role already exists and is assigned to the SPN used to provide fencing services of the cluster.

 

Next, we assign this cluster custom role to managed identities. In Azure portal, navigate to the virtual machines (VMs) of the cluster. Select the first VM, navigate to the Access control (IAM) page and add a new role assignment.

 

NOTE: Instead of assigning the custom role on cluster VMs only, an alternative is to assign on whole SAP system’s resource group or even Azure subscription. This lowers the implementation effort but is less secure, as all VMs can read and powerOff all VMs in the assigned resource group or subscription. This blog post follows the principle of least privilege and assigns roles on VMs only.

 

Figure 2 - Assigning custom role for cluster on VM to MSI - step 1/2Figure 2 - Assigning custom role for cluster on VM to MSI - step 1/2

 

Select the cluster role you created, or you used previously for the SPN in the cluster. Using the same role of the SPN ensures continuity as same permissions are used by the managed identity. Assign all virtual machine MIs of the cluster to this role assignment. For (A)SCS cluster, the two VM identities, for DB cluster the respective DB cluster VMs.

 

Picture below shows the first cluster VM – in picture VM mi2scsvm1 – granting access to perform actions defined in custom role AzFenceAgent to two MIs, belonging to  VMs mi2scsvm1 and 2.

 

Figure 3 - Assigning custom role for cluster on VM to MSI - step 2/2Figure 3 - Assigning custom role for cluster on VM to MSI - step 2/2

 

Repeat the same steps for the second and any other cluster VM, again grant the cluster role to all MSIs of the SAP cluster.

 

Verify the role has been assigned to the right scope. Navigate to either of the cluster VMs and select identity on the left blade. For the system assigned identity, click on Azure role assignments to view them.

 

Figure 4 - Verifying assigned roles to VM's MSI - step 1/2Figure 4 - Verifying assigned roles to VM's MSI - step 1/2

 

You should see at least two assignments. The cluster role on both cluster VM, assigned to the VM you are looking at. Vice versa on the second cluster VM. You might see a reader role on the resource group the VM is in, for your system assigned identity. This is likely from Azure Monitoring Extension for SAP, which uses this for access to the VM’s resources. You might also see other role assignments, for example from your company policies.

 

Figure 5 - Verifying assigned roles to VM's MSI - step 2/2Figure 5 - Verifying assigned roles to VM's MSI - step 2/2

 

Note: Consider the Limitation of using managed identities for authorization.
Since authorization tokens received by pacemaker are cached by the underlying infrastructure, any changes to the managed identity’s role can take significant time to take effect – both adding and removing roles can be delayed.

Plan your cluster role and role assignment well in advance.

 

 

Verify MSI and prepare cluster VMs

With the authorizations assigned, you are ready to start working on OS level. This verification will also confirm the role assignments are effective.

 

First verify and if needed, update the agent packages necessary. If the installed versions are lower, install the latest version for the listed packages.

 

Minimum version for MSI support:

For RHEL 8.4 - fence-agents-4.2.1-54.el8

For RHEL 8.2 - fence-agents-4.2.1-41.el8_2.4

For RHEL 8.1 - fence-agents-4.2.1-30.el8_1.4

For RHEL 7.9 - fence-agents-4.2.1-41.el7_9.4

For SLES 15 SP1 and newer - fence-agents 4.5.2+git.1592573838.1eee0863 or higher.

For SLES 12 SP5 - fence-agents 4.9.0+git.1624456340.8d746be9-3.35.2 or higher. 

Older releases of either SLES and RHEL have not been verified, even if necessary packages might be available now or over time.

 

After the required packages are updated on the cluster VMs, you can verify the operation with MSI using the command below. This can be done during normal operation and only performs a list action. Output will be the listing of VMs which can be read in the provides Azure subscription and resource group. Your assigned authorization to the VM’s MSI determines which VMs are seen by the command.

 

sudo fence_azure_arm --msi --action=list --subscriptionId=<azureSubscriptionId> --resourceGroup=<resourceGroupName>

 

Example output if only cluster VMs authorized for MSI to read. No other authorized roles assigned to system assigned MSI.

azureuser@msiscsvm1:~> sudo fence_azure_arm --msi --action=list --subscriptionId=xxxxx-xxxx-xxx --resourceGroup=sap-prod-cluster-w2us-rg
msiscsvm1,
msiscsvm2,

 

Example output if cluster VMs have Azure monitoring extension for SAP, which by currently uses Reader role on entire resource group. Seeing all 6 VMs in same resource group, not just cluster VMs.

azureuser@msiscsvm1:~> sudo fence_azure_arm --msi --action=list --subscriptionId=xxxxx-xxxx-xxx --resourceGroup=sap-prod-cluster-w2us-rg
msiappvm1,
msiappvm2,
msidbvm1,
msidbvm2,
msiscsvm1,
msiscsvm2,

 

Repeat the verification step on ALL VMs of the cluster! Any errors - syntax, Azure authorizations insufficient, other errors - need to be resolved before proceeding. The command must complete without errors and show all VMs in the cluster. 

 

For further testing of MSI, action element can be changed from list to on|off|reboot|status with additional parameter ‘--plug <vmName>’ which VM should be fenced and stopped/restarted.

NOTE: Any value for parameter action other than list or status WILL IMPACT the cluster causing downtime of the VM and resources running on it. Any such testing should be limited to either during SAP cluster and application downtime, or on test systems.

 

If the test list command completes successfully, the cluster is ready for update of the stonith resource. Using same custom role as assigned to the current SPN ensures continuity.

 

Change from SPN to MSI

With MSI created, role authorizations granted and MSI tested on cluster VMs, the step to change from current SPN setup to MSI can start.

 

NOTE: While the actions described can be executed on running SAP cluster, any failure of the stonith resource can lead to cluster package move and unexpected downtime even if thoroughly tested on non-productive or sandbox environments
The author does not take any responsibility for any possible downtime if unforeseen circumstances occur during online change. SAP and/or database downtime for such cluster change is recommended. This way cluster configuration gets changed and operation is verified before releasing the SAP system back to users.

 

All steps in this chapter immediately affect the whole cluster and should be executed only once, on any active cluster VM. See commands for the respective OS used.

 

RHEL 8.x

Current STONITH settings are with configured SPN, verify configuration

[azureuser@msiscsvm1 ~]$ sudo pcs stonith config
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: password=<SPN-secret> pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id> tenantId=<AAD-tenant-ID> username=<SPN-name>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Set cluster to maintenance mode and disable stonith

sudo pcs property set maintenance-mode=true
sudo pcs property set stonith-enabled=false

 

Enable use of MSI and unset previous values for username, password and AAD tenant ID, which are no longer needed.

sudo pcs stonith update rsc_st_azure fence_azure_arm msi=true password= username= tenantId=

 

Verify configuration does not contain username, password or tenant ID, with MSI enabled

[azureuser@msiscsvm1 ~]$ sudo pcs stonith config
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: msi=true pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Disable maintenance mode and re-enable stonith

sudo pcs property set stonith-enabled=true
sudo pcs property set maintenance-mode=false

 

Check cluster, verify stonith resource is started on one cluster host

sudo pcs status

 

RHEL 7.x

Current STONITH settings are with configured SPN, verify configuration

[azureuser@msiscsvm1 ~]$ sudo pcs stonith show --full
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: login=<SPN-secret> passwd=<SPN-name> pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id> tenantId=<AAD-tenant-ID>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Set cluster to maintenance mode and disable stonith

sudo pcs property set maintenance-mode=true
sudo pcs property set stonith-enabled=false

 

Enable use of MSI and unset previous values for username, password and AAD tenant ID, which are no longer needed.

sudo pcs stonith update rsc_st_azure fence_azure_arm msi=true passwd= login= tenantId=

 

Verify configuration does not contain username, password or tenant ID, with MSI enabled

[azureuser@msiscsvm1 ~]$ sudo pcs stonith show --full
 Resource: rsc_st_azure (class=stonith type=fence_azure_arm)
  Attributes: msi=true pcmk_action_limit=3 pcmk_delay_max=15 pcmk_monitor_retries=4 pcmk_monitor_timeout=120 pcmk_reboot_timeout=900 power_timeout=240 resourceGroup=<SAP-resource-group> subscriptionId=<subscription-id>
  Operations: monitor interval=3600 (rsc_st_azure-monitor-interval-3600)

 

Disable maintenance mode and re-enable stonith

sudo pcs property set stonith-enabled=true
sudo pcs property set maintenance-mode=false

 

Check cluster, verify stonith resource is started on one cluster host

sudo pcs status

 

SLES 12 SP5, SLES 15 SP1 and newer

Current STONITH settings are with configured SPN, verify configuration

azureuser@msiscsvm1:~> sudo crm configure show rsc_st_azure
primitive rsc_st_azure stonith:fence_azure_arm \
        params subscriptionId=<subscription-id> resourceGroup=<SAP-resource-group>  tenantID=<AAD-tenant-ID>  login=<SPN-secret> passwd="******" pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=300 \
        op monitor interval=3600 timeout=120

 

Set cluster to maintenance mode and disable stonith

sudo crm configure property maintenance-mode=true
sudo crm configure property stonith-enabled=false

 

Enable use of MSI and unset previous values for username, password and AAD tenant ID, which are no longer needed. Edit the cluster resource interactively

sudo crm configure edit rsc_st_azure

 

Add new first parameter msi=true.

Remove the parameters tenantId, login and passwd entirely from the resource.

Save with vi-style command (:wq) to exit edit mode.

Example of edited parameter below

primitive rsc_st_azure stonith:fence_azure_arm \
        params msi=true subscriptionId=<subscription-id> resourceGroup=<SAP-resource-group> pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=300 \
        op monitor interval=3600 timeout=120

 

Verify configuration does not contain username, password or tenant ID, with MSI enabled

azureuser@msiscsvm1:~> sudo crm configure show rsc_st_azure
primitive rsc_st_azure stonith:fence_azure_arm \
        params msi=true subscriptionId=<subscription-id> resourceGroup=<SAP-resource-group> pcmk_monitor_retries=4 pcmk_action_limit=3 power_timeout=240 pcmk_reboot_timeout=300 \
        op monitor interval=3600 timeout=120

 

Disable maintenance mode and re-enable stonith

sudo crm configure property stonith-enabled=true
sudo crm configure property maintenance-mode=false

 

Check cluster, verify stonith resource is started on one cluster host

sudo crm status

 

All OS verions – testing Azure fence agent with MSI after change

Verifying the managed identity is used for the cluster action, for example on test system. While the stonith resource configuration can be shown, a true test of a fencing action would show the managed identity in Azure, confirming role and scope assignment was done correctly. This section will illustrate how SPN and MSI usage show in Azure activity logs.

 

For test scenario, assume a (A)SCS cluster with 2 VMs, mi2scsvm1 and mi2scsvm2. Initially set-up with SPN and few hours later changed to MSI for Azure fence agent. VM mi2scsvm2 is stopped and powered back on by pacemaker cluster, as part of testing as documented here (RHEL and SLES versions).

 

On Azure portal, looking at the VM which was fenced – mi2scsvm2 – under activity log, in our example one can see fence action (yellow box) initially done by the SPN (sap-cluster-prod-spn) some hours ago. This was the original cluster setup with SPN. After change to MSI, same cluster test repeated (green box) will show same VM start and powerOff, but this time by the MSI for mi2scsvm1. Azure fence agent running on mi2scsvm1 detected a hanging VM and restarted it.

 

Figure 6 - Activity log of cluster testing with both SPN and MSI, showing differences.Figure 6 - Activity log of cluster testing with both SPN and MSI, showing differences.

 

Remove SPN

After the cluster operates with MSI, the previously used SPN can be deleted. Ensure the SPN is not used in other clusters or applications. Remove the role assignment on the scope it was granted to SPN – VMs, resource group and/or subscription.

 

Final words

Benefits of increased security and no expiring secrets make managed identities for Azure resource the recommended way of assigning authorization for applications. With pacemaker high-availability solution now supporting this method, it is the recommended way to use Azure fence agent. 

 

Further reading and links

 

Azure Docs | How managed identities for Azure resources work with Azure virtual machines

M365 dev blogs | Client Secret expiration now limited to a maximum of two years

Azure Docs | Assign Azure roles to a managed identity

 

Co-Authors
Version history
Last update:
‎Sep 23 2022 05:57 AM
Updated by: