Introduction:
SAP HANA system replication involves configuring one primary node and at least one secondary node. Any changes made to the data on the primary node are replicated to the secondary node synchronously. This ensures that we have a consistent and up-to-date backup, which is crucial for maintaining the integrity and availability of our data.
Problem Description:
Azure VM was in a degraded state causing a major outage since the SAP cluster was unable to start. Node health score (-1000000) did not reset automatically after redeploying and remained until manual intervention.
Consider below execution if your cluster nodes are running on SLES 12 or later: Please note that promotable is not supported.
Replace <placeholders> with your instance number and HANA system ID.
sudo crm configure primitive rsc_SAPHana_<HANA SID>HDB<instance number> ocf:suse:SAPHana
operations $id="rsc_sap<HANA SID>_HDB<instance number>-operations"
op start interval="0" timeout="3600"
op stop interval="0" timeout="3600"
op promote interval="0" timeout="3600"
op monitor interval="60" role="Master" timeout="700"
op monitor interval="61" role="Slave" timeout="700"
params SID="<HANA SID>" InstanceNumber="<instance number>" PREFER_SITE_TAKEOVER="true"
DUPLICATE_PRIMARY_TIMEOUT="7200" AUTOMATED_REGISTER="false"
sudo crm configure ms msl_SAPHana_<HANA SID>HDB<instance number> rsc_SAPHana<HANA SID>_HDB<instance number>
meta notify="true" clone-max="2" clone-node-max="1"
target-role="Started" interleave="true"
sudo crm resource meta msl_SAPHana_<HANA SID>_HDB<instance number> set priority 100
Cutover steps: These steps encompass pre-steps, execution steps, post-validation steps, and the rollback plan.
First, we have the pre-steps, which involve preparations and checks that need to be completed before we proceed with the main execution. This ensures that everything is in order and ready for the next phase. Next, we move on to the execution steps. These are the core actions that need to be carried out to ensure the task is completed accurately and efficiently. It's crucial that we follow these steps meticulously to avoid any issues. Post-validation steps come after the execution. This phase involves verifying the results and ensuring that everything works as expected.
Pre-Steps:
Check cluster status:
- crm status
- crm configure show
- SAPHanaSR-showAttr
Ensure no pending operations or failed resources:
- crm_mon -1
Confirm replication is healthy:
- hdbnsutil -sr_state
- SAPHanaSR-showAttr
Backup current configuration:
- crm configure show > /root/cluster_config_backup.txt
Execution Steps:
Enable maintenance mode:
- sudo crm configure property maintenance-mode=true
Delete the incorrect clone resource:
- crm configure delete msl_SAPHana_<SID>_HDB<instance>
Recreate using ms primitive:
- sudo crm configure ms msl_SAPHana_<SID>_HDB<instance> rsc_SAPHana_<SID>_HDB<instance> meta notify="true" clone-max="2" clone-node-max="1" target-role="Started" interleave="true" maintenance="true"
- sudo crm resource meta msl_SAPHana_<HANA SID>_HDB<instance number> set priority 100
Disable maintenance mode:
- crm configure property maintenance-mode=false
Refresh resource and disable maintenance:
- sudo crm resource refresh msl_SAPHana_<SID>
- wait 10 seconds
- Check HSR status match in all SAPHanaSR-showAttr and crm_mon -A -1 and hdbnsutil -sr_state
- sudo crm resource maintenance msl_SAPHana_<SID> off
Post Validation steps:
- crm status
- crm configure show
- SAPHanaSR-showAttr
Rollback Plan:
Enable maintenance mode:
- crm configure property maintenance-mode=true
- sudo crm resource maintenance msl_SAPHana_<SID> on
Restore configuration from backup:
- "crm configure load update /root/cluster_config_backup.txt"
Recreate the previous clone configuration if needed:
- crm configure clone msl_SAPHana_<SID>_HDB<instance> rsc_SAPHana_<SID>_HDB<instance> \ meta notify=true clone-max=2 clone-node-max=1 target-role=Started interleave=true promotable=true
Disable maintenance and refresh resources:
- crm configure property maintenance-mode=false
- sudo crm resource refresh msl_SAPHana_<SID>
- wait 10 seconds
- sudo crm resource maintenance msl_SAPHana_<SID> off
Perform below steps during actual execution:
|
Task Description |
Team |
|
Pre Step: Submit a CAB request for approval |
Basis |
|
Perform Pre-checks |
|
|
· Check cluster status: |
Basis |
|
Execution |
|
|
Get Go ahead from Leadership team |
Basis |
|
Step 0 – Put cluster into maintenance mode |
Basis |
|
crm resource maintenance g_ip_SID_HD on |
Basis |
|
#Backup current configuration: When cluster, msl, g_ip is in maintenance |
Basis |
|
Step 1 – (If not already done) clear Node 1 health and ensure topology/azure-events are running on both nodes (this avoids scheduler surprises when we re-manage) |
Basis |
|
#Execute on m1vms*(Ideally it can be executed on any node) |
SOPS |
|
crm resource cleanup health-azure-events-cln |
Basis |
|
#Backup current configuration: When health correct is complete and msl correction remaining. |
Basis |
|
Step 2 – Convert the wrapper inside a single atomic transaction |
Basis |
|
# Remove the promotable clone wrapper (keeps rscSAPHanaSIDHD primitive intact) |
Basis |
|
# Recreate as multi-state (ms) for classic agents |
Basis |
|
sudo crm resource meta msl_SAPHana_SID_HD set priority 100 |
Basis |
|
Step 3 – Re‑enable cluster management of IP and HANA |
Basis |
|
Prechecks by MSFT, SUSE Teams |
MSFT/SUSE |
|
Precheck by BASIS Team |
Basis |
|
crm configure property maintenance-mode=false |
Basis |
|
Validation |
Basis |
|
crm_mon -R1 -Af -1 |
Basis |
|
Rollback Plan |
|
|
Enable maintenance mode: |
Basis |
|
crm configure property maintenance-mode=true |
Basis |
|
Restore configuration from backup: Decide to which state we need to revert and use respective backup |
Basis |
|
crm configure load update /hana/shared/SID/dbcluster_backup_prechange/prehealth/premsl.txt |
Basis |
|
Recreate the previous clone configuration if needed: |
Basis |
|
crm configure clone msl_SAPHana_SID_HD rsc_SAPHana_SID_HD meta notify=true clone-max=2 clone-node-max=1 target-role=Started interleave=true promotable=true maintenance="true" |
Basis |
|
Disable maintenance and refresh resources: |
Basis |
|
crm configure property maintenance-mode=false |
Basis |
Important Points:
1. Are there known version-specific considerations when migrating from clone to ms?
If you are using SAPHanaSR, please ensure you are using 'ms'. On the other hand, if you are working with SAPHanaSR-angi, you should use 'clone'.
There are 3 different sets of HANA resource agents and SRHook scripts, two older ones and one newer one.
2. Does this change apply across the board on SUSE OS and/or Pacemaker versions?
The packages for the older ones are:
SAPHanaSR which is for Scale-Up HANA clusters.
SAPHanaSR-ScaleOut which is for Scale-Out HANA clusters.
The package for the new one is:
SAPHanaSR-angi which is for both Scale-up and Scale-out clusters. (angi stands for "advanced next generation interface").
When using the older SAPHanaSR or SAPHanaSR-ScaleOut resource agents and SRHook scripts, SUSE only supports the multi-state (ms) clone type for the SAPHanaSR (scale-up) or SAPHanaController (scale-out) resource. The older resource agents and scripts are supported on all Service Packs of SLES for SAP 12 and 15.
When using the newer SAPHanaSR-angi resource agents and scripts, SUSE only supports the regular clone type for the SAPHanaController resource (scale-up AND scale-out) with the "promotable=true" meta-attribute set on the clone. The newer "angi" resource agents and scripts are supported on SLES for SAP 15 SP5 and higher and on SLES for SAP 16 when it is released later this year.
So, with SLES for SAP 15 SP5 and higher, you can use either the older or the newer resource agents and scripts. For all Service Packs of SLES for SAP 12 and Service Packs of SLES for SAP 15 prior to SP5, you must use the older resource agents and scripts. Starting with SLES for SAP 16, you must use the new angi resource agents and scripts.
Installing the new SAPHanaSR-angi package will automatically uninstall the older SAPHanaSR or SAPHanaSR-ScaleOut packages if they are already installed. SUSE has published a blog on how to migrate from the older resource agents and scripts to the newer ones provided in the reference suse link.
Conclusion:
Let us set up and ensure that system replication is active. This is crucial to avoid any business disruptions during our critical operational hours. By taking these steps, we can seamlessly enhance the cluster architecture and resilience of our systems. Implementing these replication strategies will not only bolster our business continuity measures but also significantly improve our overall resilience. This means our operations will run more smoothly and efficiently, allowing us to handle future demands with ease.
Reference MS links:
High availability for SAP HANA on Azure VMs on SLES | Microsoft Learn