sap on azure
3 TopicsMSL correction from clone to multistate HANA DB Cluster SUSE activation
Introduction: SAP HANA system replication involves configuring one primary node and at least one secondary node. Any changes made to the data on the primary node are replicated to the secondary node synchronously. This ensures that we have a consistent and up-to-date backup, which is crucial for maintaining the integrity and availability of our data. Problem Description: Azure VM was in a degraded state causing a major outage since the SAP cluster was unable to start. Node health score (-1000000) did not reset automatically after redeploying and remained until manual intervention. Consider below execution if your cluster nodes are running on SLES 12 or later: Please note that promotable is not supported. Replace <placeholders> with your instance number and HANA system ID. sudo crm configure primitive rsc_SAPHana_<HANA SID>HDB<instance number> ocf:suse:SAPHana operations $id="rsc_sap<HANA SID>_HDB<instance number>-operations" op start interval="0" timeout="3600" op stop interval="0" timeout="3600" op promote interval="0" timeout="3600" op monitor interval="60" role="Master" timeout="700" op monitor interval="61" role="Slave" timeout="700" params SID="<HANA SID>" InstanceNumber="<instance number>" PREFER_SITE_TAKEOVER="true" DUPLICATE_PRIMARY_TIMEOUT="7200" AUTOMATED_REGISTER="false" sudo crm configure ms msl_SAPHana_<HANA SID>HDB<instance number> rsc_SAPHana<HANA SID>_HDB<instance number> meta notify="true" clone-max="2" clone-node-max="1" target-role="Started" interleave="true" sudo crm resource meta msl_SAPHana_<HANA SID>_HDB<instance number> set priority 100 Cutover steps: These steps encompass pre-steps, execution steps, post-validation steps, and the rollback plan. First, we have the pre-steps, which involve preparations and checks that need to be completed before we proceed with the main execution. This ensures that everything is in order and ready for the next phase. Next, we move on to the execution steps. These are the core actions that need to be carried out to ensure the task is completed accurately and efficiently. It's crucial that we follow these steps meticulously to avoid any issues. Post-validation steps come after the execution. This phase involves verifying the results and ensuring that everything works as expected. Pre-Steps: Check cluster status: crm status crm configure show SAPHanaSR-showAttr Ensure no pending operations or failed resources: crm_mon -1 Confirm replication is healthy: hdbnsutil -sr_state SAPHanaSR-showAttr Backup current configuration: crm configure show > /root/cluster_config_backup.txt Execution Steps: Enable maintenance mode: sudo crm configure property maintenance-mode=true Delete the incorrect clone resource: crm configure delete msl_SAPHana_<SID>_HDB<instance> Recreate using ms primitive: sudo crm configure ms msl_SAPHana_<SID>_HDB<instance> rsc_SAPHana_<SID>_HDB<instance> meta notify="true" clone-max="2" clone-node-max="1" target-role="Started" interleave="true" maintenance="true" sudo crm resource meta msl_SAPHana_<HANA SID>_HDB<instance number> set priority 100 Disable maintenance mode: crm configure property maintenance-mode=false Refresh resource and disable maintenance: sudo crm resource refresh msl_SAPHana_<SID> wait 10 seconds Check HSR status match in all SAPHanaSR-showAttr and crm_mon -A -1 and hdbnsutil -sr_state sudo crm resource maintenance msl_SAPHana_<SID> off Post Validation steps: crm status crm configure show SAPHanaSR-showAttr Rollback Plan: Enable maintenance mode: crm configure property maintenance-mode=true sudo crm resource maintenance msl_SAPHana_<SID> on Restore configuration from backup: "crm configure load update /root/cluster_config_backup.txt" Recreate the previous clone configuration if needed: crm configure clone msl_SAPHana_<SID>_HDB<instance> rsc_SAPHana_<SID>_HDB<instance> \ meta notify=true clone-max=2 clone-node-max=1 target-role=Started interleave=true promotable=true Disable maintenance and refresh resources: crm configure property maintenance-mode=false sudo crm resource refresh msl_SAPHana_<SID> wait 10 seconds sudo crm resource maintenance msl_SAPHana_<SID> off Perform below steps during actual execution: Task Description Team Pre Step: Submit a CAB request for approval Basis Perform Pre-checks · Check cluster status: SBD,pacemaker, coro services, sbd messages, isscsi, constraint crm status crm configure show SAPHanaSR-showAttr · Ensure no pending operations or failed resources: crm_mon -R1 -Af -1 · Confirm replication is healthy: hdbnsutil -sr_state · Backup current configuration: Pre-change crm configure show > /hana/shared/SID/dbcluster_backup_prechange.txt crm configure show | sed -n '/primitive rsc_SAPHana_SID_HD/,/^$/p' crm configure show | sed -n '/clone msl_SAPHana_SID_HD/,/^$/p' Basis Execution Get Go ahead from Leadership team Basis Step 0 – Put cluster into maintenance mode Basis crm resource maintenance g_ip_SID_HD on Basis #Backup current configuration: When cluster, msl, g_ip is in maintenance crm configure show > /hana/shared/SID/dbcluster_backup_prehealth.txt Basis Step 1 – (If not already done) clear Node 1 health and ensure topology/azure-events are running on both nodes (this avoids scheduler surprises when we re-manage) Basis #Execute on m1vms*(Ideally it can be executed on any node) crm_attribute -N vm** -n '#health-azure' -v 0 crm_attribute --node vm** --delete --name "azure-events-az_curNodeState" crm_attribute --node vm**--delete --name "azure-events-az_pendingEventIDs" SOPS crm resource cleanup health-azure-events-cln crm resource cleanup cln_SAPHanaTopology_SID_HD Basis #Backup current configuration: When health correct is complete and msl correction remaining. crm configure show > /hana/shared/SID/dbcluster_backup_premsl.txt Basis Step 2 – Convert the wrapper inside a single atomic transaction We delete the promotable clone wrapper only (not the primitive), then create the ms wrapper with the same name msl_SAPHana_SID_HD so existing colocation/order constraints that reference the name keep working. Basis # Remove the promotable clone wrapper (keeps rscSAPHanaSIDHD primitive intact) crm configure delete msl_SAPHana_SID_HD Basis # Recreate as multi-state (ms) for classic agents sudo crm configure ms msl_SAPHana_SID_HD rsc_SAPHana_SID_HD meta notify="true" clone-max="2" clone-node-max="1" target-role="Started" interleave="true" maintenance="true" Basis sudo crm resource meta msl_SAPHana_SID_HD set priority 100 Basis Step 3 – Re‑enable cluster management of IP and HANA Basis Prechecks by MSFT, SUSE Teams MSFT/SUSE Precheck by BASIS Team Basis crm configure property maintenance-mode=false crm resource refresh msl_SAPHana_SID_HD wait 10 seconds crm resource maintenance msl_SAPHana_SID_HD off crm resource maintenance g_ip_SID_HD off Basis Validation Basis crm_mon -R1 -Af -1 crm status crm configure show SAPHanaSR-showAttr Basis Rollback Plan Enable maintenance mode: Basis crm configure property maintenance-mode=true crm resource maintenance msl_SAPHana_SID_HD on crm resource maintenance g_ip_SID_HD on Basis Restore configuration from backup: Decide to which state we need to revert and use respective backup Basis crm configure load update /hana/shared/SID/dbcluster_backup_prechange/prehealth/premsl.txt Basis Recreate the previous clone configuration if needed: Basis crm configure clone msl_SAPHana_SID_HD rsc_SAPHana_SID_HD meta notify=true clone-max=2 clone-node-max=1 target-role=Started interleave=true promotable=true maintenance="true" Basis Disable maintenance and refresh resources: Basis crm configure property maintenance-mode=false crm resource refresh msl_SAPHana_SID_HD wait 10 seconds crm resource maintenance msl_SAPHana_SID_HD off crm resource maintenance g_ip_SID_HD off Basis Important Points: 1. Are there known version-specific considerations when migrating from clone to ms? If you are using SAPHanaSR, please ensure you are using 'ms'. On the other hand, if you are working with SAPHanaSR-angi, you should use 'clone'. There are 3 different sets of HANA resource agents and SRHook scripts, two older ones and one newer one. 2. Does this change apply across the board on SUSE OS and/or Pacemaker versions? The packages for the older ones are: SAPHanaSR which is for Scale-Up HANA clusters. SAPHanaSR-ScaleOut which is for Scale-Out HANA clusters. The package for the new one is: SAPHanaSR-angi which is for both Scale-up and Scale-out clusters. (angi stands for "advanced next generation interface"). When using the older SAPHanaSR or SAPHanaSR-ScaleOut resource agents and SRHook scripts, SUSE only supports the multi-state (ms) clone type for the SAPHanaSR (scale-up) or SAPHanaController (scale-out) resource. The older resource agents and scripts are supported on all Service Packs of SLES for SAP 12 and 15. When using the newer SAPHanaSR-angi resource agents and scripts, SUSE only supports the regular clone type for the SAPHanaController resource (scale-up AND scale-out) with the "promotable=true" meta-attribute set on the clone. The newer "angi" resource agents and scripts are supported on SLES for SAP 15 SP5 and higher and on SLES for SAP 16 when it is released later this year. So, with SLES for SAP 15 SP5 and higher, you can use either the older or the newer resource agents and scripts. For all Service Packs of SLES for SAP 12 and Service Packs of SLES for SAP 15 prior to SP5, you must use the older resource agents and scripts. Starting with SLES for SAP 16, you must use the new angi resource agents and scripts. Installing the new SAPHanaSR-angi package will automatically uninstall the older SAPHanaSR or SAPHanaSR-ScaleOut packages if they are already installed. SUSE has published a blog on how to migrate from the older resource agents and scripts to the newer ones provided in the reference suse link. Conclusion: Let us set up and ensure that system replication is active. This is crucial to avoid any business disruptions during our critical operational hours. By taking these steps, we can seamlessly enhance the cluster architecture and resilience of our systems. Implementing these replication strategies will not only bolster our business continuity measures but also significantly improve our overall resilience. This means our operations will run more smoothly and efficiently, allowing us to handle future demands with ease. Reference MS links: High availability for SAP HANA on Azure VMs on SLES | Microsoft Learn https://www.suse.com/c/how-to-upgrade-to-saphanasr-angi/Gen1 to Gen2 Azure VM Upgrade in Rolling Fashion
Introduction: Azure offers Trusted Launch. This seamless solution is designed to significantly enhance the security of our Generation 2 virtual machines (VMs), providing robust protection against advanced and persistent attack techniques. Trusted Launch is composed of several coordinated infrastructure technologies, each of which can be enabled independently. These technologies work in harmony to create multiple layers of defense, ensuring our virtual machines remain secure against sophisticated threats. With Trusted Launch, we can confidently improve our security posture and safeguard our VMs from potential vulnerabilities. Upgrading of Azure VMs from Generation 1 (Gen1) to Generation 2 (Gen2) involves several steps to ensure a smooth transition without data loss or disruptions. Rolling fashion upgrade process: First and foremost, it is crucial to have a complete backup of virtual machines before starting the upgrade. This step is essential to protect valuable data in case of any unforeseen issues that may arise during the process. Having a backup will give you peace of mind and ensure that data is safe and secure. It is crucial to perform any new process or implementation in pre-production systems first. This step is vital to ensure that we can identify and resolve any potential issues before moving to the production environment. By doing so, we can maintain the integrity and stability of our systems, ultimately serving our customers better. Please run the pre-validation steps before you bring down the VM. SSH into VM: Connect to the Gen1 Linux VM. Identify Boot Device with sudo : bootDevice=$(echo "/dev/$(lsblk -no pkname $(df /boot | awk 'NR==2 {print $1}'))") Check Partition Type (must return 'gpt'): sudo blkid $bootDevice -o value -s PTTYPE Validate EFI System Partition (e.g., /dev/sda2 or 3): sudo fdisk -l $bootDevice | grep EFI | awk '{print $1}' Check EFI Mountpoint (/boot/efi must be in /etc/fstab): sudo grep -qs '/boot/efi' /etc/fstab && echo '/boot/efi present in /etc/fstab' || echo '/boot/efi missing /boot/efi present in /etc/fstab' Example: Once the complete backup is in place and the pre-validation steps are completed, we will need the SAP Basis team to proceed with stopping the application. As part of our planned procedure, once the application has been taken down, the Unix team will proceed to shut down the operating system on the ERS servers. Azure team to follow below steps and perform the Gen upgrade on the selected approved servers: Example: Example: Example: Start the VM: Start-AzVM -ResourceGroupName myResourceGroup -Name myVm (Or) Start from Azure Portal Login into Azure Portal to check the VM Generation is successfully changed to V2. Example: Unix team to validate OS on approved servers. SAP Basis team to generate a new license key based on the new hardware to apply and start the application. Unix team to perform failover of ASCS cluster. SAP Basis team to stop the application server. Unix team to shutdown OS on ERS for selected VM’s and validate the OS. SAP Basis team to apply the new Hardware key and start the application. Unix team to perform failover of ASCS cluster. Azure team to work on capacity analysis to find the path forward for hosting Mv2 VMs on the same PPG group. Once successfully completed test rollback on at least one app server for rollback planning. Here are the other methods to achieve this: Method 1: Using Trusted Launch Direct Upgrade Prerequisites Check: Ensure your subscription is onboarded to preview feature Gen1ToTLMigrationPreview under Microsoft. Compute namespace. The VM should be configured with Trusted launch supported size family and OS version. Also have a successful backup in place. Update Guest OS Volume: Update guest OS volume with GPT Disk layout and EFI system partition. Use PowerShell-based orchestration script for MBR2GPT validation and conversion. Enable Trusted Launch:Deallocate the VM using Stop-AzVM. Enable Trusted launch by setting -SecurityType to TrustedLaunch using Update-AzVM command. Stop-AzVM -ResourceGroupName myResourceGroup -Name myVm, Update-AzVM -ResourceGroupName myResourceGroup -VMName myVm -SecurityType TrustedLaunch -EnableSecureBoot $true -EnableVtpm $true Validate and Start VM:Validate the security profile in the updated VM configuration. Start the VM and verify that you can sign in using RDP or SSH. Method 2: Using Azure Backup Verify Backup Data: Ensure you have valid and up-to-date backups of your Gen1 VMs, including both OS disks and data disks. Verify that the backups are successfully completed and can be restored. Create Gen2 VMs: Create new Gen2 VMs with the desired specifications and configuration. There's no need to start them initially, just have them created and ready for when we need them. Restore VM Backups: In the Azure Portal, go to the Azure Backup service. Select "Recovery Services vaults" from the left-hand menu, and then select your existing backup vault that contains the backups of the Gen1 VMs. Inside the recovery services vault, go to the "Backup Items" section and select the VM you want to restore. Initiate a restore operation for the VM. During the restore process, choose the target resource group and the target VM (which should be the newly created Gen2 VM). Restore OS Disk: Choose to restore the OS disk of the Gen1 VM to the newly created Gen2 VM. Azure Backup will restore the OS disk to the new VM, effectively migrating it to Generation 2. Restore Data Disks: Once the OS disk is restored and the Gen2 VM is operational, proceed to restore the data disks. Repeat the restore process for each data disk, attaching them to the Gen2 VM as needed. Verify and Test: Verify that the Gen2 VM is functioning correctly, and all data is intact. Test thoroughly to ensure all applications and services are running as expected. Decommission Gen1 VMs (Optional): Once the migration is successful, and you have verified that the Gen2 VMs are working correctly please decommission the original Gen1 VMs. Important Notes: Before proceeding with any production migration, thoroughly test this process in a non-production environment to ensure its success and identify any potential issues. Make sure you have a backup of critical data and configurations before attempting any migration. While this approach focuses on using Azure Backup for restoring the VMs, there are other migration strategies available that may better suit your specific scenario. Evaluate them based on your requirements and constraints. Remember, migrating VMs between generations involves changes in the underlying virtual hardware, so thorough testing and planning are essential to ensure a smooth transition without data loss or disruptions. Why Generation2 upgrade without Trusted launch is not supported? Trusted Launch provides foundational compute security for our VMs at no additional cost, which means we can enhance our security posture without incurring extra expenses. Moreover, Trusted Launch VMs are largely on par with Generation 2 VMs in terms of features and performance. This means that upgrading to Generation 2 VMs without enabling Trusted Launch does not provide any added benefits. Unsupported Gen1 VM configurations: Gen1 to Trusted launch VM upgrade is NOT supported if Gen1 VM is configured with below options: Operating system: Windows Server 2016, Azure Linux, Debian, and any other operating system not listed under Trusted launch supported operating system (OS) version. Ref link: Trusted Launch for Azure VMs - Azure Virtual Machines | Microsoft Learn VM size: Gen1 VM configured with VM size not listed under Trusted launch supported size families. Ref link: Trusted Launch for Azure VMs - Azure Virtual Machines | Microsoft Learn Azure Backup: Gen1 VM configured with Azure Backup using Standard policy. As workaround, migrate Gen1 VM backups from Standard to Enhanced policy. Ref link: Move VM backup - standard to enhanced policy in Azure Backup - Azure Backup | Microsoft Learn Conclusion: We will enhance Azure virtual machines by transitioning from Gen1 to Gen2. By implementing these approaches, we can seamlessly unlock improved security and performance of systems. This transition will not only bolster our security measures but also significantly enhance the overall performance, ensuring our operations run more smoothly and efficiently. Let us make this upgrade to ensure virtual machines are more robust and capable of handling future demands. Ref links: Upgrade Gen1 VMs to Trusted launch - Azure Virtual Machines | Microsoft Learn GitHub - Azure/Gen1-Trustedlaunch: aka.ms/Gen1ToTLUpgrade Enable Trusted launch on existing Gen2 VMs - Azure Virtual Machines | Microsoft LearnDeep dive into Pacemaker cluster for Azure SAP systems optimization
Introduction: Azure Pacemaker offers a centralized management platform that streamlines the process of monitoring and maintaining pacemakers. With this innovative service, you can ensure the safety and well-being of your systems through automated alerts and comprehensive management tools. By leveraging Azure Pacemaker, organizations can experience enhanced efficiency and peace of mind knowing that their pacemakers are being managed optimally. The centralized platform simplifies the management process, making it easier to keep track of devices and promptly respond to any issues that may arise. Current customer challenges: Configuration: Common misconfigurations occur when customers don’t follow up-to-date HA setup guidance from learn.microsoft.com, leading to failover issues. Testing: Manual testing causes untested failover scenarios and config drift. Limited expertise in HA tools complicates troubleshooting. Key Use Cases for SAP HA Testing Automation: I wanted to discuss some important updates regarding our testing and validation procedures to ensure that we continue to maintain the highest standards in our work. First off, we need to automate the validation process on new OS versions. This will help us ensure that the Pacemaker cluster configuration remains up-to-date and functions smoothly with the latest OS releases. By doing this, we can promptly address any compatibility issues that might arise. Next, we should implement loop tests to run on a regular cadence. These tests will enable us to catch regressions early and ensure that our customer systems remain robust and reliable over time. It's essential to have continuous monitoring in place to maintain optimal performance. Furthermore, we must validate our high availability (HA) configurations according to the documented SAP on Azure best practices. This will ensure effective failover and quick recovery, minimize downtime and maximize system uptime. Adhering to these best practices will significantly enhance our HA capabilities. SAP Testing Automation Framework (Public Preview): The most recommended approach for validating Pacemaker configurations in SAP HANA clusters, which is through the SAP Deployment Automation Framework (SDAF) and its High Availability Testing Framework. This framework includes a comprehensive set of automated test cases designed to validate cluster behavior under various scenarios such as primary node crashes, manual resource migrations, and service failures. Additionally, it rigorously checks OS versions, Azure roles for fencing, SAP parameters, and Pacemaker/Corosync configurations to ensure everything is set up correctly. Low-level administrative commands are employed to validate the captured values against best practices, particularly focusing on constraints and meta-attributes. This thorough validation process ensures that our clusters are reliable, resilient, and adhering to industry standards. SAP System High Availability on Azure: SAP HANA Scale-UP: SAP Central Services: Support Matrix: Linux Distribution: Distribution Supported Release SUSE Linux Enterprise Server (SLES) 15 SP4, 15 SP5, 15 SP6 Red Hat Enterprise Linux (RHEL) 8.8, 8.10, 9.2, 9.4 High Availability Configuration Patterns: Component Type Cluster Type Storage SAP Central Services ENSA1 or ENSA2 Azure Fencing Agent Azure Files or ANF SAP Central Services ENSA1 or ENSA2 ISCSI (SBD device) Azure Files or ANF SAP HANA Scale-up Azure Fencing Agent Azure Managed Disk or ANF SAP HANA Scale-up ISCSI (SBD device) Azure Managed Disk or ANF High Availability Tests scenarios: Test Type Database Tier (HANA) Central Services Configuration Checks HA Resource Parmeter Validation Azure Load Balancer Configuration HA Resource Parmeter Validation SAPControl Azure Load Balancer Configuration Failover Tests HANA Resource Migration Primary Node Crash ASCS Resource Migration ASCS Node Crash Process & Services Index Server Crash Node Kill Kill SBD Service Message Server Enqueue Server Enqueue Replication Server SAPStartSRV process Network Tests Block network Block network Infrastructure Virtual machine crash Freeze file system (storage) Manual Restart HA Failover to Node Reference links: SLES: Set up Pacemaker on SUSE Linux Enterprise Server (SLES) in Azure | Microsoft Learn Troubleshoot startup issues in a SUSE Pacemaker cluster - Virtual Machines | Microsoft Learn RHEL: Set up Pacemaker on RHEL in Azure | Microsoft Learn Troubleshoot Azure fence agent issues in an RHEL Pacemaker cluster - Virtual Machines | Microsoft Learn STAF: GitHub - Azure/sap-automation-qa: This is the repository supporting the quality assurance for SAP systems running on Azure. Conclusion: This innovative tool is designed to significantly streamline and enhance the high availability deployment of SAP systems on Azure by reducing potential misconfigurations and minimizing manual effort. Please note that since this framework performs multiple failovers sequentially to validate the cluster behavior. It is not recommended to be run on production systems directly. It is intended for use in new high availability deployments that are not yet live / non-business critical systems.