SQL Server Blog

10 MIN READ

Deploy Always On Availability Groups (AG) on SQL Server RHEL based Azure VMs - The Ansible Way!

Microsoft

Sep 02, 2022

One common comment I've heard at various conferences and customer interactions is, "setting up Always On availability groups (AGs) on pacemaker takes very long, has multiple steps and commands that need to be run, so sometimes it can be error prone, can you folks make it easier and automate the AG setup for SQL Server on Linux?"

Well, we went back and collaborated with the RHEL Engineering team to create a SQL Server role capable of creating the pacemaker cluster, configuring the SQL Server Always On availability groups (AG), adding databases to the AG, and creating the pacemaker resources with fencing agents. To top it all off, after creating the necessary infrastructure, I was able to deploy SQL Server Always On availability groups (AG), with pacemaker cluster and AG specific resources all in under 8 minutes!! Yes, eight minutes:). If you want to give it a shot, keep reading!

To know more about the Ansible based SQL Server roles, please refer to this latest blog: Preview the new system role support for Microsoft SQL Server Always On availability groups (redhat.com) from the Red Hat team.

RHEL 8.5 added support for Microsoft SQL Server system role, which automates the installation, configuration and tuning of Microsoft SQL Server on RHEL. Please see the RHEL documentation for more information.

For today’s setup, I am going to use the SQL Server role from the Ansible, listed here: Ansible Galaxy.

This is currently in preview and allows you to configure and deploy Always On availability groups (AG). This SQL Server role is provided as a sample and for reference only. It is not supported by Microsoft or Red Hat. However, we welcome your feedback and suggestions for improving the system role here: Issues linux-system-roles/mssql (github.com).

Setting up the infrastructure:

Okay, let's get started. Here are the steps we would take to create the infrastructure required to run the Ansible role to configure AG for SQL Server RHEL based VMs.

Ensure that you have an Azure subscription.

For this demo, I'm using three Azure RHEL-based VMs, one primary and two secondary replicas. I'll use the SQL Server on RHEL HA Azure Marketplace image to create the VMs. This image includes SQL Server. Here are the commands I used to create three RHEL-based Azure virtual machines.

#To list all the SQL based RHEL offers, you can run the following command
az vm image list --all --offer "sql2019-rhel8"

# Run the following commands to create the VMs, using the SQL Server PAYG(Pay-as-you-go) # images pre-configured on RHEL 8 base OS image, thus avoiding the steps to install and # configure SQL Server deployments. We are using the Standard_B4ms as an example 

az vm create --resource-group testgroup --name sqlrhel1 --size "Standard_B4ms" --location "west india" --image "MicrosoftSQLServer:sql2019-rhel8:sqldev:15.0.220712" --vnet-name amvindomain-vnet --subnet default --admin-username "adminuser" --admin-password "StrongPa$$w0rd" --authentication-type all --generate-ssh-keys

az vm create --resource-group testgroup --name sqlrhel2 --size "Standard_B4ms" --location "west india" --image "MicrosoftSQLServer:sql2019-rhel8:sqldev:15.0.220712" --vnet-name amvindomain-vnet --subnet default --admin-username "adminuser" --admin-password "StrongPa$$w0rd" --authentication-type all --generate-ssh-keys

az vm create --resource-group testgroup --name sqlrhel3 --size "Standard_B4ms" --location "west india" --image "MicrosoftSQLServer:sql2019-rhel8:sqldev:15.0.220712" --vnet-name amvindomain-vnet --subnet default --admin-username "adminuser" --admin-password "StrongPa$$w0rd" --authentication-type all --generate-ssh-keys

Register a new application in Azure Active Directory, create custom role for the fence agent & Assign the custom role to the Service Principal. You must have guessed this is required to configure the STONITH devices. Please note, if you already have a pre-created application and custom role in your subscription you can also use that.
Lastly, create and configure the Azure load balancer, required to setup the Listener service.

Preparing the VMs and Scripts

Before I get into the specifics of the Ansible role, here's a quick rundown of the steps:

On the controller node, install ansible-core and the microsoft.sql role from Ansible galaxy.
After that, configure password less SSH on the controller node and all targets.
Create the inventory and playbook files required for the deployment.

Setting up the Controller node

Once the VMs have been created, they must be updated to the most recent packages. The command I used is included below; you can also run the same commands by logging into the VMs and manually running the commands:

# My controller node is sqlrhel1, that is why on this node I am installing the 
# ansible-core package and also updating the RHEL to the latest version, please note that # my controller node is also a target node.
az vm run-command invoke -g testgroup -n sqlrhel1 --command-id RunShellScript --scripts "sudo yum update -y && sudo yum install ansible-core -y"

# On my other target nodes sqlrhel2 and sqlrhel3 I am just updating the RHEL to latest 
# version
az vm run-command invoke -g testgroup -n sqlrhel2 --command-id RunShellScript --scripts "sudo yum update -y"
az vm run-command invoke -g testgroup -n sqlrhel3 --command-id RunShellScript --scripts "sudo yum update -y"

As you can see, I chose sqlrhel1 VM as the controller node, installed the ansible-core package on it, and I will also install the microsoft.sql role on this node using the following command:
```
[root@sqlrhel1 sql]# ansible-galaxy collection install microsoft.sql
```

Enables passwordless SSH access between controller node and target nodes:

To configure passwordless SSH between the controller and all target nodes, see the section:” Setup passwordless ssh access between the Control node and the managed nodes” in the blog post Deploy SQL Server – The Ansible way! - Microsoft Tech Community. For more information on troubleshooting SSH access errors, please see this Red Hat article: https://access.redhat.com/solutions/9194.

Creating the inventory & playbook

Creating the inventory

In Ansible, the inventory file defines the target hosts that you want to configure with the Ansible playbook; in this case, my controller node (sqlrhel1) is also a target node. Here is an example of how to define the inventory file for the microsoft.sql role to setup the pacemaker cluster and assign the various roles to each of the availability group replicas.

all:
  hosts:
    # host1 is defined by a short name
    # There is no need to specify ha_cluster names explicitly
    host1:
      mssql_ha_replica_type: primary
    # host2 is defined by FQDN
    # You must define ha_cluster names to be in the short name format
    host2.example.com:
      mssql_ha_replica_type: synchronous
      ha_cluster:
        node_name: host2
        pcs_address: host2
    # host3 is defined by an ip address
    # You must define ha_cluster names to be in the short name format
    # In the case where the default host's IP address differs from the IP
    # address that Pacemaker must use to set up cluster, you must define
    # ha_cluster corosync_addresses
    192.XXX.XXX.333:
      mssql_ha_replica_type: witness
      ha_cluster:
        node_name: host3
        pcs_address: host3
        corosync_addresses:
          - 10.XXX.XXX.333

In my case, because I'm using short names, I don't have to specify the ha_cluster names explicitly, so I use the following inventory, to create this, feel free to use your favorite text editor and copy the content below; I used a command like ‘nano inventory’ and pasted the below content into the file.

all:
  hosts:
    sqlrhel1:
      mssql_ha_replica_type: primary
    sqlrhel2:
      mssql_ha_replica_type: synchronous
    sqlrhel3:
      mssql_ha_replica_type: synchronous

If I had decided to define the inventory by IP address, I would have used an inventory as shown below:

all:
  hosts:
    172.22.0.14:
      ansible_user: root
      mssql_ha_replica_type: primary
      ha_cluster:
        node_name: sqlrhel1
        pcs_address: sqlrhel1
        corosync_addresses:
          - 172.22.0.14
    172.22.0.15:
      ansible_user: root
      mssql_ha_replica_type: synchronous
      ha_cluster:
        node_name: sqlrhel2
        pcs_address: sqlrhel2
        corosync_addresses:
          - 172.22.0.15
    172.22.0.16:
      ansible_user: root
      mssql_ha_replica_type: synchronous
      ha_cluster:
        node_name: sqlrhel3
        pcs_address: sqlrhel3
        corosync_addresses:
          - 172.22.0.16

Creating the playbook

Here is an example playbook that you can create, that performs the following tasks, which are not necessarily in the order listed below:

Verifies that the SQL Server and tools are properly configured.
Creates the necessary endpoints and certificates, as well as copies the certificates across replicas, for Always On availability groups (AG) endpoint authentication.Pacemaker login is created in all the SQL Server replicas.
The Pacemaker login is created in all SQL Server replicas.
Installs the STONITH "fence-agents-azure-arm" on all nodes, which provides a fencing agent.
Creates cluster resources : rsc st azure, ag cluster-clone, virtualip, and azure load balancer. The latter two resources are also added to the virtualip group resource group.
It opens the firewall ports for the health check probe (59999 in this case, as defined in the Azure load balancer configuration), the AG endpoint (5022), and the SQL Server (1433).
Finally, it creates an AG listener service for the AG.

To create the playbook.yaml file, I ran the command ‘nano playbook.yaml’ and then pasted the following content into it,

Do remember that you need to replace the <application id>, <ServicePrincipalPassword>,<resourcegroupname>,<tenantID> and <subscriptionID> with your values that you derive from the Azure portal.
Ensure that you replace the probe port : 59999(used below) & mssql_ha_listener_port: 172.22.0.22(used below) with the port number and frontendIP that you configured on the Azure portal when creating the load balancer.
On the primary instance, I pre-created the 'test' user database that I intend to add to the AG, but you could leave out the mssql_ha_db_names variable as it is optional, or you could use another variable called: mssql_post_input_sql_file to write a t-sql to create the database, you can read more about it here:mssql/README.md at master · linux-system-roles/mssql (github.com)

To understand more about each of these variables, refer to the documentation here:mssql/README.md at master · linux-system-roles/mssql (github.com)

- hosts: all
  vars:
    mssql_accept_microsoft_odbc_driver_17_for_sql_server_eula: true
    mssql_accept_microsoft_cli_utilities_for_sql_server_eula: true
    mssql_accept_microsoft_sql_server_standard_eula: true
    mssql_manage_firewall: true
    mssql_password: "SQLp@55w0rD1"
    mssql_edition: Developer
    mssql_ha_configure: true
    mssql_ha_listener_port: 5022
    mssql_ha_cert_name: ExampleCert
    mssql_ha_master_key_password: "p@55w0rD1"
    mssql_ha_private_key_password: "p@55w0rD2"
    mssql_ha_reset_cert: true
    mssql_ha_endpoint_name: ag_endpoint
    mssql_ha_ag_name: ag_test
    mssql_ha_db_names:
      - test
    mssql_ha_login: pacemakerLogin
    mssql_ha_login_password: "Pacemakerp@55w0rD1"
    # Set mssql_ha_virtual_ip to the frontend IP address configured in the Azure
    # load balancer
    mssql_ha_virtual_ip: 172.22.0.22
    mssql_ha_cluster_run_role: true
    ha_cluster_cluster_name: "{{ mssql_ha_ag_name }}"
    ha_cluster_hacluster_password: "p@55w0rD4"
    ha_cluster_extra_packages:
      - fence-agents-azure-arm
    ha_cluster_cluster_properties:
      - attrs:
          - name: cluster-recheck-interval
            value: 2min
          - name: start-failure-is-fatal
            value: true
          - name: stonith-enabled
            value: true
          - name: stonith-timeout
            value: 900
    ha_cluster_resource_primitives:
      - id: rsc_st_azure
        agent: stonith:fence_azure_arm
        instance_attrs:
          - attrs:
              - name: login
                value: <ApplicationID> for your application registration in Azure
              - name: passwd
                value: <ServiceprincipalPassword> value from the client secret
              - name: resourceGroup
                value: <resourcegroupname> in Azure
              - name: tenantId
                value: <tenantID> in Azure
              - name: subscriptionId
                value: <subscriptionID> in Azure
              - name: power_timeout
                value: 240
              - name: pcmk_reboot_timeout
                value: 900
      - id: azure_load_balancer
        agent: azure-lb
        instance_attrs:
          - attrs:
            # probe port configured in Azure
            - name: port
              value: 59999
      - id: ag_cluster
        agent: ocf:mssql:ag
        instance_attrs:
          - attrs:
            - name: ag_name
              value: "{{ mssql_ha_ag_name }}"
        meta_attrs:
          - attrs:
            - name: failure-timeout
              value: 60s
      - id: virtualip
        agent: ocf:heartbeat:IPaddr2
        instance_attrs:
          - attrs:
            - name: ip
              value: "{{ mssql_ha_virtual_ip }}"
        operations:
          - action: monitor
            attrs:
              - name: interval
                value: 30s
    ha_cluster_resource_groups:
      - id: virtualip_group
        resource_ids:
          - azure_load_balancer
          - virtualip
    ha_cluster_resource_clones:
      - resource_id: ag_cluster
        promotable: yes
        meta_attrs:
          - attrs:
            - name: notify
              value: true
    ha_cluster_constraints_colocation:
      - resource_leader:
          id: ag_cluster-clone
          role: Promoted
        resource_follower:
          id: azure_load_balancer
        options:
          - name: score
            value: INFINITY
    ha_cluster_constraints_order:
      - resource_first:
          id: ag_cluster-clone
          action: promote
        resource_then:
          id: azure_load_balancer
          action: start
    # Variables to open the probe port configured in Azure in firewall
    firewall:
      - port: 59999/tcp
        state: enabled
        permanent: true
        runtime: true
  roles:
    - fedora.linux_system_roles.firewall
    - microsoft.sql.server

When you list the contents of the directory “/root/.ansible/collections/ansible collections/microsoft/sql”, you should see the inventory and the playbook.yaml that were created, as shown below:

[root@sqlrhel1 sql]# ll
total 48
-rw-r--r-- 1 root root  6431 Aug 29 21:30 CHANGELOG.md
-rw-r--r-- 1 root root 17218 Aug 29 21:30 FILES.json
-rw-r--r-- 1 root root   175 Aug 29 21:31 inventory
-rw-r--r-- 1 root root  1053 Aug 29 21:30 LICENSE-server
-rw-r--r-- 1 root root   892 Aug 29 21:30 MANIFEST.json
drwxr-xr-x 2 root root    25 Aug 29 21:30 meta
-rw-r--r-- 1 root root  3839 Aug 29 21:34 playbook.yaml
-rw-r--r-- 1 root root  1278 Aug 29 21:30 README.md
drwxr-xr-x 3 root root    20 Aug 29 21:30 roles
drwxr-xr-x 3 root root    20 Aug 29 21:30 tests

Running the playbook and completing the setup

To run the playbook, I executed the following command on the controller node (sqlrhel1):

[root@sqlrhel1 sql]# ansible-playbook -i inventory playbook.yaml

This should start the deployment, and you should have the entire setup completed in about 8 minutes or less. When the setup is finished, you should see the sample output shown below.

To view the cluster resources that have been created, use the command:

[root@sqlrhel1 sql]# sudo pcs status
Cluster name: ag_test1
Cluster Summary:
  * Stack: corosync
  * Current DC: sqlrhel2 (version 2.1.2-4.el8_6.2-ada5c3b36e2) - partition with quorum
  * Last updated: Mon Aug 29 21:36:10 2022
  * Last change:  Mon Aug 29 20:50:36 2022 by root via cibadmin on sqlrhel1
  * 3 nodes configured
  * 6 resource instances configured
 
Node List:
  * Online: [ sqlrhel1 sqlrhel2 sqlrhel3 ]
 
Full List of Resources:
  * rsc_st_azure        (stonith:fence_azure_arm):       Started sqlrhel1
  * Resource Group: virtualip_group:
    * azure_load_balancer       (ocf::heartbeat:azure-lb):       Started sqlrhel1
    * virtualip (ocf::heartbeat:IPaddr2):        Started sqlrhel1
  * Clone Set: ag_cluster-clone [ag_cluster] (promotable):
    * Masters: [ sqlrhel1 ]
    * Slaves: [ sqlrhel2 sqlrhel3 ]
all:
  hosts:
    sqlrhel1:
      mssql_ha_replica_type: primary
    sqlrhel2:
      mssql_ha_replica_type: synchronous
    sqlrhel3:
      mssql_ha_replica_type: synchronous

You can view the AG and its dashboard in SQL Server Management Studio (SSMS), as shown below:

You can connect using the listener IP Address in this case it is 172.22.0.22 ( mssql_ha_virtual_ip) and test the failover as suggested in this article. Also you can test the fencing as per the steps documented here.

Please let me know if you have questions on any section of this article and I’ll be more than happy to explain this further!

Happy Learning!

Updated Sep 02, 2022

Version 1.0

SQLServerTiger