One common comment I've heard at various conferences and customer interactions is, "setting up Always On availability groups (AGs) on pacemaker takes very long, has multiple steps and commands that need to be run, so sometimes it can be error prone, can you folks make it easier and automate the AG setup for SQL Server on Linux?"
Well, we went back and collaborated with the RHEL Engineering team to create a SQL Server role capable of creating the pacemaker cluster, configuring the SQL Server Always On availability groups (AG), adding databases to the AG, and creating the pacemaker resources with fencing agents. To top it all off, after creating the necessary infrastructure, I was able to deploy SQL Server Always On availability groups (AG), with pacemaker cluster and AG specific resources all in under 8 minutes!! Yes, eight minutes:). If you want to give it a shot, keep reading!
To know more about the Ansible based SQL Server roles, please refer to this latest blog: Preview the new system role support for Microsoft SQL Server Always On availability groups (redhat.com) from the Red Hat team.
RHEL 8.5 added support for Microsoft SQL Server system role, which automates the installation, configuration and tuning of Microsoft SQL Server on RHEL. Please see the RHEL documentation for more information.
For today’s setup, I am going to use the SQL Server role from the Ansible, listed here: Ansible Galaxy.
This is currently in preview and allows you to configure and deploy Always On availability groups (AG). This SQL Server role is provided as a sample and for reference only. It is not supported by Microsoft or Red Hat. However, we welcome your feedback and suggestions for improving the system role here: Issues linux-system-roles/mssql (github.com).
Setting up the infrastructure:
Okay, let's get started. Here are the steps we would take to create the infrastructure required to run the Ansible role to configure AG for SQL Server RHEL based VMs.
- Ensure that you have an Azure subscription.
- For this demo, I'm using three Azure RHEL-based VMs, one primary and two secondary replicas. I'll use the SQL Server on RHEL HA Azure Marketplace image to create the VMs. This image includes SQL Server. Here are the commands I used to create three RHEL-based Azure virtual machines.
#To list all the SQL based RHEL offers, you can run the following command az vm image list --all --offer "sql2019-rhel8" # Run the following commands to create the VMs, using the SQL Server PAYG(Pay-as-you-go) # images pre-configured on RHEL 8 base OS image, thus avoiding the steps to install and # configure SQL Server deployments. We are using the Standard_B4ms as an example az vm create --resource-group testgroup --name sqlrhel1 --size "Standard_B4ms" --location "west india" --image "MicrosoftSQLServer:sql2019-rhel8:sqldev:15.0.220712" --vnet-name amvindomain-vnet --subnet default --admin-username "adminuser" --admin-password "StrongPa$$w0rd" --authentication-type all --generate-ssh-keys az vm create --resource-group testgroup --name sqlrhel2 --size "Standard_B4ms" --location "west india" --image "MicrosoftSQLServer:sql2019-rhel8:sqldev:15.0.220712" --vnet-name amvindomain-vnet --subnet default --admin-username "adminuser" --admin-password "StrongPa$$w0rd" --authentication-type all --generate-ssh-keys az vm create --resource-group testgroup --name sqlrhel3 --size "Standard_B4ms" --location "west india" --image "MicrosoftSQLServer:sql2019-rhel8:sqldev:15.0.220712" --vnet-name amvindomain-vnet --subnet default --admin-username "adminuser" --admin-password "StrongPa$$w0rd" --authentication-type all --generate-ssh-keys
- Register a new application in Azure Active Directory, create custom role for the fence agent & Assign the custom role to the Service Principal. You must have guessed this is required to configure the STONITH devices. Please note, if you already have a pre-created application and custom role in your subscription you can also use that.
- Lastly, create and configure the Azure load balancer, required to setup the Listener service.
Preparing the VMs and Scripts
Before I get into the specifics of the Ansible role, here's a quick rundown of the steps:
- On the controller node, install ansible-core and the microsoft.sql role from Ansible galaxy.
- After that, configure password less SSH on the controller node and all targets.
- Create the inventory and playbook files required for the deployment.
Setting up the Controller node
- Once the VMs have been created, they must be updated to the most recent packages. The command I used is included below; you can also run the same commands by logging into the VMs and manually running the commands:
# My controller node is sqlrhel1, that is why on this node I am installing the # ansible-core package and also updating the RHEL to the latest version, please note that # my controller node is also a target node. az vm run-command invoke -g testgroup -n sqlrhel1 --command-id RunShellScript --scripts "sudo yum update -y && sudo yum install ansible-core -y" # On my other target nodes sqlrhel2 and sqlrhel3 I am just updating the RHEL to latest # version az vm run-command invoke -g testgroup -n sqlrhel2 --command-id RunShellScript --scripts "sudo yum update -y" az vm run-command invoke -g testgroup -n sqlrhel3 --command-id RunShellScript --scripts "sudo yum update -y"
- As you can see, I chose sqlrhel1 VM as the controller node, installed the ansible-core package on it, and I will also install the microsoft.sql role on this node using the following command:
[root@sqlrhel1 sql]# ansible-galaxy collection install microsoft.sql
Enables passwordless SSH access between controller node and target nodes:
To configure passwordless SSH between the controller and all target nodes, see the section:” Setup passwordless ssh access between the Control node and the managed nodes” in the blog post Deploy SQL Server – The Ansible way! - Microsoft Tech Community. For more information on troubleshooting SSH access errors, please see this Red Hat article: https://access.redhat.com/solutions/9194.
Creating the inventory & playbook
Creating the inventory
In Ansible, the inventory file defines the target hosts that you want to configure with the Ansible playbook; in this case, my controller node (sqlrhel1) is also a target node. Here is an example of how to define the inventory file for the microsoft.sql role to setup the pacemaker cluster and assign the various roles to each of the availability group replicas.
all:
hosts:
# host1 is defined by a short name
# There is no need to specify ha_cluster names explicitly
host1:
mssql_ha_replica_type: primary
# host2 is defined by FQDN
# You must define ha_cluster names to be in the short name format
host2.example.com:
mssql_ha_replica_type: synchronous
ha_cluster:
node_name: host2
pcs_address: host2
# host3 is defined by an ip address
# You must define ha_cluster names to be in the short name format
# In the case where the default host's IP address differs from the IP
# address that Pacemaker must use to set up cluster, you must define
# ha_cluster corosync_addresses
192.XXX.XXX.333:
mssql_ha_replica_type: witness
ha_cluster:
node_name: host3
pcs_address: host3
corosync_addresses:
- 10.XXX.XXX.333
In my case, because I'm using short names, I don't have to specify the ha_cluster names explicitly, so I use the following inventory, to create this, feel free to use your favorite text editor and copy the content below; I used a command like ‘nano inventory’ and pasted the below content into the file.
all:
hosts:
sqlrhel1:
mssql_ha_replica_type: primary
sqlrhel2:
mssql_ha_replica_type: synchronous
sqlrhel3:
mssql_ha_replica_type: synchronous
If I had decided to define the inventory by IP address, I would have used an inventory as shown below:
all:
hosts:
172.22.0.14:
ansible_user: root
mssql_ha_replica_type: primary
ha_cluster:
node_name: sqlrhel1
pcs_address: sqlrhel1
corosync_addresses:
- 172.22.0.14
172.22.0.15:
ansible_user: root
mssql_ha_replica_type: synchronous
ha_cluster:
node_name: sqlrhel2
pcs_address: sqlrhel2
corosync_addresses:
- 172.22.0.15
172.22.0.16:
ansible_user: root
mssql_ha_replica_type: synchronous
ha_cluster:
node_name: sqlrhel3
pcs_address: sqlrhel3
corosync_addresses:
- 172.22.0.16
Creating the playbook
Here is an example playbook that you can create, that performs the following tasks, which are not necessarily in the order listed below:
- Verifies that the SQL Server and tools are properly configured.
- Creates the necessary endpoints and certificates, as well as copies the certificates across replicas, for Always On availability groups (AG) endpoint authentication.Pacemaker login is created in all the SQL Server replicas.
- The Pacemaker login is created in all SQL Server replicas.
- Installs the STONITH "fence-agents-azure-arm" on all nodes, which provides a fencing agent.
- Creates cluster resources : rsc st azure, ag cluster-clone, virtualip, and azure load balancer. The latter two resources are also added to the virtualip group resource group.
- It opens the firewall ports for the health check probe (59999 in this case, as defined in the Azure load balancer configuration), the AG endpoint (5022), and the SQL Server (1433).
- Finally, it creates an AG listener service for the AG.
To create the playbook.yaml file, I ran the command ‘nano playbook.yaml’ and then pasted the following content into it,
- Do remember that you need to replace the <application id>, <ServicePrincipalPassword>,<resourcegroupname>,<tenantID> and <subscriptionID> with your values that you derive from the Azure portal.
- Ensure that you replace the probe port : 59999(used below) & mssql_ha_listener_port: 172.22.0.22(used below) with the port number and frontendIP that you configured on the Azure portal when creating the load balancer.
- On the primary instance, I pre-created the 'test' user database that I intend to add to the AG, but you could leave out the mssql_ha_db_names variable as it is optional, or you could use another variable called: mssql_post_input_sql_file to write a t-sql to create the database, you can read more about it here:mssql/README.md at master · linux-system-roles/mssql (github.com)
- To understand more about each of these variables, refer to the documentation here:mssql/README.md at master · linux-system-roles/mssql (github.com)
When you list the contents of the directory “/root/.ansible/collections/ansible collections/microsoft/sql”, you should see the inventory and the playbook.yaml that were created, as shown below:- hosts: all vars: mssql_accept_microsoft_odbc_driver_17_for_sql_server_eula: true mssql_accept_microsoft_cli_utilities_for_sql_server_eula: true mssql_accept_microsoft_sql_server_standard_eula: true mssql_manage_firewall: true mssql_password: "SQLp@55w0rD1" mssql_edition: Developer mssql_ha_configure: true mssql_ha_listener_port: 5022 mssql_ha_cert_name: ExampleCert mssql_ha_master_key_password: "p@55w0rD1" mssql_ha_private_key_password: "p@55w0rD2" mssql_ha_reset_cert: true mssql_ha_endpoint_name: ag_endpoint mssql_ha_ag_name: ag_test mssql_ha_db_names: - test mssql_ha_login: pacemakerLogin mssql_ha_login_password: "Pacemakerp@55w0rD1" # Set mssql_ha_virtual_ip to the frontend IP address configured in the Azure # load balancer mssql_ha_virtual_ip: 172.22.0.22 mssql_ha_cluster_run_role: true ha_cluster_cluster_name: "{{ mssql_ha_ag_name }}" ha_cluster_hacluster_password: "p@55w0rD4" ha_cluster_extra_packages: - fence-agents-azure-arm ha_cluster_cluster_properties: - attrs: - name: cluster-recheck-interval value: 2min - name: start-failure-is-fatal value: true - name: stonith-enabled value: true - name: stonith-timeout value: 900 ha_cluster_resource_primitives: - id: rsc_st_azure agent: stonith:fence_azure_arm instance_attrs: - attrs: - name: login value: <ApplicationID> for your application registration in Azure - name: passwd value: <ServiceprincipalPassword> value from the client secret - name: resourceGroup value: <resourcegroupname> in Azure - name: tenantId value: <tenantID> in Azure - name: subscriptionId value: <subscriptionID> in Azure - name: power_timeout value: 240 - name: pcmk_reboot_timeout value: 900 - id: azure_load_balancer agent: azure-lb instance_attrs: - attrs: # probe port configured in Azure - name: port value: 59999 - id: ag_cluster agent: ocf:mssql:ag instance_attrs: - attrs: - name: ag_name value: "{{ mssql_ha_ag_name }}" meta_attrs: - attrs: - name: failure-timeout value: 60s - id: virtualip agent: ocf:heartbeat:IPaddr2 instance_attrs: - attrs: - name: ip value: "{{ mssql_ha_virtual_ip }}" operations: - action: monitor attrs: - name: interval value: 30s ha_cluster_resource_groups: - id: virtualip_group resource_ids: - azure_load_balancer - virtualip ha_cluster_resource_clones: - resource_id: ag_cluster promotable: yes meta_attrs: - attrs: - name: notify value: true ha_cluster_constraints_colocation: - resource_leader: id: ag_cluster-clone role: Promoted resource_follower: id: azure_load_balancer options: - name: score value: INFINITY ha_cluster_constraints_order: - resource_first: id: ag_cluster-clone action: promote resource_then: id: azure_load_balancer action: start # Variables to open the probe port configured in Azure in firewall firewall: - port: 59999/tcp state: enabled permanent: true runtime: true roles: - fedora.linux_system_roles.firewall - microsoft.sql.server
[root@sqlrhel1 sql]# ll total 48 -rw-r--r-- 1 root root 6431 Aug 29 21:30 CHANGELOG.md -rw-r--r-- 1 root root 17218 Aug 29 21:30 FILES.json -rw-r--r-- 1 root root 175 Aug 29 21:31 inventory -rw-r--r-- 1 root root 1053 Aug 29 21:30 LICENSE-server -rw-r--r-- 1 root root 892 Aug 29 21:30 MANIFEST.json drwxr-xr-x 2 root root 25 Aug 29 21:30 meta -rw-r--r-- 1 root root 3839 Aug 29 21:34 playbook.yaml -rw-r--r-- 1 root root 1278 Aug 29 21:30 README.md drwxr-xr-x 3 root root 20 Aug 29 21:30 roles drwxr-xr-x 3 root root 20 Aug 29 21:30 tests
Running the playbook and completing the setup
To run the playbook, I executed the following command on the controller node (sqlrhel1):
[root@sqlrhel1 sql]# ansible-playbook -i inventory playbook.yaml
This should start the deployment, and you should have the entire setup completed in about 8 minutes or less. When the setup is finished, you should see the sample output shown below.
To view the cluster resources that have been created, use the command:
[root@sqlrhel1 sql]# sudo pcs status
Cluster name: ag_test1
Cluster Summary:
* Stack: corosync
* Current DC: sqlrhel2 (version 2.1.2-4.el8_6.2-ada5c3b36e2) - partition with quorum
* Last updated: Mon Aug 29 21:36:10 2022
* Last change: Mon Aug 29 20:50:36 2022 by root via cibadmin on sqlrhel1
* 3 nodes configured
* 6 resource instances configured
Node List:
* Online: [ sqlrhel1 sqlrhel2 sqlrhel3 ]
Full List of Resources:
* rsc_st_azure (stonith:fence_azure_arm): Started sqlrhel1
* Resource Group: virtualip_group:
* azure_load_balancer (ocf::heartbeat:azure-lb): Started sqlrhel1
* virtualip (ocf::heartbeat:IPaddr2): Started sqlrhel1
* Clone Set: ag_cluster-clone [ag_cluster] (promotable):
* Masters: [ sqlrhel1 ]
* Slaves: [ sqlrhel2 sqlrhel3 ]
all:
hosts:
sqlrhel1:
mssql_ha_replica_type: primary
sqlrhel2:
mssql_ha_replica_type: synchronous
sqlrhel3:
mssql_ha_replica_type: synchronous
You can view the AG and its dashboard in SQL Server Management Studio (SSMS), as shown below:
You can connect using the listener IP Address in this case it is 172.22.0.22 ( mssql_ha_virtual_ip) and test the failover as suggested in this article. Also you can test the fencing as per the steps documented here.
Please let me know if you have questions on any section of this article and I’ll be more than happy to explain this further!
Happy Learning!