Commands we can use to manage SQL Server availability group on Linux Pacemaker cluster

JayaprakashJothiraman · ‎Aug 19 2020

Thanks to Jayantha Das, who shared some of the common command we use in SQL Server on Linux Pacemaker Maker cluster. This blog contains the pacemaker settings that we can modify. Unless explicitly required it is recommended that we don’t modify the values as it can cause problem in the functionality of availability groups.

We can run “sudo pcs config” to list critical cluster configuration settings

Master: ag_cluster-master

Meta Attrs: failure-timeout=30s notify=true

Resource: ag_cluster (class=ocf provider=mssql type=ag)

Attributes: ag_name=ag1

Operations: start interval=0s timeout=60 (ag_cluster-start-interval-0s)

stop interval=0s timeout=10 (ag_cluster-stop-interval-0s)

promote interval=0s timeout=60 (ag_cluster-promote-interval-0s)

demote interval=0s timeout=10 (ag_cluster-demote-interval-0s)

monitor interval=10 timeout=60 (ag_cluster-monitor-interval-10)

monitor interval=11 role=Master timeout=60 (ag_cluster-monitor-interval-11)

monitor interval=12 role=Slave timeout=60 (ag_cluster-monitor-interval-12)

Resource: virtualip (class=ocf provider=heartbeat type=IPaddr2)

Attributes: ip=xx.xx.xx.xx

Operations: start interval=0s timeout=20s (virtualip-start-interval-0s)

stop interval=0s timeout=20s (virtualip-stop-interval-0s)

monitor interval=10s timeout=20s (virtualip-monitor-interval-10s)

Interval

Interval for the above operations mean how frequently (in seconds) to perform the operation. A value of 0 means never. A positive value defines a recurring action, which is typically used with monitor.

Timeout

Timeout for the above operation means how long to wait before declaring the action has failed

Role

Role for the above operations (like the one in the monitor) mean run the operation only on node(s) that the cluster thinks should be in the specified role. This only makes sense for recurring monitor operations. Allowed (case-sensitive) values: Stopped, Started, and in the case of multi-state resources, Slave and Master.

The usual monitor actions are insufficient to monitor a multi-state resource, because pacemaker needs to verify not only that the resource is active, but also that its actual role matches its intended one.

For example,

Define two monitoring actions: the usual one will cover the slave role, and an additional one with role="master" will cover the master role. In the case of Availability Groups the master is the primary and Slave indicates the secondary node.

It is crucial that every monitor operation has a different interval! Pacemaker currently differentiates between operations only by resource and interval; so if (for example) a master/slave resource had the same monitor interval for both roles, Pacemaker would ignore the role when checking the status — which would cause unexpected return codes, and therefore unnecessary complications.

Connection timeout can be found in the corosync logs

monitor: 2019/09/26 13:48:55 ag-helper invoked with hostname [localhost]; port [1433]; ag-name [ag1]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-ag_cluster-monitor]; connection-timeout [30];

The SQL query timeout is same as the connection timeout.

Following table list the some of the critical parameters and brief description and how to use it.

NAME	Description	How to use it
cluster-recheck-interval	Polling interval for time-based changes to options, resource parameters and constraints. A Pacemaker cluster is an event-driven system. As such, it won’t recalculate the best place for resources to run unless something (like a resource failure or configuration change) happens. If time-based rules are needed, the cluster-recheck-interval cluster option (which defaults to 15 minutes) is essential.This tells the cluster to periodically recalculate the ideal state of the cluster. For example, starting a resource between a certain period of time.	sudo pcs property set cluster-recheck-interval=2min
corosync totem token timeout	This is to make sure corosync is resilent against intermittent network glitches; the default is 1000 ms, or 1 second for a 2 node cluster, increasing by 650ms for each additional member; To use a value other than the default, add or edit the token line in /etc/corosync/corosync.conf e.g. totem { token: 5000}	cat /etc/corosync/corosync.conf totem { version: 2 cluster_name: pcmk secauth: off transport: udpu token: 10000 }
start-failure-is-fatal	Indicates whether a failure to start a resource on a particular node prevents further start attempts on that node. When set to false, the cluster will decide whether to try starting on the same node again based on the resource's current failure count and migration threshold.	sudo pcs property set start-failure-is-fatal=true
failure-timeout	cluster-recheck-interval indicates the polling interval at which the cluster checks for changes in the resource parameters, constraints or other cluster options. If a replica goes down, the cluster tries to restart the replica at an interval that is bound by the failure-timeout value and the cluster-recheck-interval value. For example, if failure-timeout is set to 60 seconds and cluster-recheck-interval is set to 120 seconds, the restart is tried at an interval that is greater than 60 seconds but less than 120 seconds. We recommend that you set failure-timeout to 60s and cluster-recheck-interval to a value that is greater than 60 seconds. Setting cluster-recheck-interval to a small value is not recommended.	sudo pcs resource update resource_name meta failure-timeout=60s
monitor_policy	Monitoring policy options are: 1) SERVER_UNRESPONSIVE_OR_DOWN: Fail if the SQL Server instance is unresponsive (unable to establish a connection) or down (the process is not running) 2) SERVER_CRITICAL_ERROR: Fail if sp_server_diagnostics detects a critical system error 3) SERVER_MODERATE_ERROR: Fail if sp_server_diagnostics detects a critical system or resource error 5) SERVER_ANY_QUALIFIED_ERROR: Fail if sp_server_diagnostics detects any qualified error pcs resource update C2PODDAG monitor_policy=1	sudo pcs resource update resource_name monitor_policy=2 This shows up as health-threshold [2]; in corosync logs
Connection_timeout	This is the timeout of the connection request to the sql server by the pacemaker. The timeout can be set using the pcs command.	sudo pcs resource update resource_name connection_timeout=90
demote action timeout (s)	Demote action demotes relevant resources that are running in master mode to slave mode. The timeout can be set using the pcs command.	sudo pcs resource update resource_name op demote timeout=20
promote action timeout (s)	Promote action promotes relevant resource to the master mode. The timeout can be set using the pcs command.	sudo pcs resource update resource_name op promote timeout=60
start action timeout (s)	Start action starts the resource. The timeout can be set using the pcs command.	sudo pcs resource update resource_name op start timeout=60
stop action timeout (s)	Stop action stops the resource. The timeout can be set using the pcs command.	sudo pcs resource update resource_name op stop timeout=60
monitor action timeout (s)	Monitor action checks the resource’s state.	sudo pcs resource update resource_name op monitor interval=11 role=Master timeout=11 Monitor timeout should be greater than the connection timeout.

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Commands we can use to manage SQL Server availability group on Linux Pacemaker cluster