Commands we can use to manage SQL Server availability group on Linux Pacemaker cluster
Published Aug 19 2020 04:55 PM 2,707 Views

 

Thanks to Jayantha Das,  who shared some of the common command we use in SQL Server on Linux Pacemaker Maker cluster. This blog contains the pacemaker settings that we can modify. Unless explicitly required it is recommended that we don’t modify the values as it can cause problem in the functionality of availability groups.

 

We can run “sudo pcs config” to list critical cluster configuration settings

 

Master: ag_cluster-master

  Meta Attrs: failure-timeout=30s notify=true

  Resource: ag_cluster (class=ocf provider=mssql type=ag)

   Attributes: ag_name=ag1

   Operations: start interval=0s timeout=60 (ag_cluster-start-interval-0s)

               stop interval=0s timeout=10 (ag_cluster-stop-interval-0s)

               promote interval=0s timeout=60 (ag_cluster-promote-interval-0s)

               demote interval=0s timeout=10 (ag_cluster-demote-interval-0s)

               monitor interval=10 timeout=60 (ag_cluster-monitor-interval-10)

               monitor interval=11 role=Master timeout=60 (ag_cluster-monitor-interval-11)

               monitor interval=12 role=Slave timeout=60 (ag_cluster-monitor-interval-12)

Resource: virtualip (class=ocf provider=heartbeat type=IPaddr2)

  Attributes: ip=xx.xx.xx.xx

  Operations: start interval=0s timeout=20s (virtualip-start-interval-0s)

              stop interval=0s timeout=20s (virtualip-stop-interval-0s)

              monitor interval=10s timeout=20s (virtualip-monitor-interval-10s)

 

 

Interval

Interval for the above operations mean how frequently (in seconds) to perform the operation. A value of 0 means never. A positive value defines a recurring action, which is typically used with monitor.

 

Timeout

Timeout for the above operation means how long to wait before declaring the action has failed

 

Role

Role for the above operations (like the one in the monitor) mean run the operation only on node(s) that the cluster thinks should be in the specified role. This only makes sense for recurring monitor operations. Allowed (case-sensitive) values: Stopped, Started, and in the case of multi-state resources, Slave and Master.

The usual monitor actions are insufficient to monitor a multi-state resource, because pacemaker needs to verify not only that the resource is active, but also that its actual role matches its intended one.

 

For example,

Define two monitoring actions: the usual one will cover the slave role, and an additional one with role="master" will cover the master role. In the case of Availability Groups the master is the primary and Slave indicates the secondary node.

 

<op id="public-ip-slave-check" name="monitor" interval="60"/><op id="public-ip-master-check" name="monitor" interval="61" role="Master"/>

 

It is crucial that every monitor operation has a different interval! Pacemaker currently differentiates between operations only by resource and interval; so if (for example) a master/slave resource had the same monitor interval for both roles, Pacemaker would ignore the role when checking the status — which would cause unexpected return codes, and therefore unnecessary complications.

 

 

Connection timeout can be found in the corosync logs

 

monitor: 2019/09/26 13:48:55 ag-helper invoked with hostname [localhost]; port [1433]; ag-name [ag1]; credentials-file [/var/opt/mssql/secrets/passwd]; application-name [monitor-ag_cluster-monitor]; connection-timeout [30];

The SQL query timeout is same as the connection timeout.

 

Following table list the some of the critical parameters and brief description and how to use it.

 

 

NAME

Description

How to use it

cluster-recheck-interval

Polling interval for time-based changes to options, resource parameters and constraints.

A Pacemaker cluster is an event-driven system. As such, it won’t recalculate the best place for resources to run unless something (like a resource failure or configuration change) happens. If time-based rules are needed, the cluster-recheck-interval cluster option (which defaults to 15 minutes) is essential.This tells the cluster to periodically recalculate the ideal state of the cluster. For example, starting a resource between a certain period of time.

sudo  pcs property set cluster-recheck-interval=2min

corosync totem token timeout

This is  to make sure corosync is resilent against intermittent network glitches; the default is 1000 ms, or 1 second for a 2 node cluster, increasing by 650ms for each additional member; To use a value other than the default, add or edit the token line in /etc/corosync/corosync.conf e.g. totem { token: 5000}

cat /etc/corosync/corosync.conf

totem {

    version: 2

    cluster_name: pcmk

    secauth: off

    transport: udpu

    token: 10000   

}

 

start-failure-is-fatal

Indicates whether a failure to start a resource on a particular node prevents further start attempts on that node. When set to false, the cluster will decide whether to try starting on the same node again based on the resource's current failure count and migration threshold.

sudo pcs property set  start-failure-is-fatal=true

failure-timeout

cluster-recheck-interval indicates the polling interval at which the cluster checks for changes in the resource parameters, constraints or other cluster options. If a replica goes down, the cluster tries to restart the replica at an interval that is bound by the failure-timeout value and the cluster-recheck-interval value. For example, if failure-timeout is set to 60 seconds and cluster-recheck-interval is set to 120 seconds, the restart is tried at an interval that is greater than 60 seconds but less than 120 seconds. We recommend that you set failure-timeout to 60s and cluster-recheck-interval to a value that is greater than 60 seconds. Setting cluster-recheck-interval to a small value is not recommended.

sudo pcs resource update resource_name meta failure-timeout=60s

monitor_policy

Monitoring policy options are:

 

1) SERVER_UNRESPONSIVE_OR_DOWN: Fail if the SQL Server instance is unresponsive (unable to establish a connection) or down (the process is not running)
2) SERVER_CRITICAL_ERROR: Fail if sp_server_diagnostics detects a critical system error
3) SERVER_MODERATE_ERROR: Fail if sp_server_diagnostics detects a critical system or resource error
5) SERVER_ANY_QUALIFIED_ERROR: Fail if sp_server_diagnostics detects any qualified error 
pcs resource update C2PODDAG monitor_policy=1

 

sudo pcs resource  update resource_name  monitor_policy=2

 

This shows up as health-threshold [2]; in corosync logs

Connection_timeout

This is the timeout of the connection request to the sql server by the pacemaker. The timeout can be set using the pcs command.

sudo pcs resource update resource_name connection_timeout=90

 

demote action timeout (s)

Demote action demotes relevant resources that are running in master mode to slave mode. The timeout can be set using the pcs command.

sudo pcs resource update resource_name op demote  timeout=20

 

promote action timeout (s)

Promote action promotes relevant resource to the master mode. The timeout can be set using the pcs command.

sudo pcs resource update resource_name  op promote timeout=60

start action timeout (s)

Start action starts the resource. The timeout can be set using the pcs command.

sudo pcs resource update resource_name  op start timeout=60

stop action timeout (s)

Stop action stops the resource. The timeout can be set using the pcs command.

sudo pcs resource update resource_name  op stop timeout=60

monitor action timeout (s)

Monitor action checks the resource’s state.

sudo pcs resource update resource_name op monitor interval=11  role=Master timeout=11

 

Monitor timeout should be greater than the  connection timeout.

 

 

Version history
Last update:
‎Aug 19 2020 04:55 PM
Updated by: