What is causing the Always On AG issue? Is it cluster, AD, DNS, or SQL?

Dharshana

Microsoft

Apr 23, 2023

Do any of these issues sound familiar?

- My AG fails to failover

- Why did my AG failover?

- Why did my AG fail?

- My AG is missing!

- Why am I seeing stale data on secondary?

- What is with the sudden log growth alerts?

You are either hearing from end users that they are seeing stale data, or that they are losing connections to availability database or listener, or you have monitoring solutions in place to catch these proactively and mitigate but are looking for a root cause. In either scenario, it helps to understand the high-level architecture of Availability Groups and the troubleshooting resources available for these components.

1.0 Overview of Availability Group

A picture is worth a thousand words. The above picture depicts the different high-level components that are at play with Always On Availability Groups. This article attempts to give a high-level conceptual overview of the different components, a map to common failure troubleshooting articles, and components causing failure that will help you choose the right support topic when calling Microsoft support for further assistance. This article is not an all-encompassing map of all the issues that one could run into with AG, as there could be various issues based on specific environment and workloads. However, this is an attempt to include the common areas of failures and the logs to look at for narrowing the issue and component down.

2.0 Why WSFC with AG?

As you can see from the picture, in a Windows environment, when AG is deployed for HA, a Windows Server Failover Cluster (WSFC) is required. WSFC monitors the health of the AG cluster role and also controls operations such as offline/online of the AG cluster role.

AG is a cluster role on Windows Server failover Cluster (WSFC).
Cluster service has a Resource Control Manager (RCM) component that is responsible for bringing the cluster resource group to a persistent state. RCM also negotiates with all the RCM instances on all the cluster nodes to ensure that the AG cluster role is online on at most one cluster node at a time. RCM is responsible for bringing the resources in the cluster resource group online in the order of dependencies.
In order to keep the cluster process stable and not load the custom resource DLLs into the cluster process, a child process RHS (Resource Host Subsystem), also called a Resource Monitor, hosts the DLLs for cluster resources. SQL Server Always On Availability group has a resource DLL (hadrres.dll) that is loaded into the RHS process.
If you are interested in reading more about failover cluster software components, here is a good article - Creating a Cluster Resource DLL (Part 1).

WSFC component performs mainly the following functions:

Heartbeat communication between cluster nodes

Heartbeat messages are sent between cluster nodes to ensure that the cluster is in a healthy and consistent state. IPv6 is the default protocol that Failover Clustering will use for its heartbeats. The heartbeat itself is a UDP unicast network packet that communicates over Port 3343. Delay defines the frequency in seconds at which cluster heartbeats are sent between nodes (default = 1 second in same subnet and cross subnet). Threshold defines the number of heartbeats which are missed before the cluster takes recover action. (default = 10 heartbeats in same subnet, 20 in cross subnet). For more information, refer to: Tuning Failover Cluster Network Thresholds - Microsoft Community Hub

So, with a default delay of 1 second and default threshold of 10 heartbeats on a single subnet WSFC environment, what do you expect to see when a certain cluster node does not respond to heartbeats for 10 seconds? This node is removed from active failover cluster membership. You will see a 1135 event on your system log with a message that goes like" Cluster node 'N1' was removed from the active failover cluster membership... " For more details on how to troubleshoot this issue, refer to A problem with deleting a node | Microsoft Learn

Cluster Quorum

Quorum mechanism is used to prevent split brain and also ensure a consistent and healthy state of WSFC. Cluster goes offline in the event of quorum loss. When cluster service shuts down due to quorum loss, Event ID 1177 is logged in event viewer under Microsoft-Windows-FailoverClustering with a message that goes like "The Cluster service is shutting down because quorum was lost. ... " For more information on troubleshooting this, please refer to : Event ID 1177 — Quorum and Connectivity Needed for Quorum | Microsoft Learn

Lease (Looks-Alive)

This is a continuous handshake between SQL Server instance and SQL Server resource DLL to ensure that the AG resource looks alive. A lease timeout may result in the following errors:

2016-02-16 11:37:03.05 Server      Error: 19419, Severity: 16, State: 1.
2016-02-16 11:37:03.05 Server      Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group 'ag' within the lease timeout period.
2016-02-16 11:37:03.07 Server      Error: 19407, Severity: 16, State: 1.
2016-02-16 11:37:03.07 Server      The lease between availability group 'ag' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

Other errors with lease timeout: Improved Always On Availability Group Lease Timeout Diagnostics - Microsoft Community Hub

How to diagnose lease timeout? Check this video for an end-to-end demo of troubleshooting a lease timeout scenario.

Health Check (IsAlive) -

Two values control the Always On health check: FailureConditionLevel and HealthCheckTimeout. The FailureConditionLevel indicates the tolerance level to specific failure conditions reported by sp_server_diagnosticsand the HealthCheckTimeout configures the time the resource DLL can go without receiving an update fromsp_server_diagnostics. The update interval forsp_server_diagnosticsis always HealthCheckTimeout / 3. If SQL Server does not respond with the results from executing sp_server_diagnostic within the HEALTH_CHECK_TIMEOUT (default is 30 sec), then the availability group will transition to RESOLVING state and failover if configured to do so. (Reference/Credit: Diagnose Unexpected Failover or Availability Group in RESOLVING State - Microsoft Community Hub)

Excellent Table to refer to on the healthcheckTimeout and lease timeout, the causes of these timeouts, and outcome of these timeouts - Availability group lease health check timeout - SQL Server Always On | Microsoft Learn

3.0 How is AD/DNS involved?

When a cluster is created, a CNO (Cluster Name Object) is created in Active Directory. CNO has IP address(es) as dependencies and the CNO and IP are used for management and cluster operations. The listener also referred to as a Virtual Network Name (VNN) has a network name and IP address(es) and listener is also created in Active Directory. If the account that is used to create the cluster or listener on AD, does not have full permissions to perform the operation, CNO and listener can be pre-staged on AD.

Both the CNO network name and listener VNN network name are also present on DNS for mapping between the network name and IP address(es). Reference on DNS: DNS Registration with the Network Name Resource - Microsoft Community Hub

Some common issues on AD/DNS is permissions related -

Listener creation failure with error 19471 or 19476

Msg 19471, Level 16, State 0, Line 2
The WSFC cluster could not bring the Network Name resource with DNS name '<DNS name>' online. The DNS name may have been taken or have a conflict with existing name services, or the WSFC cluster service may not be running or may be inaccessible. Use a different DNS name to resolve name conflicts, or check the WSFC cluster log for more information.

Msg 19476, Level 16, State 4, Line 2
The attempt to create the network name and IP address for the listener failed. The WSFC service may not be running or may be inaccessible in its current state, or the values provided for the network name and IP address may be incorrect. Check the state of the WSFC cluster and validate the network name and IP address with the network administrator.

Troubleshoot article: Create Listener Fails with Message 'The WSFC cluster could not bring the Network Name resource online' - Microsoft Community Hub

Connection timeouts in multi-subnet availability group

Connection Timeouts in Multi-subnet Availability Group - Microsoft Community Hub

Connect to an availability group listener - SQL Server Always On | Microsoft Learn

Driver and Client support

Driver and client connectivity support for availability groups - SQL Server Always On | Microsoft Learn (This table shows the driver versions that have support for Always On parameters like Application Intent, multi-subnet failover, read-only routing etc.)

4.0 SQL Server and AG data movement

Data movement from the primary replica to the secondary replica(s) is done through log block transmission and redo process. Here is a high-level diagram of the synchronization process and the events that occur during synchronization. Latency in any step here could cause overall latencies with synchronization.

In order to learn more about how to troubleshoot latency issues, please refer to : Troubleshooting data movement latency between synchronous-commit Always On Availability Groups - Microsoft Community Hub

SQL Server Performance issues could also lead to AG failover, and synchronization latencies. Some common issues are:

5.0 My AG has an issue. Where do I look?

Issue	Result	Logs to look at	Common Error numbers	Causing Component	Articles
Windows Cluster Issues (CNO, CNO IP, Listener, Listener IP, cluster service crash)	AG resource offline-online, failover	cluster.log, system log, Microsoft-Windows-FailoverClustering log, SQL Server errorlog	1069 1207 1135	WSFC	Troubleshooting resource for WSFC resource such as IP, disk failed/offline issues: Can't bring a clustered resource online troubleshooting guidance - Windows Server \| Microsoft Learn Troubleshooting resource for 1135 Troubleshooting cluster issue with Event ID 1135 \| Microsoft Learn Cluster System Log Event IDs and explanation Failover Clustering system log events \| Microsoft Learn
Listener Creation Issues	Listener unavailable	cluster.log, system log, SQL Server errorlog	Msg 19471, Level 16, State 0, Line 2 Msg 19476, Level 16, State 4, Line 2	WSFC and AD/DNS	CNO of WSFC cluster must have Crate Compter Objects Permissions. Configure availability group listener - SQL Server Always On \| Microsoft Learn Troubleshooting listener creation failures and common issues KB2829783 - Troubleshooting Always On availability group listener creation in SQL Server 2012 - Microsoft Support
Listener Connectivity Issues	Client connection failures, timeouts and errors	Client connection logs, network traces, SQL Server errorlog, cluster.log, system log	0x80131904 0x80004005 "Login timeout expired."	Driver multisubnet parameter support, RegisterAllProviders cluster setting, DNS	Listener connection times out - SQL Server \| Microsoft Learn Troubleshoot SQL Server connectivity issues Troubleshoot connectivity issues in SQL Server - SQL Server \| Microsoft Learn Connectivity Issues caused due to SQL Server Connectivity Problems Caused by Issues in SQL Server - Microsoft Community Hub Connection timeouts in multi-subnet AG Connection Timeouts in Multi-subnet Availability Group - Microsoft Community Hub
Driver and client connectivity issues with AG	Client connection failures, timeouts and errors	Client side connection logs, driver logs, network traces, SQL Server errorlogs		Driver, SQL Server configuration, Networking	Driver and client connectivity support for availability groups - SQL Server Always On \| Microsoft Learn
Lease Timeout	AG resource offline-online, failover	Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder	19407 19419 19420 19421 19422 19423 19424	OS not responding, low virtual memory, working set paging, SQL generating dump, pegged CPU, WSFC down (loss of quorum)	How It Works: SQL Server Always On Lease Timeout - Microsoft Community Hub Lease timeout common errors and corrective actions Improved Always On Availability Group Lease Timeout Diagnostics - Microsoft Community Hub
Cluster Health Check TImeout	AG resource Offline-online or Failover, FCI restart/failover	Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder	On cluster.log [RES] SQL Server Availability Group <ag>: [hadrag] Resource Alive result 0	Failure conditions met, OS not responding, low virtual memory, working set trim, SQL generating dump, WSFC (loss of quorum), scheduler issues (dead locked schedulers)	Diagnose Unexpected Failover or Availability Group in RESOLVING State - Microsoft Community Hub
Cluster Quorum Loss	AG in resolving state	Cluster log, System Log, Network traces	1177 The Cluster service is shutting down because quorum was lost.	Network connectivity issues, Node instability issues, OS not responding	WSFC Quorum Modes and Voting Configuration (SQL Server) WSFC quorum modes & voting configuration - SQL Server Always On \| Microsoft Learn Troubleshooting Event ID: 1177 Event ID 1177 — Quorum and Connectivity Needed for Quorum \| Microsoft Learn
Session timeout	Secondary-disconnected	Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder	35206 35201 35267	Network communication, Issues on secondary - down, OS not responding, resource contention	Reasons for connectivity failures between availability replicas - SQL Server Always On \| Microsoft Learn Troubleshooting resource: Troubleshooting intermittent connection time-outs between availability group replicas - SQL Server \| Microsoft Learn MSSQLSERVER_35267 - SQL Server \| Microsoft Learn
Cluster service permissions on SQL Server	AG creation failure	SQL Server error log	41131	SQL Server Permissions	Error 41131 when creating availability group - SQL Server \| Microsoft Learn
Data synchronization latency due to log send queuing	Log file growth, low disk space alerts, Stale data in secondary replicas	Perfmon, sys.dm_hadr_database_replica_state, Always On dashboard	9002		Troubleshooting long send queueing in an Always On availability group - SQL Server \| Microsoft Learn Error 9002 when transaction log is large - SQL Server \| Microsoft Learn
Data synchronization latency due to recovery queuing	Stale data	Perfmon, sys.dm_hadr_database_replica_state, Always On dashboard			Troubleshooting recovery queueing in an Always On availability group - SQL Server \| Microsoft Learn
Availability group removed or AG replica removed					Issue: Replica Unexpectedly Dropped in Availability Group \| Microsoft Learn
Create Availability Group Fails With Error 35250 'Failed to join the database'	Endpoint not created or started, Endpoint connection issues, Name resolution issues, Endpoint permissions	DMVs, Telnet, Test-NetConnection	35250		MSSQLSERVER_35250 - SQL Server \| Microsoft Learn