What is causing the Always On AG issue? Is it cluster, AD, DNS, or SQL?
Published Apr 23 2023 02:29 PM 18.5K Views
Microsoft

Do any of these issues sound familiar?

 

- My AG fails to failover

- Why did my AG failover? 

- Why did my AG fail?

- My AG is missing!

- Why am I seeing stale data on secondary? 

- What is with the sudden log growth alerts?

 

You are either hearing from end users that they are seeing stale data, or that they are losing connections to availability database or listener, or you have monitoring solutions in place to catch these proactively and mitigate but are looking for a root cause. In either scenario, it helps to understand the high-level architecture of Availability Groups and the troubleshooting resources available for these components.

 

1.0 Overview of Availability Group

 

AG Overview.png

 

A picture is worth a thousand words. The above picture depicts the different high-level components that are at play with Always On Availability Groups.  This article attempts to give a high-level conceptual overview of the different components, a map to common failure troubleshooting articles, and components causing failure that will help you choose the right support topic when calling Microsoft support for further assistance. This article is not an all-encompassing map of all the issues that one could run into with AG, as there could be various issues based on specific environment and workloads. However, this is an attempt to include the common areas of failures and the logs to look at for narrowing the issue and component down.

 

2.0 Why WSFC with AG? 

 

As you can see from the picture, in a Windows environment, when AG is deployed for HA, a Windows Server Failover Cluster (WSFC) is required. WSFC monitors the health of the AG cluster role and also controls operations such as offline/online of the AG cluster role.

 

  • AG is a cluster role on Windows Server failover Cluster (WSFC).
  • Cluster service has a Resource Control Manager (RCM) component that is responsible for bringing the cluster resource group to a persistent state. RCM also negotiates with all the RCM instances on all the cluster nodes to ensure that the AG cluster role is online on at most one cluster node at a time. RCM is responsible for bringing the resources in the cluster resource group online in the order of dependencies.
  • In order to keep the cluster process stable and not load the custom resource DLLs into the cluster process, a child process RHS (Resource Host Subsystem), also called a Resource Monitor, hosts the DLLs for cluster resources.  SQL Server Always On Availability group has a resource DLL (hadrres.dll) that is loaded into the RHS process. 
  • If you are interested in reading more about failover cluster software components, here is a good article - Creating a Cluster Resource DLL (Part 1).

 

WSFC component performs mainly the following functions:

 

  • Heartbeat communication between cluster nodes

Heartbeat messages are sent between cluster nodes to ensure that the cluster is in a healthy and consistent state. IPv6 is the default protocol that Failover Clustering will use for its heartbeats. The heartbeat itself is a UDP unicast network packet that communicates over Port 3343. Delay defines the frequency in seconds at which cluster heartbeats are sent between nodes (default = 1 second in same subnet and cross subnet). Threshold defines the number of heartbeats which are missed before the cluster takes recover action. (default = 10 heartbeats in same subnet, 20 in cross subnet). For more information, refer to: Tuning Failover Cluster Network Thresholds - Microsoft Community Hub

 

So, with a default delay of 1 second and default threshold of 10 heartbeats on a single subnet WSFC environment, what do you expect to see when a certain cluster node does not respond to heartbeats for 10 seconds? This node is removed from active failover cluster membership. You will see a 1135 event on your system log with a message that goes like" Cluster node 'N1' was removed from the active failover cluster membership... " For more details on how to troubleshoot this issue, refer to A problem with deleting a node | Microsoft Learn

 

  • Cluster Quorum

Quorum mechanism is used to prevent split brain and also ensure a consistent and healthy state of WSFC. Cluster goes offline in the event of quorum loss. When cluster service shuts down due to quorum loss, Event ID 1177 is logged in event viewer under Microsoft-Windows-FailoverClustering with a message that goes like "The Cluster service is shutting down because quorum was lost. ... " For more information on troubleshooting this, please refer to : Event ID 1177 — Quorum and Connectivity Needed for Quorum | Microsoft Learn

 

  • Lease (Looks-Alive) 

This is a continuous handshake between SQL Server instance and SQL Server resource DLL to ensure that the AG resource looks alive.  A lease timeout may result in the following errors:

 

2016-02-16 11:37:03.05 Server      Error: 19419, Severity: 16, State: 1.
2016-02-16 11:37:03.05 Server      Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group 'ag' within the lease timeout period.
2016-02-16 11:37:03.07 Server      Error: 19407, Severity: 16, State: 1.
2016-02-16 11:37:03.07 Server      The lease between availability group 'ag' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.

 

Other errors with lease timeout: Improved Always On Availability Group Lease Timeout Diagnostics - Microsoft Community Hub

 

How to diagnose lease timeout? Check this video for an end-to-end demo of troubleshooting a lease timeout scenario. 

 

  • Health Check (IsAlive) - 

Two values control the Always On health check: FailureConditionLevel and HealthCheckTimeout. The FailureConditionLevel indicates the tolerance level to specific failure conditions reported by sp_server_diagnosticsand the HealthCheckTimeout configures the time the resource DLL can go without receiving an update fromsp_server_diagnostics. The update interval forsp_server_diagnosticsis always HealthCheckTimeout / 3. If SQL Server does not respond with the results from executing sp_server_diagnostic within the HEALTH_CHECK_TIMEOUT (default is 30 sec), then the availability group will transition to RESOLVING state and failover if configured to do so. (Reference/CreditDiagnose Unexpected Failover or Availability Group in RESOLVING State - Microsoft Community Hub)

 

3.0 How is AD/DNS involved?

 

When a cluster is created, a CNO (Cluster Name Object) is created in Active Directory. CNO has IP address(es) as dependencies and the CNO and IP are used for management and cluster operations. The listener also referred to as a Virtual Network Name (VNN) has a network name and IP address(es) and listener is also created in Active Directory. If the account that is used to create the cluster or listener on AD, does not have full permissions to perform the operation, CNO and listener can be pre-staged on AD.

 

Both the CNO network name and listener VNN network name are also present on DNS for mapping between the network name and IP address(es). Reference on DNS: DNS Registration with the Network Name Resource - Microsoft Community Hub

 

Some common issues on AD/DNS is permissions related - 

 

  • Listener creation failure with error 19471 or 19476

Msg 19471, Level 16, State 0, Line 2
The WSFC cluster could not bring the Network Name resource with DNS name '<DNS name>' online. The DNS name may have been taken or have a conflict with existing name services, or the WSFC cluster service may not be running or may be inaccessible. Use a different DNS name to resolve name conflicts, or check the WSFC cluster log for more information.

 

Msg 19476, Level 16, State 4, Line 2
The attempt to create the network name and IP address for the listener failed. The WSFC service may not be running or may be inaccessible in its current state, or the values provided for the network name and IP address may be incorrect. Check the state of the WSFC cluster and validate the network name and IP address with the network administrator.

 

Troubleshoot article: Create Listener Fails with Message 'The WSFC cluster could not bring the Network Name resource onlin...

 

  • Connection timeouts in multi-subnet availability group

Connection Timeouts in Multi-subnet Availability Group - Microsoft Community Hub

Connect to an availability group listener - SQL Server Always On | Microsoft Learn

 

  • Driver and Client support 

Driver and client connectivity support for availability groups - SQL Server Always On | Microsoft Le... (This table shows the driver versions that have support for Always On parameters like Application Intent, multi-subnet failover, read-only routing etc.)

 

4.0 SQL Server and AG data movement

 

Data movement from the primary replica to the secondary replica(s) is done through log block transmission and redo process. Here is a high-level diagram of the synchronization process and the events that occur during synchronization. Latency in any step here could cause overall latencies with synchronization. 

 

AG Data Flow.png

In order to learn more about how to troubleshoot latency issues, please refer to : Troubleshooting data movement latency between synchronous-commit Always On Availability Groups - Mic... 

 

SQL Server Performance issues could also lead to AG failover, and synchronization latencies. Some common issues are:

 

 

5.0 My AG has an issue. Where do I look? 

 

Issue Result Logs to look at Common Error numbers Causing Component  Articles
Windows Cluster Issues (CNO, CNO IP, Listener, Listener IP, cluster service crash) AG resource offline-online, failover cluster.log, system log, Microsoft-Windows-FailoverClustering log, SQL Server errorlog

1069

1207

1135

WSFC

Troubleshooting resource for WSFC resource such as IP, disk  failed/offline issues:

Can't bring a clustered resource online troubleshooting guidance - Windows Server | Microsoft Learn

 

Troubleshooting resource for 1135

Troubleshooting cluster issue with Event ID 1135 | Microsoft Learn

 

Cluster System Log Event IDs and explanation

Failover Clustering system log events | Microsoft Learn

 

 

Listener Creation Issues Listener unavailable cluster.log, system log, SQL Server errorlog

Msg 19471, Level 16, State 0, Line 2

 

Msg 19476, Level 16, State 4, Line 2

WSFC and AD/DNS

CNO of WSFC cluster must have Crate Compter Objects Permissions.

Configure availability group listener - SQL Server Always On | Microsoft Learn

 

Troubleshooting listener creation failures and common issues

KB2829783 - Troubleshooting Always On availability group listener creation in SQL Server 2012 - Micr...

Listener Connectivity Issues  Client connection failures, timeouts and errors Client connection logs, network traces, SQL Server errorlog, cluster.log, system log

0x80131904

 

0x80004005

 

"Login timeout expired."

 

 

Driver multisubnet parameter support, RegisterAllProviders cluster setting, DNS 

Listener connection times out - SQL Server | Microsoft Learn

 

Troubleshoot SQL Server connectivity issues

Troubleshoot connectivity issues in SQL Server - SQL Server | Microsoft Learn

 

Connectivity Issues caused due to SQL Server

Connectivity Problems Caused by Issues in SQL Server - Microsoft Community Hub

 

Connection timeouts in multi-subnet AG

Connection Timeouts in Multi-subnet Availability Group - Microsoft Community Hub

 

Driver and client connectivity issues with AG  Client connection failures, timeouts and errors Client side connection logs, driver logs, network traces, SQL Server errorlogs   Driver, SQL Server configuration, Networking Driver and client connectivity support for availability groups - SQL Server Always On | Microsoft Le...
Lease Timeout AG resource offline-online, failover Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder

19407

19419

19420

19421

19422

19423

19424

 

OS not responding, low virtual memory, working set paging, SQL generating dump, pegged CPU, WSFC down (loss of quorum)

How It Works: SQL Server Always On Lease Timeout - Microsoft Community Hub

 

Lease timeout common errors and corrective actions

Improved Always On Availability Group Lease Timeout Diagnostics - Microsoft Community Hub

Cluster Health Check TImeout AG resource Offline-online or Failover, FCI restart/failover

Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder

On cluster.log

[RES] SQL Server Availability Group <ag>: [hadrag] Resource Alive result 0

Failure conditions met, OS not responding, low virtual memory, working set trim, SQL generating dump, WSFC (loss of quorum), scheduler issues (dead locked schedulers) Diagnose Unexpected Failover or Availability Group in RESOLVING State - Microsoft Community Hub
Cluster Quorum Loss AG in resolving state Cluster log, System Log, Network traces

1177

The Cluster service is shutting down because quorum was lost. 

Network connectivity issues, Node instability issues, OS not responding 

WSFC Quorum Modes and Voting Configuration (SQL Server)

WSFC quorum modes & voting configuration - SQL Server Always On | Microsoft Learn

 

Troubleshooting Event ID: 1177 

Event ID 1177 — Quorum and Connectivity Needed for Quorum | Microsoft Learn

Session timeout Secondary-disconnected Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder

35206

35201

35267

 

Network communication,
Issues on secondary - down, OS not responding, resource contention

Reasons for connectivity failures between availability replicas - SQL Server Always On | Microsoft L...

 

Troubleshooting resource:

Troubleshooting intermittent connection time-outs between availability group replicas - SQL Server |...

 

MSSQLSERVER_35267 - SQL Server | Microsoft Learn

 

Cluster service permissions on SQL Server AG creation failure SQL Server error log

41131

SQL Server Permissions

Error 41131 when creating availability group - SQL Server | Microsoft Learn

Data synchronization latency due to log send queuing Log file growth, low disk space alerts, Stale data in secondary replicas Perfmon, sys.dm_hadr_database_replica_state, Always On dashboard

9002

 

Troubleshooting long send queueing in an Always On availability group - SQL Server | Microsoft Learn

 

Error 9002 when transaction log is large - SQL Server | Microsoft Learn

Data synchronization latency due to recovery queuing Stale data Perfmon, sys.dm_hadr_database_replica_state, Always On dashboard

 

 

Troubleshooting recovery queueing in an Always On availability group - SQL Server | Microsoft Learn

Availability group removed or AG replica removed    

 

 

Issue: Replica Unexpectedly Dropped in Availability Group | Microsoft Learn

Create Availability Group Fails With Error 35250 'Failed to join the database'

Endpoint not created or started, Endpoint connection issues, Name resolution issues, Endpoint permissions

 

DMVs, Telnet, Test-NetConnection

35250

 

MSSQLSERVER_35250 - SQL Server | Microsoft Learn

 

 

This is not an all-encompassing list of issues, but includes some common issues observed with Always On. Hope this helps!

 

Thanks to @Joseph Pilov  and Kathleen Carter for reviewing this article. 

 

Till next time,

Dharshana

2 Comments
Co-Authors
Version history
Last update:
‎Apr 27 2023 10:23 AM
Updated by: