Do any of these issues sound familiar?
- My AG fails to failover
- Why did my AG failover?
- Why did my AG fail?
- My AG is missing!
- Why am I seeing stale data on secondary?
- What is with the sudden log growth alerts?
You are either hearing from end users that they are seeing stale data, or that they are losing connections to availability database or listener, or you have monitoring solutions in place to catch these proactively and mitigate but are looking for a root cause. In either scenario, it helps to understand the high-level architecture of Availability Groups and the troubleshooting resources available for these components.
1.0 Overview of Availability Group
A picture is worth a thousand words. The above picture depicts the different high-level components that are at play with Always On Availability Groups. This article attempts to give a high-level conceptual overview of the different components, a map to common failure troubleshooting articles, and components causing failure that will help you choose the right support topic when calling Microsoft support for further assistance. This article is not an all-encompassing map of all the issues that one could run into with AG, as there could be various issues based on specific environment and workloads. However, this is an attempt to include the common areas of failures and the logs to look at for narrowing the issue and component down.
2.0 Why WSFC with AG?
As you can see from the picture, in a Windows environment, when AG is deployed for HA, a Windows Server Failover Cluster (WSFC) is required. WSFC monitors the health of the AG cluster role and also controls operations such as offline/online of the AG cluster role.
WSFC component performs mainly the following functions:
Heartbeat messages are sent between cluster nodes to ensure that the cluster is in a healthy and consistent state. IPv6 is the default protocol that Failover Clustering will use for its heartbeats. The heartbeat itself is a UDP unicast network packet that communicates over Port 3343. Delay defines the frequency in seconds at which cluster heartbeats are sent between nodes (default = 1 second in same subnet and cross subnet). Threshold defines the number of heartbeats which are missed before the cluster takes recover action. (default = 10 heartbeats in same subnet, 20 in cross subnet). For more information, refer to: Tuning Failover Cluster Network Thresholds - Microsoft Community Hub
So, with a default delay of 1 second and default threshold of 10 heartbeats on a single subnet WSFC environment, what do you expect to see when a certain cluster node does not respond to heartbeats for 10 seconds? This node is removed from active failover cluster membership. You will see a 1135 event on your system log with a message that goes like" Cluster node 'N1' was removed from the active failover cluster membership... " For more details on how to troubleshoot this issue, refer to A problem with deleting a node | Microsoft Learn
Quorum mechanism is used to prevent split brain and also ensure a consistent and healthy state of WSFC. Cluster goes offline in the event of quorum loss. When cluster service shuts down due to quorum loss, Event ID 1177 is logged in event viewer under Microsoft-Windows-FailoverClustering with a message that goes like "The Cluster service is shutting down because quorum was lost. ... " For more information on troubleshooting this, please refer to : Event ID 1177 — Quorum and Connectivity Needed for Quorum | Microsoft Learn
This is a continuous handshake between SQL Server instance and SQL Server resource DLL to ensure that the AG resource looks alive. A lease timeout may result in the following errors:
2016-02-16 11:37:03.05 Server Error: 19419, Severity: 16, State: 1.
2016-02-16 11:37:03.05 Server Windows Server Failover Cluster did not receive a process event signal from SQL Server hosting availability group 'ag' within the lease timeout period.
2016-02-16 11:37:03.07 Server Error: 19407, Severity: 16, State: 1.
2016-02-16 11:37:03.07 Server The lease between availability group 'ag' and the Windows Server Failover Cluster has expired. A connectivity issue occurred between the instance of SQL Server and the Windows Server Failover Cluster. To determine whether the availability group is failing over correctly, check the corresponding availability group resource in the Windows Server Failover Cluster.
Other errors with lease timeout: Improved Always On Availability Group Lease Timeout Diagnostics - Microsoft Community Hub
How to diagnose lease timeout? Check this video for an end-to-end demo of troubleshooting a lease timeout scenario.
Two values control the Always On health check: FailureConditionLevel and HealthCheckTimeout. The FailureConditionLevel indicates the tolerance level to specific failure conditions reported by sp_server_diagnostics
and the HealthCheckTimeout configures the time the resource DLL can go without receiving an update fromsp_server_diagnostics
. The update interval forsp_server_diagnostics
is always HealthCheckTimeout / 3. If SQL Server does not respond with the results from executing sp_server_diagnostic within the HEALTH_CHECK_TIMEOUT (default is 30 sec), then the availability group will transition to RESOLVING state and failover if configured to do so. (Reference/Credit: Diagnose Unexpected Failover or Availability Group in RESOLVING State - Microsoft Community Hub)
3.0 How is AD/DNS involved?
When a cluster is created, a CNO (Cluster Name Object) is created in Active Directory. CNO has IP address(es) as dependencies and the CNO and IP are used for management and cluster operations. The listener also referred to as a Virtual Network Name (VNN) has a network name and IP address(es) and listener is also created in Active Directory. If the account that is used to create the cluster or listener on AD, does not have full permissions to perform the operation, CNO and listener can be pre-staged on AD.
Both the CNO network name and listener VNN network name are also present on DNS for mapping between the network name and IP address(es). Reference on DNS: DNS Registration with the Network Name Resource - Microsoft Community Hub
Some common issues on AD/DNS is permissions related -
Msg 19471, Level 16, State 0, Line 2
The WSFC cluster could not bring the Network Name resource with DNS name '<DNS name>' online. The DNS name may have been taken or have a conflict with existing name services, or the WSFC cluster service may not be running or may be inaccessible. Use a different DNS name to resolve name conflicts, or check the WSFC cluster log for more information.
Troubleshoot article: Create Listener Fails with Message 'The WSFC cluster could not bring the Network Name resource onlin...
Connection Timeouts in Multi-subnet Availability Group - Microsoft Community Hub
Connect to an availability group listener - SQL Server Always On | Microsoft Learn
Driver and client connectivity support for availability groups - SQL Server Always On | Microsoft Le... (This table shows the driver versions that have support for Always On parameters like Application Intent, multi-subnet failover, read-only routing etc.)
4.0 SQL Server and AG data movement
Data movement from the primary replica to the secondary replica(s) is done through log block transmission and redo process. Here is a high-level diagram of the synchronization process and the events that occur during synchronization. Latency in any step here could cause overall latencies with synchronization.
In order to learn more about how to troubleshoot latency issues, please refer to : Troubleshooting data movement latency between synchronous-commit Always On Availability Groups - Mic...
SQL Server Performance issues could also lead to AG failover, and synchronization latencies. Some common issues are:
5.0 My AG has an issue. Where do I look?
Issue | Result | Logs to look at | Common Error numbers | Causing Component | Articles |
Windows Cluster Issues (CNO, CNO IP, Listener, Listener IP, cluster service crash) | AG resource offline-online, failover | cluster.log, system log, Microsoft-Windows-FailoverClustering log, SQL Server errorlog |
1069 1207 1135 |
WSFC |
Troubleshooting resource for WSFC resource such as IP, disk failed/offline issues: Can't bring a clustered resource online troubleshooting guidance - Windows Server | Microsoft Learn
Troubleshooting resource for 1135 Troubleshooting cluster issue with Event ID 1135 | Microsoft Learn
Cluster System Log Event IDs and explanation Failover Clustering system log events | Microsoft Learn
|
Listener Creation Issues | Listener unavailable | cluster.log, system log, SQL Server errorlog |
Msg 19471, Level 16, State 0, Line 2
Msg 19476, Level 16, State 4, Line 2 |
WSFC and AD/DNS |
CNO of WSFC cluster must have Crate Compter Objects Permissions. Configure availability group listener - SQL Server Always On | Microsoft Learn
Troubleshooting listener creation failures and common issues |
Listener Connectivity Issues | Client connection failures, timeouts and errors | Client connection logs, network traces, SQL Server errorlog, cluster.log, system log |
0x80131904
0x80004005
"Login timeout expired."
|
Driver multisubnet parameter support, RegisterAllProviders cluster setting, DNS |
Listener connection times out - SQL Server | Microsoft Learn
Troubleshoot SQL Server connectivity issues Troubleshoot connectivity issues in SQL Server - SQL Server | Microsoft Learn
Connectivity Issues caused due to SQL Server Connectivity Problems Caused by Issues in SQL Server - Microsoft Community Hub
Connection timeouts in multi-subnet AG Connection Timeouts in Multi-subnet Availability Group - Microsoft Community Hub
|
Driver and client connectivity issues with AG | Client connection failures, timeouts and errors | Client side connection logs, driver logs, network traces, SQL Server errorlogs | Driver, SQL Server configuration, Networking | Driver and client connectivity support for availability groups - SQL Server Always On | Microsoft Le... | |
Lease Timeout | AG resource offline-online, failover | Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder |
19420 19422 19423 19424
|
OS not responding, low virtual memory, working set paging, SQL generating dump, pegged CPU, WSFC down (loss of quorum) |
How It Works: SQL Server Always On Lease Timeout - Microsoft Community Hub
Lease timeout common errors and corrective actions Improved Always On Availability Group Lease Timeout Diagnostics - Microsoft Community Hub |
Cluster Health Check TImeout | AG resource Offline-online or Failover, FCI restart/failover |
Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder |
On cluster.log [RES] SQL Server Availability Group <ag>: [hadrag] Resource Alive result 0 |
Failure conditions met, OS not responding, low virtual memory, working set trim, SQL generating dump, WSFC (loss of quorum), scheduler issues (dead locked schedulers) | Diagnose Unexpected Failover or Availability Group in RESOLVING State - Microsoft Community Hub |
Cluster Quorum Loss | AG in resolving state | Cluster log, System Log, Network traces |
1177 The Cluster service is shutting down because quorum was lost. |
Network connectivity issues, Node instability issues, OS not responding |
WSFC Quorum Modes and Voting Configuration (SQL Server) WSFC quorum modes & voting configuration - SQL Server Always On | Microsoft Learn
Troubleshooting Event ID: 1177 Event ID 1177 — Quorum and Connectivity Needed for Quorum | Microsoft Learn |
Session timeout | Secondary-disconnected | Cluster log, System Log, System Health events, Always On Health events, SQL Server errorlog, any dumps on SQL log folder |
35206 35201 35267
|
Network communication, Issues on secondary - down, OS not responding, resource contention |
Troubleshooting resource:
MSSQLSERVER_35267 - SQL Server | Microsoft Learn
|
Cluster service permissions on SQL Server | AG creation failure | SQL Server error log |
41131 |
SQL Server Permissions |
Error 41131 when creating availability group - SQL Server | Microsoft Learn |
Data synchronization latency due to log send queuing | Log file growth, low disk space alerts, Stale data in secondary replicas | Perfmon, sys.dm_hadr_database_replica_state, Always On dashboard |
9002 |
Troubleshooting long send queueing in an Always On availability group - SQL Server | Microsoft Learn
Error 9002 when transaction log is large - SQL Server | Microsoft Learn |
|
Data synchronization latency due to recovery queuing | Stale data | Perfmon, sys.dm_hadr_database_replica_state, Always On dashboard |
|
Troubleshooting recovery queueing in an Always On availability group - SQL Server | Microsoft Learn |
|
Availability group removed or AG replica removed |
|
Issue: Replica Unexpectedly Dropped in Availability Group | Microsoft Learn |
|||
Create Availability Group Fails With Error 35250 'Failed to join the database' |
Endpoint not created or started, Endpoint connection issues, Name resolution issues, Endpoint permissions
|
DMVs, Telnet, Test-NetConnection |
35250 |
MSSQLSERVER_35250 - SQL Server | Microsoft Learn
|
This is not an all-encompassing list of issues, but includes some common issues observed with Always On. Hope this helps!
Thanks to @Joseph_Pilov and Kathleen Carter for reviewing this article.
Till next time,
Dharshana
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.