Forum Discussion
Availability Group failover issue
Hello,
We ran into an issue recently that caused a three-node availability group to fail. We're running SQL Server 2019 (RTM-CU32-GDR) on Windows Server 2019 Standard (version 1809). We're trying to understand why the availability group wasn't able to start when the cluster automatically failed it over to another node.
First, here is the state/configuration of the availability group just before the failure:
- Node A - primary replica; synchronization state = synchronized; availability mode = synchronous commit; failover mode = automatic
- Node B - secondary replica; synchronization state = synchronized; availability mode = synchronous commit; failover mode = automatic
- Node C - secondary replica; synchronization state = synchronized; availability mode = synchronous commit; failover mode = manual
Here's the abbreviated summary of events:
- An I/O error occurred on node A. Specifically, it was a checksum mismatch when reading one of the tempdb files. This caused the availability group cluster resource to fail.
- The cluster tried unsuccessfully to restart the AG on node A so it failed over the AG to node B. The AG failed to come online on node B.
- The cluster failed back the AG to node A and left it in a failed state.
- We manually failed over the AG to node B a few minutes later and it was successful.
When the cluster tried to automatically fail over the AG to node B, the only error we found that indicates why the AG couldn't start is: "SQL Server Availability Group <AG1>: [hadrag] ODBC Error: [42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The availability replica for availability group 'AG1' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss)."
Since all nodes were in synchronous commit mode, the above error implies that the secondary databases on node B were either not joined or not synchronized. This is concerning and we're worried that auto-failover is not as reliable as we expected. Does anyone have any suggestions for how to ensure the secondary node is ready and able to become primary when the AG fails on the original primary node? Or any suggestions for settings/configuration to change? The only corrective action we have taken so far is to increase the "maximum failures in the specified period" from the default of 2 to 6 (per the recommendation https://learn.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/hadr-cluster-best-practices?view=azuresql&tabs=windows2012#relaxed-monitoring) and to decrease the failover period from the default of 6 hours to 2.
Any suggestions or insights would be greatly appreciated.
6 Replies
- lyradavenCopper Contributor
This is a solid breakdown of the issue, and you’ve already done some of the right troubleshooting steps. The error message during the attempted failover strongly suggests that, despite both replicas showing as synchronized beforehand, at the moment of failover one or more databases on Node B were not in a clean, synchronized state or hadn’t fully joined the AG resource. This can occasionally happen if replication health looks fine on the surface but hasn’t fully hardened writes on the secondary.
A few things to consider:
✅ Regularly validate database join/sync state – Get-DbaAgDatabase (from dbatools) or sys.dm_hadr_database_replica_states can give a more accurate picture of database readiness beyond the AG dashboard.
🛠️ Check for tempdb / I/O anomalies – Since the trigger was an I/O error in tempdb, it’s worth validating your storage subsystem for intermittent corruption or latency, which can ripple into the AG’s stability.
🔄 Health detection & failover sensitivity – The cluster may try to fail over too quickly before the secondary is fully ready. Adjusting failure conditions, as you’ve already started with “maximum failures” and failover period, can reduce false or premature failovers.
🕵️ Verbose logging during failover attempts – Enable detailed HADRS trace flags or review AlwaysOn_health extended events to catch the precise database state at failover time.
🧩 Node C configuration – Since Node C is manual failover only, ensure quorum and witness settings are optimal. Sometimes failback behavior is influenced by how the cluster perceives node votes.
In short, the issue doesn’t mean auto-failover is unreliable, but rather that the secondary replica wasn’t 100% failover-ready in that moment. Proactive health checks, storage validation, and cluster tuning should reduce the chance of this happening again.
AG is quite reliable but this king of things may happen. Is there any error log that could point out of why there was a failback from B to A and it did not started.
My suggestion will be to take a maintenance Window and try to switch roles to all nodes. Also run the cluster verification process to see if you get any warning from there
- rj452tmCopper Contributor
I think the failback from B to A occurred because the AG failed to come up on B. At the time we were using the default value for the "maximum failures in the specified period" setting in Failover Cluster Manager. We have since increased that from 2 to 6 in the hope that a future failure will make additional attempts to bring the AG online on the failover partner.
We have also manually failed over the AG between all three nodes without issue. The Cluster Validation Report shows all tests are passed.
- bandaruajeyuduBrass Contributor
It looks like Node B wasn’t fully ready when the auto-failover happened, even though it was in sync. SQL requires all databases to be “failover ready,” and if one lags or isn’t joined, it won’t switch. Best to check is_failover_ready, review failover policy (FailureConditionLevel/health timeout), and do regular manual failover tests to be sure secondaries can take over.
MS Article: https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/configure-flexible-automatic-failover-policy?view=sql-server-ver17- rj452tmCopper Contributor
I think you're on the right track. We perform a failover at least once a month and have never had this issue. Plus the manual failover was successful after the auto failover failed. So this seems to be specific to auto failover.
I think that after the AG initially failed on node A, the cluster tried to restart it and it briefly came back up on node A, failed again, and then the cluster tried to perform an auto-failover to node B. During that short time when the AG was up on node A after the initial failure, I suspect that the secondary replica on node B wasn't able to connect and become fully re-synchronized before the AG failed again. So when the cluster tried to fail it over to node B, it couldn't because node B wasn't synchronized. Is that possible? I'm not entirely confident in this theory because it doesn't explain how node B was eventually able to become failover-ready when we performed the manual failover.
In any case, I'm still left with little confidence that an auto-failover in the event of a similar issue will work correctly. The safety we were hoping to achieve by keeping all nodes in synchronous commit mode didn't pay off and I don't know of anything to change to ensure it will work in the future.
- SivertSolemIron Contributor
Also verify that all databases in the availability groups exist on node B.