Availability Group failover issue

Question

Hello,We ran into an issue recently that caused a three-node availability group to fail. &nbsp;We're running SQL Server 2019 (RTM-CU32-GDR) on Windows Server 2019 Standard (version 1809). &nbsp;We're trying to understand why the availability group wasn't able to start when the cluster automatically failed it over to another node.First, here is the state/configuration of the availability group just before the failure:Node A - primary replica; synchronization state = synchronized; availability mode = synchronous commit; failover mode = automaticNode B - secondary replica; synchronization state = synchronized; availability mode = synchronous commit; failover mode = automaticNode C - secondary replica; synchronization state = synchronized; availability mode = synchronous commit; failover mode = manualHere's the abbreviated summary of events:An I/O error occurred on node A. &nbsp;Specifically, it was a checksum mismatch when reading one of the tempdb files. &nbsp;This caused the availability group cluster resource to fail.The cluster tried unsuccessfully to restart the AG on node A so it failed over the AG to node B. &nbsp;The AG failed to come online on node B.The cluster failed back the AG to node A and left it in a failed state.We manually failed over the AG to node B a few minutes later and it was successful.When the cluster tried to automatically fail over the AG to node B, the only error we found that indicates why the AG couldn't start is: "SQL Server Availability Group &lt;AG1&gt;: [hadrag] ODBC Error: [42000] [Microsoft][SQL Server Native Client 11.0][SQL Server]The availability replica for availability group 'AG1' on this instance of SQL Server cannot become the primary replica. One or more databases are not synchronized or have not joined the availability group. If the availability replica uses the asynchronous-commit mode, consider performing a forced manual failover (with possible data loss). Otherwise, once all local secondary databases are joined and synchronized, you can perform a planned manual failover to this secondary replica (without data loss)."Since all nodes were in synchronous commit mode, the above error implies that the secondary databases on node B were either not joined or not synchronized. &nbsp;This is concerning and we're worried that auto-failover is not as reliable as we expected. &nbsp;Does anyone have any suggestions for how to ensure the secondary node is ready and able to become primary when the AG fails on the original primary node? &nbsp;Or any suggestions for settings/configuration to change? &nbsp;The only corrective action we have taken so far is to increase the "maximum failures in the specified period" from the default of 2 to 6 (per the recommendation https://learn.microsoft.com/en-us/azure/azure-sql/virtual-machines/windows/hadr-cluster-best-practices?view=azuresql&amp;tabs=windows2012#relaxed-monitoring) and to decrease the failover period from the default of 6 hours to 2.Any suggestions or insights would be greatly appreciated.

bandaruajeyudu · Answer

It looks like Node B wasn’t fully ready when the auto-failover happened, even though it was in sync. SQL requires all databases to be “failover ready,” and if one lags or isn’t joined, it won’t switch. Best to check is_failover_ready, review failover policy (FailureConditionLevel/health timeout), and do regular manual failover tests to be sure secondaries can take over.

MS Article: https://learn.microsoft.com/en-us/sql/database-engine/availability-groups/windows/configure-flexible-automatic-failover-policy?view=sql-server-ver17

sivertsolem · Answer

Also verify that all databases in the availability groups exist on node B.

rj452tm · Answer

I think you're on the right track.&nbsp; We perform a failover at least once a month and have never had this issue.&nbsp; Plus the manual failover was successful after the auto failover failed.&nbsp; So this seems to be specific to auto failover.I think that after the AG initially failed on node A, the cluster tried to restart it and it briefly came back up on node A, failed again, and then the cluster tried to perform an auto-failover to node B.&nbsp; During that short time when the AG was up on node A after the initial failure, I suspect that the secondary replica on node B wasn't able to connect and become fully re-synchronized before the AG failed again.&nbsp; So when the cluster tried to fail it over to node B, it couldn't because node B wasn't synchronized.&nbsp; Is that possible?&nbsp; I'm not entirely confident in this theory because it doesn't explain how node B was eventually able to become failover-ready when we performed the manual failover.In any case, I'm still left with little confidence that an auto-failover in the event of a similar issue will work correctly.&nbsp; The safety we were hoping to achieve by keeping all nodes in synchronous commit mode didn't pay off and I don't know of anything to change to ensure it will work in the future.

javier_villegas · Answer

AG is quite reliable but this king of things may happen. Is there any error log that could point out of why there was a failback from B to A and it did not started.
My suggestion will be to take a maintenance Window and try to switch roles to all nodes. Also run the cluster verification process to see if you get any warning from there

rj452tm · Answer

I think the failback from B to A occurred because the AG failed to come up on B. At the time we were using the default value for the "maximum failures in the specified period" setting in Failover Cluster Manager. We have since increased that from 2 to 6 in the hope that a future failure will make additional attempts to bring the AG online on the failover partner.

We have also manually failed over the AG between all three nodes without issue. The Cluster Validation Report shows all tests are passed.

Forum Discussion

Availability Group failover issue

6 Replies

Resources