How to configure the WSFC AG failover policy when the database level health detection enabled?

Microsoft

Aug 12, 2020

The database level health detection failover option introduced on this article

https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/sql-server-always-on-database-health-detection-failover-option?view=sql-server-ver15

In addition to the existing checks, the new implementation has the following additional checks.

The new implementation stores and uses a historical snapshot of the database state information to decide if the AG needs to be marked in error state or not. The health check routine caches the database state and associated error information, for the last three executions, which is then compared with the state information from the current execution of the health detection routine. If the same error condition (for the below mentioned error codes) exists in the four consecutive runs of the health detection routine, a failover is initiated. So for example if during the first run, let's say at 22:00:00 there is an error 823, and the same error conditions exists at the subsequent runs at 22:00:10, 22:00:20 and 22:00:30 then the AG is marked in an error state and the control is passed to the cluster. This implementation is intended to provide safeguards against transient errors and issues which can be fixed by the auto page repair capabilities of the availability groups.
The new implementation checks for following additional errors. Majority of these errors are indicative of a hardware issues on the server. Please note, that this is not an exhaustive list of errors which could impact the database availability. There is an outstanding item to include error 824 to this list.

Error	Cause	Documentation
605	Page or allocation corruption.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-605-database-engine-error?view=sql-server-2017
823	Checkpoint failures.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-823-database-engine-error?view=sql-server-2017
829	Disk corruption.
832	Hardware or memory corruption.
1101	No disk space available in a filegroup.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1101-database-engine-error?view=sql-server-2017
1105	No disk space available in a filegroup.	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1105-database-engine-error?view=sql-server-2017
5102	Missing filegroup ID requests.
5180	Wrong file ID requests.
5515		https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-5515-database-engine-error?view=sql-server-2017
5534	Log corruption due to FILESTREAM operation log record.
5535	FILESTREAM data container corruption.
9004	Log Corruption	https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-9004-database-engine-error?view=sql-server-2017

If we enable this feature, to make sure AG can failover successfully, we need to change the default failover policy.

The default “max restarts in the specified Period =1 in 1 hour

The default “max failure in the specified Period” =1 in 6 hours

Based on this settings, if the 823 error reported but this error could not be repaired from the secondary replica:

Detected 823 in 3 generation(30s)
WSFC got the error state from the AG controller, restarted the AG resource. –DB level error won’t prevent AG resource restart, the restart always works fine.
The failure count +1, the restart count+1
If continue detect 823 again—>AG offline. No failover because it reach to “max failure in the specified Period” =1. no more action took.

Recommend setting : “max failure in the specified Period” >=“max restarts in the specified Period+1 at least. Then all restart attempt finished but the issue still is detected, next time failover will trigger.

Updated Aug 24, 2020

Version 2.0

Microsoft

Joined March 29, 2019

View Profile

SQL Server Support Blog

Follow this blog board to get notified when there's new activity