How to configure the WSFC AG failover policy when the database level health detection enabled?
Published Aug 12 2020 06:33 AM 2,567 Views
Microsoft

The database level health detection failover option introduced on this article

https://docs.microsoft.com/en-us/sql/database-engine/availability-groups/windows/sql-server-always-o...

 

In addition to the existing checks, the new implementation has the following additional checks.

  1. The new implementation stores and uses a historical snapshot of the database state information to decide if the AG needs to be marked in error state or not. The health check routine caches the database state and associated error information, for the last three executions, which is then compared with the state information from the current execution of the health detection routine. If the same error condition (for the below mentioned error codes) exists in the four consecutive runs of the health detection routine, a failover is initiated. So for example if during the first run, let's say at 22:00:00 there is an error 823, and the same error conditions exists at the subsequent runs at 22:00:10, 22:00:20 and 22:00:30 then the AG is marked in an error state and the control is passed to the cluster. This implementation is intended to provide safeguards against transient errors and issues which can be fixed by the auto page repair capabilities of the availability groups.
  2. The new implementation checks for following additional errors. Majority of these errors are indicative of a hardware issues on the server. Please note, that this is not an exhaustive list of errors which could impact the database availability. There is an outstanding item to include error 824 to this list.

Error

Cause 

Documentation

605

Page or allocation corruption. 

https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-605-database-eng...

823

Checkpoint failures. 

https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-823-database-eng...

829

Disk corruption. 

 

832

Hardware or memory corruption. 

 

1101

No disk space available in a filegroup. 

https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1101-database-en...

1105

No disk space available in a filegroup. 

https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-1105-database-en...

5102

Missing filegroup ID requests. 

 

5180

Wrong file ID requests. 

 

5515

 

https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-5515-database-en...

5534

Log corruption due to FILESTREAM operation log record. 

 

5535

FILESTREAM data container corruption. 

 

9004

 Log Corruption

https://docs.microsoft.com/en-us/sql/relational-databases/errors-events/mssqlserver-9004-database-en...

 

If we enable this feature, to make sure AG can failover successfully, we need to change the default failover policy.

 

The default “max restarts in the specified Period =1 in 1 hour

The default “max failure in the specified Period” =1 in 6  hours

 
 
 

Based on this settings, if the 823 error reported but this error could not be repaired from the secondary replica:

 

  1. Detected 823 in 3 generation(30s)
  2. WSFC got the error state from the AG controller, restarted the AG resource.  –DB level error won’t prevent AG resource restart, the restart always works fine.
  3. The failure count +1, the restart count+1
  4. If continue detect 823 again—>AG offline. No failover because it reach to  “max failure in the specified Period” =1. no more action took.

 

Recommend setting : “max failure in the specified Period” >=“max restarts in the specified Period+1 at least. Then all restart attempt finished but the issue still is detected, next time failover will trigger.

 

Version history
Last update:
‎Aug 23 2020 09:25 PM
Updated by: