How It Works: Failover Cluster/Availability Group XEL Logging Frequency
Published Jan 15 2019 02:20 PM 976 Views
Microsoft
First published on MSDN on Apr 04, 2012

I had a great question from my post on sp_server_diagnostics ( http://blogs.msdn.com/b/psssql/archive/2012/03/08/sql-server-2012-true-black-box-recorder.aspx ... ).  As ' luck ' would have it, Bob Ward, was working on some training and had a similar question.  As I investigated I uncovered a few details that can help us all.

Connection Scope

As soon as the Failover Cluster Instance (FCI) or Availability Group is signaled by the cluster manager to come 'ONLINE', a connection is established to SQL Server.  The connection is persistent, just reading the next results set, one after the other, from the sp_server_diagnostics output stream.

Result Set Interval Response

The sp_server_diagnostics interval parameter is controlled by the HealthCheckTimeout value.

Formula: Interval = HealthCheckTimeout / 3;                           //  1/3 of the Health check timeout for the result set response interval

XEL Logging Interval

Here is what confused me the first time I looked at the .\LOG*SQLDIAG*.xel trace for the component_health_result events.

If I establish a performance monitor trace, capturing every 10 seconds, and found a gap of 100 seconds I would troubleshoot the issue as an overall system, responsiveness problem.   This is not the case for gaps in this XEL data.

The messages don't always show up at the 1/3 interval boundaries.   Take the example shown above, the results are returning on 10 second intervals.  Failure to return the result set, on the intervals, could trigger the Server Hang failover level and you may see additional INFO messages logged.

The logic is such that it tries to conserve space in the XEL file .  For example, if the state is CLEAN and the last state was CLEAN the logging may choose to skip output of the duplicate event data.   In the case of our example, the specific code I am studying today could skip the output ~10 times (100 sec) before recording the CLEAN state in the XEL file again.  All it means is the state didn't change, it was CLEAN the entire time.   If any state changes the events are always output.

Don't make the assumption that a gap in the events directly indicates a level of system instability.

Changing Interval

When you change the properties the resource dll (hadrres.dll) establishes a second connection and starts executing sp_server_diagnostics as the new interval.   Once the connection has been successfully established and the query started the previous connection is disconnected.  This insures we never have a monitoring gap.

If the second connection can't be established the original connection continues monitoring with the prior settings.  This makes sure your server is always being monitored for failover conditions.

Bob Dorr - Principal SQL Server Escalation Engineer

Version history
Last update:
‎Jan 15 2019 02:21 PM
Updated by: