Common Causes and Troubleshooting Solutions for SQL AG Data Synchronization Latency

Microsoft

Nov 14, 2021

This article summarizes the common causes , solutions and troubleshooting mechanism for SQL Availability Group (AG) data synchronization latency between primary and secondary for both synchronous-commit and asynchronous-commit mode.

The latency commonly happens at log harden phase or log redo phase. And sometimes, both. If it happens at log harden phase, you will likely to see HADR_SYNC_COMMIT wait type in your primary that waiting for your synchronous-commit secondary replica's acknowledgement. Also, your primary is likely to encounter latency on committing transactions sent from application because of this wait. If latency happens purely at log redo phase, you are likely to see a relatively low redo rate in your issued secondary replica with high value of redo queue.

The key point is to first narrow down the actual phase of workflow (attached at the bottom of this article as also mentioned in a few other blogs thanks to great work of Simon Su )in data synchronization that induces the latency.

For experienced DBA, if you have already narrowed down the latency to be caused by extensive HADR_SYNC_COMMIT wait type in your primary for your synchronous-commit secondary replica, you can refer to this article for resolving this bottleneck : https://techcommunity.microsoft.com/t5/sql-server/troubleshooting-high-hadr-sync-commit-wait-type-with-always-on/ba-p/385369

If you wish to capture detailed traces to narrow down the bottleneck, except for performance monitor counters where you can check log send and redo efficiency, blocking chain and its block header on primary replica, SQL AG dedicated X-events in both primary and issued secondary are also required to see the data synchronization workflow latency. We have an automatic tool "AGLatency Report Tool" to help you analyze the AG X-event data thanks to the great work of Simon Su . You can find this tool in this link with its user manual and the AG X-event script in this link

Below are the common causes and its solution/troubleshooting mechanism for SQL AG Data Synchronization Latency:

Long-running active transactions in read-only secondary replica ---- > When there is a long-running transaction in readable secondary replica, there can be blocking for the local redo thread in secondary replica when accessing the same data.
High network latency / low network throughput ----> When the network transmission between primary and the issued secondary is slow or instable, the log send rate will be slow and impact the amount of data being transmitted to secondary(refer doc https://docs.microsoft.com/en-us/previous-versions/sql/sql-server-2012/dn135335(v=sql.110)?redirectedfrom=MSDN)

Flow control induced high log send queue size and low log send rate ---- > When your SQL AG's performance triggered flow control gate for specific SQL AG databases or at SQL AG replica level, there would be temporary termination and limitation of sending log data from primary to secondary. Flow control can be commonly caused by server overloading or slow network. We need to resolve the conditions in your environment that cause the flow control if we have constantly observed flow control gate being entered by your AG. You can check below performance monitor counters to see if your SQL AG Database or SQL AG replica has ever triggered flow control gate:

AG Replica level

Availability Replica : Flow Control Time ,

Availability Replica : Flow Control/sec ,

AG Database Level

Database Replica: Database Flow Control Delay
Database Replica: Database Flow Controls/Sec

(For Flow Control Gate details as discussed below, please refer this link Monitor Performance for AlwaysOn Availability Groups | Microsoft Docs)
AlwaysOn Availability Groups is designed with flow control gates on the primary replica to avoid excessive resource consumption, such as network and memory resources, on all availability replicas. These flow control gates do not affect the synchronization health state of the availability replicas, but they can affect the overall performance of your availability databases, including RPO.

After the logs have been captured on the primary replica, they are subject to two levels of flow controls, as shown in the table below.

Level

Number of Gates

Number of messages

Useful Metrics

Transport

1 per availabiltiy replica

8192

Extended event database_transport_flow_control_action

Database

1 per availability database

11200 (x64)

1600 (x86)

DBMIRROR_SEND

Extended event hadron_database_flow_control_action

FLOW CONTROL GATES

Once the message threshold of either gate is reached, log messages are no longer sent to a specific replica or for a specific database. These can be sent once acknowledgement messages are received for the sent messages to bring the number of sent messages below the threshold. In addition to the flow control gates, there is another factor that can prevent the log messages from being sent. The synchronization of replicas ensures that the messages are sent and applied in the order of the log sequence numbers (LSN). Before a log message is sent, its LSN also checked against the lowest acknowledged LSN number to make sure that it is less than one of thresholds (depending on the message type). If the gap between the two LSN numbers is larger than the threshold, the messages are not sent. Once the gap is below the threshold again, the messages are sent.

Blocked REDO thread ---- When there are other issues or performance bottlenecks in secondary replica, the REDO thread can be blocked and cause latency of data synchronization. Refer document https://techcommunity.microsoft.com/t5/sql-server/alwayson-minimizing-blocking-of-redo-thread-when-running/ba-p/383963
Shared REDO Target ----- This issue only happens when you have multiple secondary replicas. When one secondary replica is very slow in finishing the redo process, it will limit the other secondary replicas in proceeding further on the maximum allowable LSN with the redo process. Trace flag 9559 can be used to mitigate the issue. (refer doc https://docs.microsoft.com/en-us/archive/blogs/alwaysonpro/recovery-on-secondary-lagging-shared-redo-target)
Running out of parallel redo threads in secondary replica --- When secondary replica is running out of parallel redo threads, you can use TF3478 to allow maximal number of parallel redo thread to increase with total number of CPUs. By default, A SQL Server instance uses up to 100 threads for parallel redo for secondary replicas. Each database uses up to one-half of the total number of CPU cores, but not more than 16 threads per database. If the total number of required threads for a single instance exceeds 100, SQL Server uses a single redo thread for every remaining database.
Disk or I/O subsystem latency -- > when the disk or I/O subsystem in the secondary replica has critical bottleneck, the redo rate will be impacted and quite small in secondary replica. We are less likely to see it in your secondary before it hit on primary if your secondary replica is not readable and has comparable hardware for SQL database files' storage as that in primary and dedicated for this SQL AG's usage. However, if your secondary replica has high volume of read-only workload, it can bring extra IO overhead to your IO subsystem. Frequent database backup and VM snapshot can be another potential cause of extra I/O overhead.