This article summarizes the common causes , solutions and troubleshooting mechanism for SQL Availability Group (AG) data synchronization latency between primary and secondary for both synchronous-commit and asynchronous-commit mode.
The latency commonly happens at log harden phase or log redo phase. And sometimes, both. If it happens at log harden phase, you will likely to see HADR_SYNC_COMMIT wait type in your primary that waiting for your synchronous-commit secondary replica's acknowledgement. Also, your primary is likely to encounter latency on committing transactions sent from application because of this wait. If latency happens purely at log redo phase, you are likely to see a relatively low redo rate in your issued secondary replica with high value of redo queue.
The key point is to first narrow down the actual phase of workflow (attached at the bottom of this article as also mentioned in a few other blogs thanks to great work of @Simon Su )in data synchronization that induces the latency.
If you wish to capture detailed traces to narrow down the bottleneck, except for performance monitor counters where you can check log send and redo efficiency, blocking chain and its block header on primary replica, SQL AG dedicated X-events in both primary and issued secondary are also required to see the data synchronization workflow latency. We have an automatic tool "AGLatency Report Tool" to help you analyze the AG X-event data thanks to the great work of @Simon Su . You can find this tool in this link with its user manual and the AG X-event script in this link
Below are the common causes and its solution/troubleshooting mechanism for SQL AG Data Synchronization Latency:
Long-running active transactions in read-only secondary replica ---- > When there is a long-running transaction in readable secondary replica, there can be blocking for the local redo thread in secondary replica when accessing the same data.
Flow control induced high log send queue size and low log send rate ---- >When your SQL AG's performance triggered flow control gate for specific SQL AG databases or at SQL AG replica level, there would be temporary termination and limitation of sending log data from primary to secondary. Flow control can be commonly caused by server overloading or slow network. We need to resolve the conditions in your environment that cause the flow control if we have constantly observed flow control gate being entered by your AG. You can check below performance monitor counters to see if your SQL AG Database or SQL AG replica has ever triggered flow control gate:
(For Flow Control Gate details as discussed below, please refer this link Monitor Performance for AlwaysOn Availability Groups | Microsoft Docs) AlwaysOn Availability Groups is designed with flow control gates on the primary replica to avoid excessive resource consumption, such as network and memory resources, on all availability replicas. These flow control gates do not affect the synchronization health state of the availability replicas, but they can affect the overall performance of your availability databases, including RPO.
After the logs have been captured on the primary replica, they are subject to two levels of flow controls, as shown in the table below.
Once the message threshold of either gate is reached, log messages are no longer sent to a specific replica or for a specific database. These can be sent once acknowledgement messages are received for the sent messages to bring the number of sent messages below the threshold. In addition to the flow control gates, there is another factor that can prevent the log messages from being sent. The synchronization of replicas ensures that the messages are sent and applied in the order of the log sequence numbers (LSN). Before a log message is sent, its LSN also checked against the lowest acknowledged LSN number to make sure that it is less than one of thresholds (depending on the message type). If the gap between the two LSN numbers is larger than the threshold, the messages are not sent. Once the gap is below the threshold again, the messages are sent.
Running out of parallel redo threads in secondary replica --- When secondary replica is running out of parallel redo threads, you can use TF3478 to allow maximal number of parallel redo thread to increase with total number of CPUs. By default, A SQL Server instance uses up to 100 threads for parallel redo for secondary replicas. Each database uses up to one-half of the total number of CPU cores, but not more than 16 threads per database. If the total number of required threads for a single instance exceeds 100, SQL Server uses a single redo thread for every remaining database.
Disk or I/O subsystem latency -- > when the disk or I/O subsystem in the secondary replica has critical bottleneck, the redo rate will be impacted and quite small in secondary replica. We are less likely to see it in your secondary before it hit on primary if your secondary replica is not readable and has comparable hardware for SQL database files' storage as that in primary and dedicated for this SQL AG's usage. However, if your secondary replica has high volume of read-only workload, it can bring extra IO overhead to your IO subsystem. Frequent database backup and VM snapshot can be another potential cause of extra I/O overhead.