How to isolate and troubleshoot synchronize latency issue between master and replica for MySQL

Microsoft

Apr 22, 2020

In most cases, latency between master and replicas is caused by performance issues. Therefore, when investigating a synchronization lag, start by checking if the workload on the master has increased. In this article, you'll learn how to investigate and isolate the possible causes of replication latency.

Before starting, it's critical to understand how replication in MySQL works. While MySQL supports different types of data synchronization, Azure Database for MySQL only supports asynchronous replication (applies to both data-in replication and read replicas). With asynchronous replication, one server acts as a master and one or more other servers act as replicas (a max of 5 replicas is supported).

When replication starts, a master server writes replicating events into the binary log (which only records the committed transactions). This is why enabling the binary log is required when configuring replication in the first place.

On replica servers, there are two threads per replica, one called the IO thread and the other called the SQL thread.

The IO thread connects to the master server and requests updated binary logs. After this thread receives the binary log updates, they are saved on a replica server, in a local log called the relay log.
The SQL thread reads the relay log and apply the data change(s) on replica servers.

This process is shown in the following diagram:

You can monitor the replication status from the Azure portal metrics blade. When latency occurs, an output of [SHOW SLAVE STATUS;] gives you enough information to understand what may cause the latency.

A sample output is shown below:

*************************** 1. row ***************************
Slave_IO_State: Waiting for master to send event
Master_Host: testserver.mysql.database.azure.com
Master_User: replicationuser@testserver
Master_Port: 3306
Connect_Retry: 60
Master_Log_File: mysql-bin.000191
Read_Master_Log_Pos: 103978138
Relay_Log_File: relay_bin.000568
Relay_Log_Pos: 13469095
Relay_Master_Log_File: mysql-bin.000191
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Replicate_Do_DB: 
Replicate_Ignore_DB: 
Replicate_Do_Table: 
Replicate_Ignore_Table: mysql.plugin
Replicate_Wild_Do_Table: 
Replicate_Wild_Ignore_Table: mysql.\\_\\_%
Last_Errno: 0
Last_Error: 
Skip_Counter: 0
Exec_Master_Log_Pos: 13468882
Relay_Log_Space: 103978599
Until_Condition: None
Until_Log_File: 
Until_Log_Pos: 0
Master_SSL_Allowed: Yes
Master_SSL_CA_File: c:\\work\\azure_mysqlservice.pem
Master_SSL_CA_Path: 
Master_SSL_Cert: c:\\work\\azure_mysqlclient_cert.pem
Master_SSL_Cipher: 
Master_SSL_Key: c:\\work\\azure_mysqlclient_key.pem
Seconds_Behind_Master: 648
Master_SSL_Verify_Server_Cert: No
Last_IO_Errno: 0
Last_IO_Error: 
Last_SQL_Errno: 0
Last_SQL_Error: 
Replicate_Ignore_Server_Ids: 
Master_Server_Id: 943553508
Master_UUID: eba67879-7232-11e9-b113-6c4d3f52df68
Master_Info_File: mysql.slave_master_info
SQL_Delay: 0
SQL_Remaining_Delay: NULL
Slave_SQL_Running_State: Waiting for dependent transaction to commit
Master_Retry_Count: 86400
Master_Bind: 
Last_IO_Error_Timestamp: 
Last_SQL_Error_Timestamp: 
Master_SSL_Crl: 
Master_SSL_Crlpath: 
Retrieved_Gtid_Set: 
Executed_Gtid_Set: 
Auto_Position: 0
Replicate_Rewrite_DB: 
Channel_Name: 
Master_TLS_Version: 
1 row in set (0.00 sec)

The output contains a lot of information, but normally it's only important to focus on the following columns:

Slave_IO_State: This tells you the current status of the IO thread. Normally, the status is "Waiting for master to send event" if it is synchronizing. However, if you see a status such as "Connecting to master", then the replica has lost the connection to the master server. Please check if the master is running or if a firewall is blocking the connection.
Master_Log_File: This tells you the binary log file to which the master is writing.
Read_Master_Log_Pos: This represents the position in the above binary log file in which the master is writing.
Relay_Master_Log_File: This represents the binary log file that the replica server is reading from the master.
Slave_IO_Running: This indicates whether the IO thread is running. It should be "Yes". If "NO", then most likely the replication is broken.
Slave_SQL_Running: This indicates whether the SQL thread is running. It should be "Yes". If "NO", then most likely the replication is broken.
Exec_Master_Log_Pos: This tells you which position of the above Relay_Master_Log_File the replica is applying. If there is latency, this position sequence should be smaller than Read_Master_Log_Pos.
Relay_Log_Space: The upper limit of relay log size. You can check the size by querying [show global variables like "relay_log_space_limit";].
Seconds_Behind_Master: This shows the replication latency, in seconds.
Last_IO_Errno: This shows the IO thread error code, if any. For more information about these codes, see the MySQL docs: https://dev.mysql.com/doc/refman/5.7/en/server-error-reference.html.
Last_IO_Error: This shows the IO thread error message, if any.
Last_SQL_Errno: This shows the SQL thread error code, if any. For more information about these codes, see the MySQL docs: https://dev.mysql.com/doc/refman/5.7/en/server-error-reference.html.
Last_SQL_Error: This shows the SQL thread error message, if any.
Slave_SQL_Running_State: This indicates the current SQL thread status. Please note that "System lock" shown in this state is a normal behavior. As the example above shows, "Waiting for dependent transaction to commit", which means that it is waiting for the master to update committed transactions, is also normal.

Now let's see how to use the above information to isolate the latency cause.

If the below scenario met, then it is most likely that the cause is network latency. As seen, the IO thread is running and waiting on the master, and the master has already written to binary log file #20 while replica can only get file #10. Since the IO thread connects to master via TCP/IP, the only impact factor here is the network speed.

Slave_IO_State: Waiting for master to send event
Master_Log_File: the binary file sequence is larger then Relay_Master_Log_File, e.g. mysql-bin.00020
Relay_Master_Log_File: the file sequence is smaller then above, e.g. mysql-bin.00010

If the below scenario met, then it is most likely that the workload on the master is too heavy and is burdening the replica. For the case below, though the replica can retrieve the binary log behind the master, the replica IO thread indicates that the relay log space is full already. So network speed isn't causing the delay, because the replica has already been trying to catch up as much as it can. Instead, the updated binary log size exceeds the upper limit of the relay log space.

Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space
Master_Log_File: the binary file sequence is larger then Relay_Master_Log_File, e.g. mysql-bin.00020
Relay_Master_Log_File: the file sequence is smaller then above, e.g. mysql-bin.00010

If the below scenario met, then it is most likely that the replica somehow is getting slower. For below cases, both the IO and SQL threads are running well and the replica is reading the same binary log file as the master writes. However, some latency still occurs. Normally, this could be caused by the slave server itself; check to see if the CPU consumption or IOPS on the replica server is high.

Slave_IO_State: Waiting for master to send event
Master_Log_File: The binary log file sequence equals to Relay_Master_Log_File, e.g. mysql-bin.000191
Read_Master_Log_Pos: The position of master server written to the above file is larger than Relay_Log_Pos, e.g. 103978138
Relay_Master_Log_File: mysql-bin.000191
Slave_IO_Running: Yes
Slave_SQL_Running: Yes
Exec_Master_Log_Pos: The position of slave reads from master binary log file is smaller than Read_Master_Log_Pos, e.g. 13468882
Seconds_Behind_Master: There is latency and the value here is greater than 0

Note that there is one exception for the scenario above. In Azure Database for MySQL, replication is optimized for high concurrency environments with the parameter binlog_group_commit_sync_delay is configured. This parameter controls how many microseconds the binary log commit waits before synchronizing the binary log file. The benefit is that instead of immediately applying every transaction committed, the master send the binary log updates in bulk. This reduces IO on the replica and helps to improve performance. However, for some edge cases that are not in a high-concurrent scenario, e.g. one connection only runs one transaction and then the connection closes, setting binlog_group_commit_sync_delay can cause serious latency because the IO thread is waiting for bulk binary log updates while only one transaction is committed. This means that the waiting effort is in vain.

In most cases, the latency root cause falls into one of the scenarios covered in this blog. I hope the above information helps. Please feel free to post comments or reach out to me for a discussion.

Thank you!

Shawn Xiao

Technical Support Engineer

Updated Jun 16, 2020

Version 8.0

ShawnX

Microsoft

Joined March 29, 2019

View Profile

Azure Database for MySQL Blog

Follow this blog board to get notified when there's new activity