In most cases, latency between master and replicas is caused by performance issues. Therefore, when investigating a synchronization lag, start by checking if the workload on the master has increased. In this article, you'll learn how to investigate and isolate the possible causes of replication latency.
Before starting, it's critical to understand how replication in MySQL works. While MySQL supports different types of data synchronization, Azure Database for MySQL only supports asynchronous replication (applies to both data-in replication and read replicas). With asynchronous replication, one server acts as a master and one or more other servers act as replicas (a max of 5 replicas is supported).
When replication starts, a master server writes replicating events into the binary log (which only records the committed transactions). This is why enabling the binary log is required when configuring replication in the first place.
On replica servers, there are two threads per replica, one called the IO thread and the other called the SQL thread.
This process is shown in the following diagram:
You can monitor the replication status from the Azure portal metrics blade. When latency occurs, an output of [SHOW SLAVE STATUS;] gives you enough information to understand what may cause the latency.
A sample output is shown below:
*************************** 1. row *************************** Slave_IO_State: Waiting for master to send event Master_Host: testserver.mysql.database.azure.com Master_User: replicationuser@testserver Master_Port: 3306 Connect_Retry: 60 Master_Log_File: mysql-bin.000191 Read_Master_Log_Pos: 103978138 Relay_Log_File: relay_bin.000568 Relay_Log_Pos: 13469095 Relay_Master_Log_File: mysql-bin.000191 Slave_IO_Running: Yes Slave_SQL_Running: Yes Replicate_Do_DB: Replicate_Ignore_DB: Replicate_Do_Table: Replicate_Ignore_Table: mysql.plugin Replicate_Wild_Do_Table: Replicate_Wild_Ignore_Table: mysql.\\_\\_% Last_Errno: 0 Last_Error: Skip_Counter: 0 Exec_Master_Log_Pos: 13468882 Relay_Log_Space: 103978599 Until_Condition: None Until_Log_File: Until_Log_Pos: 0 Master_SSL_Allowed: Yes Master_SSL_CA_File: c:\\work\\azure_mysqlservice.pem Master_SSL_CA_Path: Master_SSL_Cert: c:\\work\\azure_mysqlclient_cert.pem Master_SSL_Cipher: Master_SSL_Key: c:\\work\\azure_mysqlclient_key.pem Seconds_Behind_Master: 648 Master_SSL_Verify_Server_Cert: No Last_IO_Errno: 0 Last_IO_Error: Last_SQL_Errno: 0 Last_SQL_Error: Replicate_Ignore_Server_Ids: Master_Server_Id: 943553508 Master_UUID: eba67879-7232-11e9-b113-6c4d3f52df68 Master_Info_File: mysql.slave_master_info SQL_Delay: 0 SQL_Remaining_Delay: NULL Slave_SQL_Running_State: Waiting for dependent transaction to commit Master_Retry_Count: 86400 Master_Bind: Last_IO_Error_Timestamp: Last_SQL_Error_Timestamp: Master_SSL_Crl: Master_SSL_Crlpath: Retrieved_Gtid_Set: Executed_Gtid_Set: Auto_Position: 0 Replicate_Rewrite_DB: Channel_Name: Master_TLS_Version: 1 row in set (0.00 sec)
The output contains a lot of information, but normally it's only important to focus on the following columns:
Now let's see how to use the above information to isolate the latency cause.
If the below scenario met, then it is most likely that the cause is network latency. As seen, the IO thread is running and waiting on the master, and the master has already written to binary log file #20 while replica can only get file #10. Since the IO thread connects to master via TCP/IP, the only impact factor here is the network speed.
Slave_IO_State: Waiting for master to send event Master_Log_File: the binary file sequence is larger then Relay_Master_Log_File, e.g. mysql-bin.00020 Relay_Master_Log_File: the file sequence is smaller then above, e.g. mysql-bin.00010
If the below scenario met, then it is most likely that the workload on the master is too heavy and is burdening the replica. For the case below, though the replica can retrieve the binary log behind the master, the replica IO thread indicates that the relay log space is full already. So network speed isn't causing the delay, because the replica has already been trying to catch up as much as it can. Instead, the updated binary log size exceeds the upper limit of the relay log space.
Slave_IO_State: Waiting for the slave SQL thread to free enough relay log space Master_Log_File: the binary file sequence is larger then Relay_Master_Log_File, e.g. mysql-bin.00020 Relay_Master_Log_File: the file sequence is smaller then above, e.g. mysql-bin.00010
If the below scenario met, then it is most likely that the replica somehow is getting slower. For below cases, both the IO and SQL threads are running well and the replica is reading the same binary log file as the master writes. However, some latency still occurs. Normally, this could be caused by the slave server itself; check to see if the CPU consumption or IOPS on the replica server is high.
Slave_IO_State: Waiting for master to send event Master_Log_File: The binary log file sequence equals to Relay_Master_Log_File, e.g. mysql-bin.000191 Read_Master_Log_Pos: The position of master server written to the above file is larger than Relay_Log_Pos, e.g. 103978138 Relay_Master_Log_File: mysql-bin.000191 Slave_IO_Running: Yes Slave_SQL_Running: Yes Exec_Master_Log_Pos: The position of slave reads from master binary log file is smaller than Read_Master_Log_Pos, e.g. 13468882 Seconds_Behind_Master: There is latency and the value here is greater than 0
Note that there is one exception for the scenario above. In Azure Database for MySQL, replication is optimized for high concurrency environments with the parameter binlog_group_commit_sync_delay is configured. This parameter controls how many microseconds the binary log commit waits before synchronizing the binary log file. The benefit is that instead of immediately applying every transaction committed, the master send the binary log updates in bulk. This reduces IO on the replica and helps to improve performance. However, for some edge cases that are not in a high-concurrent scenario, e.g. one connection only runs one transaction and then the connection closes, setting binlog_group_commit_sync_delay can cause serious latency because the IO thread is waiting for bulk binary log updates while only one transaction is committed. This means that the waiting effort is in vain.
In most cases, the latency root cause falls into one of the scenarios covered in this blog. I hope the above information helps. Please feel free to post comments or reach out to me for a discussion.
Technical Support Engineer
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.