Monitoring and creating alerts for Data-in replication with Azure Database for MySQL-Flexible Server

Microsoft

Feb 08, 2022

You can use Data-in replication to synchronize data from an external MySQL server with data in an Azure Database for MySQL flexible server. The external server can be hosted on-premises, in virtual machines, in an Azure Database for MySQL single server, or in a database service offered other cloud providers.

Data-in replication is based on the binary log (binlog) file position-based method. For more information, see Binary Log File Position Based Replication Configuration Overview.

This post provides details about how to monitor and create alerts for Data-in replication when the replication lag is high.

Prerequisites

Before you begin to work through the process outlined in this post, be sure that you have:

An instance of Azure Database for MySQL - Flexible Server.
Data-in replication configured and running as expected between your primary and replica servers. For more information, see How to configure Azure Database for MySQL Flexible Server Data-in replication.

Monitor Data-in replication

The value of the Seconds_Behind_Master parameter, as displayed by the SHOW SLAVE STATUS command, is commonly used as an indication of the current replication lag of the replica. In Azure Database for MySQL - Flexible Server, the Replication Lag in Seconds metric tracks this value.

To monitor the replication lag, perform the following steps:

In the Azure portal, select the replica Azure Database for MySQL flexible server that you want to monitor.
In the resource menu on the left, under Monitoring, select Metrics.
From the Metric drop-down list, select Replication Lag in Seconds, and then, in the Aggregation drop-down list, ensure that Max is selected.

A graph showing the replication lag over time appears.

You can use this information to monitor the replication lag between your primary and replica servers.

Create an alert for when replication lag is high

If you notice that the replication lag is substantial, you can create an alert to ensure that you’re notified if the replication lag approaches a threshold that you set.

On the Metrics blade, select New alert rule.
On the Create an alert rule page, on the Condition tab, under Condition name, select Whenever the maximum replication_lag is greater than <logic undefined>.
In the Configure signal logic dialog box, under Alert logic, in the Threshold value text box, enter 60.
To refine the condition, in the Aggregation granularity (Period) drop-down list, select the interval over which data points are grouped using the aggregation type function (for example, 5 minutes).
In the Frequency of evaluation drop-down list, select how frequently you want the status evaluated (for example, Every 5 Minutes), and then select Done.
Select Next: Action, and then on the Actions tab, select Add action group to add an existing action group, or select Create action group to create a new one.

If you create a new action group, on the Create an action group page, under Instance details, specify the action group name and the display name. The display name is used in place of a full action group name when notifications are sent using this group.
Select Next: Notifications, and then, on Notifications tab, define a list of notifications to send when an alert is triggered. For each notification, specify the:

Notification type: The type of notification that you want to send. Use Email/SMS message/Push/Voice to send these types of notifications to specific recipients.
Name: A unique name for the notification.
Email: The email address to which you want the alert sent (specify in the Email/SMS message/Push/Voice dialog box).
Select OK, and then select Review + create.
After the action group is created, on the Create an alert rule page, select Next: Details, and then specify the:

Severity: Critical
Alert Rule Name: Data-in-replication-Lag
Alert rule description: The replication lag is high.
Select Review + create.

Now, when the replication lag exceeds the threshold you specified, you’ll see alerts in portal under Monitoring > Alerts

You’ll also receive an email notification at the email address you configured in action croup. When the replication lag falls below the threshold value, you’ll receive another email, as the alert will have been resolved. These emails will appear similar to the following:

Conclusion

It’s important to monitor the replication lag and take necessary actions if the lag exceeds the threshold to ensure your replica is in close sync with your primary. The replication lag impacting replica servers depends on several factors, including but not limited to:

Network latency.
Transaction volume on the source server.
Compute tier of the source server and replica server.
Queries running on the source server and secondary server.

Monitoring and taking timely action to reduce the replication lag is necessary to ensure that the applications that are connected to a read replica server do not get inconsistent data.

Note: To troubleshoot and resolve replication lag issues, see the following resources:

If you have any feedback or questions, please leave a comment below or email us at AskAzureDBforMySQL@service.microsoft.com.

Thank you!

Updated Jan 23, 2023

Version 5.0

Microsoft

Joined March 04, 2020