Monitor replication lag for Auto-Failover Groups in SQL Managed Instance

Published 03-03-2021 10:38 AM 1,549 Views
Microsoft

The auto-failover groups feature allows you to manage the replication and failover of a group of databases on a server or all databases in a managed instance to another region. It is a declarative abstraction on top of the existing active geo-replication feature, designed to simplify deployment and management of geo-replicated databases at scale.

 

To monitor the replication lag between primary and secondary instances, and last replication time of the secondary databases, we can use the DMV sys.dm_geo_replication_link_status. The view is created in all databases, including the logical master. However, querying this view in the logical master returns an empty set.

 

Run the below query on your primary instance:

 

USE [Database_Name]

SELECT   

     link_guid  

   , partner_server 

   , partner_database

   , last_replication  

   , replication_lag_sec   

 

FROM sys.dm_geo_replication_link_status;

 

The columns to check are:

 

  • replication_lag_sec : Shows the time difference in seconds between the last_replication value and the timestamp of that transaction's commit on the primary based on the primary database clock. This value is available on the primary database only.
  • last_replication : Shows the timestamp of the last transaction's acknowledgement by the secondary based on the primary database clock. This value is available on the primary database only.

 

 

Moreover, both primary and secondary instances are required to have the same service tier. If the primary database is experiencing a heavy write workload, a secondary with lower compute size may not be able to keep up with it. That will cause redo lag on the secondary, and potential unavailability of the secondary. To mitigate these risks, active geo-replication will throttle the primary's transaction log rate if necessary, to allow its secondaries to catch up.

 

We can monitor the replication lag over time in SQL Managed Instance by creating a scheduled agent job to run and capture the lag into a table every ex: 5 min for a day or two, to check how the lag time can be different along with the workload.

 

Short steps can be:

1)      Create a table for example named monitor_lag

2)      Create a job with the below step and set it to be run every 5 minutes.

INSERT INTO monitor_lag

SELECT    

partner_database, last_replication,last_commit,replication_lag_sec

FROM sys.dm_geo_replication_link_status

3)      Review the lag time over time.

 

Enjoy!

%3CLINGO-SUB%20id%3D%22lingo-sub-2182480%22%20slang%3D%22en-US%22%3EMonitor%20replication%20lag%20for%20Auto-Failover%20Groups%20in%20SQL%20Managed%20Instance%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2182480%22%20slang%3D%22en-US%22%3E%3CP%3EThe%20%3CA%20title%3D%22auto-failover%20groups%22%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fazure-sql%2Fdatabase%2Fauto-failover-group-overview%3Ftabs%3Dazure-powershell%23best-practices-for-sql-managed-instance%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3Eauto-failover%20groups%3C%2FA%3E%20feature%20allows%20you%20to%20manage%20the%20replication%20and%20failover%20of%20a%20group%20of%20databases%20on%20a%20server%20or%20all%20databases%20in%20a%20managed%20instance%20to%20another%20region.%20It%20is%20a%20declarative%20abstraction%20on%20top%20of%20the%20existing%20active%20geo-replication%20feature%2C%20designed%20to%20simplify%20deployment%20and%20management%20of%20geo-replicated%20databases%20at%20scale.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ETo%20monitor%20the%20replication%20lag%20between%20primary%20and%20secondary%20instances%2C%20and%20last%20replication%20time%20of%20the%20secondary%20databases%2C%20we%20can%20use%20the%20DMV%20%3CA%20title%3D%22sys.dm_geo_replication_link_status%22%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fsql%2Frelational-databases%2Fsystem-dynamic-management-views%2Fsys-dm-geo-replication-link-status-azure-sql-database%3Fview%3Dazuresqldb-current%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3Esys.dm_geo_replication_link_status%3C%2FA%3E.%26nbsp%3BThe%20view%20is%20created%20in%20all%20databases%2C%20including%20the%20logical%20master.%20However%2C%20querying%20this%20view%20in%20the%20logical%20master%20returns%20an%20empty%20set.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ERun%20the%20below%20query%20on%20your%20primary%20instance%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EUSE%20%5BDatabase_Name%5D%3C%2FP%3E%0A%3CP%3ESELECT%26nbsp%3B%26nbsp%3B%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20link_guid%26nbsp%3B%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%20%2C%20partner_server%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%20%2C%20partner_database%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%20%2C%20last_replication%26nbsp%3B%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%26nbsp%3B%20%2C%20replication_lag_sec%26nbsp%3B%26nbsp%3B%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EFROM%20sys.dm_geo_replication_link_status%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThe%20columns%20to%20check%20are%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CSTRONG%3Ereplication_lag_sec%26nbsp%3B%3A%3C%2FSTRONG%3E%20Shows%20the%20time%20difference%20in%20seconds%20between%20the%20last_replication%20value%20and%20the%20timestamp%20of%20that%20transaction's%20commit%20on%20the%20primary%20based%20on%20the%20primary%20database%20clock.%20This%20value%20is%20available%20on%20the%20primary%20database%20only.%3C%2FLI%3E%0A%3CLI%3E%3CSTRONG%3Elast_replication%20%3A%3C%2FSTRONG%3E%26nbsp%3BShows%20the%20timestamp%20of%20the%20last%20transaction's%20acknowledgement%20by%20the%20secondary%20based%20on%20the%20primary%20database%20clock.%20This%20value%20is%20available%20on%20the%20primary%20database%20only.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EMoreover%2C%20both%20primary%20and%20secondary%20instances%20are%20required%20to%20have%20the%20%3CU%3Esame%20service%20tier%3C%2FU%3E.%20If%20the%20primary%20database%20is%20experiencing%20a%20heavy%20write%20workload%2C%20a%20secondary%20with%20lower%20compute%20size%20may%20not%20be%20able%20to%20keep%20up%20with%20it.%20That%20will%20cause%20redo%20lag%20on%20the%20secondary%2C%20and%20potential%20unavailability%20of%20the%20secondary.%20To%20mitigate%20these%20risks%2C%20active%20geo-replication%20will%20throttle%20the%20primary's%20transaction%20log%20rate%20if%20necessary%2C%20to%20allow%20its%20secondaries%20to%20catch%20up.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EWe%20can%20monitor%20the%20replication%20lag%20over%20time%20in%20SQL%20Managed%20Instance%20by%20creating%20a%20scheduled%20agent%20job%20to%20run%20and%20capture%20the%20lag%20into%20a%20table%20every%20ex%3A%205%20min%20for%20a%20day%20or%20two%2C%20to%20check%20how%20the%20lag%20time%20can%20be%20different%20along%20with%20the%20workload.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EShort%20steps%20can%20be%3A%3C%2FP%3E%0A%3CTABLE%3E%0A%3CTBODY%3E%0A%3CTR%3E%0A%3CTD%20width%3D%22623%22%3E%3CP%3E1)%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20Create%20a%20table%20for%20example%20named%20monitor_lag%3C%2FP%3E%0A%3CP%3E2)%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20%3CA%20title%3D%22Create%20a%20Transact-SQL%20Job%20Step%22%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fsql%2Fssms%2Fagent%2Fcreate-a-transact-sql-job-step%3Fview%3Dsql-server-ver15%22%20target%3D%22_blank%22%20rel%3D%22noopener%20noreferrer%22%3ECreate%20a%20job%3C%2FA%3E%20with%20the%20below%20step%20and%20set%20it%20to%20be%20run%20every%205%20minutes.%3C%2FP%3E%0A%3CP%3EINSERT%20INTO%20monitor_lag%3C%2FP%3E%0A%3CP%3ESELECT%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%3C%2FP%3E%0A%3CP%3Epartner_database%2C%20last_replication%2Clast_commit%2Creplication_lag_sec%3C%2FP%3E%0A%3CP%3EFROM%20sys.dm_geo_replication_link_status%3C%2FP%3E%0A%3CP%3E3)%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%26nbsp%3B%20Review%20the%20lag%20time%20over%20time.%3C%2FP%3E%0A%3C%2FTD%3E%0A%3C%2FTR%3E%0A%3C%2FTBODY%3E%0A%3C%2FTABLE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EEnjoy!%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-2182480%22%20slang%3D%22en-US%22%3E%3CP%3E%3CSPAN%3EThis%20article%20will%20explain%20how%20to%20monitor%26nbsp%3Breplication%20lag%20for%20Auto-Failover%20Groups%20in%20SQL%20Managed%20Instance%20over%20time.%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-TEASER%3E%3CLINGO-LABS%20id%3D%22lingo-labs-2182480%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EAzure%20SQL%20Managed%20Instance%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3Efailover%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EHigh%20Availability%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Co-Authors
Version history
Last update:
‎Mar 30 2021 03:31 AM
Updated by: