Monitoring and Troubleshooting Using Public DMVs
Azure SQL Database provides built‑in high availability and geo‑replication capabilities to ensure database resilience and business continuity. While replication is fully managed by the platform, customers may occasionally observe a delay between the primary and geo‑replicated secondary database, especially during periods of heavy write activity.
This article provides a public, supported approach to monitoring and understanding geo‑replication delay caused by redo lag, using official Dynamic Management Views (DMVs) and T‑SQL only, without relying on internal tools.
Scenario Overview
Customers may report that changes committed on the primary database are not immediately visible on the geo‑replicated secondary, sometimes for several minutes.
Typical symptoms include:
- Reporting queries on the geo‑secondary showing stale data
- Increased redo catch‑up time on the secondary
- No performance degradation observed on the primary
- Replication eventually catches up without manual intervention
This behavior is commonly associated with redo lag, where the secondary has already received the log records but requires additional time to replay them into data pages.
Understanding Geo‑Replication Redo Lag
In Azure SQL Database geo‑replication:
- Transactions are sent from the primary to the secondary
- Log records are hardened on the secondary
- The secondary applies these records asynchronously to its data files (redo)
If the secondary experiences temporary pressure (for example, due to schema changes or intensive operations), redo may fall behind, causing the secondary to lag—even though data durability remains intact.
Step 1: Check Geo‑Replication Status and Lag
The primary DMV for monitoring geo‑replication is:
📘 sys.dm_geo_replication_link_status
Public documentation:
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database
Sample T‑SQL Query
SELECT
database_id,
partner_server,
partner_database,
replication_state_desc,
replication_lag_sec,
last_replication_time
FROM sys.dm_geo_replication_link_status;
Key Columns Explained
| Column | Description |
|---|---|
| replication_state_desc | Current replication state |
| replication_lag_sec | Estimated lag (in seconds) |
| last_replication_time | Last successful replication timestamp (UTC) |
| partner_server | Geo‑replica logical server |
Interpretation
- Healthy replication:
replication_lag_sec = 0 and state is healthy - Transient delay:
Lag increases temporarily but later returns to zero - Sustained delay:
Lag remains elevated for an extended period and should be investigated further
Step 2: Monitor Local Replica Redo Health
To understand whether lag is related to redo activity on the secondary, you can query:
📘 sys.dm_database_replica_states
Public documentation:
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database
Sample T‑SQL Query
SELECT
database_id,
role_desc,
synchronization_state_desc,
synchronization_health_desc,
redo_queue_size,
redo_rate,
last_commit_time
from sys.dm_database_replica_states
Key Insights
- redo_queue_size
Indicates how much log data is pending replay on the secondary - redo_rate
Shows how quickly redo is being applied - last_commit_time
Helps estimate data freshness on the secondary (UTC)
Interpretation
| Scenario | Observation |
|---|---|
| Normal operation | redo_queue_size = 0 |
| Write burst | redo_queue_size increases temporarily |
| Recovery in progress | redo_rate remains positive |
| Healthy state | synchronization_health_desc = HEALTHY |
Short‑lived spikes are expected platform behavior and usually resolve automatically.
Practical Monitoring Considerations
Reporting Workloads
If applications read from the geo‑secondary for reporting:
- Expect near‑real‑time, not guaranteed real‑time visibility
- Design reports to tolerate small delays
- Route latency‑sensitive reads to the primary if required
Transaction Patterns
Redo lag is more likely during:
- Large batch updates
- Index maintenance operations
- Schema modification commands
- Bursty write workloads
Keeping transactions short and efficient reduces replication pressure.
Best Practices
- Use UTC timestamps consistently when correlating events
- Monitor replication_lag_sec and redo_queue_size together
- Implement retry logic in applications for transient conditions
- Avoid assuming read replicas are always perfectly synchronized
- Do not take manual actions during short‑lived redo spikes unless the lag persists
Summary
Redo lag in Azure SQL Database geo‑replication is a normal and self‑healing behavior during certain workload patterns. By using supported public DMVs and T‑SQL, customers can:
- Monitor replication health safely
- Understand replica freshness
- Make informed application routing decisions
- Avoid unnecessary intervention
Azure SQL Database automatically stabilizes replication once redo pressure subsides, ensuring durability and availability without manual management.
References
- Azure SQL Database High Availability
https://learn.microsoft.com/azure/azure-sql/database/high-availability-overview - sys.dm_geo_replication_link_status
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database - sys.dm_database_replica_states
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database