Blog Post

Azure Database Support Blog
3 MIN READ

Geo‑Replication Redo Lag in Azure SQL Database

Mohamed_Baioumy_MSFT's avatar
Jan 02, 2026

Monitoring and Troubleshooting Using Public DMVs

Azure SQL Database provides built‑in high availability and geo‑replication capabilities to ensure database resilience and business continuity. While replication is fully managed by the platform, customers may occasionally observe a delay between the primary and geo‑replicated secondary database, especially during periods of heavy write activity.

 

This article provides a public, supported approach to monitoring and understanding geo‑replication delay caused by redo lag, using official Dynamic Management Views (DMVs) and T‑SQL only, without relying on internal tools.

Scenario Overview

Customers may report that changes committed on the primary database are not immediately visible on the geo‑replicated secondary, sometimes for several minutes.

Typical symptoms include:

  • Reporting queries on the geo‑secondary showing stale data
  • Increased redo catch‑up time on the secondary
  • No performance degradation observed on the primary
  • Replication eventually catches up without manual intervention

This behavior is commonly associated with redo lag, where the secondary has already received the log records but requires additional time to replay them into data pages.

Understanding Geo‑Replication Redo Lag

In Azure SQL Database geo‑replication:

  • Transactions are sent from the primary to the secondary
  • Log records are hardened on the secondary
  • The secondary applies these records asynchronously to its data files (redo)

If the secondary experiences temporary pressure (for example, due to schema changes or intensive operations), redo may fall behind, causing the secondary to lag—even though data durability remains intact.

Step 1: Check Geo‑Replication Status and Lag

The primary DMV for monitoring geo‑replication is:

📘 sys.dm_geo_replication_link_status
Public documentation:
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-geo-replication-link-status-azure-sql-database 

Sample T‑SQL Query


SELECT
    database_id,
    partner_server,
    partner_database,
    replication_state_desc,
    replication_lag_sec,
    last_replication_time
FROM sys.dm_geo_replication_link_status;

Key Columns Explained

ColumnDescription
replication_state_descCurrent replication state
replication_lag_secEstimated lag (in seconds)
last_replication_timeLast successful replication timestamp (UTC)
partner_serverGeo‑replica logical server

Interpretation

  • Healthy replication:
    replication_lag_sec = 0 and state is healthy
  • Transient delay:
    Lag increases temporarily but later returns to zero
  • Sustained delay:
    Lag remains elevated for an extended period and should be investigated further

Step 2: Monitor Local Replica Redo Health

To understand whether lag is related to redo activity on the secondary, you can query:

📘 sys.dm_database_replica_states
Public documentation:
https://learn.microsoft.com/en-us/sql/relational-databases/system-dynamic-management-views/sys-dm-database-replica-states-azure-sql-database 

Sample T‑SQL Query

SELECT
    database_id,
    role_desc,
    synchronization_state_desc,
    synchronization_health_desc,
    redo_queue_size,
    redo_rate,
    last_commit_time
from sys.dm_database_replica_states

Key Insights

  • redo_queue_size
    Indicates how much log data is pending replay on the secondary
  • redo_rate
    Shows how quickly redo is being applied
  • last_commit_time
    Helps estimate data freshness on the secondary (UTC)

Interpretation

ScenarioObservation
Normal operationredo_queue_size = 0
Write burstredo_queue_size increases temporarily
Recovery in progressredo_rate remains positive
Healthy statesynchronization_health_desc = HEALTHY

Short‑lived spikes are expected platform behavior and usually resolve automatically.

Practical Monitoring Considerations

Reporting Workloads

If applications read from the geo‑secondary for reporting:

  • Expect near‑real‑time, not guaranteed real‑time visibility
  • Design reports to tolerate small delays
  • Route latency‑sensitive reads to the primary if required

Transaction Patterns

Redo lag is more likely during:

  • Large batch updates
  • Index maintenance operations
  • Schema modification commands
  • Bursty write workloads

Keeping transactions short and efficient reduces replication pressure.

Best Practices

  • Use UTC timestamps consistently when correlating events
  • Monitor replication_lag_sec and redo_queue_size together
  • Implement retry logic in applications for transient conditions
  • Avoid assuming read replicas are always perfectly synchronized
  • Do not take manual actions during short‑lived redo spikes unless the lag persists

Summary

Redo lag in Azure SQL Database geo‑replication is a normal and self‑healing behavior during certain workload patterns. By using supported public DMVs and T‑SQL, customers can:

  • Monitor replication health safely
  • Understand replica freshness
  • Make informed application routing decisions
  • Avoid unnecessary intervention

Azure SQL Database automatically stabilizes replication once redo pressure subsides, ensuring durability and availability without manual management.

References

 

Published Jan 02, 2026
Version 1.0
No CommentsBe the first to comment