Written By: Shashikant Shakya, Ashutosh Bapat, and Guangnan Shi
The Problem
Picture this: your CDC pipeline is running smoothly, streaming changes from PostgreSQL to Kafka. Then, a bulk update hits millions of rows. Suddenly, Kafka queues pile up, downstream systems lag; dashboards go stale.
Why does replication lag spike during heavy operations? And what can you do about it?
Why This Matters
Change Data Capture (CDC) powers real-time integrations, pushing row-level changes from OLTP systems into event streams, data lakes, caches, and microservices. Debezium is a leading open-source CDC engine for PostgreSQL, and many teams successfully run Debezium against Azure Database for PostgreSQL to keep downstream systems synchronized.
However, during large DML operations (bulk updates, deletes) or schema changes (DDL), replication lag can occur because:
- Debezium consumes WAL slower than the database produces it
- Kafka throughput dips
- Consumers fall behind
This article explains why lag happens, grounded in logical decoding internals, and shows how to diagnose it quickly and what to tune across the database, Azure, and connector layers to keep pipelines healthy under heavy load.
CDC Basics
CDC streams incremental changes (INSERT/UPDATE/DELETE) from your source database to downstream systems in near real-time. In PostgreSQL, CDC is typically implemented using logical decoding and logical replication:
- PostgreSQL records every change in the Write-Ahead Log (WAL)
- WALSender reads WAL and decodes it into change events
- The pgoutput extension formats those changes, while Debezium subscribes and publishes them to Kafka topics
Benefits of CDC:
- Low latency
- Lower source overhead than periodic full extracts
- Preserves transactional ordering for consumers
The Internals: Why Lag Happens
Replication lag during heavy operations isn’t random, it’s rooted in how PostgreSQL handles logical decoding. To understand why, let’s look at the components that process changes and what happens when they hit resource limits.
Logical Decoding & ReorderBuffer
Logical decoding reconstructs transaction-level changes so they can be delivered in commit order. The core component enabling this is the ReorderBuffer.
What ReorderBuffer does:
- Reads WAL and groups changes per transaction, keeping them in memory until commit
- If memory exceeds
logical_decoding_work_mem, PostgreSQL spills decoded changes to disk in per-slot spill files - On commit, it reads back spilled data and emits changes to the client (via pgoutput → Debezium)
Disk Spill Mechanics (Deep Dive)
When a transaction is too large for memory:
- PostgreSQL writes decoded changes to spill files under
pg_replslot/<slot_name>/ - Wait events like
ReorderBufferWriteandReorderBufferReaddominate during heavy load - Spills to disk increase latency because disk I/O is far slower than memory access
Analogy: Think of ReorderBuffer as a warehouse staging floor:
- Small shipments move quickly in memory
- A huge shipment forces workers to move boxes offsite (spill-to-disk), then bring them back later, slowing everything down
Why Multiple Slots Amplify the Impact
- The WAL is shared by all slots
- Each slot decodes the entire WAL stream because filtering happens after decoding
- Result: A single large transaction affects every slot, multiplying replication lag
Recommendation:
- Minimize the number of slots/connectors
- Remember:
logical_decoding_work_memapplies per slot, not globally
Impact Snapshot:
| Scenario | Spill Size | I/O Impact |
|---|---|---|
| 1 Slot | 1 GB | 1× I/O |
| 5 Slots | 1 GB × 5 | 5× I/O |
Lifecycle: WAL → ReorderBuffer → Memory → Spill to Disk → Read Back → Send to Client
How to Detect Spills and Lag
Detection should be quick and repeatable. Start by confirming slot activity and LSN distance (how far producers are ahead of consumers), then check walsender wait events to see if decoding is stalling, and finally inspect per-slot spill metrics to quantify memory overflow to disk.
1. Active slots and lag
Use this to measure how far each logical slot is behind the current WAL. A large lsn_distance indicates backlog. If restart_lsn is far behind, the server must retain more WAL on disk, increasing storage pressure.
SELECT slot_name,
active_pid,
confirmed_flush_lsn,
restart_lsn,
pg_current_wal_lsn(),
pg_size_pretty((pg_current_wal_lsn() - confirmed_flush_lsn)) AS lsn_distance
FROM pg_replication_slots;
Interpretation: Focus on slots with the largest lsn_distance. If active_pid is NULL, the slot isn’t currently consuming; investigate connector health or connectivity.
2. Wait events for walsender
Check whether the WAL sender backends are stalled on decoding or I/O. ReorderBuffer-related waits typically point to spill-to-disk conditions or slow downstream consumption.
SELECT pid,
backend_type,
application_name,
wait_event
FROM pg_stat_activity
WHERE backend_type = 'walsender'
ORDER BY backend_start;
Interpretation: Frequent ReorderBufferWrite/ReorderBufferRead suggests large transactions are spilling.
3. Spill stats
Quantify how often and how much each slot spills from memory to disk. Rising spill_bytes and spill_count during heavy DML are strong signals to increase logical_decoding_work_mem, reduce transaction size, or tune connector throughput.
SELECT slot_name,
spill_txns,
spill_count,
pg_size_pretty(spill_bytes) AS spill_bytes,
total_txns,
pg_size_pretty(total_bytes) AS total_bytes,
stats_reset
FROM pg_stat_replication_slots;
Interpretation: Compare spill_bytes across slots; if many slots spill simultaneously, aggregate I/O multiplies. Consider reducing the number of active slots or batching large DML.
Fixing the Lag: Practical Strategies
Once you’ve identified replication lag and its root causes, the next step is mitigation. Solutions span across the database configuration, Azure infrastructure, and the Debezium connector layer. These strategies aim to reduce I/O overhead, optimize memory usage, and ensure smooth data flow under heavy workloads.
Database & Azure Layer
At the database and infrastructure level, focus on reducing unnecessary overhead and ensuring resources are scaled for peak demand. Here’s what you can do:
- Avoid
REPLICA IDENTITY FULL: prefer PRIMARY KEY; or add a unique index and setREPLICA IDENTITY USING INDEX - Use appropriately scaled IO-capable storage / right SKU for higher IOPS
- Right-size
logical_decoding_work_memconsidering multiple slots - Break up large DML: batch updates/deletes (10k–50k rows/commit)
- Schedule/throttle maintenance: stagger VACUUM/REINDEX/DDL
- Network placement: use Private Endpoint and co-locate Debezium/Kafka within the same region/VNet
Debezium Connector Layer
Connector-level tuning ensures that Debezium can keep pace with PostgreSQL WAL generation and Kafka throughput. Key adjustments include:
- Tune throughput & buffering: increase
max.batch.size,max.queue.size, reducepoll.interval.ms - Offset flush tuning: reduce
offset.flush.interval.ms - Heartbeats: introduce heartbeat events to detect staleness and prevent WAL buildup
Conclusion
Managing replication lag in Azure Database for PostgreSQL with Debezium isn’t just about tweaking parameters; it’s about understanding logical decoding internals, anticipating workload patterns, and applying proactive strategies across the entire solution.
Key Takeaways:
- Monitor early, act fast: Use diagnostic queries to track lag, wait events, and spill activity
- Minimize complexity: Fewer replication slots and well-tuned connectors reduce redundant work
- Plan for scale: Batch large DML operations, right-size memory settings
- Leverage Azure capabilities: Optimize IOPS tiers, network placement for predictable performance
By combining these best practices with continuous monitoring and operational discipline, you can keep your CDC pipelines healthy, even under heavy load, while ensuring downstream systems stay in sync with minimal latency.