Optimizing Change Data Capture (CDC) on PostgreSQL for Enhanced Data Management

Microsoft

Mar 27, 2025

Managing the ODS layer becomes crucial, especially when source systems lack physical primary keys. An Online Data Store (ODS) helps offload OLTP workloads by shifting certain queries to alternative databases. While some customers leverage secondary servers for this purpose, database replication in an active-active mode can enhance efficiency. Typically, customers attempt to use secondary servers to reduce the primary servers' workload. Database replication is an effective strategy in these cases, operating in an active-active mode to ensure operational efficiency.

Microsoft Fabric offers a mirroring option on OneLake to optimize resource utilization when moving data from OLTP databases such as SQL Server. This feature allows for data replication without delay, ensuring seamless data availability.

The need for such a feature in PostgreSQL environments is evident, especially for customer scenarios. For instance, customers running multiple PostgreSQL OLTP systems aim to enable their ODS and Data Warehouse (DW) on Fabric in both real-time and near real-time. This approach not only reduces workloads on PostgreSQL OLTP systems but also facilitates downstream analytics.

However, it's important to note that this functionality is currently available for PostgreSQL in Private Preview only. This article explores the options we have until this option becomes generally available.

Identifying Change Data Capture (CDC) on PostgreSQL

WAL2JSON Utility: This method utilizes PostgreSQL's Write-Ahead Logging (WAL) to capture data changes in JSON format. It provides flexibility for custom processing and error handling but requires manual intervention for WAL file cleanup and operates in near real-time.
RTI (Debezium) CDC Connector: A Fabric-integrated solution that automates data capture from PostgreSQL to Eventhouse using Eventstream. It ensures efficient data movement by storing changes in a structured payload format, including metadata such as operation type and transaction details. However, it requires PostgreSQL administrators to manage WAL slots for optimal performance.

Comparison of CDC Options

CDC Method	Description	Pros	Cons
RTI CDC Connector	Efficiently moves data from PostgreSQL to Fabric (Eventstream/Eventhouse). Stores data in a payload format, capturing database type, operation, and changed record.	Automated, integrates well with Fabric, captures full change history	Requires PostgreSQL team management for WAL slot cleanup
WAL2JSON Method	Provides programmatic control for error tracking and WAL file cleanup. Operates in near real-time.	More direct control, flexible error handling	Requires manual intervention, slightly delayed processing

Approaches for Data Management

Truncate and Load (Direct Source Connection)
- Establish a direct connection to the source tables in PostgreSQL for efficient data extraction.
- Truncate and reload tables in the Silver layer to maintain data integrity and consistency.
- Implement a logging mechanism in the Landing table to ensure modifications are accurately recorded and traceable.
- Despite initial resistance from the Source PostgreSQL team, increasing workloads on PostgreSQL is necessary to enhance performance and meet project goals.

Adding a Row ID Column in the Source Table
- A row ID column is essential to uniquely identify records and facilitate necessary deletions and insertions.
- However, the Source Systems have not approved this modification, limiting our ability to implement this approach.

Using Transaction ID in PostgreSQL
- Transaction IDs are inconsistent across operations and may overlap with other database transactions.
- Due to this inconsistency, they cannot be reliably used for tracking changes.

Truncate and Load Using the Landing Table
- Select distinct records with the latest timestamp from the Landing table and truncate/load them into the Silver layer.
  Implementation Strategy:
  Develop two separate procedures, one for tables with a primary key and another for tables without a primary key. Move non-primary key tables to primary key-based tables once key columns become available, as determined by respective table owners.
- Performance Consideration: Large tables may take significant time to process.
- Current Outcome: This approach has not yielded the expected results.

Replica Identity FULL Approach for only Non Primary key tables

- REPLICA IDENTITY FULL is enabled for Tables with no primary in Postgre. This helps in capturing data for all columns in WAL log irrespective of if they are changed or not.
  Sample WAL record:

- Retrieve old record details from the Before Values JSON, use them for deletions, and insert the updated records into the table.
- When key columns are unavailable, generate a hash key for the "before values" and use it to identify and delete corresponding records in the target table.
- To optimize performance, an additional hash key column can be created in the target table, reducing delete operation times.
- Primary key tables will not use hash key columns & Non-primary key tables will incorporate hash key columns.
- It is assumed that the application layer will send accurate updates to prevent duplicate records.
- Considerations:
  - Logging Overhead: Increased Write-Ahead Log (WAL) file generation may impact performance and storage.
  - ETL Recommendation: The Product Team recommends an Extract, Transform, Load (ETL) approach as the preferred solution.
  - Bronze-to-Silver Loading: Re-evaluate the strategy once the Mirroring feature becomes available, as it will be essential for this implementation.
  - Performance Impact: Setting Replica Identity to FULL will increase WAL size and add logging overhead, potentially affecting database performance.
  - Data Integrity: The ETL process must ensure accurate change tracking to maintain data integrity.
  - Testing: Conduct thorough staging environment testing to assess performance implications and validate the replication process.

Conclusion

Adding a Row ID column (Option 2) is the recommended approach, as it allows precise identification of records. However, if modifying source tables is not feasible, using Replica Identity Full (Option 5) is a viable alternative despite its impact on performance and storage.
Establishing a Standard Operating Procedure (SOP) in collaboration with the source system to identify and utilize primary keys or unique keys whenever available will significantly enhance the overall efficiency and accuracy of the process.

References

Updated Mar 24, 2025

Version 1.0

Microsoft

Joined April 02, 2024

View Profile

Microsoft Developer Community Blog

Follow this blog board to get notified when there's new activity