Native change data capture (CDC) - CosmosDB change feed is supported in ADF now

By
Published Dec 12 2021 11:58 PM 4,275 Views
Microsoft

Changed data extraction from Azure CosmosDB in ADF just works !

 

In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used scenario for its largely reducing the time of reprocessing entire data every time. During changed data extraction, to determine and track the changed rows from your tables, using a timestamp column or an incrementing key within the schema brings in the complexity of building custom logic as well as dependency on user's schema. Now with the native change data capture capability enabled in ADF, changed data extraction from Azure CosmosDB (CosmosDB SQL API via Azure Cosmos DB change feed) in ADF just works without any manual steps. After the changed data automatically extracted from CosmosDB, you can directly apply any transformations before loading transformed data into destination datasets of your choice in dataflow. This tremendously accelerates your data integration jorney in many use cases including data replication or ETL.

 

How to use this feature:

 

1. After creating an dataflow in ADF, drag a source transform and reference to CosmosDB dataset.

 

CosmosDB change feed0.png

 

2.  Check "Change feed" and "Start from beginning".

 

CosmosDB change feed.png

 

With these 2 properties checked, you can get changes and apply any transformations before loading transformed data into destination datasets of your choice. 

  • Change feed (Preview): If checked, you will get data from Azure Cosmos DB change feeds which is a persistent record of changes to a container in the order they occur from last run automatically. 
  • Start from beginning (Preview): If checked, you will get initial load of full snapshot data in the first run, followed by capturing changed data in next runs. If not checked, the initial load will be skipped in the first run, followed by capturing changed data in next runs. The setting is aligned with the same setting name in Cosmos DB reference

 

Please note: 

Make sure you keep the pipeline and activity name unchanged, so that the checkpoint can be recorded by ADF for you to get changed data from the last run automatically. If you change your pipeline name or activity name, the checkpoint will be reset, which leads you to start from beginning or get changes from now in the next run.

 

When you debug the pipeline, this feature works the same. Be aware that the checkpoint will be reset when you refresh your browser during the debug run. After you are satisfied with the pipeline result from debug run, you can go ahead to publish and trigger the pipeline. At the moment when you first time trigger your published pipeline, it automatically restart from the beginning or from now on.

 

In the monitoring section, you always have the chance to rerun a pipeline. When you are doing so, the changed data is always captured from the previous checkpoint of your selected pipeline run.

 

Get more details in Copy and transform data in Azure Cosmos DB (SQL API) - Azure Data Factory & Azure Synapse | Microsof...

Co-Authors
Version history
Last update:
‎Jan 12 2022 04:27 PM
Updated by: