Blog Post

Azure Data Factory Blog
2 MIN READ

Native change data capture (CDC) - CosmosDB change feed is supported in ADF now

Ye Xu's avatar
Ye Xu
Icon for Microsoft rankMicrosoft
Dec 13, 2021

Changed data extraction from Azure CosmosDB in ADF just works !

 

In a data integration solution, incrementally (or delta) loading data after an initial full data load is a widely used scenario for its largely reducing the time of reprocessing entire data every time. During changed data extraction, to determine and track the changed rows from your tables, using a timestamp column or an incrementing key within the schema brings in the complexity of building custom logic as well as dependency on user's schema. Now with the native change data capture capability enabled in ADF, changed data extraction from Azure CosmosDB (CosmosDB SQL API via Azure Cosmos DB change feed) in ADF just works without any manual steps. After the changed data automatically extracted from CosmosDB, you can directly apply any transformations before loading transformed data into destination datasets of your choice in dataflow. This tremendously accelerates your data integration jorney in many use cases including data replication or ETL.

 

How to use this feature:

 

1. After creating an dataflow in ADF, drag a source transform and reference to CosmosDB dataset.

 

 

2.  Check "Change feed" and "Start from beginning".

 

 

With these 2 properties checked, you can get changes and apply any transformations before loading transformed data into destination datasets of your choice. 

  • Change feed (Preview): If checked, you will get data from Azure Cosmos DB change feeds which is a persistent record of changes to a container in the order they occur from last run automatically. 
  • Start from beginning (Preview): If checked, you will get initial load of full snapshot data in the first run, followed by capturing changed data in next runs. If not checked, the initial load will be skipped in the first run, followed by capturing changed data in next runs. The setting is aligned with the same setting name in Cosmos DB reference

 

Please note: 

Make sure you keep the pipeline and activity name unchanged, so that the checkpoint can be recorded by ADF for you to get changed data from the last run automatically. If you change your pipeline name or activity name, the checkpoint will be reset, which leads you to start from beginning or get changes from now in the next run.

 

When you debug the pipeline, this feature works the same. Be aware that the checkpoint will be reset when you refresh your browser during the debug run. After you are satisfied with the pipeline result from debug run, you can go ahead to publish and trigger the pipeline. At the moment when you first time trigger your published pipeline, it automatically restart from the beginning or from now on.

 

In the monitoring section, you always have the chance to rerun a pipeline. When you are doing so, the changed data is always captured from the previous checkpoint of your selected pipeline run.

 

Get more details in Copy and transform data in Azure Cosmos DB (SQL API) - Azure Data Factory & Azure Synapse | Microsoft Docs

Updated Jan 13, 2022
Version 4.0

3 Comments

  • jasperdefesche's avatar
    jasperdefesche
    Copper Contributor

    Great feature but we run into a strange issue.

     

    Steps to reproduce:

    - change a document in cosmosdb

    - Run the pipeline that contains de changefeed-enabled dataflow.

    - Witness that the monitoring reports that one document is processed.

    - change a document in cosmosdb

    - Change something in development ADF

    - publish towards adf_publish branch which in our case triggers a release towards acceptance

    - once deployment is complete to acceptance, run the pipeline again

    - The run reports that no changes exist, where one document is expected.

     

    Dataflow source configuration:

     

    After analysis we notice that the arm template creation always changes the pipeline under the hood:


    "continuationSettings": {
    "customizedCheckpointKey": "9b29c1e3-aaca-4dee-a29f-c464b67b2f34"
    }

    The customizedCheckpointKey always contains a new/different guid, which is the suspected reason that the changefeed doesn't work any more after release. The documentation suggests that keeping the name of the pipeline and dataflow activity the same is the only requirement to not break the changefeed. But in our case this is not true.

  • When attempting to configure an ADF dataflow as you show, I get this error when attempting a data preview from an existing and functioning (in an ADF pipeline)  Cosmos data source:

     

    DF-SRC-002 at Source 'SourceCosmosProfileUsers': 'masterKey' (Master Key) is required

     

    What's this about?

     

     

  • mischmuc's avatar
    mischmuc
    Copper Contributor

    CDC preview was also available for delta files but now it is gone. Is there any reaons for this?