By @Mark Kromer and @revinchalil
Making it super-easy to create efficient and fast ETL processing the cloud, Azure Data Factory has invested heavily in change data capture features. Today, we are super-excited to announce that Azure Cosmos DB analytics store now supports Change Data Capture (CDC), for Azure Cosmos DB API for NoSQL, and Azure Cosmos DB API for Mongo DB in public preview!
This capability, available in public preview, allows you to efficiently consume a continuous and (inserted, updated, and deleted) data from the analytical store. CDC is seamlessly integrated with Azure Synapse Analytics and Azure Data Factory, a scalable no-code experience for high data volume. As CDC is based on the analytical store, it does not consume provisioned RUs, does not affect the performance of your transactional workloads, provides lower latency, and has lower TCO.
Change Data Capture (CDC) with Analytical store. Click here for supported sink types on Mapping Data Flow.
You can consider using analytical store CDC, if you are currently using or planning to use below:
Note that analytical store has up to 2 min latency to sync transactional store data
Operations on Cosmos DB analytical store do not consume the provisioned RUs and so do not impact your transactional workloads. CDC with analytical store also has lower latency and lower TCO, compared to using ChangeFeed on transactional store. The lower latency is attributed to analytical store enabling better parallelism for data processing and reduces the overall TCO enabling you to drive cost efficiencies.
The seamless and native integration of analytical store CDC with Azure Synapse and Azure Data Factory provides the no-code, low-touch experience.
Change data capture capability enables an end-to-end analytical story providing the flexibility to write Cosmos DB data to any of the supported sink types. It also enables you to bring Cosmos DB data into a centralized data lake where you can federate your data from diverse data sources. You can flatten the data, partition it, and apply more transformations either in Azure Synapse Analytics or Azure Data Factory.
You can consume incremental analytical store data from a Cosmos DB container using either Azure Synapse Analytics or Azure Data Factory, once the Cosmos DB account has been enabled for Synapse Link and analytical store has been enabled on a new container or an existing container.
On the Azure Synapse Data flow or on the Azure Data Factory Mapping Data flow, choose the Inline dataset type as “Azure Cosmos DB for NoSQL” and Store type as “Analytical”, as seen below.
In addition to providing incremental data feed from analytical store to diverse targets, CDC supports the following capabilities.
You can optionally use a source query to specify filter(s), projection(s), and transformation(s) which would all be pushed down to the analytical store. Below is a sample source-query that would only capture incremental records with Category = 'Urban', project only a subset of fields and apply a simple transformation.
Select ProductId, Product, Segment, concat(Manufacturer, ‘-‘, Category) as ManufacturerCategory
from c
where Category = ‘Urban’
Analytical store CDC captures deleted records and intermediate updates. The captured deletes and updates can be applied on sinks that support delete and update operations. The {_rid} value uniquely identifies the records and so by specifying {_rid} as key column on the sink side, the updates and deletes would be reflected on the sink.
You can filter the CDC feed for a specific type of operation. For example, you have the option to selectively capture the Insert and update operations only, thereby ignoring the user-delete and TTL-delete operations.
In addition to specifying filters, projections, and transformations via source query, you can also perform advanced schema operations such as flattening, applying advanced row modifier operations and dynamically partitioning the data based on the given key.
Each change in Cosmos DB container appears exactly once in the CDC feed, and the checkpoints are managed internally for you. This helps to address the below disadvantages of the common pattern of using custom checkpoints based on the “_ts” value:
With CDC, there’s no limitation around the fixed data retention period for which changes are available. Multiple change feeds on the same container can be consumed simultaneously. Changes can be synchronized from “the Beginning” or “from a given timestamp” or “from now”.
Please note that the linked service interface for Azure Cosmos DB for MongoDB API is not available on Dataflow yet. However, you would be able to use your account’s document endpoint with the “Azure Cosmos DB for NoSQL” linked service interface as a workaround until the Mongo linked service is supported.
Eg: ON a NoSQL linked service, choose “Enter Manually” to provide the Cosmos DB account info and use the account’s document endpoint (eg: https://<accturi>.documents.azure.com:443/) instead of the Mongo endpoint (eg: mongodb://<accturi>.mongo.cosmos.azure.com:10255/)
To learn more, please see our documentation on Change Data Capture in Azure Cosmos DB analytical store.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.