This blog is not intended to cover Cosmos Db in detail, we will glance at some of its main features, then present a use case of integrating it into an existing SQL application. If the reader is already familiar with Cosmos you may skip the intro section.
Document databases gained huge popularity in the past decade. Their flexible and semi structured nature makes it easier for developers to persist and query data, when compared to traditional relational engines. Their hierarchical format facilitates binding database fields to object-model or user interface attributes, especially when dealing with complex entities with multiple collections and nested attributes.
Microsoft Cosmos Db is one of the main engines in this space, along with MongoDb Atlas, Google Firestore and Amazon DocumentDB.
Cosmos DB also offers distinct features:
These features are highly desirable for applications with high throughput or 24/7 uptime, and are difficult to achieve with classic relational engines. Other interesting Cosmos DB features include:
Pricing: https://azure.microsoft.com/en-us/pricing/details/cosmos-db/
Understanding your bill: Understanding your Azure Cosmos DB bill | Microsoft Learn
Reservation: Reserved capacity in Azure Cosmos DB to Optimize cost | Microsoft Learn
Storage: Optimize storage cost in Azure Cosmos DB | Microsoft Learn
Throughput: Optimizing throughput cost in Azure Cosmos DB | Microsoft Learn
RUs: Optimizing the cost of your requests in Azure Cosmos DB | Microsoft Learn
Indexing: Azure Cosmos DB indexing policies | Microsoft Learn
Rate Limiting: Optimize your Azure Cosmos DB application using rate limiting | Microsoft LearnDocument databases also have downsides:
Migrating applications from SQL to Cosmos is challenging because the engines have different paradigms. The rise of service-oriented architectures increased data independence across applications. Instead of writing your PC, Web, Mobile and Integration apps all pointing to the same database, many companies have been opting for splitting (or duplicating) data across different database engines, allowing for greater scalability, and leveraging the best features of each engine. A common trend is to keep your transactional workload in a SQL backend, and analytical workload in a Distributed SQL or Spark environment, as an example Synapse covers both use cases, and the latest versions of SQL offer a link to replicate data into Synapse Dedicated Pools. Many companies also have been using document dbs (instead of SQL) for their transactional systems. To improve reportability of your data, Cosmos DB offers the option to turn on the analytical store and link to Synapse. The analytical store flattens your data, converting nested json attributes into a tabular format. The Synapse Link replicates your data into ADLS blob storage in parquet format, allowing you to query with Spark or Serverless SQL Pools. The latency is usually a few seconds.
Our use case came from a project with a major health insurance provider in the United States. They used a SQL backend for their benefits, claims, web portal and customer service systems. They then created a mobile app over the same SQL backend, which increased their traffic significantly, also increasing latency and concurrency issues with SQL. Another challenge was the fact that their database was highly normalized, and when a customer opened their mobile app, they pulled data from several tables. They tried scaling SQL up, enabling read replicas, tuning indexes, but as an SMP engine SQL can only scale so far. They then considered creating a separate database for their mobile app and wanted to store data in a way that facilitated pulling all information about a customer, as a “high performance cache”. This concept has long been implemented with Redis. However, Redis have some limitations the customer was concerned about:
We then opted to use Cosmos Db as the cache database. The document model was ideal to store all information about a customer in a few collections and allowed for consistently fast page loads without any impact on SQL. The automatic indexing feature allowed for fast searches. Essentially all the read operations interacted with Cosmos DB. The write operations (which were rare) invoked the original APIs over the SQL backend and triggered our custom “SQL To Cosmos Replication”. During the first phase all changes were still made in SQL. When users called their customer service to make a change, or changed their coverage during open enrollment, the updates would be done to SQL, leveraging mature routines and constraints, and our custom replication would update Cosmos within a few seconds. This architecture allowed to remove indexes in SQL and to eliminate read replicas, reducing the overall cost while improving end user experience.
This is an example of how to track changes in SQL and replicate to Cosmos. Our sample code should be used as a reference so the reader can understand the concept, then implement and test their own version.
Link to repo: silvam-microsoft/sql-to-cosmos-replication (github.com)
The solution composed of 2 parts:
1. The SQL script creates the following objects under cosmos schema:
Type |
Name |
Purpuse |
Table |
ImportLog |
Stores history of execution. Each run uses “high watermark” method with range DateFrom and DateTo values, both as datetime. The DateFrom takes the last successful DateTo, and DateTo gets the current system time. Other columns track execution time and how many documents were processed.
|
Table |
Collections |
Defines which collections in Cosmos need to be updated, each collection is implemented by a custom stored procedure. |
Procedure |
spStart |
To start the sync process |
Procedure |
spEnd |
To finalize the sync process |
Procedure |
spGetPersonsToRefresh |
This procedure gets the Persons that changed since last execution. Change tracking is based on column [ModifiedDate], and we are tracking tables Person, PersonPhone and EmailAddress. We assume that no hard deletes are executed, and that all processed that modify Person update the ModifiedDate column.
|
Procedure |
spGetPersonJSON |
This proc converts a person record to json, including their phones and emails. We leverage the FOR JSON native function.
|
Function |
fnGetPersonJSON |
This function does the same thing as the prior procedure, but allows for inline execution. |
Procedure |
spGetPersonsToRefreshV2 |
This function returns all the persons which changed within a date range, and their JSONs. This proc is redundant but serves to illustrate the use in multiple ways. Notice the execution is very fast. In our dev environment we got 10k person jsons in 1 second. |
Our example uses the AdventureWorks2019 on SQL 2019 Developer, which can be downloaded on the links below.
Our sample should work in any supported version of SQL, including Azure SQL and Managed Instance.
https://docs.microsoft.com/en-us/sql/samples/adventureworks-install-configure?view=sql-server-ver16&...
SQL Server Downloads | Microsoft
2. This C# console app gets results from spGetPersonsToRefreshv2 and saves to Cosmos using the UpsertItemStreamAsync method. It logs execution in the log table using spStart and spEnd.
During the first execution there is no history, so it picks up all persons. On later runs it picks up only the people that had change since last run.
This process can be scheduled with powershell. Or you can opt to process each person when they get updated using spGetPersonJSON.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.