Machine Learning with Cosmos DB and Synapse Link

Former Employee

Aug 18, 2022

Synapse Link bridges an important scenario when dealing with Cosmos DB: the efficient processing of analytical workloads, without risking the integrity of transactional applications supplying the data.

To put this feature to the test, we used an accelerometer dataset for activity recognition (you can check it out here: https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer). Our goal is to perform analytical workloads - in this case, training an ML classification model, over data stored in Cosmos DB.

The problem

Suppose we would like to create an application which can, in real-time, predict what an end user is doing, based on accelerometer data. This has many applications, like detecting when an elderly person requires help, or generating analytics for a fitness app. Whatever the reason behind it, our objective is to expose an endpoint that can receive acceleration data as an input, and output the user’s current activity - that is, standing, sitting, walking, etc.

Solution architecture

We propose the following architecture as a solution to this problem:

We can break down the diagram into the following steps:

Some initial labeled data is loaded into Cosmos DB.

This data may look something like this:

Timestamp	ParticipantId	AccelerationX	AccelerationY	AccelerationZ	Activity
00:00:00.0000	1	1222	1402	2037	Walking
00:00:00.0010	1	1245	1426	1956	Walking
00:00:00.0020	1	1142	1363	1986	Walking
00:00:00.0030	1	…	...	…	Standing
…	...	…	…	…	…

We call this labeled data because it contains labels - meaning, the actual activity being performed, as reported by the user. Since we know this activity to be correct, we also refer to this data as our ground truth.

As Cosmos DB has Synapse Link enabled, this data gets piped in real time to a column store - in a way very similar to a storage account.
Spark notebooks in Synapse use the analytical store to perform data science workloads. This includes aggregations, plots, joins and other operations which would be either too costly or too slow on regular Cosmos DB. That’s because Cosmos DB is optimized for transactional - not analytical - workloads.
Using the Azure ML SDK, the model is deployed to an Azure Machine Learning workspace. This makes it available for real-time predictions.
Now, our application can send unlabeled data to the endpoint in AzureML and generate predictions - or labeled data.

Here’s what an application might send to the endpoint:

Timestamp	AccelerationX	AccelerationY	AccelerationZ
00:00:00.0000	1222	1402	2037
00:00:00.0010	1245	1426	1956
00:00:00.0020	1142	1363	1986
00:00:00.0030	…	...	…
…	…	…	…

And here’s a possible output:

Timestamp	Activity
00:00:00.0000	Running
00:00:00.0010	Running
00:00:00.0020	Going up stairs
00:00:00.0030	Going up stairs
…	…

This labeled data allows for interesting insights: does the user need assistance? What does their day look like?
Optionally, we may now close the loop: by having a human review the predictions, we generate more ground truth, which in turn improves our training.

Procedure

The first thing we did was spin up the infrastructure. This includes:

Cosmos DB
Synapse Analytics workspace
Key Vault - for storing secrets for Synapse
Storage Account - Synapse’s default storage
Azure Machine Learning

You can check out the ARM templates in the IaC/ directory of our repo, linked at the end of the article.

Next, we needed to load our ground truth into Cosmos DB. We used a Synapse Pipeline for this:

Once the data is in Cosmos DB, we can use Synapse Spark to load it to a data frame and train our model.

Check out the notebook for more instructions on how this is done.

The notebook will perform exploratory data analysis, train and evaluate a model, and deploy it to Azure Machine Learning. Once it’s there, you can test your deployment with some sample data to verify that it’s working:

Conclusion

As a result, we could perform predictive analytics on a dataset in Cosmos DB, without having to create any additional storage or ETL processes. Everything is taken care of and optimized by Microsoft through Synapse Link. But don’t take my word for it - go check out our repository with instructions on how to run this solution in your own subscription at https://github.com/MarcoABCardoso/2b1-luti-marco!

Next steps

Want to know more about how all of this works? Stay tuned for more posts as we discuss the configuration and best practices for each component in this architecture!

Updated Aug 03, 2022

Version 1.0

data & ai

MarcoCardoso

Former Employee

Joined June 15, 2022

View Profile

FastTrack for Azure

Follow this blog board to get notified when there's new activity