Synapse Link bridges an important scenario when dealing with Cosmos DB: the efficient processing of analytical workloads, without risking the integrity of transactional applications supplying the data.
To put this feature to the test, we used an accelerometer dataset for activity recognition (you can check it out here: https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer). Our goal is to perform analytical workloads - in this case, training an ML classification model, over data stored in Cosmos DB.
The problem
Suppose we would like to create an application which can, in real-time, predict what an end user is doing, based on accelerometer data. This has many applications, like detecting when an elderly person requires help, or generating analytics for a fitness app. Whatever the reason behind it, our objective is to expose an endpoint that can receive acceleration data as an input, and output the user’s current activity - that is, standing, sitting, walking, etc.
Solution architecture
We propose the following architecture as a solution to this problem:
We can break down the diagram into the following steps:
- Some initial labeled data is loaded into Cosmos DB.
This data may look something like this:
Timestamp |
ParticipantId |
AccelerationX |
AccelerationY |
AccelerationZ |
Activity |
00:00:00.0000 |
1 |
1222 |
1402 |
2037 |
Walking |
00:00:00.0010 |
1 |
1245 |
1426 |
1956 |
Walking |
00:00:00.0020 |
1 |
1142 |
1363 |
1986 |
Walking |
00:00:00.0030 |
1 |
… |
... |
… |
Standing |
… |
... |
… |
… |
… |
… |
We call this labeled data because it contains labels - meaning, the actual activity being performed, as reported by the user. Since we know this activity to be correct, we also refer to this data as our ground truth.
- As Cosmos DB has Synapse Link enabled, this data gets piped in real time to a column store - in a way very similar to a storage account.
- Spark notebooks in Synapse use the analytical store to perform data science workloads. This includes aggregations, plots, joins and other operations which would be either too costly or too slow on regular Cosmos DB. That’s because Cosmos DB is optimized for transactional - not analytical - workloads.
- Using the Azure ML SDK, the model is deployed to an Azure Machine Learning workspace. This makes it available for real-time predictions.
- Now, our application can send unlabeled data to the endpoint in AzureML and generate predictions - or labeled data.
Here’s what an application might send to the endpoint:
Timestamp |
AccelerationX |
AccelerationY |
AccelerationZ |
00:00:00.0000 |
1222 |
1402 |
2037 |
00:00:00.0010 |
1245 |
1426 |
1956 |
00:00:00.0020 |
1142 |
1363 |
1986 |
00:00:00.0030 |
… |
... |
… |
… |
… |
… |
… |
And here’s a possible output:
Timestamp |
Activity |
00:00:00.0000 |
Running |
00:00:00.0010 |
Running |
00:00:00.0020 |
Going up stairs |
00:00:00.0030 |
Going up stairs |
… |
… |
- This labeled data allows for interesting insights: does the user need assistance? What does their day look like?
- Optionally, we may now close the loop: by having a human review the predictions, we generate more ground truth, which in turn improves our training.
Procedure
The first thing we did was spin up the infrastructure. This includes:
- Cosmos DB
- Synapse Analytics workspace
- Key Vault - for storing secrets for Synapse
- Storage Account - Synapse’s default storage
- Azure Machine Learning
You can check out the ARM templates in the IaC/ directory of our repo, linked at the end of the article.
Next, we needed to load our ground truth into Cosmos DB. We used a Synapse Pipeline for this:
Once the data is in Cosmos DB, we can use Synapse Spark to load it to a data frame and train our model.
Check out the notebook for more instructions on how this is done.
The notebook will perform exploratory data analysis, train and evaluate a model, and deploy it to Azure Machine Learning. Once it’s there, you can test your deployment with some sample data to verify that it’s working:
Conclusion
As a result, we could perform predictive analytics on a dataset in Cosmos DB, without having to create any additional storage or ETL processes. Everything is taken care of and optimized by Microsoft through Synapse Link. But don’t take my word for it - go check out our repository with instructions on how to run this solution in your own subscription at https://github.com/MarcoABCardoso/2b1-luti-marco!
Next steps
Want to know more about how all of this works? Stay tuned for more posts as we discuss the configuration and best practices for each component in this architecture!