Synapse Link bridges an important scenario when dealing with Cosmos DB: the efficient processing of analytical workloads, without risking the integrity of transactional applications supplying the data.
To put this feature to the test, we used an accelerometer dataset for activity recognition (you can check it out here: https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer). Our goal is to perform analytical workloads - in this case, training an ML classification model, over data stored in Cosmos DB.
The problem
Suppose we would like to create an application which can, in real-time, predict what an end user is doing, based on accelerometer data. This has many applications, like detecting when an elderly person requires help, or generating analytics for a fitness app. Whatever the reason behind it, our objective is to expose an endpoint that can receive acceleration data as an input, and output the user’s current activity - that is, standing, sitting, walking, etc.
Solution architecture
We propose the following architecture as a solution to this problem:
We can break down the diagram into the following steps:
This data may look something like this:
Timestamp |
ParticipantId |
AccelerationX |
AccelerationY |
AccelerationZ |
Activity |
00:00:00.0000 |
1 |
1222 |
1402 |
2037 |
Walking |
00:00:00.0010 |
1 |
1245 |
1426 |
1956 |
Walking |
00:00:00.0020 |
1 |
1142 |
1363 |
1986 |
Walking |
00:00:00.0030 |
1 |
… |
... |
… |
Standing |
… |
... |
… |
… |
… |
… |
We call this labeled data because it contains labels - meaning, the actual activity being performed, as reported by the user. Since we know this activity to be correct, we also refer to this data as our ground truth.
Here’s what an application might send to the endpoint:
Timestamp |
AccelerationX |
AccelerationY |
AccelerationZ |
00:00:00.0000 |
1222 |
1402 |
2037 |
00:00:00.0010 |
1245 |
1426 |
1956 |
00:00:00.0020 |
1142 |
1363 |
1986 |
00:00:00.0030 |
… |
... |
… |
… |
… |
… |
… |
And here’s a possible output:
Timestamp |
Activity |
00:00:00.0000 |
Running |
00:00:00.0010 |
Running |
00:00:00.0020 |
Going up stairs |
00:00:00.0030 |
Going up stairs |
… |
… |
Procedure
The first thing we did was spin up the infrastructure. This includes:
You can check out the ARM templates in the IaC/ directory of our repo, linked at the end of the article.
Next, we needed to load our ground truth into Cosmos DB. We used a Synapse Pipeline for this:
Once the data is in Cosmos DB, we can use Synapse Spark to load it to a data frame and train our model.
Check out the notebook for more instructions on how this is done.
The notebook will perform exploratory data analysis, train and evaluate a model, and deploy it to Azure Machine Learning. Once it’s there, you can test your deployment with some sample data to verify that it’s working:
Conclusion
As a result, we could perform predictive analytics on a dataset in Cosmos DB, without having to create any additional storage or ETL processes. Everything is taken care of and optimized by Microsoft through Synapse Link. But don’t take my word for it - go check out our repository with instructions on how to run this solution in your own subscription at https://github.com/MarcoABCardoso/2b1-luti-marco!
Next steps
Want to know more about how all of this works? Stay tuned for more posts as we discuss the configuration and best practices for each component in this architecture!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.