Machine Learning with Cosmos DB and Synapse Link
Published Aug 18 2022 03:44 AM 5,839 Views
Microsoft

Synapse Link bridges an important scenario when dealing with Cosmos DB: the efficient processing of analytical workloads, without risking the integrity of transactional applications supplying the data. 

 

To put this feature to the test, we used an accelerometer dataset for activity recognition (you can check it out here: https://archive.ics.uci.edu/ml/datasets/Activity+Recognition+from+Single+Chest-Mounted+Accelerometer). Our goal is to perform analytical workloads - in this case, training an ML classification model, over data stored in Cosmos DB.

 

The problem

 

Suppose we would like to create an application which can, in real-time, predict what an end user is doing, based on accelerometer data. This has many applications, like detecting when an elderly person requires help, or generating analytics for a fitness app. Whatever the reason behind it, our objective is to expose an endpoint that can receive acceleration data as an input, and output the user’s current activity - that is, standing, sitting, walking, etc.

 

Solution architecture

 

We propose the following architecture as a solution to this problem:

 

MarcoCardoso_0-1659540988829.png

 

 

We can break down the diagram into the following steps:

 

  • Some initial labeled data is loaded into Cosmos DB. 

This data may look something like this:

Timestamp

ParticipantId

AccelerationX

AccelerationY

AccelerationZ

Activity

00:00:00.0000

1

1222

1402

2037

Walking

00:00:00.0010

1

1245

1426

1956

Walking

00:00:00.0020

1

1142

1363

1986

Walking

00:00:00.0030

1

...

Standing

...

We call this labeled data because it contains labels - meaning, the actual activity being performed, as reported by the user. Since we know this activity to be correct, we also refer to this data as our ground truth.

 

  • As Cosmos DB has Synapse Link enabled, this data gets piped in real time to a column store - in a way very similar to a storage account. 
  • Spark notebooks in Synapse use the analytical store to perform data science workloads. This includes aggregations, plots, joins and other operations which would be either too costly or too slow on regular Cosmos DB. That’s because Cosmos DB is optimized for transactional - not analytical - workloads.
  • Using the Azure ML SDK, the model is deployed to an Azure Machine Learning workspace. This makes it available for real-time predictions.
  • Now, our application can send unlabeled data to the endpoint in AzureML and generate predictions - or labeled data.

 

Here’s what an application might send to the endpoint:

 

Timestamp

AccelerationX

AccelerationY

AccelerationZ

00:00:00.0000

1222

1402

2037

00:00:00.0010

1245

1426

1956

00:00:00.0020

1142

1363

1986

00:00:00.0030

...

 

And here’s a possible output:

 

Timestamp

Activity

00:00:00.0000

Running

00:00:00.0010

Running

00:00:00.0020

Going up stairs

00:00:00.0030

Going up stairs

 

  • This labeled data allows for interesting insights: does the user need assistance? What does their day look like?
  • Optionally, we may now close the loop: by having a human review the predictions, we generate more ground truth, which in turn improves our training.

 

Procedure

 

The first thing we did was spin up the infrastructure. This includes:

 

  • Cosmos DB
  • Synapse Analytics workspace
  • Key Vault - for storing secrets for Synapse
  • Storage Account - Synapse’s default storage
  • Azure Machine Learning

 

You can check out the ARM templates in the IaC/ directory of our repo, linked at the end of the article.

 

Next, we needed to load our ground truth into Cosmos DB. We used a Synapse Pipeline for this:

 

MarcoCardoso_1-1659449857926.png

 

 

Once the data is in Cosmos DB, we can use Synapse Spark to load it to a data frame and train our model.

Check out the notebook for more instructions on how this is done.

 

MarcoCardoso_2-1659449867580.png

 

 

The notebook will perform exploratory data analysis, train and evaluate a model, and deploy it to Azure Machine Learning. Once it’s there, you can test your deployment with some sample data to verify that it’s working:

 

MarcoCardoso_0-1659450044004.png

 

 

 

Conclusion

 

As a result, we could perform predictive analytics on a dataset in Cosmos DB, without having to create any additional storage or ETL processes. Everything is taken care of and optimized by Microsoft through Synapse Link. But don’t take my word for it - go check out our repository with instructions on how to run this solution in your own subscription at https://github.com/MarcoABCardoso/2b1-luti-marco!

 

Next steps

 

Want to know more about how all of this works? Stay tuned for more posts as we discuss the configuration and best practices for each component in this architecture!

 

 

1 Comment
Version history
Last update:
‎Aug 03 2022 08:36 AM
Updated by: