At Microsoft Build 2021 we announced the public preview of Azure Machine Learning managed endpoints. In this post, we’ll walk you through some of the capabilities of managed endpoints. But first a quick recap - Managed endpoints are designed to help our customers deploy their models in a turnkey manner across powerful CPU and GPU machines in Azure in a scalable, fully managed way. These take care of serving, scaling, securing & monitoring your ML models, freeing you from the overhead of setting up and managing the underlying infrastructure.
Currently, when customers want to deploy models for online/real-time inference in a production environment with Azure ML, they create and manage the underlying cluster infrastructure by themselves. These are some of the challenges we have heard from customers:
Similarly, when customers want to run a batch inference with Azure ML they need to learn a different set of concepts. At Build 2020, we released the parallel runstep, a new step in the Azure Machine Learning pipeline, designed for embarrassingly parallel machine learning workload. Nestlé uses it to perform batch inference and flag phishing emails. AGL uses it to build parallel at-scale training and batch inference. While customers are happy with the experience, performance, and scale parallel run step provides, the feedback was that there’s a steep learning curve to use it for the first time. They must construct a pipeline with a parallel run step, prepare an environment, write a scoring script, create a dataset, run the pipeline, and publish the pipeline to re-use or run from external platforms. Essentially customers want the ability to run batch inference seamlessly without the need for any additional steps once the models are registered in Azure ML.
This is what managed endpoints in Azure ML are designed to address. Let’s look at them in more detail.
This is a new capability for the online/real-time scoring of your models. Following is a summary of features and benefits:
Here’s a quick 3-minute walkthrough of the experience:
We are simplifying the batch inference experience through managed batch endpoints. This would help our customers speed up model deployment in a turnkey manner, with all the following capabilities:
|
|
We are introducing the concept of “endpoint” and “deployment”. Using these, users will be able to create multiple versions of models under a single endpoint and perform safe rollout to newer versions.
An HTTPS endpoint that clients can invoke to get the inference output of models. It provides:
A set of compute resources hosting the model and performing inference. Users can configure:
The above picture shows a managed online endpoint with a traffic split of 90% and 10% between blue and green deployments, respectively (these names are for illustration purposes – you can have any name). The blue deployment is running model version 1 in three CPU nodes (F2S VMs) and the green deployment is running on model version 2 in three GPU nodes (NC6v2 VMs).
With multiple deployment support and traffic split capability, users can perform safe rollout of new models by gradually migrating traffic [in this case] from blue to green and monitoring metrics at every stage to ensure the rollout has been successful.
Endpoints and deployments are applicable for Batch endpoints as well, with the following exceptions:
In summary, managed endpoints help ML teams focus more on the business problem than the underlying infrastructure. It provides a simple developer interface to deploy and score models and help in the operational aspects of model deployment including safely rolling out models, debugging issues faster, and monitoring SLA. Please give these a spin using the following assets and do share your feedback with us.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.