Announcing managed endpoints in Azure Machine Learning for simplified model deployment

Microsoft

May 25, 2021

At Microsoft Build 2021 we announced the public preview of Azure Machine Learning managed endpoints. In this post, we’ll walk you through some of the capabilities of managed endpoints. But first a quick recap - Managed endpoints are designed to help our customers deploy their models in a turnkey manner across powerful CPU and GPU machines in Azure in a scalable, fully managed way. These take care of serving, scaling, securing & monitoring your ML models, freeing you from the overhead of setting up and managing the underlying infrastructure.

Customer challenges

Currently, when customers want to deploy models for online/real-time inference in a production environment with Azure ML, they create and manage the underlying cluster infrastructure by themselves. These are some of the challenges we have heard from customers:

Need ability to swiftly launch a new endpoint backed by a large set of instances.
Need a better experience to debug issues locally before deploying to azure.
Need the ability to set up SLA monitoring on endpoint metrics.
It is challenging to maintain custom infrastructure like Kubernetes - Performing version updates, security hardening, scaling clusters, and getting internal security approval - all require specialized expertise.

Similarly, when customers want to run a batch inference with Azure ML they need to learn a different set of concepts. At Build 2020, we released the parallel runstep, a new step in the Azure Machine Learning pipeline, designed for embarrassingly parallel machine learning workload. Nestlé uses it to perform batch inference and flag phishing emails. AGL uses it to build parallel at-scale training and batch inference. While customers are happy with the experience, performance, and scale parallel run step provides, the feedback was that there’s a steep learning curve to use it for the first time. They must construct a pipeline with a parallel run step, prepare an environment, write a scoring script, create a dataset, run the pipeline, and publish the pipeline to re-use or run from external platforms. Essentially customers want the ability to run batch inference seamlessly without the need for any additional steps once the models are registered in Azure ML.

This is what managed endpoints in Azure ML are designed to address. Let’s look at them in more detail.

Managed online endpoints

This is a new capability for the online/real-time scoring of your models. Following is a summary of features and benefits:

Managed infrastructure:
- Users specify the VM instance type (SKU) and scale settings and the system takes care of provisioning the compute and hosting the
- The system handles update/patch of the underlying host OS images
- The system handles node recovery in case of system failure.
The safe rollout of a model by using native support for blue/green deployment. When rolling out a new version of a model, you can create a new deployment under the same endpoint and gradually divert traffic to it by validating that there are no errors and disruptions.

Monitor SLA: Monitor endpoint metrics like latency and throughput, and resource metrics like CPU/GPU utilization using out-of-the-box integration with Azure Monitor. You can also set alerts for threshold breaches.

Debug in a local Docker environment using Local endpoints. You will be able to use the same CLI command and configuration that you will use for cloud deployment, with just an additional flag.

Enable log analytics integration to analyze issues and identify trends. Analyze performance by enabling integration with App Insights

View costs at endpoint & deployment level using Azure cost analysis

Security: Endpoints support key and azure ml token auth. To access secured resources, both user-assigned, and system-assigned managed identities are supported.

Build MLOps pipelines using our new CLI & REST/ARM interfaces. The YAML file in CLI enables GitOps support with full audibility and repeatability by declaring HOW you want production.

Users can also use Azure ML Studio to create and manage endpoints.

Here’s a quick 3-minute walkthrough of the experience:

Managed batch endpoints

We are simplifying the batch inference experience through managed batch endpoints. This would help our customers speed up model deployment in a turnkey manner, with all the following capabilities:

No-code model deployment for MLflow: With Batch Endpoints, we eliminate numerous steps with creating pipelines, setting up a parallel run step, writing the scoring script, preparing environments, automation, etc. Now, for MLflow registered models, customers only need to provide the model and a compute target, run one command, and a batch endpoint is ready to use. The scoring script and environment will be automatically generated.

Flexible input data sources and configurable output location: Customers can run batch inference through managed batch endpoint using an Azure ML registered dataset, other datasets in the cloud, or datasets stored locally. The output location can also be specified to any data store.
Managed cost with autoscaling compute: Invoking a batch endpoint triggers an asynchronous batch inference job. Compute resources are automatically provisioned when the job starts, and automatically de-allocated as the job completes. Customers only pay for compute when they use it. They can override compute resource settings (like instance count) and advanced settings (like mini-batch size, error threshold, and so on) for each individual batch inference job to speed up execution as well as reduce cost.

How managed endpoints work

We are introducing the concept of “endpoint” and “deployment”. Using these, users will be able to create multiple versions of models under a single endpoint and perform safe rollout to newer versions.

Endpoint

An HTTPS endpoint that clients can invoke to get the inference output of models. It provides:

Stable URI: my-endpoint.region.inference.ml.azure.com
Authentication: Key & Token-based auth
SSL Termination
Traffic split between deployments

Deployment

A set of compute resources hosting the model and performing inference. Users can configure:

Model details (code, model, environment)
Resource and scale settings
Advanced settings like request and probe settings

The above picture shows a managed online endpoint with a traffic split of 90% and 10% between blue and green deployments, respectively (these names are for illustration purposes – you can have any name). The blue deployment is running model version 1 in three CPU nodes (F2S VMs) and the green deployment is running on model version 2 in three GPU nodes (NC6v2 VMs).

With multiple deployment support and traffic split capability, users can perform safe rollout of new models by gradually migrating traffic [in this case] from blue to green and monitoring metrics at every stage to ensure the rollout has been successful.

Endpoints and deployments are applicable for Batch endpoints as well, with the following exceptions:

Only AAD token auth is supported
The concept of traffic split (there can be only active deployment with 100% traffic) and safe rollout is not applicable
For deployments, the advanced settings are batch-specific (e.g. mini_batch_size, error_threshold, etc)

Summary

In summary, managed endpoints help ML teams focus more on the business problem than the underlying infrastructure. It provides a simple developer interface to deploy and score models and help in the operational aspects of model deployment including safely rolling out models, debugging issues faster, and monitoring SLA. Please give these a spin using the following assets and do share your feedback with us.