Creating managed online endpoints in Azure ML
Published Dec 16 2021 11:09 AM 9,807 Views


Suppose you’ve trained a machine learning model to accomplish some task, and you’d now like to provide that model’s inference capabilities as a service. Maybe you’re writing an application of your own that will rely on this service, or perhaps you want to make the service available to others. This is the purpose of endpoints — they provide a simple web-based API for feeding data to your model and getting back inference results.

Azure ML currently supports three types of endpoints: batch endpoints, Kubernetes online endpoints, and managed online endpoints. I’m going to focus on managed online endpoints in this post, but let me start by explaining how the three types differ.

Diagram showing an overview of the types of endpoints.

Batch endpoints are designed to handle large requests, working asynchronously and generating results that are held in blob storage. Because compute resources are only provisioned when the job starts, the latency of the response is higher than using online endpoints. However, that can result in substantially lower costs. Online endpoints, on the other hand, are designed to quickly process smaller requests and provide near-immediate responses. Compute resources are provisioned at the time of deployment, and are always up and running, which depending on your scenario may mean higher costs than batch endpoints. However, you get real-time responses, which is criticial to many scenarios. If you want to deploy an online endpoint, you have two options: Kubernetes online endpoints allow you to manage your own compute resources using Kubernetes, while managed online endpoints rely on Azure to manage compute resources, OS updates, scaling, and security. For more information about the different endpoint types and which one is right for you, check out the documentation.

In this post, I’ll show you how to work with managed online endpoints. We’ll start by getting familiar with our PyTorch model. We’ll then write a scoring function that loads the model and performs predictions based on user input. After that, we’ll explore several different options for creating managed online endpoints that call our scoring function. And finally, I’ll demonstrate a couple of ways to invoke our endpoints.

The code for this project can be found on GitHub.

Throughout this post, I’ll assume you’re familiar with machine learning concepts like training and prediction, but I won’t assume familiarity with Azure.

Azure ML setup

Here’s how you can set up Azure ML to follow the steps in this post.

  • You need to have an Azure subscription. You can get a free subscription to try it out.
  • Create a resource group.
  • Create a new machine learning workspace by following the “Create the workspace” section of the documentation. Keep in mind that you’ll be creating a “machine learning workspace” Azure resource, not a “workspace” Azure resource, which is entirely different!
  • If you have access to GitHub Codespaces, click on the “Code” button in this GitHub repo, select the “Codespaces” tab, and then click on “New codespace.”
  • Alternatively, if you plan to use your local machine:
    • Install the Azure CLI by following the instructions in the documentation.
    • Install the ML extension to the Azure CLI by following the “Installation” section of the documentation.
  • In a terminal window, login to Azure by executing az login --use-device-code.
  • Set your default subscription by executing az account set -s "<YOUR_SUBSCRIPTION_NAME_OR_ID>". You can verify your default subscription by executing az account show, or by looking at ~/.azure/azureProfile.json.
  • Set your default resource group and workspace by executing az configure --defaults group="<YOUR_RESOURCE_GROUP>" workspace="<YOUR_WORKSPACE>". You can verify your defaults by executing az configure --list-defaults or by looking at ~/.azure/config.
  • You can now open the Azure Machine Learning studio, where you’ll be able to see and manage all the machine learning resources we’ll be creating.
  • Although not essential to run the code in this post, I highly recommend installing the Azure Machine Learning extension for VS Code.

You’re now ready to start working with Azure ML!

Training and saving the models

To keep this post simple and focused on endpoints, I provide the already trained model in the GitHub project, under model. This way you can go straight to learning Azure ML endpoints without having to run any code.

If you want to re-create the models provided, you first need to create and activate the conda environment. If you’re running this project on Codespaces, there’s nothing to do — the conda environment is created and activated automatically when the container is created. If you’re running the code locally, you’ll need to execute the following commands from the root of the GitHub repo:


conda env create -f environment.yml
conda activate aml-managed-endpoint


You can then run src/, which saves the model using the following code:


    ..., path)


For a full explanation of the PyTorch training code, check out my PyTorch blog post.

If you’d like to train on Azure, you can look at the documentation on how to do that. I also intend to cover this topic in future posts.

Creating the models on Azure

Before we can use our ML models in the cloud, we need to create Azure ML resources that know about them. There are a few different ways to create these resources — my preferred way is to use YAML files, so this is the method I’ll show in this post.

Let’s start by looking at the YAML file for our model, cloud/model.yml. You can see that this file starts by specifying a schema, which is super helpful because it enables VS Code to make suggestions and highlight any mistakes we make. The attributes in this file make it clear that an Azure ML model consists of a name, a version, and a path to the location where we saved the trained model files locally:


name: model-managed
version: 1
path: "../model/weights.pth"


How will you select the correct schema when creating a new resource? You can always copy the schemas from my blog or from the documentation, but the easiest way is to use the Azure Machine Learning extension for VS Code. If you have it installed, you can select the Azure icon in VS Code’s left navigation pane, log into Azure, expand your subscription and ML workspace, select “Models”, and click the ”+” button to create a YAML file with the correct model schema and attributes.

Screenshot showing how to create a new model using the Azure Machine Learning extension for VS Code.

Now that you have the YAML files containing the model specifications, you can create the model resources in the cloud. If you have the Azure ML extension installed, you can do so by right clicking anywhere on the open YAML files, and selecting “Azure ML: Execute YAML.” Alternatively, you can run the following CLI command in the terminal:


az ml model create -f cloud/model.yml


If you go to the Azure ML studio, and use the left navigation to go to the “Models” page, you’ll see your newly created model listed there.

In order to deploy our model as an Azure ML endpoint, we’ll use deployment and endpoint YAML files to specify the details of the configuration. I’ll show bits and pieces of these YAML files throughout the rest of this post as I present each setting. We’ll create four endpoints with different configurations to help you understand the range of alternatives available to you. If you look at the deployment YAML file for endpoint 1, for example, you’ll notice that it refers to the model resource we just created:


model: azureml:model-managed:1


As you can see, the model name is preceded by “azureml:” and followed by a colon and the version number we specified in the model’s YAML file.

Creating the scoring files

When invoked, an endpoint will call a scoring file, which we need to provide. This scoring file needs to follow a prescribed structure: it needs to contain an init() function that will be called when the endpoint is created or updated, and a run(...) function that will be called every time the endpoint is invoked. Let’s look at these in more detail.

First we’ll take a look at the init() function:


import json
import logging
import os

import numpy as np
import torch
from torch import Tensor, nn

from neural_network import NeuralNetwork

model = None
device = None

def init():'Init started')

    global model
    global device

    device = 'cuda' if torch.cuda.is_available() else 'cpu''Device: %s', device)

    model_path = os.path.join(os.getenv('AZUREML_MODEL_DIR'), 'weights.pth')

    model = NeuralNetwork().to(device)
    model.load_state_dict(torch.load(model_path, map_location=device))
    model.eval()'Init completed')


In our simple scenario, the init() function’s main task is to load the model. Because we saved just the weights, we need to instantiate a new version of the NeuralNetwork class before we can load the saved weights into it. Notice the use of the AZUREML_MODEL_DIR environment variable, which gives us the path to the model root folder on Azure. Notice also that since we’re using PyTorch, we need to ensure that both the loaded weights and the neural network we instantiate are on the same device (GPU or CPU).

I find it useful to add calls at the beginning and end of the function to make sure that it’s being called as expected. When we cover invoking the endpoint, I’ll show you where to look for the logs. I also like to add a call that tells me whether the code is running on GPU or CPU, as a sanity check.

Now let’s look at the run(...) function:


labels_map = {
    0: 'T-Shirt',
    1: 'Trouser',
    2: 'Pullover',
    3: 'Dress',
    4: 'Coat',
    5: 'Sandal',
    6: 'Shirt',
    7: 'Sneaker',
    8: 'Bag',
    9: 'Ankle Boot',

def predict(trained_model: nn.Module, x: Tensor) -> torch.Tensor:
    with torch.no_grad():
        y_prime = trained_model(x)
        probabilities = nn.functional.softmax(y_prime, dim=1)
        predicted_indices = probabilities.argmax(1)
    return predicted_indices
def run(raw_data):'Run started')

    x = json.loads(raw_data)['data']
    x = np.array(x).reshape((1, 1, 28, 28))
    x = torch.from_numpy(x).float().to(device)

    predicted_index = predict(model, x).item()
    predicted_name = labels_map[predicted_index]'Predicted name: %s', predicted_name)'Run completed')
    return predicted_name


Notice that run(...) takes a raw_data parameter as input, which contains the data we specify when invoking the endpoint. In our scenario, we’ll be passing in a JSON dictionary with a data key corresponding to a 28 × 28 matrix containing an image with float pixel values between 0.0 and 1.0. Our run(...) function loads the JSON, transforms it into a tensor of the format that our predict(...) function expects, calls the predict(...) function, converts the predicted int into a human-readable name, and returns that name.

If you look at the deployment YAML files, you’ll see that they all refer to a Python scoring file:


  code: ../../src/


Creating the environments

An Azure Machine Learning environment specifies the runtime where we can run training and prediction code on Azure, along with any additional configuration. In our scenario we’re only running prediction in the cloud, so we’ll focus on inference environments. Azure supports three different types of environments:

  1. Environments created from prebuilt Docker images for inference

    These prebuilt Docker images are provided by Microsoft, and they’re the easiest to get started with. In addition to Ubuntu and optional GPU support, they include different versions of TensorFlow and PyTorch, as well as many other popular frameworks and packages. I prefer to use prebuilt Docker images over the other two types — they deploy quickly and their pre-installed packages cover most of my needs.

    The full list of prebuilt Docker images available for inference can be found in the documentation. The docs show which packages are pre-installed in each Docker image, and two ways of referring to each image: an “MCR path” and a “curated environment.” I use a curated environment in this sample because I’m using an image that contains all the packages I need. You would want to use the “MCR path” if you need to extend the image with a conda file — I’ll come back to this later.

    Once I’ve selected a curated environment that has the packages I need, I just need to refer to it in my deployment YAML file. Here are the relevant lines from the first endpoint in our scenario:
    environment: azureml:AzureML-pytorch-1.7-ubuntu18.04-py37-cpu-inference:11

    To determine the version number for a particular curated environment, you can look in Azure ML studio under “Environments” then “Curated environments”:

    Screenshot showing how to get a list of curated environments and their version.

    Or you can use the Azure ML extension for VS Code — click on the Azure icon in the left navigation pane, expand your subscription and ML workspace, then expand “Environments” and “Azure ML Curated Environments.” Right-click on a curated environment and select “View Environment” to see the version number.

    For the scenario in this post, we’re able to use curated environments that include all the packages we need to run our code. If your scenario requires additional packages, then you’ll need to extend the environment, which you can do in one of three ways: by specifying the MCR path together with a conda file in the deployment file (as described under the next environment type), using dynamic installation, or with pre-installed Python packages.

  2. Environments created from base images

    These are Docker images provided by Microsoft that contain just the basics: Ubuntu, and optionally CUDA and cuDNN. Keep in mind that these don’t contain Python or any machine learning package you may need, so when using these environments, we typically include an additional conda file. A full list of available base images can be found in this GitHub repo.

    I use a base image in endpoint 2. Because it doesn’t contain Python or PyTorch, I had to extend it using a conda file. Note that I also added the azureml-defaults package, which is required for inference on Azure. Let’s take a look at the conda file:
    name: managed-endpoint-score
      - pytorch
      - conda-forge
      - defaults
      - python==3.9.5
      - pytorch==1.8.1
      - pip
      - pip:
        - azureml-defaults

    Now we can specify the base image and conda file in the YAML deployment file. In the curated environments section, I chose a CPU environment. Here I’m choosing a GPU base image, so that you see the range of options available to you.
      conda_file: score-conda.yml
  3. User-managed environments

    You can also create your own container and use it as an inference environment. I won’t go into detail on this topic, but you can take a look at the documentation.

Choosing the instance type

Now we’ll choose the machine where we’ll be deploying the environments and inference code for our endpoints. You can find the list of all VMs (virtual machines) supported for inference in the documentation.

Endpoint 1 of this project relies on a curated environment that runs on the CPU, so there’s no point in paying for a VM with a GPU. For this endpoint, I chose a “Standard_DS3_v2” VM because a small size is enough for my purposes. Endpoint 2 relies on a base image environment that requires GPU support, so we’ll pair it with a GPU VM — I chose a “Standard_NC6s_v3” VM, which is also small. Our scenario doesn’t require a GPU for scoring, but I decided to show both options here because your scenario might be different.


instance_type: Standard_DS3_v2


instance_type: Standard_NC6s_v3


You should have no problem using a “Standard_DS3_v2” CPU machine, but your subscription may not have enough quota for a “Standard_NC6s_v3” GPU machine. If that’s the case, you’ll see a helpful error message when you try to create the endpoint. In order to increase your quota and get access to machines with GPUs, you’ll need to submit a support request, as is explained in the documentation. For this particular type of machine, you’ll need to ask for an increase in the quota for the “NCSv3” series, as shown in the screenshot below:

Screenshot how to ask for a quota increase for NCSv3 machines.

The support request also asks how many vCPUs you want access to. The NCSv3 family of machines comes in three flavors: small (Standard_NC6s_v3) which uses 6 vCPUs, medium (Standard_NC12s_v3) which uses 12 vCPUs, and large (Standard_NC24s_v3) which uses 24 vCPUs.

Choosing the instance count

As the name implies, the instance_count setting determines how many machines you want running at deployment. Since this is just a demo, we’ll set this setting to one for all endpoints.


instance_count: 1


You might want to set it to a higher number in your production code. You can also set auto-scaling to determine the instance count real-time, based on traffic.

Choosing the authentication mode

There are two authentication modes you can choose from: key authentication never expires, while aml_token authentication expires after an hour. The project for this post uses key authentication for all of its endpoints except for endpoint 3, which demonstrates how to use aml_token. The authentication mode can be set in the endpoint YAML in the following way:


auth_mode: key


auth_mode: aml_token


The difference between key and aml_token will become clear when we invoke the endpoints.

Notice that this setting affects all deployments in the endpoint, therefore it’s set in the endpoint.yml file, not in the deployment.yml file. The section on “Ensuring a safe rollout” later on in this post explains the differences between a deployment and an endpoint, in the practical sense.

Creating the endpoints

At this point, you’ve learned about every single line of YAML code in all endpoint and deployment specification files of the accompanying project. Let’s look at the endpoint and deployment files for our first endpoint:


name: endpoint-managed-1
auth_mode: key


name: blue
endpoint_name: endpoint-managed-1
model: azureml:model-managed:1
  code: ../../src/
environment: azureml:AzureML-pytorch-1.7-ubuntu18.04-py37-cpu-inference:11
instance_type: Standard_DS3_v2
instance_count: 1


The name of an endpoint needs to be unique within a region. You can change the name of your endpoints in the YAML specification files, or you can pass a unique name to the CLI command at creation time, as shown below. You can create endpoints 1, 2 and 3 using the following CLI commands:


az ml online-endpoint create -f cloud/endpoint-X/endpoint.yml --name <ENDPOINTX>
az ml online-deployment create -f cloud/endpoint-X/deployment.yml --all-traffic --endpoint-name <ENDPOINTX>


You can now go to the Azure ML studio, click on “Endpoints,” and in the “Real-time endpoints” page, you’ll see the list of endpoints you created.

Ensuring a safe rollout

Let’s imagine a scenario where we used a managed online endpoint to deploy our PyTorch model using a machine with a CPU, but our team now decides that we need to use a GPU instead. We change the deployment to use a GPU, and that works fine in our internal testing. But this endpoint is already in use by clients, and we don’t want to disrupt the service. Opening it up to all clients is a risky move that may reveal issues and cause instability.

That’s where Azure ML’s safe rollout feature comes in. Instead of making an abrupt switch, we can use a “blue-green” deployment approach, where we roll out the new version of the code to a small subset of clients, and tune the size of that subset as we go. After ensuring that the clients calling the new version of the code encounter no issues for a while, we can increase the percentage of clients, until we’ve completed the switch.

Endpoint 4 in the accompanying project will enable this scenario by specifying two deployments:


name: endpoint-managed-4
auth_mode: key


name: blue
endpoint_name: endpoint-managed-4
model: azureml:model-managed:1
  code: ../../src/
environment: azureml:AzureML-pytorch-1.7-ubuntu18.04-py37-cpu-inference:11
instance_type: Standard_DS3_v2
instance_count: 1


name: green
endpoint_name: endpoint-managed-4
model: azureml:model-managed:1
  code: ../../src/
  conda_file: score-conda.yml
instance_type: Standard_NC6s_v3
instance_count: 1


You can create the endpoint and deployments for endpoint 4 using CLI commands similar to the ones in the previous section. When you’re ready to adjust their traffic allocation, you can do that with an additional command, as shown below:


az ml online-endpoint create -f cloud/endpoint-4/endpoint.yml --name <ENDPOINT4>
az ml online-deployment create -f cloud/endpoint-4/deployment-blue.yml --all-traffic --endpoint-name <ENDPOINT4>
az ml online-deployment create -f cloud/endpoint-4/deployment-green.yml --endpoint-name <ENDPOINT4>
az ml online-endpoint update --name <ENDPOINT4> --traffic "blue=90 green=10"


For more information about safe rollout, check out the documentation.

Creating the sample request

Before we can invoke the endpoints, we need to create a file containing input data for our prediction code. Recall that in our scenario, the run(...) function takes in the JSON representation of a single image encoded as a 28 × 28 matrix, and returns the class that the image belongs to as a string, such as “Shirt.”

We can easily get an image file from our dataset for testing, but we still need to convert it into JSON. You can find code to create a JSON sample request in src/ This code loads Fashion MNIST data, gets an image from the dataset, creates a matrix of shape 28 × 28 containing the image’s pixel values, and adds it to a JSON dictionary with key data.


import json
import os
from pathlib import Path

from train import _get_data

DATA_PATH = 'aml-managed-endpoint/data'
SAMPLE_REQUEST = 'aml-managed-endpoint/sample-request'

def create_sample_request() -> None:
    """Creates a sample request to be used in prediction."""
    batch_size = 64
    (_, test_dataloader) = _get_data(batch_size)

    (x_batch, _) = next(iter(test_dataloader))
    x = x_batch[0, 0, :, :].cpu().numpy().tolist()

    os.makedirs(name=SAMPLE_REQUEST, exist_ok=True)
    with open(Path(SAMPLE_REQUEST, 'sample_request.json'),
              encoding='utf-8') as file:
        json.dump({'data': x}, file)

def main() -> None:

if __name__ == '__main__':


Here’s a bit of the generated sample_request.json file:


{"data": [[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.42352941632270813, 0.43921568989753723, 0.46666666865348816, 0.3921568691730499, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.16


I’ve checked in the sample request JSON, so you only need to run this code if you want to re-generate it.

Invoking the endpoints using the CLI

We’re ready to invoke the endpoints!

Let’s first invoke them using the CLI. The only two pieces of information we need to pass to the invocation are the name of the endpoint and the request file, as you can see below. (Replace <ENDPOINTX> with the name of the endpoint you’d like to invoke.)


az ml online-endpoint invoke --request-file sample-request/sample_request.json -n <ENDPOINTX> 


Let’s take a look at the logs for this endpoint, by going to the Azure ML studio, clicking on the endpoint name, and then “Deployment logs.”

Screenshot of the logs for an endpoint invocation.

If you scroll down a bit, you’ll find the logging we added to the init() function of the scoring file. I invoked the endpoint twice, so I can also see the logging of the run(...) function printed twice.

Screenshot of the logs for an endpoint invocation showing our custom logging.

Invoking the endpoints using REST

We can also invoke the endpoint using the REST (representational state transfer) protocol. Let’s now come back to the two different authentication modes, key and aml_token, and see how we can invoke endpoints created with each of these alternatives.

Let’s first consider the key authentication mode, which we used for endpoint 1. To find the REST scoring URI for this endpoint and its authentication key, we go to the Azure ML studio, select “Endpoints,” click on the name of the endpoint, and then select the “Consume” tab.

Screenshot showing the REST scoring URI and key for the endpoint created using key authentication.

The bearer token used in the request can be found in the same panel, under “Authentication.” In key authentication mode, our key never expires, so we don’t need to worry about refreshing it. We can execute the following curl command to do a POST that invokes the endpoint:


curl --location \
     --request POST https://<ENDPOINT1> \
     --header "Authorization: Bearer NXdYObRnl2KhCE7ldFzgIUAevDupm6ZB" \
     --header "Content-Type: application/json" \
     --data @sample-request/sample_request.json


Make sure you replace <ENDPOINT1> with the name of your endpoint.

Similar to the CLI invocation, we get a string back, such as “Shirt”.

Now let’s consider endpoint 3, which was created using aml_token authentication mode.

Screenshot showing the REST scoring URI for the endpoint created using aml_token authentication.

As you can see, just like in the previous endpoint, the Azure ML studio gives us a REST scoring URI. And even though it doesn’t give us a token, it tells us what we need to do to get one. Let’s follow the instructions and execute the following command:


az ml online-endpoint get-credentials --name <ENDPOINT3>


You’ll get a JSON dictionary with key accessToken and a long string value, which we’ll abbreviate as <TOKEN>. We can now use it to invoke the endpoint:


curl --location --request POST https://<ENDPOINT3> \
     --header "Authorization: Bearer <TOKEN>" \
     --header "Content-Type: application/json" \
     --data @sample-request/sample_request.json


Tokens expire after one hour, and you can refresh them by executing the same get-credentials call I show above.

The GitHub project for this post contains shell executable files that you can use to invoke these two endpoints. Feel free to reuse them in your project — just make sure to change the endpoint name and location of the file to score. Here are the contents of these files:



SCORING_URI=$(az ml online-endpoint show --name $ENDPOINT_NAME --query scoring_uri -o tsv)

PRIMARY_KEY=$(az ml online-endpoint get-credentials --name $ENDPOINT_NAME --query primaryKey -o tsv)

OUTPUT=$(curl --location \
     --request POST $SCORING_URI \
     --header "Authorization: Bearer $PRIMARY_KEY" \
     --header "Content-Type: application/json" \
     --data @sample-request/sample_request.json)



SCORING_URI=$(az ml online-endpoint show --name $ENDPOINT_NAME --query scoring_uri -o tsv)

ACCESS_TOKEN=$(az ml online-endpoint get-credentials --name $ENDPOINT_NAME --query accessToken -o tsv)

OUTPUT=$(curl --location \
     --request POST $SCORING_URI \
     --header "Authorization: Bearer $ACCESS_TOKEN" \
     --header "Content-Type: application/json" \
     --data @sample-request/sample_request.json)


You can now invoke the endpoints by simply running these scripts:





In this post, you’ve seen how to create and invoke managed online endpoints using Azure ML. There are many methods for creating Azure ML resources — here I showed how to use YAML files to specify the details for each resource, and how to use VS Code or the CLI to create them in the cloud. I then presented the main concepts you need to know to make the right choices when creating these YAML file. And finally, I showed different ways to invoke an endpoint. I hope that you learned something new, and that you’ll try these features on your own!

The complete code for this post can be found on GitHub.

Thank you to Sethu Raman and Shivani Sambare from the Azure ML team at Microsoft for reviewing the content in this post.

About the author

Bea Stollnitz is a principal developer advocate at Microsoft, focusing on Azure ML. See her blog for more in-depth articles about Azure ML and other machine learning topics.

Bea Stollnitz
1 Comment
Version history
Last update:
‎May 11 2022 03:25 AM
Updated by: