Return to Part 1: Introduction to ML Service
In this section, we will explore some of the options and best practices for deploying your ML model service on Azure, especially using Azure Machine Learning.
We will cover how to choose the appropriate Azure SKU for your ML service, as well as some of the settings and limits of Azure ML that you should be aware of.
After optimizing the model and framework utilization, it's essential to save costs by selecting the best VM SKU. Selecting the correct VM can help enhance performance and latency. However, we must be clear that the end goal is to reduce the cost of inference, not to get the best latency. For example, if SKU A setup can run 20% faster than SKU B but is 40% more expensive, SKU A may not be the best option.
First, we must understand the VM description. You must pay close attention to the following fields:
The price of different SKUs varies. Please refer to the Virtual Machine series | Microsoft Azure for the most up-to-date VM SKU list, available regions, and pricing. Some SKUs may be available in certain regions but not others.
If the client of a model service is a web service hosted in Azure, we preferably want the model service and the client service in the same region to minimize network latency. If cross-region access can't be avoided, you can find more details about network latency between Azure regions here.
In order to decide which SKU is the best fit, there are various profiling tools you could leverage. More on this in the later parts of this series.
Additionally, here are the SKU supported by Managed Online Endpoints ref. Managed online endpoints VM SKU list - Azure Machine Learning | Microsoft Learn
The AzureML settings and limits related to model service throughput and latency fall into two categories: network stack settings and container settings.
Here is how the AzureML network stack request flow looks like:
Azure Machine Learning inference HTTP server - Azure Machine Learning | Microsoft Learn
These limits are either hardcoded or related to your deployment, such as the number of cores. Refer here for the resource limits Manage resources and quotas - Azure Machine Learning | Microsoft Learn
There is one deployment configuration you need to pay attention to:
max_concurrent_requests_per_instance
If you are using AzureML container image or AzureML pre-built inference image, this number needs to be set the same as WORKER_COUNT (discussed below). If you are using an image built by yourself, then you need to set it to an appropriate number.
This setting defines the concurrent level at load balance time. Usually, the higher this number, the higher the throughput. However, if this number is set higher than what the model and machine learning framework can handle, it will cause requests to wait in the queue, eventually leading to longer end-to-end latency.
If the request per second is greater than (max_concurrent_requests_per_instance * number_of_instance), the client side will receive an HTTP status code 429.
For default value for request settings, refer here CLI (v2) managed online deployment YAML schema - Azure Machine Learning | Microsoft Learn
If you are using a Docker image built by yourself, please make sure it can accept environment variables to tune the setup. Then during deployment, make sure the environment variables are set properly.
Here is an example:
Assume “mymodelserver” can read an environment variable “MY_THREAD_COUNT” at runtime. Here is an example of your Dockerfile:
FROM <base image>
...
ENV MY_THREAD_COUNT
...
ENTRYPOINT [“mymodelserver”, “param1”, “param2”]
At deployment time, you can set the “MY_THREAD_COUNT” to a proper number to decide different parallelism level on different SKU.
If you are using the AzureML container image or AzureML prebuilt inference image, then WORKER_COUNT is one of the most important environment variables you need to set properly.
In AzureML provided images, the Python HTTP server can have multiple worker processes to serve concurrent HTTP requests. Each of the worker processes will load a model instance and process requests separately. WORKER_COUNT is an integer to define how many worker processes and default value is one (1). This means if you do not set this environment variable to a proper number, even if you choose a SKU that has multiple CPU cores, the container will still only process one request at a time!
This value is determined using an iterative process. You can use the following process to determine the value of WORKER_COUNT
Then, WORKER_COUNT = floor(result_1 / result_2). Make sure you slightly reserve some cores and memory for system components on the same VM.
Return to Part 1: Introduction to ML Service
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.