This blog series post will teach you how to optimize yourmachine learning model service and save money. Tuning the performance of your Azure Machine Learning endpoint will help you make better use of your Azure Machine Learning resources, reduce costs, and increase throughput.
Know the Business
Before we go deep, we should first understand the ML model service(s) you are about to create:
Number of models you want to run online
Number of services you want to maintain
Estimated RPS (Request Per Second) on average and at peak for each model or service
Pattern of request traffic: weekday-weekend, constant traffic, “Black Friday”, etc.
Request and response data type and size: image, text, stream, structured data, etc.
Client capabilities: high performance services, IoT devices, desktop app, mobile app, etc.
Budget: Knowing the cost of the instance is also an important consideration.
Note: The more answers you have to the above-mentioned questions, the more efficiently you can optimize the ML model(s)!
Note: Estimating #3 and #4 for a new model could be tedious. Typically, you should begin by examining the model service's client usage pattern. Then, with Azure Monitor enabled, optimize, and deploy the first version. The actual RPS and traffic pattern will be clearer after a few days of use. Alter your model deployment by updating the model configuration as needed using model service update. You'll find the optimum fit after a few iterations!
Know Your Model
Not all engineers or data scientists that deploy the model are the same data scientist who created it. However, to achieve a lower latency, you may need to optimize the model architecture. Here are some examples:
Optimize the model so that the model's operators run in parallel. A model graph such as a pearl chain cannot be optimized to run quicker than a model graph such as DAG.
Incorporate certain specific operators into the model, such as convolution; eventually, you can explore special SKUs with specific CPUs to reduce execution delay.
Use a custom model format to take advantage of specialized hardware, such as TensorRT models that can take advantage of INT8 capability on T4 GPUs.
If you intend to utilize a specific model server, model inputs and outputs must adhere to the model server definition. For example, for a Triton model, inputs and outputs must be tensors-only.
Modify the model to accept batch input. When running an NLP model on GPU, this is frequently required.
Know your Framework and Libraries
Tune the ML Framework
Every machine learning model framework has its own set of instructions, tools, or libraries for optimizing model execution parallelism and resource usage. This will result in increased throughput and decreased latency. ML Framework examples:
Some machine learning framework libraries can also be optimized using environment variables or initialization settings. As an example:
"OMP_NUM_THREADS" is an environment in OpenMP. The value of this environment variable indicates how many parallel threads will be utilized to execute this model instance.
MKL has similar "MKL_NUM_THREADS" options. If you use this library, please consider this option to optimize execution.
Note: All these variables must consider the number of model instances operating in the same container. Setting "*_NUM_THREADS" to a big integer is not recommended as it will result in many context switches and CPU L3 cache misses.
Note: In general, "*_NUM_THREADS" multiplied by the number of model instances in the same container should be fewer than the number of cores assigned to this container. Example: OMP_NUM_THREAD = 3 and instance_count = 3, using a container with < 9 cores is not recommended.
Model quantization can often result in model quality degradation, although the benefit is reduced latency. Quantization should be a must-try step if a model can be executed on a CPU. We should absolutely use a quantized model as long as the accuracy remains within the threshold.