This blog series post will teach you how to optimize your machine learning model service and save money. Tuning the performance of your Azure Machine Learning endpoint will help you make better use of your Azure Machine Learning resources, reduce costs, and increase throughput.
Before we go deep, we should first understand the ML model service(s) you are about to create:
Note: The more answers you have to the above-mentioned questions, the more efficiently you can optimize the ML model(s)!
Note: Estimating #3 and #4 for a new model could be tedious. Typically, you should begin by examining the model service's client usage pattern. Then, with Azure Monitor enabled, optimize, and deploy the first version. The actual RPS and traffic pattern will be clearer after a few days of use. Alter your model deployment by updating the model configuration as needed using model service update. You'll find the optimum fit after a few iterations!
Not all engineers or data scientists that deploy the model are the same data scientist who created it. However, to achieve a lower latency, you may need to optimize the model architecture. Here are some examples:
Every machine learning model framework has its own set of instructions, tools, or libraries for optimizing model execution parallelism and resource usage. This will result in increased throughput and decreased latency. ML Framework examples:
Some machine learning framework libraries can also be optimized using environment variables or initialization settings. As an example:
Note: All these variables must consider the number of model instances operating in the same container. Setting "*_NUM_THREADS" to a big integer is not recommended as it will result in many context switches and CPU L3 cache misses.
Note: In general, "*_NUM_THREADS" multiplied by the number of model instances in the same container should be fewer than the number of cores assigned to this container. Example: OMP_NUM_THREAD = 3 and instance_count = 3, using a container with < 9 cores is not recommended.
Model quantization can often result in model quality degradation, although the benefit is reduced latency. Quantization should be a must-try step if a model can be executed on a CPU. We should absolutely use a quantized model as long as the accuracy remains within the threshold.
Go to Part 2 : How To Pick Azure VM SKU and AzureML Settings
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.