This source code of this article:
https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git
Please refer to my repo to get more AI resources, wellcome to star it:
https://github.com/xinyuwei-david/david-share.git
Note:
This repository is designed to test the performance of open-source models from the Azure Machine Learning Model Catalog in Managed Compute. I tested the performance of nearly 20 AI models in my repository. Due to space limitations, this article only shows the testing of two models to help readers understand how to use my script for testing. More detailed info,please refer to https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git
Deploying models Methods
https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/deployments-overview
Name | Azure OpenAI service | Azure AI model inference | Serverless API | Managed compute |
---|---|---|---|---|
Which models can be deployed? | Azure OpenAI models | Azure OpenAI models and Models as a Service | Models as a Service | Open and custom models |
Deployment resource | Azure OpenAI resource | Azure AI services resource | AI project resource | AI project resource |
Best suited when | You are planning to use only OpenAI models | You are planning to take advantage of the flagship models in Azure AI catalog, including OpenAI. | You are planning to use a single model from a specific provider (excluding OpenAI). | If you plan to use open models and you have enough compute quota available in your subscription. |
Billing bases | Token usage & PTU | Token usage | Token usage | Compute core hours |
Deployment instructions | Deploy to Azure OpenAI Service | Deploy to Azure AI model inference | Deploy to Serverless API | Deploy to Managed compute |
Currently, an increasing number of new flagship models in the Azure AI Foundry model catalog, including OpenAI, will be deployed using the Azure AI model inference method. Models deployed in this way can be accessed via the AI Inference SDK (which now supports stream mode: https://learn.microsoft.com/en-us/python/api/overview/azure/ai-inference-readme?view=azure-python-preview). Open-source models include DeepSeek R1, V3, Phi, Mistral, and more. For a detailed list of models, please refer to:
https://learn.microsoft.com/en-us/azure/ai-foundry/model-inference/concepts/models
If you care about the performance data of this method, please skip to the last section of this repo.
If you want the deployed model to have more exclusive performance and lower latency, you can use the Managed Compute mode.
Performance test of AI models in Azure Machine Learning
In this section, we focus on the models deployed on Managed Compute in the Model Catalogue on AML.
Next, we will use a Python script to automate the deployment of the model and use another program to evaluate the model's performance.
Fast Deploy AI Model on AML Model Catalog via Azure GPU VM
By now, the AML names tested in this repo, their full names on Hugging Face, and the Azure GPU VM SKUs that can be deployed on AML are as follows.
Model Name on AML | Model on HF (tokenizers name) | Azure GPU VM SKU Support in AML |
---|---|---|
Phi-4 | microsoft/phi-4 | NC24/48/96 A100 |
Phi-3.5-vision-instruct | microsoft/Phi-3.5-vision-instruct | NC24/48/96 A100 |
financial-reports-analysis | NC24/48/96 A100 | |
Llama-3.2-11B-Vision-Instruct | meta-llama/Llama-3.2-11B-Vision-Instruct | NC24/48/96 A100 |
Phi-3-small-8k-instruct | microsoft/Phi-3-small-8k-instruct | NC24/48/96 A100 |
Phi-3-vision-128k-instruct | microsoft/Phi-3-vision-128k-instruct | NC48 A100 or NC96 A100 |
microsoft-swinv2-base-patch4-window12-192-22k | microsoft/swinv2-base-patch4-window12-192-22k | NC24/48/96 A100 |
mistralai-Mixtral-8x7B-Instruct-v01 | mistralai/Mixtral-8x7B-Instruct-v0.1 | NC96 A100 |
Muse | microsoft/wham | NC24/48/96 A100 |
openai-whisper-large | openai/whisper-large | NC48 A100 or NC96 A100 |
snowflake-arctic-base | Snowflake/snowflake-arctic-base | ND H100V5 |
Nemotron-3-8B-Chat-4k-SteerLM | nvidia/nemotron-3-8b-chat-4k-steerlm | NC24/48/96 A100 |
stabilityai-stable-diffusion-xl-refiner-1-0 | stabilityai/stable-diffusion-xl-refiner-1.0 | Standard_ND96amsr_A100_v4 or Standard_ND96asr_v4 |
microsoft-Orca-2-7b | microsoft/Orca-2-7b | NC24/48/96 A100 |
This repository primarily focuses on the inference performance of the aforementioned models on 1x NC24 A100, 2 x NC24 A100, 1 x NC48 A100, 1 x NC40 H100, and 1 x NC80 H100. However, these models currently do not support deployment on H100. Therefore, as of March 2025, all validations are conducted based on NC100.
Clone code and prepare shell environment
First, you need to create an Azure Machine Learning service in the Azure Portal. When selecting the region for the service, you should choose a region under the AML category in your subscription quota that has a GPU VM quota available.
Next, find a shell environment where you can execute az login to log in to your Azure subscription.
#git clone https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git
#conda create -n aml_env python=3.9 -y
#conda activate aml_env
#cd AI-Foundry-Model-Performance
#pip install -r requirements.txt
Login to Azure.
#curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash
#az login --use-device
Deploy model Automatically
Next, you need to execute a script for end-to-end model deployment. This script will:
- Help you check the GPU VM quota for AML under your subscription
- Prompt you to select the model you want to deploy
- Specify the Azure GPU VM SKU and quantity to be used for deployment.
- Provide you with the endpoint and key of the successfully deployed model, allowing you to proceed with performance testing.
Before running the script, you need to check the table above to confirm the types of Azure GPU VMs supported by the AI model you plan to deploy.
#python deploymodels-linux.py
If you do test on powershell, you should use:
#python deploymodels-powershell.py
The deploy process:
========== Enter Basic Information ==========
Subscription ID: 53039473-9bbd-499d-90d7-d046d4fa63b6
Resource Group: AIrg1
Workspace Name: aml-david-1
========== Model Name Examples ==========
- Phi-4
- Phi-3.5-vision-instruct
- financial-reports-analysis
- databricks-dbrx-instruct
- Llama-3.2-11B-Vision-Instruct
- Phi-3-small-8k-instruct
- Phi-3-vision-128k-instruct
- microsoft-swinv2-base-patch4-window12-192-22k
- mistralai-Mixtral-8x7B-Instruct-v01
- Muse
- openai-whisper-large
- snowflake-arctic-base
- Nemotron-3-8B-Chat-4k-SteerLM
- stabilityai-stable-diffusion-xl-refiner-1-0
- microsoft-Orca-2-7b
==========================================
Enter the model name to search (e.g., 'Phi-4'): Phi-4
========== Matching Models ==========
Name Description Latest version
------------------------- ------------- ----------------
Phi-4-multimodal-instruct 1
Phi-4-mini-instruct 1
Phi-4 7
Note: The above table is for reference only. Enter the exact model name below:
Enter full model name (case-sensitive): Phi-4
Enter model version (e.g., 7): 7
2025-03-13 15:42:02,438 - INFO - User-specified model: name='Phi-4', version='7'
========== GPU Quota (Limit > 1) ==========
Region,ResourceName,LocalizedValue,Usage,Limit
westeurope,standardNCADSH100v5Family,,0,100
polandcentral,standardNCADSA100v4Family,,0,100
========== A100 / H100 SKU Information ==========
SKU Name GPU Count GPU Memory (VRAM) CPU Cores
----------------------------------- ---------- -------------------- ----------
Standard_NC24ads_A100_v4 1 80 GB 24
Standard_NC48ads_A100_v4 2 1600 GB (2x80 GB) 48
Standard_NC96ads_A100_v4 4 320 GB (4x80 GB) 96
Standard_NC40ads_H100_v5 1 80 GB 40
Standard_NC80ads_H100_v5 2 160 GB (2x80 GB) 80
Available SKUs:
- Standard_NC24ads_A100_v4
- Standard_NC48ads_A100_v4
- Standard_NC96ads_A100_v4
- Standard_NC40ads_H100_v5
- Standard_NC80ads_H100_v5
Enter the SKU to use: Standard_NC24ads_A100_v4
Enter the number of instances (integer): 1
2025-03-13 15:52:42,333 - INFO - Model ID: azureml://registries/AzureML/models/Phi-4/versions/7
2025-03-13 15:52:42,333 - INFO - No environment configuration found.
2025-03-13 15:52:42,366 - INFO - ManagedIdentityCredential will use IMDS
2025-03-13 15:52:42,379 - INFO - Creating Endpoint: custom-endpoint-1741852362
2025-03-13 15:52:43,008 - INFO - Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'
After 3-5 minutes, you will get the final results:
----- Deployment Information -----
ENDPOINT_NAME=custom-endpoint-1741863106
SCORING_URI=https://custom-endpoint-1741863106.polandcentral.inference.ml.azure.com/score
PRIMARY_KEY=DRxHMd1jbbSdNoXiYOaWRQ66erYZfejzKhdyDVRuh58v2hXILOcYJQQJ99BCAAAAAAAAAAAAINFRAZML3m1v
SECONDARY_KEY=4dhy3og6WfVzkIijMU7FFUDLpz4WIWEYgIlXMGYUzgwafsW6GPrMJQQJ99BCAAAAAAAAAAAAINFRAZMLxOpO
Fast delete endpoint
We know that GPU VMs are relatively expensive. Therefore, after completing performance testing, you should make use of the script below to delete the endpoint to avoid incurring excessive costs.
#python deplete-endpoint-20250327.py
Delete process:
lease enter your Azure Subscription ID: aaaaaaaaaaaaaaaa
Please enter your Azure Resource Group name: A100VM_group
Please enter your Azure ML Workspace name: aml-westus
Retrieving the list of online Endpoints in the Workspace...
List of online Endpoints:
1. aml-westus-takfp
2. aml-westus-aflqs
Enter the numbers of the Endpoints you want to delete (e.g., 1, 3, 4). Press Enter to skip: 1, 2
Deleting Endpoint: aml-westus-takfp...
...Endpoint aml-westus-takfp deleted successfully.
Deleting Endpoint: aml-westus-aflqs...
...Endpoint aml-westus-aflqs deleted successfully.
The deletion process for all specified Endpoints has been completed. Exiting the script.
Fast Performance Test AI Model on AML Model Catalog
Note:
- The test results in this section are for reference only. You need to use my script to conduct tests in your actual environment.
- In my performance testing script, timeout and retry mechanisms are configured. Specifically, if a task fails to complete within the timeout period (default is 30 seconds), it will be marked as failed. Additionally, if a request encounters a 429 error during execution, it will trigger a backoff mechanism. If the 429 error occurs three consecutive times, the request will be marked as failed. When performing tests, you should adjust these parameters according to the requirements of your business scenario.
- When analyzing the test results, you need to consider multiple metrics, including request success rate, TTFT (Time to First Token), tokens/s, and TTFT again. You should not focus solely on a single indicator.
The primary goal of performance testing is to verify tokens/s and TTFT during the inference process. To better simulate real-world scenarios, I have set up several common LLM/SLM use cases in the test script. Additionally, to ensure tokens/s performance, the test script needs to load the corresponding model's tokenizer during execution(Refer to upper table of tokenizers name).
Before officially starting the test, you need to log in to HF on your terminal.
#huggingface-cli login
Phi Text2Text Series (Phi-4/Phi-3-small-8k-instruct)
Run the test script:
(aml_env) root@pythonvm:~/AIFperformance# python press-phi4-0314.py
Please enter the API service URL: https://david-workspace-westeurop-ldvdq.westeurope.inference.ml.azure.com/score
Please enter the API Key: Ef9DFpATsXs4NiWyoVhEXeR4PWPvFy17xcws5ySCvV2H8uOUfgV4JQQJ99BCAAAAAAAAAAAAINFRAZML3eIO
Please enter the full name of the HuggingFace model for tokenizer loading: microsoft/phi-4
Tokenizer loaded successfully: microsoft/phi-4
Test result analyze:
microsoft/phi-4
Concurrency = 1
Scenario | VM 1 (1-nc48) TTFT (s) | VM 2 (2-nc24) TTFT (s) | VM 3 (1-nc24) TTFT (s) | VM 1 (1-nc48) tokens/s | VM 2 (2-nc24) tokens/s | VM 3 (1-nc24) tokens/s |
---|---|---|---|---|---|---|
Text Generation | 12.473 | 19.546 | 19.497 | 68.07 | 44.66 | 44.78 |
Question Answering | 11.914 | 15.552 | 15.943 | 72.10 | 44.56 | 46.04 |
Translation | 2.499 | 3.241 | 3.411 | 47.62 | 33.32 | 34.59 |
Text Summarization | 2.811 | 4.630 | 3.369 | 50.16 | 37.36 | 33.84 |
Code Generation | 20.441 | 27.685 | 26.504 | 83.12 | 51.58 | 52.26 |
Chatbot | 5.035 | 9.349 | 8.366 | 64.55 | 43.96 | 41.24 |
Sentiment Analysis | 1.009 | 1.235 | 1.241 | 5.95 | 12.96 | 12.89 |
Multi-turn Reasoning | 13.148 | 20.184 | 19.793 | 76.44 | 47.12 | 47.29 |
Concurrency = 2
Scenario | VM 1 (1-nc48) Total TTFT (s) | VM 2 (2-nc24) Total TTFT (s) | VM 3 (1-nc24) Total TTFT (s) | VM 1 (1-nc48) Total tokens/s | VM 2 (2-nc24) Total tokens/s | VM 3 (1-nc24) Total tokens/s |
---|---|---|---|---|---|---|
Text Generation | 19.291 | 19.978 | 24.576 | 110.94 | 90.13 | 79.26 |
Question Answering | 14.165 | 15.906 | 21.774 | 109.94 | 90.87 | 66.67 |
Translation | 3.341 | 4.513 | 10.924 | 76.45 | 53.95 | 68.54 |
Text Summarization | 3.494 | 3.664 | 6.317 | 77.38 | 69.60 | 59.45 |
Code Generation | 16.693 | 26.310 | 27.772 | 162.72 | 104.37 | 53.22 |
Chatbot | 8.688 | 9.537 | 12.064 | 100.09 | 87.67 | 67.23 |
Sentiment Analysis | 1.251 | 1.157 | 1.229 | 19.99 | 20.09 | 16.60 |
Multi-turn Reasoning | 20.233 | 23.655 | 22.880 | 110.84 | 94.47 | 88.79 |
Performance test on Azure AI model inference
Azure AI model inference has a default quota. If you feel that the quota for the model is insufficient, you can apply for an increase separately.
Limit name | Applies to | Limit value |
---|---|---|
Tokens per minute | Azure OpenAI models | Varies per model and SKU. See limits for Azure OpenAI. |
Requests per minute | Azure OpenAI models | Varies per model and SKU. See limits for Azure OpenAI. |
Tokens per minute | DeepSeek models | 5.000.000 |
Requests per minute | DeepSeek models | 5.000 |
Concurrent requests | DeepSeek models | 300 |
Tokens per minute | Rest of models | 200.000 |
Requests per minute | Rest of models | 1.000 |
Concurrent requests | Rest of models | 300 |
After you have deployed models on Azure AI model inference, you can check their invocation methods:
Prepare test env:
#conda create -n AImodelinference python=3.11 -y
#conda activate AImodelinference
#pip install azure-ai-inference
Run test script, after entering the following three variables, the stress test will begin:
#python callaiinference.py
Please enter the Azure AI key:
Please enter the Azure AI endpoint URL:
Please enter the deployment name:
Performance on DS 671B
I will use the test results of DeeSeek R1 on Azure AI model inference as an example:
Max performance:
- When the concurrency is 300 and the prompt length is 1024, TPS = 2110.77, TTFT = 2.201s. • When the concurrency is 300 and the prompt length is 2048, TPS = 1330.94, TTFT = 1.861s.
Overall performance:
The overall throughput averages 735.12 tokens/s, with a P90 of 1184.06 tokens/s, full test result is as following:
Concurrency | Prompt Length | Total Requests | Success Count | Fail Count | Average latency (s) | Average TTFT (s) | Average token throughput (tokens/s) | Overall throughput (tokens/s) |
---|---|---|---|---|---|---|---|---|
300 | 1024 | 110 | 110 | 0 | 75.579 | 2.580 | 22.54 | 806.84 |
300 | 1024 | 110 | 110 | 0 | 71.378 | 71.378 | 24.53 | 1028.82 |
300 | 1024 | 110 | 110 | 0 | 76.622 | 2.507 | 23.24 | 979.97 |
300 | 1024 | 120 | 120 | 0 | 68.750 | 68.750 | 24.91 | 540.66 |
300 | 1024 | 120 | 120 | 0 | 72.164 | 2.389 | 22.71 | 1094.90 |
300 | 1024 | 130 | 130 | 0 | 72.245 | 72.245 | 23.68 | 1859.91 |
300 | 1024 | 130 | 130 | 0 | 82.714 | 2.003 | 20.18 | 552.08 |
300 | 1024 | 140 | 140 | 0 | 71.458 | 71.458 | 23.79 | 642.92 |
300 | 1024 | 140 | 140 | 0 | 71.565 | 2.400 | 22.93 | 488.49 |
300 | 1024 | 150 | 150 | 0 | 71.958 | 71.958 | 24.21 | 1269.10 |
300 | 1024 | 150 | 150 | 0 | 73.712 | 2.201 | 22.35 | 2110.77 |
300 | 2048 | 10 | 10 | 0 | 68.811 | 68.811 | 24.24 | 196.78 |
300 | 2048 | 10 | 10 | 0 | 70.189 | 1.021 | 23.18 | 172.92 |
300 | 2048 | 20 | 20 | 0 | 73.138 | 73.138 | 24.14 | 390.96 |
300 | 2048 | 20 | 20 | 0 | 69.649 | 1.150 | 24.22 | 351.31 |
300 | 2048 | 30 | 30 | 0 | 66.883 | 66.883 | 26.13 | 556.12 |
300 | 2048 | 30 | 30 | 0 | 68.918 | 1.660 | 23.46 | 571.63 |
300 | 2048 | 40 | 40 | 0 | 72.485 | 72.485 | 23.85 | 716.53 |
300 | 2048 | 40 | 40 | 0 | 65.228 | 1.484 | 24.87 | 625.16 |
300 | 2048 | 50 | 50 | 0 | 68.223 | 68.223 | 25.12 | 887.64 |
300 | 2048 | 50 | 50 | 0 | 66.288 | 1.815 | 24.38 | 976.17 |
300 | 2048 | 60 | 60 | 0 | 66.736 | 66.736 | 25.85 | 547.70 |
300 | 2048 | 60 | 60 | 0 | 69.355 | 2.261 | 23.94 | 615.81 |
300 | 2048 | 70 | 70 | 0 | 66.689 | 66.689 | 25.66 | 329.90 |
300 | 2048 | 70 | 70 | 0 | 67.061 | 2.128 | 23.89 | 1373.11 |
300 | 2048 | 80 | 80 | 0 | 68.091 | 68.091 | 25.68 | 1516.27 |
300 | 2048 | 80 | 80 | 0 | 67.413 | 1.861 | 24.01 | 1330.94 |
300 | 2048 | 90 | 90 | 0 | 66.603 | 66.603 | 25.51 | 418.81 |
300 | 2048 | 90 | 90 | 0 | 70.072 | 2.346 | 23.41 | 1047.53 |
300 | 2048 | 100 | 100 | 0 | 70.516 | 70.516 | 24.29 | 456.66 |
300 | 2048 | 100 | 100 | 0 | 86.862 | 2.802 | 20.03 | 899.38 |
300 | 2048 | 110 | 110 | 0 | 84.602 | 84.602 | 21.16 | 905.59 |
300 | 2048 | 110 | 110 | 0 | 77.883 | 2.179 | 21.17 | 803.93 |
300 | 2048 | 120 | 120 | 0 | 73.814 | 73.814 | 23.73 | 541.03 |
300 | 2048 | 120 | 120 | 0 | 86.787 | 4.413 | 20.32 | 650.57 |
300 | 2048 | 130 | 130 | 0 | 78.222 | 78.222 | 22.61 | 613.27 |
300 | 2048 | 130 | 130 | 0 | 83.670 | 2.131 | 20.16 | 1463.81 |
300 | 2048 | 140 | 140 | 0 | 77.429 | 77.429 | 22.74 | 1184.06 |
300 | 2048 | 140 | 140 | 0 | 77.234 | 3.891 | 21.90 | 821.34 |
300 | 2048 | 150 | 150 | 0 | 72.753 | 72.753 | 23.69 | 698.50 |
300 | 2048 | 150 | 150 | 0 | 73.674 | 2.425 | 22.74 | 1012.25 |
300 | 4096 | 10 | 10 | 0 | 83.003 | 83.003 | 25.52 | 221.28 |
300 | 4096 | 10 | 10 | 0 | 89.713 | 1.084 | 24.70 | 189.29 |
300 | 4096 | 20 | 20 | 0 | 82.342 | 82.342 | 26.65 | 337.85 |
300 | 4096 | 20 | 20 | 0 | 84.526 | 1.450 | 24.81 | 376.17 |
300 | 4096 | 30 | 30 | 0 | 87.979 | 87.979 | 24.46 | 322.62 |
300 | 4096 | 30 | 30 | 0 | 84.767 | 1.595 | 24.28 | 503.01 |
300 | 4096 | 40 | 40 | 0 | 85.231 | 85.231 | 26.03 | 733.50 |
300 | 4096 | 40 | 40 | 0 | 81.514 | 1.740 | 24.17 | 710.79 |
300 | 4096 | 50 | 50 | 0 | 91.253 | 91.253 | 24.53 | 279.55 |