Microsoft Foundry Blog

9 MIN READ

Fast deploy and evaluate AI model performance on AML/AI Foundry

xinyuwei

Microsoft

Apr 02, 2025

This source code of this article:

https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git

Please refer to my repo to get more AI resources, wellcome to star it:

https://github.com/xinyuwei-david/david-share.git

Note：

This repository is designed to test the performance of open-source models from the Azure Machine Learning Model Catalog in Managed Compute. I tested the performance of nearly 20 AI models in my repository. Due to space limitations, this article only shows the testing of two models to help readers understand how to use my script for testing. More detailed info，please refer to https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git

Deploying models Methods

https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/deployments-overview

Name	Azure OpenAI service	Azure AI model inference	Serverless API	Managed compute
Which models can be deployed?	Azure OpenAI models	Azure OpenAI models and Models as a Service	Models as a Service	Open and custom models
Deployment resource	Azure OpenAI resource	Azure AI services resource	AI project resource	AI project resource
Best suited when	You are planning to use only OpenAI models	You are planning to take advantage of the flagship models in Azure AI catalog, including OpenAI.	You are planning to use a single model from a specific provider (excluding OpenAI).	If you plan to use open models and you have enough compute quota available in your subscription.
Billing bases	Token usage & PTU	Token usage	Token usage	Compute core hours
Deployment instructions	Deploy to Azure OpenAI Service	Deploy to Azure AI model inference	Deploy to Serverless API	Deploy to Managed compute

Currently, an increasing number of new flagship models in the Azure AI Foundry model catalog, including OpenAI, will be deployed using the Azure AI model inference method. Models deployed in this way can be accessed via the AI Inference SDK (which now supports stream mode: https://learn.microsoft.com/en-us/python/api/overview/azure/ai-inference-readme?view=azure-python-preview). Open-source models include DeepSeek R1, V3, Phi, Mistral, and more. For a detailed list of models, please refer to:

https://learn.microsoft.com/en-us/azure/ai-foundry/model-inference/concepts/models

If you care about the performance data of this method, please skip to the last section of this repo.

If you want the deployed model to have more exclusive performance and lower latency, you can use the Managed Compute mode.

Performance test of AI models in Azure Machine Learning

In this section, we focus on the models deployed on Managed Compute in the Model Catalogue on AML.

Next, we will use a Python script to automate the deployment of the model and use another program to evaluate the model's performance.

Fast Deploy AI Model on AML Model Catalog via Azure GPU VM

By now, the AML names tested in this repo, their full names on Hugging Face, and the Azure GPU VM SKUs that can be deployed on AML are as follows.

Model Name on AML	Model on HF (tokenizers name)	Azure GPU VM SKU Support in AML
Phi-4	microsoft/phi-4	NC24/48/96 A100
Phi-3.5-vision-instruct	microsoft/Phi-3.5-vision-instruct	NC24/48/96 A100
financial-reports-analysis		NC24/48/96 A100
Llama-3.2-11B-Vision-Instruct	meta-llama/Llama-3.2-11B-Vision-Instruct	NC24/48/96 A100
Phi-3-small-8k-instruct	microsoft/Phi-3-small-8k-instruct	NC24/48/96 A100
Phi-3-vision-128k-instruct	microsoft/Phi-3-vision-128k-instruct	NC48 A100 or NC96 A100
microsoft-swinv2-base-patch4-window12-192-22k	microsoft/swinv2-base-patch4-window12-192-22k	NC24/48/96 A100
mistralai-Mixtral-8x7B-Instruct-v01	mistralai/Mixtral-8x7B-Instruct-v0.1	NC96 A100
Muse	microsoft/wham	NC24/48/96 A100
openai-whisper-large	openai/whisper-large	NC48 A100 or NC96 A100
snowflake-arctic-base	Snowflake/snowflake-arctic-base	ND H100V5
Nemotron-3-8B-Chat-4k-SteerLM	nvidia/nemotron-3-8b-chat-4k-steerlm	NC24/48/96 A100
stabilityai-stable-diffusion-xl-refiner-1-0	stabilityai/stable-diffusion-xl-refiner-1.0	Standard_ND96amsr_A100_v4 or Standard_ND96asr_v4
microsoft-Orca-2-7b	microsoft/Orca-2-7b	NC24/48/96 A100

This repository primarily focuses on the inference performance of the aforementioned models on 1x NC24 A100, 2 x NC24 A100, 1 x NC48 A100, 1 x NC40 H100, and 1 x NC80 H100. However, these models currently do not support deployment on H100. Therefore, as of March 2025, all validations are conducted based on NC100.

Clone code and prepare shell environment

First, you need to create an Azure Machine Learning service in the Azure Portal. When selecting the region for the service, you should choose a region under the AML category in your subscription quota that has a GPU VM quota available.

Next, find a shell environment where you can execute az login to log in to your Azure subscription.

#git clone https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git
#conda create -n aml_env python=3.9 -y
#conda activate aml_env
#cd AI-Foundry-Model-Performance
#pip install -r requirements.txt

#curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash  
#az login --use-device

Deploy model Automatically

Next, you need to execute a script for end-to-end model deployment. This script will:

Help you check the GPU VM quota for AML under your subscription
Prompt you to select the model you want to deploy
Specify the Azure GPU VM SKU and quantity to be used for deployment.
Provide you with the endpoint and key of the successfully deployed model, allowing you to proceed with performance testing.

Before running the script, you need to check the table above to confirm the types of Azure GPU VMs supported by the AI model you plan to deploy.

#python deploymodels-linux.py

If you do test on powershell, you should use:

#python deploymodels-powershell.py

The deploy process:

========== Enter Basic Information ==========
Subscription ID: 53039473-9bbd-499d-90d7-d046d4fa63b6
Resource Group: AIrg1
Workspace Name: aml-david-1

========== Model Name Examples ==========
 - Phi-4
 - Phi-3.5-vision-instruct
 - financial-reports-analysis
 - databricks-dbrx-instruct
 - Llama-3.2-11B-Vision-Instruct
 - Phi-3-small-8k-instruct
 - Phi-3-vision-128k-instruct
 - microsoft-swinv2-base-patch4-window12-192-22k
 - mistralai-Mixtral-8x7B-Instruct-v01
 - Muse
 - openai-whisper-large
 - snowflake-arctic-base
 - Nemotron-3-8B-Chat-4k-SteerLM
 - stabilityai-stable-diffusion-xl-refiner-1-0
 - microsoft-Orca-2-7b
==========================================

Enter the model name to search (e.g., 'Phi-4'): Phi-4

========== Matching Models ==========
Name                       Description    Latest version
-------------------------  -------------  ----------------
Phi-4-multimodal-instruct                 1
Phi-4-mini-instruct                       1
Phi-4                                     7

Note: The above table is for reference only. Enter the exact model name below:
Enter full model name (case-sensitive): Phi-4
Enter model version (e.g., 7): 7
2025-03-13 15:42:02,438 - INFO - User-specified model: name='Phi-4', version='7'

========== GPU Quota (Limit > 1) ==========
Region,ResourceName,LocalizedValue,Usage,Limit
westeurope,standardNCADSH100v5Family,,0,100
polandcentral,standardNCADSA100v4Family,,0,100

========== A100 / H100 SKU Information ==========
SKU Name                            GPU Count  GPU Memory (VRAM)    CPU Cores
----------------------------------- ---------- -------------------- ----------
Standard_NC24ads_A100_v4            1          80 GB                24
Standard_NC48ads_A100_v4            2          1600 GB (2x80 GB)    48
Standard_NC96ads_A100_v4            4          320 GB (4x80 GB)     96
Standard_NC40ads_H100_v5            1          80 GB                40
Standard_NC80ads_H100_v5            2          160 GB (2x80 GB)     80

Available SKUs:
 - Standard_NC24ads_A100_v4
 - Standard_NC48ads_A100_v4
 - Standard_NC96ads_A100_v4
 - Standard_NC40ads_H100_v5
 - Standard_NC80ads_H100_v5

Enter the SKU to use: Standard_NC24ads_A100_v4
Enter the number of instances (integer): 1
2025-03-13 15:52:42,333 - INFO - Model ID: azureml://registries/AzureML/models/Phi-4/versions/7
2025-03-13 15:52:42,333 - INFO - No environment configuration found.
2025-03-13 15:52:42,366 - INFO - ManagedIdentityCredential will use IMDS
2025-03-13 15:52:42,379 - INFO - Creating Endpoint: custom-endpoint-1741852362
2025-03-13 15:52:43,008 - INFO - Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'

After 3-5 minutes, you will get the final results:

----- Deployment Information -----
ENDPOINT_NAME=custom-endpoint-1741863106
SCORING_URI=https://custom-endpoint-1741863106.polandcentral.inference.ml.azure.com/score
PRIMARY_KEY=DRxHMd1jbbSdNoXiYOaWRQ66erYZfejzKhdyDVRuh58v2hXILOcYJQQJ99BCAAAAAAAAAAAAINFRAZML3m1v
SECONDARY_KEY=4dhy3og6WfVzkIijMU7FFUDLpz4WIWEYgIlXMGYUzgwafsW6GPrMJQQJ99BCAAAAAAAAAAAAINFRAZMLxOpO

Fast delete endpoint

We know that GPU VMs are relatively expensive. Therefore, after completing performance testing, you should make use of the script below to delete the endpoint to avoid incurring excessive costs.

#python deplete-endpoint-20250327.py

Delete process:

lease enter your Azure Subscription ID: aaaaaaaaaaaaaaaa
Please enter your Azure Resource Group name: A100VM_group
Please enter your Azure ML Workspace name: aml-westus

Retrieving the list of online Endpoints in the Workspace...

List of online Endpoints:
1. aml-westus-takfp
2. aml-westus-aflqs

Enter the numbers of the Endpoints you want to delete (e.g., 1, 3, 4). Press Enter to skip: 1, 2

Deleting Endpoint: aml-westus-takfp...
...Endpoint aml-westus-takfp deleted successfully.

Deleting Endpoint: aml-westus-aflqs...
...Endpoint aml-westus-aflqs deleted successfully.

The deletion process for all specified Endpoints has been completed. Exiting the script.

Fast Performance Test AI Model on AML Model Catalog

Note:

The test results in this section are for reference only. You need to use my script to conduct tests in your actual environment.
In my performance testing script, timeout and retry mechanisms are configured. Specifically, if a task fails to complete within the timeout period (default is 30 seconds), it will be marked as failed. Additionally, if a request encounters a 429 error during execution, it will trigger a backoff mechanism. If the 429 error occurs three consecutive times, the request will be marked as failed. When performing tests, you should adjust these parameters according to the requirements of your business scenario.
When analyzing the test results, you need to consider multiple metrics, including request success rate, TTFT (Time to First Token), tokens/s, and TTFT again. You should not focus solely on a single indicator.

The primary goal of performance testing is to verify tokens/s and TTFT during the inference process. To better simulate real-world scenarios, I have set up several common LLM/SLM use cases in the test script. Additionally, to ensure tokens/s performance, the test script needs to load the corresponding model's tokenizer during execution(Refer to upper table of tokenizers name).

Before officially starting the test, you need to log in to HF on your terminal.

#huggingface-cli  login

Phi Text2Text Series (Phi-4/Phi-3-small-8k-instruct)

Run the test script:

(aml_env) root@pythonvm:~/AIFperformance# python press-phi4-0314.py
Please enter the API service URL: https://david-workspace-westeurop-ldvdq.westeurope.inference.ml.azure.com/score
Please enter the API Key: Ef9DFpATsXs4NiWyoVhEXeR4PWPvFy17xcws5ySCvV2H8uOUfgV4JQQJ99BCAAAAAAAAAAAAINFRAZML3eIO
Please enter the full name of the HuggingFace model for tokenizer loading: microsoft/phi-4
Tokenizer loaded successfully: microsoft/phi-4

Test result analyze：

microsoft/phi-4

Concurrency = 1

Scenario	VM 1 (1-nc48) TTFT (s)	VM 2 (2-nc24) TTFT (s)	VM 3 (1-nc24) TTFT (s)	VM 1 (1-nc48) tokens/s	VM 2 (2-nc24) tokens/s	VM 3 (1-nc24) tokens/s
Text Generation	12.473	19.546	19.497	68.07	44.66	44.78
Question Answering	11.914	15.552	15.943	72.10	44.56	46.04
Translation	2.499	3.241	3.411	47.62	33.32	34.59
Text Summarization	2.811	4.630	3.369	50.16	37.36	33.84
Code Generation	20.441	27.685	26.504	83.12	51.58	52.26
Chatbot	5.035	9.349	8.366	64.55	43.96	41.24
Sentiment Analysis	1.009	1.235	1.241	5.95	12.96	12.89
Multi-turn Reasoning	13.148	20.184	19.793	76.44	47.12	47.29

Concurrency = 2

Scenario	VM 1 (1-nc48) Total TTFT (s)	VM 2 (2-nc24) Total TTFT (s)	VM 3 (1-nc24) Total TTFT (s)	VM 1 (1-nc48) Total tokens/s	VM 2 (2-nc24) Total tokens/s	VM 3 (1-nc24) Total tokens/s
Text Generation	19.291	19.978	24.576	110.94	90.13	79.26
Question Answering	14.165	15.906	21.774	109.94	90.87	66.67
Translation	3.341	4.513	10.924	76.45	53.95	68.54
Text Summarization	3.494	3.664	6.317	77.38	69.60	59.45
Code Generation	16.693	26.310	27.772	162.72	104.37	53.22
Chatbot	8.688	9.537	12.064	100.09	87.67	67.23
Sentiment Analysis	1.251	1.157	1.229	19.99	20.09	16.60
Multi-turn Reasoning	20.233	23.655	22.880	110.84	94.47	88.79

Performance test on Azure AI model inference

Azure AI model inference has a default quota. If you feel that the quota for the model is insufficient, you can apply for an increase separately.

Limit name	Applies to	Limit value
Tokens per minute	Azure OpenAI models	Varies per model and SKU. See limits for Azure OpenAI.
Requests per minute	Azure OpenAI models	Varies per model and SKU. See limits for Azure OpenAI.
Tokens per minute	DeepSeek models	5.000.000
Requests per minute	DeepSeek models	5.000
Concurrent requests	DeepSeek models	300
Tokens per minute	Rest of models	200.000
Requests per minute	Rest of models	1.000
Concurrent requests	Rest of models	300

After you have deployed models on Azure AI model inference, you can check their invocation methods：

Prepare test env:

#conda create -n AImodelinference python=3.11 -y
#conda activate AImodelinference
#pip install azure-ai-inference

Run test script, after entering the following three variables, the stress test will begin:

#python callaiinference.py
Please enter the Azure AI key: 
Please enter the Azure AI endpoint URL:
Please enter the deployment name:

Performance on DS 671B

I will use the test results of DeeSeek R1 on Azure AI model inference as an example:

Max performance:

When the concurrency is 300 and the prompt length is 1024, TPS = 2110.77, TTFT = 2.201s. • When the concurrency is 300 and the prompt length is 2048, TPS = 1330.94, TTFT = 1.861s.

Overall performance:

The overall throughput averages 735.12 tokens/s, with a P90 of 1184.06 tokens/s, full test result is as following:

Concurrency	Prompt Length	Total Requests	Success Count	Average latency (s)	Average TTFT (s)	Average token throughput (tokens/s)	Overall throughput (tokens/s)
300	1024	110	110	75.579	2.580	22.54	806.84
300	1024	110	110	71.378	71.378	24.53	1028.82
300	1024	110	110	76.622	2.507	23.24	979.97
300	1024	120	120	68.750	68.750	24.91	540.66
300	1024	120	120	72.164	2.389	22.71	1094.90
300	1024	130	130	72.245	72.245	23.68	1859.91
300	1024	130	130	82.714	2.003	20.18	552.08
300	1024	140	140	71.458	71.458	23.79	642.92
300	1024	140	140	71.565	2.400	22.93	488.49
300	1024	150	150	71.958	71.958	24.21	1269.10
300	1024	150	150	73.712	2.201	22.35	2110.77
300	2048	10	10	68.811	68.811	24.24	196.78
300	2048	10	10	70.189	1.021	23.18	172.92
300	2048	20	20	73.138	73.138	24.14	390.96
300	2048	20	20	69.649	1.150	24.22	351.31
300	2048	30	30	66.883	66.883	26.13	556.12
300	2048	30	30	68.918	1.660	23.46	571.63
300	2048	40	40	72.485	72.485	23.85	716.53
300	2048	40	40	65.228	1.484	24.87	625.16
300	2048	50	50	68.223	68.223	25.12	887.64
300	2048	50	50	66.288	1.815	24.38	976.17
300	2048	60	60	66.736	66.736	25.85	547.70
300	2048	60	60	69.355	2.261	23.94	615.81
300	2048	70	70	66.689	66.689	25.66	329.90
300	2048	70	70	67.061	2.128	23.89	1373.11
300	2048	80	80	68.091	68.091	25.68	1516.27
300	2048	80	80	67.413	1.861	24.01	1330.94
300	2048	90	90	66.603	66.603	25.51	418.81
300	2048	90	90	70.072	2.346	23.41	1047.53
300	2048	100	100	70.516	70.516	24.29	456.66
300	2048	100	100	86.862	2.802	20.03	899.38
300	2048	110	110	84.602	84.602	21.16	905.59
300	2048	110	110	77.883	2.179	21.17	803.93
300	2048	120	120	73.814	73.814	23.73	541.03
300	2048	120	120	86.787	4.413	20.32	650.57
300	2048	130	130	78.222	78.222	22.61	613.27
300	2048	130	130	83.670	2.131	20.16	1463.81
300	2048	140	140	77.429	77.429	22.74	1184.06
300	2048	140	140	77.234	3.891	21.90	821.34
300	2048	150	150	72.753	72.753	23.69	698.50
300	2048	150	150	73.674	2.425	22.74	1012.25
300	4096	10	10	83.003	83.003	25.52	221.28
300	4096	10	10	89.713	1.084	24.70	189.29
300	4096	20	20	82.342	82.342	26.65	337.85
300	4096	20	20	84.526	1.450	24.81	376.17
300	4096	30	30	87.979	87.979	24.46	322.62
300	4096	30	30	84.767	1.595	24.28	503.01
300	4096	40	40	85.231	85.231	26.03	733.50
300	4096	40	40	81.514	1.740	24.17	710.79
300	4096	50	50	91.253	91.253	24.53	279.55