Blog Post

Azure AI Foundry Blog
9 MIN READ

Fast deploy and evaluate AI model performance on AML/AI Foundry

xinyuwei's avatar
xinyuwei
Icon for Microsoft rankMicrosoft
Apr 02, 2025

This source code of this article:

https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git

Please refer to my repo to get more AI resources, wellcome to star it:

https://github.com/xinyuwei-david/david-share.git 

 

Note:

This repository is designed to test the performance of open-source models from the Azure Machine Learning Model Catalog in Managed Compute. I tested the performance of nearly 20 AI models in my repository. Due to space limitations, this article only shows the testing of two models to help readers understand how to use my script for testing. More detailed info,please refer to https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git 

Deploying models Methods

https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/deployments-overview

NameAzure OpenAI serviceAzure AI model inferenceServerless APIManaged compute
Which models can be deployed?Azure OpenAI modelsAzure OpenAI models and Models as a ServiceModels as a ServiceOpen and custom models
Deployment resourceAzure OpenAI resourceAzure AI services resourceAI project resourceAI project resource
Best suited whenYou are planning to use only OpenAI modelsYou are planning to take advantage of the flagship models in Azure AI catalog, including OpenAI.You are planning to use a single model from a specific provider (excluding OpenAI).If you plan to use open models and you have enough compute quota available in your subscription.
Billing basesToken usage & PTUToken usageToken usageCompute core hours
Deployment instructionsDeploy to Azure OpenAI ServiceDeploy to Azure AI model inferenceDeploy to Serverless APIDeploy to Managed compute

Currently, an increasing number of new flagship models in the Azure AI Foundry model catalog, including OpenAI, will be deployed using the Azure AI model inference method. Models deployed in this way can be accessed via the AI Inference SDK (which now supports stream mode: https://learn.microsoft.com/en-us/python/api/overview/azure/ai-inference-readme?view=azure-python-preview). Open-source models include DeepSeek R1, V3, Phi, Mistral, and more. For a detailed list of models, please refer to:

https://learn.microsoft.com/en-us/azure/ai-foundry/model-inference/concepts/models

If you care about the performance data of this method, please skip to the last section of this repo.

If you want the deployed model to have more exclusive performance and lower latency, you can use the Managed Compute mode.

Performance test of AI models in Azure Machine Learning

In this section, we focus on the models deployed on Managed Compute in the Model Catalogue on AML.

 

Next, we will use a Python script to automate the deployment of the model and use another program to evaluate the model's performance.

Fast Deploy AI Model on AML Model Catalog via Azure GPU VM

By now, the AML names tested in this repo, their full names on Hugging Face, and the Azure GPU VM SKUs that can be deployed on AML are as follows.

Model Name on AMLModel on HF (tokenizers name)Azure GPU VM SKU Support in AML
Phi-4microsoft/phi-4NC24/48/96 A100
Phi-3.5-vision-instructmicrosoft/Phi-3.5-vision-instructNC24/48/96 A100
financial-reports-analysis NC24/48/96 A100
Llama-3.2-11B-Vision-Instructmeta-llama/Llama-3.2-11B-Vision-InstructNC24/48/96 A100
Phi-3-small-8k-instructmicrosoft/Phi-3-small-8k-instructNC24/48/96 A100
Phi-3-vision-128k-instructmicrosoft/Phi-3-vision-128k-instructNC48 A100 or NC96 A100
microsoft-swinv2-base-patch4-window12-192-22kmicrosoft/swinv2-base-patch4-window12-192-22kNC24/48/96 A100
mistralai-Mixtral-8x7B-Instruct-v01mistralai/Mixtral-8x7B-Instruct-v0.1NC96 A100
Musemicrosoft/whamNC24/48/96 A100
openai-whisper-largeopenai/whisper-largeNC48 A100 or NC96 A100
snowflake-arctic-baseSnowflake/snowflake-arctic-baseND H100V5
Nemotron-3-8B-Chat-4k-SteerLMnvidia/nemotron-3-8b-chat-4k-steerlmNC24/48/96 A100
stabilityai-stable-diffusion-xl-refiner-1-0stabilityai/stable-diffusion-xl-refiner-1.0Standard_ND96amsr_A100_v4 or Standard_ND96asr_v4
microsoft-Orca-2-7bmicrosoft/Orca-2-7bNC24/48/96 A100

This repository primarily focuses on the inference performance of the aforementioned models on 1x NC24 A100, 2 x NC24 A100, 1 x NC48 A100, 1 x NC40 H100, and 1 x NC80 H100. However, these models currently do not support deployment on H100. Therefore, as of March 2025, all validations are conducted based on NC100.

Clone code and prepare shell environment

First, you need to create an Azure Machine Learning service in the Azure Portal. When selecting the region for the service, you should choose a region under the AML category in your subscription quota that has a GPU VM quota available.

Next, find a shell environment where you can execute az login to log in to your Azure subscription.

#git clone https://github.com/xinyuwei-david/AI-Foundry-Model-Performance.git
#conda create -n aml_env python=3.9 -y
#conda activate aml_env
#cd AI-Foundry-Model-Performance
#pip install -r requirements.txt  

Login to Azure.

#curl -sL https://aka.ms/InstallAzureCLIDeb | sudo bash  
#az login --use-device

Deploy model Automatically

Next, you need to execute a script for end-to-end model deployment. This script will:

  • Help you check the GPU VM quota for AML under your subscription
  • Prompt you to select the model you want to deploy
  • Specify the Azure GPU VM SKU and quantity to be used for deployment.
  • Provide you with the endpoint and key of the successfully deployed model, allowing you to proceed with performance testing.

Before running the script, you need to check the table above to confirm the types of Azure GPU VMs supported by the AI model you plan to deploy.

#python deploymodels-linux.py

If you do test on powershell, you should use:

#python deploymodels-powershell.py

The deploy process:

========== Enter Basic Information ==========
Subscription ID: 53039473-9bbd-499d-90d7-d046d4fa63b6
Resource Group: AIrg1
Workspace Name: aml-david-1

========== Model Name Examples ==========
 - Phi-4
 - Phi-3.5-vision-instruct
 - financial-reports-analysis
 - databricks-dbrx-instruct
 - Llama-3.2-11B-Vision-Instruct
 - Phi-3-small-8k-instruct
 - Phi-3-vision-128k-instruct
 - microsoft-swinv2-base-patch4-window12-192-22k
 - mistralai-Mixtral-8x7B-Instruct-v01
 - Muse
 - openai-whisper-large
 - snowflake-arctic-base
 - Nemotron-3-8B-Chat-4k-SteerLM
 - stabilityai-stable-diffusion-xl-refiner-1-0
 - microsoft-Orca-2-7b
==========================================

Enter the model name to search (e.g., 'Phi-4'): Phi-4

========== Matching Models ==========
Name                       Description    Latest version
-------------------------  -------------  ----------------
Phi-4-multimodal-instruct                 1
Phi-4-mini-instruct                       1
Phi-4                                     7

Note: The above table is for reference only. Enter the exact model name below:
Enter full model name (case-sensitive): Phi-4
Enter model version (e.g., 7): 7
2025-03-13 15:42:02,438 - INFO - User-specified model: name='Phi-4', version='7'

========== GPU Quota (Limit > 1) ==========
Region,ResourceName,LocalizedValue,Usage,Limit
westeurope,standardNCADSH100v5Family,,0,100
polandcentral,standardNCADSA100v4Family,,0,100

========== A100 / H100 SKU Information ==========
SKU Name                            GPU Count  GPU Memory (VRAM)    CPU Cores
----------------------------------- ---------- -------------------- ----------
Standard_NC24ads_A100_v4            1          80 GB                24
Standard_NC48ads_A100_v4            2          1600 GB (2x80 GB)    48
Standard_NC96ads_A100_v4            4          320 GB (4x80 GB)     96
Standard_NC40ads_H100_v5            1          80 GB                40
Standard_NC80ads_H100_v5            2          160 GB (2x80 GB)     80

Available SKUs:
 - Standard_NC24ads_A100_v4
 - Standard_NC48ads_A100_v4
 - Standard_NC96ads_A100_v4
 - Standard_NC40ads_H100_v5
 - Standard_NC80ads_H100_v5

Enter the SKU to use: Standard_NC24ads_A100_v4
Enter the number of instances (integer): 1
2025-03-13 15:52:42,333 - INFO - Model ID: azureml://registries/AzureML/models/Phi-4/versions/7
2025-03-13 15:52:42,333 - INFO - No environment configuration found.
2025-03-13 15:52:42,366 - INFO - ManagedIdentityCredential will use IMDS
2025-03-13 15:52:42,379 - INFO - Creating Endpoint: custom-endpoint-1741852362
2025-03-13 15:52:43,008 - INFO - Request URL: 'http://169.254.169.254/metadata/identity/oauth2/token?api-version=REDACTED&resource=REDACTED'

After 3-5 minutes, you will get the final results:

----- Deployment Information -----
ENDPOINT_NAME=custom-endpoint-1741863106
SCORING_URI=https://custom-endpoint-1741863106.polandcentral.inference.ml.azure.com/score
PRIMARY_KEY=DRxHMd1jbbSdNoXiYOaWRQ66erYZfejzKhdyDVRuh58v2hXILOcYJQQJ99BCAAAAAAAAAAAAINFRAZML3m1v
SECONDARY_KEY=4dhy3og6WfVzkIijMU7FFUDLpz4WIWEYgIlXMGYUzgwafsW6GPrMJQQJ99BCAAAAAAAAAAAAINFRAZMLxOpO

Fast delete endpoint

We know that GPU VMs are relatively expensive. Therefore, after completing performance testing, you should make use of the script below to delete the endpoint to avoid incurring excessive costs.

#python deplete-endpoint-20250327.py

Delete process:

lease enter your Azure Subscription ID: aaaaaaaaaaaaaaaa
Please enter your Azure Resource Group name: A100VM_group
Please enter your Azure ML Workspace name: aml-westus

Retrieving the list of online Endpoints in the Workspace...

List of online Endpoints:
1. aml-westus-takfp
2. aml-westus-aflqs

Enter the numbers of the Endpoints you want to delete (e.g., 1, 3, 4). Press Enter to skip: 1, 2

Deleting Endpoint: aml-westus-takfp...
...Endpoint aml-westus-takfp deleted successfully.

Deleting Endpoint: aml-westus-aflqs...
...Endpoint aml-westus-aflqs deleted successfully.

The deletion process for all specified Endpoints has been completed. Exiting the script.

Fast Performance Test AI Model on AML Model Catalog

Note:

  • The test results in this section are for reference only. You need to use my script to conduct tests in your actual environment.
  • In my performance testing script, timeout and retry mechanisms are configured. Specifically, if a task fails to complete within the timeout period (default is 30 seconds), it will be marked as failed. Additionally, if a request encounters a 429 error during execution, it will trigger a backoff mechanism. If the 429 error occurs three consecutive times, the request will be marked as failed. When performing tests, you should adjust these parameters according to the requirements of your business scenario.
  • When analyzing the test results, you need to consider multiple metrics, including request success rate, TTFT (Time to First Token), tokens/s, and TTFT again. You should not focus solely on a single indicator.

The primary goal of performance testing is to verify tokens/s and TTFT during the inference process. To better simulate real-world scenarios, I have set up several common LLM/SLM use cases in the test script. Additionally, to ensure tokens/s performance, the test script needs to load the corresponding model's tokenizer during execution(Refer to upper table of tokenizers name).

Before officially starting the test, you need to log in to HF on your terminal.

#huggingface-cli  login

Phi Text2Text Series (Phi-4/Phi-3-small-8k-instruct)

Run the test script:

(aml_env) root@pythonvm:~/AIFperformance# python press-phi4-0314.py
Please enter the API service URL: https://david-workspace-westeurop-ldvdq.westeurope.inference.ml.azure.com/score
Please enter the API Key: Ef9DFpATsXs4NiWyoVhEXeR4PWPvFy17xcws5ySCvV2H8uOUfgV4JQQJ99BCAAAAAAAAAAAAINFRAZML3eIO
Please enter the full name of the HuggingFace model for tokenizer loading: microsoft/phi-4
Tokenizer loaded successfully: microsoft/phi-4

Test result analyze:

microsoft/phi-4

Concurrency = 1

ScenarioVM 1 (1-nc48) TTFT (s)VM 2 (2-nc24) TTFT (s)VM 3 (1-nc24) TTFT (s)VM 1 (1-nc48) tokens/sVM 2 (2-nc24) tokens/sVM 3 (1-nc24) tokens/s
Text Generation12.47319.54619.49768.0744.6644.78
Question Answering11.91415.55215.94372.1044.5646.04
Translation2.4993.2413.41147.6233.3234.59
Text Summarization2.8114.6303.36950.1637.3633.84
Code Generation20.44127.68526.50483.1251.5852.26
Chatbot5.0359.3498.36664.5543.9641.24
Sentiment Analysis1.0091.2351.2415.9512.9612.89
Multi-turn Reasoning13.14820.18419.79376.4447.1247.29

Concurrency = 2

ScenarioVM 1 (1-nc48) Total TTFT (s)VM 2 (2-nc24) Total TTFT (s)VM 3 (1-nc24) Total TTFT (s)VM 1 (1-nc48) Total tokens/sVM 2 (2-nc24) Total tokens/sVM 3 (1-nc24) Total tokens/s
Text Generation19.29119.97824.576110.9490.1379.26
Question Answering14.16515.90621.774109.9490.8766.67
Translation3.3414.51310.92476.4553.9568.54
Text Summarization3.4943.6646.31777.3869.6059.45
Code Generation16.69326.31027.772162.72104.3753.22
Chatbot8.6889.53712.064100.0987.6767.23
Sentiment Analysis1.2511.1571.22919.9920.0916.60
Multi-turn Reasoning20.23323.65522.880110.8494.4788.79

Performance test on Azure AI model inference

Azure AI model inference has a default quota. If you feel that the quota for the model is insufficient, you can apply for an increase separately.

Limit nameApplies toLimit value
Tokens per minuteAzure OpenAI modelsVaries per model and SKU. See limits for Azure OpenAI.
Requests per minuteAzure OpenAI modelsVaries per model and SKU. See limits for Azure OpenAI.
Tokens per minuteDeepSeek models5.000.000
Requests per minuteDeepSeek models5.000
Concurrent requestsDeepSeek models300
Tokens per minuteRest of models200.000
Requests per minuteRest of models1.000
Concurrent requestsRest of models300

After you have deployed models on Azure AI model inference, you can check their invocation methods:

Prepare test env:

#conda create -n AImodelinference python=3.11 -y
#conda activate AImodelinference
#pip install azure-ai-inference

Run test script, after entering the following three variables, the stress test will begin:

#python callaiinference.py
Please enter the Azure AI key: 
Please enter the Azure AI endpoint URL:
Please enter the deployment name:

Performance on DS 671B

I will use the test results of DeeSeek R1 on Azure AI model inference as an example:

Max performance:

  • When the concurrency is 300 and the prompt length is 1024, TPS = 2110.77, TTFT = 2.201s. • When the concurrency is 300 and the prompt length is 2048, TPS = 1330.94, TTFT = 1.861s.

Overall performance:

The overall throughput averages 735.12 tokens/s, with a P90 of 1184.06 tokens/s, full test result is as following: 

ConcurrencyPrompt LengthTotal RequestsSuccess CountFail CountAverage latency (s)Average TTFT (s)Average token throughput (tokens/s)Overall throughput (tokens/s)
3001024110110075.5792.58022.54806.84
3001024110110071.37871.37824.531028.82
3001024110110076.6222.50723.24979.97
3001024120120068.75068.75024.91540.66
3001024120120072.1642.38922.711094.90
3001024130130072.24572.24523.681859.91
3001024130130082.7142.00320.18552.08
3001024140140071.45871.45823.79642.92
3001024140140071.5652.40022.93488.49
3001024150150071.95871.95824.211269.10
3001024150150073.7122.20122.352110.77
30020481010068.81168.81124.24196.78
30020481010070.1891.02123.18172.92
30020482020073.13873.13824.14390.96
30020482020069.6491.15024.22351.31
30020483030066.88366.88326.13556.12
30020483030068.9181.66023.46571.63
30020484040072.48572.48523.85716.53
30020484040065.2281.48424.87625.16
30020485050068.22368.22325.12887.64
30020485050066.2881.81524.38976.17
30020486060066.73666.73625.85547.70
30020486060069.3552.26123.94615.81
30020487070066.68966.68925.66329.90
30020487070067.0612.12823.891373.11
30020488080068.09168.09125.681516.27
30020488080067.4131.86124.011330.94
30020489090066.60366.60325.51418.81
30020489090070.0722.34623.411047.53
3002048100100070.51670.51624.29456.66
3002048100100086.8622.80220.03899.38
3002048110110084.60284.60221.16905.59
3002048110110077.8832.17921.17803.93
3002048120120073.81473.81423.73541.03
3002048120120086.7874.41320.32650.57
3002048130130078.22278.22222.61613.27
3002048130130083.6702.13120.161463.81
3002048140140077.42977.42922.741184.06
3002048140140077.2343.89121.90821.34
3002048150150072.75372.75323.69698.50
3002048150150073.6742.42522.741012.25
30040961010083.00383.00325.52221.28
30040961010089.7131.08424.70189.29
30040962020082.34282.34226.65337.85
30040962020084.5261.45024.81376.17
30040963030087.97987.97924.46322.62
30040963030084.7671.59524.28503.01
30040964040085.23185.23126.03733.50
30040964040081.5141.74024.17710.79
30040965050091.25391.25324.53279.55
Published Apr 02, 2025
Version 1.0
No CommentsBe the first to comment