Microsoft Foundry Blog

5 MIN READ

Fast Stress Test of DeepSeek 671B on Azure AMD MI300X

xinyuwei

Microsoft

Apr 09, 2025

This artical is refer to this artical, welcome to read it:

https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/running-deepseek-r1-on-a-single-ndv5-mi300x-vm/4372726

Azure GPU VM Environment Preparation
Quickly create a Spot VM, using Spot VM and password-based authentication:

az vm create --name <VMNAME> --resource-group <RESOURCE_GROUP_NAME> --location <REGION>  --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --priority Spot --max-price -1 --eviction-policy Deallocate --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --authentication-type password --admin-password <YOUR_PASSWORD>

The CLI command I used to create the VM:

xinyu [ ~ ]$ az vm create --name mi300x-xinyu --resource-group amdrg --location westus --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --priority Spot --max-price -1 --eviction-policy Deallocate --os-disk-size-gb 512 --os-disk-delete-option Delete --admin-username azureadmin --authentication-type password --admin-password azureadmin@123

VM Deployment progress:

Argument '--max-price' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus
Consider upgrading security for your workloads using Azure Trusted Launch VMs. To know more about Trusted Launch, please visit https://aka.ms/TrustedLaunch.
{
  "fqdns": "",
  "id": "/subscriptions/***/resourceGroups/amdrg/providers/Microsoft.Compute/virtualMachines/mi300x-xinyu",
  "location": "westus",
  "macAddress": "60-45-BD-01-4B-AF",
  "powerState": "VM running",
  "privateIpAddress": "10.0.0.4",
  "publicIpAddress": "13.64.8.207",
  "resourceGroup": "amdrg",
  "zones": ""
}

After the system deployment succeeds, open port 22 on the VM's NSG.

Then SSH into the VM and perform the following environment configurations:

mkdir -p /mnt/resource_nvme/
sudo mdadm --create /dev/md128 -f --run --level 0 --raid-devices 8 $(ls /dev/nvme*n1)  
sudo mkfs.xfs -f /dev/md128 
sudo mount /dev/md128 /mnt/resource_nvme 
sudo chmod 1777 /mnt/resource_nvme

During testing, we use the local NVMe temporary disk as the Docker runtime environment. Please note that any data stored on the temporary disk will be lost upon VM reboot. This approach is acceptable for quick and cost-effective testing purposes; however, in a production environment, a persistent file system should be used instead.

First, create a mount directory for RAID0:

mkdir –p /mnt/resource_nvme/hf_cache 
export HF_HOME=/mnt/resource_nvme/hf_cache

Configure RAID0 and specify it for Docker use.

mkdir -p /mnt/resource_nvme/docker 
sudo tee /etc/docker/daemon.json > /dev/null <<EOF 
{ 
    "data-root": "/mnt/resource_nvme/docker" 
} 
EOF 
sudo chmod 0644 /etc/docker/daemon.json 
sudo systemctl restart docker

Pull the Docker image:

docker pull rocm/sgl-dev:upstream_20250312_v1

It takes approximately 5 minutes for DS 671B to start up.

docker run \
  --device=/dev/kfd \
  --device=/dev/dri \
  --security-opt seccomp=unconfined \
  --cap-add=SYS_PTRACE \
  --group-add video \
  --privileged \
  --shm-size 128g \
  --ipc=host \
  -p 30000:30000 \
  -v /mnt/resource_nvme:/mnt/resource_nvme \
  -e HF_HOME=/mnt/resource_nvme/hf_cache \
  -e HSA_NO_SCRATCH_RECLAIM=1 \
  -e GPU_FORCE_BLIT_COPY_SIZE=64 \
  -e DEBUG_HIP_BLOCK_SYN=1024 \
  rocm/sgl-dev:upstream_20250312_v1 \
  python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --chunked-prefill-size 131072 --enable-torch-comple --torch-compile-max-bs 256 --host 0.0.0.0

Once you see output similar to the following, it indicates that the container has successfully started:

[2025-04-01 03:42:11 DP7 TP7] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-01 03:42:15] INFO:     127.0.0.1:37762 - "POST /generate HTTP/1.1" 200 OK
[2025-04-01 03:42:15] The server is fired up and ready to roll!
[2025-04-01 04:00:11] INFO:     172.17.0.1:55994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-04-01 04:00:11 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0, 
[2025-04-01 04:00:43] INFO:     172.17.0.1:41068 - "POST /v1/chat/completions HTTP/1.1" 200 OK

Ensure that the DS 671B container can be accessed locally:

curl http://localhost:30000/get_model_info 
{"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true} 
curl http://localhost:30000/generate -H "Content-Type: application/json" -d '{ "text": "Once upon a time,", "sampling_params": { "max_new_tokens": 16, "temperature": 0.6 } }'

Next, open port 30000 in the Azure NSG to enable remote access for testing.

pip install evalscope[perf] -U
pip install gradio

Then, perform stress testing using evalscope. This tool allows you to specify concurrency levels, total request count, number of input and output tokens, as well as test datasets.

If your goal is to maximize overall throughput, you can increase concurrency and decrease the number of input tokens. For example, using 100 concurrent requests with each request containing 100 input tokens. In this scenario, you should monitor the total tokens per second (tokens/s).

If you want to measure single-request performance, reduce the concurrency and increase the number of input tokens. In this case, pay attention to Time-To-First-Token (TTFT) and tokens/s for an individual request.

In my testing, I used a relatively extreme scenario by setting input length to 10,000 tokens.

evalscope perf --url http://mi300x-xinyu.westus.cloudapp.azure.com:30000/v1/chat/completions --model "deepseek-ai/DeepSeek-R1" --parallel 1 --number 20 --api openai --min-prompt-length 10000 --dataset "longalpaca" --max-tokens 2048 --min-tokens 2048 --stream

Next, I list testing results for several scenarios with different concurrency settings and request counts.

Single concurrency:

5 concurrency:

10 Concurrent requests:

Additional performance parameters:

--enable-torch-compile
This parameter is currently not supported in an AMD MI300X environment.
--enable-dp-attention
This parameter is supported in the AMD environment. However, setting it at low concurrency levels does not improve performance significantly. Its effectiveness at higher concurrency levels remains to be evaluated.

Please refer to my repo to get more AI resources, wellcome to star it:

https://github.com/xinyuwei-david/david-share.git

Updated Apr 11, 2025

Version 2.0

xinyuwei

Microsoft

Joined January 12, 2023

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity