This artical is refer to this artical, welcome to read it:
Azure GPU VM Environment Preparation
Quickly create a Spot VM, using Spot VM and password-based authentication:
az vm create --name <VMNAME> --resource-group <RESOURCE_GROUP_NAME> --location <REGION> --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --priority Spot --max-price -1 --eviction-policy Deallocate --os-disk-size-gb 256 --os-disk-delete-option Delete --admin-username azureadmin --authentication-type password --admin-password <YOUR_PASSWORD>
The CLI command I used to create the VM:
xinyu [ ~ ]$ az vm create --name mi300x-xinyu --resource-group amdrg --location westus --image microsoft-dsvm:ubuntu-hpc:2204-rocm:22.04.2025030701 --size Standard_ND96isr_MI300X_v5 --security-type Standard --priority Spot --max-price -1 --eviction-policy Deallocate --os-disk-size-gb 512 --os-disk-delete-option Delete --admin-username azureadmin --authentication-type password --admin-password azureadmin@123
VM Deployment progress:
Argument '--max-price' is in preview and under development. Reference and support levels: https://aka.ms/CLI_refstatus
Consider upgrading security for your workloads using Azure Trusted Launch VMs. To know more about Trusted Launch, please visit https://aka.ms/TrustedLaunch.
{
"fqdns": "",
"id": "/subscriptions/***/resourceGroups/amdrg/providers/Microsoft.Compute/virtualMachines/mi300x-xinyu",
"location": "westus",
"macAddress": "60-45-BD-01-4B-AF",
"powerState": "VM running",
"privateIpAddress": "10.0.0.4",
"publicIpAddress": "13.64.8.207",
"resourceGroup": "amdrg",
"zones": ""
}
After the system deployment succeeds, open port 22 on the VM's NSG.
Then SSH into the VM and perform the following environment configurations:
mkdir -p /mnt/resource_nvme/
sudo mdadm --create /dev/md128 -f --run --level 0 --raid-devices 8 $(ls /dev/nvme*n1)
sudo mkfs.xfs -f /dev/md128
sudo mount /dev/md128 /mnt/resource_nvme
sudo chmod 1777 /mnt/resource_nvme
During testing, we use the local NVMe temporary disk as the Docker runtime environment. Please note that any data stored on the temporary disk will be lost upon VM reboot. This approach is acceptable for quick and cost-effective testing purposes; however, in a production environment, a persistent file system should be used instead.
First, create a mount directory for RAID0:
mkdir –p /mnt/resource_nvme/hf_cache
export HF_HOME=/mnt/resource_nvme/hf_cache
Configure RAID0 and specify it for Docker use.
mkdir -p /mnt/resource_nvme/docker
sudo tee /etc/docker/daemon.json > /dev/null <<EOF
{
"data-root": "/mnt/resource_nvme/docker"
}
EOF
sudo chmod 0644 /etc/docker/daemon.json
sudo systemctl restart docker
Pull the Docker image:
docker pull rocm/sgl-dev:upstream_20250312_v1
It takes approximately 5 minutes for DS 671B to start up.
docker run \
--device=/dev/kfd \
--device=/dev/dri \
--security-opt seccomp=unconfined \
--cap-add=SYS_PTRACE \
--group-add video \
--privileged \
--shm-size 128g \
--ipc=host \
-p 30000:30000 \
-v /mnt/resource_nvme:/mnt/resource_nvme \
-e HF_HOME=/mnt/resource_nvme/hf_cache \
-e HSA_NO_SCRATCH_RECLAIM=1 \
-e GPU_FORCE_BLIT_COPY_SIZE=64 \
-e DEBUG_HIP_BLOCK_SYN=1024 \
rocm/sgl-dev:upstream_20250312_v1 \
python3 -m sglang.launch_server --model deepseek-ai/DeepSeek-R1 --tp 8 --trust-remote-code --chunked-prefill-size 131072 --enable-torch-comple --torch-compile-max-bs 256 --host 0.0.0.0
Once you see output similar to the following, it indicates that the container has successfully started:
[2025-04-01 03:42:11 DP7 TP7] Prefill batch. #new-seq: 1, #new-token: 7, #cached-token: 0, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-01 03:42:15] INFO: 127.0.0.1:37762 - "POST /generate HTTP/1.1" 200 OK
[2025-04-01 03:42:15] The server is fired up and ready to roll!
[2025-04-01 04:00:11] INFO: 172.17.0.1:55994 - "POST /v1/chat/completions HTTP/1.1" 200 OK
[2025-04-01 04:00:11 DP0 TP0] Prefill batch. #new-seq: 1, #new-token: 5, #cached-token: 1, token usage: 0.00, #running-req: 0, #queue-req: 0,
[2025-04-01 04:00:43] INFO: 172.17.0.1:41068 - "POST /v1/chat/completions HTTP/1.1" 200 OK
Ensure that the DS 671B container can be accessed locally:
curl http://localhost:30000/get_model_info
{"model_path":"deepseek-ai/DeepSeek-R1","tokenizer_path":"deepseek-ai/DeepSeek-R1","is_generation":true}
curl http://localhost:30000/generate -H "Content-Type: application/json" -d '{ "text": "Once upon a time,", "sampling_params": { "max_new_tokens": 16, "temperature": 0.6 } }'
Next, open port 30000 in the Azure NSG to enable remote access for testing.
Log in to the Linux stress test client, and run the following CLI commands to install the evalscope stress testing tool:
pip install evalscope[perf] -U
pip install gradio
Then, perform stress testing using evalscope. This tool allows you to specify concurrency levels, total request count, number of input and output tokens, as well as test datasets.
If your goal is to maximize overall throughput, you can increase concurrency and decrease the number of input tokens. For example, using 100 concurrent requests with each request containing 100 input tokens. In this scenario, you should monitor the total tokens per second (tokens/s).
If you want to measure single-request performance, reduce the concurrency and increase the number of input tokens. In this case, pay attention to Time-To-First-Token (TTFT) and tokens/s for an individual request.
In my testing, I used a relatively extreme scenario by setting input length to 10,000 tokens.
evalscope perf --url http://mi300x-xinyu.westus.cloudapp.azure.com:30000/v1/chat/completions --model "deepseek-ai/DeepSeek-R1" --parallel 1 --number 20 --api openai --min-prompt-length 10000 --dataset "longalpaca" --max-tokens 2048 --min-tokens 2048 --stream
Next, I list testing results for several scenarios with different concurrency settings and request counts.
Single concurrency:
5 concurrency:
10 Concurrent requests:
Additional performance parameters:
- --enable-torch-compile
This parameter is currently not supported in an AMD MI300X environment. - --enable-dp-attention
This parameter is supported in the AMD environment. However, setting it at low concurrency levels does not improve performance significantly. Its effectiveness at higher concurrency levels remains to be evaluated.
Please refer to my repo to get more AI resources, wellcome to star it: