Accelerating DeepSeek Inference with AMD MI300: A Collaborative Breakthrough
Over the past few months, we’ve been collaborating closely with AMD to deliver a new level of performance for large-scale inference—starting with the DeepSeek-R1 and DeepSeek-V3 models on Azure AI Foundry.
Through day-by-day improvements on inference frameworks and major kernels and shared engineering investment, we’ve significantly accelerated inference on AMD MI300 hardware, reaching competitive performance with traditional NVIDIA alternatives. The result? Faster output, and more flexibility for Models-as-a-Service (MaaS) customers using DeepSeek models.
Why AMD MI300?
While many enterprise workloads are optimized for NVIDIA GPUs, AMD’s MI300 architecture has proven to be a strong contender—especially for larger models like DeepSeek. With high VRAM capacity, bandwidth, and a growing ecosystem of tooling (like SGLang), MI300 offered us the opportunity to scale faster while keeping infrastructure costs optimized.
We initially began testing DeepSeek on MI300s with a single VM and were pleasantly surprised—early results were already comparable to NVIDIA H200s. With further tuning, including custom kernel library (AITER) from AMD and optimizations of MSFT Bing teams, we’ve exceeded the performance of H200s even without Multi Token Prediction (MTP), making MI300 highly viable for production-grade inference.
What We Optimized
Our work with AMD focused on:
- SGLang kernel tuning for DeepSeek, with progress day-by-day
- Design and implementation for more advanced optimizations like MTP and dis-aggregated prefill/decode
- Internal Bing contributions to optimize shared inference kernels
This wasn’t just a one-off tuning exercise—it’s an ongoing partnership. We are aiming for even greater improvements in current and future DeepSeek models, as well as many other models
Benchmarks
AMDs recent improvements are reproducible within Microsoft
Microsoft has worked on separate optimizations resulting in very similar gains in performance. This chart includes some early results from enabling Multi-Token Predication, the `sglang0.4.4.post1` build is based on AMDs `rocm/sgl-dev:upstream_20250312_v1` image.
Not all kernel optimizations made for DeepSeek-R1 by Microsoft have been contributed back to sglang yet. However, there is no intention to withhold them, and we are committed to collaborating with sglang and AMD to get them upstreamed. 
 
We’re very excited to continue working with AMD to combine our optimizations to achieve maximum throughput while prioritizing low latency. 
Scaling Globally with Cost-Efficient Inference
One of the biggest wins? Hardware Availability.
Because MI300s are more readily available in regions like East US and Germany Central, we were able to rapidly scale DeepSeek inference capacity—faster than if we’d waited for scarce high-end NVIDIA hardware. This flexibility allowed us to meet customer demand without compromising on performance or budget.
How to Reproduce the Benchmark
Now let’s reproduce the same performance boost on your system and apply the same techniques to your application on the MI300X GPUs.
The following instructions assume that the user already downloaded a model.
Note: The image provided for replicating the MI300X benchmark is a pre-upstream staging version which isn’t the same as the one shown above with Microsoft’s internal changes. The optimizations and performance enhancements in this release are expected to be included in the upcoming lmsysorg upstream production release.
Nvidia H200 GPU with SGLang
- Set relevant environment variables and launch the NVIDIA SGLang container.
docker pull lmsysorg/sglang:v0.4.4.post1-cu125 
export MODEL_DIR=<DeepSeek-R1 saved_path> 
docker run -it \ 
    --ipc=host \ 
    --network=host \ 
    --privileged \ 
    --shm-size 32G \ 
    --gpus all \ 
    -v $MODEL_DIR:/model \ 
    lmsysorg/sglang:v0.4.4.post1-cu1252. Start the SGLang server.
export SGL_ENABLE_JIT_DEEPGEMM=1 
python3 -m sglang.launch_server \ 
    --model /model \ 
    --trust-remote-code \ 
    --tp 8 \ 
    --mem-fraction-static 0.9 \ 
    --enable-torch-compile \ 
    --torch-compile-max-bs 256 \ 
    --chunked-prefill-size 131072 \ 
    --enable-flashinfer-mla & 3. Run the SGLang benchmark serving script for the user defined concurrency values and desired parameters.
# Run after “The server is fired up and ready to roll!” 
concurrency_values=(128 64 32 16 8 4 2 1) 
for concurrency in "${concurrency_values[@]}"; do 
python3 -m sglang.bench_serving \ 
    --dataset-name random \ 
    --random-range-ratio 1 \ 
    --num-prompt 500 \ 
    --random-input 3200 \ 
    --random-output 800 \ 
    --max-concurrency "${concurrency}" 
done AMD Instinct MI300X GPU with SGLang
NOTE: These instructions are only for rocm/sgl-dev:upstream_20250312_v1 as msft optimizations are not yet public.
- Set relevant environment variables and launch the AMD SGLang container.
docker pull rocm/sgl-dev:upstream_20250312_v1 
export MODEL_DIR=<DeepSeek-R1 saved_path> 
docker run -it \ 
    --ipc=host \ 
    --network=host \ 
    --privileged \ 
    --shm-size 32G \ 
    --cap-add=CAP_SYS_ADMIN \ 
    --device=/dev/kfd \ 
    --device=/dev/dri \ 
    --group-add video \ 
    --group-add render \ 
    --cap-add=SYS_PTRACE \ 
    --security-opt seccomp=unconfined \ 
    --security-opt apparmor=unconfined \ 
    -v $MODEL_DIR:/model \ 
    rocm/sgl-dev:upstream_20250312_v12. Start the SGLang server.
python3 -m sglang.launch_server \ 
    --model /model \ 
    --tp 8 \ 
    --trust-remote-code \ 
    --chunked-prefill-size 131072 \ 
    --enable-torch-compile \ 
    --torch-compile-max-bs 256 & 3. Run the SGLang benchmark serving script for the user defined concurrency values and desired parameters.
# Run after “The server is fired up and ready to roll!” 
concurrency_values=(128 64 32 16 8 4 2 1) 
for concurrency in "${concurrency_values[@]}"; do 
python3 -m sglang.bench_serving \ 
    --dataset-name random \ 
    --random-range-ratio 1 \ 
    --num-prompt 500 \ 
    --random-input 3200 \ 
    --random-output 800 \ 
    --max-concurrency "${concurrency}" 
done Note: enabling torch compile will result in longer graph compile time thus longer server launch time
What's Next?
We see this as the beginning of a longer-term investment in heterogeneous, cost-efficient hardware for model serving. While we’re committed to supporting a wide range of models and GPUs, the MI300 work with DeepSeek has proven that smart optimization can unlock new infrastructure choices. Further enhancements such as disaggregated decode + prefill are in the pipeline!
With continued collaboration, we plan to bring this level of performance to future models—including the newly released MAI model—which will also run on the MI300 pool. Explore Azure AI Foundry model catalog today.