Summary
We partnered closely with NVIDIA to unlock high-performance single-node inference for DeepSeek-V3.2 on NVIDIA Blackwell. By leveraging NVIDIA’s new NVFP4 checkpoint for DeepSeek-V3.2 combined with NVIDIA TensorRT LLM on NVIDIA Blackwell, we achieved breakthrough inference performance. These experiments were performed on a single node (2 Grace Blackwell superchips) of the NVIDIA GB200 NVL72 platform (hereafter referred to as an NVIDIA GB200 node), similar to the Standard_ND128isr_NDR_GB200_v6 VM available on Azure.
Using an aligned apples-to-apples benchmark methodology, single-node inference using NVIDIA GB200 nodes with NVFP4 and TensorRT LLM delivers up to 2.5x lower per-user latency than similar inference configurations with NVIDIA H200 GPUs.
Beyond increased performance, using NVIDIA GB200 nodes with NVFP4 dramatically increases the number of users which can be served from the same GPU footprint as an H200 deployment. While maintaining a consistent latency target across both GB200 and H200 deployments, our experiments demonstrated that single-node deployments of DeepSeek-V3.2 can serve up to 16 times as many users per GPU when using NVIDIA GB200 nodes versus NVIDIA H200 nodes.
The results came from end-to-end co-optimization across three layers:
- Hardware: Our experiments are performed on individual nodes of the NVIDIA GB200 NVL72, with each node consisting of 2 Grace CPUs and 4 Blackwell GPUs.
- Model weights: NVFP4-quantized weights for DeepSeek-V3.2 are optimized to deliver high inference efficiency while preserving model quality
- Inference runtime: TensorRT LLM used as the production-grade serving and execution engine, configured to optimize DeepSeek on Blackwell GPUs
This post details how we achieved these results, including our serving and benchmarking setup.
This configuration is now used to serve DeepSeek-V3.2 on Microsoft Foundry. See here for information about DeepSeek-V3.2 on Microsoft Foundry.
Unlocking Blackwell Performance
DeepSeek-V3.2 offers strong reasoning performance and broad task coverage, making it well-suited for real-world production workloads. However, the sheer scale of this 690-billion-parameter Mixture-of-Experts (MoE) model presents an inherent challenge. Achieving efficient, cost-effective inference at this magnitude demands careful, end-to-end optimization across the entire stack—from model representation to runtime and system configuration—all while preserving output quality and predictable latency.
The NVIDIA GB200 NVL72 platform, integrated with NVFP4 quantization and the TensorRT LLM inference engine, offers a powerful solution for delivering high-performance, cost-effective inference for DeepSeek-V3.2.
NVIDIA GB200 NVL72
The NVIDIA GB200 NVL72 is a rack-scale solution that leverages 72 NVIDIA Blackwell GPUs with 36 NVIDIA Grace CPUs to deliver high performance inference. Each of the 72 Blackwell GPUs contains 186 GB of high-bandwidth HBM3e memory, a 32% increase in per-GPU memory compared to NVIDIA H200 GPUs. NVIDIA Blackwell’s second-generation transformer engines provide 10 PFLOPS of dense NVFP4 performance per GPU, a 5x increase over 2 PFLOPS for dense FP8 on H200.
For MoE models like DeepSeekV3.2, Blackwell’s superior memory capacity and NVFP4 compute throughput enable higher inference performance.
NVFP4 Floating Point Precision
NVFP4 is an innovative 4-bit floating point format introduced with the NVIDIA Blackwell architecture. By encoding quantized blocks with non-power-of-two scaling factors, NVFP4 simultaneously enables higher performance, reduced memory footprint, and preserved model accuracy when compared with FP8 and FP16 floating point formats. NVFP4 is optimized to take advantage of NVIDIA Blackwell’s native Tensor Core support.
For DeepSeek-V3.2, NVIDIA’s NVFP4 quantization reduced the memory footprint of the model by 1.7x compared to the model’s original FP8 format (415 GB vs. 690 GB), leading to significant boosts in throughput and cost savings. NVIDIA has published comprehensive quality benchmarking results for the DeepSeek-V3.2 NVFP4 model, showing that the quantized weights maintain accuracy closely aligned with the original FP8 model across a broad set of industry-standard benchmarks.
|
Precision |
MMLU Pro |
GPQA Diamond |
LiveCodeBench V6 |
SciCode |
AIME 2025 |
|
FP8 |
0.802 |
0.849 |
0.756 |
0.391 |
0.934 |
|
NVFP4 |
0.799 |
0.835 |
0.756 |
0.401 |
0.923 |
Table 1: Results of DeepSeek-V3.2’s Accuracy Benchmark Results Across FP8 and NVFP4 Checkpoints
Across reasoning, coding, and scientific benchmarks, NVFP4 delivers near-parity results relative to FP8, validating its suitability for production inference where memory efficiency and throughput are critical.
For detailed model quality metrics and benchmark comparisons, see the NVIDIA DeepSeek-V3.2-NVFP4 model card.
TensorRT LLM
TensorRT LLM is an open-source library used for optimizing LLM inference. It provides high-performance optimizations for NVIDIA GPUs such as low-precision serving, in-flight batching, custom attention kernels, and much more. TensorRT-LLM's optimized support for sparse attention and large context windows enables DeepSeek-V3.2 to achieve breakthrough performance.
Benchmark Methodology
We utilized a fair and practical benchmark methodology, reflecting production-style inference patterns. Although multi-node inference is anticipated to deliver higher per-GPU performance by leveraging features like disaggregated serving and Wide EP, we focused on isolating our experiments to single-node performance to ensure a clear comparison between the two platforms.
|
Parameter |
Value |
|
Input length |
2,400 tokens |
|
Output length |
1,600 tokens |
|
Concurrent Requests |
1, 2, 4, 8, 16, 32, 64 |
|
Dataset | |
|
Target Metrics |
Output Throughput (tokens/sec), End-to-End Latency (ms) |
Table 2: Load Profile for Performance Benchmarks
As real-world inference performance varies with the quantity of active requests, our experiments measured system performance under different loads, as shown in the "Concurrent Requests" parameter in Table 2. This is sometimes referred to as “concurrency”.
Our two targeted metrics are output throughput and end-to-end latency. Output throughput measures the cumulative number of tokens produced by the model per second across all concurrent requests. End-to-end latency is a measure of the total time taken by the model to complete a request. It is measured from the time the request is sent to the model, to the time the response’s final token is generated.
The following script was used to benchmark performance at multiple request concurrencies using SGLang’s sglang.bench_serving tool. The script sends prompt requests to a provided inference endpoint, then gathers performance data such as throughput, time-to-first-token, and end-to-end latency. Read more about the sglang.bench_serving tool here.
#!/usr/bin/env bash
set -euo pipefail
CONCURRENCY_LIST=(1 2 4 8 16 32 64)
for max_concurrency in "${CONCURRENCY_LIST[@]}"; do
echo "==> Running benchmark with --max-concurrency ${max_concurrency}"
python3 -m sglang.bench_serving \
--backend sglang-oai \
--model ./DeepSeek-V3.2-NVFP4/ \
--num-prompts 500 \
--max-concurrency "${max_concurrency}" \
--tokenizer ./DeepSeek-V3.2-NVFP4/ \
--dataset-path ./ShareGPT_V3_unfiltered_cleaned_split.json \
--sharegpt-output-len 1600 \
--sharegpt-context-len 2400
done
Results: NVIDIA GB200 Node with NVFP4 Achieves Best Performance
Figure 1: Single-Node Inference: Throughput (output tokens per second) vs Median End-to-End Latency (ms)
Figure 1 plots throughput (output tokens per second) against median end-to-end latency (ms). The annotations on the graph denoted “cX:Y” outline the request concurrency at which the data point was gathered, where X is the request concurrency, and Y is the throughput observed at that request concurrency, measured in tokens per second. Higher and further left indicates better efficiency. Across all concurrencies, the configuration using GB200, NVFP4, and TensorRT LLM consistently achieves the best efficiency.
|
Configuration |
Concurrency |
Throughput (tks/s) |
Median E2E Latency (ms) |
Throughput/Latency |
|
GB200 with NVFP4 |
1 |
272 |
5,801 |
0.047 |
|
GB200 with FP8 |
1 |
228 |
7,015 |
0.033 |
|
H200 with FP8 |
1 |
109 |
14,716 |
0.007 |
Table 3: Single-concurrency requests
Key findings:
- Up to 2.5x lower end-to-end latency: A GB200 node with NVFP4 delivers 5801 ms median latency vs. 14716 ms median latency on an H200 node at concurrency 1, a 2.5x improvement.
- Best efficiency curve: Figure 1 shows that at each concurrency, a GB200 node with NVFP4 has the highest throughput and lowest latency compared to both a GB200 node with FP8 and an H200 node with FP8.
- Serve up to 16x more users per GPU: Given an end-to-end latency target of 15,000 milliseconds, single-node inference for DeepSeek-V3.2 with NVFP4 on an NVIDIA GB200 node yields 8x the throughput and can serve up to 8 concurrent users, while an NVIDIA H200 node can serve only 1 user. Since our GB200 NVL72 nodes contain 4 GPUs, and our H200 nodes contain 8 GPUs, this translates to 16x higher performance per GPU.
Serving Configuration Reference
This section shows how we served DeepSeek-V3.2 in our experiments. Depending on the hardware and software configurations used, replicated results may vary.
|
Parameter |
Blackwell NVFP4 |
Blackwell FP8 |
Hopper FP8 |
|
GPUs |
4x GB200 |
4x GB200 |
8x H200 |
|
Nodes |
1 | ||
|
Tensor Parallelism |
4 |
4 |
8 |
|
Max Batch Size |
64 | ||
|
MTP Enabled |
Yes | ||
|
Inference Engine | |||
|
Model Checkpoint | |||
Table 4: Serving Configuration Parameters
Config File
cuda_graph_config:
enable_padding: true
batch_sizes: [1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,32,64]
kv_cache_config:
free_gpu_memory_fraction: 0.8
dtype: fp8
moe_config:
backend: TRTLLM
speculative_config:
decoding_type: MTP
num_nextn_predict_layers: 3
Note: While model weights use NVFP4, the KV cache remains FP8 to balance memory efficiency with numerical stability during decode.
Serving the Model
The following command launches the TensorRT LLM inference server for the DeepSeek-V3.2 model, using the preceding configuration file and exposing the service on port 30000 for high-throughput serving.
trtllm-serve serve ./DeepSeek-V3.2-NVFP4/ \
--tp_size 4 \
--max_batch_size 64 \
--trust_remote_code \
--extra_llm_api_options ./config.yaml \
--host 0.0.0.0 \
--port 30000
What’s Next
Rack-Scale Inference
This blog post focuses on single-node inference (2 Grace CPUs and 4 Blackwell GPUs and 8 GPUs for H200). We anticipate even greater performance improvements with multi-node serving configurations, including those leveraging disaggregated serving, TensorRT LLM’s Wide EP capabilities, and all 72 GPUs on the NVIDIA GB200 NVL72 rack system
Apply Approach to New Models
We plan to apply the same approach of introducing Blackwell, NVFP4, TensorRT LLM, and kernel tuning to additional model families.
Acknowledgements
This work was enabled by close collaboration between engineering teams from Microsoft and NVIDIA. Key contributors include Xiaoran Li, Tao Wang, and Vivek Ramaswamy from Microsoft; and Stephen McCullough, Anurag Mukkara, and Nikhar Maheshwari from NVIDIA.