Azure ND GB300 v6 Virtual Machines with NVIDIA GB300 NVL72 rack-scale systems achieve unprecedented performance of 1,100,000 tokens/s on Llama2 70B Inference, beating the previous Azure ND GB200 v6 record of 865,000 tokens/s by 27%.
By Mark Gitau (Software Engineer) and Hugo Affaticati (Senior Cloud Infrastructure Engineer)
The new Azure ND GB300 v6 virtual machines, built on the cutting-edge NVIDIA Blackwell architecture introduced with the ND GB200 v6, are more optimized for inference workloads with 50% more GPU memory and 16% higher TDP (Thermal Design Power).
To simulate the performance gains of the ND GB300 v6 virtual machines on customer workloads, we ran the Llama2 70B model from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL72 domain. The Llama2 70B model is a widely adopted industry standard for large-scale AI deployments, making it a good representation of production inference workloads. One NVL72 rack of Azure ND GB300 v6 achieved an aggregated 1,100,000 tokens/s for an unverified MLPerf Inference v5.1 submission [1], detailed in Table 1 and observed by Signal65. This is a new record in AI inference, beating our own previous record of 865,000 tokens/s on one NVIDIA GB200 NVL72 rack with the ND GB200 v6 VMs.
|
Metric |
Performance (Tokens/Second) |
|
Total Aggregated Throughput |
1,100,948.3 |
|
Maximum Single-Node Throughput |
62,803.9 |
|
Minimum Single-Node Throughput |
57,599.1 |
|
Average Single-Node Throughput |
61,163.8 |
|
Median Single-Node Throughput |
61,759.1 |
Table 1: Data compiled from 18 parallel test runs, observed by Signal65.
This translates to 15,200 tokens/sec per NVIDIA Blackwell Ultra GPU (+/- 5%), a 27% performance speed up over the 12,022 tokens/s per NVIDIA Blackwell GPU. For comparison, the previous MLPerf Inference v4.1 results show that the NVIDIA DGX H100 system processed 24,525 tokens per second across 8 GPUs, or 3,066 tokens per second per NVIDIA H100 GPU [2]. This means that Azure ND GB300 v6 VMs deliver 5× higher throughput per GPU than the previous-generation ND H100 v5 virtual machines.
This milestone was observed by the third party Signal65. They concluded that this record "of over 1.1 million tokens per second on a single Azure rack is more than a benchmark record; it is a definitive proof point that the performance required for large-scale, transformative AI is now available as a reliable, efficient, and resilient utility".
Azure ND GB300 v6 virtual machine’s configurations (Table 2) benefit from performance gains across key hardware components (such as GEMM efficiency, high-bandwidth memory (HBM) throughput, NVLink connectivity, and NCCL communication). Our benchmarks show that ND GB300 v6 achieves 2.5x more GEMM TFLOPS per GPU than the ND H100 v5. Additionally, we measured 7.37 TB/s HBM bandwidth (92% efficiency) and 4x faster CPU-to-GPU transfer speeds thanks to NVLink C2C.
|
Component |
Specification |
|
Cloud Platform |
Microsoft Azure |
|
VM Instance SKU |
ND_GB300_v6 |
|
System Configuration |
18 x NDv6 VM instances in a single NVL72 rack |
|
GPU |
4 x NVIDIA GB300 per VM (72 total) |
|
GPU Memory |
189,471 MiB per GPU |
|
GPU Power Limit |
1,400 Watts |
|
Storage |
14 TB Local NVMe RAID per VM |
|
LLM Inference Engine |
NVIDIA TensorRT-LLM |
|
Benchmark Harness |
MLCommons MLPerf Inference v5.1 |
|
Benchmark Scenario |
Offline |
|
Model |
Llama2-70B |
|
Precision |
FP4 |
Table 2: ND GB300 v6 Configuration for the MLPerf Inference test
The model was run using FP4 precision, a form of quantization that significantly accelerates inference speed while maintaining high accuracy. This was implemented via NVIDIA TensorRT-LLM library, a highly optimized, production-ready software stack for LLM inference.
Azure is once again raising the bar for enterprise-scale AI Inference with the ND GB300 v6 virtual machines.
How to replicate the results on a single virtual machine in Azure
Clone the repository and enter working directory:
git clone https://github.com/Azure/AI-benchmarking-guide.git && cd AI-benchmarking-guide/Azure_Results
Download the models & datasets
- create models, data, and preprocessed_data directories in the working directory
- download the Llama 2 70B modelinside the models directory.
- download the datasetsinside data directory
- prepare the datasets
Setup container
- Inside the working directory:
mkdir build && cd build
git clone https://github.com/NVIDIA/TensorRT-LLM.git TRTLLM
cd TRTLLM
- Edit TRTLLM/docker/Makefile lines 135 and 136:
SOURCE_DIR ?= AI-benchmarking-guide/Azure_Results (make sure it is an absolute path to the working directory)
CODE_DIR ?= /work
- Build & launch the container:
make -C docker build
make -C docker run
- Once inside the container, install TensorRT-LLM:
cd 1M_ND_GB300_v6_Inference/build/TRTLLM
python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "103-real" --no-venv --clean
pip install build/tensorrt_llm-1.1.0rc6-cp312-cp312-linux_aarch64.whl
- Inside the 1M_ND_GB300_v6_Inference directory, install MLPerf dependencies:
make clone_loadgen && make build_loadgen
git clone https://github.com/NVIDIA/mitten.git ./build/mitten && pip install build/mitten
pip install -r docker/common/requirements/requirements.llm.txt
Setup and run benchmark
- Export env variables and link model & data directories:
export MLPERF_SCRATCH_PATH=/work
export SYSTEM_NAME=ND_GB300_v6
make link_dirs
- Run offline benchmark (wait a few minutes after the run server command for the server to start):
make run_llm_server RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"
make run_harness RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"
Find all 18 log files of our run here
[1] Unverified MLPerf® v5.1 Inference Closed Llama 2 70B offline. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf specification for verified results. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. Results obtained using NVIDIA MLPerf v5.1 code with NVIDIA TensorRT-LLM 1.1.0rc1
[2] Verified result with ID 4.1-0043.