Azure High Performance Computing (HPC) Blog

4 MIN READ

Breaking the Million-Token Barrier: The Technical Achievement of Azure ND GB300 v6

HugoAffaticati

Microsoft

Nov 03, 2025

Azure ND GB300 v6 Virtual Machines with NVIDIA GB300 NVL72 rack-scale systems achieve unprecedented performance of 1,100,000 tokens/s on Llama2 70B Inference, beating the previous Azure ND GB200 v6 record of 865,000 tokens/s by 27%.

By Mark Gitau (Software Engineer) and Hugo Affaticati (Senior Cloud Infrastructure Engineer)

The new Azure ND GB300 v6 virtual machines, built on the cutting-edge NVIDIA Blackwell architecture introduced with the ND GB200 v6, are more optimized for inference workloads with 50% more GPU memory and 16% higher TDP (Thermal Design Power).

To simulate the performance gains of the ND GB300 v6 virtual machines on customer workloads, we ran the Llama2 70B model from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL72 domain. The Llama2 70B model is a widely adopted industry standard for large-scale AI deployments, making it a good representation of production inference workloads. One NVL72 rack of Azure ND GB300 v6 achieved an aggregated 1,100,000 tokens/s for an unverified MLPerf Inference v5.1 submission [1], detailed in Table 1 and observed by Signal65. This is a new record in AI inference, beating our own previous record of 865,000 tokens/s on one NVIDIA GB200 NVL72 rack with the ND GB200 v6 VMs.

Metric	Performance (Tokens/Second)
Total Aggregated Throughput	1,100,948.3
Maximum Single-Node Throughput	62,803.9
Minimum Single-Node Throughput	57,599.1
Average Single-Node Throughput	61,163.8
Median Single-Node Throughput	61,759.1

Table 1: Data compiled from 18 parallel test runs, observed by Signal65.

This translates to 15,200 tokens/sec per NVIDIA Blackwell Ultra GPU (+/- 5%), a 27% performance speed up over the 12,022 tokens/s per NVIDIA Blackwell GPU. For comparison, the previous MLPerf Inference v4.1 results show that the NVIDIA DGX H100 system processed 24,525 tokens per second across 8 GPUs, or 3,066 tokens per second per NVIDIA H100 GPU [2]. This means that Azure ND GB300 v6 VMs deliver 5× higher throughput per GPU than the previous-generation ND H100 v5 virtual machines.

This milestone was observed by the third party Signal65. They concluded that this record "of over 1.1 million tokens per second on a single Azure rack is more than a benchmark record; it is a definitive proof point that the performance required for large-scale, transformative AI is now available as a reliable, efficient, and resilient utility".

Azure ND GB300 v6 virtual machine’s configurations (Table 2) benefit from performance gains across key hardware components (such as GEMM efficiency, high-bandwidth memory (HBM) throughput, NVLink connectivity, and NCCL communication). Our benchmarks show that ND GB300 v6 achieves 2.5x more GEMM TFLOPS per GPU than the ND H100 v5. Additionally, we measured 7.37 TB/s HBM bandwidth (92% efficiency) and 4x faster CPU-to-GPU transfer speeds thanks to NVLink C2C.

Component	Specification
Cloud Platform	Microsoft Azure
VM Instance SKU	ND_GB300_v6
System Configuration	18 x NDv6 VM instances in a single NVL72 rack
GPU	4 x NVIDIA GB300 per VM (72 total)
GPU Memory	279GB per GPU
GPU Power Limit	1,400 Watts
Storage	14 TB Local NVMe RAID per VM
LLM Inference Engine	NVIDIA TensorRT-LLM
Benchmark Harness	MLCommons MLPerf Inference v5.1
Benchmark Scenario	Offline
Model	Llama2-70B
Precision	FP4

Table 2: ND GB300 v6 Configuration for the MLPerf Inference test

The model was run using FP4 precision, a form of quantization that significantly accelerates inference speed while maintaining high accuracy. This was implemented via NVIDIA TensorRT-LLM library, a highly optimized, production-ready software stack for LLM inference.

Azure is once again raising the bar for enterprise-scale AI Inference with the ND GB300 v6 virtual machines.

How to replicate the results on a single virtual machine in Azure

Clone the repository and enter working directory:

git clone https://github.com/Azure/AI-benchmarking-guide.git && cd AI-benchmarking-guide/Azure_Results

Download the models & datasets

create models, data, and preprocessed_data directories in the working directory
download the Llama 2 70B modelinside the models directory.
download the datasetsinside data directory
prepare the datasets

Setup container

Inside the working directory:

     mkdir build && cd build

     git clone https://github.com/NVIDIA/TensorRT-LLM.git TRTLLM

     cd TRTLLM

Edit TRTLLM/docker/Makefile lines 135 and 136:

     SOURCE_DIR ?= AI-benchmarking-guide/Azure_Results (make sure it is an absolute path to the working directory)

     CODE_DIR ?= /work

Build & launch the container:

     make -C docker build

     make -C docker run

Once inside the container, install TensorRT-LLM:

     cd 1M_ND_GB300_v6_Inference/build/TRTLLM

     python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "103-real" --no-venv --clean

     pip install build/tensorrt_llm-1.1.0rc6-cp312-cp312-linux_aarch64.whl

Inside the 1M_ND_GB300_v6_Inference directory, install MLPerf dependencies:

     make clone_loadgen && make build_loadgen

     git clone https://github.com/NVIDIA/mitten.git ./build/mitten && pip install build/mitten

     pip install -r docker/common/requirements/requirements.llm.txt

Setup and run benchmark

Export env variables and link model & data directories:

     export MLPERF_SCRATCH_PATH=/work

     export SYSTEM_NAME=ND_GB300_v6

     make link_dirs

Run offline benchmark (wait a few minutes after the run server command for the server to start):

     make run_llm_server RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"

     make run_harness RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"

Find all 18 log files of our run here

[1] Unverified MLPerf® v5.1 Inference Closed Llama 2 70B offline. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf specification for verified results. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. Results obtained using NVIDIA MLPerf v5.1 code with NVIDIA TensorRT-LLM 1.1.0rc1

[2] Verified result with ID 4.1-0043.

Updated Nov 04, 2025

Version 5.0

Microsoft

Joined July 26, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity