Blog Post

Azure High Performance Computing (HPC) Blog
4 MIN READ

Breaking the Million-Token Barrier: The Technical Achievement of Azure ND GB300 v6

HugoAffaticati's avatar
Nov 03, 2025

Azure ND GB300 v6 Virtual Machines with NVIDIA GB300 NVL72 rack-scale systems achieve unprecedented performance of 1,100,000 tokens/s on Llama2 70B Inference, beating the previous Azure ND GB200 v6 record of 865,000 tokens/s by 27%.

By Mark Gitau (Software Engineer) and Hugo Affaticati (Senior Cloud Infrastructure Engineer)

The new Azure ND GB300 v6 virtual machines, built on the cutting-edge NVIDIA Blackwell architecture introduced with the ND GB200 v6, are more optimized for inference workloads with 50% more GPU memory and 16% higher TDP (Thermal Design Power).

To simulate the performance gains of the ND GB300 v6 virtual machines on customer workloads, we ran the Llama2 70B model from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL72 domain. The Llama2 70B model is a widely adopted industry standard for large-scale AI deployments, making it a good representation of production inference workloads. One NVL72 rack of Azure ND GB300 v6 achieved an aggregated 1,100,000 tokens/s for an unverified MLPerf Inference v5.1 submission [1], detailed in Table 1 and observed by Signal65. This is a new record in AI inference, beating our own previous record of 865,000 tokens/s on one NVIDIA GB200 NVL72 rack with the ND GB200 v6 VMs

 

Metric

Performance (Tokens/Second)

Total Aggregated Throughput

1,100,948.3

Maximum Single-Node Throughput

62,803.9

Minimum Single-Node Throughput

57,599.1

Average Single-Node Throughput

61,163.8

Median Single-Node Throughput

61,759.1

Table 1: Data compiled from 18 parallel test runs, observed by Signal65.

 

This translates to 15,200 tokens/sec per NVIDIA Blackwell Ultra GPU (+/- 5%), a 27% performance speed up over the 12,022 tokens/s per NVIDIA Blackwell GPU. For comparison, the previous MLPerf Inference v4.1 results show that the NVIDIA DGX H100 system processed 24,525 tokens per second across 8 GPUs, or 3,066 tokens per second per NVIDIA H100 GPU [2]. This means that Azure ND GB300 v6 VMs deliver 5× higher throughput per GPU than the previous-generation ND H100 v5 virtual machines.

This milestone was observed by the third party Signal65. They concluded that this record "of over 1.1 million tokens per second on a single Azure rack is more than a benchmark record; it is a definitive proof point that the performance required for large-scale, transformative AI is now available as a reliable, efficient, and resilient utility".

Azure ND GB300 v6 virtual machine’s configurations (Table 2) benefit from performance gains across key hardware components (such as GEMM efficiency, high-bandwidth memory (HBM) throughput, NVLink connectivity, and NCCL communication). Our benchmarks show that ND GB300 v6 achieves 2.5x more GEMM TFLOPS per GPU than the ND H100 v5. Additionally, we measured 7.37 TB/s HBM bandwidth (92% efficiency) and 4x faster CPU-to-GPU transfer speeds thanks to NVLink C2C.

 

Component

Specification

Cloud Platform

Microsoft Azure

VM Instance SKU

ND_GB300_v6

System Configuration

18 x NDv6 VM instances in a single NVL72 rack

GPU

4 x NVIDIA GB300 per VM (72 total)

GPU Memory

189,471 MiB per GPU

GPU Power Limit

1,400 Watts

Storage

14 TB Local NVMe RAID per VM

LLM Inference Engine

NVIDIA TensorRT-LLM

Benchmark Harness

MLCommons MLPerf Inference v5.1

Benchmark Scenario

Offline

Model

Llama2-70B

Precision

FP4

Table 2: ND GB300 v6 Configuration for the MLPerf Inference test

The model was run using FP4 precision, a form of quantization that significantly accelerates inference speed while maintaining high accuracy. This was implemented via NVIDIA TensorRT-LLM library, a highly optimized, production-ready software stack for LLM inference.

Azure is once again raising the bar for enterprise-scale AI Inference with the ND GB300 v6 virtual machines.

How to replicate the results on a single virtual machine in Azure

Clone the repository and enter working directory:

git clone https://github.com/Azure/AI-benchmarking-guide.git  && cd AI-benchmarking-guide/Azure_Results

Download the models & datasets

Setup container

  • Inside the working directory:
     mkdir build && cd build
     git clone https://github.com/NVIDIA/TensorRT-LLM.git TRTLLM
     cd TRTLLM
  • Edit TRTLLM/docker/Makefile lines 135 and 136:
     SOURCE_DIR ?= AI-benchmarking-guide/Azure_Results (make sure it is an absolute path to the working directory)
     CODE_DIR ?= /work
  • Build & launch the container:
     make -C docker build
     make -C docker run
  • Once inside the container, install TensorRT-LLM:
     cd 1M_ND_GB300_v6_Inference/build/TRTLLM
     python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "103-real" --no-venv --clean
     pip install build/tensorrt_llm-1.1.0rc6-cp312-cp312-linux_aarch64.whl
  • Inside the 1M_ND_GB300_v6_Inference directory, install MLPerf dependencies:
     make clone_loadgen && make build_loadgen
     git clone https://github.com/NVIDIA/mitten.git ./build/mitten && pip install build/mitten
     pip install -r docker/common/requirements/requirements.llm.txt

Setup and run benchmark

  • Export env variables and link model & data directories:
     export MLPERF_SCRATCH_PATH=/work
     export SYSTEM_NAME=ND_GB300_v6
     make link_dirs
  • Run offline benchmark (wait a few minutes after the run server command for the server to start):
     make run_llm_server RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"
     make run_harness RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"

Find all 18 log files of our run here

 

[1] Unverified MLPerf® v5.1 Inference Closed Llama 2 70B offline. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf specification for verified results. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. Results obtained using NVIDIA MLPerf v5.1 code with NVIDIA TensorRT-LLM 1.1.0rc1

[2] Verified result with ID 4.1-0043.

 

Updated Nov 04, 2025
Version 4.0