Achieving peak AI performance requires both cutting-edge hardware and a finely optimized infrastructure. Azure’s ND GB200 v6 Virtual Machines, accelerated by the NVIDIA GB200 Blackwell GPUs, have already demonstrated record performance of 865,000 tokens/s for inferencing on the industry standard LLAMA2 70B model.
In our previous deep dive on the performance of the ND GB200 v6 virtual machines, we explored the architectural improvements of the NVIDIA GB200 NVL72 one component at a time. Azure’s ND GB200 v6 virtual machines demonstrate performance gains across key hardware metrics (such as GEMM efficiency, high-bandwidth memory (HBM) throughput, NVLink connectivity, and NCCL communication). Key findings from our benchmarks highlighted the ND GB200 v6’s 2.5x theoretical performance increase over the ND H100 v5 translated into a sustained 2,744 TFLOPS FP8 throughput. Additionally, we measured 7.35 TB/s HBM bandwidth utilization (92% efficiency) and 4x faster CPU-to-GPU transfer speeds thanks to NVLink C2C. We projected a “3x speedup on average for end-to-end training and inference workloads” based on our early results.
Building on that analysis, we are now excited to share a new world record in AI Inference using the LLAMA 2 70B model from MLPerf Inference v4.1 on Azure. LLAMA 2 70B has emerged as an industry-standard model for large-scale AI deployments, widely adopted for its balance of capability, efficiency, and open availability, making it a key benchmark for measuring real-world inference performance. As a good representative of enterprise inference workloads, this benchmark showcases how the ND GB200 v6 delivers state-of-the-art throughput with 865,000 tokens/sec measured on one NVIDIA GB200 NVL72 on Azure as an unverified MLPerf v4.1 submission [1].
To better reflect real-world customer use cases, we deployed the LLAMA 2 70B model in parallel on the 18 ND GB200 v6 virtual machines of one NVL72 rack, simulating human interactions with an AI system. Our benchmarking runs achieved an average throughput of 48,088 tokens per second (+/- 2%) per VM, translating to 12,022 tokens per second per NVIDIA GB200 Blackwell GPU.
For comparison, the latest MLPerf Inference v4.1 results show that the NVIDIA DGX H100 system processed 24,525 tokens per second across 8 GPUs, or 3,066 tokens per second per NVIDIA H100 GPU [2]. This means that Azure’s ND GB200 v6 delivers 3.9× higher throughput than the previous-generation ND H100 v5 virtual machines for this workload, setting a new standard for enterprise-scale AI inference.
The new computing unit introduced by NVIDIA is the rack (18x ND GB200 v6 virtual machines as one NVL72 entity). At that scale, the ND GB200 v6 virtual machines demonstrate 9x the performance of one rack of ND H100 v5.
How to replicate the results in Azure
#obtain the container
The results were run with the triton_trtllm_v0.18.0dev_repro-v4.1 container for MLPerf Inference, made available by NVIDIA on NVIDIA NGC.
# start the container
docker run -it --rm --gpus all -v <container name>
# build the engine
python3 -m tensorrt_llm.commands.build \
--workers=1 \
--max_batch_size=2048 \
--max_beam_width=1 \
--kv_cache_type=paged \
--remove_input_padding=enable \
--multiple_profiles=enable \
--use_fused_mlp=enable \
--context_fmha=enable \
--use_fp8_context_fmha=enable \
--use_paged_context_fmha=enable \
--max_num_tokens=3584 \
--max_input_len=1024 \
--max_seq_len=2048 \
--tokens_per_block=32 \
--gemm_plugin=nvfp4 \
--checkpoint_dir=/workspace/mlperf/llama2-70b-chat-hf-tp1pp1-fp4 \
--output_dir=/workspace/mlperf/llama2-70b-trt-norm-quant-engine
# switch to launch directory for tritonservers
cd /workspace/mlperf/inference_results_v4.1/closed/NVIDIA/gb200_repro
# edit the config.pbtxt file with the path to our trt-engine engine
vim /workspace/mlperf/inference_results_v4.1/closed/NVIDIA/triton_repos/llama2_offline_nvl4_single/model-0/config.pbtxt
#change line 101 to the following
string_value: "/workspace/mlperf/llama2-70b-trt-norm-quant-engine"
# start triton servers
bash run_offline_nvl4.sh
# run inference
python3 /workspace/mlperf/inference_results_v4.1/closed/NVIDIA/code/harness/harness_triton_llm/main.py \
--logfile_outdir=/workspace/mlperf/temp_logs_compute000014 \
--logfile_prefix=mlperf_log_ \
--tensor_path=/workspace/open_orca/input_ids_padded.npy,/workspace/open_orca/input_lens.npy \
--llm_gen_config_path=code/llama2-70b/tensorrt/generation_config.json \
--use_token_latencies=true \
--mlperf_conf_path=gb200_conf_files/Offline/mlperf.conf \
--user_conf_path=gb200_conf_files/Offline/user_x1.conf \
--scenario Offline \
--model llama2-70b \
--num_gpus 4 \
--skip_server_spawn \
--num_clients_per_gpu 4 \
--num_frontends_per_gpu 8
[1] Unverified MLPerf® v4.1 Inference Closed Llama 2 70B offline. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf specification for verified results. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. Results obtained using Nvidia MLPerf v4.1 code with TensorRT-LLM 0.18.0.dev.
[2] Verified result with ID 4.1-0043.