A Quick Guide to Benchmarking AI models on Azure: Llama 405B and 70B with MLPerf Inference v5.1

Mark_Gitau

Microsoft

Sep 09, 2025

by Mark Gitau (Software Engineer)

Introduction

For the MLPerf Inference v5.1 submission, Azure shared performance results on the new ND GB200 v6 virtual machines. A single ND GB200 v6 VM on Azure is powered by two NVIDIA Grace CPUs and four NVIDIA Blackwell B200 GPUs.

This document highlights Azure’s MLPerf Inference v5.1 results and outlines the steps to run these benchmarks on Azure. These MLPerf™ benchmark results demonstrate Azure’s commitment to providing our customers with the latest GPU offerings of the highest quality.

Highlights from MLPerf Inference v5.1 benchmark results include:

Azure had the highest Llama 2 70B Offline submission with 52,000 tokens/s on a single ND GB200 v6 virtual machine. This corresponds to an 8% increase on single node performance since our record which would correspond to 937,098 tokens/s on a full NVL72 rack.

Azure results for Llama 3.1 405B are at par with the best submitters (1% difference), cloud and on-premises, with 847 tokens/s.

How to replicate the results in Azure

Pre-requisites:

ND GB200 v6-series (single node): Deploy and set up a virtual machine on Azure

Set up the environment

First, we need to export the path to the directory where we will perform the benchmarks.

For ND GB200 v6-series (single node), create a directory called mlperf in /mnt/nvme

Set mlperf scratch space:

      export MLPERF_SCRATCH_PATH=/mnt/nvme/mlperf

Clone the MLPerf repository inside the scratch path:

      git clone https://github.com/mlcommons/inference_results_v5.1.git

Then create empty directories in your scratch space to house the data:

      mkdir $MLPERF_SCRATCH_PATH/data $MLPERF_SCRATCH_PATH/models $MLPERF_SCRATCH_PATH/preprocessed_data

Download the models & datasets

Download the models inside the models directory you created in the previous step. This will take a while because the weights are large.
- Llama 2 70B model
- Llama 3.1 405B model

Download the preprocessed datasets for both models
- Llama 2 70B datasets
- Llama 3.1 405B datasets
Prepare the datasets for Llama 2 70B: inference_results_v5.1/closed/Azure/code/llama2-70b/tensorrt at main · mlcommons/inference_results_v5.1
prepare the datasets for Llama 3.1 405B: inference_results_v5.1/closed/Azure/code/llama3.1-405b/tensorrt at main · mlcommons/inference_results_v5.1

Build & launch MLPerf container

Export the Submitter and System name:

     export SUBMITTER=Azure SYSTEM_NAME=ND_GB200_v6

Enter the container by entering the closed/Azure directory and running:

     make prebuild

Inside the container, run

     make build

Build engines & run benchmarks

Make sure you are still in the closed/Azure directory of the MLPerf repository

To build the engines for both Llama 3.1 405B and Llama 2 70B:

     make generate_engines RUN_ARGS="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server"

To run the benchmarks for both Llama 3.1 405B and Llama 2 70B:

     make run_harness RUN_ARGS=="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server"

MLPerf from MLCommons®

MLCommons® is an open engineering consortium of AI leaders from academia, research, and industry where the mission is to “build fair and useful benchmarks” that provide unbiased evaluations of training and inference performance for hardware, software, and services—all conducted under predetermined conditions. MLPerf™ Inference benchmarks consist of compute-intensive AI workloads that simulate realistic usage of the systems, making the results very influential in technology management’s buying decisions.

Updated Sep 09, 2025

Version 2.0

ai infrastructure

benchmarking

Mark_Gitau

Microsoft

Joined November 11, 2024

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity