Blog Post

Azure High Performance Computing (HPC) Blog
3 MIN READ

A Quick Guide to Benchmarking AI models on Azure: Llama 405B and 70B with MLPerf Inference v5.1

Mark_Gitau's avatar
Mark_Gitau
Icon for Microsoft rankMicrosoft
Sep 09, 2025

by Mark Gitau (Software Engineer)

Introduction 

For the MLPerf Inference v5.1 submission, Azure shared performance results on the new ND GB200 v6 virtual machines. A single ND GB200 v6 VM on Azure is powered by two NVIDIA Grace CPUs and four NVIDIA Blackwell B200 GPUs. 

This document highlights Azure’s MLPerf Inference v5.1 results and outlines the steps to run these benchmarks on Azure. These MLPerf™ benchmark results demonstrate Azure’s commitment to providing our customers with the latest GPU offerings of the highest quality. 

Highlights from MLPerf Inference v5.1 benchmark results include: 

  • Azure had the highest Llama 2 70B Offline submission with 52,000 tokens/s on a single ND GB200 v6 virtual machine. This corresponds to an 8% increase on single node performance since our record which would correspond to 937,098 tokens/s on a full NVL72 rack. 
  • Azure results for Llama 3.1 405B are at par with the best submitters (1% difference), cloud and on-premises, with 847 tokens/s. 

How to replicate the results in Azure 

Pre-requisites: 
  • ND GB200 v6-series (single node): Deploy and set up a virtual machine on Azure 
Set up the environment 
  • First, we need to export the path to the directory where we will perform the benchmarks. 
  • For ND GB200 v6-series (single node), create a directory called mlperf in /mnt/nvme 
  •  Set mlperf scratch space:

      export MLPERF_SCRATCH_PATH=/mnt/nvme/mlperf
  • Clone the MLPerf repository inside the scratch path: 
      git clone https://github.com/mlcommons/inference_results_v5.1.git
  • Then create empty directories in your scratch space to house the data: 
      mkdir $MLPERF_SCRATCH_PATH/data $MLPERF_SCRATCH_PATH/models $MLPERF_SCRATCH_PATH/preprocessed_data 
Download the models & datasets 
Build & launch MLPerf container 
  • Export the Submitter and System name: 
     export SUBMITTER=Azure SYSTEM_NAME=ND_GB200_v6 
  • Enter the container by entering the closed/Azure directory and running: 
     make prebuild 
  • Inside the container, run  
     make build 
Build engines & run benchmarks 
  • Make sure you are still in the closed/Azure directory of the MLPerf repository 
  • To build the engines for both Llama 3.1 405B and Llama 2 70B: 
     make generate_engines RUN_ARGS="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server" 
  • To run the benchmarks for both Llama 3.1 405B and Llama 2 70B: 
     make run_harness RUN_ARGS=="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server" 

MLPerf from MLCommons® 

MLCommons® is an open engineering consortium of AI leaders from academia, research, and industry where the mission is to “build fair and useful benchmarks” that provide unbiased evaluations of training and inference performance for hardware, software, and services—all conducted under predetermined conditions. MLPerf™ Inference benchmarks consist of compute-intensive AI workloads that simulate realistic usage of the systems, making the results very influential in technology management’s buying decisions. 

Updated Sep 09, 2025
Version 2.0
No CommentsBe the first to comment