Azure High Performance Computing (HPC) Blog

9 MIN READ

DGX Cloud Benchmarking on Azure

Microsoft

May 05, 2025

NVIDIA DGX Cloud benchmarking provides a standardized framework for evaluating the performance of large-scale AI workloads using containerized recipes. Each recipe targets a specific workload and supports flexible configuration across cluster scales and numerical precisions. Benchmarks are compared against NVIDIA’s Reference Architecture, offering a reliable baseline for performance evaluation in real-world settings.

In this white paper, we present our DGX Cloud benchmarking results conducted on Azure, scaling from 8 to 1024 NDv5 H100 (Standard_ND96isr_H100_v5) GPUs. Our results show large language model (LLM) training performance that is comparable to NVIDIA’s published benchmarks, demonstrating the exceptional scalability and efficiency of Azure’s cloud infrastructure for LLM workloads.

This white paper is organized to guide readers through the complete benchmarking workflow on Azure. We begin by detailing the provision of a Slurm cluster integrated with Azure Managed Lustre File System (AMLFS) using CycleCloud Workspace for Slurm (CCWS). Following cluster deployment, we describe a series of validation steps performed to ensure system readiness for large-scale training—including single node health checks (NHC), NCCL all-reduce bandwidth tests, and GPU thermal screening. We then present key optimization strategies for configuring the training environment on Azure to achieve high performance and stability. Finally, we share our LLM benchmark results, highlighting both scaling efficiency and throughput, and comparing them with NVIDIA's DGX Cloud reference baselines.

Slurm cluster provision

The DGX Cloud Performance Recipes currently require a Slurm-based environment to orchestrate benchmarking workloads. Specifically, for DGX benchmarking suite 25.02, the cluster must satisfy several prerequisites to ensure compatibility and correctness of the benchmarking process. These requirements include: SLURM version 22.x or newer, Bash 4.2+, Python 3.10.12+, CUDA 12.3+, NVIDIA driver 535.129.03 or newer, OFED 5.9+, NCCL 2.19.4+, Enroot, and access to the NVIDIA GPU Cloud (NGC) registry. Once the environment is prepared, users can download a model-specific recipe from NGC.

To simplify the provisioning of this complex software stack and its dependencies, we used Azure CycleCloud Workspace for Slurm (CCWS) - an Azure Marketplace solution template that automates the deployment of a pre-configured Slurm cluster tailored for AI/HPC workloads. CCWS greatly reduces the operational overhead by automating infrastructure setup tasks such as virtual network creation, node pool definition, file system integration, and Slurm scheduler configuration. The resulting environment comes pre-configured with PMIx v4, Pyxis, and Enroot to support container-based workflows, aligning with the requirements of NVIDIA’s containerized benchmarking recipes.

For our benchmarking, we selected the Standard_ND96isr_H100_v5 VM SKU integrated with Azure Managed Lustre File System (AMLFS) as the high-performance shared storage backend. AMLFS provides the necessary I/O throughput required to feed data into multi-GPU training jobs at scale. It is important to note that AMLFS must be available in the target region to enable this setup. We leveraged the Azure HPC Ubuntu-HPC 22.04 image (microsoft-dsvm:ubuntu-hpc:2204:latest), which comes pre-installed with optimized GPU and networking drivers. These images are maintained by Microsoft’s Azure HPC/AI team and include a curated set of libraries and tools for high-performance and AI workloads, reducing the time and complexity of preparing nodes for production use.

By combining CCWS with AMLFS and the Azure HPC image, we were able to stand up a Slurm cluster fully compliant with NVIDIA’s benchmarking prerequisites in a repeatable and automated fashion.

Node and cluster validations

Ensuring cluster readiness is essential prior to executing large-scale LLM training benchmarks. Our validation process on Azure involved several components, i.e. node health checks, NCCL all-reduce performance testing, and GPU thermal screening. This systematic approach enabled early detection of hardware or software inconsistencies and ensured consistent baseline performance. To get a more consistent test result, we recommend setting the persistent mode and locking the GPU frequency on all H100 nodes.

sudo nvidia-smi -pm 1 
sudo nvidia-smi -lgc 1980 --mode 1

Node Health Checks

We used AzureHPC Node Health Checks (AzNHC) to validate node-level functionality. This solution builds on the well-established LBNL NHC framework and adds Azure-specific hardware validation for various HPC and AI VM SKUs, including the NDv5 H100-series used in this benchmark. AzNHC runs inside a Docker container and can be invoked directly via a wrapper script:

sudo /opt/azurehpc/test/azurehpc-health-checks/run-health-checks.sh

AzNHC provides targeted tests per SKU, including checks for GPU presence, NVLink integrity, ECC memory errors, GPU bandwidth (device-to-host/host-to-device), InfiniBand throughput (GDR and non-GDR), topology, and NCCL intra-node all-reduce performance. For our validation, we used a distributed Slurm job to execute health checks across all compute nodes:

#!/bin/bash 
#SBATCH --job-name=health_check 
#SBATCH -p gpu 
#SBATCH --mem=0 
#SBATCH --ntasks-per-node=1 
#SBATCH --output=job_%J.out 
#SBATCH --error=job_%J.err 

srun --ntasks-per-node=1 --exclusive bash -c ' LOG_FILE="$(hostname).log" sudo bash /opt/azurehpc/test/azurehpc-health-checks/run-health-checks.sh > "$LOG_FILE" 2>&1 echo "### Infiniband Firmware Version ###" >> "$LOG_FILE" cat /sys/class/infiniband/mlx5_ib*/fw_ver >> "$LOG_FILE" 2>&1'

The firmware version check ensures consistent InfiniBand HCA firmware across nodes, which is critical for stable NCCL performance. Any failing nodes were drained and replaced prior to proceeding.

NCCL All-Reduce Validation

Following node-level validation, we verified inter-node GPU communication performance using NCCL all-reduce tests. The Azure HPC image includes a prebuilt version of the NCCL test suite under /opt/nccl-tests/build/. The all-reduce test was run at full scale using MPI, testing collective bandwidth across all GPUs:

mpirun -np $(( SCALE * DEVICES )) \ 
--map-by ppr:8:node \ 
-hostfile $HOSTFILE \ 
-mca plm_rsh_no_tree_spawn 1 \ 
-mca plm_rsh_num_concurrent 800 \ 
-mca coll_hcoll_enable 0 \ 
-x LD_LIBRARY_PATH \ 
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \ 
-x UCX_TLS=rc \ 
-x UCX_NET_DEVICES=mlx5_ib0:1 \ 
-x NCCL_SOCKET_IFNAME=eth0 \ 
-x NCCL_DEBUG=WARN \ 
-x NCCL_MIN_NCHANNELS=32 \ 
-x NCCL_IB_QPS_PER_CONNECTION=4 \ 
-x NCCL_P2P_NET_CHUNKSIZE=$((512*1024)) \ 
-x NCCL_PXN_DISABLE=1 \ 
-x NCCL_TOPO_FILE=/opt/microsoft/ndv5-topo.xml \ 
-x NCCL_IGNORE_CPU_AFFINITY=1 \ 
/opt/nccl-tests/build/all_reduce_perf -i 0 -g 1 -t 1 -b 8G -e 8G -f 0 -R 1

We configured the NCCL environment for optimal collective performance, including CollNet/NVLS, GDR, and relaxed PCI ordering. If aggregate bandwidth deviated from expected baselines, we performed binary search and pairwise NCCL tests to isolate underperforming nodes. This method quickly identifies outliers that degrade collective performance.

GPU Thermal Screening

Finally, to mitigate the risk of thermal throttling during extended training runs, we conducted GPU thermal screening using synthetic GEMM workloads. We used the dcgmproftester12 tool to stress GPUs and monitored thermal behavior using nvidia-smi:

dcgmproftester12 --no-dcgm-validation --max-processes=0 -t 1004 -d 900 
nvidia-smi -q -d PERFORMANCE

We verified that all GPUs remained below their thermal thresholds and no Throttle or TLimit events occurred. Any nodes failing thermal screening were marked as DRAIN in Slurm and replaced to maintain thermal headroom during full-scale training.

Optimizing DGX benchmark on Azure

Due to differences in virtualization, network fabric, and topology awareness, Azure demands tailored optimizations.

System-Level Optimizations

The system-level optimization is ensuring proper CPU, GPU, and NIC affinity using a NCCL topology file, which is included as part of the Azure HPC VM image. It defines the mapping between NUMA nodes, GPUs, and network interfaces, allowing the NCCL library to assign communication threads to the correct CPU cores.

This file must be explicitly mounted into the container and passed to the job via the NCCL_TOPO_FILE environment variable. Additionally, setting NCCL_IGNORE_CPU_AFFINITY=1 ensures that NCCL ignores MPI’s default CPU binding and relies solely on the topology file for affinity decisions. This configuration is crucial for low-latency communication using NCCL’s LL (Low-Latency) protocol, which transfers small and medium messages via pinned CPU buffers. Without proper CPU-GPU affinity, inter-NUMA communication introduces significant performance degradation.

Further NCCL tuning on Azure includes the following recommended settings to achieve optimal NCCL all_reduce throughput:

Variable	Value	Description
NCCL_TOPO_FILE	/opt/microsoft/ndv5-topo.xml	Ensures NUMA-aware GPU/NIC/CPU mapping
NCCL_P2P_NET_CHUNKSIZE	2097152 (2MB)	Increases P2P transfer granularity
NCCL_MIN_CHANNELS	32	Improves throughput for collectives like ReduceScatter
NCCL_IB_QPS_PER_CONNECTION	4	Improves InfiniBand queue performance
NCCL_PXN_DISABLE	1	Enables zero-copy design for NCCL P2P
NCCL_IGNORE_CPU_AFFINITY	1	Ensures NCCL bindings override MPI-affinity

The Slurm srun command must also include --cpu-bind=mask_cpu:"..." to specify optimal per-rank CPU binding based on the topology file. A complete job submission example is shown below:

export NCCL_TOPO_FILE=/opt/microsoft/ndv5-topo.xml 
export NCCL_P2P_NET_CHUNKSIZE=2097152 

srun --container-image ${IMAGE} \ 
--container-writable \ 
--container-mounts ${NCCL_TOPO_FILE},${DATA_DIR}:/datasets/,${RESULT_DIR},$INDEX_MAPPING_DIR,${STAGE_PATH}/cfg:/cfg/ \ 
--container-env=NCCL_TOPO_FILE,NCCL_P2P_NET_CHUNKSIZE \ 
--cpu-bind=mask_cpu:"fff,fff000,fff000000,fff000000000,..." \ 
--no-container-mount-home \ 
<launcher-script>

Model-Level Parameter Tuning

In addition to system configuration, optimizing the model parallelism parameters was essential for certain LLMs. Specifically, reducing the virtual_pipeline_model_parallel_size in NeMo Megatron 175B from 12 to 2 significantly reduced the number of ncclSendRecv operations, improving communication overlap with computation. This reduced CPU and GPU time spent in key CUDA kernels (ncclSendRecv, kuserbuffers_pushrecv, etc.), resulting in a step time improvement.

Similarly, for Llama 3.1 70B, the context_model_parallel_size was reduced from 2 to 1. This change eliminated context-parallel all-gather operations, reducing skew-induced delays in the downstream tensor-parallel reduce-scatter phase. The effective data_parallel_size was also increased, enabling more efficient batch processing.

Benchmark results

Table 1. Nemotron 15B fp8

Number of GPUs	16	32	64	128	256	512	1024
Azure	2.09	2.11	2.15	2.11	2.13	2.18	2.2

Table 2. Nemotron 15B bf16

Number of GPUs	16	32	64	128	256	512	1024
Azure	2.9	2.91	2.92	2.95	2.94	2.98	3.03

Table 3. Nemotron 340B fp8

Number of GPUs	128	256	512	1024
Azure	3.23	3.29	3.36	3.44

Table 4. Nemotron 340B bf16

Number of GPUs	128	256	512	1024
Azure	4.71	4.77	4.81	4.87

Table 5. Llama3.1 8B fp8

Number of GPUs	8	16	32	64	128
Azure	9.95	10.1	10.1	10.1	10.2

Table 6. Llama3.1 8B bf16

Number of GPUs	8	16	32	64	128
Azure	13.3	13.2	13.3	13.3	13.5

Table 7. Llama3.1 70B fp8

Number of GPUs	64	128	256	512	1024
Azure	10.6	10.6	10.7	10.7	10.8

Table 8. Llama3.1 70B bf16

Number of GPUs	64	128	256	512	1024
Azure	14.9	15	15.2	15.3	15.3

Table 9. Llama3.1 403B fp8

Number of GPUs	32	96	192	576	1152
Azure	5.97	5.92	11.7	11.8	12

Table 10. Llama3.1 403B bf16

Number of GPUs	32	96	192	576	1152
Azure	8.99	9.01	17.9	18	18.1

Table 11. GPT3 175B fp8

Number of GPUs	128	256	512	1024
Azure	6.1	6.04	6.08	6.15

Table 12. GPT3 175B bf16

Number of GPUs	128	256	512	1024
Azure	9.29	9.31	9.44	9.36

Table 13. GROK 314B fp8

Number of GPUs	8	16	32	64	128	256	512	1024
Azure	19.4	17	8.39	16.6	17	16.8	17.9	18.4

Table 14. GROK 314B bf16

Number of GPUs	8	16	32	64	128	256	512	1024
Azure	24.1	21.9	10.7	21	21.2	20.8	22	22.4

Table 15. Maxtest Llama2 70B fp8

Number of GPUs	64	128	256	512	1024
Azure	4.361	4.505	4.881	4.769	4.867

Table 16. Maxtest Llama2 70B bf16

Number of GPUs	64	128	256	512	1024
Azure	7.056	6.986	6.987	7.029	6.957

Summary

In this white paper, we presented a comprehensive evaluation of LLM training performance on Azure using NVIDIA DGX Cloud benchmarking recipes. We detail the full benchmarking workflow, including Slurm cluster provisioning with CCWS, integration of AMLFS, system validation steps (NHC, NCCL, thermal screening), and targeted tuning strategies to optimize performance. Benchmark results span multiple LLMs across FP8 and BF16 precisions, scaling from 8 to 1024 H100 GPUs.

Across eight models and two precisions, Azure’s ND H100 v5 platform achieved comparable performance with NVIDIA’s published DGX Cloud benchmarks. These results confirm that with correct provisioning and tuning, Azure can deliver world‑class LLM training performance while offering cloud flexibility, managed Lustre storage, and automated deployment repeatability, making it a practical solution for large‑scale GPU computing.

References

Acknowledgement

Great work takes a village. This project would not have been possible without the support and expertise of the following colleagues from Azure and NVIDIA:

Azure HPC/AI: Jer-Ming Chia, Xavier Pillons, Ben Watrous, Ryan Hamel, Aditi Gaur, Andy Howard, Ojasvi Bhalerao, Hugo Affaticati, Mike Ringenburg, Kanchan Mehrotra
Azure Storage: Brian Lepore, Dilip Sundarraj (now at Pure Storage)
Azure IB Interconnect: Remi Lemarchand, Jeff Yang, Jithin Jose, Austin Farmer, Chloe Gura, Rafael Salas, Jie Zhang, Nrupal Jani
NVIDIA SMEs: Alex Filby, Mark Arnold, Miro Enev, Zachary Newell, Nishanth Dandapanthula, Sowmyan Soman, Brian Carpenter, Emily Potyraj, Nishant Sharma

Updated Jun 18, 2025

Version 9.0

Microsoft

Joined July 11, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity