benchmarking

61 Topics

Azure Sets a New Performance Record for LLM Training Benchmark at Extreme Scale
Azure achieved the most performant MLPerf Training v6.0 result to date for Llama 3.1 405B, with a time-to-train of just over seven minutes according to MLCommons. This loadbearing benchmark measures how communication overhead and system stability dominate training performance, where Azure’s full stack, end-to-end advantages shine. Azure scaled to 2,048 NVIDIA GB200 NVL72 compute tray nodes spanning 128 racks (8,192 GPUs), assembling the largest reported GB200 NVL72 cluster to date in MLPerf Training. As model sizes continue to grow into hundreds of billions of parameters, achieving this level of performance consistently requires not only massive compute capacity, but also a highly efficient communication infrastructure. To meet the demands of the massive training workloads, Azure’s Fairwater AI supercomputing infrastructure offers high-performance GPU scale-up domains with resilient, scale-out networking. Fairwater is optimized for frontier-scale distributed AI training, where communication overhead and synchronization latency can become major scaling bottlenecks at multi-thousand GPU scale. This remarkable result was possible by leveraging fifth-generation NVIDIA NVLink at 1,800 GB/s per GPU for intra-rack communication and Azure’s 100 GB/s MRC (Multipath Reliable Connection) fabric accelerated by NVIDIA DOCA and connected with NVIDIA ConnectX-8 and NVIDIA Spectrum-X Ethernet switches for inter-rack communication. What Enabled Scaling Llama 405B to 8,192 GPUs on Azure Achieving efficient training at this scale requires more than raw compute. Three architectural ingredients came together to make this result possible: High operational efficiency of NVLink scale-up domains, providing fast intra-rack GPU-to-GPU communication. Resiliency and stability of Azure’s MRC scale-out networking fabric, built on NVIDIA ConnectX-8 SuperNICs and Spectrum-X Ethernet switches, delivering 100 GB/s per GPU across racks. Topology-aware workload mapping, aligning parallelism strategies with the underlying network structure. Large-scale LLM training is fundamentally synchronous. At every training step, GPUs perform forward and backward passes followed by gradient synchronization across ranks. Because all GPUs must stay in sync, training progress is ultimately limited by the slowest rank — any time spent waiting on the network directly increases overall step time. For the Llama 405B benchmark, the workload is distributed across four dimensions of parallelism: Tensor, Context, Pipeline, and Data parallelism. Each generates different communication patterns, with some communication sitting directly on the critical training path. If left unmanaged, communication overhead can quickly become a major bottleneck at scale. The key insight is that not all communication is equally latency-sensitive. Some communication must be completed before computation can continue, while other communication can overlap with compute. Tensor-Parallel and Context-Parallel communication sit directly on the critical compute path, meaning each layer must wait for its collective operation to complete before execution can proceed. Pipeline-Parallel communication is also on the critical path, but across stages rather than within a layer (i.e., the next stage cannot begin processing its data until the previous stage has transferred its gradients). To minimize latency, all three are placed on the high-bandwidth NVLink domain, which provides up to 1,800 GB/s per GPU. Data-Parallel communication, on the other hand, can overlap with backward compute because gradient reduction runs alongside the backward pass and does not immediately block the next operation. This traffic is carried over the MRC scale-out network. The impact of such mapping is clear at scale. Our profiling shows that cross-rack MRC communication contributes just ~20 ms of exposed time on the critical path of a ~1.27 s training step (≈1.6% of step time). This communication — primarily gradient synchronization across all ranks — is well overlapped by compute and NVLink traffic, making the scale-out network nearly invisible to training performance. Stable Step Time as the Cluster Grows To evaluate execution stability as the cluster scales, we compared our GB200 NVL72 128-rack submission against a 112-rack configuration used for this experiment. In the Llama 405B workload, the critical path of each training step is dominated by compute (forward and backward GEMMs) together with latency-sensitive Tensor-Parallel and Context-Parallel communication over NVLink. Importantly, neither of these changes as the cluster scales out. Cross-node communication, such as Data-Parallel collectives and Pipeline-Parallel transfers over the MRC network, runs concurrently with backward compute and remains largely hidden behind computation. This overlap is sustained only if cross-node communication remains fast and predictable. Any network instability, congestion, or synchronization jitter would reduce this overlap and expose communication on the critical path, directly increasing overall step time. Figure 1. Step-time vs cluster size at scale This is exactly what we observe in practice under a weak scaling model, where we maintain a constant workload per GPU while scaling total cluster capacity. As shown in Figure 1, step time remains nearly identical at scale, measuring 1.2734 seconds at 112 GB200 NVL72 racks (7,168 GPUs) and 1.2712 seconds at 128 racks (8,192 GPUs) — a difference of just 2 ms. This corresponds to a near-perfect 99.8% weak scaling efficiency as we expanded the cluster by an additional 1,024 GPUs. Step-time variance also remains extremely low (±0.04–0.05%), confirming stable and predictable execution at extreme scale. This directly translates to maximized hardware utilization and an accelerated overall time-to-train, demonstrating the resiliency of Azure’s MRC scale-out network. Acknowledgement This work was made possible through strong collaboration across multiple teams. We would like to especially recognize Shantanu Patankar for his contributions to execution and performance analysis and Mark Gitau for his contributions to data preparation and experiment support as core contributors, and extend our thanks to Amirreza Rastegari, Ojasvi Bhalerao, Sai Kovouri, Adam Hough, Nandini Ramanathan, Manasa Govindu, Sanian Gaffar, Ekrem Aksoy, Bhupender Thakur, Matthew Kappel, John Rankin, Girish Bhatia, Sreevatsa Anantharamu, Scott Moe, Yang Wang, and Jithin Jose, as well as many others across Azure and NVIDIA who supported this effort.
azinheidarshenas
Jun 16, 2026 Place Azure High Performance Computing (HPC) Blog
6.1KViews
6likes
0Comments
Deploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.
CormacGarvey
Jun 02, 2023 Place Azure High Performance Computing (HPC) Blog
10KViews
5likes
2Comments
Azure Managed Lustre - Benchmarking our new Azure storage solution
Tens or Hundreds of GB/s of storage throughput in Azure? How to benchmark Azure Managed Lustre
brianlepore
Feb 22, 2023 Place Azure High Performance Computing (HPC) Blog
7.6KViews
5likes
0Comments
Accelerating the Intelligence Age with Azure AI Infrastructure and the GA of ND GB200 v6
Today we are thrilled to announce the General Availability of Azure's latest AI infrastructure Virtual Machines, the ND GB200 v6.
LockyAinley
Mar 18, 2025 Place Azure High Performance Computing (HPC) Blog
4.7KViews
4likes
0Comments
Performance and Scalability of Azure HBv5-series Virtual Machines
Azure HBv5-series virtual machines (VMs) for CPU-based high performance computing (HPC) are now Generally Available. This blog provides in-depth information about the technical underpinnings, performance, cost, and management implications of these HPC-optimized VMs. Azure HBv5 VM bring leadership levels of performance, cost optimization, and server (VM) consolidation for a variety of workloads driven by memory performance, such as computational fluid dynamics, weather simulation, geoscience simulations, and finite element analysis. For these applications and compared to HBv4 VMs, previously the highest performance offering for these workloads, HBv5 provides up to : 5x higher performance for CFD workloads with 43% lower costs 3.2x higher performance for weather simulation with 16% lower costs 2.8x higher performance for geoscience workloads at the same costs HBv5-series Technical Overview & VM Sizes Each HBv5 VMs features several new technologies for HPC customers, including: Up to 6.6 TB/s of memory bandwidth (STREAM TRIAD) and 432 GB memory capacity Up to 368 physical cores per VM (user configurable) with custom AMD EPYC CPUs, Zen4 microarchitecture (SMT disabled) Base clock of 3.5 GHz (~1 GHz higher than other 96-core EPYC CPUs), and Boost clock of 4 GHz across all cores 800 Gb/s NVIDIA Quantum-2 InfiniBand (4 x 200 Gb/s CX-7) (~2x higher HBv4 VMs) 180 Gb/s Azure Accelerated Networking (~2.2 higher than HBv4 VMs) 15 TB local NVMe SSD with up to 50 GB/s (read) and 30 GB/s (write) of bandwidth (~4x higher than HBv4 VMs) The highlight feature of HBv5 VMs is their use of high-bandwidth memory (HBM). HBv5 VMs utilize a custom AMD CPU that increases memory bandwidth by ~9x v. dual-socket 4 th Gen EPYC (Zen4, “Genoa”) server platforms, and ~7x v. dual-socket EPYC (Zen5, “Turin”) server platforms, respectively. HBv5 delivers similar levels of memory bandwidth improvement compared to the highest end alternatives from the Intel Xeon and ARM CPU ecosystems. HBv5-series VMs are available in the following sizes with specifications as shown below. Just like existing H-series VMs, HBv5-series includes constrained cores VM sizes, enabling customers to optimize their VM dimensions for a variety of scenarios: ISV licensing constraining a job to a targeted number of cores Maximum-performance-per-VM or maximum performance per core Minimum RAM/core (1.2 GB, suitable for strong scaling workloads) to maximum memory per core (9 GB, suitable for large datasets and weak scaling workloads Table 1: Technical specifications of HBv5-series VMs Note: Maximum clock frequencies (FMAX) are based product specifications of the AMD EPYC 9V64H processor. Experienced clock frequencies by a customer are a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application. For more information see official documentation for HBv5-series VMs Microbenchmark Performance This section focuses on microbenchmarks that characterize performance of the memory subsystem, compute capabilities, and InfiniBand network of HBv5 VMs. Memory & Compute Performance To capture synthetic performance, we ran the following industry standard benchmarks: STREAM – memory bandwidth High Performance Conjugate Gradient (HPCG) – sparse linear algebra High Performance Linpack (HPL)– dense linear algebra Absolute results and comparisons to HBv4 VMs are shown in Table 2, below: Table 2: Results of HBv5 running the STREAM, HPCG, and HPL benchmarks. Note: STREAM was run with the following CLI parameters: OMP_NUM_THREADS=368 OMP_PROC_BIND=true OMP_PLACES=cores ./amd_zen_stream STREAM data size: 2621440000 bytes InfiniBand Networking Performance Each HBv5-series VM is equipped with four NVIDIA Quantum-2 network interface cards (NICs), each operating at 200 Gb/s for an aggregate bandwidth of 800 Gb/s per VM (node). We ran the industry standard IB perftests based on OSU benchmarks test across two (2) HBv5-series VMs, as depicted in the results shown in Figures 3-5, below: Note: all results below are for a single 200 Gb/s (uni-directional) link only. At a VM level, all bandwidth results below are 4x higher as there are four (4) InfiniBand links per HBv5 server. Unidirectional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 Figure 1: results showing 99% achieved uni-directional bandwidth v. theoretical peak. Bi-directional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 -b Figure 2: results showing 99% achieved bi-directional bandwidth v. theoretical peak. Latency: Figure 3: results measuring as low as 1.25 microsecond latencies among HBv5 VMs. Latencies experienced by users will depend on message sizes employed by applications. Application Performance, Cost/Performance, and Server (VM) Consolidation This section focuses on characterizing HBv5-series VMs when running common, real-world HPC applications with an emphasis on those known to be meaningfully bound by memory performance as that is the focus of the HB-series family. We characterize HBv5 below in three (3) ways of high relevance to customer interests: Performance (“how much faster can it do the work”) Cost/Performance (“how much can it reduce the costs to complete the work”) Fleet consolidation (“how much can a customer simplify the size and scale of compute fleet management while still being able to the work”) Where possible, we have included comparisons to other Azure HPC VMs, including: Azure HBv4/HX series with 176 physical cores of 4 th Gen AMD EPYC CPUs with 3D V-Cache (“Genoa-X”) (HBv4 specifications, HX specifications) Azure HBv3 with 120 physical cores of 3 rd Gen AMD EPYC CPUs with 3D V-Cache (“Milan-X”) (HBv3 specifications) Azure HBv2 with 120 physical cores of 2 nd Gen AMD EPYC CPUs (“Rome”) processors (full specifications) Unless otherwise noted, all tests shown below were performed with: Alma Linux 8.10 (image URN : almalinux:almalinux-hpc:8_10-hpc-gen2:latest) for scaling ( image URN: almalinux:almalinux-hpc:8_6-hpc-gen2:latest) NVIDIA HPC-X MPI Further, all Cost/Performance comparisons leverage pricing rate info from list price, Pay-As-You-Go (PAYG) information found on Azure Linux Virtual Machines Pricing. Absolute costs will be a function of a customer’s workload, model, and consumption (PAYG v. Reserved Instance, etc.) approach. That said, the relative cost/performance comparisons illustrated below should hold for the workload and model combinations shown below, regardless of the consumption approach. Computational Fluid Dynamics (CFD) OpenFOAM – version 2306 with 100M Cell Motorbike case Figure 4: HBv5 v. HBv4 on on OpenFOAM with the Motorbike 100M cell case HBv5 VMs provide a 4.8x performance increase over HBv4 VMs. Figure 5: The cost to complete the OpenFOAM Motorbike 100M case is just 57% of what it costs to complete the same case on HBv4. Above, we can see that for customers running OpenFOAM cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of five (5). Palabos – version 1.01 with 3D Cavity, 1001 x 1001 x 1001 cells case Figure 6: On Palabos, a Lattice Boltzmann solver using a streaming memory access pattern, HBv5 VMs provide a 4.4x performance increase over HBv4 VMs. Figure 7: The cost to complete the Palabos 3D Cavity case is just 62% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Palabos with cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~4.5. Ansys Fluent – version 2025 R2 with F1 Racecar 140M case Figure 8: On ANSYS Fluent HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 9: The cost to complete the ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running ANSYS Fluent with cases similar to the size and complexity of the 140M cell F1 Racecar problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Siemens Star-CCM+ - version 17.04.005 with AeroSUV Steady Coupled 106M case Figure 10: On Star-CCM+, HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 11: The cost to complete the Siemens Star-CCM+ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Star-CCM+ with cases similar to the size and complexity of the 106M cell AeroSUV Steady Coupled, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Weather Modeling WRF – version 4.2.2 with CONUS 2.5KM case Figure 12: On WRF, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 13: The cost to complete the WRF Conus 2.5KM case is just 84% of what it costs to complete the same case on HBv4. Above, we can see that for customers running WRF with cases similar to the size and complexity of the 2.5km CONUS, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Energy Research Devito – version 4.8.7 with Acoustic Forward case Figure 14: On Devito, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 15: The cost to complete the Devito Acoustic Forward OP case is equivalent to what it costs to complete the same case on HBv4. Above, we can see that for customers running Devito with cases similar to the size and complexity of the Acoustic Forward OP, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Molecular Dynamics NAMD - version 2.15a2 with STMV 20M case Figure 16: On NAMD, HBv5 VMs provide a 2.18x performance increase over HBv4 VMs. Figure 17: The cost to complete the NAMD STMV 20M case is 26% higher on HBv5 than what it costs to complete the same case on HBv4 Above, we can see that for customers running NAMD with cases similar to the size and complexity of the STMV 20M case, organizations can consolidate their server (VM) deployments by approximately a factor of ~2. Notably, NAMD is a compute bound case, rather than memory performance bound. We include it here to illustrate that not all workloads are fit for purpose with HBv5. This latest Azure HPC VM is the fastest at this workload on the Microsoft Cloud, but does not benefit substantially from HBv5’s premium levels of memory bandwidth. NAMD would instead perform more cost efficiently with a CPU that supports AVX512 instructions natively or, much better still, a modern GPU. Scalability of HBv5-series VMs Weak Scaling Weak scaling measures how well a parallel application or system performs when both the number of processing elements and the problem size increase proportionally, so that the workload per processor remains constant. Weak scaling cases are often employed when time-to-solution is fixed (e.g. it is acceptable to solve a problem within a specified period) but a user desires a simulation to be of a higher fidelity or resolution. A common example is operational weather forecasting. To illustrate weak scaling on HBv5 VMs, we ran Palabos with the same 3D cavity problem as shown earlier: Figure 18: On Palabos with the 3D Cavity model, HBv5 scales linearly as the 3D cavity size is proportionately increased. Strong Scaling Strong scaling is characterized by the efficiency with which execution time is reduced as the number of processor elements (CPUs, GPUs, etc.) is increased, while the problem size remains kept constant. Strong scaling cases are often employed when the fidelity or resolution of the simulation is acceptable, but a user requires faster time to completion. A common example is product engineering validation when an organization wants to bring a product to market faster but must complete a broad range of validation and verification scenarios before doing so. To illustrate Strong scaling on HBv5 VMs, we ran NAMD with two different problems, each intended to illustrate the how expectations for strong scaling efficiency change depending on problem size and the ordering of computation v. communication in distributed memory workloads. First, let us examine NAMD with the 20M STMV benchmark Figure 19: Strong scaling on HBv5 with NAMD STMV 20M cell case As illustrated above, for strong scaling cases for which the compute time is continuously reduced (by leveraging more and more processor elements) but communication time remains constant, scaling efficiency will only stay high for so long. That principle is well-represented by the STMV 20m case, for which parallel efficiency remains linear (i.e. cost/job remains flat) at two (2) nodes but degrades after that. This is because while compute is being sped up, the MPI time remains relatively flat. As such, the relatively static MPI time comes to dominate end-to-end wall clock time as VM scaling increases. Said another way, HBv5 features so much compute performance that even for a moderate-sized problem like STMV 20M scaling the infrastructure can only take performance so far and cost/job will begin to increase. If we examine HBv5 against the 210M cell case, however, with 10.5x as many elements to compute as its 20M case sibling, the scaling efficiency story changes significantly. Figure 19: On NAMD with the STMV 210M cell case, HBv5 scales linearly out to 32 VMs (or more than 11,000 CPU cores). As illustrated above, larger cases with significant compute requirements will continue to scale efficiently with larger amounts of HBv5 infrastructure. While MPI time remains relatively flat for this case (as is the case with the smaller STMV 20M case), the compute demands remain the dominant fraction of end-to-end wall clock time. As such, HBv5 scales these problems with very high levels of efficiency and in doing so job costs to the user remain flat despite up to 8x as many VMs being leveraged compared to the four (4) VM baseline. The key takeaways for strong scaling scenarios are two-fold. First, users should run scaling tests with their applications and models to find a sweet spot of faster performance with constant job costs. This will depend heavily on model size. Second, as new and very high end compute platforms like HBv5 emerge that accelerate compute time, application developers will need to find ways reduce wall clock times bottlenecking on communication (MPI) time. Recommended approaches include using fewer MPI processes and, ideally, restructuring applications to overlap communication with compute phases.
jvenkatesh
Nov 04, 2025 Place Azure High Performance Computing (HPC) Blog
1.9KViews
3likes
1Comment
Performance at Scale: The Role of Interconnects in Azure HPC & AI Infrastructure
Microsoft Azure’s high-performance computing (HPC) & AI infrastructure is designed from the ground up to support the world’s most demanding workloads. High-performance AI workloads are bandwidth-hungry and latency-sensitive. As models scale in size and complexity, the efficiency of the interconnect fabric—how CPUs, GPUs, and storage communicate—becomes a critical factor in overall system performance. Even with the fastest GPUs, poor interconnect design can lead to bottlenecks, underutilized hardware, and extended time-to-results. In this blog post, we will highlight one of the key enabling features for running large-scale distributed workloads on Azure: a highly tuned HPC-class interconnect. Azure has invested years of system-level engineering of the InfiniBand interconnect, into ready-to-use configurations for customers available on Azure’s HB-series and N-series virtual machine (VMs).
HugoAffaticati
Jun 25, 2025 Place Azure High Performance Computing (HPC) Blog
4.4KViews
3likes
1Comment
Azure’s ND GB200 v6 Delivers Record Performance for Inference Workloads
Achieving peak AI performance requires both cutting-edge hardware and a finely optimized infrastructure. Azure’s ND GB200 v6 Virtual Machines, accelerated by the NVIDIA GB200 Blackwell GPUs, have already demonstrated world record performance of 865,000 tokens/s for inferencing on the industry standard LLAMA2 70B
HugoAffaticati
Mar 31, 2025 Place Azure High Performance Computing (HPC) Blog
2.1KViews
3likes
0Comments
Announcing Azure HBv5 Virtual Machines: A Breakthrough in Memory Bandwidth for HPC
Discover the new Azure HBv5 Virtual Machines, unveiled at Microsoft Ignite, designed for high-performance computing applications. With up to 7 TB/s of memory bandwidth and custom 4th Generation EPYC processors, these VMs are optimized for the most memory-intensive HPC workloads. Sign up for the preview starting in the first half of 2025 and see them in action at Supercomputing 2024 in Atlanta
Fernando_Aznar
Nov 19, 2024 Place Azure High Performance Computing (HPC) Blog
17KViews
3likes
1Comment
Performance & Scalability of HBv4 and HX-Series VMs with Genoa-X CPUs
Azure has announced the general availability of Azure HBv4-series and HX-series virtual machines (VMs) for high performance computing (HPC). This blog provides in-depth technical and performance information about these HPC-optimized VMs.
RachelPruitt
Jun 13, 2023 Place Azure High Performance Computing (HPC) Blog
12KViews
3likes
1Comment
Training large AI models on Azure using CycleCloud + Slurm
Here we demonstrate and provide template to deploy a computing environment optimized to train a transformer-based large language model on Azure using CycleCloud, a tool to orchestrate and manage HPC environments, to provision a cluster comprised of A100, or H100, nodes managed by Slurm. Such environments have been deployed to train foundational models with 10-100s billions of parameters on terabytes of data.
jesselopez
Mar 28, 2023 Place Azure High Performance Computing (HPC) Blog
8.3KViews
3likes
0Comments