Azure achieved the most performant MLPerf Training v6.0 result to date for Llama 3.1 405B, with a time-to-train of just over seven minutes according to MLCommons. This loadbearing benchmark measures how communication overhead and system stability dominate training performance, where Azure’s full stack, end-to-end advantages shine. Azure scaled to 2,048 NVIDIA GB200 NVL72 compute tray nodes spanning 128 racks (8,192 GPUs), assembling the largest reported GB200 NVL72 cluster to date in MLPerf Training. As model sizes continue to grow into hundreds of billions of parameters, achieving this level of performance consistently requires not only massive compute capacity, but also a highly efficient communication infrastructure.
To meet the demands of the massive training workloads, Azure’s Fairwater AI supercomputing infrastructure offers high-performance GPU scale-up domains with resilient, scale-out networking. Fairwater is optimized for frontier-scale distributed AI training, where communication overhead and synchronization latency can become major scaling bottlenecks at multi-thousand GPU scale. This remarkable result was possible by leveraging fifth-generation NVIDIA NVLink at 1,800 GB/s per GPU for intra-rack communication and Azure’s 100 GB/s MRC (Multipath Reliable Connection) fabric accelerated by NVIDIA DOCA and connected with NVIDIA ConnectX-8 and NVIDIA Spectrum-X Ethernet switches for inter-rack communication.
What Enabled Scaling Llama 405B to 8,192 GPUs on Azure
Achieving efficient training at this scale requires more than raw compute. Three architectural ingredients came together to make this result possible:
- High operational efficiency of NVLink scale-up domains, providing fast intra-rack GPU-to-GPU communication.
- Resiliency and stability of Azure’s MRC scale-out networking fabric, built on NVIDIA ConnectX-8 SuperNICs and Spectrum-X Ethernet switches, delivering 100 GB/s per GPU across racks.
- Topology-aware workload mapping, aligning parallelism strategies with the underlying network structure.
Large-scale LLM training is fundamentally synchronous. At every training step, GPUs perform forward and backward passes followed by gradient synchronization across ranks. Because all GPUs must stay in sync, training progress is ultimately limited by the slowest rank — any time spent waiting on the network directly increases overall step time.
For the Llama 405B benchmark, the workload is distributed across four dimensions of parallelism: Tensor, Context, Pipeline, and Data parallelism. Each generates different communication patterns, with some communication sitting directly on the critical training path. If left unmanaged, communication overhead can quickly become a major bottleneck at scale.
The key insight is that not all communication is equally latency-sensitive. Some communication must be completed before computation can continue, while other communication can overlap with compute.
Tensor-Parallel and Context-Parallel communication sit directly on the critical compute path, meaning each layer must wait for its collective operation to complete before execution can proceed. Pipeline-Parallel communication is also on the critical path, but across stages rather than within a layer (i.e., the next stage cannot begin processing its data until the previous stage has transferred its gradients). To minimize latency, all three are placed on the high-bandwidth NVLink domain, which provides up to 1,800 GB/s per GPU. Data-Parallel communication, on the other hand, can overlap with backward compute because gradient reduction runs alongside the backward pass and does not immediately block the next operation. This traffic is carried over the MRC scale-out network.
The impact of such mapping is clear at scale. Our profiling shows that cross-rack MRC communication contributes just ~20 ms of exposed time on the critical path of a ~1.27 s training step (≈1.6% of step time). This communication — primarily gradient synchronization across all ranks — is well overlapped by compute and NVLink traffic, making the scale-out network nearly invisible to training performance.
Stable Step Time as the Cluster Grows
To evaluate execution stability as the cluster scales, we compared our GB200 NVL72 128-rack submission against a 112-rack configuration used for this experiment. In the Llama 405B workload, the critical path of each training step is dominated by compute (forward and backward GEMMs) together with latency-sensitive Tensor-Parallel and Context-Parallel communication over NVLink. Importantly, neither of these changes as the cluster scales out.
Cross-node communication, such as Data-Parallel collectives and Pipeline-Parallel transfers over the MRC network, runs concurrently with backward compute and remains largely hidden behind computation. This overlap is sustained only if cross-node communication remains fast and predictable. Any network instability, congestion, or synchronization jitter would reduce this overlap and expose communication on the critical path, directly increasing overall step time.
Figure 1. Step-time vs cluster size at scale
This is exactly what we observe in practice under a weak scaling model, where we maintain a constant workload per GPU while scaling total cluster capacity. As shown in Figure 1, step time remains nearly identical at scale, measuring 1.2734 seconds at 112 GB200 NVL72 racks (7,168 GPUs) and 1.2712 seconds at 128 racks (8,192 GPUs) — a difference of just 2 ms. This corresponds to a near-perfect 99.8% weak scaling efficiency as we expanded the cluster by an additional 1,024 GPUs. Step-time variance also remains extremely low (±0.04–0.05%), confirming stable and predictable execution at extreme scale. This directly translates to maximized hardware utilization and an accelerated overall time-to-train, demonstrating the resiliency of Azure’s MRC scale-out network.
Acknowledgement
This work was made possible through strong collaboration across multiple teams. We would like to especially recognize Shantanu Patankar for his contributions to execution and performance analysis and Mark Gitau for his contributions to data preparation and experiment support as core contributors, and extend our thanks to Amirreza Rastegari, Ojasvi Bhalerao, Sai Kovouri, Adam Hough, Nandini Ramanathan, Manasa Govindu, Sanian Gaffar, Ekrem Aksoy, Bhupender Thakur, Matthew Kappel, John Rankin, Girish Bhatia, Sreevatsa Anantharamu, Scott Moe, Yang Wang, and Jithin Jose, as well as many others across Azure and NVIDIA who supported this effort.