Unpacking the Performance of Microsoft Azure ND GB200 v6 Virtual Machines

HugoAffaticati

Microsoft

Mar 17, 2025

As AI workloads continue to push the boundaries of computing, the NVIDIA GB200 NVL72 stands at the forefront of high-performance acceleration. By examining the foundational metrics, we provide insights into how Azure’s infrastructure optimally supports AI training and inference workloads. Thus, in this deep dive, we analyze the performance of Azure’s ND GB200 v6 Virtual Machines one component at a time using Azure microbenchmarks. Whether you’re fine-tuning deep learning models or scaling enterprise workloads, this analysis will help you understand what to expect from the NVIDIA GB200 Superchip on Azure.

For a comprehensive understanding of our benchmarking methodologies and detailed performance results, please refer to our benchmarking guide available on the official Azure GitHub repository: Azure AI Benchmarking Guide.

Breakdown of Benchmark Tests

GEMM Performance

General Matrix Multiply (GEMM) operations form the backbone of AI models. We measured that more than 60% of the time spent inferencing or training an AI model is spent doing matrix multiplication. Thus, measuring their speed is key to understand the performance of a GPU-based virtual machine. Azure benchmark assesses matrix-to-matrix multiplication efficiency using NVIDIA’s CuBLASLt library with FP8 precision, ensuring results reflect enterprise AI workloads. We measured the peak theoretical performance of the NVIDIA GB200 Blackwell GPU to be 4,856 TFLOPS, representing a 2.45x increase in performance compared to peak theoretical 1,979 TFLOPS on the NVIDIA H100 GPU. This finding is in-line with NVIDIA’s announcement of a 2.5x performance increase at GTC 2024.

The true performance gain of the NVIDIA GB200 Blackwell GPU over its predecessors emerges in real-life conditions. For example, using 10,000 warm-up iterations and randomly initialized matrices demonstrated a sustained 2,744 TFLOPS for FP8 workloads, which, while expectedly lower than the theoretical peak, is still double that of the H100. The impact of these improvements on real workloads indicates up to a 3x speedup on average for end-to-end training and inference workloads based on our early results.

High-Bandwidth Memory (HBM) Bandwidth

Memory bandwidth is the metric that governs data movements. Our benchmarks showed a peak memory bandwidth of 7.35 TB/s, achieving 92% of its theoretical peak of 7.9 TB/s. This efficiency mirrors that of the H100, which also operated close to its theoretical maximum, while reaching 2.5x faster data transfers. This speedup ensures that data-intensive tasks, such as training large-scale neural networks, are executed efficiently.

NVBandwidth

The ND GB200 v6 architecture significantly enhances AI workload performance with NVLink C2C, enabling a direct, high-speed connection between the GPU and host system. This design reduces latency and improves data transfer efficiency, making AI workloads faster and more scalable.

Our NVBandwidth tests measured CPU-to-GPU and GPU-to-CPU transfer rates to be nearly 4× faster than the ND H100 v5. This improvement minimizes bottlenecks in data-intensive applications and optimizes data movement efficiency over previous GPU-powered virtual machines. In addition, it allows the GPU to readily access additional memory when needed, which can be quickly accessed via the C2C link.

NCCL Bandwidth

NVIDIA’s Collective Communications Library (NCCL) enables high-speed communication between GPUs within and across nodes. We built our tests to measure the speed of communication between GPUs over NVLink within one virtual machine. Hig-speed communication is instrumental as most enterprise workloads consist of large-scale distributed models. The ND GB200 v6’s NVLink achieved a bandwidth of approximately 680 GB/s, aligning with NVIDIA’s projections.

Conclusion

The ND GB200 v6 virtual machine, powered by the NVIDIA GB200 Blackwell GPUs, showcases substantial advancements in computational performance, memory bandwidth, and data transfer speeds compared to the previous generations of virtual machines. These improvements are pivotal for efficiently managing the increasing demands of AI workloads like generative and agentic use-cases. Following our Benchmarking Guide will provide early access to performance reviews of the innovations announced at GTC 2025, helping customers drive the next wave of AI on Azure’s purpose-built AI infrastructure.

Updated Mar 19, 2025

Version 2.0

Microsoft

Joined July 26, 2022

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity