Blog Post

Azure High Performance Computing (HPC) Blog
6 MIN READ

Optimizing Large-Scale AI Performance with Pretraining Validation on a Single Azure ND GB200 v6

HugoAffaticati's avatar
Aug 18, 2025

Small performance gaps on a single virtual machine lead to large and costly performance losses at scale. Running small-scale pretraining jobs enables single-VM validation and allows for fine-grained identification of issues such as performance degradation, hardware bottlenecks, or software inefficiencies ahead of large-scale runs. In this post, we present a practical methodology for benchmarking the ND GB200 v6 virtual machines (VMs). A single ND GB200 v6 VM on Azure is powered by two NVIDIA Grace CPUs and four NVIDIA Blackwell GPUs. To ensure reliability in production workloads we used automated pretraining of lightweight Llama models with the NVIDIA NeMo framework. By systematically exploring and tuning key performance parameters and rigorously cross-checking results with detailed telemetry, we identify conditions that most significantly stress the GPUs. You can reproduce and reuse these pretraining workloads from our fully automated Azure AI Benchmarking guide.

by Mishty Dhekial (Software Engineer Intern) and Hugo Affaticati (Cloud Infrastructure Engineer)

Why Llama?

The Llama3 8B model was selected as the focus of this analysis due to its relevance as a modern, open-weight large language model (LLM) architecture. Llama models are widely used in both research and industry. Their transformer-based design, featuring multi-head self-attention and deep stacking of layers, reflects the current state-of-the-art in natural language processing and is now part of our Azure AI benchmarking guide

Model Architecture

The Llama model’s architecture is designed for scalability and efficiency. It incorporates optimizations such as rotary positional embeddings and grouped-query attention, which improve both training speed and inference quality. The open availability of the Llama models and pretraining recipes from the NVIDIA NeMo framework also makes it accessible for experimentation and benchmarking, allowing for reproducible and transparent performance evaluation.

The 8B parameter size strikes a practical balance between model capacity and hardware requirements. Llama 3 70B is too large to fit onto the previous generations of virtual machines covered in our Azure AI Benchmarking guide, while the 3B model doesn’t fully stress the ND GB200 v6 VM. 8B is compact enough to fit on a single virtual machine while still presenting a demanding workload that fully utilizes the ND GB200 v6 VM’s four GPUs. This approach is ideal for conducting controlled, fine-grained validation and troubleshooting on a single VM as an initial step, before progressing to multi-node distributed training.

Performance Parameters Overview

Efficiently training large language models depends on how the model is distributed across your virtual machine. This section      explains     parallelism parameters (tensor, pipeline, context, data) and micro batch size to maximize the Azure virtual machine’s performance and training speed.

Tensor Parallelism (TP)

Tensor parallelism splits individual layers of a model across multiple GPUs, allowing large models to be trained by distributing computations.

Source: NVIDIA, Accessed 8/5/2025

Pipeline Parallelism (PP)

Pipeline parallelism divides a model into sequential stages assigned to different GPUs.

Source: NVIDIA, Accessed 8/5/2025

Context Parallelism (CP)

Context parallelism partitions long input sequences across multiple GPUs, reducing peak activation memory and enabling efficient training of large-context models.

Data Parallelism (DP)

Data parallelism replicates the model across multiple devices, each processing a different subset of the input data and synchronizing gradients after each step, enabling scalable training with minimal communication overhead. This is configured by default.

Source: NVIDIA, Accessed 8/5/2025

Micro Batch Size (MBS)

Micro batch size is the size of the smaller batch of data processed per data parallel rank.

Understanding Telemetry and Parameter Impact

To fully understand how parallelism parameters affect both model training and cluster efficiency, it’s essential to monitor a range of telemetry and performance metrics. In this analysis, the following key metrics were tracked for each experiment:

GPU Utilization

GPU utilization measures how effectively GPUs are being used. Higher utilization generally means better hardware efficiency.

Memory Usage

Memory usage indicates how much GPU memory is consumed. This helps identify potential bottlenecks or opportunities to increase batch size.

Streaming Multiprocessor (SM) Clock Speed

SM Clock Speed reflects the average operating frequency of the GPU’s SMs during training. Higher SM clock speeds can indicate more intensive computation but may also lead to increased power consumption and thermal load.

Training Step Time

Pretraining training step time is the average time to complete a training step. Shorter step times mean faster training.

Training Step Loss

The loss quantifies the discrepancy between predicted outputs and the true outputs. It measures how effectively the model is learning.

Methodology

We ran a sweep of all possible configurations from 1 to 16 for micro batch size and 1 to 4 for each parallelism parameter with fp16 with fp8 mixed precision on a single ND GB200 v6 VM. We recommend using bf16 with fp8 mixed precision for accurate pretraining like we’ve done on the fully automated Azure AI benchmarking guide.

Results

The following plots illustrate how each parameter influenced the pretraining speed, memory use, GPU Utilization, and SM clock speed.

MBS

First, we analyzed the impact of micro batch size. With constant values for all parallelism parameters, the pretraining time per step (Figure 1) and the clock speed (Figure 4) decreases, as the MBS increases. On the other hand, the memory usage (Figure 2), and GPU utilization (Figure 3) increased with the batch size. Larger batch sizes require more memory and computational effort per step, which drives up these metrics.

 

 

 

 

Figure 1. Average pretraining time per step of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size.

Figure 2. Average memory used of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size.

 

 

 

 

Figure 3. Average GPU Utilization of LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size.

Figure 4. Average SM clock speed of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size.

Tensor Parallelism

As Tensor Parallelism (TP) increased, the average memory usage per GPU (Figure 6) decreased, since computations were distributed more broadly. However, higher TP values introduced communication overhead, which led to increased pretraining step time (Figure 5) and reduced GPU utilization (Figure 7). SM clock speed (Figure 8) slightly increased with TP, reflecting the trade-off between memory efficiency and computational speed. Notably, TP=1 achieved the fastest step time, while higher TP values slowed pretraining despite reducing memory usage.

 

 

 

 

Figure 5. Average pretraining time per step of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP.

Figure 6. Average memory used of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP.

 

 

 

 

Figure 7. Average GPU Utilization of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP.

Figure 8. Average SM clock speed of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP.

Pipeline Parallelism

Next, we examined Pipeline Parallelism (PP). Increasing PP divided the model into more sequential stages. In our single VM, 4-GPU setup, higher PP values led to increased synchronization overhead and more idle time for GPUs, resulting in longer step times (Figure 9) and lower GPU utilization (Figure 11). Memory usage (Figure 10) decreased as PP increased, while SM clock speed (Figure 12) slightly increased. These results suggest that minimal pipeline parallelism is optimal for this hardware and model size which may not be benefit from pipeline parallelism.

 

 

 

 

Figure 9. Average pretraining time per step of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP

Figure 10. Average memory used of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP.

 

 

 

 

Figure 11. Average GPU Utilization of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP.

Figure 12. Average SM clock speed of the LLAMA3 8B model with       NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP.

Context Parallelism

Finally, we explored Context Parallelism (CP). Varying CP changed how input sequences were partitioned across GPUs. Increasing CP beyond 1 led to longer training step times (Figure 13) and underutilized resources with a steady decline in GPU utilization (Figure 15). Memory usage (Figure 14) and SM clock speed (Figure 16) all decreased with higher CP, but the best training speed was achieved with CP set to 1.

 

 

 

 

Figure 13. Average pretraining time per step of the LLAMA3 8B model with     NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP.

Figure 14. Average memory used of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP.

 

 

 

 

Figure 15. Average GPU Utilization of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP.

Figure 16. Average SM clock speed of the LLAMA3 8B model with      NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP.

Conclusion

As our goal with validation is to stress the virtual machine, the optimal configuration is the one that has the highest memory and GPU utilization and the fastest pretraining steps. In the new pretraining section of our Azure AI benchmarking guide we selected MBS = 4, TP = 1, PP = 2, and CP = 1 for automation. We invite you to reproduce our results by following the README.

 

Updated Aug 18, 2025
Version 2.0
No CommentsBe the first to comment