Small performance gaps on a single virtual machine lead to large and costly performance losses at scale. Running small-scale pretraining jobs enables single-VM validation and allows for fine-grained identification of issues such as performance degradation, hardware bottlenecks, or software inefficiencies ahead of large-scale runs. In this post, we present a practical methodology for benchmarking the ND GB200 v6 virtual machines (VMs). A single ND GB200 v6 VM on Azure is powered by two NVIDIA Grace CPUs and four NVIDIA Blackwell GPUs. To ensure reliability in production workloads we used automated pretraining of lightweight Llama models with the NVIDIA NeMo framework. By systematically exploring and tuning key performance parameters and rigorously cross-checking results with detailed telemetry, we identify conditions that most significantly stress the GPUs. You can reproduce and reuse these pretraining workloads from our fully automated Azure AI Benchmarking guide.
by Mishty Dhekial (Software Engineer Intern) and Hugo Affaticati (Cloud Infrastructure Engineer)
Why Llama?
The Llama3 8B model was selected as the focus of this analysis due to its relevance as a modern, open-weight large language model (LLM) architecture. Llama models are widely used in both research and industry. Their transformer-based design, featuring multi-head self-attention and deep stacking of layers, reflects the current state-of-the-art in natural language processing and is now part of our Azure AI benchmarking guide.
Model Architecture
The Llama model’s architecture is designed for scalability and efficiency. It incorporates optimizations such as rotary positional embeddings and grouped-query attention, which improve both training speed and inference quality. The open availability of the Llama models and pretraining recipes from the NVIDIA NeMo framework also makes it accessible for experimentation and benchmarking, allowing for reproducible and transparent performance evaluation.
The 8B parameter size strikes a practical balance between model capacity and hardware requirements. Llama 3 70B is too large to fit onto the previous generations of virtual machines covered in our Azure AI Benchmarking guide, while the 3B model doesn’t fully stress the ND GB200 v6 VM. 8B is compact enough to fit on a single virtual machine while still presenting a demanding workload that fully utilizes the ND GB200 v6 VM’s four GPUs. This approach is ideal for conducting controlled, fine-grained validation and troubleshooting on a single VM as an initial step, before progressing to multi-node distributed training.
Performance Parameters Overview
Efficiently training large language models depends on how the model is distributed across your virtual machine. This section explains parallelism parameters (tensor, pipeline, context, data) and micro batch size to maximize the Azure virtual machine’s performance and training speed.
Tensor Parallelism (TP)
Tensor parallelism splits individual layers of a model across multiple GPUs, allowing large models to be trained by distributing computations.
Source: NVIDIA, Accessed 8/5/2025
Pipeline Parallelism (PP)
Pipeline parallelism divides a model into sequential stages assigned to different GPUs.
Source: NVIDIA, Accessed 8/5/2025
Context Parallelism (CP)
Context parallelism partitions long input sequences across multiple GPUs, reducing peak activation memory and enabling efficient training of large-context models.
Data Parallelism (DP)
Data parallelism replicates the model across multiple devices, each processing a different subset of the input data and synchronizing gradients after each step, enabling scalable training with minimal communication overhead. This is configured by default.
Source: NVIDIA, Accessed 8/5/2025
Micro Batch Size (MBS)
Micro batch size is the size of the smaller batch of data processed per data parallel rank.
Understanding Telemetry and Parameter Impact
To fully understand how parallelism parameters affect both model training and cluster efficiency, it’s essential to monitor a range of telemetry and performance metrics. In this analysis, the following key metrics were tracked for each experiment:
GPU Utilization
GPU utilization measures how effectively GPUs are being used. Higher utilization generally means better hardware efficiency.
Memory Usage
Memory usage indicates how much GPU memory is consumed. This helps identify potential bottlenecks or opportunities to increase batch size.
Streaming Multiprocessor (SM) Clock Speed
SM Clock Speed reflects the average operating frequency of the GPU’s SMs during training. Higher SM clock speeds can indicate more intensive computation but may also lead to increased power consumption and thermal load.
Training Step Time
Pretraining training step time is the average time to complete a training step. Shorter step times mean faster training.
Training Step Loss
The loss quantifies the discrepancy between predicted outputs and the true outputs. It measures how effectively the model is learning.
Methodology
We ran a sweep of all possible configurations from 1 to 16 for micro batch size and 1 to 4 for each parallelism parameter with fp16 with fp8 mixed precision on a single ND GB200 v6 VM. We recommend using bf16 with fp8 mixed precision for accurate pretraining like we’ve done on the fully automated Azure AI benchmarking guide.
Results
The following plots illustrate how each parameter influenced the pretraining speed, memory use, GPU Utilization, and SM clock speed.
MBS
First, we analyzed the impact of micro batch size. With constant values for all parallelism parameters, the pretraining time per step (Figure 1) and the clock speed (Figure 4) decreases, as the MBS increases. On the other hand, the memory usage (Figure 2), and GPU utilization (Figure 3) increased with the batch size. Larger batch sizes require more memory and computational effort per step, which drives up these metrics.
|
|
Figure 1. Average pretraining time per step of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size. |
Figure 2. Average memory used of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size. |
|
|
Figure 3. Average GPU Utilization of LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size. |
Figure 4. Average SM clock speed of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of the micro batch size. |
Tensor Parallelism
As Tensor Parallelism (TP) increased, the average memory usage per GPU (Figure 6) decreased, since computations were distributed more broadly. However, higher TP values introduced communication overhead, which led to increased pretraining step time (Figure 5) and reduced GPU utilization (Figure 7). SM clock speed (Figure 8) slightly increased with TP, reflecting the trade-off between memory efficiency and computational speed. Notably, TP=1 achieved the fastest step time, while higher TP values slowed pretraining despite reducing memory usage.
|
|
Figure 5. Average pretraining time per step of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP. |
Figure 6. Average memory used of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP. |
|
|
Figure 7. Average GPU Utilization of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP. |
Figure 8. Average SM clock speed of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of TP. |
Pipeline Parallelism
Next, we examined Pipeline Parallelism (PP). Increasing PP divided the model into more sequential stages. In our single VM, 4-GPU setup, higher PP values led to increased synchronization overhead and more idle time for GPUs, resulting in longer step times (Figure 9) and lower GPU utilization (Figure 11). Memory usage (Figure 10) decreased as PP increased, while SM clock speed (Figure 12) slightly increased. These results suggest that minimal pipeline parallelism is optimal for this hardware and model size which may not be benefit from pipeline parallelism.
|
|
Figure 9. Average pretraining time per step of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP |
Figure 10. Average memory used of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP. |
|
|
Figure 11. Average GPU Utilization of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP. |
Figure 12. Average SM clock speed of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of PP. |
Context Parallelism
Finally, we explored Context Parallelism (CP). Varying CP changed how input sequences were partitioned across GPUs. Increasing CP beyond 1 led to longer training step times (Figure 13) and underutilized resources with a steady decline in GPU utilization (Figure 15). Memory usage (Figure 14) and SM clock speed (Figure 16) all decreased with higher CP, but the best training speed was achieved with CP set to 1.
|
|
Figure 13. Average pretraining time per step of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP. |
Figure 14. Average memory used of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP. |
|
|
Figure 15. Average GPU Utilization of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP. |
Figure 16. Average SM clock speed of the LLAMA3 8B model with NVIDIA NeMo framework on Azure ND GB200 v6 VM as a function of CP. |
Conclusion
As our goal with validation is to stress the virtual machine, the optimal configuration is the one that has the highest memory and GPU utilization and the fastest pretraining steps. In the new pretraining section of our Azure AI benchmarking guide we selected MBS = 4, TP = 1, PP = 2, and CP = 1 for automation. We invite you to reproduce our results by following the README.