benchmarking
50 TopicsDeploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.Performance at Scale: The Role of Interconnects in Azure HPC & AI Infrastructure
Microsoft Azure’s high-performance computing (HPC) & AI infrastructure is designed from the ground up to support the world’s most demanding workloads. High-performance AI workloads are bandwidth-hungry and latency-sensitive. As models scale in size and complexity, the efficiency of the interconnect fabric—how CPUs, GPUs, and storage communicate—becomes a critical factor in overall system performance. Even with the fastest GPUs, poor interconnect design can lead to bottlenecks, underutilized hardware, and extended time-to-results. In this blog post, we will highlight one of the key enabling features for running large-scale distributed workloads on Azure: a highly tuned HPC-class interconnect. Azure has invested years of system-level engineering of the InfiniBand interconnect, into ready-to-use configurations for customers available on Azure’s HB-series and N-series virtual machine (VMs).Azure’s ND GB200 v6 Delivers Record Performance for Inference Workloads
Achieving peak AI performance requires both cutting-edge hardware and a finely optimized infrastructure. Azure’s ND GB200 v6 Virtual Machines, accelerated by the NVIDIA GB200 Blackwell GPUs, have already demonstrated world record performance of 865,000 tokens/s for inferencing on the industry standard LLAMA2 70BAnnouncing Azure HBv5 Virtual Machines: A Breakthrough in Memory Bandwidth for HPC
Discover the new Azure HBv5 Virtual Machines, unveiled at Microsoft Ignite, designed for high-performance computing applications. With up to 7 TB/s of memory bandwidth and custom 4th Generation EPYC processors, these VMs are optimized for the most memory-intensive HPC workloads. Sign up for the preview starting in the first half of 2025 and see them in action at Supercomputing 2024 in AtlantaPerformance & Scalability of HBv4 and HX-Series VMs with Genoa-X CPUs
Azure has announced the general availability of Azure HBv4-series and HX-series virtual machines (VMs) for high performance computing (HPC). This blog provides in-depth technical and performance information about these HPC-optimized VMs.Training large AI models on Azure using CycleCloud + Slurm
Here we demonstrate and provide template to deploy a computing environment optimized to train a transformer-based large language model on Azure using CycleCloud, a tool to orchestrate and manage HPC environments, to provision a cluster comprised of A100, or H100, nodes managed by Slurm. Such environments have been deployed to train foundational models with 10-100s billions of parameters on terabytes of data.DGX Cloud Benchmarking on Azure
This blog presents our benchmarking results for NVIDIA DGX Cloud workloads on Azure, scaling from 8 to 1024 H100 GPUs. We detail the Slurm-based setup using Azure CycleCloud Workspace for Slurm, performance validation via NCCL and thermal screening, and tuning strategies that deliver near-parity with NVIDIA DGX reference metrics.Monitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)
As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices. By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments. In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure. DISCLAIMER: This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support. While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations. Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS. Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com) Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver. Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor. Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04. Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters Run the following commands: wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh chmod +x gpu-ib-mon_setup.sh ./gpu-ib-mon_setup.sh Test the Telegraf configuration by executing the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification. To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal: Set the scope to your VM. Choose the Metric Namespace as Telegraf/nvidia-smi. From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics. Use filters and splits to analyze data across multiple GPUs or over time. To monitor InfiniBand performance, repeat the same process: In the Metrics section, set the scope to your VM. Select the Metric Namespace as Telegraf/infiniband. You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps. Use filters to break down the data by port or metric type for deeper insights. Link_downed Metric Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values. Port_rcv_data metrics Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand. Testing InfiniBand and GPU Usage If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname. module load mpi/hpcx-v2.13.1 export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5 mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \ -mca coll_hcoll_enable 0 --bind-to numa \ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1 Alternate: GPU Load Simulation Using TensorFlow If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines. To get started, run the following: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence. Happy testing! References: Ubuntu HPC on Azure ND A100 v4-series GPU VM Sizes Telegraf Azure Monitor Output Plugin (v1.15) Telegraf NVIDIA SMI Input Plugin (v1.15) Telegraf InfiniBand Input Plugin Documentation