Monitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)

Microsoft

Apr 24, 2025

As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices.

By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments.

In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure.

DISCLAIMER:
This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support.

While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations.

Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS.

Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn

Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com)

Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor

In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver.

Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor.

Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04.

Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters

Run the following commands:

wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh
chmod +x gpu-ib-mon_setup.sh
./gpu-ib-mon_setup.sh

Test the Telegraf configuration by executing the following command:

sudo telegraf --config /etc/telegraf/telegraf.conf --test

Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage

Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification.

To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal:

Set the scope to your VM.
Choose the Metric Namespace as Telegraf/nvidia-smi.
From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics.
Use filters and splits to analyze data across multiple GPUs or over time.

To monitor InfiniBand performance, repeat the same process:

In the Metrics section, set the scope to your VM.
Select the Metric Namespace as Telegraf/infiniband.
You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps.
Use filters to break down the data by port or metric type for deeper insights.

Link_downed Metric

Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values.

Port_rcv_data metrics

Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand.

Testing InfiniBand and GPU Usage

If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname.

module load mpi/hpcx-v2.13.1
export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5
mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \
       -mca coll_hcoll_enable 0 --bind-to numa \
       -x NCCL_IB_PCI_RELAXED_ORDERING=1 \
       -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \
       -x CUDA_DEVICE_ORDER=PCI_BUS_ID \
       -x NCCL_SOCKET_IFNAME=eth0 \
       -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
       -x NCCL_DEBUG=WARN \
       /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1

Alternate: GPU Load Simulation Using TensorFlow

If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines.

To get started, run the following:

wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh
chmod +x gpu_test_program.sh
./gpu_test_program.sh

With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence.

Happy testing!

References:

Ubuntu HPC on Azure

ND A100 v4-series GPU VM Sizes

Telegraf Azure Monitor Output Plugin (v1.15)

Telegraf NVIDIA SMI Input Plugin (v1.15)

Telegraf InfiniBand Input Plugin Documentation

Published Apr 24, 2025

Version 1.0

ai infrastructure

benchmarking

hpc