Overview
Monitoring is a crucial aspect of managing a high-performance computing (HPC) or AI clusters. Here we will focus specifically on resource utilization monitoring using a custom data collector and the Azure Monitor service. Using a Custom data collector, we show how we can collect and monitor, the following HPC/AI resources.
- CPU (user, sys, idle, iowait, mem utilization, etc)
- GPU (utilization, tensor core, temp etc)
- Network (InFiniband and Ethernet)
- Storage (local SSD, attached disks and NFS storage)
- Scheduled events (Spot VM evictions, scheduled maintenance etc)
Azure Monitor is an azure service that provides a platform to ingest, analyze, query and monitor all types of data. The primary advantage of using Azure Monitor to monitor your data is simplicity, you do not need to deploy any additional resources or install any extra software to monitor your data.
Here we give an example of how to use a custom data collector to extract various ND96amsr_A100_v4 (A100 on Ubuntu-HPC 20.04) GPU metrics and send them to log analytics for analysis. Please note that the HPC/AI monitoring procedures outlined in this blog post are very customizable, portable and can be used with other Azure HPC/AI VM's (e.g HBv3, NDv2 and NC series).
Which GPU metrics to use?
Nvidia Datacenter GPU Monitoring (DCGM) is a framework that allows access to several low-level GPU counters and metrics to help give insights to the performance and health of the GPU’s. In this example we will be monitoring counter/metrics provided by dmon feature. All DCGM metrics/counters can be accessed by a specific field id. To see all available field ids:
dcgmi dmon -l
___________________________________________________________________________________
Long Name Short Name Field Id
___________________________________________________________________________________
driver_version DRVER 1
nvml_version NVVER 2
process_name PRNAM 3
device_count DVCNT 4
cuda_driver_version CDVER 5
name DVNAM 50
brand DVBRN 51
nvml_index NVIDX 52
serial_number SRNUM 53
uuid UUID# 54
minor_number MNNUM 55
oem_inforom_version OEMVR 56
pci_busid PCBID 57
pci_combined_id PCCID 58
pci_subsys_id PCSID 59
etc
Note: The DCGM stand-alone executable dcgmi is pre-loaded on the ubuntu-hpc marketplace images.
Some useful DCGM field Ids
Field Id | GPU Metric |
150 | temperature (in C) |
203 | utilization (0-100) |
252 | memory used (0-100) |
1004 | tensor core active (0-1) |
1006 | fp64 unit active (0-1) |
1007 | fp32 unit active (0-1) |
1008 | fp16 unit active (0-1) |
How to create a custom HPC/AI cluster Azure monitor collector
The python script hpc_data_collector.py connects to your log analytics workspace , collect various GPU metrics, infiniband metrics, ethernet metrics, CPU metrics disk metrics and NFS I/O metrics and send them at a specified time interval to your log analytics workspace.
hpc_data_collector.py -h
usage: hpc_data_collector.py [-h] [-dfi DCGM_FIELD_IDS] [-nle NAME_LOG_EVENT]
[-fhm] [-gpum] [-ibm] [-ethm] [-nfsm] [-diskm]
[-dfm] [-cpum] [-cpu_memm] [-eventm] [-uc]
[-tis TIME_INTERVAL_SECONDS]
optional arguments:
-h, --help show this help message and exit
-dfi DCGM_FIELD_IDS, --dcgm_field_ids DCGM_FIELD_IDS
Select the DCGM field ids you would like to monitor
(if multiple field ids are desired then separate by
commas) [string] (default: 203,252,1004)
-nle NAME_LOG_EVENT, --name_log_event NAME_LOG_EVENT
Select a name for the log events you want to monitor
(default: MyGPUMonitor)
-fhm, --force_hpc_monitoring
Forces data to be sent to log analytics WS even if no
SLURM job is running on the node (default: False)
-gpum, --gpu_metrics Collect GPU metrics (default: False)
-ibm, --infiniband_metrics
Collect InfiniBand metrics (default: False)
-ethm, --ethernet_metrics
Collect Ethernet metrics (default: False)
-nfsm, --nfs_metrics Collect NFS client side metrics (default: False)
-diskm, --disk_metrics
Collect disk device metrics (default: False)
-dfm, --df_metrics Collect filesystem (usage and inode) metrics (default: False)
-cpum, --cpu_metrics Collects CPU metrics (e.g. user, sys, idle & iowait
time) (default: False)
-cpu_memm, --cpu_mem_metrics
Collects CPU memory metrics (Default: MemTotal,
MemFree) (default: False)
-eventm, --scheduled_event_metrics
Collects Azure/user scheduled events metrics (default:
False)
-uc, --use_crontab This script will be started by the system contab and
the time interval between each data collection will be
decided by the system crontab (if crontab is selected
then the -tis argument will be ignored). (default:
False)
-tis TIME_INTERVAL_SECONDS, --time_interval_seconds TIME_INTERVAL_SECONDS
The time interval in seconds between each data
collection (This option cannot be used with the -uc
argument) (default: 10)
Note: This script also collects SLURM job id and the physical hostnames (i.e. physical hosts on which this VM is running). By default, data is only sent to log analytics workspace if a SLURM job is running on the node (this can be overridden with the -fhm option).
The preferred way to enable HPC/AI Cluster monitoring is to use the provided cyclecloud cc_hpc_monitoring project, upload it to your cyclecloud locker and enable it on your compute nodes.
Note: HPC/AI Monitoring can also be enabled manually and via a crontab, some sample scripts are provided in the repository.
To connect to the log analytics workspace the customer_id and shared_key needed to be defined. (Customer ID (i.e. Workspace ID) and shared key (primary or secondary key) can be found in the Azure portal-->log analytics workspace-->Agents management).
You can either define customer_id and shared_key in the script or set with environment variables.
export LOG_ANALYTICS_CUSTOMER_ID=<log_analytics_customer_id>
export LOG_ANALYTICS_SHARED_KEY=<log_analytics_shared_key>
Note: if customer_id or shared_key is defined in the hpc_data_collector.py script, then the LOG_ANALYTICS_CUSTOMER_ID or LOG_ANALYTICS_SHARED_KEY environmental variables will be ignored.
Details of HPC/AI Cluster metrics collected
InfiniBand metrics (-ibm)
InfiniBand metrics are collected from this location
/sys/class/infiniband/<IB device>/ports/<port number>/counters
By default, port_xmit_data and port_rcv_data are collected. To change the defaults, modify the IB_COUNTERS list definition in the custom collector script.
IB_COUNTERS = [
'port_xmit_data',
'port_rcv_data'
]
per_sec is appended to the infiniBand metric fields, so in a kusto query you can reference IB metrics using the following format <METRIC>_per_sec (e.g port_xmit_data_per_sec).
Ethernet metrics (-ethm)
Ethernet metrics are collected from
/sys/class/net/eth*/statistics
By default, tx_bytes and rx_bytes are collected. To change the defaults, modify the ETH_COUNTERS list definition in the collector script.
ETH_COUNTERS = [
'tx_bytes',
'rx_bytes'
]
per_sec is appended to the ethernet metric fields, so in kusto to reference tx_bytes you refer to tx_bytes_per_sec.
CPU metrics (-ethm)
CPU metrics are collected from this location.
/proc/stat
and cpu load average (1 minute) from this location.
/proc/loadavg
The CPU metrics can be reference in kusto by cpu_user_time_user_hz_d, cpu_nice_time_user_hz_d, cpu_idle_time_user_hz_d, cpu_iowait_time_user_hz, cpu_irq_time_user_hz_d and cpu_softirq_time_user_hz_d.
CPU memory (-cpu_memm)
CPU memory metrics are collected from this location.
/proc/meminfo
By default only MemTotal and MemFree are collected, but that can be changed by modifying the CPU_MEM_COUNTERS list definitions in the custom collector script.
CPU_MEM_COUNTERS = [
'MemTotal',
'MemFree'
]
_KB is appended to these counter names, so for example MeMTotal can be referenced in kusto by MemTotal_KB.
NFS client metrics (-nfsm)
The NFS client I/O statistics are collected using the following command.
mountstats -R
For each mounted device the READ and WRITE iops and bytes are collected. The following metrics can be referenced in kusto, nfs_mount_pt, client_read_bytes_per_sec, client_write_bytes_per_sec, client_read_iops and client_write_iops.
Local disk/ssd metrics (-diskm)
All local disk/device data is obtained from this location.
/proc/diskstats
For each disk/device the following metrics are collected, read_completed, read_sectors, read_time_ms, write_completed, write_sectors, write_time_ms, disk_name and disk_time_interval_secs.
Filesysem inode and capacity metrics (-dfm)
The filesystem inode and capacity metrics are collected using the following command.
df --output=source,itotal,iused,iavail,ipcent,size,used,avail,pcent,target
The following data is collected for each filesystem, df_inode_total, df_inode_used, df_inode_free, df_inode_used_pc, df_total_KB, df_used_KB, df_avail_KB, df_used_pc, df_mount_pt.
Scheduled events metrics (-eventm)
The scheduled events are collected from the scheduler events metadata server located at
http://169.254.169.254/metadata/scheduledevents
The following metrics are collected, EventId, EventStatus, EventType, ResourceType, Resources, NotBefore, Destription, EventSource, DurationInSeconds.
Create HPC/AI Monitoring dashboard (with Azure Monitor)
You can go to the log analytics workspace you created in Azure and use kusto queries to create the GPU metrics charts you are interested in.
Here is a query to get the average GPU utilization for a particular SLURM job running on a virtual machine with GPU's.
Here is the kusto query to view the infiniBand bandwidth metrics for all 8 IB devices on NDv4.
To monitor CPU utilization
To Monitor NFS client I/O (e.g. write throughput)
The following kusto queries show you how to find what nodes are part of slurm job ID and how to find the physical hostname for a particular slurm node.
When you enable the scheduled event monitoring (-eventm), you can monitor all the scheduled events, like Spot virtual machine evictions.
You can then pin these graphs to your Azure dashboard to create a dashboard like the following.
Creating an alert
Setting up alert rules is easy and built into the Azure monitor service. We show how to set up an alert rule to notify us when any local NVMe SSD inode count exceeds a threshold, 90% in this example.
Conclusion
HPC/AI cluster monitoring is essential for ensuring its smooth operation, optimizing performance, and identifying and resolving issues in a timely manner.
Azure monitor has some powerful monitoring capabilities and allows you to provide HPC/AI cluster monitoring without having to deploy additional resources or install extra software. An example custom client python code is provided that collects and sends GPU metrics, infiniBand metrics, ethernet metrics CPU metrics, disk metrics and NFS client I/O metrics to Azure Monitor, which can then be used to create a custom HPC/AI cluster monitoring dashboard and custom alerts.