HPC/AI Cluster resource utilization monitoring using Azure Monitor
Published Feb 02 2022 10:07 PM 9,122 Views
Microsoft

gpu_monitoring.jpg

Overview

Monitoring is a crucial aspect of managing a high-performance computing (HPC) or AI clusters. Here we will focus specifically on resource utilization monitoring using a custom data collector and the Azure Monitor service. Using a Custom data collector, we show how we can collect and monitor, the following HPC/AI resources.

  • CPU (user, sys, idle, iowait, mem utilization, etc)
  • GPU (utilization, tensor core, temp etc)
  • Network (InFiniband and Ethernet)
  • Storage (local SSD, attached disks and NFS storage)
  • Scheduled events (Spot VM evictions, scheduled maintenance etc)

Azure Monitor is an azure service that provides a platform to ingest, analyze, query and monitor all types of data. The primary advantage of using Azure Monitor to monitor your data is simplicity, you do not need to deploy any additional resources or install any extra software to monitor your data.

Here we give an example of how to use a custom data collector to extract various ND96amsr_A100_v4 (A100 on Ubuntu-HPC 20.04) GPU metrics and send them to log analytics for analysis. Please note that the HPC/AI monitoring procedures outlined in this blog post are very customizable, portable and can be used with other Azure HPC/AI VM's (e.g HBv3,  NDv2 and NC series).

 

Which GPU metrics to use?

Nvidia Datacenter GPU Monitoring (DCGM) is a framework that allows access to several low-level GPU counters and metrics to help give insights to the performance and health of the GPU’s. In this example we will be monitoring counter/metrics provided by dmon feature. All DCGM metrics/counters can be accessed by a specific field id. To see all available field ids:

 

 

 

 

 

 

dcgmi dmon -l
___________________________________________________________________________________
                               Long Name          Short Name       Field Id
___________________________________________________________________________________
driver_version                                         DRVER              1
nvml_version                                           NVVER              2
process_name                                           PRNAM              3
device_count                                           DVCNT              4
cuda_driver_version                                    CDVER              5
name                                                   DVNAM             50
brand                                                  DVBRN             51
nvml_index                                             NVIDX             52
serial_number                                          SRNUM             53
uuid                                                   UUID#             54
minor_number                                           MNNUM             55
oem_inforom_version                                    OEMVR             56
pci_busid                                              PCBID             57
pci_combined_id                                        PCCID             58
pci_subsys_id                                          PCSID             59

etc

 

 

 

 

 

Note: The DCGM stand-alone executable dcgmi is pre-loaded on the ubuntu-hpc marketplace images.

 

Some useful DCGM field Ids

Field Id GPU Metric
150 temperature (in C)
203 utilization (0-100)
252 memory used (0-100)
1004 tensor core active (0-1)
1006 fp64 unit active (0-1)
1007 fp32 unit active (0-1)
1008 fp16 unit active (0-1)

 

 

How to create a custom HPC/AI cluster Azure monitor collector

 

The python script hpc_data_collector.py connects to your log analytics workspace , collect various GPU metrics, infiniband metrics, ethernet metrics, CPU metrics disk metrics and NFS I/O metrics and send them at a specified time interval to your log analytics workspace.

 

 

 

 

 

 

hpc_data_collector.py -h
usage: hpc_data_collector.py [-h] [-dfi DCGM_FIELD_IDS] [-nle NAME_LOG_EVENT]
                             [-fhm] [-gpum] [-ibm] [-ethm] [-nfsm] [-diskm]
                             [-dfm] [-cpum] [-cpu_memm] [-eventm] [-uc]
                             [-tis TIME_INTERVAL_SECONDS]

optional arguments:
  -h, --help            show this help message and exit
  -dfi DCGM_FIELD_IDS, --dcgm_field_ids DCGM_FIELD_IDS
                        Select the DCGM field ids you would like to monitor
                        (if multiple field ids are desired then separate by
                        commas) [string] (default: 203,252,1004)
  -nle NAME_LOG_EVENT, --name_log_event NAME_LOG_EVENT
                        Select a name for the log events you want to monitor
                        (default: MyGPUMonitor)
  -fhm, --force_hpc_monitoring
                        Forces data to be sent to log analytics WS even if no
                        SLURM job is running on the node (default: False)
  -gpum, --gpu_metrics  Collect GPU metrics (default: False)
  -ibm, --infiniband_metrics
                        Collect InfiniBand metrics (default: False)
  -ethm, --ethernet_metrics
                        Collect Ethernet metrics (default: False)
  -nfsm, --nfs_metrics  Collect NFS client side metrics (default: False)
  -diskm, --disk_metrics
                        Collect disk device metrics (default: False)
  -dfm, --df_metrics    Collect filesystem (usage and inode) metrics (default: False)
  -cpum, --cpu_metrics  Collects CPU metrics (e.g. user, sys, idle & iowait
                        time) (default: False)
  -cpu_memm, --cpu_mem_metrics
                        Collects CPU memory metrics (Default: MemTotal,
                        MemFree) (default: False)
  -eventm, --scheduled_event_metrics
                        Collects Azure/user scheduled events metrics (default:
                        False)
  -uc, --use_crontab    This script will be started by the system contab and
                        the time interval between each data collection will be
                        decided by the system crontab (if crontab is selected
                        then the -tis argument will be ignored). (default:
                        False)
  -tis TIME_INTERVAL_SECONDS, --time_interval_seconds TIME_INTERVAL_SECONDS
                        The time interval in seconds between each data
                        collection (This option cannot be used with the -uc
                        argument) (default: 10)

 

 

 

 

 

Note: This script also collects SLURM job id and the physical hostnames (i.e. physical hosts on which this VM is running). By default, data is only sent to log analytics workspace if a SLURM job is running on the node (this can be overridden with the -fhm option).

 

The preferred way to enable HPC/AI Cluster monitoring is to use the provided cyclecloud cc_hpc_monitoring project, upload it to your cyclecloud locker and enable it on your compute nodes.

Note: HPC/AI Monitoring can also be enabled manually and via a crontab, some sample scripts are provided in the repository.

 

To connect to the log analytics workspace the customer_id and shared_key needed to be defined. (Customer ID (i.e. Workspace ID) and shared key (primary or secondary key) can be found in the Azure portal-->log analytics workspace-->Agents management).

You can either define customer_id and shared_key in the script or set with environment variables.

 

 

 

 

 

export LOG_ANALYTICS_CUSTOMER_ID=<log_analytics_customer_id>
export LOG_ANALYTICS_SHARED_KEY=<log_analytics_shared_key>

 

 

 

 

 

Note: if customer_id or shared_key is defined in the hpc_data_collector.py script, then the LOG_ANALYTICS_CUSTOMER_ID or LOG_ANALYTICS_SHARED_KEY environmental variables will be ignored.

 

Details of HPC/AI Cluster metrics collected

 

InfiniBand metrics (-ibm)

InfiniBand metrics are collected from this location

/sys/class/infiniband/<IB device>/ports/<port number>/counters

By default, port_xmit_data and port_rcv_data are collected. To change the defaults, modify the IB_COUNTERS list definition in the custom collector script.

 

 

 

 

 

IB_COUNTERS = [
                'port_xmit_data',
                'port_rcv_data'
              ]

 

 

 

 

 

per_sec is appended to the infiniBand metric fields, so in a kusto query you can reference IB metrics using the following format <METRIC>_per_sec (e.g port_xmit_data_per_sec).

 

Ethernet metrics (-ethm)

Ethernet metrics are collected from

/sys/class/net/eth*/statistics

By default, tx_bytes and rx_bytes are collected. To change the defaults, modify the ETH_COUNTERS list definition in the collector script.

 

 

 

 

 

ETH_COUNTERS = [
                'tx_bytes',
                'rx_bytes'
               ]

 

 

 

 

 

per_sec is appended to the ethernet metric fields, so in kusto to reference tx_bytes you refer to tx_bytes_per_sec.

 

CPU metrics (-ethm)

CPU metrics are collected from this location.

/proc/stat

and cpu load average (1 minute) from this location.

/proc/loadavg

The CPU metrics can be reference in kusto by cpu_user_time_user_hz_d, cpu_nice_time_user_hz_d, cpu_idle_time_user_hz_d, cpu_iowait_time_user_hz, cpu_irq_time_user_hz_d and cpu_softirq_time_user_hz_d.

 

CPU memory (-cpu_memm)

CPU memory metrics are collected from this location.

/proc/meminfo

By default only MemTotal and MemFree are collected, but that can be changed by modifying the CPU_MEM_COUNTERS list definitions in the custom collector script.

 

 

 

 

 

CPU_MEM_COUNTERS = [
                    'MemTotal',
                    'MemFree'
                   ]

 

 

 

 

 

_KB is appended to these counter names, so for example MeMTotal can be referenced in kusto by MemTotal_KB.

 

NFS client metrics (-nfsm)

The NFS client I/O statistics are collected using the following command.

mountstats -R

For each mounted device the READ and WRITE iops and bytes are collected. The following metrics can be referenced in kusto, nfs_mount_pt, client_read_bytes_per_sec, client_write_bytes_per_sec, client_read_iops and client_write_iops.

 

Local disk/ssd metrics (-diskm)

All local disk/device data is obtained from this location.

/proc/diskstats

For each disk/device the following metrics are collected, read_completed, read_sectors, read_time_ms, write_completed, write_sectors, write_time_ms, disk_name and disk_time_interval_secs.

 

Filesysem inode and capacity metrics (-dfm)

The filesystem inode and capacity metrics are collected using the following command.

df --output=source,itotal,iused,iavail,ipcent,size,used,avail,pcent,target

The following data is collected for each filesystem, df_inode_total, df_inode_used, df_inode_free, df_inode_used_pc, df_total_KB, df_used_KB, df_avail_KB, df_used_pc, df_mount_pt.

 

Scheduled events metrics (-eventm)

The scheduled events are collected from the scheduler events metadata server located at

http://169.254.169.254/metadata/scheduledevents

The following metrics are collected, EventId, EventStatus, EventType, ResourceType, Resources, NotBefore, Destription, EventSource, DurationInSeconds.

 

Create HPC/AI Monitoring dashboard (with Azure Monitor)

You can go to the log analytics workspace you created in Azure and use kusto  queries to create the GPU metrics charts you are interested in.

 

Here is a query to get the average GPU utilization for a particular SLURM job running on a virtual machine with GPU's.

740m_4n_gpu_utilization_jobid.jpg

 

 

Here is the kusto query to view the infiniBand bandwidth metrics for all 8 IB devices on NDv4.

740m_4n_infiniband_bw.jpg

 

To monitor CPU utilization

cpu_utilization.jpg

 

To Monitor NFS client I/O (e.g. write throughput)

 

nfs_client_write_io.png

 

 

The following kusto queries show you how to find what nodes are part of slurm job ID and how to find the physical hostname for a particular slurm node.

 

 

740m_4n_find_hostnames.jpg

 

740m_4n_find_physical_hostname.jpg

 

When you enable the scheduled event monitoring (-eventm), you can monitor all the scheduled events, like Spot virtual machine evictions.

spot_eviction.jpg

 

You can then pin these graphs to your Azure dashboard to create a dashboard like the following.

 

CormacGarvey_0-1643839112456.png

 

Creating an alert

Setting up alert rules is easy and built into the Azure monitor service. We show how to set up an alert rule to notify us when any local NVMe SSD inode count exceeds a threshold, 90% in this example.

 

inode_nvme_alert.png

 

alert_rule_message.jpg

 

 

Conclusion

 

HPC/AI cluster monitoring is essential for ensuring its smooth operation, optimizing performance, and identifying and resolving issues in a timely manner.

Azure monitor has some powerful monitoring capabilities and allows you to provide HPC/AI cluster monitoring without having to deploy additional resources or install extra software. An example custom client python code is provided that collects and sends GPU metrics, infiniBand metrics, ethernet metrics CPU metrics, disk metrics and NFS client I/O metrics to Azure Monitor, which can then be used to create a custom HPC/AI cluster monitoring dashboard and custom alerts.

 

 

 

2 Comments
Co-Authors
Version history
Last update:
‎Apr 24 2023 06:43 AM
Updated by: