Health checks for HPC workloads on Microsoft Azure

Microsoft

Sep 05, 2019

Introduction

Many HPC applications are highly parallel and have tightly coupled communication, meaning that during an applications parallel simulation run, all parallel processes must communicate with each other frequently. These types of applications usually perform best when the inter-communication between the parallel processes is done on high bandwidth/low latency networks like InfiniBand. The tightly coupled nature of these applications means that if a single VM is not functioning optimally, then it may cause the job to have an impaired performance. The purpose of these checks/tests is to assist you in quickly identifying a non-optimal node, so it can be excluded from a parallel job. If your job needs an exact number of parallel processes, a slight overprovision is a good practice, just in case you find a few nodes that you need to exclude.

HBv2, HB and HC SKUs were specifically designed for HPC applications. They have InfiniBand (HDR(HBv2), EDR (HB,HC)) networks, high floating-point performance, and high-memory bandwidths. The tests/checks described here are specifically designed for HBv2, HB and HC SKUs. It is a good practice to run these checks/tests prior to running a parallel job (especially for large parallel jobs).

Note: Some GPU specific health-checks can be found here. A Blog post on Automating HPC/AI health-checks with a SLURM scheduler can be found here.

How to access the test/check scripts 

git clone https://github.com/Azure/azurehpc.git

Note: Scripts will be in the apps/health_checks directory.

Tests/Checks

Check the InfiniBand network

This test is used to identify if there is an unexpected issue with the InfiniBand network. This test runs a network bandwidth test on pairs of compute nodes (one process running on each compute node). A hostfile contains a list of all the nodes to be tested. The pairs of nodes are grouped in a ring format. For example, if the hostfile contained 4 hosts (A, B, C, & D), the 4 node pairs tested would be (A,B),(B,C), (C,D), and (D,A).

A bad node can be identified by a node pair test failing/not running or underperforming (measured network bandwidth << the expected network bandwidth).

Procedure:

Download the osu benchmark suite:
1. http://mvapich.cse.ohio-state.edu/download/mvapich/osu-micro-benchmarks-5.6.3.tar.gz
Build/install osu micro-benchmark suite.
1. module load mpi/mvapich2-2.3.4
2. configure –prefix=/location/you/want/to/install CC=/opt/mvapich2-2.3.4/bin/mpicc CXX=/opt/mvapich2-2.3.4/bin/mpicxx
3. make
4. make install
5. NOTE: Building and installing the osu-micro-benchmarks with spack is easier (See azurehpc/apps/spack for details)
```
run_ring_osu_bw.sh [/full/path/to/hostlist] [/full/path/to/osu_bw] [/full/path/to/OUTPUT_DIR]
```
1. The first script parameter is the full path to the hostlist, which should have a single hostname or IP address per line.
```
Host1 
Host2 
Host3 
```
2. The second script parameter is the full path to the osu_bw executable that you built in step 2.
3. The third script parameter is the full path to the output directory. This is the location of the resulting output from this test.
4. These pairwise pt-to-pt benchmarks run serially (each test <20s), so the total test time would depend on how many nodes are in the hostlist file.
A number of files will be created for each node-pair tested. An output report will also be generated in the OUTPUT_DIR directory called “osu_bw_report.log_PID”. The second column is IB bandwidth numbers in MB/s (ascending order). Any numbers << 7000 should be reported and removed from your hostlist. (The slowest test results will be at the top of this file.) If any of the node pair tests failed (the file size is zero, or it contains an error), report those nodes and remove them from your hostlist before running your parallel job.
```
10.32.4.211_to_10.32.4.213_osu_bw.log_68076:4194304 7384.99 
10.32.4.248_to_10.32.4.249_osu_bw.log_68076:4194304 7390.99 
10.32.4.142_to_10.32.4.143_osu_bw.log_68076:4194304 7394.00 
10.32.4.174_to_10.32.4.175_osu_bw.log_68076:4194304 7400.52 
10.32.4.194_to_10.32.4.195_osu_bw.log_68076:4194304 7407.01 
```
The AzureHPC repository also contains some additional script to run all the IB tests in parallel, which will speed-up testing time considerably for a large HPC cluster. (See azurehpc/apps/health_checks/readme for details).

Check all the compute nodes memory

This test will help identify problematic memory dimms (for example, dimms that are failing or underperforming). The STREAM benchmark is used for this test, which measures the memory bandwidth on each compute node. The STREAM benchmark is run on each compute node in parallel. Bad memory on a compute node is identified by the STREAM benchmark failing or the measured memory bandwidth << expected memory bandwidth.

Procedure:

Get the stream code from www.cs.virginia.edu/stream/

Build stream with the Intel mpi compiler:

icc -o stream.intel stream.c -DSTATIC -DSTREAM_ARRAY_SIZE=3200000000 -mcmodel=large -shared-intel -Ofast -qopenmp

```
WCOLL=hostlist pdsh /path/to/run_stream_bw.sh [/full/path/to/intel/compilervars.sh] [/full/path/to/stream] [/full/path/to/OUTPUT_DIR] 
```
1. The first script parameter “/full/path/to/intel/compilervars.sh” is the location of “compilevar.sh” script in the Intel compiler environment, which will be sourced to set-up the correct Intel compiler environment.
2. The second parameter “/full/path/to/stream” is the full path to the stream executable, which was built in step 2.
3. The third parameter “/full/path/to/OUTPUT_DIR” is the full path to the directory location where the resulting output from running this test will be deposited.
A test summary report can be generated by running this script:
1. ```
report_stream.sh [/full/path/to/OUTPUT_DIR] 
```
The stream test report “stream_report.log_PID” lists the stream benchmark result for each node in ascending order (the slowest results will be at the top of the file). The second column gives the node memory bandwidth in MB/s. For HBv2 any memory bandwidth << ~280 GB/s (for HB memory bandwidth << 240 GB/s and HC memory bandwidth << 190 GB/s) should be reported and removed from your hostlist. Any node on which this tests fails should also be reported and removed from your hostlist.
```
cgbbv300c00009L/stream.out_27138:Triad: 231227.8 0.084653 0.083035 0.086457 
cgbbv300c00009B/stream.out_27363:Triad: 233946.3 0.084680 0.082070 0.095031 
cgbbv300c0000BR/stream.out_28519:Triad: 234140.8 0.083516 0.082002 0.084803 
cgbbv300c00009O/stream.out_26951:Triad: 234578.7 0.082362 0.081849 0.083965 
cgbbv300c00009U/stream.out_27276:Triad: 234736.0 0.083303 0.081794 0.086764 
```
An additional memory bandwidth benchmark tests are also available in AzureHPC. (See azurehpc/health_checks/readme.md for details).

Summary

A single parallel HPC workload may require many compute nodes for the job to complete in a reasonable time. If one of the compute nodes is configured incorrectly or has a sub-par performance, then it may impact the overall performance for a parallel job. We have some checks/tests which will help identify the problems, but to ensure your nodes are configured correctly, we strongly recommend you run these checks/tests before running any large parallel job.

Updated Mar 10, 2023

Version 12.0

hpc

CormacGarvey

Microsoft

Joined June 20, 2019

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity