How to access the test/check scripts
git clone email@example.com:Azure/azurehpc.git
Note: Scripts will be in the apps/health_checks directory.
This test is used to identify if there is an unexpected issue with the InfiniBand network. This test runs a network bandwidth test on pairs of compute nodes (one process running on each compute node). A hostfile contains a list of all the nodes to be tested. The pairs of nodes are grouped in a ring format. For example, if the hostfile contained 4 hosts (A, B, C, & D), the 4 node pairs tested would be (A,B),(B,C), (C,D), and (D,A).
A bad node can be identified by a node pair test failing/not running or underperforming (measured network bandwidth << the expected network bandwidth).
run_ring_osu_bw.sh [/full/path/to/hostlist] [/full/path/to/osu_bw] [/full/path/to/OUTPUT_DIR]
This test will help identify problematic memory dimms (for example, dimms that are failing or underperforming). The STREAM benchmark is used for this test, which measures the memory bandwidth on each compute node. The STREAM benchmark is run on each compute node in parallel. Bad memory on a compute node is identified by the STREAM benchmark failing or the measured memory bandwidth << expected memory bandwidth.
icc -o stream.intel stream.c -DSTATIC -DSTREAM_ARRAY_SIZE=3200000000 -mcmodel=large -shared-intel -Ofast -qopenmp
WCOLL=hostlist pdsh /path/to/run_stream_bw.sh [/full/path/to/intel/compilervars.sh] [/full/path/to/stream] [/full/path/to/OUTPUT_DIR]
The stream test report “stream_report.log_PID” lists the stream benchmark result for each node in ascending order (the slowest results will be at the top of the file). The second column gives the node memory bandwidth in MB/s. For Hb any memory bandwidth <<~220 GB/s (and for Hc memory bandwidth << 180 GB/s) should be reported and removed from your hostlist. Any node on which this tests fails should also be reported and removed from your hostlist.
cgbbv300c00009L/stream.out_27138:Triad: 231227.8 0.084653 0.083035 0.086457
cgbbv300c00009B/stream.out_27363:Triad: 233946.3 0.084680 0.082070 0.095031
cgbbv300c0000BR/stream.out_28519:Triad: 234140.8 0.083516 0.082002 0.084803
cgbbv300c00009O/stream.out_26951:Triad: 234578.7 0.082362 0.081849 0.083965
cgbbv300c00009U/stream.out_27276:Triad: 234736.0 0.083303 0.081794 0.086764
A single parallel HPC workload may require many compute nodes for the job to complete in a reasonable time. If one of the compute nodes is configured incorrectly or has a sub-par performance, then it may impact the overall performance for a parallel job. We have some checks/tests which will help identify the problems, but to ensure your nodes are configured correctly, we strongly recommend you run these checks/tests before running any large parallel job.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.