Azure High Performance Computing (HPC) Blog

8 MIN READ

Tool to assist in optimal pinning of processes/threads for Azure HPC/AI VM’s

Microsoft

Aug 23, 2021

Introduction

To maximize a HPC applications performance (memory bandwidth and floating-point performance), the processes/threads need to be distributed evenly on the VM, utilizing all sockets, NUMA domains and L3caches.

In hybrid parallel applications, each process has several threads associated with it and it’s recommended to have a process and its threads on the same L3cache to maximize data sharing and re-use.

Optimal process/thread placement on Azure AMD processor VM’s (e.g. HB120rs_v2, HBv3 series and NDv4 etc) is further complicated because these VM’s have several NUMA domains and many L3caches.

There are several linux tools (e.g numactl and taskset) that can control the placement of processes, MPI libraries and job schedulers (e.g. Slurm srun) also provide arguments and environmental variables to control process affinity.

In this post we will discuss a tool which can help you place/pin your HPC/AI application's processes/threads in an optimal manner on HPC specialty VM’s. The pinning tool has the following functionality.

View VM CPU topology (Numa domains, l3caches, core ID's and GPU ID's)
Check where a parallel applications processes and threads are running (e.g. on what core ID, GPU ID and NUMA) and provide feedback/warnings if suboptimal process/thread affinity is detected.
For a given number of processes and threads, generate the optimal mpi (hpcx, openmpi, intel mpi and mvapich2 are supported) and slurm scheduler(srun) process affinity arguments.
Run app pinning tool directly in mpirun or slurm script and pass the optimal arguments directly to the run command.

HPC application pinning tool

The tool is called “check_app_pinning.py” and is in the azurehpc github repository.

git clone https://github.com/Azure/azurehpc.git

See the experimental/check_app_pinning directory.

Tool syntax

./check_app_pinning.py -h
usage: check_app_pinning.py [-h] [-anp APPLICATION_PATTERN] 
                            [-pps] [-f]
                            [-nv TOTAL_NUMBER_VMS]
                            [-nppv NUMBER_PROCESSES_PER_VM]
                            [-ntpp NUMBER_THREADS_PER_PROCESS]
                            [-mt {openmpi,intel,mvapich2,srun}]

optional arguments:
  -h, --help            show this help message and exit
  -anp APPLICATION_PATTERN, --application_name_pattern APPLICATION_PATTERN
                        Select the application pattern to check [string]
                        (default: None)
  -pps, --print_pinning_syntax
                        Print MPI pinning syntax (default: False)
  -f, --force           Force printing MPI pinning syntax (i.e ignore
                        warnings) (default: False)
  -nv TOTAL_NUMBER_VMS, --total_number_vms TOTAL_NUMBER_VMS
                        Total number of VM's (used with -pps) (default: 1)
   -nppv NUMBER_PROCESSES_PER_VM, --number_processes_per_vm NUMBER_PROCESSES_PER_VM
                        Total number of MPI processes per VM (used with -pps)
                        (default: None)                  
  -ntpp NUMBER_THREADS_PER_PROCESS, --number_threads_per_process NUMBER_THREADS_PER_PROCESS
                        Number of threads per process (used with -pps)
                        (default: None)
  -mt {openmpi,intel,mvapich2,srun}, --mpi_type {openmpi,intel,mvapich2,srun}
                        Select which type of MPI to generate pinning syntax
                        (used with -pps)(select srun when you are using a
                        SLURM scheduler)(default: openmpi)

Tool prerequisites

The “check_app_pinning.py” requires that python3 and the hwloc package be installed before running the tool.

On CentOS-HPC, these packages can be installed using yum as follows

sudo yum install -y python3 hwloc

Some examples

You need to run the tool on the HPC compute VM running your application. If you want to run on all your HPC VM’s involved in your MPI application, then you can use a parallel shell like pdsh or pssh.

Generate MPI pinning for hybrid parallel application running on HB120-64rs_v3

You are using 4 HB120-64rs_v3 VMs and would like to know the correct HPCX MPI syntax to pin a total of 64 MPI processes and 4 threads per process (i.e 16 MPI processes per VM).

check_app_pinning.py -pps -nv 4 -nppv 16 -ntpp 4
Virtual Machine (Standard_HB120-64rs_v3, cghb64v3) Numa topology

NumaNode id  Core ids              GPU ids
============ ==================== ==========
0            ['0-15']             []
1            ['16-31']            []
2            ['32-47']            []
3            ['48-63']            []

L3Cache id   Core ids
============ ====================
0            ['0-3']
1            ['4-7']
2            ['8-11']
3            ['12-15']
4            ['16-19']
5            ['20-23']
6            ['24-27']
7            ['28-31']
8            ['32-35']
9            ['36-39']
10           ['40-43']
11           ['44-47']
12           ['48-51']
13           ['52-55']
14           ['56-59']
15           ['60-63']


Process/thread openmpi MPI Mapping/pinning syntax for 16 processes and 4 threads per process

-np 64 --bind-to l3cache--map-by ppr:4:numa

The first section of output shows the NUMA topology (i.e. how many NUMA domains, how many L3caches and how many and which core id’s are in each NUMA domain and L3cache.). It also shows how many GPU’s you have and which NUMA domain each gpu id belongs to.

You will see some warnings if the combination of processes and threads affinity is not optimal for this VM. e.g too many (or too few) processes/threads, number of threads will not fit in L3cache etc). By default, if you get a warning the MPI placement syntax will not be generated until the warning is corrected, but you can override this default with the (-f or -force option) and ignore the warning and generate the MPI placement syntax.

In the above example you would cut and paste the MPI HPCX pinning syntax and launch your application like this.

mpirun -np 64 --bind-to l3cache --map-by ppr:4:numa ./mpi_executable

Another option would be to reference the optimal affinity arguments directly in your mpirun or slurm script. The pinning tool stores the optimal affinity arguments in the following files (created in the current working directory).

AZ_MPI_NP : File containing the total number of parallel processes.
AZ_MPI_ARGS : File containing the optimal processes/threads pinning/affinity arguments for (hpcx, openmpi, Intel MPI, Mvapich2 or Slurm srun)

Example MPI script (16 processes, 6 threads per process on HB120-96rs_v3)

#!/bin/bash

export OMP_NUM_THREADS=6
check_app_pinning.py -pps -nv 1 -nppv 16 -ntpp $OMP_NUM_THREADS -mt openmpi
AZ_MPI_NP=$(cat AZ_MPI_NP)
AZ_MPI_ARGS=$(cat AZ_MPI_ARGS)

mpirun -np $AZ_MPI_NP $AZ_MPI_ARGS mpi_executable

Note: AZ_MPI_NP=16 and AZ_MPI_ARGS="--bind-to l3cache --map-by ppr:4:numa -report-bindings"

If you would prefer Intel MPI placement syntax, then just add the -mt intel option and the following Intel MPI pinning syntax will be generated.

 AZ_MPI_NP=16
 AZ_MPI_ARGS="-genv I_MPI_PIN_DOMAIN 6:compact -genv FI_PROVIDER mlx -genv I_MPI_COLL_EXTERNAL 1 -genv I_MPI_DEBUG 6"

In this case you can reference AZ_MPI_NP and AZ_MPI_ARGS in your mpirun command arguments.

Similarly, you can pass optimal process/thread affinity arguments directly to a Slurm script.

For example run 8 processes, 1 threads per process on ND96amsr_A100_V4 using a slurm scheduler.

#!/bin/bash

#SBATCH --mem=0
#SBATCH --ntasks-per-node=8
#SBATCH --gpus-per-node=8
#SBATCH --exclusive

module load gcc-9.2.0
module load mpi/hpcx

export SLURM_CPU_BIND=verbose
export OMP_NUM_THREADS=1

check_app_pinning.py -pps -nv $SLURM_NNODES -nppv $SLURM_NTASKS_PER_NODE -ntpp $OMP_NUM_THREADS -mt srun
AZ_MPI_NP=$(cat AZ_MPI_NP)
AZ_MPI_ARGS=$(cat AZ_MPI_ARGS)

srun $AZ_MPI_ARGS mpi_executable

Note: AZ_MPI_ARGS="--mpi=pmix --cpu-bind=mask_cpu:0xffffff000000,0xffffff000000,0xffffff,0xffffff,0xffffff000000000000000000,0xffffff000000000000000000,0xffffff000000000000,0xffffff000000000000 --ntasks-per-node=8 --gpus-per-node=8"

Show where Hybrid MPI application is running on HB120rs_v2

You have a hybrid parallel MPI application (called hpcapp) running on 8 HB120rs_v2 VM’s, you would like to check and see where the processes and threads are running and are the processes and threads placed optimally on HB120rs_v2.

On one of the HB120rs_v2 VM’s, execute.

check_app_pinning.py -anp hpcapp
Virtual Machine (Standard_HB120_v2) Numa topology

NumaNode id  Core ids              GPU ids
============ ==================== ==========
0            ['0-29']             []
1            ['30-59']            []
2            ['60-89']            []
3            ['90-119']           []

L3Cache id   Core ids
============ ====================
0            ['0-2']
1            ['3-5']
2            ['6-9']
3            ['10-13']
4            ['14-17']
5            ['18-21']
6            ['22-25']
7            ['26-29']
8            ['30-32']
9            ['33-35']
10           ['36-39']
11           ['40-43']
12           ['44-47']
13           ['48-51']
14           ['52-55']
15           ['56-59']
16           ['60-62']
17           ['63-65']
18           ['66-69']
19           ['70-73']
20           ['74-77']
21           ['78-81']
22           ['82-85']
23           ['86-89']
24           ['90-92']
25           ['93-95']
26           ['96-99']
27           ['100-103']
28           ['104-107']
29           ['108-111']
30           ['112-115']
31           ['116-119']


Application (hpcapp) mapping/pinning

PID          Total Threads      Running Threads   Last core id     Core id mapping   Numa Node ids   GPU ids
============ ================= ================= ==============  ================= =============== ===============
13405        7                 4                   0              0                 [0]             []
13406        7                 4                   4              4                 [1]             []
13407        7                 4                   8              8                 [2]             []
13408        7                 4                   12             12                [3]             []


Warning: 4 threads are mapped to 1 core(s), for pid (13405)
Warning: 4 threads are mapped to 1 core(s), for pid (13406)
Warning: 4 threads are mapped to 1 core(s), for pid (13407)
Warning: 4 threads are mapped to 1 core(s), for pid (13408)

Note: you do not need to provide the full application name, just enough of the application name pattern to uniquely identify it.

The first part of the output is the same as before, it shows the VM NUMA topology.

The second section of the output titled “Application (hpcapp) mapping/pinning”, shows the details of how many processes and threads are running and on which core_ids.

PID: Refers to the processor Identification, a unique number to identify each process.

Total Threads: Is the total number of threads associated with each PID.

Last core_id: Identify’s the last core_id the PID was running on.

Core id mapping: Shows the PID’s CPU affinity, i.e which core_ids the PID can run on.

Numa Node ids: Show which NUMA domains the core id’s identified by “Core id mapping” corresponds to.

GPU ids: Show which GPU id the PID is running on.

Running Threads: Is the actual number of threads corresponding to each PID that are currently in a run state.

If you see some warnings at the end of the report, that indicates that the tool has identified some possible suboptimal process placement. Check the warnings to make sure your application is running as expected.

Check HPC application running on ND96asr_v4 (A100)

We can check where the processes and threads (from application called hello) are running on a VM with GPU's like ND96asr_v4 and if the processes are pinned optimally to utilize all the GPU's.

./check_app_pinning_new.py -anp hello

Virtual Machine (Standard_ND96asr_v4) Numa topology

NumaNode id  Core ids              GPU ids
============ ==================== ==========
0            ['0-23']             [3, 2]
1            ['24-47']            [1, 0]
2            ['48-71']            [7, 6]
3            ['72-95']            [5, 4]

L3Cache id   Core ids
============ ====================
0            ['0-3']
1            ['4-7']
2            ['8-11']
3            ['12-15']
4            ['16-19']
5            ['20-23']
6            ['24-27']
7            ['28-31']
8            ['32-35']
9            ['36-39']
10           ['40-43']
11           ['44-47']
12           ['48-51']
13           ['52-55']
14           ['56-59']
15           ['60-63']
16           ['64-67']
17           ['68-71']
18           ['72-75']
19           ['76-79']
20           ['80-83']
21           ['84-87']
22           ['88-91']
23           ['92-95']


Application (hello) Mapping/pinning

PID          Total Threads      Running Threads   Last core id    Core id mapping   Numa Node ids   GPU ids
============ ================= ================= ==============  ================= =============== ===============
32473        6                 0                 0                  0                 [0]             [3]
32474        6                 2                 24                 24                [1]             [1]
32475        6                 2                 48                 48                [2]             [7]
32476        6                 2                 72                 72                [3]             [5]


Warning: 2 threads are mapped to 1 core(s), for pid (32474)
Warning: 2 threads are mapped to 1 core(s), for pid (32475)
Warning: 2 threads are mapped to 1 core(s), for pid (32476)
Warning: Virtual Machine has 8 GPU's, but only 6 threads are running

In this case we see that the tool has identified 8 A100 GPU’s, which core id and GPU id each PID is running on and detected possible suboptimal pinning with warnings.

Summary

Proper placement of processes and threads on HPC VM’s is important to get optimal performance.

It is more complicated to figure out the correct placement of processes/threads on VM with many NUMA domains and L3caches like the HB, HBv2, HBv3 series and NDv4 series.

A tool is discussed to assist in optimal placement of processes/threads on Azure HPC VM’s.

Updated Apr 07, 2023

Version 7.0

hpc

CormacGarvey

Microsoft

Joined June 20, 2019

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity