Optimal MPI Process Placement for Azure HB Series VMs

Copper Contributor

Mar 03, 2022

The following applies to OpenMPI (including the HPC-X OpenMPI) and the 96-core, 64-core, 32-core, and 16-core HBv3 VMs. This does not apply to the 120-core HBv3 VM.

For a hybrid solver (e.g. MPI + OpenMP) with the 96/64/32/16 core HBv3 VMs, you can (and should) map each rank to an L3 cache. The simplest way to do this is --map-by l3cache. Depending on your scheduler (e.g. Slurm), you may find this creates a single-core mapping slot. In that case, add :PE=X where X is 6 for the 96-core VM, 4 for the 64-core VM and 2 for the 32-core VM. The default behavior is what you want for the 16-core VM. You can effectively use the 96-core VM to emulate the 64, 32, and 16-core VMs by using the above values of X on the 96-core VM. The net effect is the same - you are using X of the 8 possible cores on each L3 cache.

Where it gets tricky is if you have a single threaded application. Here, you can use --map-by l3cache:pe=1, which will create a mapping slot on each L3 cache consisting of one core. Once a slot has been created on each L3, the mapper will start over using a second core on each l3cache. Provided your -np value is divisible by the number of L3 cache domains (30 on HBv2 and 16 on HBv3 except for the full-size 120-core VM), you will evenly spread processes among the L3s.

Adding --report-bindings to your mpirun command line can help visualize the results of the mapping and binding phases to ensure you are achieving the desired results.

Blog Post

Optimal MPI Process Placement for Azure HB Series VMs