Blog Post

Azure High Performance Computing (HPC) Blog
13 MIN READ

Optimal MPI Process Placement for Azure HB Series VMs

jshelley's avatar
jshelley
Brass Contributor
Jun 18, 2021

MPI Process Pinning for HB-series VMs

For MPI applications, optimal pinning of processes can lead to significant application performance improvements for under subscribed systems. Before AMD introduced the Chiplet design a few years back, to get the optimal performance the user just needed to decide if their application performed better running all on the same socket or equally balanced across the sockets. However, with the introduction of the Chiplet design, it became more complicated. The following is a link to a diagram that may help to better understand the chiplet design

In the chiplet design, AMD has essentially integrated a bunch of smaller CPUs together to provide a socket with 64 cores (8 - 16 smaller CPUs with 4-8 cores each). To maximize the performance from each core it is important to balance the amount of L3 cache and memory bandwidth per core.  We will discuss how to do this below for the following Azure HB VM types using IntelMPI and OpenMPI/HPC-X.

 

Azure HB VM:

This instance comes with 60 AMD Naples cores. Each socket contains 8 numa domain with 4 cores each. One 4 core numa domain is held back for the hypervisor leaving 15 numa domains for the user. When undersubscribing the VM to get the desired resources/core it is desirable to equally balance the L3 cache and memory bandwidth between cores. To do this the user will need to select either 15, 30, 45, or 60 cores per node.

 

Metrics Azure
HB60rs HB60rs HB60rs HB60rs
Cores (Physical) 15 30 45 60
RAM (GB) 224 224 224 224
Network (BW) (Gb/s) 100 100 100 100
Memory BW (GB/s) 250 250 250 250
RAM/Core (GB) 14.93 7.47 4.98 3.73
Network BW/Core (Gb/s) 6.67 3.33 2.22 1.67
Memory BW/Core (GB/s) 16.67 8.33 5.56 4.17

 

OpenMPI 4 / HPC-X:

Note: To print out the placement of the cores before the application is run add the flag --report-bindings

    --bind-to core --map-by ppr:1:numa (15 cores)

    --bind-to core --map-by ppr:2:numa (30 cores)

    --bind-to core --map-by ppr:3:numa (45 cores)

 

Intel MPI:

Note: To print out the placement of the cores before the application is run add the environment variable I_MPI_DEBUG=4

15 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<60;i+=4) for (j=0;j<1;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

 

30 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<60;i+=4) for (j=0;j<2;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

 

45 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<60;i+=4) for (j=0;j<3;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

 

Azure HBv2 VM:

This instance comes with the 120 AMD Rome cores. Each socket contains 15 numa domain with 4 cores each. Two 4 core numa domain are held back for the hypervisor. When undersubscribing the HBv2 VM to get the desired resources/core it is desirable to equally balance the L3 cache and memory bandwidth between cores. To do this the user will need to select either 30, 60, 90, or 120 cores per node.

 

Metrics Azure
HB120rs_v2 HB120rs_v2 HB120rs_v2 HB120rs_v2
Cores (Physical) 30 60 90 120
RAM (GB) 448 448 448 448
Network (BW) (Gb/s) 200 200 200 200
Memory BW (GB/s) 345 345 345 345
RAM/Core (GB) 14.93 7.47 4.98 3.73
Network BW/Core (Gb/s) 6.67 3.33 2.22 1.67
Memory BW/Core (GB/s) 11.50 5.75 3.83 2.88

 

If you want to undersubscribe your VM to get the optimal about of resources per core for you application then you can pin your processes to get the optimal placement for the 30, 60, or 90 cores. To do this you will need to add the following environment variables to your MPI jobs.

 

OpenMPI 4 / HPC-X:

Note: To print out the placement of the cores before the application is run add the flag --report-bindings

    --bind-to core --map-by ppr:1:numa (30 cores)

    --bind-to core --map-by ppr:2:numa (60 cores)

    --bind-to core --map-by ppr:3:numa (90 cores)

 

Intel MPI:

Note: To print out the placement of the cores before the application is run add the environment variable I_MPI_DEBUG=4

30 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<120;i+=4) for (j=0;j<1;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

 

60 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<120;i+=4) for (j=0;j<2;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

 

90 PPN:

-env I_MPI_PIN_PROCESSOR_LIST=$(echo "for (i=0;i<120;i+=4) for (j=0;j<3;j++) i+j" | bc | sed -z 's/\n/,/g;s/,$/\n/')

 

Azure HBv3 VM:

This instance comes with the 120 AMD Milan cores. Each socket contains 2 numa domain with 30 cores each. 2 cores from 4 chiplets are held back for the hypervisor. When undersubscribing the HBv3 VM to get the desired resources/core it is desirable to equally balance the L3 cache and memory bandwidth between cores. To do this the user will need to select either 16, 32, 64, 96, or 120 cores per node. To simplify the optimal process placement for our customers, we have provided additional HBv3 VM sizes (HB120-16rs_v3, HB120-32rs_v3, HB120-64rs_v3, HB120-96rs_v3) than the standard HB120rs_v3 size. Below you can see a table of the resources per core when using the various sizes.

 

Metrics Azure
HB120-16rs_v3 HB120-32rs_v3 HB120-64rs_v3 HB120-96rs_v3 HB120rs_v3
Cores (Physical) 16 32 64 96 120
RAM (GB) 448 448 448 448 448
Network (BW) (Gb/s) 200 200 200 200 200
Memory BW (GB/s) 345 345 345 345 345
RAM/Core (GB) 28.00 14.00 7.00 4.67 3.73
Network BW/Core (Gb/s) 12.50 6.25 3.13 2.08 1.67
Memory BW/Core (GB/s) 21.56 10.78 5.39 3.59 2.88

 

If you are using the HBv120rs_v3 size and you want to undersubscribe your VM to get the optimal about of resources per core for you application then you can pin your processes to the same cores used by the 16, 32, 64, or 96 core VM sizes. To do this you will need to add the following environment variables to your MPI jobs.

 

OpenMPI 4 / HPC-X:

Note: To print out the placement of the cores before the application is run add the flag --report-bindings

 

16 PPN:

--bind-to cpulist:ordered --cpu-set 0,8,16,24,30,38,46,54,60,68,76,84,90,98,106,114

 

32 PPN:

--bind-to cpulist:ordered 

--cpu-set 0,1,8,9,16,17,24,25,30,31,38,39,46,47,54,55,60,61,68,69,76,77,84,85,90,91,98,99,106,107,114,115

 

64 PPN:

--bind-to cpulist:ordered

--cpu-set 0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27,30,31,32,33,38,39,40,41,46,47,48,49,54,55,56,57,60,61,62,63,68,69,70,71,76,77,78,79,84,85,86,87,90,91,92,93,98,99,100,101,106,107,108,109,114,115,116,117

 

96 PPN:

--bind-to cpulist:ordered

--cpu-set 0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29,30,31,32,33,34,35,38,39,40,41,42,43,46,47,48,49,50,51,54,55,56,57,58,59,60,61,62,63,64,65,68,69,70,71,72,75,76,77,78,79,80,81,84,85,86,87,88,89,90,91,92,93,94,95,98,99,100,101,102,103,106,107,108,109,110,111,114,115,116,117,118,119

 

 

Intel MPI:

Note: To print out the placement of the cores before the application is run add the environment variable I_MPI_DEBUG=4

 

16 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,8,16,24,30,38,46,54,60,68,76,84,90,98,106,114

 

32 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,8,9,16,17,24,25,30,31,38,39,46,47,54,55,60,61,68,69,76,77,84,85,90,91,98,99,106,107,114,115

 

64 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,8,9,10,11,16,17,18,19,24,25,26,27,30,31,32,33,38,39,40,41,46,47,48,49,54,55,56,57,60,61,62,63,68,69,70,71,76,77,78,79,84,85,86,87,90,91,92,93,98,99,100,101,106,107,108,109,114,115,116,117

 

96 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,8,9,10,11,12,13,16,17,18,19,20,21,24,25,26,27,28,29,30,31,32,33,34,35,38,39,40,41,42,43,46,47,48,49,50,51,54,55,56,57,58,59,60,61,62,63,64,65,68,69,70,71,72,75,76,77,78,79,80,81,84,85,86,87,88,89,90,91,92,93,94,95,98,99,100,101,102,103,106,107,108,109,110,111,114,115,116,117,118,119

 

Azure HBv4/HX VM:

These instance comes with the 176 AMD Genoa cores. Each socket contains 2 numa domain with 88 cores each. 2 cores from 8 chiplets are held back for the hypervisor. When undersubscribing the HBv4/HX VM to get the desired resources/core it is desirable to equally balance the L3 cache and memory bandwidth between cores. To do this the user will need to select either 24, 48, 96, 144, 176 cores per node. To simplify the optimal process placement for our customers, we have provided additional HBv4/HX VM sizes (HB176-24rs_v4/HX176-24rs, HB176-48rs_v4/HX176-48rs, HB120-96rs_v4/HX176-96rs, HB176-144rs_v4/HX176-144rs) than the standard HB176rs_v4/HX176-24rs size. Below you can see a table of the resources per core when using the various sizes.

 

Metrics Azure
HB176-24rs_v4/ HX176-24rs HB176-48rs_v4/ HX176-48rs HB176-96rs_v4/ HX176-96rs HB176-144rs_v4/ HX176-144rs HB176rs_v4/  HX176-176rs
Cores (Physical) 24 48 96 144 176
RAM (GB) 688/1408 688/1408 688/1408 688/1408 688/1408
Network (BW) (Gb/s) 400 400 400 400 400
Memory BW (GB/s) 800 800 800 800 800
RAM/Core (GB) 28.7/58.7 14.3/24.3 7.0/12.0 4.8/9.8 3.9/8.0
Network BW/Core (Gb/s) 16.7 8.4 4.2 2.8 2.3
Memory BW/Core (GB/s) 33.3 16.7 8.3 5.6 4.5

 

If you are using the HBv176rs_v4 or HX176rs size and you want to undersubscribe your VM to get the optimal about of resources per core for you application then you can pin your processes to the same cores used by the 24, 48, 96, or 144 core VM sizes. To do this you will need to add the following environment variables to your MPI jobs.

 

OpenMPI 4 / HPC-X:

Note: To print out the placement of the cores before the application is run add the flag --report-bindings

 

24 PPN:

--bind-to cpulist:ordered --cpu-set 0,6,12,20,28,36,44,50,56,64,72,80,88,94,100,108,116,124,132,138,144,152,160,168

 

48 PPN:

--bind-to cpulist:ordered 

--cpu-set   0,1,6,7,12,13,20,21,28,29,36,37,44,45,50,51,56,57,64,65,72,73,80,81,88,89,94,95,100,101,108,109,116,117,124,125,132,133,138,139,144,145,152,153,160,161,168,169

 

96 PPN:

--bind-to cpulist:ordered

--cpu-set    0,1,2,3,6,7,8,9,12,13,14,15,20,21,22,23,28,29,30,31,36,37,38,39,44,45,46,47,50,51,52,53,56,57,58,59,64,65,66,67,72,73,74,75,80,81,82,83,88,89,90,91,94,95,96,97,100,101,102,103,108,109,110,111,116,117,118,119,124,125,126,127,132,133,134,135,138,139,140,141,144,145,146,147,152,153,154,155,160,161,162,163,168,169,170,171

 

144 PPN:

--bind-to cpulist:ordered

--cpu-set 0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,20,21,22,23,24,25,28,29,30,31,32,33,36,37,38,39,40,41,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,64,65,66,67,68,69,72,73,74,75,76,77,80,81,82,83,84,85,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,108,109,110,111,112,113,116,117,118,119,120,121,124,125,126,127,128,129,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,152,153,154,155,156,157,160,161,162,163,164,165,168,169,170,171,172,173

 

 

Intel MPI:

Note: To print out the placement of the cores before the application is run add the environment variable I_MPI_DEBUG=4

 

24 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,6,12,20,28,36,44,50,56,64,72,80,88,94,100,108,116,124,132,138,144,152,160,168

 

48 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,6,7,12,13,20,21,28,29,36,37,44,45,50,51,56,57,64,65,72,73,80,81,88,89,94,95,100,101,108,109,116,117,124,125,132,133,138,139,144,145,152,153,160,161,168,169

 

96 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,6,7,8,9,12,13,14,15,20,21,22,23,28,29,30,31,36,37,38,39,44,45,46,47,50,51,52,53,56,57,58,59,64,65,66,67,72,73,74,75,80,81,82,83,88,89,90,91,94,95,96,97,100,101,102,103,108,109,110,111,116,117,118,119,124,125,126,127,132,133,134,135,138,139,140,141,144,145,146,147,152,153,154,155,160,161,162,163,168,169,170,171

 

144 PPN:

-genv I_MPI_PIN_PROCESSOR_LIST=0,1,2,3,4,5,6,7,8,9,10,11,12,13,14,15,16,17,20,21,22,23,24,25,28,29,30,31,32,33,36,37,38,39,40,41,44,45,46,47,48,49,50,51,52,53,54,55,56,57,58,59,60,61,64,65,66,67,68,69,72,73,74,75,76,77,80,81,82,83,84,85,88,89,90,91,92,93,94,95,96,97,98,99,100,101,102,103,104,105,108,109,110,111,112,113,116,117,118,119,120,121,124,125,126,127,128,129,132,133,134,135,136,137,138,139,140,141,142,143,144,145,146,147,148,149,152,153,154,155,156,157,160,161,162,163,164,165,168,169,170,171,172,173

 

Pinning for Hybrid (MPI + OpenMP)

When running in hybrid mode on HBv3 you will need to exclude some cores to get the proper pinning. 

 

HB and HBv2

HB and HBv2 are laid out with chiplet represented as a numa domain. To get the optimal L3 cache usage you will only want to use 2, 3, or 4 threads per MPI rank. Below are the environment variables that you will need to set to get the optimal mpi placement. For HB you will want to only use 15 (2, 3, or 4 threads/rank) or 30 ( 2 threads/rank) mpi ranks. For HBv2 you will want to only use 30 (2, 3, or 4 threads/rank) or 60 ( 2 threads/rank) mpi ranks. 

 

OpenMPI 4 / HPC-X:

  • --bind-to core
  • --map-by ppr:<mpi ranks/numa>:numa:pe=<threads/mpi rank>

Example: If I wanted to run 30 MPI ranks on HBv2 and use 3 threads/rank (90 total cores) I would use the following options

  • -np 30
  • --bind-to core
  • --map-by ppr:1:numa:pe=3
  • OMP_NUM_THREADS=3

HBv3

  • Under investigation. If you know of a clean way to do this with OpenMPI that is equivalent to what Intel MPI does please share in the comments.

Intel MPI:

  • I_MPI_PIN=on
  • I_MPI_PIN_DOMAIN cache3
  • OMP_NUM_THREADS=[2,3, or 4]

Example: If I wanted to run 30 MPI ranks on HBv2 and use 2 threads/rank (60 total cores) I would use the following options

  • -np 30 (or some multiple of 30 * number of VMs)
  • I_MPI_PIN=on
  • I_MPI_PIN_DOMAIN cache3
  • OMP_NUM_THREADS=2

HBv3

The approach that we found that works with Intel MPI is to exclude the cores you do not want it to use and then by using the I_MPI_PIN_DOMAIN variable you can get it properly use the remaining cores. Below is the list of cores you would want to exclude if you were to run 96 (exclude 24), 64 (exclude 56), or 32 (exclude 88) cores/node.

 

Exclude Cores
Cores Core List
24 14,15,22,23,6,7,44,45,52,53,36,37,74,75,82,83,66,67,104,105,112,113,96,97
56 4,5,12,13,14,15,20,21,22,23,28,29,6,7,34,35,42,43,44,45,50,51,52,53,58,59,36,37,64,65,72,73,74,75,80,81,82,83,88,89,66,67,94,95,102,103,104,105,110,111,112,113,118,119,96,97
88 2,3,4,5,10,11,12,13,14,15,18,19,20,21,22,23,26,27,28,29,6,7,32,33,34,35,40,41,42,43,44,45,48,49,50,51,52,53,56,57,58,59,36,37,62,63,64,65,70,71,72,73,74,75,78,79,80,81,82,83,86,87,88,89,66,67,92,93,94,95,100,101,102,103,104,105,108,109,110,111,112,113,116,117,118,119,96,97

 

Recommendations for the following hybrid scenarios:

  • Note: If you use other combinations of ranks and threads you will not have optimal resource distribution for L3 cache and will span AMD chiplets which will reduce performance
MPI Ranks Threads/MPI rank Exclude Cores
16 6 24
32 3 24
48 2 24
16 4

56

32 2

56

16 2

88

 

To run in hybrid mode, you will want to set the following environment variables

  • I_MPI_PIN=on
  • I_MPI_PIN_DOMAIN <threads/mpi rank>:compact
  • I_MPI_PIN_PROCESSOR_EXCLUDE_LIST=<exclude core list>

Example: For 16 MPI ranks/node with 6 threads/rank ( 96 cores/node):       

    • I_MPI_PIN=on
    • I_MPI_PIN_DOMAIN 6:compact
    • I_MPI_PIN_PROCESSOR_EXCLUDE_LIST=14,15,22,23,6,7,44,45,52,53,36,37,74,75,82,83,66,67,104,105,112,113,96,97

 

#AzureHPC #AzureHPCAI

Updated Nov 09, 2023
Version 9.0

1 Comment

  • AMD_Lewis's avatar
    AMD_Lewis
    Copper Contributor

    The following applies to OpenMPI (including the HPC-X OpenMPI) and the 96-core, 64-core, 32-core, and 16-core HBv3 VMs. This does not apply to the 120-core HBv3 VM.

     

    For a hybrid solver (e.g. MPI + OpenMP) with the 96/64/32/16 core HBv3 VMs, you can (and should) map each rank to an L3 cache. The simplest way to do this is --map-by l3cache. Depending on your scheduler (e.g. Slurm), you may find this creates a single-core mapping slot. In that case, add :PE=X where X is 6 for the 96-core VM, 4 for the 64-core VM and 2 for the 32-core VM. The default behavior is what you want for the 16-core VM. You can effectively use the 96-core VM to emulate the 64, 32, and 16-core VMs by using the above values of X on the 96-core VM. The net effect is the same - you are using X of the 8 possible cores on each L3 cache.

     

    Where it gets tricky is if you have a single threaded application. Here, you can use --map-by l3cache:pe=1, which will create a mapping slot on each L3 cache consisting of one core. Once a slot has been created on each L3, the mapper will start over using a second core on each l3cache. Provided your -np value is divisible by the number of L3 cache domains (30 on HBv2 and 16 on HBv3 except for the full-size 120-core VM), you will evenly spread processes among the L3s.

     

    Adding --report-bindings to your mpirun command line can help visualize the results of the mapping and binding phases to ensure you are achieving the desired results.