The following applies to OpenMPI (including the HPC-X OpenMPI) and the 96-core, 64-core, 32-core, and 16-core HBv3 VMs. This does not apply to the 120-core HBv3 VM.
For a hybrid solver (e.g. MPI + OpenMP) with the 96/64/32/16 core HBv3 VMs, you can (and should) map each rank to an L3 cache. The simplest way to do this is --map-by l3cache. Depending on your scheduler (e.g. Slurm), you may find this creates a single-core mapping slot. In that case, add :PE=X where X is 6 for the 96-core VM, 4 for the 64-core VM and 2 for the 32-core VM. The default behavior is what you want for the 16-core VM. You can effectively use the 96-core VM to emulate the 64, 32, and 16-core VMs by using the above values of X on the 96-core VM. The net effect is the same - you are using X of the 8 possible cores on each L3 cache.
Where it gets tricky is if you have a single threaded application. Here, you can use --map-by l3cache:pe=1, which will create a mapping slot on each L3 cache consisting of one core. Once a slot has been created on each L3, the mapper will start over using a second core on each l3cache. Provided your -np value is divisible by the number of L3 cache domains (30 on HBv2 and 16 on HBv3 except for the full-size 120-core VM), you will evenly spread processes among the L3s.
Adding --report-bindings to your mpirun command line can help visualize the results of the mapping and binding phases to ensure you are achieving the desired results.