MPI Collective communication primitives offer a flexible, portable way to implement group communication operations. They are widely used across various scientific parallel applications and have a significant impact on the overall application performance. This blog post highlights the configuration parameters to optimize collective communication performance using HPC-X on Azure HPC virtual machines.
HPC-X uses 'hcoll' library for collective communication, and the main features of hcoll are listed below:
Collectives Offload (Core-Direct):
Recent generation InfiniBand network cards support offloading of collective operations to Host Channel Adapters (HCA) and thereby reducing the overhead on CPU. Thus, CPU cycles can be more efficiently spent on real computation tasks from the application, rather than for implementing collective operations. This also offers better potential for computation-communication overlap. Most collective operations perform better with hardware offload.
'hcoll' is enabled by default in HPC-X on Azure HPC VMs and can be controlled at runtime by using the parameter
-mca coll_hcoll_enable 1
Hardware Multicast (mcast):
InfiniBand offers UD-based hardware multicast. With this, short messages can be broadcast to multicast-groups in a high performant way. Some of the MPI collective algorithms (such as MPI_Barrier, MPI_Bcast) makes use of 'mcast' and offers significant performance improvements.
'mcast' is enabled by default in Azure HPC VMs and can be controlled at runtime using the environment parameter:
Most of the hcoll algorithms use a hierarchical structure to implement the collective operation. It is imperative to make sure that this hierarchy is aware of the underlying CPU architecture and network topology. For example, on NUMA architectures such as Azure HB and Azure HBv2 VMs, NUMA aware subgrouping offers better performance.
NUMA-aware subgrouping can be controlled by the environment parameter:
Recommended Configuration on Azure HPC VMs:
Based on the above features, the recommended configuration parameters for the different Azure HPC VM types are highlighted below:
The following figure depicts the performance improvements with enabling UD-based mcast and NUMA-aware subgrouping. The results were obtained with osu_bcast MPI benchmark on 16 HBv2 VMs with 120 processes per node.
Impact of HPC-X configuration on MPI_Bcast performance
This post lists the various hardware features for optimal collective communication performance on Azure HPC VMs and highlights the recommended configuration for different VM types. Please note that the following are general recommendations, and the real application performance depends on your application characteristics, runtime configuration, transport protocols, processes per node (ppn) configuration, etc.