Adaptive Routing (AR) allows Azure Virtual Machines (VMs) running EDR and HDR InfiniBand to automatically detect and avoid network congestion by dynamically selecting more optimal network paths. As a result, AR offers improved latency and bandwidth on the InfiniBand network, which in turn drives higher performance and scaling efficiency.
AR is enabled on the following VM families:
In this post, we discuss AR configuration in Azure HPC clusters and its implication on MPI libraries and communication runtimes based on InfiniBand.
AR and Legacy MPIs/Communication Runtimes
Adaptive Routing allows network packets to use different network routes, which can result in out-of-order packet arrivals. Certain protocols/optimizations in MPI libraries and communication runtimes assume an in-order arrival of network packets. Although this assumption is not valid as per the InfiniBand specification from the InfiniBand Trade Association, Mellanox’s InfiniBand implementations in recent years have nonetheless guaranteed in-order arrival of network packets.
With the arrival of AR capability on modern Mellanox InfiniBand switches, however, in-order packet arrival is no longer guaranteed as the switches may change packet ordering in order to provide optimal network paths and flows.
As a result, protocols/optimizations that rely on the assumption of in-order packet arrival assumption may produce errors previously not seen. Such protocols/optimizations include rdma-exchange protocol in IntelMPI (2018 and before), RDMA-Fast-Path in MVAPICH2, eager-rdma protocol in OpenMPI (openib btl), etc. Please note that this issue arises when these protocols are used for messages larger than one InfiniBand MTU size.
Fear not, though. In Azure HPC, we consider all legacy MPI's and runtimes, and we configure AR in such a way that legacy MPIs and runtimes are well-supported.
AR and Service Levels
Adaptive Routing is enabled per Service Level (SL). SL is specified during the InfiniBand Queue Pair (QP) initialization phase of MPI libraries and communication runtimes. A preferred SL can be specified through environment parameters exposed by MPI libraries (for eg: UCX_IB_SL=1, which instructs UCX runtime to use Service Level 1).
Azure HPC AR Configuration
HBv3, HBv2, HB, HC and NDrv2: Adaptive Routing is enabled on all SL's except SL=0 (i.e. the default SL). This way, all legacy MPIs and communication runtimes work well on the Azure HPC clusters without any modification. MPI Libraries and communication runtimes that are optimized for Adaptive Routing can take advantage of AR by specifying a non-default SL (e.g: SL=1).
Adaptive Routing is enabled by default, except for SL=2. It is enabled by default because the most AI communication libraries (eg: NCCL, UCX, newer versions of MPI libraries etc.) supports adaptive routing well. Legacy MPIs can still avoid any issues due to AR by switching to SL=2.
The following are environment parameters to specify a non-default SL in various MPI libraries and runtimes. Use of these parameters will enable Adaptive Routing:
For transport-specific SL configuration, use the corresponding environment parameter(s) based on the transport type (RC/DC/UD).