Over the past several years, neural networks have proven to be an incredibly effective tool in the field of Artificial Intelligence. As the problems/tasks become more complex and creative, the training of the neural network model inevitably involves a massive amount of data and computational resources. These require large clusters of multi-GPUs to process, which brings significantly more inter-node communications.
NCCL library provides inter-GPU communication primitives that are topology-aware and can be easily integrated into applications. To deliver optimal inter-GPU performance, GPUDirect RDMA technology has been commonly utilized. It provides direct communication between NVIDIA GPUs in remote systems, which bypasses the system CPUs and eliminates the required buffer copies of data via the system memory, resulting in a significant performance boost [1].
GPUDirect RDMA in a virtualized environment requires enablement of ATS (Address Translation Services) on the network adapter [2]. ATS extends the PCIe protocol to support an address translation agent (TA) that translates DMA addresses to cached addresses in the device. The Address Translation Cache (ATC) located in the device reduces the processing load on the translation agent, enhancing system performance [3]. As more memory transactions are generated with ATS, PCIe relaxed ordering consequently plays an important role here.
PCI Express supports the Relaxed Ordering (RO) mechanism introduced by PCI-X. The concept of Relaxed Ordering in the PCI Express environment allows switches in the path between the Producer and Consumer to reorder some transactions just received before others that were previously enqueued.
The ordering rules that exist to support the Producer/Consumer model may result in transactions being blocked, when in fact the blocked transactions are completely unrelated to any Producer/Consumer transaction sequence. Consequently, in certain circumstances, a transaction with its Relaxed Ordering attribute bit set can be re-ordered ahead of other transactions [4].
As a PCIe feature, Relaxed Ordering allows flexibility in the transaction order over PCIe. This reduces the number of retransmissions on the lane and can greatly help the performance of InfiniBand networks in virtualized environments.
In this blog, we will demonstrate the performance impact of PCI Relaxed Ordering with NCCL Allreduce benchmark across two VMs on Azure.
Since NCCL-2.12, an environment variable NCCL_IB_PCI_RELAXED_ORDERING has been introduced, which can enable/disable PCIe Relaxed Ordering for IB Verbs transport directly. This variable by default is set to automatically use Relaxed Ordering if available. Azure HPC images already have NCCL-2.12 or higher prebuilt in the images. We can easily turn it on/off to check NCCL performance impact.
mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \
-mca coll_hcoll_enable 0 --bind-to numa \
-x NCCL_IB_PCI_RELAXED_ORDERING=0 \
-x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
-x NCCL_DEBUG=WARN \
/opt/nccl-tests/build/${TEST} -b 8 -e 8G -f 2 -g 1 -c 0
Fig.1 NCCL Allreduce performance w. RO for IB Verbs transport disabled
mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \
-mca coll_hcoll_enable 0 --bind-to numa \
-x NCCL_IB_PCI_RELAXED_ORDERING=1 \
-x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \
-x CUDA_DEVICE_ORDER=PCI_BUS_ID \
-x NCCL_SOCKET_IFNAME=eth0 \
-x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \
-x NCCL_DEBUG=WARN \
/opt/nccl-tests/build/${TEST} -b 8 -e 8G -f 2 -g 1 -c 0
Fig.2 NCCL Allreduce performance w. RO for IB Verbs transport enabled
As we can see, the maximum in-place busbw with RO disabled is only 25GB/s (at 2M message size) in Fig.1, however it can deliver 188 GB/s in-place busbw with RO enabled, as shown in Fig.2. It brings more than 6X performance improvement with RO enabled.
For NCCL that is lower than version 2.12, the environment variable NCCL_IB_PCI_RELAXED_ORDERING is not available, but there are two ways to control PCI Relaxed Ordering through NCCL IB plugin. It’s worth noting that both options have been integrated in Azure HPC images.
This blog demonstrates that NCCL performance can be greatly boosted with PCI Relaxed Ordering enabled. Azure HPC images provide the user with different approaches to enable PCI relaxed ordering via environment variable, either by newer version of NCCL (2.12 onwards) or NCCL IB plugin.
[1]. https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/overview.html
[2]. RDG: Virtualizing GPU-Accelerated HPC & AI Workloads on OpenStack Cloud over InfiniBand Fabric
[3]. Address Translation Services, Revision 1.1 (composter.com.ua)
[4]. PCI Express System Architecture by Tom Shanley, Don Anderson, Ravi Budruk, MindShare, Inc
#AzureHPCAI
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.