This blog post presents practical techniques for optimizing the performance of DeepEP on Azure-based GPU clusters. DeepEP is a high-performance communication library designed to accelerate Mixture-of-Experts (MoE) models through efficient expert parallelism. It leverages NVSHMEM for one-sided GPU communication, enabling low-latency, host-bypass data transfers across nodes. The focus of this post is on affinity-aware optimization, demonstrating how to align processes with NUMA topology, GPUs and network interfaces to minimize communication overhead. We describe code-level modifications using psutil, libnuma, and NVSHMEM environment variables to set CPU cores, GPUs and memory affinities during initialization, ensuring optimal hardware placement. These enhancements significantly improve DeepEP's communication efficiency and overall performance when deployed in distributed training on Azure.
Updated May 21, 2025
Version 3.0