Blog Post

Azure High Performance Computing (HPC) Blog
7 MIN READ

Achieving Optimal Performance for DeepSeek Expert Parallelism (DeepEP) on Azure

mahdiehghazi's avatar
mahdiehghazi
Icon for Microsoft rankMicrosoft
May 16, 2025

This blog post presents practical techniques for optimizing the performance of DeepEP on Azure-based GPU clusters. DeepEP is a high-performance communication library designed to accelerate Mixture-of-Experts (MoE) models through efficient expert parallelism. It leverages NVSHMEM for one-sided GPU communication, enabling low-latency, host-bypass data transfers across nodes. The focus of this post is on affinity-aware optimization, demonstrating how to align processes with NUMA topology, GPUs and network interfaces to minimize communication overhead. We describe code-level modifications using psutil, libnuma, and NVSHMEM environment variables to set CPU cores, GPUs and memory affinities during initialization, ensuring optimal hardware placement. These enhancements significantly improve DeepEP's communication efficiency and overall performance when deployed in distributed training on Azure.

DeepEP DeepEP is a high-performance communication library developed by DeepSeek AI to optimize Mixture-of-Experts (MoE) and expert parallelism (EP) in large-scale AI models. It provides high-throug...
Updated May 21, 2025
Version 3.0