Achieving Optimal Performance for DeepSeek Expert Parallelism (DeepEP) on Azure

mahdiehghazi

Microsoft

May 16, 2025

This blog post presents practical techniques for optimizing the performance of DeepEP on Azure-based GPU clusters. DeepEP is a high-performance communication library designed to accelerate Mixture-of-Experts (MoE) models through efficient expert parallelism. It leverages NVSHMEM for one-sided GPU communication, enabling low-latency, host-bypass data transfers across nodes. The focus of this post is on affinity-aware optimization, demonstrating how to align processes with NUMA topology, GPUs and network interfaces to minimize communication overhead. We describe code-level modifications using psutil, libnuma, and NVSHMEM environment variables to set CPU cores, GPUs and memory affinities during initialization, ensuring optimal hardware placement. These enhancements significantly improve DeepEP's communication efficiency and overall performance when deployed in distributed training on Azure.

DeepEP

DeepEP is a high-performance communication library developed by DeepSeek AI to optimize Mixture-of-Experts (MoE) and expert parallelism (EP) in large-scale AI models. It provides high-throughput, low-latency all-to-all GPU kernels for MoE dispatch and combine operations, which are critical for efficiently routing data between expert modules during training and inference. DeepEP includes specialized kernels for asymmetric-domain bandwidth forwarding—such as transfers between NVLink and RDMA/InfiniBand domains—and requires only 20 Streaming Multiprocessors (SMs) to saturate both. Tokens are first transmitted via IB to GPUs with matching in-node indices, then forwarded via NVLink to target experts, fully overlapping both communication paths. It leverages NVSHMEM for efficient one-sided communication, enabling low-latency data movement without host involvement. With its network-aware design and deep integration with MoE algorithms, DeepEP is a foundational component for scalable, high-performance expert model training and inference.

The Importance of NUMA Affinity

NUMA affinity refers to how well a process or thread is aligned with the memory and hardware resources—such as CPUs, GPUs, or NICs—within a Non-Uniform Memory Access (NUMA) system. In a NUMA architecture, the system’s memory is divided among multiple nodes (often corresponding to CPU sockets), and each node can access its local memory faster than the memory attached to other nodes. NUMA affinity is about ensuring that a process runs on a CPU (or accesses a device) that is physically close to the memory or network resources it needs, minimizing latency and maximizing bandwidth.
NUMA affinity is particularly critical in multi-GPU and multi-node systems where GPUs communicate with each other or with the network through NICs. If a GPU is not NUMA-affined to the NIC it uses, data may be routed across additional interconnects like PCIe switches or CPU sockets, increasing communication latency and reducing throughput. By maintaining proper NUMA affinity—ensuring, for example, that a GPU communicates through a NIC on the same NUMA node—systems can achieve significantly better performance, especially in communication-heavy workloads like distributed deep learning, MoE expert dispatch, or all-to-all collective operations.

NVIDIA DGX H100 system topology (Courtesy: https://docs.nvidia.com/dgx/dgxh100-user-guide/introduction-to-dgxh100.html)

Affinity Considerations on Azure NDv5 VMs (H100)

The command lscpu can be used to get information about NUMA to cores binding. This is from the output of lscpu on an NVIDIA DGX H100 system, showing that the system has two NUMA nodes: cores 0–47 belong to NUMA node 0, and cores 48–95 belong to NUMA node 1.

NUMA: 
NUMA node(s): 2 
NUMA node0 CPU(s): 0-47 
NUMA node1 CPU(s): 48-95

In addition, the lstopo command along with the bus_id of GPUs and HCA (Host Channel Adapter) cards can be used to find the mapping between NUMA nodes, CPU cores, GPUs, and HCA.

Affinity-Aware Code Adjustments to Boost DeepEP Performance

To improve DeepEP performance on Azure, we introduced code changes that explicitly bind processes to the right set of cores, GPU, and HCA., ensuring alignment with the system's NUMA topology. These modifications reduce cross-NUMA communication overhead and improve data locality, which is crucial for communication-heavy workloads like expert parallelism.

For this, we integrated the libnuma library using ctypes to enable memory binding to specific NUMA nodes, ensuring that memory allocations are local to the process’s assigned CPU cores. We also used the psutil library to explicitly set CPU affinity, binding each process to a distinct range of cores based on its rank. This reduces cross-node traffic and improves cache locality. As mentioned earlier, on the NVIDIA DGX H100 system, we have two NUMA nodes with 48 cores per NUMA. With 8 processes per node, we can assign 12 cores to each process on this system. These settings are applied early in the init_dist() function, ensuring that compute and communication operations benefit from optimal CPU and memory placement.

diff --git a/tests/utils.py b/tests/utils.py
index a574366..fffa905 100644
--- a/tests/utils.py
+++ b/tests/utils.py
@@ -1,10 +1,34 @@
 import os
 import sys
+import psutil
 import numpy as np
 import torch
 import torch.distributed as dist
 from typing import Optional
-
+import ctypes
+
+# Load libnuma
+libnuma = ctypes.CDLL("libnuma.so")
+libnuma.numa_available.restype = ctypes.c_int
+libnuma.numa_run_on_node.argtypes = [ctypes.c_int]
+libnuma.numa_set_preferred.argtypes = [ctypes.c_int]
+
+def set_numa_affinity(rank):
+    cores_per_rank = 12
+    numa_node = rank // 4
+    core_start = rank * cores_per_rank
+    core_end = core_start + cores_per_rank
+    p = psutil.Process(os.getpid())
+    p.cpu_affinity(list(range(core_start, core_end)))
+    print(f"Rank {rank} numa node {numa_node} bound to cores {core_start}-{core_end - 1}")
+
+       # Bind memory to NUMA node
+    if libnuma.numa_available() != -1:
+        libnuma.numa_set_preferred(numa_node)
+        print(f"Rank {rank}: CPU affinity → cores {core_start}-{core_end - 1}, memory NUMA → node {numa_node}")
+    else:
+        print(f"Rank {rank}: libnuma not available")

 def init_dist(local_rank: int, num_local_ranks: int):
     # NOTES: you may rewrite this function with your own cluster settings
@@ -20,8 +44,10 @@ def init_dist(local_rank: int, num_local_ranks: int):
         world_size=num_nodes * num_local_ranks,
         rank=node_rank * num_local_ranks + local_rank
     )
+    set_numa_affinity(local_rank)
     torch.set_default_dtype(torch.bfloat16)
     torch.set_default_device('cuda')
+
     torch.cuda.set_device(local_rank)

     return dist.get_rank(), dist.get_world_size(), dist.new_group(list(range(num_local_ranks * num_nodes)))

Additionally, as noted earlier, DeepEP leverages NVSHMEM for inter-GPU communication. To ensure each process uses the correct set of Host Channel Adapters (HCAs), we set the NVSHMEM_HCA_LIST environment variable with a comma-separated list of HCAs. For this setting to take effect, the NVSHMEM_ENABLE_NIC_PE_MAPPING variable must also be set to 1.

diff --git a/deep_ep/buffer.py b/deep_ep/buffer.py
index feeb386..d81130e 100644
--- a/deep_ep/buffer.py
+++ b/deep_ep/buffer.py
@@ -72,6 +72,8 @@ class Buffer:
             os.environ['NVSHMEM_IB_ENABLE_IBGDA'] = '1'
             os.environ['NVSHMEM_IBGDA_NIC_HANDLER'] = 'gpu'
             os.environ['NVSHMEM_IBGDA_NUM_RC_PER_PE'] = f'{num_qps_per_rank}'
+            os.environ['NVSHMEM_ENABLE_NIC_PE_MAPPING'] = '1'
+            os.environ['NVSHMEM_HCA_LIST'] = 'mlx5_ib0:1,mlx5_ib1:1,mlx5_ib2:1,mlx5_ib3:1,mlx5_ib4:1,mlx5_ib5:1,mlx5_ib6:1,mlx5_ib7:1'
             # Make sure QP depth is always larger than the number of on-flight WRs, so that we can skip WQ slot check
             os.environ['NVSHMEM_QP_DEPTH'] = '1024'
             # NOTES: NVSHMEM initialization requires at least 256 MiB

AtomicOps Support: A Prerequisite for Running DeepEP

DeepEP requires network controllers to support atomic operations. You may face the following error when running DeepEP if the network controllers do not support atomic operations:

WARN: device mlx5_an0 does not support all necessary atomic operations. You may want to check the PCI_ATOMIC_MODE value in the NIC firmware. Skipping...
/root/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:340: NULL value qp creation failed
/root/nvshmem_src/src/modules/transport/ibrc/ibrc.cpp:1395: non-zero status: 7 ep_create failed
h100jcjah000000:16173:16378 [4] NCCL INFO [Service thread] Connection closed by localRank 3
h100jcjah000000:16171:16390 [2] NCCL INFO [Service thread] Connection closed by localRank 3
h100jcjah000000:16169:16382 [0] NCCL INFO [Service thread] Connection closed by localRank 3
W0425 05:06:08.343000 16157 torch/multiprocessing/spawn.py:169] Terminating process 16169 via signal SIGTERM

In this section, we discuss how to ensure the atomic operations are supported by the network controllers to avoid this failure. We can use the following command to get the list of network controllers:

 lspci |grep Mell
0101:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
0102:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
0103:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
0104:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
0105:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
0106:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
0107:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
0108:00:00.0 Infiniband controller: Mellanox Technologies ConnectX Family mlx5Gen Virtual Function
14a7:00:02.0 Ethernet controller: Mellanox Technologies MT28800 Family [ConnectX-5 Ex Virtual Function] (rev 80)

The above output shows that we have eight InfiniBand controllers and one Ethernet controller on the VM. To check if the Ethernet controller supports atomic operations, we can use the following commands:

 lspci -s 14a7:00:02.0 -vvv | grep AtomicOpsCap
                         AtomicOpsCap: 32bit- 64bit- 128bitCAS-

Alternatively, for the InfiniBand controller we have:

 lspci -s 0108:00:00.0 -vvv | grep AtomicOpsCap
                         AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+

As can be seen from the outputs, for the Ethernet controller (14a7:00:02.0), atomic operations are not supported (AtomicOpsCap: 32bit- 64bit- 128bitCAS-) while for the InfiniBand controller (0108:00:00.0), Atomic operations are supported (AtomicOpsCap: 32bit+ 64bit+ 128bitCAS+). One way to avoid using the Ethernet controller is to provide the list of HCAs through NVSHMEM_HCA_LIST and bypass mlx5_an0 as discussed earlier. Another option is to disable Accelerated Networking (AN) on the VM. This can be done through Azure portal when the VM is deallocated. Disabling AN is not recommended though as it impacts the overall infrastructure, particularly for customers using AMLFS.

Performance Experiments

After applying the above changes, we got the following performance numbers for the test_internode.py on two Standard_ND96isr_H100_v5 VMs with 8 processes per node (16 total processes). This benchmark evaluates the performance of dispatch and combine operations on a multi-node setting. In this benchmark, the intranode communication is overlapped with internode communications. Please note that the benchmark reports the algorithm bandwidth so the total time of both communication and computation is considered in the performance results. The experimental results on Standard_ND96isr_H100_v5 VMs shows that we're reaching and exceeding the claimed performance in DeepEP repository

Item	Best Reported RDMA BW	Best Reported NVL BW
Dispatch (FP8)	45.9 GB/s	149.82 GB/s
Dispatch (BF16)	60.32 GB/s	196.89 GB/s
Combine	61.34 GB/s	200.22 GB/s