As cloud storage demands continue to grow, the need for ultra-fast, reliable networking becomes ever more critical. Microsoft Azure’s journey to empower its storage infrastructure with RDMA (Remote Direct Memory Access) has been transformative, but it’s not without challenges—especially when it comes to congestion control at scale. Azure’s deployment of RDMA at regional scale relies on DCQCN (Data Center Quantized Congestion Notification), a protocol that’s become central to Azure’s ability to deliver high-throughput, low-latency storage services across vast, heterogeneous data center regions.
Why congestion control matters in RDMA networks
RDMA offloads the network stack to NIC hardware, reducing CPU overhead and enabling near line-rate performance. However, as Azure scaled RDMA across clusters and regions, it faced new challenges:
- Heterogeneous hardware: Different generations of RDMA NICs (Network Interface Cards) and switches, each with their own quirks.
- Variable latency: Long-haul links between datacenters introduce large round-trip time (RTT) variations.
- Congestion risks: High-speed, incast-like traffic patterns can easily overwhelm buffers, leading to packet loss and degraded performance.
To address these, Azure needed a congestion control protocol that could operate reliably across diverse hardware and network conditions. Traditional TCP congestion control mechanisms don’t apply here, so Azure leverages DCQCN combined with Priority Flow Control (PFC) to maintain high throughput, low latency, and near-zero packet loss.
How DCQCN works
DCQCN coordinates congestion control using three main entities:
- Reaction point (RP): The sender adjusts its rate based on feedback.
- Congestion point (CP): Switches mark packets using ECN when queues exceed thresholds.
- Notification point (NP): The receiver sends Congestion Notification Packets (CNPs) upon receiving ECN-marked packets.
This feedback loop allows RDMA flows to dynamically adapt their sending rates, preventing congestion collapse while maintaining fairness.
- When the switch detects congestion, it marks packets with ECN.
- The receiver NIC (NP) observes ECN marks and sends CNPs to the sender.
- The sender NIC (RP) reduces its sending rate upon receiving CNPs; otherwise, it increases the rate gradually.
Interoperability challenges across different hardware generations
Cloud infrastructure evolves incrementally, typically at the level of individual clusters or racks, as newer server hardware generations are introduced. Within a single region, clusters often differ in their NIC configurations. Our deployment includes three generations of commodity RDMA NICs—Gen1, Gen2, and Gen3—each implementing DCQCN with distinct design variations. These discrepancies create complex and often problematic interactions when NICs from different generations interoperate.
- Gen1 NICs: Firmware-based DCQCN, NP-side CNP coalescing, burst-based rate limiting.
- Gen2/Gen3 NICs: Hardware-based DCQCN, RP-side CNP coalescing, per-packet rate limiting.
Problem:
- Gen2/Gen3 NICs sending to Gen1 can trigger excessive cache misses, slowing down Gen1’s receiver pipeline.
- Gen1 sending to Gen2/Gen3 can cause excessive rate reductions due to frequent CNPs.
Azure’s solution:
- Move CNP coalescing to NP side for Gen2/Gen3.
- Implement per-QP CNP rate limiting, matching Gen1’s timer.
- Enable per-burst rate limiting on Gen2/Gen3 to reduce cache pressure.
DCQCN tuning: Achieving fairness and performance
DCQCN is inherently RTT-fair—its rate adjustment is independent of round-trip time, making it suitable for Azure’s regional networks with RTTs ranging from microseconds to milliseconds.
Key Tuning Strategies:
- Sparse ECN marking: Use large ECN marking thresholds (K_max - K_min) and low marking probabilities (P_max) for flows with large RTTs.
- Joint buffer and DCQCN tuning: Tune switch buffer thresholds and DCQCN parameters together to avoid premature congestion signals and optimize throughput.
- Global parameter settings: Azure’s NICs support only global DCQCN settings, so parameters must work well across all traffic types and RTTs.
Real-world results
- High throughput & low latency:
- RDMA traffic runs at line rate with near-zero packet loss.
- CPU savings:
- Freed CPU cores can be repurposed for customer VMs or application logic.
- Performance metrics:
- RDMA reduces CPU utilization by up to 34.5% compared to TCP for storage frontend traffic.
- Large I/O requests (1 MB) see up to 23.8% latency reduction for reads and 15.6% for writes.
- Scalability:
- As of November 2025, ~85% of Azure’s traffic is RDMA, supported in all public regions.
Conclusion
DCQCN is a cornerstone of Azure’s RDMA-enabled storage infrastructure, enabling reliable, high-performance cloud storage at scale. By combining ECN-based signaling with dynamic rate adjustments, DCQCN ensures high throughput, low latency, and near-zero packet loss—even across heterogeneous hardware and long-haul links. Its interoperability fixes and careful tuning make it a critical enabler for RDMA adoption in modern data centers, paving the way for efficient, scalable, and resilient cloud storage.