Blog Post

Azure Infrastructure Blog
3 MIN READ

RAIDDR: Redefining Memory Reliability

TerryGrunzke's avatar
TerryGrunzke
Icon for Microsoft rankMicrosoft
Jan 20, 2026

Microsoft-developed error correction architecture reduces overhead while delivering cloud scale reliability

Introduction

As datacenters scale to support modern digital life, so does the challenge of keeping memory reliable. Even very rare DRAM faults can translate into an unacceptable number of uncorrectable or silent errors at scale, making robust error correction (ECC) a must. Yet, traditional ECC incurs increased cost, power, and memory footprints, all of which challenge cloud scalability and sustainability. Enter RAIDDR (Redundant Array of Independent Disks for Double Data Rate), Microsoft’s innovative ECC architecture, designed to meet these challenges with a 50 percent reduction in overhead.

The Problem with Traditional ECC

Historically, hyperscale memory reliability has relied on Reed-Solomon and other legacy ECC schemes. While effective, these methods come at a cost: As shown in the first slide below there is up to ~30% memory overhead across hyperscale fleets. As new memory technologies and advanced SoCs emerge, traditional ECC methods struggle to scale due to high reliability requirements, power requirements, meta data requirements, and on-die correction mechanisms that limit flexibility for cloud providers.  As shown in second slide below, the current ECC solutions for x8 devices (e.g. LPDDR5X in byte mode) do not provide cloud-level reliability. In addition, performance requirements for new memories could require even more overhead.

 

 

 

 

RAIDDR Architecture

RAIDDR flips the ECC paradigm by enabling additional error correction on the host’s memory controller.  The controller handles symbol-based correction using a mix of parity, CRC and BCH, inspired by RAID techniques from storage. RAIDDR maximizes the number of correctable failures per device, ensuring robust protection while reducing overheads. This host-centric approach makes RAIDDR adaptable for the next generation of memory technology.

 

 

Variants: Basic vs. Enhanced RAIDDR

Not all deployments require the same level of integration. Basic RAIDDR relies on CRC, may not require additional bits from each die, and works seamlessly with standard DIMMs, supporting a broad array of existing hardware. Enhanced RAIDDR, however, goes a step further—by leveraging access of additional bits available from the device, it pushes reliability and efficiency even further by using BCH to correct additional single bit failures. Enhanced RAIDDR can achieve the reliability requirements of general-purpose cloud providers with lower memory overhead compared to traditional approaches, giving cloud architects flexibility in balancing cost and performance.

Deployment in Azure Silicon

Microsoft has already begun integrating RAIDDR into its Azure silicon stack. Developed by Microsoft with Cadence IP and compatibility with LPDDR5X, RAIDDR is ready for deployment across new platforms.

Open Licensing and Ecosystem Adoption

To accelerate wide adoption of RAIDDR as a standard for memory reliability across the industry, Microsoft has released RAIDDR under the OWF CLA 0.9 open licensing model. With transparent licensing and collaborative development, we encourage broad industry engagement, from silicon and IP vendors to system integrators and cloud builders.  RAIDDR is well positioned to become a standard for memory reliability across, aligning with memory and SoC partner usage.

Technical Deep Dive

For engineers eager for details, RAIDDR’s encoding and decoding flows are meticulously documented (https://github.com/microsoft/BasicRAIDDR). Correction logic is designed to address a wide spectrum of failure scenarios, from single-bit errors to multi-device faults. Extensive simulations and benchmarks demonstrate RAIDDR’s ability to match ECC overhead vs. reliability requirements. RAIDDR can be implemented with a fraction of the gates used in traditional methods at similar or lower latency.

 

 

Conclusion

RAIDDR stands as a transformative leap in memory reliability for hyperscale environments. It delivers robust error correction with a fraction of the traditional overhead, reducing costs and power while unlocking new efficiencies for hyperscale clouds. Looking ahead, RAIDDR’s architecture lays a foundation for next-generation memory. We invite engineers, partners, and the broader ecosystem to join us in shaping the future of cloud memory reliability.

 

Published Jan 20, 2026
Version 1.0
No CommentsBe the first to comment