Blog Post

Azure Compute Blog
4 MIN READ

Power stabilization for AI training datacenters

BrijeshW's avatar
BrijeshW
Icon for Microsoft rankMicrosoft
Oct 13, 2025

Microsoft has been partnering closely with Open AI and Nvidia in delivering some of the largest supercomputers, which have served as the foundation for development of ChatGPT. Operationalizing these massive-scale supercomputers has required addressing several complex technical problems, one of which is stabilizing the power load on the utility.

In our recent paper “Power Stabilization for AI Training Datacenters” (now on arXiv ), we introduce the problem with production data and explore innovative solutions to address it. This post highlights the key ideas from the paper in plain language and shares why this problem is central to the next generation of AI infrastructure.

The challenge: Power swings at hyperscale

Modern AI training jobs span tens of thousands of GPUs, operating in tightly synchronized iterations. Each iteration includes a compute-heavy phase, where GPUs process local data, followed by a communication-heavy phase, where they synchronize globally. This pattern creates dramatic swings in power consumption—surges during computation and dips during communication.

Figure 1: Power readings from an at-scale training job on DGX-H100 racks

As training jobs scale, the amplitude of these power swings increases. But it’s not just the amplitude that matters; the frequency spectrum of these swings can align with critical frequencies in utility grids, posing risks of physical damage to infrastructure. The magnitude of the problem scales with the size of the training deployment. Therefore, to continue scaling AI training workloads safely, we need to stabilize the power of such workloads to ensure grid stability and operational continuity.

Frequency components of the power waveform shown in Figure 1.

Requirements imposed on the solution

Addressing the challenge and ensuring grid stability requires that we meet the following utility-level requirements and specifications.

Time-Domain Specs

Ramp rate: Limits how quickly a datacenter can increase (ramp up) its power draw (measured in megawatts per second) and how quickly a datacenter can decrease (ramp down) power.

Dynamic power range: Defines how much short-term fluctuation in power is acceptable.

Frequency-Domain Specs

Oscillations: Limits on the magnitude of oscillations within critical frequency ranges (e.g., 0.1–20 Hz). These limits can vary depending on the utility and its power generation equipment.

Solution: A multi-pronged power stabilization strategy

The power stabilization solution must meet the above requirements, ideally without adding extra cost, wasting energy, or reducing performance.  With this in mind, we explored three main strategies.

Software-Only Mitigation

This approach introduces supplementary and controlled workloads, such as sequences of matrix multiplications, during periods of reduced activity to maintain consistent power consumption. It offers flexibility and can be rapidly deployed without the need for hardware modifications. However, it may lead to energy inefficiency, has the potential to slightly reduce the performance of primary AI workloads, and requires thorough calibration as well as ongoing monitoring.

Hardware-Based Power Smoothing

We co-developed a power smoothing feature with Nvidia that was introduced in their GB200 GPUs.  The feature enforces minimum power thresholds and regulation of power fluctuations. This approach offers consistent performance with negligible impact on AI workloads while also reducing resource utilization overhead. However, it consumes additional energy, and existing hardware may not satisfy the most stringent grid requirements.

Rack-Level Energy Storage

Installing batteries or capacitors at the rack level allows surplus energy to be stored during periods of high demand and discharged during times of low demand. This approach enhances energy efficiency by minimizing waste, reducing peak power requirements, and plays an important role in supporting grid stability. However, it comes with high implementation costs and may require substantial rack space, especially when rapid fluctuations in power demand occur.

Combined Approach

An optimal approach could be to combine the best of these solutions: Rely on energy storage for all steady-state fluctuations, but rely on GPU-level smoothing (both software and hardware) for ramp-up/ramp-down and corner cases where the energy storage runs out of capacity.  This approach would address all requirements, while keeping costs low, and limiting impact on performance.

 

Training run with GPU-level smoothing and Energy storage together delivering a clean load to utility

Conclusion

As AI training workloads become larger and more complex, the challenges of managing power

fluctuations and their impact on the power grid will intensify. This blog and our paper describe a cross-stack approach that combines software-based power smoothing, GPU-level controls, and rack-level energy storage that offer practical and immediate relief for today's deployments. However, this work is just the beginning. Ensuring that AI infrastructure remains both performant and grid-safe will require sustained collaboration across research, industry, and utilities.

In the context of the broader industry collaboration initiated by Google, Meta, and Microsoft, we will be partnering with many other industry players and utilities to define a common set of standards and requirements for large-scale AI training power stabilization.

 

Stay tuned for more updates from the summit and explore the full paper here.

Updated Oct 12, 2025
Version 1.0
No CommentsBe the first to comment