ai infrastructure
108 TopicsIntroducing New Performance Tiers for Azure Managed Lustre: Enhancing HPC Workloads
Building upon the success of its General Availability (GA) launch last month, we’re excited to unveil two new performance tiers for Azure Managed Lustre (AMLFS): 40MB/s per TiB and 500MB/s per TiB. This blog post explores the specifics of these new tiers and how they embody a customer-centric approach to innovation.Deploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.Ramp up with me...on HPC: What is high-performance computing (HPC)?
Over the next several months, let’s take a journey together and learn about the different use cases. Join me as I dive into each use case and for some of them, I’ll even try my hand at the workload for the first time. We’ll talk about what went well, and any what issues I ran into. And maybe, you’ll get to hear a little about our customers and partners along the way.Building resilient networks for AI supercomputers
By Valerie Cutts and Jithin Jose Last fall we introduced Fairwater, the world’s most powerful AI datacenter. Delivering a system of this scale required rethinking how Azure designs supercomputers, especially the scale-out network. Today, we are sharing more about the networking innovations that have made Fairwater possible. In this post, we share what’s unique about networking at extreme GPU scale and the system-level design choices required to enable large synchronous training jobs to run reliably, even during network failures. We are also publishing, in partnership with others, the open-source Multipath Reliable Connection (MRC) specification and software interfaces, and open-sourcing the associated libraries. Fault-tolerant scale-out networking At extreme scale, synchronous training amplifies the impact of routine network faults, turning packet drops, slow links, and partial failures into stalls, restarts, and wasted GPU time. As we describe in our MRC paper, the path forward is to treat failure as normal and design the network as an integrated system that: Scales to 100K+ GPUs using a two level, multi-path topology to enable enough redundancy Balances load evenly across the fabric to prevent congestion Recovers predictably and gracefully during failures Uses less power than three- or four-layer single-plane topologies As outlined in the Multipath Reliable Connection Specification, Microsoft partnered with AMD, Broadcom, Intel, NVIDIA and OpenAI to jointly address this problem, focusing on changes to transport and network design needed to support training at extreme scale. Instead of relying on lossless fabrics and dynamic routing, we collectively designed, built, and deployed Multipath Reliable Connection (MRC), which draws upon lessons from the Ultra Ethernet Consortium (UEC), and paired it with a multiplane network topology, enabling reliable training jobs even when links, switches, or paths fail. The endpoint–driven transport created a simpler, more resilient network that delivers: More resilient, predictable training at very large scale Large training jobs continue making steady forward progress despite routine network faults, reducing stalls and restarts and improving time-to-train as cluster size increases. Better utilization of expensive GPU infrastructure By avoiding tail latency amplification and repeated recovery cycles, GPUs spend more time doing useful work instead of waiting on synchronization or replaying lost computation, improving overall efficiency and cost effectiveness. Automatic adaptation at machine timescales Failure detection, load balancing, and recovery happen fast enough to keep up with the rate and complexity of faults in 100K+ GPU systems, well beyond what manual intervention or control-plane convergence can achieve, allowing the system to remain stable as scale increases. In the Fairwater supercomputer, enabling graceful degradation in the scale-out network improves training throughput versus traditional transports and architectures. In combination with a multi-plane topology design, MRC increases the time that installed NVIDIA GPUs perform useful computation. A shift in philosophy: End-to-end control The central design decision behind MRC is to shift responsibility for load balancing and failure handling from complex network switch control planes to the network endpoints with end-to-end controls. The network endpoint controls the path selection and can optimally use a set of paths based on feedback from the network. MRC extends the RoCE Reliable Connection (RC) transport to support true multipath operation. Instead of binding a queue pair to a single path, MRC sprays packets across many paths simultaneously, making performance far less sensitive to any single slow or failed link. Several design elements are critical to enable end-to-end control. Every packet carries enough information for the receiver to place it directly into memory, even if packets arrive out of order. Selective acknowledgments enable rapid retransmission of only the packets that were lost. Packet trimming signals network congestion swiftly without forcing full packet drops, enabling efficient congestion control. MRC disables Priority Flow Control (PFC) entirely and runs Ethernet in best-effort mode, avoiding global pauses that can devastate tail latency or lead to fabric-wide deadlock behavior. The system enables seamless self-recovery from network hardware failures. The result is a transport protocol that expects loss, adapts quickly, and continues making progress even when parts of the fabric misbehave. Rethinking topology: Multi‑plane design Transport alone is not enough. To complement MRC, we implemented a two-tier, multiplane network topology in Fairwater using high-radix switches. Our network design splits each NIC into multiple lower speed ports (i.e. eight x 100 Gbps) and builds multiple parallel network planes. This multi-plane design enables a more compact topology, as opposed to a traditional three-tier Clos Network running at 800 Gbps/port. Our 2-layer multi-plane topology design offers several advantages: Enables connecting 100K+ GPUs with just two tiers of network. Lower latency since packets traverse fewer switches. Reduced impact of network issues on overall job completion, while we see single switches connected to more servers, the individual impact is reduced by spreading the failure, decreasing the performance impact to the overall job. Reduced hardware and power costs compared to designs with additional network layer, without compromising on GPU scale. Most importantly, the network becomes more tolerant of partial failures, so jobs continue with slightly reduced bandwidth rather than failing outright. Multiplane networks work efficiently only when traffic is evenly distributed across all planes and paths. This is where MRC’s packet spraying and path aware congestion response is essential. Figure 1: Example of two-tier multi-plane topology using 512 x 100 GbE switches: 512 T0s x 256 NICs = 131,072 NICs Static SRv6: Fewer moving parts, more predictability In many data center networks, switches rely on Border Gateway Protocol (BGP) or other dynamic routing protocols. We removed dynamic routing from our design; instead, packets are source-routed using IPv6 Segment Routing (SRv6). Each packet encodes the end-to-end network path using compact microsegment IDs (uSIDs). At first glance, static routing seems counterintuitive in failure-prone environments. At extreme scale, however, dynamic routing is more of a liability than an asset. Namely, if two or more switches try to reroute packets at the same time, network behavior becomes unpredictable and harder to diagnose. Interactions between adaptive routing and adaptive transport can be hard to resolve and harder to debug at larger scale. MRC, on the other hand, handles path health and rapid failover at the transport layer. When static routing is used, it enables precise health feedback for each of the different network paths from that network endpoint. Because probe (test) packets follow the same paths as data packets, operators gain accurate, ground-truth insight into fabric health without depending on switch control planes, which are themselves a common source of failures. Additionally, SRv6 routing allows network operators to utilize out-of-band monitoring frameworks to accurately identify link failures and device faults, which has been particularly valuable in managing large-scale AI clusters. Static SRv6 ensures paths are deterministic, making problems easier to reproduce, debug and, ultimately, more stable over time. Failure as a normal operating condition In production, failures are expected by design—link flaps, partial failures, and even switch reboots are routine at this scale. With MRC, many of these events no longer impact training workloads. Repair actions proceed in parallel, while MRC dynamically routes around failed paths. As repairs complete, MRC discovers and validates the restored paths before seamlessly reintegrating them—entirely transparent to the training application. In summary: Systems degrade gracefully Losing a NIC port reduces available bandwidth proportional to the lost port but does not crash jobs Flapping T0–T1 links often go unnoticed by applications Switches can be rebooted without coordinated drain or rerouting of the system For massive scale training runs, this translates into higher effective uptime, fewer interrupted jobs, and more training throughput. Figure 2: Bidirectional-bandwidth measurements with pt-pt RDMA Perftest while a T0 switch was taken down. Results indicate that the overall bandwidth dropped in proportion to the T0 switch bandwidth, but without failing the job Figure 3: Bidirectional-bandwidth measurements with RDMA Perftest while T1 switch is failed and restored. Results indicate that no impact in performance as MRC was able to route around the bad switch Measured results at scale This is not just a thought exercise: Microsoft and OpenAI have both run extensive experiments and record-scale training jobs, showing only brief, bounded performance dips during significant network faults, followed by rapid recovery. Microbenchmarks demonstrate near line-rate bandwidth and predictable latency, even under injected loss. OpenAI describes their scale results in a recent blog post, consistent with what we observe. Taken together, multi-plane MRC with SRv6 delivers better load balancing with fewer queue pairs and substantially higher resilience to packet loss, enabling millions of networking links to connect hundreds of thousands of GPUs. Figure 4: NCCL Send-Recv Benchmark Results with 42,020 GPUs each with 800 Gbps MRC NIC showing up to 92% of theoretical peak bandwidth for large message sizes What this enables Taken together, MRC, multiplane topologies, and static SRv6 form a coherent strategy for building AI supercomputer scale-out networks that keep large synchronous training jobs moving forward under real-world fault conditions. Instead of treating loss, link flaps and partial failures as events that trigger stalls or restarts, the system is designed to fail gracefully and reach 100K+ GPUs scale at high utilization. This design approach has been deployed in Fairwater and elsewhere to train state-of-the-art models, where the result is more predictable performance for large jobs with higher effective GPU utilization. The core takeaway is simple: by assuming failures will happen and, designing for them explicitly, events that would otherwise be catastrophic become minor, manageable perturbations. Join us in advancing resilient AI infrastructure To help the broader ecosystem adopt these capabilities, Microsoft is joining key partners in releasing the MRC specification to the Open Compute Project and open sourcing key components: libMRC: MRC transport APIs NCCL MRC plugin: enables NCCL to run over MRC transport MRC shim library: enables compatible verbs applications to run over MRC with no code changes MSCCL++ with MRC support: MSCCL++ library with MRC support SONiC SRv6: enhance SRv6 with open NOS for high performance AI Ethernet We encourage others to review these contributions to the public, share feedback, and ultimately adopt these capabilities within the broader ecosystem of AI networking products, infrastructures, and workloads. Acknowledgements Advancing AI at this scale requires collaboration across the industry. At Microsoft, we value our partnerships with AMD, Broadcom, Intel, NVIDIA, and OpenAI, and our shared commitment to continuing to evolve MRC alongside the broader community. References: MRC paper: Resilient AI Supercomputer Networking using MRC and SRv6 Multipath Reliable Connection Specification OpenAI MRC blog AMD MRC blog Broadcom MRC blog NVIDIA MRC Blog libMRC APIs microsoft/mrc-verbs-shim-lib: shim library to translate ibverbs to libmrc interfaces microsoft/mrc-nccl-plugin: MRC plugin for NCCL microsoft/mscclpp: MSCCL++: A GPU-driven communication stack for scalable AI applications (with MRC support)Simplify troubleshooting at scale - Centralized Log Management for CycleCloud Workspace for Slurm
Training large AI models on hundreds or thousands of nodes introduces a critical operational challenge: when a distributed job fails, quickly identifying the root cause across scattered logs can become incredibly time-consuming. This manual process delays recovery and reduces cluster utilization. The ability to quickly parse centralized cluster logs from a single interface is critical to ensure job failure root cases are swiftly identified and mitigated to maintain high cluster utilization. Solution Architecture This is a turnkey, customizable log forwarding solution for CycleCloud Workspace for Slurm that centralizes all cluster logs into Azure Monitor Logs Analytic. The architecture uses Azure Monitor Agent (AMA) deployed on every VM and Virtual Machine Scale Set (VMSS) to stream logs defined by Data Collection Rules (DCR) to dedicated tables in a Log Analytics workspace where they can be queried from a single interface. The turnkey solution captures three categories of logs essential for troubleshooting distributed workloads, but can be extended for any other logs: Slurm logs including slurmctld, slurmd, etc., plus archived job artifacts (job submission scripts, environmental variables, stdout/stderr) collected via prolog/epilog scripts. Infrastructure logs including those from CycleCloud including the CycleCloud Healthagent which automatically tests nodes for hardware health and draining nodes that fail tests. Operation System logs from syslog and dmesg capturing kernel events, network state changes, and hardware issues. Each log source flows through its own DCR into a dedicated table following a consistent schema. The solution automatically associates scheduler-specific DCRs with the Slurm scheduler node and compute-specific DCRs with compute nodes handling dynamic node scaling transparently. The solution is purpose-built for CycleCloud Workspace for Slurm, but designed in a modular fashion to be easily extended for new data sources (i.e. new log formats) and processing (i.e. Data Collection Rules) to support log forwarding and analysis of other required logs. Key Benefits Time-series correlation: Azure Monitor's time-based indexing enables rapid identification of cascading failures. For example, trace a network carrier flap detected in syslog to corresponding slurmd communication errors to specific job failures all within seconds. Centralized visibility: Query logs from thousands of nodes through a single interface instead of SSH-ing to individual machines. Correlate Slurm controller decisions with node-level errors and system events in one query. Log persistence: Logs survive node deallocations and reimaging. Critical in cloud environments where compute nodes are ephemeral. Powerful query language: KQL (Kusto Query Language) allows parsing raw logs into structured fields, filtering across multiple sources, and building operational dashboards. Example queries detect patterns like repeated job failures, network instability, or resource exhaustion. Production-ready scalability: User-assigned managed identities automatically propagate to new VMSS instances, and DCR associations handle thousands of nodes without manual configuration. Getting Started The complete solution is available on GitHub (slurm-log-collection) with deployment scripts that: Create all required Log Analytics tables Deploys pre-configured DCRs for Slurm, CycleCloud, and OS logs Automatically associate DCRs with scheduler and compute resources After configuring environment variables and running the setup scripts, logs begin flowing to Azure Monitor and will populate within 15 minutes, but normal log ingestion latency is ~30s to 3 minutes. The repository includes sample KQL queries for common troubleshooting scenarios to accelerate time-to-resolution and to perform non-troubleshooting analysis of cluster usage.Running GPU accelerated workloads with NVIDIA GPU Operator on AKS
The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator for HPC/AI workloads requiring a high degree of customization and granular control over the compute-resources configurationAzure announces new AI optimized VM series featuring AMD’s flagship MI300X GPU
In our relentless pursuit of pushing the boundaries of artificial intelligence, we understand that cutting-edge infrastructure and expertise is needed to harness the full potential of advanced AI. At Microsoft, we've amassed a decade of experience in supercomputing and have consistently supported the most demanding AI training and generative inferencing workloads. Today, we're excited to announce the latest milestone in our journey. We’ve created a virtual machine (VM) with an unprecedented 1.5 TB of high bandwidth memory (HBM) that leverages the power of AMD’s flagship MI300X GPU. Our Azure VMs powered with the MI300X GPU give customers even more choices for AI optimized VMs.