azure hardware infrastructure
25 TopicsAnnouncing Kubernetes Center (Preview) On Azure Portal
Today, we’re excited to introduce the Kubernetes Center in the Azure portal, a new experience to simplify how customers manage, monitor, and optimize Azure Kubernetes Services environments at scale. The Kubernetes Center provides a unified view across all clusters, intelligent insights, and streamlined workflows that help platform teams stay in control while enabling developers to move fast. As Kubernetes adoption accelerates, many teams face growing challenges in managing clusters and workloads at scale. Getting a quick snapshot of what needs attention across clusters and workloads can quickly become overwhelming. Kubernetes Center is designed to change that, offering a streamlined and intuitive experience that brings everything together in one place, brings the most critical Kubernetes capabilities into a single pane of glass for unified visibility and control. What is Kubernetes Center?: Actionable insights from the start: Kubernetes Center surfaces key issues like security vulnerabilities, cluster alerts, compliance gaps, and upgrade recommendations in a single, unified view. This helps teams focus immediately on what matters most, leading to faster resolution times, improved security posture, and greater operational clarity. Streamlined management experience: By bringing together AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single experience, we’ve reduced the need to jump between services. Everything you need to manage Kubernetes on Azure is now organized in one consistent interface. Centralized Quickstarts: Whether you’re getting started or troubleshooting advanced scenarios, Kubernetes Center brings relevant documentation, learning resources, and in-context help into one place so you can spend less time searching and more time building. Azure Portal: From Distinct landing experiences for AKS, Fleet Manager, and Managed Kubernetes Namespaces: To a streamlined management experience: Get the big picture at a glance, then dive deeper with individual pages designed for effortless discovery. Centralized Quickstarts: Next Steps: Build on your momentum by exploring Kubernetes Center. Create your first AKS cluster or deploy your first application using the Deploy Your Application flow and track your progress in real time or Check out the new experience and instantly see your existing clusters in a streamlined management experience. Your feedback will help shape what comes next. Start building today with Kubernetes Center on Azure Portal! Learn more: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn FAQ: What products from Azure are included in Kubernetes Center? A. Kubernetes Center brings together all your Azure Kubernetes resources such as AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single interface for simplified operations. Create new resources or view your existing resources in Kubernetes Center. Does Kubernetes Center handle multi-cluster management? A. Kubernetes Center provides a unified interface aka single pane of glass to view and monitor all your Kubernetes resources in one place. For multi-cluster operations like upgrading Kubernetes Version, placing cluster resources on N clusters, policy management, and coordination across environments, Kubernetes Fleet Manager is the solution designed to handle that complexity at scale. It enables teams to manage clusters at scale with automation, consistency, and operational control. Does Kubernetes Center provide security and compliance insights? A. Absolutely. When Microsoft Defender for Containers is enabled, Kubernetes Center surfaces critical security vulnerabilities and compliance gaps across your clusters. Where can I find help and documentation? A. All relevant documentation, QuickStarts, and learning resources are available directly within Kubernetes Center, making it easier to get support without leaving the platform. For more information: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn What is the status of this launch? A. Kubernetes Center is currently in preview, offering core capabilities with more features planned for the general availability release. What is the roadmap for GA? A. Our roadmap includes adding new features and introducing tailored views designed for Admins and Developers. We also plan to enhance support for multi-cluster capabilities in Azure Fleet Manager, enabling smoother and more efficient operations within the Kubernetes Center.3.4KViews10likes0CommentsAnnouncing Cobalt 200: Azure’s next cloud-native CPU
By Selim Bilgin, Corporate Vice President, Silicon Engineering, and Pat Stemen, Vice President, Azure Cobalt Today, we’re thrilled to announce Azure Cobalt 200, our next-generation Arm-based CPU designed for cloud-native workloads. Cobalt 200 is a milestone in our continued approach to optimize every layer of the cloud stack from silicon to software. Our design goals were to deliver full compatibility for workloads using our existing Azure Cobalt CPUs, deliver up to 50% performance improvement over Cobalt 100, and integrate with the latest Microsoft security, networking and storage technologies. Like its predecessor, Cobalt 200 is optimized for common customer workloads and delivers unique capabilities for our own Microsoft cloud products. Our first production Cobalt 200 servers are now live in our datacenters, with wider rollout and customer availability coming in 2026. Azure Cobalt 200 SoC and platform Building on Cobalt 100: Leading Price-Performance Our Azure Cobalt journey began with Cobalt 100, our first custom-built processor for cloud-native workloads. Cobalt 100 VMs have been Generally Available (GA) since October of 2024 and availability has expanded rapidly to 32 Azure datacenter regions around the world. In just one year, we have been blown away with the pace that customers have adopted the new platform, and migrated their most critical workloads to Cobalt 100 for the performance, efficiency, and price-performance benefits. Cloud analytics leaders like Databricks and Snowflake are adopting Cobalt 100 to optimize their cloud footprint. The compute performance and energy-efficiency balance of Cobalt 100-based virtual machines and containers has proven ideal for large-scale data processing workloads. Microsoft’s own cloud services have also rapidly adopted Azure Cobalt for similar benefits. Microsoft Teams achieved up to 45% better performance using Cobalt 100 than their previous compute platform. This increased performance means less servers needed for the same task, for instance Microsoft Teams media processing uses 35% fewer compute cores with Cobalt 100. Designing Compute Infrastructure for Real Workloads With this solid foundation, we set out to design a worthy successor – Cobalt 200. We faced a key challenge: traditional compute benchmarks do not represent the diversity of our customer workloads. Our telemetry from the wide range of workloads running in Azure (small microservices to globally available SaaS products) did not match common hardware performance benchmarks. Existing benchmarks tend to skew toward CPU core-focused compute patterns, leaving gaps in how real-world cloud applications behave at scale when using network and storage resources. Optimizing Azure Cobalt for customer workloads requires us to expand beyond these CPU core benchmarks to truly understand and model the diversity of customer workloads in Azure. As a result, we created a portfolio of benchmarks drawn directly from the usage patterns we see in Azure, including databases, web servers, storage caches, network transactions, and data analytics. Each of our benchmark workloads includes multiple variants for performance evaluation based on the ways our customers may use the underlying database, storage, or web serving technology. In total, we built and refined over 140 individual benchmark variants as part of our internal evaluation suite. With the help of our software teams, we created a complete digital twin simulation from the silicon up: beginning with the CPU core microarchitecture, fabric, and memory IP blocks in Cobalt 200, all the way through the server design and rack topology. Then, we used AI, statistical modelling and the power of Azure to model the performance and power consumption of the 140 benchmarks against 2,800 combinations of SoC and system design parameters: core count, cache size, memory speed, server topology, SoC power, and rack configuration. This resulted in the evaluation of over 350,000 configuration candidates of the Cobalt 200 system as part of our design process. This extensive modelling and simulation helped us to quickly iterate to find the optimal design point for Cobalt 200, delivering over 50% increased performance compared to Cobalt 100, all while continuing to deliver our most power-efficient platform in Azure. Cobalt 200: Delivering Performance and Efficiency At the heart of every Cobalt 200 server is the most advanced compute silicon in Azure: the Cobalt 200 System-on-Chip (SoC). The Cobalt 200 SoC is built around the Arm Neoverse Compute Subsystems V3 (CSS V3), the latest performance-optimized core and fabric from Arm. Each Cobalt 200 SoC includes 132 active cores with 3MB of L2 cache per-core and 192MB of L3 system cache to deliver exceptional performance for customer workloads. Power efficiency is just as important as raw performance. Energy consumption represents a significant portion of the lifetime operating cost of a cloud server. One of the unique innovations in our Azure Cobalt CPUs is individual per-core Dynamic Voltage and Frequency Scaling (DVFS). In Cobalt 200 this allows each of the 132 cores to run at a different performance level, delivering optimal power consumption no matter the workload. We are also taking advantage of the latest TSMC 3nm process, further improving power efficiency. Security is top-of-mind for all of our customers and a key part of the unique innovation in Cobalt 200. We designed and built a custom memory controller for Cobalt 200, so that memory encryption is on by default with negligible performance impact. Cobalt 200 also implements Arm’s Confidential Compute Architecture (CCA), which supports hardware-based isolation of VM memory from the hypervisor and host OS. When designing Cobalt 200, our benchmark workloads and design simulations revealed an interesting trend: several universal compute patterns emerged – compression, decompression, and encryption. Over 30% of cloud workloads had significant use of one of these common operations. Optimizing for these common operations required a different approach than just cache sizing and CPU core selection. We designed custom compression and cryptography accelerators – dedicated blocks of silicon on each Cobalt 200 SoC – solely for the purpose of accelerating these operations without sacrificing CPU cycles. These accelerators help reduce workload CPU consumption and overall costs. For example, by offloading compression and encryption tasks to the Cobalt 200 accelerator, Azure SQL is able to reduce use of critical compute resources, prioritizing them for customer workloads. Leading Infrastructure Innovation with Cobalt 200 Azure Cobalt is more than just an SoC, and we are constantly optimizing and accelerating every layer in the infrastructure. The latest Azure Boost capabilities are built into the new Cobalt 200 system, which significantly improves networking and remote storage performance. Azure Boost delivers increased network bandwidth and offloads remote storage and networking tasks to custom hardware, improving overall workload performance and reducing latency. Cobalt 200 systems also embed the Azure Integrated HSM (Hardware Security Module), providing customers with top-tier cryptographic key protection within Azure’s infrastructure, ensuring sensitive data stays secure. The Azure Integrated HSM works with Azure Key Vault for simplified management of encryption keys, offering high availability and scalability as well as meeting FIPS 140-3 Level 3 compliance. An Azure Cobalt 200 server in a validation lab Looking Forward to 2026 We are excited about the innovation and advanced technology in Cobalt 200 and look forward to seeing how our customers create breakthrough products and services. We’re busy racking and stacking Cobalt 200 servers around the world and look forward to sharing more as we get closer to wider availability next year. Check out Microsoft Ignite opening keynote Read more on what's new in Azure at Ignite Learn more about Microsoft's global infrastructure17KViews9likes0CommentsUnleashing GitHub Copilot for Infrastructure as Code
Introduction In the world of managing infrastructure, things are always changing. People really want solutions that work, can handle big tasks, and won't let them down. Now, as more companies switch to using cloud-based systems and start using Infrastructure as Code (IaC), the job of folks who handle infrastructure is getting even more important. They're facing new problems in setting up and keeping everything running smoothly. The Challenges faced by Infrastructure Professionals Complexity of IaC: Managing infrastructure through code introduces a layer of complexity. Infrastructure professionals often grapple with the intricate syntax and structure required by tools like Terraform and PowerShell. This complexity can lead to errors, delays, and increased cognitive load. Consistency Across Environments: Achieving consistency across multiple environments—development, testing, and production—poses a significant challenge. Maintaining uniformity in configurations is crucial for ensuring the reliability and stability of the deployed infrastructure. Learning Curve: The learning curve associated with IaC tools and languages can be steep for those new to the domain. As teams grow and diversify, onboarding members with varying levels of expertise becomes a hurdle. Time-Consuming Development Cycles: Crafting infrastructure code manually is a time-consuming process. Infrastructure professionals often find themselves reinventing the wheel, writing boilerplate code, and handling repetitive tasks that could be automated. Unleashing GitHub Copilot for Infrastructure as Code In response to these challenges, Leveraging GitHub Copilot to generate infra code specifically for infrastructure professionals is helping to revolutionize the way infrastructure is written, addressing the pain points experienced by professionals in the field. The Significance of GH Copilot for Infra Code Generation with accuracy: Copilot harnesses the power of machine learning to interpret the intent behind prompts and swiftly generate precise infrastructure code. It understands the context of infrastructure tasks, allowing professionals to express their requirements in natural language and receive corresponding code suggestions. Streamlining the IaC Development Process: By automating the generation of infrastructure code, Copilot significantly streamlines the IaC development process. Infrastructure professionals can now focus on higher-level design decisions and business logic rather than wrestling with syntax intricacies. Consistency Across Environments and Projects: GH Copilot ensures consistency across environments by generating standardized code snippets. Whether deploying resources in a development, testing, or production environment, GH Copilot helps maintain uniformity in configurations. Accelerating Onboarding and Learning: For new team members and those less familiar with IaC, GH Copilot serves as an invaluable learning service. It provides real-time examples and best practices, fostering a collaborative environment where knowledge is shared seamlessly. Efficiency and Time Savings: The efficiency gains brought about by GH Copilot are substantial. Infrastructure professionals can witness a dramatic reduction in development cycles, allowing for faster iteration and deployment of infrastructure changes. Copilot in Action Prerequisites 1.Install visual studio code latest version - https://code.visualstudio.com/download Have a GitHub Copilot license with a personal free trial or your company/enterprise GitHub account, install the Copilot extension, and sign in from Visual Studio Code. https://docs.github.com/en/copilot/quickstart Install the PowerShell extension for VS Code, as we are going to use PowerShell for our IaC sample. Below is the PowerShell code generated using VS Code & GitHub Copilot. It demonstrates how to create a simple Azure VM. We're employing a straightforward prompt with #, with the underlying code automatically generated within the VS Code editor. Another example to create azure vm with vm scale set with minimum and maximum number of instance count. Prompt used with # in below example. The PowerShell script generated above can be executed either from the local system or from the Azure Portal Cloud Shell. Similarly, we can create Terraform and devops code using this Infra Copilot. Conclusion In summary, GH Copilot is a big deal in the world of infrastructure as code. It helps professionals overcome challenges and brings about a more efficient and collaborative way of working. As we finish talking about GH Copilot's abilities, the examples we've looked at have shown how it works, what technologies it uses, and how it can be used in real life. This guide aims to give infrastructure professionals the info they need to improve how they do infrastructure as code.31KViews9likes9CommentsReimagining AI at scale: NVIDIA GB300 NVL72 on Azure
By Gohar Waqar, CVP of Cloud Hardware Infrastructure Engineering, Microsoft Microsoft was the first hyperscaler to deploy the NVIDIA GB300 NVL72 infrastructure at scale – with a fully integrated platform engineered to deliver unprecedented compute density in a single rack to meet the demands of agentic AI workloads. Each GB300 NVL72 rack packs 72 NVIDIA Blackwell Ultra GPUs and 36 NVIDIA Grace™ CPUs with up to ~136 kW of IT load, enabled by Microsoft’s custom liquid cooling heat exchanger unit (HXU) system. Using a systems approach to architect GB300 clusters, Azure’s new NDv6 GB300 VMs include robust infrastructure innovation across every layer of the stack, including smart rack management for fleet health, innovative cooling systems, and efficient deployment features that make scaling high-density AI clusters easier than ever. With purpose-built hardware engineered for a unified platform – from silicon to systems to software – Azure’s deployment of NVIDIA GB300 NVL72 is a clear representation of Microsoft’s commitment to raising the bar on accelerated computing, enabling training of multitrillion-parameter models and high throughput on inference workloads. Unique features of NVIDIA GB300 NVL72 system on Microsoft Azure Ultra-dense AI rack - The GB300 rack integrates 72 NVIDIA Blackwell Ultra GPUs (each with 288 GB HBM3e each) and 36 Grace CPUs, effectively delivering supercomputer-class performance in a single rack. Advanced liquid cooling - Each rack uses direct-to-chip liquid cooling. In air-cooled data centers, external liquid cooling heat exchanger unit (HXU) radiator units in each rack dissipate ~136 kW to room air. In facilities with chilled water, the rack connects directly to facility water. Smart rack management - The system is equipped with an embedded controller that monitors power, temperature, coolant flow, and leak sensors in real time. It can auto-throttle or shut down components if conditions go out-of-range and provide full telemetry for remote fleet diagnostics. Fully integrated security and offload features: Our unique design also includes the Azure Integrated Hardware Security Module (HSM) chip and Azure Boost offload accelerator for advanced I/O and security performance. Scalable datacenter deployment - GB300 arrives as an integrated rack (compute trays, NVIDIA NVLink™ fabric, cooling, and power shelves pre-installed). Deployment is streamlined – just requiring connectivity power and cooling, performance of initial checks, and the rack self-regulates its cooling and power distribution. Purpose-built architecture designed for rapid deployment and scale At its core, GB300 is built to maximize AI compute density within a standard data center footprint. It is a single-rack AI inference and training cluster with unprecedented component density. Compared to the previous generation (NVIDIA GB200 NVL72), it introduces higher-performance GPUs (from ~1.2 kW to ~1.4 kW each with more HBM3e memory), a ~50% boost in NVFP4 throughput and a revamped power/cooling design to handle ~20% greater thermal and power load. The liquid cooling system for the GPU module is enhanced with a new cold plate and improved leak detection assembly for safe, high-density operation. Innovations in our purpose-built Azure Boost accelerator for I/O offload unlock higher bandwidth, while our custom Datacenter-secure Control Module (DC-SCM) introduces a secure, modular control plane built on a hardware root of trust, backed by the Azure Integrated Hardware Security Module (HSM). Together, these advancements enable fleet-wide manageability, strengthening security and operational resilience at scale meeting the demands of hyperscale environments. Cooling systems designed for deployability and global resiliency To dissipate ~136 kW of heat per rack, GB300 relies on direct liquid cooling for all major components. To offer resiliency and wide deployability across Microsoft’s datacenter footprint, our cooling designs support both facility-water and air-cooled environments. Both approaches use a closed coolant loop inside the rack with a treated water-glycol fluid. Leak detection cables line each tray, and the base of the rack is equipped with smart management protocols to address potential leaks. Using this method, liquid cooling is highly efficient and reliable – it allows GB300 to run with warmer coolant temperatures than traditional datacenter water, improving overall power usage effectiveness (PUE). Smart management, fleet health & diagnostics Each GB300 rack is a “smart IT rack” with an embedded management controller that oversees its operation. This controller is supported by a rack control module that serves as the brain of the rack, providing comprehensive monitoring and automation for power, cooling, and health diagnostics. By delivering an integrated “single pane of glass” view for each rack’s health, the GB300 makes management at scale feasible despite the complexity. This rack self-regulates its power and thermal environment once installed, adjusting fans or pump speeds automatically, and isolates faults – reducing the manual effort to keep the cluster running optimally so customers can focus on the workloads, with confidence that the infrastructure is continuously self-monitoring and safeguarding itself. In addition to this, the rack control module monitors and moderates GPU peak power consumption and other power management scenarios. These robust design choices reflect the fleet-first mindset – maximizing uptime and easier diagnostics in large deployments. Efficient and streamlined deployment As Microsoft scales thousands of GB300 racks for increased AI supercomputing capacity, fast and repeatable deployment is critical. GB300 introduces a new era of high-density AI infrastructure, tightly integrating cutting-edge hardware (Grace CPUs, Blackwell Ultra GPUs, and NVLink connectivity) with innovations both in power delivery and liquid cooling. Crucially, it does so with an eye toward operational excellence: built-in management, health diagnostics, and deployment-friendly design mean that scaling up AI clusters with GB300 can be done rapidly and reliably. With its unprecedented compute density, intelligent self-management, and flexible cooling options, the GB300 platform enables organizations to scale rapidly with the latest AI supercomputer hardware while maintaining the reliability and serviceability expected in Azure’s promise to customers. GB300 unlocks next-level AI performance delivered in a package engineered for real-world efficiency and fleet-scale success.2.3KViews7likes0CommentsDeep dive into the Maia 200 architecture
Maia 200 is a breakthrough inference architecture engineered to dramatically shift the economics of large-scale token generation. As Microsoft’s first silicon and system platform optimized specifically for AI inference, Maia 200 is built for modern reasoning and large language models, delivering the most efficient performance per dollar of any inference system deployed in Azure and represents the highest performance chip of any custom cloud accelerator today. AI inference is increasingly defined by an efficient frontier, a curve that measures how much real-world capability and accuracy can be delivered at a given level of cost, latency, and energy. Different applications sit at different points on that frontier: interactive copilots prioritize low-latency responsiveness, batch-scale summarization and search emphasize throughput at a given cost, and advanced reasoning workloads demand sustained performance under long-context and multi-step execution. As enterprises deploy AI across these diverse scenarios, the infrastructure requirements are no longer one-size-fits-all; they require a portfolio approach that delivers the highest-performance, lowest-cost infrastructure at scale. Maia 200 reflects a core principle of AI at scale: innovation across software, silicon, systems, and datacenters is what enables us to deliver 30% better performance per dollar than the latest generation hardware in our fleet today. As agentic applications expand in capability and adoption, this integrated approach makes infrastructure efficiency a foundational advantage. Maia 200 Purpose‑Built for Price-Performance Inference Leadership To meet these demands, Maia 200 introduces a new system and silicon architecture purpose built to maximize inference efficiency. Guided by a deep understanding of AI workloads and supported by an advanced pre-silicon environment and enabling hardware/software codesign, Maia 200 incorporates a set of deliberate architectural choices that deliver industry leading tokens per dollar and per watt. Notable architecture innovations include: Optimized narrow precision datapaths, on the latest TSMC N3 process technology enabling 10.1 PetaOPS FP4, positioning Maia 200 among the highest FP4perdollar accelerator available in any cloud. A reimagined memory subsystem combining 272 MB of ondie SRAM with 216 GB HBM3e delivering 7 TB/s of HBM bandwidth capacity to service dataintensive operations while minimizing off-chip traffic, reducing HBM bandwidth demand and improving overall energy efficiency. An efficient datamovement fabric, centered around a multilevel Direct Memory Access (DMA) subsystem and a hierarchical NetworkonChip (NoC), ensuring predictable, scalable performance for heterogeneous and memorybound AI workloads. A highly performant and reliable, Ethernet scaleup interconnect, featuring an integrated on-die NIC with 2.8 TB/s (bi-directional) of bandwidth, advanced transportprotocol enabling a 2 tier scale up network and topology optimizations to deliver highbandwidth, lowlatency communication across a cluster of 6,144 accelerators. A closer look at Maia 200 reveals the architectural advancements and system‑level innovations purpose‑built for inference that enable its industry‑leading efficiency. Maia 200 Architecture Overview Maia accelerators are organized around a hierarchical micro-architecture. At the foundation of this hierarchy is the tile, the smallest autonomous unit of compute and local storage. Each tile integrates two complementary execution engines: a Tile Tensor Unit (TTU) for high-throughput matrix multiply and convolution, and a Tile Vector Processor (TVP) as a highly programmable SIMD engine. These engines are fed by multi-banked Tile SRAM (TSRAM) and a tile-level DMA subsystem that is responsible for moving data into and out of that SRAM without stalling the compute pipeline. A lightweight Tile Control Processor (TCP) runs the low-level code emitted by the software stack and orchestrates TTU and DMA work issuance, while hardware semaphores provide fine-grained synchronization between data movement and compute. Multiple tiles compose into a cluster, which introduces a second tier of shared locality and coordinated movement. Each cluster contains a large multi-banked Cluster SRAM (CSRAM) accessible across the tiles in that cluster, along with a dedicated cluster DMA subsystem that stages traffic between CSRAM and co-packaged High Bandwidth Memory (HBM). A dedicated cluster core provides the control and synchronization needed to coordinate multi-tile execution, and the full SoC is built by instantiating multiple clusters. Because building at scale requires not just peak performance but manufacturability, the architecture also incorporates redundancy schemes for both tiles and SRAM to improve yield while preserving the hierarchical programming and execution model. Maia accelerators feature a highly optimized data movement infrastructure, centered around its Direct Memory Access (DMA) subsystem coupled with a hierarchical Network-on-Chip (NoC). The DMA engines are architected for multichannel, high-bandwidth transfer and support 1D/2D/3D strided movement, enabling common ML tensor layouts to be moved efficiently between on-chip SRAM, HBM, and external interfaces while overlapping data movement with compute. Meanwhile, the NoC provides scalable, low-latency communication across clusters and memory subsystems and supports both unicast and multicast transfers—an important capability for distributing tensor blocks and coordinating parallel execution. To further improve effective memory efficiency, Maia supports multiple narrow-precision data types as storage formats in both HBM and SRAM and employs hardware-based data casting to convert storage types to compute types at line rate so that movement and execution remain tightly coupled. For communication beyond the chip, Maia 200 integrates a high‑performance NIC and an Ethernet‑based scale‑up interconnect using an optimized AI Transport Layer (ATL) protocol to deliver scalable, low‑latency communication across nodes. The on‑die NIC provides 1.4 TB/s unidirectional (2.8 TB/s bidirectional) I/O bandwidth, eliminating the power and cost overhead of external NICs while enabling efficient scaling to 6,144 accelerators within a two‑tier scale‑up domain. ATL operates end‑to‑end over standard Ethernet, supporting a commodity, multi‑vendor switching ecosystem, while layering on innovations such as packet spraying, multipath routing, and congestion‑resistant flow control built directly into the transport layer to maximize throughput and stability. Optimized Tensor Core for Narrow Precision Data Types As AI models continue to grow in size and complexity, achieving cost‑effective inference increasingly depends on exploiting narrow‑precision arithmetic and reducing memory footprints to improve performance and efficiency. Industry results consistently show that formats such as FP4 can maintain robust model accuracy for inference while significantly reducing computational and memory requirements. Maia 200 is architected from the ground up for narrow‑precision execution. Its Tile Tensor Unit (TTU) is optimized for matrix multiplication in FP8, FP6, and FP4, and supports mixed‑precision modes such as FP8 activations multiplied by FP4 weights to maximize throughput without compromising accuracy. Complementing this, the Tile Vector Processor (TVP) delivers FP8 compute alongside BF16, FP16, and FP32, providing flexibility for layers or operators that benefit from higher precision. An integrated reshaper up‑converts low‑precision formats at line rate prior to computation, ensuring seamless dataflow without introducing bottlenecks. Notably, FP4 throughput on Maia 200 is 2× that of FP8, and 8× that of BF16, enabling substantial gains in tokens‑per‑second and performance‑per‑watt for inference‑centric workloads. A Reimagined Memory Subsystem A defining feature of Maia 200’s architecture is its advanced memory hierarchy, engineered to optimize data movement and sustain high utilization across diverse inference workloads. Maia 200 integrates 272 MB of on‑die SRAM partitioned into multi‑tier Cluster‑level SRAM (CSRAM) and Tile‑level SRAM (TSRAM). This substantial on‑die memory resource enables a wide range of low‑latency, bandwidth‑efficient data‑management strategies. Both CSRAM and TSRAM are fully software‑managed, allowing developers—or the compiler/runtime—to deterministically place and pin data for precise control of locality and movement. For example, a primary use case for CSRAM is pinning critical working sets within cluster‑local memory. Keeping frequently accessed data resident on‑chip provides predictable low‑latency access, reduces dependence on higher‑latency memory tiers, and improves deterministic execution. More broadly, the on‑die SRAM hierarchy allows programmers to buffer, stage, and pin data in ways that significantly optimize dataflow patterns across kernel types. Examples include: GEMM kernels can retain intermediate matrix tiles in TSRAM, boosting arithmetic intensity by eliminating round‑trips to HBM or even CSRAM. Attention kernels can pin Q/O tensors, K/V tensors, and partial Q·K products as much as possible in TSRAM, minimizing data‑movement overhead throughout the attention pipeline. Collective‑communication kernels can buffer full payloads in CSRAM while accumulation proceeds in TSRAM, reducing pressure on HBM and preventing bandwidth collapse during multi‑node operations. Cross‑kernel pipelines benefit from CSRAM as a transient buffer between stages, enabling tightly coupled, high‑throughput kernel chaining with fewer stalls particularly valuable for workloads with high kernel density or complex operator fusion. Together, these capabilities allow Maia 200 to maintain high compute efficiency and deterministic performance, even as model architectures and sequence lengths grow increasingly demanding. An Efficient Data‑Movement Fabric: Specialized DMA Engines and a Custom On‑Chip Interconnect Sustained inference utilization on Maia 200 depends on the ability to move data predictably and efficiently among compute tiles, on‑die SRAM, HBM, and I/O. Because inference performance is often bounded by data movement rather than peak FLOPs, the interconnect must support high‑throughput tensor transfers (broadcast, gather, reduce, scatter) while also ensuring low‑latency delivery of synchronization and control signals. Maia 200 addresses this challenge with a custom Network‑on‑Chip (NOC) designed explicitly for inference‑centric dataflow. At the chip level, the NOC forms a mesh network spanning all clusters, tiles, memory controllers, and I/O units. It is segmented into multiple logical planes—or virtual fabrics—including a high‑bandwidth data plane for large tensor transfers and a dedicated control plane for interrupts, synchronization, and small messages. This separation ensures that latency‑critical control traffic is never blocked behind bulk data transfers, a key requirement when hundreds of tiles, DMA engines, and controllers operate concurrently. Maia 200’s on‑chip fabric introduces several inference‑oriented innovations: Efficient HBM‑to‑cluster broadcast: Hierarchical data movement allows tensors to be fetched once from HBM and fanned out to multiple CSRAM, avoiding redundant HBM reads and improving energy efficiency. Localized high‑bandwidth cluster traffic: High-bandwidth cluster‑local fabrics keep the hottest data movement within the cluster, enabling common inference patterns—such as intra‑layer reductions, scratchpad exchanges, and small collectives—to complete within the cluster without repeatedly traversing global links. Tile‑to‑tile SRAM access: Within a cluster, the fabric allows Tile DMAs and vector units to directly read and write peer tile SRAMs, enabling efficient broadcasts, reductions, and shared‑state updates without engaging HBM and CSRAM. Quality‑of‑Service for critical traffic: QoS mechanisms in both the fabric and memory controllers prioritize urgent, low‑latency messages such as synchronization signals or small inference outputs ensuring they are not delayed by bulk tensor transfers. Fail‑safe management plane: By isolating control and telemetry traffic from the data path, Maia 200 maintains a reliable, always‑available management channel—essential for recovery, coordination, and monitoring in large‑scale inference deployments. Complementing the NOC, Maia 200 implements a hierarchy of specialized DMA engines tailored for AI dataflow. Tile DMAs handle fine‑grained transfers between TSRAM and CSRAM; Cluster DMAs shuttle data between CSRAM and HBM or across clusters; and Network DMAs manage send/receive paths for off‑chip links. This layered DMA architecture enables concurrent, overlapped transfers across memory tiers and across nodes, ensuring compute tiles remain well‑fed under diverse workload conditions. Together, the custom NOC and multi‑tier DMA hierarchy form a data‑movement subsystem purpose‑built for inference—high‑bandwidth for tensors, low‑latency for control, localized when possible, prioritized when necessary, and efficiently coordinated across the entire chip. This architecture is fundamental to Maia 200’s ability to sustain high utilization across varied and increasingly complex AI workloads. A Highly Performant and Reliable 2 Tier ScaleUp Interconnect with An Innovative AI Transport Layer Maia 200 incorporates an integrated NIC and a high-performance Ethernet based scaleup interconnect built around Microsoft’s AI Transport Layer (ATL) protocol to enable scalable, low latency chip-to-chip communication across 6,144 Maia accelerators arranged in a two-tier topology. Scale‑up networking was approached as a full‑stack solution, architecting the interconnect as a set of well-defined layers co-optimized end-to-end for performance-per-dollar. The design emphasizes predictable latency, full bandwidth utilization, and software defined flexibility, while leveraging the robustness and multivendor support of the commodity Ethernet switch ecosystem. A foundational innovation in Maia 200’s interconnect is the on-die integrated NIC and its close coupling with both the ATL protocol engine and the Network DMA. This custom, inhouse network controller is engineered for very low power and area, enabling features such as packet spraying, multipath routing, and congestionresistant flow control directly in the transport layer to maximize throughput and stability. Together, these elements enable a two-tier scaleup fabric optimized for largescale inference workloads, providing tightly coupled communication both within and across racks. Many accelerator systems rely on allswitched scaleup fabrics, where even local tensorparallel traffic must traverse external switches. This approach forces most collective operations onto shared switch paths, adding hop latency and power and requiring significant port and cabling overprovisioning to sustain worstcase alltoall patterns. Maia 200 avoids these inefficiencies through the Fully Connected Quad (FCQ): groups of four accelerators connected via switchless, direct links. This intranode topology delivers significantly faster tensorparallel communication without relying on an external switch and provides a superior Perf/$ and Perf/W balance for both compute and collective I/O. Beyond the FCQ domain, the switched tier extends connectivity to 6,144 accelerators, enabling very large inference models to be sharded across nodes while preserving communication efficiency—without depending on external NICs and scaleout network. This architecture offers three major benefits: Bandwidth optimizations and reduced overhead High intensity tensor parallel traffic, KV updates, and partial activations remain localized within FCQ groups, while switches handle lighter weight cross domain collectives. Multirack inference at scale without trainingclass cost The design avoids the power, complexity, and fleetcost burden of scaleout network while still enabling hyperscale inference topologies under practical power envelopes. Workloadaligned network behavior Modern inference workloads require moderate synchronization—not the extreme alltoall pressure of training. The twotier architecture meets these needs without overengineering the fabric, while still delivering high throughput and low latency for production inference deployments. The result is a scaleup network that is highperformance, reliable, and right sized achieving the bandwidth, latency, and efficiency targets essential for largescale inference while remaining cost and powerefficient for hyperscale deployment. At the top of the scaleup hardware stack is the collective communication layer, which forms the interface between deeplearning frameworks (e.g., PyTorch, TensorFlow) and the underlying hardware. Maia 200 uses the Microsoft Collective Communication Library (MCCL), whose algorithms are codesigned with Maia’s hardware to deliver optimal scaleup performance for specific workload shapes. Key areas of innovation in MCCL include: Compute–I/O overlap to hide synchronization overhead and minimize pipeline bubbles. Hierarchical collectives reducing network traffic, lowering latency, and minimizing incast. Dynamic algorithmic selection tuned to tensor sizes and communication patterns. I/O latency hiding through pipelined and predictive scheduling. Together, the interconnect hardware and MCCL software deliver a tightly integrated, inferenceoptimized scaleup platform capable of supporting the next generation of largescale, lowlatency AI deployments. Maia 200 System: Azure‑Integrated, Cloud‑Native by Design The Maia 200 system is engineered as a fully Azure‑native platform, tightly integrated into the same cloud infrastructure that powers Microsoft’s largest AI and GPU fleets. At the hardware layer, Maia 200 is co‑designed with Azure’s third‑party GPU systems, adhering to a standardized rack, power, and mechanical architecture that simplifies deployment, improves serviceability, and allows heterogeneous accelerators to coexist within the same datacenter footprint. This alignment enables Azure to operate Maia 200 at hyperscale without requiring bespoke infrastructure or specialized site configurations. Thermal design is equally modular. Maia 200 supports deployments in both air and liquid cooled datacenters, including a second‑generation liquid‑cooling sidecar designed for high‑density racks and thermally constrained environments. This ensures broad deployability and fungibility across both legacy air-cooled and next‑generation liquid cooled datacenters while maintaining consistent performance under sustained workloads. Operationally, Maia 200 integrates with Azure’s native control plane, inheriting the same lifecycle, availability, and reliability guarantees as other Azure compute services. Firmware rollouts, fault detection and health monitoring are all performed through impactless, fleet‑wide management workflows, minimizing disruption and ensuring consistent service levels. This tight control‑plane integration also enables automated node bring‑up, safe in‑place upgrades, and coordinated multi‑rack maintenance—capabilities essential for large‑scale inference deployments. Maia 200 will be part of our heterogenous AI infrastructure supporting multiple models, including the latest GPT-5.2 models from OpenAI, to power AI workloads in Microsoft Foundry and Microsoft 365 Copilot. It will be fully integrated into Azure allowing models and workloads to be scheduled, partitioned, and monitored using the same tooling that supports Azure’s GPU fleets. This ensures portability across hardware types and allows service operators to optimize for perf/$, latency, or capacity without rewriting orchestration logic. Together, these system‑level capabilities make Maia 200 not just an highly efficient inference accelerator, but a cloud‑native compute building block, integrated seamlessly into Azure’s global AI infrastructure and optimized for reliable, large‑scale, multi‑tenant operation. Maia 200 Software Stack and Developer Toolchain: A Cloud‑Native Platform for High‑Performance Inference The Maia 200 software stack brings together a fully Azure‑integrated inference platform and a modern, developer‑oriented SDK built to deliver performance at scale. It is designed so cloud developers can adopt Maia seamlessly, leveraging familiar tooling while accessing low‑level control when needed for peak efficiency. For developers, the Maia SDK provides a comprehensive toolchain for building, optimizing, and deploying both open source and proprietary models on Maia hardware. Workflows begin naturally with PyTorch, and developers can choose the level of abstraction required: use the Maia Triton compiler for rapid kernel generation, rely on highly optimized kernel libraries tuned for Maia’s tile‑ and cluster‑based architecture, or target Microsoft’s Nested Parallel Language (NPL) for explicit control of data movement, SRAM placement, and parallel execution to reach near–peak utilization. The SDK includes a full simulator, compiler pipeline, profiler, debugger, and a robust quantization and validation suite, enabling teams to prototype models before silicon availability, diagnose performance bottlenecks with fine granularity, and tune kernels for optimal execution across the Maia stack. Together, the Maia inference stack and SDK form a unified platform that accelerates model bring‑up, simplifies performance optimization, and makes high‑performance inference a first‑class, cloud‑native development experience. In conclusion, with Maia 200, we demonstrate that leadership in AI infrastructure comes from unified system and workload optimizations across the entire stack — AI models, software toolchain and orchestration, custom silicon, networking, rack‑scale architecture, and datacenter infrastructure. Maia 200 embodies this principle, delivering 30% better performance per dollar than the latest generation hardware in our fleet today with an architecture that is purpose‑built for efficiency at scale. It represents a decisive step in advancing the world’s most capable, efficient, and scalable cloud platform, and forms the foundation for Microsoft’s AI future.12KViews5likes2Comments