azure hardware infrastructure
26 TopicsDemystifying On-Demand Capacity Reservations
About On-Demand Capacity Reservations Introducing the “parking garage” metaphor There are dozens of VM types available in Azure which span multiple generations of CPU across vendors and architectures. Within each Azure region are datacenters hosting pools of hardware which runs Azure services, such as virtual machines, of those types. As VMs are started and stopped by customers there is a constant ebb and flow of available capacity to run each type of VM within the region. Available capacity is driven by the rhythms of the business day, which creates variations in utilization on an hour-to-hour and even minute-to-minute basis. Longer cycles of demand such as holiday seasons, school calendars and other real-world events are also a factor. When you command an Azure Virtual Machine (VM) to start, the Azure Resource Manager (ARM) – the “engine” that manages resources in the Microsoft cloud -- needs to do a few things to make it happen. The most important of these is that it needs to identify hardware within the target region with sufficient capacity to bring the desired type and size of VM online at that moment in time. If ARM finds space for the desired VM size, the VM starts normally. However, if there is no room to start the desired VM, you will see an error similar to this one: This process of finding a place to start up an Azure VM has a lot of similarities to finding a place to park a vehicle. Parking facilities are built to handle typical demand for their location. If something is going on nearby, such as a large sporting event, which causes the need for parking to be much higher than normal then you might be out of luck when you try to find a spot because the garage is simply full. During periods of high demand in Azure this can result in VMs failing to start simply because there is nowhere to run them at that particular moment. If this happens to a VM which needed to be stopped for a configuration change or other reasons this can cause impact to your environment which you certainly want to avoid. On-Demand Capacity Reservations Azure has a resource called an On-Demand Capacity Reservation, or ODCR, which allows you to reserve a spot for a VM in the appropriate hardware within a region for a specific VM size. This is similar to “owning" a parking space: It’s a reserved place exclusively for the use of a specific VM. At a high level, the way this works is that you create an ODCR which matches the Azure region, availability zone and specific VM type, such as for a VM of type D16s_v6 in availability zone 2 of the Canada Central Azure region. Once the reservation is created, an Azure VM that matches that configuration can be associated to it so the VM now “owns” that “parking space”. This gives that VM priority over others of the same type when it needs to start because it already has a “parking space” assigned to it that can't be used by another one. More detail about VM startup Before we get further into what ODCRs are and how they work, it’s important to know a few more things about starting up a VM. Azure does not provide an explicit SLA for VM startup for virtual machines without an ODCR. The process of finding a hypervisor slot to boot up a VM is purely a “best effort” action on Azure’s part. Having quota headroom does not help with VM startup. Quota in Azure is your "credit limit" for creating VMs. Quota grants permission to create up to a certain number of cores’ worth of Virtual Machines from a particular family (like Ds_v6) but has no effect on whether you can actually start the machine once it’s created. Similarly, having a Reserved Instance purchase or a Savings Plan for a particular number of cores of a given VM family does not have any impact on the ability to start a VM either. These mechanisms are a discount mechanism only where the customer pre-pays for a certain amount of VM cores to be running 24x7 at a discounted rate. Assigning an ODCR to a virtual machine applies a formal SLA on startup for it. VMs with ODCRs get priority over ones that don’t so the likelihood of a successful startup is much higher for VMs that have one compared to those that do not, especially during times when Azure is experiencing a period of high demand for that particular VM type. The actual language of the ODCR SLA can be found in Microsoft's Service Level Agreements for Online Services document which can be downloaded from the linked site. Cost Implications of ODCRs These are the key points that you need to know about how billing works for ODCRs: The compute cost for the parking space capacity reservation for a VM is exactly the same as a running VM of the same size. There is no “double billing” for a VM to have an ODCR associated with it. Billing for the ODCR starts immediately if the quantity of reserved "parking spaces" is greater than zero. Stopping a VM that has an ODCR associated with it does not impact cost. This is because the ODCR is holding the reserved hypervisor slot even if the VM is not running. Having a Reserved Instance purchase or Savings Plan which covers the same scope as the ODCR means that the VM will be billed at the discounted rate. Are there any cases where using ODCRs results in paying more for a VM? There are two cases that I’ve identified where you pay for two ODCRs for the same VM. First, if you are using Azure Site Recovery to protect a VM in Azure by replicating it to another location, you have the option to associate the remote replica of the VM with a capacity reservation. This helps ensure that the replica will start when it’s called upon because it has a pre-allocated spot reserved for it. In this situation, if the original VM also is associated with an ODCR you are paying for both the original (running) VM and also for the reservation being held for its replica. Second, and similarly, when setting up replication for a VM that is preparing for migration into Azure via Azure Migrate, you can associate a capacity reservation with the replica for similar reasons to the above ASR example -- to ensure that the VM will start when its migrated replica is activated. If the source machine is also in Azure then you are again paying twice for the same machine. When should I use them? Capacity Reservations are an important element when designing for resiliency. They help ensure that VMs will be online when needed, even if they have to be shut down for some reason. For example, there was an incident where a customer had to shut down a VM that was serving as a firewall appliance to make an adjustment to its configuration and it failed to start up afterwards because of a capacity-related failure. This resulted in significant impact due to the loss of connectivity for systems dependent on the firewall for connectivity until they were able to bring it back online. Based on field experience and resiliency assessments, applying ODCRs to VMs that must be available 24x7 is strongly recommended. Examples of this include key functions like AD domain controllers, application servers and database servers. Also, any VM-based appliances that may be running as firewalls, load balancers or other infrastructure-support services should be considered as well. Microsoft offers assessments which review a workload for gaps that impact resiliency in many dimensions including outages in Azure. These assessments include checks for the presence of capacity reservations and will report any VM’s that do not have them as a high-risk finding. Not all VM stops in Azure are voluntary Even if you are careful to never stop a VM yourself it can sometimes happen. Not every shutdown of a VM in Azure is user-initiated. Involuntary shutdowns are rare but they can occur due to predictive hardware failures or other events which ARM will respond to by stopping the VM in order to move it out of harm's way. Creating On-Demand Capacity Reservations This section covers the components of an ODCR, the process of creating them and why creating them can fail. Components of an ODCR: An ODCR has two components to it. The first part is a Capacity Reservation Group (CRG) which is simply a "bucket" for any number of capacity reservations. To create a CRG you only need to provide its name, the region that it will be used for and which availability zones within that region it will have access to. The second -- and more important -- component is the actual Capacity Reservation which is created within a CRG. The capacity reservation requires: The name of the reservation. Including the VM size and other details in the name is useful to reduce ambiguity. An example could be “Zone1_D16s_v5” The specific VM size the reservation is for, such as “D16s_v5” The availability zone of the reservation. You can also create a regional reservation, where the VM is “zoneless”, as well. The number of parking spaces instances that the reservation holds. ODCRs can be created via the Azure portal, from the command line using PowerShell or the Azure CLI or deployed through IaC tools such as Bicep or Terraform. CRGs also can also be shared across subscriptions, which allows a CRG created and managed in one subscription to be utilized by VMs in a different subscription. When the ODCR is created, if the number of instances it contains is higher than zero then ARM will attempt to allocate the desired number of instances of the specified VM type in the target region/zone. If there is capacity available for this then the creation succeeds and you can move on to associating machines with it to give them the protection of the ODCR. If creating the ODCR is unsuccessful, the cause can be a variety of things, including: No open hypervisor slots for the desired VM in the target location – the “parking lot” was full at the moment the request was submitted. This can result from outages within Azure that reduce capacity as well as demand pressure. There is insufficient quota in the subscription to claim the necessary number of VM cores for the reservation in the region. The VM type is simply not available in the target region or AZ. Since not all Azure regions are provisioned with identical hardware this can be the cause, especially for VM types other than the popular D, E and F series machines. A restriction is applied to the subscription, zone or region that blocks creation of the reservation for some reason. What you can do if creating an ODCR fails Some things that may help if creating a capacity reservation fails and you know that quota or other restrictions are not a factor are below. Not coincidentally, these are the same recommendations that you should try when a VM fails to start because the same ARM action – finding and allocating hardware with free capacity to start the VM – is taking place. IN GENERAL, creating an ODCR outside of business hours has a higher probability of success. Demand for Azure services typically drops off at the end of the business day where the region is located. Consider using a different VM type, availability zone or a different Azure region. A script or other automation that retries at intervals until the reservation succeeds in claiming the desired number of spots can help, though it can take an unknown amount of time before this works. It may need to run for days or even weeks before it succeeds. Submitting a support ticket will create visibility to your situation from Microsoft. If the root cause is something other than capacity, support can identify that cause and provide guidance on how to resolve it. If the issue truly is a capacity squeeze, the ability of support to help get the reservation created started is extremely limited because the support folks, while helpful, are not able to create space where none exists. In this case the support teams will often suggest the three options above. Protecting a VM with an ODCR Once you have the ODCR created, applying it to a VM is straightforward. To do this from the portal, open the configuration tab on the VM’s screen. Then scroll to the bottom of the panel that appears to find the “Capacity reservations” section. Select “Capacity reservation group” from the list. The list of capacity reservation groups that match the VM will appear in a drop-down menu below. Select the CRG that the VM should use and click “Apply”. If you are using an Infrastructure-as-Code approach such as Bicep or Terraform, an Azure VM is linked to a CRG by specifying the resource ID of the CRG in the appropriate property on the VM definition. Impact of associating a virtual machine with an ODCR: If the VM is not running then the change takes effect immediately. If the VM is running and has no zone assignment (a “regional” VM) then it must be stopped and restarted for the protection of the ODCR to apply. If the VM is running and has a zone assignment then the change is immediate and there is no disruption to the VM. Where an ODCR is not the right answer ODCRs are most effective when they are used to protect VMs that need to always be running because they are providing essential services. Examples include AD domain controllers, firewall or load balancer appliances, database servers, integration servers that support workflows and the like. The primary thing to keep in mind is the cost impact of the ODCRs and whether they are necessary for the service to be functioning. Environments where machines come and go frequently, such as scale in/out setups used to minimize cost, are not ideal for ODCRs. For example, if you have a pool of app servers configured for scale-out, using ODCRs to cover the entire size of the pool means you would be paying for all machines, whether they are actually online or not. A possible approach in a scale-out environment is to determine the minimum number of VMs necessary for the service to be available -- even in a degraded state -- and use an ODCR to protect that number of instances. This way you can have confidence that at least that number of machines in the pool will always be running even if an attempt to scale out fails. Working with On-Demand Capacity Reservations (and three interesting behaviors that you should know about) This section discusses some ins and outs of working with ODCRs in your environment, especially if you need to apply them to existing machines. This is a common scenario when you are attempting to improve the resiliency of a set of VMs against impacts from maintenance, outages or other situations that may cause VMs to restart. “Associated” vs “Allocated” A capacity reservation group will always have ownership of some number of "parking spots" within a region. The number that it holds is referred to as the reservation's capacity which is expressed as a number of allocated instances. When you link a VM to a CRG, the VM becomes associated with the CRG and can take advantage of the protection that it offers from matching reservations that it contains. It is possible to associate more VMs to a CRG than it has allocated capacity for. This is called overallocation. When a CRG is overallocated, the VMs associated with it are protected on a first-come-first-served basis based on when they were started. If, for example, there are four VMs associated with a CRG but the CRG only has an allocated capacity of two, the first two associated machines which were started will receive protection but the others will not. “Interesting” On-Demand Capacity Reservation behavior #1: Here is the first of three interesting behaviors that you can use to your advantage when working with ODCRs. You can add a running VM to a capacity reservation group. As mentioned previously, if the VM is zonal then the change is immediate and nondisruptive. If the VM is regional then the VM must be stopped and restarted for the change to take effect. This is conceptually different from other Azure mechanisms used for resiliency such as Availability Sets. You can only add a VM to an availability set at the time the VM is created but you can add or remove a VM from a Capacity Reservation Group at any time whether the VM is running or not. “Interesting” On-Demand Capacity Reservation behavior #2 Interesting behavior #2 is deceptively simple. When creating a reservation, you can specify a capacity (number of allocated instances) of zero. This should always succeed because Azure needs to take no action to fulfill it -- this is just a metadata adjustment for the reservation within the CRG. This seems to not be terribly useful at first glance but keep reading. “Interesting” On-Demand Capacity Reservation behavior #3 If the number of associated VMs is higher than the allocated capacity of the reservation, you can increase the capacity of the reservation to cover the running VMs. Why does this work? Because running VMs, by definition, have a parking spot hypervisor allocation already so Azure doesn’t need to find one for it -- Azure can simply link the capacity reservation to the hypervisor slot that the running VM is using. The payoff! Or, using these three behaviors to your advantage Because ODCRs are relatively new and have not yet been adopted widely, a common finding to emerge from field resiliency assessments of running workloads is that the VMs that support the workload need to have ODCRs applied to them. In large environments there may be dozens or even hundreds of VMs that need to be protected. The process for doing this can seem daunting to a technical team that is not familiar with ODCRs. Thankfully, these three behaviors make it possible to easily protect any number of running machines with a very high probability of success -- and zero disruption if they are zonal VMs -- by proceeding in this order: Create a CRG with a reservation for the region, AZ and VM type for the machine(s) that need to be covered with a quantity of zero. (Interesting behavior #2) Associate the VMs to the capacity reservation group. At this point the CRG is overallocated so the machines are not yet protected. Remember that if the VMs are regional, a restart is required to finalize the ODCR assignment. (Interesting behavior #1) Update the reservation within the CRG to increase the number of allocated instances to match the number of running VMs. (Interesting behavior #3) When the number of instances on the reservation is equal to or higher than the number of VMs associated with it, all of the associated VMs are protected and you’re done! Final thoughts This leads to a final piece of advice about working with ODCRs, especially when you know that capacity is a challenge in the target region: As a field CSA, I recommend that you bring VMs online first, then apply a capacity reservation to them. Why? If you already have a set of running VMs that need to be protected then following what seems like the obvious process: Creating a CRG, creating reservations within it for the correct number of instances and then associating the VMs with the reservation – has a risk of failure at the step of creating the ODCR because Azure needs to find and allocate additional hypervisor slots for the reservation to own. This can be challenging when there is a lot of demand for the VM type. As the example in the previous section showed, it’s much easier to protect VMs that are already online by associating them with an existing capacity reservation, even if it doesn’t have enough instances allocated to it, and then increasing the capacity of the ODCR to cover the running machines. References: On-Demand Capacity Reservations Overview Monitor the list of restrictions on VM eligibility because it changes frequently SLA Details for On-Demand Capacity Reservations Legal fine print is in the consolidated SLA for Online Services (.docx) Some details about Overallocating capacity reservations Information on creating a Capacity Reservation Group via Bicep, Terraform or ARM template.183Views1like0CommentsAnnouncing Microsoft Azure Network Adapter (MANA) support for Existing VM SKUs
As a leader in cloud infrastructure, Microsoft ensures that Azure’s IaaS customers always have access to the latest hardware. Our goal is to consistently deliver technology to support business critical workloads with world class efficiency, reliability, and security. Customers benefit from cutting-edge performance enhancements and features, helping them to future proof their workloads while maintaining business continuity. Azure will be deploying the Microsoft Azure Network Adapter (MANA) for existing VM Size Families. Deployment timeline to be announced by mid-to-late April. The intent is to provide the benefits of new server hardware to customers of existing VM SKUs as they work towards migrating to newer SKUs. The deployments will be based on capacity needs and won’t be restricted by region. Once the hardware is available in a region, VMs can be deployed to it as needed. Workloads on operating systems which fully support MANA will benefit from sub-second Network Interface Card (NIC) firmware upgrades, higher throughput, lower latency, increased Security and Azure Boost-enabled data path accelerations. If your workload doesn't support MANA today, you'll still be able to access Azure’s network on MANA enabled SKUs, but performance will be comparable to previous generation (non-MANA) hardware. Check out the Azure Boost Overview and the Microsoft Azure Network Adapter (MANA) overview for more detailed information and OS compatibility. To determine whether your VMs are impacted and what actions (if any) you should take, start with MANA support for existing VM SKUs. This article provides additional information about which VM Sizes are eligible to be deployed on the new MANA-enabled hardware, what actions (if any) you should take, and how to determine if the workload has been deployed on MANA-enabled hardware.5.3KViews7likes1CommentProactive Resiliency in Azure for Specialized Workload i.e. Citrix VDI on Azure Design Framework.
In this post, I’ll share my perspective on designing cloud architectures for near-zero downtime. We’ll explore how adopting multi-region strategies and other best practices can dramatically improve reliability. The discussion will be technically and architecturally driven covering key decisions around network architecture, data replication, user experience continuity, and cost management but also touch on the business angle of why this matters. The goal is to inform and inspire you to strengthen your own systems, and guide you toward concrete actions such as engaging with Microsoft Cloud Solution Architects (CSAs), submitting workloads for resiliency reviews, and embracing multi-region design patterns. Resilience as a Shared Responsibility One fundamental truth in cloud architecture is that ensuring uptime is a shared responsibility between the cloud provider and you, the customer. Microsoft is responsible for the reliability of the cloud in other words, we build and operate Azure’s core infrastructure to be highly available. This includes the physical datacenters, network backbone, power/cooling, and built-in platform features for redundancy. We also provide a rich toolkit of resiliency features (think availability sets, Availability Zones, geo-redundant storage, service failover capabilities, backup services, etc.) that you can leverage to increase the reliability of your workloads. However, the reliability in the cloud of your specific applications and data is up to you. You control your application architecture, deployment topology, data replication, and failover strategies. If you run everything in a single region with no backups or fallbacks, even Azure’s rock-solid foundation can’t save you from an outage. On the other hand, if you architect smartly (using multiple regions, zones, and Azure resiliency features properly), you can achieve end-to-end high availability even through major platform incidents. In short: Microsoft ensures the cloud itself is resilient, but you must design resilience into your workload. It’s a true partnership one where both sides play a critical role in delivering robust, continuous services to end-users. I emphasize this because it sets the mindset: proactive resiliency is something we do with our customers. As you’ll see, Microsoft has programs and people (like CSAs) dedicated to helping you succeed in this shared model. Six Layers of Resilient Cloud Architecture for Citrix VDI workloads To systematically approach multi-region resiliency, it helps to break the problem down into layers. In my work, I arrived at a six-layer decision framework for designing resilient architectures. This was originally developed for a global Citrix DaaS deployment on Azure (hence some VDI flavor in the examples), but the principles apply broadly to cloud solutions. The layers ensure we cover everything from the ground-up network connectivity to the operational model for failover. 1. Network Fabric (the global backbone) Establish high-performance, low-latency links between regions. Preferred: Use Global VNet Peering for simplified any-to-any connectivity with minimal latency over Microsoft’s backbone (ideal for point-to-point replication traffic), rather than a more complex Azure Virtual WAN unless your topology demands it. 2. Storage Foundation (the bedrock ) In any distributed computing environment, storage is the "heaviest" component. Moving compute (VDAs) is instantaneous; moving data (profiles, user layers) is governed by bandwidth and the speed of light. The success of a multi-region DaaS deployment hinges on the performance and synchronization of the underlying storage subsystem. Use storage that can handle cross-region workload needs, especially for user data or state. In case of Citrix Daas, preferred approach is Azure NetApp Files (ANF) for consistent sub-millisecond latency and high throughput. ANF provides enterprise-grade performance (critical during “login storms” or peak I/O) and features like Cool Access tiering to optimize cost, outperforming standard Azure Files for this scenario. 3. User Profile & State (solving data gravity) Enable active-active availability of user data or application state across regions. Solution: FSLogix Cloud Cache (in a VDI context) or similar distributed caching/replication tech, which allows simultaneous read/write of profile data in multiple regions. In our case, Cloud Cache insulates the user session from WAN latency by writing to a local cache and asynchronously replicating to the secondary region, overcoming the challenge of traditional file locking. The principle extends to databases or state stores: use geo-replication or distributed databases to avoid any single-region state. 4. Access & Ingress (the intelligent front door) Ensure users/customers connect to the right region and can fail over seamlessly. Preferred: Deploy a global traffic management solution under your control e.g. customer-managed NetScaler (Citrix ADC) with Global Server Load Balancing (GSLB) to direct users to the nearest available datacenter. In our design, NetScaler’s GSLB uses DNS-based geo-routing and supports Local Host Cache for Citrix, meaning even if the cloud control plane (Citrix Cloud) is unreachable, users can still connect to their desktop apps. The general point: use Azure Front Door, Traffic Manager, or third-party equivalents to steer traffic, and avoid any solution that introduces a new single point of failure in the authentication or gateway path. 5. Master Image (ensuring global consistency) : If you rely on VM images or similar artifacts, replicate them globally. Use: Azure Compute Gallery (ACG) to manage and distribute images across regions. In our case, we maintain a single “golden” image for virtual desktops: it’s built once, then the Compute Gallery replicates it from West Europe to East US (and any other region) automatically. This ensures that when we scale out or recover in Region B, we’re launching the exact same app versions and OS as Region A. Consistency here prevents failover from causing functionality regressions. 6. Operations & Cost (smart economics at scale) Run an efficient DR strategy you want readiness without paying 2x all the time. Approach: Warm Standby with autoscaling. That means the secondary region isn’t serving full traffic during normal operations (some resources can be scaled down or even deallocated), but it can scale up rapidly when needed. For our scenario, we leverage Citrix Autoscale to keep the DR site in a minimal state only a small buffer of machines is powered on, just enough to handle a sudden failover until load-based scaling brings up the rest. This “active/passive” model (or hot-warm rather than hot-hot) strikes a balance: you pay only for what you use, yet you can meet your RTO (Recovery Time Objective) because resources spin up automatically on trigger. In cloud-native terms, you might use Azure Automation or scale sets to similar effect. The key is to avoid having an idle full duplicate environment incurring full costs 24/7, while still being prepared. Each of these layers corresponds to critical architectural choices that determine your overall resiliency. Neglect any one layer, and that’s where Murphy’s Law will strike next. For example, you might perfectly replicate your data across regions, but if you forgot about network connectivity, a regional hub outage could still cut off access. Or you have every system duplicated, but if users can’t be rerouted to the backup region in time, the benefit is lost. The six-layer framework helps make sure we cover all bases. Notably, these design best practices align very closely with Azure’s Well-Architected Framework (especially the Reliability pillar), and they’re exactly the kind of prescriptive guidance we provide through programs like the Proactive Resiliency Initiative. In fact, the PRI playbook essentially prioritizes these same steps for customers: First, harden the network foundation e.g. ensure ExpressRoute gateways are zone-redundant and circuits are “multi-homed” in at least two locations (so no single datacenter failure breaks connectivity). Next, address in-region resiliency – make sure critical workloads are distributed across Availability Zones and not vulnerable to a single zone outage. (As an aside: Microsoft’s internal data shows a huge payoff here; when we configured our top Azure services for zonal resilience, we saw a 68% reduction in platform outages that lead to support incidents!) Then, enable multi-region continuity (BCDR) – for those tier-0 and tier-1 workloads, set up cross-regional failover so even a region-wide disruption won’t take you down. Multi-region is described as the complement to (not a substitute for) zonal design: it’s about surviving the “black swan” of a region-level event, and also about supporting geo-distributed users and future growth. In other words, if you follow the six-layer approach, you’re doing exactly what our structured resiliency programs recommend.323Views1like0CommentsJoin Microsoft as we share more on Maia 200 in the Bay Area
In the next major step in our AI Infrastructure evolution, last week we introduced Maia 200, a breakthrough inference accelerator engineered to dramatically improve the economics of AI token generation. Microsoft engineering leaders will be showcasing our latest silicon innovation in San Francisco this month. Here are a few ways you can learn more and engage with this exciting technology and the team behind it: Maia 200 ISSCC Whitepaper Microsoft has submitted a technical whitepaper around Maia 200 as part of the International Solid-State Circuits Conference (ISSCC). The paper will be released on Friday, February 13 th to ISSCC attendees and be available after the conference digitally on IEEE. The paper is titled “Maia: A Reticle-Scale AI Accelerator” by Sherry Xu, Partner, Silicon Architecture at Microsoft. ISSCC Session Leadership from Azure Hardware and Systems will present a 25-minute session on Maia development titled "MAIA: A Reticle-Scale AI Accelerator" at the ISSCC conference at 2:45 PM (Session 17.4) as part of the wider ISSCC Session 17 "Highlighted Chip Releases for AI" that begins at 1:30 PM inside the Marriott Marquis in downtown San Francisco. During the session, we will highlight our design approach for Maia 200 and walk through the architecture and implementation of Microsoft’s Maia AI silicon. We’ll also share how the team engineered a reticle‑limited, ~750W AI SoC and the innovations that enabled a scalable, high‑performance accelerator. Join us at the Microsoft Silicon Social Event Microsoft will also be hosting a Silicon Social in downtown San Francisco on the evening of the 17 th . Maia 200 will be on-site marking its first public appearance outside of Microsoft labs and Azure datacenters, along with a selection of other Microsoft silicon hardware. Microsoft’s silicon engineering leadership will be attending, and we will provide food and drink during the event. All ISSCC attendees and others in the Bay Area silicon community are invited to register interest in attending by February 13 th . Due to limited capacity, confirmed attendees will receive a follow‑up email with event details, including the venue.410Views0likes0CommentsDeep dive into the Maia 200 architecture
Maia 200 is a breakthrough inference architecture engineered to dramatically shift the economics of large-scale token generation. As Microsoft’s first silicon and system platform optimized specifically for AI inference, Maia 200 is built for modern reasoning and large language models, delivering the most efficient performance per dollar of any inference system deployed in Azure and represents the highest performance chip of any custom cloud accelerator today. AI inference is increasingly defined by an efficient frontier, a curve that measures how much real-world capability and accuracy can be delivered at a given level of cost, latency, and energy. Different applications sit at different points on that frontier: interactive copilots prioritize low-latency responsiveness, batch-scale summarization and search emphasize throughput at a given cost, and advanced reasoning workloads demand sustained performance under long-context and multi-step execution. As enterprises deploy AI across these diverse scenarios, the infrastructure requirements are no longer one-size-fits-all; they require a portfolio approach that delivers the highest-performance, lowest-cost infrastructure at scale. Maia 200 reflects a core principle of AI at scale: innovation across software, silicon, systems, and datacenters is what enables us to deliver 30% better performance per dollar than the latest generation hardware in our fleet today. As agentic applications expand in capability and adoption, this integrated approach makes infrastructure efficiency a foundational advantage. Maia 200 Purpose‑Built for Price-Performance Inference Leadership To meet these demands, Maia 200 introduces a new system and silicon architecture purpose built to maximize inference efficiency. Guided by a deep understanding of AI workloads and supported by an advanced pre-silicon environment and enabling hardware/software codesign, Maia 200 incorporates a set of deliberate architectural choices that deliver industry leading tokens per dollar and per watt. Notable architecture innovations include: Optimized narrow precision datapaths, on the latest TSMC N3 process technology enabling 10.1 PetaOPS FP4, positioning Maia 200 among the highest FP4perdollar accelerator available in any cloud. A reimagined memory subsystem combining 272 MB of ondie SRAM with 216 GB HBM3e delivering 7 TB/s of HBM bandwidth capacity to service dataintensive operations while minimizing off-chip traffic, reducing HBM bandwidth demand and improving overall energy efficiency. An efficient datamovement fabric, centered around a multilevel Direct Memory Access (DMA) subsystem and a hierarchical NetworkonChip (NoC), ensuring predictable, scalable performance for heterogeneous and memorybound AI workloads. A highly performant and reliable, Ethernet scaleup interconnect, featuring an integrated on-die NIC with 2.8 TB/s (bi-directional) of bandwidth, advanced transportprotocol enabling a 2 tier scale up network and topology optimizations to deliver highbandwidth, lowlatency communication across a cluster of 6,144 accelerators. A closer look at Maia 200 reveals the architectural advancements and system‑level innovations purpose‑built for inference that enable its industry‑leading efficiency. Maia 200 Architecture Overview Maia accelerators are organized around a hierarchical micro-architecture. At the foundation of this hierarchy is the tile, the smallest autonomous unit of compute and local storage. Each tile integrates two complementary execution engines: a Tile Tensor Unit (TTU) for high-throughput matrix multiply and convolution, and a Tile Vector Processor (TVP) as a highly programmable SIMD engine. These engines are fed by multi-banked Tile SRAM (TSRAM) and a tile-level DMA subsystem that is responsible for moving data into and out of that SRAM without stalling the compute pipeline. A lightweight Tile Control Processor (TCP) runs the low-level code emitted by the software stack and orchestrates TTU and DMA work issuance, while hardware semaphores provide fine-grained synchronization between data movement and compute. Multiple tiles compose into a cluster, which introduces a second tier of shared locality and coordinated movement. Each cluster contains a large multi-banked Cluster SRAM (CSRAM) accessible across the tiles in that cluster, along with a dedicated cluster DMA subsystem that stages traffic between CSRAM and co-packaged High Bandwidth Memory (HBM). A dedicated cluster core provides the control and synchronization needed to coordinate multi-tile execution, and the full SoC is built by instantiating multiple clusters. Because building at scale requires not just peak performance but manufacturability, the architecture also incorporates redundancy schemes for both tiles and SRAM to improve yield while preserving the hierarchical programming and execution model. Maia accelerators feature a highly optimized data movement infrastructure, centered around its Direct Memory Access (DMA) subsystem coupled with a hierarchical Network-on-Chip (NoC). The DMA engines are architected for multichannel, high-bandwidth transfer and support 1D/2D/3D strided movement, enabling common ML tensor layouts to be moved efficiently between on-chip SRAM, HBM, and external interfaces while overlapping data movement with compute. Meanwhile, the NoC provides scalable, low-latency communication across clusters and memory subsystems and supports both unicast and multicast transfers—an important capability for distributing tensor blocks and coordinating parallel execution. To further improve effective memory efficiency, Maia supports multiple narrow-precision data types as storage formats in both HBM and SRAM and employs hardware-based data casting to convert storage types to compute types at line rate so that movement and execution remain tightly coupled. For communication beyond the chip, Maia 200 integrates a high‑performance NIC and an Ethernet‑based scale‑up interconnect using an optimized AI Transport Layer (ATL) protocol to deliver scalable, low‑latency communication across nodes. The on‑die NIC provides 1.4 TB/s unidirectional (2.8 TB/s bidirectional) I/O bandwidth, eliminating the power and cost overhead of external NICs while enabling efficient scaling to 6,144 accelerators within a two‑tier scale‑up domain. ATL operates end‑to‑end over standard Ethernet, supporting a commodity, multi‑vendor switching ecosystem, while layering on innovations such as packet spraying, multipath routing, and congestion‑resistant flow control built directly into the transport layer to maximize throughput and stability. Optimized Tensor Core for Narrow Precision Data Types As AI models continue to grow in size and complexity, achieving cost‑effective inference increasingly depends on exploiting narrow‑precision arithmetic and reducing memory footprints to improve performance and efficiency. Industry results consistently show that formats such as FP4 can maintain robust model accuracy for inference while significantly reducing computational and memory requirements. Maia 200 is architected from the ground up for narrow‑precision execution. Its Tile Tensor Unit (TTU) is optimized for matrix multiplication in FP8, FP6, and FP4, and supports mixed‑precision modes such as FP8 activations multiplied by FP4 weights to maximize throughput without compromising accuracy. Complementing this, the Tile Vector Processor (TVP) delivers FP8 compute alongside BF16, FP16, and FP32, providing flexibility for layers or operators that benefit from higher precision. An integrated reshaper up‑converts low‑precision formats at line rate prior to computation, ensuring seamless dataflow without introducing bottlenecks. Notably, FP4 throughput on Maia 200 is 2× that of FP8, and 8× that of BF16, enabling substantial gains in tokens‑per‑second and performance‑per‑watt for inference‑centric workloads. A Reimagined Memory Subsystem A defining feature of Maia 200’s architecture is its advanced memory hierarchy, engineered to optimize data movement and sustain high utilization across diverse inference workloads. Maia 200 integrates 272 MB of on‑die SRAM partitioned into multi‑tier Cluster‑level SRAM (CSRAM) and Tile‑level SRAM (TSRAM). This substantial on‑die memory resource enables a wide range of low‑latency, bandwidth‑efficient data‑management strategies. Both CSRAM and TSRAM are fully software‑managed, allowing developers—or the compiler/runtime—to deterministically place and pin data for precise control of locality and movement. For example, a primary use case for CSRAM is pinning critical working sets within cluster‑local memory. Keeping frequently accessed data resident on‑chip provides predictable low‑latency access, reduces dependence on higher‑latency memory tiers, and improves deterministic execution. More broadly, the on‑die SRAM hierarchy allows programmers to buffer, stage, and pin data in ways that significantly optimize dataflow patterns across kernel types. Examples include: GEMM kernels can retain intermediate matrix tiles in TSRAM, boosting arithmetic intensity by eliminating round‑trips to HBM or even CSRAM. Attention kernels can pin Q/O tensors, K/V tensors, and partial Q·K products as much as possible in TSRAM, minimizing data‑movement overhead throughout the attention pipeline. Collective‑communication kernels can buffer full payloads in CSRAM while accumulation proceeds in TSRAM, reducing pressure on HBM and preventing bandwidth collapse during multi‑node operations. Cross‑kernel pipelines benefit from CSRAM as a transient buffer between stages, enabling tightly coupled, high‑throughput kernel chaining with fewer stalls particularly valuable for workloads with high kernel density or complex operator fusion. Together, these capabilities allow Maia 200 to maintain high compute efficiency and deterministic performance, even as model architectures and sequence lengths grow increasingly demanding. An Efficient Data‑Movement Fabric: Specialized DMA Engines and a Custom On‑Chip Interconnect Sustained inference utilization on Maia 200 depends on the ability to move data predictably and efficiently among compute tiles, on‑die SRAM, HBM, and I/O. Because inference performance is often bounded by data movement rather than peak FLOPs, the interconnect must support high‑throughput tensor transfers (broadcast, gather, reduce, scatter) while also ensuring low‑latency delivery of synchronization and control signals. Maia 200 addresses this challenge with a custom Network‑on‑Chip (NOC) designed explicitly for inference‑centric dataflow. At the chip level, the NOC forms a mesh network spanning all clusters, tiles, memory controllers, and I/O units. It is segmented into multiple logical planes—or virtual fabrics—including a high‑bandwidth data plane for large tensor transfers and a dedicated control plane for interrupts, synchronization, and small messages. This separation ensures that latency‑critical control traffic is never blocked behind bulk data transfers, a key requirement when hundreds of tiles, DMA engines, and controllers operate concurrently. Maia 200’s on‑chip fabric introduces several inference‑oriented innovations: Efficient HBM‑to‑cluster broadcast: Hierarchical data movement allows tensors to be fetched once from HBM and fanned out to multiple CSRAM, avoiding redundant HBM reads and improving energy efficiency. Localized high‑bandwidth cluster traffic: High-bandwidth cluster‑local fabrics keep the hottest data movement within the cluster, enabling common inference patterns—such as intra‑layer reductions, scratchpad exchanges, and small collectives—to complete within the cluster without repeatedly traversing global links. Tile‑to‑tile SRAM access: Within a cluster, the fabric allows Tile DMAs and vector units to directly read and write peer tile SRAMs, enabling efficient broadcasts, reductions, and shared‑state updates without engaging HBM and CSRAM. Quality‑of‑Service for critical traffic: QoS mechanisms in both the fabric and memory controllers prioritize urgent, low‑latency messages such as synchronization signals or small inference outputs ensuring they are not delayed by bulk tensor transfers. Fail‑safe management plane: By isolating control and telemetry traffic from the data path, Maia 200 maintains a reliable, always‑available management channel—essential for recovery, coordination, and monitoring in large‑scale inference deployments. Complementing the NOC, Maia 200 implements a hierarchy of specialized DMA engines tailored for AI dataflow. Tile DMAs handle fine‑grained transfers between TSRAM and CSRAM; Cluster DMAs shuttle data between CSRAM and HBM or across clusters; and Network DMAs manage send/receive paths for off‑chip links. This layered DMA architecture enables concurrent, overlapped transfers across memory tiers and across nodes, ensuring compute tiles remain well‑fed under diverse workload conditions. Together, the custom NOC and multi‑tier DMA hierarchy form a data‑movement subsystem purpose‑built for inference—high‑bandwidth for tensors, low‑latency for control, localized when possible, prioritized when necessary, and efficiently coordinated across the entire chip. This architecture is fundamental to Maia 200’s ability to sustain high utilization across varied and increasingly complex AI workloads. A Highly Performant and Reliable 2 Tier ScaleUp Interconnect with An Innovative AI Transport Layer Maia 200 incorporates an integrated NIC and a high-performance Ethernet based scaleup interconnect built around Microsoft’s AI Transport Layer (ATL) protocol to enable scalable, low latency chip-to-chip communication across 6,144 Maia accelerators arranged in a two-tier topology. Scale‑up networking was approached as a full‑stack solution, architecting the interconnect as a set of well-defined layers co-optimized end-to-end for performance-per-dollar. The design emphasizes predictable latency, full bandwidth utilization, and software defined flexibility, while leveraging the robustness and multivendor support of the commodity Ethernet switch ecosystem. A foundational innovation in Maia 200’s interconnect is the on-die integrated NIC and its close coupling with both the ATL protocol engine and the Network DMA. This custom, inhouse network controller is engineered for very low power and area, enabling features such as packet spraying, multipath routing, and congestionresistant flow control directly in the transport layer to maximize throughput and stability. Together, these elements enable a two-tier scaleup fabric optimized for largescale inference workloads, providing tightly coupled communication both within and across racks. Many accelerator systems rely on allswitched scaleup fabrics, where even local tensorparallel traffic must traverse external switches. This approach forces most collective operations onto shared switch paths, adding hop latency and power and requiring significant port and cabling overprovisioning to sustain worstcase alltoall patterns. Maia 200 avoids these inefficiencies through the Fully Connected Quad (FCQ): groups of four accelerators connected via switchless, direct links. This intranode topology delivers significantly faster tensorparallel communication without relying on an external switch and provides a superior Perf/$ and Perf/W balance for both compute and collective I/O. Beyond the FCQ domain, the switched tier extends connectivity to 6,144 accelerators, enabling very large inference models to be sharded across nodes while preserving communication efficiency—without depending on external NICs and scaleout network. This architecture offers three major benefits: Bandwidth optimizations and reduced overhead High intensity tensor parallel traffic, KV updates, and partial activations remain localized within FCQ groups, while switches handle lighter weight cross domain collectives. Multirack inference at scale without trainingclass cost The design avoids the power, complexity, and fleetcost burden of scaleout network while still enabling hyperscale inference topologies under practical power envelopes. Workloadaligned network behavior Modern inference workloads require moderate synchronization—not the extreme alltoall pressure of training. The twotier architecture meets these needs without overengineering the fabric, while still delivering high throughput and low latency for production inference deployments. The result is a scaleup network that is highperformance, reliable, and right sized achieving the bandwidth, latency, and efficiency targets essential for largescale inference while remaining cost and powerefficient for hyperscale deployment. At the top of the scaleup hardware stack is the collective communication layer, which forms the interface between deeplearning frameworks (e.g., PyTorch, TensorFlow) and the underlying hardware. Maia 200 uses the Microsoft Collective Communication Library (MCCL), whose algorithms are codesigned with Maia’s hardware to deliver optimal scaleup performance for specific workload shapes. Key areas of innovation in MCCL include: Compute–I/O overlap to hide synchronization overhead and minimize pipeline bubbles. Hierarchical collectives reducing network traffic, lowering latency, and minimizing incast. Dynamic algorithmic selection tuned to tensor sizes and communication patterns. I/O latency hiding through pipelined and predictive scheduling. Together, the interconnect hardware and MCCL software deliver a tightly integrated, inferenceoptimized scaleup platform capable of supporting the next generation of largescale, lowlatency AI deployments. Maia 200 System: Azure‑Integrated, Cloud‑Native by Design The Maia 200 system is engineered as a fully Azure‑native platform, tightly integrated into the same cloud infrastructure that powers Microsoft’s largest AI and GPU fleets. At the hardware layer, Maia 200 is co‑designed with Azure’s third‑party GPU systems, adhering to a standardized rack, power, and mechanical architecture that simplifies deployment, improves serviceability, and allows heterogeneous accelerators to coexist within the same datacenter footprint. This alignment enables Azure to operate Maia 200 at hyperscale without requiring bespoke infrastructure or specialized site configurations. Thermal design is equally modular. Maia 200 supports deployments in both air and liquid cooled datacenters, including a second‑generation liquid‑cooling sidecar designed for high‑density racks and thermally constrained environments. This ensures broad deployability and fungibility across both legacy air-cooled and next‑generation liquid cooled datacenters while maintaining consistent performance under sustained workloads. Operationally, Maia 200 integrates with Azure’s native control plane, inheriting the same lifecycle, availability, and reliability guarantees as other Azure compute services. Firmware rollouts, fault detection and health monitoring are all performed through impactless, fleet‑wide management workflows, minimizing disruption and ensuring consistent service levels. This tight control‑plane integration also enables automated node bring‑up, safe in‑place upgrades, and coordinated multi‑rack maintenance—capabilities essential for large‑scale inference deployments. Maia 200 will be part of our heterogenous AI infrastructure supporting multiple models, including the latest GPT-5.2 models from OpenAI, to power AI workloads in Microsoft Foundry and Microsoft 365 Copilot. It will be fully integrated into Azure allowing models and workloads to be scheduled, partitioned, and monitored using the same tooling that supports Azure’s GPU fleets. This ensures portability across hardware types and allows service operators to optimize for perf/$, latency, or capacity without rewriting orchestration logic. Together, these system‑level capabilities make Maia 200 not just an highly efficient inference accelerator, but a cloud‑native compute building block, integrated seamlessly into Azure’s global AI infrastructure and optimized for reliable, large‑scale, multi‑tenant operation. Maia 200 Software Stack and Developer Toolchain: A Cloud‑Native Platform for High‑Performance Inference The Maia 200 software stack brings together a fully Azure‑integrated inference platform and a modern, developer‑oriented SDK built to deliver performance at scale. It is designed so cloud developers can adopt Maia seamlessly, leveraging familiar tooling while accessing low‑level control when needed for peak efficiency. For developers, the Maia SDK provides a comprehensive toolchain for building, optimizing, and deploying both open source and proprietary models on Maia hardware. Workflows begin naturally with PyTorch, and developers can choose the level of abstraction required: use the Maia Triton compiler for rapid kernel generation, rely on highly optimized kernel libraries tuned for Maia’s tile‑ and cluster‑based architecture, or target Microsoft’s Nested Parallel Language (NPL) for explicit control of data movement, SRAM placement, and parallel execution to reach near–peak utilization. The SDK includes a full simulator, compiler pipeline, profiler, debugger, and a robust quantization and validation suite, enabling teams to prototype models before silicon availability, diagnose performance bottlenecks with fine granularity, and tune kernels for optimal execution across the Maia stack. Together, the Maia inference stack and SDK form a unified platform that accelerates model bring‑up, simplifies performance optimization, and makes high‑performance inference a first‑class, cloud‑native development experience. In conclusion, with Maia 200, we demonstrate that leadership in AI infrastructure comes from unified system and workload optimizations across the entire stack — AI models, software toolchain and orchestration, custom silicon, networking, rack‑scale architecture, and datacenter infrastructure. Maia 200 embodies this principle, delivering 30% better performance per dollar than the latest generation hardware in our fleet today with an architecture that is purpose‑built for efficiency at scale. It represents a decisive step in advancing the world’s most capable, efficient, and scalable cloud platform, and forms the foundation for Microsoft’s AI future.13KViews5likes2CommentsRAIDDR: Redefining Memory Reliability
Introduction As datacenters scale to support modern digital life, so does the challenge of keeping memory reliable. Even very rare DRAM faults can translate into an unacceptable number of uncorrectable or silent errors at scale, making robust error correction (ECC) a must. Yet, traditional ECC incurs increased cost, power, and memory footprints, all of which challenge cloud scalability and sustainability. Enter RAIDDR (Redundant Array of Independent Disks for Double Data Rate), Microsoft’s innovative ECC architecture, designed to meet these challenges with a 50 percent reduction in overhead. The Problem with Traditional ECC Historically, hyperscale memory reliability has relied on Reed-Solomon and other legacy ECC schemes. While effective, these methods come at a cost: As shown in the first slide below there is up to ~30% memory overhead across hyperscale fleets. As new memory technologies and advanced SoCs emerge, traditional ECC methods struggle to scale due to high reliability requirements, power requirements, meta data requirements, and on-die correction mechanisms that limit flexibility for cloud providers. As shown in second slide below, the current ECC solutions for x8 devices (e.g. LPDDR5X in byte mode) do not provide cloud-level reliability. In addition, performance requirements for new memories could require even more overhead. RAIDDR Architecture RAIDDR flips the ECC paradigm by enabling additional error correction on the host’s memory controller. The controller handles symbol-based correction using a mix of parity, CRC and BCH, inspired by RAID techniques from storage. RAIDDR maximizes the number of correctable failures per device, ensuring robust protection while reducing overheads. This host-centric approach makes RAIDDR adaptable for the next generation of memory technology. Variants: Basic vs. Enhanced RAIDDR Not all deployments require the same level of integration. Basic RAIDDR relies on CRC, may not require additional bits from each die, and works seamlessly with standard DIMMs, supporting a broad array of existing hardware. Enhanced RAIDDR, however, goes a step further—by leveraging access of additional bits available from the device, it pushes reliability and efficiency even further by using BCH to correct additional single bit failures. Enhanced RAIDDR can achieve the reliability requirements of general-purpose cloud providers with lower memory overhead compared to traditional approaches, giving cloud architects flexibility in balancing cost and performance. Deployment in Azure Silicon Microsoft has already begun integrating RAIDDR into its Azure silicon stack. Developed by Microsoft with Cadence IP and compatibility with LPDDR5X, RAIDDR is ready for deployment across new platforms. Open Licensing and Ecosystem Adoption To accelerate wide adoption of RAIDDR as a standard for memory reliability across the industry, Microsoft has released RAIDDR under the OWF CLA 0.9 open licensing model. With transparent licensing and collaborative development, we encourage broad industry engagement, from silicon and IP vendors to system integrators and cloud builders. RAIDDR is well positioned to become a standard for memory reliability across, aligning with memory and SoC partner usage. Technical Deep Dive For engineers eager for details, RAIDDR’s encoding and decoding flows are meticulously documented (https://github.com/microsoft/BasicRAIDDR). Correction logic is designed to address a wide spectrum of failure scenarios, from single-bit errors to multi-device faults. Extensive simulations and benchmarks demonstrate RAIDDR’s ability to match ECC overhead vs. reliability requirements. RAIDDR can be implemented with a fraction of the gates used in traditional methods at similar or lower latency. Conclusion RAIDDR stands as a transformative leap in memory reliability for hyperscale environments. It delivers robust error correction with a fraction of the traditional overhead, reducing costs and power while unlocking new efficiencies for hyperscale clouds. Looking ahead, RAIDDR’s architecture lays a foundation for next-generation memory. We invite engineers, partners, and the broader ecosystem to join us in shaping the future of cloud memory reliability.1.1KViews2likes4CommentsAdvancing embodied carbon measurement at scale for Microsoft Azure hardware
Introduction: Why embodied carbon in cloud hardware matters At Microsoft, 97% of our greenhouse gas (GHG) emissions fall under the Scope 3 category, with the majority originating in our supply chain. Information and communication technology (ICT) hardware within our datacenters (e.g. servers) is a significant contributor to our supply chain emissions, making it essential to understand and reduce the embodied carbon of our Azure hardware as Microsoft and other major cloud providers pursue ambitious climate goals. Yet, accurately measuring these impacts is hard. It’s especially difficult when you work within a complex, global supply chain producing products that continually change and advance. The challenge is immense, and we expect meeting our targets requires us to innovate just as quickly - developing actionable carbon accounting and metrics that drive accountability and real progress. This blog introduces our approach to addressing this challenge, and our white paper How Microsoft is advancing embodied carbon measurement at scale for Azure hardware provides a deeper look into the methodology. There will continue to be more work to do, for us and collectively with the industry, but we believe it’s a solid and actionable foundation to build on. Our approach to scaling embodied carbon measurement To meet the complexity and scale of Azure hardware systems in our datacenters, we developed an in‑house, process‑based lifecycle assessment (LCA) approach for cloud ICT hardware. This approach, known internally as the cloud hardware emissions methodology (CHEM), scales environmental impact modeling of Azure hardware while preserving the data resolution needed to support meaningful decarbonization. At a high level, CHEM brings together: Microsoft’s product data systems and supplier data, including full materials declarations State-of-the-art, technology–specific semiconductor environmental impact data Cloud‑based automated LCA software, mapping product, material and impact data CHEM connects internal product and supplier material data to environmental life‑cycle inventories, automates repetitive mapping steps, and keeps LCA practitioners focused on data quality and actionable insights. It allows: Scalability and automation: modeling thousands of hardware configurations efficiently, enabling rapid, consistent, and high-resolution carbon footprinting across Microsoft’s Azure data centers Data quality and modularity: integration of supplier-specific data and state-of-the-art semiconductor impact data that improve accuracy, while a modular architecture supports continuous updates as new data becomes available Actionable insights: identifying emissions hot spots deep within multi-tiered supply chains, supporting targeted decarbonization interventions and informed hardware design decisions In essence, CHEM translates the rigor of process-based LCA into a repeatable approach that our engineering, sourcing, and sustainability teams can use to identify hotspots and track progress towards decarbonization, without sacrificing the resolution of LCA and the technical depth complex hardware demands. What CHEM enables for Microsoft Improving Scope 3 embodied carbon accounting High‑granularity, process‑based data now covers over 97% of cloud server rack emissions and nearly 80% of semiconductor emissions, a key hotspot in the ICT supply chain. This enhances our annual carbon reporting and helps ensure our disclosures are more representative of real hardware configurations and their embodied carbon in our Azure datacenters. Identifying, sizing and tracking supply‑chain decarbonization actions By quantifying impacts across multiple tiers of the ICT hardware supply chain, we can pinpoint where emissions occur—whether in materials, energy use, or manufacturing processes - and work with suppliers to evaluate, prioritize and quantify the impact reduction of targeted interventions. Supporting hardware systems architecture System architects can use these insights to understand the embodied carbon implications of design choices for components, servers, racks, and clusters. This helps integrate carbon metrics into system‑level design alongside traditional considerations like performance and power. Informing carbon roadmaps and long‑term planning More actionable data strengthens Microsoft’s Scope 3 carbon emissions reduction planning, helping teams size interventions, assess tradeoffs, and identify the components—such as memory and storage—that drive the greatest share of embodied emissions. Looking ahead Scaling actionable embodied carbon measurement across the ICT sector requires continued improvements in data quality and standardization of carbon accounting frameworks and data exchange. Microsoft is collaborating with other hyperscalers and the ICT sector to align on shared standards for datacenter hardware LCA. Alongside partner methodologies, CHEM is helping shape Product Category Rules (PCRs) and an open, scalable LCA approach through collaborations led by the Open Compute Project (OCP) and the SEMI Semiconductor Climate Consortium (SCC). These efforts aim to harmonize carbon accounting, raise data quality, and enable consistent, actionable measurement of embodied carbon of datacenter ICT hardware to advance global sustainability goals. Read the white paper How Microsoft is advancing embodied carbon measurement at scale for Azure hardware to dive deeper into CHEM and example applications.1.2KViews2likes0CommentsAnnouncing Kubernetes Center (Preview) On Azure Portal
Today, we’re excited to introduce the Kubernetes Center in the Azure portal, a new experience to simplify how customers manage, monitor, and optimize Azure Kubernetes Services environments at scale. The Kubernetes Center provides a unified view across all clusters, intelligent insights, and streamlined workflows that help platform teams stay in control while enabling developers to move fast. As Kubernetes adoption accelerates, many teams face growing challenges in managing clusters and workloads at scale. Getting a quick snapshot of what needs attention across clusters and workloads can quickly become overwhelming. Kubernetes Center is designed to change that, offering a streamlined and intuitive experience that brings everything together in one place, brings the most critical Kubernetes capabilities into a single pane of glass for unified visibility and control. What is Kubernetes Center?: Actionable insights from the start: Kubernetes Center surfaces key issues like security vulnerabilities, cluster alerts, compliance gaps, and upgrade recommendations in a single, unified view. This helps teams focus immediately on what matters most, leading to faster resolution times, improved security posture, and greater operational clarity. Streamlined management experience: By bringing together AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single experience, we’ve reduced the need to jump between services. Everything you need to manage Kubernetes on Azure is now organized in one consistent interface. Centralized Quickstarts: Whether you’re getting started or troubleshooting advanced scenarios, Kubernetes Center brings relevant documentation, learning resources, and in-context help into one place so you can spend less time searching and more time building. Azure Portal: From Distinct landing experiences for AKS, Fleet Manager, and Managed Kubernetes Namespaces: To a streamlined management experience: Get the big picture at a glance, then dive deeper with individual pages designed for effortless discovery. Centralized Quickstarts: Next Steps: Build on your momentum by exploring Kubernetes Center. Create your first AKS cluster or deploy your first application using the Deploy Your Application flow and track your progress in real time or Check out the new experience and instantly see your existing clusters in a streamlined management experience. Your feedback will help shape what comes next. Start building today with Kubernetes Center on Azure Portal! Learn more: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn FAQ: What products from Azure are included in Kubernetes Center? A. Kubernetes Center brings together all your Azure Kubernetes resources such as AKS, AKS Automatic, Fleet Manager, and Managed Namespaces into a single interface for simplified operations. Create new resources or view your existing resources in Kubernetes Center. Does Kubernetes Center handle multi-cluster management? A. Kubernetes Center provides a unified interface aka single pane of glass to view and monitor all your Kubernetes resources in one place. For multi-cluster operations like upgrading Kubernetes Version, placing cluster resources on N clusters, policy management, and coordination across environments, Kubernetes Fleet Manager is the solution designed to handle that complexity at scale. It enables teams to manage clusters at scale with automation, consistency, and operational control. Does Kubernetes Center provide security and compliance insights? A. Absolutely. When Microsoft Defender for Containers is enabled, Kubernetes Center surfaces critical security vulnerabilities and compliance gaps across your clusters. Where can I find help and documentation? A. All relevant documentation, QuickStarts, and learning resources are available directly within Kubernetes Center, making it easier to get support without leaving the platform. For more information: Create and Manage Kubernetes resources in the Azure portal with Kubernetes Center (preview) - Azure Kubernetes Service | Microsoft Learn What is the status of this launch? A. Kubernetes Center is currently in preview, offering core capabilities with more features planned for the general availability release. What is the roadmap for GA? A. Our roadmap includes adding new features and introducing tailored views designed for Admins and Developers. We also plan to enhance support for multi-cluster capabilities in Azure Fleet Manager, enabling smoother and more efficient operations within the Kubernetes Center.3.5KViews10likes0CommentsAnnouncing Cobalt 200: Azure’s next cloud-native CPU
By Selim Bilgin, Corporate Vice President, Silicon Engineering, and Pat Stemen, Vice President, Azure Cobalt Today, we’re thrilled to announce Azure Cobalt 200, our next-generation Arm-based CPU designed for cloud-native workloads. Cobalt 200 is a milestone in our continued approach to optimize every layer of the cloud stack from silicon to software. Our design goals were to deliver full compatibility for workloads using our existing Azure Cobalt CPUs, deliver up to 50% performance improvement over Cobalt 100, and integrate with the latest Microsoft security, networking and storage technologies. Like its predecessor, Cobalt 200 is optimized for common customer workloads and delivers unique capabilities for our own Microsoft cloud products. Our first production Cobalt 200 servers are now live in our datacenters, with wider rollout and customer availability coming in 2026. Azure Cobalt 200 SoC and platform Building on Cobalt 100: Leading Price-Performance Our Azure Cobalt journey began with Cobalt 100, our first custom-built processor for cloud-native workloads. Cobalt 100 VMs have been Generally Available (GA) since October of 2024 and availability has expanded rapidly to 32 Azure datacenter regions around the world. In just one year, we have been blown away with the pace that customers have adopted the new platform, and migrated their most critical workloads to Cobalt 100 for the performance, efficiency, and price-performance benefits. Cloud analytics leaders like Databricks and Snowflake are adopting Cobalt 100 to optimize their cloud footprint. The compute performance and energy-efficiency balance of Cobalt 100-based virtual machines and containers has proven ideal for large-scale data processing workloads. Microsoft’s own cloud services have also rapidly adopted Azure Cobalt for similar benefits. Microsoft Teams achieved up to 45% better performance using Cobalt 100 than their previous compute platform. This increased performance means less servers needed for the same task, for instance Microsoft Teams media processing uses 35% fewer compute cores with Cobalt 100. Designing Compute Infrastructure for Real Workloads With this solid foundation, we set out to design a worthy successor – Cobalt 200. We faced a key challenge: traditional compute benchmarks do not represent the diversity of our customer workloads. Our telemetry from the wide range of workloads running in Azure (small microservices to globally available SaaS products) did not match common hardware performance benchmarks. Existing benchmarks tend to skew toward CPU core-focused compute patterns, leaving gaps in how real-world cloud applications behave at scale when using network and storage resources. Optimizing Azure Cobalt for customer workloads requires us to expand beyond these CPU core benchmarks to truly understand and model the diversity of customer workloads in Azure. As a result, we created a portfolio of benchmarks drawn directly from the usage patterns we see in Azure, including databases, web servers, storage caches, network transactions, and data analytics. Each of our benchmark workloads includes multiple variants for performance evaluation based on the ways our customers may use the underlying database, storage, or web serving technology. In total, we built and refined over 140 individual benchmark variants as part of our internal evaluation suite. With the help of our software teams, we created a complete digital twin simulation from the silicon up: beginning with the CPU core microarchitecture, fabric, and memory IP blocks in Cobalt 200, all the way through the server design and rack topology. Then, we used AI, statistical modelling and the power of Azure to model the performance and power consumption of the 140 benchmarks against 2,800 combinations of SoC and system design parameters: core count, cache size, memory speed, server topology, SoC power, and rack configuration. This resulted in the evaluation of over 350,000 configuration candidates of the Cobalt 200 system as part of our design process. This extensive modelling and simulation helped us to quickly iterate to find the optimal design point for Cobalt 200, delivering over 50% increased performance compared to Cobalt 100, all while continuing to deliver our most power-efficient platform in Azure. Cobalt 200: Delivering Performance and Efficiency At the heart of every Cobalt 200 server is the most advanced compute silicon in Azure: the Cobalt 200 System-on-Chip (SoC). The Cobalt 200 SoC is built around the Arm Neoverse Compute Subsystems V3 (CSS V3), the latest performance-optimized core and fabric from Arm. Each Cobalt 200 SoC includes 132 active cores with 3MB of L2 cache per-core and 192MB of L3 system cache to deliver exceptional performance for customer workloads. Power efficiency is just as important as raw performance. Energy consumption represents a significant portion of the lifetime operating cost of a cloud server. One of the unique innovations in our Azure Cobalt CPUs is individual per-core Dynamic Voltage and Frequency Scaling (DVFS). In Cobalt 200 this allows each of the 132 cores to run at a different performance level, delivering optimal power consumption no matter the workload. We are also taking advantage of the latest TSMC 3nm process, further improving power efficiency. Security is top-of-mind for all of our customers and a key part of the unique innovation in Cobalt 200. We designed and built a custom memory controller for Cobalt 200, so that memory encryption is on by default with negligible performance impact. Cobalt 200 also implements Arm’s Confidential Compute Architecture (CCA), which supports hardware-based isolation of VM memory from the hypervisor and host OS. When designing Cobalt 200, our benchmark workloads and design simulations revealed an interesting trend: several universal compute patterns emerged – compression, decompression, and encryption. Over 30% of cloud workloads had significant use of one of these common operations. Optimizing for these common operations required a different approach than just cache sizing and CPU core selection. We designed custom compression and cryptography accelerators – dedicated blocks of silicon on each Cobalt 200 SoC – solely for the purpose of accelerating these operations without sacrificing CPU cycles. These accelerators help reduce workload CPU consumption and overall costs. For example, by offloading compression and encryption tasks to the Cobalt 200 accelerator, Azure SQL is able to reduce use of critical compute resources, prioritizing them for customer workloads. Leading Infrastructure Innovation with Cobalt 200 Azure Cobalt is more than just an SoC, and we are constantly optimizing and accelerating every layer in the infrastructure. The latest Azure Boost capabilities are built into the new Cobalt 200 system, which significantly improves networking and remote storage performance. Azure Boost delivers increased network bandwidth and offloads remote storage and networking tasks to custom hardware, improving overall workload performance and reducing latency. Cobalt 200 systems also embed the Azure Integrated HSM (Hardware Security Module), providing customers with top-tier cryptographic key protection within Azure’s infrastructure, ensuring sensitive data stays secure. The Azure Integrated HSM works with Azure Key Vault for simplified management of encryption keys, offering high availability and scalability as well as meeting FIPS 140-3 Level 3 compliance. An Azure Cobalt 200 server in a validation lab Looking Forward to 2026 We are excited about the innovation and advanced technology in Cobalt 200 and look forward to seeing how our customers create breakthrough products and services. We’re busy racking and stacking Cobalt 200 servers around the world and look forward to sharing more as we get closer to wider availability next year. Check out Microsoft Ignite opening keynote Read more on what's new in Azure at Ignite Learn more about Microsoft's global infrastructure19KViews10likes0Comments