virtual machines
177 TopicsIncoming Changes for Window Server 2022 Marketplace Image Users
New Azure Marketplace Windows Server 2022 media images offer is available in March 2026 excluding .NET 6 packages. Migrate to the new images before June 2026. After June 2026, there will be no patch to .NET 6 on the legacy images, and the legacy image will start deprecation at the same time.15KViews2likes1CommentPublic Preview: Ephemeral OS Disk with full caching for VM/VMSS
Today, we’re excited to announce the public preview of Ephemeral OS disk with full caching, a new feature designed to significantly enhance performance and reliability by utilizing local storage. This feature is ideal for IO-sensitive stateless workloads, as it eliminates dependency on remote storage by caching the entire OS image on local storage. Key Advantages: High Performance: Provides extremely high-performance OS disks with consistently fast response times. Reliability: Ensures high availability, making it suitable for critical workloads. Why Full OS Caching? Currently, Ephemeral OS disks store OS writes locally but still rely on a remote base OS image for reads. With Ephemeral OS Disk with full caching, the entire OS disk image is cached on local storage, removing the dependency on remote storage for OS disk reads. Once caching is complete, all OS disk IO is served locally. This results in: Consistently fast OS disk performance with low‑millisecond latency Improved resilience during remote storage disruptions No impact to VM create times, as caching happens asynchronously after boot This capability is well suited for IO-sensitive stateless workloads that need fast OS disk access, including: AI workloads Quorum‑based databases Data analytics and real‑time processing systems Large‑scale stateless services on General Purpose VM families These workloads benefit directly from lower OS disk latency and reduced exposure to remote storage outages. How It Works? When full OS caching is enabled: VM’s Local storage (cache disk, resource disk, or NVMe disk) is used to host the full OS disk Local storage capacity is reduced by 2× the OS disk size to accommodate OS caching The OS disk is cached in the background after VM boot, ensuring fast provisioning All OS disk IOs happen on the local storage, thus providing 10X better IO performance and resiliency to storage interruptions Public Preview Availability During public preview, Ephemeral OS disk with full caching is available for most general purpose VM SKUs (excluding 2‑vCPUs and 4‑vCPUs VMs) in 29 regions - AustraliaCentral, AustraliaCentral2, AustraliaSouthEast, BrazilSoutheast, CanadaCentral, CanadaEast, CentralIndia, CentralUSEUAP, EastAsia, GermanyWestCentral, JapanEast, JioIndiaCentral, JioIndiaWest, KoreaCentral, KoreaSouth, MalaysiaSouth, MexicoCentral, NorthEurope, NorwayWest, QatarCentral, SouthAfricaNorth, SwedenCentral, SwitzerlandWest, TaiwanNorth, UAECentral, UKSouth, UKWest, WestCentralUS, and WestIndia. We’re continuing to expand support across regions, and tooling as we move toward general availability. Getting Started Customers can enable Ephemeral OS disk with full caching when creating new VMs or VMSS by updating their ARM templates or REST API definitions and setting the enableFullCaching flag for Ephemeral OS disks. ARM template to create VMs with full caching: "resources": [ "name": "[parameters('virtualMachineName')]", "type": "Microsoft.Compute/virtualMachines", "apiVersion": "2025-04-01", .. .. "osDisk": { "diffDiskSettings": { "option": "Local", "placement": "ResourceDisk", "enableFullCaching": true }, "caching": "ReadOnly", "createOption": "FromImage", "managedDisk": { "storageAccountType": "StandardSSD_LRS" } } ARM template to create VMSS with full caching: "resources": [ "name": "[parameters('vmssName')]", "type": "Microsoft.Compute/virtualMachineScaleSets", "apiVersion": "2025-04-01", .. .. "osDisk": { "diffDiskSettings": { "option": "Local", "placement": "ResourceDisk", "enableFullCaching": true }, "caching": "ReadOnly", "createOption": "FromImage", "managedDisk": { "storageAccountType": "StandardSSD_LRS" } } Your feedback during public preview will help shape the final experience.453Views1like0CommentsAzure NCv6 Virtual Machines: Enhancements and GA Transition
NCv6 Virtual Machines are Azure's flexible, next generation platform enabling both leading-edge graphics and generative AI compute workloads. Featuring NVIDIA RTX PRO 6000 Blackwell Server Edition (BSE) GPUs, Intel Xeon™ 6 "Granite Rapids" 6900P series CPUs, and a suite of Microsoft Azure technologies, NCv6 VMs are available now in Preview. Today, we are pleased to share a series of exciting updates coming soon to Azure NCv6 that will: Enhance VM performance and capabilities Provide more VM sizes for customers to "right size" their usage Bring NCv6 to production readiness with a transition to General Availability, and Expand accessibility across the global Azure cloud New VM Sizes, Features, and Performance Enhancements In the coming weeks, Azure will debut seven new NCv6-series VM sizes and two different sub-families for customers to choose from. The standout features introduced with the new VM sizes include: 🧩 Fractional GPU support, enabling graphics workload customers to deploy VMs with as little as 1/2 or 1/4 of a RTX PRO™ 6000. VMs with fractional GPU support also feature reduced vCPU, memory, SSD, and networking to help customers optimize costs and right size their VMs to their workloads. ⚡ Increased vCPU per VM size (e.g. 288 vCPU instead of 256) to provide more performance for high-end VDI workstations and better align with the Intel Xeon 6900P's triple compute tile architecture. 🛠️ General Purpose and Compute Optimized VM sizes. The former provides larger amounts of CPU memory for demanding generative AI inference and ISV CAD/CAE simulations, while the latter offers reduced memory to enable customers with less memory intensive workloads to cost optimize their deployments. The new VM sizes will replace the existing three VM sizes offered in Preview, and be available as follows: NCv6 - General Purpose VM sizes: Size Name vCPUs Memory (GB) Networking (Mb/s) GPUs GPU Mem (GB) Temp Disk NVMe Disk Standard_NC36ds_xl_RTXPro6000_v6 36 132 22500 1/4 24 256 1600 Standard_NC72ds_xl_RTXPro6000_v6 72 264 45000 1/2 48 512 3200 Standard_NC132ds_xl_RTXPro6000_v6 132 516 90000 1 96 1024 6400 Standard_NC144ds_xl_RTXPro6000_v6 144 516 90000 1 96 1024 6400 Standard_NC264ds_xl_RTXPro6000_v6 264 1032 180000 2 192 2048 12800 Standard_NC288ds_xl_RTXPro6000_v6 288 1032 180000 2 192 2048 12800 Standard_NC324ds_xl_RTXPro6000_v6 324 1284 180000 2 192 2048 12800 NCv6-Compute Optimized VM sizes: Size Name vCPUs Memory (GB) Networking (Mbps) GPUs GPU Mem (GB) Temp Disk NVMe Disk Standard_NC24lds_xl_RTXPro6000_v6 24 72 22500 1/4 24 256 1600 Standard_NC36lds_xl_RTXPro6000_v6 36 72 22500 1/4 24 256 1600 Standard_NC72lds_xl_RTXPro6000_v6 72 132 45000 1/2 48 512 3200 Standard_NC132lds_xl_RTXPro6000_v6 132 264 90000 1 96 1024 6400 Standard_NC144lds_xl_RTXPro6000_v6 144 264 90000 1 96 1024 6400 Standard_NC264lds_xl_RTXPro6000_v6 264 516 180000 2 192 2048 12800 Standard_NC288lds_xl_RTXPro6000_v6 288 516 180000 2 192 2048 12800 Standard_NC324lds_xl_RTXPro6000_v6 324 648 180000 2 192 2048 12800 Note that, until the new VM sizes are available, Microsoft Learn resources will continue to reflect the currently offered VM sizes and technical specifications. Transition to General Availability In the coming weeks, Azure will transition NCv6-series from Preview to General Availability (GA) status. With this transition, NCv6 VMs will become covered by the Azure Service Level Agreement (SLA) and thus ready to support production-grade deployments by customers, partners, and service providers. When the transition to NCv6 VMs occurs, they will be available in the Azure West US2 and Southeast Asia regions. Information on availability timing of additional regions is provided below. Regional Expansion Across the Azure Cloud At the beginning of Preview, NCv6 VMs debuted in the West US2 region. Since then, we have also added NCv6 VMs to the Southeast Asia region. Both regions will be part of the transition to GA status. We are pleased to share that in the proceeding months covering Q3 of 2026, NCv6 VMs will also become available in the following Azure regions: • East US • West Europe • East US 2 • North Europe • South Central US • Germany West Central • West US • Korea Central Ready to build for the future with Azure NCv6? NCv6 Virtual Machines are available now in Preview. Start your production-grade AI journey today and explore the next frontier of Azure AI infrastructure. Join the PreviewRemove Unnecessary Azure Storage Account Dependencies in VM Diagnostics
This post explains how to reduce unnecessary Azure Storage Account dependencies—and associated SAS token usage—by simplifying VM diagnostics configurations: specifically, by removing the retiring legacy IaaS Diagnostics extension and migrating VM boot diagnostics from customer-managed Storage Accounts to Microsoft‑managed storage. Using Azure Resource Graph to identify affected virtual machines at scale, the article shows that both changes can be implemented without VM reboots or guest OS impact, reduce storage sprawl and operational overhead, and help organizations stay ahead of platform deprecations, with automation options available to standardize these improvements across environments.Upcoming Compute API Change: Always return non-null securityType
Starting with Azure Compute API version 2025‑11‑01, Virtual Machines and Virtual Machine Scale Sets will always return a non‑null securityType value in API responses. This post explains the behavior change, which API versions are affected, and what teams need to update in automation, validation or post-deployment logic that relies on null checks.193Views1like0CommentsMicrosoft at NVIDIA GTC 2026
Microsoft returns to NVIDIA GTC 2026 in San Jose with a strong presence across conference sessions, in‑booth theater talks, live demos, and executive‑level ancillary events. Together with NVIDIA and our partner ecosystem, Microsoft is showcasing how Azure AI infrastructure enables AI training, inference, and production at global scale. Visit us at Booth #521 to see the latest innovations in action and connect with Azure and NVIDIA experts. Exclusive GTC Experiences LEGO® Datacenter Model Explore Azure AI infrastructure at the Park Container. Candy Lounge Visit the high-traffic candy wall for co-branded treats all day long. Networking Lounge Relax and recharge with comfy seating and vital charging options. Outdoor Juice Truck Free, refreshing beverages served during outdoor park hours. Sponsored Breakout Sessions Microsoft Featured Reinventing Semiconductor Design with Microsoft Discovery S82398 · Mon, Mar 16 · 4:00 PM Prashant Varshney Microsoft · Semiconductor & AI Engineering Abstract: Semiconductor teams face exploding design complexity and shrinking verification windows. This session shows how the Microsoft Discovery AI for Science platform, combined with Synopsys Agent Engineers, introduces an agentic approach to EDA that automates routine steps and accelerates expert decision-making on Azure. Microsoft Featured Operationalizing Agentic AI at Hyperscale S82399 · Tue, Mar 17 · 1:00 PM Nitin Nagarkatte Microsoft · Azure AI Infrastructure Anand Raman Microsoft · Azure AI Vipul Modi Microsoft · AI Systems Abstract: As enterprises move to agentic systems, the challenge shifts to operating intelligent agents reliably at scale. This session demonstrates how Microsoft builds AI Factories on Azure using NVIDIA technology and explores Microsoft Foundry as the control plane for deploying and operating coordinated AI agents. Live from GTC: AI Podcast Dayan Rodriguez Corporate Vice President Global Manufacturing and Mobility Alistair Spiers General Manager Azure Infrastructure Live Special Feature A conversation with Microsoft Azure Listen & Subscribe: aka.ms/GTC2026Podcast Scan to Listen Earned Conference Sessions Don't miss these high-impact sessions where Microsoft and NVIDIA leaders discuss the future of AI factories and infrastructure. Mon · Mar 16 5:00 PM Drive Optimal Tokens per Watt on AI Infrastructure Using Benchmarking Recipes Speakers: Paul Edwards, Emily Potyraj Microsoft, NVIDIA Tue · Mar 17 9:00 AM Autonomous AI Factories: Technical Preview of Agent-Native Production Speakers: JP Vasseur, César Martinez Spessot NVIDIA, Microsoft Research Tue · Mar 17 4:00 PM The Road to Intelligent Mobility: Vehicle GenAI Speakers: Raj Paul, Thomas Evans, Bryan Goodman Microsoft, NVIDIA, Bosch Wed · Mar 18 9:00 AM Supercharging AI with Multi-Gigawatt AI Factories Speakers: Gilad Shainer, Peter Salanki, Evan Burness NVIDIA, CoreWeave, Meta, Microsoft Daily Booth Theater Schedule Visit the Microsoft Theater for lightning talks from engineering leaders and partners. Monday, March 16 2:00 PM BTH208 · NVIDIA Accelerate AI Innovation on Azure with NVIDIA Run:ai — Rob Magno 2:30 PM BTH202 · General Robotics Models to Machines: Deploying Agentic AI in Real-World Robotics — Dinesh Narayanan 3:00 PM BTH200 · Fractal Analytics From Generalist to Enterprise-Ready: Fractal Builds Domain AI — C. Chaudhuri 3:30 PM BTH109 · Microsoft Agentic cloud ops - Smarter Operations with Azure Copilot — Jyoti Sharma 4:00 PM BTH103 · Microsoft Build a Deep Research Agent for Enterprise Data — D. Casati, A. Slutsky, H. Alkemade 4:30 PM BTH205 · NetApp Azure NetApp Files: Powering Your Data for AI Capabilities — Andy Chan 5:00 PM BTH207 · NVIDIA The Agentic Commerce Stack: Open Models on Azure — Antonio Martinez 5:30 PM BTH217 · OPAQUE Confidential AI on Azure Unlocks Sovereign AI at Scale — Aaron Fulkerson 6:00 PM BTH218 · Simplismart Making BYOC work at scale with modular inference — Amritanshu Jain 6:30 PM Expo Reception Tuesday, March 17 1:30 PM BTH100 · Microsoft From Open Weights to Enterprise Scale: Open-Source Models — Sharmila Chockalingam 2:00 PM BTH212 · Personal AI Unlocking the power of memory in Teams with Personal AI — Sam Harkness 2:30 PM BTH111 · Microsoft / NVIDIA Scalable LLM Inference on AKS Using NVIDIA Dynamo — Mohamad Al jazaery, Anton Slutsky 3:00 PM BTH204 · Mistral AI Innovate with Mistral AI on Microsoft Foundry — Ian Mathew 3:30 PM BTH104 · Microsoft GPU-Accelerated CFD at Scale: Star-CCM+ on Azure — Jason Scheffelmaer 4:00 PM BTH206 · NeuBird AI Agentic AI for Incident Response on Microsoft Azure — Grant Griffiths 4:30 PM BTH101 · GitHub Agentic DevOps: Evolving software with GitHub Copilot — Glenn Wester 5:00 PM BTH209 · Rescale Real-World AI Physics: GM & NVIDIA on Rescale — Dinal Perera 5:30 PM BTH107 · Microsoft Intro to LoRA Fine-Tuning on Azure — Christin Pohl 6:30 PM Raffle Wednesday, March 18 1:00 PM BTH219 · VAST Data Scaling AI Infrastructure on Azure with VAST Data — Jason Vallery 1:30 PM BTH110 · Microsoft Physical AI and Robotics: The Next Frontier — F. Miller, C. Souche, D. Narayanan 2:00 PM BTH105 · Microsoft Sovereign AI options with Azure Local — Kim Lam 2:30 PM BTH108 · Microsoft Automating HPC Workflows with Copilot Agents — Param Shah 3:00 PM BTH102 · Microsoft Trustworthy Multi-Agent Workflows with Microsoft Foundry — Brian Benz 4:00 PM BTH106 · Microsoft Scaling Enterprise AI on ARO with NVIDIA H100 & H200 — Lachie Evenson 4:30 PM BTH211 · WEKA Hybrid AI Data Orchestration with WEKA NeuralMesh™ — Desiree Campbell 5:00 PM BTH202 · Hammerspace NVIDIA AI Enterprise Software with NIM — Mike Bloom 5:30 PM BTH203 · Kinaxis Reimagining Global Supply Planning with Azure — Dane Henshall 6:00 PM BTH214 · AT&T Connected AI on Azure for Manufacturing — Brad Pritchett 6:30 PM Raffle Thursday, March 19 11:00 AM BTH210 · Wandelbots Physical AI: Powering Software-Defined Automation in Robotics — Marwin Kunz, Martin George 11:30 AM Raffle Explore Our Demo Pods Visit the Microsoft booth to see our technology in action with live demonstrations across four dedicated pod areas. POD 1 Azure AI Infrastructure End‑to‑end AI infrastructure for training and inference at scale, featuring the latest NVIDIA GPU integrations on Azure. POD 2 Microsoft Foundry Our comprehensive platform for building, deploying, and operating agentic AI systems with enterprise reliability. POD 3 Building AI Together Showcasing joint Microsoft and NVIDIA solutions across diverse industries, from manufacturing to retail. POD 4 Startups Powering AI Discover how innovative startups are running next‑generation AI workloads on the Azure platform. Ancillary Events & Networking Join Microsoft leadership and our partner ecosystem at these curated networking experiences. Click the location to view on Bing Maps. Sun · Mar 15 6:00 PM Microsoft for Startups Executive Leadership Dinner 📍 Morton’s Steakhouse, San Jose Exclusive gathering for startup leaders and Microsoft executives. Mon · Mar 16 1:30 PM Microsoft × NVIDIA Open Meet 📍 Signia by Hilton · International Suite Strategic alignment session for Microsoft and NVIDIA executives. Mon · Mar 16 7:30 PM Microsoft + NVIDIA Executive Dinner 📍 Il Fornaio, San Jose Executive dinner for key customers and leadership teams. Tue · Mar 17 11:00 AM to 1:00 PM Microsoft AI Luncheon: Research, Robotics, & Real‑World AI 📍 Signia by Hilton · International Suite Invite-only: A curated executive lunch exploring the journey from AI research to physical enterprise deployments in robotics and manufacturing. Tue · Mar 17 7:30 PM Networking in AI & Tech 📍 San Pedro Square Market Community networking mixer for Microsoft teams, partners, and customers. Wed · Mar 18 10:00 AM to 1:00 PM AI Innovator’s Circle Brunch: Powering Intelligent Systems Across the Ecosystem 📍 Il Fornaio, San Jose Hosted by Microsoft & NVIDIA at GTC. Join us for an exclusive brunch and discussion on the intelligent ecosystem.Enhancing Resiliency in Azure Compute Gallery
In today's cloud-driven world, ensuring the resiliency and recoverability of critical resources is top of mind for organizations of all sizes. Azure Compute Gallery (ACG) continues to evolve, introducing robust features that safeguard your virtual machine (VM) images and application artifacts. In this blog post, we'll explore two key resiliency innovations: the new Soft Delete feature (currently in preview) and Zonal Redundant Storage (ZRS) as the default storage type for image versions. Together, these features significantly reduce the risk of data loss and improve business continuity for Azure users. The Soft Delete Feature in Preview: A safety net for your Images Many Azure customers have struggled with accidental deletion of VM images, which disrupts workflows and causes data loss without any way to recover, often requiring users to rebuild images from scratch. Previously, removing an image from the Azure Compute Gallery was permanent and resulted in customer dissatisfaction due to service disruption and lengthy process of recreating the image. Now, with Soft Delete (currently available in public preview), Azure introduces a safeguard that makes it easy to recover deleted images within a specified retention period. How Soft Delete Works When Soft Delete is enabled on a gallery, deleting an image doesn't immediately remove it from the system. Instead, the image enters a "soft-deleted" state, where it remains recoverable for up to 7 days. During this grace period, administrators can review and restore images that may have been deleted by mistake, preventing permanent loss. After the retention period expires, the platform automatically performs a hard (permanent) delete, at which point recovery is no longer possible. Key Capabilities and User Experience Recovery period: Images are retained for a default period of 7 days, giving users time to identify and restore any resources deleted in error. Seamless Recovery: Recover soft-deleted images directly from the Azure Portal or via REST API, making it easy to integrate with automation and CI/CD pipelines. Role-Based Access: Only owners or users with the Compute Gallery Sharing Admin role at the subscription or gallery level can manage soft-deleted images, ensuring tight control over recovery and deletion operations. No Additional Cost: The Soft Delete feature is provided at no extra charge. After deletion, only one replica per region is retained, and standard storage charges apply until the image is permanently deleted. Comprehensive Support: Soft Delete is available for Private, Direct Shared, and Community Galleries. New and existing galleries can be configured to support the feature. To enable Soft Delete, you can update your gallery settings via the Azure Portal or use the Azure CLI. Once enabled, the "delete" operation will soft-delete images, and you can view, list, restore, or permanently remove these images as needed. Learn more about Soft Delete feature at https://aka.ms/sigsoftdelete Zonal Redundant Storage (ZRS) by Default Another major resiliency enhancement in Azure Compute Gallery is the default use of Zonal Redundant Storage (ZRS) for image versions. ZRS replicates your images across multiple availability zones within a region, ensuring that your resources remain available even if a zone experiences an outage. By defaulting to ZRS, Azure raises the baseline for image durability and access, reducing the risk of disruptions due to zone-level failures. Automatic Redundancy: All new image versions are stored using ZRS by default, without requiring manual configuration. High Availability: Your VM images are protected against the failure of any single availability zone within the region. Simplified Management: Users benefit from resilient storage without the need to explicitly set up or manage storage account redundancy settings. Default ZRS capability starts with API version 2025-03-03; Portal/SDK support will be added later. Why These Features Matter The combination of Soft Delete and ZRS by default provides Azure customers with enhanced operational reliability and data protection. Whether overseeing a suite of VM images for development and testing purposes or coordinating production deployments across multiple teams, these features offer the following benefits: Mitigate operational risks associated with accidental deletions or regional outages. Minimize downtime and reduce manual recovery processes. Promote compliance and security through advanced access controls and transparent recovery procedures. To evaluate the Soft Delete feature, you may register for the preview and activate it on your galleries through the Azure Portal or RestAPI. Please note that, during its preview phase, this capability is recommended for assessment and testing rather than for production environments. ZRS is already available out-of-the-box, delivering image availability starting with API version 2025-03-03. For comprehensive details and step-by-step guidance on enabling and utilizing Soft Delete, please review the public specification document at https://aka.ms/sigsoftdelete Conclusion Azure Compute Gallery continues to push the envelope on resource resiliency. With Soft Delete (preview) offering a reliable recovery mechanism for deleted images, and ZRS by default protecting your assets against zonal failures, Azure empowers you to build and manage VM deployments with peace of mind. Stay tuned for future updates as these features evolve toward general availability.464Views2likes1CommentAzure Recognized as an NVIDIA Cloud Exemplar, Setting the Bar for AI Performance in the Cloud
As AI models continue to scale in size and complexity, cloud infrastructure must deliver more than theoretical peak performance. What matters in practice is reliable, end-to-end, workload-level AI performance—where compute, networking, system software, and optimization work together to deliver predictable, repeatable results at scale. This directly translates to business value: efficient full-stack infrastructure accelerates time-to-market, maximizes ROI on GPU and cloud investments, and enables organizations to scale AI from proof-of-concept to revenue-generating products with predictable economics. Today, Microsoft is proud to share an important milestone in partnership with NVIDIA: Azure has been validated as an NVIDIA Exemplar Cloud, becoming the first cloud provider recognized for Exemplar-class AI performance aligned with GB300-class (Blackwell generation) systems. This recognition builds on Azure’s previously validated Exemplar status for H100 training workloads and reflects NVIDIA’s confidence in Azure’s ability to extend that rigor and performance discipline into the next generation of AI platforms. What Is NVIDIA Exemplar Cloud? The NVIDIA Exemplar Cloud initiative celebrates cloud platforms that demonstrate robust end-to-end AI workload performance using NVIDIA’s Performance Benchmarking suite. Rather than relying on synthetic microbenchmarks, Performance Benchmarking evaluates real AI training workloads using: Large-scale LLM training scenarios Production-grade software stacks Optimized system and network configurations Workload-centric metrics such as throughput and time-to-train Achieving Exemplar validation signals that a provider can consistently deliver world-class AI performance in the cloud, showcasing that end users are getting optimal performance value by default. Proven Exemplar Validation on H100 Azure’s Exemplar Cloud journey began with publicly shared benchmarking results for H100-based training workloads, where Azure ND GPU clusters demonstrated exemplar performance using NVIDIA Performance Benchmarking recipes. Those results—published previously and validated through NVIDIA’s benchmarking framework—established a proven foundation of end-to-end AI performance for large-scale, production workloads running on Azure today. Extending Exemplar-Class AI Performance to GB300-Class Platforms Building on the rigor and learnings from H100 validation, Microsoft has now been recognized by NVIDIA as the first cloud provider to achieve Exemplar-class performance and readiness aligned with GB300-class systems. This designation reflects NVIDIA’s assessment that the same principles applied to H100—including end-to-end system tuning, networking optimization, and software alignment—are being successfully carried forward into the Blackwell generation. Rather than treating GB300 as a point solution, Azure approaches it as a continuation of a proven performance model: delivering consistent world-class AI performance in the cloud while preserving the flexibility, elasticity, and global scale customers expect. What Enables Exemplar-Class AI Performance on Azure Delivering Exemplar-class AI performance requires optimization across the full AI stack: Infrastructure and Networking High-performance Azure ND GPU clusters with NVIDIA InfiniBand NUMA-aware CPU, GPU, and NIC alignment to minimize latency Tuned NCCL communication paths for efficient multi-GPU scaling Software and System Optimization Tight integration with NVIDIA software, including Performance Benchmarking recipes and NVIDIA AI Enterprise Parallelism strategies aligned with large-scale LLM training Continuous tuning as models, workloads, and system architectures evolve End-to-End Workload Focus Measuring real training performance, not isolated component metrics Driving repeatable improvements in application-level throughput and efficiency Closing the performance gap between cloud and on-premises systems—without sacrificing manageability Together, these capabilities enabled Azure to deliver consistent Exemplar-class AI performance across generations of NVIDIA platforms. What This Means for Customers For customers training and deploying advanced AI models, this milestone delivers clear benefits: World-class AI performance in a fully managed cloud environment Predictable scaling from small clusters to thousands of GPUs Faster time to train and improved performance per dollar Confidence that Azure is ready for Blackwell-class and GB300-class AI workloads As AI workloads become more complex and reasoning-heavy, infrastructure performance increasingly determines outcomes. Azure’s NVIDIA Cloud Exemplar recognition provides a clear signal: customers can build and scale next-generation AI systems on Azure without compromising on performance. Learn More DGX Cloud Benchmarking on Azure DGX Cloud Benchmarking on Azure | Microsoft Community HubAzure Automated Virtual Machine Recovery: Minimizing Downtime
Co-authors: Mukhtar Ahmed, Shekhar Agrawal, Harish Luckshetty, Vinay Nagarajan, Jie Su, Sri Harsha Kanukuntla, David Maldonado, Shardul Dabholkar. Keeping virtual machines running smoothly is essential for businesses across every industry. When a VM stays down for even a short period, the impact can cascade quickly; delayed financial transactions, stalled manufacturing lines, unavailable retail systems, or interruptions to healthcare services. This understanding led to the creation of this solution, with its primary goal of ensuring fast and reliable recovery times so customers can focus on their business priorities without worrying about manual recovery strategies. This feature helps ensure your business Service-Level Agreements are consistently met. When a VM experiences an issue, our system springs into action within seconds, working to restore your service as quickly as possible. It automatically executes the optimal recovery strategy, all without customer intervention. The feature operates continuously in the background, monitoring the health of VMs through multiple detection mechanisms. Lastly, it automatically selects the fastest recovery path based on the specific failure type. Getting Started The best part? Azure Automated VM Recovery requires no setup or configuration. Running quietly in the background, this service helps guarantee the highest level of recoverability and a smooth experience for every Azure customer. Your VMs are already benefiting from faster detection, smarter diagnosis, and optimized recovery strategies. The Importance of Automated VM Recovery Automated VM recovery is essential to keeping cloud services resilient, reliable, and interruption-free. Automated recovery ensures that the moment a failure occurs, the platform responds instantly with fast detection, intelligent diagnostics, and the optimal repair action, all without requiring customer intervention. Better experience for customers: By minimizing VM downtime, it helps customers keep their services online, avoiding disruptions and potential business losses. Stronger trust in Azure: Fast, reliable recovery builds customer confidence in Azure’s platform, reinforcing our reputation for dependability. Reduced financial impact for customers: The lower the downtime, the less time your customers will be impacted, reducing potential loss of revenue and minimizing business disruption during critical operations. Empowering internal teams: Automated monitoring and clear visibility into recovery metrics help teams track health, onboard easily, and identify opportunities for improvement with minimal effort. How Azure Automated VM Recovery Works: A Three-Stage Approach Azure automatically handles VM issues through a three-stage recovery framework: Detection, Diagnosis, and Mitigation. Detection From the moment a failure occurs, multiple parallel mechanisms identify issues quickly. Azure hardware devices send regular health signals, which are monitored for interruptions or degradation. At the application level, operational health is tracked via response times, error rates, and successful operations to detect software-level problems rapidly. Diagnosis Once detected, lightweight diagnostics determine the best recovery action without unnecessary delays. Diagnostics operates at multiple levels; host level checks asses underlying infrastructure, VM level diagnostics evaluate the virtual machine state and system-on-chip (SoC) level analysis examines hardware components. This includes network checks, resource utilization assessments, and service responsiveness tests. Detailed data is also collected for post-incident analysis, continuously improving diagnostic algorithms while active recovery proceeds. Mitigation Based on diagnostics, the system automatically executes the optimal recovery strategy, starting with the least disruptive methods and escalating as needed. Hardware failures may trigger VM migration, while software issues might be resolved with targeted service restarts. If needed, a host reset is performed while preserving virtual machine state, ensuring minimal disruption to running workloads. Post-mitigation health checks ensure full VM functionality before recovery is considered complete. Recovery Event Annotations Recovery Event Annotations are specialized annotations that provide detailed visibility into every stage of VM recovery, going beyond simple uptime metrics. These indicators act as custom monitoring metrics, breaking down each incident into precise time segments. For example, TTD (Time to Detect) measures the time between a VM becoming unhealthy and the system recognizing the issue, while TTDiag (Time to Diagnose) tracks the duration of diagnostic checks. By analyzing these segments, Recovery Timing Indicators help identify bottlenecks, optimize recovery steps, and improve overall reliability. Key benefits include: Understanding why some VMs recover faster than others. Identifying which diagnostics add value versus those that don’t. Highlighting opportunities that provide a faster path of recovery. Enabling early detection of regressions through event annotation-driven alerts. Establishing a common language across Azure teams for measuring and improving downtime. Customer Impact and Results Azure Automated VM Recovery demonstrates our commitment to not only high availability but also rapid recovery. By minimizing downtime, it helps customers build resilient applications and maintain business continuity during unexpected failures. Over the past 18 months, this solution has cut average VM downtime by more than half, significantly enhancing reliability and customer experience. Our ongoing goal is to provide a platform where customers can deploy workloads with confidence, knowing automated recovery will minimize disruptions.912Views9likes1Comment