virtual machines
160 TopicsIncoming Changes for Window Server 2022 Marketplace Image Users
New Azure Marketplace Windows Server 2022 media images offer is available in March 2026 excluding .NET 6 packages. Migrate to the new images before June 2026. After June 2026, there will be no patch to .NET 6 on the legacy images, and the legacy image will start deprecation at the same time.15KViews2likes1CommentPublic Preview: Ephemeral OS Disk with full caching for VM/VMSS
Today, we’re excited to announce the public preview of Ephemeral OS disk with full caching, a new feature designed to significantly enhance performance and reliability by utilizing local storage. This feature is ideal for IO-sensitive stateless workloads, as it eliminates dependency on remote storage by caching the entire OS image on local storage. Key Advantages: High Performance: Provides extremely high-performance OS disks with consistently fast response times. Reliability: Ensures high availability, making it suitable for critical workloads. Why Full OS Caching? Currently, Ephemeral OS disks store OS writes locally but still rely on a remote base OS image for reads. With Ephemeral OS Disk with full caching, the entire OS disk image is cached on local storage, removing the dependency on remote storage for OS disk reads. Once caching is complete, all OS disk IO is served locally. This results in: Consistently fast OS disk performance with low‑millisecond latency Improved resilience during remote storage disruptions No impact to VM create times, as caching happens asynchronously after boot This capability is well suited for IO-sensitive stateless workloads that need fast OS disk access, including: AI workloads Quorum‑based databases Data analytics and real‑time processing systems Large‑scale stateless services on General Purpose VM families These workloads benefit directly from lower OS disk latency and reduced exposure to remote storage outages. How It Works? When full OS caching is enabled: VM’s Local storage (cache disk, resource disk, or NVMe disk) is used to host the full OS disk Local storage capacity is reduced by 2× the OS disk size to accommodate OS caching The OS disk is cached in the background after VM boot, ensuring fast provisioning All OS disk IOs happen on the local storage, thus providing 10X better IO performance and resiliency to storage interruptions Public Preview Availability During public preview, Ephemeral OS disk with full caching is available for most general purpose VM SKUs (excluding 2‑vCPUs and 4‑vCPUs VMs) in 29 regions - AustraliaCentral, AustraliaCentral2, AustraliaSouthEast, BrazilSoutheast, CanadaCentral, CanadaEast, CentralIndia, CentralUSEUAP, EastAsia, GermanyWestCentral, JapanEast, JioIndiaCentral, JioIndiaWest, KoreaCentral, KoreaSouth, MalaysiaSouth, MexicoCentral, NorthEurope, NorwayWest, QatarCentral, SouthAfricaNorth, SwedenCentral, SwitzerlandWest, TaiwanNorth, UAECentral, UKSouth, UKWest, WestCentralUS, and WestIndia. We’re continuing to expand support across regions, and tooling as we move toward general availability. Getting Started Customers can enable Ephemeral OS disk with full caching when creating new VMs or VMSS by updating their ARM templates or REST API definitions and setting the enableFullCaching flag for Ephemeral OS disks. ARM template to create VMs with full caching: "resources": [ "name": "[parameters('virtualMachineName')]", "type": "Microsoft.Compute/virtualMachines", "apiVersion": "2025-04-01", .. .. "osDisk": { "diffDiskSettings": { "option": "Local", "placement": "ResourceDisk", "enableFullCaching": true }, "caching": "ReadOnly", "createOption": "FromImage", "managedDisk": { "storageAccountType": "StandardSSD_LRS" } } ARM template to create VMSS with full caching: "resources": [ "name": "[parameters('vmssName')]", "type": "Microsoft.Compute/virtualMachineScaleSets", "apiVersion": "2025-04-01", .. .. "osDisk": { "diffDiskSettings": { "option": "Local", "placement": "ResourceDisk", "enableFullCaching": true }, "caching": "ReadOnly", "createOption": "FromImage", "managedDisk": { "storageAccountType": "StandardSSD_LRS" } } Your feedback during public preview will help shape the final experience.471Views1like0CommentsAzure NCv6 Virtual Machines: Enhancements and GA Transition
NCv6 Virtual Machines are Azure's flexible, next generation platform enabling both leading-edge graphics and generative AI compute workloads. Featuring NVIDIA RTX PRO 6000 Blackwell Server Edition (BSE) GPUs, Intel Xeon™ 6 "Granite Rapids" 6900P series CPUs, and a suite of Microsoft Azure technologies, NCv6 VMs are available now in Preview. Today, we are pleased to share a series of exciting updates coming soon to Azure NCv6 that will: Enhance VM performance and capabilities Provide more VM sizes for customers to "right size" their usage Bring NCv6 to production readiness with a transition to General Availability, and Expand accessibility across the global Azure cloud New VM Sizes, Features, and Performance Enhancements In the coming weeks, Azure will debut seven new NCv6-series VM sizes and two different sub-families for customers to choose from. The standout features introduced with the new VM sizes include: 🧩 Fractional GPU support, enabling graphics workload customers to deploy VMs with as little as 1/2 or 1/4 of a RTX PRO™ 6000. VMs with fractional GPU support also feature reduced vCPU, memory, SSD, and networking to help customers optimize costs and right size their VMs to their workloads. ⚡ Increased vCPU per VM size (e.g. 288 vCPU instead of 256) to provide more performance for high-end VDI workstations and better align with the Intel Xeon 6900P's triple compute tile architecture. 🛠️ General Purpose and Compute Optimized VM sizes. The former provides larger amounts of CPU memory for demanding generative AI inference and ISV CAD/CAE simulations, while the latter offers reduced memory to enable customers with less memory intensive workloads to cost optimize their deployments. The new VM sizes will replace the existing three VM sizes offered in Preview, and be available as follows: NCv6 - General Purpose VM sizes: Size Name vCPUs Memory (GB) Networking (Mb/s) GPUs GPU Mem (GB) Temp Disk NVMe Disk Standard_NC36ds_xl_RTXPro6000_v6 36 132 22500 1/4 24 256 1600 Standard_NC72ds_xl_RTXPro6000_v6 72 264 45000 1/2 48 512 3200 Standard_NC132ds_xl_RTXPro6000_v6 132 516 90000 1 96 1024 6400 Standard_NC144ds_xl_RTXPro6000_v6 144 516 90000 1 96 1024 6400 Standard_NC264ds_xl_RTXPro6000_v6 264 1032 180000 2 192 2048 12800 Standard_NC288ds_xl_RTXPro6000_v6 288 1032 180000 2 192 2048 12800 Standard_NC324ds_xl_RTXPro6000_v6 324 1284 180000 2 192 2048 12800 NCv6-Compute Optimized VM sizes: Size Name vCPUs Memory (GB) Networking (Mbps) GPUs GPU Mem (GB) Temp Disk NVMe Disk Standard_NC24lds_xl_RTXPro6000_v6 24 72 22500 1/4 24 256 1600 Standard_NC36lds_xl_RTXPro6000_v6 36 72 22500 1/4 24 256 1600 Standard_NC72lds_xl_RTXPro6000_v6 72 132 45000 1/2 48 512 3200 Standard_NC132lds_xl_RTXPro6000_v6 132 264 90000 1 96 1024 6400 Standard_NC144lds_xl_RTXPro6000_v6 144 264 90000 1 96 1024 6400 Standard_NC264lds_xl_RTXPro6000_v6 264 516 180000 2 192 2048 12800 Standard_NC288lds_xl_RTXPro6000_v6 288 516 180000 2 192 2048 12800 Standard_NC324lds_xl_RTXPro6000_v6 324 648 180000 2 192 2048 12800 Note that, until the new VM sizes are available, Microsoft Learn resources will continue to reflect the currently offered VM sizes and technical specifications. Transition to General Availability In the coming weeks, Azure will transition NCv6-series from Preview to General Availability (GA) status. With this transition, NCv6 VMs will become covered by the Azure Service Level Agreement (SLA) and thus ready to support production-grade deployments by customers, partners, and service providers. When the transition to NCv6 VMs occurs, they will be available in the Azure West US2 and Southeast Asia regions. Information on availability timing of additional regions is provided below. Regional Expansion Across the Azure Cloud At the beginning of Preview, NCv6 VMs debuted in the West US2 region. Since then, we have also added NCv6 VMs to the Southeast Asia region. Both regions will be part of the transition to GA status. We are pleased to share that in the proceeding months covering Q3 of 2026, NCv6 VMs will also become available in the following Azure regions: • East US • West Europe • East US 2 • North Europe • South Central US • Germany West Central • West US • Korea Central Ready to build for the future with Azure NCv6? NCv6 Virtual Machines are available now in Preview. Start your production-grade AI journey today and explore the next frontier of Azure AI infrastructure. Join the PreviewUpcoming Compute API Change: Always return non-null securityType
Starting with Azure Compute API version 2025‑11‑01, Virtual Machines and Virtual Machine Scale Sets will always return a non‑null securityType value in API responses. This post explains the behavior change, which API versions are affected, and what teams need to update in automation, validation or post-deployment logic that relies on null checks.193Views1like0CommentsMicrosoft at NVIDIA GTC 2026
Microsoft returns to NVIDIA GTC 2026 in San Jose with a strong presence across conference sessions, in‑booth theater talks, live demos, and executive‑level ancillary events. Together with NVIDIA and our partner ecosystem, Microsoft is showcasing how Azure AI infrastructure enables AI training, inference, and production at global scale. Visit us at Booth #521 to see the latest innovations in action and connect with Azure and NVIDIA experts. Exclusive GTC Experiences LEGO® Datacenter Model Explore Azure AI infrastructure at the Park Container. Candy Lounge Visit the high-traffic candy wall for co-branded treats all day long. Networking Lounge Relax and recharge with comfy seating and vital charging options. Outdoor Juice Truck Free, refreshing beverages served during outdoor park hours. Sponsored Breakout Sessions Microsoft Featured Reinventing Semiconductor Design with Microsoft Discovery S82398 · Mon, Mar 16 · 4:00 PM Prashant Varshney Microsoft · Semiconductor & AI Engineering Abstract: Semiconductor teams face exploding design complexity and shrinking verification windows. This session shows how the Microsoft Discovery AI for Science platform, combined with Synopsys Agent Engineers, introduces an agentic approach to EDA that automates routine steps and accelerates expert decision-making on Azure. Microsoft Featured Operationalizing Agentic AI at Hyperscale S82399 · Tue, Mar 17 · 1:00 PM Nitin Nagarkatte Microsoft · Azure AI Infrastructure Anand Raman Microsoft · Azure AI Vipul Modi Microsoft · AI Systems Abstract: As enterprises move to agentic systems, the challenge shifts to operating intelligent agents reliably at scale. This session demonstrates how Microsoft builds AI Factories on Azure using NVIDIA technology and explores Microsoft Foundry as the control plane for deploying and operating coordinated AI agents. Live from GTC: AI Podcast Dayan Rodriguez Corporate Vice President Global Manufacturing and Mobility Alistair Spiers General Manager Azure Infrastructure Live Special Feature A conversation with Microsoft Azure Listen & Subscribe: aka.ms/GTC2026Podcast Scan to Listen Earned Conference Sessions Don't miss these high-impact sessions where Microsoft and NVIDIA leaders discuss the future of AI factories and infrastructure. Mon · Mar 16 5:00 PM Drive Optimal Tokens per Watt on AI Infrastructure Using Benchmarking Recipes Speakers: Paul Edwards, Emily Potyraj Microsoft, NVIDIA Tue · Mar 17 9:00 AM Autonomous AI Factories: Technical Preview of Agent-Native Production Speakers: JP Vasseur, César Martinez Spessot NVIDIA, Microsoft Research Tue · Mar 17 4:00 PM The Road to Intelligent Mobility: Vehicle GenAI Speakers: Raj Paul, Thomas Evans, Bryan Goodman Microsoft, NVIDIA, Bosch Wed · Mar 18 9:00 AM Supercharging AI with Multi-Gigawatt AI Factories Speakers: Gilad Shainer, Peter Salanki, Evan Burness NVIDIA, CoreWeave, Meta, Microsoft Daily Booth Theater Schedule Visit the Microsoft Theater for lightning talks from engineering leaders and partners. Monday, March 16 2:00 PM BTH208 · NVIDIA Accelerate AI Innovation on Azure with NVIDIA Run:ai — Rob Magno 2:30 PM BTH202 · General Robotics Models to Machines: Deploying Agentic AI in Real-World Robotics — Dinesh Narayanan 3:00 PM BTH200 · Fractal Analytics From Generalist to Enterprise-Ready: Fractal Builds Domain AI — C. Chaudhuri 3:30 PM BTH109 · Microsoft Agentic cloud ops - Smarter Operations with Azure Copilot — Jyoti Sharma 4:00 PM BTH103 · Microsoft Build a Deep Research Agent for Enterprise Data — D. Casati, A. Slutsky, H. Alkemade 4:30 PM BTH205 · NetApp Azure NetApp Files: Powering Your Data for AI Capabilities — Andy Chan 5:00 PM BTH207 · NVIDIA The Agentic Commerce Stack: Open Models on Azure — Antonio Martinez 5:30 PM BTH217 · OPAQUE Confidential AI on Azure Unlocks Sovereign AI at Scale — Aaron Fulkerson 6:00 PM BTH218 · Simplismart Making BYOC work at scale with modular inference — Amritanshu Jain 6:30 PM Expo Reception Tuesday, March 17 1:30 PM BTH100 · Microsoft From Open Weights to Enterprise Scale: Open-Source Models — Sharmila Chockalingam 2:00 PM BTH212 · Personal AI Unlocking the power of memory in Teams with Personal AI — Sam Harkness 2:30 PM BTH111 · Microsoft / NVIDIA Scalable LLM Inference on AKS Using NVIDIA Dynamo — Mohamad Al jazaery, Anton Slutsky 3:00 PM BTH204 · Mistral AI Innovate with Mistral AI on Microsoft Foundry — Ian Mathew 3:30 PM BTH104 · Microsoft GPU-Accelerated CFD at Scale: Star-CCM+ on Azure — Jason Scheffelmaer 4:00 PM BTH206 · NeuBird AI Agentic AI for Incident Response on Microsoft Azure — Grant Griffiths 4:30 PM BTH101 · GitHub Agentic DevOps: Evolving software with GitHub Copilot — Glenn Wester 5:00 PM BTH209 · Rescale Real-World AI Physics: GM & NVIDIA on Rescale — Dinal Perera 5:30 PM BTH107 · Microsoft Intro to LoRA Fine-Tuning on Azure — Christin Pohl 6:30 PM Raffle Wednesday, March 18 1:00 PM BTH219 · VAST Data Scaling AI Infrastructure on Azure with VAST Data — Jason Vallery 1:30 PM BTH110 · Microsoft Physical AI and Robotics: The Next Frontier — F. Miller, C. Souche, D. Narayanan 2:00 PM BTH105 · Microsoft Sovereign AI options with Azure Local — Kim Lam 2:30 PM BTH108 · Microsoft Automating HPC Workflows with Copilot Agents — Param Shah 3:00 PM BTH102 · Microsoft Trustworthy Multi-Agent Workflows with Microsoft Foundry — Brian Benz 4:00 PM BTH106 · Microsoft Scaling Enterprise AI on ARO with NVIDIA H100 & H200 — Lachie Evenson 4:30 PM BTH211 · WEKA Hybrid AI Data Orchestration with WEKA NeuralMesh™ — Desiree Campbell 5:00 PM BTH202 · Hammerspace NVIDIA AI Enterprise Software with NIM — Mike Bloom 5:30 PM BTH203 · Kinaxis Reimagining Global Supply Planning with Azure — Dane Henshall 6:00 PM BTH214 · AT&T Connected AI on Azure for Manufacturing — Brad Pritchett 6:30 PM Raffle Thursday, March 19 11:00 AM BTH210 · Wandelbots Physical AI: Powering Software-Defined Automation in Robotics — Marwin Kunz, Martin George 11:30 AM Raffle Explore Our Demo Pods Visit the Microsoft booth to see our technology in action with live demonstrations across four dedicated pod areas. POD 1 Azure AI Infrastructure End‑to‑end AI infrastructure for training and inference at scale, featuring the latest NVIDIA GPU integrations on Azure. POD 2 Microsoft Foundry Our comprehensive platform for building, deploying, and operating agentic AI systems with enterprise reliability. POD 3 Building AI Together Showcasing joint Microsoft and NVIDIA solutions across diverse industries, from manufacturing to retail. POD 4 Startups Powering AI Discover how innovative startups are running next‑generation AI workloads on the Azure platform. Ancillary Events & Networking Join Microsoft leadership and our partner ecosystem at these curated networking experiences. Click the location to view on Bing Maps. Sun · Mar 15 6:00 PM Microsoft for Startups Executive Leadership Dinner 📍 Morton’s Steakhouse, San Jose Exclusive gathering for startup leaders and Microsoft executives. Mon · Mar 16 1:30 PM Microsoft × NVIDIA Open Meet 📍 Signia by Hilton · International Suite Strategic alignment session for Microsoft and NVIDIA executives. Mon · Mar 16 7:30 PM Microsoft + NVIDIA Executive Dinner 📍 Il Fornaio, San Jose Executive dinner for key customers and leadership teams. Tue · Mar 17 11:00 AM to 1:00 PM Microsoft AI Luncheon: Research, Robotics, & Real‑World AI 📍 Signia by Hilton · International Suite Invite-only: A curated executive lunch exploring the journey from AI research to physical enterprise deployments in robotics and manufacturing. Tue · Mar 17 7:30 PM Networking in AI & Tech 📍 San Pedro Square Market Community networking mixer for Microsoft teams, partners, and customers. Wed · Mar 18 10:00 AM to 1:00 PM AI Innovator’s Circle Brunch: Powering Intelligent Systems Across the Ecosystem 📍 Il Fornaio, San Jose Hosted by Microsoft & NVIDIA at GTC. Join us for an exclusive brunch and discussion on the intelligent ecosystem.Enhancing Resiliency in Azure Compute Gallery
In today's cloud-driven world, ensuring the resiliency and recoverability of critical resources is top of mind for organizations of all sizes. Azure Compute Gallery (ACG) continues to evolve, introducing robust features that safeguard your virtual machine (VM) images and application artifacts. In this blog post, we'll explore two key resiliency innovations: the new Soft Delete feature (currently in preview) and Zonal Redundant Storage (ZRS) as the default storage type for image versions. Together, these features significantly reduce the risk of data loss and improve business continuity for Azure users. The Soft Delete Feature in Preview: A safety net for your Images Many Azure customers have struggled with accidental deletion of VM images, which disrupts workflows and causes data loss without any way to recover, often requiring users to rebuild images from scratch. Previously, removing an image from the Azure Compute Gallery was permanent and resulted in customer dissatisfaction due to service disruption and lengthy process of recreating the image. Now, with Soft Delete (currently available in public preview), Azure introduces a safeguard that makes it easy to recover deleted images within a specified retention period. How Soft Delete Works When Soft Delete is enabled on a gallery, deleting an image doesn't immediately remove it from the system. Instead, the image enters a "soft-deleted" state, where it remains recoverable for up to 7 days. During this grace period, administrators can review and restore images that may have been deleted by mistake, preventing permanent loss. After the retention period expires, the platform automatically performs a hard (permanent) delete, at which point recovery is no longer possible. Key Capabilities and User Experience Recovery period: Images are retained for a default period of 7 days, giving users time to identify and restore any resources deleted in error. Seamless Recovery: Recover soft-deleted images directly from the Azure Portal or via REST API, making it easy to integrate with automation and CI/CD pipelines. Role-Based Access: Only owners or users with the Compute Gallery Sharing Admin role at the subscription or gallery level can manage soft-deleted images, ensuring tight control over recovery and deletion operations. No Additional Cost: The Soft Delete feature is provided at no extra charge. After deletion, only one replica per region is retained, and standard storage charges apply until the image is permanently deleted. Comprehensive Support: Soft Delete is available for Private, Direct Shared, and Community Galleries. New and existing galleries can be configured to support the feature. To enable Soft Delete, you can update your gallery settings via the Azure Portal or use the Azure CLI. Once enabled, the "delete" operation will soft-delete images, and you can view, list, restore, or permanently remove these images as needed. Learn more about Soft Delete feature at https://aka.ms/sigsoftdelete Zonal Redundant Storage (ZRS) by Default Another major resiliency enhancement in Azure Compute Gallery is the default use of Zonal Redundant Storage (ZRS) for image versions. ZRS replicates your images across multiple availability zones within a region, ensuring that your resources remain available even if a zone experiences an outage. By defaulting to ZRS, Azure raises the baseline for image durability and access, reducing the risk of disruptions due to zone-level failures. Automatic Redundancy: All new image versions are stored using ZRS by default, without requiring manual configuration. High Availability: Your VM images are protected against the failure of any single availability zone within the region. Simplified Management: Users benefit from resilient storage without the need to explicitly set up or manage storage account redundancy settings. Default ZRS capability starts with API version 2025-03-03; Portal/SDK support will be added later. Why These Features Matter The combination of Soft Delete and ZRS by default provides Azure customers with enhanced operational reliability and data protection. Whether overseeing a suite of VM images for development and testing purposes or coordinating production deployments across multiple teams, these features offer the following benefits: Mitigate operational risks associated with accidental deletions or regional outages. Minimize downtime and reduce manual recovery processes. Promote compliance and security through advanced access controls and transparent recovery procedures. To evaluate the Soft Delete feature, you may register for the preview and activate it on your galleries through the Azure Portal or RestAPI. Please note that, during its preview phase, this capability is recommended for assessment and testing rather than for production environments. ZRS is already available out-of-the-box, delivering image availability starting with API version 2025-03-03. For comprehensive details and step-by-step guidance on enabling and utilizing Soft Delete, please review the public specification document at https://aka.ms/sigsoftdelete Conclusion Azure Compute Gallery continues to push the envelope on resource resiliency. With Soft Delete (preview) offering a reliable recovery mechanism for deleted images, and ZRS by default protecting your assets against zonal failures, Azure empowers you to build and manage VM deployments with peace of mind. Stay tuned for future updates as these features evolve toward general availability.464Views2likes1CommentAzure Recognized as an NVIDIA Cloud Exemplar, Setting the Bar for AI Performance in the Cloud
As AI models continue to scale in size and complexity, cloud infrastructure must deliver more than theoretical peak performance. What matters in practice is reliable, end-to-end, workload-level AI performance—where compute, networking, system software, and optimization work together to deliver predictable, repeatable results at scale. This directly translates to business value: efficient full-stack infrastructure accelerates time-to-market, maximizes ROI on GPU and cloud investments, and enables organizations to scale AI from proof-of-concept to revenue-generating products with predictable economics. Today, Microsoft is proud to share an important milestone in partnership with NVIDIA: Azure has been validated as an NVIDIA Exemplar Cloud, becoming the first cloud provider recognized for Exemplar-class AI performance aligned with GB300-class (Blackwell generation) systems. This recognition builds on Azure’s previously validated Exemplar status for H100 training workloads and reflects NVIDIA’s confidence in Azure’s ability to extend that rigor and performance discipline into the next generation of AI platforms. What Is NVIDIA Exemplar Cloud? The NVIDIA Exemplar Cloud initiative celebrates cloud platforms that demonstrate robust end-to-end AI workload performance using NVIDIA’s Performance Benchmarking suite. Rather than relying on synthetic microbenchmarks, Performance Benchmarking evaluates real AI training workloads using: Large-scale LLM training scenarios Production-grade software stacks Optimized system and network configurations Workload-centric metrics such as throughput and time-to-train Achieving Exemplar validation signals that a provider can consistently deliver world-class AI performance in the cloud, showcasing that end users are getting optimal performance value by default. Proven Exemplar Validation on H100 Azure’s Exemplar Cloud journey began with publicly shared benchmarking results for H100-based training workloads, where Azure ND GPU clusters demonstrated exemplar performance using NVIDIA Performance Benchmarking recipes. Those results—published previously and validated through NVIDIA’s benchmarking framework—established a proven foundation of end-to-end AI performance for large-scale, production workloads running on Azure today. Extending Exemplar-Class AI Performance to GB300-Class Platforms Building on the rigor and learnings from H100 validation, Microsoft has now been recognized by NVIDIA as the first cloud provider to achieve Exemplar-class performance and readiness aligned with GB300-class systems. This designation reflects NVIDIA’s assessment that the same principles applied to H100—including end-to-end system tuning, networking optimization, and software alignment—are being successfully carried forward into the Blackwell generation. Rather than treating GB300 as a point solution, Azure approaches it as a continuation of a proven performance model: delivering consistent world-class AI performance in the cloud while preserving the flexibility, elasticity, and global scale customers expect. What Enables Exemplar-Class AI Performance on Azure Delivering Exemplar-class AI performance requires optimization across the full AI stack: Infrastructure and Networking High-performance Azure ND GPU clusters with NVIDIA InfiniBand NUMA-aware CPU, GPU, and NIC alignment to minimize latency Tuned NCCL communication paths for efficient multi-GPU scaling Software and System Optimization Tight integration with NVIDIA software, including Performance Benchmarking recipes and NVIDIA AI Enterprise Parallelism strategies aligned with large-scale LLM training Continuous tuning as models, workloads, and system architectures evolve End-to-End Workload Focus Measuring real training performance, not isolated component metrics Driving repeatable improvements in application-level throughput and efficiency Closing the performance gap between cloud and on-premises systems—without sacrificing manageability Together, these capabilities enabled Azure to deliver consistent Exemplar-class AI performance across generations of NVIDIA platforms. What This Means for Customers For customers training and deploying advanced AI models, this milestone delivers clear benefits: World-class AI performance in a fully managed cloud environment Predictable scaling from small clusters to thousands of GPUs Faster time to train and improved performance per dollar Confidence that Azure is ready for Blackwell-class and GB300-class AI workloads As AI workloads become more complex and reasoning-heavy, infrastructure performance increasingly determines outcomes. Azure’s NVIDIA Cloud Exemplar recognition provides a clear signal: customers can build and scale next-generation AI systems on Azure without compromising on performance. Learn More DGX Cloud Benchmarking on Azure DGX Cloud Benchmarking on Azure | Microsoft Community HubAzure Automated Virtual Machine Recovery: Minimizing Downtime
Co-authors: Mukhtar Ahmed, Shekhar Agrawal, Harish Luckshetty, Vinay Nagarajan, Jie Su, Sri Harsha Kanukuntla, David Maldonado, Shardul Dabholkar. Keeping virtual machines running smoothly is essential for businesses across every industry. When a VM stays down for even a short period, the impact can cascade quickly; delayed financial transactions, stalled manufacturing lines, unavailable retail systems, or interruptions to healthcare services. This understanding led to the creation of this solution, with its primary goal of ensuring fast and reliable recovery times so customers can focus on their business priorities without worrying about manual recovery strategies. This feature helps ensure your business Service-Level Agreements are consistently met. When a VM experiences an issue, our system springs into action within seconds, working to restore your service as quickly as possible. It automatically executes the optimal recovery strategy, all without customer intervention. The feature operates continuously in the background, monitoring the health of VMs through multiple detection mechanisms. Lastly, it automatically selects the fastest recovery path based on the specific failure type. Getting Started The best part? Azure Automated VM Recovery requires no setup or configuration. Running quietly in the background, this service helps guarantee the highest level of recoverability and a smooth experience for every Azure customer. Your VMs are already benefiting from faster detection, smarter diagnosis, and optimized recovery strategies. The Importance of Automated VM Recovery Automated VM recovery is essential to keeping cloud services resilient, reliable, and interruption-free. Automated recovery ensures that the moment a failure occurs, the platform responds instantly with fast detection, intelligent diagnostics, and the optimal repair action, all without requiring customer intervention. Better experience for customers: By minimizing VM downtime, it helps customers keep their services online, avoiding disruptions and potential business losses. Stronger trust in Azure: Fast, reliable recovery builds customer confidence in Azure’s platform, reinforcing our reputation for dependability. Reduced financial impact for customers: The lower the downtime, the less time your customers will be impacted, reducing potential loss of revenue and minimizing business disruption during critical operations. Empowering internal teams: Automated monitoring and clear visibility into recovery metrics help teams track health, onboard easily, and identify opportunities for improvement with minimal effort. How Azure Automated VM Recovery Works: A Three-Stage Approach Azure automatically handles VM issues through a three-stage recovery framework: Detection, Diagnosis, and Mitigation. Detection From the moment a failure occurs, multiple parallel mechanisms identify issues quickly. Azure hardware devices send regular health signals, which are monitored for interruptions or degradation. At the application level, operational health is tracked via response times, error rates, and successful operations to detect software-level problems rapidly. Diagnosis Once detected, lightweight diagnostics determine the best recovery action without unnecessary delays. Diagnostics operates at multiple levels; host level checks asses underlying infrastructure, VM level diagnostics evaluate the virtual machine state and system-on-chip (SoC) level analysis examines hardware components. This includes network checks, resource utilization assessments, and service responsiveness tests. Detailed data is also collected for post-incident analysis, continuously improving diagnostic algorithms while active recovery proceeds. Mitigation Based on diagnostics, the system automatically executes the optimal recovery strategy, starting with the least disruptive methods and escalating as needed. Hardware failures may trigger VM migration, while software issues might be resolved with targeted service restarts. If needed, a host reset is performed while preserving virtual machine state, ensuring minimal disruption to running workloads. Post-mitigation health checks ensure full VM functionality before recovery is considered complete. Recovery Event Annotations Recovery Event Annotations are specialized annotations that provide detailed visibility into every stage of VM recovery, going beyond simple uptime metrics. These indicators act as custom monitoring metrics, breaking down each incident into precise time segments. For example, TTD (Time to Detect) measures the time between a VM becoming unhealthy and the system recognizing the issue, while TTDiag (Time to Diagnose) tracks the duration of diagnostic checks. By analyzing these segments, Recovery Timing Indicators help identify bottlenecks, optimize recovery steps, and improve overall reliability. Key benefits include: Understanding why some VMs recover faster than others. Identifying which diagnostics add value versus those that don’t. Highlighting opportunities that provide a faster path of recovery. Enabling early detection of regressions through event annotation-driven alerts. Establishing a common language across Azure teams for measuring and improving downtime. Customer Impact and Results Azure Automated VM Recovery demonstrates our commitment to not only high availability but also rapid recovery. By minimizing downtime, it helps customers build resilient applications and maintain business continuity during unexpected failures. Over the past 18 months, this solution has cut average VM downtime by more than half, significantly enhancing reliability and customer experience. Our ongoing goal is to provide a platform where customers can deploy workloads with confidence, knowing automated recovery will minimize disruptions.912Views9likes1CommentAnnouncing General Availability of Azure Da/Ea/Fasv7-series VMs based on AMD ‘Turin’ processors
Today, Microsoft is announcing the general availability of Azure’s new AMD based Virtual Machines (VMs) powered by 5th Gen AMD EPYC™ (Turin) processors. These VMs include general-purpose (Dasv7, Dalsv7), memory-optimized (Easv7), and compute-optimized (Fasv7, Falsv7, Famsv7) series, available with or without local disks. Azure’s latest AMD based VMs offer faster CPU performance, greater scalability, and flexible configurations, making them the ideal choice for high performance, cost efficiency, and diverse workloads. Key improvements include up to 35% better CPU performance and price-performance compared to equivalent v6 AMD-based VMs. Workload-specific gains are significant—up to 25% for Java applications, up to 65% for in-memory cache applications, up to 80% for crypto workloads, and up to 130% for web server applications just to name a few. Dalsv7-series VMs are cost-efficient for low memory workloads like web servers, video encoding, and batch processing. Dasv7-series suit general computing tasks such as e-commerce, web front ends, virtualization, customer relationship management applications (CRM), and entry to mid-range databases. Easv7-series target memory-heavy workloads like enterprise applications, data warehousing, business intelligence, in-memory analytics and more. Falsv7-, Fasv7-, and Famsv7 series deliver full-core performance without Simultaneous Multithreading (SMT) for compute-intensive tasks like scientific simulations, financial modeling, gaming and more. You can now choose constrained-core VM sizes — reducing the vCPU total by 50% or 75% while maintaining the other resources. Dasv7, Dalsv7, and Easv7 VMs now scale up to 160 vCPUs, an increase from 96 vCPUs in the previous generation. The Fasv7, Falsv7, and Famsv7 VMs, which do not include Simultaneous Multithreading (SMT), support up to 80 vCPUs—up from 64 vCPUs in the prior generation—and introduce a new 1-core option. These VMs offer a maximum boost CPU frequency of up to 4.5 GHz for faster compute-intensive operations. The new VMs deliver increased memory capacity —up to 640 GiB for Dasv7 and 1280 GiB for Easv7—making them ideal for memory-intensive workloads. They also support three memory (GiB)-to-vCPU ratios: 2:1 (Dalsv7-series, Daldsv7-series, Falsv7-series and Faldsv7-series), 4:1 (Dasv7-series, Dadsv7-series, Fasv7-series and Fadsv7-series), and 8:1 (Easv7-series, Eadsv7-series, Famsv7-series and Famdsv7-series). Remote storage performance is improved up to 20% higher IOPS, up to 50% greater throughput, while local storage performance offers up to 55% higher throughput. Network performance is also enhanced up to 75% compared to corresponding D-series and E-series v6 VMs. New VM series Fadsv7, Faldsv7, and Famdsv7, introduce local disk support. The new VMs leverage Azure Boost technology to enhance performance and security, utilize the Microsoft Azure Network Adapter (MANA), and support the NVMe protocol for both local and remote disks. The 5th Generation AMD EPYC™ processor family, based on the newest ‘Zen 5’ core, provides enhanced capabilities for these new Azure’s AMD based VM series such as AVX-512 with a full 512-bit data path for vector and floating-point operations, higher memory bandwidth, and improved instructions per clock compared to the previous generation. These updates provide the ability to handle compute-intensive tasks for AI and machine learning, scientific simulations, and financial analytics, among others. AMD Infinity Guard hardware-based security features, such as Transparent Secure Memory Encryption (TSME), continue in this generation to ensure sensitive information remains secure. These VMs are available in the following Azure regions: Australia East, Central US, Germany West Central, Japan East, North Europe, South Central US, Southeast Asia, UK South, West Europe, West US 2, and West US 3. The large 160 vCPU Easv7-series and Eadsv7-series sizes are available in North Europe, South Central US, West Europe, and West US 2. More regions are coming in 2026. Refer to Product Availability by Region for the latest information. Our customers have shared the benefits they’ve observed with these new VMs: “Elastic enables customers to drive innovation and cost-efficiency with our observability, security, and search solutions on Azure. In our testing, Azure’s latest Daldsv7 VMs provided up to 13% better indexing throughput compared to previous generation Daldsv6 VMs, and we are looking forward to the improved performance for Elasticsearch users deploying on Azure.” — Yuvraj Gupta, Director, Product Management, Elastic “The Easv7 series of Azure VMs offers a balanced mix of CPU, memory, storage, and network performance that suits the majority of Oracle Database configurations very well. The 80 Gbps network with the jumbo frame capability is especially helpful for efficient operation of FlashGrid Cluster with Oracle RAC on Azure VMs.” — Art Danielov, CEO, FlashGrid "Our analysis indicates that Azure’s new AMD based v7 series Virtual Machines demonstrate significantly higher performance compared to the v6 series, particularly in single-thread ratings. This advancement is highly beneficial, as several of our critical applications, such as ArcGIS Enterprise, are single-threaded and CPU-bound. Consequently, these faster v7 series VMs have resulted in improved performance with the same number of users, evidenced by lower server utilization and faster client-side response times." — Thomas Buchmann, Senior Cloud Architect, VertiGIS Here’s what our technology partners are saying: “AMD and Microsoft have built one of the industry’s most successful cloud partnerships, bringing over 60 VM series to market through years of deep engineering collaboration. With the new v7 Azure VMs powered by 5th Gen AMD EPYC processors, we’re setting a new benchmark for performance, efficiency, and scalability—giving customers the proven, leadership compute they expect from AMD in the world’s most demanding cloud environments.” — Steve Berg, Corporate Vice President and General Manager of the Server CPU Cloud Business Group at AMD “Our collaboration with Microsoft continues to empower developers and enterprises alike. The new AMD based v7-series VMs on Azure offer a powerful foundation for the full spectrum of modern workloads, from development to production AI/ML pipelines. We are excited to support this launch, ensuring every user gets a seamless experience on Ubuntu, with the enterprise security and long-term stability of Ubuntu Pro available for their most critical systems." — Jehudi Castro-Sierra, Public Cloud Alliances Director "The new Azure Da/Ea/Fa v7-series AMD Turin-based instances running SUSE Linux Enterprise Server provide a significant performance uplift during initial tests. They show an impressive 20%-40% increase with typical Linux kernel compilation tasks compared to the same instance sizes of the v6 series. This demonstrates the enhanced capabilities the v7 series brings to our joint customers seeking maximum efficiency and performance for their critical applications.” — Peter Schinagl, Sr. Technical Architect, SUSE You can learn more about these latest Azure AMD based VMs by visiting the specification pages at Dasv7-series, Dadsv7-series, Dalsv7-series, Daldsv7-series, Easv7-series, Eadsv7-series, Fasv7-series, Fadsv7-series, Falsv7-series, Faldsv7-series, Famsv7-series , Famdsv7-series, constrained-core sizes. For pricing details, visit the Azure Virtual Machines pricing page. These VMs support all remote disk types. See Azure managed disk type for additional details. Disk storage is billed separately. Azure Integrated HSM (Hardware Security Module) will continue to be in preview with these VMs. Azure Integrated HSM is an ephemeral HSM cache that enables secure key management within Azure VMs by ensuring that cryptographic keys remain protected inside a FIPS 140-3 Level 3-compliant boundary throughout their lifecycle. To explore this new feature, please sign up using the form. Have questions? Please reach us at Azure Support and our experts will be there to help you with your Azure journey.3.1KViews3likes1Comment