virtual machines
64 TopicsEnhancing Resiliency in Azure Compute Gallery
In today's cloud-driven world, ensuring the resiliency and recoverability of critical resources is top of mind for organizations of all sizes. Azure Compute Gallery (ACG) continues to evolve, introducing robust features that safeguard your virtual machine (VM) images and application artifacts. In this blog post, we'll explore two key resiliency innovations: the new Soft Delete feature (currently in preview) and Zonal Redundant Storage (ZRS) as the default storage type for image versions. Together, these features significantly reduce the risk of data loss and improve business continuity for Azure users. The Soft Delete Feature in Preview: A safety net for your Images Many Azure customers have struggled with accidental deletion of VM images, which disrupts workflows and causes data loss without any way to recover, often requiring users to rebuild images from scratch. Previously, removing an image from the Azure Compute Gallery was permanent and resulted in customer dissatisfaction due to service disruption and lengthy process of recreating the image. Now, with Soft Delete (currently available in public preview), Azure introduces a safeguard that makes it easy to recover deleted images within a specified retention period. How Soft Delete Works When Soft Delete is enabled on a gallery, deleting an image doesn't immediately remove it from the system. Instead, the image enters a "soft-deleted" state, where it remains recoverable for up to 7 days. During this grace period, administrators can review and restore images that may have been deleted by mistake, preventing permanent loss. After the retention period expires, the platform automatically performs a hard (permanent) delete, at which point recovery is no longer possible. Key Capabilities and User Experience Recovery period: Images are retained for a default period of 7 days, giving users time to identify and restore any resources deleted in error. Seamless Recovery: Recover soft-deleted images directly from the Azure Portal or via REST API, making it easy to integrate with automation and CI/CD pipelines. Role-Based Access: Only owners or users with the Compute Gallery Sharing Admin role at the subscription or gallery level can manage soft-deleted images, ensuring tight control over recovery and deletion operations. No Additional Cost: The Soft Delete feature is provided at no extra charge. After deletion, only one replica per region is retained, and standard storage charges apply until the image is permanently deleted. Comprehensive Support: Soft Delete is available for Private, Direct Shared, and Community Galleries. New and existing galleries can be configured to support the feature. To enable Soft Delete, you can update your gallery settings via the Azure Portal or use the Azure CLI. Once enabled, the "delete" operation will soft-delete images, and you can view, list, restore, or permanently remove these images as needed. Learn more about Soft Delete feature at https://aka.ms/sigsoftdelete Zonal Redundant Storage (ZRS) by Default Another major resiliency enhancement in Azure Compute Gallery is the default use of Zonal Redundant Storage (ZRS) for image versions. ZRS replicates your images across multiple availability zones within a region, ensuring that your resources remain available even if a zone experiences an outage. By defaulting to ZRS, Azure raises the baseline for image durability and access, reducing the risk of disruptions due to zone-level failures. Automatic Redundancy: All new image versions are stored using ZRS by default, without requiring manual configuration. High Availability: Your VM images are protected against the failure of any single availability zone within the region. Simplified Management: Users benefit from resilient storage without the need to explicitly set up or manage storage account redundancy settings. Default ZRS capability starts with API version 2025-03-03; Portal/SDK support will be added later. Why These Features Matter The combination of Soft Delete and ZRS by default provides Azure customers with enhanced operational reliability and data protection. Whether overseeing a suite of VM images for development and testing purposes or coordinating production deployments across multiple teams, these features offer the following benefits: Mitigate operational risks associated with accidental deletions or regional outages. Minimize downtime and reduce manual recovery processes. Promote compliance and security through advanced access controls and transparent recovery procedures. To evaluate the Soft Delete feature, you may register for the preview and activate it on your galleries through the Azure Portal or RestAPI. Please note that, during its preview phase, this capability is recommended for assessment and testing rather than for production environments. ZRS is already available out-of-the-box, delivering image availability starting with API version 2025-03-03. For comprehensive details and step-by-step guidance on enabling and utilizing Soft Delete, please review the public specification document at https://aka.ms/sigsoftdelete Conclusion Azure Compute Gallery continues to push the envelope on resource resiliency. With Soft Delete (preview) offering a reliable recovery mechanism for deleted images, and ZRS by default protecting your assets against zonal failures, Azure empowers you to build and manage VM deployments with peace of mind. Stay tuned for future updates as these features evolve toward general availability.408Views2likes1CommentAzure Automated Virtual Machine Recovery: Minimizing Downtime
Co-authors: Mukhtar Ahmed, Shekhar Agrawal, Harish Luckshetty, Vinay Nagarajan, Jie Su, Sri Harsha Kanukuntla, David Maldonado, Shardul Dabholkar. Keeping virtual machines running smoothly is essential for businesses across every industry. When a VM stays down for even a short period, the impact can cascade quickly; delayed financial transactions, stalled manufacturing lines, unavailable retail systems, or interruptions to healthcare services. This understanding led to the creation of this solution, with its primary goal of ensuring fast and reliable recovery times so customers can focus on their business priorities without worrying about manual recovery strategies. This feature helps ensure your business Service-Level Agreements are consistently met. When a VM experiences an issue, our system springs into action within seconds, working to restore your service as quickly as possible. It automatically executes the optimal recovery strategy, all without customer intervention. The feature operates continuously in the background, monitoring the health of VMs through multiple detection mechanisms. Lastly, it automatically selects the fastest recovery path based on the specific failure type. Getting Started The best part? Azure Automated VM Recovery requires no setup or configuration. Running quietly in the background, this service helps guarantee the highest level of recoverability and a smooth experience for every Azure customer. Your VMs are already benefiting from faster detection, smarter diagnosis, and optimized recovery strategies. The Importance of Automated VM Recovery Automated VM recovery is essential to keeping cloud services resilient, reliable, and interruption-free. Automated recovery ensures that the moment a failure occurs, the platform responds instantly with fast detection, intelligent diagnostics, and the optimal repair action, all without requiring customer intervention. Better experience for customers: By minimizing VM downtime, it helps customers keep their services online, avoiding disruptions and potential business losses. Stronger trust in Azure: Fast, reliable recovery builds customer confidence in Azure’s platform, reinforcing our reputation for dependability. Reduced financial impact for customers: The lower the downtime, the less time your customers will be impacted, reducing potential loss of revenue and minimizing business disruption during critical operations. Empowering internal teams: Automated monitoring and clear visibility into recovery metrics help teams track health, onboard easily, and identify opportunities for improvement with minimal effort. How Azure Automated VM Recovery Works: A Three-Stage Approach Azure automatically handles VM issues through a three-stage recovery framework: Detection, Diagnosis, and Mitigation. Detection From the moment a failure occurs, multiple parallel mechanisms identify issues quickly. Azure hardware devices send regular health signals, which are monitored for interruptions or degradation. At the application level, operational health is tracked via response times, error rates, and successful operations to detect software-level problems rapidly. Diagnosis Once detected, lightweight diagnostics determine the best recovery action without unnecessary delays. Diagnostics operates at multiple levels; host level checks asses underlying infrastructure, VM level diagnostics evaluate the virtual machine state and system-on-chip (SoC) level analysis examines hardware components. This includes network checks, resource utilization assessments, and service responsiveness tests. Detailed data is also collected for post-incident analysis, continuously improving diagnostic algorithms while active recovery proceeds. Mitigation Based on diagnostics, the system automatically executes the optimal recovery strategy, starting with the least disruptive methods and escalating as needed. Hardware failures may trigger VM migration, while software issues might be resolved with targeted service restarts. If needed, a host reset is performed while preserving virtual machine state, ensuring minimal disruption to running workloads. Post-mitigation health checks ensure full VM functionality before recovery is considered complete. Recovery Event Annotations Recovery Event Annotations are specialized annotations that provide detailed visibility into every stage of VM recovery, going beyond simple uptime metrics. These indicators act as custom monitoring metrics, breaking down each incident into precise time segments. For example, TTD (Time to Detect) measures the time between a VM becoming unhealthy and the system recognizing the issue, while TTDiag (Time to Diagnose) tracks the duration of diagnostic checks. By analyzing these segments, Recovery Timing Indicators help identify bottlenecks, optimize recovery steps, and improve overall reliability. Key benefits include: Understanding why some VMs recover faster than others. Identifying which diagnostics add value versus those that don’t. Highlighting opportunities that provide a faster path of recovery. Enabling early detection of regressions through event annotation-driven alerts. Establishing a common language across Azure teams for measuring and improving downtime. Customer Impact and Results Azure Automated VM Recovery demonstrates our commitment to not only high availability but also rapid recovery. By minimizing downtime, it helps customers build resilient applications and maintain business continuity during unexpected failures. Over the past 18 months, this solution has cut average VM downtime by more than half, significantly enhancing reliability and customer experience. Our ongoing goal is to provide a platform where customers can deploy workloads with confidence, knowing automated recovery will minimize disruptions.770Views8likes1CommentAnnouncing General Availability of Azure Da/Ea/Fasv7-series VMs based on AMD ‘Turin’ processors
Today, Microsoft is announcing the general availability of Azure’s new AMD based Virtual Machines (VMs) powered by 5th Gen AMD EPYC™ (Turin) processors. These VMs include general-purpose (Dasv7, Dalsv7), memory-optimized (Easv7), and compute-optimized (Fasv7, Falsv7, Famsv7) series, available with or without local disks. Azure’s latest AMD based VMs offer faster CPU performance, greater scalability, and flexible configurations, making them the ideal choice for high performance, cost efficiency, and diverse workloads. Key improvements include up to 35% better CPU performance and price-performance compared to equivalent v6 AMD-based VMs. Workload-specific gains are significant—up to 25% for Java applications, up to 65% for in-memory cache applications, up to 80% for crypto workloads, and up to 130% for web server applications just to name a few. Dalsv7-series VMs are cost-efficient for low memory workloads like web servers, video encoding, and batch processing. Dasv7-series suit general computing tasks such as e-commerce, web front ends, virtualization, customer relationship management applications (CRM), and entry to mid-range databases. Easv7-series target memory-heavy workloads like enterprise applications, data warehousing, business intelligence, in-memory analytics and more. Falsv7-, Fasv7-, and Famsv7 series deliver full-core performance without Simultaneous Multithreading (SMT) for compute-intensive tasks like scientific simulations, financial modeling, gaming and more. You can now choose constrained-core VM sizes — reducing the vCPU total by 50% or 75% while maintaining the other resources. Dasv7, Dalsv7, and Easv7 VMs now scale up to 160 vCPUs, an increase from 96 vCPUs in the previous generation. The Fasv7, Falsv7, and Famsv7 VMs, which do not include Simultaneous Multithreading (SMT), support up to 80 vCPUs—up from 64 vCPUs in the prior generation—and introduce a new 1-core option. These VMs offer a maximum boost CPU frequency of up to 4.5 GHz for faster compute-intensive operations. The new VMs deliver increased memory capacity —up to 640 GiB for Dasv7 and 1280 GiB for Easv7—making them ideal for memory-intensive workloads. They also support three memory (GiB)-to-vCPU ratios: 2:1 (Dalsv7-series, Daldsv7-series, Falsv7-series and Faldsv7-series), 4:1 (Dasv7-series, Dadsv7-series, Fasv7-series and Fadsv7-series), and 8:1 (Easv7-series, Eadsv7-series, Famsv7-series and Famdsv7-series). Remote storage performance is improved up to 20% higher IOPS, up to 50% greater throughput, while local storage performance offers up to 55% higher throughput. Network performance is also enhanced up to 75% compared to corresponding D-series and E-series v6 VMs. New VM series Fadsv7, Faldsv7, and Famdsv7, introduce local disk support. The new VMs leverage Azure Boost technology to enhance performance and security, utilize the Microsoft Azure Network Adapter (MANA), and support the NVMe protocol for both local and remote disks. The 5th Generation AMD EPYC™ processor family, based on the newest ‘Zen 5’ core, provides enhanced capabilities for these new Azure’s AMD based VM series such as AVX-512 with a full 512-bit data path for vector and floating-point operations, higher memory bandwidth, and improved instructions per clock compared to the previous generation. These updates provide the ability to handle compute-intensive tasks for AI and machine learning, scientific simulations, and financial analytics, among others. AMD Infinity Guard hardware-based security features, such as Transparent Secure Memory Encryption (TSME), continue in this generation to ensure sensitive information remains secure. These VMs are available in the following Azure regions: Australia East, Central US, Germany West Central, Japan East, North Europe, South Central US, Southeast Asia, UK South, West Europe, West US 2, and West US 3. The large 160 vCPU Easv7-series and Eadsv7-series sizes are available in North Europe, South Central US, West Europe, and West US 2. More regions are coming in 2026. Refer to Product Availability by Region for the latest information. Our customers have shared the benefits they’ve observed with these new VMs: “Elastic enables customers to drive innovation and cost-efficiency with our observability, security, and search solutions on Azure. In our testing, Azure’s latest Daldsv7 VMs provided up to 13% better indexing throughput compared to previous generation Daldsv6 VMs, and we are looking forward to the improved performance for Elasticsearch users deploying on Azure.” — Yuvraj Gupta, Director, Product Management, Elastic “The Easv7 series of Azure VMs offers a balanced mix of CPU, memory, storage, and network performance that suits the majority of Oracle Database configurations very well. The 80 Gbps network with the jumbo frame capability is especially helpful for efficient operation of FlashGrid Cluster with Oracle RAC on Azure VMs.” — Art Danielov, CEO, FlashGrid "Our analysis indicates that Azure’s new AMD based v7 series Virtual Machines demonstrate significantly higher performance compared to the v6 series, particularly in single-thread ratings. This advancement is highly beneficial, as several of our critical applications, such as ArcGIS Enterprise, are single-threaded and CPU-bound. Consequently, these faster v7 series VMs have resulted in improved performance with the same number of users, evidenced by lower server utilization and faster client-side response times." — Thomas Buchmann, Senior Cloud Architect, VertiGIS Here’s what our technology partners are saying: “AMD and Microsoft have built one of the industry’s most successful cloud partnerships, bringing over 60 VM series to market through years of deep engineering collaboration. With the new v7 Azure VMs powered by 5th Gen AMD EPYC processors, we’re setting a new benchmark for performance, efficiency, and scalability—giving customers the proven, leadership compute they expect from AMD in the world’s most demanding cloud environments.” — Steve Berg, Corporate Vice President and General Manager of the Server CPU Cloud Business Group at AMD “Our collaboration with Microsoft continues to empower developers and enterprises alike. The new AMD based v7-series VMs on Azure offer a powerful foundation for the full spectrum of modern workloads, from development to production AI/ML pipelines. We are excited to support this launch, ensuring every user gets a seamless experience on Ubuntu, with the enterprise security and long-term stability of Ubuntu Pro available for their most critical systems." — Jehudi Castro-Sierra, Public Cloud Alliances Director "The new Azure Da/Ea/Fa v7-series AMD Turin-based instances running SUSE Linux Enterprise Server provide a significant performance uplift during initial tests. They show an impressive 20%-40% increase with typical Linux kernel compilation tasks compared to the same instance sizes of the v6 series. This demonstrates the enhanced capabilities the v7 series brings to our joint customers seeking maximum efficiency and performance for their critical applications.” — Peter Schinagl, Sr. Technical Architect, SUSE You can learn more about these latest Azure AMD based VMs by visiting the specification pages at Dasv7-series, Dadsv7-series, Dalsv7-series, Daldsv7-series, Easv7-series, Eadsv7-series, Fasv7-series, Fadsv7-series, Falsv7-series, Faldsv7-series, Famsv7-series , Famdsv7-series, constrained-core sizes. For pricing details, visit the Azure Virtual Machines pricing page. These VMs support all remote disk types. See Azure managed disk type for additional details. Disk storage is billed separately. Azure Integrated HSM (Hardware Security Module) will continue to be in preview with these VMs. Azure Integrated HSM is an ephemeral HSM cache that enables secure key management within Azure VMs by ensuring that cryptographic keys remain protected inside a FIPS 140-3 Level 3-compliant boundary throughout their lifecycle. To explore this new feature, please sign up using the form. Have questions? Please reach us at Azure Support and our experts will be there to help you with your Azure journey.2.1KViews3likes1CommentAnnouncing preview of new Azure Dasv7, Easv7, Fasv7-series VMs based on AMD EPYC™ ‘Turin’ processor
Today, Microsoft is announcing preview of the new Azure AMD-based Virtual Machines (VMs), powered by 5th Generation AMD EPYC™ (Turin) processors. The preview includes general purpose (Dasv7 & Dalsv7 series), memory-optimized (Easv7 series) and compute-optimized (Fasv7, Falsv7, Famsv7 series) VMs, available with and without local disks. These VMs are in preview in the following Azure regions: East US 2, North Europe, and West US 3. To request access to the preview, please fill out the Preview-Signup. The latest Azure AMD-based VMs deliver significant enhancements over the previous generation (v6) AMD-based VMs: improved CPU performance, greater scalability, and expanded configuration options to meet the needs of a wide range of workloads. Key improvements include: Up to 35% CPU performance improvement compared to equivalent sized (v6) AMD-based VMs. Significant performance gains on other workloads: Up to 25% for Java-based workloads Up to 65% for in-memory cache applications Up to 80% for crypto workloads Up to 130% for web server applications Maximum boost CPU frequency of 4.5 GHz, enabling faster operations for compute-intensive workloads. Expanded VM sizes: Dasv7-series, Dalsv7-series and Easv7-series now scale up to 160 vCPUs. Fasv7-series supports up to 80 vCPUs, with a new 1-core size. Increased memory capacity: Dasv7-series now offers up to 640 GiB of memory. Easv7-series scales up to 1280 GiB and is ideal for memory-intensive applications. Enhanced remote storage performance: VMs offer up to 20% higher IOPS and up to 50% greater throughput compared to similar sized previous generation (v6) VMs. New VM families introduced: Fadsv7, Faldsv7, and Famdsv7 are now available with local disk support. Expanded constrained-core offerings: New constrained-core sizes for Easv7 and Famsv7, available with and without local disks, helping to optimize licensing costs for core-based software licensing. These enhancements make these latest VMs a compelling choice for customers seeking high performance, cost efficiency, and workload flexibility on Azure. Additionally, these VMs leverage the latest Azure Boost technology enhancements to performance and security of these new VMs. The new VMs utilize the Microsoft Azure Network Adapter (MANA), a next-generation network interface that provides stable, forward-compatible drivers for Windows and Linux operating systems. These VMs also support the NVMe protocol for both local and remote disks. The 5th Generation AMD EPYC™ processor family, based on the newest ‘Zen 5’ core, provides enhanced capabilities for these new Azure AMD-based VM series such as AVX-512 with a full 512-bit data path for vector and floating-point operations, higher memory bandwidth, and improved instructions per clock compared to the previous generation. These updates provide increased throughput and ability to scale for compute-intensive tasks like AI and machine learning, scientific simulations, and financial analytics, among others. AMD Infinity Guard hardware-based security features, such as Transparent Secure Memory Encryption (TSME), continue in this generation to ensure sensitive information remains secure. These VMs support three memory (GiB)-to-vCPU ratios such as 2:1 (Dalsv7-series, Daldsv7-series, Falsv7-series and Faldsv7-series), 4:1 (Dasv7-series, Dadsv7-series, Fasv7-series and Fadsv7-series), and 8:1 (Easv7-series, Eadsv7-series, Famsv7-series and Famdsv7-series). The Dalsv7-series are ideal for workloads that require less RAM per vCPU that can reduce costs when running non-memory intensive applications, including web servers, video encoding, batch processing and more. The Dasv7-series VMs work well for many general computing workloads, such as e-commerce systems, web front ends, desktop virtualization solutions, customer relationship management applications, entry-level and mid-range databases, application servers, and more. The Easv7-series VMs are ideal for workloads such as memory-intensive enterprise applications, data warehousing, business intelligence, in-memory analytics, and financial transactions. The new Falsv7-series, Fasv7-series and Famsv7-series VM series do not have Simultaneous Multithreading (SMT), meaning a vCPU equals a full core, which makes these VMs well-suited for compute-intensive workloads needing the highest CPU performance, such as scientific simulations, financial modeling and risk analysis, gaming, and more. In addition to the standard sizes, the latest VM series are available in constrained-core sizes, with vCPU count constrained to one-half or one-quarter of the original VM size, giving you the flexibility to select the core and memory configuration that best fits your workloads. In addition to the new VM capabilities, the previously announced Azure Integrated HSM (Hardware Security Module), will be in Preview soon with the latest Azure AMD-based VMs. Azure Integrated HSM is an ephemeral HSM cache that enables secure key management within Azure virtual machines by ensuring that cryptographic keys remain protected inside a FIPS 140-3 Level 3-compliant boundary throughout their lifecycle. To explore this new feature, please sign up using the form provided below. These latest Azure AMD-based VMs will be charged during preview; pricing information will be shared with access to the VMs. Eligible new Azure customers can sign up for a free account and receive a $200 Azure credit. The new VMs support all remote disk types. To learn more about the disk types and their regional availability, please refer to Azure managed disk type. Disk storage is billed separately from virtual machines. You can learn more about these latest Azure AMD-based VMs by visiting the specification pages at Dasv7-series, Dadsv7-series, Dalsv7-series, Daldsv7-series, Easv7-series, Eadsv7-series, Fasv7-series, Fadsv7-series, Falsv7-series, Faldsv7-series, Famsv7-series and Famdsv7-series. The latest Azure AMD-based VMs provide options for your wide range of computing needs. Explore the new VMs today and discover how these VMs can enhance your workload performance and lower your costs. To request access to the preview, please fill out the Preview-Signup form. Have questions? Please reach us at Azure Support and our experts will be there to help you with your Azure journey.2.8KViews1like3CommentsAnnouncing Preview of New Azure Dnl/Dn/En v6 VMs powered by Intel 5th Gen processor & Azure Boost
We are thrilled to announce the public preview of Azure's first Network Optimized VMs powered by the latest 5th Gen Intel® Xeon® processor offering unparalleled performance and flexibility. The network optimized VMs will be relevant for workloads such as network virtual appliances, large-scale e-commerce applications, express route, application gateway, central DNS and monitoring servers, firewalls, media processing tasks that involve transferring large amounts of data quickly, and any workloads that require the ability to handle a high number of user connections and data transfers. Network Optimized VMs enhance networking performance by providing hardware acceleration for initial connection setup for certain traffic types, a task previously performed in software. These VMs will have lower end-to-end latency for initially establishing a connection or initial packet flow, as well as allow a VM to scale up the number of connections it manages more quickly. These Intel-based VMs come with three different memory-to-core ratios and offer options with and without local SSD across the VM families: Dnsv6, Dndsv6, Dnlsv6, Dnldsv6, Ensv6 and Endsv6 series. There are 55 VM sizes in total, ranging from 2 to 192 vCPU and up to 1.8TB of memory. The new Network Optimized VMs have higher network bandwidth per vCPU, numbers of vNICs per vCPU and connections per second. What’s New Compared to the current Intel Dl/D/Ev6 VMs, the network optimized VMs have: Up to 3x improvement in NW BW/vCPU than the current generation Intel Dl/D/Ev6 VMs 2x vNIC allocation on smaller vCPU sizes Up to 200 Gbps VM network bandwidth Up to 8x CPS connections enhancement across sizes Up to 192vCPU and >18GiB of memory Azure Boost which enables: Up to 400k IOPS and 12 GB/s remote storage throughput Up to 200 Gbps VM network bandwidth NVMe interface for local and remote disks Enhanced security through Total Memory Encryption (TME) technology Customers are excited about the new Azure Dnl/Dn/Ensv6 VMs “Palo Alto Networks, the global cybersecurity leader, is working with Microsoft to bring best-in-class Network Virtual Appliance performance capabilities to their customers. As the performance needs of customers on Azure continue to grow, innovations like Network Optimized VMs, Azure Boost, and Microsoft Azure Network Adapter (MANA) technology will help ensure that both our VM Series network virtual appliance and Cloud NGFW, our Azure native firewall service, can scale efficiently and cost-effectively,” said Rich Campagna, SVP Products, Palo Alto Networks. “We look forward to continuing our partnership with Microsoft to bring these innovations to life." General Purpose Workloads - Dnlsv6, Dnldsv6, Dnsv6, Dndsv6 The new Network Optimized Dnlsv6-series and Dnsv6 series VMs offer a balance of memory to CPU performance with increased scalability of up to 128 vCPUs and 512 GiB of RAM. Below is an overview of the specifications offered by the Dnlsv6-series and Dnsv6 series VMs. Series vCPU vNIC Network Bandwidth (Gbps) CPS Memory (GiB) Local Disk (GiB) Max Data Disks Dnlsv6-series 2 – 128 4 - 15 25.0 – 200.0 30K – 400K 4 – 256 n/a 8 – 64 Dnldsv6-series 2 – 128 4 - 15 25.0 – 200.0 30K – 400K 4 – 256 110 – 7,040 8 – 64 Dnsv6-series 2 – 128 4 - 15 25.0 – 200.0 30K – 400K 8 – 512 n/a 8 – 64 Dndsv6-series 2 – 128 4 - 15 25.0 – 200.0 30K – 400K 8 – 512 110 – 7,040 8 – 64 Memory Intensive Workloads - Ensv6 and Endsv6 The new Network Optimized Ensv6-series and Endsv6-series virtual machines are ideal for memory-intensive workloads offering up to 192vCPU and 1.8 TiB of RAM. Below is an overview of specifications offered by the Ensv6-series and Endsv6-series VMs. Series vCPU vNIC Network Bandwidth (Gbps) CPS Memory (GiB) Local Disk (GiB) Max Data Disks Ensv6-series 2 – 128 4 - 15 25.0 – 200.0 30K – 400K 16 – >1800 n/a 8 – 64 Endsv6-series 2 – 192 4 - 15 25.0 – 200.0 30K – 400K 16 – >1800 110 – 10,560 8 – 64 The Dnlv6, Dnv6, and Env6-series Azure Virtual Machines will offer options with and without local disk storage. These VMs are also compatible with remote persistent disk options including Premium SSD, Premium SSD v2, and Ultra Disk. Join the Preview Dnlv6, Dnv6, and Env6 series VMs are now available for preview in US East. VMs above 96 vCPUs and the VM series with local disk will be supported later in the preview. To request access to the preview, please fill out the survey form here. We look forward to hearing from you.2.2KViews1like2CommentsYour guide to Azure Compute at Microsoft Ignite 2025
The countdown to Microsoft Ignite 2025 is almost over— Microsoft Ignite - November 18–21, 2025! Whether you’ll be joining us in person or tuning in virtually, this guide is your essential resource for everything Azure Compute. Explore the latest advancements, connect with product experts, and expand your cloud skills through curated sessions and interactive experiences. Attendees will have the opportunity to dive deep into new product capabilities and solutions, including ways to boost virtual machine performance, enhance resiliency, and optimize cloud operations. Be sure to add these sessions to your schedule for a personalized and can’t-miss Ignite experience. Bookmark this guide for quick access to all the latest Azure Compute news and updates throughout Ignite 2025! Featured sessions Tuesday BRK171: What's new and what's next in Azure IaaS Level: Intermediate 200 In this session, we’ll introduce the latest capabilities across compute, storage, and networking. Uncover the advancements in Azure IaaS, driving performance, resiliency, and cost efficiency. We will present how Azure’s global backbone, enhanced capabilities, and expanding portfolio can support mission-critical, cloud native and AI workloads —while built-in security and flexible tiering help right-size app deployments and accelerate modernization. Tuesday, November 18 | 2:30 PM-3:15 PM PST Wednesday BRK430: Inside Azure Innovations with Mark Russinovich Level: Advanced 300 Join Mark Russinovich, CTO and Technical Fellow of Microsoft Azure. Mark will take you on a tour of the latest innovations in Azure architecture and explain how Azure enables intelligent, modern, and innovative applications at scale in the cloud, on-premises, and on the edge. Featuring some of the latest Compute announcements with Azure Boost. Wednesday, November 19, 2:45 PM PST Other related IaaS sessions Use the following as a guide to build your session schedule with an emphasis on our Azure Compute topics. These sessions will be in person and recorded. Sessions Tuesday-Thursday will be live streamed. Thursday BRK176: Driving efficiency and cost optimization for Azure IaaS deployments Level: Intermediate 200 Control cloud spend without compromising performance. This session shows how Azure IaaS helps IT leaders optimize costs through flexible pricing, built-in tools, and smart resource planning. Learn how to align infrastructure choices with workload requirements, reduce TCO, and make informed decisions that support growth and innovation. You will gain a deeper understanding of how Azure delivers a comprehensive set of services, tools, and financial instruments to optimize your cloud costs at scale. Thursday, November 20 th , 9:45 AM PST BRK217: Resilience by design: Secure, scalable, AI-ready cloud with Azure Level: Advanced 300 Resiliency is foundational. Explore how resiliency on Azure enables secure, scalable, AI-ready cloud architectures. Learn to set resilience goals, simulate failures, and orchestrate recovery. See live demos and discover how shared responsibility empowers teams to deliver trusted, resilient outcomes. Thursday, November 20 th , 1:00 PM PST BRK178: Architecting for resiliency on Azure Infrastructure Level: Intermediate 200 Discover how to build resilient cloud solutions on Azure by leveraging availability zones, multi-region deployments, and fungible products. This session explores architectural patterns, platform capabilities, and best practices to ensure high availability, fault tolerance, and business continuity for mission-critical workloads in dynamic and distributed environments. Thursday, November 20, 1:00 PM PST BRK148: Architect resilient apps with Azure backup and reliability features Level: Advanced 300 Learn to use self-serve tools to strengthen zonal resiliency for critical workloads. Assess and validate resilience across VMs, DBs, and containers. Explore enhanced data and cyber resiliency with immutability and threat detection to guard against ransomware. Discover expanded workload coverage and real-time insights to proactively protect your applications and infrastructure. Thursday, November 20, 3:30 PM PST Friday BRK146: Resiliency and recovery with Azure Backup and Site Recovery Level: Advanced 300 This session will show how to secure, detect threats, and quickly recover critical workloads across Azure environments using advanced backup and disaster recovery solutions. It covers modern techniques like threat-aware backups, container protection, and seamless disaster recovery to help meet compliance and recovery objectives. Friday, November 21, 9:00 AM PST BRK149: Unlock cloud-scale observability and optimization with Azure Level: Advanced 300 In this session, we'll deep dive into how Azure Monitor delivers end-to-end observability across your cloud and hybrid environments, helping you detect issues early and reduce mean time to recovery. We'll also share how new Copilot in Azure agents can extend this visibility into actionable cost and carbon efficiency insights—helping you identify optimization opportunities, validating recommendations, and streamlining resource performance for business impact. Friday, November 21 st , 10:15 AM PST BRK173: Azure IaaS best practices to enhance performance and scale Level: Advanced 300 Azure IaaS can deliver excellent performance and scalability across a broad range of workloads. With high-throughput storage, low-latency networking, and intelligent auto-scaling, Azure supports demanding apps with precision. Learn how to optimize compute, storage, and network resources to meet performance goals, reduce costs, and scale confidently across global regions. Dive into the latest capabilities Azure Boost, Compute Fleet, Azure Virtual Machines, Azure Storage and Networking offer. Friday, November 21, 10:15 AM PST BRK172: Powering modern cloud workloads with Azure Compute Level: Advanced 300 Uncover new VM offerings announcements and explore innovations like Azure Boost. Dive into the latest compute innovation at the core of Azure IaaS. Whether you're running mission-critical enterprise apps or scaling cloud-native services, discover how these innovations are unlocking new value for customers and get a preview of what’s coming next. Friday, November 21, 11:30 AM PST BRK168: Azure IaaS platform security deep dive Level: Advanced 300 As organizations accelerate their cloud adoption, robust security for your Infrastructure as a Service platform is more critical than ever. This session will provide a comprehensive exploration of Azure’s security architecture, best practices, and innovations across four pillars: foundational security, compute security, network security, and storage security. Attendees will gain actionable insights to strengthen their cloud posture, ensure compliance, and protect sensitive workloads. Friday, November 21 st ,11:30 AM PST Upskill yourself with hands on labs This section explains that live demos and hands-on labs are exclusively available to those who attend in person, providing them with a direct, firsthand experience. Tuesday LAB500: Attain unified observability and optimization in Azure Level: Intermediate 200 Get an AI-powered view of your Azure workload health and performance while uncovering cost and carbon savings. In this lab, use AI to investigate anomalies, correlate telemetry, and drive optimization. Apply FinOps and sustainability insights, align health with SLI/SLO targets, and improve monitoring posture for lasting efficiency. Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees. Tuesday November 18 th , 2:45 PM PST LAB520: Start, Get and Stay Resilient with Azure Level: Intermediate 200 Understand the Start, Get, and Stay Resilient journey. Get equipped with tools & insights to architect mission critical applications with Azure’s Resiliency and Configuration experiences. Assess your resiliency posture, apply recommendations, validate your posture and orchestrate recovery. With the Essentials Machine Management bundle from Azure, manage and maintain the state of your resources, enforce configurations across devices and ensure resilience is not a one-time goal but an ongoing state. Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees. Tuesday, November 18 th , 4:30 PM PST951Views2likes1CommentKernel Dump based Online Repair
Introduction In the ever-evolving landscape of cloud computing, reliability remains paramount. As workloads scale and businesses depend on uninterrupted service, Azure continues to invest in technologies that enhance system resilience and minimizes customer impact in cases of failures. Azure Compute infrastructure operates at an unmatched scale, with certain Availability Zones (AZs) hosting nearly a million Azure Virtual Machines (Azure VMs) that run customer workloads. These Azure VMs depend on a sophisticated ecosystem of physical machines, networking infrastructure, storage systems, and other essential components. When failures occur at any of these layers—whether from hardware malfunctions, kernel issues, or network disruptions—customers may experience service interruptions. To address these challenges, Azure Compute Repair Platform plays a vital role in identifying, diagnosing, and applying mitigation strategies to resolve issues as quickly as possible. To further improve our ability to diagnose and resolve failures swiftly and accurately, we present a novel approach —a real-time kernel dump analysis technology aimed at identifying the root cause of issues and facilitating precise, data-driven repairs. This is an addition to the gamut of detection and mitigation strategies we already leverage. This capability is generally available in all Azure regions and helps our customers out, including our most critical customers. This project would not have been possible without the invaluable support and contributions of Binit Mishra, Dhruv Joshi, Abhay Ketkar, Gaurav Jagtiani, Mukhtar Ahmed, Siamak Ahari, Rajeev Acharya, Deepak Venkatesh, Abhinav Dua, Alvina Putri, Emma Montalvo, and Chantale Ninah — my heartfelt thanks to each of you. Real-Time Failure Diagnosis and Repair We have developed a novel approach to diagnosing and mitigating failures in Azure Compute infrastructure by understanding the state of the kernel on the Azure Host Machine through real-time collection and analysis of Live Kernel Dumps (LKD). This enables us to pinpoint the exact issue with the kernel and use that insight for precise repair actions, rather than applying a broad set of mitigation strategies. By reducing trial-and-error repair attempts, we significantly minimize downtime and accelerate issue resolution. Kernel dumps can help detect critical issues such as kernel panics, memory leaks, and driver failures. Kernel panics occur when the system encounters a fatal error, causing the kernel to stop functioning. Memory leaks, where memory is not properly released, can lead to system instability over time. Driver failures, often caused by faulty or incompatible drivers, can also be identified through kernel dump analysis. Importantly, it is the Repair Platform that triggers LKD collection and further consumes the LKD analysis to make informed decisions. By incorporating liver kernel dump analysis into our mitigation workflows, we enhance Azure’s ability to quickly diagnose, categorize, and resolve infrastructure issues, ultimately reducing system downtime and improving overall performance. Architecture How does this system work: Dump Collection: When an issue is detected, the Repair Platform triggers the collection of a Live Kernel Dump (LKD) on the machine hosting the affected Azure VM. Dump Upload: An agent running on the machine monitors a designated storage location for newly generated dumps. When a dump is detected, the agent uploads it from the Azure Host Machine to an online Analysis Service. Failure Classification: The Analysis Service processes the uploaded Live Kernel Dump (LKD), diagnoses the root cause of the failure, and categorizes it accordingly—for example, identifying a networking switch in a hung state. Persistence: The Analysis Service generates a detailed failure message and stores it in an Azure Table for tracking and retrieval. Automated Repair Decisions: The Repair Platform continuously monitors the Azure Table for failure messages. Once a failure is recorded, it retrieves the data and makes an informed repair decision. Impact By leveraging this approach, Azure Compute Repair Platform achieves both a better repair strategy and significant downtime savings. (A) Better Repair Strategy By precisely identifying failures, the Repair Platform can classify issues accurately and apply the most effective resolution method, minimizing unnecessary disruptions and enhancing long-term infrastructure stability. For instance, in the case of a VM Switch Hung issue, the Repair Platform attempts to mitigate the problem on the affected Azure Host Machine. However, if unsuccessful, it migrates the customer's workload to a more stable machine and initiates aggressive repairs on the faulty Azure Host Machine. While this restores service, it does not address the underlying cause, leaving the Azure Host Machine vulnerable to repeated VM Switch Hung failures. Enabling real-time failure classification, the Repair Platform could instead hold a subset of affected Azure Host Machines in a restricted state, preventing new Azure VMs from being assigned to them. This approach allows Azure’s hardware and network partners to run diagnostics, gain deeper insights into the failure, and implement targeted fixes. As a result, Azure has reduced recurring failures, minimized customer impact, and improved overall infrastructure reliability. While the VM Switch Hung issue serves as an example, this data-driven repair strategy can be extended to various failure scenarios, enabling faster recovery, fewer disruptions, and a more resilient platform. (B) Downtime Reduction The longer it takes to resolve an issue, the longer a customer workload may experience interruptions. As a result, downtime reduction is one of the key metrics we prioritize. We significantly reduce time to resolution by providing an early signal that pinpoints the exact issue. This allows the Repair Platform to perform targeted repairs rather than relying on time-consuming, broad mitigation strategies. Sample scenario: When a customer faces issues stopping or destroying an Azure VM, and the problem is severe enough that all repair attempts fail, the only option may be to migrate the customer's workload to a different Azure Host Machine. Today, this process can take up to 26 minutes before the decision to move the customer workload is reached. However, with this new approach, we are optimizing to detect the failure and surface the issue within 3 minutes, enabling a decision much earlier and reducing customer downtime by 23 minutes—a significant improvement in downtime reduction and customer resolution. Conclusion Online kernel dump analysis for machine issue resolution marks a significant advancement in Azure’s commitment to reliability, bringing us closer to a future where failures are not just detected but proactively mitigated in real time. By enabling real-time diagnostics and automated repair strategies, this approach is redefining Compute reliability—drastically reducing mitigation times, enhancing repair accuracy, and ensuring customers experience seamless service continuity. As we continue refining it, our focus remains on expanding its capabilities, enhancing kernel analysis, reducing analysis time, and strengthening the entire pipeline for greater efficiency and resilience. Stay tuned for further updates as we push the boundaries of intelligent cloud reliability.2.5KViews0likes0CommentsRevolutionizing Reliability: Introducing the Azure Failure Prediction and Detection (AFPD) system
As part of the journey to consistently improve Azure reliability and platform stability, we launched Azure Failure Prediction & Detection (AFPD), Azure’s premiere shift-left reliability solution. AFPD became operational in 2024, unifying failure prediction, detection, mitigation, and remediation services into a single end-to-end system with the goal of preventing Azure Compute customer workload interruptions and repairing nodes at scale. AFPD builds upon previous reliability solutions such as Project Narya, adding new best practices and fleet health management capabilities on top of pre-existing failure prediction and mitigation capabilities. The end-to-end AFPD system has proven to further reduce the overall number of reboots by over 36% and allows for a proactive approach to maintaining the cloud. This system operates for all Azure Compute General Purpose, Specialized Compute, High Performance Computing (HPC)/Artificial Intelligence (AI) workloads and select Azure Storage scenarios. For a deeper dive, you can read the whitepaper here, which won Best Paper Award at the 2025 IEEE Cloud Summit!1.8KViews8likes0Comments