virtual machines
166 TopicsAzure NCv6 Public Preview: The new Unified Platform for Converged AI and Visual Computing
As enterprises accelerate adoption of physical AI (AI models interacting with real-world physics), digital twins (virtual replicas of physical systems), LLM inference (running language models for predictions), and agentic workflows (autonomous AI-driven processes), the demand for infrastructure that bridges high-end visualization and generative AI inference has never been higher. Today, we are pleased to announce the Public Preview of the NC RTX PRO 6000 BSE v6 series, powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The NCv6 series represents a generational leap in Azure’s visual compute portfolio, designed to be the dual engine for both Industrial Digitalization and cost-effective LLM inference. By leveraging NVIDIA Multi-Instance GPU (MIG) capabilities, the NCv6 platform offers affordable sizing options similar to our legacy NCv3 and NVv5 series. This provides a seamless upgrade path to Blackwell performance, enabling customers to run complex NVIDIA Omniverse simulations and multimodal AI agents with greater efficiency. Why Choose Azure NCv6? While traditional GPU instances often force a choice between "compute" (AI) and "graphics" (visualization) optimizations, the NCv6 breaks this silo. Built on the NVIDIA Blackwell architecture, it provides a "right-sized" acceleration platform for workloads that demand both ray-traced fidelity and Tensor Core performance. As outlined in our product documentation, these VMs are ideal for converged AI and visual computing workloads, including: Real-time digital twin and NVIDIA Omniverse simulation. LLM Inference and RAG (Retrieval-Augmented Generation) on small to medium AI models. High-fidelity 3D rendering, product design, and video streaming. Agentic AI application development and deployment. Scientific visualization and High-Performance Computing (HPC). Key Features of the NCv6 Platform The Power of NVIDIA Blackwell At the heart of the NCv6 is the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. This powerhouse delivers breakthrough performance featuring 96 GB of ultra-fast GDDR7 memory. This massive frame buffer allows for the handling of complex multimodal AI models and high-resolution textures that previous generations simply could not fit. Host Performance: Intel Granite Rapids To ensure your workloads aren't bottlenecked by the CPU, the VM host is equipped with Intel Xeon Granite Rapids processors. These provide an all-core turbo frequency of up to 4.2 GHz, ensuring that demanding pre- and post-processing steps—common in rendering and physics simulations—are handled efficiently. Optimized Sizing for Every Workflow We understand that one size does not fit all. The NCv6 series introduces three distinct sizing categories to match your specific unit economics: General Purpose: Balanced CPU-to-GPU ratios (up to 320 vCPUs) for diverse workloads. Compute Optimized: Higher vCPU density for heavy simulation and physics tasks. Memory Optimized: Massive memory footprints (up to 1,280 GB RAM) for data-intensive applications. Crucially, for smaller inference jobs or VDI, we will also offer fractional GPU options, allowing you to right-size your infrastructure and optimize costs. NCv6 Technical Specifications Specification Details GPU NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7) Processor Intel Xeon Granite Rapids (up to 4.2 GHz Turbo) vCPUs 16 – 320 vCPUs (Scalable across GP, Compute, and Memory optimized sizes) System Memory 64 GB – 1,280 GB DDR5 Network Up to 200,000 Mbps (200 Gbps) Azure Accelerated Networking Storage Up to 2TB local temp storage; Support for Premium SSD v2 & Ultra Disk Real-World Applications The NCv6 is built for versatility, powering everything from pixel-perfect rendering to high-throughput language reasoning: Production Generative AI & Inference: Deploy self-hosted LLMs and RAG pipelines with optimized unit economics. The NCv6 is ideal for serving ranking models, recommendation engines, and content generation agents where low latency and cost-efficiency are paramount. Automotive & Manufacturing: Validate autonomous driving sensors (LiDAR/Radar) and train physical AI models in high-fidelity simulation environments before they ever touch the real world. Next-Gen VDI & Azure Virtual Desktop: Modernize remote workstations with NVIDIA RTX Virtual Workstation capabilities. By leveraging fractional GPU options, organizations can deliver high-fidelity, accelerated desktop experiences to distributed teams—offering a superior, high-density alternative to legacy NVv5 deployments. Media & Entertainment: Accelerate render farms for VFX studios requiring burst capacity, while simultaneously running generative AI tools for texture creation and scene optimization. Conclusion: The Engine for the Era of Converged AI The Azure NCv6 series redefines the boundaries of cloud infrastructure. By combining the raw power of NVIDIA’s Blackwell architecture with the high-frequency performance of Intel Granite Rapids, we are moving beyond just "visual computing." Innovators can now leverage a unified platform to build the industrial metaverse, deploy intelligent agents, and scale production AI—all with the enterprise-grade security and hybrid reach of Azure. Ready to experience the next generation? Sign up for the NCv6 Public Preview here.Azure delivers the first cloud VM with Intel Xeon 6 and CXL memory - now in Private Preview
Intel released their new Intel Xeon 6 6500/6700 series processor with P-cores this year. Intel Xeon 6 processors provide performance and scalability by delivering outstanding performance for transactional and analytical workloads and provide scale-up capacities of up to 64TB of memory. In addition, Intel Xeon 6 supports the new Compute Express Link (CXL) standard that enables memory expansion to accommodate larger data sets in a cost-effective manner. CXL Flat Memory Mode is a unique Intel Xeon 6 capability that enhances the ability to right-size the compute-to-memory ratio and improve scalability without sacrificing performance. This enhanced ability can help run SAP S/4HANA more efficiently and help enable greater flexibility for configurations so they can better align with business needs and improve the total cost of ownership. In collaboration with SAP and Intel, Microsoft is delighted to announce private preview of CXL technology on Azure M-series family of VMs. We believe that, when combined with advancements in the new Intel Xeon 6 processors, it can tackle the challenges of managing the growing volume of data in SAP software, meet the increased demand for faster compute performance and reduce overall TCO. Stefan Bäuerle, SVP, Head of BTP, HANA & Persistency at SAP noted: “Intel Xeon 6 helps deliver system scalability to support the growing demand for high-performance computing and growing database capacity among SAP customers.” Elyse Ge Hylander, Senior Director, Azure SAP Compute stated: “At Microsoft, we are continually exploring new technological innovations to improve our customer experience. We are thrilled about the potential of Intel’s new Xeon 6 processors with CXL and Flat Memory Mode. This is a big step forward to deliver the next-level performance, reliability, and scalability to meet the growing demands of our customers.” Bill Pearson, Vice President of Data Center and Artificial Intelligence at Intel states: “Intel Xeon 6 represents a significant advancement for Intel, opening up exciting business opportunities to strengthen our collaboration with Microsoft Azure and SAP. The innovative instance architecture featuring CXL Flat Memory Mode is designed to enhance cost efficiency and performance optimization for SAP software and SAP customers.” If you are interested in joining our CXL private preview in Azure, contact Mseries_CXL_Preview@microsoft.com ### Co-author: Phyllis Ng - Senior Director of Hardware Strategic Planning (Memory and Storage) - MicrosoftPure Storage Cloud, Azure Native evolves at Microsoft Ignite!
In September, we were pleased to announce the General Availability of Pure Storage Cloud, Azure Native. A co-developed Azure Native Integration enabling more customers to migrate to Azure easily and benefit from Pure’s industry-leading storage platform – now supporting more customer workloads!153Views0likes0CommentsAzure CycleCloud 8.8 and CCWS 1.2 at SC25 and Ignite
Azure CycleCloud 8.8: Advancing HPC & AI Workloads with Smarter Health Checks Azure CycleCloud continues to evolve as the backbone for orchestrating high-performance computing (HPC) and AI workloads in the cloud. With the release of CycleCloud 8.8, users gain access to a suite of new features designed to streamline cluster management, enhance health monitoring, and future-proof their HPC environments. Key Features in CycleCloud 8.8 1. ARM64 HPC Support The platform expands its hardware compatibility with ARM64 HPC support, opening new possibilities for energy-efficient and cost-effective compute clusters. This includes access to the newer generation of GB200 VMs as well as general ARM64 support, enabling new AI workloads at a scale never possible before 2. Slurm Topology-Aware Scheduling The integration of topology-aware scheduling for Slurm clusters allows CycleCloud users to optimize job placement based on network and hardware topology. This leads to improved performance for tightly coupled HPC workloads and better utilization of available resources. 3. Nvidia MNNVL and IMEX Support With expanded support for Nvidia MNNVL and IMEX, CycleCloud 8.8 ensures compatibility with the latest GPU technologies. This enables users to leverage cutting-edge hardware for AI training, inference, and scientific simulations. 4. HealthAgent: Event-Driven Health Monitoring and Alerting A standout feature in this release is the enhanced HealthAgent, which delivers event-driven health monitoring and alerting. CycleCloud now proactively detects issues across clusters, nodes, and interconnects, providing real-time notifications and actionable insights. This improvement is a game-changer for maintaining uptime and reliability in large-scale HPC deployments. Node Healthagent supports both impactful healthchecks which can only run while nodes are idle as well as non-impactful healthchecks that can run throughout the lifecycle of a job. This allows CycleCloud to alert on issues that not only happen while nodes are starting, but also issues that may result from failures for long-running nodes. Later releases of CycleCloud will also include automatic remediation for common failures, so stay tuned! 5. Enterprise Linux 9 and Ubuntu 24 support One common request has been wider support for the various Enterprise Linux (EL) 9 variants, including RHEL9, AlmaLinux 9, and Rocky Linux 9. CycleCloud 8.8 introduces support for those distributions as well as the latest Ubuntu HPC release. Why These Features Matter The CycleCloud 8.8 release marks a significant leap forward for organizations running HPC and AI workloads in Azure. The improved health check support—anchored by HealthAgent and automated remediation—means less downtime, faster troubleshooting, and greater confidence in cloud-based research and innovation. Whether you’re managing scientific simulations, AI model training, or enterprise analytics, CycleCloud’s latest features help you build resilient, scalable, and future-ready HPC environments. Key Features in CycleCloud Workspace for Slurm 1.2 Along with the release of CycleCloud 8.8 comes a new CycleCloud Workspace for Slurm (CCWS) release. This release includes the General Availability of features that were previously in preview, such as Open OnDemand, Cendio ThinLinc, and managed Grafana monitoring capabilities. In addition to previously announced features, CCWS 1.2 also includes support for a new Hub and Spoke deployment model. This allows customers to retain a central hub of shared resources that can be re-used between cluster deployments with "disposable" spoke clusters that branch from the hub. Hub and Spoke deployments enable customers who need to re-deploy clusters in order to upgrade their operating system, deploy new versions of software, or even reconfigure the overall architecture of Slurm clusters. Come visit us at SC25 and MS Ignite To learn more about these features, come visit us at the Microsoft booth at #SC25 in St. Louis, MO and #Microsoft #Ignite in San Francisco this week!Your guide to Azure Compute at Microsoft Ignite 2025
The countdown to Microsoft Ignite 2025 is almost over— Microsoft Ignite - November 18–21, 2025! Whether you’ll be joining us in person or tuning in virtually, this guide is your essential resource for everything Azure Compute. Explore the latest advancements, connect with product experts, and expand your cloud skills through curated sessions and interactive experiences. Attendees will have the opportunity to dive deep into new product capabilities and solutions, including ways to boost virtual machine performance, enhance resiliency, and optimize cloud operations. Be sure to add these sessions to your schedule for a personalized and can’t-miss Ignite experience. Bookmark this guide for quick access to all the latest Azure Compute news and updates throughout Ignite 2025! Featured sessions Tuesday BRK171: What's new and what's next in Azure IaaS Level: Intermediate 200 In this session, we’ll introduce the latest capabilities across compute, storage, and networking. Uncover the advancements in Azure IaaS, driving performance, resiliency, and cost efficiency. We will present how Azure’s global backbone, enhanced capabilities, and expanding portfolio can support mission-critical, cloud native and AI workloads —while built-in security and flexible tiering help right-size app deployments and accelerate modernization. Tuesday, November 18 | 2:30 PM-3:15 PM PST Wednesday BRK430: Inside Azure Innovations with Mark Russinovich Level: Advanced 300 Join Mark Russinovich, CTO and Technical Fellow of Microsoft Azure. Mark will take you on a tour of the latest innovations in Azure architecture and explain how Azure enables intelligent, modern, and innovative applications at scale in the cloud, on-premises, and on the edge. Featuring some of the latest Compute announcements with Azure Boost. Wednesday, November 19, 2:45 PM PST Other related IaaS sessions Use the following as a guide to build your session schedule with an emphasis on our Azure Compute topics. These sessions will be in person and recorded. Sessions Tuesday-Thursday will be live streamed. Thursday BRK176: Driving efficiency and cost optimization for Azure IaaS deployments Level: Intermediate 200 Control cloud spend without compromising performance. This session shows how Azure IaaS helps IT leaders optimize costs through flexible pricing, built-in tools, and smart resource planning. Learn how to align infrastructure choices with workload requirements, reduce TCO, and make informed decisions that support growth and innovation. You will gain a deeper understanding of how Azure delivers a comprehensive set of services, tools, and financial instruments to optimize your cloud costs at scale. Thursday, November 20 th , 9:45 AM PST BRK217: Resilience by design: Secure, scalable, AI-ready cloud with Azure Level: Advanced 300 Resiliency is foundational. Explore how resiliency on Azure enables secure, scalable, AI-ready cloud architectures. Learn to set resilience goals, simulate failures, and orchestrate recovery. See live demos and discover how shared responsibility empowers teams to deliver trusted, resilient outcomes. Thursday, November 20 th , 1:00 PM PST BRK178: Architecting for resiliency on Azure Infrastructure Level: Intermediate 200 Discover how to build resilient cloud solutions on Azure by leveraging availability zones, multi-region deployments, and fungible products. This session explores architectural patterns, platform capabilities, and best practices to ensure high availability, fault tolerance, and business continuity for mission-critical workloads in dynamic and distributed environments. Thursday, November 20, 1:00 PM PST BRK148: Architect resilient apps with Azure backup and reliability features Level: Advanced 300 Learn to use self-serve tools to strengthen zonal resiliency for critical workloads. Assess and validate resilience across VMs, DBs, and containers. Explore enhanced data and cyber resiliency with immutability and threat detection to guard against ransomware. Discover expanded workload coverage and real-time insights to proactively protect your applications and infrastructure. Thursday, November 20, 3:30 PM PST Friday BRK146: Resiliency and recovery with Azure Backup and Site Recovery Level: Advanced 300 This session will show how to secure, detect threats, and quickly recover critical workloads across Azure environments using advanced backup and disaster recovery solutions. It covers modern techniques like threat-aware backups, container protection, and seamless disaster recovery to help meet compliance and recovery objectives. Friday, November 21, 9:00 AM PST BRK149: Unlock cloud-scale observability and optimization with Azure Level: Advanced 300 In this session, we'll deep dive into how Azure Monitor delivers end-to-end observability across your cloud and hybrid environments, helping you detect issues early and reduce mean time to recovery. We'll also share how new Copilot in Azure agents can extend this visibility into actionable cost and carbon efficiency insights—helping you identify optimization opportunities, validating recommendations, and streamlining resource performance for business impact. Friday, November 21 st , 10:15 AM PST BRK173: Azure IaaS best practices to enhance performance and scale Level: Advanced 300 Azure IaaS can deliver excellent performance and scalability across a broad range of workloads. With high-throughput storage, low-latency networking, and intelligent auto-scaling, Azure supports demanding apps with precision. Learn how to optimize compute, storage, and network resources to meet performance goals, reduce costs, and scale confidently across global regions. Dive into the latest capabilities Azure Boost, Compute Fleet, Azure Virtual Machines, Azure Storage and Networking offer. Friday, November 21, 10:15 AM PST BRK172: Powering modern cloud workloads with Azure Compute Level: Advanced 300 Uncover new VM offerings announcements and explore innovations like Azure Boost. Dive into the latest compute innovation at the core of Azure IaaS. Whether you're running mission-critical enterprise apps or scaling cloud-native services, discover how these innovations are unlocking new value for customers and get a preview of what’s coming next. Friday, November 21, 11:30 AM PST BRK168: Azure IaaS platform security deep dive Level: Advanced 300 As organizations accelerate their cloud adoption, robust security for your Infrastructure as a Service platform is more critical than ever. This session will provide a comprehensive exploration of Azure’s security architecture, best practices, and innovations across four pillars: foundational security, compute security, network security, and storage security. Attendees will gain actionable insights to strengthen their cloud posture, ensure compliance, and protect sensitive workloads. Friday, November 21 st ,11:30 AM PST Upskill yourself with hands on labs This section explains that live demos and hands-on labs are exclusively available to those who attend in person, providing them with a direct, firsthand experience. Tuesday LAB500: Attain unified observability and optimization in Azure Level: Intermediate 200 Get an AI-powered view of your Azure workload health and performance while uncovering cost and carbon savings. In this lab, use AI to investigate anomalies, correlate telemetry, and drive optimization. Apply FinOps and sustainability insights, align health with SLI/SLO targets, and improve monitoring posture for lasting efficiency. Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees. Tuesday November 18 th , 2:45 PM PST LAB520: Start, Get and Stay Resilient with Azure Level: Intermediate 200 Understand the Start, Get, and Stay Resilient journey. Get equipped with tools & insights to architect mission critical applications with Azure’s Resiliency and Configuration experiences. Assess your resiliency posture, apply recommendations, validate your posture and orchestrate recovery. With the Essentials Machine Management bundle from Azure, manage and maintain the state of your resources, enforce configurations across devices and ensure resilience is not a one-time goal but an ongoing state. Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees. Tuesday, November 18 th , 4:30 PM PST668Views2likes1CommentKernel Dump based Online Repair
Introduction In the ever-evolving landscape of cloud computing, reliability remains paramount. As workloads scale and businesses depend on uninterrupted service, Azure continues to invest in technologies that enhance system resilience and minimizes customer impact in cases of failures. Azure Compute infrastructure operates at an unmatched scale, with certain Availability Zones (AZs) hosting nearly a million Azure Virtual Machines (Azure VMs) that run customer workloads. These Azure VMs depend on a sophisticated ecosystem of physical machines, networking infrastructure, storage systems, and other essential components. When failures occur at any of these layers—whether from hardware malfunctions, kernel issues, or network disruptions—customers may experience service interruptions. To address these challenges, Azure Compute Repair Platform plays a vital role in identifying, diagnosing, and applying mitigation strategies to resolve issues as quickly as possible. To further improve our ability to diagnose and resolve failures swiftly and accurately, we present a novel approach —a real-time kernel dump analysis technology aimed at identifying the root cause of issues and facilitating precise, data-driven repairs. This is an addition to the gamut of detection and mitigation strategies we already leverage. This capability is generally available in all Azure regions and helps our customers out, including our most critical customers. This project would not have been possible without the invaluable support and contributions of Binit Mishra, Dhruv Joshi, Abhay Ketkar, Gaurav Jagtiani, Mukhtar Ahmed, Siamak Ahari, Rajeev Acharya, Deepak Venkatesh, Abhinav Dua, Alvina Putri, Emma Montalvo, and Chantale Ninah — my heartfelt thanks to each of you. Real-Time Failure Diagnosis and Repair We have developed a novel approach to diagnosing and mitigating failures in Azure Compute infrastructure by understanding the state of the kernel on the Azure Host Machine through real-time collection and analysis of Live Kernel Dumps (LKD). This enables us to pinpoint the exact issue with the kernel and use that insight for precise repair actions, rather than applying a broad set of mitigation strategies. By reducing trial-and-error repair attempts, we significantly minimize downtime and accelerate issue resolution. Kernel dumps can help detect critical issues such as kernel panics, memory leaks, and driver failures. Kernel panics occur when the system encounters a fatal error, causing the kernel to stop functioning. Memory leaks, where memory is not properly released, can lead to system instability over time. Driver failures, often caused by faulty or incompatible drivers, can also be identified through kernel dump analysis. Importantly, it is the Repair Platform that triggers LKD collection and further consumes the LKD analysis to make informed decisions. By incorporating liver kernel dump analysis into our mitigation workflows, we enhance Azure’s ability to quickly diagnose, categorize, and resolve infrastructure issues, ultimately reducing system downtime and improving overall performance. Architecture How does this system work: Dump Collection: When an issue is detected, the Repair Platform triggers the collection of a Live Kernel Dump (LKD) on the machine hosting the affected Azure VM. Dump Upload: An agent running on the machine monitors a designated storage location for newly generated dumps. When a dump is detected, the agent uploads it from the Azure Host Machine to an online Analysis Service. Failure Classification: The Analysis Service processes the uploaded Live Kernel Dump (LKD), diagnoses the root cause of the failure, and categorizes it accordingly—for example, identifying a networking switch in a hung state. Persistence: The Analysis Service generates a detailed failure message and stores it in an Azure Table for tracking and retrieval. Automated Repair Decisions: The Repair Platform continuously monitors the Azure Table for failure messages. Once a failure is recorded, it retrieves the data and makes an informed repair decision. Impact By leveraging this approach, Azure Compute Repair Platform achieves both a better repair strategy and significant downtime savings. (A) Better Repair Strategy By precisely identifying failures, the Repair Platform can classify issues accurately and apply the most effective resolution method, minimizing unnecessary disruptions and enhancing long-term infrastructure stability. For instance, in the case of a VM Switch Hung issue, the Repair Platform attempts to mitigate the problem on the affected Azure Host Machine. However, if unsuccessful, it migrates the customer's workload to a more stable machine and initiates aggressive repairs on the faulty Azure Host Machine. While this restores service, it does not address the underlying cause, leaving the Azure Host Machine vulnerable to repeated VM Switch Hung failures. Enabling real-time failure classification, the Repair Platform could instead hold a subset of affected Azure Host Machines in a restricted state, preventing new Azure VMs from being assigned to them. This approach allows Azure’s hardware and network partners to run diagnostics, gain deeper insights into the failure, and implement targeted fixes. As a result, Azure has reduced recurring failures, minimized customer impact, and improved overall infrastructure reliability. While the VM Switch Hung issue serves as an example, this data-driven repair strategy can be extended to various failure scenarios, enabling faster recovery, fewer disruptions, and a more resilient platform. (B) Downtime Reduction The longer it takes to resolve an issue, the longer a customer workload may experience interruptions. As a result, downtime reduction is one of the key metrics we prioritize. We significantly reduce time to resolution by providing an early signal that pinpoints the exact issue. This allows the Repair Platform to perform targeted repairs rather than relying on time-consuming, broad mitigation strategies. Sample scenario: When a customer faces issues stopping or destroying an Azure VM, and the problem is severe enough that all repair attempts fail, the only option may be to migrate the customer's workload to a different Azure Host Machine. Today, this process can take up to 26 minutes before the decision to move the customer workload is reached. However, with this new approach, we are optimizing to detect the failure and surface the issue within 3 minutes, enabling a decision much earlier and reducing customer downtime by 23 minutes—a significant improvement in downtime reduction and customer resolution. Conclusion Online kernel dump analysis for machine issue resolution marks a significant advancement in Azure’s commitment to reliability, bringing us closer to a future where failures are not just detected but proactively mitigated in real time. By enabling real-time diagnostics and automated repair strategies, this approach is redefining Compute reliability—drastically reducing mitigation times, enhancing repair accuracy, and ensuring customers experience seamless service continuity. As we continue refining it, our focus remains on expanding its capabilities, enhancing kernel analysis, reducing analysis time, and strengthening the entire pipeline for greater efficiency and resilience. Stay tuned for further updates as we push the boundaries of intelligent cloud reliability.2.4KViews0likes0CommentsEnhancing Resiliency in Azure Compute Gallery
In today's cloud-driven world, ensuring the resiliency and recoverability of critical resources is top of mind for organizations of all sizes. Azure Compute Gallery (ACG) continues to evolve, introducing robust features that safeguard your virtual machine (VM) images and application artifacts. In this blog post, we'll explore two key resiliency innovations: the new Soft Delete feature (currently in preview) and Zonal Redundant Storage (ZRS) as the default storage type for image versions. Together, these features significantly reduce the risk of data loss and improve business continuity for Azure users. The Soft Delete Feature in Preview: A safety net for your Images Many Azure customers have struggled with accidental deletion of VM images, which disrupts workflows and causes data loss without any way to recover, often requiring users to rebuild images from scratch. Previously, removing an image from the Azure Compute Gallery was permanent and resulted in customer dissatisfaction due to service disruption and lengthy process of recreating the image. Now, with Soft Delete (currently available in public preview), Azure introduces a safeguard that makes it easy to recover deleted images within a specified retention period. How Soft Delete Works When Soft Delete is enabled on a gallery, deleting an image doesn't immediately remove it from the system. Instead, the image enters a "soft-deleted" state, where it remains recoverable for up to 7 days. During this grace period, administrators can review and restore images that may have been deleted by mistake, preventing permanent loss. After the retention period expires, the platform automatically performs a hard (permanent) delete, at which point recovery is no longer possible. Key Capabilities and User Experience Recovery period: Images are retained for a default period of 7 days, giving users time to identify and restore any resources deleted in error. Seamless Recovery: Recover soft-deleted images directly from the Azure Portal or via REST API, making it easy to integrate with automation and CI/CD pipelines. Role-Based Access: Only owners or users with the Compute Gallery Sharing Admin role at the subscription or gallery level can manage soft-deleted images, ensuring tight control over recovery and deletion operations. No Additional Cost: The Soft Delete feature is provided at no extra charge. After deletion, only one replica per region is retained, and standard storage charges apply until the image is permanently deleted. Comprehensive Support: Soft Delete is available for Private, Direct Shared, and Community Galleries. New and existing galleries can be configured to support the feature. To enable Soft Delete, you can update your gallery settings via the Azure Portal or use the Azure CLI. Once enabled, the "delete" operation will soft-delete images, and you can view, list, restore, or permanently remove these images as needed. Learn more about Soft Delete feature at https://aka.ms/sigsoftdelete Zonal Redundant Storage (ZRS) by Default Another major resiliency enhancement in Azure Compute Gallery is the default use of Zonal Redundant Storage (ZRS) for image versions. ZRS replicates your images across multiple availability zones within a region, ensuring that your resources remain available even if a zone experiences an outage. By defaulting to ZRS, Azure raises the baseline for image durability and access, reducing the risk of disruptions due to zone-level failures. Automatic Redundancy: All new image versions are stored using ZRS by default, without requiring manual configuration. High Availability: Your VM images are protected against the failure of any single availability zone within the region. Simplified Management: Users benefit from resilient storage without the need to explicitly set up or manage storage account redundancy settings. Default ZRS capability starts with API version 2025-03-03; Portal/SDK support will be added later. Why These Features Matter The combination of Soft Delete and ZRS by default provides Azure customers with enhanced operational reliability and data protection. Whether overseeing a suite of VM images for development and testing purposes or coordinating production deployments across multiple teams, these features offer the following benefits: Mitigate operational risks associated with accidental deletions or regional outages. Minimize downtime and reduce manual recovery processes. Promote compliance and security through advanced access controls and transparent recovery procedures. To evaluate the Soft Delete feature, you may register for the preview and activate it on your galleries through the Azure Portal or RestAPI. Please note that, during its preview phase, this capability is recommended for assessment and testing rather than for production environments. ZRS is already available out-of-the-box, delivering image availability starting with API version 2025-03-03. For comprehensive details and step-by-step guidance on enabling and utilizing Soft Delete, please review the public specification document at https://aka.ms/sigsoftdelete Conclusion Azure Compute Gallery continues to push the envelope on resource resiliency. With Soft Delete (preview) offering a reliable recovery mechanism for deleted images, and ZRS by default protecting your assets against zonal failures, Azure empowers you to build and manage VM deployments with peace of mind. Stay tuned for future updates as these features evolve toward general availability.287Views1like0CommentsPerformance and Scalability of Azure HBv5-series Virtual Machines
Azure HBv5-series virtual machines (VMs) for CPU-based high performance computing (HPC) are now Generally Available. This blog provides in-depth information about the technical underpinnings, performance, cost, and management implications of these HPC-optimized VMs. Azure HBv5 VM bring leadership levels of performance, cost optimization, and server (VM) consolidation for a variety of workloads driven by memory performance, such as computational fluid dynamics, weather simulation, geoscience simulations, and finite element analysis. For these applications and compared to HBv4 VMs, previously the highest performance offering for these workloads, HBv5 provides up to : 5x higher performance for CFD workloads with 43% lower costs 3.2x higher performance for weather simulation with 16% lower costs 2.8x higher performance for geoscience workloads at the same costs HBv5-series Technical Overview & VM Sizes Each HBv5 VMs features several new technologies for HPC customers, including: Up to 6.6 TB/s of memory bandwidth (STREAM TRIAD) and 432 GB memory capacity Up to 368 physical cores per VM (user configurable) with custom AMD EPYC CPUs, Zen4 microarchitecture (SMT disabled) Base clock of 3.5 GHz (~1 GHz higher than other 96-core EPYC CPUs), and Boost clock of 4 GHz across all cores 800 Gb/s NVIDIA Quantum-2 InfiniBand (4 x 200 Gb/s CX-7) (~2x higher HBv4 VMs) 180 Gb/s Azure Accelerated Networking (~2.2 higher than HBv4 VMs) 15 TB local NVMe SSD with up to 50 GB/s (read) and 30 GB/s (write) of bandwidth (~4x higher than HBv4 VMs) The highlight feature of HBv5 VMs is their use of high-bandwidth memory (HBM). HBv5 VMs utilize a custom AMD CPU that increases memory bandwidth by ~9x v. dual-socket 4 th Gen EPYC (Zen4, “Genoa”) server platforms, and ~7x v. dual-socket EPYC (Zen5, “Turin”) server platforms, respectively. HBv5 delivers similar levels of memory bandwidth improvement compared to the highest end alternatives from the Intel Xeon and ARM CPU ecosystems. HBv5-series VMs are available in the following sizes with specifications as shown below. Just like existing H-series VMs, HBv5-series includes constrained cores VM sizes, enabling customers to optimize their VM dimensions for a variety of scenarios: ISV licensing constraining a job to a targeted number of cores Maximum-performance-per-VM or maximum performance per core Minimum RAM/core (1.2 GB, suitable for strong scaling workloads) to maximum memory per core (9 GB, suitable for large datasets and weak scaling workloads Table 1: Technical specifications of HBv5-series VMs Note: Maximum clock frequencies (FMAX) are based product specifications of the AMD EPYC 9V64H processor. Experienced clock frequencies by a customer are a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application. For more information see official documentation for HBv5-series VMs Microbenchmark Performance This section focuses on microbenchmarks that characterize performance of the memory subsystem, compute capabilities, and InfiniBand network of HBv5 VMs. Memory & Compute Performance To capture synthetic performance, we ran the following industry standard benchmarks: STREAM – memory bandwidth High Performance Conjugate Gradient (HPCG) – sparse linear algebra High Performance Linpack (HPL)– dense linear algebra Absolute results and comparisons to HBv4 VMs are shown in Table 2, below: Table 2: Results of HBv5 running the STREAM, HPCG, and HPL benchmarks. Note: STREAM was run with the following CLI parameters: OMP_NUM_THREADS=368 OMP_PROC_BIND=true OMP_PLACES=cores ./amd_zen_stream STREAM data size: 2621440000 bytes InfiniBand Networking Performance Each HBv5-series VM is equipped with four NVIDIA Quantum-2 network interface cards (NICs), each operating at 200 Gb/s for an aggregate bandwidth of 800 Gb/s per VM (node). We ran the industry standard IB perftests based on OSU benchmarks test across two (2) HBv5-series VMs, as depicted in the results shown in Figures 3-5, below: Note: all results below are for a single 200 Gb/s (uni-directional) link only. At a VM level, all bandwidth results below are 4x higher as there are four (4) InfiniBand links per HBv5 server. Unidirectional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 Figure 1: results showing 99% achieved uni-directional bandwidth v. theoretical peak. Bi-directional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 -b Figure 2: results showing 99% achieved bi-directional bandwidth v. theoretical peak. Latency: Figure 3: results measuring as low as 1.25 microsecond latencies among HBv5 VMs. Latencies experienced by users will depend on message sizes employed by applications. Application Performance, Cost/Performance, and Server (VM) Consolidation This section focuses on characterizing HBv5-series VMs when running common, real-world HPC applications with an emphasis on those known to be meaningfully bound by memory performance as that is the focus of the HB-series family. We characterize HBv5 below in three (3) ways of high relevance to customer interests: Performance (“how much faster can it do the work”) Cost/Performance (“how much can it reduce the costs to complete the work”) Fleet consolidation (“how much can a customer simplify the size and scale of compute fleet management while still being able to the work”) Where possible, we have included comparisons to other Azure HPC VMs, including: Azure HBv4/HX series with 176 physical cores of 4 th Gen AMD EPYC CPUs with 3D V-Cache (“Genoa-X”) (HBv4 specifications, HX specifications) Azure HBv3 with 120 physical cores of 3 rd Gen AMD EPYC CPUs with 3D V-Cache (“Milan-X”) (HBv3 specifications) Azure HBv2 with 120 physical cores of 2 nd Gen AMD EPYC CPUs (“Rome”) processors (full specifications) Unless otherwise noted, all tests shown below were performed with: Alma Linux 8.10 (image URN : almalinux:almalinux-hpc:8_10-hpc-gen2:latest) for scaling ( image URN: almalinux:almalinux-hpc:8_6-hpc-gen2:latest) NVIDIA HPC-X MPI Further, all Cost/Performance comparisons leverage pricing rate info from list price, Pay-As-You-Go (PAYG) information found on Azure Linux Virtual Machines Pricing. Absolute costs will be a function of a customer’s workload, model, and consumption (PAYG v. Reserved Instance, etc.) approach. That said, the relative cost/performance comparisons illustrated below should hold for the workload and model combinations shown below, regardless of the consumption approach. Computational Fluid Dynamics (CFD) OpenFOAM – version 2306 with 100M Cell Motorbike case Figure 4: HBv5 v. HBv4 on on OpenFOAM with the Motorbike 100M cell case HBv5 VMs provide a 4.8x performance increase over HBv4 VMs. Figure 5: The cost to complete the OpenFOAM Motorbike 100M case is just 57% of what it costs to complete the same case on HBv4. Above, we can see that for customers running OpenFOAM cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of five (5). Palabos – version 1.01 with 3D Cavity, 1001 x 1001 x 1001 cells case Figure 6: On Palabos, a Lattice Boltzmann solver using a streaming memory access pattern, HBv5 VMs provide a 4.4x performance increase over HBv4 VMs. Figure 7: The cost to complete the Palabos 3D Cavity case is just 62% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Palabos with cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~4.5. Ansys Fluent – version 2025 R2 with F1 Racecar 140M case Figure 8: On ANSYS Fluent HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 9: The cost to complete the ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running ANSYS Fluent with cases similar to the size and complexity of the 140M cell F1 Racecar problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Siemens Star-CCM+ - version 17.04.005 with AeroSUV Steady Coupled 106M case Figure 10: On Star-CCM+, HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 11: The cost to complete the Siemens Star-CCM+ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Star-CCM+ with cases similar to the size and complexity of the 106M cell AeroSUV Steady Coupled, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Weather Modeling WRF – version 4.2.2 with CONUS 2.5KM case Figure 12: On WRF, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 13: The cost to complete the WRF Conus 2.5KM case is just 84% of what it costs to complete the same case on HBv4. Above, we can see that for customers running WRF with cases similar to the size and complexity of the 2.5km CONUS, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Energy Research Devito – version 4.8.7 with Acoustic Forward case Figure 14: On Devito, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 15: The cost to complete the Devito Acoustic Forward OP case is equivalent to what it costs to complete the same case on HBv4. Above, we can see that for customers running Devito with cases similar to the size and complexity of the Acoustic Forward OP, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Molecular Dynamics NAMD - version 2.15a2 with STMV 20M case Figure 16: On NAMD, HBv5 VMs provide a 2.18x performance increase over HBv4 VMs. Figure 17: The cost to complete the NAMD STMV 20M case is 26% higher on HBv5 than what it costs to complete the same case on HBv4 Above, we can see that for customers running NAMD with cases similar to the size and complexity of the STMV 20M case, organizations can consolidate their server (VM) deployments by approximately a factor of ~2. Notably, NAMD is a compute bound case, rather than memory performance bound. We include it here to illustrate that not all workloads are fit for purpose with HBv5. This latest Azure HPC VM is the fastest at this workload on the Microsoft Cloud, but does not benefit substantially from HBv5’s premium levels of memory bandwidth. NAMD would instead perform more cost efficiently with a CPU that supports AVX512 instructions natively or, much better still, a modern GPU. Scalability of HBv5-series VMs Weak Scaling Weak scaling measures how well a parallel application or system performs when both the number of processing elements and the problem size increase proportionally, so that the workload per processor remains constant. Weak scaling cases are often employed when time-to-solution is fixed (e.g. it is acceptable to solve a problem within a specified period) but a user desires a simulation to be of a higher fidelity or resolution. A common example is operational weather forecasting. To illustrate weak scaling on HBv5 VMs, we ran Palabos with the same 3D cavity problem as shown earlier: Figure 18: On Palabos with the 3D Cavity model, HBv5 scales linearly as the 3D cavity size is proportionately increased. Strong Scaling Strong scaling is characterized by the efficiency with which execution time is reduced as the number of processor elements (CPUs, GPUs, etc.) is increased, while the problem size remains kept constant. Strong scaling cases are often employed when the fidelity or resolution of the simulation is acceptable, but a user requires faster time to completion. A common example is product engineering validation when an organization wants to bring a product to market faster but must complete a broad range of validation and verification scenarios before doing so. To illustrate Strong scaling on HBv5 VMs, we ran NAMD with two different problems, each intended to illustrate the how expectations for strong scaling efficiency change depending on problem size and the ordering of computation v. communication in distributed memory workloads. First, let us examine NAMD with the 20M STMV benchmark Figure 19: Strong scaling on HBv5 with NAMD STMV 20M cell case As illustrated above, for strong scaling cases for which the compute time is continuously reduced (by leveraging more and more processor elements) but communication time remains constant, scaling efficiency will only stay high for so long. That principle is well-represented by the STMV 20m case, for which parallel efficiency remains linear (i.e. cost/job remains flat) at two (2) nodes but degrades after that. This is because while compute is being sped up, the MPI time remains relatively flat. As such, the relatively static MPI time comes to dominate end-to-end wall clock time as VM scaling increases. Said another way, HBv5 features so much compute performance that even for a moderate-sized problem like STMV 20M scaling the infrastructure can only take performance so far and cost/job will begin to increase. If we examine HBv5 against the 210M cell case, however, with 10.5x as many elements to compute as its 20M case sibling, the scaling efficiency story changes significantly. Figure 19: On NAMD with the STMV 210M cell case, HBv5 scales linearly out to 32 VMs (or more than 11,000 CPU cores). As illustrated above, larger cases with significant compute requirements will continue to scale efficiently with larger amounts of HBv5 infrastructure. While MPI time remains relatively flat for this case (as is the case with the smaller STMV 20M case), the compute demands remain the dominant fraction of end-to-end wall clock time. As such, HBv5 scales these problems with very high levels of efficiency and in doing so job costs to the user remain flat despite up to 8x as many VMs being leveraged compared to the four (4) VM baseline. The key takeaways for strong scaling scenarios are two-fold. First, users should run scaling tests with their applications and models to find a sweet spot of faster performance with constant job costs. This will depend heavily on model size. Second, as new and very high end compute platforms like HBv5 emerge that accelerate compute time, application developers will need to find ways reduce wall clock times bottlenecking on communication (MPI) time. Recommended approaches include using fewer MPI processes and, ideally, restructuring applications to overlap communication with compute phases.Join Microsoft @ SC25: Experience HPC and AI Innovation
Supercomputing 2025 is coming to St. Louis, MO, November 16–21! Visit Microsoft Booth #1627 to explore cutting-edge HPC and AI solutions, connect with experts, and experience interactive demos that showcase the future of compute. Whether you’re attending technical sessions, stopping by for a coffee, or joining our partner events, we’ve got something for everyone. Booth Highlights Alpine Formula 1 Showcar: Snap a photo with a real Alpine F1 car and learn how high-performance computing drives innovation in motorsports. Silicon Wall: Discover silicon diversity—featuring chips from our partners AMD and NVIDIA, alongside Microsoft’s own first-party silicon: Maia, Cobalt, and Majorana. NVIDIA Weather Modeling Demo: See how AI and HPC predict extreme weather events with Tomorrow.io and NVIDIA technology. Coffee Bar with Barista: Enjoy a handcrafted coffee while you connect with our experts. Immersive Screens: Watch live demos and visual stories about HPC breakthroughs and AI innovation. Hardware Bar: Explore AMD EPYC™ and NVIDIA GB200 systems powering next-generation workloads. Whether you’re attending technical sessions, stopping by for a coffee and chat with our team, or joining our partner events, we’ve got something for everyone. Conference Details Conference week: Sun, Nov 16 – Fri, Nov 21 Expo hours (CST): Mon, Nov 17: 7:00–9:00 PM (Opening Night) Tue, Nov 18: 10:00 AM–6:00 PM Wed, Nov 19: 10:00 AM–6:00 PM Thu, Nov 20: 10:00 AM–3:00 PM Customer meeting rooms: Four Seasons Hotel Quick links RSVP — Microsoft + AMD Networking Reception (Tue, Nov 18): https://aka.ms/MicrosoftAMD-Mixer RSVP — Microsoft + NVIDIA Panel Luncheon (Wed, Nov 19): Luncheon is now closed as the event is fully booked. Earned Sessions (Technical Program) Sunday, Nov 16 Session Type Time (CST) Title Microsoft Contributor(s) Location Tutorial 8:30 AM–5:00 PM Delivering HPC: Procurement, Cost Models, Metrics, Value, and More Andrew Jones Room 132 Tutorial 8:30 AM–5:00 PM Modern High Performance I/O: Leveraging Object Stores Glenn Lockwood Room 120 Workshop 2:00–5:30 PM 14th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2025) Torsten Hoefler Room 265 Monday, Nov 17 Session Type Time (CST) Title Microsoft Contributor(s) Location Early Career Program 3:30–4:45 PM Voices from the Field: Navigating Careers in Academia, Government, and Industry Joe Greenseid Room 262 Workshop 3:50–4:20 PM Towards Enabling Hostile Multi-tenancy in Kubernetes Ali Kanso; Elzeiny Mostafa; Gurpreet Virdi; Slava Oks Room 275 Workshop 5:00–5:30 PM On the Performance and Scalability of Cloud Supercomputers: Insights from Eagle and Reindeer Amirreza Rastegari; Prabhat Ram; Michael F. Ringenburg Room 267 Tuesday, Nov 18 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM High Performance Software Foundation BoF Joe Greenseid Room 230 Poster 5:30–7:00 PM Compute System Simulator: Modeling the Impact of Allocation Policy and Hardware Reliability on HPC Cloud Resource Utilization Jarrod Leddy; Huseyin Yildiz Second Floor Atrium Wednesday, Nov 19 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM The Future of Python on HPC Systems Michael Droettboom Room 125 BOF 12:15–1:15 PM Autonomous Science Network: Interconnected Autonomous Science Labs Empowered by HPC and Intelligent Agents Joe Tostenrude Room 131 Paper 1:30–1:52 PM Uno: A One‑Stop Solution for Inter‑ and Intra‑Data Center Congestion Control and Reliable Connectivity Abdul Kabbani; Ahmad Ghalayini; Nadeen Gebara; Terry Lam Rooms 260–267 Paper 2:14–2:36 PM SDR‑RDMA: Software‑Defined Reliability Architecture for Planetary‑Scale RDMA Communication Abdul Kabbani; Jie Zhang; Jithin Jose; Konstantin Taranov; Mahmoud Elhaddad; Scott Moe; Sreevatsa Anantharamu; Zhuolong Yu Rooms 260–267 Panel 3:30–5:00 PM CPUs Have a Memory Problem — Designing CPU‑Based HPC Systems with Very High Memory Bandwidth Joe Greenseid Rooms 231–232 Paper 4:36–4:58 PM SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations Kun Li; Liang Yuan; Ting Cao; Mao Yang Rooms 260–267 Thursday, Nov 20 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM Super(computing)heroes Laura Parry Rooms 261–266 Paper 3:30–3:52 PM Workload Intelligence: Workload‑Aware IaaS Abstraction for Cloud Efficiency Anjaly Parayil; Chetan Bansal; Eli Cortez; Íñigo Goiri; Jim Kleewein; Jue Zhang; Pantea Zardoshti; Pulkit Misra; Raphael Ghelman; Ricardo Bianchini; Rodrigo Fonseca; Saravan Rajmohan; Xiaoting Qin Room 275 Paper 4:14–4:36 PM From Deep Learning to Deep Science: AI Accelerators Scaling Quantum Chemistry Beyond Limits Fusong Ju; Kun Li; Mao Yang Rooms 260–267 Friday, Nov 21 Session Type Time (CST) Title Microsoft Contributor(s) Location Workshop 9:00 AM–12:30 PM Eleventh International Workshop on Heterogeneous High‑performance Reconfigurable Computing (H2RC 2025) Torsten Hoefler Room 263 Booth Theater Sessions Monday, Nov 17 — 7:00 PM–9:00 PM Time (CST) Session Title Presenter(s) 8:00–8:20 PM Inside the World’s Most Powerful AI Data Center Chris Jones 8:30–8:50 PM Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC Joe Tostenrude Tuesday, Nov 18 — 10:00 AM–6:00 PM Time (CST) Session Title Presenter(s) 11:00–11:50 AM Ignite Keynotes 12:00–12:20 PM Accelerating AI workloads with Azure Storage Sachin Sheth; Wolfgang De Salvador 12:30–12:50 PM Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA Jyothi Venkatesh 1:00–1:20 PM Radiation & Health Companion: AI‑Driven Flight‑Dose Awareness Olesya Sarajlic 1:30–1:50 PM Ascend HPC Lab: Your On‑Ramp to GPU‑Powered Innovation Daniel Cooke (Oakwood) 2:00–2:20 PM Azure AMD HBv5: Redefining CFD Performance and Value in the Cloud Rick Knoechel (AMD) 2:30–2:50 PM Empowering High Performance Life Sciences Workloads on Azure Qumulo 3:00–3:20 PM Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC Joe Tostenrude 4:00–4:20 PM Unleashing AMD EPYC on Azure: Scalable HPC for Energy and Manufacturing Varun Selvaraj (AMD) 4:30–4:50 PM Automating HPC Workflows with Copilot Agents Xavier Pillons 5:00–5:20 PM Scaling the Future: NVIDIA’s GB300 NVL72 Rack for Next‑Generation AI Inference Kirthi Devleker (NVIDIA) 5:30–5:50 PM Enabling AI and HPC Workloads in the Cloud with Azure NetApp Files Andy Chan Wednesday, Nov 19 — 10:00 AM–6:00 PM Time (CST) Session Title Presenter(s) 10:30–10:50 AM AI‑Powered Digital Twins for Industrial Engineering John Linford (NVIDIA) 11:00–11:20 AM Advancing 5 Generations of HPC Innovation with AMD on Azure Allen Leibovitch (AMD) 11:30–11:50 AM Intro to LoRA Fine‑Tuning on Azure Christin Pohl 12:00–12:20 PM VAST + Microsoft: Building the Foundation for Agentic AI Lior Genzel (VAST Data) 12:30–12:50 PM Inside the World’s Most Powerful AI Data Center Chris Jones 1:00–1:20 PM Supervised GenAI Simulation – Stroke Prognosis (NVads V710 v5) Kurt Niebuhr 1:30–1:50 PM What You Don’t See: How Azure Defines VM Families Anshul Jain 2:00–2:20 PM Hammerspace Tier 0: Unleashing GPU Storage Performance on Azure Raj Sharma (Hammerspace) 2:30–2:50 PM GM Motorsports: Accelerating Race Performance with AI Physics on Rescale Bernardo Mendez (Rescale) 3:00–3:20 PM Hurricane Analysis and Forecasting on the Azure Cloud Salar Adili (Microsoft); Unni Kirandumkara (GDIT); Stefan Gary (Parallel Works) 3:30–3:50 PM Performance at Scale: Accelerating HPC & AI Workloads with WEKA on Azure Desiree Campbell; Wolfgang De Salvador 4:00–4:20 PM Pushing the Limits of Performance: Supercomputing on Azure AI Infrastructure Biju Thankachen; Ojasvi Bhalerao 4:30–4:50 PM Accelerating Momentum: Powering AI & HPC with AMD Instinct™ GPUs Jay Cayton (AMD) Thursday, Nov 20 — 10:00 AM–3:00 PM Time (CST) Session Title Presenter(s) 11:30–11:50 AM Intro to LoRA Fine‑Tuning on Azure Christin Pohl 12:00–12:20 PM Accelerating HPC Workflows with Ansys Access on Microsoft Azure Dr. John Baker (Ansys) 12:30–12:50 PM Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA Jyothi Venkatesh 1:00–1:20 PM Pushing the Limits: Supercomputing on Azure AI Infrastructure Biju Thankachen; Ojasvi Bhalerao 1:30–1:50 PM The High Performance Software Foundation Todd Gamblin (HPSF) 2:00–2:20 PM Heidi AI — Deploying Azure Cloud Environments for Higher‑Ed Students & Researchers James Verona (Adaptive Computing); Dr. Sameer Shende (UO/ParaTools) Partner Session Schedule Tuesday, Nov 18 Date Time (CST) Title Microsoft Contributor(s) Location Nov 18 11:00 AM–11:50 AM Cloud Computing for Engineering Simulation Joe Greenseid Ansys Booth Nov 18 1:00 PM–1:30 PM Revolutionizing Simulation with Artificial Intelligence Joe Tostenrude Ansys Booth Nov 18 4:30 PM–5:00 PM [HBv5] Jyothi Venkatesh AMD Booth Wednesday, Nov 19 Date Time (CST) Title Microsoft Contributor(s) Location Nov 19 11:30 AM–1:30 PM Accelerating Discovery: How HPC and AI Are Shaping the Future of Science (Lunch Panel) Andrew Jones (Moderator); Joe Greenseid (Panelist) Ruth's Chris Steak House Nov 19 1:00 PM–1:30 PM VAST and Microsoft Kanchan Mehrotra VAST Booth Demo Pods at Microsoft Booth Azure HPC & AI Infrastructure Explore how Azure delivers high-performance computing and AI workloads at scale. Learn about VM families, networking, and storage optimized for HPC. Agentic AI for Science See how autonomous agents accelerate scientific workflows, from simulation to analysis, using Azure AI and HPC resources. Hybrid HPC with Azure Arc Discover how Azure Arc enables hybrid HPC environments, integrating on-prem clusters with cloud resources for flexibility and scale. Ancillary Events (RSVP Required) Microsoft + AMD Networking Reception — Tuesday Night When: Tue, Nov 18, 6:30–10:00 PM (CST) Where: UMB Champions Club, Busch Stadium RSVP: https://aka.ms/MicrosoftAMD-Mixer Microsoft + NVIDIA Panel Luncheon — Wednesday When: Wed, Nov 19, 11:30 AM–1:30 PM (CST) Where: Ruth’s Chris Steak House Topic: Accelerating Discovery: How AI and HPC Are Shaping the Future of Science Panelists: Dan Ernst (NVIDIA); Rollin Thomas (NERSC); Joe Greenseid (Microsoft); Antonia Maar (Intersect360 Research); Fernanda Foertter (University of Alabama) RSVP: Luncheon is now closed as the event is fully booked. Conclusion We’re excited to connect with you at SC25! Whether you’re exploring our booth demos, attending technical sessions, or joining one of our partner events, this is your opportunity to experience how Microsoft is driving innovation in HPC and AI. Stop by Booth #1627 to see the Alpine F1 showcar, explore the Silicon Wall featuring AMD, NVIDIA, and Microsoft’s own chips, and enjoy a coffee from our barista while networking with experts. Don’t forget to RSVP for our Microsoft + AMD Network Reception and Microsoft + NVIDIA Panel Luncheon See you in St. Louis!