virtual machines
169 TopicsFrequent platform-initiated VM redeployments (v6) in North Europe – host OS / firmware issues
Hi everyone, We’ve been experiencing recurring platform-initiated redeployments on Azure VMs (v6 series) in the North Europe region and wanted to check if others are seeing something similar. Around two to three times per week, one of our virtual machines becomes unavailable and is automatically redeployed by the Azure platform. The Service Health notifications usually mention that the host OS became unresponsive and that there is a low-level issue between the host operating system and firmware. The VM is then started on a different host as part of the auto-recovery process. There is no corresponding public Azure Status incident for North Europe when this occurs. From the guest OS perspective, there are no warning signs beforehand such as high CPU or memory usage, kernel errors, or planned maintenance events. This behavior looks like a host or hardware stamp issue, but the frequency is concerning. Has anyone else running v6 virtual machines in North Europe observed similar unplanned redeployments? Has Microsoft shared any statements or acknowledgements regarding ongoing host or firmware stability issues for this region or SKU? If you worked with Azure Support on this, were you told this was cluster-specific or related to a particular hardware stamp? We are already engaging Azure Support, but I wanted to check whether this is an isolated case or something others are also encountering. Thanks in advance for any insights or shared experiences.40Views0likes1CommentAzure V710 V5 Series -AMD Radeon GPU - Validation of Siemens CAD -NX
Overview of Siemens NX Siemens NX is a next-generation integrated CAD/CAM/CAE platform used by aerospace, automotive, industrial machinery, energy, medical, robotics, and defense manufacturers. It spans: Complex 3D modeling Assemblies containing thousands to millions of parts Surfacing and composites Tolerance engineering CAM and machining simulation Integrated multi physics through Simcenter / NX Nastran Because NX is used to design real-world engineered systems — aircraft structures, automotive platforms, satellites, robotic arms, injection molds — its usability and performance directly affect engineering velocity and product timelines. NX Needs GPU Acceleration NX is highly visual. It leans heavily on: OpenGL acceleration Shader-based rendering Hidden line removal Real-time shading / material rendering Ray-Traced Studio for photorealistic output Switch shading modes → CAD content must stay readable Zoom, section, annotate → requires stable frame pacing NVads V710 v5-Series on Azure The NVads V710 v5-series virtual machines on Azure are designed for GPU-accelerated workloads and virtual desktop environments. Key highlights: Hardware Specs: o GPU: AMD Radeon™ Pro V710 (up to 24 GiB frame buffer; fractional GPU options available). o CPU: AMD EPYC™ 9V64 F (Genoa) with SMT, base frequency 3.95 GHz, peak 4.3 GHz. o Memory: 16 GiB to 160 GiB. o Storage: NVMe-based ephemeral local storage supported. VM Sizes: o Ranges from Standard_NV4ads_V710_v5 (4 vCPUs, 16 GiB RAM, 1/6 GPU) to Standard_NV28adms_V710_v5 (28 vCPUs, 160 GiB RAM, full GPU). Supported Features: o Premium storage, accelerated networking, ephemeral OS disk. o Both Windows and Linux VMs supported. o No additional GPU licensing is required. AMD Radeon™ PRO GPUs offer: o Optimized OpenGL professional driver stack o Stable interactive performance vs large assemblies Business Scenario Enabled by NX + Cloud GPU Engineering Anywhere Distributed teams can securely work on the same assemblies from any geographic region. Supplier Ecosystem Collaboration Tier-1/2 manufacturers and engineering partners can access controlled models without local high-end workstations. Secure IP Protection Data stays in Azure — files never leave the controlled workspace. Faster Engineering Cycles Visualization + simulation accelerate design reviews, decision making, and manufacturability evaluations. Scalable Cost Model Pay for compute only when needed — ideal for burst design cycles and testing workloads. Architecture Overview – Siemens NX on Azure NVads_v710 Key Architecture Elements Create Azure Virtual Machine- NVads_v710_24 Install Azure AMD V710 GPU drivers Deploy Azure File-based storage Hosting assemblies, metadata, drawing packages, PMI, simulation data. Configure Vnet with Accelerated Networking Install NX licenses and software. Install NXCP & ATS Test suites on the Virtual Machine Qualitative Benchmark on Azure NVads_v710_24 Siemens has approved the following qualitative test results. The certification matrix update is currently in progress. Technical variant: Complex assemblies with thousands of components maintained smooth rotation, zooming, and selection, even under concurrent session load. NXCP and ATS test results on NVads_v710_24 Non-Interactive test results: Note: Execution Time (seconds) ATS Non‑Interactive Test Results validate the correctness and stability of Siemens NX graphical rendering by comparing generated images against approved reference outputs. The minimal or zero pixel differences confirm deterministic and visually consistent rendering, indicating a stable GPU driver and visualization pipeline. The reported test execution times (in seconds) represent the duration required to complete each automated graphics validation scenario, demonstrating predictable and repeatable processing performance under non‑interactive conditions. Interactive test results on Azure NVads_v710_24: Note: Execution Time (seconds) ATS Interactive Test Results evaluate Siemens NX graphics behavior during real‑time user interactions such as rotation, zoom, pan, sectioning, and view manipulation. The results demonstrate stable and consistent rendering during interactive workflows, confirming that the GPU driver and visualization stack reliably support user‑driven NX operations. The measured execution times (in seconds) reflect the responsiveness of each interactive graphics operation, indicating predictable behavior under live, user‑controlled conditions rather than peak performance tuning. NX CAD functions Automatic Tests Interactive Tests Grace1 Basic Tests GrPlayer_xp64.exe <FILE> Basic_Features.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> Fog_Measurement_Clipping.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> lighting.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> Shadow_Bump_Environment.tgl Passed! Passed! GrPlayer_xp64.exe <FILE> Texture_Map.tgl Passed! Passed! Grace2 Graphics Tests GrPlayer_64.exe <FILE> GrACETrace.tgl Passed! Passed! Grace2 Graphics Tests GrPlayer_64.exe <FILE> GrACETrace.tgl Passed! Passed! NXCP Test Scenarios Automatic Tests NXCP Gdat Tests gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_1.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_2.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_4.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_5.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_6.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_7.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_8.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_9.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_10.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_11.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_12.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_13.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_14.cgi Passed! gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_15.cgi Passed! Benefits Azure NVads_v710 (AMD GPU Platform for NX Workstation-class AMD Radeon PRO graphics drivers baked into Azure Ensures ISV-validated driver pipeline. Excellent performance for CAD workloads Makes GPU-accelerated NX accessible to wider user bases. Remote engineering enablement Critical for companies who now operate global design teams. Elastic scale Spin up GPU when development peaks; scale down when idle. Conclusion: Siemens NX on Azure NVads_v710 powered by AMD GPUs enables enterprise-class CAD/CAM/CAE experiences in the cloud. NX benefits directly from workstation-grade OpenGL optimization, shading stability, and Ray Traced Studio acceleration, allowing engineers to interact smoothly with large assemblies, run visualization workloads, and perform design reviews without local hardware dependencies. Right‑sized GPU delivers workstation‑class experience at lower cost The family enables fractional GPU allocation (down to 1/6 of a Radeon™ Pro V710), allowing Siemens NX deployments to be right‑sized per user role. This avoids over‑provisioning full GPUs while still delivering ISV‑grade OpenGL and visualization stability, resulting in a lower per‑engineer cost compared to fixed full‑GPU cloud or on‑prem workstations Elastic scale improves cost efficiency for burst engineering workloads NVads_V710_v5 instances support on demand scaling and ephemeral NVMe storage, allowing NX environments to scale up for design reviews, supplier collaboration, or peak integration cycles and scale down when idle. This consumption model provides a cost advantage over fixed on prem workstations that remain underutilized outside peak engineering periods NX visualization pipelines benefit from balanced CPU–GPU architecture The combination of high‑frequency AMD EPYC™ Genoa CPUs (up to 4.3 GHz) and Radeon™ Pro V710 GPUs addresses Siemens NX’s mixed CPU–GPU workload profile, where scene graph processing, tessellation, and OpenGL submission are CPU‑sensitive. This balance reduces idle GPU cycles, improving effective utilization and overall cost efficiency when compared with GPU‑heavy but CPU‑constrained configurations The result is a scalable, secure, and cost-efficient engineering platform that supports distributed innovation, supplier collaboration, and digital product development workflows — all backed by the Rendering and interaction consistency of AMD GPU virtualization on Azure.Monitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)
As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices. By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments. In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure. DISCLAIMER: This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support. While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations. 🔔 Update: Supported Monitoring Option Now Available Update (December 2025): At the time this guide was written, monitoring InfiniBand (IB) and GPU metrics on Azure H-series and N-series VMs required a largely unofficial approach using Telegraf and Azure Monitor. Microsoft has since introduced a supported solution: Azure Managed Prometheus on VM / VM Scale Sets (VMSS), currently available in private preview. This new capability provides a native, managed Prometheus experience for collecting infrastructure and accelerator metrics directly from VMs and VMSS. It significantly simplifies deployment, lifecycle management, and long-term support compared to custom Telegraf-based setups. For new deployments, customers are encouraged to evaluate Azure Managed Prometheus on VM / VMSS as the preferred and supported approach for HPC and AI workload monitoring. Official announcement: Private Preview: Azure Managed Prometheus on VM / VMSS Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS. Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com) Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver. Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor. Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04. Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters Run the following commands: wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh chmod +x gpu-ib-mon_setup.sh ./gpu-ib-mon_setup.sh Test the Telegraf configuration by executing the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification. To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal: Set the scope to your VM. Choose the Metric Namespace as Telegraf/nvidia-smi. From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics. Use filters and splits to analyze data across multiple GPUs or over time. To monitor InfiniBand performance, repeat the same process: In the Metrics section, set the scope to your VM. Select the Metric Namespace as Telegraf/infiniband. You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps. Use filters to break down the data by port or metric type for deeper insights. Link_downed Metric Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values. Port_rcv_data metrics Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand. Testing InfiniBand and GPU Usage If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname. module load mpi/hpcx-v2.13.1 export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5 mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \ -mca coll_hcoll_enable 0 --bind-to numa \ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1 Alternate: GPU Load Simulation Using TensorFlow If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines. To get started, run the following: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence. Happy testing! References: Ubuntu HPC on Azure ND A100 v4-series GPU VM Sizes Telegraf Azure Monitor Output Plugin (v1.15) Telegraf NVIDIA SMI Input Plugin (v1.15) Telegraf InfiniBand Input Plugin DocumentationAzure NCv6 Public Preview: The new Unified Platform for Converged AI and Visual Computing
As enterprises accelerate adoption of physical AI (AI models interacting with real-world physics), digital twins (virtual replicas of physical systems), LLM inference (running language models for predictions), and agentic workflows (autonomous AI-driven processes), the demand for infrastructure that bridges high-end visualization and generative AI inference has never been higher. Today, we are pleased to announce the Public Preview of the NC RTX PRO 6000 BSE v6 series, powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The NCv6 series represents a generational leap in Azure’s visual compute portfolio, designed to be the dual engine for both Industrial Digitalization and cost-effective LLM inference. By leveraging NVIDIA Multi-Instance GPU (MIG) capabilities, the NCv6 platform offers affordable sizing options similar to our legacy NCv3 and NVv5 series. This provides a seamless upgrade path to Blackwell performance, enabling customers to run complex NVIDIA Omniverse simulations and multimodal AI agents with greater efficiency. Why Choose Azure NCv6? While traditional GPU instances often force a choice between "compute" (AI) and "graphics" (visualization) optimizations, the NCv6 breaks this silo. Built on the NVIDIA Blackwell architecture, it provides a "right-sized" acceleration platform for workloads that demand both ray-traced fidelity and Tensor Core performance. As outlined in our product documentation, these VMs are ideal for converged AI and visual computing workloads, including: Real-time digital twin and NVIDIA Omniverse simulation. LLM Inference and RAG (Retrieval-Augmented Generation) on small to medium AI models. High-fidelity 3D rendering, product design, and video streaming. Agentic AI application development and deployment. Scientific visualization and High-Performance Computing (HPC). Key Features of the NCv6 Platform The Power of NVIDIA Blackwell At the heart of the NCv6 is the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. This powerhouse delivers breakthrough performance featuring 96 GB of ultra-fast GDDR7 memory. This massive frame buffer allows for the handling of complex multimodal AI models and high-resolution textures that previous generations simply could not fit. Host Performance: Intel Granite Rapids To ensure your workloads aren't bottlenecked by the CPU, the VM host is equipped with Intel Xeon Granite Rapids processors. These provide an all-core turbo frequency of up to 4.2 GHz, ensuring that demanding pre- and post-processing steps—common in rendering and physics simulations—are handled efficiently. Optimized Sizing for Every Workflow We understand that one size does not fit all. The NCv6 series introduces three distinct sizing categories to match your specific unit economics: General Purpose: Balanced CPU-to-GPU ratios (up to 320 vCPUs) for diverse workloads. Compute Optimized: Higher vCPU density for heavy simulation and physics tasks. Memory Optimized: Massive memory footprints (up to 1,280 GB RAM) for data-intensive applications. Crucially, for smaller inference jobs or VDI, we will also offer fractional GPU options, allowing you to right-size your infrastructure and optimize costs. NCv6 Technical Specifications Specification Details GPU NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7) Processor Intel Xeon Granite Rapids (up to 4.2 GHz Turbo) vCPUs 16 – 320 vCPUs (Scalable across GP, Compute, and Memory optimized sizes) System Memory 64 GB – 1,280 GB DDR5 Network Up to 200,000 Mbps (200 Gbps) Azure Accelerated Networking Storage Up to 2TB local temp storage; Support for Premium SSD v2 & Ultra Disk Real-World Applications The NCv6 is built for versatility, powering everything from pixel-perfect rendering to high-throughput language reasoning: Production Generative AI & Inference: Deploy self-hosted LLMs and RAG pipelines with optimized unit economics. The NCv6 is ideal for serving ranking models, recommendation engines, and content generation agents where low latency and cost-efficiency are paramount. Automotive & Manufacturing: Validate autonomous driving sensors (LiDAR/Radar) and train physical AI models in high-fidelity simulation environments before they ever touch the real world. Next-Gen VDI & Azure Virtual Desktop: Modernize remote workstations with NVIDIA RTX Virtual Workstation capabilities. By leveraging fractional GPU options, organizations can deliver high-fidelity, accelerated desktop experiences to distributed teams—offering a superior, high-density alternative to legacy NVv5 deployments. Media & Entertainment: Accelerate render farms for VFX studios requiring burst capacity, while simultaneously running generative AI tools for texture creation and scene optimization. Conclusion: The Engine for the Era of Converged AI The Azure NCv6 series redefines the boundaries of cloud infrastructure. By combining the raw power of NVIDIA’s Blackwell architecture with the high-frequency performance of Intel Granite Rapids, we are moving beyond just "visual computing." Innovators can now leverage a unified platform to build the industrial metaverse, deploy intelligent agents, and scale production AI—all with the enterprise-grade security and hybrid reach of Azure. Ready to experience the next generation? Sign up for the NCv6 Public Preview here.Azure delivers the first cloud VM with Intel Xeon 6 and CXL memory - now in Private Preview
Intel released their new Intel Xeon 6 6500/6700 series processor with P-cores this year. Intel Xeon 6 processors provide performance and scalability by delivering outstanding performance for transactional and analytical workloads and provide scale-up capacities of up to 64TB of memory. In addition, Intel Xeon 6 supports the new Compute Express Link (CXL) standard that enables memory expansion to accommodate larger data sets in a cost-effective manner. CXL Flat Memory Mode is a unique Intel Xeon 6 capability that enhances the ability to right-size the compute-to-memory ratio and improve scalability without sacrificing performance. This enhanced ability can help run SAP S/4HANA more efficiently and help enable greater flexibility for configurations so they can better align with business needs and improve the total cost of ownership. In collaboration with SAP and Intel, Microsoft is delighted to announce private preview of CXL technology on Azure M-series family of VMs. We believe that, when combined with advancements in the new Intel Xeon 6 processors, it can tackle the challenges of managing the growing volume of data in SAP software, meet the increased demand for faster compute performance and reduce overall TCO. Stefan Bäuerle, SVP, Head of BTP, HANA & Persistency at SAP noted: “Intel Xeon 6 helps deliver system scalability to support the growing demand for high-performance computing and growing database capacity among SAP customers.” Elyse Ge Hylander, Senior Director, Azure SAP Compute stated: “At Microsoft, we are continually exploring new technological innovations to improve our customer experience. We are thrilled about the potential of Intel’s new Xeon 6 processors with CXL and Flat Memory Mode. This is a big step forward to deliver the next-level performance, reliability, and scalability to meet the growing demands of our customers.” Bill Pearson, Vice President of Data Center and Artificial Intelligence at Intel states: “Intel Xeon 6 represents a significant advancement for Intel, opening up exciting business opportunities to strengthen our collaboration with Microsoft Azure and SAP. The innovative instance architecture featuring CXL Flat Memory Mode is designed to enhance cost efficiency and performance optimization for SAP software and SAP customers.” If you are interested in joining our CXL private preview in Azure, contact Mseries_CXL_Preview@microsoft.com ### Co-author: Phyllis Ng - Senior Director of Hardware Strategic Planning (Memory and Storage) - MicrosoftPure Storage Cloud, Azure Native evolves at Microsoft Ignite!
In September, we were pleased to announce the General Availability of Pure Storage Cloud, Azure Native. A co-developed Azure Native Integration enabling more customers to migrate to Azure easily and benefit from Pure’s industry-leading storage platform – now supporting more customer workloads!242Views0likes0CommentsAzure CycleCloud 8.8 and CCWS 1.2 at SC25 and Ignite
Azure CycleCloud 8.8: Advancing HPC & AI Workloads with Smarter Health Checks Azure CycleCloud continues to evolve as the backbone for orchestrating high-performance computing (HPC) and AI workloads in the cloud. With the release of CycleCloud 8.8, users gain access to a suite of new features designed to streamline cluster management, enhance health monitoring, and future-proof their HPC environments. Key Features in CycleCloud 8.8 1. ARM64 HPC Support The platform expands its hardware compatibility with ARM64 HPC support, opening new possibilities for energy-efficient and cost-effective compute clusters. This includes access to the newer generation of GB200 VMs as well as general ARM64 support, enabling new AI workloads at a scale never possible before 2. Slurm Topology-Aware Scheduling The integration of topology-aware scheduling for Slurm clusters allows CycleCloud users to optimize job placement based on network and hardware topology. This leads to improved performance for tightly coupled HPC workloads and better utilization of available resources. 3. Nvidia MNNVL and IMEX Support With expanded support for Nvidia MNNVL and IMEX, CycleCloud 8.8 ensures compatibility with the latest GPU technologies. This enables users to leverage cutting-edge hardware for AI training, inference, and scientific simulations. 4. HealthAgent: Event-Driven Health Monitoring and Alerting A standout feature in this release is the enhanced HealthAgent, which delivers event-driven health monitoring and alerting. CycleCloud now proactively detects issues across clusters, nodes, and interconnects, providing real-time notifications and actionable insights. This improvement is a game-changer for maintaining uptime and reliability in large-scale HPC deployments. Node Healthagent supports both impactful healthchecks which can only run while nodes are idle as well as non-impactful healthchecks that can run throughout the lifecycle of a job. This allows CycleCloud to alert on issues that not only happen while nodes are starting, but also issues that may result from failures for long-running nodes. Later releases of CycleCloud will also include automatic remediation for common failures, so stay tuned! 5. Enterprise Linux 9 and Ubuntu 24 support One common request has been wider support for the various Enterprise Linux (EL) 9 variants, including RHEL9, AlmaLinux 9, and Rocky Linux 9. CycleCloud 8.8 introduces support for those distributions as well as the latest Ubuntu HPC release. Why These Features Matter The CycleCloud 8.8 release marks a significant leap forward for organizations running HPC and AI workloads in Azure. The improved health check support—anchored by HealthAgent and automated remediation—means less downtime, faster troubleshooting, and greater confidence in cloud-based research and innovation. Whether you’re managing scientific simulations, AI model training, or enterprise analytics, CycleCloud’s latest features help you build resilient, scalable, and future-ready HPC environments. Key Features in CycleCloud Workspace for Slurm 1.2 Along with the release of CycleCloud 8.8 comes a new CycleCloud Workspace for Slurm (CCWS) release. This release includes the General Availability of features that were previously in preview, such as Open OnDemand, Cendio ThinLinc, and managed Grafana monitoring capabilities. In addition to previously announced features, CCWS 1.2 also includes support for a new Hub and Spoke deployment model. This allows customers to retain a central hub of shared resources that can be re-used between cluster deployments with "disposable" spoke clusters that branch from the hub. Hub and Spoke deployments enable customers who need to re-deploy clusters in order to upgrade their operating system, deploy new versions of software, or even reconfigure the overall architecture of Slurm clusters. Come visit us at SC25 and MS Ignite To learn more about these features, come visit us at the Microsoft booth at #SC25 in St. Louis, MO and #Microsoft #Ignite in San Francisco this week!Your guide to Azure Compute at Microsoft Ignite 2025
The countdown to Microsoft Ignite 2025 is almost over— Microsoft Ignite - November 18–21, 2025! Whether you’ll be joining us in person or tuning in virtually, this guide is your essential resource for everything Azure Compute. Explore the latest advancements, connect with product experts, and expand your cloud skills through curated sessions and interactive experiences. Attendees will have the opportunity to dive deep into new product capabilities and solutions, including ways to boost virtual machine performance, enhance resiliency, and optimize cloud operations. Be sure to add these sessions to your schedule for a personalized and can’t-miss Ignite experience. Bookmark this guide for quick access to all the latest Azure Compute news and updates throughout Ignite 2025! Featured sessions Tuesday BRK171: What's new and what's next in Azure IaaS Level: Intermediate 200 In this session, we’ll introduce the latest capabilities across compute, storage, and networking. Uncover the advancements in Azure IaaS, driving performance, resiliency, and cost efficiency. We will present how Azure’s global backbone, enhanced capabilities, and expanding portfolio can support mission-critical, cloud native and AI workloads —while built-in security and flexible tiering help right-size app deployments and accelerate modernization. Tuesday, November 18 | 2:30 PM-3:15 PM PST Wednesday BRK430: Inside Azure Innovations with Mark Russinovich Level: Advanced 300 Join Mark Russinovich, CTO and Technical Fellow of Microsoft Azure. Mark will take you on a tour of the latest innovations in Azure architecture and explain how Azure enables intelligent, modern, and innovative applications at scale in the cloud, on-premises, and on the edge. Featuring some of the latest Compute announcements with Azure Boost. Wednesday, November 19, 2:45 PM PST Other related IaaS sessions Use the following as a guide to build your session schedule with an emphasis on our Azure Compute topics. These sessions will be in person and recorded. Sessions Tuesday-Thursday will be live streamed. Thursday BRK176: Driving efficiency and cost optimization for Azure IaaS deployments Level: Intermediate 200 Control cloud spend without compromising performance. This session shows how Azure IaaS helps IT leaders optimize costs through flexible pricing, built-in tools, and smart resource planning. Learn how to align infrastructure choices with workload requirements, reduce TCO, and make informed decisions that support growth and innovation. You will gain a deeper understanding of how Azure delivers a comprehensive set of services, tools, and financial instruments to optimize your cloud costs at scale. Thursday, November 20 th , 9:45 AM PST BRK217: Resilience by design: Secure, scalable, AI-ready cloud with Azure Level: Advanced 300 Resiliency is foundational. Explore how resiliency on Azure enables secure, scalable, AI-ready cloud architectures. Learn to set resilience goals, simulate failures, and orchestrate recovery. See live demos and discover how shared responsibility empowers teams to deliver trusted, resilient outcomes. Thursday, November 20 th , 1:00 PM PST BRK178: Architecting for resiliency on Azure Infrastructure Level: Intermediate 200 Discover how to build resilient cloud solutions on Azure by leveraging availability zones, multi-region deployments, and fungible products. This session explores architectural patterns, platform capabilities, and best practices to ensure high availability, fault tolerance, and business continuity for mission-critical workloads in dynamic and distributed environments. Thursday, November 20, 1:00 PM PST BRK148: Architect resilient apps with Azure backup and reliability features Level: Advanced 300 Learn to use self-serve tools to strengthen zonal resiliency for critical workloads. Assess and validate resilience across VMs, DBs, and containers. Explore enhanced data and cyber resiliency with immutability and threat detection to guard against ransomware. Discover expanded workload coverage and real-time insights to proactively protect your applications and infrastructure. Thursday, November 20, 3:30 PM PST Friday BRK146: Resiliency and recovery with Azure Backup and Site Recovery Level: Advanced 300 This session will show how to secure, detect threats, and quickly recover critical workloads across Azure environments using advanced backup and disaster recovery solutions. It covers modern techniques like threat-aware backups, container protection, and seamless disaster recovery to help meet compliance and recovery objectives. Friday, November 21, 9:00 AM PST BRK149: Unlock cloud-scale observability and optimization with Azure Level: Advanced 300 In this session, we'll deep dive into how Azure Monitor delivers end-to-end observability across your cloud and hybrid environments, helping you detect issues early and reduce mean time to recovery. We'll also share how new Copilot in Azure agents can extend this visibility into actionable cost and carbon efficiency insights—helping you identify optimization opportunities, validating recommendations, and streamlining resource performance for business impact. Friday, November 21 st , 10:15 AM PST BRK173: Azure IaaS best practices to enhance performance and scale Level: Advanced 300 Azure IaaS can deliver excellent performance and scalability across a broad range of workloads. With high-throughput storage, low-latency networking, and intelligent auto-scaling, Azure supports demanding apps with precision. Learn how to optimize compute, storage, and network resources to meet performance goals, reduce costs, and scale confidently across global regions. Dive into the latest capabilities Azure Boost, Compute Fleet, Azure Virtual Machines, Azure Storage and Networking offer. Friday, November 21, 10:15 AM PST BRK172: Powering modern cloud workloads with Azure Compute Level: Advanced 300 Uncover new VM offerings announcements and explore innovations like Azure Boost. Dive into the latest compute innovation at the core of Azure IaaS. Whether you're running mission-critical enterprise apps or scaling cloud-native services, discover how these innovations are unlocking new value for customers and get a preview of what’s coming next. Friday, November 21, 11:30 AM PST BRK168: Azure IaaS platform security deep dive Level: Advanced 300 As organizations accelerate their cloud adoption, robust security for your Infrastructure as a Service platform is more critical than ever. This session will provide a comprehensive exploration of Azure’s security architecture, best practices, and innovations across four pillars: foundational security, compute security, network security, and storage security. Attendees will gain actionable insights to strengthen their cloud posture, ensure compliance, and protect sensitive workloads. Friday, November 21 st ,11:30 AM PST Upskill yourself with hands on labs This section explains that live demos and hands-on labs are exclusively available to those who attend in person, providing them with a direct, firsthand experience. Tuesday LAB500: Attain unified observability and optimization in Azure Level: Intermediate 200 Get an AI-powered view of your Azure workload health and performance while uncovering cost and carbon savings. In this lab, use AI to investigate anomalies, correlate telemetry, and drive optimization. Apply FinOps and sustainability insights, align health with SLI/SLO targets, and improve monitoring posture for lasting efficiency. Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees. Tuesday November 18 th , 2:45 PM PST LAB520: Start, Get and Stay Resilient with Azure Level: Intermediate 200 Understand the Start, Get, and Stay Resilient journey. Get equipped with tools & insights to architect mission critical applications with Azure’s Resiliency and Configuration experiences. Assess your resiliency posture, apply recommendations, validate your posture and orchestrate recovery. With the Essentials Machine Management bundle from Azure, manage and maintain the state of your resources, enforce configurations across devices and ensure resilience is not a one-time goal but an ongoing state. Please RSVP and arrive at least 5 minutes before the start time, at which point remaining spaces are open to standby attendees. Tuesday, November 18 th , 4:30 PM PST838Views2likes1CommentKernel Dump based Online Repair
Introduction In the ever-evolving landscape of cloud computing, reliability remains paramount. As workloads scale and businesses depend on uninterrupted service, Azure continues to invest in technologies that enhance system resilience and minimizes customer impact in cases of failures. Azure Compute infrastructure operates at an unmatched scale, with certain Availability Zones (AZs) hosting nearly a million Azure Virtual Machines (Azure VMs) that run customer workloads. These Azure VMs depend on a sophisticated ecosystem of physical machines, networking infrastructure, storage systems, and other essential components. When failures occur at any of these layers—whether from hardware malfunctions, kernel issues, or network disruptions—customers may experience service interruptions. To address these challenges, Azure Compute Repair Platform plays a vital role in identifying, diagnosing, and applying mitigation strategies to resolve issues as quickly as possible. To further improve our ability to diagnose and resolve failures swiftly and accurately, we present a novel approach —a real-time kernel dump analysis technology aimed at identifying the root cause of issues and facilitating precise, data-driven repairs. This is an addition to the gamut of detection and mitigation strategies we already leverage. This capability is generally available in all Azure regions and helps our customers out, including our most critical customers. This project would not have been possible without the invaluable support and contributions of Binit Mishra, Dhruv Joshi, Abhay Ketkar, Gaurav Jagtiani, Mukhtar Ahmed, Siamak Ahari, Rajeev Acharya, Deepak Venkatesh, Abhinav Dua, Alvina Putri, Emma Montalvo, and Chantale Ninah — my heartfelt thanks to each of you. Real-Time Failure Diagnosis and Repair We have developed a novel approach to diagnosing and mitigating failures in Azure Compute infrastructure by understanding the state of the kernel on the Azure Host Machine through real-time collection and analysis of Live Kernel Dumps (LKD). This enables us to pinpoint the exact issue with the kernel and use that insight for precise repair actions, rather than applying a broad set of mitigation strategies. By reducing trial-and-error repair attempts, we significantly minimize downtime and accelerate issue resolution. Kernel dumps can help detect critical issues such as kernel panics, memory leaks, and driver failures. Kernel panics occur when the system encounters a fatal error, causing the kernel to stop functioning. Memory leaks, where memory is not properly released, can lead to system instability over time. Driver failures, often caused by faulty or incompatible drivers, can also be identified through kernel dump analysis. Importantly, it is the Repair Platform that triggers LKD collection and further consumes the LKD analysis to make informed decisions. By incorporating liver kernel dump analysis into our mitigation workflows, we enhance Azure’s ability to quickly diagnose, categorize, and resolve infrastructure issues, ultimately reducing system downtime and improving overall performance. Architecture How does this system work: Dump Collection: When an issue is detected, the Repair Platform triggers the collection of a Live Kernel Dump (LKD) on the machine hosting the affected Azure VM. Dump Upload: An agent running on the machine monitors a designated storage location for newly generated dumps. When a dump is detected, the agent uploads it from the Azure Host Machine to an online Analysis Service. Failure Classification: The Analysis Service processes the uploaded Live Kernel Dump (LKD), diagnoses the root cause of the failure, and categorizes it accordingly—for example, identifying a networking switch in a hung state. Persistence: The Analysis Service generates a detailed failure message and stores it in an Azure Table for tracking and retrieval. Automated Repair Decisions: The Repair Platform continuously monitors the Azure Table for failure messages. Once a failure is recorded, it retrieves the data and makes an informed repair decision. Impact By leveraging this approach, Azure Compute Repair Platform achieves both a better repair strategy and significant downtime savings. (A) Better Repair Strategy By precisely identifying failures, the Repair Platform can classify issues accurately and apply the most effective resolution method, minimizing unnecessary disruptions and enhancing long-term infrastructure stability. For instance, in the case of a VM Switch Hung issue, the Repair Platform attempts to mitigate the problem on the affected Azure Host Machine. However, if unsuccessful, it migrates the customer's workload to a more stable machine and initiates aggressive repairs on the faulty Azure Host Machine. While this restores service, it does not address the underlying cause, leaving the Azure Host Machine vulnerable to repeated VM Switch Hung failures. Enabling real-time failure classification, the Repair Platform could instead hold a subset of affected Azure Host Machines in a restricted state, preventing new Azure VMs from being assigned to them. This approach allows Azure’s hardware and network partners to run diagnostics, gain deeper insights into the failure, and implement targeted fixes. As a result, Azure has reduced recurring failures, minimized customer impact, and improved overall infrastructure reliability. While the VM Switch Hung issue serves as an example, this data-driven repair strategy can be extended to various failure scenarios, enabling faster recovery, fewer disruptions, and a more resilient platform. (B) Downtime Reduction The longer it takes to resolve an issue, the longer a customer workload may experience interruptions. As a result, downtime reduction is one of the key metrics we prioritize. We significantly reduce time to resolution by providing an early signal that pinpoints the exact issue. This allows the Repair Platform to perform targeted repairs rather than relying on time-consuming, broad mitigation strategies. Sample scenario: When a customer faces issues stopping or destroying an Azure VM, and the problem is severe enough that all repair attempts fail, the only option may be to migrate the customer's workload to a different Azure Host Machine. Today, this process can take up to 26 minutes before the decision to move the customer workload is reached. However, with this new approach, we are optimizing to detect the failure and surface the issue within 3 minutes, enabling a decision much earlier and reducing customer downtime by 23 minutes—a significant improvement in downtime reduction and customer resolution. Conclusion Online kernel dump analysis for machine issue resolution marks a significant advancement in Azure’s commitment to reliability, bringing us closer to a future where failures are not just detected but proactively mitigated in real time. By enabling real-time diagnostics and automated repair strategies, this approach is redefining Compute reliability—drastically reducing mitigation times, enhancing repair accuracy, and ensuring customers experience seamless service continuity. As we continue refining it, our focus remains on expanding its capabilities, enhancing kernel analysis, reducing analysis time, and strengthening the entire pipeline for greater efficiency and resilience. Stay tuned for further updates as we push the boundaries of intelligent cloud reliability.2.5KViews0likes0Comments