hpc
252 TopicsAzure NCv6 Public Preview: The new Unified Platform for Converged AI and Visual Computing
As enterprises accelerate adoption of physical AI (AI models interacting with real-world physics), digital twins (virtual replicas of physical systems), LLM inference (running language models for predictions), and agentic workflows (autonomous AI-driven processes), the demand for infrastructure that bridges high-end visualization and generative AI inference has never been higher. Today, we are pleased to announce the Public Preview of the NC RTX PRO 6000 BSE v6 series, powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The NCv6 series represents a generational leap in Azure’s visual compute portfolio, designed to be the dual engine for both Industrial Digitalization and cost-effective LLM inference. By leveraging NVIDIA Multi-Instance GPU (MIG) capabilities, the NCv6 platform offers affordable sizing options similar to our legacy NCv3 and NVv5 series. This provides a seamless upgrade path to Blackwell performance, enabling customers to run complex NVIDIA Omniverse simulations and multimodal AI agents with greater efficiency. Why Choose Azure NCv6? While traditional GPU instances often force a choice between "compute" (AI) and "graphics" (visualization) optimizations, the NCv6 breaks this silo. Built on the NVIDIA Blackwell architecture, it provides a "right-sized" acceleration platform for workloads that demand both ray-traced fidelity and Tensor Core performance. As outlined in our product documentation, these VMs are ideal for converged AI and visual computing workloads, including: Real-time digital twin and NVIDIA Omniverse simulation. LLM Inference and RAG (Retrieval-Augmented Generation) on small to medium AI models. High-fidelity 3D rendering, product design, and video streaming. Agentic AI application development and deployment. Scientific visualization and High-Performance Computing (HPC). Key Features of the NCv6 Platform The Power of NVIDIA Blackwell At the heart of the NCv6 is the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. This powerhouse delivers breakthrough performance featuring 96 GB of ultra-fast GDDR7 memory. This massive frame buffer allows for the handling of complex multimodal AI models and high-resolution textures that previous generations simply could not fit. Host Performance: Intel Granite Rapids To ensure your workloads aren't bottlenecked by the CPU, the VM host is equipped with Intel Xeon Granite Rapids processors. These provide an all-core turbo frequency of up to 4.2 GHz, ensuring that demanding pre- and post-processing steps—common in rendering and physics simulations—are handled efficiently. Optimized Sizing for Every Workflow We understand that one size does not fit all. The NCv6 series introduces three distinct sizing categories to match your specific unit economics: General Purpose: Balanced CPU-to-GPU ratios (up to 320 vCPUs) for diverse workloads. Compute Optimized: Higher vCPU density for heavy simulation and physics tasks. Memory Optimized: Massive memory footprints (up to 1,280 GB RAM) for data-intensive applications. Crucially, for smaller inference jobs or VDI, we will also offer fractional GPU options, allowing you to right-size your infrastructure and optimize costs. NCv6 Technical Specifications Specification Details GPU NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7) Processor Intel Xeon Granite Rapids (up to 4.2 GHz Turbo) vCPUs 16 – 320 vCPUs (Scalable across GP, Compute, and Memory optimized sizes) System Memory 64 GB – 1,280 GB DDR5 Network Up to 200,000 Mbps (200 Gbps) Azure Accelerated Networking Storage Up to 2TB local temp storage; Support for Premium SSD v2 & Ultra Disk Real-World Applications The NCv6 is built for versatility, powering everything from pixel-perfect rendering to high-throughput language reasoning: Production Generative AI & Inference: Deploy self-hosted LLMs and RAG pipelines with optimized unit economics. The NCv6 is ideal for serving ranking models, recommendation engines, and content generation agents where low latency and cost-efficiency are paramount. Automotive & Manufacturing: Validate autonomous driving sensors (LiDAR/Radar) and train physical AI models in high-fidelity simulation environments before they ever touch the real world. Next-Gen VDI & Azure Virtual Desktop: Modernize remote workstations with NVIDIA RTX Virtual Workstation capabilities. By leveraging fractional GPU options, organizations can deliver high-fidelity, accelerated desktop experiences to distributed teams—offering a superior, high-density alternative to legacy NVv5 deployments. Media & Entertainment: Accelerate render farms for VFX studios requiring burst capacity, while simultaneously running generative AI tools for texture creation and scene optimization. Conclusion: The Engine for the Era of Converged AI The Azure NCv6 series redefines the boundaries of cloud infrastructure. By combining the raw power of NVIDIA’s Blackwell architecture with the high-frequency performance of Intel Granite Rapids, we are moving beyond just "visual computing." Innovators can now leverage a unified platform to build the industrial metaverse, deploy intelligent agents, and scale production AI—all with the enterprise-grade security and hybrid reach of Azure. Ready to experience the next generation? Sign up for the NCv6 Public Preview here.Azure CycleCloud 8.8 and CCWS 1.2 at SC25 and Ignite
Azure CycleCloud 8.8: Advancing HPC & AI Workloads with Smarter Health Checks Azure CycleCloud continues to evolve as the backbone for orchestrating high-performance computing (HPC) and AI workloads in the cloud. With the release of CycleCloud 8.8, users gain access to a suite of new features designed to streamline cluster management, enhance health monitoring, and future-proof their HPC environments. Key Features in CycleCloud 8.8 1. ARM64 HPC Support The platform expands its hardware compatibility with ARM64 HPC support, opening new possibilities for energy-efficient and cost-effective compute clusters. This includes access to the newer generation of GB200 VMs as well as general ARM64 support, enabling new AI workloads at a scale never possible before 2. Slurm Topology-Aware Scheduling The integration of topology-aware scheduling for Slurm clusters allows CycleCloud users to optimize job placement based on network and hardware topology. This leads to improved performance for tightly coupled HPC workloads and better utilization of available resources. 3. Nvidia MNNVL and IMEX Support With expanded support for Nvidia MNNVL and IMEX, CycleCloud 8.8 ensures compatibility with the latest GPU technologies. This enables users to leverage cutting-edge hardware for AI training, inference, and scientific simulations. 4. HealthAgent: Event-Driven Health Monitoring and Alerting A standout feature in this release is the enhanced HealthAgent, which delivers event-driven health monitoring and alerting. CycleCloud now proactively detects issues across clusters, nodes, and interconnects, providing real-time notifications and actionable insights. This improvement is a game-changer for maintaining uptime and reliability in large-scale HPC deployments. Node Healthagent supports both impactful healthchecks which can only run while nodes are idle as well as non-impactful healthchecks that can run throughout the lifecycle of a job. This allows CycleCloud to alert on issues that not only happen while nodes are starting, but also issues that may result from failures for long-running nodes. Later releases of CycleCloud will also include automatic remediation for common failures, so stay tuned! 5. Enterprise Linux 9 and Ubuntu 24 support One common request has been wider support for the various Enterprise Linux (EL) 9 variants, including RHEL9, AlmaLinux 9, and Rocky Linux 9. CycleCloud 8.8 introduces support for those distributions as well as the latest Ubuntu HPC release. Why These Features Matter The CycleCloud 8.8 release marks a significant leap forward for organizations running HPC and AI workloads in Azure. The improved health check support—anchored by HealthAgent and automated remediation—means less downtime, faster troubleshooting, and greater confidence in cloud-based research and innovation. Whether you’re managing scientific simulations, AI model training, or enterprise analytics, CycleCloud’s latest features help you build resilient, scalable, and future-ready HPC environments. Key Features in CycleCloud Workspace for Slurm 1.2 Along with the release of CycleCloud 8.8 comes a new CycleCloud Workspace for Slurm (CCWS) release. This release includes the General Availability of features that were previously in preview, such as Open OnDemand, Cendio ThinLinc, and managed Grafana monitoring capabilities. In addition to previously announced features, CCWS 1.2 also includes support for a new Hub and Spoke deployment model. This allows customers to retain a central hub of shared resources that can be re-used between cluster deployments with "disposable" spoke clusters that branch from the hub. Hub and Spoke deployments enable customers who need to re-deploy clusters in order to upgrade their operating system, deploy new versions of software, or even reconfigure the overall architecture of Slurm clusters. Come visit us at SC25 and MS Ignite To learn more about these features, come visit us at the Microsoft booth at #SC25 in St. Louis, MO and #Microsoft #Ignite in San Francisco this week!Microsoft Discovery: The path to an agentic EDA environment
Generative AI has been the buzz across engineering, science and consumer applications, including EDA. It was the centerpiece of the keynotes at both SNUG and CadenceLive, and it will feature heavily at DAC. Very impressive task specific tools and capabilities powered by traditional and generative AI are being developed by both industry vendors and customers. However, all these solutions are point solutions addressing specific tasks. This leaves the question of how customers will tie it all together and how customers will be able to run and access the LLMs, AI and data resources needed to power these solutions. While our industry has experience developing, running, and maintaining high-performance EDA environments, an AI centric data center running GPUs and low latency interconnect like Infiniband, is not an environment many chip development companies already have or have experience executing. Unfortunately, because LLMs are so resource hungry, it’s difficult to “ease into” a deployment. The Agentic Platform for EDA At the Microsoft Build conference in May, Microsoft introduced the Microsoft Discovery Platform. This platform aims to accelerate R&D across several industry verticals, specifically Biology (Life science and drug discovery), Chemistry (materials and substance discovery), and Physics (semiconductors and multi-physics). Microsoft Discovery provides the platform and capabilities to help customers implement a complete agentic AI environment. Being a cloud-based solution means customers won’t need to manage the AI models or RAG solutions themselves. Running inside the customer’s cloud tenant, the AI models, the data they use, and the results they produce all remain under the customer's control and within the customer’s environment. No data goes back to the Internet and all learning remains with the customer. This gives customers the confidence that they can safely and easily deploy and use AI models while maintaining complete sovereignty over their data and IP. Customers are free to deploy any of the dozens of available AI models offered on Azure. Customers can also deploy and use Graph RAG solutions to improve context and get better LLM responses. This is all available without having to deploy additional hardware or manage a large, independent GPU deployment. Customers testing out generative AI solutions and starting to develop their flows, tools, and methodologies with this new technology can deploy and use these resources as needed. The Microsoft Discovery platform does not try to replace the EDA tools you already have. Instead, it allows you to incorporate those tools into an agentic environment. Without anthropomorphizing, these agents can be thought of as AI driven task engines that can reason and interact with each other or tools. They can be used to make decisions, analyze results, generate responses, take action, or even drive tools. Customers will be able to incorporate existing EDA tools into the platform and drive them with an agent. Microsoft Discovery will even be able to run agents from partners and help customers intelligently tie together multiple capabilities and help automate analysis and decision-making on the flow helping each engineering teams accomplish a greater number of tasks more quickly and achieve increased productivity. HPC Infrastructure for EDA Of course, to run EDA tools, customers need an effective environment to run those tools in. One of the things that has always been true in our industry but is often overlooked is that, as good as the algorithms in the tools are, they’re always limited by the infrastructure it runs on. No matter how fast your algorithm is, running on a slow processor means turn-around time is still going to be slow. No matter how fast your tools are and how new and shiny your servers are, if your file system is a bottleneck, your tool and server will have to wait for the data. The infrastructure you run on sets the speed limit for your job regardless of how fast an engine you have. Most of the AI solutions being discussed for EDA focus only on the engine and ignore the infrastructure. The Microsoft Discovery platform understands this and addresses the issue by having the Azure HPC environment at its core. The HPC core of the platform uses elements familiar to the EDA community. High performance file storage utilizes Azure NetApp Files (ANF). This shared file service uses the same NetApp technology and hardware that many in the EDA community already uses on-prem. ANF delivers unmatched performance for cloud-based file storage, especially for metadata heavy workloads, like EDA. This will help provide EDA workloads with a familiar pathway into the Discovery platform to make use of the AI capabilities for chip design. Customers will also have access to Azure’s fleet of high-performance compute, including the recently released Intel Emerald Rapids-based FXv2, which was developed with large, back-end EDA workloads in mind. FXv2 features 1.8TB of RAM and all core turbo clock speed of 4 GHz. Ideal for large STA, P&R, and PV workloads. For front-end and moderate sized back-end workloads, in addition to the existing HPC compute offerings, Microsoft recently updated the D and E series compute SKUs with Intel Emerald Rapids processors in the v6 versions of those systems, further pushing performance for smaller workloads. Design teams will have access to the required high-performance compute and storage resources to maximize their EDA tools while also taking advantage of the benefits of AI capabilities offered by the platform. The familiar EDA-friendly HPC environment makes migration of existing workloads easier and ensures that tools will run effectively and, more importantly, flows mesh more smoothly. Industry Standards and Interoperability Another aspect of the Microsoft Discovery platform that will be especially important for EDA customers is the fact that the platform will utilize A2A for agent-to-agent communication and MCP for agent-service communication. The reason this is important is because both A2A and MCP are industry standard protocols. Microsoft also expects to support the evolution of these and other newer standards that will emerge in this field, future-proofing your investment. Those of us who have been involved in the various standards and interoperability efforts in semiconductor and EDA over the years will understand that having the platform use industry standards-based interfaces makes adoption of new technology much easier for all users. With AI development rushing forward and everyone, customers and vendors alike, trying to capitalize on gen AI’s promises, there are already independent efforts by customers and vendors to develop capabilities quickly. In the past, this meant that everyone went off in different directions developing mutually exclusive solutions. Vendors would develop mutually exclusive solutions that customers would have to also develop customized solutions to leverage. The various solutions would all work slightly differently, making integration a painful process. The history of VMM, OVM, and UVM was an example of this. As the industry starts to develop AI and agentic environments, the same fragmentation is likely to also happen again. By starting with A2A and MCP, Microsoft is signaling for the industry to align around these industry standard protocols. This will make it easier for agents developed by customers and vendors to interoperate with each other and the Discovery platform. Vendor tools implementing a MCP server interface can directly communicate with customer agents using MCP as well as with the Discovery platform. This makes it easier for our industry to develop interoperable solutions. Similarly, agents that use the A2A protocol to interact with other agents can be more easily integrated if the other agents also communicate using A2A. If you’re going to be building agents for EDA or EDA tools or services that interact with agents, build them using A2A for inter-agent communication and MCP for agent-to-tool/service communication. Generative AI is likely to be the most transformative technology to impact EDA this decade. It likely will be at least as impactful, productivity wise, for us a synthesis, STA, and automatic place and route were in their own ways. To learn more about these innovations, come join the Microsoft team at the Design Automation Conference (DAC) in San Francisco on June 23. At DAC, the Microsoft team will go into depth about the Discovery platform and the larger impact that AI will have on the semiconductor industry. In his opening keynote discussion on Monday, Bill Chappell, Microsoft's CTO for the Microsoft Discovery and Quantum team will discuss AI's impact on science and the semiconductor industry. Serge Leef’s engineering track session will talk about generative AI in chip design, and don't miss Prashant Varshney's detailed explanation of the Microsoft Discovery platform in his Exhibitor Forum session. Visit the Microsoft booth (second floor, 2124) for more in-depth discussions with our team.Announcing the Public Preview of AMLFS 20: Azure Managed Lustre New SKU for Massive AI&HPC Workloads
Sachin Sheth - Principal PDM Manager Brian Barbisch - Principal Group Software Engineering Manager Matt White - Principal Group Software Engineering Manager Brian Lepore - Principal Product Manager Wolfgang De Salvador - Senior Product Manager Ron Hogue - Senior Product Manager Introduction We are excited to announce the Public Preview of AMLFS Durable Premium 20 (AMLFS 20), a new SKU in Azure Managed Lustre designed to deliver unprecedented performance and scale for demanding AI and HPC workloads. Key Features Massive Scale: Store up to 25 PiB of data in a single namespace, with up to 512 GB/s of total bandwidth. Advanced Metadata Performance: Multi-MDS (Metadata Server) architecture dramatically improves metadata IOPS. In mdtest benchmarks, AMLFS 20 demonstrated more than 5x improvement in metadata operations. An additional MDS is provided for every 5 PiB of provisioned filesystem. High File Capacity: Supports up to 20 billion inodes for maximum namespace size. Why AMLFS 20 Matters Simplified Architecture: Previously, datasets larger than 12.5 PiB required multiple filesystems and complex management. AMLFS 20 enables a single, high-performance file system for massive AI and HPC workloads up to 25 PiB, streamlining deployment and administration. Accelerated Data Preparation: The multi-MDT architecture significantly increases metadata IOPS, which is crucial during the data preparation stage of AI training, where rapid access to millions of files is required. Faster Time-to-Value: Researchers and engineers benefit from easier management, reduced bottlenecks, and faster access to large datasets, accelerating innovation. Availability AMLFS 20 is available in Public Preview alongside the already existing AMLFS SKUs. For more details on other SKUs, visit the Azure Managed Lustre documentation. How to Join the Preview If you are working with large-scale AI or HPC workloads and would like early access to AMLFS 20, we invite you to fill out this form to tell us about your use case. Our team will follow up with onboarding details.Performance and Scalability of Azure HBv5-series Virtual Machines
Azure HBv5-series virtual machines (VMs) for CPU-based high performance computing (HPC) are now Generally Available. This blog provides in-depth information about the technical underpinnings, performance, cost, and management implications of these HPC-optimized VMs. Azure HBv5 VM bring leadership levels of performance, cost optimization, and server (VM) consolidation for a variety of workloads driven by memory performance, such as computational fluid dynamics, weather simulation, geoscience simulations, and finite element analysis. For these applications and compared to HBv4 VMs, previously the highest performance offering for these workloads, HBv5 provides up to : 5x higher performance for CFD workloads with 43% lower costs 3.2x higher performance for weather simulation with 16% lower costs 2.8x higher performance for geoscience workloads at the same costs HBv5-series Technical Overview & VM Sizes Each HBv5 VMs features several new technologies for HPC customers, including: Up to 6.6 TB/s of memory bandwidth (STREAM TRIAD) and 432 GB memory capacity Up to 368 physical cores per VM (user configurable) with custom AMD EPYC CPUs, Zen4 microarchitecture (SMT disabled) Base clock of 3.5 GHz (~1 GHz higher than other 96-core EPYC CPUs), and Boost clock of 4 GHz across all cores 800 Gb/s NVIDIA Quantum-2 InfiniBand (4 x 200 Gb/s CX-7) (~2x higher HBv4 VMs) 180 Gb/s Azure Accelerated Networking (~2.2 higher than HBv4 VMs) 15 TB local NVMe SSD with up to 50 GB/s (read) and 30 GB/s (write) of bandwidth (~4x higher than HBv4 VMs) The highlight feature of HBv5 VMs is their use of high-bandwidth memory (HBM). HBv5 VMs utilize a custom AMD CPU that increases memory bandwidth by ~9x v. dual-socket 4 th Gen EPYC (Zen4, “Genoa”) server platforms, and ~7x v. dual-socket EPYC (Zen5, “Turin”) server platforms, respectively. HBv5 delivers similar levels of memory bandwidth improvement compared to the highest end alternatives from the Intel Xeon and ARM CPU ecosystems. HBv5-series VMs are available in the following sizes with specifications as shown below. Just like existing H-series VMs, HBv5-series includes constrained cores VM sizes, enabling customers to optimize their VM dimensions for a variety of scenarios: ISV licensing constraining a job to a targeted number of cores Maximum-performance-per-VM or maximum performance per core Minimum RAM/core (1.2 GB, suitable for strong scaling workloads) to maximum memory per core (9 GB, suitable for large datasets and weak scaling workloads Table 1: Technical specifications of HBv5-series VMs Note: Maximum clock frequencies (FMAX) are based product specifications of the AMD EPYC 9V64H processor. Experienced clock frequencies by a customer are a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application. For more information see official documentation for HBv5-series VMs Microbenchmark Performance This section focuses on microbenchmarks that characterize performance of the memory subsystem, compute capabilities, and InfiniBand network of HBv5 VMs. Memory & Compute Performance To capture synthetic performance, we ran the following industry standard benchmarks: STREAM – memory bandwidth High Performance Conjugate Gradient (HPCG) – sparse linear algebra High Performance Linpack (HPL)– dense linear algebra Absolute results and comparisons to HBv4 VMs are shown in Table 2, below: Table 2: Results of HBv5 running the STREAM, HPCG, and HPL benchmarks. Note: STREAM was run with the following CLI parameters: OMP_NUM_THREADS=368 OMP_PROC_BIND=true OMP_PLACES=cores ./amd_zen_stream STREAM data size: 2621440000 bytes InfiniBand Networking Performance Each HBv5-series VM is equipped with four NVIDIA Quantum-2 network interface cards (NICs), each operating at 200 Gb/s for an aggregate bandwidth of 800 Gb/s per VM (node). We ran the industry standard IB perftests based on OSU benchmarks test across two (2) HBv5-series VMs, as depicted in the results shown in Figures 3-5, below: Note: all results below are for a single 200 Gb/s (uni-directional) link only. At a VM level, all bandwidth results below are 4x higher as there are four (4) InfiniBand links per HBv5 server. Unidirectional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 Figure 1: results showing 99% achieved uni-directional bandwidth v. theoretical peak. Bi-directional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 -b Figure 2: results showing 99% achieved bi-directional bandwidth v. theoretical peak. Latency: Figure 3: results measuring as low as 1.25 microsecond latencies among HBv5 VMs. Latencies experienced by users will depend on message sizes employed by applications. Application Performance, Cost/Performance, and Server (VM) Consolidation This section focuses on characterizing HBv5-series VMs when running common, real-world HPC applications with an emphasis on those known to be meaningfully bound by memory performance as that is the focus of the HB-series family. We characterize HBv5 below in three (3) ways of high relevance to customer interests: Performance (“how much faster can it do the work”) Cost/Performance (“how much can it reduce the costs to complete the work”) Fleet consolidation (“how much can a customer simplify the size and scale of compute fleet management while still being able to the work”) Where possible, we have included comparisons to other Azure HPC VMs, including: Azure HBv4/HX series with 176 physical cores of 4 th Gen AMD EPYC CPUs with 3D V-Cache (“Genoa-X”) (HBv4 specifications, HX specifications) Azure HBv3 with 120 physical cores of 3 rd Gen AMD EPYC CPUs with 3D V-Cache (“Milan-X”) (HBv3 specifications) Azure HBv2 with 120 physical cores of 2 nd Gen AMD EPYC CPUs (“Rome”) processors (full specifications) Unless otherwise noted, all tests shown below were performed with: Alma Linux 8.10 (image URN : almalinux:almalinux-hpc:8_10-hpc-gen2:latest) for scaling ( image URN: almalinux:almalinux-hpc:8_6-hpc-gen2:latest) NVIDIA HPC-X MPI Further, all Cost/Performance comparisons leverage pricing rate info from list price, Pay-As-You-Go (PAYG) information found on Azure Linux Virtual Machines Pricing. Absolute costs will be a function of a customer’s workload, model, and consumption (PAYG v. Reserved Instance, etc.) approach. That said, the relative cost/performance comparisons illustrated below should hold for the workload and model combinations shown below, regardless of the consumption approach. Computational Fluid Dynamics (CFD) OpenFOAM – version 2306 with 100M Cell Motorbike case Figure 4: HBv5 v. HBv4 on on OpenFOAM with the Motorbike 100M cell case HBv5 VMs provide a 4.8x performance increase over HBv4 VMs. Figure 5: The cost to complete the OpenFOAM Motorbike 100M case is just 57% of what it costs to complete the same case on HBv4. Above, we can see that for customers running OpenFOAM cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of five (5). Palabos – version 1.01 with 3D Cavity, 1001 x 1001 x 1001 cells case Figure 6: On Palabos, a Lattice Boltzmann solver using a streaming memory access pattern, HBv5 VMs provide a 4.4x performance increase over HBv4 VMs. Figure 7: The cost to complete the Palabos 3D Cavity case is just 62% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Palabos with cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~4.5. Ansys Fluent – version 2025 R2 with F1 Racecar 140M case Figure 8: On ANSYS Fluent HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 9: The cost to complete the ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running ANSYS Fluent with cases similar to the size and complexity of the 140M cell F1 Racecar problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Siemens Star-CCM+ - version 17.04.005 with AeroSUV Steady Coupled 106M case Figure 10: On Star-CCM+, HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 11: The cost to complete the Siemens Star-CCM+ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Star-CCM+ with cases similar to the size and complexity of the 106M cell AeroSUV Steady Coupled, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Weather Modeling WRF – version 4.2.2 with CONUS 2.5KM case Figure 12: On WRF, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 13: The cost to complete the WRF Conus 2.5KM case is just 84% of what it costs to complete the same case on HBv4. Above, we can see that for customers running WRF with cases similar to the size and complexity of the 2.5km CONUS, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Energy Research Devito – version 4.8.7 with Acoustic Forward case Figure 14: On Devito, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 15: The cost to complete the Devito Acoustic Forward OP case is equivalent to what it costs to complete the same case on HBv4. Above, we can see that for customers running Devito with cases similar to the size and complexity of the Acoustic Forward OP, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Molecular Dynamics NAMD - version 2.15a2 with STMV 20M case Figure 16: On NAMD, HBv5 VMs provide a 2.18x performance increase over HBv4 VMs. Figure 17: The cost to complete the NAMD STMV 20M case is 26% higher on HBv5 than what it costs to complete the same case on HBv4 Above, we can see that for customers running NAMD with cases similar to the size and complexity of the STMV 20M case, organizations can consolidate their server (VM) deployments by approximately a factor of ~2. Notably, NAMD is a compute bound case, rather than memory performance bound. We include it here to illustrate that not all workloads are fit for purpose with HBv5. This latest Azure HPC VM is the fastest at this workload on the Microsoft Cloud, but does not benefit substantially from HBv5’s premium levels of memory bandwidth. NAMD would instead perform more cost efficiently with a CPU that supports AVX512 instructions natively or, much better still, a modern GPU. Scalability of HBv5-series VMs Weak Scaling Weak scaling measures how well a parallel application or system performs when both the number of processing elements and the problem size increase proportionally, so that the workload per processor remains constant. Weak scaling cases are often employed when time-to-solution is fixed (e.g. it is acceptable to solve a problem within a specified period) but a user desires a simulation to be of a higher fidelity or resolution. A common example is operational weather forecasting. To illustrate weak scaling on HBv5 VMs, we ran Palabos with the same 3D cavity problem as shown earlier: Figure 18: On Palabos with the 3D Cavity model, HBv5 scales linearly as the 3D cavity size is proportionately increased. Strong Scaling Strong scaling is characterized by the efficiency with which execution time is reduced as the number of processor elements (CPUs, GPUs, etc.) is increased, while the problem size remains kept constant. Strong scaling cases are often employed when the fidelity or resolution of the simulation is acceptable, but a user requires faster time to completion. A common example is product engineering validation when an organization wants to bring a product to market faster but must complete a broad range of validation and verification scenarios before doing so. To illustrate Strong scaling on HBv5 VMs, we ran NAMD with two different problems, each intended to illustrate the how expectations for strong scaling efficiency change depending on problem size and the ordering of computation v. communication in distributed memory workloads. First, let us examine NAMD with the 20M STMV benchmark Figure 19: Strong scaling on HBv5 with NAMD STMV 20M cell case As illustrated above, for strong scaling cases for which the compute time is continuously reduced (by leveraging more and more processor elements) but communication time remains constant, scaling efficiency will only stay high for so long. That principle is well-represented by the STMV 20m case, for which parallel efficiency remains linear (i.e. cost/job remains flat) at two (2) nodes but degrades after that. This is because while compute is being sped up, the MPI time remains relatively flat. As such, the relatively static MPI time comes to dominate end-to-end wall clock time as VM scaling increases. Said another way, HBv5 features so much compute performance that even for a moderate-sized problem like STMV 20M scaling the infrastructure can only take performance so far and cost/job will begin to increase. If we examine HBv5 against the 210M cell case, however, with 10.5x as many elements to compute as its 20M case sibling, the scaling efficiency story changes significantly. Figure 19: On NAMD with the STMV 210M cell case, HBv5 scales linearly out to 32 VMs (or more than 11,000 CPU cores). As illustrated above, larger cases with significant compute requirements will continue to scale efficiently with larger amounts of HBv5 infrastructure. While MPI time remains relatively flat for this case (as is the case with the smaller STMV 20M case), the compute demands remain the dominant fraction of end-to-end wall clock time. As such, HBv5 scales these problems with very high levels of efficiency and in doing so job costs to the user remain flat despite up to 8x as many VMs being leveraged compared to the four (4) VM baseline. The key takeaways for strong scaling scenarios are two-fold. First, users should run scaling tests with their applications and models to find a sweet spot of faster performance with constant job costs. This will depend heavily on model size. Second, as new and very high end compute platforms like HBv5 emerge that accelerate compute time, application developers will need to find ways reduce wall clock times bottlenecking on communication (MPI) time. Recommended approaches include using fewer MPI processes and, ideally, restructuring applications to overlap communication with compute phases.Join Microsoft @ SC25: Experience HPC and AI Innovation
Supercomputing 2025 is coming to St. Louis, MO, November 16–21! Visit Microsoft Booth #1627 to explore cutting-edge HPC and AI solutions, connect with experts, and experience interactive demos that showcase the future of compute. Whether you’re attending technical sessions, stopping by for a coffee, or joining our partner events, we’ve got something for everyone. Booth Highlights Alpine Formula 1 Showcar: Snap a photo with a real Alpine F1 car and learn how high-performance computing drives innovation in motorsports. Silicon Wall: Discover silicon diversity—featuring chips from our partners AMD and NVIDIA, alongside Microsoft’s own first-party silicon: Maia, Cobalt, and Majorana. NVIDIA Weather Modeling Demo: See how AI and HPC predict extreme weather events with Tomorrow.io and NVIDIA technology. Coffee Bar with Barista: Enjoy a handcrafted coffee while you connect with our experts. Immersive Screens: Watch live demos and visual stories about HPC breakthroughs and AI innovation. Hardware Bar: Explore AMD EPYC™ and NVIDIA GB200 systems powering next-generation workloads. Whether you’re attending technical sessions, stopping by for a coffee and chat with our team, or joining our partner events, we’ve got something for everyone. Conference Details Conference week: Sun, Nov 16 – Fri, Nov 21 Expo hours (CST): Mon, Nov 17: 7:00–9:00 PM (Opening Night) Tue, Nov 18: 10:00 AM–6:00 PM Wed, Nov 19: 10:00 AM–6:00 PM Thu, Nov 20: 10:00 AM–3:00 PM Customer meeting rooms: Four Seasons Hotel Quick links RSVP — Microsoft + AMD Networking Reception (Tue, Nov 18): https://aka.ms/MicrosoftAMD-Mixer RSVP — Microsoft + NVIDIA Panel Luncheon (Wed, Nov 19): Luncheon is now closed as the event is fully booked. Earned Sessions (Technical Program) Sunday, Nov 16 Session Type Time (CST) Title Microsoft Contributor(s) Location Tutorial 8:30 AM–5:00 PM Delivering HPC: Procurement, Cost Models, Metrics, Value, and More Andrew Jones Room 132 Tutorial 8:30 AM–5:00 PM Modern High Performance I/O: Leveraging Object Stores Glenn Lockwood Room 120 Workshop 2:00–5:30 PM 14th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2025) Torsten Hoefler Room 265 Monday, Nov 17 Session Type Time (CST) Title Microsoft Contributor(s) Location Early Career Program 3:30–4:45 PM Voices from the Field: Navigating Careers in Academia, Government, and Industry Joe Greenseid Room 262 Workshop 3:50–4:20 PM Towards Enabling Hostile Multi-tenancy in Kubernetes Ali Kanso; Elzeiny Mostafa; Gurpreet Virdi; Slava Oks Room 275 Workshop 5:00–5:30 PM On the Performance and Scalability of Cloud Supercomputers: Insights from Eagle and Reindeer Amirreza Rastegari; Prabhat Ram; Michael F. Ringenburg Room 267 Tuesday, Nov 18 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM High Performance Software Foundation BoF Joe Greenseid Room 230 Poster 5:30–7:00 PM Compute System Simulator: Modeling the Impact of Allocation Policy and Hardware Reliability on HPC Cloud Resource Utilization Jarrod Leddy; Huseyin Yildiz Second Floor Atrium Wednesday, Nov 19 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM The Future of Python on HPC Systems Michael Droettboom Room 125 BOF 12:15–1:15 PM Autonomous Science Network: Interconnected Autonomous Science Labs Empowered by HPC and Intelligent Agents Joe Tostenrude Room 131 Paper 1:30–1:52 PM Uno: A One‑Stop Solution for Inter‑ and Intra‑Data Center Congestion Control and Reliable Connectivity Abdul Kabbani; Ahmad Ghalayini; Nadeen Gebara; Terry Lam Rooms 260–267 Paper 2:14–2:36 PM SDR‑RDMA: Software‑Defined Reliability Architecture for Planetary‑Scale RDMA Communication Abdul Kabbani; Jie Zhang; Jithin Jose; Konstantin Taranov; Mahmoud Elhaddad; Scott Moe; Sreevatsa Anantharamu; Zhuolong Yu Rooms 260–267 Panel 3:30–5:00 PM CPUs Have a Memory Problem — Designing CPU‑Based HPC Systems with Very High Memory Bandwidth Joe Greenseid Rooms 231–232 Paper 4:36–4:58 PM SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations Kun Li; Liang Yuan; Ting Cao; Mao Yang Rooms 260–267 Thursday, Nov 20 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM Super(computing)heroes Laura Parry Rooms 261–266 Paper 3:30–3:52 PM Workload Intelligence: Workload‑Aware IaaS Abstraction for Cloud Efficiency Anjaly Parayil; Chetan Bansal; Eli Cortez; Íñigo Goiri; Jim Kleewein; Jue Zhang; Pantea Zardoshti; Pulkit Misra; Raphael Ghelman; Ricardo Bianchini; Rodrigo Fonseca; Saravan Rajmohan; Xiaoting Qin Room 275 Paper 4:14–4:36 PM From Deep Learning to Deep Science: AI Accelerators Scaling Quantum Chemistry Beyond Limits Fusong Ju; Kun Li; Mao Yang Rooms 260–267 Friday, Nov 21 Session Type Time (CST) Title Microsoft Contributor(s) Location Workshop 9:00 AM–12:30 PM Eleventh International Workshop on Heterogeneous High‑performance Reconfigurable Computing (H2RC 2025) Torsten Hoefler Room 263 Booth Theater Sessions Monday, Nov 17 — 7:00 PM–9:00 PM Time (CST) Session Title Presenter(s) 8:00–8:20 PM Inside the World’s Most Powerful AI Data Center Chris Jones 8:30–8:50 PM Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC Joe Tostenrude Tuesday, Nov 18 — 10:00 AM–6:00 PM Time (CST) Session Title Presenter(s) 11:00–11:50 AM Ignite Keynotes 12:00–12:20 PM Accelerating AI workloads with Azure Storage Sachin Sheth; Wolfgang De Salvador 12:30–12:50 PM Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA Jyothi Venkatesh 1:00–1:20 PM Radiation & Health Companion: AI‑Driven Flight‑Dose Awareness Olesya Sarajlic 1:30–1:50 PM Ascend HPC Lab: Your On‑Ramp to GPU‑Powered Innovation Daniel Cooke (Oakwood) 2:00–2:20 PM Azure AMD HBv5: Redefining CFD Performance and Value in the Cloud Rick Knoechel (AMD) 2:30–2:50 PM Empowering High Performance Life Sciences Workloads on Azure Qumulo 3:00–3:20 PM Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC Joe Tostenrude 4:00–4:20 PM Unleashing AMD EPYC on Azure: Scalable HPC for Energy and Manufacturing Varun Selvaraj (AMD) 4:30–4:50 PM Automating HPC Workflows with Copilot Agents Xavier Pillons 5:00–5:20 PM Scaling the Future: NVIDIA’s GB300 NVL72 Rack for Next‑Generation AI Inference Kirthi Devleker (NVIDIA) 5:30–5:50 PM Enabling AI and HPC Workloads in the Cloud with Azure NetApp Files Andy Chan Wednesday, Nov 19 — 10:00 AM–6:00 PM Time (CST) Session Title Presenter(s) 10:30–10:50 AM AI‑Powered Digital Twins for Industrial Engineering John Linford (NVIDIA) 11:00–11:20 AM Advancing 5 Generations of HPC Innovation with AMD on Azure Allen Leibovitch (AMD) 11:30–11:50 AM Intro to LoRA Fine‑Tuning on Azure Christin Pohl 12:00–12:20 PM VAST + Microsoft: Building the Foundation for Agentic AI Lior Genzel (VAST Data) 12:30–12:50 PM Inside the World’s Most Powerful AI Data Center Chris Jones 1:00–1:20 PM Supervised GenAI Simulation – Stroke Prognosis (NVads V710 v5) Kurt Niebuhr 1:30–1:50 PM What You Don’t See: How Azure Defines VM Families Anshul Jain 2:00–2:20 PM Hammerspace Tier 0: Unleashing GPU Storage Performance on Azure Raj Sharma (Hammerspace) 2:30–2:50 PM GM Motorsports: Accelerating Race Performance with AI Physics on Rescale Bernardo Mendez (Rescale) 3:00–3:20 PM Hurricane Analysis and Forecasting on the Azure Cloud Salar Adili (Microsoft); Unni Kirandumkara (GDIT); Stefan Gary (Parallel Works) 3:30–3:50 PM Performance at Scale: Accelerating HPC & AI Workloads with WEKA on Azure Desiree Campbell; Wolfgang De Salvador 4:00–4:20 PM Pushing the Limits of Performance: Supercomputing on Azure AI Infrastructure Biju Thankachen; Ojasvi Bhalerao 4:30–4:50 PM Accelerating Momentum: Powering AI & HPC with AMD Instinct™ GPUs Jay Cayton (AMD) Thursday, Nov 20 — 10:00 AM–3:00 PM Time (CST) Session Title Presenter(s) 11:30–11:50 AM Intro to LoRA Fine‑Tuning on Azure Christin Pohl 12:00–12:20 PM Accelerating HPC Workflows with Ansys Access on Microsoft Azure Dr. John Baker (Ansys) 12:30–12:50 PM Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA Jyothi Venkatesh 1:00–1:20 PM Pushing the Limits: Supercomputing on Azure AI Infrastructure Biju Thankachen; Ojasvi Bhalerao 1:30–1:50 PM The High Performance Software Foundation Todd Gamblin (HPSF) 2:00–2:20 PM Heidi AI — Deploying Azure Cloud Environments for Higher‑Ed Students & Researchers James Verona (Adaptive Computing); Dr. Sameer Shende (UO/ParaTools) Partner Session Schedule Tuesday, Nov 18 Date Time (CST) Title Microsoft Contributor(s) Location Nov 18 11:00 AM–11:50 AM Cloud Computing for Engineering Simulation Joe Greenseid Ansys Booth Nov 18 1:00 PM–1:30 PM Revolutionizing Simulation with Artificial Intelligence Joe Tostenrude Ansys Booth Nov 18 4:30 PM–5:00 PM [HBv5] Jyothi Venkatesh AMD Booth Wednesday, Nov 19 Date Time (CST) Title Microsoft Contributor(s) Location Nov 19 11:30 AM–1:30 PM Accelerating Discovery: How HPC and AI Are Shaping the Future of Science (Lunch Panel) Andrew Jones (Moderator); Joe Greenseid (Panelist) Ruth's Chris Steak House Nov 19 1:00 PM–1:30 PM VAST and Microsoft Kanchan Mehrotra VAST Booth Demo Pods at Microsoft Booth Azure HPC & AI Infrastructure Explore how Azure delivers high-performance computing and AI workloads at scale. Learn about VM families, networking, and storage optimized for HPC. Agentic AI for Science See how autonomous agents accelerate scientific workflows, from simulation to analysis, using Azure AI and HPC resources. Hybrid HPC with Azure Arc Discover how Azure Arc enables hybrid HPC environments, integrating on-prem clusters with cloud resources for flexibility and scale. Ancillary Events (RSVP Required) Microsoft + AMD Networking Reception — Tuesday Night When: Tue, Nov 18, 6:30–10:00 PM (CST) Where: UMB Champions Club, Busch Stadium RSVP: https://aka.ms/MicrosoftAMD-Mixer Microsoft + NVIDIA Panel Luncheon — Wednesday When: Wed, Nov 19, 11:30 AM–1:30 PM (CST) Where: Ruth’s Chris Steak House Topic: Accelerating Discovery: How AI and HPC Are Shaping the Future of Science Panelists: Dan Ernst (NVIDIA); Rollin Thomas (NERSC); Joe Greenseid (Microsoft); Antonia Maar (Intersect360 Research); Fernanda Foertter (University of Alabama) RSVP: Luncheon is now closed as the event is fully booked. Conclusion We’re excited to connect with you at SC25! Whether you’re exploring our booth demos, attending technical sessions, or joining one of our partner events, this is your opportunity to experience how Microsoft is driving innovation in HPC and AI. Stop by Booth #1627 to see the Alpine F1 showcar, explore the Silicon Wall featuring AMD, NVIDIA, and Microsoft’s own chips, and enjoy a coffee from our barista while networking with experts. Don’t forget to RSVP for our Microsoft + AMD Network Reception and Microsoft + NVIDIA Panel Luncheon See you in St. Louis!The Complete Guide to Renewing an Expired Certificate in Microsoft HPC Pack 2019 (Single Head Node)
Managing certificates in an HPC Pack 2019 cluster is critical for secure communication between nodes. However, if your certificate has expired, your cluster services (Scheduler, Broker, Web Components, etc.) may stop functioning properly — preventing nodes from communicating or jobs from scheduling. When the HPC Pack certificate expires, the HPC Cluster Manager will fail to launch, and you may encounter error messages similar to the examples shown below. This comprehensive guide walks you through how to renew an already expired HPC Pack certificate on a single-head-node setup and bring your cluster back online. Step 1: Check the Current Certificate Expiry Start by checking the existing certificate and its expiry date. Get-ChildItem -Path Cert:\LocalMachine\root | Where-Object { $_.Subject -like "HPC" } $thumbprint = "<Thumbprint value from the previous command>".ToUpper() $cert = Get-ChildItem -Path Cert:\LocalMachine\My | Where-Object { $_.Thumbprint -eq $thumbprint } $cert | Select-Object Subject, NotBefore, NotAfter, Thumbprint Date You can also confirm the system date using the PowerShell date command: Date This ensures you’re viewing the correct validity period for the currently installed certificate. Step 2: Prepare a New Self-Signed Certificate Next, we’ll create a new certificate that meets the HPC communication requirements. Certificate Requirements: Must have a private key capable of key exchange. Key usage should include: Digital Signature, Key Encipherment, Key Agreement, and Certificate Signing. Enhanced key usage should include: Client Authentication and Server Authentication. If two certificates are used (private/public), both must have the same subject name. When you prepare a new certificate, make sure that you use the same subject name as that of the old certificate. Run the following PowerShell commands on the HPC node to get the subject name of your certificate. You can verify the existing certificate’s subject name using the following command: $thumbprint = (Get-ItemProperty -Path HKLM:\SOFTWARE\Microsoft\HPC -Name SSLThumbprint).SSLThumbPrint $subjectName = (Get-Item Cert:\LocalMachine\My\$thumbprint).Subject $subjectName Use the same subject name when generating the new certificate. Step 3: Create a New Certificate Use the below commands to create and export a new self-signed certificate (valid for 1 year). $subjectName = "HPC Pack Node Communication" $pfxcert = New-SelfSignedCertificate -Subject $subjectName -KeySpec KeyExchange -KeyLength 2048 -HashAlgorithm SHA256 -TextExtension @("2.5.29.37={text}1.3.6.1.5.5.7.3.1,1.3.6.1.5.5.7.3.2") -Provider "Microsoft Enhanced RSA and AES Cryptographic Provider" -CertStoreLocation Cert:\CurrentUser\My -KeyExportPolicy Exportable -NotAfter (Get-Date).AddYears(1) -NotBefore (Get-Date).AddDays(-1) $certThumbprint = $pfxcert.Thumbprint $null = New-Item $env:Temp\$certThumbprint -ItemType Directory $pfxPassword = Get-Credential -UserName 'Protection password' -Message 'Enter protection password below' Export-PfxCertificate -Cert Cert:\CurrentUser\My\$certThumbprint -FilePath "$env:Temp\$certThumbprint\PrivateCert.pfx" -Password $pfxPassword.Password Export-Certificate -Cert Cert:\CurrentUser\My\$certThumbprint -FilePath "$env:Temp\$certThumbprint\PublicCert.cer" -Type CERT -Force start "$env:Temp\$certThumbprint" This will generate both .pfx (private) and .cer (public) files in a temporary directory. Step 4: Copy Certificate to Install Share On the master (head) node, copy the newly created certificate to the following path: C:\Program Files\Microsoft HPC Pack 2019\Data\InstallShare\Certificates This ensures the certificate is available to all compute nodes in the cluster. Step 5: Rotate Certificates on Compute Nodes Important: Always rotate certificates on compute nodes first, before the head node. If you update the head node first, compute nodes will reject the new certificate, forcing manual reconfiguration. After rotating compute node certificates, expect them to appear as Offline in HPC Cluster Manager — this is normal until the head node certificate is updated. Download the PowerShell script Update-HpcNodeCertificate.ps1 and place it in your HPC install share: \\<headnode>\REMINST On each compute node, open PowerShell as Administrator and run: PowerShell.exe -ExecutionPolicy ByPass -Command "\\<headnode>\REMINST\Update-HpcNodeCertificate.ps1 -PfxFilePath \\headnode>\REMINST\Certificates\HpcCnCommunication.pfx -Password <password> " This updates the certificate on each compute node. Step 6: Update Certificate on the Master (Head) Node On the head node, run the following commands in PowerShell as Administrator: $certPassword = ConvertTo-SecureString -String "YourPassword" -AsPlainText -Force Import-PfxCertificate -FilePath "C:\Program Files\Microsoft HPC Pack 2019\Data\InstallShare\Certificates\PrivateCert.pfx" -CertStoreLocation "Cert:\LocalMachine\My" -Password $certPassword PowerShell.exe -ExecutionPolicy ByPass -Command "Import-certificate -FilePath \\master\REMINST\Certificates\PublicCert.cer -CertStoreLocation cert:\LocalMachine\Root" Set-ItemProperty -Path "HKLM:\SOFTWARE\Microsoft\HPC" -Name SSLThumbprint -Value <Thumbprint> Set-ItemProperty -Path "HKLM:\SOFTWARE\Wow6432Node\Microsoft\HPC" -Name SSLThumbprint -Value <Thumbprint> Step 7: Update Thumbprint in SQL Database You’ll also need to update the certificate thumbprint stored in the HPCHAStorage database. Install SQL Server Management Studio (SSMS) (latest version). pen SSMS and connect to the HPC database. 3. Navigate to: 4. HPCHAStorage → Tables → dbo.DataTable 5. Right-click and select “Select Top 1000 Rows” to view the current SSL thumbprint. 6. Use the new query window and run the following command with the updated thumbprint: Update dbo.DataTable set dvalue='<NewThumbrpint>' where dpath = 'HKEY_LOCAL_MACHINE\Software\Microsoft\HPC' and dkey = 'SSLThumbprint' This updates the stored certificate reference used by the HPC services. Step 8: Reboot the Master Node Once everything is updated, reboot the head node to apply the changes. After the system restarts, open HPC Cluster Manager — your cluster should now be fully functional with the new certificate in place. Summary By following these steps, you can safely renew an expired HPC Pack 2019 certificate and restore secure communication across your cluster — without needing to reinstall or reconfigure HPC Pack components. This guide helps administrators handle expired certificates with confidence and maintain business continuity for HPC workloads. If this guide helped you resolve your certificate issues, please give it a 👍 thumbs up and share your feedback or questions in the comments section below.Use Entra IDs to run jobs on your HPC cluster
Introduction This blog demonstrates the practical implementation of System Security Services Daemon (SSSD) with the recently introduced “idp” provider that can be used on Azure Linux 3.0 HPC clusters to provide consistent Usernames, UIDs and GIDs across the cluster all rooted in Microsoft Entra ID. Having consistent Identities across the cluster is a fundamental requirement that is commonly met using SSSD and a provider such as LDAP, FreeIPA, or ADDS, or if no IdP is available by managing local accounts across all nodes. SSSD 2.11.0 introduced a new generic “idp” provider that can integrate Linux systems with Microsoft Entra ID via OAuth2/OpenID Connect. This means we can now define a domain in sssd.conf with id_provider = idp and idp_type = entra_id, along with Entra tenant and app credentials. With SSSD configured and running, getent can now resolve Entra users and groups via Entra ID, fetching the Entra user’s POSIX info consistently across the cluster. As this new capability is very new (it’s being included in the Fedora 43 pre-release) this blog intends to cover the steps required to implement it on Azure Linux 3.0 for those that would like to explore this on their own VMs and Clusters. Implementation 1. Build RPMs As we are deploying on Azure Linux 3.0 and RPMs are not available in packages.microsoft.com (PMC) we must download the release package 2.11.0 from Releases · SSSD/sssd and follow the guidance from Building SSSD - sssd.io A virtual machine running Azure Linux 3.0 HPC edition which provides many of the build tools required (and is our target operating system) was used. A number of dependencies must still be installed to perform the make but these are all available from PMC and the make runs without issue. # Install dependencies sudo tdnf -y install \ c-ares-devel \ cifs-utils-devel \ curl-devel \ cyrus-sasl-devel \ dbus-devel \ jansson-devel \ krb5-devel \ libcap-devel \ libdhash-devel \ libldb-devel \ libini_config-devel \ libjose-devel \ libnfsidmap-devel \ libsemanage-devel \ libsmbclient-devel \ libtalloc-devel \ libtdb-devel \ libtevent-devel \ libunistring-devel \ libwbclient-devel \ p11-kit-devel \ samba-devel \ samba-winbind sudo ln -s /etc/alternatives/libwbclient.so-64 /usr/lib/libwbclient.so.0 # Build SSSD from source wget https://github.com/SSSD/sssd/releases/download/2.11.0/sssd-2.11.0.tar.gz tar -xvf sssd-2.11.0.tar.gz cd sssd-2.11.0 autoreconf -if ./configure --enable-nsslibdir=/lib64 --enable-pammoddir=/lib64/security --enable-silent-rules --with-smb-idmap-interface-version=6 make # Success!! Building the RPMs is more complex as there are many more dependencies, some dependencies not available on PMC and we are also reusing the generic sssd.spec file. However, this can be performed to create a working set of required SSSD RPMs. First install the dependencies available from PMC: # Add dependencies for rpmbuild sudo tdnf -y install \ doxygen \ libcmocka-devel \ nss_wrapper \ pam_wrapper \ po4a \ shadow-utils-subid-devel \ softhsm \ systemtap-sdt-devel \ uid_wrapper The remaining four dependencies are sourced from Fedora 42 builds and may be installed using tdnf: # gdm-pam-extensions-devel wget https://kojipkgs.fedoraproject.org//packages/gdm/48.0/3.fc42/x86_64/gdm-pam-extensions-devel-48.0-3.fc42.x86_64.rpm sudo tdnf install ./gdm-pam-extensions-devel-48.0-3.fc42.x86_64.rpm # libfido2-devel wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/l/libcbor-0.7.0-6.el8.x86_64.rpm wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/l/libfido2-1.11.0-2.el8.x86_64.rpm wget https://dl.fedoraproject.org/pub/epel/8/Everything/x86_64/Packages/l/libfido2-devel-1.11.0-2.el8.x86_64.rpm sudo tdnf install ./libcbor-0.7.0-6.el8.x86_64.rpm --nogpgcheck sudo tdnf install ./libfido2-1.11.0-2.el8.x86_64.rpm --nogpgcheck sudo tdnf install ./libfido2-devel-1.11.0-2.el8.x86_64.rpm --nogpgcheck The sudo make rpms can now be initiated. It will fail but establishes much of what we need for a successful rpmbuild using the following steps: # rpmbuild sudo make rpms # will error with: File /rpmbuild/SOURCES/sssd-2.11.0.tar.gz: No such file or directory sudo cp ../sssd-2.11.0.tar.gz /rpmbuild/SOURCES/ cd /rpmbuild sudo vi SPECS/sssd.spec # edit build_passkey 1 in SPECS/sssd.spec to 0 to skip passkey support sudo rpmbuild --define "_topdir /rpmbuild" -ba SPECS/sssd.spec And we have RPMs #RPMS!!! libipa_hbac-2.11.0-0.azl3.x86_64.rpm libipa_hbac-devel-2.11.0-0.azl3.x86_64.rpm libsss_autofs-2.11.0-0.azl3.x86_64.rpm libsss_certmap-2.11.0-0.azl3.x86_64.rpm libsss_certmap-devel-2.11.0-0.azl3.x86_64.rpm libsss_idmap-2.11.0-0.azl3.x86_64.rpm libsss_idmap-devel-2.11.0-0.azl3.x86_64.rpm libsss_nss_idmap-2.11.0-0.azl3.x86_64.rpm libsss_nss_idmap-devel-2.11.0-0.azl3.x86_64.rpm libsss_sudo-2.11.0-0.azl3.x86_64.rpm python3-libipa_hbac-2.11.0-0.azl3.x86_64.rpm python3-libsss_nss_idmap-2.11.0-0.azl3.x86_64.rpm python3-sss-2.11.0-0.azl3.x86_64.rpm python3-sss-murmur-2.11.0-0.azl3.x86_64.rpm python3-sssdconfig-2.11.0-0.azl3.noarch.rpm sssd-2.11.0-0.azl3.x86_64.rpm sssd-ad-2.11.0-0.azl3.x86_64.rpm sssd-client-2.11.0-0.azl3.x86_64.rpm sssd-common-2.11.0-0.azl3.x86_64.rpm sssd-common-pac-2.11.0-0.azl3.x86_64.rpm sssd-dbus-2.11.0-0.azl3.x86_64.rpm sssd-debuginfo-2.11.0-0.azl3.x86_64.rpm sssd-idp-2.11.0-0.azl3.x86_64.rpm sssd-ipa-2.11.0-0.azl3.x86_64.rpm sssd-kcm-2.11.0-0.azl3.x86_64.rpm sssd-krb5-2.11.0-0.azl3.x86_64.rpm sssd-krb5-common-2.11.0-0.azl3.x86_64.rpm sssd-ldap-2.11.0-0.azl3.x86_64.rpm sssd-nfs-idmap-2.11.0-0.azl3.x86_64.rpm sssd-proxy-2.11.0-0.azl3.x86_64.rpm sssd-tools-2.11.0-0.azl3.x86_64.rpm sssd-winbind-idmap-2.11.0-0.azl3.x86_64.rpm 2. Deploy RPMs With the RPMs created we can now move to installing them on our Cluster. In my case I am using a customised image with other tunings and packages so these can be included in my Ansible Playbook and an updated image produced. The following details the rpms (a subset of the 30 or so created) installed into the image: # Pre install sssd rpms - name: Copy sssd rpms onto host ansible.builtin.copy: src: sssd-2.11.0/ dest: /tmp/sssd/ - name: Install sssd rpms ansible.builtin.shell: | tdnf -y install /tmp/sssd/libsss_certmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_certmap-devel-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_idmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_nss_idmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-client-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libsss_sudo-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-nfs-idmap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-common-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-common-pac-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-idp-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-krb5-common-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-ad-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/libipa_hbac-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-ipa-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-krb5-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-ldap-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-proxy-2.11.0-0.azl3.x86_64.rpm tdnf -y install /tmp/sssd/sssd-2.11.0-0.azl3.x86_64.rpm 3. Create an App Registration For the SSSD “idp” provider to be able to read Entra ID user and group attributes we must create an Application ID with Secret in our Entra tenant. The Application will require the following API permissions: Additionally the Application must be assigned the Directory Readers permissions over the directory. This can be done through the Graph API using the following template: POST https://graph.microsoft.com/v1.0/roleManagement/directory/roleAssignments { "principalId": "<ObjectId of your SPN>", "roleDefinitionId": "<RoleDefinitionId for Directory Readers>", "directoryScopeId": "/" } Note the Application (client) ID and its secret as these will be required to for the SSSD configuration. 4. Configure SSSD & NSSWITCH For these I have used cloud init to add the sssd.conf and amend the nsswitch.conf during deployment across Slurm Controllers, Login nodes and Compute nodes. The SSSD service is also enabled and started. The resulting files should look like the following customized to your own domain, app Id and secret. /etc/sssd/sssd.conf [sssd] config_file_version = 2 services = nss, pam domains = mydomain.onmicrosoft.com [domain/mydomain.onmicrosoft.com] id_provider = idp idp_type = entra_id idp_client_id = ########-####-####-####-############ idp_client_secret = ######################################## idp_token_endpoint = https://login.microsoftonline.com/937d5829-df9d-46b6-ad5a-718ebc33371e/oauth2/v2.0/token idp_userinfo_endpoint = https://graph.microsoft.com/v1.0/me idp_device_auth_endpoint = https://login.microsoftonline.com/937d5829-df9d-46b6-ad5a-718ebc33371e/oauth2/v2.0/devicecode idp_id_scope = https%3A%2F%2Fgraph.microsoft.com%2F.default idp_auth_scope = openid profile email auto_private_groups = true use_fully_qualified_names = false cache_credentials = true entry_cache_timeout = 5400 entry_cache_nowait_percentage = 50 refresh_expired_interval = 4050 enumerate = false debug_level = 2 [nss] debug_level = 2 default_shell = /bin/bash fallback_homedir = /shared/home/%u [pam] debug_level = 2 /etc/nsswitch.conf # Begin /etc/nsswitch.conf passwd: files sss group: files sss shadow: files sss hosts: files dns networks: files protocols: files services: files ethers: files rpc: files # End /etc/nsswitch.conf 5. Create User home directories The use of Device Auth for Entra users over SSH is not currently supported so for now my Entra users will authenticate using SSH Public Key Auth. For that to work their $HOME directories must be pre-created, and their public keys added to .ssh/authorized_keys. This is simplified by having SSSD in place as we can use getent passwd to get a user’s $HOME and set directory and file permissions using the usual chown command. The following example script will create the users directory, add their public key, and creates a keypair for internal use across the cluster: #!/bin/bash # Script to create a user home directory and populate it with a given SSH public key. # Must be executed as root or via sudo. USER_NAME=$1 USER_PUBKEY=$2 if [ -z "${USER_NAME}" ] || [ -z "${USER_PUBKEY}" ]; then echo "Usage: $0 " exit 1 fi entry=$(getent passwd "${USER_NAME}") export USER_UID=$(echo "$entry" | awk -F: '{print $3}') export USER_HOME=$(echo "$entry" | awk -F: '{print $6}') #if directory exists, we're good if [ -d "${USER_HOME}" ]; then echo "Directory ${USER_HOME} exists, do not modify." else mkdir -p "${USER_HOME}" chown $USER_UID:$USER_UID $USER_HOME chmod 700 $USER_HOME cp -r /etc/skel/. $USER_HOME mkdir -p $USER_HOME/.ssh chmod 700 $USER_HOME/.ssh touch $USER_HOME/.ssh/authorized_keys chmod 644 $USER_HOME/.ssh/authorized_keys echo "${USER_PUBKEY}" >> $USER_HOME/.ssh/authorized_keys { echo "# Automatically generated - StrictHostKeyChecking is disabled to allow for passwordless SSH between Azure nodes" echo "Host *" echo " StrictHostKeyChecking no" } >> "$USER_HOME/.ssh/config" chmod 644 "$USER_HOME/.ssh/config" chown -R $USER_UID:$USER_UID $USER_HOME sudo -u $USER_NAME ssh-keygen -f $USER_HOME/.ssh/id_ed25519 -N "" -q cat $USER_HOME/.ssh/id_ed25519.pub >> $USER_HOME/.ssh/authorized_keys fi 6. Run jobs as Entra user Logged in Entra user: john.doe@tst4-login-0 [ ~ ]$ id uid=1137116670(john.doe) gid=1137116670(john.doe) groups=1137116670(john.doe) john.doe@tst4-login-0 [ ~ ]$ getent passwd john.doe john.doe:*:1137116670:1137116670::/shared/home/john.doe:/bin/bash And running an MPI job: john.doe@tst4-login-0 [ ~ ]$ sbatch -p hbv4 /cvmfs/az.pe/1.2.6/tests/imb/imb-env-intel-oneapi.sh Submitted batch job 330 john.doe@tst4-login-0 [ ~ ]$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 330 hbv4 imb-env- john.doe R 0:06 2 tst4-hbv4-[114-115] john.doe@tst4-login-0 [ ~ ]$ cat slurm-imb-env-intel-oneapi-330.out Testing IMB using Spack environment intel-oneapi ... Setting up Azure PE version 1.2.6 for azurelinux3.0 on x86_64 Testing IMB using srun... #---------------------------------------------------------------- # Intel(R) MPI Benchmarks 2021.7, MPI-1 part #---------------------------------------------------------------- # Date : Mon Sep 29 15:24:24 2025 # Machine : x86_64 # System : Linux # Release : 6.6.96.1-1.azl3 # Version : #1 SMP PREEMPT_DYNAMIC Tue Jul 29 02:44:24 UTC 2025 # MPI Version : 4.1 # MPI Thread Environment: Summary So, early days and requires a little prep but hopefully this demonstrates that using the new SSSD “idp” provider we can finally use Entra and the source of user identities on our HPC clusters.Explore HPC & AI Innovation: Microsoft + AMD at HPC Roundtable 2025
The HPC Roundtable 2025 in Turin brings together industry leaders, engineers, and technologists to explore the future of high-performance computing (HPC) and artificial intelligence (AI) infrastructure. Hosted by DoITNow, the event features Microsoft and AMD as key participants, with sessions highlighting real-world innovations such as Polestar’s adoption of Microsoft Azure HPC for Computer-Aided Engineering (CAE). Attendees will gain insights into cloud-native HPC, hybrid compute environments, and the convergence of simulation and machine learning. The roundtable offers networking opportunities, strategic discussions, and showcases how Microsoft Azure and AMD are accelerating engineering innovation and intelligent workloads in automotive and other industries.CycleCloud + Hammerspace
Abstract The theme of this blog is “Simplicity”. Today’s HPC user has an overabundance of choices when it comes to HPC Schedulers, clouds, infrastructure in those clouds, and data management solutions. Let's simplify it! Using CycleCloud as the nucleus, my intent is to show how simple it is to deploy a Slurm cluster on the Hammerspace data platform while using a standard NFS protocol. And for good measure, we will use a new feature in CycleCloud called Scheduled Events – which will automatically unmount the NFS share when the VM’s are shutdown. CycleCloud and SLURM Azure CycleCloud Workspace for Slurm is an Azure Marketplace solution template that delivers a fully managed SLURM workload environment on Azure. This occurs without requiring manually configured infrastructure or Slurm settings. To get started, go to the Azure marketplace and type “Azure CycleCloud for Slurm” I have not provided a detailed breakdown of the steps for Azure CycleCloud for Slurm as Kiran Buchetti does an excellent job of that in the blog here. It is a worthwhile read so please take a minute to review. Getting back to the theme of this blog, simplicity of Azure CycleCloud Workspace for Slurm is one of its most important value propositions. Please see below for my top reasons why: CycleCloud Workplace for Slurm is a simple template for entire cluster creations. Without the above, a user would have to manually install CycleCloud, install Slurm, configure the compute partitions, attach storage, etc. Instead, you fill out a marketplace template and a working cluster is live in 15-20 minutes. Preconfigured best practices, prebuilt Slurm nodes, partitions, network and security rules are done for the end user. No deep knowledge of HPC or SLURM is required! Automatic Cost control: Workplace for Slurm is designed to deploy only when a job is submitted. From there, the solution will auto shutdown after a job is complete. Moreover, workplace for Slurm comes with preconfigured partitions (GPU partition, HTC spot partition) – so end users can submit jobs to the right partition based on performance and budget. Now that we have a cluster built – let's turn our attention to data management. I have chosen to highlight the Hammerspace Data Platform in this blog. Why? Namely, because it is a powerful solution that provides high performance and global access to CycleCloud HPC/AI nodes. Sticking true to our theme... it is also incredibly simple to integrate with CycleCloud. Who is Hammerspace ? Before discussing integration, let's take a minute to introduce you to Hammerspace. Hammerspace is a software-defined data orchestration platform that provides a global file system across on-premises infrastructure and public clouds. It enables users and applications to access and manage unstructured data anywhere at any time. This all comes without the need to copy, migrate, or manually manage data. Hammerspace’s core philosophy is that “Data should follow the user, not the other way around”. Great information on Hammerspace at the following link: Hammerspace Whitepapers Linux Native Hammerspace's foundation as a data platform is built natively into the Linux kernel, requiring no additional software installation on any nodes. The company’s goal is to deliver a High-Performance Plug and Play model – using standard NFS protocols (v3, v4, pNFS) – that make high performance & scalable file access familiar to any Linux system administrator. Let’s break down why the native Kernel approach is important to a CycleCloud Workplace on SLURM user: POSIX compliant high performance file access with no changes in code required. No agents needed on the hosts, no additional CycleCloud templates needed. From a CycleCloud perspective, Hammerspace is simply an “external NFS” No re-staging of jobs required. Its NFS – all the compute nodes can access the same data (regardless of where it resides). The days of copying / moving data between compute nodes are over. Seamless Mounting. Native NFS mounts can be added easily in CycleCloud and files are instantly available for SLURM jobs with no unnecessary job prep time. We will take a deeper dive into this topic in the next section. How to export NFS Native NFS mounts can be added easily to CycleCloud such as the example below... NFS mounts can be entered on the Marketplace template or alternatively via the scheduler. For Hammerspace – click on External NFS. Put in the IP of the Hammerspace Anvil Metadata server, add in your mount options, and that’s it. The example below uses NFS mounts for /sched and /data Once the nodes are provisioned, log into any of the nodes and they will be mounted. On the Hammerspace user interface, we see the /sched share deployed with any relevant IOPS, growth, and files That’s it. That’s all it takes to mount a powerful parallel file system to CycleCloud. Now let's look at the benefits of a Hammerspace/CycleCloud implementation Simplified data management: CycleCloud orchestrates HPC infrastructure on demand – Hammerspace ensures that the data is immediately available whenever the compute comes up. Hammerspace will also place data in the right location or tier based on its policy driven management. This reduces the need for manual scripting to put data on lower cost tiers of storage. No application refactoring: Applications do not need to add additional agents, nor do they have to change to benefit from using a Global Access system like Hammerspace. CycleCloud Scheduled Events The last piece of the story is the shutdown/termination process. The HPC jobs are complete, now it is time to shut down the nodes and save costs. What happens to the NFS mounts that are on each node? Prior to CycleCloud 8.2.2 – if nodes were not unmounted properly, NFS mounts could hang indefinitely waiting for IO. Users can now take advantage of “Scheduled Events” in CycleCloud – a feature that lets you put a script on your HPC nodes to automatically be executed when a supported event occurs. In our case, our supported event is a node termination. The following is taken straight from the CycleCloud Main page here. CycleCloud supports enabling Terminate Notification on scaleset VMs (e.g., execute nodes). To do this, set EnableTerminateNotification to true on the nodearray. This will enable it for scalesets created for this nodearray. To override the timeout allowed, you can set TerminateNotificationTimeout to a new time. For example, in a cluster template: The script to unmount a NFS share during a terminate event is not trivial: Add it to your project project.spec Attach it to the shutdown task: Simple! Now a user can run a job and terminate the nodes after job completion without worrying about what it does to the backend storage. No more cleanup! This is cost savings, operational efficiency, and resource cleanliness (no more stale azure resources like IP’s, NICs, and disks cluttering up a subscription). Conclusion Azure CycleCloud along with Slurm and the Hammerspace Data Platform provides a powerful, scalable and cost-efficient solution for HPC in the cloud. CycleCloud automates the provisioning (and the elastic scaling up and down) of the Infrastructure, SLURM manages the task of job scheduling, and Hammerspace delivers a global data environment with high performance parallel NFS. Ultimately, the most important element of the solution is the simplicity. Hammerspace enables HPC organizations to focus on solving core problems vs the headache of managing infrastructure, setup, and unpredictable storage mounts. By reducing the administrative overhead needed to run HPC environments, the solution described in this blog will help organizations accelerate time to results, lower costs, and drive innovation across all industries.