Azure High Performance Computing (HPC) Blog articles

Simplify troubleshooting at scale - Centralized Log Management for CycleCloud Workspace for Slurm

jesselopez — Tue, 31 Mar 2026 21:17:04 GMT

Training large AI models on hundreds or thousands of nodes introduces a critical operational challenge: when a distributed job fails, quickly identifying the root cause across scattered logs can become incredibly time-consuming. This manual process delays recovery and reduces cluster utilization. The ability to quickly parse centralized cluster logs from a single interface is critical to ensure job failure root cases are swiftly identified and mitigated to maintain high cluster utilization.

Solution Architecture

This is a turnkey, customizable log forwarding solution for CycleCloud Workspace for Slurm that centralizes all cluster logs into Azure Monitor Logs Analytic. The architecture uses Azure Monitor Agent (AMA) deployed on every VM and Virtual Machine Scale Set (VMSS) to stream logs defined by Data Collection Rules (DCR) to dedicated tables in a Log Analytics workspace where they can be queried from a single interface.

The turnkey solution captures three categories of logs essential for troubleshooting distributed workloads, but can be extended for any other logs:

Slurm logs including slurmctld, slurmd, etc., plus archived job artifacts (job submission scripts, environmental variables, stdout/stderr) collected via prolog/epilog scripts.

Infrastructure logs including those from CycleCloud including the CycleCloud Healthagent which automatically tests nodes for hardware health and draining nodes that fail tests.

Operation System logs from syslog and dmesg capturing kernel events, network state changes, and hardware issues.

Each log source flows through its own DCR into a dedicated table following a consistent schema. The solution automatically associates scheduler-specific DCRs with the Slurm scheduler node and compute-specific DCRs with compute nodes handling dynamic node scaling transparently.

The solution is purpose-built for CycleCloud Workspace for Slurm, but designed in a modular fashion to be easily extended for new data sources (i.e. new log formats) and processing (i.e. Data Collection Rules) to support log forwarding and analysis of other required logs.

Key Benefits

Time-series correlation: Azure Monitor's time-based indexing enables rapid identification of cascading failures. For example, trace a network carrier flap detected in syslog to corresponding slurmd communication errors to specific job failures all within seconds.

Centralized visibility: Query logs from thousands of nodes through a single interface instead of SSH-ing to individual machines. Correlate Slurm controller decisions with node-level errors and system events in one query.

Log persistence: Logs survive node deallocations and reimaging. Critical in cloud environments where compute nodes are ephemeral.

Powerful query language: KQL (Kusto Query Language) allows parsing raw logs into structured fields, filtering across multiple sources, and building operational dashboards. Example queries detect patterns like repeated job failures, network instability, or resource exhaustion.

Production-ready scalability: User-assigned managed identities automatically propagate to new VMSS instances, and DCR associations handle thousands of nodes without manual configuration.

Getting Started

The complete solution is available on GitHub (slurm-log-collection) with deployment scripts that:

Create all required Log Analytics tables

Deploys pre-configured DCRs for Slurm, CycleCloud, and OS logs

Automatically associate DCRs with scheduler and compute resources

After configuring environment variables and running the setup scripts, logs begin flowing to Azure Monitor and will populate within 15 minutes, but normal log ingestion latency is ~30s to 3 minutes. The repository includes sample KQL queries for common troubleshooting scenarios to accelerate time-to-resolution and to perform non-troubleshooting analysis of cluster usage.

Azure NCv6 Virtual Machines: Enhancements and GA Transition

Fernando_Aznar — Wed, 18 Mar 2026 20:08:51 GMT

NCv6 Virtual Machines are Azure's flexible, next generation platform enabling both leading-edge graphics and generative AI compute workloads. Featuring NVIDIA RTX PRO 6000 Blackwell Server Edition (BSE) GPUs, Intel Xeon™ 6 "Granite Rapids" 6900P series CPUs, and a suite of Microsoft Azure technologies, NCv6 VMs are available now in Preview.

Today, we are pleased to share a series of exciting updates coming soon to Azure NCv6 that will:

Enhance VM performance and capabilities
Provide more VM sizes for customers to "right size" their usage
Bring NCv6 to production readiness with a transition to General Availability, and
Expand accessibility across the global Azure cloud

New VM Sizes, Features, and Performance Enhancements

In the coming weeks, Azure will debut seven new NCv6-series VM sizes and two different sub-families for customers to choose from. The standout features introduced with the new VM sizes include:

🧩

Fractional GPU support, enabling graphics workload customers to deploy VMs with as little as 1/2 or 1/4 of a RTX PRO™ 6000. VMs with fractional GPU support also feature reduced vCPU, memory, SSD, and networking to help customers optimize costs and right size their VMs to their workloads.

⚡

Increased vCPU per VM size (e.g. 288 vCPU instead of 256) to provide more performance for high-end VDI workstations and better align with the Intel Xeon 6900P's triple compute tile architecture.

🛠️

General Purpose and Compute Optimized VM sizes. The former provides larger amounts of CPU memory for demanding generative AI inference and ISV CAD/CAE simulations, while the latter offers reduced memory to enable customers with less memory intensive workloads to cost optimize their deployments.

The new VM sizes will replace the existing three VM sizes offered in Preview, and be available as follows:

NCv6 - General Purpose VM sizes:

Size Name	vCPUs	Memory (GB)	Networking (Mb/s)	GPUs	GPU Mem (GB)	Temp Disk	NVMe Disk
Standard_NC36ds_xl_RTXPro6000_v6	36	132	22500	1/4	24	256	1600
Standard_NC72ds_xl_RTXPro6000_v6	72	264	45000	1/2	48	512	3200
Standard_NC132ds_xl_RTXPro6000_v6	132	516	90000	1	96	1024	6400
Standard_NC144ds_xl_RTXPro6000_v6	144	516	90000	1	96	1024	6400
Standard_NC264ds_xl_RTXPro6000_v6	264	1032	180000	2	192	2048	12800
Standard_NC288ds_xl_RTXPro6000_v6	288	1032	180000	2	192	2048	12800
Standard_NC324ds_xl_RTXPro6000_v6	324	1284	180000	2	192	2048	12800

NCv6-Compute Optimized VM sizes:

Size Name	vCPUs	Memory (GB)	Networking (Mbps)	GPUs	GPU Mem (GB)	Temp Disk	NVMe Disk
Standard_NC24lds_xl_RTXPro6000_v6	24	72	22500	1/4	24	256	1600
Standard_NC36lds_xl_RTXPro6000_v6	36	72	22500	1/4	24	256	1600
Standard_NC72lds_xl_RTXPro6000_v6	72	132	45000	1/2	48	512	3200
Standard_NC132lds_xl_RTXPro6000_v6	132	264	90000	1	96	1024	6400
Standard_NC144lds_xl_RTXPro6000_v6	144	264	90000	1	96	1024	6400
Standard_NC264lds_xl_RTXPro6000_v6	264	516	180000	2	192	2048	12800
Standard_NC288lds_xl_RTXPro6000_v6	288	516	180000	2	192	2048	12800
Standard_NC324lds_xl_RTXPro6000_v6	324	648	180000	2	192	2048	12800

Note that, until the new VM sizes are available, Microsoft Learn resources will continue to reflect the currently offered VM sizes and technical specifications.

Transition to General Availability

In the coming weeks, Azure will transition NCv6-series from Preview to General Availability (GA) status. With this transition, NCv6 VMs will become covered by the Azure Service Level Agreement (SLA) and thus ready to support production-grade deployments by customers, partners, and service providers.

When the transition to NCv6 VMs occurs, they will be available in the Azure West US2 and Southeast Asia regions. Information on availability timing of additional regions is provided below.

Regional Expansion Across the Azure Cloud

At the beginning of Preview, NCv6 VMs debuted in the West US2 region. Since then, we have also added NCv6 VMs to the Southeast Asia region. Both regions will be part of the transition to GA status.

We are pleased to share that in the proceeding months covering Q3 of 2026, NCv6 VMs will also become available in the following Azure regions:

• East US

• West Europe

• East US 2

• North Europe

• South Central US

• Germany West Central

• West US

• Korea Central

Ready to build for the future with Azure NCv6?

NCv6 Virtual Machines are available now in Preview. Start your production-grade AI journey today and explore the next frontier of Azure AI infrastructure.

Join the Preview

AI Inferencing in Air-Gapped Environments

damocelj — Mon, 09 Mar 2026 09:18:31 GMT

If you had to point out the top trends of IT these days, two strong candidates would be Generative AI and Cybersecurity. Especially around the latter, sophistication, reach and volume of cyberattacks have seen significant increases in the last years, with added ingredients such as advanced persistent threats, state actors or “crime-as-a-service” providers.

Interestingly enough, both trends go hand in hand: Artificial Intelligence extracts value from your data, and cyber criminals are exactly after the same thing: your data. It is not surprising that organizations have taken steps to protect themselves against data theft or data exfiltration, as it is often described.

In this post we will explore how to deploy in a Kubernetes cluster a Hugging Face-hosted model and a NVIDIA NIM™ microservice, a prebuilt, optimized inference container for rapidly deploying the latest AI models, and at the same time protect your infrastructure against data theft. You can find more information about NVIDIA NIM here: https://developer.nvidia.com/nim.

We will outline the process for deploying a Kubernetes cluster in Azure with the highest level of network security to prevent data exfiltration, and we will also demonstrate the deployment of container images and required model parameters for both options

Why Kubernetes clusters

Unless you have been living under a rock, you are probably aware that Kubernetes has taken the IT world by storm, and the AI ecosystem is not an exception. Kubernetes makes it extremely easy to package and deploy applications over any infrastructure and hence it has become one the most popular platforms to run AI workloads, especially AI inferencing.

Azure Kubernetes Service (AKS) is an Azure service that makes it easy to run Kubernetes clusters in Azure. Over time, AKS has introduced multiple deployment options to meet increasingly stringent requirements, particularly around security. One such option is the private cluster, where no public IP addresses are assigned to the Kubernetes control plane or nodes.

To understand this evolution, let’s have a look at what a “public” AKS cluster looks like:

Figure 1- public AKS API enabled cluster

As the previous figure shows, there are multiple traffic flows that go over the public Internet:

In the bootstrap phase, the nodes get images from Microsoft Container Registry, as well as potentially from other repositories such as Ubuntu.
The Kubernetes administrator operates the cluster accessing the Kubernetes API provided by Microsoft with a public IP address.
When pulling container images, node clusters can get them from publicly available repositories such as Docker hub or Azure Container Registry (if configured to be publicly accessible).
Lastly, administrators are allowed to expose applications that run in the cluster via public IP addresses, so that users will access them over the Internet too.

The first evolution of this concept towards a more restrictive environment was a commonly used pattern consisting of a combination of private clusters (https://learn.microsoft.com/azure/aks/private-clusters) and Azure Firewall to limit egress traffic (https://learn.microsoft.com/azure/aks/limit-egress-traffic) and prevent data exfiltration.

In this model, there are no longer any inbound connections to the cluster:

Figure 2- AKS private cluster

The AKS API control plane is fully integrated in the virtual network.
Azure Container Registry and other Azure services such as Azure Storage or Azure Key Vault are also integrated with the virtual network through the Private Link technology (https://aka.ms/privatelink).

However, there are still outbound flows from the cluster nodes to the Internet, for example during the cluster creation process or the deployment of images stored in public repositories, which need to be explicitly allowed by the egress firewall.

Air-gapped clusters

It can be argued that using private clusters only provides security up to the robustness of your firewall ruleset: it essentially acts as a fail-open mechanism. If there’s a misconfiguration in the firewall rules, you may unintentionally allow data exfiltration or theft.

To address this, AKS offers an even greater degree of isolation with network-isolated clusters (https://learn.microsoft.com/azure/aks/concepts-network-isolated), where all outbound connections are completely blocked without the need of a firewall:

Figure 3- AKS isolated cluster

In this mode, AKS nodes are configured in a way so that no outbound flows to the Internet can exist.

If you are curious about what you need to do to make sure of that in Azure, here is the list:

No public Azure load balancer attached to the AKS nodes.
No NAT gateway attached to the AKS node subnet.
No public IP address attached to the AKS nodes.
The AKS node subnet configured for no default outbound access (https://learn.microsoft.com/azure/virtual-network/ip-services/default-outbound-access).

An important consideration is understanding how AKS nodes receive updates or how images are retrieved from public repositories (e.g. docker.io or nvcr.io). This is achieved through an Azure Container Registry feature known as “artifact cache”: https://learn.microsoft.com/azure/container-registry/artifact-cache-overview.

However, a challenge arises when considering Large Language Models (LLMs): LLM container images sourced from Hugging Face or NVIDIA (or any other source) typically include the inference runtime (for example vLLM) but not the model weights. Instead, model artifacts are downloaded dynamically when the container starts.

Consequently, Azure Container Registry cannot cache these assets. The question then becomes: how can these model weights be made available within an air-gapped Kubernetes environment?

The Model Weights Challenge

While the model weight (re) load on container startup is a flexible approach in connected environments, it fails in air-gapped clusters where outbound network access is blocked.

To address this, we consider two viable strategies:

Constructing a container image that includes all required components and pushing it to the container registry accessible by the isolated cluster
Pre-downloading model artifacts to a private file share connected to the virtual network of the isolated cluster, and accessing these resources as needed.

Both methods will be demonstrated in detail, but before proceeding, however, we will further outline the example scenario. To provide context aligned with current priorities among our financial clients and organizations operating within regulated sectors, this demonstration focuses on the process of model deployment and the configuration of an isolated cluster for LLM inferencing.

For inferencing we use Llama-3.1-8B-Instruct-FP8 served by vLLM, a high-performance inference runtime designed specifically for large language models. In simple terms, vLLM is responsible for efficiently loading the model onto the GPU and handling incoming inference requests with very low latency and high throughput. vLLM is typically packaged as a container image, which can be sourced either from Hugging Face or from NVIDIA (in our examples), the latter being highly optimized for NVIDIA GPUs and CUDA®. As described earlier, these images usually contain the inference runtime and dependencies, but not the model weights themselves.

Instead, the model weights and other model-specific artifacts are downloaded dynamically when the container starts, allowing the same container image to be reused across different models and versions while keeping the image size small and deployment flexible. This approach is not suitable in isolated AKS clusters, where network traffic flowing outside of the deployed virtual network is not permitted.

From an architectural perspective, model serving is only one part of the overall inferencing platform, and the design of the underlying GPU infrastructure plays a critical role-especially in isolated AKS clusters. In such environments, challenges are not limited to downloading model weights at container startup; for example, setting up the GPU node pool is another important consideration.

Traditionally, enabling GPUs on AKS requires installing the NVIDIA device plugin for Kubernetes as well as the NVIDIA GPU drivers, most commonly by deploying the NVIDIA GPU Operator, which takes care of both. While the device plugin itself can be installed relatively easily via the artifact cache of an attached container registry, driver installation is more involved, especially in air-gapped or isolated environments: https://learn.microsoft.com/azure/aks/use-nvidia-gpu. NVIDIA also provides detailed guidance on how to deploy the GPU Operator in such scenarios in their documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-air-gapped.html.

While the required procedures are clearly outlined and well documented, configuring the NVIDIA AKS GPU node pool within an air-gapped, isolated cluster continues to be a complex and time-consuming process.

Microsoft has recently unveiled a preview feature that allows users to create fully managed NVIDIA GPU node pools on AKS. With this option, all necessary NVIDIA components including drivers, device plugins, and other supporting software are pre-installed and maintained by Microsoft throughout their lifecycle.

This functionality is supported on isolated AKS clusters operating with Kubernetes v1.34.0 or later. It substantially decreases operational complexity, streamlining the deployment and maintenance of GPU-based AI inferencing solutions accelerated by NVIDIA in restricted environments https://learn.microsoft.com/en-us/azure/aks/aks-managed-gpu-nodes?tabs=add-ubuntu-gpu-node-pool

For example, this Azure CLI command would deploy a managed GPU AKS nodepool to an existing AKS cluster:

1 az aks nodepool add \

2 --resource-group MyResourceGroup \

3 --cluster-name MyAKSCluster \

4 --name gpunp \

5 --node-count 1 \

6 --node-vm-size <GPU_SKU> \

7 --node-taints sku=gpu:NoSchedule \

8 --enable-cluster-autoscaler \

9 --min-count 1 \

10 --max-count 3 \

11 --tags EnableManagedGPUExperience=true

Keep in mind that if you want complete control over driver versions and related settings, you should follow the guidelines for deploying the NVIDIA GPU Operator in an air-gapped environment. With the managed option, Microsoft takes care of maintaining driver versions for you.

Container and model weight deployment

With the managed GPU node pool set up, we can begin implementing both inferencing scenarios.

Scenario 1: Baking Model Weights into the Container Image

Let’s start with the first one, where the model weights are downloaded from Hugging Face, baked into a container image and then pushed to the attached Container Registry which is reachable from the isolated AKS cluster:

Figure 4- Baking Model Weights into Container Image

The easiest way to achieve this, is to trigger the container build directly from the container registry, which will take the local Dockerfile, pull the image and data needed from Hugging Face, and deploy and tag the backed image to the container registry.

Make sure you have acquired a Hugging Face API Key. If you are using a gated model, access must be requested before building the image.

1 az acr build \

2 --registry <ACR_NAME> \

3 --image llama3-vllm-fat:8b-instruct \

4 --build-arg HF_TOKEN=$HF_TOKEN \

5 .

Note: When you use the “az acr build” command instead of running docker build yourself, it automatically tags your image and pushes it to the Azure Container Registry.

Once this container image is available in the container registry, we can create a simple pod and an internal load balancer to expose the service endpoint to the user. The detailed instructions and code are available here: https://github.com/mocelj/aks-air-gap-vllm-deployment.

You can test the deployment by querying the external IP of the deployed service and interacting with the endpoint in an OpenAPI-compatible way . Note that since this is an isolated cluster, you need to connect to the cluster’s network via VPN or run the curl command in a pod inside of the cluster, see aks-air-gap-vllm-deployment/aks_isolated.sh at main · mocelj/aks-air-gap-vllm-deployment for more details about how to set up a point-to-site VPN in Azure. Here you can see how to get the service IP address and query the completions API:

1 # Get the Service IP

2 svc_ip=$(kubectl get svc vllm-llama3-8b -o jsonpath='{.status.loadBalancer.ingress[0].ip}')

3 curl -X POST "http://${svc_ip}:8000/v1/chat/completions" \

4 -H "accept: application/json" \

5 -H "Content-Type: application/json" \

6 -d '{

7 "messages": [

8 {"role": "system", "content": "You are a polite and respectful chatbot."},

9 {"role": "user", "content": "Where should I go for lunch close to the Microsoft office in Pratteln?"}

10 ],

11 "model": "meta/llama3-8b-instruct",

12 "max_tokens": 512,

13 "top_p": 1,

14 "n": 1,

15 "stream": false,

16 "frequency_penalty": 0.0

17 }'

Scenario 2: Using a Shared File System for Model Artifacts

In the second scenario, where we are downloading the model weights and other artifacts used by the NVIDIA NIM to a shared NFS drive, we must follow a slightly different strategy.

Figure 5- Pre-download model weights to a private file share

For simplicity, we have used a virtual machine capable of downloading artifacts from the Internet and reaching the internal container registry as well as the shared NFS volume deployed in a virtual network. In this example, we have created a simple NFS share using Azure Files (see the Azure CLI code here to create the share and the endpoint): https://github.com/mocelj/aks-air-gap-vllm-deployment/blob/main/aks_isolated.sh#L352). For large scale inferencing scenarios, you might want to consider other storage options to ensure reasonable startup times, given the weights can be of considerable size.

To facilitate model deployment on NVIDIA A100 GPUs, we have provisioned a jump box equipped with the same GPU type. If you start with a fresh virtual machine, you may need to install the appropriate GPU driver as well as NVIDIA’s container runtime (https://developer.nvidia.com/container-runtime). You can alternatively deploy a Linux virtual machine using DSVM Linux images, where only the container runtime needs to be added to ensure readiness for operation.

To download the container image from nvcr.io we can leverage the caching rules in the container registry and pull the image via our connected container registry. Once the container image is pulled locally, we can download the model profile with the appropriate artifacts and copy everything to the shared folder, e.g. by using rsync.

The artifacts can be downloaded by using the Utilities for NVIDIA NIM for LLMs: https://docs.nvidia.com/nim/large-language-models/latest/utilities.html

1 docker run --rm \

2 --runtime=nvidia \

3 --gpus all \

4 -v $LOCAL_NIM_CACHE:/opt/nim/.cache \

5 -u $(id -u) \

6 -e NGC_API_KEY \

7 $TARGET_IMAGE \

8 download-to-cache

Important: The NVIDIA API key must not be included in the AKS deployment manifests, as otherwise it would trigger outbound network calls that will fail in air‑gapped environments. The key is only required on the jump box during the download of the model artefacts.

Since this shared folder is reachable from within the Jump box network and the isolated, air-gapped AKS cluster, the only thing we must do is pointing the NVIDIA NIM container to use the model weights found in the shared folder, and not to download it once the pod starts. It is important to note that the NVIDIA API key should not be part of the deployment script, since otherwise it will trigger an outbound connection to pull an image from the nvcr.io registry, which will fail in an air gapped environment.

The service can be tested in a similar way as before. First, we need to find out the external service IP, which we will get via “kubectl get svc vllm-nim-llama3-service -o wide” and then interact with the service in the same way as before, adjusting to the new service IP address.

A more detailed description of the implementation steps can be found in the attached repository: https://github.com/mocelj/aks-air-gap-vllm-deployment.

Summary

This document presents a practical guide for deploying LLM inferencing solutions in isolated Azure Kubernetes Service (AKS) clusters.

It outlines two deployment approaches: one where model weights and artifacts are pre-downloaded and stored in a shared folder accessible by both the jump box and cluster, and another using a shared NFS drive for storing downloaded resources.

Both strategies enable secure, air-gapped deployments without relying on outbound internet access. For step-by-step instructions and further technical details, consult the referenced GitHub repository https://github.com/mocelj/aks-air-gap-vllm-deployment.

For large‑scale or production deployments, more performant storage options—such as local NVMe or other high‑throughput solutions—can be explored; the services used in this guide are intentionally chosen to maximize clarity and reproducibility.

Microsoft at NVIDIA GTC 2026

Fernando_Aznar — Fri, 06 Mar 2026 23:44:35 GMT

Microsoft returns to NVIDIA GTC 2026 in San Jose with a strong presence across conference sessions, in‑booth theater talks, live demos, and executive‑level ancillary events. Together with NVIDIA and our partner ecosystem, Microsoft is showcasing how Azure AI infrastructure enables AI training, inference, and production at global scale. Visit us at Booth #521 to see the latest innovations in action and connect with Azure and NVIDIA experts.

Exclusive GTC Experiences

LEGO® Datacenter Model

Explore Azure AI infrastructure at the Park Container.

Candy Lounge

Visit the high-traffic candy wall for co-branded treats all day long.

Networking Lounge

Relax and recharge with comfy seating and vital charging options.

Outdoor Juice Truck

Free, refreshing beverages served during outdoor park hours.

Live from GTC: AI Podcast

Dayan Rodriguez

Corporate Vice President
Global Manufacturing
and Mobility

Alistair Spiers

General Manager
Azure Infrastructure

Live Special Feature

A conversation with Microsoft Azure

Listen & Subscribe:

aka.ms/GTC2026Podcast

Scan to Listen

Earned Conference Sessions

Don't miss these high-impact sessions where Microsoft and NVIDIA leaders discuss the future of AI factories and infrastructure.

Mon · Mar 16

5:00 PM

Drive Optimal Tokens per Watt on AI Infrastructure Using Benchmarking Recipes

Speakers: Paul Edwards, Emily Potyraj

Microsoft, NVIDIA

Tue · Mar 17

9:00 AM

Autonomous AI Factories: Technical Preview of Agent-Native Production

Speakers: JP Vasseur, César Martinez Spessot

NVIDIA, Microsoft Research

Tue · Mar 17

4:00 PM

The Road to Intelligent Mobility: Vehicle GenAI

Speakers: Raj Paul, Thomas Evans, Bryan Goodman

Microsoft, NVIDIA, Bosch

Wed · Mar 18

9:00 AM

Supercharging AI with Multi-Gigawatt AI Factories

Speakers: Gilad Shainer, Peter Salanki, Evan Burness

NVIDIA, CoreWeave, Meta, Microsoft

Daily Booth Theater Schedule

Visit the Microsoft Theater for lightning talks from engineering leaders and partners.

Monday, March 16

2:00 PM

BTH208 · NVIDIA

Accelerate AI Innovation on Azure with NVIDIA Run:ai — Rob Magno

2:30 PM

BTH202 · General Robotics

Models to Machines: Deploying Agentic AI in Real-World Robotics — Dinesh Narayanan

3:00 PM

BTH200 · Fractal Analytics

From Generalist to Enterprise-Ready: Fractal Builds Domain AI — C. Chaudhuri

3:30 PM

BTH109 · Microsoft

Agentic cloud ops - Smarter Operations with Azure Copilot — Jyoti Sharma

4:00 PM

BTH103 · Microsoft

Build a Deep Research Agent for Enterprise Data — D. Casati, A. Slutsky, H. Alkemade

4:30 PM

BTH205 · NetApp

Azure NetApp Files: Powering Your Data for AI Capabilities — Andy Chan

5:00 PM

BTH207 · NVIDIA

The Agentic Commerce Stack: Open Models on Azure — Antonio Martinez

5:30 PM

BTH217 · OPAQUE

Confidential AI on Azure Unlocks Sovereign AI at Scale — Aaron Fulkerson

6:00 PM

BTH218 · Simplismart

Making BYOC work at scale with modular inference — Amritanshu Jain

6:30 PM

Expo Reception

Tuesday, March 17

1:30 PM

BTH100 · Microsoft

From Open Weights to Enterprise Scale: Open-Source Models — Sharmila Chockalingam

2:00 PM

BTH212 · Personal AI

Unlocking the power of memory in Teams with Personal AI — Sam Harkness

2:30 PM

BTH111 · Microsoft / NVIDIA

Scalable LLM Inference on AKS Using NVIDIA Dynamo — Mohamad Al jazaery, Anton Slutsky

3:00 PM

BTH204 · Mistral AI

Innovate with Mistral AI on Microsoft Foundry — Ian Mathew

3:30 PM

BTH104 · Microsoft

GPU-Accelerated CFD at Scale: Star-CCM+ on Azure — Jason Scheffelmaer

4:00 PM

BTH206 · NeuBird AI

Agentic AI for Incident Response on Microsoft Azure — Grant Griffiths

4:30 PM

BTH101 · GitHub

Agentic DevOps: Evolving software with GitHub Copilot — Glenn Wester

5:00 PM

BTH209 · Rescale

Real-World AI Physics: GM & NVIDIA on Rescale — Dinal Perera

5:30 PM

BTH107 · Microsoft

Intro to LoRA Fine-Tuning on Azure — Christin Pohl

6:30 PM

Raffle

Wednesday, March 18

1:00 PM

BTH219 · VAST Data

Scaling AI Infrastructure on Azure with VAST Data — Jason Vallery

1:30 PM

BTH110 · Microsoft

Physical AI and Robotics: The Next Frontier — F. Miller, C. Souche, D. Narayanan

2:00 PM

BTH105 · Microsoft

Sovereign AI options with Azure Local — Kim Lam

2:30 PM

BTH108 · Microsoft

Automating HPC Workflows with Copilot Agents — Param Shah

3:00 PM

BTH102 · Microsoft

Trustworthy Multi-Agent Workflows with Microsoft Foundry — Brian Benz

4:00 PM

BTH106 · Microsoft

Scaling Enterprise AI on ARO with NVIDIA H100 & H200 — Lachie Evenson

4:30 PM

BTH211 · WEKA

Hybrid AI Data Orchestration with WEKA NeuralMesh™ — Desiree Campbell

5:00 PM

BTH202 · Hammerspace

NVIDIA AI Enterprise Software with NIM — Mike Bloom

5:30 PM

BTH203 · Kinaxis

Reimagining Global Supply Planning with Azure — Dane Henshall

6:00 PM

BTH214 · AT&T

Connected AI on Azure for Manufacturing — Brad Pritchett

6:30 PM

Raffle

Thursday, March 19

11:00 AM

BTH210 · Wandelbots

Physical AI: Powering Software-Defined Automation in Robotics — Marwin Kunz, Martin George

11:30 AM

Raffle

Explore Our Demo Pods

Visit the Microsoft booth to see our technology in action with live demonstrations across four dedicated pod areas.

POD 1

Azure AI Infrastructure

End‑to‑end AI infrastructure for training and inference at scale, featuring the latest NVIDIA GPU integrations on Azure.

POD 2

Microsoft Foundry

Our comprehensive platform for building, deploying, and operating agentic AI systems with enterprise reliability.

POD 3

Building AI Together

Showcasing joint Microsoft and NVIDIA solutions across diverse industries, from manufacturing to retail.

POD 4

Startups Powering AI

Discover how innovative startups are running next‑generation AI workloads on the Azure platform.

Ancillary Events & Networking

Join Microsoft leadership and our partner ecosystem at these curated networking experiences. Click the location to view on Bing Maps.

Sun · Mar 15

6:00

Microsoft for Startups Executive Leadership Dinner

📍 Morton’s Steakhouse, San Jose

Exclusive gathering for startup leaders and Microsoft executives.

Mon · Mar 16

1:30

Microsoft × NVIDIA Open Meet

📍 Signia by Hilton · International Suite

Strategic alignment session for Microsoft and NVIDIA executives.

Mon · Mar 16

7:30

Microsoft + NVIDIA Executive Dinner

📍 Il Fornaio, San Jose

Executive dinner for key customers and leadership teams.

Tue · Mar 17

11:00 AM

to 1:00 PM

Microsoft AI Luncheon: Research, Robotics, & Real‑World AI

📍 Signia by Hilton · International Suite

Invite-only: A curated executive lunch exploring the journey from AI research to physical enterprise deployments in robotics and manufacturing.

Tue · Mar 17

7:30

Networking in AI & Tech

📍 San Pedro Square Market

Community networking mixer for Microsoft teams, partners, and customers.

Wed · Mar 18

10:00 AM

to 1:00 PM

AI Innovator’s Circle Brunch: Powering Intelligent Systems Across the Ecosystem

📍 Il Fornaio, San Jose

Hosted by Microsoft & NVIDIA at GTC. Join us for an exclusive brunch and discussion on the intelligent ecosystem.

Azure Recognized as an NVIDIA Cloud Exemplar, Setting the Bar for AI Performance in the Cloud

Fernando_Aznar — Wed, 18 Feb 2026 22:31:25 GMT

As AI models continue to scale in size and complexity, cloud infrastructure must deliver more than theoretical peak performance. What matters in practice is reliable, end-to-end, workload-level AI performance—where compute, networking, system software, and optimization work together to deliver predictable, repeatable results at scale. This directly translates to business value: efficient full-stack infrastructure accelerates time-to-market, maximizes ROI on GPU and cloud investments, and enables organizations to scale AI from proof-of-concept to revenue-generating products with predictable economics.

Today, Microsoft is proud to share an important milestone in partnership with NVIDIA: Azure has been validated as an NVIDIA Exemplar Cloud, becoming the first cloud provider recognized for Exemplar-class AI performance aligned with GB300-class (Blackwell generation) systems.
This recognition builds on Azure’s previously validated Exemplar status for H100 training workloads and reflects NVIDIA’s confidence in Azure’s ability to extend that rigor and performance discipline into the next generation of AI platforms.

What Is NVIDIA Exemplar Cloud?

The NVIDIA Exemplar Cloud initiative celebrates cloud platforms that demonstrate robust end-to-end AI workload performance using NVIDIA’s Performance Benchmarking suite.

Rather than relying on synthetic microbenchmarks, Performance Benchmarking evaluates real AI training workloads using:

Large-scale LLM training scenarios
Production-grade software stacks
Optimized system and network configurations
Workload-centric metrics such as throughput and time-to-train

Achieving Exemplar validation signals that a provider can consistently deliver world-class AI performance in the cloud, showcasing that end users are getting optimal performance value by default.

Proven Exemplar Validation on H100

Azure’s Exemplar Cloud journey began with publicly shared benchmarking results for H100-based training workloads, where Azure ND GPU clusters demonstrated exemplar performance using NVIDIA Performance Benchmarking recipes.

Those results—published previously and validated through NVIDIA’s benchmarking framework—established a proven foundation of end-to-end AI performance for large-scale, production workloads running on Azure today.

Extending Exemplar-Class AI Performance to GB300-Class Platforms

Building on the rigor and learnings from H100 validation, Microsoft has now been recognized by NVIDIA as the first cloud provider to achieve Exemplar-class performance and readiness aligned with GB300-class systems.

This designation reflects NVIDIA’s assessment that the same principles applied to H100—including end-to-end system tuning, networking optimization, and software alignment—are being successfully carried forward into the Blackwell generation.

Rather than treating GB300 as a point solution, Azure approaches it as a continuation of a proven performance model: delivering consistent world-class AI performance in the cloud while preserving the flexibility, elasticity, and global scale customers expect.

What Enables Exemplar-Class AI Performance on Azure

Delivering Exemplar-class AI performance requires optimization across the full AI stack:

Infrastructure and Networking

High-performance Azure ND GPU clusters with NVIDIA InfiniBand
NUMA-aware CPU, GPU, and NIC alignment to minimize latency
Tuned NCCL communication paths for efficient multi-GPU scaling

Software and System Optimization

Tight integration with NVIDIA software, including Performance Benchmarking recipes and NVIDIA AI Enterprise
Parallelism strategies aligned with large-scale LLM training
Continuous tuning as models, workloads, and system architectures evolve

End-to-End Workload Focus

Measuring real training performance, not isolated component metrics
Driving repeatable improvements in application-level throughput and efficiency
Closing the performance gap between cloud and on-premises systems—without sacrificing manageability

Together, these capabilities enabled Azure to deliver consistent Exemplar-class AI performance across generations of NVIDIA platforms.

What This Means for Customers

For customers training and deploying advanced AI models, this milestone delivers clear benefits:

World-class AI performance in a fully managed cloud environment
Predictable scaling from small clusters to thousands of GPUs
Faster time to train and improved performance per dollar
Confidence that Azure is ready for Blackwell-class and GB300-class AI workloads

As AI workloads become more complex and reasoning-heavy, infrastructure performance increasingly determines outcomes. Azure’s NVIDIA Cloud Exemplar recognition provides a clear signal: customers can build and scale next-generation AI systems on Azure without compromising on performance.

Learn More

DGX Cloud Benchmarking on Azure
DGX Cloud Benchmarking on Azure | Microsoft Community Hub

Centralized cluster performance metrics with ReFrame HPC and Azure Log Analytics

jimpaine — Fri, 06 Feb 2026 09:37:24 GMT

Imagine having several clusters across different environments (dev, test and prod) or planning a migration between PBS and Slurm or porting codes to a different system. They can all seem like daunting tasks.

This is where the combination of ReFrame HPC, a powerful and feature rich testing framework, and Azure Log Analytics can help improve confidence and assurance in the performance and accuracy of a system.

Here we will look at how to configure ReFrame HPC specifically for Azure: Deploying the required Azure resources, running a test and capturing the results in Log Analytics for analysis.

Deploying the required Azure Resources

Firstly, deploy the required resources in Azure by using this bicep from GitHub. The deployment includes the creation and configuration of everything required for ReFrame HPC. These resources include a data collection endpoint, a data collection rule and a log analytics workspace.

Azure icons for a Data Collection Endpoint, Data Collection Rule with an arrow pointing from them to the icon for Log Analytics Workspace.

The structure of the endpoint that is needed later is complex, but the bicep generates it and outputs it at the end so make sure to caputure it now.

Running ior via ReFrame HPC

For the purpose of demonstrating a running test and capturing the results in Azure from start to finish, here is a simple ior test which will run both a read and a write operation against the shared storage.

import reframe as rfm import reframe.utility.sanity as sn @rfm.simple_test class SimplePerfTest(rfm.RunOnlyRegressionTest): valid_systems = ["*"] valid_prog_environs = ["+ior"] executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float)

Test explanation

Set the binary to be executed to ior, along with its arguments.

executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ]

Specify which systems the test should run on. In this case, any system/cluster which is known to have ior available will be selected. Look at the ReFrame HPC documentation to get a better understanding of the options available for use.

valid_systems = ["*"] valid_prog_environs = ["+ior"]

Verify the stdout of the job by searching for a specific value to assert that it ran successfully.

@sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout)

If the sanity function passed it will then extract the performance metrics from the stdout of the job. The naming of the methods is important, as they will be stored in the results later.

@performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float)

Performance references are used to determine if the current cluster has met the requirement or not. It also allows margins to be specified in either direction.

reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } }

ReFrame HPC Configuration

The ReFrame HPC configuration is key to determine how and where the test will run. It is also where the logic allowing Reframe HPC to use Azure for centralized logging will be defined. The full configuration file is vast and is covered in detail within the ReFrame HPC documentation. For the purpose of this test an example can be found on GitHub. Below is a breakdown of the key parts that allow Reframe HPC to push its results into Azure Log Analytics.

Logging Handler

The most important part of this configuration is the logging section, without it ReFrame HPC will not attempt to log the results. A handler_perflog of type httpjson is added to enable the logs to be sent to a HTTP endpoint with specific values which our covered below.

'logging': [ { 'perflog_multiline': True, 'handlers_perflog': [ { 'type': 'httpjson', 'url': 'REDACTED', 'level': 'info', 'debug': False, 'extra_headers': {'Authorization': f'Bearer {_get_token()}'}, 'extras': { 'TimeGenerated': f'{datetime.now(timezone.utc).isoformat()}', 'facility': 'reframe', 'reframe_azure_data_version': '1.0', }, 'ignore_keys': ['check_perfvalues'], 'json_formatter': _format_record } ] }

Multiline Perflog

To ensure this works with Azure, enable perflog_multiline. This will ensure a single record per metric is sent to Log Analytics. This is the cleanest way to output the results. Having this set to False will move the metric names into column names, which means that the schema will be different for each test and will become hard to maintain.

Extra Headers

A bearer token is required to authenticate the request. ReFrame HPC allows the adding of headers via the extra_headers property and a simple Python function, which obtains a scoped token that can be appended to the additional header.

def _get_token(scope='https://monitor.azure.com/.default') -> str: credential = DefaultAzureCredential() token = credential.get_token(scope) return token.token

Url Structure

The url can be found in the output of the bicep which was run previously. It can also be obtained via the portal. Here is the structure of the url for reference.

'${dce.properties.logsIngestion.endpoint}/dataCollectionRules/${dcr.properties.immutableId}/streams/Custom-${table.name}?api-version=2023-01-01'

json Formatter

A small work around is needed as the Data Collection Rule expects an array of items and ReFrame HPC outputs a single record. To resolve this another Python function can be used which simply wraps the record up in an array. In this example it also tidys up and removes some items that are not required and would cause issues with the json serialization.

def _format_record(record, extras, ignore_keys): data = {} for attr, val in record.__dict__.items(): if attr in ignore_keys or attr.startswith('_'): continue data[attr] = val data.update(extras) return json.dumps([data])

Running the Test

Now that the infrastructure has been deployed, the test has been defined and is correctly configured, we can run the test.

Start by logging in. Here I am using the managed identity of the node, but User auth and User Assigned Managed Identities are also supported.

$ az login --identity

ReFrame HPC can be installed via Spack or Python and, while I am using Spack for packages on the cluster, I find the simplest approach is to activate a Python environment and install ReFrame HPC along with test specfic Python dependencies.

$ python3 -m venv .venv $ . .venv/bin/activate $ python -m pip install -U pip $ pip install -r requirements.txt

Now using the ReFrame HPC cli, the test can be run using the configuration file and the test file.

$ reframe -C config.py -c simple_perf.py --performance-report -r

ReFrame HPC will now run the test against the system/cluster defined in the configuration. For this example it is a Slurm cluster on a partition of HBv4 nodes and running squeue clarifys that.

$ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 955 hbv4 rfm_Simp jim.pain R 0:28 1 tst4-hbv4-97

Results

And there we have it, results are now appearing in Azure! From here we can use kql to query and filter the results. This is just a subset of the values available but the dataset is vast and includes a huge range of values that are extremely helpful.

Summary

By standardizing on the combination of ReFrame HPC and Azure Log Analytics for testing and reporting of performance data across our clusters, whether Slurm based, Azure CycleCloud or existing on-prem clusters, you can gain unprecendented visibility and confidence in the systems you manage and the codes you deploy that were previously hard to obtain. Enabling the potential for:

🔎Fast cross-cluster comparisions
📈Trend analysis over long running periods
📊Standardized metrics regardless of scheduler or system
☁️Unified monitoring and reporting across clusters

ReFrame HPC is suitable for a wide range of testing, so if testing is something you have been looking to implement, take a look at ReFrame HPC

Scaling physics-based digital twins: Neural Concept on Azure delivers a New Record in Industrial AI

lmiroslaw — Mon, 12 Jan 2026 12:10:23 GMT

Automotive Design and the DrivAerNet++ Benchmark

In automotive design, external aerodynamics have a direct impact on performance, energy efficiency, and development cost. Even small reductions in drag can translate into significant fuel savings or extended EV range. As development timelines accelerate, engineering teams increasingly rely on data-driven methods to augment or replace traditional CFD workflows.

MIT’s DrivAerNet++ dataset is the largest open multimodal dataset for automotive aerodynamics, offering a large-scale benchmark for evaluating learning-based approaches that capture the physical signals required by engineers. It includes 8,000 vehicle geometries across 3 variants (fastback, notchback and estate-back) and aggregates 39 TB of high-fidelity CFD outputs such as surface pressure, wall shear stress, volumetric flow fields, and drag coefficients.

Benchmark Highlights

Neural Concept trained its geometry-native Geometric Regressor, designed to handle any type of engineering data. The benchmark was executed on Azure HPC infrastructure to evaluate the capabilities of the geometry-native platform under transparent, scalable, and fully reproducible conditions.

Surface pressure: Lowest prediction error recorded on the benchmark, revealing where high- and low-pressure zones form.
Wall shear stress: Outperforming all competing methods to detect flow attachment and separation for drag and stability control.
Volumetric velocity field: More than 50% lower error than previous best, capturing full flow structure for wake stability analysis.
Drag coefficient Cd: R² of 0.978 on the test set, accurate enough for early design screening without full CFD runs.
Dataset Scale and Ingestion: 39 TB of data was ingested into Neural Concept’s platform through a parallel conversion task with 128 workers and 5 GB RAM each that finished in about 1 hour and produced a compact 3 TB dataset in the platform’s native format.
Data Pre Processing: Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices for handling external aerodynamics workflows.
Model Training and Deployment: Training completed in 24 hours on 4 A100 GPUs, with the best model obtained after 16 hours. The final model is compact and real-time predictions can be served on a single 16 GB GPU for industrial use.

Neural Concept outperformed all other competing methods, achieving state-of-the-art performance prediction on all metrics and physical quantities within a week:

“Neural Concept’s breakthrough demonstrates the power of combining advanced AI with the scalability of Microsoft Azure,” said Jack Kabat, Partner, Azure HPC and AI Infrastructure Products, Microsoft. “By running training and deployment on Azure’s high-performance infrastructure — specifically the NC A100 Virtual Machine— Neural Concept was able to transform 39 terabytes of data into a production-ready workflow in just one week. This shows how Azure accelerates innovation and helps automotive manufacturers bring better products to market faster.”

For additional benchmark metrics and comparisons, please refer to the Detailed Quantitative Results section at the end of the article.

From State-Of-The-Art Benchmark Accuracy to Proven Industrial Impact

Model accuracy alone is necessary, but not sufficient for industrial impact. Transformative gains at scale and over time are only revealed once high-performing models are deployed into maintainable and repeatable workflows across organizations.

Customers using Neural Concept’s platform have achieved:

30% shorter design cycles
$20M in savings on a 100,000-unit vehicle program

These outcomes fundamentally result from a transformed, systematic approach to design, unlocking better and faster data-driven decisions. The Design Lab interface, described in the next section, is at the core of this transformation.

Within Neural Concept’s ecosystem, validated geometry and physics models can be deployed directly into the Design Lab - a collaborative environment where aerodynamicists and designers evaluate concepts in real time. AI copilots provide instant performance feedback, geometry-aware improvement suggestions, and live KPI updates, effectively reconnecting aerodynamic analysis with the pace of modern vehicle design.

CES 2026: See how OEMs are transforming product development with Engineering Intelligence

Neural Concept and Microsoft will showcase how AI-native aerodynamic workflows can reshape vehicle development — from real-time design exploration to enterprise-scale deployment. Visit the Microsoft booth to see DrivAerNet++ running on Azure HPC and meet the teams shaping the future of automotive engineering.

Visit Microsoft Booth to find out more

Neural Concept’s executive team will also be at CES to share flagship results achieved by leading OEMs and Tier-1 suppliers already using the platform in production. Learn more on: https://www.neuralconcept.com/ces-2026

Credits
Microsoft: Hugo Meiland (Principal Program Manager), Guy Bursell (Director Business Strategy, Manufacturing), Fernando Aznar Cornejo (Product Marketing Manager) and Dr. Lukasz Miroslaw (Sr. Industry Advisor)

Neural Concept: Theophile Allard (CTO), Benoit Guillard (Senior ML Research Scientist), Alexander Gorgin (Product Marketing Engineer), Konstantinos Samaras-Tsakiris (Software Engineer)

Detailed Quantitative Results

In the sections that follow, we share the results obtained by applying Neural Concept’s aerodynamics predictive model training template to Drivaernet++.

We evaluated our model’s prediction errors using the official train/test split and the standard evaluation strategy. For comparison, metrics from other methods were taken from the public leaderboard. We reported both Mean Squared Error (MSE) and Mean Absolute Error (MAE) to quantify prediction accuracy. Lower values for either metric indicate closer agreement with the ground truth simulations, meaning better predictions.

1. Surface Field Predictions: Pressure and Wall Shear Stress

We began by evaluating predictions for the two physical quantities defined on the vehicle surface.

Surface Pressure

The Geometric Regressor achieved substantially better performance than all existing methods in predicting surface pressure distribution.

Rank	Deep Learning Model	*MSE (10-2, lower = better)**	*MAE (10-1, lower = better)**
#1	Neural Concept	3.98	1.08
#2	GAOT (May 2025)	4.94	1.10
#3	FIGConvNet (February 2025)	4.99	1.22
#4	TripNet (March 2025)	5.14	1.25
#5	RegDGCNN (June 2024)	8.29	1.61

Table 1: Neural Concept’s Geometric Regressor predicts surface pressure more accurately than previously published state-of-the-art methods. The dates indicate when the competing model architectures were published.

Figure 1: Side-by-side comparison of the ground truth pressure field (left), Neural Concept model’s prediction (middle), and the corresponding error for a representative test sample (right).

Wall Shear Stress

Similarly, the model delivered top-tier results, outperforming all competing methods.

Rank	Deep Learning Model	*MSE (10^-2, lower = better)**	*MAE (10^-1, lower = better)**
#1	Neural Concept	7.80	1.44
#2	GAOT (May 2025)	8.74	1.57
#3	TripNet (March 2025)	9.52	2.15
#4	FIGConvNet (Feb. 2025)	9.86	2.22
#5	RegDGCNN (June 2024)	13.82	3.64

Table 2: Neural Concept’s Geometric Regressor predicts wall shear stress more accurately than previously published state-of-the-art methods.

Figure 2: Side-by-side comparison of the ground truth magnitude of the wall shear stress, Neural Concept model’s prediction, and the corresponding error for a representative test sample.

Across both surface fields (pressure and wall shear stress), the Geometric Regressor achieved the lowest MSE and MAE by a clear margin. The baseline methods represent several high-quality and recent academic work (the earliest being from June 2024), yet our architecture established a new state-of-the-art in predictive performance.

2. Volumetric Predictions: Velocity

Beyond surface quantities, DrivAerNet++ provides 3D velocity fields in the flow volume surrounding the vehicle, which we also predicted using the Geometric Regressor.

Rank	Deep Learning Model	MSE (lower = better)	*MAE (10^-1, lower = better)**
#1	Neural Concept	3.11	9.22
#2	TripNet (March 2025)	6.71	15.2

Table 3: Neural Concept’s Geometric Regressor predicts velocity more accurately than the previously published state-of-the-art method.

The illustration below shows the velocity magnitude for two test samples. Note that only a single 2D slice of the 3D volumetric domain is shown here, focusing on the wake region behind the car. In practice, the network predicts velocity at any location within the full 3D domain, not just on this slice.

Figure 3: Velocity magnitude for two test samples, arranged in two columns (left and right). For each sample, the top row displays the simulated velocity field, the middle row shows the prediction from the network, and the bottom row presents the error between the two.

3. Scalar Predictions: Drag Coefficient

The drag coefficient (Cd) is the most critical parameter in automotive aerodynamics, as reducing it directly translates to lower fuel consumption in combustion vehicles and increased range in electric vehicles. Using the same underlying architecture, our model achieved state-of-the-art performance in Cd prediction.

In addition to MSE and MAE, we reported the Maximum Absolute Error (Max AE) to reflect worst-case accuracy. We also included the Coefficient of Determination (R² score), which measures the proportion of variance explained by the model. An R² value of 1 indicates a perfect fit to the target data.

Rank	Deep Learning Model	*MSE (1e-5)**	*MAE (1e-3)**	*Max AE (1e-2)**	R²
#1	Neural Concept	0.8	2.22	1.13	0.978
#2	TripNet	9.1	7.19	7.70	0.957
#3	PointNet	14.9	9.60	12.45	0.643
#4	RegDGCNN	14.2	9.31	12.79	0.641
#5	GCNN	17.1	10.43	15.03	0.596

On the official split, the model shows tight agreement with CFD (R² of 0.978) across the test set, which is sufficient for early design screening where engineers need to rank variants confidently and spot meaningful gains without running full simulations for every change.

4. Compute Efficiency and Azure HPC&AI Collaboration

Executing the full DrivAerNet++ benchmark at industrial scale required Neural Concept’s full software and infrastructure stack combined with seamless cloud integration on Microsoft Azure to dynamically scale computing resources on demand. The entire pipeline runs natively on Microsoft Azure and can scale within minutes, allowing us to process new industrial datasets that contain thousands of geometries without complex capacity planning.

Dataset Scale and Ingestion

DrivAerNet++ dataset contains 8000 car designs along with their corresponding CFD simulations. The raw dataset occupies approximately 39TB of storage. Generating the simulations required a total of about 3 million CPU hours by MIT’s DeCoDE Lab.

Ingestion into Neural Concept’s platform is the first step of the pipeline.

To convert the raw data into the platform’s native format, we use a Conversion task that transforms raw files into the platform’s optimized native format.
This task was parallelized with 128 workers; each allocated 5 GB of RAM.

As a result, the entire conversion process was completed in approximately one hour only. After converting the relevant data (car geometry, wall shear stress, pressure, and velocity), the full dataset occupies approximately 3 TB in Neural Concept’s native format.

Data Pre-Processing

Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices. During this phase, workloads were distributed across multiple compute nodes with peak memory usage reaching approximately 1.5 TB of RAM.

The pre-processing pipeline consists of two main stages. In the first stage, we repaired the car meshes and pre-computed geometric features needed for training. The second stage involved filtering the volumetric domain and re-sampling points to follow a spatial distribution that is more efficient for training our deep learning model.

We scaled the compute resources so that each of the two stages in the pipeline completes in 1 to 3 hours when processing the full dataset. The first stage is the most computationally intensive. To handle it efficiently, we parallelized the task across 256 independent workers, each allocated 6 GB of RAM.

Model Training and Deployment

While we use state-of-the-art hardware for training, our performance gains come primarily from model design. Once trained, the model remains lightweight and cost-effective to run.

Training was performed on Azure Standard_NC96ads_A100_v4 node, which provided access to four A100 GPUs, each with 80 GB of memory.
The model was trained for approximately 24 hours.

Neural Concept’s Geometric Regressor achieved the best reported performance on the official benchmark for surface pressure, wall shear stress, volumetric velocity and drag prediction.

mpi-stage: High-Performance File Distribution for HPC Clusters

pauledwards — Fri, 09 Jan 2026 10:24:34 GMT

When running containerized workloads on HPC clusters, one of the first problems you hit is getting container images onto the nodes quickly and repeatably. A .sqsh is a Squashfs image (commonly used by container runtimes on HPC). In some environments you can run a Squashfs image directly from shared storage, but at scale that often turns the shared filesystem into a hot spot.

Copying the image to local NVMe keeps startup time predictable and avoids hundreds of nodes hammering the same source during job launch.

In this post, I'll introduce mpi-stage, a lightweight tool that uses MPI broadcasts to distribute large files across cluster nodes at speeds that can saturate the backend network.

The Problem: Staging Files at Scale

On an Azure CycleCloud Workspace for Slurm cluster with GB300 GPU nodes, I needed to stage a large Squashfs container image from shared storage onto each node's local NVMe storage before launching training jobs.

At small scale you can often get away with ad-hoc copies, but once hundreds of nodes are all trying to read the same source file, the shared source filesystem quickly becomes the bottleneck.

I tried several approaches:

Attempt 1: Slurm's sbcast

Slurm's built-in sbcast seemed like the natural choice. In my quick testing it was slower than I wanted, and the overwrite/skip-existing behavior didn't match the "fast no-op if already present" workflow I was after. I didn't spend much time exploring all the configuration options before moving on.

Attempt 2: Shell Script Fan-Out

I wrote a shell script using a tree-based fan-out approach: copy to N nodes, then each of those copies to N more, and so on. This worked and scaled reasonably, but had some drawbacks:

Multiple stages: The script required orchestrating multiple rounds of copy commands, adding complexity
Source filesystem stress: Even with fan-out, the initial copies still hit the source filesystem simultaneously — a fan-out of 4 meant 4 nodes competing for source bandwidth
Frontend network: Copies went over the Ethernet network by default — I could have configured IPoIB, but that added more setup

The Solution: MPI Broadcasts

The key insight was that MPI's broadcast primitive (MPI_Bcast) is specifically optimized for one-to-many data distribution. Modern MPI implementations like HPC-X use tree-based algorithms that efficiently utilize the high-bandwidth, low-latency InfiniBand network.

With mpi-stage:

Single source read: Only one node reads from the source filesystem
Backend network utilization: Data flows over InfiniBand using optimized MPI collectives
Intelligent skipping: Nodes that already have the file (verified by size or checksum) skip the copy entirely

Combined, this keeps the shared source (NFS, Lustre, blobfuse, etc.) from being hammered by many concurrent readers while still taking full advantage of the backend fabric.

How It Works

mpi-stage is designed around a simple workflow:

The source node reads the file in chunks and streams each chunk via MPI_Bcast. Destination nodes write each chunk to local storage immediately upon receipt. This streaming approach means the entire file never needs to fit in memory — only a small buffer is required.

Key Features

Pre-copy Validation

Before any data is transferred, each node checks if the destination file already exists and matches the source. You can choose between:

Size check (default): Fast comparison of file sizes—sufficient for most use cases
Checksum: Stronger validation, but requires reading the full file and is therefore slower

If all nodes already have the correct file, mpi-stage completes in milliseconds with no data transfer.

Double-Buffered Transfers

The implementation uses double-buffered, chunked transfers to overlap network communication with disk I/O. While one buffer is being broadcast, the next chunk is being read from the source.

Post-copy Validation

Optionally verify that all nodes received the file correctly after the copy completes.

Single-Writer Per Node

The tool enforces one MPI rank per node to prevent filesystem contention and ensure predictable performance.

Real-World Performance

In one run using 156 GPU nodes, distributing a container image achieved approximately 3 GB/s effective distribution rate (file_size/time), completing in just over 5 seconds:

[0] Copy required: yes [0] Starting copy phase (source writes: yes) [0] Copy complete, Bandwidth: 3007.14 MB/s [0] Post-validation complete [0] Timings (s): Topology check: 5.22463 Source metadata: 0.00803746 Pre-validation: 0.0046786 Copy phase: 5.21189 Post-validation: 2.2944e-05 Total time: 5.2563

Because every node writes the file to its own local NVMe, the cumulative write rate across the cluster is roughly this number times the node count: ~3 GB/s × 156 ≈ ~468 GB/s of total local writes.

Workflow: Container Image Distribution

The primary use case is distributing Squashfs images to local NVMe before launching containerized workloads. Run mpi-stage as a job step before your main application:

#!/bin/bash #SBATCH --job-name=my-training-job #SBATCH --ntasks-per-node=1 #SBATCH --exclusive # Stage the container image srun --mpi=pmix ./mpi_stage \ --source /shared/images/pytorch.sqsh \ --dest /nvme/images/pytorch.sqsh \ --pre-validate size \ --verbose # Run the actual job (from local NVMe - much faster!) srun --container-image=/nvme/images/pytorch.sqsh ...

mpi-stage will create the destination directory if it doesn't exist.

If your container runtime supports running the image directly from shared storage, you may not strictly need this step—but staging to local NVMe tends to be faster and more predictable at large scale.

Because of the pre-validation, you can include this step in every job script without penalty—if the image is already present, it completes in milliseconds.

Getting Started

git clone https://github.com/edwardsp/mpi-stage.git cd mpi-stage make

For detailed usage and options, see the README.

Summary

mpi-stage started as a solution to a very specific problem—staging large container images efficiently across a large GPU cluster—but the same pattern may be useful in other scenarios where many nodes need the same large file.

By using MPI broadcasts, only a single node reads from the source filesystem, while data is distributed over the backend network using optimized collectives. In practice, this can significantly reduce load on shared filesystems and cloud-backed mounts, such as Azure Blob Storage accessed via blobfuse2, where hundreds of concurrent readers can otherwise become a bottleneck.

While container images were the initial focus, this approach could also be applied to staging training datasets, distributing model checkpoints or pretrained weights, or copying large binaries to local NVMe before a job starts. Anywhere that a “many nodes, same file” pattern exists is a potential fit.

If you're running large-scale containerized workloads on Azure HPC infrastructure, give it a try. If you use mpi-stage in other workflows, I'd love to hear what worked (and what didn't). Feedback and contributions are welcome.

Have questions or feedback? Leave a comment below or open an issue on GitHub.

Azure V710 V5 Series -AMD Radeon GPU - Validation of Siemens CAD -NX

Sunita_AZ0708 — Wed, 07 Jan 2026 16:38:08 GMT

Overview of Siemens NX

Siemens NX is a next-generation integrated CAD/CAM/CAE platform used by aerospace, automotive, industrial machinery, energy, medical, robotics, and defense manufacturers.
It spans:

Complex 3D modeling
Assemblies containing thousands to millions of parts
Surfacing and composites
Tolerance engineering
CAM and machining simulation
Integrated multi physics through Simcenter / NX Nastran

Because NX is used to design real-world engineered systems — aircraft structures, automotive platforms, satellites, robotic arms, injection molds — its usability and performance directly affect engineering velocity and product timelines.

NX Needs GPU Acceleration

NX is highly visual.
It leans heavily on:

OpenGL acceleration
Shader-based rendering
Hidden line removal
Real-time shading / material rendering
Ray-Traced Studio for photorealistic output

Switch shading modes → CAD content must stay readable
Zoom, section, annotate → requires stable frame pacing

NVads V710 v5-Series on Azure

The NVads V710 v5-series virtual machines on Azure are designed for GPU-accelerated workloads and virtual desktop environments. Key highlights:

Hardware Specs:

o GPU: AMD Radeon™ Pro V710 (up to 24 GiB frame buffer; fractional GPU options available).

o CPU: AMD EPYC™ 9V64 F (Genoa) with SMT, base frequency 3.95 GHz, peak 4.3 GHz.

o Memory: 16 GiB to 160 GiB.

o Storage: NVMe-based ephemeral local storage supported.

VM Sizes:

o Ranges from Standard_NV4ads_V710_v5 (4 vCPUs, 16 GiB RAM, 1/6 GPU) to Standard_NV28adms_V710_v5 (28 vCPUs, 160 GiB RAM, full GPU).

Supported Features:

o Premium storage, accelerated networking, ephemeral OS disk.

o Both Windows and Linux VMs supported.

o No additional GPU licensing is required.

AMD Radeon™ PRO GPUs offer:

o Optimized OpenGL professional driver stack

o Stable interactive performance vs large assemblies

Business Scenario Enabled by NX + Cloud GPU

Engineering Anywhere

Distributed teams can securely work on the same assemblies from any geographic region.

Supplier Ecosystem Collaboration

Tier-1/2 manufacturers and engineering partners can access controlled models without local high-end workstations.

Secure IP Protection

Data stays in Azure — files never leave the controlled workspace.

Faster Engineering Cycles

Visualization + simulation accelerate design reviews, decision making, and manufacturability evaluations.

Scalable Cost Model

Pay for compute only when needed — ideal for burst design cycles and testing workloads.

Architecture Overview – Siemens NX on Azure NVads_v710

Key Architecture Elements

- Create Azure Virtual Machine- NVads_v710_24
- Install Azure AMD V710 GPU drivers
- Deploy Azure File-based storage
  Hosting assemblies, metadata, drawing packages, PMI, simulation data.
- Configure Vnet with Accelerated Networking
- Install NX licenses and software.
- Install NXCP & ATS Test suites on the Virtual Machine

Qualitative Benchmark on Azure NVads_v710_24

Siemens has approved the following qualitative test results. The certification matrix update is currently in progress.

Technical variant:

Complex assemblies with thousands of components maintained smooth rotation, zooming, and selection, even under concurrent session load.

NXCP and ATS test results on NVads_v710_24

Non-Interactive test results:

Note: Execution Time (seconds)

ATS Non‑Interactive Test Results validate the correctness and stability of Siemens NX graphical rendering by comparing generated images against approved reference outputs. The minimal or zero pixel differences confirm deterministic and visually consistent rendering, indicating a stable GPU driver and visualization pipeline. The reported test execution times (in seconds) represent the duration required to complete each automated graphics validation scenario, demonstrating predictable and repeatable processing performance under non‑interactive conditions.

Interactive test results on Azure NVads_v710_24:

Note: Execution Time (seconds)

ATS Interactive Test Results evaluate Siemens NX graphics behavior during real‑time user interactions such as rotation, zoom, pan, sectioning, and view manipulation. The results demonstrate stable and consistent rendering during interactive workflows, confirming that the GPU driver and visualization stack reliably support user‑driven NX operations.
The measured execution times (in seconds) reflect the responsiveness of each interactive graphics operation, indicating predictable behavior under live, user‑controlled conditions rather than peak performance tuning.

NX CAD functions		Automatic Tests	Interactive Tests
Grace1 Basic Tests	GrPlayer_xp64.exe <FILE> Basic_Features.tgl	Passed!	Passed!
	GrPlayer_xp64.exe <FILE> Fog_Measurement_Clipping.tgl	Passed!	Passed!
	GrPlayer_xp64.exe <FILE> lighting.tgl	Passed!	Passed!
	GrPlayer_xp64.exe <FILE> Shadow_Bump_Environment.tgl	Passed!	Passed!
	GrPlayer_xp64.exe <FILE> Texture_Map.tgl	Passed!	Passed!
Grace2 Graphics Tests	GrPlayer_64.exe <FILE> GrACETrace.tgl	Passed!	Passed!
Grace2 Graphics Tests	GrPlayer_64.exe <FILE> GrACETrace.tgl	Passed!	Passed!

NXCP Test Scenarios
NXCP Test Scenarios		Automatic Tests
NXCP Gdat Tests	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_1.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_2.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_4.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_5.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_6.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_7.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_8.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_9.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_10.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_11.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_12.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_13.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_14.cgi	Passed!
	gdat_leg_xp64.exe -infile <FILE> leg_gfx_cert_15.cgi	Passed!

Benefits Azure NVads_v710 (AMD GPU Platform for NX

Workstation-class AMD Radeon PRO graphics drivers baked into Azure
Ensures ISV-validated driver pipeline.
Excellent performance for CAD workloads
Makes GPU-accelerated NX accessible to wider user bases.
Remote engineering enablement
Critical for companies who now operate global design teams.
Elastic scale
Spin up GPU when development peaks; scale down when idle.

Conclusion:

Siemens NX on Azure NVads_v710 powered by AMD GPUs enables enterprise-class CAD/CAM/CAE experiences in the cloud. NX benefits directly from workstation-grade OpenGL optimization, shading stability, and Ray Traced Studio acceleration, allowing engineers to interact smoothly with large assemblies, run visualization workloads, and perform design reviews without local hardware dependencies.

Right‑sized GPU delivers workstation‑class experience at lower cost

The family enables fractional GPU allocation (down to 1/6 of a Radeon™ Pro V710), allowing Siemens NX deployments to be right‑sized per user role. This avoids over‑provisioning full GPUs while still delivering ISV‑grade OpenGL and visualization stability, resulting in a lower per‑engineer cost compared to fixed full‑GPU cloud or on‑prem workstations

Elastic scale improves cost efficiency for burst engineering workloads

NVads_V710_v5 instances support on demand scaling and ephemeral NVMe storage, allowing NX environments to scale up for design reviews, supplier collaboration, or peak integration cycles and scale down when idle. This consumption model provides a cost advantage over fixed on prem workstations that remain underutilized outside peak engineering periods

NX visualization pipelines benefit from balanced CPU–GPU architecture

The combination of high‑frequency AMD EPYC™ Genoa CPUs (up to 4.3 GHz) and Radeon™ Pro V710 GPUs addresses Siemens NX’s mixed CPU–GPU workload profile, where scene graph processing, tessellation, and OpenGL submission are CPU‑sensitive. This balance reduces idle GPU cycles, improving effective utilization and overall cost efficiency when compared with GPU‑heavy but CPU‑constrained configurations

The result is a scalable, secure, and cost-efficient engineering platform that supports distributed innovation, supplier collaboration, and digital product development workflows — all backed by the Rendering and interaction consistency of AMD GPU virtualization on Azure.

Announcing Azure CycleCloud Workspace for Slurm: Version 2025.12.01 Release

xpillons — Wed, 07 Jan 2026 09:22:19 GMT

We are excited to announce the latest release of Azure CycleCloud Workspace for Slurm, now available with the powerful features and enhancements introduced in CycleCloud 8.8.1. This update brings significant improvements to cluster management, monitoring, security, and platform support, empowering technical communities to build and operate scalable HPC environments with greater efficiency and flexibility.

Major Feature Updates in CycleCloud Workspace for Slurm 2025.12.01

Integrated Monitoring with Prometheus self-agent and managed Grafana
Entra ID Single Sign-On (SSO) for secure and seamless authentication
Support for ARM64 compute nodes
Compatibility with Ubuntu 24.04 and AlmaLinux 9

Enhanced Monitoring: Prometheus Self Agent and Managed Grafana

With CycleCloud 8.8.1, monitoring your Slurm clusters is easier and more powerful than ever. The integration of Prometheus self-agent enables automated collection of metrics from compute nodes and Slurm jobs, providing real-time insights into cluster performance and resource utilization. Coupled with managed Grafana, users can visualize these metrics through customizable dashboards, making it simple to track system health, identify bottlenecks, and optimize workloads. This seamless monitoring solution reduces operational overhead and enhances the reliability of your HPC environment.

Create the Managed Monitoring Infrastructure

To use this feature, simply set up an Azure Monitor Workspace for Prometheus and an Azure Managed Grafana environment. Follow these steps as outlined here: Azure/cyclecloud-monitoring: Cluster-init project and related tools for adding managed monitoring to a CycleCloud cluster.

Create a resource group for the monitoring infrastructure
Deploy with the provided commands

git clone https://github.com/Azure/cyclecloud-monitoring.git cd cyclecloud-monitoring ./infra/deploy.sh <monitoring_resource_group>

After deployment to the specified resource group, you will find an Azure Monitor Workspace called ccw-mon-xxx and an Azure Managed Grafana named ccw-graf-xxx. To access the dashboards, go to the Grafana endpoint, enter the Grafana portal, and expand the Dashboards/Azure CycleCloud folder to view the available dashboards.

Depending on the node type, monitoring capabilities include:

For GPUs: tracking utilization rates, memory copy utilization, various clock speeds, temperature, power consumption, ECC error counts, and NVLink throughput statistics.

For Infiniband: assessing throughput and error occurrences.

For other resources: evaluating CPU usage and frequency, memory utilization, disk space usage, network activity, file system capacity, as well as NFS operations and associated throughput.

Enable Monitoring

Monitoring can be enabled during Azure CycleCloud Workspace for Slurm deployment in the Marketplace UI:

You can get the “Monitoring ingestion endpoint” and “Data collection rules” from the Azure Monitor Workspace properties.

Starting with CycleCloud 8.8.1, this option is included in the Slurm default template, so you can enable monitoring directly in the cluster options.

The Client ID to be provided should correspond to the User Managed Identity assigned to the nodes, which has been granted permission to push metrics. For CCWS, this will be ccwLockerManagedIdentity.

Secure and Seamless Authentication: Entra ID SSO

The new Entra ID Single Sign-On (SSO) integration streamlines user authentication across your CycleCloud Workspace. By leveraging Azure Entra ID, users benefit from centralized identity management, enhanced security, and simplified access control. This feature supports multi-factor authentication and compliance requirements, making it easier for organizations to manage users and permissions while protecting sensitive HPC workloads. Entra ID SSO ensures a frictionless login experience, reducing administrative burden and improving overall security posture.

Entra ID Single Sign-On (SSO) facilitates authentication for both the CycleCloud user interface and Open OnDemand via OpenID Connect. Mapping to Linux users may be accomplished either through CycleCloud's local user creation process or through LDAP integration with the cc-ldap-auth CycleCloud cluster-init project. This article will concentrate on the former approach.

Pre-deployment Steps

Entra ID Single Sign-On (SSO) requires registration of an Entra ID application prior to deploying a CycleCloud Workspace for the Slurm environment. Additionally, a user-managed identity must be created, which serves as a replacement for the secret password by being integrated into the federated credentials of the application. This User Managed Identity (UMI) will be assigned to the Open OnDemand virtual machine and designated as a trusted authentication source.

Comprehensive instructions are available in our GitHub repository on the entra_instructions page.

Deployment

You can enable Microsoft ID SSO from the Basics tab in the latest marketplace UI, which is necessary if you plan to deploy Open OnDemand as well.

The required values may be obtained from the output generated by the pre-deployment script executed previously.

Post Deployment

When you register the Entra ID application, placeholders are initially used for the CycleCloud and Open OnDemand IP addresses. These need to be updated later, either manually or by using this utility script.

Once the application is configured, you need to now grant permissions to users. For this, retrieve the app in Enterprise Applications and select Manage/Users and groups.

To add users to the relevant CycleCloud roles, select "Add user/group" and choose one or more of the predefined roles. Assign Global.Node.User to standard users; for users requiring sudo privileges, assign Global.Node.Admin; and for those engaged in cluster administration within CycleCloud, select SuperUser or Administrator as appropriate.

After roles are assigned, users must first access the CycleCloud UI before they can interact with the cluster or Open OnDemand. This process ensures user profiles are retrieved, and local accounts are created on the nodes within the clusters.

Conclusion

The 2025.12.01 release of Azure CycleCloud Workspace for Slurm delivers substantial advancements that strengthen performance, security, and usability for HPC environments. With integrated Prometheus self‑agent monitoring, managed Grafana dashboards, support for ARM64 compute architectures, and compatibility with modern Linux distributions, this update empowers teams to operate clusters with greater visibility and efficiency. The addition of Entra ID Single Sign‑On further streamlines user authentication and reinforces security across both CycleCloud and Open OnDemand interfaces.

Together, these enhancements reflect our ongoing commitment to providing a flexible, scalable, and secure HPC platform that meets the evolving needs of technical and scientific communities. We look forward to seeing how you leverage these capabilities to accelerate innovation and simplify the operation of your HPC workloads.

Private Preview: Azure Managed Prometheus on VM / VMSS

Daramfon — Wed, 18 Feb 2026 20:00:31 GMT

What’s new — Managed Prometheus now supports VMs & VMSS

Today we are excited to announce the private preview of Azure Managed Prometheus support for virtual machines (VM) and virtual machine scale sets (VMSS). Until now, Managed Prometheus on Azure was primarily targeted at containerized workloads — e.g. Kubernetes (AKS) or Azure Arc–enabled clusters. With this preview, you can now extend Prometheus-style monitoring to your IaaS workloads running on VMs/VMSS, giving you unified, scalable, resilient metric collection and observability across both containers and traditional compute —including full support for GPU and InfiniBand (IB) metric collection for HPC scenarios.

Behind the scenes, Azure Monitor provides the storage, ingestion pipeline, and query engine, while surfacing a fully compatible Prometheus experience — including scraping, PromQL, alerting rules, and dashboards.

Why this matters — especially for HPC workloads

Azure HPC customers running large fleets of GPU-accelerated VMs and VMSS nodes can now:

Collect node-level metrics (CPU, memory, disk, frontend NIC, InfiniBand) and GPU metrics (utilization, memory, clocks, ECC, throttling) through standard Prometheus exporters
Store all Prometheus metrics in an Azure Monitor Workspace
Visualize cluster performance using Azure Managed Grafana with out of the box dashboards that include cluster-level views, node-level views, and data links to easily move between them.
Run PromQL queries directly against Azure Monitor
Monitor mixed fleets (AKS + VMSS + standalone VMs) in one unified system

All of this is achieved through a fully managed Prometheus backend, with no servers, scaling, or storage to manage.

Access Requirement

This feature is currently in private preview, and your Azure subscription must be allowlisted before you can use Azure Managed Prometheus for VMs/VMSS.

Request access to the private preview

Once approved, you will be notified and can proceed with the onboarding steps in the GitHub repository.

Try it yourself

We invite you to try it out and share your feedback with us. To get started, follow the step-by-step guide in our GitHub repository to help you onboard to the preview quickly.

Once you’ve onboarded, you can begin scraping node and GPU metrics, run sample PromQL queries, and import ready-made HPC dashboards into Azure Managed Grafana.

We hope you enjoy using Azure Managed Prometheus for VM/VMSS and find the new capabilities valuable for your AI and HPC workloads. As this is a private preview, your feedback is especially important. Please share input by opening an issue in the GitHub repository.

Automating HPC Workflows with Copilot Agents

xpillons — Wed, 03 Dec 2025 10:43:26 GMT

Introduction

High Performance Computing (HPC) workloads are complex, requiring precise job submission scripts and careful resource management. Manual scripting for platforms like OpenFOAM is time-consuming, error-prone, and often frustrating. At SC25, we showcased how Copilot Agents—powered by AI—are transforming HPC workflows by automating Slurm submission scripts, making scientific computing more efficient and accessible. A full demonstration can be found in the video at the end of this article.

Why Automate HPC Workflows?

High-performance computing workloads are often elaborate, requiring carefully structured job submission scripts to efficiently manage system resources. In applications like OpenFOAM, where precise setup of nodes, tasks, and memory is essential, composing these scripts by hand can be both labor-intensive and susceptible to errors.

Manually creating Slurm scripts not only consumes valuable time but also raises the likelihood of mistakes, resulting in failed jobs and costly delays that delay research and innovation. For OpenFOAM users, this translates into spending less time on actual simulations and more time resolving script-related problems.

Automating the creation of these scripts eases the burden on researchers and engineers by accelerating research processes, minimizing errors, and enabling users to dedicate more attention to simulation and analysis instead of debugging submission issues.

AI-powered Workflow Automation

Copilot Agents uses artificial intelligence to simplify the process of making job submission scripts, helping HPC workflows run smoothly and efficiently. With this system, users can focus less on manual scripting and more on research and analysis.

Copilot Agent recognizes your workload's context and applies best practices to create precise and optimized Slurm scripts. It interprets specific needs so that each script matches the requirements of individual jobs, which helps with resource allocation and scheduling.

Key benefits include quicker script creation, fewer mistakes, and greater consistency across HPC tasks. Automating this process speeds up the workflow and maintains standards, resulting in more dependable and repeatable job submissions.

Typical Workflow with Copilot Agents

Defining the Context: Begin by outlining your workload requirements clearly and thoroughly. Indicate how to load and run the application, specify the number of tasks per node, and detail any special logging or configuration instructions. The more accurate you are with these details, the more effectively the agent can create a reliable script.

Script Generation by AI: Copilot processes your input and automatically creates a full Slurm submission script. Using AI models, this stage incorporates best practices to save time and prevent errors.

Validation and Submission: After the script is built, it’s checked for accuracy and submitted to the scheduler. You should always examine the output and error logs and adjust as needed. This ongoing review helps ensure that jobs run smoothly and improves your workflow over time.

Best Practices for Defining Context

Consider context as your guideline: providing more specific and thorough details helps the agent produce a more accurate Slurm script. Always make your instructions straightforward and precise. Add links to relevant documentation when possible, and share example cases that show exactly what you need. Be clear about requirements like how to load applications, set the number of tasks per node, or any special configuration and logging needs. Clear and complete context not only lowers the chance of mistakes but also results in higher-quality scripts, ultimately saving you time and effort.

Script Generation: Iterative Improvement

Model Selection: Advanced models such as GPT-5 are capable of producing highly detailed and comprehensive scripts. Although the initial draft may require additional time to generate, these models typically integrate best practices and sophisticated configuration options, which can be further refined through iterative development.

Iterative Improvement: The initial script produced by AI generally serves as a starting point for further enhancement. Systematic revisions informed by output logs, error reports, and user feedback contribute to improving the accuracy, efficiency, and customization of the final submission script according to the specific needs of your HPC workload.

Practical Example: As demonstrated in the video below, a chat-based Copilot Agent facilitates script creation by prompting for the script name and subsequently generating a Bash script that incorporates all requested features. These include leveraging Slurm environment variables, automating task distribution, loading requisite modules, and enabling comprehensive logging. The resulting script is prepared for submission via the sbatch command.

Validation and Continuous Improvement

Once you have generated your Slurm submission script using Copilot Agent, it is essential to conduct a careful review of the output prior to executing your job. This preliminary assessment is critical for identifying potential issues early and ensuring that the script aligns with your specific workload requirements.

Submit the job to the scheduler for validation, and diligently monitor both the output and error log files, as these will inform your subsequent actions.

Should errors arise—such as missing file paths or incorrect module loads—utilize the feedback from the logs to amend your script accordingly. This iterative refinement process is fundamental to optimizing your workflow and achieving reliable job execution.

The accompanying example illustrates how Copilot Agent can assist in locating and correcting errors, such as updating an OpenFOAM tutorial path. By leveraging AI-enabled feedback, users are able to efficiently address issues and confidently resubmit jobs.

Continuous validation and revision are paramount to advancing high-performance computing automation. Consistently refer to output and error logs to guide subsequent iterations, thereby enhancing the robustness and dependability of your scripts over time.

Key Benefits

Time Efficiency: Copilot Agents significantly decrease the time needed to generate job submission scripts. Tasks that previously required hours of manual scripting can now be completed within minutes, enabling researchers and engineers to give more attention to simulation and analysis rather than script troubleshooting.

Error Reduction: Automation substantially lowers the risk of human error commonly associated with manual script development. By enforcing best practices and standardizing the script generation process, Copilot Agents improve reliability and minimize job failures.

Enhanced Scalability: Automated workflows facilitate more efficient scaling across high-performance computing (HPC) environments. As workloads increase in complexity and scale, Copilot Agents support consistency and optimal resource utilization, simplifying the management of expansive simulations.

User-Friendly Automation: Copilot Agents make HPC scripting more approachable for new users by offering intuitive automation and guidance. This approach ensures adherence to best practices and broadens accessibility, even for individuals with limited prior experience.

Azure NCv6 Public Preview: The new Unified Platform for Converged AI and Visual Computing

rishabv90 — Tue, 25 Nov 2025 17:22:05 GMT

As enterprises accelerate adoption of physical AI (AI models interacting with real-world physics), digital twins (virtual replicas of physical systems), LLM inference (running language models for predictions), and agentic workflows (autonomous AI-driven processes), the demand for infrastructure that bridges high-end visualization and generative AI inference has never been higher. Today, we are pleased to announce the Public Preview of the NC RTX PRO 6000 BSE v6 series, powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs.

The NCv6 series represents a generational leap in Azure’s visual compute portfolio, designed to be the dual engine for both Industrial Digitalization and cost-effective LLM inference. By leveraging NVIDIA Multi-Instance GPU (MIG) capabilities, the NCv6 platform offers affordable sizing options similar to our legacy NCv3 and NVv5 series. This provides a seamless upgrade path to Blackwell performance, enabling customers to run complex NVIDIA Omniverse simulations and multimodal AI agents with greater efficiency.

Why Choose Azure NCv6?

While traditional GPU instances often force a choice between "compute" (AI) and "graphics" (visualization) optimizations, the NCv6 breaks this silo. Built on the NVIDIA Blackwell architecture, it provides a "right-sized" acceleration platform for workloads that demand both ray-traced fidelity and Tensor Core performance.

As outlined in our product documentation, these VMs are ideal for converged AI and visual computing workloads, including:

Real-time digital twin and NVIDIA Omniverse simulation.
LLM Inference and RAG (Retrieval-Augmented Generation) on small to medium AI models.
High-fidelity 3D rendering, product design, and video streaming.
Agentic AI application development and deployment.
Scientific visualization and High-Performance Computing (HPC).

Key Features of the NCv6 Platform

The Power of NVIDIA Blackwell

At the heart of the NCv6 is the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. This powerhouse delivers breakthrough performance featuring 96 GB of ultra-fast GDDR7 memory. This massive frame buffer allows for the handling of complex multimodal AI models and high-resolution textures that previous generations simply could not fit.

Host Performance: Intel Granite Rapids

To ensure your workloads aren't bottlenecked by the CPU, the VM host is equipped with Intel Xeon Granite Rapids processors. These provide an all-core turbo frequency of up to 4.2 GHz, ensuring that demanding pre- and post-processing steps—common in rendering and physics simulations—are handled efficiently.

Optimized Sizing for Every Workflow

We understand that one size does not fit all. The NCv6 series introduces three distinct sizing categories to match your specific unit economics:

General Purpose: Balanced CPU-to-GPU ratios (up to 320 vCPUs) for diverse workloads.
Compute Optimized: Higher vCPU density for heavy simulation and physics tasks.
Memory Optimized: Massive memory footprints (up to 1,280 GB RAM) for data-intensive applications.

Crucially, for smaller inference jobs or VDI, we will also offer fractional GPU options, allowing you to right-size your infrastructure and optimize costs.

NCv6 Technical Specifications

Specification	Details
GPU	NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7)
Processor	Intel Xeon Granite Rapids (up to 4.2 GHz Turbo)
vCPUs	16 – 320 vCPUs (Scalable across GP, Compute, and Memory optimized sizes)
System Memory	64 GB – 1,280 GB DDR5
Network	Up to 200,000 Mbps (200 Gbps) Azure Accelerated Networking
Storage	Up to 2TB local temp storage; Support for Premium SSD v2 & Ultra Disk

Real-World Applications

The NCv6 is built for versatility, powering everything from pixel-perfect rendering to high-throughput language reasoning:

Production Generative AI & Inference: Deploy self-hosted LLMs and RAG pipelines with optimized unit economics. The NCv6 is ideal for serving ranking models, recommendation engines, and content generation agents where low latency and cost-efficiency are paramount.
Automotive & Manufacturing: Validate autonomous driving sensors (LiDAR/Radar) and train physical AI models in high-fidelity simulation environments before they ever touch the real world.
Next-Gen VDI & Azure Virtual Desktop: Modernize remote workstations with NVIDIA RTX Virtual Workstation capabilities. By leveraging fractional GPU options, organizations can deliver high-fidelity, accelerated desktop experiences to distributed teams—offering a superior, high-density alternative to legacy NVv5 deployments.
Media & Entertainment: Accelerate render farms for VFX studios requiring burst capacity, while simultaneously running generative AI tools for texture creation and scene optimization.

Conclusion: The Engine for the Era of Converged AI

The Azure NCv6 series redefines the boundaries of cloud infrastructure. By combining the raw power of NVIDIA’s Blackwell architecture with the high-frequency performance of Intel Granite Rapids, we are moving beyond just "visual computing." Innovators can now leverage a unified platform to build the industrial metaverse, deploy intelligent agents, and scale production AI—all with the enterprise-grade security and hybrid reach of Azure.

Ready to experience the next generation? Sign up for the NCv6 Public Preview here.

Azure ND GB300 v6 now Generally Available - Hyper-optimized for Generative and Agentic AI workloads

Nitin_Nagarkatte — Wed, 19 Nov 2025 01:13:53 GMT

We are pleased to announce the General Availability (GA) of ND GB300 v6 virtual machines, delivering the next leap in AI infrastructure. On 10/09, we shared the delivery of the first at-scale production cluster with more than 4,600 NVIDIA GB300 NVL72, featuring NVIDIA Blackwell Ultra GPUs connected through the next-generation NVIDIA InfiniBand network. We have now deployed tens of thousands of GB300 GPUs for production customer workloads and expect to scale to hundreds of thousands. Built on NVIDIA GB300 NVL72 systems, these VMs redefine performance for frontier model training, large-scale inference, multimodal reasoning, and agentic AI.

The ND GB300 v6 series enables customers to:

Deploy trillion-parameter models with unprecedented throughput.
Accelerate inference for long-context and multimodal workloads.
Scale seamlessly at high bandwidth for large scale training workloads.

In recent benchmarks, ND GB300 v6 achieved over 1.1 million tokens per second on Llama 2 70B inference workloads - a 27% uplift over ND GB200 v6. This performance breakthrough enables customers to serve long-context, multimodal, and agentic AI models with unmatched speed and efficiency.

With the general availability of ND GB300 v6 VMs, Microsoft strengthens its long-standing collaboration with NVIDIA by leading the market in delivering the latest GPU innovations, reaffirming our commitment to world-class AI infrastructure.

The ND v6 GB300 systems are built in a rack-scale design, with each rack hosting 18 VMs for a total of 72 GPUs interconnected by high-speed NVLINK. Each VM has 2 NVIDIA Grace CPUs and 4 Blackwell Ultra GPUs. Each NVLINK connect rack contains:

72 NVIDIA Blackwell Ultra GPUs (with 36 NVIDIA Grace CPUs).
800 gigabits per second (Gbp/s) per GPU cross-rack scale-out bandwidth via next-generation NVIDIA Quantum-X800 InfiniBand (2x ND GB200 v6).
130 terabytes (TB) per second of NVIDIA NVLink bandwidth within rack.
37TB of fast memory. (~20 TB HBM3e + ~17TB LPDDR)
Up to 1,440 petaflops (PFLOPS) of FP4 Tensor Core performance. (1.5x ND GB200 v6)

Together, NVLINK and XDR InfiniBand enable GB300 systems to behave as a unified compute and memory pool, minimizing latency, maximizing bandwidth, and dramatically improving scalability. Within a rack, NVLink enables coherent memory access and fast synchronization for tightly coupled workloads. Across racks, XDR InfiniBand ensures ultra-low latency, high-throughput communication with SHARP offloading—maintaining sub-100 µs latency for cross-node collectives.

Azure provides an end-to-end AI platform that enables customers to build, deploy, and scale AI workloads efficiently on GB300 infrastructure. Services like Azure CycleCloud and Azure Batch simplify the setup and management of HPC and AI environments, allowing organizations to dynamically adjust resources, integrate leading schedulers, and run containerized workloads at massive scale. With tools such as CycleCloud Workspace for Slurm, users can create and configure clusters without prior expertise, while Azure Batch handles millions of parallel tasks, ensuring cost and resource efficiency for large-scale training.

For cloud-native AI, Azure Kubernetes Service (AKS) offers rapid deployment and management of containerized workloads, complemented by platform-specific optimizations for observability and reliability. Whether using Kubernetes or custom stacks, Azure delivers a unified suite of services to maximize performance and scalability.

Learn More & Get Started

Announcing the Public Preview of AMLFS 20: Azure Managed Lustre New SKU for Massive AI&HPC Workloads

wolfgangdesalvador — Tue, 18 Nov 2025 17:00:00 GMT

Sachin Sheth - Principal PDM Manager

Brian Barbisch - Principal Group Software Engineering Manager

Matt White - Principal Group Software Engineering Manager

Brian Lepore - Principal Product Manager

Wolfgang De Salvador - Senior Product Manager

Ron Hogue - Senior Product Manager

Introduction

We are excited to announce the Public Preview of AMLFS Durable Premium 20 (AMLFS 20), a new SKU in Azure Managed Lustre designed to deliver unprecedented performance and scale for demanding AI and HPC workloads.

Key Features

Massive Scale: Store up to 25 PiB of data in a single namespace, with up to 512 GB/s of total bandwidth.

Advanced Metadata Performance: Multi-MDS (Metadata Server) architecture dramatically improves metadata IOPS. In mdtest benchmarks, AMLFS 20 demonstrated more than 5x improvement in metadata operations. An additional MDS is provided for every 5 PiB of provisioned filesystem.

High File Capacity: Supports up to 20 billion inodes for maximum namespace size.

Why AMLFS 20 Matters

Simplified Architecture: Previously, datasets larger than 12.5 PiB required multiple filesystems and complex management. AMLFS 20 enables a single, high-performance file system for massive AI and HPC workloads up to 25 PiB, streamlining deployment and administration.

Accelerated Data Preparation: The multi-MDT architecture significantly increases metadata IOPS, which is crucial during the data preparation stage of AI training, where rapid access to millions of files is required.

Faster Time-to-Value: Researchers and engineers benefit from easier management, reduced bottlenecks, and faster access to large datasets, accelerating innovation.

Availability

AMLFS 20 is available in Public Preview alongside the already existing AMLFS SKUs. For more details on other SKUs, visit the Azure Managed Lustre documentation.

How to Join the Preview

If you are working with large-scale AI or HPC workloads and would like early access to AMLFS 20, we invite you to fill out this form to tell us about your use case. Our team will follow up with onboarding details.

Azure CycleCloud 8.8 and CCWS 1.2 at SC25 and Ignite

anhoward — Wed, 19 Nov 2025 15:02:15 GMT

Azure CycleCloud 8.8: Advancing HPC & AI Workloads with Smarter Health Checks

Azure CycleCloud continues to evolve as the backbone for orchestrating high-performance computing (HPC) and AI workloads in the cloud. With the release of CycleCloud 8.8, users gain access to a suite of new features designed to streamline cluster management, enhance health monitoring, and future-proof their HPC environments.

Key Features in CycleCloud 8.8

1. ARM64 HPC Support

The platform expands its hardware compatibility with ARM64 HPC support, opening new possibilities for energy-efficient and cost-effective compute clusters. This includes access to the newer generation of GB200 VMs as well as general ARM64 support, enabling new AI workloads at a scale never possible before

2. Slurm Topology-Aware Scheduling

The integration of topology-aware scheduling for Slurm clusters allows CycleCloud users to optimize job placement based on network and hardware topology. This leads to improved performance for tightly coupled HPC workloads and better utilization of available resources.

3. Nvidia MNNVL and IMEX Support

With expanded support for Nvidia MNNVL and IMEX, CycleCloud 8.8 ensures compatibility with the latest GPU technologies. This enables users to leverage cutting-edge hardware for AI training, inference, and scientific simulations.

4. HealthAgent: Event-Driven Health Monitoring and Alerting

A standout feature in this release is the enhanced HealthAgent, which delivers event-driven health monitoring and alerting. CycleCloud now proactively detects issues across clusters, nodes, and interconnects, providing real-time notifications and actionable insights. This improvement is a game-changer for maintaining uptime and reliability in large-scale HPC deployments. Node Healthagent supports both impactful healthchecks which can only run while nodes are idle as well as non-impactful healthchecks that can run throughout the lifecycle of a job. This allows CycleCloud to alert on issues that not only happen while nodes are starting, but also issues that may result from failures for long-running nodes.

Later releases of CycleCloud will also include automatic remediation for common failures, so stay tuned!

5. Enterprise Linux 9 and Ubuntu 24 support

One common request has been wider support for the various Enterprise Linux (EL) 9 variants, including RHEL9, AlmaLinux 9, and Rocky Linux 9. CycleCloud 8.8 introduces support for those distributions as well as the latest Ubuntu HPC release.

Why These Features Matter

The CycleCloud 8.8 release marks a significant leap forward for organizations running HPC and AI workloads in Azure. The improved health check support—anchored by HealthAgent and automated remediation—means less downtime, faster troubleshooting, and greater confidence in cloud-based research and innovation.

Whether you’re managing scientific simulations, AI model training, or enterprise analytics, CycleCloud’s latest features help you build resilient, scalable, and future-ready HPC environments.

Key Features in CycleCloud Workspace for Slurm 1.2

Along with the release of CycleCloud 8.8 comes a new CycleCloud Workspace for Slurm (CCWS) release. This release includes the General Availability of features that were previously in preview, such as Open OnDemand, Cendio ThinLinc, and managed Grafana monitoring capabilities.

In addition to previously announced features, CCWS 1.2 also includes support for a new Hub and Spoke deployment model. This allows customers to retain a central hub of shared resources that can be re-used between cluster deployments with "disposable" spoke clusters that branch from the hub. Hub and Spoke deployments enable customers who need to re-deploy clusters in order to upgrade their operating system, deploy new versions of software, or even reconfigure the overall architecture of Slurm clusters.

Come visit us at SC25 and MS Ignite

To learn more about these features, come visit us at the Microsoft booth at #SC25 in St. Louis, MO and #Microsoft #Ignite in San Francisco this week!

Join Microsoft @ SC25: Experience HPC and AI Innovation

Fernando_Aznar — Fri, 14 Nov 2025 21:24:37 GMT

Supercomputing 2025 is coming to St. Louis, MO, November 16–21! Visit Microsoft Booth #1627 to explore cutting-edge HPC and AI solutions, connect with experts, and experience interactive demos that showcase the future of compute. Whether you’re attending technical sessions, stopping by for a coffee, or joining our partner events, we’ve got something for everyone.

Booth Highlights

Alpine Formula 1 Showcar: Snap a photo with a real Alpine F1 car and learn how high-performance computing drives innovation in motorsports.
Silicon Wall: Discover silicon diversity—featuring chips from our partners AMD and NVIDIA, alongside Microsoft’s own first-party silicon: Maia, Cobalt, and Majorana.
NVIDIA Weather Modeling Demo: See how AI and HPC predict extreme weather events with Tomorrow.io and NVIDIA technology.
Coffee Bar with Barista: Enjoy a handcrafted coffee while you connect with our experts.
Immersive Screens: Watch live demos and visual stories about HPC breakthroughs and AI innovation.
Hardware Bar: Explore AMD EPYC™ and NVIDIA GB200 systems powering next-generation workloads.

Whether you’re attending technical sessions, stopping by for a coffee and chat with our team, or joining our partner events, we’ve got something for everyone.

Conference Details

Conference week: Sun, Nov 16 – Fri, Nov 21
Expo hours (CST):
- Mon, Nov 17: 7:00–9:00 PM (Opening Night)
- Tue, Nov 18: 10:00 AM–6:00 PM
- Wed, Nov 19: 10:00 AM–6:00 PM
- Thu, Nov 20: 10:00 AM–3:00 PM
Customer meeting rooms: Four Seasons Hotel

Quick links

RSVP — Microsoft + AMD Networking Reception (Tue, Nov 18): https://aka.ms/MicrosoftAMD-Mixer
RSVP — Microsoft + NVIDIA Panel Luncheon (Wed, Nov 19): Luncheon is now closed as the event is fully booked.

Earned Sessions (Technical Program)

Sunday, Nov 16

Session Type	Time (CST)	Title	Microsoft Contributor(s)	Location
Tutorial	8:30 AM–5:00 PM	Delivering HPC: Procurement, Cost Models, Metrics, Value, and More	Andrew Jones	Room 132
Tutorial	8:30 AM–5:00 PM	Modern High Performance I/O: Leveraging Object Stores	Glenn Lockwood	Room 120
Workshop	2:00–5:30 PM	14th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2025)	Torsten Hoefler	Room 265

Monday, Nov 17

Session Type	Time (CST)	Title	Microsoft Contributor(s)	Location
Early Career Program	3:30–4:45 PM	Voices from the Field: Navigating Careers in Academia, Government, and Industry	Joe Greenseid	Room 262
Workshop	3:50–4:20 PM	Towards Enabling Hostile Multi-tenancy in Kubernetes	Ali Kanso; Elzeiny Mostafa; Gurpreet Virdi; Slava Oks	Room 275
Workshop	5:00–5:30 PM	On the Performance and Scalability of Cloud Supercomputers: Insights from Eagle and Reindeer	Amirreza Rastegari; Prabhat Ram; Michael F. Ringenburg	Room 267

Tuesday, Nov 18

Session Type	Time (CST)	Title	Microsoft Contributor(s)	Location
BOF	12:15–1:15 PM	High Performance Software Foundation BoF	Joe Greenseid	Room 230
Poster	5:30–7:00 PM	Compute System Simulator: Modeling the Impact of Allocation Policy and Hardware Reliability on HPC Cloud Resource Utilization	Jarrod Leddy; Huseyin Yildiz	Second Floor Atrium

Wednesday, Nov 19

Session Type	Time (CST)	Title	Microsoft Contributor(s)	Location
BOF	12:15–1:15 PM	The Future of Python on HPC Systems	Michael Droettboom	Room 125
BOF	12:15–1:15 PM	Autonomous Science Network: Interconnected Autonomous Science Labs Empowered by HPC and Intelligent Agents	Joe Tostenrude	Room 131
Paper	1:30–1:52 PM	Uno: A One‑Stop Solution for Inter‑ and Intra‑Data Center Congestion Control and Reliable Connectivity	Abdul Kabbani; Ahmad Ghalayini; Nadeen Gebara; Terry Lam	Rooms 260–267
Paper	2:14–2:36 PM	SDR‑RDMA: Software‑Defined Reliability Architecture for Planetary‑Scale RDMA Communication	Abdul Kabbani; Jie Zhang; Jithin Jose; Konstantin Taranov; Mahmoud Elhaddad; Scott Moe; Sreevatsa Anantharamu; Zhuolong Yu	Rooms 260–267
Panel	3:30–5:00 PM	CPUs Have a Memory Problem — Designing CPU‑Based HPC Systems with Very High Memory Bandwidth	Joe Greenseid	Rooms 231–232
Paper	4:36–4:58 PM	SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations	Kun Li; Liang Yuan; Ting Cao; Mao Yang	Rooms 260–267

Thursday, Nov 20

Session Type	Time (CST)	Title	Microsoft Contributor(s)	Location
BOF	12:15–1:15 PM	Super(computing)heroes	Laura Parry	Rooms 261–266
Paper	3:30–3:52 PM	Workload Intelligence: Workload‑Aware IaaS Abstraction for Cloud Efficiency	Anjaly Parayil; Chetan Bansal; Eli Cortez; Íñigo Goiri; Jim Kleewein; Jue Zhang; Pantea Zardoshti; Pulkit Misra; Raphael Ghelman; Ricardo Bianchini; Rodrigo Fonseca; Saravan Rajmohan; Xiaoting Qin	Room 275
Paper	4:14–4:36 PM	From Deep Learning to Deep Science: AI Accelerators Scaling Quantum Chemistry Beyond Limits	Fusong Ju; Kun Li; Mao Yang	Rooms 260–267

Friday, Nov 21

Session Type	Time (CST)	Title	Microsoft Contributor(s)	Location
Workshop	9:00 AM–12:30 PM	Eleventh International Workshop on Heterogeneous High‑performance Reconfigurable Computing (H2RC 2025)	Torsten Hoefler	Room 263

Booth Theater Sessions

Monday, Nov 17 — 7:00 PM–9:00 PM

Time (CST)	Session Title	Presenter(s)
8:00–8:20 PM	Inside the World’s Most Powerful AI Data Center	Chris Jones
8:30–8:50 PM	Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC	Joe Tostenrude

Tuesday, Nov 18 — 10:00 AM–6:00 PM

Time (CST)	Session Title	Presenter(s)
11:00–11:50 AM	Ignite Keynotes
12:00–12:20 PM	Accelerating AI workloads with Azure Storage	Sachin Sheth; Wolfgang De Salvador
12:30–12:50 PM	Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA	Jyothi Venkatesh
1:00–1:20 PM	Radiation & Health Companion: AI‑Driven Flight‑Dose Awareness	Olesya Sarajlic
1:30–1:50 PM	Ascend HPC Lab: Your On‑Ramp to GPU‑Powered Innovation	Daniel Cooke (Oakwood)
2:00–2:20 PM	Azure AMD HBv5: Redefining CFD Performance and Value in the Cloud	Rick Knoechel (AMD)
2:30–2:50 PM	Empowering High Performance Life Sciences Workloads on Azure	Qumulo
3:00–3:20 PM	Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC	Joe Tostenrude
4:00–4:20 PM	Unleashing AMD EPYC on Azure: Scalable HPC for Energy and Manufacturing	Varun Selvaraj (AMD)
4:30–4:50 PM	Automating HPC Workflows with Copilot Agents	Xavier Pillons
5:00–5:20 PM	Scaling the Future: NVIDIA’s GB300 NVL72 Rack for Next‑Generation AI Inference	Kirthi Devleker (NVIDIA)
5:30–5:50 PM	Enabling AI and HPC Workloads in the Cloud with Azure NetApp Files	Andy Chan

Wednesday, Nov 19 — 10:00 AM–6:00 PM

Time (CST)	Session Title	Presenter(s)
10:30–10:50 AM	AI‑Powered Digital Twins for Industrial Engineering	John Linford (NVIDIA)
11:00–11:20 AM	Advancing 5 Generations of HPC Innovation with AMD on Azure	Allen Leibovitch (AMD)
11:30–11:50 AM	Intro to LoRA Fine‑Tuning on Azure	Christin Pohl
12:00–12:20 PM	VAST + Microsoft: Building the Foundation for Agentic AI	Lior Genzel (VAST Data)
12:30–12:50 PM	Inside the World’s Most Powerful AI Data Center	Chris Jones
1:00–1:20 PM	Supervised GenAI Simulation – Stroke Prognosis (NVads V710 v5)	Kurt Niebuhr
1:30–1:50 PM	What You Don’t See: How Azure Defines VM Families	Anshul Jain
2:00–2:20 PM	Hammerspace Tier 0: Unleashing GPU Storage Performance on Azure	Raj Sharma (Hammerspace)
2:30–2:50 PM	GM Motorsports: Accelerating Race Performance with AI Physics on Rescale	Bernardo Mendez (Rescale)
3:00–3:20 PM	Hurricane Analysis and Forecasting on the Azure Cloud	Salar Adili (Microsoft); Unni Kirandumkara (GDIT); Stefan Gary (Parallel Works)
3:30–3:50 PM	Performance at Scale: Accelerating HPC & AI Workloads with WEKA on Azure	Desiree Campbell; Wolfgang De Salvador
4:00–4:20 PM	Pushing the Limits of Performance: Supercomputing on Azure AI Infrastructure	Biju Thankachen; Ojasvi Bhalerao
4:30–4:50 PM	Accelerating Momentum: Powering AI & HPC with AMD Instinct™ GPUs	Jay Cayton (AMD)

Thursday, Nov 20 — 10:00 AM–3:00 PM

Time (CST)	Session Title	Presenter(s)
11:30–11:50 AM	Intro to LoRA Fine‑Tuning on Azure	Christin Pohl
12:00–12:20 PM	Accelerating HPC Workflows with Ansys Access on Microsoft Azure	Dr. John Baker (Ansys)
12:30–12:50 PM	Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA	Jyothi Venkatesh
1:00–1:20 PM	Pushing the Limits: Supercomputing on Azure AI Infrastructure	Biju Thankachen; Ojasvi Bhalerao
1:30–1:50 PM	The High Performance Software Foundation	Todd Gamblin (HPSF)
2:00–2:20 PM	Heidi AI — Deploying Azure Cloud Environments for Higher‑Ed Students & Researchers	James Verona (Adaptive Computing); Dr. Sameer Shende (UO/ParaTools)

Partner Session Schedule

Tuesday, Nov 18

Date	Time (CST)	Title	Microsoft Contributor(s)	Location
Nov 18	11:00 AM–11:50 AM	Cloud Computing for Engineering Simulation	Joe Greenseid	Ansys Booth
Nov 18	1:00 PM–1:30 PM	Revolutionizing Simulation with Artificial Intelligence	Joe Tostenrude	Ansys Booth
Nov 18	4:30 PM–5:00 PM	[HBv5]	Jyothi Venkatesh	AMD Booth

Wednesday, Nov 19

Date	Time (CST)	Title	Microsoft Contributor(s)	Location
Nov 19	11:30 AM–1:30 PM	Accelerating Discovery: How HPC and AI Are Shaping the Future of Science (Lunch Panel)	Andrew Jones (Moderator); Joe Greenseid (Panelist)	Ruth's Chris Steak House
Nov 19	1:00 PM–1:30 PM	VAST and Microsoft	Kanchan Mehrotra	VAST Booth

Demo Pods at Microsoft Booth

Azure HPC & AI Infrastructure

Explore how Azure delivers high-performance computing and AI workloads at scale. Learn about VM families, networking, and storage optimized for HPC.

Agentic AI for Science

See how autonomous agents accelerate scientific workflows, from simulation to analysis, using Azure AI and HPC resources.

Hybrid HPC with Azure Arc

Discover how Azure Arc enables hybrid HPC environments, integrating on-prem clusters with cloud resources for flexibility and scale.

Ancillary Events (RSVP Required)

Microsoft + AMD Networking Reception — Tuesday Night

When: Tue, Nov 18, 6:30–10:00 PM (CST)

Where: UMB Champions Club, Busch Stadium

RSVP: https://aka.ms/MicrosoftAMD-Mixer

Microsoft + NVIDIA Panel Luncheon — Wednesday

When: Wed, Nov 19, 11:30 AM–1:30 PM (CST)

Where: Ruth’s Chris Steak House

Topic: Accelerating Discovery: How AI and HPC Are Shaping the Future of Science

Panelists: Dan Ernst (NVIDIA); Rollin Thomas (NERSC); Joe Greenseid (Microsoft); Antonia Maar (Intersect360 Research); Fernanda Foertter (University of Alabama)

RSVP: Luncheon is now closed as the event is fully booked.

Conclusion

We’re excited to connect with you at SC25! Whether you’re exploring our booth demos, attending technical sessions, or joining one of our partner events, this is your opportunity to experience how Microsoft is driving innovation in HPC and AI.

Stop by Booth #1627 to see the Alpine F1 showcar, explore the Silicon Wall featuring AMD, NVIDIA, and Microsoft’s own chips, and enjoy a coffee from our barista while networking with experts.

Don’t forget to RSVP for our Microsoft + AMD Network Reception and Microsoft + NVIDIA Panel Luncheon

See you in St. Louis!

Performance and Scalability of Azure HBv5-series Virtual Machines

jvenkatesh — Mon, 17 Nov 2025 23:35:04 GMT

Azure HBv5-series virtual machines (VMs) for CPU-based high performance computing (HPC) are now Generally Available. This blog provides in-depth information about the technical underpinnings, performance, cost, and management implications of these HPC-optimized VMs.

Azure HBv5 VM bring leadership levels of performance, cost optimization, and server (VM) consolidation for a variety of workloads driven by memory performance, such as computational fluid dynamics, weather simulation, geoscience simulations, and finite element analysis. For these applications and compared to HBv4 VMs, previously the highest performance offering for these workloads, HBv5 provides up to :

5x higher performance for CFD workloads with 43% lower costs
3.2x higher performance for weather simulation with 16% lower costs
2.8x higher performance for geoscience workloads at the same costs

HBv5-series Technical Overview & VM Sizes

Each HBv5 VMs features several new technologies for HPC customers, including:

Up to 6.6 TB/s of memory bandwidth (STREAM TRIAD) and 432 GB memory capacity
Up to 368 physical cores per VM (user configurable) with custom AMD EPYC CPUs, Zen4 microarchitecture (SMT disabled)
Base clock of 3.5 GHz (~1 GHz higher than other 96-core EPYC CPUs), and Boost clock of 4 GHz across all cores
800 Gb/s NVIDIA Quantum-2 InfiniBand (4 x 200 Gb/s CX-7) (~2x higher HBv4 VMs)
180 Gb/s Azure Accelerated Networking (~2.2 higher than HBv4 VMs)
15 TB local NVMe SSD with up to 50 GB/s (read) and 30 GB/s (write) of bandwidth (~4x higher than HBv4 VMs)

The highlight feature of HBv5 VMs is their use of high-bandwidth memory (HBM). HBv5 VMs utilize a custom AMD CPU that increases memory bandwidth by ~9x v. dual-socket 4^th Gen EPYC (Zen4, “Genoa”) server platforms, and ~7x v. dual-socket EPYC (Zen5, “Turin”) server platforms, respectively. HBv5 delivers similar levels of memory bandwidth improvement compared to the highest end alternatives from the Intel Xeon and ARM CPU ecosystems.

HBv5-series VMs are available in the following sizes with specifications as shown below. Just like existing H-series VMs, HBv5-series includes constrained cores VM sizes, enabling customers to optimize their VM dimensions for a variety of scenarios:

ISV licensing constraining a job to a targeted number of cores
Maximum-performance-per-VM or maximum performance per core
Minimum RAM/core (1.2 GB, suitable for strong scaling workloads) to maximum memory per core (9 GB, suitable for large datasets and weak scaling workloads

Table 1: Technical specifications of HBv5-series VMs

Note: Maximum clock frequencies (FMAX) are based product specifications of the AMD EPYC 9V64H processor. Experienced clock frequencies by a customer are a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application.

For more information see official documentation for HBv5-series VMs

Microbenchmark Performance

This section focuses on microbenchmarks that characterize performance of the memory subsystem, compute capabilities, and InfiniBand network of HBv5 VMs.

Memory & Compute Performance

To capture synthetic performance, we ran the following industry standard benchmarks:

STREAM – memory bandwidth
High Performance Conjugate Gradient (HPCG) – sparse linear algebra
High Performance Linpack (HPL)– dense linear algebra

Absolute results and comparisons to HBv4 VMs are shown in Table 2, below:

Table 2: Results of HBv5 running the STREAM, HPCG, and HPL benchmarks.

Note: STREAM was run with the following CLI parameters:

OMP_NUM_THREADS=368 OMP_PROC_BIND=true OMP_PLACES=cores ./amd_zen_stream

STREAM data size: 2621440000 bytes

InfiniBand Networking Performance

Each HBv5-series VM is equipped with four NVIDIA Quantum-2 network interface cards (NICs), each operating at 200 Gb/s for an aggregate bandwidth of 800 Gb/s per VM (node).

We ran the industry standard IB perftests based on OSU benchmarks test across two (2) HBv5-series VMs, as depicted in the results shown in Figures 3-5, below:

Note: all results below are for a single 200 Gb/s (uni-directional) link only. At a VM level, all bandwidth results below are 4x higher as there are four (4) InfiniBand links per HBv5 server.

Unidirectional bandwidth:

numactl -c 0 ib_send_bw -aF -q 2

Figure 1: results showing 99% achieved uni-directional bandwidth v. theoretical peak.

Bi-directional bandwidth:

numactl -c 0 ib_send_bw -aF -q 2 -b

Figure 2: results showing 99% achieved bi-directional bandwidth v. theoretical peak.

Latency:

Figure 3: results measuring as low as 1.25 microsecond latencies among HBv5 VMs. Latencies experienced by users will depend on message sizes employed by applications.

Application Performance, Cost/Performance, and Server (VM) Consolidation

This section focuses on characterizing HBv5-series VMs when running common, real-world HPC applications with an emphasis on those known to be meaningfully bound by memory performance as that is the focus of the HB-series family.

We characterize HBv5 below in three (3) ways of high relevance to customer interests:

Performance (“how much faster can it do the work”)
Cost/Performance (“how much can it reduce the costs to complete the work”)
Fleet consolidation (“how much can a customer simplify the size and scale of compute fleet management while still being able to the work”)

Where possible, we have included comparisons to other Azure HPC VMs, including:

Azure HBv4/HX series with 176 physical cores of 4^th Gen AMD EPYC CPUs with 3D V-Cache (“Genoa-X”) (HBv4 specifications, HX specifications)
Azure HBv3 with 120 physical cores of 3^rd Gen AMD EPYC CPUs with 3D V-Cache (“Milan-X”) (HBv3 specifications)
Azure HBv2 with 120 physical cores of 2^nd Gen AMD EPYC CPUs (“Rome”) processors (full specifications)

Unless otherwise noted, all tests shown below were performed with:

Alma Linux 8.10 (image URN : almalinux:almalinux-hpc:8_10-hpc-gen2:latest) for scaling ( image URN: almalinux:almalinux-hpc:8_6-hpc-gen2:latest)
NVIDIA HPC-X MPI

Further, all Cost/Performance comparisons leverage pricing rate info from list price, Pay-As-You-Go (PAYG) information found on Azure Linux Virtual Machines Pricing. Absolute costs will be a function of a customer’s workload, model, and consumption (PAYG v. Reserved Instance, etc.) approach. That said, the relative cost/performance comparisons illustrated below should hold for the workload and model combinations shown below, regardless of the consumption approach.

Computational Fluid Dynamics (CFD)

OpenFOAM – version 2306 with 100M Cell Motorbike case

Figure 4: HBv5 v. HBv4 on on OpenFOAM with the Motorbike 100M cell case HBv5 VMs provide a 4.8x performance increase over HBv4 VMs.

Figure 5: The cost to complete the OpenFOAM Motorbike 100M case is just 57% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running OpenFOAM cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of five (5).

Palabos – version 1.01 with 3D Cavity, 1001 x 1001 x 1001 cells case

Figure 6: On Palabos, a Lattice Boltzmann solver using a streaming memory access pattern, HBv5 VMs provide a 4.4x performance increase over HBv4 VMs.

Figure 7: The cost to complete the Palabos 3D Cavity case is just 62% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running Palabos with cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~4.5.

Ansys Fluent – version 2025 R2 with F1 Racecar 140M case

Figure 8: On ANSYS Fluent HBv5 VMs provide a 3.4x performance increase over HBv4 VMs.

Figure 9: The cost to complete the ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running ANSYS Fluent with cases similar to the size and complexity of the 140M cell F1 Racecar problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5.

Siemens Star-CCM+ - version 17.04.005 with AeroSUV Steady Coupled 106M case

Figure 10: On Star-CCM+, HBv5 VMs provide a 3.4x performance increase over HBv4 VMs.

Figure 11: The cost to complete the Siemens Star-CCM+ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running Star-CCM+ with cases similar to the size and complexity of the 106M cell AeroSUV Steady Coupled, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5.

Weather Modeling

WRF – version 4.2.2 with CONUS 2.5KM case

Figure 12: On WRF, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs.

Figure 13: The cost to complete the WRF Conus 2.5KM case is just 84% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running WRF with cases similar to the size and complexity of the 2.5km CONUS, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.

Energy Research

Devito – version 4.8.7 with Acoustic Forward case

Figure 14: On Devito, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs.

Figure 15: The cost to complete the Devito Acoustic Forward OP case is equivalent to what it costs to complete the same case on HBv4.

Above, we can see that for customers running Devito with cases similar to the size and complexity of the Acoustic Forward OP, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.

Molecular Dynamics

NAMD - version 2.15a2 with STMV 20M case

Figure 16: On NAMD, HBv5 VMs provide a 2.18x performance increase over HBv4 VMs.

Figure 17: The cost to complete the NAMD STMV 20M case is 26% higher on HBv5 than what it costs to complete the same case on HBv4

Above, we can see that for customers running NAMD with cases similar to the size and complexity of the STMV 20M case, organizations can consolidate their server (VM) deployments by approximately a factor of ~2.

Notably, NAMD is a compute bound case, rather than memory performance bound. We include it here to illustrate that not all workloads are fit for purpose with HBv5. This latest Azure HPC VM is the fastest at this workload on the Microsoft Cloud, but does not benefit substantially from HBv5’s premium levels of memory bandwidth. NAMD would instead perform more cost efficiently with a CPU that supports AVX512 instructions natively or, much better still, a modern GPU.

Scalability of HBv5-series VMs

Weak Scaling

Weak scaling measures how well a parallel application or system performs when both the number of processing elements and the problem size increase proportionally, so that the workload per processor remains constant. Weak scaling cases are often employed when time-to-solution is fixed (e.g. it is acceptable to solve a problem within a specified period) but a user desires a simulation to be of a higher fidelity or resolution. A common example is operational weather forecasting.

To illustrate weak scaling on HBv5 VMs, we ran Palabos with the same 3D cavity problem as shown earlier:

Figure 18: On Palabos with the 3D Cavity model, HBv5 scales linearly as the 3D cavity size is proportionately increased.

Strong Scaling

Strong scaling is characterized by the efficiency with which execution time is reduced as the number of processor elements (CPUs, GPUs, etc.) is increased, while the problem size remains kept constant. Strong scaling cases are often employed when the fidelity or resolution of the simulation is acceptable, but a user requires faster time to completion. A common example is product engineering validation when an organization wants to bring a product to market faster but must complete a broad range of validation and verification scenarios before doing so.

To illustrate Strong scaling on HBv5 VMs, we ran NAMD with two different problems, each intended to illustrate the how expectations for strong scaling efficiency change depending on problem size and the ordering of computation v. communication in distributed memory workloads.

First, let us examine NAMD with the 20M STMV benchmark

Figure 19: Strong scaling on HBv5 with NAMD STMV 20M cell case

As illustrated above, for strong scaling cases for which the compute time is continuously reduced (by leveraging more and more processor elements) but communication time remains constant, scaling efficiency will only stay high for so long. That principle is well-represented by the STMV 20m case, for which parallel efficiency remains linear (i.e. cost/job remains flat) at two (2) nodes but degrades after that. This is because while compute is being sped up, the MPI time remains relatively flat. As such, the relatively static MPI time comes to dominate end-to-end wall clock time as VM scaling increases. Said another way, HBv5 features so much compute performance that even for a moderate-sized problem like STMV 20M scaling the infrastructure can only take performance so far and cost/job will begin to increase.

If we examine HBv5 against the 210M cell case, however, with 10.5x as many elements to compute as its 20M case sibling, the scaling efficiency story changes significantly.

Figure 19: On NAMD with the STMV 210M cell case, HBv5 scales linearly out to 32 VMs (or more than 11,000 CPU cores).

As illustrated above, larger cases with significant compute requirements will continue to scale efficiently with larger amounts of HBv5 infrastructure. While MPI time remains relatively flat for this case (as is the case with the smaller STMV 20M case), the compute demands remain the dominant fraction of end-to-end wall clock time. As such, HBv5 scales these problems with very high levels of efficiency and in doing so job costs to the user remain flat despite up to 8x as many VMs being leveraged compared to the four (4) VM baseline.

The key takeaways for strong scaling scenarios are two-fold. First, users should run scaling tests with their applications and models to find a sweet spot of faster performance with constant job costs. This will depend heavily on model size. Second, as new and very high end compute platforms like HBv5 emerge that accelerate compute time, application developers will need to find ways reduce wall clock times bottlenecking on communication (MPI) time. Recommended approaches include using fewer MPI processes and, ideally, restructuring applications to overlap communication with compute phases.

Breaking the Million-Token Barrier: The Technical Achievement of Azure ND GB300 v6

HugoAffaticati — Tue, 04 Nov 2025 22:07:44 GMT

By Mark Gitau (Software Engineer) and Hugo Affaticati (Senior Cloud Infrastructure Engineer)

The new Azure ND GB300 v6 virtual machines, built on the cutting-edge NVIDIA Blackwell architecture introduced with the ND GB200 v6, are more optimized for inference workloads with 50% more GPU memory and 16% higher TDP (Thermal Design Power).

To simulate the performance gains of the ND GB300 v6 virtual machines on customer workloads, we ran the Llama2 70B model from MLPerf Inference v5.1 on each of the 18 ND GB300 v6 virtual machines on one NVIDIA GB300 NVL72 domain. The Llama2 70B model is a widely adopted industry standard for large-scale AI deployments, making it a good representation of production inference workloads. One NVL72 rack of Azure ND GB300 v6 achieved an aggregated 1,100,000 tokens/s for an unverified MLPerf Inference v5.1 submission [1], detailed in Table 1 and observed by Signal65. This is a new record in AI inference, beating our own previous record of 865,000 tokens/s on one NVIDIA GB200 NVL72 rack with the ND GB200 v6 VMs.

Metric	Performance (Tokens/Second)
Total Aggregated Throughput	1,100,948.3
Maximum Single-Node Throughput	62,803.9
Minimum Single-Node Throughput	57,599.1
Average Single-Node Throughput	61,163.8
Median Single-Node Throughput	61,759.1

Table 1: Data compiled from 18 parallel test runs, observed by Signal65.

This translates to 15,200 tokens/sec per NVIDIA Blackwell Ultra GPU (+/- 5%), a 27% performance speed up over the 12,022 tokens/s per NVIDIA Blackwell GPU. For comparison, the previous MLPerf Inference v4.1 results show that the NVIDIA DGX H100 system processed 24,525 tokens per second across 8 GPUs, or 3,066 tokens per second per NVIDIA H100 GPU [2]. This means that Azure ND GB300 v6 VMs deliver 5× higher throughput per GPU than the previous-generation ND H100 v5 virtual machines.

This milestone was observed by the third party Signal65. They concluded that this record "of over 1.1 million tokens per second on a single Azure rack is more than a benchmark record; it is a definitive proof point that the performance required for large-scale, transformative AI is now available as a reliable, efficient, and resilient utility".

Azure ND GB300 v6 virtual machine’s configurations (Table 2) benefit from performance gains across key hardware components (such as GEMM efficiency, high-bandwidth memory (HBM) throughput, NVLink connectivity, and NCCL communication). Our benchmarks show that ND GB300 v6 achieves 2.5x more GEMM TFLOPS per GPU than the ND H100 v5. Additionally, we measured 7.37 TB/s HBM bandwidth (92% efficiency) and 4x faster CPU-to-GPU transfer speeds thanks to NVLink C2C.

Component	Specification
Cloud Platform	Microsoft Azure
VM Instance SKU	ND_GB300_v6
System Configuration	18 x NDv6 VM instances in a single NVL72 rack
GPU	4 x NVIDIA GB300 per VM (72 total)
GPU Memory	279GB per GPU
GPU Power Limit	1,400 Watts
Storage	14 TB Local NVMe RAID per VM
LLM Inference Engine	NVIDIA TensorRT-LLM
Benchmark Harness	MLCommons MLPerf Inference v5.1
Benchmark Scenario	Offline
Model	Llama2-70B
Precision	FP4

Table 2: ND GB300 v6 Configuration for the MLPerf Inference test

The model was run using FP4 precision, a form of quantization that significantly accelerates inference speed while maintaining high accuracy. This was implemented via NVIDIA TensorRT-LLM library, a highly optimized, production-ready software stack for LLM inference.

Azure is once again raising the bar for enterprise-scale AI Inference with the ND GB300 v6 virtual machines.

How to replicate the results on a single virtual machine in Azure

Clone the repository and enter working directory:

git clone https://github.com/Azure/AI-benchmarking-guide.git && cd AI-benchmarking-guide/Azure_Results

Download the models & datasets

create models, data, and preprocessed_data directories in the working directory
download the Llama 2 70B modelinside the models directory.
download the datasetsinside data directory
prepare the datasets

Setup container

Inside the working directory:

     mkdir build && cd build

     git clone https://github.com/NVIDIA/TensorRT-LLM.git TRTLLM

     cd TRTLLM

Edit TRTLLM/docker/Makefile lines 135 and 136:

     SOURCE_DIR ?= AI-benchmarking-guide/Azure_Results (make sure it is an absolute path to the working directory)

     CODE_DIR ?= /work

Build & launch the container:

     make -C docker build

     make -C docker run

Once inside the container, install TensorRT-LLM:

     cd 1M_ND_GB300_v6_Inference/build/TRTLLM

     python3 ./scripts/build_wheel.py --trt_root /usr/local/tensorrt --benchmarks --cuda_architectures "103-real" --no-venv --clean

     pip install build/tensorrt_llm-1.1.0rc6-cp312-cp312-linux_aarch64.whl

Inside the 1M_ND_GB300_v6_Inference directory, install MLPerf dependencies:

     make clone_loadgen && make build_loadgen

     git clone https://github.com/NVIDIA/mitten.git ./build/mitten && pip install build/mitten

     pip install -r docker/common/requirements/requirements.llm.txt

Setup and run benchmark

Export env variables and link model & data directories:

     export MLPERF_SCRATCH_PATH=/work

     export SYSTEM_NAME=ND_GB300_v6

     make link_dirs

Run offline benchmark (wait a few minutes after the run server command for the server to start):

     make run_llm_server RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"

     make run_harness RUN_ARGS="--core_type=trtllm_endpoint --benchmarks=llama2-70b --scenarios=Offline"

Find all 18 log files of our run here

[1] Unverified MLPerf® v5.1 Inference Closed Llama 2 70B offline. Result not verified by MLCommons Association. Unverified results have not been through an MLPerf review and may use measurement methodologies and/or workload implementations that are inconsistent with the MLPerf specification for verified results. The MLPerf name and logo are registered and unregistered trademarks of MLCommons Association in the United States and other countries. All rights reserved. Unauthorized use strictly prohibited. See www.mlcommons.org for more information. Results obtained using NVIDIA MLPerf v5.1 code with NVIDIA TensorRT-LLM 1.1.0rc1

[2] Verified result with ID 4.1-0043.

The Complete Guide to Renewing an Expired Certificate in Microsoft HPC Pack 2019 (Single Head Node)

vinilv — Thu, 30 Oct 2025 06:27:43 GMT

Managing certificates in an HPC Pack 2019 cluster is critical for secure communication between nodes. However, if your certificate has expired, your cluster services (Scheduler, Broker, Web Components, etc.) may stop functioning properly — preventing nodes from communicating or jobs from scheduling.

When the HPC Pack certificate expires, the HPC Cluster Manager will fail to launch, and you may encounter error messages similar to the examples shown below.

This comprehensive guide walks you through how to renew an already expired HPC Pack certificate on a single-head-node setup and bring your cluster back online.

Step 1: Check the Current Certificate Expiry

Start by checking the existing certificate and its expiry date.

Get-ChildItem -Path Cert:\LocalMachine\root | Where-Object { $_.Subject -like "HPC" } $thumbprint = "<Thumbprint value from the previous command>".ToUpper() $cert = Get-ChildItem -Path Cert:\LocalMachine\My | Where-Object { $_.Thumbprint -eq $thumbprint } $cert | Select-Object Subject, NotBefore, NotAfter, Thumbprint Date

You can also confirm the system date using the PowerShell date command:

Date

This ensures you’re viewing the correct validity period for the currently installed certificate.

Step 2: Prepare a New Self-Signed Certificate

Next, we’ll create a new certificate that meets the HPC communication requirements.

Certificate Requirements:

Must have a private key capable of key exchange.
Key usage should include: Digital Signature, Key Encipherment, Key Agreement, and Certificate Signing.
Enhanced key usage should include: Client Authentication and Server Authentication.
If two certificates are used (private/public), both must have the same subject name.

When you prepare a new certificate, make sure that you use the same subject name as that of the old certificate. Run the following PowerShell commands on the HPC node to get the subject name of your certificate.

You can verify the existing certificate’s subject name using the following command:

$thumbprint = (Get-ItemProperty -Path HKLM:\SOFTWARE\Microsoft\HPC -Name SSLThumbprint).SSLThumbPrint $subjectName = (Get-Item Cert:\LocalMachine\My\$thumbprint).Subject $subjectName

Use the same subject name when generating the new certificate.

Step 3: Create a New Certificate

Use the below commands to create and export a new self-signed certificate (valid for 1 year).

$subjectName = "HPC Pack Node Communication" $pfxcert = New-SelfSignedCertificate -Subject $subjectName -KeySpec KeyExchange -KeyLength 2048 -HashAlgorithm SHA256 -TextExtension @("2.5.29.37={text}1.3.6.1.5.5.7.3.1,1.3.6.1.5.5.7.3.2") -Provider "Microsoft Enhanced RSA and AES Cryptographic Provider" -CertStoreLocation Cert:\CurrentUser\My -KeyExportPolicy Exportable -NotAfter (Get-Date).AddYears(1) -NotBefore (Get-Date).AddDays(-1) $certThumbprint = $pfxcert.Thumbprint $null = New-Item $env:Temp\$certThumbprint -ItemType Directory $pfxPassword = Get-Credential -UserName 'Protection password' -Message 'Enter protection password below' Export-PfxCertificate -Cert Cert:\CurrentUser\My\$certThumbprint -FilePath "$env:Temp\$certThumbprint\PrivateCert.pfx" -Password $pfxPassword.Password Export-Certificate -Cert Cert:\CurrentUser\My\$certThumbprint -FilePath "$env:Temp\$certThumbprint\PublicCert.cer" -Type CERT -Force start "$env:Temp\$certThumbprint"

This will generate both .pfx (private) and .cer (public) files in a temporary directory.

Step 4: Copy Certificate to Install Share

On the master (head) node, copy the newly created certificate to the following path:

C:\Program Files\Microsoft HPC Pack 2019\Data\InstallShare\Certificates

This ensures the certificate is available to all compute nodes in the cluster.

Step 5: Rotate Certificates on Compute Nodes

Important:
Always rotate certificates on compute nodes first, before the head node.
If you update the head node first, compute nodes will reject the new certificate, forcing manual reconfiguration.

After rotating compute node certificates, expect them to appear as Offline in HPC Cluster Manager — this is normal until the head node certificate is updated.

Download the PowerShell script Update-HpcNodeCertificate.ps1 and place it in your HPC install share:
\\<headnode>\REMINST
On each compute node, open PowerShell as Administrator and run:

PowerShell.exe -ExecutionPolicy ByPass -Command "\\<headnode>\REMINST\Update-HpcNodeCertificate.ps1 -PfxFilePath \\headnode>\REMINST\Certificates\HpcCnCommunication.pfx -Password <password> "

This updates the certificate on each compute node.

Step 6: Update Certificate on the Master (Head) Node

On the head node, run the following commands in PowerShell as Administrator:

$certPassword = ConvertTo-SecureString -String "YourPassword" -AsPlainText -Force Import-PfxCertificate -FilePath "C:\Program Files\Microsoft HPC Pack 2019\Data\InstallShare\Certificates\PrivateCert.pfx" -CertStoreLocation "Cert:\LocalMachine\My" -Password $certPassword PowerShell.exe -ExecutionPolicy ByPass -Command "Import-certificate -FilePath \\master\REMINST\Certificates\PublicCert.cer -CertStoreLocation cert:\LocalMachine\Root" Set-ItemProperty -Path "HKLM:\SOFTWARE\Microsoft\HPC" -Name SSLThumbprint -Value <Thumbprint> Set-ItemProperty -Path "HKLM:\SOFTWARE\Wow6432Node\Microsoft\HPC" -Name SSLThumbprint -Value <Thumbprint>

Step 7: Update Thumbprint in SQL Database

You’ll also need to update the certificate thumbprint stored in the HPCHAStorage database.

Install SQL Server Management Studio (SSMS) (latest version).
pen SSMS and connect to the HPC database.

3. Navigate to:

4. HPCHAStorage → Tables → dbo.DataTable

5. Right-click and select “Select Top 1000 Rows” to view the current SSL thumbprint.

6. Use the new query window and run the following command with the updated thumbprint:

Update dbo.DataTable set dvalue='<NewThumbrpint>' where dpath = 'HKEY_LOCAL_MACHINE\Software\Microsoft\HPC' and dkey = 'SSLThumbprint'

This updates the stored certificate reference used by the HPC services.

Step 8: Reboot the Master Node

Once everything is updated, reboot the head node to apply the changes.

After the system restarts, open HPC Cluster Manager — your cluster should now be fully functional with the new certificate in place.

Summary

By following these steps, you can safely renew an expired HPC Pack 2019 certificate and restore secure communication across your cluster — without needing to reinstall or reconfigure HPC Pack components.

This guide helps administrators handle expired certificates with confidence and maintain business continuity for HPC workloads.

If this guide helped you resolve your certificate issues, please give it a 👍 thumbs up and share your feedback or questions in the comments section below.

Azure High Performance Computing (HPC) Blog articles

Simplify troubleshooting at scale - Centralized Log Management for CycleCloud Workspace for Slurm

Solution Architecture

Key Benefits

Getting Started

Azure NCv6 Virtual Machines: Enhancements and GA Transition

New VM Sizes, Features, and Performance Enhancements

NCv6 - General Purpose VM sizes:

NCv6-Compute Optimized VM sizes:

Transition to General Availability

Regional Expansion Across the Azure Cloud

Ready to build for the future with Azure NCv6?

AI Inferencing in Air-Gapped Environments

Why Kubernetes clusters

Air-gapped clusters

The Model Weights Challenge

Container and model weight deployment

Scenario 1: Baking Model Weights into the Container Image

Scenario 2: Using a Shared File System for Model Artifacts

Summary

Microsoft at NVIDIA GTC 2026

Exclusive GTC Experiences

LEGO® Datacenter Model

Candy Lounge

Networking Lounge

Outdoor Juice Truck

Sponsored Breakout Sessions

Reinventing Semiconductor Design with Microsoft Discovery

Operationalizing Agentic AI at Hyperscale

Live from GTC: AI Podcast

A conversation with Microsoft Azure

Earned Conference Sessions

Drive Optimal Tokens per Watt on AI Infrastructure Using Benchmarking Recipes

Autonomous AI Factories: Technical Preview of Agent-Native Production

The Road to Intelligent Mobility: Vehicle GenAI

Supercharging AI with Multi-Gigawatt AI Factories

Daily Booth Theater Schedule

Monday, March 16

Tuesday, March 17

Wednesday, March 18

Thursday, March 19

Explore Our Demo Pods

Azure AI Infrastructure

Microsoft Foundry

Building AI Together

Startups Powering AI

Ancillary Events & Networking

Microsoft for Startups Executive Leadership Dinner

Microsoft × NVIDIA Open Meet

Microsoft + NVIDIA Executive Dinner

Microsoft AI Luncheon: Research, Robotics, & Real‑World AI

Networking in AI & Tech

AI Innovator’s Circle Brunch: Powering Intelligent Systems Across the Ecosystem

Azure Recognized as an NVIDIA Cloud Exemplar, Setting the Bar for AI Performance in the Cloud

What Is NVIDIA Exemplar Cloud?

Proven Exemplar Validation on H100

Extending Exemplar-Class AI Performance to GB300-Class Platforms

What Enables Exemplar-Class AI Performance on Azure

What This Means for Customers

Learn More

Centralized cluster performance metrics with ReFrame HPC and Azure Log Analytics

Deploying the required Azure Resources

Running ior via ReFrame HPC

Test explanation

ReFrame HPC Configuration

Logging Handler

Multiline Perflog

Extra Headers

Url Structure

json Formatter

Running the Test

Results

Summary

Scaling physics-based digital twins: Neural Concept on Azure delivers a New Record in Industrial AI

Automotive Design and the DrivAerNet++ Benchmark

Benchmark Highlights

From State-Of-The-Art Benchmark Accuracy to Proven Industrial Impact

CES 2026: See how OEMs are transforming product development with Engineering Intelligence

Detailed Quantitative Results

1. Surface Field Predictions: Pressure and Wall Shear Stress