hpc
265 TopicsScaling High-Performance CAD on Azure Virtual Desktop with NVIDIA RTX PRO 6000
This blog presents a validation study of Siemens NX 2506 running in a multi-user Azure Virtual Desktop (AVD) environment powered by NVIDIA RTX PRO 6000 GPUs. It demonstrates how a single GPU-backed Azure VM can support up to 30 concurrent CAD users with stable rendering, consistent performance, and successful execution of all ATS benchmark workloads. The study also highlights performance improvements achieved through multi-host scaling, making Azure AVD a compelling platform for high-density, enterprise-scale CAD deployments.AI Infrastructure Preflight at User space: Validating Multi Node, Multi GPU Slurm Clusters
Every team that operates GPU clusters for AI has seen this pattern. The cluster boots, GPUs are visible, and scheduling works at a basic level. Then the first distributed training run stalls in NCCL initialization, fails during rank rendezvous, or silently maps ranks to the wrong devices. The issue is often not in training code. It is in infrastructure consistency across scheduler, runtime, drivers, networking, and process topology. The goal of ai-infra-validator is straightforward: Run a fast user space preflight before expensive training jobs. Validate distributed initialization for multi node, multi GPU workloads. Confirm GPU affinity and rank mapping are correct. Verify NCCL communication fabric can complete a collective ring under Slurm. This post walks through the implementation in detail, explains why each part exists, and shows how to operationalize it in real HPC AI environments. What the project validates Zero-dependency user space smoke test for AI clusters. Validates multi-node PyTorch DDP initialization, GPU affinity, and NCCL fabric connectivity under Slurm orchestration. Git Repo: ai-cluster-validator In practical terms, this checks that: Slurm launches the expected number of ranks per node. Distributed process group creation with NCCL succeeds. Each rank binds to the expected local GPU. Cross-rank all-reduce completes and converges. Node level telemetry confirms software and fabric state. This is not a performance benchmark. It is a correctness and readiness gate. Tested platform profile Component Value CycleCloud 8.8.3-3667 Slurm 25.05.5 Slurm partition hpc Scheduler VM SKU Standard_D8s_v6 Compute VM SKU Standard_ND96asr_v4 OS images microsoft-dsvm:ubuntu-hpc:2204:latest and microsoft-dsvm:ubuntu-hpc:2404:latest PyTorch 2.12.0+cu130 CUDA runtime 13.0 NCCL target 2.29.7 This profile represents a common enterprise scenario where scheduler and compute nodes have different roles, and the training fleet depends on correct multi node orchestration. Step 1: Minimal user space bootstrap The bootstrap script creates a shared Python environment at /shared/apps/pytorch_env and installs the required packages: torch torchvision torchaudio psutil This choice is intentional: No dependency on containers for first-pass validation. Single environment path visible to all compute nodes. Rapid setup and repeatability for cluster operators. Command sequence: git clone https://github.com/vinil-v/ai-cluster-validator.git cd ai-cluster-validator sudo bash bootstrap_env.sh Step 2: Slurm job defines deterministic distributed topology The Slurm script expresses a clear topology contract: nodes=2 ntasks-per-node=8 gpus-per-node=8 cpus-per-task=12 From this, world size is derived as: WORLD_SIZE = SLURM_NTASKS = 2 x 8 = 16 The script also configures network and NCCL behavior: NCCL_DEBUG=WARN NCCL_IB_DISABLE=0 NCCL_P2P_DISABLE=0 NCCL_IGNORE_CPU_AFFINITY=1 GLOO_SOCKET_IFNAME=eth0 NCCL_SOCKET_IFNAME=eth0 Important implementation detail: MASTER_ADDR is set to the first host in SLURM_JOB_NODELIST. MASTER_PORT is selected dynamically from the ephemeral range 49152-65535 and falls back to 29500 if needed. Why this matters: Reduces port collision risk when jobs run frequently. Avoids hardcoded rendezvous values that may fail in shared clusters. Launch path: srun --cpu-bind=none bash -c " source /shared/apps/pytorch_env/bin/activate; export RANK=$SLURM_PROCID; export LOCAL_RANK=$SLURM_LOCALID; python3 ddp_mesh_ping.py " The LOCAL_RANK handoff is critical for stable GPU affinity inside each node. Step 3: DDP initialization and rank to GPU affinity Inside ddp_mesh_ping.py, each process executes: Parse WORLD_SIZE, RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT. Initialize torch.distributed with backend nccl and TCP init method. Set CUDA device using LOCAL_RANK. Core initialization path: dist.init_process_group( backend="nccl", init_method=f"tcp://{master_addr}:{master_port}", world_size=world_size, rank=rank ) torch.cuda.set_device(local_rank) This validates the minimum distributed contract required by real model training jobs. Step 4: Rich node and fabric telemetry in user space Each rank collects detailed metadata before the collective test: Node identity from Slurm and hostname. GPU model and VRAM from CUDA properties. System memory via psutil. CPU model from /proc/cpuinfo. OS and kernel versions. NVIDIA driver version from /proc/driver/nvidia/version. PyTorch, CUDA, and NCCL runtime versions. InfiniBand device state and link rate from /sys/class/infiniband. Basic GPU peer access capability via torch.cuda.can_device_access_peer. All rank payloads are gathered on rank 0 using dist.gather_object and printed as: Cluster hardware topology report. Node environment deep dive. Network interconnect and fabric status. This design gives platform teams one artifact that is both operational and diagnostic. Step 5: Functional collective validation After telemetry, each rank executes a lightweight DDP compute path: Build nn.Linear(10,10) on local GPU. Wrap with DistributedDataParallel. Perform forward, loss, backward. Run all_reduce on loss tensor. Compute global average loss. Pass condition is explicit in log output: SUCCESS: DDP Multi-Node AllReduce Ring Complete! This confirms that process group initialization and collective communication both completed successfully. What a successful run looks like Submission: sbatch ddp_smoke_test.slurm squeue Representative outcomes in the log: Total Execution Ranks: 16 Two nodes with local ranks 0 through 7 on each node GPU inventory aligned with expected A100 topology Active InfiniBand HCAs discovered per host NCCL socket interface set to eth0 Final success marker and computed convergence loss When these markers are present and coherent with expected hardware shape, the cluster is typically ready for distributed training bring-up. How to check the output file The Slurm script writes two artifacts per job: ai_infra_smoke_test_<jobid>.log ai_infra_smoke_test_<jobid>.err Use this exact workflow after submission: # 1. Submit and capture the job id sbatch ddp_smoke_test.slurm # 2. Check job state squeue -j <jobid> # 3. Read standard output log cat ai_infra_smoke_test_<jobid>.log # 4. Read standard error log cat ai_infra_smoke_test_<jobid>.err For stronger validation in automation, also check: Total Execution Ranks equals expected world size. Both nodes appear in the topology table with local ranks 0 through 7. NCCL/CUDA/PyTorch versions are present in the node environment section. Complete reference output Use the following full log as a known-good reference from a successful 2-node ND96asr_v4 run. Master Node IP/Hostname: ddpcluster-hpc-1 Dynamically Assigned Port: 53593 Total Execution Ranks: 16 =============================================================================================== HPC CLUSTER INTERACTION MONITOR =============================================================================================== --> Initializing DDP on Master Node : ddpcluster-hpc-1 --> Dynamic Coordination Port : 53593 --> Target World Cluster Size : 16 GPUs ----------------------------------------------------------------------------------------------- =============================================================================================== CLUSTER HARDWARE TOPOLOGY REPORT =============================================================================================== | Rank | Node Name | Local ID | GPU Model | VRAM | Sys Mem | CPU Cores | ----------------------------------------------------------------------------------------------- | 0 | ddpcluster-hpc-1 | 0 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 1 | ddpcluster-hpc-1 | 1 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 2 | ddpcluster-hpc-1 | 2 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 3 | ddpcluster-hpc-1 | 3 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 4 | ddpcluster-hpc-1 | 4 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 5 | ddpcluster-hpc-1 | 5 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 6 | ddpcluster-hpc-1 | 6 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 7 | ddpcluster-hpc-1 | 7 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 8 | ddpcluster-hpc-2 | 0 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 9 | ddpcluster-hpc-2 | 1 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 10 | ddpcluster-hpc-2 | 2 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 11 | ddpcluster-hpc-2 | 3 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 12 | ddpcluster-hpc-2 | 4 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 13 | ddpcluster-hpc-2 | 5 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 14 | ddpcluster-hpc-2 | 6 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 15 | ddpcluster-hpc-2 | 7 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | =============================================================================================== NODE ENVIRONMENT DEEP DIVE ----------------------------------------------------------------------------------------------- [ddpcluster-hpc-1] Details: --> CPU Microarchitecture : AMD EPYC 7V12 64-Core Processor --> Operating System : Ubuntu 22.04.5 LTS --> Kernel Base Version : 5.15.0-1110-azure --> Nvidia Driver Loaded : 580.126.20 --> PyTorch Environment : v2.12.0+cu130 --> CUDA Runtime Version : v13.0 --> NCCL Fabric Target : v2.29.7 --> Discovered InfiniBand HCAs: - mlx5_an0:1 (4: ACTIVE - 40 Gb/sec (4X QDR)) - mlx5_ib0:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib1:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib2:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib3:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib4:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib5:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib6:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib7:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) ----------------------------------------------------------------------------------------------- [ddpcluster-hpc-2] Details: --> CPU Microarchitecture : AMD EPYC 7V12 64-Core Processor --> Operating System : Ubuntu 22.04.5 LTS --> Kernel Base Version : 5.15.0-1110-azure --> Nvidia Driver Loaded : 580.126.20 --> PyTorch Environment : v2.12.0+cu130 --> CUDA Runtime Version : v13.0 --> NCCL Fabric Target : v2.29.7 --> Discovered InfiniBand HCAs: - mlx5_an0:1 (4: ACTIVE - 40 Gb/sec (4X QDR)) - mlx5_ib0:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib1:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib2:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib3:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib4:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib5:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib6:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib7:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) ----------------------------------------------------------------------------------------------- NETWORK INTERCONNECT & FABRIC STATUS ----------------------------------------------------------------------------------------------- --> Target Communication Interface (NCCL_SOCKET_IFNAME) : eth0 --> Active Telemetry Tracking Level (NCCL_DEBUG) : WARN --> Inter-GPU Topo Link Verification : Active (P2P/NVLink Capable) ----------------------------------------------------------------------------------------------- SUCCESS: DDP Multi-Node AllReduce Ring Complete! --> Computed System Verification Convergence Loss : 1.398719 =============================================================================================== Why this is effective for platform operations For AI infrastructure teams, this pattern is highly effective because it is: Fast: can be run after every change window. Deterministic: same topology contracts every run. Actionable: output includes enough context for first-level triage. Low friction: user space only, no heavy control plane dependencies. This supports common operating workflows: Day-0 cluster acceptance. Day-1 patch validation after driver, kernel, or image changes. Regression gate in golden image pipelines. Preflight before large multi node model training jobs. Practical guidance for extending to larger clusters Adjust Slurm directives for nodes and tasks per node. Keep one rank per GPU unless validating alternate placement policy. Set NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME according to your network policy. Preserve the dynamic MASTER_PORT logic to avoid static collisions. Keep the success marker string stable so automation can parse it. Closing perspective Most distributed training failures are expensive because they are discovered late. A user space preflight that validates scheduler topology, rank rendezvous, GPU affinity, and NCCL collectives provides a high value guardrail before production starts. ai-infra-validator is a practical implementation of that guardrail. It is compact, transparent, and aligned with how real Slurm based AI clusters operate. For teams running multi node multi gpu training at scale, this kind of preflight should be a standard operational gate.Distributing model weights to your AI cluster: a faster pre-flight on AKS and Slurm
Standing up an N-node training or inference job and waiting forever for the model checkpoint to land on every node's NVMe? Here's a small Rust + MPI tool — azcp-cluster — that pays Azure egress once, broadcasts over your fabric, and finishes in seconds. Plus the AKS and Slurm patterns to wire it into a real pipeline.Teamcenter Simulation Process Data Management Architecture on Azure CycleCloud- Slurm cluster
Introduction: Many customers run multiple Teamcenter-SPDM solutions across the enterprise, mixing multiple instances, multiple ISV vendors, and hybrid cloud/on-prem implementations. This fragmentation reduces the customer’s ability to uniformly access data. Consolidating Teamcenter-SPDM on Azure can speed the shift to one consistent, harmonized PLM experience, enterprise wide. What is Teamcenter Simulation? Teamcenter Simulation integrates simulation data, processes, and results into the broader PLM (Product Lifecycle Management) environment. Instead of engineers running simulations in silos on local drives, it provides: A single source of truth for CAD, simulation models, inputs, and results. Traceability across design, analysis, and manufacturing. Support for multi-CAD, multi-CAE tools (e.g., NX Nastran, ANSYS, Abaqus, Star-CCM+). Primary benefit Teamcenter Simulation SPDM gives you full traceability from source to solution. SPDM is a single source of truth where CAE analysis of a product design testing is related to a corresponding item in original CAD. This relationship of CAD and SIM data is a key to determine which CAD revision is captured in a particular CAE analysis. Architecture: Siemens Teamcenter SPDM baseline architecture has two major blocks of architectures which are connected. Teamcenter PLM core deployment StarCCM deployed on HPC Cyclecloud Slurm Workspace Teamcenter PLM Core Deployment: It has four distributed tiers (client, web, enterprise, and resource) in a single availability zone. Each tier aligns to function and communication flows between these tiers. All four tiers use their own virtual machines in a single virtual network. The Teamcenter Simulation aka CAE manage is core business functionality of SPDM runs on a central server in the enterprise tier and users access it through a web-based or thick-client interface. You can deploy multiple instances in Dev and Test environments by adding extra virtual machines and storage on virtual networks separate from production virtual networks. StarCCM HPC Cyclecloud slurm cluster architecture: Siemens StarCCM simulation software will be deployed on Azure Cyclecloud HPC Scheduler node. CAE Analyst fires the simulation jobs from Teamcenter Active workspace or Rich client UI. Azure HPC will then spin up and HPC nodes, these nodes will process the jobs submitted by CAE Analyst based on the runtime parameter. StarCCM will processed complete the simulation iteration and .sim file output will be generated. Workflow CAE Analysts, SPDM & Teamcenter users access the Teamcenter application via an HTTPS-based endpoint Public URL. Users access the application through two user interfaces: (1) a Rich client and (2) an Active workspace client, CAE engineer/Simulation Analysts access the Teamcenter through the Teamcenter Simulation client. Teamcenter Simulation client is lightweight thin client runs on users’ desktop. User access will be authenticated via Company’s Azure Entra ID. Azure Entra ID with SAML configuration allows single sign on(SSO) to the Teamcenter application. Azure Firewall & Azure backbone Security component which filter the traffic and threat intelligence feeds directly from Microsoft Cyber Security. Https traffic directed to the Azure Application gateway. The Hub virtual network and Spoke virtual network are peered so they can communicate over the Azure backbone network. Azure Application Gateway routes traffic to the Teamcenter’s web server virtual machines (VMs) in the Web tier. Siemens PLM Teamcenter deployment on Azure. For detailed information about Teamcenter Architecture on Azure refer this url. Teamcenter Simulation Client runs on Teamcenter User’s desktop. CAE manager is deployed as integral part of the Teamcenter package. Teamcenter Simulation on Azure HPC: CAE Engineer executes the following typical workflow with Azure HPC cluster Step 1: CAD Data & Product Structures CAD models (e.g., from NX, CATIA, SolidWorks) are managed in Teamcenter. Simulation engineer links simulation models directly to Teamcenter product structures. Ensures simulation always uses the latest or correct version of the design. Step 2: Build Simulation Model (Pre-processing) Simulation templates define solver type (FEA, CFD, Multiphysics) and required inputs. Engineers use tools like NX CAE, Simcenter 3D, ANSYS, Abaqus, or Star-CCM+ integrated with Teamcenter. Meshes, boundary conditions, loads, and materials are associated with the correct design revision. Step 3: Manage Simulation Data All input decks, scripts, and models stored in Teamcenter for version control. Metadata (e.g., load case, solver settings) captured for searchability & re-use. Supports process automation: simulation workflows can be pre-configured for repeatable tasks. Step 4: Run Simulation Jobs (Enhanced with Azure CycleCloud Benefits) Jobs submitted to local HPC clusters or cloud HPC (Azure CycleCloud,) directly from Teamcenter. Teamcenter stores solver logs, job status, and output files. Following diagram show end to end workflow starts with Teamcenter CAE manager--> StarCCM -->HPC cluster ->Simulation processing Sim file -->Sim file back to Teamcenter Teamcenter CAE manager--> StarCCM running on HPC cluster Teamcenter generates the job file on the HPC node HPC Cluster creating HPC nodes Squeue monitoring on HPC node Job monitoring on Teamcenter UI Simulation output file generated by Sbatch job File copied over to Teamcenter shared file location Step 5: Post-processing & Results Management Results imported back into Teamcenter: stress plots, temperature distributions, flow fields, etc. Visualization via Simcenter 3D, JT format (lightweight 3D), or web-based viewers. Results tied back to: Design versions Simulation setup Load cases This creates a traceable digital thread from requirements → design → simulation → results. Step 6: Review, Sign-off, and Collaboration Results shared with design, manufacturing, and management teams in Teamcenter. Review workflows, e-signatures, and approvals integrated into PLM processes. Simulation results influence design changes and product validation reports. Azure CycleCloud adds several key advantages: On-demand scaling: Automatically provisions Azure compute nodes when workloads spike, then scales down when jobs complete to reduce costs. HPC Slurm scheduler integration: Supports popular schedulers like Slurm enabling smooth job submission from Teamcenter. Multi-VM sizes & GPU support: Allows selecting the right mix of CPU/GPU VMs for different simulation workloads (e.g., CFD, FEA, ML-driven simulations). Hybrid flexibility: Combine on-prem HPC with Azure bursting to handle peak demand without over-provisioning local hardware. Cost governance: Built-in cost controls, job quotas, and reporting to track simulation expenses. Security & compliance: Leverages Azure security, VNet isolation, and role-based access control for simulation data and compute resources. Integration with Azure Storage: Simplifies access to input/output files using Azure Blob, Azure NetApp Files, or Lustre for HPC-grade throughput. Conclusion: Siemens Teamcenter SPDM, when deployed on Azure HPC CycleCloud Workspaces, delivers a scalable and high-performance simulation data management solution. The integration with Azure CycleCloud enables dynamic provisioning of compute resources, allowing simulation workloads to scale elastically based on demand. This ensures optimal resource utilization and cost efficiency, especially during peak simulation cycles. With support for Slurm scheduling, multi-VM configurations, and GPU acceleration, SPDM on HPC CCWs empowers engineering teams to run complex simulations faster and more reliably. The architecture’s hybrid flexibility—combining on-premises and cloud bursting—further enhances throughput without overcommitting infrastructure, making it a robust foundation for enterprise-wide digital thread and product validation workflows.Simplify troubleshooting at scale - Centralized Log Management for CycleCloud Workspace for Slurm
Training large AI models on hundreds or thousands of nodes introduces a critical operational challenge: when a distributed job fails, quickly identifying the root cause across scattered logs can become incredibly time-consuming. This manual process delays recovery and reduces cluster utilization. The ability to quickly parse centralized cluster logs from a single interface is critical to ensure job failure root cases are swiftly identified and mitigated to maintain high cluster utilization. Solution Architecture This is a turnkey, customizable log forwarding solution for CycleCloud Workspace for Slurm that centralizes all cluster logs into Azure Monitor Logs Analytic. The architecture uses Azure Monitor Agent (AMA) deployed on every VM and Virtual Machine Scale Set (VMSS) to stream logs defined by Data Collection Rules (DCR) to dedicated tables in a Log Analytics workspace where they can be queried from a single interface. The turnkey solution captures three categories of logs essential for troubleshooting distributed workloads, but can be extended for any other logs: Slurm logs including slurmctld, slurmd, etc., plus archived job artifacts (job submission scripts, environmental variables, stdout/stderr) collected via prolog/epilog scripts. Infrastructure logs including those from CycleCloud including the CycleCloud Healthagent which automatically tests nodes for hardware health and draining nodes that fail tests. Operation System logs from syslog and dmesg capturing kernel events, network state changes, and hardware issues. Each log source flows through its own DCR into a dedicated table following a consistent schema. The solution automatically associates scheduler-specific DCRs with the Slurm scheduler node and compute-specific DCRs with compute nodes handling dynamic node scaling transparently. The solution is purpose-built for CycleCloud Workspace for Slurm, but designed in a modular fashion to be easily extended for new data sources (i.e. new log formats) and processing (i.e. Data Collection Rules) to support log forwarding and analysis of other required logs. Key Benefits Time-series correlation: Azure Monitor's time-based indexing enables rapid identification of cascading failures. For example, trace a network carrier flap detected in syslog to corresponding slurmd communication errors to specific job failures all within seconds. Centralized visibility: Query logs from thousands of nodes through a single interface instead of SSH-ing to individual machines. Correlate Slurm controller decisions with node-level errors and system events in one query. Log persistence: Logs survive node deallocations and reimaging. Critical in cloud environments where compute nodes are ephemeral. Powerful query language: KQL (Kusto Query Language) allows parsing raw logs into structured fields, filtering across multiple sources, and building operational dashboards. Example queries detect patterns like repeated job failures, network instability, or resource exhaustion. Production-ready scalability: User-assigned managed identities automatically propagate to new VMSS instances, and DCR associations handle thousands of nodes without manual configuration. Getting Started The complete solution is available on GitHub (slurm-log-collection) with deployment scripts that: Create all required Log Analytics tables Deploys pre-configured DCRs for Slurm, CycleCloud, and OS logs Automatically associate DCRs with scheduler and compute resources After configuring environment variables and running the setup scripts, logs begin flowing to Azure Monitor and will populate within 15 minutes, but normal log ingestion latency is ~30s to 3 minutes. The repository includes sample KQL queries for common troubleshooting scenarios to accelerate time-to-resolution and to perform non-troubleshooting analysis of cluster usage.Microsoft at NVIDIA GTC 2026
Microsoft returns to NVIDIA GTC 2026 in San Jose with a strong presence across conference sessions, in‑booth theater talks, live demos, and executive‑level ancillary events. Together with NVIDIA and our partner ecosystem, Microsoft is showcasing how Azure AI infrastructure enables AI training, inference, and production at global scale. Visit us at Booth #521 to see the latest innovations in action and connect with Azure and NVIDIA experts. Exclusive GTC Experiences LEGO® Datacenter Model Explore Azure AI infrastructure at the Park Container. Candy Lounge Visit the high-traffic candy wall for co-branded treats all day long. Networking Lounge Relax and recharge with comfy seating and vital charging options. Outdoor Juice Truck Free, refreshing beverages served during outdoor park hours. Sponsored Breakout Sessions Microsoft Featured Reinventing Semiconductor Design with Microsoft Discovery S82398 · Mon, Mar 16 · 4:00 PM Prashant Varshney Microsoft · Semiconductor & AI Engineering Abstract: Semiconductor teams face exploding design complexity and shrinking verification windows. This session shows how the Microsoft Discovery AI for Science platform, combined with Synopsys Agent Engineers, introduces an agentic approach to EDA that automates routine steps and accelerates expert decision-making on Azure. Microsoft Featured Operationalizing Agentic AI at Hyperscale S82399 · Tue, Mar 17 · 1:00 PM Nitin Nagarkatte Microsoft · Azure AI Infrastructure Anand Raman Microsoft · Azure AI Vipul Modi Microsoft · AI Systems Abstract: As enterprises move to agentic systems, the challenge shifts to operating intelligent agents reliably at scale. This session demonstrates how Microsoft builds AI Factories on Azure using NVIDIA technology and explores Microsoft Foundry as the control plane for deploying and operating coordinated AI agents. Live from GTC: AI Podcast Dayan Rodriguez Corporate Vice President Global Manufacturing and Mobility Alistair Spiers General Manager Azure Infrastructure Live Special Feature A conversation with Microsoft Azure Listen & Subscribe: aka.ms/GTC2026Podcast Scan to Listen Earned Conference Sessions Don't miss these high-impact sessions where Microsoft and NVIDIA leaders discuss the future of AI factories and infrastructure. Mon · Mar 16 5:00 PM Drive Optimal Tokens per Watt on AI Infrastructure Using Benchmarking Recipes Speakers: Paul Edwards, Emily Potyraj Microsoft, NVIDIA Tue · Mar 17 9:00 AM Autonomous AI Factories: Technical Preview of Agent-Native Production Speakers: JP Vasseur, César Martinez Spessot NVIDIA, Microsoft Research Tue · Mar 17 4:00 PM The Road to Intelligent Mobility: Vehicle GenAI Speakers: Raj Paul, Thomas Evans, Bryan Goodman Microsoft, NVIDIA, Bosch Wed · Mar 18 9:00 AM Supercharging AI with Multi-Gigawatt AI Factories Speakers: Gilad Shainer, Peter Salanki, Evan Burness NVIDIA, CoreWeave, Meta, Microsoft Daily Booth Theater Schedule Visit the Microsoft Theater for lightning talks from engineering leaders and partners. Monday, March 16 2:00 PM BTH208 · NVIDIA Accelerate AI Innovation on Azure with NVIDIA Run:ai — Rob Magno 2:30 PM BTH202 · General Robotics Models to Machines: Deploying Agentic AI in Real-World Robotics — Dinesh Narayanan 3:00 PM BTH200 · Fractal Analytics From Generalist to Enterprise-Ready: Fractal Builds Domain AI — C. Chaudhuri 3:30 PM BTH109 · Microsoft Agentic cloud ops - Smarter Operations with Azure Copilot — Jyoti Sharma 4:00 PM BTH103 · Microsoft Build a Deep Research Agent for Enterprise Data — D. Casati, A. Slutsky, H. Alkemade 4:30 PM BTH205 · NetApp Azure NetApp Files: Powering Your Data for AI Capabilities — Andy Chan 5:00 PM BTH207 · NVIDIA The Agentic Commerce Stack: Open Models on Azure — Antonio Martinez 5:30 PM BTH217 · OPAQUE Confidential AI on Azure Unlocks Sovereign AI at Scale — Aaron Fulkerson 6:00 PM BTH218 · Simplismart Making BYOC work at scale with modular inference — Amritanshu Jain 6:30 PM Expo Reception Tuesday, March 17 1:30 PM BTH100 · Microsoft From Open Weights to Enterprise Scale: Open-Source Models — Sharmila Chockalingam 2:00 PM BTH212 · Personal AI Unlocking the power of memory in Teams with Personal AI — Sam Harkness 2:30 PM BTH111 · Microsoft / NVIDIA Scalable LLM Inference on AKS Using NVIDIA Dynamo — Mohamad Al jazaery, Anton Slutsky 3:00 PM BTH204 · Mistral AI Innovate with Mistral AI on Microsoft Foundry — Ian Mathew 3:30 PM BTH104 · Microsoft GPU-Accelerated CFD at Scale: Star-CCM+ on Azure — Jason Scheffelmaer 4:00 PM BTH206 · NeuBird AI Agentic AI for Incident Response on Microsoft Azure — Grant Griffiths 4:30 PM BTH101 · GitHub Agentic DevOps: Evolving software with GitHub Copilot — Glenn Wester 5:00 PM BTH209 · Rescale Real-World AI Physics: GM & NVIDIA on Rescale — Dinal Perera 5:30 PM BTH107 · Microsoft Intro to LoRA Fine-Tuning on Azure — Christin Pohl 6:30 PM Raffle Wednesday, March 18 1:00 PM BTH219 · VAST Data Scaling AI Infrastructure on Azure with VAST Data — Jason Vallery 1:30 PM BTH110 · Microsoft Physical AI and Robotics: The Next Frontier — F. Miller, C. Souche, D. Narayanan 2:00 PM BTH105 · Microsoft Sovereign AI options with Azure Local — Kim Lam 2:30 PM BTH108 · Microsoft Automating HPC Workflows with Copilot Agents — Param Shah 3:00 PM BTH102 · Microsoft Trustworthy Multi-Agent Workflows with Microsoft Foundry — Brian Benz 4:00 PM BTH106 · Microsoft Scaling Enterprise AI on ARO with NVIDIA H100 & H200 — Lachie Evenson 4:30 PM BTH211 · WEKA Hybrid AI Data Orchestration with WEKA NeuralMesh™ — Desiree Campbell 5:00 PM BTH202 · Hammerspace NVIDIA AI Enterprise Software with NIM — Mike Bloom 5:30 PM BTH203 · Kinaxis Reimagining Global Supply Planning with Azure — Dane Henshall 6:00 PM BTH214 · AT&T Connected AI on Azure for Manufacturing — Brad Pritchett 6:30 PM Raffle Thursday, March 19 11:00 AM BTH210 · Wandelbots Physical AI: Powering Software-Defined Automation in Robotics — Marwin Kunz, Martin George 11:30 AM Raffle Explore Our Demo Pods Visit the Microsoft booth to see our technology in action with live demonstrations across four dedicated pod areas. POD 1 Azure AI Infrastructure End‑to‑end AI infrastructure for training and inference at scale, featuring the latest NVIDIA GPU integrations on Azure. POD 2 Microsoft Foundry Our comprehensive platform for building, deploying, and operating agentic AI systems with enterprise reliability. POD 3 Building AI Together Showcasing joint Microsoft and NVIDIA solutions across diverse industries, from manufacturing to retail. POD 4 Startups Powering AI Discover how innovative startups are running next‑generation AI workloads on the Azure platform. Ancillary Events & Networking Join Microsoft leadership and our partner ecosystem at these curated networking experiences. Click the location to view on Bing Maps. Sun · Mar 15 6:00 PM Microsoft for Startups Executive Leadership Dinner 📍 Morton’s Steakhouse, San Jose Exclusive gathering for startup leaders and Microsoft executives. Mon · Mar 16 1:30 PM Microsoft × NVIDIA Open Meet 📍 Signia by Hilton · International Suite Strategic alignment session for Microsoft and NVIDIA executives. Mon · Mar 16 7:30 PM Microsoft + NVIDIA Executive Dinner 📍 Il Fornaio, San Jose Executive dinner for key customers and leadership teams. Tue · Mar 17 11:00 AM to 1:00 PM Microsoft AI Luncheon: Research, Robotics, & Real‑World AI 📍 Signia by Hilton · International Suite Invite-only: A curated executive lunch exploring the journey from AI research to physical enterprise deployments in robotics and manufacturing. Tue · Mar 17 7:30 PM Networking in AI & Tech 📍 San Pedro Square Market Community networking mixer for Microsoft teams, partners, and customers. Wed · Mar 18 10:00 AM to 1:00 PM AI Innovator’s Circle Brunch: Powering Intelligent Systems Across the Ecosystem 📍 Il Fornaio, San Jose Hosted by Microsoft & NVIDIA at GTC. Join us for an exclusive brunch and discussion on the intelligent ecosystem.Centralized cluster performance metrics with ReFrame HPC and Azure Log Analytics
Imagine having several clusters across different environments (dev, test and prod) or planning a migration between PBS and Slurm or porting codes to a different system. They can all seem like daunting tasks. This is where the combination of ReFrame HPC, a powerful and feature rich testing framework, and Azure Log Analytics can help improve confidence and assurance in the performance and accuracy of a system. Here we will look at how to configure ReFrame HPC specifically for Azure: Deploying the required Azure resources, running a test and capturing the results in Log Analytics for analysis. Deploying the required Azure Resources Firstly, deploy the required resources in Azure by using this bicep from GitHub. The deployment includes the creation and configuration of everything required for ReFrame HPC. These resources include a data collection endpoint, a data collection rule and a log analytics workspace. Running ior via ReFrame HPC For the purpose of demonstrating a running test and capturing the results in Azure from start to finish, here is a simple ior test which will run both a read and a write operation against the shared storage. import reframe as rfm import reframe.utility.sanity as sn @rfm.simple_test class SimplePerfTest(rfm.RunOnlyRegressionTest): valid_systems = ["*"] valid_prog_environs = ["+ior"] executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) Test explanation Set the binary to be executed to ior, along with its arguments. executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] Specify which systems the test should run on. In this case, any system/cluster which is known to have ior available will be selected. Look at the ReFrame HPC documentation to get a better understanding of the options available for use. valid_systems = ["*"] valid_prog_environs = ["+ior"] Verify the stdout of the job by searching for a specific value to assert that it ran successfully. @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) If the sanity function passed it will then extract the performance metrics from the stdout of the job. The naming of the methods is important, as they will be stored in the results later. @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) Performance references are used to determine if the current cluster has met the requirement or not. It also allows margins to be specified in either direction. reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } ReFrame HPC Configuration The ReFrame HPC configuration is key to determine how and where the test will run. It is also where the logic allowing Reframe HPC to use Azure for centralized logging will be defined. The full configuration file is vast and is covered in detail within the ReFrame HPC documentation. For the purpose of this test an example can be found on GitHub. Below is a breakdown of the key parts that allow Reframe HPC to push its results into Azure Log Analytics. Logging Handler The most important part of this configuration is the logging section, without it ReFrame HPC will not attempt to log the results. A handler_perflog of type httpjson is added to enable the logs to be sent to a HTTP endpoint with specific values which our covered below. 'logging': [ { 'perflog_multiline': True, 'handlers_perflog': [ { 'type': 'httpjson', 'url': 'REDACTED', 'level': 'info', 'debug': False, 'extra_headers': {'Authorization': f'Bearer {_get_token()}'}, 'extras': { 'TimeGenerated': f'{datetime.now(timezone.utc).isoformat()}', 'facility': 'reframe', 'reframe_azure_data_version': '1.0', }, 'ignore_keys': ['check_perfvalues'], 'json_formatter': _format_record } ] } Multiline Perflog To ensure this works with Azure, enable perflog_multiline. This will ensure a single record per metric is sent to Log Analytics. This is the cleanest way to output the results. Having this set to False will move the metric names into column names, which means that the schema will be different for each test and will become hard to maintain. Extra Headers A bearer token is required to authenticate the request. ReFrame HPC allows the adding of headers via the extra_headers property and a simple Python function, which obtains a scoped token that can be appended to the additional header. def _get_token(scope='https://monitor.azure.com/.default') -> str: credential = DefaultAzureCredential() token = credential.get_token(scope) return token.token Url Structure The url can be found in the output of the bicep which was run previously. It can also be obtained via the portal. Here is the structure of the url for reference. '${dce.properties.logsIngestion.endpoint}/dataCollectionRules/${dcr.properties.immutableId}/streams/Custom-${table.name}?api-version=2023-01-01' json Formatter A small work around is needed as the Data Collection Rule expects an array of items and ReFrame HPC outputs a single record. To resolve this another Python function can be used which simply wraps the record up in an array. In this example it also tidys up and removes some items that are not required and would cause issues with the json serialization. def _format_record(record, extras, ignore_keys): data = {} for attr, val in record.__dict__.items(): if attr in ignore_keys or attr.startswith('_'): continue data[attr] = val data.update(extras) return json.dumps([data]) Running the Test Now that the infrastructure has been deployed, the test has been defined and is correctly configured, we can run the test. Start by logging in. Here I am using the managed identity of the node, but User auth and User Assigned Managed Identities are also supported. $ az login --identity ReFrame HPC can be installed via Spack or Python and, while I am using Spack for packages on the cluster, I find the simplest approach is to activate a Python environment and install ReFrame HPC along with test specfic Python dependencies. $ python3 -m venv .venv $ . .venv/bin/activate $ python -m pip install -U pip $ pip install -r requirements.txt Now using the ReFrame HPC cli, the test can be run using the configuration file and the test file. $ reframe -C config.py -c simple_perf.py --performance-report -r ReFrame HPC will now run the test against the system/cluster defined in the configuration. For this example it is a Slurm cluster on a partition of HBv4 nodes and running squeue clarifys that. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 955 hbv4 rfm_Simp jim.pain R 0:28 1 tst4-hbv4-97 Results And there we have it, results are now appearing in Azure! From here we can use kql to query and filter the results. This is just a subset of the values available but the dataset is vast and includes a huge range of values that are extremely helpful. Summary By standardizing on the combination of ReFrame HPC and Azure Log Analytics for testing and reporting of performance data across our clusters, whether Slurm based, Azure CycleCloud or existing on-prem clusters, you can gain unprecendented visibility and confidence in the systems you manage and the codes you deploy that were previously hard to obtain. Enabling the potential for: 🔎Fast cross-cluster comparisions 📈Trend analysis over long running periods 📊Standardized metrics regardless of scheduler or system ☁️Unified monitoring and reporting across clusters ReFrame HPC is suitable for a wide range of testing, so if testing is something you have been looking to implement, take a look at ReFrame HPC