ai infrastructure

100 Topics

Microsoft at NVIDIA GTC 2026
Microsoft returns to NVIDIA GTC 2026 in San Jose with a strong presence across conference sessions, in‑booth theater talks, live demos, and executive‑level ancillary events. Together with NVIDIA and our partner ecosystem, Microsoft is showcasing how Azure AI infrastructure enables AI training, inference, and production at global scale. Visit us at Booth #521 to see the latest innovations in action and connect with Azure and NVIDIA experts. Exclusive GTC Experiences LEGO® Datacenter Model Explore Azure AI infrastructure at the Park Container. Candy Lounge Visit the high-traffic candy wall for co-branded treats all day long. Networking Lounge Relax and recharge with comfy seating and vital charging options. Outdoor Juice Truck Free, refreshing beverages served during outdoor park hours. Sponsored Breakout Sessions Microsoft Featured Reinventing Semiconductor Design with Microsoft Discovery S82398 · Mon, Mar 16 · 4:00 PM Prashant Varshney Microsoft · Semiconductor & AI Engineering Abstract: Semiconductor teams face exploding design complexity and shrinking verification windows. This session shows how the Microsoft Discovery AI for Science platform, combined with Synopsys Agent Engineers, introduces an agentic approach to EDA that automates routine steps and accelerates expert decision-making on Azure. Microsoft Featured Operationalizing Agentic AI at Hyperscale S82399 · Tue, Mar 17 · 1:00 PM Nitin Nagarkatte Microsoft · Azure AI Infrastructure Anand Raman Microsoft · Azure AI Vipul Modi Microsoft · AI Systems Abstract: As enterprises move to agentic systems, the challenge shifts to operating intelligent agents reliably at scale. This session demonstrates how Microsoft builds AI Factories on Azure using NVIDIA technology and explores Microsoft Foundry as the control plane for deploying and operating coordinated AI agents. Live from GTC: AI Podcast Dayan Rodriguez Corporate Vice President Global Manufacturing and Mobility Alistair Spiers General Manager Azure Infrastructure Live Special Feature A conversation with Microsoft Azure Monday, March 16 @ 2PM Listen & Subscribe: aka.ms/GTC2026Podcast Scan to Listen Earned Conference Sessions Don't miss these high-impact sessions where Microsoft and NVIDIA leaders discuss the future of AI factories and infrastructure. Mon · Mar 16 5:00 PM Drive Optimal Tokens per Watt on AI Infrastructure Using Benchmarking Recipes Speakers: Paul Edwards, Emily Potyraj Microsoft, NVIDIA Tue · Mar 17 9:00 AM Autonomous AI Factories: Technical Preview of Agent-Native Production Speakers: JP Vasseur, César Martinez Spessot NVIDIA, Microsoft Research Tue · Mar 17 4:00 PM The Road to Intelligent Mobility: Vehicle GenAI Speakers: Raj Paul, Thomas Evans, Bryan Goodman Microsoft, NVIDIA, Bosch Wed · Mar 18 9:00 AM Supercharging AI with Multi-Gigawatt AI Factories Speakers: Gilad Shainer, Peter Salanki, Evan Burness NVIDIA, CoreWeave, Meta, Microsoft Daily Booth Theater Schedule Visit the Microsoft Theater for lightning talks from engineering leaders and partners. Monday, March 16 2:00 PM BTH208 · NVIDIA Accelerate AI Innovation on Azure with NVIDIA Run:ai — Rob Magno 2:30 PM BTH202 · General Robotics Models to Machines: Deploying Agentic AI in Real-World Robotics — Dinesh Narayanan 3:00 PM BTH200 · Fractal Analytics From Generalist to Enterprise-Ready: Fractal Builds Domain AI — C. Chaudhuri, S. Chakraborty 3:30 PM BTH109 · Microsoft Agentic cloud ops - Smarter Operations with Azure Copilot — Jyoti Sharma 4:00 PM BTH103 · Microsoft Build a Deep Research Agent for Enterprise Data — D. Casati, A. Slutsky, H. Alkemade 4:30 PM BTH205 · NetApp Azure NetApp Files: Powering Your Data for AI Capabilities — Andy Chan 5:00 PM BTH207 · NVIDIA The Agentic Commerce Stack: Open Models on Azure — Antonio Martinez 5:30 PM BTH217 · OPAQUE Confidential AI on Azure Unlocks Sovereign AI at Scale — Aaron Fulkerson 6:00 PM BTH218 · Simplismart Making BYOC work at scale with modular inference — Amritanshu Jain 6:30 PM Expo Reception Tuesday, March 17 1:30 PM BTH100 · Microsoft From Open Weights to Enterprise Scale: Open-Source Models — Sharmila Chockalingam 2:00 PM BTH212 · Personal AI Unlocking the power of memory in Teams with Personal AI — Sam Harkness 2:30 PM BTH111 · Microsoft / NVIDIA Scalable LLM Inference on AKS Using NVIDIA Dynamo — Mohamad Al jazaery, Anton Slutsky 3:00 PM BTH204 · Mistral AI Innovate with Mistral AI on Microsoft Foundry — Ian Mathew 3:30 PM BTH104 · Microsoft GPU-Accelerated CFD at Scale: Star-CCM+ on Azure — Jason Scheffelmaer 4:00 PM BTH206 · NeuBird AI Agentic AI for Incident Response on Microsoft Azure — Grant Griffiths 4:30 PM BTH101 · GitHub Agentic DevOps: Evolving software with GitHub Copilot — Glenn Wester 5:00 PM BTH209 · Rescale Real-World AI Physics: GM & NVIDIA on Rescale — Dinal Perera 5:30 PM BTH107 · Microsoft Intro to LoRA Fine-Tuning on Azure — Christin Pohl 6:30 PM Raffle Wednesday, March 18 1:00 PM BTH219 · VAST Data Scaling AI Infrastructure on Azure with VAST Data — Jason Vallery 1:30 PM BTH110 · Microsoft Physical AI and Robotics: The Next Frontier — F. Miller, C. Souche, D. Narayanan 2:00 PM BTH105 · Microsoft Sovereign AI options with Azure Local — Kim Lam 2:30 PM BTH108 · Microsoft Automating HPC Workflows with Copilot Agents — Param Shah 3:00 PM BTH102 · Microsoft Trustworthy Multi-Agent Workflows with Microsoft Foundry — Brian Benz 4:00 PM BTH106 · Microsoft Scaling Enterprise AI on ARO with NVIDIA H100 & H200 — Lachie Evenson 4:30 PM BTH211 · WEKA Hybrid AI Data Orchestration with WEKA NeuralMesh™ — Desiree Campbell 5:00 PM BTH202 · Hammerspace NVIDIA AI Enterprise Software with NIM — Mike Bloom 5:30 PM BTH203 · Kinaxis Reimagining Global Supply Planning with Azure — Dane Henshall 6:00 PM BTH214 · AT&T Connected AI on Azure for Manufacturing — Brad Pritchett 6:30 PM Raffle Thursday, March 19 11:00 AM BTH210 · Wandelbots Physical AI: Powering Software-Defined Automation in Robotics — Marwin Kunz, Martin George 11:30 AM Raffle Explore Our Demo Pods Visit the Microsoft booth to see our technology in action with live demonstrations across four dedicated pod areas. POD 1 Azure AI Infrastructure End‑to‑end AI infrastructure for training and inference at scale, featuring the latest NVIDIA GPU integrations on Azure. POD 2 Microsoft Foundry Our comprehensive platform for building, deploying, and operating agentic AI systems with enterprise reliability. POD 3 Building AI Together Showcasing joint Microsoft and NVIDIA solutions across diverse industries, from manufacturing to retail. POD 4 Startups Powering AI Discover how innovative startups are running next‑generation AI workloads on the Azure platform. Ancillary Events & Networking Join Microsoft leadership and our partner ecosystem at these curated networking experiences. Click the location to view on Bing Maps. Sun · Mar 15 6:00 PM Microsoft for Startups Executive Leadership Dinner 📍 Morton’s Steakhouse, San Jose Exclusive gathering for startup leaders and Microsoft executives. Mon · Mar 16 1:30 PM Microsoft × NVIDIA Open Meet 📍 Signia by Hilton · International Suite Strategic alignment session for Microsoft and NVIDIA executives. Mon · Mar 16 7:30 PM Microsoft + NVIDIA Executive Dinner 📍 Il Fornaio, San Jose Executive dinner for key customers and leadership teams. Tue · Mar 17 7:30 PM Networking in AI & Tech 📍 San Pedro Square Market Community networking mixer for Microsoft teams, partners, and customers. Wed · Mar 18 10:00 AM to 1:00 PM AI Innovator’s Circle Brunch: Powering Intelligent Systems Across the Ecosystem 📍 Il Fornaio, San Jose Hosted by Microsoft & NVIDIA at GTC. Join us for an exclusive brunch and discussion on the intelligent ecosystem.
Fernando_Aznar
Feb 27, 2026 Place Azure High Performance Computing (HPC) Blog
793Views
1like
0Comments
Azure Recognized as an NVIDIA Cloud Exemplar, Setting the Bar for AI Performance in the Cloud
As AI models continue to scale in size and complexity, cloud infrastructure must deliver more than theoretical peak performance. What matters in practice is reliable, end-to-end, workload-level AI performance—where compute, networking, system software, and optimization work together to deliver predictable, repeatable results at scale. This directly translates to business value: efficient full-stack infrastructure accelerates time-to-market, maximizes ROI on GPU and cloud investments, and enables organizations to scale AI from proof-of-concept to revenue-generating products with predictable economics. Today, Microsoft is proud to share an important milestone in partnership with NVIDIA: Azure has been validated as an NVIDIA Exemplar Cloud, becoming the first cloud provider recognized for Exemplar-class AI performance aligned with GB300-class (Blackwell generation) systems. This recognition builds on Azure’s previously validated Exemplar status for H100 training workloads and reflects NVIDIA’s confidence in Azure’s ability to extend that rigor and performance discipline into the next generation of AI platforms. What Is NVIDIA Exemplar Cloud? The NVIDIA Exemplar Cloud initiative celebrates cloud platforms that demonstrate robust end-to-end AI workload performance using NVIDIA’s Performance Benchmarking suite. Rather than relying on synthetic microbenchmarks, Performance Benchmarking evaluates real AI training workloads using: Large-scale LLM training scenarios Production-grade software stacks Optimized system and network configurations Workload-centric metrics such as throughput and time-to-train Achieving Exemplar validation signals that a provider can consistently deliver world-class AI performance in the cloud, showcasing that end users are getting optimal performance value by default. Proven Exemplar Validation on H100 Azure’s Exemplar Cloud journey began with publicly shared benchmarking results for H100-based training workloads, where Azure ND GPU clusters demonstrated exemplar performance using NVIDIA Performance Benchmarking recipes. Those results—published previously and validated through NVIDIA’s benchmarking framework—established a proven foundation of end-to-end AI performance for large-scale, production workloads running on Azure today. Extending Exemplar-Class AI Performance to GB300-Class Platforms Building on the rigor and learnings from H100 validation, Microsoft has now been recognized by NVIDIA as the first cloud provider to achieve Exemplar-class performance and readiness aligned with GB300-class systems. This designation reflects NVIDIA’s assessment that the same principles applied to H100—including end-to-end system tuning, networking optimization, and software alignment—are being successfully carried forward into the Blackwell generation. Rather than treating GB300 as a point solution, Azure approaches it as a continuation of a proven performance model: delivering consistent world-class AI performance in the cloud while preserving the flexibility, elasticity, and global scale customers expect. What Enables Exemplar-Class AI Performance on Azure Delivering Exemplar-class AI performance requires optimization across the full AI stack: Infrastructure and Networking High-performance Azure ND GPU clusters with NVIDIA InfiniBand NUMA-aware CPU, GPU, and NIC alignment to minimize latency Tuned NCCL communication paths for efficient multi-GPU scaling Software and System Optimization Tight integration with NVIDIA software, including Performance Benchmarking recipes and NVIDIA AI Enterprise Parallelism strategies aligned with large-scale LLM training Continuous tuning as models, workloads, and system architectures evolve End-to-End Workload Focus Measuring real training performance, not isolated component metrics Driving repeatable improvements in application-level throughput and efficiency Closing the performance gap between cloud and on-premises systems—without sacrificing manageability Together, these capabilities enabled Azure to deliver consistent Exemplar-class AI performance across generations of NVIDIA platforms. What This Means for Customers For customers training and deploying advanced AI models, this milestone delivers clear benefits: World-class AI performance in a fully managed cloud environment Predictable scaling from small clusters to thousands of GPUs Faster time to train and improved performance per dollar Confidence that Azure is ready for Blackwell-class and GB300-class AI workloads As AI workloads become more complex and reasoning-heavy, infrastructure performance increasingly determines outcomes. Azure’s NVIDIA Cloud Exemplar recognition provides a clear signal: customers can build and scale next-generation AI systems on Azure without compromising on performance. Learn More DGX Cloud Benchmarking on Azure DGX Cloud Benchmarking on Azure | Microsoft Community Hub
Fernando_Aznar
Feb 18, 2026 Place Azure High Performance Computing (HPC) Blog
193Views
0likes
0Comments
Private Preview: Azure Managed Prometheus on VM / VMSS
Announcing private preview support for Azure Managed Prometheus on VM/VMSS, enabling unified monitoring with GPU, InfiniBand, and node-level metrics for HPC workloads.
Daramfon
Feb 18, 2026 Place Azure High Performance Computing (HPC) Blog
994Views
0likes
0Comments
Comprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor
Unlocking Nvidia GPU Monitoring for Azure N-Series VMs with Telegraf and Azure Monitor. In the world of AI and HPC, optimizing GPU performance is critical for avoiding inefficiencies that can bottleneck workflows and drive up costs. While Azure Monitor tracks key resources like CPU and memory, it falls short in native GPU monitoring for Azure N-series VMs. Enter Telegraf—a powerful tool that integrates seamlessly with Azure Monitor to bridge this gap. In this blog, discover how to harness Telegraf for comprehensive GPU monitoring and ensure your GPUs perform at peak efficiency in the cloud.
vinilv
Feb 04, 2026 Place Azure High Performance Computing (HPC) Blog
4.6KViews
2likes
4Comments
Scaling physics-based digital twins: Neural Concept on Azure delivers a New Record in Industrial AI
Automotive Design and the DrivAerNet++ Benchmark In automotive design, external aerodynamics have a direct impact on performance, energy efficiency, and development cost. Even small reductions in drag can translate into significant fuel savings or extended EV range. As development timelines accelerate, engineering teams increasingly rely on data-driven methods to augment or replace traditional CFD workflows. MIT’s DrivAerNet++ dataset is the largest open multimodal dataset for automotive aerodynamics, offering a large-scale benchmark for evaluating learning-based approaches that capture the physical signals required by engineers. It includes 8,000 vehicle geometries across 3 variants (fastback, notchback and estate-back) and aggregates 39 TB of high-fidelity CFD outputs such as surface pressure, wall shear stress, volumetric flow fields, and drag coefficients. Benchmark Highlights Neural Concept trained its geometry-native Geometric Regressor, designed to handle any type of engineering data. The benchmark was executed on Azure HPC infrastructure to evaluate the capabilities of the geometry-native platform under transparent, scalable, and fully reproducible conditions. Surface pressure: Lowest prediction error recorded on the benchmark, revealing where high- and low-pressure zones form. Wall shear stress: Outperforming all competing methods to detect flow attachment and separation for drag and stability control. Volumetric velocity field: More than 50% lower error than previous best, capturing full flow structure for wake stability analysis. Drag coefficient Cd: R² of 0.978 on the test set, accurate enough for early design screening without full CFD runs. Dataset Scale and Ingestion: 39 TB of data was ingested into Neural Concept’s platform through a parallel conversion task with 128 workers and 5 GB RAM each that finished in about 1 hour and produced a compact 3 TB dataset in the platform’s native format. Data Pre Processing: Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices for handling external aerodynamics workflows. Model Training and Deployment: Training completed in 24 hours on 4 A100 GPUs, with the best model obtained after 16 hours. The final model is compact and real-time predictions can be served on a single 16 GB GPU for industrial use. Neural Concept outperformed all other competing methods, achieving state-of-the-art performance prediction on all metrics and physical quantities within a week: “Neural Concept’s breakthrough demonstrates the power of combining advanced AI with the scalability of Microsoft Azure,” said Jack Kabat, Partner, Azure HPC and AI Infrastructure Products, Microsoft. “By running training and deployment on Azure’s high-performance infrastructure — specifically the NC A100 Virtual Machine— Neural Concept was able to transform 39 terabytes of data into a production-ready workflow in just one week. This shows how Azure accelerates innovation and helps automotive manufacturers bring better products to market faster.” For additional benchmark metrics and comparisons, please refer to the Detailed Quantitative Results section at the end of the article. From State-Of-The-Art Benchmark Accuracy to Proven Industrial Impact Model accuracy alone is necessary, but not sufficient for industrial impact. Transformative gains at scale and over time are only revealed once high-performing models are deployed into maintainable and repeatable workflows across organizations. Customers using Neural Concept’s platform have achieved: 30% shorter design cycles $20M in savings on a 100,000-unit vehicle program These outcomes fundamentally result from a transformed, systematic approach to design, unlocking better and faster data-driven decisions. The Design Lab interface, described in the next section, is at the core of this transformation. Within Neural Concept’s ecosystem, validated geometry and physics models can be deployed directly into the Design Lab - a collaborative environment where aerodynamicists and designers evaluate concepts in real time. AI copilots provide instant performance feedback, geometry-aware improvement suggestions, and live KPI updates, effectively reconnecting aerodynamic analysis with the pace of modern vehicle design. CES 2026: See how OEMs are transforming product development with Engineering Intelligence Neural Concept and Microsoft will showcase how AI-native aerodynamic workflows can reshape vehicle development — from real-time design exploration to enterprise-scale deployment. Visit the Microsoft booth to see DrivAerNet++ running on Azure HPC and meet the teams shaping the future of automotive engineering. Visit Microsoft Booth to find out more Neural Concept’s executive team will also be at CES to share flagship results achieved by leading OEMs and Tier-1 suppliers already using the platform in production. Learn more on: https://www.neuralconcept.com/ces-2026 Credits Microsoft: Hugo Meiland (Principal Program Manager), Guy Bursell (Director Business Strategy, Manufacturing), Fernando Aznar Cornejo (Product Marketing Manager) and Dr. Lukasz Miroslaw (Sr. Industry Advisor) Neural Concept: Theophile Allard (CTO), Benoit Guillard (Senior ML Research Scientist), Alexander Gorgin (Product Marketing Engineer), Konstantinos Samaras-Tsakiris (Software Engineer) Detailed Quantitative Results In the sections that follow, we share the results obtained by applying Neural Concept’s aerodynamics predictive model training template to Drivaernet++. We evaluated our model’s prediction errors using the official train/test split and the standard evaluation strategy. For comparison, metrics from other methods were taken from the public leaderboard. We reported both Mean Squared Error (MSE) and Mean Absolute Error (MAE) to quantify prediction accuracy. Lower values for either metric indicate closer agreement with the ground truth simulations, meaning better predictions. 1. Surface Field Predictions: Pressure and Wall Shear Stress We began by evaluating predictions for the two physical quantities defined on the vehicle surface. Surface Pressure The Geometric Regressor achieved substantially better performance than all existing methods in predicting surface pressure distribution. Rank Deep Learning Model MSE (*10-2, lower = better) MAE (*10-1, lower = better) #1 Neural Concept 3.98 1.08 #2 GAOT (May 2025) 4.94 1.10 #3 FIGConvNet (February 2025) 4.99 1.22 #4 TripNet (March 2025) 5.14 1.25 #5 RegDGCNN (June 2024) 8.29 1.61 Table 1: Neural Concept’s Geometric Regressor predicts surface pressure more accurately than previously published state-of-the-art methods. The dates indicate when the competing model architectures were published. Figure 1: Side-by-side comparison of the ground truth pressure field (left), Neural Concept model’s prediction (middle), and the corresponding error for a representative test sample (right). Wall Shear Stress Similarly, the model delivered top-tier results, outperforming all competing methods. Rank Deep Learning Model MSE (*10 -2 , lower = better) MAE (*10 -1 , lower = better) #1 Neural Concept 7.80 1.44 #2 GAOT (May 2025) 8.74 1.57 #3 TripNet (March 2025) 9.52 2.15 #4 FIGConvNet (Feb. 2025) 9.86 2.22 #5 RegDGCNN (June 2024) 13.82 3.64 Table 2: Neural Concept’s Geometric Regressor predicts wall shear stress more accurately than previously published state-of-the-art methods. Figure 2: Side-by-side comparison of the ground truth magnitude of the wall shear stress, Neural Concept model’s prediction, and the corresponding error for a representative test sample. Across both surface fields (pressure and wall shear stress), the Geometric Regressor achieved the lowest MSE and MAE by a clear margin. The baseline methods represent several high-quality and recent academic work (the earliest being from June 2024), yet our architecture established a new state-of-the-art in predictive performance. 2. Volumetric Predictions: Velocity Beyond surface quantities, DrivAerNet++ provides 3D velocity fields in the flow volume surrounding the vehicle, which we also predicted using the Geometric Regressor. Rank Deep Learning Model MSE (lower = better) MAE (*10 -1 , lower = better) #1 Neural Concept 3.11 9.22 #2 TripNet (March 2025) 6.71 15.2 Table 3: Neural Concept’s Geometric Regressor predicts velocity more accurately than the previously published state-of-the-art method. The illustration below shows the velocity magnitude for two test samples. Note that only a single 2D slice of the 3D volumetric domain is shown here, focusing on the wake region behind the car. In practice, the network predicts velocity at any location within the full 3D domain, not just on this slice. Figure 3: Velocity magnitude for two test samples, arranged in two columns (left and right). For each sample, the top row displays the simulated velocity field, the middle row shows the prediction from the network, and the bottom row presents the error between the two. 3. Scalar Predictions: Drag Coefficient The drag coefficient (Cd) is the most critical parameter in automotive aerodynamics, as reducing it directly translates to lower fuel consumption in combustion vehicles and increased range in electric vehicles. Using the same underlying architecture, our model achieved state-of-the-art performance in Cd prediction. In addition to MSE and MAE, we reported the Maximum Absolute Error (Max AE) to reflect worst-case accuracy. We also included the Coefficient of Determination (R² score), which measures the proportion of variance explained by the model. An R² value of 1 indicates a perfect fit to the target data. Rank Deep Learning Model MSE (*1e-5) MAE (*1e-3) Max AE (*1e-2) R² #1 Neural Concept 0.8 2.22 1.13 0.978 #2 TripNet 9.1 7.19 7.70 0.957 #3 PointNet 14.9 9.60 12.45 0.643 #4 RegDGCNN 14.2 9.31 12.79 0.641 #5 GCNN 17.1 10.43 15.03 0.596 On the official split, the model shows tight agreement with CFD (R² of 0.978) across the test set, which is sufficient for early design screening where engineers need to rank variants confidently and spot meaningful gains without running full simulations for every change. 4. Compute Efficiency and Azure HPC&AI Collaboration Executing the full DrivAerNet++ benchmark at industrial scale required Neural Concept’s full software and infrastructure stack combined with seamless cloud integration on Microsoft Azure to dynamically scale computing resources on demand. The entire pipeline runs natively on Microsoft Azure and can scale within minutes, allowing us to process new industrial datasets that contain thousands of geometries without complex capacity planning. Dataset Scale and Ingestion DrivAerNet++ dataset contains 8000 car designs along with their corresponding CFD simulations. The raw dataset occupies approximately 39TB of storage. Generating the simulations required a total of about 3 million CPU hours by MIT’s DeCoDE Lab. Ingestion into Neural Concept’s platform is the first step of the pipeline. To convert the raw data into the platform’s native format, we use a Conversion task that transforms raw files into the platform’s optimized native format. This task was parallelized with 128 workers; each allocated 5 GB of RAM. As a result, the entire conversion process was completed in approximately one hour only. After converting the relevant data (car geometry, wall shear stress, pressure, and velocity), the full dataset occupies approximately 3 TB in Neural Concept’s native format. Data Pre-Processing Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices. During this phase, workloads were distributed across multiple compute nodes with peak memory usage reaching approximately 1.5 TB of RAM. The pre-processing pipeline consists of two main stages. In the first stage, we repaired the car meshes and pre-computed geometric features needed for training. The second stage involved filtering the volumetric domain and re-sampling points to follow a spatial distribution that is more efficient for training our deep learning model. We scaled the compute resources so that each of the two stages in the pipeline completes in 1 to 3 hours when processing the full dataset. The first stage is the most computationally intensive. To handle it efficiently, we parallelized the task across 256 independent workers, each allocated 6 GB of RAM. Model Training and Deployment While we use state-of-the-art hardware for training, our performance gains come primarily from model design. Once trained, the model remains lightweight and cost-effective to run. Training was performed on Azure Standard_NC96ads_A100_v4 node, which provided access to four A100 GPUs, each with 80 GB of memory. The model was trained for approximately 24 hours. Neural Concept’s Geometric Regressor achieved the best reported performance on the official benchmark for surface pressure, wall shear stress, volumetric velocity and drag prediction.
lmiroslaw
Jan 12, 2026 Place Azure High Performance Computing (HPC) Blog
474Views
0likes
0Comments
mpi-stage: High-Performance File Distribution for HPC Clusters
When running containerized workloads on HPC clusters, one of the first problems you hit is getting container images onto the nodes quickly and repeatably. A .sqsh is a Squashfs image (commonly used by container runtimes on HPC). In some environments you can run a Squashfs image directly from shared storage, but at scale that often turns the shared filesystem into a hot spot. Copying the image to local NVMe keeps startup time predictable and avoids hundreds of nodes hammering the same source during job launch. In this post, I'll introduce mpi-stage, a lightweight tool that uses MPI broadcasts to distribute large files across cluster nodes at speeds that can saturate the backend network. The Problem: Staging Files at Scale On an Azure CycleCloud Workspace for Slurm cluster with GB300 GPU nodes, I needed to stage a large Squashfs container image from shared storage onto each node's local NVMe storage before launching training jobs. At small scale you can often get away with ad-hoc copies, but once hundreds of nodes are all trying to read the same source file, the shared source filesystem quickly becomes the bottleneck. I tried several approaches: Attempt 1: Slurm's sbcast Slurm's built-in sbcast seemed like the natural choice. In my quick testing it was slower than I wanted, and the overwrite/skip-existing behavior didn't match the "fast no-op if already present" workflow I was after. I didn't spend much time exploring all the configuration options before moving on. Attempt 2: Shell Script Fan-Out I wrote a shell script using a tree-based fan-out approach: copy to N nodes, then each of those copies to N more, and so on. This worked and scaled reasonably, but had some drawbacks: Multiple stages: The script required orchestrating multiple rounds of copy commands, adding complexity Source filesystem stress: Even with fan-out, the initial copies still hit the source filesystem simultaneously — a fan-out of 4 meant 4 nodes competing for source bandwidth Frontend network: Copies went over the Ethernet network by default — I could have configured IPoIB, but that added more setup The Solution: MPI Broadcasts The key insight was that MPI's broadcast primitive (MPI_Bcast) is specifically optimized for one-to-many data distribution. Modern MPI implementations like HPC-X use tree-based algorithms that efficiently utilize the high-bandwidth, low-latency InfiniBand network. With mpi-stage: Single source read: Only one node reads from the source filesystem Backend network utilization: Data flows over InfiniBand using optimized MPI collectives Intelligent skipping: Nodes that already have the file (verified by size or checksum) skip the copy entirely Combined, this keeps the shared source (NFS, Lustre, blobfuse, etc.) from being hammered by many concurrent readers while still taking full advantage of the backend fabric. How It Works mpi-stage is designed around a simple workflow: The source node reads the file in chunks and streams each chunk via MPI_Bcast. Destination nodes write each chunk to local storage immediately upon receipt. This streaming approach means the entire file never needs to fit in memory — only a small buffer is required. Key Features Pre-copy Validation Before any data is transferred, each node checks if the destination file already exists and matches the source. You can choose between: Size check (default): Fast comparison of file sizes—sufficient for most use cases Checksum: Stronger validation, but requires reading the full file and is therefore slower If all nodes already have the correct file, mpi-stage completes in milliseconds with no data transfer. Double-Buffered Transfers The implementation uses double-buffered, chunked transfers to overlap network communication with disk I/O. While one buffer is being broadcast, the next chunk is being read from the source. Post-copy Validation Optionally verify that all nodes received the file correctly after the copy completes. Single-Writer Per Node The tool enforces one MPI rank per node to prevent filesystem contention and ensure predictable performance. Real-World Performance In one run using 156 GPU nodes, distributing a container image achieved approximately 3 GB/s effective distribution rate (file_size/time), completing in just over 5 seconds: [0] Copy required: yes [0] Starting copy phase (source writes: yes) [0] Copy complete, Bandwidth: 3007.14 MB/s [0] Post-validation complete [0] Timings (s): Topology check: 5.22463 Source metadata: 0.00803746 Pre-validation: 0.0046786 Copy phase: 5.21189 Post-validation: 2.2944e-05 Total time: 5.2563 Because every node writes the file to its own local NVMe, the cumulative write rate across the cluster is roughly this number times the node count: ~3 GB/s × 156 ≈ ~468 GB/s of total local writes. Workflow: Container Image Distribution The primary use case is distributing Squashfs images to local NVMe before launching containerized workloads. Run mpi-stage as a job step before your main application: #!/bin/bash #SBATCH --job-name=my-training-job #SBATCH --ntasks-per-node=1 #SBATCH --exclusive # Stage the container image srun --mpi=pmix ./mpi_stage \ --source /shared/images/pytorch.sqsh \ --dest /nvme/images/pytorch.sqsh \ --pre-validate size \ --verbose # Run the actual job (from local NVMe - much faster!) srun --container-image=/nvme/images/pytorch.sqsh ... mpi-stage will create the destination directory if it doesn't exist. If your container runtime supports running the image directly from shared storage, you may not strictly need this step—but staging to local NVMe tends to be faster and more predictable at large scale. Because of the pre-validation, you can include this step in every job script without penalty—if the image is already present, it completes in milliseconds. Getting Started git clone https://github.com/edwardsp/mpi-stage.git cd mpi-stage make For detailed usage and options, see the README. Summary mpi-stage started as a solution to a very specific problem—staging large container images efficiently across a large GPU cluster—but the same pattern may be useful in other scenarios where many nodes need the same large file. By using MPI broadcasts, only a single node reads from the source filesystem, while data is distributed over the backend network using optimized collectives. In practice, this can significantly reduce load on shared filesystems and cloud-backed mounts, such as Azure Blob Storage accessed via blobfuse2, where hundreds of concurrent readers can otherwise become a bottleneck. While container images were the initial focus, this approach could also be applied to staging training datasets, distributing model checkpoints or pretrained weights, or copying large binaries to local NVMe before a job starts. Anywhere that a “many nodes, same file” pattern exists is a potential fit. If you're running large-scale containerized workloads on Azure HPC infrastructure, give it a try. If you use mpi-stage in other workflows, I'd love to hear what worked (and what didn't). Feedback and contributions are welcome. Have questions or feedback? Leave a comment below or open an issue on GitHub.
pauledwards
Jan 09, 2026 Place Azure High Performance Computing (HPC) Blog
295Views
1like
0Comments
Announcing Azure CycleCloud Workspace for Slurm: Version 2025.12.01 Release
The Azure CycleCloud Workspace for Slurm 2025.12.01 release introduces major upgrades that strengthen performance, monitoring, authentication, and platform flexibility for HPC environments. This update integrates Prometheus self‑agent monitoring and Azure Managed Grafana, giving teams real‑time visibility into node metrics, Slurm jobs, and cluster health through ready‑to‑use dashboards. The release also adds Entra ID Single Sign‑On (SSO) to streamline secure access across CycleCloud and Open OnDemand. With centralized identity management and support for MFA, organizations can simplify user onboarding while improving security. Additionally, the update expands platform support with ARM64 compute nodes and compatibility for Ubuntu 24.04 and AlmaLinux 9, enabling more flexible and efficient HPC cluster deployments. Overall, this version focuses on improved observability, stronger security, and broader infrastructure options for technical and scientific HPC teams.
xpillons
Jan 07, 2026 Place Azure High Performance Computing (HPC) Blog
259Views
2likes
0Comments
Monitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)
As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices. By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments. In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure. DISCLAIMER: This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support. While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations. 🔔 Update: Supported Monitoring Option Now Available Update (December 2025): At the time this guide was written, monitoring InfiniBand (IB) and GPU metrics on Azure H-series and N-series VMs required a largely unofficial approach using Telegraf and Azure Monitor. Microsoft has since introduced a supported solution: Azure Managed Prometheus on VM / VM Scale Sets (VMSS), currently available in private preview. This new capability provides a native, managed Prometheus experience for collecting infrastructure and accelerator metrics directly from VMs and VMSS. It significantly simplifies deployment, lifecycle management, and long-term support compared to custom Telegraf-based setups. For new deployments, customers are encouraged to evaluate Azure Managed Prometheus on VM / VMSS as the preferred and supported approach for HPC and AI workload monitoring. Official announcement: Private Preview: Azure Managed Prometheus on VM / VMSS Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS. Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com) Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver. Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor. Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04. Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters Run the following commands: wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh chmod +x gpu-ib-mon_setup.sh ./gpu-ib-mon_setup.sh Test the Telegraf configuration by executing the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification. To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal: Set the scope to your VM. Choose the Metric Namespace as Telegraf/nvidia-smi. From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics. Use filters and splits to analyze data across multiple GPUs or over time. To monitor InfiniBand performance, repeat the same process: In the Metrics section, set the scope to your VM. Select the Metric Namespace as Telegraf/infiniband. You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps. Use filters to break down the data by port or metric type for deeper insights. Link_downed Metric Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values. Port_rcv_data metrics Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand. Testing InfiniBand and GPU Usage If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname. module load mpi/hpcx-v2.13.1 export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5 mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \ -mca coll_hcoll_enable 0 --bind-to numa \ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1 Alternate: GPU Load Simulation Using TensorFlow If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines. To get started, run the following: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence. Happy testing! References: Ubuntu HPC on Azure ND A100 v4-series GPU VM Sizes Telegraf Azure Monitor Output Plugin (v1.15) Telegraf NVIDIA SMI Input Plugin (v1.15) Telegraf InfiniBand Input Plugin Documentation
vinilv
Dec 18, 2025 Place Azure High Performance Computing (HPC) Blog
1.6KViews
2likes
0Comments
Automating HPC Workflows with Copilot Agents
High Performance Computing (HPC) workloads are complex, requiring precise job submission scripts and careful resource management. Manual scripting for platforms like OpenFOAM is time-consuming, error-prone, and often frustrating. At SC25, we showcased how Copilot Agents—powered by AI—are transforming HPC workflows by automating Slurm submission scripts, making scientific computing more efficient and accessible.
xpillons
Dec 03, 2025 Place Azure High Performance Computing (HPC) Blog
645Views
1like
0Comments
Azure NCv6 Public Preview: The new Unified Platform for Converged AI and Visual Computing
As enterprises accelerate adoption of physical AI (AI models interacting with real-world physics), digital twins (virtual replicas of physical systems), LLM inference (running language models for predictions), and agentic workflows (autonomous AI-driven processes), the demand for infrastructure that bridges high-end visualization and generative AI inference has never been higher. Today, we are pleased to announce the Public Preview of the NC RTX PRO 6000 BSE v6 series, powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The NCv6 series represents a generational leap in Azure’s visual compute portfolio, designed to be the dual engine for both Industrial Digitalization and cost-effective LLM inference. By leveraging NVIDIA Multi-Instance GPU (MIG) capabilities, the NCv6 platform offers affordable sizing options similar to our legacy NCv3 and NVv5 series. This provides a seamless upgrade path to Blackwell performance, enabling customers to run complex NVIDIA Omniverse simulations and multimodal AI agents with greater efficiency. Why Choose Azure NCv6? While traditional GPU instances often force a choice between "compute" (AI) and "graphics" (visualization) optimizations, the NCv6 breaks this silo. Built on the NVIDIA Blackwell architecture, it provides a "right-sized" acceleration platform for workloads that demand both ray-traced fidelity and Tensor Core performance. As outlined in our product documentation, these VMs are ideal for converged AI and visual computing workloads, including: Real-time digital twin and NVIDIA Omniverse simulation. LLM Inference and RAG (Retrieval-Augmented Generation) on small to medium AI models. High-fidelity 3D rendering, product design, and video streaming. Agentic AI application development and deployment. Scientific visualization and High-Performance Computing (HPC). Key Features of the NCv6 Platform The Power of NVIDIA Blackwell At the heart of the NCv6 is the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. This powerhouse delivers breakthrough performance featuring 96 GB of ultra-fast GDDR7 memory. This massive frame buffer allows for the handling of complex multimodal AI models and high-resolution textures that previous generations simply could not fit. Host Performance: Intel Granite Rapids To ensure your workloads aren't bottlenecked by the CPU, the VM host is equipped with Intel Xeon Granite Rapids processors. These provide an all-core turbo frequency of up to 4.2 GHz, ensuring that demanding pre- and post-processing steps—common in rendering and physics simulations—are handled efficiently. Optimized Sizing for Every Workflow We understand that one size does not fit all. The NCv6 series introduces three distinct sizing categories to match your specific unit economics: General Purpose: Balanced CPU-to-GPU ratios (up to 320 vCPUs) for diverse workloads. Compute Optimized: Higher vCPU density for heavy simulation and physics tasks. Memory Optimized: Massive memory footprints (up to 1,280 GB RAM) for data-intensive applications. Crucially, for smaller inference jobs or VDI, we will also offer fractional GPU options, allowing you to right-size your infrastructure and optimize costs. NCv6 Technical Specifications Specification Details GPU NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7) Processor Intel Xeon Granite Rapids (up to 4.2 GHz Turbo) vCPUs 16 – 320 vCPUs (Scalable across GP, Compute, and Memory optimized sizes) System Memory 64 GB – 1,280 GB DDR5 Network Up to 200,000 Mbps (200 Gbps) Azure Accelerated Networking Storage Up to 2TB local temp storage; Support for Premium SSD v2 & Ultra Disk Real-World Applications The NCv6 is built for versatility, powering everything from pixel-perfect rendering to high-throughput language reasoning: Production Generative AI & Inference: Deploy self-hosted LLMs and RAG pipelines with optimized unit economics. The NCv6 is ideal for serving ranking models, recommendation engines, and content generation agents where low latency and cost-efficiency are paramount. Automotive & Manufacturing: Validate autonomous driving sensors (LiDAR/Radar) and train physical AI models in high-fidelity simulation environments before they ever touch the real world. Next-Gen VDI & Azure Virtual Desktop: Modernize remote workstations with NVIDIA RTX Virtual Workstation capabilities. By leveraging fractional GPU options, organizations can deliver high-fidelity, accelerated desktop experiences to distributed teams—offering a superior, high-density alternative to legacy NVv5 deployments. Media & Entertainment: Accelerate render farms for VFX studios requiring burst capacity, while simultaneously running generative AI tools for texture creation and scene optimization. Conclusion: The Engine for the Era of Converged AI The Azure NCv6 series redefines the boundaries of cloud infrastructure. By combining the raw power of NVIDIA’s Blackwell architecture with the high-frequency performance of Intel Granite Rapids, we are moving beyond just "visual computing." Innovators can now leverage a unified platform to build the industrial metaverse, deploy intelligent agents, and scale production AI—all with the enterprise-grade security and hybrid reach of Azure. Ready to experience the next generation? Sign up for the NCv6 Public Preview here.
rishabv90
Nov 25, 2025 Place Azure High Performance Computing (HPC) Blog
1.4KViews
1like
0Comments