ai infrastructure
95 TopicsComprehensive Nvidia GPU Monitoring for Azure N-Series VMs Using Telegraf with Azure Monitor
Unlocking Nvidia GPU Monitoring for Azure N-Series VMs with Telegraf and Azure Monitor. In the world of AI and HPC, optimizing GPU performance is critical for avoiding inefficiencies that can bottleneck workflows and drive up costs. While Azure Monitor tracks key resources like CPU and memory, it falls short in native GPU monitoring for Azure N-series VMs. Enter Telegraf—a powerful tool that integrates seamlessly with Azure Monitor to bridge this gap. In this blog, discover how to harness Telegraf for comprehensive GPU monitoring and ensure your GPUs perform at peak efficiency in the cloud.Monitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)
As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices. By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments. In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure. DISCLAIMER: This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support. While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations. 🔔 Update: Supported Monitoring Option Now Available Update (December 2025): At the time this guide was written, monitoring InfiniBand (IB) and GPU metrics on Azure H-series and N-series VMs required a largely unofficial approach using Telegraf and Azure Monitor. Microsoft has since introduced a supported solution: Azure Managed Prometheus on VM / VM Scale Sets (VMSS), currently available in private preview. This new capability provides a native, managed Prometheus experience for collecting infrastructure and accelerator metrics directly from VMs and VMSS. It significantly simplifies deployment, lifecycle management, and long-term support compared to custom Telegraf-based setups. For new deployments, customers are encouraged to evaluate Azure Managed Prometheus on VM / VMSS as the preferred and supported approach for HPC and AI workload monitoring. Official announcement: Private Preview: Azure Managed Prometheus on VM / VMSS Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS. Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com) Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver. Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor. Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04. Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters Run the following commands: wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh chmod +x gpu-ib-mon_setup.sh ./gpu-ib-mon_setup.sh Test the Telegraf configuration by executing the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification. To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal: Set the scope to your VM. Choose the Metric Namespace as Telegraf/nvidia-smi. From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics. Use filters and splits to analyze data across multiple GPUs or over time. To monitor InfiniBand performance, repeat the same process: In the Metrics section, set the scope to your VM. Select the Metric Namespace as Telegraf/infiniband. You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps. Use filters to break down the data by port or metric type for deeper insights. Link_downed Metric Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values. Port_rcv_data metrics Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand. Testing InfiniBand and GPU Usage If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname. module load mpi/hpcx-v2.13.1 export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5 mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \ -mca coll_hcoll_enable 0 --bind-to numa \ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1 Alternate: GPU Load Simulation Using TensorFlow If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines. To get started, run the following: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence. Happy testing! References: Ubuntu HPC on Azure ND A100 v4-series GPU VM Sizes Telegraf Azure Monitor Output Plugin (v1.15) Telegraf NVIDIA SMI Input Plugin (v1.15) Telegraf InfiniBand Input Plugin DocumentationAutomating HPC Workflows with Copilot Agents
High Performance Computing (HPC) workloads are complex, requiring precise job submission scripts and careful resource management. Manual scripting for platforms like OpenFOAM is time-consuming, error-prone, and often frustrating. At SC25, we showcased how Copilot Agents—powered by AI—are transforming HPC workflows by automating Slurm submission scripts, making scientific computing more efficient and accessible.Azure NCv6 Public Preview: The new Unified Platform for Converged AI and Visual Computing
As enterprises accelerate adoption of physical AI (AI models interacting with real-world physics), digital twins (virtual replicas of physical systems), LLM inference (running language models for predictions), and agentic workflows (autonomous AI-driven processes), the demand for infrastructure that bridges high-end visualization and generative AI inference has never been higher. Today, we are pleased to announce the Public Preview of the NC RTX PRO 6000 BSE v6 series, powered by the NVIDIA RTX PRO 6000 Blackwell Server Edition GPUs. The NCv6 series represents a generational leap in Azure’s visual compute portfolio, designed to be the dual engine for both Industrial Digitalization and cost-effective LLM inference. By leveraging NVIDIA Multi-Instance GPU (MIG) capabilities, the NCv6 platform offers affordable sizing options similar to our legacy NCv3 and NVv5 series. This provides a seamless upgrade path to Blackwell performance, enabling customers to run complex NVIDIA Omniverse simulations and multimodal AI agents with greater efficiency. Why Choose Azure NCv6? While traditional GPU instances often force a choice between "compute" (AI) and "graphics" (visualization) optimizations, the NCv6 breaks this silo. Built on the NVIDIA Blackwell architecture, it provides a "right-sized" acceleration platform for workloads that demand both ray-traced fidelity and Tensor Core performance. As outlined in our product documentation, these VMs are ideal for converged AI and visual computing workloads, including: Real-time digital twin and NVIDIA Omniverse simulation. LLM Inference and RAG (Retrieval-Augmented Generation) on small to medium AI models. High-fidelity 3D rendering, product design, and video streaming. Agentic AI application development and deployment. Scientific visualization and High-Performance Computing (HPC). Key Features of the NCv6 Platform The Power of NVIDIA Blackwell At the heart of the NCv6 is the NVIDIA RTX PRO 6000 Blackwell Server Edition GPU. This powerhouse delivers breakthrough performance featuring 96 GB of ultra-fast GDDR7 memory. This massive frame buffer allows for the handling of complex multimodal AI models and high-resolution textures that previous generations simply could not fit. Host Performance: Intel Granite Rapids To ensure your workloads aren't bottlenecked by the CPU, the VM host is equipped with Intel Xeon Granite Rapids processors. These provide an all-core turbo frequency of up to 4.2 GHz, ensuring that demanding pre- and post-processing steps—common in rendering and physics simulations—are handled efficiently. Optimized Sizing for Every Workflow We understand that one size does not fit all. The NCv6 series introduces three distinct sizing categories to match your specific unit economics: General Purpose: Balanced CPU-to-GPU ratios (up to 320 vCPUs) for diverse workloads. Compute Optimized: Higher vCPU density for heavy simulation and physics tasks. Memory Optimized: Massive memory footprints (up to 1,280 GB RAM) for data-intensive applications. Crucially, for smaller inference jobs or VDI, we will also offer fractional GPU options, allowing you to right-size your infrastructure and optimize costs. NCv6 Technical Specifications Specification Details GPU NVIDIA RTX PRO 6000 Blackwell Server Edition (96 GB GDDR7) Processor Intel Xeon Granite Rapids (up to 4.2 GHz Turbo) vCPUs 16 – 320 vCPUs (Scalable across GP, Compute, and Memory optimized sizes) System Memory 64 GB – 1,280 GB DDR5 Network Up to 200,000 Mbps (200 Gbps) Azure Accelerated Networking Storage Up to 2TB local temp storage; Support for Premium SSD v2 & Ultra Disk Real-World Applications The NCv6 is built for versatility, powering everything from pixel-perfect rendering to high-throughput language reasoning: Production Generative AI & Inference: Deploy self-hosted LLMs and RAG pipelines with optimized unit economics. The NCv6 is ideal for serving ranking models, recommendation engines, and content generation agents where low latency and cost-efficiency are paramount. Automotive & Manufacturing: Validate autonomous driving sensors (LiDAR/Radar) and train physical AI models in high-fidelity simulation environments before they ever touch the real world. Next-Gen VDI & Azure Virtual Desktop: Modernize remote workstations with NVIDIA RTX Virtual Workstation capabilities. By leveraging fractional GPU options, organizations can deliver high-fidelity, accelerated desktop experiences to distributed teams—offering a superior, high-density alternative to legacy NVv5 deployments. Media & Entertainment: Accelerate render farms for VFX studios requiring burst capacity, while simultaneously running generative AI tools for texture creation and scene optimization. Conclusion: The Engine for the Era of Converged AI The Azure NCv6 series redefines the boundaries of cloud infrastructure. By combining the raw power of NVIDIA’s Blackwell architecture with the high-frequency performance of Intel Granite Rapids, we are moving beyond just "visual computing." Innovators can now leverage a unified platform to build the industrial metaverse, deploy intelligent agents, and scale production AI—all with the enterprise-grade security and hybrid reach of Azure. Ready to experience the next generation? Sign up for the NCv6 Public Preview here.Azure CycleCloud 8.8 and CCWS 1.2 at SC25 and Ignite
Azure CycleCloud 8.8: Advancing HPC & AI Workloads with Smarter Health Checks Azure CycleCloud continues to evolve as the backbone for orchestrating high-performance computing (HPC) and AI workloads in the cloud. With the release of CycleCloud 8.8, users gain access to a suite of new features designed to streamline cluster management, enhance health monitoring, and future-proof their HPC environments. Key Features in CycleCloud 8.8 1. ARM64 HPC Support The platform expands its hardware compatibility with ARM64 HPC support, opening new possibilities for energy-efficient and cost-effective compute clusters. This includes access to the newer generation of GB200 VMs as well as general ARM64 support, enabling new AI workloads at a scale never possible before 2. Slurm Topology-Aware Scheduling The integration of topology-aware scheduling for Slurm clusters allows CycleCloud users to optimize job placement based on network and hardware topology. This leads to improved performance for tightly coupled HPC workloads and better utilization of available resources. 3. Nvidia MNNVL and IMEX Support With expanded support for Nvidia MNNVL and IMEX, CycleCloud 8.8 ensures compatibility with the latest GPU technologies. This enables users to leverage cutting-edge hardware for AI training, inference, and scientific simulations. 4. HealthAgent: Event-Driven Health Monitoring and Alerting A standout feature in this release is the enhanced HealthAgent, which delivers event-driven health monitoring and alerting. CycleCloud now proactively detects issues across clusters, nodes, and interconnects, providing real-time notifications and actionable insights. This improvement is a game-changer for maintaining uptime and reliability in large-scale HPC deployments. Node Healthagent supports both impactful healthchecks which can only run while nodes are idle as well as non-impactful healthchecks that can run throughout the lifecycle of a job. This allows CycleCloud to alert on issues that not only happen while nodes are starting, but also issues that may result from failures for long-running nodes. Later releases of CycleCloud will also include automatic remediation for common failures, so stay tuned! 5. Enterprise Linux 9 and Ubuntu 24 support One common request has been wider support for the various Enterprise Linux (EL) 9 variants, including RHEL9, AlmaLinux 9, and Rocky Linux 9. CycleCloud 8.8 introduces support for those distributions as well as the latest Ubuntu HPC release. Why These Features Matter The CycleCloud 8.8 release marks a significant leap forward for organizations running HPC and AI workloads in Azure. The improved health check support—anchored by HealthAgent and automated remediation—means less downtime, faster troubleshooting, and greater confidence in cloud-based research and innovation. Whether you’re managing scientific simulations, AI model training, or enterprise analytics, CycleCloud’s latest features help you build resilient, scalable, and future-ready HPC environments. Key Features in CycleCloud Workspace for Slurm 1.2 Along with the release of CycleCloud 8.8 comes a new CycleCloud Workspace for Slurm (CCWS) release. This release includes the General Availability of features that were previously in preview, such as Open OnDemand, Cendio ThinLinc, and managed Grafana monitoring capabilities. In addition to previously announced features, CCWS 1.2 also includes support for a new Hub and Spoke deployment model. This allows customers to retain a central hub of shared resources that can be re-used between cluster deployments with "disposable" spoke clusters that branch from the hub. Hub and Spoke deployments enable customers who need to re-deploy clusters in order to upgrade their operating system, deploy new versions of software, or even reconfigure the overall architecture of Slurm clusters. Come visit us at SC25 and MS Ignite To learn more about these features, come visit us at the Microsoft booth at #SC25 in St. Louis, MO and #Microsoft #Ignite in San Francisco this week!Azure ND GB300 v6 now Generally Available - Hyper-optimized for Generative and Agentic AI workloads
We are pleased to announce the General Availability (GA) of ND GB300 v6 virtual machines, delivering the next leap in AI infrastructure. On 10/09, we shared the delivery of the first at-scale production cluster with more than 4,600 NVIDIA GB300 NVL72, featuring NVIDIA Blackwell Ultra GPUs connected through the next-generation NVIDIA InfiniBand network. We have now deployed tens of thousands of GB300 GPUs for production customer workloads and expect to scale to hundreds of thousands. Built on NVIDIA GB300 NVL72 systems, these VMs redefine performance for frontier model training, large-scale inference, multimodal reasoning, and agentic AI. The ND GB300 v6 series enables customers to: Deploy trillion-parameter models with unprecedented throughput. Accelerate inference for long-context and multimodal workloads. Scale seamlessly at high bandwidth for large scale training workloads. In recent benchmarks, ND GB300 v6 achieved over 1.1 million tokens per second on Llama 2 70B inference workloads - a 27% uplift over ND GB200 v6. This performance breakthrough enables customers to serve long-context, multimodal, and agentic AI models with unmatched speed and efficiency. With the general availability of ND GB300 v6 VMs, Microsoft strengthens its long-standing collaboration with NVIDIA by leading the market in delivering the latest GPU innovations, reaffirming our commitment to world-class AI infrastructure. The ND v6 GB300 systems are built in a rack-scale design, with each rack hosting 18 VMs for a total of 72 GPUs interconnected by high-speed NVLINK. Each VM has 2 NVIDIA Grace CPUs and 4 Blackwell Ultra GPUs. Each NVLINK connect rack contains: 72 NVIDIA Blackwell Ultra GPUs (with 36 NVIDIA Grace CPUs). 800 gigabits per second (Gbp/s) per GPU cross-rack scale-out bandwidth via next-generation NVIDIA Quantum-X800 InfiniBand (2x ND GB200 v6). 130 terabytes (TB) per second of NVIDIA NVLink bandwidth within rack. 37TB of fast memory. (~20 TB HBM3e + ~17TB LPDDR) Up to 1,440 petaflops (PFLOPS) of FP4 Tensor Core performance. (1.5x ND GB200 v6) Together, NVLINK and XDR InfiniBand enable GB300 systems to behave as a unified compute and memory pool, minimizing latency, maximizing bandwidth, and dramatically improving scalability. Within a rack, NVLink enables coherent memory access and fast synchronization for tightly coupled workloads. Across racks, XDR InfiniBand ensures ultra-low latency, high-throughput communication with SHARP offloading—maintaining sub-100 µs latency for cross-node collectives. Azure provides an end-to-end AI platform that enables customers to build, deploy, and scale AI workloads efficiently on GB300 infrastructure. Services like Azure CycleCloud and Azure Batch simplify the setup and management of HPC and AI environments, allowing organizations to dynamically adjust resources, integrate leading schedulers, and run containerized workloads at massive scale. With tools such as CycleCloud Workspace for Slurm, users can create and configure clusters without prior expertise, while Azure Batch handles millions of parallel tasks, ensuring cost and resource efficiency for large-scale training. For cloud-native AI, Azure Kubernetes Service (AKS) offers rapid deployment and management of containerized workloads, complemented by platform-specific optimizations for observability and reliability. Whether using Kubernetes or custom stacks, Azure delivers a unified suite of services to maximize performance and scalability. Learn More & Get Started https://azure.microsoft.com/en-us/blog/microsoft-azure-delivers-the-first-large-scale-cluster-with-nvidia-gb300-nvl72-for-openai-workloads/ https://techcommunity.microsoft.com/blog/azurehighperformancecomputingblog/breaking-the-million-token-barrier-the-technical-achievement-of-azure-nd-gb300-v/4466080 NVIDIA Blog: Azure’s GB300 NVL72 Supercomputing Cluster Azure VM Sizes OverviewAnnouncing the Public Preview of AMLFS 20: Azure Managed Lustre New SKU for Massive AI&HPC Workloads
Sachin Sheth - Principal PDM Manager Brian Barbisch - Principal Group Software Engineering Manager Matt White - Principal Group Software Engineering Manager Brian Lepore - Principal Product Manager Wolfgang De Salvador - Senior Product Manager Ron Hogue - Senior Product Manager Introduction We are excited to announce the Public Preview of AMLFS Durable Premium 20 (AMLFS 20), a new SKU in Azure Managed Lustre designed to deliver unprecedented performance and scale for demanding AI and HPC workloads. Key Features Massive Scale: Store up to 25 PiB of data in a single namespace, with up to 512 GB/s of total bandwidth. Advanced Metadata Performance: Multi-MDS (Metadata Server) architecture dramatically improves metadata IOPS. In mdtest benchmarks, AMLFS 20 demonstrated more than 5x improvement in metadata operations. An additional MDS is provided for every 5 PiB of provisioned filesystem. High File Capacity: Supports up to 20 billion inodes for maximum namespace size. Why AMLFS 20 Matters Simplified Architecture: Previously, datasets larger than 12.5 PiB required multiple filesystems and complex management. AMLFS 20 enables a single, high-performance file system for massive AI and HPC workloads up to 25 PiB, streamlining deployment and administration. Accelerated Data Preparation: The multi-MDT architecture significantly increases metadata IOPS, which is crucial during the data preparation stage of AI training, where rapid access to millions of files is required. Faster Time-to-Value: Researchers and engineers benefit from easier management, reduced bottlenecks, and faster access to large datasets, accelerating innovation. Availability AMLFS 20 is available in Public Preview alongside the already existing AMLFS SKUs. For more details on other SKUs, visit the Azure Managed Lustre documentation. How to Join the Preview If you are working with large-scale AI or HPC workloads and would like early access to AMLFS 20, we invite you to fill out this form to tell us about your use case. Our team will follow up with onboarding details.Join Microsoft @ SC25: Experience HPC and AI Innovation
Supercomputing 2025 is coming to St. Louis, MO, November 16–21! Visit Microsoft Booth #1627 to explore cutting-edge HPC and AI solutions, connect with experts, and experience interactive demos that showcase the future of compute. Whether you’re attending technical sessions, stopping by for a coffee, or joining our partner events, we’ve got something for everyone. Booth Highlights Alpine Formula 1 Showcar: Snap a photo with a real Alpine F1 car and learn how high-performance computing drives innovation in motorsports. Silicon Wall: Discover silicon diversity—featuring chips from our partners AMD and NVIDIA, alongside Microsoft’s own first-party silicon: Maia, Cobalt, and Majorana. NVIDIA Weather Modeling Demo: See how AI and HPC predict extreme weather events with Tomorrow.io and NVIDIA technology. Coffee Bar with Barista: Enjoy a handcrafted coffee while you connect with our experts. Immersive Screens: Watch live demos and visual stories about HPC breakthroughs and AI innovation. Hardware Bar: Explore AMD EPYC™ and NVIDIA GB200 systems powering next-generation workloads. Whether you’re attending technical sessions, stopping by for a coffee and chat with our team, or joining our partner events, we’ve got something for everyone. Conference Details Conference week: Sun, Nov 16 – Fri, Nov 21 Expo hours (CST): Mon, Nov 17: 7:00–9:00 PM (Opening Night) Tue, Nov 18: 10:00 AM–6:00 PM Wed, Nov 19: 10:00 AM–6:00 PM Thu, Nov 20: 10:00 AM–3:00 PM Customer meeting rooms: Four Seasons Hotel Quick links RSVP — Microsoft + AMD Networking Reception (Tue, Nov 18): https://aka.ms/MicrosoftAMD-Mixer RSVP — Microsoft + NVIDIA Panel Luncheon (Wed, Nov 19): Luncheon is now closed as the event is fully booked. Earned Sessions (Technical Program) Sunday, Nov 16 Session Type Time (CST) Title Microsoft Contributor(s) Location Tutorial 8:30 AM–5:00 PM Delivering HPC: Procurement, Cost Models, Metrics, Value, and More Andrew Jones Room 132 Tutorial 8:30 AM–5:00 PM Modern High Performance I/O: Leveraging Object Stores Glenn Lockwood Room 120 Workshop 2:00–5:30 PM 14th International Workshop on Runtime and Operating Systems for Supercomputers (ROSS 2025) Torsten Hoefler Room 265 Monday, Nov 17 Session Type Time (CST) Title Microsoft Contributor(s) Location Early Career Program 3:30–4:45 PM Voices from the Field: Navigating Careers in Academia, Government, and Industry Joe Greenseid Room 262 Workshop 3:50–4:20 PM Towards Enabling Hostile Multi-tenancy in Kubernetes Ali Kanso; Elzeiny Mostafa; Gurpreet Virdi; Slava Oks Room 275 Workshop 5:00–5:30 PM On the Performance and Scalability of Cloud Supercomputers: Insights from Eagle and Reindeer Amirreza Rastegari; Prabhat Ram; Michael F. Ringenburg Room 267 Tuesday, Nov 18 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM High Performance Software Foundation BoF Joe Greenseid Room 230 Poster 5:30–7:00 PM Compute System Simulator: Modeling the Impact of Allocation Policy and Hardware Reliability on HPC Cloud Resource Utilization Jarrod Leddy; Huseyin Yildiz Second Floor Atrium Wednesday, Nov 19 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM The Future of Python on HPC Systems Michael Droettboom Room 125 BOF 12:15–1:15 PM Autonomous Science Network: Interconnected Autonomous Science Labs Empowered by HPC and Intelligent Agents Joe Tostenrude Room 131 Paper 1:30–1:52 PM Uno: A One‑Stop Solution for Inter‑ and Intra‑Data Center Congestion Control and Reliable Connectivity Abdul Kabbani; Ahmad Ghalayini; Nadeen Gebara; Terry Lam Rooms 260–267 Paper 2:14–2:36 PM SDR‑RDMA: Software‑Defined Reliability Architecture for Planetary‑Scale RDMA Communication Abdul Kabbani; Jie Zhang; Jithin Jose; Konstantin Taranov; Mahmoud Elhaddad; Scott Moe; Sreevatsa Anantharamu; Zhuolong Yu Rooms 260–267 Panel 3:30–5:00 PM CPUs Have a Memory Problem — Designing CPU‑Based HPC Systems with Very High Memory Bandwidth Joe Greenseid Rooms 231–232 Paper 4:36–4:58 PM SparStencil: Retargeting Sparse Tensor Cores to Scientific Stencil Computations Kun Li; Liang Yuan; Ting Cao; Mao Yang Rooms 260–267 Thursday, Nov 20 Session Type Time (CST) Title Microsoft Contributor(s) Location BOF 12:15–1:15 PM Super(computing)heroes Laura Parry Rooms 261–266 Paper 3:30–3:52 PM Workload Intelligence: Workload‑Aware IaaS Abstraction for Cloud Efficiency Anjaly Parayil; Chetan Bansal; Eli Cortez; Íñigo Goiri; Jim Kleewein; Jue Zhang; Pantea Zardoshti; Pulkit Misra; Raphael Ghelman; Ricardo Bianchini; Rodrigo Fonseca; Saravan Rajmohan; Xiaoting Qin Room 275 Paper 4:14–4:36 PM From Deep Learning to Deep Science: AI Accelerators Scaling Quantum Chemistry Beyond Limits Fusong Ju; Kun Li; Mao Yang Rooms 260–267 Friday, Nov 21 Session Type Time (CST) Title Microsoft Contributor(s) Location Workshop 9:00 AM–12:30 PM Eleventh International Workshop on Heterogeneous High‑performance Reconfigurable Computing (H2RC 2025) Torsten Hoefler Room 263 Booth Theater Sessions Monday, Nov 17 — 7:00 PM–9:00 PM Time (CST) Session Title Presenter(s) 8:00–8:20 PM Inside the World’s Most Powerful AI Data Center Chris Jones 8:30–8:50 PM Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC Joe Tostenrude Tuesday, Nov 18 — 10:00 AM–6:00 PM Time (CST) Session Title Presenter(s) 11:00–11:50 AM Ignite Keynotes 12:00–12:20 PM Accelerating AI workloads with Azure Storage Sachin Sheth; Wolfgang De Salvador 12:30–12:50 PM Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA Jyothi Venkatesh 1:00–1:20 PM Radiation & Health Companion: AI‑Driven Flight‑Dose Awareness Olesya Sarajlic 1:30–1:50 PM Ascend HPC Lab: Your On‑Ramp to GPU‑Powered Innovation Daniel Cooke (Oakwood) 2:00–2:20 PM Azure AMD HBv5: Redefining CFD Performance and Value in the Cloud Rick Knoechel (AMD) 2:30–2:50 PM Empowering High Performance Life Sciences Workloads on Azure Qumulo 3:00–3:20 PM Transforming Science and Engineering — Driven by Agentic AI, Powered by HPC Joe Tostenrude 4:00–4:20 PM Unleashing AMD EPYC on Azure: Scalable HPC for Energy and Manufacturing Varun Selvaraj (AMD) 4:30–4:50 PM Automating HPC Workflows with Copilot Agents Xavier Pillons 5:00–5:20 PM Scaling the Future: NVIDIA’s GB300 NVL72 Rack for Next‑Generation AI Inference Kirthi Devleker (NVIDIA) 5:30–5:50 PM Enabling AI and HPC Workloads in the Cloud with Azure NetApp Files Andy Chan Wednesday, Nov 19 — 10:00 AM–6:00 PM Time (CST) Session Title Presenter(s) 10:30–10:50 AM AI‑Powered Digital Twins for Industrial Engineering John Linford (NVIDIA) 11:00–11:20 AM Advancing 5 Generations of HPC Innovation with AMD on Azure Allen Leibovitch (AMD) 11:30–11:50 AM Intro to LoRA Fine‑Tuning on Azure Christin Pohl 12:00–12:20 PM VAST + Microsoft: Building the Foundation for Agentic AI Lior Genzel (VAST Data) 12:30–12:50 PM Inside the World’s Most Powerful AI Data Center Chris Jones 1:00–1:20 PM Supervised GenAI Simulation – Stroke Prognosis (NVads V710 v5) Kurt Niebuhr 1:30–1:50 PM What You Don’t See: How Azure Defines VM Families Anshul Jain 2:00–2:20 PM Hammerspace Tier 0: Unleashing GPU Storage Performance on Azure Raj Sharma (Hammerspace) 2:30–2:50 PM GM Motorsports: Accelerating Race Performance with AI Physics on Rescale Bernardo Mendez (Rescale) 3:00–3:20 PM Hurricane Analysis and Forecasting on the Azure Cloud Salar Adili (Microsoft); Unni Kirandumkara (GDIT); Stefan Gary (Parallel Works) 3:30–3:50 PM Performance at Scale: Accelerating HPC & AI Workloads with WEKA on Azure Desiree Campbell; Wolfgang De Salvador 4:00–4:20 PM Pushing the Limits of Performance: Supercomputing on Azure AI Infrastructure Biju Thankachen; Ojasvi Bhalerao 4:30–4:50 PM Accelerating Momentum: Powering AI & HPC with AMD Instinct™ GPUs Jay Cayton (AMD) Thursday, Nov 20 — 10:00 AM–3:00 PM Time (CST) Session Title Presenter(s) 11:30–11:50 AM Intro to LoRA Fine‑Tuning on Azure Christin Pohl 12:00–12:20 PM Accelerating HPC Workflows with Ansys Access on Microsoft Azure Dr. John Baker (Ansys) 12:30–12:50 PM Accelerate Memory Bandwidth‑Bound Workloads with Azure HBv5, now GA Jyothi Venkatesh 1:00–1:20 PM Pushing the Limits: Supercomputing on Azure AI Infrastructure Biju Thankachen; Ojasvi Bhalerao 1:30–1:50 PM The High Performance Software Foundation Todd Gamblin (HPSF) 2:00–2:20 PM Heidi AI — Deploying Azure Cloud Environments for Higher‑Ed Students & Researchers James Verona (Adaptive Computing); Dr. Sameer Shende (UO/ParaTools) Partner Session Schedule Tuesday, Nov 18 Date Time (CST) Title Microsoft Contributor(s) Location Nov 18 11:00 AM–11:50 AM Cloud Computing for Engineering Simulation Joe Greenseid Ansys Booth Nov 18 1:00 PM–1:30 PM Revolutionizing Simulation with Artificial Intelligence Joe Tostenrude Ansys Booth Nov 18 4:30 PM–5:00 PM [HBv5] Jyothi Venkatesh AMD Booth Wednesday, Nov 19 Date Time (CST) Title Microsoft Contributor(s) Location Nov 19 11:30 AM–1:30 PM Accelerating Discovery: How HPC and AI Are Shaping the Future of Science (Lunch Panel) Andrew Jones (Moderator); Joe Greenseid (Panelist) Ruth's Chris Steak House Nov 19 1:00 PM–1:30 PM VAST and Microsoft Kanchan Mehrotra VAST Booth Demo Pods at Microsoft Booth Azure HPC & AI Infrastructure Explore how Azure delivers high-performance computing and AI workloads at scale. Learn about VM families, networking, and storage optimized for HPC. Agentic AI for Science See how autonomous agents accelerate scientific workflows, from simulation to analysis, using Azure AI and HPC resources. Hybrid HPC with Azure Arc Discover how Azure Arc enables hybrid HPC environments, integrating on-prem clusters with cloud resources for flexibility and scale. Ancillary Events (RSVP Required) Microsoft + AMD Networking Reception — Tuesday Night When: Tue, Nov 18, 6:30–10:00 PM (CST) Where: UMB Champions Club, Busch Stadium RSVP: https://aka.ms/MicrosoftAMD-Mixer Microsoft + NVIDIA Panel Luncheon — Wednesday When: Wed, Nov 19, 11:30 AM–1:30 PM (CST) Where: Ruth’s Chris Steak House Topic: Accelerating Discovery: How AI and HPC Are Shaping the Future of Science Panelists: Dan Ernst (NVIDIA); Rollin Thomas (NERSC); Joe Greenseid (Microsoft); Antonia Maar (Intersect360 Research); Fernanda Foertter (University of Alabama) RSVP: Luncheon is now closed as the event is fully booked. Conclusion We’re excited to connect with you at SC25! Whether you’re exploring our booth demos, attending technical sessions, or joining one of our partner events, this is your opportunity to experience how Microsoft is driving innovation in HPC and AI. Stop by Booth #1627 to see the Alpine F1 showcar, explore the Silicon Wall featuring AMD, NVIDIA, and Microsoft’s own chips, and enjoy a coffee from our barista while networking with experts. Don’t forget to RSVP for our Microsoft + AMD Network Reception and Microsoft + NVIDIA Panel Luncheon See you in St. Louis!Breaking the Million-Token Barrier: The Technical Achievement of Azure ND GB300 v6
Azure ND GB300 v6 Virtual Machines with NVIDIA GB300 NVL72 rack-scale systems achieve unprecedented performance of 1,100,000 tokens/s on Llama2 70B Inference, beating the previous Azure ND GB200 v6 record of 865,000 tokens/s by 27%.