ai infrastructure
107 TopicsTraining 100B+ Models on a Single GPU: What MegaTrain Changes - and What It Means for Azure
The Paradigm Shift in Model Training The conventional wisdom in deep learning has been simple: bigger models require bigger infrastructure. Training a 100-billion parameter language model traditionally demands massive GPU clusters with terabytes of combined memory, where each GPU holds portions of the model simultaneously. This assumption has shaped the entire AI infrastructure landscape, driving demand for high-memory accelerators and complex distributed training frameworks. A new paper named MegaTrain (Yuan et al) by research teams at NotreDame and Lehigh universities, changes this paradigm by demonstrating that model size need not be limited by GPU memory capacity. Instead of treating memory as a container that must hold the entire model, MegaTrain treats it as a cache through which model components flow during computation. This architectural inversion enables training models orders of magnitude larger than available GPU memory, transforming what was thought to be a hardware limitation into a software scheduling problem. The implications extend beyond academic curiosity to practical infrastructure decisions, particularly for cloud platforms like Azure where GPU configurations, interconnect topologies, and cost structures create distinct optimization landscapes. The Memory Bottleneck in Traditional LLM Training Large language model training faces a fundamental memory constraint that grows with model scale. A typical 100-billion parameter model with mixed-precision training may require in excess of 1 Terabyte of memory when taking into account weights and optimizer states. Activation memory adds another substantial overhead, particularly for long sequences where intermediate layer outputs must be retained for backpropagation. Traditional training approaches keep all parameters, gradients, and optimizer states resident in GPU memory simultaneously, creating a hard ceiling on trainable model size. Standard datacenter GPUs like the NVIDIA A100 with 80GB memory can barely accommodate models beyond 20 billion parameters with reasonable batch sizes, while consumer-grade GPUs with 16-24GB memory are restricted to models under 5 billion parameters. This memory wall has forced the AI community toward distributed training strategies like pipeline parallelism, tensor parallelism, and data parallelism, each adding communication overhead, synchronization complexity, and infrastructure costs. The memory bottleneck becomes particularly acute during gradient accumulation and optimizer updates, where transient memory peaks can trigger out-of-memory errors even when average utilization suggests sufficient capacity. The Key Insight: Inverting the Memory Hierarchy MegaTrain's core innovation inverts the traditional memory hierarchy by treating GPU memory as a streaming cache rather than a static container. In conventional training, the model resides in GPU memory while slower storage sits idle except for checkpoint loading. MegaTrain reverses this relationship: the full model lives in high-bandwidth storage like NVMe SSDs, with only the actively computing layer residing in GPU memory at any moment. This architectural inversion exploits the sequential nature of neural network computation: forward and backward passes process layers in deterministic order, creating predictable access patterns amenable to prefetching and streaming. The approach transforms the memory constraint from a capacity problem into a bandwidth problem where success depends on streaming layers between storage and GPU fast enough to keep compute units saturated. Modern NVMe SSDs provide 7-14 GB/s sequential read bandwidth, while PCIe 4.0 x16 offers 32 GB/s bidirectional throughput, creating sufficient headroom for layer streaming when properly pipelined. This memory hierarchy is the basis for a fundamental architectural shift, through which MegaTrain moves from memory-bound scaling to bandwidth-bound scaling. MegaTrain Architecture and Core Mechanisms MegaTrain implements memory hierarchy inversion through four interlocking mechanisms that together enable efficient single-GPU training of massive models. The architecture maintains model parameters, optimizer states, and gradients in host memory or NVMe storage, streaming only the actively computing layer into GPU memory. Each training iteration processes layers sequentially: during the forward pass, layers stream from storage to GPU, compute activations, then stream back to make room for the next layer. The backward pass reverses this flow, streaming layers in reverse order to compute gradients. Critically, MegaTrain maintains minimal GPU memory footprint by immediately evicting each layer after computation rather than accumulating them. The system achieves this through careful orchestration of four core techniques that transform naive layer streaming from a theoretical possibility into a practical training method. Layer-wise streaming provides the foundation by decomposing model computation into independent sequential operations. Pipelined execution overlaps data movement with computation to hide transfer latency. Block-wise recomputation trades redundant forward passes for reduced activation memory. Stateless execution eliminates persistent GPU state to maximize available memory for active computation. Together, these techniques enable training throughput within 2-5x of conventional in-memory training while supporting models 10-100x larger than GPU memory capacity. Let’s look at each of these techniques in detail. Layer-Wise Streaming: Sequential Model Decomposition Layer-wise streaming exploits the inherently sequential structure of neural network computation to decompose model execution into memory-independent stages. Each transformer layer performs a self-contained computation: it receives input activations, applies attention and feedforward transformations, and produces output activations for the next layer. MegaTrain leverages this modularity by loading exactly one layer's parameters into GPU memory, computing its forward pass, storing the output activations, then immediately evicting the layer to make room for the next. During backpropagation, the process reverses: layers stream in reverse order, recompute their forward pass from stored input activations, compute gradients, and stream back to storage. This approach reduces peak GPU memory from the sum of all layer parameters to the size of the single largest layer plus activation storage. For a 100-billion parameter model with 80 transformer layers, each layer contains approximately 1.25 billion parameters requiring 2.5GB in FP16, compared to 200GB for the full model. The memory savings enable training on GPUs that would otherwise be incapable of holding even a fraction of the model. Layer-wise streaming does introduce computational overhead: each layer must be loaded and evicted during both forward and backward passes, creating 4x the parameter transfer volume compared to conventional training. However, modern interconnect bandwidth and careful prefetching largely mitigate this overhead when properly pipelined. Pipelined Execution: Overlapping Transfer and Compute Pipelined execution hides layer streaming latency by overlapping data transfers with GPU computation, ensuring compute units remain saturated despite continuous parameter movement. MegaTrain maintains a three-stage pipeline: while the GPU computes layer N, the system prefetches layer N+1 from storage to host memory and simultaneously evicts layer N-1 back to storage. This overlapping execution exploits the independence of CPU-GPU data transfers and GPU computation in modern architectures. PCIe transfers and GPU kernels execute concurrently when properly scheduled using asynchronous CUDA streams, allowing near-complete latency hiding if transfer time does not exceed computation time. For typical transformer layers, forward pass computation takes 50-200ms depending on batch size and sequence length, while transferring a 2.5GB layer over PCIe 4.0 x16 at 16 GB/s requires only 156ms, fitting comfortably within the computation window. The pipeline achieves optimal efficiency when transfer bandwidth and computation throughput are balanced, creating a predictable performance model based on layer size, batch configuration, and interconnect speed. The three-stage pipeline pattern - prefetch, compute, and eviction - interleaves to maintain continuous GPU utilization, as illustrated in the diagram above. Block-Wise Recomputation Block-wise recomputation addresses the activation memory bottleneck by trading computation for memory through selective forward pass recalculation. In standard training, all intermediate activations from the forward pass must be retained for gradient computation during backpropagation, consuming memory proportional to model depth and batch size. MegaTrain instead stores only a subset of checkpoint activations at regular intervals, recomputing intermediate values on-demand during the backward pass. For a model with 80 layers, storing checkpoints every 10 layers reduces activation memory by 90 percent at the cost of recomputing 9 out of 10 layers during backpropagation. This trade-off proves favorable because forward pass computation is relatively cheap compared to the memory savings, and recomputation can be pipelined with layer streaming. Stateless Execution Stateless execution complements recomputation by eliminating all persistent GPU state between layer computations. After processing each layer, MegaTrain immediately evicts not only parameters but also optimizer states like momentum buffers and variance estimates, keeping only the minimal activations needed for gradient flow. This aggressive state eviction maximizes available GPU memory for the active layer's computation, enabling larger batch sizes and longer sequences within fixed memory budgets. Together, block-wise recomputation and stateless execution transform MegaTrain's memory footprint from O(model_size) to O(largest_layer + checkpoint_activations), typically reducing requirements by 50-100x for models in the 100-billion parameter range. W- Weight prefetch, F/R/B – Computation, G – Gradient offload Source: Yuan, Zhengqing, et al. MegaTrain: Full Precision Training of 100B+ Parameter Large Language Models on a Single GPU Performance Results and Achievements MegaTrain demonstrates practical viability by training models far exceeding GPU memory capacity with acceptable performance overhead. The paper reports an implementation that successfully trained a 175-billion parameter model on a single NVIDIA A100 GPU with 80GB memory. Using conventional techniques, that same hardware configuration would support only 15-20 billion parameters. When using this conventional configuration as the baseline to compare to, MegaTrain throughput reached 45 percent of the baseline in-memory training speed, translating to approximately 2.2x wallclock time for equivalent training steps. For smaller models at 30-billion parameters, throughput exceeded 65 percent of baseline as the ratio of computation to transfer improved. The chart demonstrates the inverse relationship between model size and throughput efficiency. As model parameters increase from 30 billion to 175 billion, throughput degrades linearly from 65% to 45% of baseline, validating the predictable performance characteristics of the streaming approach. This degradation reflects the increasing ratio of layer count to computation time as models grow larger. Memory efficiency gains are substantial: peak GPU memory utilization remained under 40GB throughout training regardless of model size, with the remainder allocated to batch size expansion that partially offset streaming overhead. The system sustained 12-14 GB/s effective storage bandwidth during training, approaching theoretical NVMe limits and validating the pipelined streaming approach. The memory utilization chart reveals MegaTrain's fundamental breakthrough: GPU memory consumption remains nearly constant across vastly different model sizes. While conventional training would require proportionally more GPU memory as models grow, MegaTrain keeps peak utilization between 35GB and 40GB regardless of whether the model has 30 billion or 175 billion parameters. This flat memory profile enables single-GPU training of models that would otherwise require expensive multi-GPU configurations. Scaling analysis revealed predictable performance characteristics: throughput degraded linearly with model size as layer count increased, while improving with batch size as computation became more dominant relative to fixed transfer costs. Energy efficiency comparisons showed single-GPU MegaTrain consuming 85 percent less power than equivalent 8-GPU distributed training for the same model, though at 2.2x longer duration resulting in 40 percent total energy savings. Running MegaTrain on Azure NC-Series GPUs Azure's NC-series GPU virtual machines provide diverse configurations for MegaTrain deployment, each with distinct performance characteristics shaped by GPU type, interconnect topology, and storage options. The NC A100 v4 series offers NVIDIA A100 GPUs with 80GB memory over PCIe 4.0 interfaces, directly matching the original MegaTrain research configuration. These instances provide 32 GB/s bidirectional PCIe bandwidth sufficient for layer streaming without bottlenecking on parameter transfers for models up to 200-billion parameters. Storage configuration critically impacts MegaTrain performance: Azure Premium SSD v2 delivers up to 16 GB/s throughput with appropriate provisioning, while local NVMe SSDs on storage-optimized instances can reach 7-14 GB/s depending on VM size. The NC series of VMs provides a large amount of system memory, local NVMe storage and GPUs, creating an ideal environment for MegaTrain's host memory and storage requirements. Importantly, Azure's NC-series instances use PCIe-connected GPUs rather than NVLink-connected configurations, making interconnect bandwidth the primary performance constraint rather than GPU compute capacity. Cost analysis favors MegaTrain for experimentation and medium-scale training: a single NC A100 v4 instance costs ~4 dollars per hour compared ~30 dollars per hour for an 8-GPU NDv4 instance, enabling 8x cost reduction for workloads tolerant of 2-3x longer training duration. Code for MegaTrain is available on the project’s Github repository at https://github.com/DLYuanGod/MegaTrain. Supported models at the time of writing include Qwen 2/2.5/3/3.5/3.5 MoE, Llama 2/3/3.1/3.2/3.3/4, Mistral, Mixtral, DeepSeek Code/R1, Phi-3/4, Gemma 2/3, as well as some others. Critical Constraint: PCIe Bandwidth vs NVLink The interconnect bandwidth hierarchy creates fundamentally different performance envelopes for MegaTrain on Azure compared to specialized research hardware. PCIe 4.0 x16 connections on Azure NC-series instances provide 32 GB/s bidirectional bandwidth between CPU and GPU. For MegaTrain's layer streaming workload, CPU-GPU bandwidth matters most since model parameters flow from host memory or NVMe storage through PCIe to GPU memory. PCIe 4.0's 32 GB/s bidirectional capacity proves sufficient for models up to 200-billion parameters when combined with aggressive prefetching and eviction pipelining but becomes a hard bottleneck for larger models where layer transfer time exceeds computation time. Trade-Offs and Limitations MegaTrain introduces several trade-offs that practitioners must evaluate against infrastructure constraints and training objectives. Training throughput degrades to 40-65 percent of baseline depending on model size, extending wallclock training time by 1.5-2.5x for equivalent iteration counts. This slowdown proves acceptable for research experimentation and low-frequency retraining but may be prohibitive for production pipelines requiring rapid iteration. Memory efficiency gains come at the cost of increased storage I/O: a full training run generates 4x the parameter transfer volume compared to conventional training due to bidirectional streaming during forward and backward passes. This I/O amplification accelerates SSD wear and may impact cost for cloud storage with throughput-based pricing. Implications for AI Infrastructure and Azure Strategy MegaTrain's viability reshapes infrastructure planning assumptions for AI training workloads, particularly in cost-constrained environments where model size exceeds available GPU memory. Organizations can defer expensive hardware upgrades by exploiting existing GPU capacity more fully through memory hierarchy inversion, extending the useful life of current hardware generations. Cloud platforms like Azure benefit from increased flexibility in instance sizing: users can select GPU types based on compute requirements rather than being forced into high-memory configurations solely for capacity. This decoupling enables better price-performance optimization by matching compute intensity to GPU type while relying on storage for capacity scaling. The approach particularly suits Azure's NC-series positioning as a cost-effective alternative to specialized AI instances, turning PCIe bandwidth into an acceptable constraint rather than a disqualifying limitation. Research teams gain the ability to prototype and validate large model architectures without provisioning expensive multi-GPU clusters, accelerating iteration during early development phases. However, production deployment strategies should carefully evaluate MegaTrain's throughput trade-offs against distributed training alternatives: the 2-3x slowdown may be unacceptable for latency-sensitive pipelines despite cost savings. Infrastructure teams should prioritize NVMe storage provisioning and PCIe 4.0/5.0 support (for example, by using newer GPU types such as the H100) when deploying MegaTrain-compatible environments, as storage bandwidth directly determines achievable performance. Long-term, the techniques validate a broader trend toward memory-disaggregated computing where storage, memory, and compute scale independently rather than in fixed ratios determined by hardware packages.AI Infrastructure Preflight at User space: Validating Multi Node, Multi GPU Slurm Clusters
Every team that operates GPU clusters for AI has seen this pattern. The cluster boots, GPUs are visible, and scheduling works at a basic level. Then the first distributed training run stalls in NCCL initialization, fails during rank rendezvous, or silently maps ranks to the wrong devices. The issue is often not in training code. It is in infrastructure consistency across scheduler, runtime, drivers, networking, and process topology. The goal of ai-infra-validator is straightforward: Run a fast user space preflight before expensive training jobs. Validate distributed initialization for multi node, multi GPU workloads. Confirm GPU affinity and rank mapping are correct. Verify NCCL communication fabric can complete a collective ring under Slurm. This post walks through the implementation in detail, explains why each part exists, and shows how to operationalize it in real HPC AI environments. What the project validates Zero-dependency user space smoke test for AI clusters. Validates multi-node PyTorch DDP initialization, GPU affinity, and NCCL fabric connectivity under Slurm orchestration. Git Repo: ai-cluster-validator In practical terms, this checks that: Slurm launches the expected number of ranks per node. Distributed process group creation with NCCL succeeds. Each rank binds to the expected local GPU. Cross-rank all-reduce completes and converges. Node level telemetry confirms software and fabric state. This is not a performance benchmark. It is a correctness and readiness gate. Tested platform profile Component Value CycleCloud 8.8.3-3667 Slurm 25.05.5 Slurm partition hpc Scheduler VM SKU Standard_D8s_v6 Compute VM SKU Standard_ND96asr_v4 OS images microsoft-dsvm:ubuntu-hpc:2204:latest and microsoft-dsvm:ubuntu-hpc:2404:latest PyTorch 2.12.0+cu130 CUDA runtime 13.0 NCCL target 2.29.7 This profile represents a common enterprise scenario where scheduler and compute nodes have different roles, and the training fleet depends on correct multi node orchestration. Step 1: Minimal user space bootstrap The bootstrap script creates a shared Python environment at /shared/apps/pytorch_env and installs the required packages: torch torchvision torchaudio psutil This choice is intentional: No dependency on containers for first-pass validation. Single environment path visible to all compute nodes. Rapid setup and repeatability for cluster operators. Command sequence: git clone https://github.com/vinil-v/ai-cluster-validator.git cd ai-cluster-validator sudo bash bootstrap_env.sh Step 2: Slurm job defines deterministic distributed topology The Slurm script expresses a clear topology contract: nodes=2 ntasks-per-node=8 gpus-per-node=8 cpus-per-task=12 From this, world size is derived as: WORLD_SIZE = SLURM_NTASKS = 2 x 8 = 16 The script also configures network and NCCL behavior: NCCL_DEBUG=WARN NCCL_IB_DISABLE=0 NCCL_P2P_DISABLE=0 NCCL_IGNORE_CPU_AFFINITY=1 GLOO_SOCKET_IFNAME=eth0 NCCL_SOCKET_IFNAME=eth0 Important implementation detail: MASTER_ADDR is set to the first host in SLURM_JOB_NODELIST. MASTER_PORT is selected dynamically from the ephemeral range 49152-65535 and falls back to 29500 if needed. Why this matters: Reduces port collision risk when jobs run frequently. Avoids hardcoded rendezvous values that may fail in shared clusters. Launch path: srun --cpu-bind=none bash -c " source /shared/apps/pytorch_env/bin/activate; export RANK=$SLURM_PROCID; export LOCAL_RANK=$SLURM_LOCALID; python3 ddp_mesh_ping.py " The LOCAL_RANK handoff is critical for stable GPU affinity inside each node. Step 3: DDP initialization and rank to GPU affinity Inside ddp_mesh_ping.py, each process executes: Parse WORLD_SIZE, RANK, LOCAL_RANK, MASTER_ADDR, MASTER_PORT. Initialize torch.distributed with backend nccl and TCP init method. Set CUDA device using LOCAL_RANK. Core initialization path: dist.init_process_group( backend="nccl", init_method=f"tcp://{master_addr}:{master_port}", world_size=world_size, rank=rank ) torch.cuda.set_device(local_rank) This validates the minimum distributed contract required by real model training jobs. Step 4: Rich node and fabric telemetry in user space Each rank collects detailed metadata before the collective test: Node identity from Slurm and hostname. GPU model and VRAM from CUDA properties. System memory via psutil. CPU model from /proc/cpuinfo. OS and kernel versions. NVIDIA driver version from /proc/driver/nvidia/version. PyTorch, CUDA, and NCCL runtime versions. InfiniBand device state and link rate from /sys/class/infiniband. Basic GPU peer access capability via torch.cuda.can_device_access_peer. All rank payloads are gathered on rank 0 using dist.gather_object and printed as: Cluster hardware topology report. Node environment deep dive. Network interconnect and fabric status. This design gives platform teams one artifact that is both operational and diagnostic. Step 5: Functional collective validation After telemetry, each rank executes a lightweight DDP compute path: Build nn.Linear(10,10) on local GPU. Wrap with DistributedDataParallel. Perform forward, loss, backward. Run all_reduce on loss tensor. Compute global average loss. Pass condition is explicit in log output: SUCCESS: DDP Multi-Node AllReduce Ring Complete! This confirms that process group initialization and collective communication both completed successfully. What a successful run looks like Submission: sbatch ddp_smoke_test.slurm squeue Representative outcomes in the log: Total Execution Ranks: 16 Two nodes with local ranks 0 through 7 on each node GPU inventory aligned with expected A100 topology Active InfiniBand HCAs discovered per host NCCL socket interface set to eth0 Final success marker and computed convergence loss When these markers are present and coherent with expected hardware shape, the cluster is typically ready for distributed training bring-up. How to check the output file The Slurm script writes two artifacts per job: ai_infra_smoke_test_<jobid>.log ai_infra_smoke_test_<jobid>.err Use this exact workflow after submission: # 1. Submit and capture the job id sbatch ddp_smoke_test.slurm # 2. Check job state squeue -j <jobid> # 3. Read standard output log cat ai_infra_smoke_test_<jobid>.log # 4. Read standard error log cat ai_infra_smoke_test_<jobid>.err For stronger validation in automation, also check: Total Execution Ranks equals expected world size. Both nodes appear in the topology table with local ranks 0 through 7. NCCL/CUDA/PyTorch versions are present in the node environment section. Complete reference output Use the following full log as a known-good reference from a successful 2-node ND96asr_v4 run. Master Node IP/Hostname: ddpcluster-hpc-1 Dynamically Assigned Port: 53593 Total Execution Ranks: 16 =============================================================================================== HPC CLUSTER INTERACTION MONITOR =============================================================================================== --> Initializing DDP on Master Node : ddpcluster-hpc-1 --> Dynamic Coordination Port : 53593 --> Target World Cluster Size : 16 GPUs ----------------------------------------------------------------------------------------------- =============================================================================================== CLUSTER HARDWARE TOPOLOGY REPORT =============================================================================================== | Rank | Node Name | Local ID | GPU Model | VRAM | Sys Mem | CPU Cores | ----------------------------------------------------------------------------------------------- | 0 | ddpcluster-hpc-1 | 0 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 1 | ddpcluster-hpc-1 | 1 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 2 | ddpcluster-hpc-1 | 2 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 3 | ddpcluster-hpc-1 | 3 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 4 | ddpcluster-hpc-1 | 4 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 5 | ddpcluster-hpc-1 | 5 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 6 | ddpcluster-hpc-1 | 6 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 7 | ddpcluster-hpc-1 | 7 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 8 | ddpcluster-hpc-2 | 0 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 9 | ddpcluster-hpc-2 | 1 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 10 | ddpcluster-hpc-2 | 2 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 11 | ddpcluster-hpc-2 | 3 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 12 | ddpcluster-hpc-2 | 4 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 13 | ddpcluster-hpc-2 | 5 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 14 | ddpcluster-hpc-2 | 6 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | | 15 | ddpcluster-hpc-2 | 7 | NVIDIA A100-SXM4-4 | 39.5 GB | 885.8 GB | 96 Cores | =============================================================================================== NODE ENVIRONMENT DEEP DIVE ----------------------------------------------------------------------------------------------- [ddpcluster-hpc-1] Details: --> CPU Microarchitecture : AMD EPYC 7V12 64-Core Processor --> Operating System : Ubuntu 22.04.5 LTS --> Kernel Base Version : 5.15.0-1110-azure --> Nvidia Driver Loaded : 580.126.20 --> PyTorch Environment : v2.12.0+cu130 --> CUDA Runtime Version : v13.0 --> NCCL Fabric Target : v2.29.7 --> Discovered InfiniBand HCAs: - mlx5_an0:1 (4: ACTIVE - 40 Gb/sec (4X QDR)) - mlx5_ib0:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib1:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib2:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib3:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib4:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib5:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib6:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib7:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) ----------------------------------------------------------------------------------------------- [ddpcluster-hpc-2] Details: --> CPU Microarchitecture : AMD EPYC 7V12 64-Core Processor --> Operating System : Ubuntu 22.04.5 LTS --> Kernel Base Version : 5.15.0-1110-azure --> Nvidia Driver Loaded : 580.126.20 --> PyTorch Environment : v2.12.0+cu130 --> CUDA Runtime Version : v13.0 --> NCCL Fabric Target : v2.29.7 --> Discovered InfiniBand HCAs: - mlx5_an0:1 (4: ACTIVE - 40 Gb/sec (4X QDR)) - mlx5_ib0:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib1:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib2:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib3:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib4:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib5:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib6:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) - mlx5_ib7:1 (4: ACTIVE - 200 Gb/sec (4X HDR)) ----------------------------------------------------------------------------------------------- NETWORK INTERCONNECT & FABRIC STATUS ----------------------------------------------------------------------------------------------- --> Target Communication Interface (NCCL_SOCKET_IFNAME) : eth0 --> Active Telemetry Tracking Level (NCCL_DEBUG) : WARN --> Inter-GPU Topo Link Verification : Active (P2P/NVLink Capable) ----------------------------------------------------------------------------------------------- SUCCESS: DDP Multi-Node AllReduce Ring Complete! --> Computed System Verification Convergence Loss : 1.398719 =============================================================================================== Why this is effective for platform operations For AI infrastructure teams, this pattern is highly effective because it is: Fast: can be run after every change window. Deterministic: same topology contracts every run. Actionable: output includes enough context for first-level triage. Low friction: user space only, no heavy control plane dependencies. This supports common operating workflows: Day-0 cluster acceptance. Day-1 patch validation after driver, kernel, or image changes. Regression gate in golden image pipelines. Preflight before large multi node model training jobs. Practical guidance for extending to larger clusters Adjust Slurm directives for nodes and tasks per node. Keep one rank per GPU unless validating alternate placement policy. Set NCCL_SOCKET_IFNAME and GLOO_SOCKET_IFNAME according to your network policy. Preserve the dynamic MASTER_PORT logic to avoid static collisions. Keep the success marker string stable so automation can parse it. Closing perspective Most distributed training failures are expensive because they are discovered late. A user space preflight that validates scheduler topology, rank rendezvous, GPU affinity, and NCCL collectives provides a high value guardrail before production starts. ai-infra-validator is a practical implementation of that guardrail. It is compact, transparent, and aligned with how real Slurm based AI clusters operate. For teams running multi node multi gpu training at scale, this kind of preflight should be a standard operational gate.Building resilient networks for AI supercomputers
By Valerie Cutts and Jithin Jose Last fall we introduced Fairwater, the world’s most powerful AI datacenter. Delivering a system of this scale required rethinking how Azure designs supercomputers, especially the scale-out network. Today, we are sharing more about the networking innovations that have made Fairwater possible. In this post, we share what’s unique about networking at extreme GPU scale and the system-level design choices required to enable large synchronous training jobs to run reliably, even during network failures. We are also publishing, in partnership with others, the open-source Multipath Reliable Connection (MRC) specification and software interfaces, and open-sourcing the associated libraries. Fault-tolerant scale-out networking At extreme scale, synchronous training amplifies the impact of routine network faults, turning packet drops, slow links, and partial failures into stalls, restarts, and wasted GPU time. As we describe in our MRC paper, the path forward is to treat failure as normal and design the network as an integrated system that: Scales to 100K+ GPUs using a two level, multi-path topology to enable enough redundancy Balances load evenly across the fabric to prevent congestion Recovers predictably and gracefully during failures Uses less power than three- or four-layer single-plane topologies As outlined in the Multipath Reliable Connection Specification, Microsoft partnered with AMD, Broadcom, Intel, NVIDIA and OpenAI to jointly address this problem, focusing on changes to transport and network design needed to support training at extreme scale. Instead of relying on lossless fabrics and dynamic routing, we collectively designed, built, and deployed Multipath Reliable Connection (MRC), which draws upon lessons from the Ultra Ethernet Consortium (UEC), and paired it with a multiplane network topology, enabling reliable training jobs even when links, switches, or paths fail. The endpoint–driven transport created a simpler, more resilient network that delivers: More resilient, predictable training at very large scale Large training jobs continue making steady forward progress despite routine network faults, reducing stalls and restarts and improving time-to-train as cluster size increases. Better utilization of expensive GPU infrastructure By avoiding tail latency amplification and repeated recovery cycles, GPUs spend more time doing useful work instead of waiting on synchronization or replaying lost computation, improving overall efficiency and cost effectiveness. Automatic adaptation at machine timescales Failure detection, load balancing, and recovery happen fast enough to keep up with the rate and complexity of faults in 100K+ GPU systems, well beyond what manual intervention or control-plane convergence can achieve, allowing the system to remain stable as scale increases. In the Fairwater supercomputer, enabling graceful degradation in the scale-out network improves training throughput versus traditional transports and architectures. In combination with a multi-plane topology design, MRC increases the time that installed NVIDIA GPUs perform useful computation. A shift in philosophy: End-to-end control The central design decision behind MRC is to shift responsibility for load balancing and failure handling from complex network switch control planes to the network endpoints with end-to-end controls. The network endpoint controls the path selection and can optimally use a set of paths based on feedback from the network. MRC extends the RoCE Reliable Connection (RC) transport to support true multipath operation. Instead of binding a queue pair to a single path, MRC sprays packets across many paths simultaneously, making performance far less sensitive to any single slow or failed link. Several design elements are critical to enable end-to-end control. Every packet carries enough information for the receiver to place it directly into memory, even if packets arrive out of order. Selective acknowledgments enable rapid retransmission of only the packets that were lost. Packet trimming signals network congestion swiftly without forcing full packet drops, enabling efficient congestion control. MRC disables Priority Flow Control (PFC) entirely and runs Ethernet in best-effort mode, avoiding global pauses that can devastate tail latency or lead to fabric-wide deadlock behavior. The system enables seamless self-recovery from network hardware failures. The result is a transport protocol that expects loss, adapts quickly, and continues making progress even when parts of the fabric misbehave. Rethinking topology: Multi‑plane design Transport alone is not enough. To complement MRC, we implemented a two-tier, multiplane network topology in Fairwater using high-radix switches. Our network design splits each NIC into multiple lower speed ports (i.e. eight x 100 Gbps) and builds multiple parallel network planes. This multi-plane design enables a more compact topology, as opposed to a traditional three-tier Clos Network running at 800 Gbps/port. Our 2-layer multi-plane topology design offers several advantages: Enables connecting 100K+ GPUs with just two tiers of network. Lower latency since packets traverse fewer switches. Reduced impact of network issues on overall job completion, while we see single switches connected to more servers, the individual impact is reduced by spreading the failure, decreasing the performance impact to the overall job. Reduced hardware and power costs compared to designs with additional network layer, without compromising on GPU scale. Most importantly, the network becomes more tolerant of partial failures, so jobs continue with slightly reduced bandwidth rather than failing outright. Multiplane networks work efficiently only when traffic is evenly distributed across all planes and paths. This is where MRC’s packet spraying and path aware congestion response is essential. Figure 1: Example of two-tier multi-plane topology using 512 x 100 GbE switches: 512 T0s x 256 NICs = 131,072 NICs Static SRv6: Fewer moving parts, more predictability In many data center networks, switches rely on Border Gateway Protocol (BGP) or other dynamic routing protocols. We removed dynamic routing from our design; instead, packets are source-routed using IPv6 Segment Routing (SRv6). Each packet encodes the end-to-end network path using compact microsegment IDs (uSIDs). At first glance, static routing seems counterintuitive in failure-prone environments. At extreme scale, however, dynamic routing is more of a liability than an asset. Namely, if two or more switches try to reroute packets at the same time, network behavior becomes unpredictable and harder to diagnose. Interactions between adaptive routing and adaptive transport can be hard to resolve and harder to debug at larger scale. MRC, on the other hand, handles path health and rapid failover at the transport layer. When static routing is used, it enables precise health feedback for each of the different network paths from that network endpoint. Because probe (test) packets follow the same paths as data packets, operators gain accurate, ground-truth insight into fabric health without depending on switch control planes, which are themselves a common source of failures. Additionally, SRv6 routing allows network operators to utilize out-of-band monitoring frameworks to accurately identify link failures and device faults, which has been particularly valuable in managing large-scale AI clusters. Static SRv6 ensures paths are deterministic, making problems easier to reproduce, debug and, ultimately, more stable over time. Failure as a normal operating condition In production, failures are expected by design—link flaps, partial failures, and even switch reboots are routine at this scale. With MRC, many of these events no longer impact training workloads. Repair actions proceed in parallel, while MRC dynamically routes around failed paths. As repairs complete, MRC discovers and validates the restored paths before seamlessly reintegrating them—entirely transparent to the training application. In summary: Systems degrade gracefully Losing a NIC port reduces available bandwidth proportional to the lost port but does not crash jobs Flapping T0–T1 links often go unnoticed by applications Switches can be rebooted without coordinated drain or rerouting of the system For massive scale training runs, this translates into higher effective uptime, fewer interrupted jobs, and more training throughput. Figure 2: Bidirectional-bandwidth measurements with pt-pt RDMA Perftest while a T0 switch was taken down. Results indicate that the overall bandwidth dropped in proportion to the T0 switch bandwidth, but without failing the job Figure 3: Bidirectional-bandwidth measurements with RDMA Perftest while T1 switch is failed and restored. Results indicate that no impact in performance as MRC was able to route around the bad switch Measured results at scale This is not just a thought exercise: Microsoft and OpenAI have both run extensive experiments and record-scale training jobs, showing only brief, bounded performance dips during significant network faults, followed by rapid recovery. Microbenchmarks demonstrate near line-rate bandwidth and predictable latency, even under injected loss. OpenAI describes their scale results in a recent blog post, consistent with what we observe. Taken together, multi-plane MRC with SRv6 delivers better load balancing with fewer queue pairs and substantially higher resilience to packet loss, enabling millions of networking links to connect hundreds of thousands of GPUs. Figure 4: NCCL Send-Recv Benchmark Results with 42,020 GPUs each with 800 Gbps MRC NIC showing up to 92% of theoretical peak bandwidth for large message sizes What this enables Taken together, MRC, multiplane topologies, and static SRv6 form a coherent strategy for building AI supercomputer scale-out networks that keep large synchronous training jobs moving forward under real-world fault conditions. Instead of treating loss, link flaps and partial failures as events that trigger stalls or restarts, the system is designed to fail gracefully and reach 100K+ GPUs scale at high utilization. This design approach has been deployed in Fairwater and elsewhere to train state-of-the-art models, where the result is more predictable performance for large jobs with higher effective GPU utilization. The core takeaway is simple: by assuming failures will happen and, designing for them explicitly, events that would otherwise be catastrophic become minor, manageable perturbations. Join us in advancing resilient AI infrastructure To help the broader ecosystem adopt these capabilities, Microsoft is joining key partners in releasing the MRC specification to the Open Compute Project and open sourcing key components: libMRC: MRC transport APIs NCCL MRC plugin: enables NCCL to run over MRC transport MRC shim library: enables compatible verbs applications to run over MRC with no code changes MSCCL++ with MRC support: MSCCL++ library with MRC support SONiC SRv6: enhance SRv6 with open NOS for high performance AI Ethernet We encourage others to review these contributions to the public, share feedback, and ultimately adopt these capabilities within the broader ecosystem of AI networking products, infrastructures, and workloads. Acknowledgements Advancing AI at this scale requires collaboration across the industry. At Microsoft, we value our partnerships with AMD, Broadcom, Intel, NVIDIA, and OpenAI, and our shared commitment to continuing to evolve MRC alongside the broader community. References: MRC paper: Resilient AI Supercomputer Networking using MRC and SRv6 Multipath Reliable Connection Specification OpenAI MRC blog AMD MRC blog Broadcom MRC blog NVIDIA MRC Blog libMRC APIs microsoft/mrc-verbs-shim-lib: shim library to translate ibverbs to libmrc interfaces microsoft/mrc-nccl-plugin: MRC plugin for NCCL microsoft/mscclpp: MSCCL++: A GPU-driven communication stack for scalable AI applications (with MRC support)Distributing model weights to your AI cluster: a faster pre-flight on AKS and Slurm
Standing up an N-node training or inference job and waiting forever for the model checkpoint to land on every node's NVMe? Here's a small Rust + MPI tool — azcp-cluster — that pays Azure egress once, broadcasts over your fabric, and finishes in seconds. Plus the AKS and Slurm patterns to wire it into a real pipeline.Azure NCv6 Virtual Machines: Enhancements and GA Transition
NCv6 Virtual Machines are Azure's flexible, next generation platform enabling both leading-edge graphics and generative AI compute workloads. Featuring NVIDIA RTX PRO 6000 Blackwell Server Edition (BSE) GPUs, Intel Xeon™ 6 "Granite Rapids" 6900P series CPUs, and a suite of Microsoft Azure technologies, NCv6 VMs are available now in Preview. Today, we are pleased to share a series of exciting updates coming soon to Azure NCv6 that will: Enhance VM performance and capabilities Provide more VM sizes for customers to "right size" their usage Bring NCv6 to production readiness with a transition to General Availability, and Expand accessibility across the global Azure cloud New VM Sizes, Features, and Performance Enhancements In the coming weeks, Azure will debut seven new NCv6-series VM sizes and two different sub-families for customers to choose from. The standout features introduced with the new VM sizes include: 🧩 Fractional GPU support, enabling graphics workload customers to deploy VMs with as little as 1/2 or 1/4 of a RTX PRO™ 6000. VMs with fractional GPU support also feature reduced vCPU, memory, SSD, and networking to help customers optimize costs and right size their VMs to their workloads. ⚡ Increased vCPU per VM size (e.g. 288 vCPU instead of 256) to provide more performance for high-end VDI workstations and better align with the Intel Xeon 6900P's triple compute tile architecture. 🛠️ General Purpose and Compute Optimized VM sizes. The former provides larger amounts of CPU memory for demanding generative AI inference and ISV CAD/CAE simulations, while the latter offers reduced memory to enable customers with less memory intensive workloads to cost optimize their deployments. The new VM sizes will replace the existing three VM sizes offered in Preview, and be available as follows: NCv6 - General Purpose VM sizes: Size Name vCPUs Memory (GB) Networking (Mb/s) GPUs GPU Mem (GB) Temp Disk NVMe Disk Standard_NC36ds_xl_RTXPro6000_v6 36 132 22500 1/4 24 256 1600 Standard_NC72ds_xl_RTXPro6000_v6 72 264 45000 1/2 48 512 3200 Standard_NC144ds_xl_RTXPro6000_v6 144 516 90000 1 96 1024 6400 Standard_NC288ds_xl_RTXPro6000_v6 288 1032 180000 2 192 2048 12800 Standard_NC324ds_xl_RTXPro6000_v6 324 1284 180000 2 192 2048 12800 NCv6-Compute Optimized VM sizes: Size Name vCPUs Memory (GB) Networking (Mbps) GPUs GPU Mem (GB) Temp Disk NVMe Disk Standard_NC24lds_xl_RTXPro6000_v6 24 72 22500 1/4 24 256 1600 Standard_NC36lds_xl_RTXPro6000_v6 36 72 22500 1/4 24 256 1600 Standard_NC72lds_xl_RTXPro6000_v6 72 132 45000 1/2 48 512 3200 Standard_NC144lds_xl_RTXPro6000_v6 144 264 90000 1 96 1024 6400 Standard_NC288lds_xl_RTXPro6000_v6 288 516 180000 2 192 2048 12800 Standard_NC324lds_xl_RTXPro6000_v6 324 648 180000 2 192 2048 12800 Note that, until the new VM sizes are available, Microsoft Learn resources will continue to reflect the currently offered VM sizes and technical specifications. Transition to General Availability In the coming weeks, Azure will transition NCv6-series from Preview to General Availability (GA) status. With this transition, NCv6 VMs will become covered by the Azure Service Level Agreement (SLA) and thus ready to support production-grade deployments by customers, partners, and service providers. When the transition to NCv6 VMs occurs, they will be available in the Azure West US2 and Southeast Asia regions. Information on availability timing of additional regions is provided below. Regional Expansion Across the Azure Cloud At the beginning of Preview, NCv6 VMs debuted in the West US2 region. Since then, we have also added NCv6 VMs to the Southeast Asia region. Both regions will be part of the transition to GA status. We are pleased to share that in the proceeding months covering Q3 of 2026, NCv6 VMs will also become available in the following Azure regions: • East US • West Europe • East US 2 • North Europe • South Central US • Germany West Central • West US • Korea Central Ready to build for the future with Azure NCv6? NCv6 Virtual Machines are available now in Preview. Start your production-grade AI journey today and explore the next frontier of Azure AI infrastructure. Join the PreviewSimplify troubleshooting at scale - Centralized Log Management for CycleCloud Workspace for Slurm
Training large AI models on hundreds or thousands of nodes introduces a critical operational challenge: when a distributed job fails, quickly identifying the root cause across scattered logs can become incredibly time-consuming. This manual process delays recovery and reduces cluster utilization. The ability to quickly parse centralized cluster logs from a single interface is critical to ensure job failure root cases are swiftly identified and mitigated to maintain high cluster utilization. Solution Architecture This is a turnkey, customizable log forwarding solution for CycleCloud Workspace for Slurm that centralizes all cluster logs into Azure Monitor Logs Analytic. The architecture uses Azure Monitor Agent (AMA) deployed on every VM and Virtual Machine Scale Set (VMSS) to stream logs defined by Data Collection Rules (DCR) to dedicated tables in a Log Analytics workspace where they can be queried from a single interface. The turnkey solution captures three categories of logs essential for troubleshooting distributed workloads, but can be extended for any other logs: Slurm logs including slurmctld, slurmd, etc., plus archived job artifacts (job submission scripts, environmental variables, stdout/stderr) collected via prolog/epilog scripts. Infrastructure logs including those from CycleCloud including the CycleCloud Healthagent which automatically tests nodes for hardware health and draining nodes that fail tests. Operation System logs from syslog and dmesg capturing kernel events, network state changes, and hardware issues. Each log source flows through its own DCR into a dedicated table following a consistent schema. The solution automatically associates scheduler-specific DCRs with the Slurm scheduler node and compute-specific DCRs with compute nodes handling dynamic node scaling transparently. The solution is purpose-built for CycleCloud Workspace for Slurm, but designed in a modular fashion to be easily extended for new data sources (i.e. new log formats) and processing (i.e. Data Collection Rules) to support log forwarding and analysis of other required logs. Key Benefits Time-series correlation: Azure Monitor's time-based indexing enables rapid identification of cascading failures. For example, trace a network carrier flap detected in syslog to corresponding slurmd communication errors to specific job failures all within seconds. Centralized visibility: Query logs from thousands of nodes through a single interface instead of SSH-ing to individual machines. Correlate Slurm controller decisions with node-level errors and system events in one query. Log persistence: Logs survive node deallocations and reimaging. Critical in cloud environments where compute nodes are ephemeral. Powerful query language: KQL (Kusto Query Language) allows parsing raw logs into structured fields, filtering across multiple sources, and building operational dashboards. Example queries detect patterns like repeated job failures, network instability, or resource exhaustion. Production-ready scalability: User-assigned managed identities automatically propagate to new VMSS instances, and DCR associations handle thousands of nodes without manual configuration. Getting Started The complete solution is available on GitHub (slurm-log-collection) with deployment scripts that: Create all required Log Analytics tables Deploys pre-configured DCRs for Slurm, CycleCloud, and OS logs Automatically associate DCRs with scheduler and compute resources After configuring environment variables and running the setup scripts, logs begin flowing to Azure Monitor and will populate within 15 minutes, but normal log ingestion latency is ~30s to 3 minutes. The repository includes sample KQL queries for common troubleshooting scenarios to accelerate time-to-resolution and to perform non-troubleshooting analysis of cluster usage.AI Inferencing in Air-Gapped Environments
If you had to point out the top trends of IT these days, two strong candidates would be Generative AI and Cybersecurity. Especially around the latter, sophistication, reach and volume of cyberattacks have seen significant increases in the last years, with added ingredients such as advanced persistent threats, state actors or “crime-as-a-service” providers. Interestingly enough, both trends go hand in hand: Artificial Intelligence extracts value from your data, and cyber criminals are exactly after the same thing: your data. It is not surprising that organizations have taken steps to protect themselves against data theft or data exfiltration, as it is often described. In this post we will explore how to deploy in a Kubernetes cluster a Hugging Face-hosted model and a NVIDIA NIM™ microservice, a prebuilt, optimized inference container for rapidly deploying the latest AI models, and at the same time protect your infrastructure against data theft. You can find more information about NVIDIA NIM here: https://developer.nvidia.com/nim. We will outline the process for deploying a Kubernetes cluster in Azure with the highest level of network security to prevent data exfiltration, and we will also demonstrate the deployment of container images and required model parameters for both options Why Kubernetes clusters Unless you have been living under a rock, you are probably aware that Kubernetes has taken the IT world by storm, and the AI ecosystem is not an exception. Kubernetes makes it extremely easy to package and deploy applications over any infrastructure and hence it has become one the most popular platforms to run AI workloads, especially AI inferencing. Azure Kubernetes Service (AKS) is an Azure service that makes it easy to run Kubernetes clusters in Azure. Over time, AKS has introduced multiple deployment options to meet increasingly stringent requirements, particularly around security. One such option is the private cluster, where no public IP addresses are assigned to the Kubernetes control plane or nodes. To understand this evolution, let’s have a look at what a “public” AKS cluster looks like: Figure 1- public AKS API enabled cluster As the previous figure shows, there are multiple traffic flows that go over the public Internet: In the bootstrap phase, the nodes get images from Microsoft Container Registry, as well as potentially from other repositories such as Ubuntu. The Kubernetes administrator operates the cluster accessing the Kubernetes API provided by Microsoft with a public IP address. When pulling container images, node clusters can get them from publicly available repositories such as Docker hub or Azure Container Registry (if configured to be publicly accessible). Lastly, administrators are allowed to expose applications that run in the cluster via public IP addresses, so that users will access them over the Internet too. The first evolution of this concept towards a more restrictive environment was a commonly used pattern consisting of a combination of private clusters (https://learn.microsoft.com/azure/aks/private-clusters) and Azure Firewall to limit egress traffic (https://learn.microsoft.com/azure/aks/limit-egress-traffic) and prevent data exfiltration. In this model, there are no longer any inbound connections to the cluster: Figure 2- AKS private cluster The AKS API control plane is fully integrated in the virtual network. Azure Container Registry and other Azure services such as Azure Storage or Azure Key Vault are also integrated with the virtual network through the Private Link technology (https://aka.ms/privatelink). However, there are still outbound flows from the cluster nodes to the Internet, for example during the cluster creation process or the deployment of images stored in public repositories, which need to be explicitly allowed by the egress firewall. Air-gapped clusters It can be argued that using private clusters only provides security up to the robustness of your firewall ruleset: it essentially acts as a fail-open mechanism. If there’s a misconfiguration in the firewall rules, you may unintentionally allow data exfiltration or theft. To address this, AKS offers an even greater degree of isolation with network-isolated clusters (https://learn.microsoft.com/azure/aks/concepts-network-isolated), where all outbound connections are completely blocked without the need of a firewall: In this mode, AKS nodes are configured in a way so that no outbound flows to the Internet can exist. If you are curious about what you need to do to make sure of that in Azure, here is the list: No public Azure load balancer attached to the AKS nodes. No NAT gateway attached to the AKS node subnet. No public IP address attached to the AKS nodes. The AKS node subnet configured for no default outbound access (https://learn.microsoft.com/azure/virtual-network/ip-services/default-outbound-access). An important consideration is understanding how AKS nodes receive updates or how images are retrieved from public repositories (e.g. docker.io or nvcr.io). This is achieved through an Azure Container Registry feature known as “artifact cache”: https://learn.microsoft.com/azure/container-registry/artifact-cache-overview. However, a challenge arises when considering Large Language Models (LLMs): LLM container images sourced from Hugging Face or NVIDIA (or any other source) typically include the inference runtime (for example vLLM) but not the model weights. Instead, model artifacts are downloaded dynamically when the container starts. Consequently, Azure Container Registry cannot cache these assets. The question then becomes: how can these model weights be made available within an air-gapped Kubernetes environment? The Model Weights Challenge While the model weight (re) load on container startup is a flexible approach in connected environments, it fails in air-gapped clusters where outbound network access is blocked. To address this, we consider two viable strategies: Constructing a container image that includes all required components and pushing it to the container registry accessible by the isolated cluster Pre-downloading model artifacts to a private file share connected to the virtual network of the isolated cluster, and accessing these resources as needed. Both methods will be demonstrated in detail, but before proceeding, however, we will further outline the example scenario. To provide context aligned with current priorities among our financial clients and organizations operating within regulated sectors, this demonstration focuses on the process of model deployment and the configuration of an isolated cluster for LLM inferencing. For inferencing we use Llama-3.1-8B-Instruct-FP8 served by vLLM, a high-performance inference runtime designed specifically for large language models. In simple terms, vLLM is responsible for efficiently loading the model onto the GPU and handling incoming inference requests with very low latency and high throughput. vLLM is typically packaged as a container image, which can be sourced either from Hugging Face or from NVIDIA (in our examples), the latter being highly optimized for NVIDIA GPUs and CUDA®. As described earlier, these images usually contain the inference runtime and dependencies, but not the model weights themselves. Instead, the model weights and other model-specific artifacts are downloaded dynamically when the container starts, allowing the same container image to be reused across different models and versions while keeping the image size small and deployment flexible. This approach is not suitable in isolated AKS clusters, where network traffic flowing outside of the deployed virtual network is not permitted. From an architectural perspective, model serving is only one part of the overall inferencing platform, and the design of the underlying GPU infrastructure plays a critical role-especially in isolated AKS clusters. In such environments, challenges are not limited to downloading model weights at container startup; for example, setting up the GPU node pool is another important consideration. Traditionally, enabling GPUs on AKS requires installing the NVIDIA device plugin for Kubernetes as well as the NVIDIA GPU drivers, most commonly by deploying the NVIDIA GPU Operator, which takes care of both. While the device plugin itself can be installed relatively easily via the artifact cache of an attached container registry, driver installation is more involved, especially in air-gapped or isolated environments: https://learn.microsoft.com/azure/aks/use-nvidia-gpu. NVIDIA also provides detailed guidance on how to deploy the GPU Operator in such scenarios in their documentation: https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/latest/install-gpu-operator-air-gapped.html. While the required procedures are clearly outlined and well documented, configuring the NVIDIA AKS GPU node pool within an air-gapped, isolated cluster continues to be a complex and time-consuming process. Microsoft has recently unveiled a preview feature that allows users to create fully managed NVIDIA GPU node pools on AKS. With this option, all necessary NVIDIA components including drivers, device plugins, and other supporting software are pre-installed and maintained by Microsoft throughout their lifecycle. This functionality is supported on isolated AKS clusters operating with Kubernetes v1.34.0 or later. It substantially decreases operational complexity, streamlining the deployment and maintenance of GPU-based AI inferencing solutions accelerated by NVIDIA in restricted environments https://learn.microsoft.com/en-us/azure/aks/aks-managed-gpu-nodes?tabs=add-ubuntu-gpu-node-pool For example, this Azure CLI command would deploy a managed GPU AKS nodepool to an existing AKS cluster: 1 az aks nodepool add \ 2 --resource-group MyResourceGroup \ 3 --cluster-name MyAKSCluster \ 4 --name gpunp \ 5 --node-count 1 \ 6 --node-vm-size <GPU_SKU> \ 7 --node-taints sku=gpu:NoSchedule \ 8 --enable-cluster-autoscaler \ 9 --min-count 1 \ 10 --max-count 3 \ 11 --tags EnableManagedGPUExperience=true Keep in mind that if you want complete control over driver versions and related settings, you should follow the guidelines for deploying the NVIDIA GPU Operator in an air-gapped environment. With the managed option, Microsoft takes care of maintaining driver versions for you. Container and model weight deployment With the managed GPU node pool set up, we can begin implementing both inferencing scenarios. Scenario 1: Baking Model Weights into the Container Image Let’s start with the first one, where the model weights are downloaded from Hugging Face, baked into a container image and then pushed to the attached Container Registry which is reachable from the isolated AKS cluster: Figure 4- Baking Model Weights into Container Image Figure 4- Baking Model Weights into Container Image The easiest way to achieve this, is to trigger the container build directly from the container registry, which will take the local Dockerfile, pull the image and data needed from Hugging Face, and deploy and tag the backed image to the container registry. Make sure you have acquired a Hugging Face API Key. If you are using a gated model, access must be requested before building the image. 1 az acr build \ 2 --registry <ACR_NAME> \ 3 --image llama3-vllm-fat:8b-instruct \ 4 --build-arg HF_TOKEN=$HF_TOKEN \ 5 . Note: When you use the “az acr build” command instead of running docker build yourself, it automatically tags your image and pushes it to the Azure Container Registry. Once this container image is available in the container registry, we can create a simple pod and an internal load balancer to expose the service endpoint to the user. The detailed instructions and code are available here: https://github.com/mocelj/aks-air-gap-vllm-deployment. You can test the deployment by querying the external IP of the deployed service and interacting with the endpoint in an OpenAPI-compatible way . Note that since this is an isolated cluster, you need to connect to the cluster’s network via VPN or run the curl command in a pod inside of the cluster, see aks-air-gap-vllm-deployment/aks_isolated.sh at main · mocelj/aks-air-gap-vllm-deployment for more details about how to set up a point-to-site VPN in Azure. Here you can see how to get the service IP address and query the completions API: 1 # Get the Service IP 2 svc_ip=$(kubectl get svc vllm-llama3-8b -o jsonpath='{.status.loadBalancer.ingress[0].ip}') 3 curl -X POST "http://${svc_ip}:8000/v1/chat/completions" \ 4 -H "accept: application/json" \ 5 -H "Content-Type: application/json" \ 6 -d '{ 7 "messages": [ 8 {"role": "system", "content": "You are a polite and respectful chatbot."}, 9 {"role": "user", "content": "Where should I go for lunch close to the Microsoft office in Pratteln?"} 10 ], 11 "model": "meta/llama3-8b-instruct", 12 "max_tokens": 512, 13 "top_p": 1, 14 "n": 1, 15 "stream": false, 16 "frequency_penalty": 0.0 17 }' Scenario 2: Using a Shared File System for Model Artifacts In the second scenario, where we are downloading the model weights and other artifacts used by the NVIDIA NIM to a shared NFS drive, we must follow a slightly different strategy. Figure 5- Pre-download model weights to a private file share For simplicity, we have used a virtual machine capable of downloading artifacts from the Internet and reaching the internal container registry as well as the shared NFS volume deployed in a virtual network. In this example, we have created a simple NFS share using Azure Files (see the Azure CLI code here to create the share and the endpoint): https://github.com/mocelj/aks-air-gap-vllm-deployment/blob/main/aks_isolated.sh#L352). For large scale inferencing scenarios, you might want to consider other storage options to ensure reasonable startup times, given the weights can be of considerable size. To facilitate model deployment on NVIDIA A100 GPUs, we have provisioned a jump box equipped with the same GPU type. If you start with a fresh virtual machine, you may need to install the appropriate GPU driver as well as NVIDIA’s container runtime (https://developer.nvidia.com/container-runtime). You can alternatively deploy a Linux virtual machine using DSVM Linux images, where only the container runtime needs to be added to ensure readiness for operation. To download the container image from nvcr.io we can leverage the caching rules in the container registry and pull the image via our connected container registry. Once the container image is pulled locally, we can download the model profile with the appropriate artifacts and copy everything to the shared folder, e.g. by using rsync. The artifacts can be downloaded by using the Utilities for NVIDIA NIM for LLMs: https://docs.nvidia.com/nim/large-language-models/latest/utilities.html 1 docker run --rm \ 2 --runtime=nvidia \ 3 --gpus all \ 4 -v $LOCAL_NIM_CACHE:/opt/nim/.cache \ 5 -u $(id -u) \ 6 -e NGC_API_KEY \ 7 $TARGET_IMAGE \ 8 download-to-cache Important: The NVIDIA API key must not be included in the AKS deployment manifests, as otherwise it would trigger outbound network calls that will fail in air‑gapped environments. The key is only required on the jump box during the download of the model artefacts. Since this shared folder is reachable from within the Jump box network and the isolated, air-gapped AKS cluster, the only thing we must do is pointing the NVIDIA NIM container to use the model weights found in the shared folder, and not to download it once the pod starts. It is important to note that the NVIDIA API key should not be part of the deployment script, since otherwise it will trigger an outbound connection to pull an image from the nvcr.io registry, which will fail in an air gapped environment. The service can be tested in a similar way as before. First, we need to find out the external service IP, which we will get via “kubectl get svc vllm-nim-llama3-service -o wide” and then interact with the service in the same way as before, adjusting to the new service IP address. A more detailed description of the implementation steps can be found in the attached repository: https://github.com/mocelj/aks-air-gap-vllm-deployment. Summary This document presents a practical guide for deploying LLM inferencing solutions in isolated Azure Kubernetes Service (AKS) clusters. It outlines two deployment approaches: one where model weights and artifacts are pre-downloaded and stored in a shared folder accessible by both the jump box and cluster, and another using a shared NFS drive for storing downloaded resources. Both strategies enable secure, air-gapped deployments without relying on outbound internet access. For step-by-step instructions and further technical details, consult the referenced GitHub repository https://github.com/mocelj/aks-air-gap-vllm-deployment. For large‑scale or production deployments, more performant storage options—such as local NVMe or other high‑throughput solutions—can be explored; the services used in this guide are intentionally chosen to maximize clarity and reproducibility.Microsoft at NVIDIA GTC 2026
Microsoft returns to NVIDIA GTC 2026 in San Jose with a strong presence across conference sessions, in‑booth theater talks, live demos, and executive‑level ancillary events. Together with NVIDIA and our partner ecosystem, Microsoft is showcasing how Azure AI infrastructure enables AI training, inference, and production at global scale. Visit us at Booth #521 to see the latest innovations in action and connect with Azure and NVIDIA experts. Exclusive GTC Experiences LEGO® Datacenter Model Explore Azure AI infrastructure at the Park Container. Candy Lounge Visit the high-traffic candy wall for co-branded treats all day long. Networking Lounge Relax and recharge with comfy seating and vital charging options. Outdoor Juice Truck Free, refreshing beverages served during outdoor park hours. Sponsored Breakout Sessions Microsoft Featured Reinventing Semiconductor Design with Microsoft Discovery S82398 · Mon, Mar 16 · 4:00 PM Prashant Varshney Microsoft · Semiconductor & AI Engineering Abstract: Semiconductor teams face exploding design complexity and shrinking verification windows. This session shows how the Microsoft Discovery AI for Science platform, combined with Synopsys Agent Engineers, introduces an agentic approach to EDA that automates routine steps and accelerates expert decision-making on Azure. Microsoft Featured Operationalizing Agentic AI at Hyperscale S82399 · Tue, Mar 17 · 1:00 PM Nitin Nagarkatte Microsoft · Azure AI Infrastructure Anand Raman Microsoft · Azure AI Vipul Modi Microsoft · AI Systems Abstract: As enterprises move to agentic systems, the challenge shifts to operating intelligent agents reliably at scale. This session demonstrates how Microsoft builds AI Factories on Azure using NVIDIA technology and explores Microsoft Foundry as the control plane for deploying and operating coordinated AI agents. Live from GTC: AI Podcast Dayan Rodriguez Corporate Vice President Global Manufacturing and Mobility Alistair Spiers General Manager Azure Infrastructure Live Special Feature A conversation with Microsoft Azure Listen & Subscribe: aka.ms/GTC2026Podcast Scan to Listen Earned Conference Sessions Don't miss these high-impact sessions where Microsoft and NVIDIA leaders discuss the future of AI factories and infrastructure. Mon · Mar 16 5:00 PM Drive Optimal Tokens per Watt on AI Infrastructure Using Benchmarking Recipes Speakers: Paul Edwards, Emily Potyraj Microsoft, NVIDIA Tue · Mar 17 9:00 AM Autonomous AI Factories: Technical Preview of Agent-Native Production Speakers: JP Vasseur, César Martinez Spessot NVIDIA, Microsoft Research Tue · Mar 17 4:00 PM The Road to Intelligent Mobility: Vehicle GenAI Speakers: Raj Paul, Thomas Evans, Bryan Goodman Microsoft, NVIDIA, Bosch Wed · Mar 18 9:00 AM Supercharging AI with Multi-Gigawatt AI Factories Speakers: Gilad Shainer, Peter Salanki, Evan Burness NVIDIA, CoreWeave, Meta, Microsoft Daily Booth Theater Schedule Visit the Microsoft Theater for lightning talks from engineering leaders and partners. Monday, March 16 2:00 PM BTH208 · NVIDIA Accelerate AI Innovation on Azure with NVIDIA Run:ai — Rob Magno 2:30 PM BTH202 · General Robotics Models to Machines: Deploying Agentic AI in Real-World Robotics — Dinesh Narayanan 3:00 PM BTH200 · Fractal Analytics From Generalist to Enterprise-Ready: Fractal Builds Domain AI — C. Chaudhuri 3:30 PM BTH109 · Microsoft Agentic cloud ops - Smarter Operations with Azure Copilot — Jyoti Sharma 4:00 PM BTH103 · Microsoft Build a Deep Research Agent for Enterprise Data — D. Casati, A. Slutsky, H. Alkemade 4:30 PM BTH205 · NetApp Azure NetApp Files: Powering Your Data for AI Capabilities — Andy Chan 5:00 PM BTH207 · NVIDIA The Agentic Commerce Stack: Open Models on Azure — Antonio Martinez 5:30 PM BTH217 · OPAQUE Confidential AI on Azure Unlocks Sovereign AI at Scale — Aaron Fulkerson 6:00 PM BTH218 · Simplismart Making BYOC work at scale with modular inference — Amritanshu Jain 6:30 PM Expo Reception Tuesday, March 17 1:30 PM BTH100 · Microsoft From Open Weights to Enterprise Scale: Open-Source Models — Sharmila Chockalingam 2:00 PM BTH212 · Personal AI Unlocking the power of memory in Teams with Personal AI — Sam Harkness 2:30 PM BTH111 · Microsoft / NVIDIA Scalable LLM Inference on AKS Using NVIDIA Dynamo — Mohamad Al jazaery, Anton Slutsky 3:00 PM BTH204 · Mistral AI Innovate with Mistral AI on Microsoft Foundry — Ian Mathew 3:30 PM BTH104 · Microsoft GPU-Accelerated CFD at Scale: Star-CCM+ on Azure — Jason Scheffelmaer 4:00 PM BTH206 · NeuBird AI Agentic AI for Incident Response on Microsoft Azure — Grant Griffiths 4:30 PM BTH101 · GitHub Agentic DevOps: Evolving software with GitHub Copilot — Glenn Wester 5:00 PM BTH209 · Rescale Real-World AI Physics: GM & NVIDIA on Rescale — Dinal Perera 5:30 PM BTH107 · Microsoft Intro to LoRA Fine-Tuning on Azure — Christin Pohl 6:30 PM Raffle Wednesday, March 18 1:00 PM BTH219 · VAST Data Scaling AI Infrastructure on Azure with VAST Data — Jason Vallery 1:30 PM BTH110 · Microsoft Physical AI and Robotics: The Next Frontier — F. Miller, C. Souche, D. Narayanan 2:00 PM BTH105 · Microsoft Sovereign AI options with Azure Local — Kim Lam 2:30 PM BTH108 · Microsoft Automating HPC Workflows with Copilot Agents — Param Shah 3:00 PM BTH102 · Microsoft Trustworthy Multi-Agent Workflows with Microsoft Foundry — Brian Benz 4:00 PM BTH106 · Microsoft Scaling Enterprise AI on ARO with NVIDIA H100 & H200 — Lachie Evenson 4:30 PM BTH211 · WEKA Hybrid AI Data Orchestration with WEKA NeuralMesh™ — Desiree Campbell 5:00 PM BTH202 · Hammerspace NVIDIA AI Enterprise Software with NIM — Mike Bloom 5:30 PM BTH203 · Kinaxis Reimagining Global Supply Planning with Azure — Dane Henshall 6:00 PM BTH214 · AT&T Connected AI on Azure for Manufacturing — Brad Pritchett 6:30 PM Raffle Thursday, March 19 11:00 AM BTH210 · Wandelbots Physical AI: Powering Software-Defined Automation in Robotics — Marwin Kunz, Martin George 11:30 AM Raffle Explore Our Demo Pods Visit the Microsoft booth to see our technology in action with live demonstrations across four dedicated pod areas. POD 1 Azure AI Infrastructure End‑to‑end AI infrastructure for training and inference at scale, featuring the latest NVIDIA GPU integrations on Azure. POD 2 Microsoft Foundry Our comprehensive platform for building, deploying, and operating agentic AI systems with enterprise reliability. POD 3 Building AI Together Showcasing joint Microsoft and NVIDIA solutions across diverse industries, from manufacturing to retail. POD 4 Startups Powering AI Discover how innovative startups are running next‑generation AI workloads on the Azure platform. Ancillary Events & Networking Join Microsoft leadership and our partner ecosystem at these curated networking experiences. Click the location to view on Bing Maps. Sun · Mar 15 6:00 PM Microsoft for Startups Executive Leadership Dinner 📍 Morton’s Steakhouse, San Jose Exclusive gathering for startup leaders and Microsoft executives. Mon · Mar 16 1:30 PM Microsoft × NVIDIA Open Meet 📍 Signia by Hilton · International Suite Strategic alignment session for Microsoft and NVIDIA executives. Mon · Mar 16 7:30 PM Microsoft + NVIDIA Executive Dinner 📍 Il Fornaio, San Jose Executive dinner for key customers and leadership teams. Tue · Mar 17 11:00 AM to 1:00 PM Microsoft AI Luncheon: Research, Robotics, & Real‑World AI 📍 Signia by Hilton · International Suite Invite-only: A curated executive lunch exploring the journey from AI research to physical enterprise deployments in robotics and manufacturing. Tue · Mar 17 7:30 PM Networking in AI & Tech 📍 San Pedro Square Market Community networking mixer for Microsoft teams, partners, and customers. Wed · Mar 18 10:00 AM to 1:00 PM AI Innovator’s Circle Brunch: Powering Intelligent Systems Across the Ecosystem 📍 Il Fornaio, San Jose Hosted by Microsoft & NVIDIA at GTC. Join us for an exclusive brunch and discussion on the intelligent ecosystem.Azure Recognized as an NVIDIA Cloud Exemplar, Setting the Bar for AI Performance in the Cloud
As AI models continue to scale in size and complexity, cloud infrastructure must deliver more than theoretical peak performance. What matters in practice is reliable, end-to-end, workload-level AI performance—where compute, networking, system software, and optimization work together to deliver predictable, repeatable results at scale. This directly translates to business value: efficient full-stack infrastructure accelerates time-to-market, maximizes ROI on GPU and cloud investments, and enables organizations to scale AI from proof-of-concept to revenue-generating products with predictable economics. Today, Microsoft is proud to share an important milestone in partnership with NVIDIA: Azure has been validated as an NVIDIA Exemplar Cloud, becoming the first cloud provider recognized for Exemplar-class AI performance aligned with GB300-class (Blackwell generation) systems. This recognition builds on Azure’s previously validated Exemplar status for H100 training workloads and reflects NVIDIA’s confidence in Azure’s ability to extend that rigor and performance discipline into the next generation of AI platforms. What Is NVIDIA Exemplar Cloud? The NVIDIA Exemplar Cloud initiative celebrates cloud platforms that demonstrate robust end-to-end AI workload performance using NVIDIA’s Performance Benchmarking suite. Rather than relying on synthetic microbenchmarks, Performance Benchmarking evaluates real AI training workloads using: Large-scale LLM training scenarios Production-grade software stacks Optimized system and network configurations Workload-centric metrics such as throughput and time-to-train Achieving Exemplar validation signals that a provider can consistently deliver world-class AI performance in the cloud, showcasing that end users are getting optimal performance value by default. Proven Exemplar Validation on H100 Azure’s Exemplar Cloud journey began with publicly shared benchmarking results for H100-based training workloads, where Azure ND GPU clusters demonstrated exemplar performance using NVIDIA Performance Benchmarking recipes. Those results—published previously and validated through NVIDIA’s benchmarking framework—established a proven foundation of end-to-end AI performance for large-scale, production workloads running on Azure today. Extending Exemplar-Class AI Performance to GB300-Class Platforms Building on the rigor and learnings from H100 validation, Microsoft has now been recognized by NVIDIA as the first cloud provider to achieve Exemplar-class performance and readiness aligned with GB300-class systems. This designation reflects NVIDIA’s assessment that the same principles applied to H100—including end-to-end system tuning, networking optimization, and software alignment—are being successfully carried forward into the Blackwell generation. Rather than treating GB300 as a point solution, Azure approaches it as a continuation of a proven performance model: delivering consistent world-class AI performance in the cloud while preserving the flexibility, elasticity, and global scale customers expect. What Enables Exemplar-Class AI Performance on Azure Delivering Exemplar-class AI performance requires optimization across the full AI stack: Infrastructure and Networking High-performance Azure ND GPU clusters with NVIDIA InfiniBand NUMA-aware CPU, GPU, and NIC alignment to minimize latency Tuned NCCL communication paths for efficient multi-GPU scaling Software and System Optimization Tight integration with NVIDIA software, including Performance Benchmarking recipes and NVIDIA AI Enterprise Parallelism strategies aligned with large-scale LLM training Continuous tuning as models, workloads, and system architectures evolve End-to-End Workload Focus Measuring real training performance, not isolated component metrics Driving repeatable improvements in application-level throughput and efficiency Closing the performance gap between cloud and on-premises systems—without sacrificing manageability Together, these capabilities enabled Azure to deliver consistent Exemplar-class AI performance across generations of NVIDIA platforms. What This Means for Customers For customers training and deploying advanced AI models, this milestone delivers clear benefits: World-class AI performance in a fully managed cloud environment Predictable scaling from small clusters to thousands of GPUs Faster time to train and improved performance per dollar Confidence that Azure is ready for Blackwell-class and GB300-class AI workloads As AI workloads become more complex and reasoning-heavy, infrastructure performance increasingly determines outcomes. Azure’s NVIDIA Cloud Exemplar recognition provides a clear signal: customers can build and scale next-generation AI systems on Azure without compromising on performance. Learn More DGX Cloud Benchmarking on Azure DGX Cloud Benchmarking on Azure | Microsoft Community Hub