Microsoft Foundry Blog

4 MIN READ

Azure Machine Learning now supports Large-Scale AI Training and Inference with ND H200 v5 VMs

Microsoft

Aug 22, 2025

TL;DR: Azure Machine Learning now offers ND H200 v5 VMs accelerated by NVIDIA H200 Tensor Core GPUs, purpose‑built to train and serve modern generative AI more efficiently at cloud scale. With massive on‑GPU memory and high intra‑node bandwidth, you can fit larger models and batches, keep tensors local, and cut cross‑GPU transfers - doing more with fewer nodes. Start with a single VM or scale out to hundreds in a managed cluster to capture cloud economics, while Azure’s AI‑optimized infrastructure delivers consistent performance across training and inference.

Why this matters

The AI stack is evolving with bigger parameter counts, longer context windows, multimodal pipelines, and production-scale inference. ND H200 v5 on Azure ML is designed to address these needs with a memory-first, network-optimized, and workflow-friendly approach, enabling data science and MLOps teams to move from experiment to production efficiently.

Memory, the real superpower

At the heart of each ND H200 v5 VM are eight NVIDIA H200 GPUs, each packing 141 GB of HBM3e memory - representing a 76% increase in HBM capacity over H100. That means you can now process more per GPU, larger models, more tokens and better performance. Aggregate that across all eight GPUs and you get a massive 1,128 GB of GPU memory per VM.

HBM3e throughput: 4.8 TB/s per GPU ensures continuous data flow, preventing compute starvation.

Larger models with fewer compromises: Accommodate wider context windows, larger batch sizes, deeper expert mixtures, or higher-resolution vision tokens without needing aggressive sharding or offloading techniques.

Improved scaling: Increased on-GPU memory reduces cross-device communication and enhances step-time stability.

Built to scale-within a VM and across the cluster

When training across multiple GPUs, communication speed is crucial.

Inside the VM: Eight NVIDIA H200 GPUs are linked via NVIDIA NVLink, delivering 900 GB/s of bidirectional bandwidth per GPU for ultra-fast all-reduce and model-parallel operations with minimal synchronization overhead.

Across VMs: Each instance comes with eight 400 Gb/s NVIDIA ConnectX-7 InfiniBand adapters connecting to NVIDIA Quantum-2 InfiniBand switches, totaling 3.2 Tb/s interconnect per VM.

GPUDirect RDMA: Enables data to move GPU-to-GPU across nodes with lower latency and lower CPU overhead, which is essential for distributed data/model/sequence parallelism.

The result is near-linear scaling characteristics for many large-model training and fine-tuning workloads.

Built into Azure ML workflows (no friction)

Azure Machine Learning integrates ND H200 v5 with the tools your teams already use:

Frameworks: PyTorch, TensorFlow, JAX, and more

Containers: Optimized Docker images available via Azure Container Registry

Distributed training: NVIDIA NCCL fully supported to maximize performance of NVLink and InfiniBand

Bring your existing training scripts, launch distributed runs, and integrate into pipelines, registries, managed endpoints, and MLOps with minimal change.

Real-world gains

Early benchmarks show up to 35% throughput improvements for large language model inference compared to the previous generation, particularly on models like Llama 3.1 405B. The increased HBM capacity allows for larger inference batches, improving utilization and cost efficiency. For training, the combination of additional memory and higher bandwidth supports larger models or more data per step, often potentially reducing overall training time.

Your mileage will vary by model architecture, precision, parallelism strategy, and data loader efficiency—but the headroom is real.

Quick spec snapshot

GPUs: 8× NVIDIA H200 Tensor Core GPUs

- HBM3: 141 GB per GPU (1,128 GB per VM)

- HBM bandwidth: 4.8 TB/s per GPU

Inter-GPU: NVIDIA NVLink 900 GB/s (intra-VM)

Host: 96 vCPUs (Intel Xeon Sapphire Rapids), 1,850 GiB RAM

Local storage: 28 TB NVMe SSD

Networking: 8× 400 Gb/s NVIDIA ConnectX-7 InfiniBand adapters (3.2 Tb/s total) with GPUDirect RDMA

Getting started (it’s just a CLI away)

Create an auto-scaling compute cluster in Azure ML:

az ml compute create \ 
 --name h200-training-cluster \ 
 --size Standard_ND96isr_H200_v5 \ 
 --min-instances 0 \ 
 --max-instances 8 \ 
 --type amlcompute

Auto-scaling means you only pay for what you use - perfect for research bursts, scheduled training, and production inference with variable demand.

What you can do now

Train foundation models with larger batch sizes and longer sequences
Fine-tune LLMs with fewer memory workarounds, reducing the need for offloading and resharding
Deploy high-throughput inference ND H200 v5 documentation for chat, RAG, MoE, and multimodal use cases
Accelerate scientific and simulation workloads that require high bandwidth + memory

Pro tips to unlock performance

Optimize HBM usage: Increase batch size/sequence length until you reach the HBM bandwidth limit of approximately 4.8 TB/s per GPU).

Utilize parallelism effectively: Combine tensor/model parallel (NVLink-aware) with data parallelism across nodes (InfiniBand + GPUDirect RDMA).

Optimize your input pipeline: Parallelize tokenization/augmentation,and store frequently accessed data on local NVMe to prevent GPU stalls.

Leverage NCCL: Configure your communication backend to take advantage of the topology, using NVLink intra-node and InfiniBand inter-node.

The bottom line

This is more than a hardware bump - it’s a platform designed for the next wave of AI. With ND H200 v5 on Azure ML, you gain the memory capacity, network throughput, and operational simplicity needed to transform ambitious models into production-grade systems.

For comprehensive technical specifications and deployment guidance, visit the official ND H200 v5 documentation and explore our detailed announcement blog for additional insights and use cases.

Updated Aug 22, 2025

Version 1.0

azure machine learning

JagatjitTuruk

Microsoft

Joined August 12, 2025

View Profile

Microsoft Foundry Blog

Follow this blog board to get notified when there's new activity