TL;DR: Azure Machine Learning now offers ND H200 v5 VMs accelerated by NVIDIA H200 Tensor Core GPUs, purpose‑built to train and serve modern generative AI more efficiently at cloud scale. With massive on‑GPU memory and high intra‑node bandwidth, you can fit larger models and batches, keep tensors local, and cut cross‑GPU transfers - doing more with fewer nodes. Start with a single VM or scale out to hundreds in a managed cluster to capture cloud economics, while Azure’s AI‑optimized infrastructure delivers consistent performance across training and inference.
Why this matters
The AI stack is evolving with bigger parameter counts, longer context windows, multimodal pipelines, and production-scale inference. ND H200 v5 on Azure ML is designed to address these needs with a memory-first, network-optimized, and workflow-friendly approach, enabling data science and MLOps teams to move from experiment to production efficiently.
Memory, the real superpower
At the heart of each ND H200 v5 VM are eight NVIDIA H200 GPUs, each packing 141 GB of HBM3e memory - representing a 76% increase in HBM capacity over H100. That means you can now process more per GPU, larger models, more tokens and better performance. Aggregate that across all eight GPUs and you get a massive 1,128 GB of GPU memory per VM.
- HBM3e throughput: 4.8 TB/s per GPU ensures continuous data flow, preventing compute starvation.
- Larger models with fewer compromises: Accommodate wider context windows, larger batch sizes, deeper expert mixtures, or higher-resolution vision tokens without needing aggressive sharding or offloading techniques.
- Improved scaling: Increased on-GPU memory reduces cross-device communication and enhances step-time stability.
Built to scale-within a VM and across the cluster
When training across multiple GPUs, communication speed is crucial.
- Inside the VM: Eight NVIDIA H200 GPUs are linked via NVIDIA NVLink, delivering 900 GB/s of bidirectional bandwidth per GPU for ultra-fast all-reduce and model-parallel operations with minimal synchronization overhead.
- Across VMs: Each instance comes with eight 400 Gb/s NVIDIA ConnectX-7 InfiniBand adapters connecting to NVIDIA Quantum-2 InfiniBand switches, totaling 3.2 Tb/s interconnect per VM.
- GPUDirect RDMA: Enables data to move GPU-to-GPU across nodes with lower latency and lower CPU overhead, which is essential for distributed data/model/sequence parallelism.
The result is near-linear scaling characteristics for many large-model training and fine-tuning workloads.
Built into Azure ML workflows (no friction)
Azure Machine Learning integrates ND H200 v5 with the tools your teams already use:
- Frameworks: PyTorch, TensorFlow, JAX, and more
- Containers: Optimized Docker images available via Azure Container Registry
- Distributed training: NVIDIA NCCL fully supported to maximize performance of NVLink and InfiniBand
Bring your existing training scripts, launch distributed runs, and integrate into pipelines, registries, managed endpoints, and MLOps with minimal change.
Real-world gains
Early benchmarks show up to 35% throughput improvements for large language model inference compared to the previous generation, particularly on models like Llama 3.1 405B. The increased HBM capacity allows for larger inference batches, improving utilization and cost efficiency. For training, the combination of additional memory and higher bandwidth supports larger models or more data per step, often potentially reducing overall training time.
Your mileage will vary by model architecture, precision, parallelism strategy, and data loader efficiency—but the headroom is real.
Quick spec snapshot
- GPUs: 8× NVIDIA H200 Tensor Core GPUs
-
- HBM3: 141 GB per GPU (1,128 GB per VM)
-
- HBM bandwidth: 4.8 TB/s per GPU
- Inter-GPU: NVIDIA NVLink 900 GB/s (intra-VM)
- Host: 96 vCPUs (Intel Xeon Sapphire Rapids), 1,850 GiB RAM
- Local storage: 28 TB NVMe SSD
- Networking: 8× 400 Gb/s NVIDIA ConnectX-7 InfiniBand adapters (3.2 Tb/s total) with GPUDirect RDMA
Getting started (it’s just a CLI away)
Create an auto-scaling compute cluster in Azure ML:
az ml compute create \
--name h200-training-cluster \
--size Standard_ND96isr_H200_v5 \
--min-instances 0 \
--max-instances 8 \
--type amlcompute
Auto-scaling means you only pay for what you use - perfect for research bursts, scheduled training, and production inference with variable demand.
What you can do now
- Train foundation models with larger batch sizes and longer sequences
- Fine-tune LLMs with fewer memory workarounds, reducing the need for offloading and resharding
- Deploy high-throughput inference ND H200 v5 documentation for chat, RAG, MoE, and multimodal use cases
- Accelerate scientific and simulation workloads that require high bandwidth + memory
Pro tips to unlock performance
- Optimize HBM usage: Increase batch size/sequence length until you reach the HBM bandwidth limit of approximately 4.8 TB/s per GPU).
- Utilize parallelism effectively: Combine tensor/model parallel (NVLink-aware) with data parallelism across nodes (InfiniBand + GPUDirect RDMA).
- Optimize your input pipeline: Parallelize tokenization/augmentation,and store frequently accessed data on local NVMe to prevent GPU stalls.
- Leverage NCCL: Configure your communication backend to take advantage of the topology, using NVLink intra-node and InfiniBand inter-node.
The bottom line
This is more than a hardware bump - it’s a platform designed for the next wave of AI. With ND H200 v5 on Azure ML, you gain the memory capacity, network throughput, and operational simplicity needed to transform ambitious models into production-grade systems.
For comprehensive technical specifications and deployment guidance, visit the official ND H200 v5 documentation and explore our detailed announcement blog for additional insights and use cases.