Azure previews powerful and scalable virtual machine series to accelerate generative AI

Microsoft

Written by Matt Vegas, Principal Product Manager, Azure HPC+AI

 

Delivering on the promise of advanced AI for our customers requires supercomputing infrastructure, services, and expertise to address the exponentially increasing size and complexity of the latest models. At Microsoft, we are meeting this challenge by applying a decade of experience in supercomputing and supporting the largest AI training workloads to create AI infrastructure capable of massive performance at scale. The Microsoft Azure cloud, and specifically our graphics processing unit (GPU) accelerated virtual machines (VMs), provide the foundation for many generative AI advancements from both Microsoft and our customers.

 

“Co-designing supercomputers with Azure has been crucial for scaling our demanding AI training needs, making our research and alignment work on systems like ChatGPT possible.”—Greg Brockman, President and Co-Founder of OpenAI.

 

Azure's most powerful and massively scalable AI virtual machine series

Today, Microsoft is introducing the ND H100 v5 VM which enables on-demand in sizes ranging from eight to thousands of NVIDIA H100 GPUs interconnected by NVIDIA Quantum-2 InfiniBand networking. Customers will see significantly faster performance for AI models over our last generation ND A100 v4 VMs with innovative technologies like:

 

  • 8x NVIDIA H100 Tensor Core GPUs interconnected via next gen NVSwitch and NVLink 4.0
  • 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand per GPU with 3.2Tb/s per VM in a non-blocking fat-tree network
  • NVSwitch and NVLink 4.0 with 3.6TB/s bisectional bandwidth between 8 local GPUs within each VM
  • 4th Gen Intel Xeon Scalable processors
  • PCIE Gen5 host to GPU interconnect with 64GB/s bandwidth per GPU
  • 16 Channels of 4800MHz DDR5 DIMMs

 

Read the full article

0 Replies