Introducing the new Azure AI infrastructure VM series ND MI300X v5

Microsoft

May 21, 2024

Industry-leading high-bandwidth memory (HBM) capacity and bandwidth targeting generative inferencing and AI training

Artificial intelligence is transforming every industry and creating new opportunities for innovation and growth. On top of this, AI models are continually advancing and becoming more complex and accurate. More powerful computers with purpose-built AI accelerators that have resources like high bandwidth memory (HBM), specialized data formats, and exceptional compute performance are needed to fuel these technological advances.

To meet this need, Azure is proud to be the first cloud service to offer general availability of a new ND MI300X v5 virtual machine (VM) series based on AMD's latest Instinct GPU, MI300X. This new VM series is the first cloud offering of its kind and is designed to give the most high bandwidth memory (HBM) capacity of any available VM with industry-leading speeds, letting customers serve larger models faster, and with fewer GPUs.

Unmatched infrastructure optimized at every layer to deliver performance, efficiency, and scalability

These new ND MI300X VMs are a product of a long collaboration with AMD to build powerful cloud systems for AI with open-source software. This collaboration includes optimizations across the entire hardware and software stack. For example, these new VMs are powered by 8x AMD MI300X GPUs, each VM with 1.5 TB of high bandwidth memory (HBM) and 5.3 TB/s of HBM bandwidth. HBM is essential for AI applications due to its high bandwidth, low power consumption, and compact size. It is ideal for AI applications that need to quickly process vast amounts of data. The result is a VM with industry-leading performance, HBM capacity, and HBM bandwidth, enabling you to fit larger models in GPU memory and/or use less GPUs. In the end, you save power, cost, and time-to-solution.

On the software side, the ND MI300X VMs use the AMD ROCm open-source software platform, which provides a comprehensive set of tools and libraries for AI development and deployment. The ROCm platform supports popular frameworks such as TensorFlow and PyTorch, as well as Microsoft libraries for AI acceleration like ONNX Runtime, DeepSpeed, and MSCCL. The ROCm platform also enables seamless porting of models and solutions from one platform to another, lowering your engineering costs and speeding up time to market for your AI solutions.

For customers looking to scale out efficiently to thousands of GPUs, it’s as simple as using ND MI300X v5 VMs with a standard Azure Virtual Machine Scale Set (VMSS). ND MI300X v5 VMs feature high-throughput, low latency InfiniBand communication between different VMs. Each GPU has its own dedicated 400 Gb/s NVIDIA Quantum-2 CX7 InfiniBand link to give 3.2 Tb/s of bandwidth per VM. InfiniBand is the standard for AI workloads needing to scale out to large numbers of VMs/GPUs.

Scalable AI infrastructure running the capable OpenAI models

These VMs, and the software that powers them, were purpose-built for our own Azure AI services production workloads. We have already optimized the most capable natural language model in the world, GPT-4 Turbo, for these VMs. ND MI300X v5 VMs offer leading cost performance for popular OpenAI and open-source models.

The addition of ND MI300X v5 VMs to our infrastructure expands our capacity to serve these models to more customers, faster. If you want to generate text, answer questions, summarize documents, or create new applications, you can leverage the power and scalability of the Azure AI infrastructure to run these models at lightning speed, huge scale, and, optimized efficiency.

“These new Azure VMs based on AMD’s latest Instinct GPUs, MI300X, have delivered impressive performance results for our Microsoft Copilot Service. Microsoft's heterogenous approach to silicon investments across AI accelerators ensures we are delivering continuous performance benefits to the thousands of Copilot customers, and we are excited to add the unprecedented power of these new VMs to our fleet. These VMs are part of the leading AI infrastructure platform that runs GPT-4 Turbo and underpins critical M365 Copilot scenarios, including M365 Copilot chat, Word Copilot, and Teams Meeting Copilot.” – Jason Henderson, CVP, Office 365 Product Management, Microsoft

Leading with innovation to advance the ecosystem

We are also working closely with our partners and customers so they can take full advantage of these new VMs and accelerate their AI projects and applications. One of our partners, Hugging Face, is a popular provider of natural language processing open-source models. Hugging Face easily ported their models to the ND MI300X VMs without any code changes and achieved 2x to 3x performance gains over AMD’s MI250 using these VMs. Now you can use these open-source models and Hugging Face libraries on the ND MI300X VMs to create and deploy your own NLP applications with ease and efficiency. Read more about it here!

We are also excited to see what our customers will do with the new VMs. Whether you want to bring your own models, use our models through the Azure OpenAI Service, or use open models from Azure AI catalog or from Hugging Face, you can get the best performance at the best price on the new Azure AI infrastructure VMs. You can also scale up or down your VMs as needed, thanks to the flexibility and elasticity of the Azure cloud.

“The deep collaboration between Microsoft, AMD and Hugging Face on the ROCm™ open software ecosystem will enable Hugging Face users to run hundreds of thousands of AI models available on the Hugging Face Hub on Azure with AMD Instinct GPUs without code changes.“

– Julien Simon, Chief Evangelist Officer

The new ND MI300X v5 VMs are now available in Canada Central and Sweden Central, and you can start using them today. Learn more about the new VMs and how to get started, please visit our documentation page. To join the conversation and share your feedback, please visit our forum. We look forward to hearing from you and seeing your amazing AI creations on the new Azure AI infrastructure VMs.

Updated May 22, 2024

Version 2.0

Microsoft

Joined October 27, 2023

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity

8 Comments

LogicMage3
Copper Contributor
Dec 10, 2024
TensorWave might be a better bet. Only Canada Central and Sweden Central have any available
jgong1585
Copper Contributor
Jun 26, 2024
Opened a ticket and I got:
Thank you for requesting additional quota in [Canada Central (CC)]. Unfortunately, due to high demand for virtual machines in this region, we are not able to approve your quota request at this time. To ensure that all customers can access the services they need, we are working through approving quota requests as we bring additional capacity online. We are continually investing in additional infrastructure to expand our available resources.
bwibking
Copper Contributor
May 31, 2024
I opened a support ticket and asked how to rent these VMs, but I was told:
"Thank you for requesting additional quota in [Sweden Central]. Unfortunately, due to high demand for virtual machines in this region, we are not able to approve your quota request at this time. To ensure that all customers can access the services they need, we are working through approving quota requests as we bring additional capacity online."
ajmedick55378008
Copper Contributor
May 30, 2024
frosty54 I've been able to rent MI300X on TensorWave (tensorwave.com). Havent seen it actually readily available anywhere else though
frosty54
Copper Contributor
May 26, 2024
Has anyone been able to actually rent it?
jgong1585
Copper Contributor
May 24, 2024
Can't find MI300X in Canada Central.
Egborbe
Copper Contributor
May 24, 2024
The information in this article is incorrect. I only see NVIDIA GPUs in compute clusters in Sweden Central.
Egborbe
Copper Contributor
May 23, 2024
Can you please provide a link to the "documentation page"

Blog Post

Introducing the new Azure AI infrastructure VM series ND MI300X v5

Industry-leading high-bandwidth memory (HBM) capacity and bandwidth targeting generative inferencing and AI training

Unmatched infrastructure optimized at every layer to deliver performance, efficiency, and scalability

Scalable AI infrastructure running the capable OpenAI models

Leading with innovation to advance the ecosystem