ai infrastructure

88 Topics

Introducing the new Azure AI infrastructure VM series ND MI300X v5
ND MI300X v5 features industry-leading high-bandwidth memory (HBM) capacity and bandwidth targeting and AI training.
MarcCharest
May 21, 2024 Place Azure High Performance Computing (HPC) Blog
26KViews
6likes
8Comments
CycleCloud 8.5 Release Announcement
Azure CycleCloud just keeps getting better! Come check out the new features.
anhoward
Nov 11, 2023 Place Azure High Performance Computing (HPC) Blog
3.6KViews
5likes
1Comment
Introducing New Performance Tiers for Azure Managed Lustre: Enhancing HPC Workloads
Building upon the success of its General Availability (GA) launch last month, we’re excited to unveil two new performance tiers for Azure Managed Lustre (AMLFS): 40MB/s per TiB and 500MB/s per TiB. This blog post explores the specifics of these new tiers and how they embody a customer-centric approach to innovation.
brianlepore
Aug 16, 2023 Place Azure High Performance Computing (HPC) Blog
10KViews
5likes
0Comments
Deploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.
CormacGarvey
Jun 03, 2023 Place Azure High Performance Computing (HPC) Blog
9.6KViews
5likes
2Comments
Ramp up with me...on HPC: What is high-performance computing (HPC)?
Over the next several months, let’s take a journey together and learn about the different use cases. Join me as I dive into each use case and for some of them, I’ll even try my hand at the workload for the first time. We’ll talk about what went well, and any what issues I ran into. And maybe, you’ll get to hear a little about our customers and partners along the way.
RachelPruitt
Apr 12, 2023 Place Azure High Performance Computing (HPC) Blog
7.1KViews
5likes
0Comments
Accelerating the Intelligence Age with Azure AI Infrastructure and the GA of ND GB200 v6
Today we are thrilled to announce the General Availability of Azure's latest AI infrastructure Virtual Machines, the ND GB200 v6.
MattVegas1
Mar 18, 2025 Place Azure High Performance Computing (HPC) Blog
3.7KViews
4likes
0Comments
Running GPU accelerated workloads with NVIDIA GPU Operator on AKS
The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator for HPC/AI workloads requiring a high degree of customization and granular control over the compute-resources configuration
wolfgangdesalvador
Feb 23, 2024 Place Azure High Performance Computing (HPC) Blog
15KViews
4likes
1Comment
Azure announces new AI optimized VM series featuring AMD’s flagship MI300X GPU
In our relentless pursuit of pushing the boundaries of artificial intelligence, we understand that cutting-edge infrastructure and expertise is needed to harness the full potential of advanced AI. At Microsoft, we've amassed a decade of experience in supercomputing and have consistently supported the most demanding AI training and generative inferencing workloads. Today, we're excited to announce the latest milestone in our journey. We’ve created a virtual machine (VM) with an unprecedented 1.5 TB of high bandwidth memory (HBM) that leverages the power of AMD’s flagship MI300X GPU. Our Azure VMs powered with the MI300X GPU give customers even more choices for AI optimized VMs.
MarcCharest
Nov 15, 2023 Place Azure High Performance Computing (HPC) Blog
57KViews
4likes
2Comments
Performance considerations for large scale deep learning training on Azure NDv4 (A100) series
Modern DL training jobs require large Clusters of multi-GPUs with high floating-point performance connected with high bandwidth, low latency networks. The Azure NDv4 VM series is designed specifically for these types of workloads. We will be focusing on HPC+AI Clusters built with the ND96asr_v4 virtual machine type and providing specific optimization recommendations to get the best performance.
CormacGarvey
Aug 28, 2021 Place Azure High Performance Computing (HPC) Blog
19KViews
4likes
0Comments
Performance at Scale: The Role of Interconnects in Azure HPC & AI Infrastructure
Microsoft Azure’s high-performance computing (HPC) & AI infrastructure is designed from the ground up to support the world’s most demanding workloads. High-performance AI workloads are bandwidth-hungry and latency-sensitive. As models scale in size and complexity, the efficiency of the interconnect fabric—how CPUs, GPUs, and storage communicate—becomes a critical factor in overall system performance. Even with the fastest GPUs, poor interconnect design can lead to bottlenecks, underutilized hardware, and extended time-to-results. In this blog post, we will highlight one of the key enabling features for running large-scale distributed workloads on Azure: a highly tuned HPC-class interconnect. Azure has invested years of system-level engineering of the InfiniBand interconnect, into ready-to-use configurations for customers available on Azure’s HB-series and N-series virtual machine (VMs).
HugoAffaticati
Jun 25, 2025 Place Azure High Performance Computing (HPC) Blog
2.1KViews
3likes
1Comment