AI Infrastructure
96 TopicsIntroducing New Performance Tiers for Azure Managed Lustre: Enhancing HPC Workloads
Building upon the success of its General Availability (GA) launch last month, we’re excited to unveil two new performance tiers for Azure Managed Lustre (AMLFS): 40MB/s per TiB and 500MB/s per TiB. This blog post explores the specifics of these new tiers and how they embody a customer-centric approach to innovation.Deploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.Ramp up with me...on HPC: What is high-performance computing (HPC)?
Over the next several months, let’s take a journey together and learn about the different use cases. Join me as I dive into each use case and for some of them, I’ll even try my hand at the workload for the first time. We’ll talk about what went well, and any what issues I ran into. And maybe, you’ll get to hear a little about our customers and partners along the way.Running GPU accelerated workloads with NVIDIA GPU Operator on AKS
The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator for HPC/AI workloads requiring a high degree of customization and granular control over the compute-resources configurationAzure announces new AI optimized VM series featuring AMD’s flagship MI300X GPU
In our relentless pursuit of pushing the boundaries of artificial intelligence, we understand that cutting-edge infrastructure and expertise is needed to harness the full potential of advanced AI. At Microsoft, we've amassed a decade of experience in supercomputing and have consistently supported the most demanding AI training and generative inferencing workloads. Today, we're excited to announce the latest milestone in our journey. We’ve created a virtual machine (VM) with an unprecedented 1.5 TB of high bandwidth memory (HBM) that leverages the power of AMD’s flagship MI300X GPU. Our Azure VMs powered with the MI300X GPU give customers even more choices for AI optimized VMs.Performance considerations for large scale deep learning training on Azure NDv4 (A100) series
Modern DL training jobs require large Clusters of multi-GPUs with high floating-point performance connected with high bandwidth, low latency networks. The Azure NDv4 VM series is designed specifically for these types of workloads. We will be focusing on HPC+AI Clusters built with the ND96asr_v4 virtual machine type and providing specific optimization recommendations to get the best performance.Announcing the Public Preview of AMLFS 20: Azure Managed Lustre New SKU for Massive AI&HPC Workloads
Sachin Sheth - Principal PDM Manager Brian Barbisch - Principal Group Software Engineering Manager Matt White - Principal Group Software Engineering Manager Brian Lepore - Principal Product Manager Wolfgang De Salvador - Senior Product Manager Ron Hogue - Senior Product Manager Introduction We are excited to announce the Public Preview of AMLFS Durable Premium 20 (AMLFS 20), a new SKU in Azure Managed Lustre designed to deliver unprecedented performance and scale for demanding AI and HPC workloads. Key Features Massive Scale: Store up to 25 PiB of data in a single namespace, with up to 512 GB/s of total bandwidth. Advanced Metadata Performance: Multi-MDS (Metadata Server) architecture dramatically improves metadata IOPS. In mdtest benchmarks, AMLFS 20 demonstrated more than 5x improvement in metadata operations. An additional MDS is provided for every 5 PiB of provisioned filesystem. High File Capacity: Supports up to 20 billion inodes for maximum namespace size. Why AMLFS 20 Matters Simplified Architecture: Previously, datasets larger than 12.5 PiB required multiple filesystems and complex management. AMLFS 20 enables a single, high-performance file system for massive AI and HPC workloads up to 25 PiB, streamlining deployment and administration. Accelerated Data Preparation: The multi-MDT architecture significantly increases metadata IOPS, which is crucial during the data preparation stage of AI training, where rapid access to millions of files is required. Faster Time-to-Value: Researchers and engineers benefit from easier management, reduced bottlenecks, and faster access to large datasets, accelerating innovation. Availability AMLFS 20 is available in Public Preview alongside the already existing AMLFS SKUs. For more details on other SKUs, visit the Azure Managed Lustre documentation. How to Join the Preview If you are working with large-scale AI or HPC workloads and would like early access to AMLFS 20, we invite you to fill out this form to tell us about your use case. Our team will follow up with onboarding details.