ai infrastructure
103 TopicsIntroducing New Performance Tiers for Azure Managed Lustre: Enhancing HPC Workloads
Building upon the success of its General Availability (GA) launch last month, we’re excited to unveil two new performance tiers for Azure Managed Lustre (AMLFS): 40MB/s per TiB and 500MB/s per TiB. This blog post explores the specifics of these new tiers and how they embody a customer-centric approach to innovation.Deploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.Ramp up with me...on HPC: What is high-performance computing (HPC)?
Over the next several months, let’s take a journey together and learn about the different use cases. Join me as I dive into each use case and for some of them, I’ll even try my hand at the workload for the first time. We’ll talk about what went well, and any what issues I ran into. And maybe, you’ll get to hear a little about our customers and partners along the way.Simplify troubleshooting at scale - Centralized Log Management for CycleCloud Workspace for Slurm
Training large AI models on hundreds or thousands of nodes introduces a critical operational challenge: when a distributed job fails, quickly identifying the root cause across scattered logs can become incredibly time-consuming. This manual process delays recovery and reduces cluster utilization. The ability to quickly parse centralized cluster logs from a single interface is critical to ensure job failure root cases are swiftly identified and mitigated to maintain high cluster utilization. Solution Architecture This is a turnkey, customizable log forwarding solution for CycleCloud Workspace for Slurm that centralizes all cluster logs into Azure Monitor Logs Analytic. The architecture uses Azure Monitor Agent (AMA) deployed on every VM and Virtual Machine Scale Set (VMSS) to stream logs defined by Data Collection Rules (DCR) to dedicated tables in a Log Analytics workspace where they can be queried from a single interface. The turnkey solution captures three categories of logs essential for troubleshooting distributed workloads, but can be extended for any other logs: Slurm logs including slurmctld, slurmd, etc., plus archived job artifacts (job submission scripts, environmental variables, stdout/stderr) collected via prolog/epilog scripts. Infrastructure logs including those from CycleCloud including the CycleCloud Healthagent which automatically tests nodes for hardware health and draining nodes that fail tests. Operation System logs from syslog and dmesg capturing kernel events, network state changes, and hardware issues. Each log source flows through its own DCR into a dedicated table following a consistent schema. The solution automatically associates scheduler-specific DCRs with the Slurm scheduler node and compute-specific DCRs with compute nodes handling dynamic node scaling transparently. The solution is purpose-built for CycleCloud Workspace for Slurm, but designed in a modular fashion to be easily extended for new data sources (i.e. new log formats) and processing (i.e. Data Collection Rules) to support log forwarding and analysis of other required logs. Key Benefits Time-series correlation: Azure Monitor's time-based indexing enables rapid identification of cascading failures. For example, trace a network carrier flap detected in syslog to corresponding slurmd communication errors to specific job failures all within seconds. Centralized visibility: Query logs from thousands of nodes through a single interface instead of SSH-ing to individual machines. Correlate Slurm controller decisions with node-level errors and system events in one query. Log persistence: Logs survive node deallocations and reimaging. Critical in cloud environments where compute nodes are ephemeral. Powerful query language: KQL (Kusto Query Language) allows parsing raw logs into structured fields, filtering across multiple sources, and building operational dashboards. Example queries detect patterns like repeated job failures, network instability, or resource exhaustion. Production-ready scalability: User-assigned managed identities automatically propagate to new VMSS instances, and DCR associations handle thousands of nodes without manual configuration. Getting Started The complete solution is available on GitHub (slurm-log-collection) with deployment scripts that: Create all required Log Analytics tables Deploys pre-configured DCRs for Slurm, CycleCloud, and OS logs Automatically associate DCRs with scheduler and compute resources After configuring environment variables and running the setup scripts, logs begin flowing to Azure Monitor and will populate within 15 minutes, but normal log ingestion latency is ~30s to 3 minutes. The repository includes sample KQL queries for common troubleshooting scenarios to accelerate time-to-resolution and to perform non-troubleshooting analysis of cluster usage.Running GPU accelerated workloads with NVIDIA GPU Operator on AKS
The focus of this article will be on getting NVIDIA GPUs managed and configured in the best way on Azure Kuberentes Services using NVIDIA GPU Operator for HPC/AI workloads requiring a high degree of customization and granular control over the compute-resources configurationAzure announces new AI optimized VM series featuring AMD’s flagship MI300X GPU
In our relentless pursuit of pushing the boundaries of artificial intelligence, we understand that cutting-edge infrastructure and expertise is needed to harness the full potential of advanced AI. At Microsoft, we've amassed a decade of experience in supercomputing and have consistently supported the most demanding AI training and generative inferencing workloads. Today, we're excited to announce the latest milestone in our journey. We’ve created a virtual machine (VM) with an unprecedented 1.5 TB of high bandwidth memory (HBM) that leverages the power of AMD’s flagship MI300X GPU. Our Azure VMs powered with the MI300X GPU give customers even more choices for AI optimized VMs.Performance considerations for large scale deep learning training on Azure NDv4 (A100) series
Modern DL training jobs require large Clusters of multi-GPUs with high floating-point performance connected with high bandwidth, low latency networks. The Azure NDv4 VM series is designed specifically for these types of workloads. We will be focusing on HPC+AI Clusters built with the ND96asr_v4 virtual machine type and providing specific optimization recommendations to get the best performance.