Recent Blog ArticlesMost RecentMost LikesGPU node health checks integrated into Azure Kubernetes service via node problem detector The Azurehpc node health repository provides a suite of recommended node health checks for all Azure specialized SKU’s (including GPU’s). In this blog post we will show how to integrate the GPU node ...Deploy NDm_v4 (A100) Kubernetes Cluster We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optima...Performance impact of enabling Accelerated Networking on HBv3, HBv2 and HC virtual machines Azure Accelerated Networking is now available on HBv3, HBv2, HC and HB virtual machines (VMs). Enabling this feature improves networking performance between VMs when connecting over the Ethernet-base...HPC/AI Storage options for NDm_v4 (A100) Azure kubernetes service (AKS) cluster We will show how to set-up and use popular azure HPC/AI storage options (such as local NVMe SSDs, Azure managed lustre Filesystem (AMLFS) and Azure files+NFSv4) in an NDm_v4 AKS cluster and provide I...HPC/AI Cluster resource utilization monitoring using Azure Monitor Monitoring is a crucial aspect of managing a high-performance computing (HPC) or AI cluster. Here we will focus specifically on resource utilization monitoring using a Custom data collector and the A...Tool to assist in optimal pinning of processes/threads for Azure HPC/AI VM’s We will discuss a tool which can help HPC applications pin processes/threads in an optimal manner on HPC specialty VM’s. The tool has the following functionality. View VM CPU topology (Numa domai...Health checks for HPC workloads on Microsoft Azure Many HPC applications are highly parallel and have tightly coupled communication, meaning that during an applications parallel simulation run, all parallel processes must communicate with each other ...E2E deployment of a production ready NDv4 (A100) cluster targeting large deep learning training The NDv4 series is very popular for running large deep learning training jobs, which require lots of floating-point performance and high interconnection bandwidth. In this article we will walk throug...Tuning BeeGFS and BeeOND on Azure for specific I/O patterns BeeGFS and BeeGFS On Demand () are popular parallel file systems used to handle the I/O requirements for many high performance computing (HPC) workloads. Both run well on Azure, and both can harness...Performance considerations for large scale deep learning training on Azure NDv4 (A100) series Modern DL training jobs require large Clusters of multi-GPUs with high floating-point performance connected with high bandwidth, low latency networks. The Azure NDv4 VM series is designed specificall...