Recent Blog ArticlesNewest TopicsMost LikesTagged:TagPerformance analysis of DeepSeek R1 AI Inference using vLLM on ND-H100-v5 Introduction The DeepSeek R1 model represents a new frontier in large-scale reasoning for AI applications. Designed to tackle complex inference tasks, R1 pushes the boundaries of what’s possible—bu...Inference performance of Llama 3.1 8B using vLLM across various GPUs and CPUs Introduction Following our previous evaluation of Llama 3.1 8B inference performance on Azure’s ND-H100-v5 infrastructure using vLLM, this report broadens the scope to compare inference performance...Performance of Llama 3.1 8B AI Inference using vLLM on ND-H100-v5 Introduction The pace of development in large language models (LLMs) has continued to accelerate as the global AI community races toward the goal of artificial general intelligence (AGI). Today’s m...GPU node health checks integrated into Azure Kubernetes service via node problem detector The Azurehpc node health repository provides a suite of recommended node health checks for all Azure specialized SKU’s (including GPU’s). In this blog post we will show how to integrate the GPU node ...HPC/AI Storage options for NDm_v4 (A100) Azure kubernetes service (AKS) cluster We will show how to set-up and use popular azure HPC/AI storage options (such as local NVMe SSDs, Azure managed lustre Filesystem (AMLFS) and Azure files+NFSv4) in an NDm_v4 AKS cluster and provide I...Deploy NDm_v4 (A100) Kubernetes Cluster We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optima...E2E deployment of a production ready NDv4 (A100) cluster targeting large deep learning training The NDv4 series is very popular for running large deep learning training jobs, which require lots of floating-point performance and high interconnection bandwidth. In this article we will walk throug...Automated HPC/AI compute node health-checks Integrated with the SLURM scheduler It is best practice to run health-checks on compute nodes before running jobs, this is especially important for tightly coupled HPC/AI applications. The virtual machines that fail the health-checks s...HPC/AI Cluster resource utilization monitoring using Azure Monitor Monitoring is a crucial aspect of managing a high-performance computing (HPC) or AI cluster. Here we will focus specifically on resource utilization monitoring using a Custom data collector and the A...Performance considerations for large scale deep learning training on Azure NDv4 (A100) series Modern DL training jobs require large Clusters of multi-GPUs with high floating-point performance connected with high bandwidth, low latency networks. The Azure NDv4 VM series is designed specificall...