benchmarking
58 TopicsAzure Recognized as an NVIDIA Cloud Exemplar, Setting the Bar for AI Performance in the Cloud
As AI models continue to scale in size and complexity, cloud infrastructure must deliver more than theoretical peak performance. What matters in practice is reliable, end-to-end, workload-level AI performance—where compute, networking, system software, and optimization work together to deliver predictable, repeatable results at scale. This directly translates to business value: efficient full-stack infrastructure accelerates time-to-market, maximizes ROI on GPU and cloud investments, and enables organizations to scale AI from proof-of-concept to revenue-generating products with predictable economics. Today, Microsoft is proud to share an important milestone in partnership with NVIDIA: Azure has been validated as an NVIDIA Exemplar Cloud, becoming the first cloud provider recognized for Exemplar-class AI performance aligned with GB300-class (Blackwell generation) systems. This recognition builds on Azure’s previously validated Exemplar status for H100 training workloads and reflects NVIDIA’s confidence in Azure’s ability to extend that rigor and performance discipline into the next generation of AI platforms. What Is NVIDIA Exemplar Cloud? The NVIDIA Exemplar Cloud initiative celebrates cloud platforms that demonstrate robust end-to-end AI workload performance using NVIDIA’s Performance Benchmarking suite. Rather than relying on synthetic microbenchmarks, Performance Benchmarking evaluates real AI training workloads using: Large-scale LLM training scenarios Production-grade software stacks Optimized system and network configurations Workload-centric metrics such as throughput and time-to-train Achieving Exemplar validation signals that a provider can consistently deliver world-class AI performance in the cloud, showcasing that end users are getting optimal performance value by default. Proven Exemplar Validation on H100 Azure’s Exemplar Cloud journey began with publicly shared benchmarking results for H100-based training workloads, where Azure ND GPU clusters demonstrated exemplar performance using NVIDIA Performance Benchmarking recipes. Those results—published previously and validated through NVIDIA’s benchmarking framework—established a proven foundation of end-to-end AI performance for large-scale, production workloads running on Azure today. Extending Exemplar-Class AI Performance to GB300-Class Platforms Building on the rigor and learnings from H100 validation, Microsoft has now been recognized by NVIDIA as the first cloud provider to achieve Exemplar-class performance and readiness aligned with GB300-class systems. This designation reflects NVIDIA’s assessment that the same principles applied to H100—including end-to-end system tuning, networking optimization, and software alignment—are being successfully carried forward into the Blackwell generation. Rather than treating GB300 as a point solution, Azure approaches it as a continuation of a proven performance model: delivering consistent world-class AI performance in the cloud while preserving the flexibility, elasticity, and global scale customers expect. What Enables Exemplar-Class AI Performance on Azure Delivering Exemplar-class AI performance requires optimization across the full AI stack: Infrastructure and Networking High-performance Azure ND GPU clusters with NVIDIA InfiniBand NUMA-aware CPU, GPU, and NIC alignment to minimize latency Tuned NCCL communication paths for efficient multi-GPU scaling Software and System Optimization Tight integration with NVIDIA software, including Performance Benchmarking recipes and NVIDIA AI Enterprise Parallelism strategies aligned with large-scale LLM training Continuous tuning as models, workloads, and system architectures evolve End-to-End Workload Focus Measuring real training performance, not isolated component metrics Driving repeatable improvements in application-level throughput and efficiency Closing the performance gap between cloud and on-premises systems—without sacrificing manageability Together, these capabilities enabled Azure to deliver consistent Exemplar-class AI performance across generations of NVIDIA platforms. What This Means for Customers For customers training and deploying advanced AI models, this milestone delivers clear benefits: World-class AI performance in a fully managed cloud environment Predictable scaling from small clusters to thousands of GPUs Faster time to train and improved performance per dollar Confidence that Azure is ready for Blackwell-class and GB300-class AI workloads As AI workloads become more complex and reasoning-heavy, infrastructure performance increasingly determines outcomes. Azure’s NVIDIA Cloud Exemplar recognition provides a clear signal: customers can build and scale next-generation AI systems on Azure without compromising on performance. Learn More DGX Cloud Benchmarking on Azure DGX Cloud Benchmarking on Azure | Microsoft Community HubCentralized cluster performance metrics with ReFrame HPC and Azure Log Analytics
Imagine having several clusters across different environments (dev, test and prod) or planning a migration between PBS and Slurm or porting codes to a different system. They can all seem like daunting tasks. This is where the combination of ReFrame HPC, a powerful and feature rich testing framework, and Azure Log Analytics can help improve confidence and assurance in the performance and accuracy of a system. Here we will look at how to configure ReFrame HPC specifically for Azure: Deploying the required Azure resources, running a test and capturing the results in Log Analytics for analysis. Deploying the required Azure Resources Firstly, deploy the required resources in Azure by using this bicep from GitHub. The deployment includes the creation and configuration of everything required for ReFrame HPC. These resources include a data collection endpoint, a data collection rule and a log analytics workspace. Running ior via ReFrame HPC For the purpose of demonstrating a running test and capturing the results in Azure from start to finish, here is a simple ior test which will run both a read and a write operation against the shared storage. import reframe as rfm import reframe.utility.sanity as sn @rfm.simple_test class SimplePerfTest(rfm.RunOnlyRegressionTest): valid_systems = ["*"] valid_prog_environs = ["+ior"] executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) Test explanation Set the binary to be executed to ior, along with its arguments. executable = 'ior' executable_opts = [ '-a POSIX -w -r -C -e -g -F -b 2M -t 2M -s 25600 -o /data/demo/test.bin -D 300' ] Specify which systems the test should run on. In this case, any system/cluster which is known to have ior available will be selected. Look at the ReFrame HPC documentation to get a better understanding of the options available for use. valid_systems = ["*"] valid_prog_environs = ["+ior"] Verify the stdout of the job by searching for a specific value to assert that it ran successfully. @sanity_function def validate_run(self): return sn.assert_found(r'Summary of all tests:', self.stdout) If the sanity function passed it will then extract the performance metrics from the stdout of the job. The naming of the methods is important, as they will be stored in the results later. @performance_function('MiB/s') def write_bandwidth_mib(self): return sn.extractsingle(r'^write\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) @performance_function('MiB/s') def read_bandwidth_mib(self): return sn.extractsingle(r'^read\s+([0-9]+\.?[0-9]*)', self.stdout, 1, float) Performance references are used to determine if the current cluster has met the requirement or not. It also allows margins to be specified in either direction. reference = { 'tst:hbv4': { 'write_bandwidth_mib': (500, -0.05, 0.1, 'MiB/s'), 'read_bandwidth_mib': (350, -0.05, 0.5, 'MiB/s'), } } ReFrame HPC Configuration The ReFrame HPC configuration is key to determine how and where the test will run. It is also where the logic allowing Reframe HPC to use Azure for centralized logging will be defined. The full configuration file is vast and is covered in detail within the ReFrame HPC documentation. For the purpose of this test an example can be found on GitHub. Below is a breakdown of the key parts that allow Reframe HPC to push its results into Azure Log Analytics. Logging Handler The most important part of this configuration is the logging section, without it ReFrame HPC will not attempt to log the results. A handler_perflog of type httpjson is added to enable the logs to be sent to a HTTP endpoint with specific values which our covered below. 'logging': [ { 'perflog_multiline': True, 'handlers_perflog': [ { 'type': 'httpjson', 'url': 'REDACTED', 'level': 'info', 'debug': False, 'extra_headers': {'Authorization': f'Bearer {_get_token()}'}, 'extras': { 'TimeGenerated': f'{datetime.now(timezone.utc).isoformat()}', 'facility': 'reframe', 'reframe_azure_data_version': '1.0', }, 'ignore_keys': ['check_perfvalues'], 'json_formatter': _format_record } ] } Multiline Perflog To ensure this works with Azure, enable perflog_multiline. This will ensure a single record per metric is sent to Log Analytics. This is the cleanest way to output the results. Having this set to False will move the metric names into column names, which means that the schema will be different for each test and will become hard to maintain. Extra Headers A bearer token is required to authenticate the request. ReFrame HPC allows the adding of headers via the extra_headers property and a simple Python function, which obtains a scoped token that can be appended to the additional header. def _get_token(scope='https://monitor.azure.com/.default') -> str: credential = DefaultAzureCredential() token = credential.get_token(scope) return token.token Url Structure The url can be found in the output of the bicep which was run previously. It can also be obtained via the portal. Here is the structure of the url for reference. '${dce.properties.logsIngestion.endpoint}/dataCollectionRules/${dcr.properties.immutableId}/streams/Custom-${table.name}?api-version=2023-01-01' json Formatter A small work around is needed as the Data Collection Rule expects an array of items and ReFrame HPC outputs a single record. To resolve this another Python function can be used which simply wraps the record up in an array. In this example it also tidys up and removes some items that are not required and would cause issues with the json serialization. def _format_record(record, extras, ignore_keys): data = {} for attr, val in record.__dict__.items(): if attr in ignore_keys or attr.startswith('_'): continue data[attr] = val data.update(extras) return json.dumps([data]) Running the Test Now that the infrastructure has been deployed, the test has been defined and is correctly configured, we can run the test. Start by logging in. Here I am using the managed identity of the node, but User auth and User Assigned Managed Identities are also supported. $ az login --identity ReFrame HPC can be installed via Spack or Python and, while I am using Spack for packages on the cluster, I find the simplest approach is to activate a Python environment and install ReFrame HPC along with test specfic Python dependencies. $ python3 -m venv .venv $ . .venv/bin/activate $ python -m pip install -U pip $ pip install -r requirements.txt Now using the ReFrame HPC cli, the test can be run using the configuration file and the test file. $ reframe -C config.py -c simple_perf.py --performance-report -r ReFrame HPC will now run the test against the system/cluster defined in the configuration. For this example it is a Slurm cluster on a partition of HBv4 nodes and running squeue clarifys that. $ squeue JOBID PARTITION NAME USER ST TIME NODES NODELIST(REASON) 955 hbv4 rfm_Simp jim.pain R 0:28 1 tst4-hbv4-97 Results And there we have it, results are now appearing in Azure! From here we can use kql to query and filter the results. This is just a subset of the values available but the dataset is vast and includes a huge range of values that are extremely helpful. Summary By standardizing on the combination of ReFrame HPC and Azure Log Analytics for testing and reporting of performance data across our clusters, whether Slurm based, Azure CycleCloud or existing on-prem clusters, you can gain unprecendented visibility and confidence in the systems you manage and the codes you deploy that were previously hard to obtain. Enabling the potential for: 🔎Fast cross-cluster comparisions 📈Trend analysis over long running periods 📊Standardized metrics regardless of scheduler or system ☁️Unified monitoring and reporting across clusters ReFrame HPC is suitable for a wide range of testing, so if testing is something you have been looking to implement, take a look at ReFrame HPCScaling physics-based digital twins: Neural Concept on Azure delivers a New Record in Industrial AI
Automotive Design and the DrivAerNet++ Benchmark In automotive design, external aerodynamics have a direct impact on performance, energy efficiency, and development cost. Even small reductions in drag can translate into significant fuel savings or extended EV range. As development timelines accelerate, engineering teams increasingly rely on data-driven methods to augment or replace traditional CFD workflows. MIT’s DrivAerNet++ dataset is the largest open multimodal dataset for automotive aerodynamics, offering a large-scale benchmark for evaluating learning-based approaches that capture the physical signals required by engineers. It includes 8,000 vehicle geometries across 3 variants (fastback, notchback and estate-back) and aggregates 39 TB of high-fidelity CFD outputs such as surface pressure, wall shear stress, volumetric flow fields, and drag coefficients. Benchmark Highlights Neural Concept trained its geometry-native Geometric Regressor, designed to handle any type of engineering data. The benchmark was executed on Azure HPC infrastructure to evaluate the capabilities of the geometry-native platform under transparent, scalable, and fully reproducible conditions. Surface pressure: Lowest prediction error recorded on the benchmark, revealing where high- and low-pressure zones form. Wall shear stress: Outperforming all competing methods to detect flow attachment and separation for drag and stability control. Volumetric velocity field: More than 50% lower error than previous best, capturing full flow structure for wake stability analysis. Drag coefficient Cd: R² of 0.978 on the test set, accurate enough for early design screening without full CFD runs. Dataset Scale and Ingestion: 39 TB of data was ingested into Neural Concept’s platform through a parallel conversion task with 128 workers and 5 GB RAM each that finished in about 1 hour and produced a compact 3 TB dataset in the platform’s native format. Data Pre Processing: Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices for handling external aerodynamics workflows. Model Training and Deployment: Training completed in 24 hours on 4 A100 GPUs, with the best model obtained after 16 hours. The final model is compact and real-time predictions can be served on a single 16 GB GPU for industrial use. Neural Concept outperformed all other competing methods, achieving state-of-the-art performance prediction on all metrics and physical quantities within a week: “Neural Concept’s breakthrough demonstrates the power of combining advanced AI with the scalability of Microsoft Azure,” said Jack Kabat, Partner, Azure HPC and AI Infrastructure Products, Microsoft. “By running training and deployment on Azure’s high-performance infrastructure — specifically the NC A100 Virtual Machine— Neural Concept was able to transform 39 terabytes of data into a production-ready workflow in just one week. This shows how Azure accelerates innovation and helps automotive manufacturers bring better products to market faster.” For additional benchmark metrics and comparisons, please refer to the Detailed Quantitative Results section at the end of the article. From State-Of-The-Art Benchmark Accuracy to Proven Industrial Impact Model accuracy alone is necessary, but not sufficient for industrial impact. Transformative gains at scale and over time are only revealed once high-performing models are deployed into maintainable and repeatable workflows across organizations. Customers using Neural Concept’s platform have achieved: 30% shorter design cycles $20M in savings on a 100,000-unit vehicle program These outcomes fundamentally result from a transformed, systematic approach to design, unlocking better and faster data-driven decisions. The Design Lab interface, described in the next section, is at the core of this transformation. Within Neural Concept’s ecosystem, validated geometry and physics models can be deployed directly into the Design Lab - a collaborative environment where aerodynamicists and designers evaluate concepts in real time. AI copilots provide instant performance feedback, geometry-aware improvement suggestions, and live KPI updates, effectively reconnecting aerodynamic analysis with the pace of modern vehicle design. CES 2026: See how OEMs are transforming product development with Engineering Intelligence Neural Concept and Microsoft will showcase how AI-native aerodynamic workflows can reshape vehicle development — from real-time design exploration to enterprise-scale deployment. Visit the Microsoft booth to see DrivAerNet++ running on Azure HPC and meet the teams shaping the future of automotive engineering. Visit Microsoft Booth to find out more Neural Concept’s executive team will also be at CES to share flagship results achieved by leading OEMs and Tier-1 suppliers already using the platform in production. Learn more on: https://www.neuralconcept.com/ces-2026 Credits Microsoft: Hugo Meiland (Principal Program Manager), Guy Bursell (Director Business Strategy, Manufacturing), Fernando Aznar Cornejo (Product Marketing Manager) and Dr. Lukasz Miroslaw (Sr. Industry Advisor) Neural Concept: Theophile Allard (CTO), Benoit Guillard (Senior ML Research Scientist), Alexander Gorgin (Product Marketing Engineer), Konstantinos Samaras-Tsakiris (Software Engineer) Detailed Quantitative Results In the sections that follow, we share the results obtained by applying Neural Concept’s aerodynamics predictive model training template to Drivaernet++. We evaluated our model’s prediction errors using the official train/test split and the standard evaluation strategy. For comparison, metrics from other methods were taken from the public leaderboard. We reported both Mean Squared Error (MSE) and Mean Absolute Error (MAE) to quantify prediction accuracy. Lower values for either metric indicate closer agreement with the ground truth simulations, meaning better predictions. 1. Surface Field Predictions: Pressure and Wall Shear Stress We began by evaluating predictions for the two physical quantities defined on the vehicle surface. Surface Pressure The Geometric Regressor achieved substantially better performance than all existing methods in predicting surface pressure distribution. Rank Deep Learning Model MSE (*10-2, lower = better) MAE (*10-1, lower = better) #1 Neural Concept 3.98 1.08 #2 GAOT (May 2025) 4.94 1.10 #3 FIGConvNet (February 2025) 4.99 1.22 #4 TripNet (March 2025) 5.14 1.25 #5 RegDGCNN (June 2024) 8.29 1.61 Table 1: Neural Concept’s Geometric Regressor predicts surface pressure more accurately than previously published state-of-the-art methods. The dates indicate when the competing model architectures were published. Figure 1: Side-by-side comparison of the ground truth pressure field (left), Neural Concept model’s prediction (middle), and the corresponding error for a representative test sample (right). Wall Shear Stress Similarly, the model delivered top-tier results, outperforming all competing methods. Rank Deep Learning Model MSE (*10 -2 , lower = better) MAE (*10 -1 , lower = better) #1 Neural Concept 7.80 1.44 #2 GAOT (May 2025) 8.74 1.57 #3 TripNet (March 2025) 9.52 2.15 #4 FIGConvNet (Feb. 2025) 9.86 2.22 #5 RegDGCNN (June 2024) 13.82 3.64 Table 2: Neural Concept’s Geometric Regressor predicts wall shear stress more accurately than previously published state-of-the-art methods. Figure 2: Side-by-side comparison of the ground truth magnitude of the wall shear stress, Neural Concept model’s prediction, and the corresponding error for a representative test sample. Across both surface fields (pressure and wall shear stress), the Geometric Regressor achieved the lowest MSE and MAE by a clear margin. The baseline methods represent several high-quality and recent academic work (the earliest being from June 2024), yet our architecture established a new state-of-the-art in predictive performance. 2. Volumetric Predictions: Velocity Beyond surface quantities, DrivAerNet++ provides 3D velocity fields in the flow volume surrounding the vehicle, which we also predicted using the Geometric Regressor. Rank Deep Learning Model MSE (lower = better) MAE (*10 -1 , lower = better) #1 Neural Concept 3.11 9.22 #2 TripNet (March 2025) 6.71 15.2 Table 3: Neural Concept’s Geometric Regressor predicts velocity more accurately than the previously published state-of-the-art method. The illustration below shows the velocity magnitude for two test samples. Note that only a single 2D slice of the 3D volumetric domain is shown here, focusing on the wake region behind the car. In practice, the network predicts velocity at any location within the full 3D domain, not just on this slice. Figure 3: Velocity magnitude for two test samples, arranged in two columns (left and right). For each sample, the top row displays the simulated velocity field, the middle row shows the prediction from the network, and the bottom row presents the error between the two. 3. Scalar Predictions: Drag Coefficient The drag coefficient (Cd) is the most critical parameter in automotive aerodynamics, as reducing it directly translates to lower fuel consumption in combustion vehicles and increased range in electric vehicles. Using the same underlying architecture, our model achieved state-of-the-art performance in Cd prediction. In addition to MSE and MAE, we reported the Maximum Absolute Error (Max AE) to reflect worst-case accuracy. We also included the Coefficient of Determination (R² score), which measures the proportion of variance explained by the model. An R² value of 1 indicates a perfect fit to the target data. Rank Deep Learning Model MSE (*1e-5) MAE (*1e-3) Max AE (*1e-2) R² #1 Neural Concept 0.8 2.22 1.13 0.978 #2 TripNet 9.1 7.19 7.70 0.957 #3 PointNet 14.9 9.60 12.45 0.643 #4 RegDGCNN 14.2 9.31 12.79 0.641 #5 GCNN 17.1 10.43 15.03 0.596 On the official split, the model shows tight agreement with CFD (R² of 0.978) across the test set, which is sufficient for early design screening where engineers need to rank variants confidently and spot meaningful gains without running full simulations for every change. 4. Compute Efficiency and Azure HPC&AI Collaboration Executing the full DrivAerNet++ benchmark at industrial scale required Neural Concept’s full software and infrastructure stack combined with seamless cloud integration on Microsoft Azure to dynamically scale computing resources on demand. The entire pipeline runs natively on Microsoft Azure and can scale within minutes, allowing us to process new industrial datasets that contain thousands of geometries without complex capacity planning. Dataset Scale and Ingestion DrivAerNet++ dataset contains 8000 car designs along with their corresponding CFD simulations. The raw dataset occupies approximately 39TB of storage. Generating the simulations required a total of about 3 million CPU hours by MIT’s DeCoDE Lab. Ingestion into Neural Concept’s platform is the first step of the pipeline. To convert the raw data into the platform’s native format, we use a Conversion task that transforms raw files into the platform’s optimized native format. This task was parallelized with 128 workers; each allocated 5 GB of RAM. As a result, the entire conversion process was completed in approximately one hour only. After converting the relevant data (car geometry, wall shear stress, pressure, and velocity), the full dataset occupies approximately 3 TB in Neural Concept’s native format. Data Pre-Processing Pre-processing the dataset required both large-scale parallelization and the application of our domain-specific best practices. During this phase, workloads were distributed across multiple compute nodes with peak memory usage reaching approximately 1.5 TB of RAM. The pre-processing pipeline consists of two main stages. In the first stage, we repaired the car meshes and pre-computed geometric features needed for training. The second stage involved filtering the volumetric domain and re-sampling points to follow a spatial distribution that is more efficient for training our deep learning model. We scaled the compute resources so that each of the two stages in the pipeline completes in 1 to 3 hours when processing the full dataset. The first stage is the most computationally intensive. To handle it efficiently, we parallelized the task across 256 independent workers, each allocated 6 GB of RAM. Model Training and Deployment While we use state-of-the-art hardware for training, our performance gains come primarily from model design. Once trained, the model remains lightweight and cost-effective to run. Training was performed on Azure Standard_NC96ads_A100_v4 node, which provided access to four A100 GPUs, each with 80 GB of memory. The model was trained for approximately 24 hours. Neural Concept’s Geometric Regressor achieved the best reported performance on the official benchmark for surface pressure, wall shear stress, volumetric velocity and drag prediction.mpi-stage: High-Performance File Distribution for HPC Clusters
When running containerized workloads on HPC clusters, one of the first problems you hit is getting container images onto the nodes quickly and repeatably. A .sqsh is a Squashfs image (commonly used by container runtimes on HPC). In some environments you can run a Squashfs image directly from shared storage, but at scale that often turns the shared filesystem into a hot spot. Copying the image to local NVMe keeps startup time predictable and avoids hundreds of nodes hammering the same source during job launch. In this post, I'll introduce mpi-stage, a lightweight tool that uses MPI broadcasts to distribute large files across cluster nodes at speeds that can saturate the backend network. The Problem: Staging Files at Scale On an Azure CycleCloud Workspace for Slurm cluster with GB300 GPU nodes, I needed to stage a large Squashfs container image from shared storage onto each node's local NVMe storage before launching training jobs. At small scale you can often get away with ad-hoc copies, but once hundreds of nodes are all trying to read the same source file, the shared source filesystem quickly becomes the bottleneck. I tried several approaches: Attempt 1: Slurm's sbcast Slurm's built-in sbcast seemed like the natural choice. In my quick testing it was slower than I wanted, and the overwrite/skip-existing behavior didn't match the "fast no-op if already present" workflow I was after. I didn't spend much time exploring all the configuration options before moving on. Attempt 2: Shell Script Fan-Out I wrote a shell script using a tree-based fan-out approach: copy to N nodes, then each of those copies to N more, and so on. This worked and scaled reasonably, but had some drawbacks: Multiple stages: The script required orchestrating multiple rounds of copy commands, adding complexity Source filesystem stress: Even with fan-out, the initial copies still hit the source filesystem simultaneously — a fan-out of 4 meant 4 nodes competing for source bandwidth Frontend network: Copies went over the Ethernet network by default — I could have configured IPoIB, but that added more setup The Solution: MPI Broadcasts The key insight was that MPI's broadcast primitive (MPI_Bcast) is specifically optimized for one-to-many data distribution. Modern MPI implementations like HPC-X use tree-based algorithms that efficiently utilize the high-bandwidth, low-latency InfiniBand network. With mpi-stage: Single source read: Only one node reads from the source filesystem Backend network utilization: Data flows over InfiniBand using optimized MPI collectives Intelligent skipping: Nodes that already have the file (verified by size or checksum) skip the copy entirely Combined, this keeps the shared source (NFS, Lustre, blobfuse, etc.) from being hammered by many concurrent readers while still taking full advantage of the backend fabric. How It Works mpi-stage is designed around a simple workflow: The source node reads the file in chunks and streams each chunk via MPI_Bcast. Destination nodes write each chunk to local storage immediately upon receipt. This streaming approach means the entire file never needs to fit in memory — only a small buffer is required. Key Features Pre-copy Validation Before any data is transferred, each node checks if the destination file already exists and matches the source. You can choose between: Size check (default): Fast comparison of file sizes—sufficient for most use cases Checksum: Stronger validation, but requires reading the full file and is therefore slower If all nodes already have the correct file, mpi-stage completes in milliseconds with no data transfer. Double-Buffered Transfers The implementation uses double-buffered, chunked transfers to overlap network communication with disk I/O. While one buffer is being broadcast, the next chunk is being read from the source. Post-copy Validation Optionally verify that all nodes received the file correctly after the copy completes. Single-Writer Per Node The tool enforces one MPI rank per node to prevent filesystem contention and ensure predictable performance. Real-World Performance In one run using 156 GPU nodes, distributing a container image achieved approximately 3 GB/s effective distribution rate (file_size/time), completing in just over 5 seconds: [0] Copy required: yes [0] Starting copy phase (source writes: yes) [0] Copy complete, Bandwidth: 3007.14 MB/s [0] Post-validation complete [0] Timings (s): Topology check: 5.22463 Source metadata: 0.00803746 Pre-validation: 0.0046786 Copy phase: 5.21189 Post-validation: 2.2944e-05 Total time: 5.2563 Because every node writes the file to its own local NVMe, the cumulative write rate across the cluster is roughly this number times the node count: ~3 GB/s × 156 ≈ ~468 GB/s of total local writes. Workflow: Container Image Distribution The primary use case is distributing Squashfs images to local NVMe before launching containerized workloads. Run mpi-stage as a job step before your main application: #!/bin/bash #SBATCH --job-name=my-training-job #SBATCH --ntasks-per-node=1 #SBATCH --exclusive # Stage the container image srun --mpi=pmix ./mpi_stage \ --source /shared/images/pytorch.sqsh \ --dest /nvme/images/pytorch.sqsh \ --pre-validate size \ --verbose # Run the actual job (from local NVMe - much faster!) srun --container-image=/nvme/images/pytorch.sqsh ... mpi-stage will create the destination directory if it doesn't exist. If your container runtime supports running the image directly from shared storage, you may not strictly need this step—but staging to local NVMe tends to be faster and more predictable at large scale. Because of the pre-validation, you can include this step in every job script without penalty—if the image is already present, it completes in milliseconds. Getting Started git clone https://github.com/edwardsp/mpi-stage.git cd mpi-stage make For detailed usage and options, see the README. Summary mpi-stage started as a solution to a very specific problem—staging large container images efficiently across a large GPU cluster—but the same pattern may be useful in other scenarios where many nodes need the same large file. By using MPI broadcasts, only a single node reads from the source filesystem, while data is distributed over the backend network using optimized collectives. In practice, this can significantly reduce load on shared filesystems and cloud-backed mounts, such as Azure Blob Storage accessed via blobfuse2, where hundreds of concurrent readers can otherwise become a bottleneck. While container images were the initial focus, this approach could also be applied to staging training datasets, distributing model checkpoints or pretrained weights, or copying large binaries to local NVMe before a job starts. Anywhere that a “many nodes, same file” pattern exists is a potential fit. If you're running large-scale containerized workloads on Azure HPC infrastructure, give it a try. If you use mpi-stage in other workflows, I'd love to hear what worked (and what didn't). Feedback and contributions are welcome. Have questions or feedback? Leave a comment below or open an issue on GitHub.Monitoring HPC & AI Workloads on Azure H/N VMs Using Telegraf and Azure Monitor (GPU & InfiniBand)
As HPC & AI workloads continue to scale in complexity and performance demands, ensuring visibility into the underlying infrastructure becomes critical. This guide presents an essential monitoring solution for AI infrastructure deployed on Azure RDMA-enabled virtual machines (VMs), focusing on NVIDIA GPUs and Mellanox InfiniBand devices. By leveraging the Telegraf agent and Azure Monitor, this setup enables real-time collection and visualization of key hardware metrics, including GPU utilization, GPU memory usage, InfiniBand port errors, and link flaps. It provides operational insights vital for debugging, performance tuning, and capacity planning in high-performance AI environments. In this blog, we'll walk through the process of configuring Telegraf to collect and send GPU and InfiniBand monitoring metrics to Azure Monitor. This end-to-end guide covers all the essential steps to enable robust monitoring for NVIDIA GPUs and Mellanox InfiniBand devices, empowering you to track, analyze, and optimize performance across your HPC & AI infrastructure on Azure. DISCLAIMER: This is an unofficial configuration guide and is not supported by Microsoft. Please use it at your own discretion. The setup is provided "as-is" without any warranties, guarantees, or official support. While Azure Monitor offers robust monitoring capabilities for CPU, memory, storage, and networking, it does not natively support GPU or InfiniBand metrics for Azure H- or N-series VMs. To monitor GPU and InfiniBand performance, additional configuration using third-party tools—such as Telegraf—is required. As of the time of writing, Azure Monitor does not include built-in support for these metrics without external integrations. 🔔 Update: Supported Monitoring Option Now Available Update (December 2025): At the time this guide was written, monitoring InfiniBand (IB) and GPU metrics on Azure H-series and N-series VMs required a largely unofficial approach using Telegraf and Azure Monitor. Microsoft has since introduced a supported solution: Azure Managed Prometheus on VM / VM Scale Sets (VMSS), currently available in private preview. This new capability provides a native, managed Prometheus experience for collecting infrastructure and accelerator metrics directly from VMs and VMSS. It significantly simplifies deployment, lifecycle management, and long-term support compared to custom Telegraf-based setups. For new deployments, customers are encouraged to evaluate Azure Managed Prometheus on VM / VMSS as the preferred and supported approach for HPC and AI workload monitoring. Official announcement: Private Preview: Azure Managed Prometheus on VM / VMSS Step 1: Making changes in Azure for sending GPU and IB metrics from Telegraf agents to Azure monitor from VM or VMSS. Register the microsoft.insights resource provider in your Azure subscription. Refer: Resource providers and resource types - Azure Resource Manager | Microsoft Learn Step 2: Enable Managed Service Identities to authenticate an Azure VM or Azure VMSS. In the example we are using Managed Identity for authentication. You can also use User Managed Identities or Service Principle to authenticate the VM. Refer: telegraf/plugins/outputs/azure_monitor at release-1.15 · influxdata/telegraf (github.com) Step 3: Set Up the Telegraf Agent Inside the VM or VMSS to Send Data to Azure Monitor In this example, I'll use an Azure Standard_ND96asr_v4 VM with the Ubuntu-HPC 2204 image to configure the environment for VMSS. The Ubuntu-HPC 2204 image comes with pre-installed NVIDIA GPU drivers, CUDA, and InfiniBand drivers. If you opt for a different image, ensure that you manually install the necessary GPU drivers, CUDA toolkit, and InfiniBand driver. Next, download and run the gpu-ib-mon_setup.sh script to install the Telegraf agent on Ubuntu 22.04. This script will also configure the NVIDIA SMI input plugin and InfiniBand Input Plugin, along with setting up the Telegraf configuration to send data to Azure Monitor. Note: The gpu-ib-mon_setup.sh script is currently supported and tested only on Ubuntu 22.04. Please read the InfiniBand counter collected by Telegraf - https://enterprise-support.nvidia.com/s/article/understanding-mlx5-linux-counters-and-status-parameters Run the following commands: wget https://raw.githubusercontent.com/vinil-v/gpu-ib-monitoring/refs/heads/main/scripts/gpu-ib-mon_setup.sh -O gpu-ib-mon_setup.sh chmod +x gpu-ib-mon_setup.sh ./gpu-ib-mon_setup.sh Test the Telegraf configuration by executing the following command: sudo telegraf --config /etc/telegraf/telegraf.conf --test Step 4: Creating Dashboards in Azure Monitor to Check NVIDIA GPU and InfiniBand Usage Telegraf includes an output plugin specifically designed for Azure Monitor, allowing custom metrics to be sent directly to the platform. Since Azure Monitor supports a metric resolution of one minute, the Telegraf output plugin aggregates metrics into one-minute intervals and sends them to Azure Monitor at each flush cycle. Metrics from each Telegraf input plugin are stored in a separate Azure Monitor namespace, typically prefixed with Telegraf/ for easy identification. To visualize NVIDIA GPU usage, go to the Metrics section in the Azure portal: Set the scope to your VM. Choose the Metric Namespace as Telegraf/nvidia-smi. From there, you can select and display various GPU metrics such as utilization, memory usage, temperature, and more. In example we are using GPU memory_used metrics. Use filters and splits to analyze data across multiple GPUs or over time. To monitor InfiniBand performance, repeat the same process: In the Metrics section, set the scope to your VM. Select the Metric Namespace as Telegraf/infiniband. You can visualize metrics such as port status, data transmitted/received, and error counters. In this example, we are using a Link Flap Metrics to check the InfiniBand link flaps. Use filters to break down the data by port or metric type for deeper insights. Link_downed Metric Note: The link_downed metric with Aggregation: Count is returning incorrect values. We can use Max, Min values. Port_rcv_data metrics Creating custom dashboards in Azure Monitor with both Telegraf/nvidia-smi and Telegraf/infiniband namespaces allows for unified visibility into GPU and InfiniBand. Testing InfiniBand and GPU Usage If you're testing GPU metrics and need a reliable way to simulate multi-GPU workloads—especially over InfiniBand—here’s a straightforward solution using the NCCL benchmark suite. This method is ideal for verifying GPU and network monitoring setups. NCCL Benchmark and OpenMPI is part of the Ubuntu HPC 22.04 image. Update the variable according to your environment. Update the hostfile with the hostname. module load mpi/hpcx-v2.13.1 export CUDA_VISIBLE_DEVICES=2,3,0,1,6,7,4,5 mpirun -np 16 --map-by ppr:8:node -hostfile hostfile \ -mca coll_hcoll_enable 0 --bind-to numa \ -x NCCL_IB_PCI_RELAXED_ORDERING=1 \ -x LD_LIBRARY_PATH=/usr/local/nccl-rdma-sharp-plugins/lib:$LD_LIBRARY_PATH \ -x CUDA_DEVICE_ORDER=PCI_BUS_ID \ -x NCCL_SOCKET_IFNAME=eth0 \ -x NCCL_TOPO_FILE=/opt/microsoft/ndv4-topo.xml \ -x NCCL_DEBUG=WARN \ /opt/nccl-tests/build/all_reduce_perf -b 8 -e 8G -f 2 -g 1 -c 1 Alternate: GPU Load Simulation Using TensorFlow If you're looking for a more application-like load (e.g., distributed training), I’ve prepared a script that sets up a multi-GPU TensorFlow training environment using Anaconda. This is a great way to simulate real-world GPU workloads and validate your monitoring pipelines. To get started, run the following: wget -q https://raw.githubusercontent.com/vinil-v/gpu-monitoring/refs/heads/main/scripts/gpu_test_program.sh -O gpu_test_program.sh chmod +x gpu_test_program.sh ./gpu_test_program.sh With either method NCCL benchmarks or TensorFlow training you’ll be able to simulate realistic GPU usage and validate your GPU and InfiniBand monitoring setup with confidence. Happy testing! References: Ubuntu HPC on Azure ND A100 v4-series GPU VM Sizes Telegraf Azure Monitor Output Plugin (v1.15) Telegraf NVIDIA SMI Input Plugin (v1.15) Telegraf InfiniBand Input Plugin DocumentationPerformance and Scalability of Azure HBv5-series Virtual Machines
Azure HBv5-series virtual machines (VMs) for CPU-based high performance computing (HPC) are now Generally Available. This blog provides in-depth information about the technical underpinnings, performance, cost, and management implications of these HPC-optimized VMs. Azure HBv5 VM bring leadership levels of performance, cost optimization, and server (VM) consolidation for a variety of workloads driven by memory performance, such as computational fluid dynamics, weather simulation, geoscience simulations, and finite element analysis. For these applications and compared to HBv4 VMs, previously the highest performance offering for these workloads, HBv5 provides up to : 5x higher performance for CFD workloads with 43% lower costs 3.2x higher performance for weather simulation with 16% lower costs 2.8x higher performance for geoscience workloads at the same costs HBv5-series Technical Overview & VM Sizes Each HBv5 VMs features several new technologies for HPC customers, including: Up to 6.6 TB/s of memory bandwidth (STREAM TRIAD) and 432 GB memory capacity Up to 368 physical cores per VM (user configurable) with custom AMD EPYC CPUs, Zen4 microarchitecture (SMT disabled) Base clock of 3.5 GHz (~1 GHz higher than other 96-core EPYC CPUs), and Boost clock of 4 GHz across all cores 800 Gb/s NVIDIA Quantum-2 InfiniBand (4 x 200 Gb/s CX-7) (~2x higher HBv4 VMs) 180 Gb/s Azure Accelerated Networking (~2.2 higher than HBv4 VMs) 15 TB local NVMe SSD with up to 50 GB/s (read) and 30 GB/s (write) of bandwidth (~4x higher than HBv4 VMs) The highlight feature of HBv5 VMs is their use of high-bandwidth memory (HBM). HBv5 VMs utilize a custom AMD CPU that increases memory bandwidth by ~9x v. dual-socket 4 th Gen EPYC (Zen4, “Genoa”) server platforms, and ~7x v. dual-socket EPYC (Zen5, “Turin”) server platforms, respectively. HBv5 delivers similar levels of memory bandwidth improvement compared to the highest end alternatives from the Intel Xeon and ARM CPU ecosystems. HBv5-series VMs are available in the following sizes with specifications as shown below. Just like existing H-series VMs, HBv5-series includes constrained cores VM sizes, enabling customers to optimize their VM dimensions for a variety of scenarios: ISV licensing constraining a job to a targeted number of cores Maximum-performance-per-VM or maximum performance per core Minimum RAM/core (1.2 GB, suitable for strong scaling workloads) to maximum memory per core (9 GB, suitable for large datasets and weak scaling workloads Table 1: Technical specifications of HBv5-series VMs Note: Maximum clock frequencies (FMAX) are based product specifications of the AMD EPYC 9V64H processor. Experienced clock frequencies by a customer are a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application. For more information see official documentation for HBv5-series VMs Microbenchmark Performance This section focuses on microbenchmarks that characterize performance of the memory subsystem, compute capabilities, and InfiniBand network of HBv5 VMs. Memory & Compute Performance To capture synthetic performance, we ran the following industry standard benchmarks: STREAM – memory bandwidth High Performance Conjugate Gradient (HPCG) – sparse linear algebra High Performance Linpack (HPL)– dense linear algebra Absolute results and comparisons to HBv4 VMs are shown in Table 2, below: Table 2: Results of HBv5 running the STREAM, HPCG, and HPL benchmarks. Note: STREAM was run with the following CLI parameters: OMP_NUM_THREADS=368 OMP_PROC_BIND=true OMP_PLACES=cores ./amd_zen_stream STREAM data size: 2621440000 bytes InfiniBand Networking Performance Each HBv5-series VM is equipped with four NVIDIA Quantum-2 network interface cards (NICs), each operating at 200 Gb/s for an aggregate bandwidth of 800 Gb/s per VM (node). We ran the industry standard IB perftests based on OSU benchmarks test across two (2) HBv5-series VMs, as depicted in the results shown in Figures 3-5, below: Note: all results below are for a single 200 Gb/s (uni-directional) link only. At a VM level, all bandwidth results below are 4x higher as there are four (4) InfiniBand links per HBv5 server. Unidirectional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 Figure 1: results showing 99% achieved uni-directional bandwidth v. theoretical peak. Bi-directional bandwidth: numactl -c 0 ib_send_bw -aF -q 2 -b Figure 2: results showing 99% achieved bi-directional bandwidth v. theoretical peak. Latency: Figure 3: results measuring as low as 1.25 microsecond latencies among HBv5 VMs. Latencies experienced by users will depend on message sizes employed by applications. Application Performance, Cost/Performance, and Server (VM) Consolidation This section focuses on characterizing HBv5-series VMs when running common, real-world HPC applications with an emphasis on those known to be meaningfully bound by memory performance as that is the focus of the HB-series family. We characterize HBv5 below in three (3) ways of high relevance to customer interests: Performance (“how much faster can it do the work”) Cost/Performance (“how much can it reduce the costs to complete the work”) Fleet consolidation (“how much can a customer simplify the size and scale of compute fleet management while still being able to the work”) Where possible, we have included comparisons to other Azure HPC VMs, including: Azure HBv4/HX series with 176 physical cores of 4 th Gen AMD EPYC CPUs with 3D V-Cache (“Genoa-X”) (HBv4 specifications, HX specifications) Azure HBv3 with 120 physical cores of 3 rd Gen AMD EPYC CPUs with 3D V-Cache (“Milan-X”) (HBv3 specifications) Azure HBv2 with 120 physical cores of 2 nd Gen AMD EPYC CPUs (“Rome”) processors (full specifications) Unless otherwise noted, all tests shown below were performed with: Alma Linux 8.10 (image URN : almalinux:almalinux-hpc:8_10-hpc-gen2:latest) for scaling ( image URN: almalinux:almalinux-hpc:8_6-hpc-gen2:latest) NVIDIA HPC-X MPI Further, all Cost/Performance comparisons leverage pricing rate info from list price, Pay-As-You-Go (PAYG) information found on Azure Linux Virtual Machines Pricing. Absolute costs will be a function of a customer’s workload, model, and consumption (PAYG v. Reserved Instance, etc.) approach. That said, the relative cost/performance comparisons illustrated below should hold for the workload and model combinations shown below, regardless of the consumption approach. Computational Fluid Dynamics (CFD) OpenFOAM – version 2306 with 100M Cell Motorbike case Figure 4: HBv5 v. HBv4 on on OpenFOAM with the Motorbike 100M cell case HBv5 VMs provide a 4.8x performance increase over HBv4 VMs. Figure 5: The cost to complete the OpenFOAM Motorbike 100M case is just 57% of what it costs to complete the same case on HBv4. Above, we can see that for customers running OpenFOAM cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of five (5). Palabos – version 1.01 with 3D Cavity, 1001 x 1001 x 1001 cells case Figure 6: On Palabos, a Lattice Boltzmann solver using a streaming memory access pattern, HBv5 VMs provide a 4.4x performance increase over HBv4 VMs. Figure 7: The cost to complete the Palabos 3D Cavity case is just 62% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Palabos with cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~4.5. Ansys Fluent – version 2025 R2 with F1 Racecar 140M case Figure 8: On ANSYS Fluent HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 9: The cost to complete the ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running ANSYS Fluent with cases similar to the size and complexity of the 140M cell F1 Racecar problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Siemens Star-CCM+ - version 17.04.005 with AeroSUV Steady Coupled 106M case Figure 10: On Star-CCM+, HBv5 VMs provide a 3.4x performance increase over HBv4 VMs. Figure 11: The cost to complete the Siemens Star-CCM+ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4. Above, we can see that for customers running Star-CCM+ with cases similar to the size and complexity of the 106M cell AeroSUV Steady Coupled, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5. Weather Modeling WRF – version 4.2.2 with CONUS 2.5KM case Figure 12: On WRF, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 13: The cost to complete the WRF Conus 2.5KM case is just 84% of what it costs to complete the same case on HBv4. Above, we can see that for customers running WRF with cases similar to the size and complexity of the 2.5km CONUS, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Energy Research Devito – version 4.8.7 with Acoustic Forward case Figure 14: On Devito, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs. Figure 15: The cost to complete the Devito Acoustic Forward OP case is equivalent to what it costs to complete the same case on HBv4. Above, we can see that for customers running Devito with cases similar to the size and complexity of the Acoustic Forward OP, organizations can consolidate their server (VM) deployments by approximately a factor of ~3. Molecular Dynamics NAMD - version 2.15a2 with STMV 20M case Figure 16: On NAMD, HBv5 VMs provide a 2.18x performance increase over HBv4 VMs. Figure 17: The cost to complete the NAMD STMV 20M case is 26% higher on HBv5 than what it costs to complete the same case on HBv4 Above, we can see that for customers running NAMD with cases similar to the size and complexity of the STMV 20M case, organizations can consolidate their server (VM) deployments by approximately a factor of ~2. Notably, NAMD is a compute bound case, rather than memory performance bound. We include it here to illustrate that not all workloads are fit for purpose with HBv5. This latest Azure HPC VM is the fastest at this workload on the Microsoft Cloud, but does not benefit substantially from HBv5’s premium levels of memory bandwidth. NAMD would instead perform more cost efficiently with a CPU that supports AVX512 instructions natively or, much better still, a modern GPU. Scalability of HBv5-series VMs Weak Scaling Weak scaling measures how well a parallel application or system performs when both the number of processing elements and the problem size increase proportionally, so that the workload per processor remains constant. Weak scaling cases are often employed when time-to-solution is fixed (e.g. it is acceptable to solve a problem within a specified period) but a user desires a simulation to be of a higher fidelity or resolution. A common example is operational weather forecasting. To illustrate weak scaling on HBv5 VMs, we ran Palabos with the same 3D cavity problem as shown earlier: Figure 18: On Palabos with the 3D Cavity model, HBv5 scales linearly as the 3D cavity size is proportionately increased. Strong Scaling Strong scaling is characterized by the efficiency with which execution time is reduced as the number of processor elements (CPUs, GPUs, etc.) is increased, while the problem size remains kept constant. Strong scaling cases are often employed when the fidelity or resolution of the simulation is acceptable, but a user requires faster time to completion. A common example is product engineering validation when an organization wants to bring a product to market faster but must complete a broad range of validation and verification scenarios before doing so. To illustrate Strong scaling on HBv5 VMs, we ran NAMD with two different problems, each intended to illustrate the how expectations for strong scaling efficiency change depending on problem size and the ordering of computation v. communication in distributed memory workloads. First, let us examine NAMD with the 20M STMV benchmark Figure 19: Strong scaling on HBv5 with NAMD STMV 20M cell case As illustrated above, for strong scaling cases for which the compute time is continuously reduced (by leveraging more and more processor elements) but communication time remains constant, scaling efficiency will only stay high for so long. That principle is well-represented by the STMV 20m case, for which parallel efficiency remains linear (i.e. cost/job remains flat) at two (2) nodes but degrades after that. This is because while compute is being sped up, the MPI time remains relatively flat. As such, the relatively static MPI time comes to dominate end-to-end wall clock time as VM scaling increases. Said another way, HBv5 features so much compute performance that even for a moderate-sized problem like STMV 20M scaling the infrastructure can only take performance so far and cost/job will begin to increase. If we examine HBv5 against the 210M cell case, however, with 10.5x as many elements to compute as its 20M case sibling, the scaling efficiency story changes significantly. Figure 19: On NAMD with the STMV 210M cell case, HBv5 scales linearly out to 32 VMs (or more than 11,000 CPU cores). As illustrated above, larger cases with significant compute requirements will continue to scale efficiently with larger amounts of HBv5 infrastructure. While MPI time remains relatively flat for this case (as is the case with the smaller STMV 20M case), the compute demands remain the dominant fraction of end-to-end wall clock time. As such, HBv5 scales these problems with very high levels of efficiency and in doing so job costs to the user remain flat despite up to 8x as many VMs being leveraged compared to the four (4) VM baseline. The key takeaways for strong scaling scenarios are two-fold. First, users should run scaling tests with their applications and models to find a sweet spot of faster performance with constant job costs. This will depend heavily on model size. Second, as new and very high end compute platforms like HBv5 emerge that accelerate compute time, application developers will need to find ways reduce wall clock times bottlenecking on communication (MPI) time. Recommended approaches include using fewer MPI processes and, ideally, restructuring applications to overlap communication with compute phases.Breaking the Million-Token Barrier: The Technical Achievement of Azure ND GB300 v6
Azure ND GB300 v6 Virtual Machines with NVIDIA GB300 NVL72 rack-scale systems achieve unprecedented performance of 1,100,000 tokens/s on Llama2 70B Inference, beating the previous Azure ND GB200 v6 record of 865,000 tokens/s by 27%.A Quick Guide to Benchmarking AI models on Azure: Llama 405B and 70B with MLPerf Inference v5.1
by Mark Gitau (Software Engineer) Introduction For the MLPerf Inference v5.1 submission, Azure shared performance results on the new ND GB200 v6 virtual machines. A single ND GB200 v6 VM on Azure is powered by two NVIDIA Grace CPUs and four NVIDIA Blackwell B200 GPUs. This document highlights Azure’s MLPerf Inference v5.1 results and outlines the steps to run these benchmarks on Azure. These MLPerf™ benchmark results demonstrate Azure’s commitment to providing our customers with the latest GPU offerings of the highest quality. Highlights from MLPerf Inference v5.1 benchmark results include: Azure had the highest Llama 2 70B Offline submission with 52,000 tokens/s on a single ND GB200 v6 virtual machine. This corresponds to an 8% increase on single node performance since our record which would correspond to 937,098 tokens/s on a full NVL72 rack. Azure results for Llama 3.1 405B are at par with the best submitters (1% difference), cloud and on-premises, with 847 tokens/s. How to replicate the results in Azure Pre-requisites: ND GB200 v6-series (single node): Deploy and set up a virtual machine on Azure Set up the environment First, we need to export the path to the directory where we will perform the benchmarks. For ND GB200 v6-series (single node), create a directory called mlperf in /mnt/nvme Set mlperf scratch space: export MLPERF_SCRATCH_PATH=/mnt/nvme/mlperf Clone the MLPerf repository inside the scratch path: git clone https://github.com/mlcommons/inference_results_v5.1.git Then create empty directories in your scratch space to house the data: mkdir $MLPERF_SCRATCH_PATH/data $MLPERF_SCRATCH_PATH/models $MLPERF_SCRATCH_PATH/preprocessed_data Download the models & datasets Download the models inside the models directory you created in the previous step. This will take a while because the weights are large. Llama 2 70B model Llama 3.1 405B model Download the preprocessed datasets for both models Llama 2 70B datasets Llama 3.1 405B datasets Prepare the datasets for Llama 2 70B: inference_results_v5.1/closed/Azure/code/llama2-70b/tensorrt at main · mlcommons/inference_results_v5.1 prepare the datasets for Llama 3.1 405B: inference_results_v5.1/closed/Azure/code/llama3.1-405b/tensorrt at main · mlcommons/inference_results_v5.1 Build & launch MLPerf container Export the Submitter and System name: export SUBMITTER=Azure SYSTEM_NAME=ND_GB200_v6 Enter the container by entering the closed/Azure directory and running: make prebuild Inside the container, run make build Build engines & run benchmarks Make sure you are still in the closed/Azure directory of the MLPerf repository To build the engines for both Llama 3.1 405B and Llama 2 70B: make generate_engines RUN_ARGS="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server" To run the benchmarks for both Llama 3.1 405B and Llama 2 70B: make run_harness RUN_ARGS=="--benchmarks=llama2-70b,llama3.1-405b --scenarios=offline,server" MLPerf from MLCommons® MLCommons® is an open engineering consortium of AI leaders from academia, research, and industry where the mission is to “build fair and useful benchmarks” that provide unbiased evaluations of training and inference performance for hardware, software, and services—all conducted under predetermined conditions. MLPerf™ Inference benchmarks consist of compute-intensive AI workloads that simulate realistic usage of the systems, making the results very influential in technology management’s buying decisions.Azure’s ND GB200 v6 Delivers Record Performance for Inference Workloads
Achieving peak AI performance requires both cutting-edge hardware and a finely optimized infrastructure. Azure’s ND GB200 v6 Virtual Machines, accelerated by the NVIDIA GB200 Blackwell GPUs, have already demonstrated world record performance of 865,000 tokens/s for inferencing on the industry standard LLAMA2 70BPerformance analysis of DeepSeek R1 AI Inference using vLLM on ND-H100-v5
Introduction The DeepSeek R1 model represents a new frontier in large-scale reasoning for AI applications. Designed to tackle complex inference tasks, R1 pushes the boundaries of what’s possible—but not without significant infrastructure demands. To deploy DeepSeek R1 effectively in an inference service like vLLM, high-performance hardware is essential. Specifically, the model requires two Azure ND_H100_v5 nodes, each equipped with 8 NVIDIA H100 GPUs, totaling 16 H100s. These nodes are interconnected via InfiniBand and NVLink, ensuring the bandwidth and latency characteristics necessary to support the model’s massive memory footprint and parallel processing needs. In this post, we’ll present inference benchmark results for DeepSeek R1, measuring performance across GPU utilization, memory throughput, and interconnect efficiency. While R1 excels in reasoning tasks, it’s important to recognize that such models are not universally optimal. For many general-purpose AI applications, smaller models like Llama 3.1 8B offer a compelling alternative, delivering sufficient accuracy and performance at a fraction of the cost. We explore the performance characteristics of DeepSeek R1 and help you decide when a large reasoning model is worth the investment—and when a leaner solution might be the better choice. Benchmark environment 2 ND_H100_v5 nodes (16 H100) were required to load the DeepSeek R1 model into GPU memory. vLLM was used to set-up the online inference server using a single API server endpoint. vLLM bench (specifically hugging face AI-MO/aimo-validation-aime) was used to run the inference benchmark. Detailed specifications are available in the tables below. Results Reasoning model’s additional chain of thought tokens We show that DeepSeek R1 generates many additional tokens (for chain of thought), which helps reasoning models answer complex logical questions. In this example the DeepSeek R1 generates 1162 completion tokens, but Llama 3.1 8B only generates 37. This example also demonstrates that while non-reasoning models may struggle with this seemingly simple logical question, reasoning models like DeepSeek R1 can solve it—though they require significantly more intermediate tokens to do so. #!/bin/bash curl http://localhost:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "model": "deepseek-ai/DeepSeek-R1", "messages": [ {"role": "user", "content": "9.11 and 9.8, which is greater? Explain your reasoning"} ] }' Note: The REST API json output is formatted to make the output more readable. (./run_restapi_query.sh 2> /dev/null | python3 ./convert_json_2_txt.py >& run_query.out_$$) Reasoning: Okay, so I need to figure out whether 9.11 is greater than 9.8 or if 9.8 is greater. Hmm, let me start by looking at both numbers. They both have a 9 in the units place, so that part is equal. Now, I need to compare the decimal parts: .11 and .8. Wait, but decimals can be tricky because the places matter. The first number is 9.11, which has a 1 in the tenths place and a 1 in the hundredths place. The second number is 9.8, which is the same as 9.80 when you add a zero to the hundredths place for comparison. So, 9.80 versus 9.11. Now, looking at the tenths place: 8 is greater than 1. Therefore, 9.80 should be greater than 9.11. But let me double-check because sometimes when numbers have different decimal lengths, it's easy to make a mistake. If I convert both to fractions, maybe that will help. 9.11 is the same as 9 + 0.11, and 0.11 is 11/100. On the other hand, 9.8 is 9 + 0.8, which is 9 + 8/10. To compare 8/10 and 11/100, I can convert them to have the same denominator. 8/10 is equal to 80/100. Now, 80/100 is much larger than 11/100. So yes, 9.8 is indeed greater than 9.11. Another way to see it is by subtracting the two numbers. If I subtract 9.11 from 9.8, what do I get? Let's see: 9.8 - 9.11. Since 9.8 is the same as 9.80, subtracting 9.11 gives 0.69. Since the result is positive, that means 9.8 is larger. Wait, maybe I can also think about money to make it clearer. If I have $9.11 and you have $9.80, which is more? $9.80 is $9 and 80 cents, whereas $9.11 is $9 and 11 cents. Definitely, 80 cents is more than 11 cents, so $9.80 is more. That helps visualize it. Alternatively, using a number line. If I imagine a number line with 9.0, 9.1, 9.2,... up to 10.0. 9.11 is just a little past 9.1, while 9.8 is almost at 10.0. So clearly, 9.8 is much further to the right, meaning it's a larger number. I think that covers different methods: decimal place comparison, converting to fractions, subtraction, real-life money examples, and number line visualization. All of these methods point to 9.8 being greater than 9.11. I don't see any contradictions here. Maybe I can also consider writing both numbers with the same number of decimal places to make it even clearer. 9.11 is already two decimal places, and 9.8 can be written as 9.80. Now, comparing 9.80 and 9.11 digit by digit: - The units place: 9 vs. 9 (equal) - Tenths place: 8 vs. 1 (8 is greater) - Since the tenths place of 9.80 is greater, the entire number is larger, regardless of the hundredths place. So yes, that confirms it again. The key idea here is that when comparing decimals, you start from the leftmost digit after the decimal point and move right until you find a difference. In this case, the tenths place already shows that 9.8 is larger. I guess another common mistake could be thinking that because 11 is more than 8, 0.11 is more than 0.8, but that's not correct because the places are different. Tenths versus hundredths. 0.8 is equivalent to 0.80, which is way bigger than 0.11. To prevent that mistake, always align the decimal points and add trailing zeros if necessary. Then compare each digit one by one from left to right. That method avoids confusion. So, all in all, through multiple approaches, 9.8 is definitely greater than 9.11. Answer: To determine which number is greater between **9.11** and **9.8**, follow these steps: **Align the decimal places** by rewriting 9.8 as **9.80**. **Compare digit by digit** from left to right: - **Units place**: Both have **9** (equal). - **Tenths place**: **8** (in 9.80) vs. **1** (in 9.11). Since \(8 > 1\), **9.80 is greater** at this point. - No need to compare hundredths place once a difference is found. **Conclusion**: \(9.8\) (or \(9.80\)) is greater than \(9.11\) because **8 tenths** (\(0.8\)) is larger than **1 tenth + 1 hundredth** (\(0.11\)). \(\boxed{9.8 \text{ is greater}}\) Completion Info: Finish reason: stop Prompt tokens: 21 Completion tokens: 1162 Total tokens: 1183 When running using Llama 3.1 8B (Not reasoning model). Answer: To determine which is greater, we need to compare the two numbers. 9.11 is greater than 9.8 because 9.11 is a larger decimal value. Completion Info: Finish reason: stop Prompt tokens: 51 Completion tokens: 37 Total tokens: 88 Throughput and latency results Cost comparison In this cost analysis we use the ND-H100-v5 and ND-H200-v4 pay as you go pricing in south central US region and the measured total throughput tokens/sec to compute the $/(1K tokens). Note: ND-H200-v5 pricing was estimated at 20% more than ND-H100-v5 pricing. Analysis The DeepSeek R1 is a large, complex reasoning model that is costlier and slower than smaller models. It needs 16 H100 GPUs for FP8 precision and generates many more intermediate tokens in its chain of thought process—about 31 times more than Llama 3.1 8B—but at a much slower rate (~54 times slower). Its latency is also higher, with TTFT and ITL being roughly 6 and 3 times slower, respectively. The DeepSeek R1 model has small intranode and internode network requirements (~14% of available InfiniBand network bandwidth was used, and < 1% of available NVLink bandwidth is used. GPUs with higher memory bandwidth and higher FLOPS would help improve its performance. The cost analysis shows that the cost to generate DeepSeek R1 tokens is ~54 times more expensive than Llama 3.1 8B on the same 16 H100 GPU’s and ~34 times more expensive on 8 H200 GPU’s. DeepSeek R1 model is very capability, but due to its higher TCO it should be only used in specific AI applications that require its strong reasoning abilities. Conclusion The DeepSeek R1 model demonstrates exceptional reasoning capabilities, but its deployment demands substantial infrastructure and incurs high latency and cost. While it excels in generating detailed chains of thought, its throughput and efficiency lag significantly behind smaller models like Llama 3.1 8B. For applications requiring deep logical analysis, DeepSeek R1 is a powerful tool. However, for general-purpose inference tasks, more lightweight models offer better performance and cost-effectiveness. Strategic use of DeepSeek R1 should be reserved for scenarios where its advanced reasoning justifies the resource investment. References Deepseek R1 model on Hugging Face https://huggingface.co/deepseek-ai/DeepSeek-R1 vLLM GitHub repository https://github.com/vllm-project/vllm Azure ND H100 v5 documentation https://learn.microsoft.com/en-us/azure/virtual-machines/nd-h100-v5-series FlashInfer GitHub repository https://github.com/flashinfer-ai/flashinfer DeepGEMM GitHub repository https://github.com/deepseek-ai/DeepGEMM AI-MO validation dataset on Hugging Face https://huggingface.co/datasets/AI-MO/aimo-validation-aime Appendix Install vLLM curl -LsSf https://astral.sh/uv/install.sh | sh uv venv myvllm --python 3.11 --seed source myvllm/bin/activate uv pip install vllm --torch-backend=auto git clone https://github.com/flashinfer-ai/flashinfer.git --recursive uv pip install ninja cd flashinfer uv pip install --no-build-isolation --verbose . Install DeepSeek DeepEP git clone https://github.com/vllm-project/vllm.git cd ~/vllm/tools/ep_kernels export CUDA_HOME=/usr/local/cuda-12.8 TORCH_CUDA_ARCH_LIST="9.0" (For Hopper) bash install_python_libraries.sh 2.&1 | tee install_python_libraries.log_$$ sudo bash configure_system_drivers.sh 2>&1 | tee configure_system_drivers.log_$$ sudo reboot Install DeepSeek DeepGEMM git clone --recursive https://github.com/deepseek-ai/DeepGEMM.git cd deepGEMM ./install.sh 2>&1 | tee install.log_$$ Configure DeepSeek R1 with vLLM on 2 ND_H100_v5 Second node configuration Execute this script on second node before the script on the primary node. #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-start-rank 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --headless --max-model-len 32768 --reasoning-parser deepseek_r1 Primary node configuration #!/bin/bash MODEL="deepseek-ai/DeepSeek-R1" PORT=8000 export VLLM_LOGGING_LEVEL=INFO export HF_HUB_CACHE=/home/azureuser/cgshared/hf_cache #export VLLM_ALL2ALL_BACKEND=deepep_high_throughput export VLLM_ALL2ALL_BACKEND=deepep_low_latency export VLLM_USE_DEEP_GEMM=1 export GLOO_SOCKET_IFNAME=eth0 vllm serve $MODEL --port $PORT --tensor-parallel-size 1 --enable-expert-parallel --data-parallel-size 16 --data-parallel-size-local 8 --data-parallel-address 10.0.0.6 --data-parallel-rpc-port 23345 --api-server-count 1 --max-model-len 32768 --reasoning-parser deepseek_r1 Install vLLM benchmark environment cd vllm uv pip install vllm[bench] Run vLLM benchmark #!/bin/bash vllm bench serve \ --backend vllm \ --model deepseek-ai/DeepSeek-R1 \ --endpoint /v1/completions \ --dataset-name hf \ --dataset-path AI-MO/aimo-validation-aime \ --ramp-up-strategy linear \ --ramp-up-start-rps 1 \ --ramp-up-end-rps 10 \ --num-prompts 400 \ --seed 42