Blog Post

Azure High Performance Computing (HPC) Blog
10 MIN READ

Performance and Scalability of Azure HBv5-series Virtual Machines

jvenkatesh's avatar
jvenkatesh
Icon for Microsoft rankMicrosoft
Nov 05, 2025

Article contributed by Amirreza Rastegari, Sai Kovouri, Zehra Naz, Michael Cui, Yifan Zhang, Neeva Sethi, Evan Burness, Jyothi Venkatesh, Joe Greenseid, Rob Walsh, Malcolm Doster, Chloe Gura, Jeff Yang

Azure HBv5-series virtual machines (VMs) for CPU-based high performance computing (HPC) are now Generally Available. This blog provides in-depth information about the technical underpinnings, performance, cost, and management implications of these HPC-optimized VMs.

Azure HBv5 VM bring leadership levels of performance, cost optimization, and server (VM) consolidation for a variety of workloads driven by memory performance, such as computational fluid dynamics, weather simulation, geoscience simulations, and finite element analysis. For these applications and compared to HBv4 VMs, previously the highest performance offering for these workloads, HBv5 provides up to :

  • 5x higher performance for CFD workloads with 43% lower costs
  • 3.2x higher performance for weather simulation with 16% lower costs
  • 2.8x higher performance for geoscience workloads at the same costs

HBv5-series Technical Overview & VM Sizes

Each HBv5 VMs features several new technologies for HPC customers, including:

  • Up to 6.6 TB/s of memory bandwidth (STREAM TRIAD) and 432 GB memory capacity
  • Up to 368 physical cores per VM (user configurable) with custom AMD EPYC CPUs, Zen4 microarchitecture (SMT disabled)
  • Base clock of 3.5 GHz (~1 GHz higher than other 96-core EPYC CPUs), and Boost clock of 4 GHz across all cores
  • 800 Gb/s NVIDIA Quantum-2 InfiniBand (4 x 200 Gb/s CX-7) (~2x higher HBv4 VMs)
  • 180 Gb/s Azure Accelerated Networking (~2.2 higher than HBv4 VMs)
  • 15 TB local NVMe SSD with up to 50 GB/s (read) and 30 GB/s (write) of bandwidth (~4x higher than HBv4 VMs)

The highlight feature of HBv5 VMs is their use of high-bandwidth memory (HBM). HBv5 VMs utilize a custom AMD CPU that increases memory bandwidth by ~9x v. dual-socket 4th Gen EPYC (Zen4, “Genoa”) server platforms, and ~7x v. dual-socket EPYC (Zen5, “Turin”) server platforms, respectively. HBv5 delivers similar levels of memory bandwidth improvement compared to the highest end alternatives from the Intel Xeon and ARM CPU ecosystems.

HBv5-series VMs are available in the following sizes with specifications as shown below. Just like existing H-series VMs, HBv5-series includes constrained cores VM sizes, enabling customers to optimize their VM dimensions for a variety of scenarios:

  • ISV licensing constraining a job to a targeted number of cores
  • Maximum-performance-per-VM or maximum performance per core
  • Minimum RAM/core (1.2 GB, suitable for strong scaling workloads) to maximum memory per core (9 GB, suitable for large datasets and weak scaling workloads 

 Table 1: Technical specifications of HBv5-series VMs

 

Note: Maximum clock frequencies (FMAX) are based product specifications of the AMD EPYC 9V64H processor. Experienced clock frequencies by a customer are a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application.

For more information see official documentation for HBv5-series VMs

Microbenchmark Performance

This section focuses on microbenchmarks that characterize performance of the memory subsystem, compute capabilities, and InfiniBand network of HBv5 VMs.

Memory & Compute Performance

To capture synthetic performance, we ran the following industry standard benchmarks:

 

Absolute results and comparisons to HBv4 VMs are shown in Table 2, below:

Table 2: Results of HBv5 running the STREAM, HPCG, and HPL benchmarks.

Note: STREAM was run with the following CLI parameters:

OMP_NUM_THREADS=368 OMP_PROC_BIND=true OMP_PLACES=cores ./amd_zen_stream

STREAM data size: 2621440000 bytes

InfiniBand Networking Performance

Each HBv5-series VM is equipped with four NVIDIA Quantum-2 network interface cards (NICs), each operating at 200 Gb/s for an aggregate bandwidth of 800 Gb/s per VM (node).

We ran the industry standard IB perftests based on OSU benchmarks test across two (2) HBv5-series VMs, as depicted in the results shown in Figures 3-5, below:

 Note: all results below are for a single 200 Gb/s (uni-directional) link only. At a VM level, all bandwidth results below are 4x higher as there are four (4) InfiniBand links per HBv5 server.  

 Unidirectional bandwidth:

numactl -c 0 ib_send_bw -aF -q 2

Figure 1: results showing 99% achieved uni-directional bandwidth v. theoretical peak.

 

Bi-directional bandwidth:

numactl -c 0 ib_send_bw -aF -q 2 -b

Figure 2: results showing 99% achieved bi-directional bandwidth v. theoretical peak.

 

Latency:

 

Figure 3: results measuring as low as 1.25 microsecond latencies among HBv5 VMs. Latencies experienced by users will depend on message sizes employed by applications.

 

Application Performance, Cost/Performance, and Server (VM) Consolidation

This section focuses on characterizing HBv5-series VMs when running common, real-world HPC applications with an emphasis on those known to be meaningfully bound by memory performance as that is the focus of the HB-series family.
We characterize HBv5 below in three (3) ways of high relevance to customer interests:
  • Performance (“how much faster can it do the work”)
  • Cost/Performance (“how much can it reduce the costs to complete the work”)
  • Fleet consolidation (“how much can a customer simplify the size and scale of compute fleet management while still being able to the work”)
Where possible, we have included comparisons to other Azure HPC VMs, including:

 

Unless otherwise noted, all tests shown below were performed with:

  • Alma Linux 8.10 (image URN : almalinux:almalinux-hpc:8_10-hpc-gen2:latest) for scaling ( image URN: almalinux:almalinux-hpc:8_6-hpc-gen2:latest)
  • NVIDIA HPC-X MPI

 

Further, all Cost/Performance comparisons leverage pricing rate info from list price, Pay-As-You-Go (PAYG) information found on Azure Linux Virtual Machines Pricing. Absolute costs will be a function of a customer’s workload, model, and consumption (PAYG v. Reserved Instance, etc.) approach. That said, the relative cost/performance comparisons illustrated below should hold for the workload and model combinations shown below, regardless of the consumption approach.

 

Computational Fluid Dynamics (CFD)

 

OpenFOAM – version 2306 with 100M Cell Motorbike case

Figure 4: HBv5 v. HBv4 on on OpenFOAM with the Motorbike 100M cell case HBv5 VMs provide a 4.8x performance increase over HBv4 VMs.

Figure 5: The cost to complete the OpenFOAM Motorbike 100M case is just 57% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running OpenFOAM cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of five (5).

 

 Palabos – version 1.01 with 3D Cavity, 1001 x 1001 x 1001 cells case

Figure 6: On Palabos, a Lattice Boltzmann solver using a streaming memory access pattern, HBv5 VMs provide a 4.4x performance increase over HBv4 VMs.

Figure 7: The cost to complete the Palabos 3D Cavity case is just 62% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running Palabos with cases similar to the size and complexity of the 100M cell Motorbike problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~4.5.

 

Ansys Fluent – version 2025 R2 with F1 Racecar 140M case

Figure 8: On ANSYS Fluent HBv5 VMs provide a 3.4x performance increase over HBv4 VMs.

 

Figure 9: The cost to complete the ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running ANSYS Fluent with cases similar to the size and complexity of the 140M cell F1 Racecar problem, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5.

 

Siemens Star-CCM+ - version 17.04.005 with AeroSUV Steady Coupled 106M case

Figure 10: On Star-CCM+, HBv5 VMs provide a 3.4x performance increase over HBv4 VMs.

Figure 11: The cost to complete the Siemens Star-CCM+ANSYS Fluent F1 racecar 140M case is just 81% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running Star-CCM+  with cases similar to the size and complexity of the 106M cell AeroSUV Steady Coupled, organizations can consolidate their server (VM) deployments by approximately a factor of ~3.5.

 

Weather Modeling

WRF – version 4.2.2 with CONUS 2.5KM case

Figure 12: On WRF, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs.

Figure 13: The cost to complete the WRF Conus 2.5KM case is just 84% of what it costs to complete the same case on HBv4.

Above, we can see that for customers running WRF  with cases similar to the size and complexity of the 2.5km CONUS,  organizations can consolidate their server (VM) deployments by approximately a factor of ~3.

 

Energy Research

 

Devito – version 4.8.7 with Acoustic Forward case

Figure 14: On Devito, HBv5 VMs provide a 3.27x performance increase over HBv4 VMs.

Figure 15: The cost to complete the Devito Acoustic Forward OP case is equivalent to what it costs to complete the same case on HBv4.

Above, we can see that for customers running Devito with cases similar to the size and complexity of the Acoustic Forward OP,  organizations can consolidate their server (VM) deployments by approximately a factor of ~3.

 

Molecular Dynamics

 

NAMD - version 2.15a2 with STMV 20M case

Figure 16: On NAMD, HBv5 VMs provide a 2.18x performance increase over HBv4 VMs.

Figure 17: The cost to complete the NAMD STMV 20M case is 26% higher on HBv5 than what it costs to complete the same case on HBv4

Above, we can see that for customers running NAMD with cases similar to the size and complexity of the STMV 20M case, organizations can consolidate their server (VM) deployments by approximately a factor of ~2.

Notably, NAMD is a compute bound case, rather than memory performance bound. We include it here to illustrate that not all workloads are fit for purpose with HBv5. This latest Azure HPC VM is the fastest at this workload on the Microsoft Cloud, but does not benefit substantially from HBv5’s premium levels of memory bandwidth. NAMD would instead perform more cost efficiently with a CPU that supports AVX512 instructions natively or, much better still, a modern GPU.

 

Scalability of HBv5-series VMs

 

Weak Scaling

Weak scaling measures how well a parallel application or system performs when both the number of processing elements and the problem size increase proportionally, so that the workload per processor remains constant. Weak scaling cases are often employed when time-to-solution is fixed (e.g. it is acceptable to solve a problem within a specified period) but a user desires a simulation to be of a higher fidelity or resolution. A common example is operational weather forecasting.

To illustrate weak scaling on HBv5 VMs, we ran Palabos with the same 3D cavity problem as shown earlier:

Figure 18: On Palabos with the 3D Cavity model, HBv5 scales linearly as the 3D cavity size is proportionately increased.

Strong Scaling

Strong scaling is characterized by the efficiency with which execution time is reduced as the number of processor elements (CPUs, GPUs, etc.) is increased, while the problem size remains kept constant. Strong scaling cases are often employed when the fidelity or resolution of the simulation is acceptable, but a user requires faster time to completion. A common example is product engineering validation when an organization wants to bring a product to market faster but must complete a broad range of validation and verification scenarios before doing so.

To illustrate Strong scaling on HBv5 VMs, we ran NAMD with two different problems, each intended to illustrate the how expectations for strong scaling efficiency change depending on problem size and the ordering of computation v. communication in distributed memory workloads.

First, let us examine NAMD with the 20M STMV benchmark

Figure 19: On NAMD with the STMV 20M cell case, HBv5 scales linearly as the 3D cavity size is proportionately increased.

As illustrated above, for strong scaling cases for which the compute time is continuously reduced (by leveraging more and more processor elements) but communication time remains constant, scaling efficiency will only stay high for so long. That principle is well-represented by the STMV 20m case, for which parallel efficiency remains linear (i.e. cost/job remains flat) at two (2) nodes but degrades after that. This is because while compute is being sped up, the MPI time remains relatively flat. As such, the relatively static MPI time comes to dominate end-to-end wall clock time as VM scaling increases. Said another way, HBv5 features so much compute performance that even for a moderate-sized   problem like STMV 20M scaling the infrastructure can only take performance so far and cost/job will begin to increase.

If we examine HBv5 against the 210M cell case, however, with 10.5x as many elements to compute as its 20M case sibling, the scaling efficiency story changes significantly.

Figure 19: On NAMD with the STMV 210M cell case, HBv5 scales linearly out to 32 VMs (or more than 11,000 CPU cores).

As illustrated above, larger cases with significant compute requirements will continue to scale efficiently with larger amounts of HBv5 infrastructure. While MPI time remains relatively flat for this case (as is the case with the smaller STMV 20M case), the compute demands remain the dominant fraction of end-to-end wall clock time. As such, HBv5 scales these problems with very high levels of efficiency and in doing so job costs to the user remain flat despite up to 8x as many VMs being leveraged compared to the four (4) VM baseline.

The key takeaways for strong scaling scenarios are two-fold. First, users should run scaling tests with their applications and models to find a sweet spot of faster performance with constant job costs. This will depend heavily on model size. Second, as new and very high end compute platforms like HBv5 emerge that accelerate compute time, application developers will need to find ways reduce wall clock times bottlenecking on communication (MPI) time. Recommended approaches include using fewer MPI processes and, ideally, restructuring applications to overlap communication with compute phases.

Updated Nov 05, 2025
Version 3.0
No CommentsBe the first to comment