Article contributed by Amirreza Rastegari, Jon Shelley, Scott Moe, Jie Zhang, Jithin Jose, Anshul Jain, Jyothi Venkatesh, Joe Greenseid, Fanny Ou, and Evan Burness
Azure has announced the general availability of Azure HBv4-series and HX-series virtual machines (VMs) for high performance computing (HPC). This blog provides in-depth technical and performance information about these HPC-optimized VMs.
During the preview announced in November 2022, these VMs were equipped with standard 4th gen AMD EPYCTM processors (codenamed 'Genoa'). Starting with the General Availability announced today, all HBv4 and HX Series VMs have been upgraded to 4th Gen AMD EPYCTM processors with AMD 3D V-cache, codenamed (“Genoa-X”). Standard 4th Gen AMD EPYC processors are no longer available in HBv4 and HX series VMs.
These VMs feature several new technologies for HPC customers, including:
HBv4 and HX VMs are available in the following sizes with specifications as shown in Tables 1 and 2, respectively. Just like existing H-series VMs, HBv4 and HX-series also include constrained cores VM sizes, enabling customers to choose a size along a spectrum of from maximum-performance-per-VM to maximum performance per core.
HBv4-series VMs
HX-series VMs
Note: Maximum clock frequencies (FMAX) are based on non-AVX workload scenarios measured by the Azure HPC team with AMD EPYC 9004-series processors. Experienced clock frequency by a customer is a function of a variety of factors, including but not limited to the arithmetic intensity (SIMD) and parallelism of an application.
For more information see official documentation for HBv4-series and HX-series VMs.
It is useful to understand the stacked L3 cache technology, called 3D V-Cache, and what effect this does and does not have on a range of HPC workloads.
4th Gen EPYC processors with 3D V-Cache differ from their standard 4th gen EPYC counterparts only by virtue of having 3x as much L3 cache memory per Genoa core, CCD, socket, and server. This results in a 2-socket server (such as those underlying HBv4 and HX-series VMs) having a total of:
To put this amount of L3 cache into context, here’s how several widely used processor models by HPC buyers over the last half decade compare in terms of L3 cache per 2-socket server when juxtaposed against the latest Genoa-X processors in HBv4 and HX VMs.
Note that looking at L3 cache size alone, absent context, can be misleading. Different CPUs balance L2 (faster) and L3 (slower) ratios differently in different generations. For example, while an Intel Xeon “Broadwell” CPU does have more L3 cache per core and often more CPU as well as compared to an Intel Xeon “Skylake” core, that does not mean it has a higher performance memory subsystem. A Skylake core has much larger L2 caches than does a Broadwell CPU, and higher bandwidth from DRAM, plus significant advances in prefetcher capabilities. Instead, the above table is merely intended to make apparent how much larger the total L3 cache size is in a Genoa-X server as compared to all prior CPUs.
Caches of this size have an opportunity to noticeably improve (1) effective memory bandwidth and (2) effective memory latency. Many HPC applications increase their performance partially or fully in-line with improvements to memory bandwidth and/or memory latency, so the potential impact to HPC customers of Genoa-X processors is large. Examples of workloads that fall into these categories include:
Just as important, however, is understanding what these large caches do not affect. Namely, they do not improve peak FLOPS, clock frequency, or memory capacity. Thus, workloads whose performance or ability to run at all are limited by one or more of these will, in general, not see a material impact from the extra large L3 caches present in Genoa-X processors. Example of workloads that fall into these categories include:
Some compute intensive workloads may even see a modest decrease in performance as the larger SRAM that comprises the 3D V-Cache consumes power from a fixed CPU SoC power budget that would otherwise go to the CPU cores, themselves, to driver higher frequencies. Workloads these intensively compute bound are uncommon but they do exist, and it is worth discussing them so as to not misunderstand 3D V-Cache as a performance enhancing feature for all HPC workloads.
This section focuses on microbenchmarks that characterize performance of the memory subsystem and the InfiniBand network of HBv4-series and HX series VMs.
Memory Performance
Benchmarking “memory performance” of servers featuring 4th Gen AMD EPYC CPUs with 3D V-Cache is a nuanced exercise due to the variable and potentially significant impact of the large L3 cache.
To start, however, let us first measure the performance of each type of memory (DRAM and L3) independently to illustrate the large differences between these two values.
To capture this information, we ran the industry standard STREAM benchmark sized both (A) primarily out of system memory (DRAM), as well as (B) intentionally sized down to fit entirely inside of the large L3 cache (3D V-Cache).
Below in Figure 1, we share the results of running the industry standard STREAM benchmark on HBv4/HX VMs with data size (8.9 GB) too large to fit in the L3 cache (2.3 GB), thereby resulting in memory bandwidth result principally representing DRAM performance.
The STREAM benchmark was run using the following command.
Clang -Ofast -ffp-contract=fast -mavx2 -march=znver4 -lomp -fopenmp -mcmodel=large -fnt-store=aggressive -DSTREAM_ARRAY_SIZE=400000000 -DNTIMES=10 stream.c -o stream
OMP_PLACES=”0,4,8,12,16,20,24,28,32,36,38,42,44,48,52,56,60,64,68,72,76,80,82,86,88,92,96,100,104,108,112,116,120,124,126,130,132,136,140,144,148,152,156,160,164,168,170,174” OMP_NUM_THREADS=48 ./stream
Results shown above are essentially identical to those measured from a standard 2-socket server with “Genoa” processors in a 1 DIMM per channel configuration, such as HBv4/HX Preview VMs (have been replaced by GA HBv4/HX VMs with “Genoa-X” processors). As aforementioned, this is because the test is sufficiently large such that only a small minority of the benchmark fits within the 3D V-Cache, thus minimizing its impact on memory bandwidth. The above result of ~780 GB/s is the expected bandwidth, as there is no difference in the physical DIMMs in these servers as compared to when they ran standard 4th Gen EPYC CPUs.
Below in Figure 2, however, we share the results of running STREAM with a data set sized downward (~80 MB) that fits entirely into the 2.3 GB of L3 cache of a HBv4 or HX series VM, and more precisely into each of the 96 MB of L3 per CCD slices. Here, the STREAM benchmark was run using the following:
clang -Ofast -mavx2 -ffp-contract=fast -march=znver4 -lomp -fopenmp -mcmodel=large -fnt-store=never -DSTREAM_ARRAY_SIZE=3300000 -DNTIMES=100000 stream.c -o stream
OMP_NUM_THREADS=176 ./stream
In Figure 2 above, we see measure bandwidths significantly higher than would result from a standard 4th Gen EPYC CPU or any other CPU without L3 cache so large it stands a reasonable chance to run all or a large percentage of a working dataset out of it. The ~5.7 terabytes/sec of STREAM TRIAD bandwidth measured here is more than 7x the that of the results shown in Figure 1 that principally represent DRAM bandwidth.
So which number is “correct”? The answer is “both” and “it depends.” We can reasonably say both measured numbers are accurate because:
Below in the Application Performance section 3D V-Cache effects can be observed by measuring up to 49% uplift compared to standard 4th Gen EPYC CPUs” in highly memory bandwidth bound applications like OpenFOAM. This is strictly subjective to the characteristics of the model in question.
Thus, it can be said that based on the performance data we ourselves have been able to reproducible measure thus far, the amplification effect from 3D V-Cache is up to 1.49x the effective memory bandwidth, because the workload is performing as if it were being fed more like ~1.2 TB/s (1.49 * ~780 GB/s).
Again, the memory bandwidth amplification effect should be understood as an “up to” because the data below shows a close relationship between an increasing percentage of an active dataset running out of cache and an increasing performance uplift. Moreover, Azure’s characterization of the maximum performance uplift may increase over time as we explore greater levels of scaled performance comparisons between 4th Gen AMD EPYC CPUs with and without 3D V-Cache.
InfiniBand Networking Performance
HBv4 and HX VMs are equipped with the latest 400 Gbps NVIDIA Quantum-2 InfiniBand (NDR) networking. We ran the industry standard IB perftests test across two (2) HBv4-series VMs using the following:
Unidirectional bandwidth:
numactl -c 0 ib_send_bw -aF -q 2
Bi-directional bandwidth:
numactl -c 0 ib_send_bw -aF -q 2 -b
Results of these tests are depicted in Figures 3 and 4, below.
As depicted above, Azure HBv4/HX-series VMs achieve line-rate bandwidth performance (99% of peak) for both unidirectional and bi-directional tests.
This section will focus on characterizing performance of HBv4 and HX VMs on commonly run HPC applications. Performance comparisons are also provided across other HPC VMs offered on Azure, including:
Note: HC-series represents a highly customer relevant comparison as the majority of HPC workloads, market-wide, still run largely or exclusively in on-premises datacenters and on infrastructure that is operated for, on average, between 4-5 years. Thus, it is important to include performance information of HPC technology that aligns to the full age spectrum that customers may be accustomed to using on-premises. Azure HC-series VMs well-represent the older end of that spectrum and feature highly performant technologies like EDR InfiniBand, 1DPC DDR4 2666 MT/s memory, and 1st Gen Xeon Platinum CPUs (“Skylake”) that comprised a large majority of HPC customer investments and configuration choices during that period. As such, application performance comparisons below commonly use HC-series as a representative proxy for an approximately 4-year-old HPC optimized server.
Note: Standard 4th Gen AMD EPYC CPUs were only available in the Preview of HBv4 and HX series VMs and are no longer available. Starting with General Availability, all HBv4 and HX VMs are available only with 4th Gen EPYC processors with 3D V-Cache (codenamed “Genoa-X”).
Unless otherwise noted, all tests shown below were performed with:
Computational Fluid Dynamics (CFD)
Ansys Fluent
Note: for all ANSYS Fluent tests, HBv4/HX performance data (both with and without 3D V-Cache) were collected using ANSYS Fluent 2022 R2 and HPC-X 2.15 on AlmaLinux 8.6, while all other results were collected with ANSYS Fluent 2021 R1 with HPC-X 2.83 on CentOS 8.1. There are no known performance advantages, however, from using the newer software versions. Rather, each was used due to validation coverage by each of ANSYS, NVIDIA, and AMD.
Absolute performance values for the benchmark represented in Figure 6 are shared below:
Absolute performance values (solver rating) for the benchmark represented in Figures 7 & 8 are shared below:
Siemens Simcenter STAR-CCM+
Note: for all Siemens Simcenter Star-CCM+ tests, HBv4/HX with 3D V-Cache performance data were collected using Simcenter 18.04.005, while HBv4 performance data without 3D V-Cache used version 18.02.003. Both utilized HPC-X 2.15 on AlmaLinux 8.6. All other results were collected with Siemens Simcenter Star-CCM+ 17.04.008 with HPC-X 2.83 on CentOS 8.1. There are no known performance advantages, however, from using the newer software versions. Rather, each was used due to validation coverage by each of Siemens, NVIDIA, and AMD.
In addition, on AMD-based systems xpmem with Simcenter Star-CCM+ will underperform (~10% performance loss) due to a large increase of page-faults. A workaround is to uninstall the kmod-xpmem package or unload the xpmem module. Applications will fall back to a supported shared memory transport provided by UCX without any other user intervention. A patch for XPMEM should be included in the July 2023 release of UCX/NVIDIA OFED.
Absolute performance values (elapsed time) for the benchmark represented in Figure 9 are shared below:
Absolute performance values (elapsed time) for the benchmark represented in Figure 10 are shared below:
OpenFOAM
Note: All OpenFOAM performance tests used OpenFOAM version 2006, as well as AlmaLinux 8.6 and HPC-X MPI.
Absolute performance values for the benchmark represented in Figures 11 and 12 are shared below:
Hexagon Cradle CFD
Note: All Hexagon Cradle CFD performance tests used version 2022 with AlmaLinux 8.6.
Absolute performance values (mean time per timestep) for the benchmark represented in Figures 15 and 16 are shared below:
Finite Element Analysis (FEA)
Altair RADIOSS
Note: All Altair RADIOSS performance tests used RADIOSS version 2021.1 with AlmaLinux 8.6.
The absolute values for the benchmark represented in Figure 13 are shared below:
MSC Nastran – version 2022.3
Note: All NASTRAN tests performance tests used NASTRAN 2022.3 and AlmaLinux 8.6.
Note: for NASTRAN, the SOL108 medium benchmark was only tested on a HX-series VM because this VM type was created to support such large memory workloads. The larger memory footprint of HX-series (2x that of HBv4 series, and more than 3x that of HBv3 series) allows the benchmark to run completely out of DRAM, which in turn provides additional performance speedup on top of that provided by the newer 4th Gen EPYC processors with 3D V-Cache. As such, it would not be accurate to characterize the performance depicted below as “HBv4/HX” and we have instead marked it simply as “HX.”
The absolute values for the benchmark represented in Figure 14 are shared below:
Weather Simulation
WRF
Note: All WRF performance tests on HBv4/HX series (both with and without 3D V-Cache) utilized WRF 4.2.2, HPC-X MPI 2.15, and AlmaLinux 8.6.
Absolute performance values (mean time per timestep) for the benchmark represented in Figures 15 and 16 are shared below:
Molecular Dynamics
NAMD – version 2.15
Note: all NAMD performance tests used NAMD version 2.15 with AlmaLinux 8.6 and HPC-X 2.12. For HBv4/HX and HC-series, the AVX512 Tiles binary was utilized to take advantage of AVX512 capabilities in both Xeon Platinum 1st Gen “Skylake” and 4th Gen EPYC “Genoa” and “Genoa-X” processors. In addition, on AMD systems xpmem will underperform (~10% performance loss) due to a large increase of page-faults. A workaround is to uninstall the kmod-xpmem package or unload the xpmem module. Applications will fall back to a supported shared memory transport provided by UCX without any other user intervention. A patch for XPMEM should be included in the July 2023 release of UCX/NVIDIA OFED.
Absolute performance values (nanoseconds per day) for the benchmark represented in Figure 17 are shared below:
Rendering
Chaos V-Ray
version 5.02.00. All tests on HBv4/HX VMs used AlmaLinux 8.6, while all tests on HBv3, HBv2, and HC utilized CentOS 7.9. There are no known performance advantages, however, from using the newer software versions. Rather, each was used due to validation coverage by AMD.
Absolute performance values (framed rendered) for the benchmark represented in Figure 20 are shared below:
Chemistry
CP2K
version 9.1. All tests on HBv4/HX VMs used AlmaLinux 8.7 and HPC-X 2.15, while all tests on HBv3 VMs CentOS 8.1 and HPC-X 2.8.3. There are no known performance advantages, however, from using the newer software versions. Rather, each was used due to validation coverage by each of NVIDIA and AMD.
Absolute performance values (average execution time) for the benchmark represented in Figure 21 are shared below:
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.