(Article contributed by Jon Shelly and Evan Burness, Azure)
Just in time for SC’19, Azure launched into Preview this week the new HBv2 Virtual Machines for High-Performance Computing (HPC).
These VMs feature a wealth of new technology, including:
Below are initial performance characterizations using a variety of configurations on both microbenchmarks as well as commonly used HPC applications for which the HB family of VMs is optimized for.
Microbenchmarks
MPI Latency
OSU Benchmarks (5.6.2) – osu_latency with MPI = HPC-X, Intel MPI, MVAPICH2, OpenMPI
Message Size (bytes) |
HPC-X |
Intel MPI |
MVAPICH2 |
OpenMPI |
0 |
1.62 |
1.95 |
1.85 |
1.61 |
1 |
1.62 |
1.95 |
1.9 |
1.61 |
2 |
1.61 |
1.95 |
1.9 |
1.61 |
4 |
1.62 |
1.95 |
1.9 |
1.61 |
8 |
1.61 |
1.96 |
1.9 |
1.61 |
16 |
1.62 |
1.96 |
1.93 |
1.62 |
32 |
1.77 |
1.97 |
1.94 |
1.77 |
64 |
1.83 |
2.03 |
2.08 |
1.82 |
128 |
1.9 |
2.09 |
2.29 |
1.9 |
256 |
2.44 |
2.65 |
2.78 |
2.44 |
512 |
2.53 |
2.71 |
2.84 |
2.53 |
1024 |
2.63 |
2.84 |
2.93 |
2.62 |
2048 |
2.92 |
3.09 |
3.13 |
2.92 |
4096 |
3.72 |
3.89 |
4.07 |
3.74 |
MPI Bandwidth (2 QP’s)
OSU Benchmarks (5.6.2) – osu_mbw_mr with ppn = 2 with MPI = HPC-X
#bytes |
BW peak |
BW average |
4096 |
15920.65 |
15902.67 |
8192 |
23045.57 |
23036.88 |
16384 |
23270.14 |
23260.04 |
32768 |
23376.91 |
23372.9 |
65536 |
23423.49 |
23423.23 |
131072 |
23445.05 |
23443.6 |
262144 |
23463.94 |
23463.93 |
524288 |
23470.7 |
23470.55 |
1048576 |
23474.3 |
23474.08 |
2097152 |
23475.77 |
23475.73 |
4194304 |
23476.61 |
23476.61 |
8388608 |
23477.06 |
23477.05 |
Application Benchmarks
App: Siemens Star-CCM+
Version: 14.06.004
Model: LeMans 100M Coupled Solver
Configuration Details: 116 MPI ranks were run (4 ranks from each of 29 NUMA) in each HBv2 VM in order to leave nominal resources to run Linux background processes. In addition, Adaptive Routing was enabled and DCT (Dynamic Connected Transport) was used as the transport layer, while HPC-X version 2.50 was used for MPI. Azure CentOS HPC 7.6 image was used from https://github.com/Azure/azhpc-images
VMs |
Cores |
PPN |
SETime |
SpeedUp |
ParallelEff |
1 |
116 |
116 |
258.92 |
116 |
100 |
2 |
232 |
116 |
129.56 |
231.82 |
99.9 |
4 |
464 |
116 |
62.01 |
484.35 |
104.4 |
16 |
1856 |
116 |
16.46 |
1824.71 |
98.3 |
32 |
3712 |
116 |
8.4 |
3575.56 |
96.3 |
64 |
7424 |
116 |
4.8 |
6257.23 |
84.3 |
128 |
14848 |
116 |
2.5 |
12013.89 |
80.9 |
Summary: Star-CCM+ was scaled at 81% efficiency to nearly 15,000 MPI ranks delivering an application speedup of more than 12,000x. This compares favorably to Azure’s previous best of more than 11,500 MPI ranks, which itself was a world-record for MPI scalability on the public cloud.
App: ANSYS Fluent
Version: 14.06.004
Model: External Flow over a Formula-1 Race Car (f1_racecar_140m)
Configuration Details: 60 MPI ranks were run (2 out of 4 cores per NUMA) in each HBv2 VM in order to leave nominal resources to run Linux background processes and give ~6 GB/s of memory bandwidth per core. In addition, Adaptive Routing was enabled and DCT (Dynamic Connected Transport) was used as the transport layer, while HPC-X version 2.50 was used for MPI. Azure CentOS HPC 7.6 image was used from https://github.com/Azure/azhpc-images
VMs |
HBv2 Solver Rating |
HBv2 Speedup |
Linear Ideal Speedup |
1 |
68.5 |
1 |
1 |
2 |
134.5 |
1.96 |
2 |
4 |
275.9 |
4.03 |
4 |
8 |
557.8 |
8.14 |
8 |
16 |
1122.1 |
16.38 |
16 |
32 |
2385.1 |
34.82 |
32 |
64 |
4601.9 |
67.18 |
64 |
128 |
9846.2 |
143.74 |
128 |
Summary: HBv2 VMs scale super linearly (112%) up to the top end measured number of VMs (128). The Fluent Solver Rating measured at this top-end level of scale is 83% more performance than the current leader submission on ANSYS public database for this model (https://bit.ly/2OdAExM).
Impact of Adaptive Routing
App: Siemens Star-CCM+
Version: 14.06.004
Model: LeMans 100M Coupled Solver
Configuration Details: Star-CCM+ performance was compared on an “apples to apples” basis, with the only variable as Adaptive Routing being disabled and then enabled.
Summary: Adaptive Routing, designed to drive higher sustained application scalability for large MPI jobs, delivered a scaling efficiency improvement of 17% over an identical job run with the feature disabled. This translates to faster time to solution, and more efficient use of application licenses.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.