benchmarking
50 TopicsAnnouncing Azure HBv5 Virtual Machines: A Breakthrough in Memory Bandwidth for HPC
Discover the new Azure HBv5 Virtual Machines, unveiled at Microsoft Ignite, designed for high-performance computing applications. With up to 7 TB/s of memory bandwidth and custom 4th Generation EPYC processors, these VMs are optimized for the most memory-intensive HPC workloads. Sign up for the preview starting in the first half of 2025 and see them in action at Supercomputing 2024 in AtlantaMatLab and Azure: A Match Made in Performance Heaven
Are you tired of slow MatLab performance when working with complex simulations and data analysis? Microsoft Azure provides virtual machines (VMs) that can help you achieve faster and more efficient MatLab workloads. However, selecting the right VM SKU can be challenging and could lead to poor performance and higher costs. In our upcoming blog article, we'll dive into the factors that impact MatLab performance and guide you through choosing the best Azure VM SKU to maximize MatLab efficiency. We'll also share some best practices to help you optimize your MatLab performance on Azure VMs.Accelerating NAMD on Azure HB-series VMs
Authors: Amirreza Rastegari, Nihit Pokhrel, Nathan Baker, Jon Shelley Introduction: Nanoscale Molecular Dynamics (NAMD) is a molecular dynamics program designed for high-performance simulation of large biomolecular systems. The program scales well over many processors and nodes, making it an excellent fit for Azure High Performance Computing (HPC) infrastructure. Azure HB Series virtual machines (VMs) enable customers who run NAMD to accelerate their innovation by using the latest AMD CPU and NVIDIA Quantum InfiniBand offerings. These VMs, as shown in table 1, are designed to deliver leadership-class performance, scalability, and cost efficiency for various real-world HPC workloads at scale. Table 1: HB-Series VM sizes benchmarked. Performance Benchmarking and Simulation Models The benchmarks we used to compare the HB-series VM offering were standard ApoA1 and STMV. The software stack we used to run the benchmarks was NAMD version 2.15 (Git-2022-07-21) and AlmaLinux HPC 8.6 Azure marketplace image. These benchmarks enabled us to compare the performance across the various VM sizes using increasing numbers of VMs in an objective way. The benchmark models are summarized in the table 2 below. Table 2: Description of the two benchmark models. Representative images of the protein systems in these benchmarks are shown in figure 1 below. ApoA1 (water not shown) STMV (water not shown) Figure 1: Visual representation of the benchmark models used. Images generated using MOL* 3D viewer on RCSB PDB - 3D View. Performance comparison across the HB series VMs For each benchmark model, we scaled the simulations up to a maximum of 64 VMs (32 for the ApoA1 benchmark) and measured the performance in terms of nanoseconds per day (ns/day). Using this metric, we calculated the cost[1] per nanosecond (cost/ns) of the simulations to understand the relative cost per solution across 3-4 years of HB-series VM offerings. Figure 2: Performance and cost comparison of HB, HBv2 and HBv3 series VMs on the ApoA1 (92k atoms) model. N HB (60ppn, ns/day) HBv2 (96ppn, ns/day) HBv3 (96ppn, ns/day) 1 5.1 11.4 11.4 2 8.5 20.2 20.4 4 12.8 38.1 38.6 8 29.4 66.7 67.3 16 37.9 102.6 109.6 32 40.1 120.7 134.5 Table 1: Performance of HB, HBv2 and HBv3 series VMs on the ApoA1 (92k atoms) model. The ApoA1 (92k atoms) benchmark model consists of 92,224 atoms. With this model, the HBv3 VMs outperform the HB VMs by as much as 330%, while lowering the cost per nanosecond by up to 50%. The HBv2 VMs demonstrate almost identical performances to those of HBv3, on up to 8 VMs. However, beyond this point, the HBv2 VMs falls off, compared to the HBv3 VMs. The best performance was achieved using 32 HBv3 VMs, which delivered a performance of 134.5 ns/day, at a cost of 22.6 USD/ns. However, due to the small size of the problem, the parallel efficiency with 32 HBv3 VMs drops to 37%, since there are less than 30 atoms/core. The reasonable cost and performance point is obtained with 4 HBv3 VMs. This is when there are ~250 atoms/core in the simulation, and the resulting parallel efficiency remains around 85%. At this point, the simulation's performance and cost per nanosecond are 38.6 ns/day and 9.8 USD/ns, respectively. Figure 3: Performance and cost comparison of HB, HBv2 and HBv3-series VMs on the STMV (1M atoms) model. N HB (60ppn, ns/day) HBv2 (96ppn, ns/day) HBv3 (96ppn, ns/day) 1 0.49 1.13 1.18 2 0.90 1.91 1.97 4 1.55 4.31 4.55 8 3.81 8.49 8.96 16 6.97 15.95 16.87 32 10.11 27.18 29.65 64 18.24 31.94 42.05 Table 1: Performance of HB, HBv2 and HBv3 series VMs on the STMV (1M atoms) model. In the STMV (1M atoms) benchmark model with one million atoms, the HBv3 VMs outperform the HB VMs by as much as 230%, while also lowering the cost per nanosecond by as much as 32%. The HBv2 VMs demonstrate similar performances to those of HBv3, on up to 4 VMs. Beyond this point, however, the HBv2 VMs exhibit lower performance compared to the HBv3 VMs. The best performance on the STMV (1M atoms) benchmark model was achieved using 64 HBv3 VMs. At this point we see ~42 ns/day at a cost of ~145 USD/ns. However, with this many VMs the parallel efficiency drops significantly because of the dwindling number of atoms/core (~350) in the simulation. Figure 4: Performance and cost comparison of HB, HBv2 and HBv3 series VMs on the STMV (20M atoms) model. N HB (60ppn, ns/day) HBv2 (120ppn, ns/day) HBv3 (120ppn, ns/day) 1 0.035 0.086 0.088 2 0.063 0.145 0.154 4 0.138 0.480 0.469 8 0.397 0.862 0.867 16 0.757 1.864 1.854 32 1.399 3.419 3.481 64 2.698 5.486 5.879 Table 1: Performance of HB, HBv2 and HBv3 series VMs on the STMV (20M atoms) model. In the STMV (20M atoms) benchmark model, consisting of 20 million atoms, HBv3 VMs outperformed HB VMs by 218-251%, while also providing a cost per nanosecond of around 64-68%. Since this is a compute-intensive problem, HBv2 VMs deliver a similar performance to that of HBv3 VMs at small VM counts of up to 16 VMs, and a small drop at the larger VM counts. The best performance with this model was achieved with 64 HBv3 VMs, which delivered a performance of 5.879 ns/day and a cost of ~1035 USD/ns. For this benchmark we scaled the model to just over 3200 atoms/core at 64 HBv3 VMs. Over the range from 1-64 VMs we saw a parallel efficiency of~105-135%, demonstrating super linear speedup. Figure 5: Performance and cost comparison of HB, HBv2 and HBv3 series VMs on the STMV (210M atoms) model. N HB (60ppn, ns/day) HBv2 (120ppn, ns/day) HBv3 (120ppn, ns/day) 16 0.061 0.161 0.157 32 0.119 0.299 0.299 64 0.241 0.654 0.649 Table 1: Performance of HB, HBv2 and HBv3 series VMs on the STMV (210M atoms) model. In the STMV (210M atoms) benchmark model, which consists of 210 million atoms, the HBv3 VMs outperform the HB VMs by 257-270%, at only ~60% of the cost. This benchmark is very compute-intensive, so the HBv2 VMs delivers similar performance to the HBv3 VMs. The highest performance with this model was achieved using 64 HBv3 VMs, which delivered a performance of 0.649 ns/day at a cost of ~9360 USD/ns. Because of the size of this benchmark, with more than 27300 atoms per core at 64 HBv3 VMs, the simulation's parallel efficiency remains impressive, reaching as high as 128%, demonstrating super linear speedup. Continuous innovation with leading-edge capabilities With Azure’s HB-series VMs, NAMD customers can reduce the time and cost of their simulations. When compared to 3–4-year-old technology (HB-series VMs), the various benchmarks ranging from 1 to 64 VMs (60 – 7680 cores), HBv3-series VMs provide the fastest time-to-solution at the lowest relative cost per NAMD simulation. These performance gains on the HBv3 are due to additional cores and the 200Gb/s HDR InfiniBand. The benchmarks show HBv3-series VMs deliver up to 3.1 times higher performances while reducing the $/nanosecond of the simulation by as much as 1.9 times compared to HB-series VMs. Additional Information Learn more about Azure HPC Azure HBv3-series VMs. Azure WOC-Benchmarking GitHub Repository NAMD compilation recipe on Azure HPC GitHub Repository Azure HPC Content Hub * NAMD was developed by the Theoretical and Computational Biophysics Group in the Beckman Institute for Advanced Science and Technology at the University of Illinois at Urbana-Champaign. [1] Learn more about the pricing of each HB-series VM.Exploring CPU vs GPU Speed in AI Training: A Demonstration with TensorFlow
In the ever-evolving landscape of artificial intelligence, the speed of model training is a crucial factor that can significantly impact the development and deployment of AI applications. Central Processing Units (CPUs) and Graphics Processing Units (GPUs) are two types of processors commonly used for this purpose. In this blog post, we will delve into a practical demonstration using TensorFlow to showcase the speed differences between CPU and GPU when training a deep learning model.Performance & Scalability of HBv4 and HX-Series VMs with Genoa-X CPUs
Azure has announced the general availability of Azure HBv4-series and HX-series virtual machines (VMs) for high performance computing (HPC). This blog provides in-depth technical and performance information about these HPC-optimized VMs.Reducing cost and time to solution for Simcenter STAR-CCM+ software on Azure’s HB-series VMs
Authors: Amirreza Rastegari, Jon Shelley Simcenter STAR-CCM+ is one of the most widely used computational fluid dynamics (CFD) simulation applications by high performance computing (HPC) customers in a variety of research and commercial fields. Over the past 5 years, our partnership has been focused on delivering great CFD performance in Azure. Recently, we evaluated Simcenter STAR-CCM+ across our HB series virtual machines (VMs) to see how the HBv3 VMs compares to the previous generations in terms of cost[1], performance, and scalability. The Azure HB series currently includes HBv3, HBv2 and HB VMs, with their technical features summarized in table 1: Table 1: Summary of VM specs for the HB series VMs. A total of 3 benchmarks were run using Simcenter STAR-CCM+. The benchmarks selected cover small, medium, and large scale CFD model sizes, details of which are listed in table 2: Table 2: Benchmarks used for the comparisons. For the software stack we used Simcenter STAR-CCM+ 2021.3, CentOS 8.1 HPC Azure marketplace image (maintained on GitHub), and HPC-X 2.8.3 MPI library. How we calculated performance and cost To calculate the performance, we ran each benchmark 3 times at each VM count and averaged the resulting “averaged elapsed times”, obtained from the three runs. Next, we took the average elapsed time for the single HB VM and divided it by the averages obtained for the other VM counts and VM types. To calculate the relative costs, we took the average elapsed times calculated in the performance step and then converted the times from seconds to hours. We then multiplied the times by the pay-as-you-go hourly pricing. Once we had the costs, we took that cost and divided it by the cost for a single HB VM to determine the relative cost. Our calculations did not include the Simcenter STAR-CCM+ licensing costs. However, if you are using the Power on Demand licensing scheme then you are paying for a job and not the number of cores used. Benchmark Results Benchmark (< 10 million cells) The Reactor benchmark represents a reactive flow simulation in a reactor chamber using the segregated solvers, with a mesh of approximately 9.1M cells. Using the “Reactor” benchmarks we evaluated the cost, performance, and scale for small workloads. Figure 1: Relative performance and cost of the simulations for the "Reactor" benchmark using Simcenter STAR-CCM+ on Azure’s HB series VMs. As can be seen in Figure 1, the HB series VMs demonstrate excellent parallel efficiency and cost performances up to 8 VMs. At 4 VMs we see the largest performance differential. At this point there are ~2.3M elements per VM. The HBv3 VM, with 1.5GB of L3 3D V-Cache TM [2] (3x more L3 cache than HBv2 and 6x more than HB), delivered ~3.3x performance improvement compared to HB VMs, launched only 4 years prior. Medium Benchmark (15 – 20 million cells) The LeMans Poly 17M benchmark represents external flow simulation around a race car using segregated solvers, with a mesh of approximately 17M cells. For this benchmark, we scaled up to 32 VMs. At the 32 VM point, each HBv2 and HBv3 VM simulates ~0.5 million cells, which equates to ~7000 cells/core. HB VMs simulated twice that amount (~1M cells per VM, ~14,000 cells/core). The results are shown below in figure 2. Figure 2: Relative performance and cost of the simulations for the “LeMans Poly 17M” benchmark using Simcenter STAR-CCM+ on Azure’s HB series VMs. For the LeMans Poly 17M benchmark the HBv3 VMs delivered the lowest cost per job at 8 VMs. For 8 VMs, there are ~2.1 million grid cells per VM. Here, the HBv3 delivered ~3.4x performance improvement compared to HBv1 VMs with a speedup of ~10.4 compared to a linear speedup of 8. Large Benchmarks (~100 million cells) The LeMans 100M Coupled benchmark is an extension of the LeMans Poly 17M benchmark with a much finer grid. It represents an external flow simulation around a race car using the coupled solvers, with a mesh of approximately 106M cells. Figure 3: Relative performance and cost of the simulations for the "LeMans 100M Coupled" benchmark using Simcenter STAR-CCM+ on Azure’s HB series VMs. For the LeMans 100M Coupled benchmark the HBv3 VMs demonstrate excellent scalability and efficiency at all VM counts, up to 64 VMs, as seen in figures 3. The best scalability and lowest costs per job were obtained at the 32 and 64 VM mark. There are ~1.6 million cells per VM at 64 VMs and 3.3 million cells per VM at 32 VMs. Here, the HBv3 delivered ~2.8x performance improvement compared to HBv1 VMs with a speedup of ~38 compared to a linear speedup of 32. Summary and Conclusions These benchmarking studies showcase Azure’s commitment to continuously improve the computational performance, scalability, and cost effectiveness of its HPC offerings, over 3 generations of hardware. These improvements enable our customers to cut their overall costs and time per solution. When compared to the 3–5-year life span of an on-premises cluster, the yearly HB series improvements lead to significant cost savings, faster time to solutions, and shorter time to market. Compared to the 3–4-year-old HB hardware, HBv3 delivers performance gains which help engineers and researchers to be more productive and to cut the VM cost per solution by 40-50%. The 1.5GB of L3 3D V-Cache TM on HBv3 provides significant performance gains for CFD workloads. From the Simcenter STAR-CCM+ benchmarking results, reported above, we see that an optimal performance is achieved at around the 2-3 million elements per VM range. We invite you to learn more about the Azure HB series to see how it can help your business meet the challenges of tomorrow. Additional Information: Azure High-Performance Computing Azure HPC documentation | Microsoft Learn HBv3-series - Azure Virtual Machines | Microsoft Learn HBv2-series - Azure Virtual Machines | Microsoft Learn HB-series - Azure Virtual Machines | Microsoft Learn Simcenter STAR-CCM+ | Siemens Software #AzureHPCAI #AzureHPC [1] Cost comparisons were constructed using publicly available pay-as-you-go pricing in the East US region. [2] https://www.amd.com/en/technologies/3d-v-cacheDeploy NDm_v4 (A100) Kubernetes Cluster
We show how to deploy an optimal NDm_v4 (A100) AKS cluster, making sure that all 8 GPU and 8 InfiniBand devices available on each vritual machine come up correctly and are available to deliver optimal performance. A multi-node NCCL allreduce job is executed on the NDmv4 AKS cluster to verify its deployed/configured correctly.