benchmarking
35 TopicsBenchmark EDA workloads on Azure Intel Emerald Rapids (EMR) VMs
Co-author: Nalla Ram and Wiener Evgeny, Intel Electronic Design Automation (EDA) consists of a collection of software tools and workflows used for designing semiconductor products, most notably advanced computer chips. With today's rapid pace of innovation, there is an increasing demand for higher performance, smaller chip sizes, and lower power consumption. EDA tools require multiple nodes and numerous CPUs (cores) in a cluster to meet this demand. A high-performance network and a centralized file system support this multi-node cluster, ensuring that all components act as a unified whole to provide both consistency and scale-out performance. The cloud offers EDA engineers access to substantial computing capacities, enabling faster design and characterization while removing the need for in-house CPUs that may remain idle during off-peak times. Additionally, it provides enhanced quality with higher simulation coverage and facilitates global collaboration among designers. Objective In this article, we will evaluate the performance of the latest Azure VMs, which utilize the 5th Gen Intel® Xeon® Platinum 8537C (Emerald Rapids “EMR”) processor, comparing them to the previous Ice Lake generation. We compared the new D64dsv6 and FX64v2 VMs, both running on the same EMR CPU model, against the previous generation D64dsv5 VM. The new Dsv6 and Dlsv6 series VMs provide two different memory-to-vCPU ratios. Dsv6 supports up to 128 vCPUs and 512 GiB of RAM. The Esv6 and Edsv6 series VMs are also available, offering up to 192 vCPUs and over 1800 GiB of RAM. FXv2 series VMs feature an all-core turbo frequency of up to 4.0 GHz, supporting a 21:1 memory-to-vCPU ratio with base sizes, and an even higher ratio with constrained core sizes. Please refer to New Azure Dlsv6, Dsv6, Esv6 VMs with new CPU, Azure Boost, and NVMe Support and Azure FXv2-series Virtual Machines based on the Emerald Rapids processor for more details. We will use two leading EDA tools, Cadence Spectre-X and Synopsys VCS, to benchmark design simulations. We will explore different scenarios in real-world situations, including single-threaded, multi-threaded, and multiple jobs running on one node. Additionally, we will conduct a cost-effective analysis to provide silicon companies with guidance when considering migrating EDA workloads to Azure. Testing environment Figure 1: Testing environment Compute VMs, the license server VM, and storage all reside in the same Proximity Placement Group to reduce the network traffic latency. We used Azure NetApp Files (ANF) as our NFS storage solution with a Premium 4TiB volume, which provides up to 256 MB/s throughput. Cadence Spectre-X Use Case Business Value Cadence Spectre X is a top-tier EDA tool designed for large-scale verification simulations of complex analog, RF, and mixed-signal blocks. With multi-threaded simulation capabilities, users can run single analyses, such as TRAN or HB, on a multi-core machine. Spectre-X excels at distributing simulation workloads across Azure cloud, leveraging thousands of CPU cores to enhance performance and capacity. Test Case The test design is a Post Layout DSPF design with over 100,000 circuit entries. All tools, design files, and output files are stored on the shared Azure NetApp Files (ANF) volume. Please refer to Benefits of using Azure NetApp Files for EDA).Simulations are run by altering the number of threads per job using the +mt option, and the total time for each run is recorded from the output log files. Here is an example command for running a single-threaded (+mt=1) Spectre-X job: spectre -64 +preset=cx +mt=1 input.scs -o SPECTREX_cx_1t +lqt 0 -f sst2 During the test, we observed that the number of utilized CPUs matched the number of threads per job. We also noted low storage read/write operations (fewer than 2,000 IOPS), low network bandwidth usage, and minimal memory use during the test. This indicates that Spectre-X is a highly compute-intensive and CPU-bound workload. Benchmark Results The table below displays the total elapsed time (in seconds) for various Spectre-X simulation runs with thread counts of 1, 2, 4, and 8. A lower value indicates better performance. These results demonstrate the Spectre-X tool's efficiency in distributing workloads on either Ice Lake instances (D64dsv5) and Emerald Rapids instances (D64dsv6 and FX64v2). As expected, we observed that the D64dsv6 instances, with an all-core turbo frequency of up to 3.6 GHz, perform 12 to 18% better, and the FX64v2 instances, with a CPU frequency of up to 4.0 GHz, perform 22 to 29% better than the D64dsv5 instances. Figure 2. Total elapsed time (in seconds) for Spectre-X simulation with thread counts of 1, 2, 4, and 8. Figure 3. Performance improvement of multi-threading Spectre-X jobs Cost-effectiveness Analysis By estimating the total time and VM costs for running 500 single-threaded jobs, we found that the D64dsv6 instances were the most cost-effective option, while the FX64v2 instances achieved the fastest total time. This provides customers with options depending on whether they prioritize cost savings or faster job completion times when choosing Azure EMR VMs. Figure 4. Cost-effective estimation for running 500 single-threaded Spectre-X jobs Synopsys VCS Use Case Business value VCS is a Synopsys Functional verification solution. A significant portion of compute capacity (up to 40%) in chip design is consumed by front-end validation tools, specifically RTL simulators like Synopsys VCS. The chip logic design cycle consists of recurrent design and validation cycles. The validation cycle involves running a set of simulator tests (regression) on the latest design snapshot. The ability to scale and accelerate the VCS test regression and keep validation up to date with design changes is crucial for a project to meet its time-to-market and quality goals, as shown schematically in Figure 5. Figure 5: Scaling Front End regression accelerates design validation cycle and improves quality, in the above example on the left panel Model B and C would land after Design A validation is completed Test case As a test case we used a containerized representative VCS test of an Intel design. Complex RTL design (>10M gates) SVTB (System-Verilog Test Bench) simulation test running 100K cycles Resident memory footprint per simulation instance is 7 GB. Benchmark Results We ran our test case VCS simulation separately on the D64dsv5 (Xeon 3 - Ice Lake) system and the FX64v2 (Xeon 5 - Emerald Rapids) system, scaling from 1 to 32 parallel VCS tests on each. Since VCS simulation is a CPU-intensive application, we anticipated performance acceleration on the FX64v2. The new Emerald Rapids CPU architecture offers higher instructions per cycle (IPC) at the same frequency compared to previous generations. It operates at a higher all-core Turbo frequency (4.0 GHz vs. 3.5 GHz for Ice Lake), features larger L2 and L3 caches, faster UPI NUMA links, supports DDR5 memory instead of DDR4 in Ice Lake, and includes PCIe 5.0 compared to PCIe 4.0 in Ice Lake. The results presented here are for specific Intel IP design used Results may vary based on individual configuration and design used for testing As expected, we observed speedup for the Emerald Rapids instance compared to Ice Lake instance from 17 to 43% for the range of simultaneous simulations shown in the chart below. (See Figure 6. and Figure 7.) Figure 6: Scaling the parallel VCS simulation tests on Emerald Rapids vs. Ice Lake Azure Instances. Vertical axis shows simulation test avg. runtime in sec. Figure 7: Speedup percentage for VCS on Emerald Rapids instance compared to Ice Lake Azure instances Summary The article evaluates the performance of the latest Azure VMs using the 5th Gen Intel® Xeon® Platinum 8537C (Emerald Rapids) processor by comparing them to the previous Ice Lake generation. Using two EDA tools, Cadence Spectre-X and Synopsys VCS, the benchmarks involve real-world scenarios including single-threaded, multi-threaded, and multiple jobs running on one node. Results show that Spectre-X performs 12 to 18% better on D64ds v6 instances and 22 to 29% better on FX64v2 instances compared to D64ds v5 instances. The D64ds v6 instances were found to be more cost-effective, while FX64mds v2 instances achieved the shortest total runtime. For Synopsys VCS, the benchmarks revealed a speedup of 17 to 43% for Emerald Rapids instances over Ice Lake instances across various parallel simulations. The findings offer EDA customers options on which Azure EMR instances to select based on the cost-efficiency analysis.Announcing Azure HBv5 Virtual Machines: A Breakthrough in Memory Bandwidth for HPC
Discover the new Azure HBv5 Virtual Machines, unveiled at Microsoft Ignite, designed for high-performance computing applications. With up to 7 TB/s of memory bandwidth and custom 4th Generation EPYC processors, these VMs are optimized for the most memory-intensive HPC workloads. Sign up for the preview starting in the first half of 2025 and see them in action at Supercomputing 2024 in AtlantaOptimizing Language Model Inference on Azure
Inefficient inference optimization can lead to skyrocketing costs for customers, making it crucial to establish clear performance benchmarking numbers. This blog sets the standard for expected performance, helping customers make informed decisions that maximize efficiency and minimize expenses with the new Azure ND H200 v5-series.Benchmarking 6th gen. Intel-based Dv6 (preview) VM SKUs for HPC Workloads in Financial Services
Introduction In the fast-paced world of Financial Services, High-Performance Computing (HPC) systems in the cloud have become indispensable. From instrument pricing and risk evaluations to portfolio optimizations and regulatory workloads like CVA and FRTB, the flexibility and scalability of cloud deployments are transforming the industry. Unlike traditional HPC systems that require complex parallelization frameworks (e.g. depending on MPI and InfiniBand networking), many financial calculations can be efficiently executed on general-purpose SKUs in Azure. Depending on the codes used to perform the calculations, many implementations leverage vendor-specific optimizations such as AVX-512 from Intel. With the recent announcement of the public preview of the 6th generation of Intel-based Dv6 VMs (see here), this article will explore the performance evolution across three generations of D32ds – from D32dsv4 to D32dsv6. We will follow the testing methodology similar to the article from January 2023 – “Benchmarking on Azure HPC SKUs for Financial Services Workloads” (link here). Overview of D-Series VM in focus: In the official announcement it was mentioned, that the upcoming Dv6 series (currently in preview) offers significant improvements over the previous Dv5 generation. Key highlights include: Up to 27% higher vCPU performance and a threefold increase in L3 cache compared to the previous generation Intel Dl/D/Ev5 VMs. Support for up to 192 vCPUs and more than 18 GiB of memory. Azure Boost, which provides: Up to 400,000 IOPS and 12 GB/s remote storage throughput. Up to 200 Gbps VM network bandwidth. A 46% increase in local SSD capacity and more than three times the read IOPS. NVMe interface for both local and remote disks. Note: Enhanced security through Total Memory Encryption (TME) technology is not activated in the preview deployment and will be benchmarked once available. Technical Specifications for 3 generations of D32ds SKUs VM Name D32ds_v4 D32ds_v5 D32ds_v6 Number of vCPUs 32 32 32 InfiniBand N/A N/A N/A Processor Intel® Xeon® Platinum 8370C (Ice Lake) or Intel® Xeon® Platinum 8272CL (Cascade Lake) Intel® Xeon® Platinum 8370C (Ice Lake) Intel® Xeon® Platinum 8573C (Emerald Rapids) processor Peak CPU Frequency 3.4 GHz 3.5 GHz 3.0 GHz RAM per VM 128 GB 128 GB 128 GB RAM per core 4 GB 4 GB 4 GB Attached Disk 1200 SSD 1200 SSD 440 SSD Benchmarking Setup For our benchmarking setup, we utilised the user-friendly, open-source test suite from Phoronix (link) to run 2 tests from OpenBenchmarking.org test suite, specifically targeting quantitative finance workloads. The tests in the "finance suite" are divided into two groups, each running independent benchmarks. In addition to the finance test suite, we also ran the AI-Benchmark to evaluate the evolution of AI inferencing capabilities across three VM generations. Finance Bench QuantLib AI Benchmark Bonds OpenMP Size XXS Device Inference Score Repo OpenMP Size X Device AI Score Monte-Carlo OpenMP Device Training Score Software dependencies Component Version OS Image Ubuntu marketplace image: 24_04-lts Phoronix Test Suite 10.8.5 Quantlib Benchmark 1.35-dev Finance Bench Benchmark 2016-07-25 AI Benchmark Alpha 0.1.2 Python 3.12.3 To run the benchmark on a freshly created D-Series VM, execute the following commands (after updating the installed packages to the latest version): git clone https://github.com/phoronix-test-suite/phoronix-test-suite.git sudo apt-get install php-cli php-xml cmake sudo ./install-sh phoronix-test-suite benchmark finance For the AI Benchmark tests, a few additional steps are required. For example, creating a virtual environment for additional python packages and the installation of the tensorflow and ai-benchmark packages are required: sudo apt install python3 python3-pip python3-virtualenv mkdir ai-benchmark && cd ai-benchmark virtualenv virtualenv source virtualenv/bin/activate pip install tensorflow pip install ai-benchmark phoronix-test-suite benchmark ai-benchmark Benchmarking Runtimes and Results The purpose of this article is to share the results of a set of benchmarks that closely align with the use cases mentioned in the introduction. Most of these use cases are predominantly CPU-bound, which is why we have limited the benchmark to D-Series VMs. For memory-bound codes that would benefit from a higher memory-to-core ratio, the new Ev6 SKU could be a suitable option. In the picture below, you can see a representative benchmarking run on a Dv6 VM, where nearly 100% of the CPUs were utilised during execution. The individual runs of the Phoronix test suite, starting with Finance Bench and followed by QuantLib, are clearly visible. Runtimes Benchmark VM Size Start Time End Time Duration Minutes Finance Benchmark Standard D32ds v4 12:08 15:29 03:21 201.00 Finance Benchmark Standard D32ds v5 11:38 14:12 02:34 154.00 Finance Benchmark Standard D32ds v6 11:39 13:27 01:48 108.00 Finance Bench Results QuantLib Results AI Benchmark Alpha Results Discussion of the results The results show significant performance improvements in QuantLib across the D32v4, D32v5, and D32v6 versions. Specifically, the tasks per second for Size S increased by 47.18% from D32v5 to D32v6, while Size XXS saw an increase of 45.55%. Benchmark times for 'Repo OpenMP' and 'Bonds OpenMP' also decreased, indicating better performance. 'Repo OpenMP' times were reduced by 18.72% from D32v4 to D32v5 and by 20.46% from D32v5 to D32v6. Similarly, 'Bonds OpenMP' times decreased by 11.98% from D32v4 to D32v5 and by 18.61% from D32v5 to D32v6. In terms of Monte-Carlo OpenMP performance, the D32v6 showed the best results with a time of 51,927.04 ms, followed by the D32v5 at 56,443.91 ms, and the D32v4 at 57,093.94 ms. The improvements were -1.14% from D32v4 to D32v5 and -8.00% from D32v5 to D32v6. AI Benchmark Alpha scores for device inference and training also improved significantly. Inference scores increased by 15.22% from D32v4 to D32v5 and by 42.41% from D32v5 to D32v6. Training scores saw an increase of 21.82% from D32v4 to D32v5 and 43.49% from D32v5 to D32v6. Finally, Device AI scores improved across the versions, with D32v4 scoring 6726, D32v5 scoring 7996, and D32v6 scoring 11436. The percentage increases were 18.88% from D32v4 to D32v5 and 43.02% from D32v5 to D32v6. Next Steps & Final Comments The public preview of the new Intel SKUs have already shown very promising benchmarking results, indicating a significant performance improvement compared to the previous D-series generations, which are still widely used in FSI scenarios. It's important to note that your custom code or purchased libraries might exhibit different characteristics than the benchmarks selected. Therefore, we recommend validating the performance indicators with your own setup. In this benchmarking setup, we have not disabled Hyper-Threading on the CPUs, so the available cores are exposed as virtual cores. If this scenario is of interest to you, please reach out to the authors for more information. Additionally, Azure offers a wide range of VM families to suit various needs, including F, FX, Fa, D, Da, E, Ea, and specialized HPC SKUs like HC and HB VMs. A dedicated validation, based on your individual code / workload, is recommended here as well, to ensure the best suited SKU is selected for the task at hand.Performance & Scalability of HBv4 and HX-Series VMs with Genoa-X CPUs
Azure has announced the general availability of Azure HBv4-series and HX-series virtual machines (VMs) for high performance computing (HPC). This blog provides in-depth technical and performance information about these HPC-optimized VMs.A quick start guide to benchmarking AI models in Azure: Llama 2 from MLPerf Inference v4.0
Microsoft Azure has delivered industry-leading results for AI inference workloads amongst cloud service providers in the most recent MLPerf Inference results published publicly by MLCommons. The Azure results were achieved using the new NC H100 v5 Virtual Machines (VMs) and reinforced the commitment from Azure to designing AI infrastructure that is optimized for training and inferencing in the cloud. In this document, one will find the steps to reproduce the results with the model Llama 2 from MLPerf Inference v4.0 on the new NC H100 v5 virtual machines.Training large AI models on Azure using CycleCloud + Slurm
Here we demonstrate and provide template to deploy a computing environment optimized to train a transformer-based large language model on Azure using CycleCloud, a tool to orchestrate and manage HPC environments, to provision a cluster comprised of A100, or H100, nodes managed by Slurm. Such environments have been deployed to train foundational models with 10-100s billions of parameters on terabytes of data.