Azure High Performance Computing (HPC) Blog

6 MIN READ

Benchmark EDA workloads on Azure Intel Emerald Rapids (EMR) VMs

RaymondTsai

Microsoft

Nov 25, 2024

Co-author: Nalla Ram and Wiener Evgeny, Intel

Electronic Design Automation (EDA) consists of a collection of software tools and workflows used for designing semiconductor products, most notably advanced computer chips. With today's rapid pace of innovation, there is an increasing demand for higher performance, smaller chip sizes, and lower power consumption. EDA tools require multiple nodes and numerous CPUs (cores) in a cluster to meet this demand. A high-performance network and a centralized file system support this multi-node cluster, ensuring that all components act as a unified whole to provide both consistency and scale-out performance.

The cloud offers EDA engineers access to substantial computing capacities, enabling faster design and characterization while removing the need for in-house CPUs that may remain idle during off-peak times. Additionally, it provides enhanced quality with higher simulation coverage and facilitates global collaboration among designers.

Objective

In this article, we will evaluate the performance of the latest Azure VMs, which utilize the 5th Gen Intel® Xeon® Platinum 8537C (Emerald Rapids “EMR”) processor, comparing them to the previous Ice Lake generation.

We compared the new D64dsv6 and FX64v2 VMs, both running on the same EMR CPU model, against the previous generation D64dsv5 VM. The new Dsv6 and Dlsv6 series VMs provide two different memory-to-vCPU ratios. Dsv6 supports up to 128 vCPUs and 512 GiB of RAM. The Esv6 and Edsv6 series VMs are also available, offering up to 192 vCPUs and over 1800 GiB of RAM. FXv2 series VMs feature an all-core turbo frequency of up to 4.0 GHz, supporting a 21:1 memory-to-vCPU ratio with base sizes, and an even higher ratio with constrained core sizes. Please refer to New Azure Dlsv6, Dsv6, Esv6 VMs with new CPU, Azure Boost, and NVMe Support and Azure FXv2-series Virtual Machines based on the Emerald Rapids processor for more details.

We will use two leading EDA tools, Cadence Spectre-X and Synopsys VCS, to benchmark design simulations. We will explore different scenarios in real-world situations, including single-threaded, multi-threaded, and multiple jobs running on one node. Additionally, we will conduct a cost-effective analysis to provide silicon companies with guidance when considering migrating EDA workloads to Azure.

Testing environment

Figure 1: Testing environment

Compute VMs, the license server VM, and storage all reside in the same Proximity Placement Group to reduce the network traffic latency. We used Azure NetApp Files (ANF) as our NFS storage solution with a Premium 4TiB volume, which provides up to 256 MB/s throughput.

Cadence Spectre-X Use Case

Business Value

Cadence Spectre X is a top-tier EDA tool designed for large-scale verification simulations of complex analog, RF, and mixed-signal blocks. With multi-threaded simulation capabilities, users can run single analyses, such as TRAN or HB, on a multi-core machine. Spectre-X excels at distributing simulation workloads across Azure cloud, leveraging thousands of CPU cores to enhance performance and capacity.

Test Case

The test design is a Post Layout DSPF design with over 100,000 circuit entries. All tools, design files, and output files are stored on the shared Azure NetApp Files (ANF) volume. Please refer to Benefits of using Azure NetApp Files for EDA).Simulations are run by altering the number of threads per job using the +mt option, and the total time for each run is recorded from the output log files.

Here is an example command for running a single-threaded (+mt=1) Spectre-X job:

spectre -64 +preset=cx +mt=1 input.scs -o SPECTREX_cx_1t +lqt 0 -f sst2

During the test, we observed that the number of utilized CPUs matched the number of threads per job. We also noted low storage read/write operations (fewer than 2,000 IOPS), low network bandwidth usage, and minimal memory use during the test. This indicates that Spectre-X is a highly compute-intensive and CPU-bound workload.

Benchmark Results

The table below displays the total elapsed time (in seconds) for various Spectre-X simulation runs with thread counts of 1, 2, 4, and 8. A lower value indicates better performance. These results demonstrate the Spectre-X tool's efficiency in distributing workloads on either Ice Lake instances (D64dsv5) and Emerald Rapids instances (D64dsv6 and FX64v2). As expected, we observed that the D64dsv6 instances, with an all-core turbo frequency of up to 3.6 GHz, perform 12 to 18% better, and the FX64v2 instances, with a CPU frequency of up to 4.0 GHz, perform 22 to 29% better than the D64dsv5 instances.

Figure 2. Total elapsed time (in seconds) for Spectre-X simulation with thread counts of 1, 2, 4, and 8.

Figure 3. Performance improvement of multi-threading Spectre-X jobs

Cost-effectiveness Analysis

By estimating the total time and VM costs for running 500 single-threaded jobs, we found that the D64dsv6 instances were the most cost-effective option, while the FX64v2 instances achieved the fastest total time. This provides customers with options depending on whether they prioritize cost savings or faster job completion times when choosing Azure EMR VMs.

Figure 4. Cost-effective estimation for running 500 single-threaded Spectre-X jobs

Synopsys VCS Use Case

Business value

VCS is a Synopsys Functional verification solution. A significant portion of compute capacity (up to 40%) in chip design is consumed by front-end validation tools, specifically RTL simulators like Synopsys VCS. The chip logic design cycle consists of recurrent design and validation cycles. The validation cycle involves running a set of simulator tests (regression) on the latest design snapshot. The ability to scale and accelerate the VCS test regression and keep validation up to date with design changes is crucial for a project to meet its time-to-market and quality goals, as shown schematically in Figure 5.

Figure 5: Scaling Front End regression accelerates design validation cycle and improves quality, in the above example on the left panel Model B and C would land after Design A validation is completed

Test case

As a test case we used a containerized representative VCS test of an Intel design.

Complex RTL design (>10M gates)

SVTB (System-Verilog Test Bench) simulation test running 100K cycles

Resident memory footprint per simulation instance is 7 GB.

Benchmark Results

We ran our test case VCS simulation separately on the D64dsv5 (Xeon 3 - Ice Lake) system and the FX64v2 (Xeon 5 - Emerald Rapids) system, scaling from 1 to 32 parallel VCS tests on each.

Since VCS simulation is a CPU-intensive application, we anticipated performance acceleration on the FX64v2. The new Emerald Rapids CPU architecture offers higher instructions per cycle (IPC) at the same frequency compared to previous generations. It operates at a higher all-core Turbo frequency (4.0 GHz vs. 3.5 GHz for Ice Lake), features larger L2 and L3 caches, faster UPI NUMA links, supports DDR5 memory instead of DDR4 in Ice Lake, and includes PCIe 5.0 compared to PCIe 4.0 in Ice Lake.

The results presented here are for specific Intel IP design used

Results may vary based on individual configuration and design used for testing

As expected, we observed speedup for the Emerald Rapids instance compared to Ice Lake instance from 17 to 43% for the range of simultaneous simulations shown in the chart below. (See Figure 6. and Figure 7.)

Figure 6: Scaling the parallel VCS simulation tests on Emerald Rapids vs. Ice Lake Azure Instances. Vertical axis shows simulation test avg. runtime in sec.

Figure 7: Speedup percentage for VCS on Emerald Rapids instance compared to Ice Lake Azure instances

Summary

The article evaluates the performance of the latest Azure VMs using the 5th Gen Intel® Xeon® Platinum 8537C (Emerald Rapids) processor by comparing them to the previous Ice Lake generation. Using two EDA tools, Cadence Spectre-X and Synopsys VCS, the benchmarks involve real-world scenarios including single-threaded, multi-threaded, and multiple jobs running on one node.

Results show that Spectre-X performs 12 to 18% better on D64ds v6 instances and 22 to 29% better on FX64v2 instances compared to D64ds v5 instances. The D64ds v6 instances were found to be more cost-effective, while FX64mds v2 instances achieved the shortest total runtime. For Synopsys VCS, the benchmarks revealed a speedup of 17 to 43% for Emerald Rapids instances over Ice Lake instances across various parallel simulations. The findings offer EDA customers options on which Azure EMR instances to select based on the cost-efficiency analysis.

Updated Nov 27, 2024

Version 7.0

Microsoft

Joined October 08, 2018

View Profile

Azure High Performance Computing (HPC) Blog

Follow this blog board to get notified when there's new activity