The Java Engineering Group (JEG) optimized the Microsoft Build of OpenJDK for Azure's first Ampere Altra Arm-based VM families as an essential part of this launch. The Ampere Altra CPUs are based on Arm’s Neoverse N1 architecture. Hence, JEG focused the optimizations to take advantage of Arm Neoverse N1-specific CPU features like a variety of cryptographic algorithms, Large System Extension (LSE) to improve scalability on large server systems, and the entire physical core per vCPU. This helped ensure we could maximize the scaling factor for Java Virtual Machine (JVM) based applications and take advantage of other optimizations.
The Azure Compute Arm-based VM families include:
2 – 64
4GiB per vCPU ratio
8GiB to 208GiB sizes
2 – 64
2GiB per vCPU ratio
4GiB to 128GiB sizes
2 – 32
8GiB per vCPU ratio
16GiB to 208GiB sizes
Table 1: Azure Compute's Arm64 Offerings
We tracked allocations and CPU + system memory scaling numbers for the above SKUs. We also baselined and gathered scaling numbers for various CPU features introduced with Azure's Arm64 offerings.
The JEG used the Microsoft Build of OpenJDK 17.0.3+7 LTS, with Parallel GC and G1 GC for all our tests.
The Azure Arm64-based SKUs released in Sept 2022 offer a different feature set than x86-64-based processors, with Arm64 processors also having features tied to specific Instruction Set Architectures (ISAs). As a result, the JEG wanted to ensure that the Microsoft Build of OpenJDK considered these new features for the ARMv8.2 ISA supported by the Azure Arm64-based SKUs.
The purpose was to highlight the new CPU's scaling implications and introduce features like the new intrinsics with the most used algorithms for Azure SDK, such as MD5, into the Microsoft Build of OpenJDK.
The tests consisted of simple microbenchmarks executed using the Java Microbenchmark Harness (JMH) as well as the SPECJVM2008 suite. These were specifically designed to stress and highlight the new CPU features listed below.
We modified the JMH benchmarks to gather data on scaling both with and without the UseLSE switch. The UseLSE switch, which stands for Large System Extension, toggles the use of single atomic instructions and memory operations (such as CAS, SWP, LDADD) in the JVM when running on ARM processors. Additionally, the JEG measured the impact of Load-Link/Store-Conditional (LL/SC) operations, utilizing exclusive load-acquire/store-release instructions (LDAXR/STLXR) for atomic read-modify-write operations, when the UseLSE flag was false.
The JEG used SPECJVM2008's Crypto with the following two sub-benchmark programs:
We used SPECJVM2008's Compress benchmark based on the Lempel-Ziv-Welch (LZW) algorithm to test compression.
To ensure accurate measurement of CPU features and scalability, the JEG took the following steps:
By conducting these tests, the JEG was able to gain insights into the new CPU features and their impact on performance, allowing for more informed decision-making in terms of hardware and software optimizations.
As mentioned, these Arm-based systems offer an entire physical core per vCPU. Hence, we wanted to study how we could optimize and fully utilize these physical cores to scale up and scale out different workloads on the Microsoft Build of OpenJDK.
The purpose was to discover any scaling implications on Microsoft's Build of OpenJDK when working with the various processor vs. memory configurations offered by the Ampere Altra-based SKUs. The JEG used several benchmarks/test harnesses (SPECjbb2015, HyperAlloc, YCSB/Cassandra). The criteria being to design CPU vs. system memory scaling stress tests for each benchmark.
The JEG used industry benchmarks designed to drive load in various shapes on the JVM. We also ensured that the JVM utilizes the available CPU and memory for each configuration offering for our Arm64-based SKUs.
The JEG utilized common practices to quieten the operating system and increase run-to-run repeatability before tests, including, but not limited to:
Enable physical CPU binding and local allocation for all runs using numactl --physcpubind=$cpurange –localalloc, where $cpurange consists of the cores under test.
Each benchmark also had its specific configuration.
The HyperAlloc benchmark is a part of the Heapothesys toolset developed by the Amazon Corretto team to help analyze and optimize the JVM’s garbage collectors. HyperAlloc is a multi-threaded benchmark that measures memory allocation performance in Java applications. It uses a scalable allocation scheme that reduces contention on shared resources and can handle large numbers of threads.
For our scaling tests, we specified the following:
SPECjbb2015 is an industry-standard benchmark designed to measure the performance and scalability of Java-based business applications. We have worked extensively with SPECjbb2015 over the past decade to optimize the performance of the entire software-hardware stack. The benchmark consists of a simulated enterprise application that emulates the behavior of a complete e-commerce system. It includes various workloads and scenarios, such as order placement, inventory management, and payment processing. These workloads and scenarios stress different system performance aspects by simulating a high-volume online transaction processing (OLTP) workload.
SPECJBB2015 includes several key metrics for evaluating scalability, such as sustained throughput and throughput under response time SLA constraints. These metrics make SPECjbb2015 an invaluable tool for identifying and addressing performance bottlenecks and validating hardware and software configurations for scaling in the cloud. The JEG has worked with OSG-Java for many years. We deeply understand the different Java-based benchmarks and their nuances and best practices. We can leverage this knowledge to help achieve optimal performance and scalability.
We used the run_composite and run_multi scripts to execute the SPECJBB2015 benchmark for our scaling tests.
The run_composite script is a single JVM – single host script used to launch all the benchmark components:
This script is easy to use, and the community uses this for comparative learning and scaling experiments.
The run_multi script is a low-level script that provides more fine-grained control over the benchmark execution. It allows you to specify the controller outside of the measurement domain. The transaction injector and the backend then form a group. Depending on the scaling tests, we can have a 1:1 relationship between them or scale at 2:1 to drive more load towards the backend. The groups themselves can scale such that the backends can also have some inter-process communication with other backends sitting in other groups. You can also stress different parameters, such as the number of warehouses, the number of JVM instances, and the number of virtual users, among others. The run_multi script is ideal for more advanced users who need to customize the benchmark execution to suit their specific scaling needs.
For all the scripts, we standardized the following:
Our baseline configuration just had the standard configuration listed above. For the minimally “tuned” runs, we tested different configurations as listed here:
These tunings are an essential aspect of performance engineering. Apart from the standard JVM tunings and settings mentioned earlier, the JEG also conducted a scale analysis on software prefetch distance, type profile width, and small code inline sizes to understand how different options can affect the utilization of resources and benchmark scores.
By carefully tuning these parameters, the JEG was able to understand the nuances of the Armv8.2 system and how these options tracked for the architecture and the benchmark.
If you found this blog post insightful, you'll want to check out the follow-up in our series on performance optimization for Azure D-series v5 VMs.
In our next post titled, "Optimizing Java Workloads on Azure General Purpose D-series v5 VMs with Microsoft’s Build of OpenJDK", we delve into the nitty-gritty of maximizing the utilization of Azure D-series v5 VMs for Java-based workloads. We shed light on performance benefits resulting from JVM command line tuning, the utilization of Transparent HugePages, and the application of Compressed Oops with a 16-byte offset on Arm VMs.
Whether you're looking to manage high-traffic e-commerce platforms more efficiently, fine-tune data-intensive applications, or optimize gaming workloads, our in-depth analysis can provide valuable insights for you.
In Figure 1 below, you can see the coverage (as a percentage range over the axes) achieved by the benchmarks. Each benchmark stressed different Live Data Set (LDS) sizes over various allocation rates. HyperAlloc could stress 100%, whereas SPECjbb2015 and Cassandra could only stress 50-80% depending on the allocation rate and spans (e.g., allocation span ratios). Since Cassandra and SPECjbb2015 filled similar needs, we only needed to dive more deeply into one after the initial analysis.
Figure 1: Benchmark Selection criteria coverage offered by each benchmark
Allocation Span for Parallel GC refers to the allocation behavior of the Parallel GC collector. The fast path allocations fall into the thread-local allocation buffer (TLAB) space, which are special areas within Eden space. The objects that don’t fit TLAB size requirements are allocated outside the TLABs directly into the Eden space. The collector performs a Large Object Allocation (LOA) directly into the tenured or old generation space for objects larger than a certain threshold. The tenured or old generation is used to store objects that have survived the young collection cycles and have aged past the preset tenuring threshold.
Allocation Span for G1GC is like what we have mentioned above, but we consider the humongous objects allocated directly into the humongous regions out of the old generation space.
Allocation span refers to the space in the Java heap where an object is allocated. By measuring the span of allocations, we can gain insight into the allocation behavior of a benchmark or an application and identify potential bottlenecks in the heap. To calculate span ratios, we can compare the number of objects allocated in different heap areas. For example, we can calculate the ratio of objects allocated in the Eden space vs. those allocated in the Old generation space. We can also separately calculate the ratio of objects allocated within the TLAB versus those allocated outside the TLAB area. These ratios can determine whether an application is experiencing memory pressure in specific areas of the heap and guide optimizations to improve memory usage.
 Please note that Microsoft Build of OpenJDK has had security updates since then. Please download the latest from https://www.microsoft.com/openjdk/download.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.