The Azure Synapse Analytics team is continually focused on delivering a highly performant and scalable platform for Apache Spark workloads, in particular in support of our customers' most common workload patterns. By combining the latest open-source updates in Apache Spark with a strong focus on optimizing performance, customers can now take advantage of significant performance gains reflected in the below benchmarking tests1
We have recently announced General Availability of Apache Spark 3.1.2 as a part of Azure Synapse Analytics. This delivers significant performance improvements over Apache Spark 2.4. In the new release of Spark on Azure Synapse Analytics, our benchmark performance tests indicate that we have also been able to achieve a 13% improvement in performance from the previous release and run 202% faster than Apache Spark 3.1.2. This means you can do more with your data, faster and at a lower cost.
Previously, we improved Apache Spark performance through query optimization, autoscaling, cluster optimizations, intelligent caching, and indexing. In the latest release, we have further improved Apache Spark 3.1.2 in Azure Synapse Analytics by using the following three optimizations:
Bloom filter enhancements
This optimization applies while performing top-k queries by eliminating compute cycles involved in processing rows which are not part of the top-k within the partition.
For example, when identifying the top-selling products across categories, where data is partitioned by categories, identifying the top-k rows within a shuffle, and comparing just those that fall within the top-k across partitions will eliminate the need for processing other rows.
Statistics must be enabled to trigger this optimization.
Sorting is one of the most used and computationally expensive operations along with aggregations. In Synapse Spark, we have written an optimized an implementation of sorting which benefits from prior partitioning of data.
This new algorithm can leverage cardinality information to create multiple sorters and efficiently use prefix comparison. Prefix comparisons are way faster than record comparisons. For sorting on multiple columns, we reorder sorting columns to reduce the number of record comparisons required.
This is very useful for queries requiring window operation like getting top 100 highly paid employees in each department or getting 100 most selling products in different categories.
Bloom filter enhancements
In this release, we have extended support of Bloom filters to sort merge joins in addition to broadcast hash joins which we talked about previously.
Shuffling is a bottleneck in query execution as it requires data to be written on the disk. We have further enhanced Bloom filter implementation in Synapse Spark to operate on sort merge joins. The idea is to create Bloom filters from the smaller tables and leverage them to prune large tables. This will help in reducing shuffle data and thus improving query performance. With this extension we were able to reduce shuffle sizes by 50%.
For example, given a fact table ‘Sales’ and a dimension table ‘Items’, application of a Bloom filter will drastically improve performance when we want to get total sales for selected items.
Limit push-down and optimized sort are effective when stats are enabled & the Analyze Table command is run.
Continuous improvements to performance tuning and optimizations in Azure Synapse Analytics enable you to run your Apache Spark workloads in a cost-effective manner with reduced processing times.
1 The workload used was derived from the TPC-DS benchmark but did not follow the formal benchmarking process. As such, this workload's results are not comparable to published TPC-DS results.