New query optimization techniques in Apache Spark for Azure Synapse

Microsoft

Sep 06, 2022

The Azure Synapse Analytics team has prominent engineers enhancing and contributing back to the Apache Spark project. One of our focus areas is Spark query optimization techniques, where Microsoft has decades of experience and is making significant contributions to the Apache Spark open source engine.

The attachment at the bottom of this blog post will be presented at the 48^th International Conference on Very Large Databases (#VLDB2022) and covers the latest developments in query optimization for Apache Spark 3. Those optimizations were developed by Microsoft engineers and are available today in the Azure Synapse runtime for Apache Spark versions 3.1 and 3.2.

The optimizations are motivated by a detailed performance analysis of Apache Spark on the TPC-DS benchmark. Based on operator level breakdown of the 20 most expensive TPC-DS queries, work has been done to identify exchange, aggregations and sort as the three most expensive operators. To bring down its cost three classes of optimizations were proposed.

Exchange placement: As exchange is the most expensive operator, focus was put into a new algorithm to minimize the usage of such method. In other words, a new algorithm was created to decide on the best placement of exchange operators. The algorithm takes into consideration the possibility of reusing an exchange to reduce exchanges beyond the default Apache Spark algorithm.
Partial push-down: This refers to a class of optimizations where some partial computation is derived from an existing operator and pushed down the chain below an exchange. The existing operator is not eliminated, hence it is a partial pushdown.
- Partial aggregation push-down: we introduced a new logical operator to represent local-aggregation operator and to introduce a comprehensive set of optimization rules to push down local-aggregates below all standard SQL operators. Derive local aggregates not only from group-by but also from semi-join and intersect. This allows Spark to aggregate data early and reduce the amount of data shuffled, a critical component of performance.
- Partial push-down of semi-joins: This is another example of an optimization which pushes down parts of an operator without eliminating it. The optimization looks at the query plan tree that is rooted at a semi-join and has multiple joins under it; and then converts some inner joins to semi-joins. It encodes a comprehensive set of conditions under which it is safe to do so.
Peephole optimizations: We optimized the implementation of sorting, the third most expensive operator in spark, by re-ordering the keys where possible. Also specialized the sorting algorithm in scenarios where some keys have very few distinct values. These optimizations significantly reduce the time spent in sort on queries.

The work resulted in considerable performance benefits across the board on the reference TPC-DS workload as well as a significant reduction in query plan generation time. Please consider reading the complete paper for details on the work performed.

To learn more about our previous optimization contributions to the Apache Spark open source community, please read:

Updated Sep 06, 2022

Version 2.0

Microsoft

Joined May 02, 2019

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity