Apache Spark 3.0 support in Azure Synapse Analytics

Former Employee

May 25, 2021

Starting today, the Apache Spark 3.0 runtime is now available in Azure Synapse. This version builds on top of existing open source and Microsoft specific enhancements to include additional unique improvements listed below. The combination of these enhancements results in a significantly faster processing capability than the open-source Spark 3.0.2 and 2.4.

The public preview announced today starts with the foundation based on the open-source Apache Spark 3.0 branch with subsequent updates leading up to a Generally Available version derived from the latest 3.1 branch.

Performance Improvements

In large-scale distributed systems, performance is never far from the top of mind, "to do more with the same" or "to do the same with less" are always key measures. In addition to the Azure Synapse performance improvements announced recently, Spark 3 brings new enhancements and the opportunity for the performance engineering team to do even more great work.

Predicate Pushdown and more efficient Shuffle Management build on the common performance patterns/optimizations that are often included in releases. The Azure Synapse specific optimizations in these areas have been ported over to augment the enhancements that come with Spark 3.

Adaptive Query Execution (AQE)

There is an attribute of data processing jobs run by data-intensive platforms like Apache Spark that differentiates them from more traditional data processing systems like relational databases. It is the volume of data and subsequently the length of the job to process it. It's not uncommon for queries/data processing steps to take hours or even days to run in Spark. This presents unique challenges and opportunities to take a different approach to optimize and access the data. Over several days the query plan shape can change as estimates of data volume, skew, cardinality, etc., are replaced with actual measurements.

Adaptive Query Execution (AQE) in Azure Synapse provides a framework for dynamic optimization that brings significant performance improvement to Spark workloads and gives valuable time back to data and performance engineering teams by automating manual tasks.

AQE assists with:

Shuffle partition tuning: This is a major source of manual work data teams deal with today.
Join strategy optimization: This requires human review today and deep knowledge of query optimization to tune the types of joins used based on actual rather than estimated data.

Dynamic Partition Pruning

One of the common optimizations in high-scale query processors is eliminating the reading of certain partitions, with the adage that the less you read, the faster you go. However, not all partition elimination can be done as part of query optimization; some require execution time optimization. This feature is so critical to the performance that we added a version of this to the Apache Spark 2.4 codebase used in Azure Synapse. This is also built into the Spark 3.0 runtime now available in Azure Synapse.

ANSI SQL

Over the last 25+ years, SQL has become and continues to be one of the de-facto languages for data processing; even when using languages such as Python, C#, R, Scala, these frequently just expose a SQL call interface or generate SQL code.

One of SQL's challenges as a language, going back to its earliest days, has been the different implementations by different vendors being incompatible with each other (including Spark SQL). ANSI SQL is generally seen as the common definition across all implementations. Using ANSI SQL leads to supporting the least amount of rework and relearning; as part of Apache Spark 3, there has been a big push to improve the ANSI compatibility within Spark SQL.

With these changes in place in Azure Synapse, the majority of folks who are familiar with some variant of SQL will feel very comfortable and productive in the Spark 3 environment.

Pandas

While we tend to focus on high-scale algorithms and APIs when working on a platform like Apache Spark, it does not diminish the value of highly popular and heavily used local-only APIs like pandas. In fact, for some time, Spark has included support for User Defined Functions (UDF's) which make it easier and more scalable to run these local only libraries rather than just running them in the driver process.

Given that ~70% of all API calls on Spark are Python, supporting the language APIs is critical to maximize existing skills. In Spark 3, the UDF capability has been upgraded to include a capability only available in newer versions of Python, type hints. When combined with a new UDF implementation, with support for new Pandas UDF APIs and types, this release supports existing skills in a more performant environment.

Accelerator aware scheduling

The sheer volume of data and the richness of required analysis have made ML a core workload for systems such as Apache Spark. While it has been possible to use GPUs together with Spark for some time, Spark 3 includes optimization in the scheduler, a core part of the system, brought in from the Hydrogen project to support more efficient use of (hardware) accelerators. For hardware-accelerated Spark workloads running in Azure Synapse, there has been deep collaboration with Nvidia to deliver specific optimizations on top of their hardware and some of their dedicated APIs for running GPUs in Spark.

Delta Lake

Delta Lake is one of the most popular projects that can be used to augment Apache Spark. Azure Synapse uses the Linux Foundation open-source implementation of Delta Lake. Unfortunately, when running on Spark 2.4, the highest version of Delta Lake that is supported is Delta Lake 0.6.1. By adding support for Spark 3, it means that newer versions of Delta Lake can be used with Azure Synapse. Currently, Azure Synapse is shipping with support for Linux Foundation Delta Lake 0.8.

The biggest enhancements in 0.8 versus 0.6.1 are primarily around the SQL language and some of the APIs. It is now possible to perform most DDL and DML operations without leaving the Spark SQL language/environment. In addition, there have been significant enhancements to the MERGE statement/API (one of the most powerful capabilities of Delta Lake) expanding scope and capability.

Get Started Today

Customers with *qualifying subscription types can now try the Apache Spark pool resources in Azure Synapse using free quantities until July 31st, 2021 (up to 120 free vCore-hours per month).

If you don’t have one already, follow this tutorial to create a Synapse workspace
Follow this step-by-step video tutorial on how to create your first Apache Spark pool in Azure Synapse