Improve performance of Parquet external tables using new native technology in dedicated SQL pools

Microsoft

Jun 03, 2021

Azure Synapse Analytics enables you to read Parquet files stored in the Azure Data Lake storage using the T-SQL language and high-performance Parquet readers. The key characteristic of these high-performance Parquet readers is that they are using the native (C++) code for reading Parquet files, unlike the existing Polybase Parquet reader technology that uses the Java code. These native readers are introduced in the serverless SQL pools in Azure Synapse Analytics workspaces.

In many experiments, this native technology that is used in the serverless SQL pools demonstrated better performance compared to the existing Polybase external table in the dedicated SQL pools.

This native technology for reading Parquet files is now also available in the dedicated SQL pools. In the dedicated Pools in Azure Synapse Analytics, you can create external tables that use native code to read Parquet files and improve performance of your queries that access external Parquet files.

NOTE: The native external tables are in the gated public preview. If you want to try this feature, fill-in this form and we will contact you, or check with your MSFT contacts or product group how to try this.

In this post you will see how to use these new external tables and what are the benefits.

What are you getting with this new technology?

First let’s clarify what new with this improvement:

Before: You could use only Polybase TYPE=HADOOP external tables to read the Parquet files using the dedicated SQL pools. Polybase TYPE=HADOOP tables are based on Java code and don’t have expected performance in some cases because Java data structures must be converted to the native structures.

Now: In addition to Polybase TYPE=HADOOP external tables, you can use a new type of native external tables that are much faster. The native external tables are implemented using the native code and have better performance. The new native tables and the existing Polybase TYPE=HADOOP external tables can be used on the same dedicated pool.

In the following section you will see how to get 10x better performance while accessing external Parquet files just by removing TYPE option from the external data source.

Hadoop vs native external tables

In dedicated SQL pools you can use two types of tables:

Hadoop external tables – The exiting Polybase Hadoop external table that leverage Java technology to read external Parquet files. This is the existing technology that is Generally Available.
Native external tables – new external tables that use the native Parquet readers. This feature is currently in gated public preview.

The only syntax difference in these two table types are the external data source definitions:

If you want to use the existing Hadoop external tables create an external data source with TYPE=HADOOP option.
If you want to use the new native external tables create an external data source without TYPE option.

The syntax for the external tables is the same in both cases – you just need to create an external table on top of data source that you created with or without TYPE option. The external table will use native code or Java code depending on the TYPE attribute in the underlying EXTERNAL DATA SOURCE object.

The query experience is not changed. Once you create external tables you can use them in any query.

Performance comparison

The main benefit of this new technology is performance. Let’s compare the performance of the 22 T-SQL queries derived from the TPC-H benchmark, executed using the existing Hadoop external tables (red) and the new native external tables (green). All queries are executed on the same 100GB set of Parquet organized in the TPC-H table structure:

(Note: smaller is better)

On the chart you can see that all T-SQL queries are running faster with the native external tables. The queries are running on the same Parquet data set from the same dedicated pool instance with the size 2000 DWU. The only difference is the TYPE parameter in the external data source.

IMPORTANT: This is not an official TPC-H benchmark. The queries are executed on the Parquet data set that has the same schema as TPC-H tables, but the distribution of data is not identical to the official TPC-H benchmark. You might get different results on your 100GB data sets, due to different data distribution.

Another interesting experiment shows the performance comparison of the existing Polybase tables on top of 100GB data set compared to the native external tables on top of 1TB Parquet data set (10x bigger data set):

(Note: smaller is better)

With the new native tables (green bars), we are getting better performance on most of the queries although the test is running on 10x bigger data set.

Conclusion

The native external tables in the dedicated SQL pools in Azure Synapse analytics are the new technology that will boost performance of your queries that use the external tables on top of Parquet files.

Just by removing the TYPE option in the external data source, you can get 5-10x better performance without up-scaling your dedicated pools even on the 10x bigger data sets.

If you want to try this feature, fill-in this form and we will contact you.

This feature is in public preview, and we would appreciate your feedback. You can add your suggestions in Azure Synapse Analytics Feedback site.

Updated Sep 15, 2021

Version 6.0

Synapse SQL

JovanPop

Microsoft

Joined March 07, 2019

View Profile

Azure Synapse Analytics Blog

Follow this blog board to get notified when there's new activity