Synapse Serverless SQL Pool - Performance and cost optimization with partitioning
Published Nov 10 2022 04:12 AM 1,114 Views
Microsoft
Why Partitions ?
 
Because it optimizes your per-query amount of data processed, it reduces cost and improve performance
Serverless is billed based on Data processed, and Data processed consists of:
  1. Amount of data read from storage. This amount includes:
    1. Data read while reading data.
    2. Data read while reading metadata (for file formats that contain metadata, like Parquet).
  2. Amount of data in intermediate results. This data is transferred among nodes while the query runs. It includes the data transfer to your endpoint, in an uncompressed format.
  3. Amount of data written to storage. If you use CETAS to export your result set to storage, then the amount of data written out is added to the amount of data processed for the SELECT part of CETAS.
The amount of data processed is rounded up to the nearest MB per query. Each query has a minimum of 10 MB of data processed.

You can instruct serverless SQL pool to query particular folders and files. Doing so reduces the number of files and the amount of data the query needs to read and process. An added bonus is that you'll achieve better performance and save money.
 
Here you can find all the details about cost management.
 
What a partitioned folder looks like?
 
Imagine we have multiple parquet files containing sales data, one parquet file for each month in each year since 2001. This folder might be organized like this:
 
1.png
 
and the parquet file contains the data for the entire month. 
 
Read partitioned data using Synapse Serverless SQL Pool
 
In our scenario, Parquet files in the dataset do not expose the YEAR and MONTH columns, they only contain the ORDERDATEKEY in the format yyyymmdd. We want to query the files and filter them by ORDERDATEKEY = 20220119

2.png
 
And here the amount of data processed (We pay for it)
 
3.png
This query processed 47MB but the file for Jan 2022 is about 8.2 MB only. 
 
6.png
This means the query processed the entire dataset and didn't benefit at all by partitions. 
 
To reduce the amount of data processed, excluding unuseful partitions, we have to point the proper folders (Year and Month) by using the filepath T-SQL function
 
4.png
the amount of data processed decreased and, as a side effect, improved performance and reduced the cost per-query.

5.png
Unfortunately, filepath and filename T-SQL functions cannot be used to define an external table (useful to create a Logical Datawarehouse); if you define an external table over a partitioned dataset, Synapse Serverless SQL Pool is not able to benefit from it, it will always read the entire bunch of files.
 
But you can create a View to expose the YEAR and MONTH columns by leveraging the filepath and filename functions.

8.png
 
here the data processed

9.png
 
In this scenario I did create the partitioned dataset naming the folder "YEAR=yyyy" and "MONTH=mm". When filtering data through the view we have to provide the real names of the year and month folders, which is not elegant in the T-SQL code.
 
But Synapse Spark can benefit from this notation and can expose the table in Synapse Serverless SQL Pool through shared metadata, it means we could filter by [YEAR] and [MONTH], just specifying the values (2022 and 1) since Spark "translate" the folder structure into fields. 
To use this approach, we need to create a Spark Pool and a Spark Notebook.
 
Bear in mind, in this case you need to create the database and all its tables using Spark notebook, and no changes will be permitted through DDL command from Synapse Serverless. (No ALTER, CREATE, DROP for the objects) All DDL commands must be executed through Spark.

7.png
Now we can query the dataset by filtering by [YEAR] = 2022 AND [MONTH] = 1, no need to use the filepath function in this case,

10.png

That's all.
 
 
 
 
 
Co-Authors
Version history
Last update:
‎Nov 10 2022 05:55 AM
Updated by: