Performance Tuning ADF Data Flow Sources and Sinks

Microsoft

Oct 15, 2020

Azure Data Factory Data Flows perform data transformation ETL at cloud-scale. This blog post takes a look at performance of different source and sink types. I've put our findings below based on performance tests of different source & sink pairs:

Scenario 1

Source: Delimited Text Blob Store
Sink: Azure SQL DB
File size: 421Mb, 74 columns, 887k rows
Transforms: Single derived column to mask 3 fields
Time: 4 mins end-to-end using memory-optimized 80-core debug Azure IR
Recommended settings: Current partitioning used throughout

Scenario 2

Source: Azure SQL DB Table
Sink: Azure SQL DB Table
Table size: 74 columns, 887k rows
Transforms: Single derived column to mask 3 fields
Time: 3 mins end-to-end using memory-optimized 80-core debug Azure IR
Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink

Scenario 3

Source: Delimited Text Blob Store
Sink: Delimited Text Blob store
Table size: 74 columns, 887k rows
Transforms: Single derived column to mask 3 fields
Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker cores)

File-based Source / Sink

Set "current partitioning" on source & sink to allow data flows to leverage native Spark partitioning. This will allow the performance to scale proportionally with an increase in core counts.
Pre and post-processing operations like "save as single file", "clear folder", and "delete files" will incur additional time in your ETL process.

Azure SQL DB Source / Sink

SQL DB Source
- Use "Source" partitioning under Optimize and set the number of partitions equal to the number of cores you are using. Use a high-cardinality column or set of columns as the partition column.
- Use "Input query" to minimize the data, columns, and for pushdown functions.
SQL DB Sink
- Make sure that you are using a large enough SQL DB tier for your ETL job to write to the database with enough resources.
- Adding cores to your job will scale the performance proportionally, but you will always be throttled by the ability of the database to serialize data.
- Use current partitioning.