Azure Data Factory Data Flows perform data transformation ETL at cloud-scale. This blog post takes a look at performance of different source and sink types. I've put our findings below based on performance tests of different source & sink pairs:
Scenario 1
- Source: Delimited Text Blob Store
- Sink: Azure SQL DB
- File size: 421Mb, 74 columns, 887k rows
- Transforms: Single derived column to mask 3 fields
- Time: 4 mins end-to-end using memory-optimized 80-core debug Azure IR
- Recommended settings: Current partitioning used throughout
Scenario 2
- Source: Azure SQL DB Table
- Sink: Azure SQL DB Table
- Table size: 74 columns, 887k rows
- Transforms: Single derived column to mask 3 fields
- Time: 3 mins end-to-end using memory-optimized 80-core debug Azure IR
- Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink
Scenario 3
- Source: Delimited Text Blob Store
- Sink: Delimited Text Blob store
- Table size: 74 columns, 887k rows
- Transforms: Single derived column to mask 3 fields
- Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
- Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down partitions based on size of Azure IR (i.e. number of worker cores)
File-based Source / Sink
- Set "current partitioning" on source & sink to allow data flows to leverage native Spark partitioning. This will allow the performance to scale proportionally with an increase in core counts.
- Pre and post-processing operations like "save as single file", "clear folder", and "delete files" will incur additional time in your ETL process.
Azure SQL DB Source / Sink
-
SQL DB Source
- Use "Source" partitioning under Optimize and set the number of partitions equal to the number of cores you are using. Use a high-cardinality column or set of columns as the partition column.
- Use "Input query" to minimize the data, columns, and for pushdown functions.
-
SQL DB Sink
- Make sure that you are using a large enough SQL DB tier for your ETL job to write to the database with enough resources.
- Adding cores to your job will scale the performance proportionally, but you will always be throttled by the ability of the database to serialize data.
- Use current partitioning.
Synapse SQL DW
-
Synapse DW Source & Sink
- Always use "Enable Staging" and increase core count to minimize data processing times
CosmosDB Source / Sink
Make use of the "throughput" option on the CosmoDB source and sink to increase the throughput for the ETL job.
Updated Oct 15, 2020
Version 1.0Mark Kromer
Microsoft
Joined August 14, 2018
Azure Data Factory Blog
Follow this blog board to get notified when there's new activity