%3CLINGO-SUB%20id%3D%22lingo-sub-1781804%22%20slang%3D%22en-US%22%3EPerformance%20Tuning%20ADF%20Data%20Flow%20Sources%20and%20Sinks%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-1781804%22%20slang%3D%22en-US%22%3E%3CP%3EAzure%20Data%20Factory%20Data%20Flows%20perform%20data%20transformation%20ETL%20at%20cloud-scale.%20This%20blog%20post%20takes%20a%20look%20at%20performance%20of%20different%20source%20and%20sink%20types.%20I've%20put%20our%20findings%20below%20based%20on%20performance%20tests%20of%20different%20source%20%26amp%3B%20sink%20pairs%3A%3C%2FP%3E%0A%3CH3%20id%3D%22toc-hId-1260750110%22%20id%3D%22toc-hId-1260774286%22%3EScenario%201%3C%2FH3%3E%0A%3CUL%3E%0A%3CLI%3ESource%3A%20Delimited%20Text%20Blob%20Store%3C%2FLI%3E%0A%3CLI%3ESink%3A%20Azure%20SQL%20DB%3C%2FLI%3E%0A%3CLI%3EFile%20size%3A%20421Mb%2C%2074%20columns%2C%20887k%20rows%3C%2FLI%3E%0A%3CLI%3ETransforms%3A%20Single%20derived%20column%20to%20mask%203%20fields%3C%2FLI%3E%0A%3CLI%3ETime%3A%204%20mins%20end-to-end%20using%20memory-optimized%2080-core%20debug%20Azure%20IR%3C%2FLI%3E%0A%3CLI%3ERecommended%20settings%3A%20Current%20partitioning%20used%20throughout%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CH3%20id%3D%22toc-hId--546704353%22%20id%3D%22toc-hId--546680177%22%3EScenario%202%3C%2FH3%3E%0A%3CUL%3E%0A%3CLI%3ESource%3A%20Azure%20SQL%20DB%20Table%3C%2FLI%3E%0A%3CLI%3ESink%3A%20Azure%20SQL%20DB%20Table%3C%2FLI%3E%0A%3CLI%3ETable%20size%3A%2074%20columns%2C%20887k%20rows%3C%2FLI%3E%0A%3CLI%3ETransforms%3A%20Single%20derived%20column%20to%20mask%203%20fields%3C%2FLI%3E%0A%3CLI%3ETime%3A%203%20mins%20end-to-end%20using%20memory-optimized%2080-core%20debug%20Azure%20IR%3C%2FLI%3E%0A%3CLI%3ERecommended%20settings%3A%20Source%20partitioning%20on%20SQL%20DB%20Source%2C%20current%20partitioning%20on%20Derived%20Column%20and%20Sink%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CH3%20id%3D%22toc-hId-1940808480%22%20id%3D%22toc-hId-1940832656%22%3EScenario%203%3C%2FH3%3E%0A%3CUL%3E%0A%3CLI%3ESource%3A%20Delimited%20Text%20Blob%20Store%3C%2FLI%3E%0A%3CLI%3ESink%3A%20Delimited%20Text%20Blob%20store%3C%2FLI%3E%0A%3CLI%3ETable%20size%3A%2074%20columns%2C%20887k%20rows%3C%2FLI%3E%0A%3CLI%3ETransforms%3A%20Single%20derived%20column%20to%20mask%203%20fields%3C%2FLI%3E%0A%3CLI%3ETime%3A%202%20mins%20end-to-end%20using%20memory%20optimized%2080-core%20debug%20Azure%20IR%3C%2FLI%3E%0A%3CLI%3ERecommended%20settings%3A%20Leaving%20default%2Fcurrent%20partitioning%20throughout%20allows%20ADF%20to%20scale-up%2Fdown%26nbsp%3B%20partitions%20based%20on%20size%20of%20Azure%20IR%20(i.e.%20number%20of%20worker%20cores)%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CH3%20id%3D%22toc-hId-133354017%22%20id%3D%22toc-hId-133378193%22%3EFile-based%20Source%20%2F%20Sink%3C%2FH3%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22perf1.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F226798i2472A344A41938C2%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22perf1.png%22%20alt%3D%22perf1.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3ESet%20%22current%20partitioning%22%20on%20source%20%26amp%3B%20sink%20to%20allow%20data%20flows%20to%20leverage%20native%20Spark%20partitioning.%20This%20will%20allow%20the%20performance%20to%20scale%20proportionally%20with%20an%20increase%20in%20core%20counts.%3C%2FLI%3E%0A%3CLI%3EPre%20and%20post-processing%20operations%20like%20%22save%20as%20single%20file%22%2C%20%22clear%20folder%22%2C%20and%20%22delete%20files%22%20will%20incur%20additional%20time%20in%20your%20ETL%20process.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CH3%20id%3D%22toc-hId--1674100446%22%20id%3D%22toc-hId--1674076270%22%3EAzure%20SQL%20DB%20Source%20%2F%20Sink%3C%2FH3%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22perf2.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F226802i96E8856FB914E58C%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22perf2.png%22%20alt%3D%22perf2.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CH5%20id%3D%22toc-hId-1514476965%22%20id%3D%22toc-hId-1514501141%22%3ESQL%20DB%20Source%3C%2FH5%3E%0A%3CUL%3E%0A%3CLI%3EUse%20%22Source%22%20partitioning%20under%20Optimize%20and%20set%20the%20number%20of%20partitions%20equal%20to%20the%20number%20of%20cores%20you%20are%20using.%20Use%20a%20high-cardinality%20column%20or%20set%20of%20columns%20as%20the%20partition%20column.%3C%2FLI%3E%0A%3CLI%3EUse%20%22Input%20query%22%20to%20minimize%20the%20data%2C%20columns%2C%20and%20for%20pushdown%20functions.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3CLI%3E%3CH5%20id%3D%22toc-hId--292977498%22%20id%3D%22toc-hId--292953322%22%3ESQL%20DB%20Sink%3C%2FH5%3E%0A%3CUL%3E%0A%3CLI%3EMake%20sure%20that%20you%20are%20using%20a%20large%20enough%20SQL%20DB%20tier%20for%20your%20ETL%20job%20to%20write%20to%20the%20database%20with%20enough%20resources.%3C%2FLI%3E%0A%3CLI%3EAdding%20cores%20to%20your%20job%20will%20scale%20the%20performance%20proportionally%2C%20but%20you%20will%20always%20be%20throttled%20by%20the%20ability%20of%20the%20database%20to%20serialize%20data.%3C%2FLI%3E%0A%3CLI%3EUse%20current%20partitioning.%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CH3%20id%3D%22toc-hId-1493470757%22%20id%3D%22toc-hId-1493494933%22%3ESynapse%20SQL%20DW%3C%2FH3%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22perf5.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F226810i6987BA524477BBAA%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22perf5.png%22%20alt%3D%22perf5.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CUL%3E%0A%3CLI%3E%3CH5%20id%3D%22toc-hId-387080872%22%20id%3D%22toc-hId-387105048%22%3ESynapse%20DW%20Source%20%26amp%3B%20Sink%3C%2FH5%3E%0A%3CUL%3E%0A%3CLI%3EAlways%20use%20%22Enable%20Staging%22%20and%20increase%20core%20count%20to%20minimize%20data%20processing%20times%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3C%2FLI%3E%0A%3C%2FUL%3E%0A%3CH3%20id%3D%22toc-hId--2121438169%22%20id%3D%22toc-hId--2121413993%22%3ECosmosDB%20Source%20%2F%20Sink%3C%2FH3%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22perf4.png%22%20style%3D%22width%3A%20999px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F226811i39B2D2D26300F779%2Fimage-size%2Flarge%3Fv%3D1.0%26amp%3Bpx%3D999%22%20role%3D%22button%22%20title%3D%22perf4.png%22%20alt%3D%22perf4.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3EMake%20use%20of%20the%20%22throughput%22%20option%20on%20the%20CosmoDB%20source%20and%20sink%20to%20increase%20the%20throughput%20for%20the%20ETL%20job.%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-1781804%22%20slang%3D%22en-US%22%3E%3CP%3EAzure%20Data%20Factory%20Data%20Flows%20perform%20data%20transformation%20ETL%20at%20cloud-scale.%20This%20blog%20post%20takes%20a%20look%20at%20the%20performance%20of%20different%20source%20and%20sink%20types.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22perf1.png%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F226777iB98F51FB1BFF36FD%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20role%3D%22button%22%20title%3D%22perf1.png%22%20alt%3D%22perf1.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-TEASER%3E%3CLINGO-LABS%20id%3D%22lingo-labs-1781804%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EAzure%20Data%20Factory%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EAzure%20ETL%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EMapping%20Data%20Flows%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Microsoft

Azure Data Factory Data Flows perform data transformation ETL at cloud-scale. This blog post takes a look at performance of different source and sink types. I've put our findings below based on performance tests of different source & sink pairs:

Scenario 1

  • Source: Delimited Text Blob Store
  • Sink: Azure SQL DB
  • File size: 421Mb, 74 columns, 887k rows
  • Transforms: Single derived column to mask 3 fields
  • Time: 4 mins end-to-end using memory-optimized 80-core debug Azure IR
  • Recommended settings: Current partitioning used throughout

Scenario 2

  • Source: Azure SQL DB Table
  • Sink: Azure SQL DB Table
  • Table size: 74 columns, 887k rows
  • Transforms: Single derived column to mask 3 fields
  • Time: 3 mins end-to-end using memory-optimized 80-core debug Azure IR
  • Recommended settings: Source partitioning on SQL DB Source, current partitioning on Derived Column and Sink

Scenario 3

  • Source: Delimited Text Blob Store
  • Sink: Delimited Text Blob store
  • Table size: 74 columns, 887k rows
  • Transforms: Single derived column to mask 3 fields
  • Time: 2 mins end-to-end using memory optimized 80-core debug Azure IR
  • Recommended settings: Leaving default/current partitioning throughout allows ADF to scale-up/down  partitions based on size of Azure IR (i.e. number of worker cores)

File-based Source / Sink

perf1.png

 

  • Set "current partitioning" on source & sink to allow data flows to leverage native Spark partitioning. This will allow the performance to scale proportionally with an increase in core counts.
  • Pre and post-processing operations like "save as single file", "clear folder", and "delete files" will incur additional time in your ETL process.

Azure SQL DB Source / Sink

perf2.png

  • SQL DB Source
    • Use "Source" partitioning under Optimize and set the number of partitions equal to the number of cores you are using. Use a high-cardinality column or set of columns as the partition column.
    • Use "Input query" to minimize the data, columns, and for pushdown functions.
  • SQL DB Sink
    • Make sure that you are using a large enough SQL DB tier for your ETL job to write to the database with enough resources.
    • Adding cores to your job will scale the performance proportionally, but you will always be throttled by the ability of the database to serialize data.
    • Use current partitioning.

Synapse SQL DW

perf5.png

  • Synapse DW Source & Sink
    • Always use "Enable Staging" and increase core count to minimize data processing times

CosmosDB Source / Sink

perf4.png

Make use of the "throughput" option on the CosmoDB source and sink to increase the throughput for the ETL job.