Implement Fill Down in ADF and Synapse Data Flows

Published Dec 23 2020 02:38 PM 2,890 Views
Microsoft

"Fill down" is an operation common in data prep and data cleansing meant to solve the problem with data sets when you want to replace NULL values with the value from the previous non-NULL value in the sequence. Here is how to implement this in ADF and Synapse data flows.

 

Note that this operation can have negative performance implications because you must create a synthetic window across your entire data set with a "dummy" category value. Additionally, you must sort by a value to create the proper data sequence to find the previous non-NULL value.

 

In this case, my products data is missing the Color value in several rows. To fix this, I applied the fill down pattern in the Window transformation to fill in the NULL values:

 

fill2.png

To achieve this, I used the coalesce formula in the Window transformation: 

 

 

coalesce(Color, last(Color, true()))

 

 

The key to make this work is to first set the Window "over" clause to your entire data set. If you do not already have a value that is the same across all rows, you can set a synthetic window value in a Derived Column. In my case, I set an integer value of 1 to a column called "dummy".

 

Next, the Window sorting should be sorted by the column that sets the proper ordering such that the previous non-NULL value will be your fill value. In my case, I used the Surrogate Key transformation to create an incrementing key to sort by.

 

The final fill-down pattern looks something like this:

 

fill1.png

 

To make this easy to add to your data flow pipelines, I added the script-behind as a data flow snippet to our online recipes here.

 

I've also included this snippet below. This creates the synthetic category as "dummy" and sorts by a surrogate key. You can remove the surrogate key and use your own data-specific sort key. This code snippet assumes you've already added a Source transformation to your data flow called source1.

source1 derive(dummy = 1) ~> DerivedColumn
DerivedColumn keyGenerate(output(sk as long),
	startAt: 1L) ~> SurrogateKey
SurrogateKey window(over(dummy),
	asc(sk, true),
	Rating2 = coalesce(Color, last(Color, true()))) ~> Window1

 

1 Comment
%3CLINGO-SUB%20id%3D%22lingo-sub-2013406%22%20slang%3D%22en-US%22%3EImplement%20Fill%20Down%20in%20ADF%20and%20Synapse%20Data%20Flows%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2013406%22%20slang%3D%22en-US%22%3E%3CP%3E%22Fill%20down%22%20is%20an%20operation%20common%20in%20data%20prep%20and%20data%20cleansing%20meant%20to%20solve%20the%20problem%20with%20data%20sets%20when%20you%20want%20to%20replace%20NULL%20values%20with%20the%20value%20from%20the%20previous%20non-NULL%20value%20in%20the%20sequence.%20Here%20is%20how%20to%20implement%20this%20in%20ADF%20and%20Synapse%20data%20flows.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ENote%20that%20this%20operation%20can%20have%20negative%20performance%20implications%20because%20you%20must%20create%20a%20synthetic%20window%20across%20your%20entire%20data%20set%20with%20a%20%22dummy%22%20category%20value.%20Additionally%2C%20you%20must%20sort%20by%20a%20value%20to%20create%20the%20proper%20data%20sequence%20to%20find%20the%20previous%20non-NULL%20value.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EIn%20this%20case%2C%20my%20products%20data%20is%20missing%20the%20Color%20value%20in%20several%20rows.%20To%20fix%20this%2C%20I%20applied%20the%20fill%20down%20pattern%20in%20the%20Window%20transformation%20to%20fill%20in%20the%20NULL%20values%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22fill2.png%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F242651i48925F8AA7D41F81%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20role%3D%22button%22%20title%3D%22fill2.png%22%20alt%3D%22fill2.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3ETo%20achieve%20this%2C%20I%20used%20the%20coalesce%20formula%20in%20the%20Window%20transformation%3A%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CPRE%20class%3D%22lia-code-sample%20language-applescript%22%3E%3CCODE%3Ecoalesce(Color%2C%20last(Color%2C%20true()))%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThe%20key%20to%20make%20this%20work%20is%20to%20first%20set%20the%20Window%20%22over%22%20clause%20to%20your%20entire%20data%20set.%20If%20you%20do%20not%20already%20have%20a%20value%20that%20is%20the%20same%20across%20all%20rows%2C%20you%20can%20set%20a%20synthetic%20window%20value%20in%20a%20Derived%20Column.%20In%20my%20case%2C%20I%20set%20an%20integer%20value%20of%201%20to%20a%20column%20called%20%22dummy%22.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ENext%2C%20the%20Window%20sorting%20should%20be%20sorted%20by%20the%20column%20that%20sets%20the%20proper%20ordering%20such%20that%20the%20previous%20non-NULL%20value%20will%20be%20your%20fill%20value.%20In%20my%20case%2C%20I%20used%20the%20Surrogate%20Key%20transformation%20to%20create%20an%20incrementing%20key%20to%20sort%20by.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EThe%20final%20fill-down%20pattern%20looks%20something%20like%20this%3A%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22fill1.png%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F242609i2080B7DCADDFE23B%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20role%3D%22button%22%20title%3D%22fill1.png%22%20alt%3D%22fill1.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3ETo%20make%20this%20easy%20to%20add%20to%20your%20data%20flow%20pipelines%2C%20I%20added%20the%20script-behind%20as%20a%20data%20flow%20snippet%20to%20our%20online%20recipes%20%3CA%20href%3D%22https%3A%2F%2Fdocs.microsoft.com%2Fen-us%2Fazure%2Fdata-factory%2Fdata-flow-script%23fill-down%22%20target%3D%22_self%22%20rel%3D%22noopener%20noreferrer%22%3Ehere%3C%2FA%3E.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3EI've%20also%20included%20this%20snippet%20below.%20This%20creates%20the%20synthetic%20category%20as%20%22dummy%22%20and%20sorts%20by%20a%20surrogate%20key.%20You%20can%20remove%20the%20surrogate%20key%20and%20use%20your%20own%20data-specific%20sort%20key.%20This%20code%20snippet%20assumes%20you've%20already%20added%20a%20Source%20transformation%20to%20your%20data%20flow%20called%3CSPAN%3E%26nbsp%3B%3C%2FSPAN%3E%3CCODE%3Esource1.%3C%2FCODE%3E%3C%2FP%3E%0A%3CP%3E%3CCODE%3E%3C%2FCODE%3E%3C%2FP%3E%0A%3CPRE%3E%3CCODE%3Esource1%20derive(dummy%20%3D%201)%20~%26gt%3B%20DerivedColumn%0ADerivedColumn%20keyGenerate(output(sk%20as%20long)%2C%0A%20startAt%3A%201L)%20~%26gt%3B%20SurrogateKey%0ASurrogateKey%20window(over(dummy)%2C%0A%20asc(sk%2C%20true)%2C%0A%20Rating2%20%3D%20coalesce(Color%2C%20last(Color%2C%20true())))%20~%26gt%3B%20Window1%3C%2FCODE%3E%3C%2FPRE%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%3C%2FLINGO-BODY%3E%3CLINGO-TEASER%20id%3D%22lingo-teaser-2013406%22%20slang%3D%22en-US%22%3E%3CP%3E%22Fill%20Down%22%20is%20an%20operation%20common%20in%20data%20prep%20and%20data%20cleansing%20meant%20to%20solve%20the%20problem%20with%20data%20sets%20when%20you%20want%20to%20replace%20NULL%20values%20with%20the%20value%20from%20the%20previous%20non-NULL%20value%20in%20the%20sequence.%20Here%20is%20how%20to%20implement%20this%20in%20ADF%20and%20Synapse%20data%20flows.%3C%2FP%3E%0A%3CP%3E%26nbsp%3B%3C%2FP%3E%0A%3CP%3E%3CSPAN%20class%3D%22lia-inline-image-display-wrapper%20lia-image-align-inline%22%20image-alt%3D%22fill2.png%22%20style%3D%22width%3A%20400px%3B%22%3E%3CIMG%20src%3D%22https%3A%2F%2Ftechcommunity.microsoft.com%2Ft5%2Fimage%2Fserverpage%2Fimage-id%2F242650iCB5A7E53D4068FFE%2Fimage-size%2Fmedium%3Fv%3D1.0%26amp%3Bpx%3D400%22%20role%3D%22button%22%20title%3D%22fill2.png%22%20alt%3D%22fill2.png%22%20%2F%3E%3C%2FSPAN%3E%3C%2FP%3E%3C%2FLINGO-TEASER%3E%3CLINGO-LABS%20id%3D%22lingo-labs-2013406%22%20slang%3D%22en-US%22%3E%3CLINGO-LABEL%3EAzure%20Data%20Factory%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EAzure%20Synapse%20Analytics%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EBig%20Data%20Analytics%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EData%20Flows%3C%2FLINGO-LABEL%3E%3CLINGO-LABEL%3EMapping%20Data%20Flows%3C%2FLINGO-LABEL%3E%3C%2FLINGO-LABS%3E
Version history
Last update:
‎Dec 23 2020 03:02 PM
Updated by: