New data flow functions for dynamic, reusable patterns

Microsoft

May 15, 2020

ADF has added columns() and byNames() functions to make it even easier to build ETL patterns that are reusable and flexible for generic handling of dimensions and other big data analytics requirements.

In this example below, I am making a generic change detection data flow that looks for changed column values by hashing the row. I can use static column names in the hashing function (I'm using sha2 in this case) or I can use columns() as a way to tell ADF to use all of the stream's incoming columns as a single function argument. This way, my pattern can be re-used with any source without needing to hardcode column names. This is all supported by the ADF schema drift feature and is achieved by using the new columns() function as the argument to the sha2() function:

sha2(256,columns())

In the second example above, I'm passing in a comma-separated list of values as a string that represent the specific columns that I wish to check for change detection. In this case, I don't want to use static columns and I don't want to look for changes from any column using columns() for the entire row. Instead, I want to parameterize the hash function. In this case, I use byNames() and split my string parameter to create an array of string column names that can be sent to the data flow activity at runtime, making the hashing dynamic.

sha2(256,byNames(split($cols,',')))

In this case, I'm going to hash the columns movies, title, and genres, and that will detect any changes to the values in those columns. This is achieved by using the new byNames() function as the argument to the sha2() hash function and using split() to create an array from the string parameter that contains column names.

Updated May 15, 2020

Version 2.0

azure data factory

Azure Data Integration

Azure ETL

Big Data Analytics

Mark Kromer

Microsoft

Joined August 14, 2018

View Profile

Azure Data Factory Blog

Follow this blog board to get notified when there's new activity

25 Comments

matt_gibson
Copper Contributor
Apr 30, 2024
I notice that the sha2() function seems to simply concatenate its values before hashing the whole result. This means that, for example, the following calls generate the same hash:
sha2(256, 'ABC', 'DEF') /* e9c0f8b575cbfcb42ab3b78ecc87efa3b011d9a5d10b09fa4e96f240bf6a82f5 */ sha2(256, 'AB', 'CDEF') /* e9c0f8b575cbfcb42ab3b78ecc87efa3b011d9a5d10b09fa4e96f240bf6a82f5 */

So if data in adjacent columns changes like this no change will actually be detected. This doesn't seem optimal. Is there a hash function where if the source data changes like this, the calculated hash will change? Or some other simple way of doing sha2(256, columns()) that would avoid this issue?
Mark Kromer
Microsoft
Feb 08, 2021
VisRamAzureBI columns() in that context in an Exist transformation won't work. Columns() needs to be wrapped in an array function like this:

array(columns())

... but Exists currently cannot use arrays to compare, so that technique, unfortunately, does not work.

Instead, I've provided updated guidance on looking for duplicate rows here.
Mark Kromer
Microsoft
Feb 05, 2021
ji_new We do not have that capability today, but are looking at adding that feature as it is on our roadmap
ji_new
Copper Contributor
Feb 05, 2021
Mark Kromer
Is it possible to create reusable transformations in ADF.
eg: A lookup which can be shared across pipelines.
VisRamAzureBI
Copper Contributor
Feb 02, 2021
Hi Mark Kromer ,
Nice blog. One challenge I am facing from a long time while using Sha2 & Exist. First lets go with Exist.
Little background on my task: Comparing/validating Row by row, column by column between ADLS file(no headers) and Synapse Table.
I have many such datasets to validate so both Table name and ADLS file should be dynamic.
I saw your reply/suggestions to another person to use columns() == columns().
But when I am using that option, my pipeline throwing error "StatusCode":"DFExecutorUserError","Message":"Job failed due to reason: 0","Details":"0"
when I manually input columns (RN_CD == {_col0_}&& LB_CD == {_col1_})then it is working. But I need to go dynamic. please advise what else can be done. Thanks in advance. Vishnu
Mark Kromer
Microsoft
Jan 12, 2021
Piotr_Kalinski The Merge transformation in my screenshot is just the name I gave to my Union transform. You must branch off your stream and then use a Select to pick the columns you want to keep for your hash in order to use columns() in the hash function.
Piotr_Kalinski
Copper Contributor
Jan 06, 2021
Hi Mark Kromer
thanks for quick answer. I thought about selecting my attribute columns to a separate stream and then hashing them (which is what you suggest), but then I am not able to join my MD5 back to the original stream. I understand the Merge transformation in your screenshot makes the join on md5?
This is not quite what I am looking for.

My use case is that I want to deduplicate consecutive rows from my source. My source has also a business key (which I didn't mention in my original question - lets say it is always EMPLOYEE_ID) and if the rows are sorted by UPD_TS, I want to get rid of rows that repeat attribute values from the previous row (although they may have different UPD_TS).

So ideally, I would add a Derived Column to my stream like this one :
md5(byNames(filter(columnNames(), !in(['employee_id','upd_ts'],#item)) ))
and then I could detect repeating md5s with lag & window transformation.
Unfortunately, the expression above returns an error 😞

Please note that certain MD5s can recur, because underlying attributes may change values in this fashion:
original value --> original value --> another value --> original value
In this case, I want to capture first, third & fourth value.
This means I can't do a simple join of my 2 streams by MD5, because I would get a semi-cartesian product.

It is a really cool feature that you can do rule-based mappings in the select transformation and easilly get rid of columns you dont want
eg. !in(['upd_ts'],name)
and it would be great if I could use the same logic when defining which columns should be MD5ed in my Derived Column (and keeping the other columns).
Mark Kromer
Microsoft
Jan 05, 2021
Piotr_Kalinski Add a new branch after your Source, then use a Select and remove the column(s) you wish to not include in your hash. After the Select, I added a Derived Column for my hash using columns().
Piotr_Kalinski
Copper Contributor
Jan 05, 2021
Hi Mark Kromer ,
I am trying hard to make an MD5 from my all source columns except for 1 column.
Lets imagine my source table has following columns: attr_A, attr_B, attr_C, UPD_TS.
I want to hash everything except for UPD_TS.
I know I can pass a parameter with a list of columns to hash ( 'attr_A,aatr_B,attr_C'), but all my sources have UPD_TS column so it would be much easier for me to exclude a single column rather than create a list of non-UPD_TS columns for each source table.
I know ADF has a lot of cool functions like columnNames(), byNames(), columns() etc and I feel it should be possible to do what I want.
But.... I tested plenty of different options and all of them return some errors.
Is there a smart way of doing this?
KSchulke
Copper Contributor
Jul 07, 2020
Mark Kromer
Thanks for your reply. I am attempting to process column names dynamically to create the row hash based upon its values, but the issue I am experiencing is how to obtain the values for the sorted column names.

The documentation for the byNames function states 'Computed inputs are not supported but you can use parameter substitutions'. This explains why in your example that a parameter was used as input the split function to create the array used in the byNames function:

sha2(256,byNames(split($cols,',')))

I can use this technique, but I would like to understand the reason why this following method would not work.

Data flow does provide convenient functions to get the column names from the input file and sort them. I created a derived column called sortColumnNames and configure it as follows:

sort(columnNames(), compare(#item1, #item2))

In the next or same derived column step, I configured a checksum column to obtain the hash for the row data based upon the sorted names array:

md5(toString(byNames(sortColumnNames)))

This seems simple to do within data flow without having to create a external parameter for the data flow. However, during the debug session for the data flow, this error message is returned:
DF-EXPR-030 - Column name function 'byNames' does not accept column or argument parameters

Can you help explain why the byNames function is limited or prohibited from using this type of parameter?

Blog Post

New data flow functions for dynamic, reusable patterns