New data flow functions for dynamic, reusable patterns

Copper Contributor

Jan 06, 2021

thanks for quick answer. I thought about selecting my attribute columns to a separate stream and then hashing them (which is what you suggest), but then I am not able to join my MD5 back to the original stream. I understand the Merge transformation in your screenshot makes the join on md5?
This is not quite what I am looking for.

My use case is that I want to deduplicate consecutive rows from my source. My source has also a business key (which I didn't mention in my original question - lets say it is always EMPLOYEE_ID) and if the rows are sorted by UPD_TS, I want to get rid of rows that repeat attribute values from the previous row (although they may have different UPD_TS).

So ideally, I would add a Derived Column to my stream like this one :

md5(byNames(filter(columnNames(), !in(['employee_id','upd_ts'],#item)) ))

and then I could detect repeating md5s with lag & window transformation.

Unfortunately, the expression above returns an error 😞

Please note that certain MD5s can recur, because underlying attributes may change values in this fashion:

original value --> original value --> another value --> original value

In this case, I want to capture first, third & fourth value.
This means I can't do a simple join of my 2 streams by MD5, because I would get a semi-cartesian product.

It is a really cool feature that you can do rule-based mappings in the select transformation and easilly get rid of columns you dont want

eg. !in(['upd_ts'],name)

and it would be great if I could use the same logic when defining which columns should be MD5ed in my Derived Column (and keeping the other columns).

Blog Post

New data flow functions for dynamic, reusable patterns