Hi Mark Kromer
thanks for quick answer. I thought about selecting my attribute columns to a separate stream and then hashing them (which is what you suggest), but then I am not able to join my MD5 back to the original stream. I understand the Merge transformation in your screenshot makes the join on md5?
This is not quite what I am looking for.
My use case is that I want to deduplicate consecutive rows from my source. My source has also a business key (which I didn't mention in my original question - lets say it is always EMPLOYEE_ID) and if the rows are sorted by UPD_TS, I want to get rid of rows that repeat attribute values from the previous row (although they may have different UPD_TS).
So ideally, I would add a Derived Column to my stream like this one :
md5(byNames(filter(columnNames(), !in(['employee_id','upd_ts'],#item)) ))
and then I could detect repeating md5s with lag & window transformation.
Unfortunately, the expression above returns an error 😞
Please note that certain MD5s can recur, because underlying attributes may change values in this fashion:
original value --> original value --> another value --> original value
In this case, I want to capture first, third & fourth value.
This means I can't do a simple join of my 2 streams by MD5, because I would get a semi-cartesian product.
It is a really cool feature that you can do rule-based mappings in the select transformation and easilly get rid of columns you dont want
eg. !in(['upd_ts'],name)
and it would be great if I could use the same logic when defining which columns should be MD5ed in my Derived Column (and keeping the other columns).