ADF Data Flows has a low-code graph-based UI to design and develop data transformations at scale. But the script behind those graphs is very powerful. This article introduces you to the Data Flow Scr...
Having the data flow script (DFS) as a single line makes version control difficult.
For example if I change 3 of 400 lines in a DFS, this will be identified as only 1 change in GitHub, since all those 400 lines are on a single line in the dataflow json file.
And sometimes GitHub will say the DFS changed even though it functionally did not. This happens because the formatting of the DFS (or the order of the parameters therein) can occasionally change, even if none of the actual values changed.
I've tried writing a python script to automatically parse the DFS of any dataflow into json format. I aim to use my script to automatically go through any number dataflow files and obtain proper "diffs" between those and a corresponding set of earlier/different dataflow files. But this has proved easier said than done, mainly since the DFS appears to be written in some non-standard data format which I am unable to identify or parse easily. My script is currently a clunky and error-prone mess of regexes, loops, and conditions. Is there a better way?