Forum Discussion
Power Query - removing identical generic text from the start and end of multiple columns
- Nov 08, 2024
Hi Insert_Key
Thanks for sharing all what you shared and TBH I wouldn't have bet on GenAI regarding data types. What they came up with is very similar to what I've seen so far and what I referred to in my previous reply. Though, AI got it wrong when it comes to numbers vs. whole numbers... In the workbook you shared - just for testing purpose - enter a decimal number in column [Parent id] then refresh the query... 😒
In attached v1.2 I fixed it as follow:
DetectDataType = (column as list) as type => if List.IsEmpty(column) then type nullable text // Default to text if empty else let sampleValue = List.First(List.RemoveNulls(column)), // Get the first non-null value valueType = Value.Type(sampleValue) in if valueType = type text then type nullable text else if valueType = type number and (sampleValue = Number.Round(sampleValue)) then type nullable Int64.Type // Check for whole numbers else if valueType = type number then type nullable number // Decimal numbers else if valueType = type logical then type nullable logical else if valueType = type date or valueType = type datetime then type nullable datetime // Handle Date/Time else type nullable text, // Fallback to textand IMHO the last else should be type nullable any and not text but that doesn't matter in your context as any will end as text once the query loads to the Data Model
A comment, just in case... I see you added a Sort step at the end of your query. I guess that's for checks on the query output within Excel. Otherwise, as far as I know, the Data Model doesn't mind about ordered data
Last thing. Now that I understand your solution has no external connection, you can slightly speed up the time to refresh your query by disabling the PQ Firewall. Excel data (tab) > Get Data > Query Options... > CURRENT WORKBOOK > Privacy > Choose Ignore the Privacy Levels...
Hope this helps and makes sense
I did something in attached v2 for:
- Created a mapping table (MapCustomFieldNumber) in the PQ Editor. Probably you'll want to have it as a Table in your workbook for easier expansion/maintenance
- Updated the NewColumnNames step to merge (after converting it as table) the list of ColumnsToRename with the mapping table. Then we Expand the [Replacer] field from the mapping table and recombine the appropriate strings. Any question about this let me know
Hi Lorenzo
Thank you for all of your hard work to find a solution for me.
I must confess that I've become a bit confused by the many versions and their different approaches. I inserted your V1 with corrected source string (immediately previous comment) and it worked... but then I saw that there was another post beneath it, so grabbed that code and updated my workbook.
They both seemed to work fine, so I'm unsure which - if either - is superior. When I "Refreshed All" within V1, it took a little over three minutes to complete updating all queries, but I'd made the mistake of unnecessarily loading the JiraExport to Data Model, and the Results as a table to the workbook. When I wrote V2 over the top, I corrected this - JiraExport is Connection Only and Results are loaded to the Data Model so that I can write some measures and create some Power Pivot Tables and Charts.
With the queries loaded as described, it takes two minutes to refresh all from a data source of 3202 rows. I did modify JiraExport in its own query (i.e. so the source is updated before it's handled by your solution) by removing the Issue id and Parent id columns, and add a couple of steps to QueryResult to address blanks (converting them to nulls) and applying an autodetect of data type:
RenamedMergedColumns = Table.RenameColumns(
MergedColumns,
List.Zip({ColumnsToRename, NewColumnNames})
),
ReplacedEmptyWithNull = Table.ReplaceValue(RenamedMergedColumns,"",null,Replacer.ReplaceValue,{"Issue Type", "Priority", "Issue key", "Status", "Resolution", "Created", "Updated", "Resolved", "Summary", "Change type", "Impacted Business System (BS)", "Environment (BS)", "Downstream / Upstream Impacted Business System/s (Secondary)", "Impacted Agency / Agencies (BS)", "Environment.", "Number of Users", "Expected Outage", "Change Category", "Labels"}),
ChangedType = Table.TransformColumnTypes(ReplacedEmptyWithNull,{{"Issue Type", type text}, {"Priority", type text}, {"Issue key", type text}, {"Status", type text}, {"Resolution", type text}, {"Created", type datetime}, {"Updated", type datetime}, {"Resolved", type datetime}, {"Summary", type text}, {"Change type", type text}, {"Impacted Business System (BS)", type text}, {"Environment (BS)", type text}, {"Downstream / Upstream Impacted Business System/s (Secondary)", type text}, {"Impacted Agency / Agencies (BS)", type text}, {"Environment.", type text}, {"Number of Users", type text}, {"Expected Outage", type text}, {"Change Category", type text}, {"Labels", type text}}),
Result = if Table.IsEmpty(FilteredCount) then Source else ChangedType
in
ResultThere is no maintenance required for me with this mapping table approach is there... is it dynamic in response to the count of how many headers are being merged, or may I need to manually intervene sometime? I don't really understand how it functions.
EDIT: Actually, I've just has a look at it, and I'm not sure that it serves a function for my requirement? The "(BS)", I believe is annotation to differentiate those fields from others that are utilised by other projects/teams. Not a great approach, but not my Jira project and not my problem 😁 Even the "." after "Environment" has been placed there intentionally for the same reason.
"Downstream / Upstream Impacted Business System/s (Secondary)" is a Jira field name should always read exactly that and never be transformed/translate to "Downstream / Upstream Impacted Business System/s (Third)" or anything else. I hope that clarifies 🙂
The transformed data that has come out of the back looks great; I would be very pleased to select a final version and get cracking with my analysis 🤓
Is the latest the greatest, as the saying goes? Which version would you recommend I adopt and stick with?
Thanks again, so much!
For reference, I am in Darwin, NT, Australia. I just wanted to make you aware, as I believe that there is a difference in our time zones and I'm normally not active when you are. I am meant to knock off work in about 30 minutes but do tend to stay back a bit on most nights; fighting the urge to do so tonight, as I feel like a lovely colleague has shared their cold with me 🙄😩
- LorenzoNov 08, 2024Silver Contributor
You're very welcome. It's been a real pleasure as it's not common to have such exchange on forums & as PQ +/- challenging cases are not that frequent that's been an opportunity to do a bit of M coding
Re. Date vs Datetime: if your sampleValue is of type any and in your scenario it is, I don't see another option than checking if the Time value of the DateTime value = 0. Implemented in attached file where I used your [Parent id] column with a DateTime value:
else if valueType = type datetime and ( Number.From( Time.From( sampleValue ) ) = 0 ) then type date // Handle Date else if valueType = type datetime then type nullable datetime // Handle Date/TimeAll the best... Sincerely
- Insert_KeyNov 08, 2024Brass Contributor
Thank you, again, so much Lorenzo !
Yes, Copilot couldn't solve the whole number scenario - I kept on pointing out the error of its ways but gave up after, I think, the third attempt because as I mentioned above "...I did modify JiraExport in its own query (i.e. so the source is updated before it's handled by your solution) by removing the Issue id and Parent id columns...". Outcome: the affected columns won't actually make it to the Data Model.
That said, your solution for it will no doubt be useful for other exports/reports and I will definitely use it 😊 I wish that this platform used post numbers, or time & date so that I could better reference quotes/comments!
I think I will also need to tweak theelse if valueType = type date or valueType = type datetime then type nullable datetime // Handle Date/Timeline, because I assume it will assign any date as datetime, which is *not* what I wanted 😆 but easily enough changed. I'll perhaps expand the entire ValueType argument to specify all Types that I anticipate; was just in too much of a rush today. Was doing quite a bit in Excel's front-end for other bits of work today.
I'll have a look to see if I can actually adjust the Privacy Levels in Excel/Power Query. Being that I am employed within s Government setting, everything is pretty locked down but I'll definitely investigate and very much appreciate the advice/insight. I've been waiting on a replacement work device for three months now! Apparently a hardware shortage, but it is allegedly in my city now. Hopefully worth the wait!
By the end of the day, I had taken my standalone development file and repacked its contents into my "master" file that I had used for all other Q3 reporting. As it now contains the Change Management data, I can start creating statistics around that next week, and need to develop something similar for another Service Notifications and then consolidate that back into the master, re-run the stats, and re-author an expanded, enhanced Q3 report.Again, thank you so much for your help, curiosity and desire to find an effectively, sustainable, performant solution for me. It's genuinely appreciated! 👏👏👏
- LorenzoNov 08, 2024Silver Contributor
Hi Insert_Key
Thanks for sharing all what you shared and TBH I wouldn't have bet on GenAI regarding data types. What they came up with is very similar to what I've seen so far and what I referred to in my previous reply. Though, AI got it wrong when it comes to numbers vs. whole numbers... In the workbook you shared - just for testing purpose - enter a decimal number in column [Parent id] then refresh the query... 😒
In attached v1.2 I fixed it as follow:
DetectDataType = (column as list) as type => if List.IsEmpty(column) then type nullable text // Default to text if empty else let sampleValue = List.First(List.RemoveNulls(column)), // Get the first non-null value valueType = Value.Type(sampleValue) in if valueType = type text then type nullable text else if valueType = type number and (sampleValue = Number.Round(sampleValue)) then type nullable Int64.Type // Check for whole numbers else if valueType = type number then type nullable number // Decimal numbers else if valueType = type logical then type nullable logical else if valueType = type date or valueType = type datetime then type nullable datetime // Handle Date/Time else type nullable text, // Fallback to textand IMHO the last else should be type nullable any and not text but that doesn't matter in your context as any will end as text once the query loads to the Data Model
A comment, just in case... I see you added a Sort step at the end of your query. I guess that's for checks on the query output within Excel. Otherwise, as far as I know, the Data Model doesn't mind about ordered data
Last thing. Now that I understand your solution has no external connection, you can slightly speed up the time to refresh your query by disabling the PQ Firewall. Excel data (tab) > Get Data > Query Options... > CURRENT WORKBOOK > Privacy > Choose Ignore the Privacy Levels...
Hope this helps and makes sense
- Insert_KeyNov 08, 2024Brass Contributor
Thanks Lorenzo 😊
I guess the most helpful thing for me to do is describe a common structure for my workbooks. I aim to build templates, so that I am not always reinventing the wheel each time that statistics are required.
- First tab is a quick ReadMe/UserGuide for customers
- Second tab is the raw data, formatted as an Excel table
- The contents of next tab varies depending on the audience, and purpose of the file. If I load the transformed data for the customers to play with directly, this is where it resides.
- The next tab would usually contain dynamic, interactive statistics and charts - a dashboard if you will
- There may be a final hidden sheet, with helper tables etc, if I've included/required them
For my work, I have found that using a table to store my raw data has been more flexible or maybe more robust than using a range - and definitely more robust than linking to an external data source, as the files get sent everywhere. I've learned through experience that it's best for me to contain the raw data with the transformations and the output :)
The template contains a table with on the raw data tab with the header row and an empty first data row. The complete range of raw data is copied from the CSV and pasted as Values Only into the table - meaning that formatting is retained and that the table expands in size to fit. I hope that this answers your question 🙂
Trying to keep my skills contemporary, I hopped into some GenAI (tried Copilot, then Gemini Advanced, and then back to Copilot) for assistance with those last two things that I wanted to solve (convert blank to null and check Data Type) and am satisfied with the outcome. The amended QueryResult code is below and I've attached the file in case you are curious about what AI chose to do:
let Source = JiraExport, SourceHeaders = Table.FromColumns( { Table.ColumnNames( Source ) }, {"JiraHeader"} ), SelectedAssumedHeaders = Table.SelectRows( SourceHeaders, each Text.StartsWith( [JiraHeader], Labels_StartsWith ) or Text.StartsWith( [JiraHeader], CustomFields_StartsWith ) ), HeaderStartNames = Table.AddColumn(SelectedAssumedHeaders, "JiraHeaderStartName", each if Text.StartsWith( [JiraHeader], Labels_StartsWith ) then Labels_StartsWith else Text.BeforeDelimiter( [JiraHeader], ")", {0, RelativePosition.FromEnd} )), GroupedStartNames = Table.Group( HeaderStartNames, {"JiraHeaderStartName"}, { {"DATA", each _, type table}, {"Count", each Table.RowCount(_), Int64.Type} } ), FilteredCount = Table.SelectRows( GroupedStartNames, each ( [Count] > 1 ) ), List_HeadersToMerge = Table.ExpandTableColumn( Table.SelectColumns( FilteredCount, {"DATA"} ), "DATA", {"JiraHeader"}, {"JiraHeaders"} )[JiraHeaders], List_HeadersStartsWith = FilteredCount[JiraHeaderStartName], MergedColumns = List.Accumulate( List_HeadersStartsWith, Source, (accum,name) as table => Table.CombineColumns( accum, List.Select( List_HeadersToMerge, each Text.StartsWith(_, name ) ), each Combiner.CombineTextByDelimiter(" ; ", QuoteStyle.None)(List.RemoveNulls(_) ), name ) ), ColumnsToRename = List.Select( Table.ColumnNames( MergedColumns ), each Text.StartsWith(_, CustomFields_StartsWith ) ), NewColumnNames = List.Transform( ColumnsToRename, each let BetweenDelimiters = Text.BetweenDelimiters(_, CustomFields_StartsWith, ")" ) in if Text.Contains( BetweenDelimiters, "(" ) then BetweenDelimiters & ")" else BetweenDelimiters ), RenamedMergedColumns = Table.RenameColumns( MergedColumns, List.Zip( { ColumnsToRename, NewColumnNames} ) ), Result = if Table.IsEmpty( FilteredCount ) then Source else RenamedMergedColumns, EmptyToNull = Table.TransformColumns( Result, List.Transform( Table.ColumnNames(Result), (colName) => {colName, each if _ = "" then null else _} ) ), // Function to detect data type DetectDataType = (column as list) as type => if List.IsEmpty(column) then type nullable text // Default to text if empty else let sampleValue = List.First(List.RemoveNulls(column)), // Get the first non-null value valueType = Value.Type(sampleValue) in if valueType = type text then type nullable text else if valueType = type number and (sampleValue = Number.Round(sampleValue)) then type nullable number // Check for whole numbers else if valueType = type logical then type nullable logical else if valueType = type date or valueType = type datetime then type nullable datetime // Handle Date/Time else type nullable text, // Fallback to text // Apply the function to each column ColumnTypes = List.Transform( Table.ColumnNames(EmptyToNull), (colName) => let columnValues = Table.Column(EmptyToNull, colName) // Get the entire column values in {colName, DetectDataType(columnValues)} // Pass the column values to the function ), // Transform the table with detected types DetectDataTypes = Table.TransformColumnTypes(EmptyToNull, ColumnTypes), #"Sorted Rows" = Table.Sort(DetectDataTypes,{{"Issue key", Order.Ascending}}) in #"Sorted Rows" - LorenzoNov 07, 2024Silver Contributor
Hi Insert_Key
With other exports, there most definitely will be numerous values that need to have specific types (for calculations etc.) and I would like them to be set within Power Query. Is there a nice way to do that with code?
There's currently no way to dynamically detect and change Types 🙁 but there are creative people WW and one of them wrote a Power Query function that does this decently. However, his function assumes that the initial data Type is any. And given you introduced a step to replace "" with nulls I have a key question for you
I quickly search the Net and found that Jira export as Excel/Csv. So, the Key question is: When you receive the Jira export, do you query the Csv file or do you open the Csv in Excel and format the data as Table (or something else)?
What I need to understand is if the initial* data Types are any or text
* by initial I mean before we start looking a columns to merge or not in the query we've been discussing - Insert_KeyNov 07, 2024Brass Contributor
I appreciate everything you have done SO much!
The value "1st" was added purely so that we (and by we, I mean: you! 😅) could experiment/attempt a solution that would handle numbers in the headers. I tried to explain that with the text that is above the "Actions" table - "no translation/conversion of the numbers is required - just a straight up merge, as if they were text descriptions".I am currently at home after nearly a twelve-hour day (this is one of many current projects) and about to head out for some fresh air and exercise. I don't have the workbook with me but really appreciate your point about explicitly naming columns - that was not my intention at all! I actually completed the steps in the GUI (selected all columns) and then shuffled the code around in the Advanced Editor for consistency. Probably not a smart way to do it, and possibly the 'blanks to nulls' has been coded the same way - again, I just selected all columns in the user interface and applied the transformation via the menu.
With other exports, there most definitely will be numerous values that need to have specific types (for calculations etc.) and I would like them to be set within Power Query. Is there a nice way to do that with code? This is something that I might well be able to figure out by myself tomorrow and report back - you've been an amazing help to get me this far and I appreciate you pointing this out as it may well have become problematic down the line.
It's 6:00PM here now, so in about 14 hours, I'll be back at my desk, looking at this again and hopefully identifying metrics, writing measures and building some visualisations. Once more, thank you so much! 🌟
- LorenzoNov 07, 2024Silver Contributor
First of all apologies if I confused you. I re-attached the 2 versions
v1 Basically does this:
so the query does not translate/map i.e "2nd" as "Secondary"
v2 Is the same as v1 + it maps "2nd" as "Secondary" but not only:As you see above "1st" has been replaced with "BS" for "Custom field (Impacted Business System (1st))". In the used data set this Custom field only required a renaming (no merge). Reading a comment in the UpdatedSampleData you shared I understood (and I could have misunderstood) that when there's only 1 Custom field, i.e. "Custom field wxyz (1st)", "(1st)" should always be replaced with "(BS)"
Hope this clarifies
Regarding Which version would you recommend I adopt and stick with?
v2 adds a level of complexity to the query and necessarily has a little impact on time to Refresh. If you add this to the fact that I might have misunderstood the point re. "(1st)" => "(BS)", if it was me I would go with v1 + the other steps you addedHowever, I have a point regarding your Change Type step. Now that I know you load to the Data Model, sure columns must be Typed otherwise they'll be treated as Text in the Data Model. But your Change Type step "hard-code" a number of column names and if one of them doesn't exist following your next Jira import your query will fail on that step
Reading your current Change Type step & referring to the workbook you shared, you only have 3 columns that require a Type for the Data Model: [Created], [Updated] and [Resolved]. So, to reduce the risk of query failure IMHO your Change Type should only be:ChangedType = Table.TransformColumnTypes( ReplacedEmptyWithNull, { {"Created", type datetime}, {"Updated", type datetime}, {"Resolved", type datetime} } )Hope this makes sense
(In France for a few years)