Forum Discussion
Removing duplicates in PowerQuery
- Jul 01, 2020
Remove Duplicates keeps first of duplicated rows in order of how you sorted records before. However, to ensure it works correctly before removing duplicates you need to fix the table in memory. You may wrap it by Table.Buffer, or easier to add Index column->select column(s) for which remove duplicates->remove them->remove Index column.
SergeiBaklan Thank You again for Your help. I used Table.Buffer and it works perfectly now. After You advised me to use this, I wasn't sure how to use it so I found it on https://exceleratorbi.com.au/remove-duplicates-keep-last-record-power-query/
Sorry didn't explain Table.Buffer() in more details. But again, simplest way is to Add index->Sort->Remove index
Anyway, glad to know you sorted the issue out.
- truemhApr 02, 2021Copper Contributor
Thank you for the tip! I used buffer previously, but with lots of flat files and millions of rows refresh of a dataflow became rather slow. Below my initial timings with Table.Buffer / Sort / RemoveDuplicates.
Bytes processed (KB) Max commit (KB) Processor Time 273836 2134400 30:47.9 As the production data set will be four times bigger and growing I got upset. But thanks to your advice same results were achieved faster and with less effort on the capacity side:
Bytes processed (KB) Max commit (KB) Processor Time 211547 1488424 08:57.5 - SergeiBaklanApr 02, 2021Diamond Contributor
Unfortunately not, optimising of performance in significant part is the art, not only technology, it individual for each concrete case. It's always better to follow Best practices when working with Power Query | Microsoft Docs even they are not directly affect the performance. As a minimum check if query folding works, you have modular structure of the queries and you minimised repeated refresh of them.
- camsawayMay 28, 2024Copper Contributor
SergeiBaklan any chance you can expand on the method of Add Index, Sort, Remove Duplicate, Remove Index? I don't seem to be getting the expected results.
I have 3 fields that are relevant here:
f_fi_id which is a sequence number for data updates
sdt_problem_id which is the master reference
sdt_task_index which is the subtask
Each DB update adds a new record as follows:
* Unique f_fi_id for the update run - this increments with each update run
* Record for each Problem_id/task_index combination
Only changed records are appended in each update run
So I need to grab the last update for each Problem_id/task_index combination (i.e. the one with the highest f_fi_id value).
The following doesn't work:
#"Added Index" = Table.AddIndexColumn(#"Filtered Rows", "Index", 0, 1, Int64.Type), #"Removed Duplicates" = Table.Distinct(#"Added Index", {"sdt_problem_id", "sdt_task_index"}), #"Sorted Rows" = Table.Sort(#"Removed Duplicates",{{"f_fi_id", Order.Descending}, {"sdt_problem_id", Order.Descending}, {"sdt_task_index", Order.Descending}}), #"Removed Columns1" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),