SOLVED

Removing duplicates in PowerQuery

Copper Contributor

Hi guys! 

Before I start with my question, please, take a look the pictures that I attached.

My source is folder X which contains 3 files (but also can contain x files and problem is still there). What I do: I import data in Power Query, delete column with file name, sort them by date from newest to oldest. Then I want to know what is last status of item that is bar coded. When I remove duplicates (bar Code column) I only get status from the 1st file, but I want to get status with newest date whatever is position of file in folder.

What is the trick?

 

 

5 Replies
best response confirmed by Zarko_Tripunovic (Copper Contributor)
Solution

@Zarko_Tripunovic 

Remove Duplicates keeps first of duplicated rows in order of how you sorted records before. However, to ensure it works correctly before removing duplicates you need to fix the table in memory. You may wrap it by Table.Buffer, or easier to add Index column->select column(s) for which remove duplicates->remove them->remove Index column.

@Sergei Baklan Thank You again for Your help. I used Table.Buffer and it works perfectly now. After You advised me to use this, I wasn't sure how to use it so I found it on https://exceleratorbi.com.au/remove-duplicates-keep-last-record-power-query/ 

@Zarko_Tripunovic 

Sorry didn't explain Table.Buffer() in more details. But again, simplest way is to Add index->Sort->Remove index

Anyway, glad to know you sorted the issue out.

@Sergei Baklan 

Thank you for the tip! I used buffer previously, but with lots of flat files and millions of rows refresh of a dataflow became rather slow. Below my initial timings with Table.Buffer / Sort / RemoveDuplicates.

Bytes processed (KB)Max commit (KB)Processor Time
273836213440030:47.9

As the production data set will be four times bigger and growing I got upset. But thanks to your advice same results were achieved faster and with less effort on the capacity side:

Bytes processed (KB)Max commit (KB)Processor Time
211547148842408:57.5

@truemh 

Unfortunately not, optimising of performance in significant part is the art, not only technology, it individual for each concrete case. It's always better to follow Best practices when working with Power Query | Microsoft Docs even they are not directly affect the performance. As a minimum check if query folding works, you have modular structure of the queries and you minimised repeated refresh of them. 

1 best response

Accepted Solutions
best response confirmed by Zarko_Tripunovic (Copper Contributor)
Solution

@Zarko_Tripunovic 

Remove Duplicates keeps first of duplicated rows in order of how you sorted records before. However, to ensure it works correctly before removing duplicates you need to fix the table in memory. You may wrap it by Table.Buffer, or easier to add Index column->select column(s) for which remove duplicates->remove them->remove Index column.

View solution in original post