Removing duplicates in PowerQuery

Zarko_Tripunovic · ‎Jul 01 2020

Hi guys!

Before I start with my question, please, take a look the pictures that I attached.

My source is folder X which contains 3 files (but also can contain x files and problem is still there). What I do: I import data in Power Query, delete column with file name, sort them by date from newest to oldest. Then I want to know what is last status of item that is bar coded. When I remove duplicates (bar Code column) I only get status from the 1st file, but I want to get status with newest date whatever is position of file in folder.

What is the trick?

Zarko_Tripunovic · ‎Jul 01 2020

@Zarko_Tripunovic

Remove Duplicates keeps first of duplicated rows in order of how you sorted records before. However, to ensure it works correctly before removing duplicates you need to fix the table in memory. You may wrap it by Table.Buffer, or easier to add Index column->select column(s) for which remove duplicates->remove them->remove Index column.

Zarko_Tripunovic · ‎Jul 02 2020

@Sergei Baklan Thank You again for Your help. I used Table.Buffer and it works perfectly now. After You advised me to use this, I wasn't sure how to use it so I found it on https://exceleratorbi.com.au/remove-duplicates-keep-last-record-power-query/

Sergei Baklan · ‎Jul 02 2020

@Zarko_Tripunovic

Sorry didn't explain Table.Buffer() in more details. But again, simplest way is to Add index->Sort->Remove index

Anyway, glad to know you sorted the issue out.

truemh · ‎Apr 02 2021

@Sergei Baklan

Thank you for the tip! I used buffer previously, but with lots of flat files and millions of rows refresh of a dataflow became rather slow. Below my initial timings with Table.Buffer / Sort / RemoveDuplicates.

Bytes processed (KB)	Max commit (KB)	Processor Time
273836	2134400	30:47.9

As the production data set will be four times bigger and growing I got upset. But thanks to your advice same results were achieved faster and with less effort on the capacity side:

Bytes processed (KB)	Max commit (KB)	Processor Time
211547	1488424	08:57.5

Sergei Baklan · ‎Apr 02 2021

@truemh

Unfortunately not, optimising of performance in significant part is the art, not only technology, it individual for each concrete case. It's always better to follow Best practices when working with Power Query | Microsoft Docs even they are not directly affect the performance. As a minimum check if query folding works, you have modular structure of the queries and you minimised repeated refresh of them.

Zarko_Tripunovic · ‎Jul 01 2020

@Zarko_Tripunovic

Remove Duplicates keeps first of duplicated rows in order of how you sorted records before. However, to ensure it works correctly before removing duplicates you need to fix the table in memory. You may wrap it by Table.Buffer, or easier to add Index column->select column(s) for which remove duplicates->remove them->remove Index column.

View solution in original post

Removing duplicates in PowerQuery

Removing duplicates in PowerQuery

Re: Removing duplicates in PowerQuery

Re: Removing duplicates in PowerQuery

Re: Removing duplicates in PowerQuery

Re: Removing duplicates in PowerQuery

Re: Removing duplicates in PowerQuery

Re: Removing duplicates in PowerQuery

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Removing duplicates in PowerQuery