Forum Discussion

Copper Contributor

Jul 01, 2020

Solved

Removing duplicates in PowerQuery

Hi guys! Before I start with my question, please, take a look the pictures that I attached. My source is folder X which contains 3 files (but also can contain x files and problem is still there). ...

excel

SergeiBaklan
Jul 01, 2020
Zarko_Tripunovic

Remove Duplicates keeps first of duplicated rows in order of how you sorted records before. However, to ensure it works correctly before removing duplicates you need to fix the table in memory. You may wrap it by Table.Buffer, or easier to add Index column->select column(s) for which remove duplicates->remove them->remove Index column.

SergeiBaklan

Diamond Contributor

Apr 02, 2021

truemh

Unfortunately not, optimising of performance in significant part is the art, not only technology, it individual for each concrete case. It's always better to follow Best practices when working with Power Query | Microsoft Docs even they are not directly affect the performance. As a minimum check if query folding works, you have modular structure of the queries and you minimised repeated refresh of them.

camsaway

Copper Contributor

May 28, 2024

SergeiBaklan any chance you can expand on the method of Add Index, Sort, Remove Duplicate, Remove Index? I don't seem to be getting the expected results.

I have 3 fields that are relevant here:

f_fi_id which is a sequence number for data updates

sdt_problem_id which is the master reference

sdt_task_index which is the subtask

Each DB update adds a new record as follows:

* Unique f_fi_id for the update run - this increments with each update run

* Record for each Problem_id/task_index combination

Only changed records are appended in each update run

So I need to grab the last update for each Problem_id/task_index combination (i.e. the one with the highest f_fi_id value).

The following doesn't work:

    #"Added Index" = Table.AddIndexColumn(#"Filtered Rows", "Index", 0, 1, Int64.Type),
    #"Removed Duplicates" = Table.Distinct(#"Added Index", {"sdt_problem_id", "sdt_task_index"}),
    #"Sorted Rows" = Table.Sort(#"Removed Duplicates",{{"f_fi_id", Order.Descending}, {"sdt_problem_id", Order.Descending}, {"sdt_task_index", Order.Descending}}),
    #"Removed Columns1" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),

SergeiBaklan
Diamond Contributor
May 28, 2024
camsaway

Better

Add Index -> Sort -> Add another index -> Remove duplicates -> delete indexes