Removing duplicates in PowerQuery

truemh
Copper Contributor
Apr 02, 2021
SergeiBaklan
Thank you for the tip! I used buffer previously, but with lots of flat files and millions of rows refresh of a dataflow became rather slow. Below my initial timings with Table.Buffer / Sort / RemoveDuplicates.
Bytes processed (KB) Max commit (KB) Processor Time
273836 2134400 30:47.9
As the production data set will be four times bigger and growing I got upset. But thanks to your advice same results were achieved faster and with less effort on the capacity side:
Bytes processed (KB) Max commit (KB) Processor Time
211547 1488424 08:57.5
- SergeiBaklan
  Diamond Contributor
  Apr 02, 2021
  truemh
  
  Unfortunately not, optimising of performance in significant part is the art, not only technology, it individual for each concrete case. It's always better to follow Best practices when working with Power Query | Microsoft Docs even they are not directly affect the performance. As a minimum check if query folding works, you have modular structure of the queries and you minimised repeated refresh of them.
  - camsaway
    Copper Contributor
    May 28, 2024
    SergeiBaklan any chance you can expand on the method of Add Index, Sort, Remove Duplicate, Remove Index? I don't seem to be getting the expected results.
    
    I have 3 fields that are relevant here:
    f_fi_id which is a sequence number for data updates
    sdt_problem_id which is the master reference
    sdt_task_index which is the subtask
    
    Each DB update adds a new record as follows:
    * Unique f_fi_id for the update run - this increments with each update run
    * Record for each Problem_id/task_index combination
    Only changed records are appended in each update run
    
    So I need to grab the last update for each Problem_id/task_index combination (i.e. the one with the highest f_fi_id value).
    
    The following doesn't work:
    #"Added Index" = Table.AddIndexColumn(#"Filtered Rows", "Index", 0, 1, Int64.Type), #"Removed Duplicates" = Table.Distinct(#"Added Index", {"sdt_problem_id", "sdt_task_index"}), #"Sorted Rows" = Table.Sort(#"Removed Duplicates",{{"f_fi_id", Order.Descending}, {"sdt_problem_id", Order.Descending}, {"sdt_task_index", Order.Descending}}), #"Removed Columns1" = Table.RemoveColumns(#"Sorted Rows",{"Index"}),

Forum Discussion

Resources

Bytes processed (KB)	Max commit (KB)	Processor Time
273836	2134400	30:47.9