SOLVED

MaxIfIs vs Earliest vs Aggregation to Find Douplicates {Get&Transform}

Copper Contributor

After having a view looks on variouse forums I found that I was not the first with the same data filtering requirement of my source data.

 

Case:

You have a Dataset where you need to append frequently new data from Folder in which people drop new Excel files (like talked about at the 2nd Excel Table talk). Now it could happen that you get douplicates based on a certain ID and only want to keep either the Earliest or the Latest.

 

Here where I found a view good solutions to this topic:

http://prologika.com/finding-duplicates-in-dax/

 

Now you have 3 Options.

1.) Dump the data as a list into a sheet, make a Maxifis Array function and filter away what you dont need based on the "Create Date" of the files.

(Worst case. Your Excel file now has to store redundant data and the Array Function is slow.)

 

2.) You write yourself a Dax function inside the Data Model Manager based on the proposal as found at prologika.

(Most efficient Option. You still store douplicate data, but now only within the Data Model.)

 

3.) You Douplicate the data you are reading in and merge from it's douplication an agregated column on to your main table based of the Create Date as a MAX {or Min}, then add a custom column which compares the Create Date and the Aggregated Max of the Create date, ... and then you keep only those which represent the Data from either the Latest or Earliest Version by adding a filter.

(Best in regards to Data Size, impossible slow. You'd fall asleep or be tempted to cancel half way if you have more than a view hundret thousand rows.)

 

Does somebody have a better solution ? Could this be done better within Get&Transform as such ?

 

3 Replies

Hi,

 

I don't know if other options exist, perhaps they are, as for me i'd take #2 and almost sure it'll be most efficient.

 

G&T is slow and inefficient to work with big arrays of data. Microsoft works on performance improving, but so far we have what we have.

 

Data model engine is initially very well optimized to work with large amounts of data, shall not be a big problem to keep duplicates.

 

Of course, many depends on how good is optimized your M script for #3 and data model structure with DAX formulas for #2.

Thank you for your insight. Indeed Number 2 is the fastest, though also creates problems of creating relationships afterwards within the Datamodel, as you'd run into having douplicates at what are suposed to unique ID's.


Personaly I am currently experimenting on changing the function that is showin in Aggregation and potentially have a Table.???({?}) function based on "List.Max" or "List.Min" of the Aggregation part.

 

Like
Table.TransformColumns

Record.TransformFields

(from https://msdn.microsoft.com/en-us/library/mt296614.aspx)


Here the Aggregation.

= Table.AggregateTableColumn(#"Merged Queries", "NewColumn", {{"Date created", List.Max, "Max of NewColumn.Date created"}})

 

There must be a quicker way of doing it in G&T. Otherwhise sourcing from a foulder would be half as usefull.

best response confirmed by Elmar Bischof (Copper Contributor)
Solution

Hi.

 

After playing arround for it for a couple of days I found a quicker solution than #2, while using the Query Editor.

 

Hence here #4:

  1. Import your data and duplicate it.
  2. Remove all columuns in one data set up to the Uniqe ID and the "Date Created" of the source file.
  3. Use "GroupBy" within "Transform". Group by the ID, return Max of the Date.
  4. Select both columns, go to "Transform" and use "Merge Columns"
  5. Change the "Load To" to of this Grouped dataset to connection only without loading it to the datamodel.
  6. Go back to the other dataset that you intend to load into the Datamodel and merge the "Date created" and your Unique ID
  7. Make an Inner Join

I get a refresh time of less then one minute.

Still not the quick single Table solution without the use of "Join" I have been looking for, though at least it is within the "Query Editor", quick and keeps the file size small.

 

Ps.: You get "Date Created" at the very first ste of Applied Steps, under "Source"

1 best response

Accepted Solutions
best response confirmed by Elmar Bischof (Copper Contributor)
Solution

Hi.

 

After playing arround for it for a couple of days I found a quicker solution than #2, while using the Query Editor.

 

Hence here #4:

  1. Import your data and duplicate it.
  2. Remove all columuns in one data set up to the Uniqe ID and the "Date Created" of the source file.
  3. Use "GroupBy" within "Transform". Group by the ID, return Max of the Date.
  4. Select both columns, go to "Transform" and use "Merge Columns"
  5. Change the "Load To" to of this Grouped dataset to connection only without loading it to the datamodel.
  6. Go back to the other dataset that you intend to load into the Datamodel and merge the "Date created" and your Unique ID
  7. Make an Inner Join

I get a refresh time of less then one minute.

Still not the quick single Table solution without the use of "Join" I have been looking for, though at least it is within the "Query Editor", quick and keeps the file size small.

 

Ps.: You get "Date Created" at the very first ste of Applied Steps, under "Source"

View solution in original post