Forum Discussion

Copper Contributor

Aug 04, 2017

Solved

MaxIfIs vs Earliest vs Aggregation to Find Douplicates {Get&Transform}

After having a view looks on variouse forums I found that I was not the first with the same data filtering requirement of my source data. Case: You have a Dataset where you need to append freque...

BI & Data Analytics

Requests

Elmar Bischof
Aug 08, 2017
Hi.

After playing arround for it for a couple of days I found a quicker solution than #2, while using the Query Editor.

Hence here #4:
Import your data and duplicate it.
Remove all columuns in one data set up to the Uniqe ID and the "Date Created" of the source file.
Use "GroupBy" within "Transform". Group by the ID, return Max of the Date.
Select both columns, go to "Transform" and use "Merge Columns"
Change the "Load To" to of this Grouped dataset to connection only without loading it to the datamodel.
Go back to the other dataset that you intend to load into the Datamodel and merge the "Date created" and your Unique ID
Make an Inner Join
I get a refresh time of less then one minute.
Still not the quick single Table solution without the use of "Join" I have been looking for, though at least it is within the "Query Editor", quick and keeps the file size small.

Ps.: You get "Date Created" at the very first ste of Applied Steps, under "Source"

SergeiBaklan

Diamond Contributor

Aug 04, 2017

Hi,

I don't know if other options exist, perhaps they are, as for me i'd take #2 and almost sure it'll be most efficient.

G&T is slow and inefficient to work with big arrays of data. Microsoft works on performance improving, but so far we have what we have.

Data model engine is initially very well optimized to work with large amounts of data, shall not be a big problem to keep duplicates.

Of course, many depends on how good is optimized your M script for #3 and data model structure with DAX formulas for #2.

Elmar Bischof
Copper Contributor
Aug 04, 2017
Thank you for your insight. Indeed Number 2 is the fastest, though also creates problems of creating relationships afterwards within the Datamodel, as you'd run into having douplicates at what are suposed to unique ID's.

Personaly I am currently experimenting on changing the function that is showin in Aggregation and potentially have a Table.???({?}) function based on "List.Max" or "List.Min" of the Aggregation part.

Like
Table.TransformColumns
Record.TransformFields
(from https://msdn.microsoft.com/en-us/library/mt296614.aspx)

Here the Aggregation.
= Table.AggregateTableColumn(#"Merged Queries", "NewColumn", {{"Date created", List.Max, "Max of NewColumn.Date created"}})

There must be a quicker way of doing it in G&T. Otherwhise sourcing from a foulder would be half as usefull.
- Elmar Bischof
  Copper Contributor
  Aug 08, 2017
  Hi.
  
  After playing arround for it for a couple of days I found a quicker solution than #2, while using the Query Editor.
  
  Hence here #4:
  Import your data and duplicate it.
  Remove all columuns in one data set up to the Uniqe ID and the "Date Created" of the source file.
  Use "GroupBy" within "Transform". Group by the ID, return Max of the Date.
  Select both columns, go to "Transform" and use "Merge Columns"
  Change the "Load To" to of this Grouped dataset to connection only without loading it to the datamodel.
  Go back to the other dataset that you intend to load into the Datamodel and merge the "Date created" and your Unique ID
  Make an Inner Join
  I get a refresh time of less then one minute.
  Still not the quick single Table solution without the use of "Join" I have been looking for, though at least it is within the "Query Editor", quick and keeps the file size small.
  
  Ps.: You get "Date Created" at the very first ste of Applied Steps, under "Source"