Forum Discussion
Identify Duplicates and Create Column Identifying Them
I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not.
The file names can have additional characters in it, which makes this a little more difficult. This is what I'm trying to do:
The file name and site for title 1 and 2 are the same but it's separated by the underscore. Other examples of items that are throwing this off are version numbers, dates, delimiters, etc. I've included various examples in my dummy data.
Any help is greatly appreciated here.
You may try Power Query fuzzy match playing with threshold
let threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0}, Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}), Check = Table.FuzzyNestedJoin( DeclareType, {"File Name"}, DeclareType, {"File Name"}, "Table", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]), IsDuplicate = Table.AddColumn( Check, "IsDuplicate", each Table.RowCount( [Table] ) > 1, Logical.Type), Similar = Table.AddColumn( IsDuplicate, "Similar", each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ), type text), RemoveJoin = Table.RemoveColumns(Similar,{"Table"}) in RemoveJoin
8 Replies
- SergeiBaklanDiamond Contributor
You may try Power Query fuzzy match playing with threshold
let threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0}, Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}), Check = Table.FuzzyNestedJoin( DeclareType, {"File Name"}, DeclareType, {"File Name"}, "Table", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]), IsDuplicate = Table.AddColumn( Check, "IsDuplicate", each Table.RowCount( [Table] ) > 1, Logical.Type), Similar = Table.AddColumn( IsDuplicate, "Similar", each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ), type text), RemoveJoin = Table.RemoveColumns(Similar,{"Table"}) in RemoveJoin
- renee_crozierBrass Contributor
This works great! I just need to finesse it and it's good to go. Thanks!
- SergeiBaklanDiamond Contributor
renee_crozier , you are welcome.
Please note, result very depends on concrete texts and threshold you use.
- m_tarlerBronze Contributor
here is my attempt and not a neat as peters but it is 'broken out' for each of the 'factors' I took into account. I agree there may be processing impact. In this example I just created the 'helper' column to strip out all the unwanted items:
=LET(in, [@[File Name]], strip_filetype, TEXTBEFORE(in,".",-1,,,), chk_date, TEXTSPLIT(SUBSTITUTE(strip_filetype,".","-")," "), date_removed, TEXTJOIN("",,IF(ISNUMBER(DATEVALUE(IF(RIGHT(chk_date,1)="-",LEFT(chk_date,LEN(chk_date)-1),chk_date))),"",chk_date)), specials_removed, SUBSTITUTE(SUBSTITUTE(date_removed,"-"," "),"_"," "), chk_vers, TRIM(TEXTSPLIT(specials_removed,"v")), ver_removed, TEXTJOIN("",,IF(ISNUMBER(--LEFT(chk_vers)),"",chk_vers)), strip_end_numbers, LEFT(ver_removed,LEN(TRIM(CONCAT("a",IF(ISNUMBER(--MID(ver_removed,SEQUENCE(LEN(ver_removed)),1))," ","a"))))-1), strip_final, MID(strip_end_numbers,IF(LOWER(LEFT(strip_end_numbers,5))="final",6,1),IF(LOWER(RIGHT(strip_end_numbers,5))="final",LEN(strip_end_numbers)-5,LEN(strip_end_numbers))), remove_space, SUBSTITUTE(strip_final," ",""), remove_ending_s, LEFT(remove_space, LEN(remove_space)-(LOWER(RIGHT(remove_space,1))="s")), remove_ending_s)
btw, i didn't re-add the file filetype back on but that could be easily done on line 12 with &TEXTAFTER(in,".",-1)
I also replaced the date format using "." with "-" because at least my excel didn't recognize that format.
and as for the "final" word I remove it only if it is found at the very beginning or at the end (after removing any other date or version info)
I also split the name using "v" and check if the 1st character is then a number and then remove it, but that could also result in potential errors and might need tweaking.
- PeterBartholomew1Silver Contributor
I think you are pretty much doomed. Searching 70,000+ file names is a noticeable calculation, to do it 70,000 times is getting to be heavy computing. To attempt it with dirty data is asking for trouble.
My attempt is
= LET( fileNames, REGEXEXTRACT(Table1[File Name], "[\w|\.\-]+\.pdf"), site, Table1[Site], duplicate, MAP(site, fileNames, COMPARE(site, fileNames)), HSTACK(fileNames, site, IF(duplicate, "Yes", "No")) )
where
COMPAREλ = LAMBDA(site, fileNames, LAMBDA(s, f, LET( filteredFile, IF(site = s, fileNames), firstMatch, XMATCH(f, filteredFile, , 1), lastmatch, XMATCH(f, filteredFile, , -1), firstMatch <> lastmatch ) ))
- renee_crozierBrass Contributor
I was able to get it filtered to around 11,000. Does that help my case?
This is going to be difficult - it's not difficult to remove non-alphabet characters, but how to determine whether a difference in spelling is significant or not? Apparently "filename" is to be treated as equal to "filenames", "filenamev", and "finalfilename", but what is the rule there?
- renee_crozierBrass Contributor
There's always going to be margin for error so as long as it's relatively close, I think it's okay. I acknowledge that this will never be 100%