Forum Discussion
Identify Duplicates and Create Column Identifying Them
- Dec 09, 2024
You may try Power Query fuzzy match playing with threshold
let threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0}, Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}), Check = Table.FuzzyNestedJoin( DeclareType, {"File Name"}, DeclareType, {"File Name"}, "Table", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]), IsDuplicate = Table.AddColumn( Check, "IsDuplicate", each Table.RowCount( [Table] ) > 1, Logical.Type), Similar = Table.AddColumn( IsDuplicate, "Similar", each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ), type text), RemoveJoin = Table.RemoveColumns(Similar,{"Table"}) in RemoveJoin
here is my attempt and not a neat as peters but it is 'broken out' for each of the 'factors' I took into account. I agree there may be processing impact. In this example I just created the 'helper' column to strip out all the unwanted items:
=LET(in, [@[File Name]],
strip_filetype, TEXTBEFORE(in,".",-1,,,),
chk_date, TEXTSPLIT(SUBSTITUTE(strip_filetype,".","-")," "),
date_removed, TEXTJOIN("",,IF(ISNUMBER(DATEVALUE(IF(RIGHT(chk_date,1)="-",LEFT(chk_date,LEN(chk_date)-1),chk_date))),"",chk_date)),
specials_removed, SUBSTITUTE(SUBSTITUTE(date_removed,"-"," "),"_"," "),
chk_vers, TRIM(TEXTSPLIT(specials_removed,"v")),
ver_removed, TEXTJOIN("",,IF(ISNUMBER(--LEFT(chk_vers)),"",chk_vers)),
strip_end_numbers, LEFT(ver_removed,LEN(TRIM(CONCAT("a",IF(ISNUMBER(--MID(ver_removed,SEQUENCE(LEN(ver_removed)),1))," ","a"))))-1),
strip_final, MID(strip_end_numbers,IF(LOWER(LEFT(strip_end_numbers,5))="final",6,1),IF(LOWER(RIGHT(strip_end_numbers,5))="final",LEN(strip_end_numbers)-5,LEN(strip_end_numbers))),
remove_space, SUBSTITUTE(strip_final," ",""),
remove_ending_s, LEFT(remove_space, LEN(remove_space)-(LOWER(RIGHT(remove_space,1))="s")),
remove_ending_s)
btw, i didn't re-add the file filetype back on but that could be easily done on line 12 with &TEXTAFTER(in,".",-1)
I also replaced the date format using "." with "-" because at least my excel didn't recognize that format.
and as for the "final" word I remove it only if it is found at the very beginning or at the end (after removing any other date or version info)
I also split the name using "v" and check if the 1st character is then a number and then remove it, but that could also result in potential errors and might need tweaking.