Forum Discussion
renee_crozier
Dec 09, 2024Brass Contributor
Identify Duplicates and Create Column Identifying Them
I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not. The file names can...
- Dec 09, 2024
You may try Power Query fuzzy match playing with threshold
let threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0}, Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}), Check = Table.FuzzyNestedJoin( DeclareType, {"File Name"}, DeclareType, {"File Name"}, "Table", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]), IsDuplicate = Table.AddColumn( Check, "IsDuplicate", each Table.RowCount( [Table] ) > 1, Logical.Type), Similar = Table.AddColumn( IsDuplicate, "Similar", each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ), type text), RemoveJoin = Table.RemoveColumns(Similar,{"Table"}) in RemoveJoin
HansVogelaar
Dec 09, 2024MVP
This is going to be difficult - it's not difficult to remove non-alphabet characters, but how to determine whether a difference in spelling is significant or not? Apparently "filename" is to be treated as equal to "filenames", "filenamev", and "finalfilename", but what is the rule there?
renee_crozier
Dec 09, 2024Brass Contributor
There's always going to be margin for error so as long as it's relatively close, I think it's okay. I acknowledge that this will never be 100%