Forum Discussion
renee_crozier
Dec 09, 2024Brass Contributor
Identify Duplicates and Create Column Identifying Them
I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not. The file names can...
- Dec 09, 2024
You may try Power Query fuzzy match playing with threshold
let threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0}, Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}), Check = Table.FuzzyNestedJoin( DeclareType, {"File Name"}, DeclareType, {"File Name"}, "Table", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]), IsDuplicate = Table.AddColumn( Check, "IsDuplicate", each Table.RowCount( [Table] ) > 1, Logical.Type), Similar = Table.AddColumn( IsDuplicate, "Similar", each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ), type text), RemoveJoin = Table.RemoveColumns(Similar,{"Table"}) in RemoveJoin
renee_crozier
Dec 10, 2024Brass Contributor
This works great! I just need to finesse it and it's good to go. Thanks!
SergeiBaklan
Dec 11, 2024Diamond Contributor
renee_crozier , you are welcome.
Please note, result very depends on concrete texts and threshold you use.