Forum Discussion
renee_crozier
Dec 09, 2024Brass Contributor
Identify Duplicates and Create Column Identifying Them
I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not. The file names can...
- Dec 09, 2024
You may try Power Query fuzzy match playing with threshold
let threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0}, Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}), Check = Table.FuzzyNestedJoin( DeclareType, {"File Name"}, DeclareType, {"File Name"}, "Table", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]), IsDuplicate = Table.AddColumn( Check, "IsDuplicate", each Table.RowCount( [Table] ) > 1, Logical.Type), Similar = Table.AddColumn( IsDuplicate, "Similar", each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ), type text), RemoveJoin = Table.RemoveColumns(Similar,{"Table"}) in RemoveJoin
SergeiBaklan
Dec 09, 2024Diamond Contributor
You may try Power Query fuzzy match playing with threshold
let
threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0},
Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}),
Check = Table.FuzzyNestedJoin(
DeclareType, {"File Name"},
DeclareType, {"File Name"},
"Table",
JoinKind.LeftOuter,
[IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]),
IsDuplicate = Table.AddColumn(
Check,
"IsDuplicate",
each Table.RowCount( [Table] ) > 1,
Logical.Type),
Similar = Table.AddColumn(
IsDuplicate,
"Similar",
each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ),
type text),
RemoveJoin = Table.RemoveColumns(Similar,{"Table"})
in
RemoveJoin
- renee_crozierDec 10, 2024Brass Contributor
This works great! I just need to finesse it and it's good to go. Thanks!
- SergeiBaklanDec 11, 2024Diamond Contributor
renee_crozier , you are welcome.
Please note, result very depends on concrete texts and threshold you use.