Forum Discussion

renee_crozier's avatar
renee_crozier
Brass Contributor
Dec 09, 2024
Solved

Identify Duplicates and Create Column Identifying Them

I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not. The file names can...
  • SergeiBaklan's avatar
    Dec 09, 2024

    You may try Power Query fuzzy match playing with threshold

    let
    
        threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0},
    
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}),
    
        Check = Table.FuzzyNestedJoin(
            DeclareType, {"File Name"},
            DeclareType, {"File Name"},
            "Table",
            JoinKind.LeftOuter,
            [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]),
        IsDuplicate = Table.AddColumn(
            Check,
            "IsDuplicate",
            each Table.RowCount( [Table] ) > 1,
            Logical.Type),
        Similar = Table.AddColumn(
            IsDuplicate,
            "Similar",
            each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ),
            type text),
        RemoveJoin = Table.RemoveColumns(Similar,{"Table"})
    in
        RemoveJoin

Resources