Forum Discussion

renee_crozier's avatar
renee_crozier
Brass Contributor
Dec 09, 2024
Solved

Identify Duplicates and Create Column Identifying Them

I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not.

The file names can have additional characters in it, which makes this a little more difficult. This is what I'm trying to do:

The file name and site for title 1 and 2 are the same but it's separated by the underscore. Other examples of items that are throwing this off are version numbers, dates, delimiters, etc. I've included various examples in my dummy data.

 

Any help is greatly appreciated here.

  • You may try Power Query fuzzy match playing with threshold

    let
    
        threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0},
    
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}),
    
        Check = Table.FuzzyNestedJoin(
            DeclareType, {"File Name"},
            DeclareType, {"File Name"},
            "Table",
            JoinKind.LeftOuter,
            [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]),
        IsDuplicate = Table.AddColumn(
            Check,
            "IsDuplicate",
            each Table.RowCount( [Table] ) > 1,
            Logical.Type),
        Similar = Table.AddColumn(
            IsDuplicate,
            "Similar",
            each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ),
            type text),
        RemoveJoin = Table.RemoveColumns(Similar,{"Table"})
    in
        RemoveJoin

8 Replies

  • SergeiBaklan's avatar
    SergeiBaklan
    Diamond Contributor

    You may try Power Query fuzzy match playing with threshold

    let
    
        threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0},
    
        Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
        DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}),
    
        Check = Table.FuzzyNestedJoin(
            DeclareType, {"File Name"},
            DeclareType, {"File Name"},
            "Table",
            JoinKind.LeftOuter,
            [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]),
        IsDuplicate = Table.AddColumn(
            Check,
            "IsDuplicate",
            each Table.RowCount( [Table] ) > 1,
            Logical.Type),
        Similar = Table.AddColumn(
            IsDuplicate,
            "Similar",
            each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ),
            type text),
        RemoveJoin = Table.RemoveColumns(Similar,{"Table"})
    in
        RemoveJoin
    • renee_crozier's avatar
      renee_crozier
      Brass Contributor

      This works great! I just need to finesse it and it's good to go. Thanks!

  • m_tarler's avatar
    m_tarler
    Bronze Contributor

    here is my attempt and not a neat as peters but it is 'broken out' for each of the 'factors' I took into account.  I agree there may be processing impact.  In this example I just created the 'helper' column to strip out all the unwanted items:

    =LET(in, [@[File Name]],
    strip_filetype, TEXTBEFORE(in,".",-1,,,),
    chk_date, TEXTSPLIT(SUBSTITUTE(strip_filetype,".","-")," "),
    date_removed, TEXTJOIN("",,IF(ISNUMBER(DATEVALUE(IF(RIGHT(chk_date,1)="-",LEFT(chk_date,LEN(chk_date)-1),chk_date))),"",chk_date)),
    specials_removed, SUBSTITUTE(SUBSTITUTE(date_removed,"-"," "),"_"," "),
    chk_vers, TRIM(TEXTSPLIT(specials_removed,"v")),
    ver_removed, TEXTJOIN("",,IF(ISNUMBER(--LEFT(chk_vers)),"",chk_vers)),
    strip_end_numbers, LEFT(ver_removed,LEN(TRIM(CONCAT("a",IF(ISNUMBER(--MID(ver_removed,SEQUENCE(LEN(ver_removed)),1))," ","a"))))-1),
    strip_final, MID(strip_end_numbers,IF(LOWER(LEFT(strip_end_numbers,5))="final",6,1),IF(LOWER(RIGHT(strip_end_numbers,5))="final",LEN(strip_end_numbers)-5,LEN(strip_end_numbers))),
    remove_space, SUBSTITUTE(strip_final," ",""),
    remove_ending_s, LEFT(remove_space, LEN(remove_space)-(LOWER(RIGHT(remove_space,1))="s")),
    remove_ending_s)

    btw, i didn't re-add the file filetype back on but that could be easily done on line 12 with &TEXTAFTER(in,".",-1)

    I also replaced the date format using "." with "-" because at least my excel didn't recognize that format.

    and as for the "final" word I remove it only if it is found at the very beginning or at the end (after removing any other date or version info)

    I also split the name using "v" and check if the 1st character is then a number and then remove it, but that could also result in potential errors and might need tweaking.

  • I think you are pretty much doomed.  Searching 70,000+ file names is a noticeable calculation, to do it 70,000 times is getting to be heavy computing.  To attempt it with dirty data is asking for trouble.

    My attempt is

    = LET(
        fileNames, REGEXEXTRACT(Table1[File Name], "[\w|\.\-]+\.pdf"),
        site,      Table1[Site],
        duplicate, MAP(site, fileNames, COMPARE(site, fileNames)),
        HSTACK(fileNames, site, IF(duplicate, "Yes", "No"))
      )

    where

    COMPAREλ
    = LAMBDA(site, fileNames, LAMBDA(s, f,
        LET(
            filteredFile, IF(site = s, fileNames),
            firstMatch,   XMATCH(f, filteredFile, , 1),
            lastmatch,    XMATCH(f, filteredFile, , -1),
            firstMatch <> lastmatch
        )
    ))

     

    • renee_crozier's avatar
      renee_crozier
      Brass Contributor

      I was able to get it filtered to around 11,000. Does that help my case?

  • This is going to be difficult - it's not difficult to remove non-alphabet characters, but how to determine whether a difference in spelling is significant or not? Apparently "filename" is to be treated as equal to "filenames",  "filenamev", and "finalfilename", but what is the rule there?

    • renee_crozier's avatar
      renee_crozier
      Brass Contributor

      There's always going to be margin for error so as long as it's relatively close, I think it's okay. I acknowledge that this will never be 100%

Resources