Brass Contributor

Dec 09, 2024

Solved

Identify Duplicates and Create Column Identifying Them

I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not.

The file names can have additional characters in it, which makes this a little more difficult. This is what I'm trying to do:

The file name and site for title 1 and 2 are the same but it's separated by the underscore. Other examples of items that are throwing this off are version numbers, dates, delimiters, etc. I've included various examples in my dummy data.

Any help is greatly appreciated here.

You may try Power Query fuzzy match playing with threshold

let

    threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0},

    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}),

    Check = Table.FuzzyNestedJoin(
        DeclareType, {"File Name"},
        DeclareType, {"File Name"},
        "Table",
        JoinKind.LeftOuter,
        [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]),
    IsDuplicate = Table.AddColumn(
        Check,
        "IsDuplicate",
        each Table.RowCount( [Table] ) > 1,
        Logical.Type),
    Similar = Table.AddColumn(
        IsDuplicate,
        "Similar",
        each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ),
        type text),
    RemoveJoin = Table.RemoveColumns(Similar,{"Table"})
in
    RemoveJoin

Dummy Data (1) (1).xlsx20 KB

8 Replies

SergeiBaklan

Diamond Contributor

Dec 09, 2024

You may try Power Query fuzzy match playing with threshold

let

    threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0},

    Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content],
    DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}),

    Check = Table.FuzzyNestedJoin(
        DeclareType, {"File Name"},
        DeclareType, {"File Name"},
        "Table",
        JoinKind.LeftOuter,
        [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]),
    IsDuplicate = Table.AddColumn(
        Check,
        "IsDuplicate",
        each Table.RowCount( [Table] ) > 1,
        Logical.Type),
    Similar = Table.AddColumn(
        IsDuplicate,
        "Similar",
        each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ),
        type text),
    RemoveJoin = Table.RemoveColumns(Similar,{"Table"})
in
    RemoveJoin

Dummy Data (1) (1).xlsx20 KB

renee_crozier
Brass Contributor
Dec 10, 2024
This works great! I just need to finesse it and it's good to go. Thanks!
- SergeiBaklan
  Diamond Contributor
  Dec 11, 2024
  renee_crozier , you are welcome.
  
  Please note, result very depends on concrete texts and threshold you use.

m_tarler

Bronze Contributor

Dec 09, 2024

here is my attempt and not a neat as peters but it is 'broken out' for each of the 'factors' I took into account. I agree there may be processing impact. In this example I just created the 'helper' column to strip out all the unwanted items:

=LET(in, [@[File Name]],
strip_filetype, TEXTBEFORE(in,".",-1,,,),
chk_date, TEXTSPLIT(SUBSTITUTE(strip_filetype,".","-")," "),
date_removed, TEXTJOIN("",,IF(ISNUMBER(DATEVALUE(IF(RIGHT(chk_date,1)="-",LEFT(chk_date,LEN(chk_date)-1),chk_date))),"",chk_date)),
specials_removed, SUBSTITUTE(SUBSTITUTE(date_removed,"-"," "),"_"," "),
chk_vers, TRIM(TEXTSPLIT(specials_removed,"v")),
ver_removed, TEXTJOIN("",,IF(ISNUMBER(--LEFT(chk_vers)),"",chk_vers)),
strip_end_numbers, LEFT(ver_removed,LEN(TRIM(CONCAT("a",IF(ISNUMBER(--MID(ver_removed,SEQUENCE(LEN(ver_removed)),1))," ","a"))))-1),
strip_final, MID(strip_end_numbers,IF(LOWER(LEFT(strip_end_numbers,5))="final",6,1),IF(LOWER(RIGHT(strip_end_numbers,5))="final",LEN(strip_end_numbers)-5,LEN(strip_end_numbers))),
remove_space, SUBSTITUTE(strip_final," ",""),
remove_ending_s, LEFT(remove_space, LEN(remove_space)-(LOWER(RIGHT(remove_space,1))="s")),
remove_ending_s)

btw, i didn't re-add the file filetype back on but that could be easily done on line 12 with &TEXTAFTER(in,".",-1)

I also replaced the date format using "." with "-" because at least my excel didn't recognize that format.

and as for the "final" word I remove it only if it is found at the very beginning or at the end (after removing any other date or version info)

I also split the name using "v" and check if the 1st character is then a number and then remove it, but that could also result in potential errors and might need tweaking.

Dummy Data.xlsx13 KB

PeterBartholomew1

Silver Contributor

Dec 09, 2024

I think you are pretty much doomed. Searching 70,000+ file names is a noticeable calculation, to do it 70,000 times is getting to be heavy computing. To attempt it with dirty data is asking for trouble.

My attempt is

= LET(
    fileNames, REGEXEXTRACT(Table1[File Name], "[\w|\.\-]+\.pdf"),
    site,      Table1[Site],
    duplicate, MAP(site, fileNames, COMPARE(site, fileNames)),
    HSTACK(fileNames, site, IF(duplicate, "Yes", "No"))
  )

where

COMPAREλ
= LAMBDA(site, fileNames, LAMBDA(s, f,
    LET(
        filteredFile, IF(site = s, fileNames),
        firstMatch,   XMATCH(f, filteredFile, , 1),
        lastmatch,    XMATCH(f, filteredFile, , -1),
        firstMatch <> lastmatch
    )
))

Dummy Data (1).xlsx16 KB

renee_crozier
Brass Contributor
Dec 09, 2024
I was able to get it filtered to around 11,000. Does that help my case?

HansVogelaar
MVP
Dec 09, 2024
This is going to be difficult - it's not difficult to remove non-alphabet characters, but how to determine whether a difference in spelling is significant or not? Apparently "filename" is to be treated as equal to "filenames", "filenamev", and "finalfilename", but what is the rule there?
- renee_crozier
  Brass Contributor
  Dec 09, 2024
  There's always going to be margin for error so as long as it's relatively close, I think it's okay. I acknowledge that this will never be 100%

Forum Discussion

Identify Duplicates and Create Column Identifying Them

8 Replies

Resources