Forum Discussion
renee_crozier
Dec 09, 2024Brass Contributor
Identify Duplicates and Create Column Identifying Them
I'm working with a large data set (70,000+ rows). I'm looking to identify the duplicate file names in the same site and create a new column specifying if it is a duplicate or not. The file names can...
- Dec 09, 2024
You may try Power Query fuzzy match playing with threshold
let threshold = Excel.CurrentWorkbook(){[Name="Threshold"]}[Content][Column1]{0}, Source = Excel.CurrentWorkbook(){[Name="Table1"]}[Content], DeclareType = Table.TransformColumnTypes(Source,{{"File Name", type text}}), Check = Table.FuzzyNestedJoin( DeclareType, {"File Name"}, DeclareType, {"File Name"}, "Table", JoinKind.LeftOuter, [IgnoreCase=true, IgnoreSpace=true, Threshold=threshold]), IsDuplicate = Table.AddColumn( Check, "IsDuplicate", each Table.RowCount( [Table] ) > 1, Logical.Type), Similar = Table.AddColumn( IsDuplicate, "Similar", each Text.Combine( List.RemoveItems( [Table][File Name], {[File Name]} ), ", " ), type text), RemoveJoin = Table.RemoveColumns(Similar,{"Table"}) in RemoveJoin
PeterBartholomew1
Dec 09, 2024Silver Contributor
I think you are pretty much doomed. Searching 70,000+ file names is a noticeable calculation, to do it 70,000 times is getting to be heavy computing. To attempt it with dirty data is asking for trouble.
My attempt is
= LET(
fileNames, REGEXEXTRACT(Table1[File Name], "[\w|\.\-]+\.pdf"),
site, Table1[Site],
duplicate, MAP(site, fileNames, COMPARE(site, fileNames)),
HSTACK(fileNames, site, IF(duplicate, "Yes", "No"))
)
where
COMPAREλ
= LAMBDA(site, fileNames, LAMBDA(s, f,
LET(
filteredFile, IF(site = s, fileNames),
firstMatch, XMATCH(f, filteredFile, , 1),
lastmatch, XMATCH(f, filteredFile, , -1),
firstMatch <> lastmatch
)
))
- renee_crozierDec 09, 2024Brass Contributor
I was able to get it filtered to around 11,000. Does that help my case?