Forum Discussion

lfk73's avatar
lfk73
Brass Contributor
Nov 25, 2022

"Smart" de deuplication

I am working with a spreadsheet of around 100K lines.  About 30K of them are actrually duplicated and need to be removed.  Most people will say just use the deduplication feature and this is probably what I will do.

 

However for each duplicate one line has more information in it that the other so when I reduplicate I need to make sure the row with the extra information remains that one other row is removed.  Luckily there is a column that has a value for the row I want to keep.

 

The below is a very simplfied version of the bigger table however you can see the DNS and IP columns are how I am finding the duplicates.  Then I want to keep the row with more info or another way is keep the rows with "AGENT" in the Tracking column.

 

DCDNSScanIPCountryIDTracking
unknownhostname1central192.168.0.1US86171AGENT
localhostname1local192.168.0.1 86171SCAN
unknownhostname1central192.168.0.2US86172AGENT
localhostname1local192.168.0.2 86172SCAN
unknownhostname1central192.168.0.3US86173AGENT
localhostname1local192.168.0.3 86173SCAN

 

I do the reduplication using the DNS and IP columns but I don't see how I can tell it to keep the row I want.  How does excel choose?  Does it just keep the first row and delete the second for example?

 

Thanks.

  • Harun24HR's avatar
    Harun24HR
    Bronze Contributor
    What output do you expect from your given sample data? Can you please also post your desired result to your post?
  • Patrick2788's avatar
    Patrick2788
    Silver Contributor

    lfk73 

    If you have access to LAMBDA, this might work for you:

    =LET(b,BYROW(HSTACK(IP,tracking),LAMBDA(row,IF(OR(COUNTIF(IP,TAKE(row,,1))=1,AND(COUNTIF(IP,TAKE(row,,1))>1,TAKE(row,,-1)="Agent")),1,0))),FILTER(data,b=1))

Share