"Smart" de deuplication

Brass Contributor

I am working with a spreadsheet of around 100K lines.  About 30K of them are actrually duplicated and need to be removed.  Most people will say just use the deduplication feature and this is probably what I will do.

 

However for each duplicate one line has more information in it that the other so when I reduplicate I need to make sure the row with the extra information remains that one other row is removed.  Luckily there is a column that has a value for the row I want to keep.

 

The below is a very simplfied version of the bigger table however you can see the DNS and IP columns are how I am finding the duplicates.  Then I want to keep the row with more info or another way is keep the rows with "AGENT" in the Tracking column.

 

DCDNSScanIPCountryIDTracking
unknownhostname1central192.168.0.1US86171AGENT
localhostname1local192.168.0.1 86171SCAN
unknownhostname1central192.168.0.2US86172AGENT
localhostname1local192.168.0.2 86172SCAN
unknownhostname1central192.168.0.3US86173AGENT
localhostname1local192.168.0.3 86173SCAN

 

I do the reduplication using the DNS and IP columns but I don't see how I can tell it to keep the row I want.  How does excel choose?  Does it just keep the first row and delete the second for example?

 

Thanks.

2 Replies
What output do you expect from your given sample data? Can you please also post your desired result to your post?

@lfk73 

If you have access to LAMBDA, this might work for you:

=LET(b,BYROW(HSTACK(IP,tracking),LAMBDA(row,IF(OR(COUNTIF(IP,TAKE(row,,1))=1,AND(COUNTIF(IP,TAKE(row,,1))>1,TAKE(row,,-1)="Agent")),1,0))),FILTER(data,b=1))