"Smart" de deuplication

Question

I am working with a spreadsheet of around 100K lines. About 30K of them are actrually duplicated and need to be removed. Most people will say just use the deduplication feature and this is probably what I will do.

However for each duplicate one line has more information in it that the other so when I reduplicate I need to make sure the row with the extra information remains that one other row is removed. Luckily there is a column that has a value for the row I want to keep.

The below is a very simplfied version of the bigger table however you can see the DNS and IP columns are how I am finding the duplicates. Then I want to keep the row with more info or another way is keep the rows with "AGENT" in the Tracking column.

DC	DNS	Scan	IP	Country	ID	Tracking
unknown	hostname1	central	192.168.0.1	US	86171	AGENT
local	hostname1	local	192.168.0.1		86171	SCAN
unknown	hostname1	central	192.168.0.2	US	86172	AGENT
local	hostname1	local	192.168.0.2		86172	SCAN
unknown	hostname1	central	192.168.0.3	US	86173	AGENT
local	hostname1	local	192.168.0.3		86173	SCAN

I do the reduplication using the DNS and IP columns but I don't see how I can tell it to keep the row I want. How does excel choose? Does it just keep the first row and delete the second for example?

Thanks.

harun24hr · Answer

What output do you expect from your given sample data? Can you please also post your desired result to your post?

patrick2788 · Answer

lfk73&nbsp;If you have access to LAMBDA, this might work for you:=LET(b,BYROW(HSTACK(IP,tracking),LAMBDA(row,IF(OR(COUNTIF(IP,TAKE(row,,1))=1,AND(COUNTIF(IP,TAKE(row,,1))&gt;1,TAKE(row,,-1)="Agent")),1,0))),FILTER(data,b=1))

Forum Discussion

"Smart" de deuplication

Share