"Smart" de deuplication

lfk73 · ‎Nov 24 2022

I am working with a spreadsheet of around 100K lines. About 30K of them are actrually duplicated and need to be removed. Most people will say just use the deduplication feature and this is probably what I will do.

However for each duplicate one line has more information in it that the other so when I reduplicate I need to make sure the row with the extra information remains that one other row is removed. Luckily there is a column that has a value for the row I want to keep.

The below is a very simplfied version of the bigger table however you can see the DNS and IP columns are how I am finding the duplicates. Then I want to keep the row with more info or another way is keep the rows with "AGENT" in the Tracking column.

DC	DNS	Scan	IP	Country	ID	Tracking
unknown	hostname1	central	192.168.0.1	US	86171	AGENT
local	hostname1	local	192.168.0.1		86171	SCAN
unknown	hostname1	central	192.168.0.2	US	86172	AGENT
local	hostname1	local	192.168.0.2		86172	SCAN
unknown	hostname1	central	192.168.0.3	US	86173	AGENT
local	hostname1	local	192.168.0.3		86173	SCAN

I do the reduplication using the DNS and IP columns but I don't see how I can tell it to keep the row I want. How does excel choose? Does it just keep the first row and delete the second for example?

Thanks.

Harun24HR · ‎Nov 24 2022

What output do you expect from your given sample data? Can you please also post your desired result to your post?

Patrick2788 · ‎Nov 25 2022

@lfk73

If you have access to LAMBDA, this might work for you:

=LET(b,BYROW(HSTACK(IP,tracking),LAMBDA(row,IF(OR(COUNTIF(IP,TAKE(row,,1))=1,AND(COUNTIF(IP,TAKE(row,,1))>1,TAKE(row,,-1)="Agent")),1,0))),FILTER(data,b=1))

"Smart" de deuplication

"Smart" de deuplication

Re: "Smart" de deuplication

Re: "Smart" de deuplication

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

"Smart" de deuplication

"Smart" de deuplication

Re: "Smart" de deuplication

Re: "Smart" de deuplication