Most effecient way to identify duplicates in data?

abergs · ‎Mar 14 2019

We're moving data analytics towards Kusto and one feature we would like to have is to sanity-check our data and find duplicates of data in a dataset. The problem I want to solve is to make sure we don't accidentally ingest duplicates and then report too high sales numers. Our plan now is to introduce an extra column with a sha1 hash of the row and do something like

"summarize count() by sha1 | where count_ > 1" ... but that would need to touch every column in the data set. I realize that's in the nature of the problem, but I'd just like to investigate strategies what would make this as effecient as possible.

Strategies I've thought of would be to first limit the dataset to a certain timewindow or perhaps by customerId. I also know about the ingestion tags but I don't want to impact extents too much since this is more of a sanity-check.

What other strategies could we use to make this operation as efficient as possible?

Yoni · ‎Mar 14 2019

there are several tactics for dealing with duplicate data details in the following article: you may find some of them useful for your use case: https://docs.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Most effecient way to identify duplicates in data?

Most effecient way to identify duplicates in data?

Re: Most effecient way to identify duplicates in data?