How to: Handle duplicate records in Azure Data Explorer

Tzvia Gitlin Troyna · ‎Sep 12 2019

Azure Data Explorer is an append only database that isn’t designed to support frequent data deletion. If you accidentally ingest your data into Azure Data Explorer multiple times, the following tips can help you handle the duplicate records:

Filter out the duplicate rows in the data during query. The arg_max() aggregated function can be used to filter out the duplicate records and return the last record based on the timestamp (or another column).
Filter duplicates during the ingestion process.

Drop extents with duplicated records and re-ingest the data.

// create table with the extent ids that include the duplicate data 
// add the specific date 
.set ExtentsToCompress <| bla //original table name
| extend eid = extent_id()
| dt=ingestion_time() // one option to find the date
| where dt in a date range // alternative option to find the date
|summarize by eid

// present extent ids 
ExtentsToCompress

// ingest the distinct rows into a temp table
// increase performance 
.set BlaTmp <| bla
| extend eid = extent_id()| where eid in (ExtentsToCompress)
| project-away eid
| distinct *

// drop extents with duplicates values 
.drop extents <| .show table bla extents | where ExtentId in(ExtentsToCompress)

// re-ingest the distinct values 
.set-or-append bla <| BlaTmp

For few records, use purge command for remove specific records. Note that data deletion using the .purge command is designed to protect personal data and should not be used in other scenarios. It is not designed to support frequent delete requests, or deletion of massive quantities of data, and may have a significant performance impact on the service

For more information regarding how to handle queries with duplicated records read: Handle duplicate data in Azure Data Explorer

Learn more about Azure Data Explorer (Kusto):

Join us to share questions, thoughts, or ideas about Azure Data Explorer (Kusto) and receive answers from the diverse and knowledgeable Azure Data Explorer community.

Azure Data Explorer product team

“Join the conversation on the Azure Data Explorer community”.

RobBarat · ‎Sep 20 2019

Thank you for this information :)

Just wanted to check, instead of the following when there is large amounts of data :

// re-ingest the distinct values 
.set-or-append bla <| BlaTmp

can we use the move extents command ?

.move extents all from table BlaTmp to table bla

I was lucky enough to have some time with one of the product team who highlighted this command to me.

Tzvia Gitlin Troyna · ‎Sep 22 2019

@RobBarat yes, and even move extents all has better performance

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

How to: Handle duplicate records in Azure Data Explorer