How to: Handle duplicate records in Azure Data Explorer
Published Sep 12 2019 04:04 AM 11.5K Views

Azure Data Explorer is an append only database that isn’t designed to support frequent data deletion. If you accidentally ingest your data into Azure Data Explorer multiple times, the following tips can help you handle the duplicate records:

  1. Filter out the duplicate rows in the data during query. The arg_max() aggregated function can be used to filter out the duplicate records and return the last record based on the timestamp (or another column).
  2. Filter duplicates during the ingestion process.
  3. Drop extents with duplicated records and re-ingest the data. 
    // create table with the extent ids that include the duplicate data 
    // add the specific date 
    .set ExtentsToCompress <| bla //original table name
    | extend eid = extent_id()
    | dt=ingestion_time() // one option to find the date
    | where dt in a date range // alternative option to find the date
    |summarize by eid
    
    // present extent ids 
    ExtentsToCompress
    
    // ingest the distinct rows into a temp table
    // increase performance 
    .set BlaTmp <| bla
    | extend eid = extent_id()| where eid in (ExtentsToCompress)
    | project-away eid
    | distinct *
    
    // drop extents with duplicates values 
    .drop extents <| .show table bla extents | where ExtentId in(ExtentsToCompress)
    
    // re-ingest the distinct values 
    .set-or-append bla <| BlaTmp
    
    
  4. For few records, use purge command for remove specific records. Note that data deletion using the .purge command is designed to protect personal data and should not be used in other scenarios. It is not designed to support frequent delete requests, or deletion of massive quantities of data, and may have a significant performance impact on the service

 

For more information regarding how to handle queries with duplicated records read:  Handle duplicate data in Azure Data Explorer 

 

 

Learn more about Azure Data Explorer (Kusto):

  1. Azure Data Explorer
  2. Documentation
  3. Course – Basics of KQL
  4. Query explorer
  5. Azure Portal
  6. User Voice
  7. Cost Estimator

Join us to share questions, thoughts, or ideas about Azure Data Explorer (Kusto) and receive answers from the diverse and knowledgeable Azure Data Explorer community.

 

Azure Data Explorer product team

“Join the conversation on the Azure Data Explorer community”.

2 Comments
Copper Contributor

Thank you for this information :)

Just wanted to check, instead of the following when there is large amounts of data : 

// re-ingest the distinct values 
.set-or-append bla <| BlaTmp

can we use the move extents command ?

 

.move extents all from table BlaTmp to table bla

I was lucky enough to have some time with one of the product team who highlighted this command to me.

 

 

@RobBarat yes, and even move extents all has better performance 

Version history
Last update:
‎Aug 13 2020 09:12 AM
Updated by: