Azure Data Explorer is a big data analytics platform that takes care of everything required to ensure real time decision making can take place, or at least, near real time. This includes data ingestion, data querying, data visualization and data management.
In this blog post you’ll learn how to delete individual records from a table, and how it works behind the scenes.
There are several scenarios in which you might want to delete individual records:
You have multiple devices that continuously send data into a table, and you only want the latest record per device. For this scenario we recommend using Materialized Views.
You sometimes have duplicate records ingested into a table, and you want to remove the duplicates. There are several ways to handle this, please read more details in the Handle duplicate data document.
Your customer asked you to delete the data you have about him, and you want to delete the relevant records from your table, including the storage artifacts in which the data is stored. For this scenario we recommend using Purge.
Your pipeline sometimes ingests corrupted data (for example, a faulty device sends corrupted telemetry). For this scenario you would want to delete these specific records, given a predicate in the form of a KQL query. This blog post will explain how to achieve this with soft-delete.
This blog post will explain how to delete individual records for the last scenario in the list above.
A common scenario for deleting individual records is when you have a table with data coming from IOT devices, and one day you find out that some device has recently started reporting wrong data, and then completely broke and stopped reporting data at all. You then wish to delete the records from the table that were produced by the faulty device in the specific timeframe.
Suppose your company manages a large number of office buildings, and in every room of every building there’s a device that measures the temperature and sends it to a table named DeviceTemperature. Here’s a small data sample from the table:
You know that these thermometers sometimes break, and when they do, they start reporting extremely high temperatures for a while, and then they stop sending data at all. So, you created a script that runs a query once a day to detect such devices, and one day you see that the query returned a result indicating that one device broke:
Your next step is to write a query that returns all the records that were produced by the faulty device, based on these results:
Deleting records by a predicate
To delete records, you’ll need to provide a predicate in the form of a KQL query, that returns all the records that you want to be deleted.
The recommendation is to first run the deletion command in `whatif` mode, in order to see how many records will be deleted if you use the provided predicate, but without deleting anything just yet.
In the result above, you can see that 1857 records were returned by the predicate, and they are spread across three extents.
A quick note on what an “extent” is: Azure Data Explorer is built to support tables with a huge number of records (rows) and large amounts of data. To handle such large tables, each table's data is divided into smaller "chunks" called data shards or extents (the two terms are synonymous). The union of all the table's extents holds the table's data. An extent is a like a type of mini-table.
If the result makes sense to you, then it’s time to actually delete these records, by removing the `with (whatif=true)` part:
How do you read this result? Let’s look at the first record. It says that extent 405da552-04bf-4138-b764-5fbef6513f04 contained 548 records that were returned by the predicate. The command replaced this extent with a new one, 3d93e17d-06d5-4efc-b2f0-209547bf687b, containing all the records from the original extent, except the 548 records that were filtered out by the predicate.
Now when you query the DeviceTemperature table, you won’t get the records that you deleted.
How does this work behind the scenes?
The deletion process is performed using the following steps:
Run predicate query: The table is scanned to identify data extents that contain records to be deleted. The extents identified are those with one or more records returned by the predicate query.
Extents replacement: The identified extents are replaced with new extents that point to the original data blobs, and also have a new hidden column of type bool that indicates per record whether it was deleted or not. Once completed, if no new data is ingested, the predicate query will not return any records if run again.
In this blog post you learnt how to delete individual records by providing a predicate in the form of a KQL query, and how the deletion process works behind the scenes.