Forum Discussion
Azure Data Explorer removing duplicates from Azure Event Hub stream
Hi,
I need a way to preprocess data (I assume before ingestion into ADX) to deduplicate records based on ApplicationID from Sign-in Logs. I essentially only need userIDs for one instance of them logging into an application.
I've been reading on the potential of using arg_max() during query, soft delete and materialized views via the ADX platform from the public docs: https://learn.microsoft.com/en-us/azure/data-explorer/dealing-with-duplicates#solution-3-filter-duplicates-during-the-ingestion-process
...but I don't think these fit my needs.
My assumption is that Materialized Views and the recommendations listed by Microsoft to de-dupe data would still maintain the original data sample, and that it's only for query-time reduction, since ADX is an append-only resource.
I have deployed Azure Stream Analytics to preprocess/parse the columns that I need and applied a small temporal window to reduce the out of the Event Hub stream, and this has massively reduced the immediate storage...however the duplication issue remains in the long term.
One resolution I am considering is to instead create an Azure Function that will trigger on Event Hub and read the Kusto table to see if the record already exists before ingestion.
This could be a very costly operation though (executing a query for every event on ADX), and I know Azure Stream Analytics > Azure Function on large data samples (I would assume latency is a concern)
This leads to another question of whether by doing this I will incur more computational costs than storage costs...
1 Reply
- wernerzirkelBrass ContributorThat is an architectural question and is depending on many questions like how many events are you going to parse per minute/hour/month, how fast you need the data for Analyts, how much data is it, how how benefit would you get out of it (use case worthiness), how long does it have to run, does it have to scale, is the deduplication logic going to change in the future... and many more. For this reason I think it is not realistic to answer your question seriously in a tech forum.
The only hint I could give you is that I have been using de-duplication within Kusto at a large scale for the last years. There are three options you might consider:
a) using Materialized Views. This is my preferred option, the Kusto team did a pretty good job with this. It works for billions of rows and should handle deduplication of ApplicationIDs easily. It would be my preferred scenario if we talk about continous streaming data.
b) using update policies. This is not the best option for deduplication but it might help you in this sort of logic that you intend to implement in an Azure Function. Think also about a combination of a) and b). You might ingest into a raw table, handle newly ingested data in an update policy and then do a deduplication via a) on the refined data.
c) using batch jobs in Data Factory to append data - this is only interesting if you need data only in regular intervals like hours or days instead of streaming mode. If you execute a simple append | mydeduplfunction(mytable) query a few times a day, you avoid high CPU load on your cluster.