Forum Discussion
jmbowie
Aug 28, 2023Copper Contributor
Azure Data Explorer removing duplicates from Azure Event Hub stream
Hi, I need a way to preprocess data (I assume before ingestion into ADX) to deduplicate records based on ApplicationID from Sign-in Logs. I essentially only need userIDs for one instance of them ...
wernerzirkel
Sep 01, 2023Brass Contributor
That is an architectural question and is depending on many questions like how many events are you going to parse per minute/hour/month, how fast you need the data for Analyts, how much data is it, how how benefit would you get out of it (use case worthiness), how long does it have to run, does it have to scale, is the deduplication logic going to change in the future... and many more. For this reason I think it is not realistic to answer your question seriously in a tech forum.
The only hint I could give you is that I have been using de-duplication within Kusto at a large scale for the last years. There are three options you might consider:
a) using Materialized Views. This is my preferred option, the Kusto team did a pretty good job with this. It works for billions of rows and should handle deduplication of ApplicationIDs easily. It would be my preferred scenario if we talk about continous streaming data.
b) using update policies. This is not the best option for deduplication but it might help you in this sort of logic that you intend to implement in an Azure Function. Think also about a combination of a) and b). You might ingest into a raw table, handle newly ingested data in an update policy and then do a deduplication via a) on the refined data.
c) using batch jobs in Data Factory to append data - this is only interesting if you need data only in regular intervals like hours or days instead of streaming mode. If you execute a simple append | mydeduplfunction(mytable) query a few times a day, you avoid high CPU load on your cluster.
The only hint I could give you is that I have been using de-duplication within Kusto at a large scale for the last years. There are three options you might consider:
a) using Materialized Views. This is my preferred option, the Kusto team did a pretty good job with this. It works for billions of rows and should handle deduplication of ApplicationIDs easily. It would be my preferred scenario if we talk about continous streaming data.
b) using update policies. This is not the best option for deduplication but it might help you in this sort of logic that you intend to implement in an Azure Function. Think also about a combination of a) and b). You might ingest into a raw table, handle newly ingested data in an update policy and then do a deduplication via a) on the refined data.
c) using batch jobs in Data Factory to append data - this is only interesting if you need data only in regular intervals like hours or days instead of streaming mode. If you execute a simple append | mydeduplfunction(mytable) query a few times a day, you avoid high CPU load on your cluster.