Clickstream analytics in Azure Data Explorer

Sher Shah Khan · ‎Nov 03 2020

Clickstream analysis is the process of collecting, analysing and reporting aggregated data about user’s journey on a website. User’s interactions with websites are collected, with applications like Adobe analytics and Tealium. Bringing clickstream data into a centralised place and combining it with other data sources for rich analytics is often required.

Typically clickstream data can reach around 2-3TB/day in size. Using Relational databases to analyse such data might not be suitable or cost effective.

Azure Data Explorer is a fast, fully managed data analytics service for real-time analysis on large volumes of Telemetry, Logs, Time Series data streaming from applications, websites, IoT devices, and more.

Proposed Architecture:

An end to end proposed architecture how you can use Azure Data Explorer to bring Clickstream historical and incremental data into Azure Data Explorer and analyse, aggregate and visualise the data.

Source System:

1.a) Adobe Analytics dataset usually contains about 1000+ fields. Adobe Analytics provides Data feed functionality which can be used to extract the data to Azure Data Lake Store. Data can be extracted with different configurations:

Frequency: daily or hourly
Format: .zip or .gz format
Files: Single vs Multiple

1.b) Tealium datasets are extracted in JSON structure with 100’s of nested fields. The data can be sent to Azure Event hub in near realtime.

Bringing Data into ADX:

Creating the table structure and mapping for 1000+ column is a cumbersome task, it can be made easy with 1 Click Ingestion.

There are multiple ways to load the data.

3) Historical Load using Lightingest:

LightIngest is a command-line utility for ad-hoc data ingestion into Azure Data Explorer.

Simplest and efficient way to load all the historical data, with just one command.

LightIngest.exe "https://ingest-<<ClusterName.ClusterRegion>>.kusto.windows.net;Fed=True"   -database:<<DatabaseName>>   -table:<<TableName>>   -source:" https://<<StorageName>>.blob.core.windows.net/<<ContainerName>>;<<StorageAccountKey>> "   -prefix:"<<FolderPath>>"   -pattern:*.tsv.gz   -format:tsve

Although adobe files are in TSV format but due to special characters in the files, you should use the format TSVE in ADX.

Similarly, for Tealium historical load, use Json format.

Incremental Load:

3.a)

For Adobe batch file loading subscribe to eventgrid.

Alternatively, ADF is useful for loading because you can easily control the loading flow, and manage all your ETL or ELT pipelines from a single tool. Example pipeline (Template attached)

ADF Pipeline.JPG

.ingest command:

Read more about .ingest command. The below sample .ingest command is created dynamically using ADF parameters. It will avoid duplicate loading of same file, encrypt the secrets in logs and create custom creation date.

.ingest into table @{pipeline().parameters.tableName} (h'abfss://@{variables('containerName')}@{pipeline().parameters.storageName}.dfs.core.windows.net/@{variables('folderPath')}/@{pipeline().parameters.fileName};@{activity('GetAccountKeyFromKV').output.value}') with (format='tsve',ingestIfNotExists = '["@{pipeline().parameters.fileName}"]', tags = '["ingest-by:@{pipeline().parameters.fileName}"]',creationTime='@{variables('inputFileDateTime')}')

3.b) For Tealium real time streaming connect Eventhub to ADX table:

4) Expected Compression ratio:

Source	Data Size	Input Format	ADX Mapping Format	Expected Data Size in ADX
Adobe Analytics	XX - TB	TSV	TSVE	XX/10 TB (10x Compression)
Tealium	XX - TB	JSON	JSON	XX/3.5 TB (3.5x Compression)

5) Querying & Analytical workload:

The Analytical work requires more compute but is usually required at working hours. Using a follower cluster will make it easy to pause/resume and optimise for read workload. This will be useful for charge back to different groups and will provide workload isolation.

Cluster pause/resume can be done using Logic app or Azure Automation.

Estimated Cluster Size:

The cluster size will depend on the daily data ingestion and how many days needs to be retained in hot cache. You can estimate your cluster size based on your requirements.

Costing for 2TB daily load with 10x compression.

You can monitor your workload and adjust your cluster accordingly.

kumarneeraj · ‎Nov 04 2020

The article is crystal clear, written in simple language, and easy to understand with proper reasoning behind each chosen option. I although have a few comments/queries.

I assume that we can integrate more than two ADX clusters together with one being primary or the follower cluster, and others as the one which gets the data from different sources, which definitely enables separation of concerns. Is my understanding correct?
Following the point above, if my understanding is correct, can we not have the separation of concern at the primary cluster level? If that is possible, why do we need the follower cluster?
With follower clusters, how do we save on cost? I understand that it will be handled by logic apps or the automation by starting or pausing whenever needed, but if the primary cluster is always running, why can't we utilize it to perform the analysis? We can just increase the compute.
The ingest into table command has the data obfuscation part. For people who do not exactly understand ingestions and KQL, it would be better if that is highlighted and a link provided to the Microsoft documentation. I have found an external link (https://azure-training.com/2019/09/04/data-obfuscation-in-kusto-query-language/).
The availability zones in this case are not enabled. I understand that this is a proposed architecture, but how are the physical failures handled?

I apologize for my queries above. You may definitely choose to not answer them but just tried my luck.

Regards,

Neeraj

Sher Shah Khan · ‎Nov 04 2020

@kumarneeraj , Thanks for your feedback much appreciated.

1) Followers cluster is an in place data sharing functionality and it can be used in multiple scenarios, in my article I explained that it will be useful for workload isolation and consumer/department paying their own cluster bill.

2) We can definitely use the primary cluster for all workload and many customers are using this approach. In some cases data scientist/Analyst do quick testing and POC and bring their own datasets to join with production data. bringing un-managed dataset in the primary production cluster might be a concern in some scenario that why the proposed architecture.

3) Having a very large cluster which can serve both read/write requirements(Autoscale might work as well) might come out to be expensive. If the read and write data retention policy is different it might not be possible in the same cluster as well.

4) Thanks for your feedback i will update the .ingest section. The dynamic .ingest in the blog will hide the secrets, avoid duplicate file loading and also can set a custom creation time if the file arrive a few days late(which is common in prod scenario).

5) Azure Data Explorer is a highly available service with 99.9% uptime. I have not covered the DR scenario in the blog but it is well documented. Also Availability zone is currently only available in few regions worldwide so I might not available in all Azure regions.

Uri Barash · ‎Nov 14 2020

In many case, a single cluster would work extremely well and be very efficient and performant. A very large percentage of the ADX deployments are configured as such.

As Sher wrote, there are scenarios where you would like to separate:

When different departments use it for different purposes - Operational dashboards and data science
When ingestion load can vary a lot, a dedicated ingestion cluster (Compute heavy machines) can scale up and down in higher flexibility than a query cluster (Storage optimized)
When heavy analytics might impact other usages (ingestion dashboarding)
Scaling out for massive concurrent usage by creating more instances of the follower

Products (50)

Special Topics (27)

Video Hub (462)

Most Active Hubs

Most Active Hubs

Video Hub

Clickstream analytics in Azure Data Explorer