Azure Data Explorer Blog

13 MIN READ

How to Monitor Batching Ingestion to ADX in Azure Portal

Lea_Shallev

Microsoft

Mar 08, 2021

In this tutorial you will learn how to use ingestion metrics to monitor Batching ingestion to ADX in Azure portal.

Background

Batching ingestion is one of the methods to ingest data to ADX. Using this method, ADX batches received ingress data chunks to optimize ingestion throughput. The data is batched based on a batching policy defined on the database or on the table. ADX uses a default values of 5 minutes as the maximum delay time, 1000 items and a total size of 1G for batching. The batching ingestion goes through several stages, and there are specific components responsible for each of these steps:

For Event Grid, Event Hub and IoT Hub ingestion, there is a Data Connection that gets the data from external sources and performs initial data rearrangement.
The Batching Manager batches the received references to data chunks to optimize ingestion throughput based on a batching policy.
The Ingestion Manager sends the ingestion command to the ADX Storage Engine.
The ADX Storage Engine stores the ingested data so it is available for query.

Monitoring the batching ingestion, you can get information such as ingestion results, the amount of ingested data, the latency of the ingestion and the batching process itself. When analyzing the amount of data passing through ingestion and the ingestion latency, it is possible to split metrics by Component Type to better understand the performance of each of the batching ingestion steps.

After reading this tutorial you will know how to answer the following questions:

How can I see the result of my ingestion attempts?
How much data was processed by the ingestion pipeline?
What is the latency of the ingestion process and did latency built up in ADX pipeline or upstream to ADX?
How can I better understand the batching process of my cluster during ingestion?
When working with Event Hub, Event Grid and IoT Hub ingestion, how can I compare the number of events arrived to ADX to the number of events sent for ingestion?

For more information about Azure Metric Explorer see Getting started with Azure Metrics Explorer
To get the full list of ADX metrics see Supported Azure Data Explorer metrics
Understand how to use metrics to monitor ADX in general and how to work with the metric pane.

Navigate to the cluster metrics pane and configure the analysis timeframe

In this tutorial, we are analyzing data ingestion to ADX during the last 48 hours:

Sign in to Azure portal and navigate to your cluster overview page.
In the left-hand pane of your ADX cluster, search for metrics.
Select Metrics to open the metrics pane and begin analysis on your cluster.
In the upper right corner above the chart, click on the time selector:

Select the desired timespan for metrics analyzing (in this example, last 48 hours), then select Apply:

Ingestion result

The Ingestion Result metric gives information about the total number of sources that either failed or succeeded to be ingested. Splitting the metric by status, you can get detailed information about the status of the ingestion operations.

Note:

Using Event Hub or IoT Hub ingestion events are pre-aggregated into one blob, and then treated as a single source to be ingested. Therefore, pre-aggregated events appear as a single ingestion result after pre-aggregation.
Transient failures are retried internally to a limited number of attempts. Each transient failure is reported as a transient ingestion result. Therefore, a single ingestion may result with more than one ingestion result.

In the metrics pane select the following settings:

Settings	Suggested Value	Field Description
Scope	<Your Cluster Name>	The name of the ADX cluster
Metric Namespace	Kusto Cluster Standard Metrics	A namespace which is the category for the metric
Metric	Ingestion result	The metric name
Aggregation	Sum	The aggregated function by which the metrics are aggregated over time. To better understand aggregation see Changing aggregation

You can now see the number of ingestion sources (that either failed or succeeded to be ingested) over time:

Select Apply splitting above the chart:

Choose the Status dimension to segment your chart by the status of the ingestion operations:

After selecting the splitting values, click away from the split selector to close it. Now the chart shows how many ingestion sources were tried to be ingested over time, and the status of the ingestions. There are multiple lines, one for each possible ingestion result.

In the chart above, you can see 3 lines: blue for successful ingestion operations, orange for ingestion operations that failed due to “Entity not found” and purple for ingestion operations that failed due to "Bad request”.
The error in the chart represents the category of the error code. To see the full list of ingestion error codes by categories and better understand the possible error reason see Ingestion error codes in Azure Data Explorer.
To get more details on an ingestion error, you can set failed ingestion diagnostic logs. (take into account that logs emission results with creation of additional resources, and therefore costs money).

The amount of ingested data:

The Blobs Processed, Blobs Received and Blobs Dropped metrics give information about the number of blobs that were processed by ingestion components.

In the metrics pane select the following settings:

Settings	Suggested Value	Field Description
Scope	<Your Cluster Name>	The name of the ADX cluster
Metric Namespace	Kusto Cluster Standard Metrics	A namespace which is the category for the metric
Metric	Blobs Processed	The metric name
Aggregation	Sum	The aggregated function by which the metrics are aggregated over time. To better understand aggregation see Changing aggregation

Select Apply splitting above the chart:

Select the Component Type dimension to segment the chart by different components through ingestion:

If you want to focus on a specific database of your cluster, select Add filter above the chart:

Select the database that you want to analyze (this example shows filtering out blobs sent to the GitHub database):

After selecting the filter values, click away from the Filter Selector to close it. Now the chart shows how many blobs that are sent to GitHub database were processed at each of the ingestion components over time:

In the chart above you can see that on February 13 there is a decrease in the number of blobs that were ingested to GitHub database over time. You can also see that the number of blobs that were processed at each of the components is similar, meaning that approximately all data processed in the data connection was also processed successfully by the batching manager, by the ingestion manager and by the storage engine. Therefore, this data is ready for query.
To better understand the relation between the number of blobs that were received at each component and the number of blobs that were processed successfully at each component, you can add a new chart to describe the number of blobs that were sent to GitHub database and were received at each component.
Above the Blob processed chart, select New chart:

Select the following settings for the new chart:

Return the split and filter steps above to split the Blob Received metric by component type and filter only blobs sent to GitHub database. You can now see the following charts next to each other:

Comparing the charts, you can see that the number of blobs that were received on each component is like the number of blobs that were processed. That is means that approximately there are no blobs that were dropped during ingestion.
You can also analyze the Blob Dropped metric following the steps above to see how many blobs were dropped during ingestion and to detect whether there is problem in processing at specific component during ingestion. For each dropped blob you will also get an Ingestion Result metric with more information about the failure reason.

Ingestion latency:

Note: According to the default batching policy, the default batching time is 5 minutes. Therefore, the expected latency of ~5 minutes using the default batching policy.

While ingesting data to ADX, it is important to understand the ingestion latency to know how much time passes until data is ready for query. The metrics Stage Latency and Discovery Latency aimed to monitor ingestion latency.

The Stage Latency indicates the timespan from when a message is discovered by ADX, until its content is received by an ingestion component for processing. Stage latency filtered by the Storage Engine component indicates the total ADX ingestion time until data is ready for query.

The Discovery Latency is used for ingestion pipelines with data connections (Event Hub, IoT Hub and Event Grid). This metric gives information about the timespan from data enqueue until discovery by ADX data connections. This timespan is upstream to ADX, and therefore it is not included in the Stage Latency that measures only latency in ADX.

When you see a long latency until data is ready for query, analyzing Stage Latency and Discovery Latency can help you to understand whether the long latency is because of long latency in ADX or upstream to ADX. If the long latency is in ADX, you can also detect the specific component responsible for the long latency.

In the metrics pane select the following settings:

Settings	Suggested Value	Field Description
Scope	<Your Cluster Name>	The name of the ADX cluster
Metric Namespace	Kusto Cluster Standard Metrics	A namespace which is the category for the metric
Metric	Stage Latency	The metric name
Aggregation	Avg	The aggregated function by which the metrics are aggregated over time. To better understand aggregation see Changing aggregation

Select Apply splitting above the chart:

Select the Component Type dimension to segment the chart by different components through ingestion:

If you want to focus on a specific database of your cluster, select Add filter above the chart:

Select which database values you want to include when plotting the chart (this example shows filtering out blobs sent to GitHub database):

After selecting the filter values, click away from the Filter Selector to close it. Now the chart shows the latency of ingestion operations that are sent to GitHub database at each of the components through ingestion over time:

In the chart above, you can see that the latency at the data connection is approximately 0 seconds. It makes sense since the Stage Latency measures only latency from when a message is discovered by ADX.
You can also see that the longest time passes from when the Batching Manager received data to when the Ingestion Manager received data. In the chart above it took around 5 minutes as we used a default Batching Policy for the GitHub database and the default time for batching policy is 5 minutes. We can conclude that apparently the sole reason for the batching was time. You can see this conclusion in detail in the section about Understanding Batching Process.
Finally, looking at the StorageEngine latency in the chart, represents the latency when receiving data by the Storage Engine, you can see the average total latency from the time of discovery data by ADX until data is ready for query. In the graph above it is 5.2 minutes on average.

If you use ingestion with data connections, you may want to estimate the latency upstream to ADX over time as long latency may also be because of long latency before ADX actually gets the data for ingestion. For that purpose, you can use the Discovery Latency metric.
Above the chart you have already created, select New chart:

Select the following settings to see the average Discovery Latency over time:

Return the split steps above to split the Discovery Latency by Component Type which represents the type of the data connection that discovers the data.
After selecting the splitting values, click away from the split selector to close it. Now you have a chart for Discovery Latency:

You can see that almost all the time, the discovery Latency is close to 0 seconds means that ADX got data immediately after data enqueue. The highest peak of around 300 milliseconds is around February 13 at 14:00 AM means that at this time ADX cluster got the data around 300 milliseconds after data enqueue.

Understand the Batching Process:

The Batch blob count, Batch duration, Batch size and Batches processed metrics aimed to provide information about the batching process:

Batch blob count: Number of blobs in a completed batch for ingestion.

Batch duration: The duration of the batching phase in the ingestion flow.

Batch size: Uncompressed expected data size in an aggregated batch for ingestion.

Batches processed: Number of batches completed for ingestion.

In the metrics pane select the following settings:

Settings	Suggested Value	Field Description
Scope	<Your Cluster Name>	The name of the ADX cluster
Metric Namespace	Kusto Cluster Standard Metrics	A namespace which is the category for the metric
Metric	Batches Processed	The metric name
Aggregation	Sum	The aggregated function by which the metrics are aggregated over time. To better understand aggregation see Changing aggregation

Select Apply splitting above the chart:

Select the Batching Type dimension to segment the chart by the batch seal reason (whether the batch reached the batching time, data size or number of files limit, set by batching policy):

If you want to focus on a specific database of your cluster, select Add filter above the chart:

Select which database that you want to analyze (this example shows filtering out blobs sent to GitHub database):

After selecting the filter values, click away from the Filter Selector to close it. Now the chart shows the number of sealed batches with data sent to GitHub database over time, split by the Batching type:

You can see that there are 2-4 batches per time unit over time, and all batches are sealed by time as estimated in the Stage latency section where you can see that it took around 5 minutes to batch data based on the default batching policy.
Select Add Chart above the chart and return the filter steps above to create additional charts for the Batch blob count, Batch duration and Batch size metrics on a desired database. Use Avg aggregation while creating these metric charts.

From the charts you can conclude some insights:

The average number of blobs in the batches is around 160 blobs over time, then it decrease to 60-120 blobs. As we have around 280 processed blobs over time on February 14 in the batching manager (see The amount of data ingested section) and 3 processed batch over time, it indeed makes sense. Based on the default batching policy, batch can seal when blob count is 1000 blobs. As we don’t reach 1000 blobs in less than 5 minutes, we indeed don’t see batches sealed by count.
The average batch duration is 5 minutes. Note that the default batching time defined in the batching policy is 5 minutes, and it may significantly affect the ingestion latency. On the other hand, you should consider that too small batching time may cause ingestion commands to include too small data size and reduce ingestion efficiency as well as requesting post-ingestion resources to optimize the small data shards produced by non-batched ingestion.
In the batch size chart, you can see that the average size of batches is around 200-500MB over time. Note that the optimal size of data to be ingested in bulk is 1 GB of uncompressed data which is also defined as a seal reason by the default batching policy. As there is no 1GB of data to be batched over 5 minutes time frames, batches aren’t seal by size. Looking at the size, you should also consider the tradeoff between latency and efficiency as explained above.

Compare data connection incoming events to the number of events sent for ingestion

Applying Event Hub, IoT Hub or Event Grid ingestion, you can compare the number of events received by ADX to the number of events sent from Event Hub to ADX. The metrics Events Received, Events Processed and Events Dropped aimed to enable this comparison.

In the metrics pane select the following settings:

Settings	Suggested Value	Field Description
Scope	<Your Cluster Name>	The name of the ADX cluster
Metric Namespace	Kusto Cluster Standard Metrics	A namespace which is the category for the metric
Metric	Events Received	The metric name
Aggregation	Sum	The aggregated function by which the metrics are aggregated over time. To better understand aggregation see Changing aggregation

Select Add filter above the chart:

Select the Component Name property to filter the events received by a specific data connection defined on your cluster:

After selecting the filtering values, click away from the Filter Selector to close it. Now the chart shows the number of events received by the selected data connection over time:

Looking at the chart above you can see that the data connection called GitHubStreamingEvents got around 200-500 events over time.

To see if there are events dropped by ADX, you can focus on the Events Dropped metric and Events Processed metric.
In the chart you created, select Add metric:

‘

Select Events Processed as the Metric, and Sum for the Aggregation.
Return these steps to add Events Dropped by the data connection.
The chart now shows the number of Events that were received, processed and dropped by the GitHubStreamingEvents data connection over time:

In the chart above you can see that almost all received events were processed successfully by the data connection. There is 1 dropped event, which compatible with the failed ingestion result due to bad request that we saw in the ingestion result section.

You may also want to compare the number of Event Received to the number of events that were sent from Event Hub to ADX.
On the chart select Add metric.
Click on the Scope to select the desired Event Hub namespace as the scope of the metric.
In the opened panel, de-selected the ADX cluster, search for the namespace of the Event Hub that sends data to your data connection and select it:

Select Apply
Select the following settings:

Settings	Suggested Value	Field Description
Scope	<Your Event Hub Namespace Name>	The name of the Event Hub namespace which send data to your data connection
Metric Namespace	Event Hub standard metrics	A namespace which is the category for the metric
Metric	Outgoing Messages	The metric name
Aggregation	Sum	The aggregated function by which the metrics are aggregated over time. To better understand aggregation see Changing aggregation

Click away from the settings to get the full chart that compare the number of events processed by ADX data connection to the number of events sent from the Event Hub:

In the chart above you can see that all events that were sent from Event Hub, were processed successfully by ADX data connection.

Note:

If you have more than one Event Hub in the Event Hub namespace, you should filter Outgoing Messages metric by the Entity Name dimension to get only data from the desired Event hub in your Event Hub namespace
There is no option to monitor outgoing message per consumer group. When having few consumer groups defined on the EventHub, the Outgoing Messages metric count the total number of messages that where consumed by all consumer groups. Therefore, if you have few consumer groups in your Event Hub, you may get larger number of Outgoing Messages than Events Received.