Blog Post

FastTrack for Azure
3 MIN READ

Monitoring Consumer groups processing using custom metrics

yodobrin's avatar
yodobrin
Former Employee
Feb 08, 2023
Authors:
- @yaronpri is a Principal Consultant in EMEA,
- chgeuer  and yodobrin are Principal Customer Engineers in the Growth & Innovation team.

 

Use Case

When using Azure Event Hub, it is important and useful to monitor the 'lag' of the consumer group. 'Lag' refers to the number of messages/events that have not yet been processed within a consumer group. The larger the lag, the more the consumer group is "falling behind". Monitoring this property can help identifying overloaded consuming applications: Increasing lag indicates that the consuming application is not able to keep up with the load of the event hub. The ability to monitor the 'lag' is not part of the built-in metrics provided by Azure Event Hub, for the Azure Event Hub built-in metrics please review this document.


In a queue-based system (such as Azure Storage queues or Azure Service Bus), the 'lag' would correspond to the queue length, i.e. the number of unprocessed messages.

 

High Level Solution Approach

How does it work?

Both producers & consumers follow this QuickStart guidelines.

The application (deployed as Azure Container App) will iterate over the consumer groups and for each partition will calculate the lag by subtracting the sequence number from the last enqueued sequence number, these values are part of the information available through the Azure Event Hub SDK. It will then send these calculations as custom metrics to the event hub control plain using REST calls.

Azure Monitor supports [custom metrics, i.e. emitting a custom metric to an existing Azure resource, such as an Event Hubs instance. This allows users to enrich their Azure Monitor dashboards and build richer queries. In this solution, we use a dockerized .NET application, running in Azure Event Hubs, to augment an Azure Event Hubs instance with the per-consumer group lag, allowing the customers to closely monitor the workload of the Event Hubs consumers.
 

 

What is checkpointing?

 

Checkpointing is a process where consumers mark their position in a partition event sequence. It's the responsibility of each consumer in a consumer group to keep track of their current position in the event stream. When a reader reconnects, it begins reading from the last checkpoint submitted by the last reader in the consumer group. Checkpointing provides failover resiliency and enables event stream replay. For further reading please review this document.

 

The solution periodically scans the checkpoints, created by the Event Hubs SDK in blob storage, to calculate the lag against the different Event Hubs partitions. The solution continuously sends these deltas as custom metrics towards Azure Monitor, alongside with the event hub name and partition id, and consumer group name.

 

These metrics can then be presented as part of the metrics of the Event Hubs instance:

.

 

 

You could also use split by partition, consumer group or eventhub name.

 

 

In the above example, we could see that a few of the partitions are handling the load better than others.

 

Securely connecting to the backend

The application continuously monitors the lag between events in the EventHubs partitions and the recorded state in Azure blob storage, and emits the delta towards Azure Monitor, i.e. it reads from Event Hubs and Storage, and writes to Azure Monitor. The application connects to all three services using Azure AD authentication using its __user-assigned managed identity__. For further reading about identities please see this document
 
Azure AD tokens usually have a token lifetime of an hour, i.e. for continuous service usage, such access tokens must be refreshed at regular intervals.
 
  • For blob storage access, the application uses the regular .NET storage. The SDK, using the BlobContainerClient, handles access token refresh internally on-demand.
  • For EventHub access, the application uses both the EventHub SDK (the EventHubConsumerClient class), and plain REST calls, to retrieve the number of partitions, the list of consumer groups, and the concrete partition properties.
  • To emit metrics to Azure Monitor, the application directly calls the Azure Monitor metric store using the REST API.

As a result, for both EventHub and Azure Monitor, we continuously need a valid access token. The application's TokenStore.RefreshCredentialOnDemand method locally keeps a fresh set of access tokens around (and refreshes them on-demand, i.e. around 5 minutes prior expiry), so continues access to the backing services is ensured.

 

 

The utilization of tokens eliminates the need for storing connection strings or any other sensitive credentials. This approach provides a more secure mechanism for accessing and managing the necessary information required for communication between systems.

 

Whats included in the solution?

 

We provide source code and deployment bicep scripts, which will create the solution with all required settings. For step by step instructions on how to deploy the solution, please refer to the Azure-Samples/eventhub-custom-metrics-emitter repository here.

 

Updated Feb 08, 2023
Version 1.0

4 Comments

  • stefanhuditech's avatar
    stefanhuditech
    Copper Contributor

    Good article, I especially like the very concise and understandable explanation of what 'lag' actually is.

     

    I just wanted to mention two things:

    • If you are on Premium or Dedicated SKUs of Event Hubs, lag metrics are already available (you can enable it via Diagnostics Settings). https://learn.microsoft.com/en-us/azure/event-hubs/monitor-event-hubs-reference#application-metrics-logs is the documentation for this.
    • If you are on Standard or Basic SKUs and you do not want to run this yourself (as explained in this article by yodobrin and demonstrated in the linked GitHub repo), you can also use a Marketplace app that we created (https://azuremarketplace.microsoft.com/en-us/marketplace/apps/huditechughaftungsbeschrnkt1673457598758.lag-metrics?tab=overview), which will perform the task as a managed solution. One advantage is that it will find all Event Hubs and automatically and write lag metrics for all of them.
  • Hi Jayendran 

    In regards to your question, lag is a metric that will help you understand if your consumers are able to handle the throughput your workload is currently handling.

    Also if the 'total time' (message entered to eventhub until it was processed) is critical to you - then it's very important to know this number.

    Throttling is something different, it mean that the capacity that currently your eventhub allocated is not able to handle the incoming requests. it worth to check the 'throughput units' you allocated to your eventhub namespace, check the Azure documentation about how many incoming messages this units can handle,  and might check the possibility to enable the 'auto-inflate throughput units' for your eventhub (also can be read in Azure documentation).

    Here is a blog post discussing the throttling: How exactly does Event Hubs Throttling Work? - Microsoft Community Hub.

    This document explains the quotas and limits - https://learn.microsoft.com/en-us/azure/event-hubs/event-hubs-quotas

  • Nice Article nadavbh / yodobrin Thanks for this. I have one question what will be the difference between the throttling vs lag?

    I actually created an alert based on throttling. I thought do I need to worry about the lag?  

    Before answering this I'll give my actual use case to let you understand better. My producer in this case is Azure(Diagnostic logs) from all resources (let's say 1000 different azure resources). My Consumer is Azure Function (EventHub Trigger). Sometimes my throttling goes much and leads me to miss lots of (GBs) of events(diagnostic logs). Hence I created the alerts based on throttling. I'm wondering if is there any benefit for me to use these Lag metrics.