Monitoring Consumer groups processing using custom metrics
Published Feb 08 2023 12:47 AM 5,575 Views
Microsoft
Authors:
- @yaronpri is a Principal Consultant in EMEA,
- @chgeuer  and @yodobrin are Principal Customer Engineers in the Growth & Innovation team.

 

Use Case

When using Azure Event Hub, it is important and useful to monitor the 'lag' of the consumer group. 'Lag' refers to the number of messages/events that have not yet been processed within a consumer group. The larger the lag, the more the consumer group is "falling behind". Monitoring this property can help identifying overloaded consuming applications: Increasing lag indicates that the consuming application is not able to keep up with the load of the event hub. The ability to monitor the 'lag' is not part of the built-in metrics provided by Azure Event Hub, for the Azure Event Hub built-in metrics please review this document.


In a queue-based system (such as Azure Storage queues or Azure Service Bus), the 'lag' would correspond to the queue length, i.e. the number of unprocessed messages.

 

High Level Solution Approach

design.png

How does it work?

Both producers & consumers follow this QuickStart guidelines.

The application (deployed as Azure Container App) will iterate over the consumer groups and for each partition will calculate the lag by subtracting the sequence number from the last enqueued sequence number, these values are part of the information available through the Azure Event Hub SDK. It will then send these calculations as custom metrics to the event hub control plain using REST calls.

Azure Monitor supports [custom metrics, i.e. emitting a custom metric to an existing Azure resource, such as an Event Hubs instance. This allows users to enrich their Azure Monitor dashboards and build richer queries. In this solution, we use a dockerized .NET application, running in Azure Event Hubs, to augment an Azure Event Hubs instance with the per-consumer group lag, allowing the customers to closely monitor the workload of the Event Hubs consumers.
 

 

What is checkpointing?

 

Checkpointing is a process where consumers mark their position in a partition event sequence. It's the responsibility of each consumer in a consumer group to keep track of their current position in the event stream. When a reader reconnects, it begins reading from the last checkpoint submitted by the last reader in the consumer group. Checkpointing provides failover resiliency and enables event stream replay. For further reading please review this document.

 

The solution periodically scans the checkpoints, created by the Event Hubs SDK in blob storage, to calculate the lag against the different Event Hubs partitions. The solution continuously sends these deltas as custom metrics towards Azure Monitor, alongside with the event hub name and partition id, and consumer group name.

 

These metrics can then be presented as part of the metrics of the Event Hubs instance:

.

 

yodobrin_0-1675770347464.png

 

You could also use split by partition, consumer group or eventhub name.

 

yodobrin_2-1675770631644.png

 

In the above example, we could see that a few of the partitions are handling the load better than others.

 

Securely connecting to the backend

The application continuously monitors the lag between events in the EventHubs partitions and the recorded state in Azure blob storage, and emits the delta towards Azure Monitor, i.e. it reads from Event Hubs and Storage, and writes to Azure Monitor. The application connects to all three services using Azure AD authentication using its __user-assigned managed identity__. For further reading about identities please see this document
 
Azure AD tokens usually have a token lifetime of an hour, i.e. for continuous service usage, such access tokens must be refreshed at regular intervals.
 
  • For blob storage access, the application uses the regular .NET storage. The SDK, using the BlobContainerClient, handles access token refresh internally on-demand.
  • For EventHub access, the application uses both the EventHub SDK (the EventHubConsumerClient class), and plain REST calls, to retrieve the number of partitions, the list of consumer groups, and the concrete partition properties.
  • To emit metrics to Azure Monitor, the application directly calls the Azure Monitor metric store using the REST API.

As a result, for both EventHub and Azure Monitor, we continuously need a valid access token. The application's TokenStore.RefreshCredentialOnDemand method locally keeps a fresh set of access tokens around (and refreshes them on-demand, i.e. around 5 minutes prior expiry), so continues access to the backing services is ensured.

 

 

The utilization of tokens eliminates the need for storing connection strings or any other sensitive credentials. This approach provides a more secure mechanism for accessing and managing the necessary information required for communication between systems.

 

Whats included in the solution?

 

We provide source code and deployment bicep scripts, which will create the solution with all required settings. For step by step instructions on how to deploy the solution, please refer to the Azure-Samples/eventhub-custom-metrics-emitter repository here.

 

3 Comments
Co-Authors
Version history
Last update:
‎Feb 08 2023 12:46 AM
Updated by: