Cluster monitoring is essential to guarantee your users can always get performant and reliable service.
One of the foundation stones for monitoring resources in Azure is metrics and specifically for Azure Data Explorer service there are some key metrics that are essential in monitoring the cluster health and performance. For example, CPU, cache utilization factor, ingestion utilization and keep alive.
In this article we’ll focus on the Keep alive metric and where and how it should be used.
The mechanics behind the Keep alive metric, is a basic and simple synthetic query that is being sent to the cluster every minute to verify that the cluster is responding properly.
This metric is used to quickly expose health and performance issues with the cluster. It was not designed, and it is not aimed at measuring cluster Service Level Agreement (SLA).
Keep Alive cannot reliably serve as an SLA metric since the cluster does not respond in a variety of cases, some may be triggered due to service issues and some may be triggered by user actions. An example of such an action is overloading the cluster with too many queries, which will cause the CPU to spike and make the cluster unresponsive. SLA measurement should relate only to service issues that cause the service to not be available. For more details on Azure Data Explorer SLA read this - https://azure.microsoft.com/en-us/support/legal/sla/data-explorer/v1_0/
The Keep alive metric can have 3 values:
0 – A response was sent back from the synthetic test indicating that the cluster is not in healthy state.
1 – A valid response was sent back by the cluster. A healthy and fully functioning cluster will return ‘1’ in the response.
Null – Keep alive metric was not sent to the cluster for a specific minute, therefore no response was received. Null just means that there was no request and no response, which is sporadically a viable state.
A healthy cluster looks like this –
Receiving a ‘0’ should constitute a warning sign - a single ‘0’ response could be a transitory issue. In case you get a series of ‘0’ responses, this should be treated as a trigger to analyze the state of the cluster.
In some cases, where a persistent issue is caused by a user action, the user needs to identify the issue and remediate the problem. Such problems can be caused by overloading the cluster resources with high load of ingestion and queries, faulty queries, faulty application of policies, etc.
I other cases, a persistent lack of responsiveness might be related to a service issue. In those cases, the service ops team, which is monitoring the metric can identify that the root cause is not a user operation and will handle the issue.
A user that identifies a persistent drop of the Keep alive metric should further analyze the health of the cluster using the “Insights” blade in Azure Portal or by going into other metrics to uncover more specific problem areas like, query load, ingestion load, not enough SSD etc.
Our recommendation is to track the keep alive metric and in case the value is ‘0’ during consecutive 5 minutes, start analysis of the cluster health and if needed open a support ticket.