Azure Monitor: Use Dynamic Thresholds in Log Alerts

BrunoGabrielli

Microsoft

Apr 17, 2023

Hello howdy readers

It has been a while since my last post, and I decided that the time of writing a new one will come up.

In this new blog post I am going to explain how to use dynamic threshold in log alerts. Think for a second, that you need to create an alert that must, at the same time, apply to more than one resource and react to different thresholds.

The solution is inside the power of the Kusto Query Language or KQL.

This language offers several operators and functions that allow us to manipulate data to extract values, extend fields and compare values with thresholds.

Let’s take as an example, the data collected through VM Insights. It collects operating system performance data for Logical Disk, Memory, Network, Processor and, if enabled, it also collects information about processes, ports and connection for each target computer.

The reason I am taking VM Insights as an example is not only because data collection is done in a platform-agnostic way, meaning that data is stored using the same counter name for both Windows and Linux, but also because it collects some important metadata together with the performance counters. For the sake of this post, metadata is ‘’da key’ for dynamic thresholds setup.

To see the metadata, you can query the InsightsMetrics table and get, among others, the Tags field as part of the record. Tags field is a dynamic scalar data type based on the primitive string data type.

To explore the tags, you can use queries like the ones below:

Query #1: Number of processors

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Name == "UtilizationPercentage"
| summarize arg_max(TimeGenerated, *) by Computer

Query #2: Memory size

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Name == "AvailableMB"
| summarize arg_max(TimeGenerated, *) by Computer

Query #3: Disks size

InsightsMetrics
| where Origin == "vm.azm.ms"
| where Name == "FreeSpaceMB"
| summarize arg_max(TimeGenerated, *) by Computer

With that in mind, imagine that you want to create an alert for low available memory that fires on different condition (or threshold) based on the memory size. How can we make that possible? We need to know the value of what can be identified as the decision maker: our loved memory size.

This is where query #2, or a similar one, comes in hand. We can use it to compare the virtual machine amount of memory with the range we’re going to establish, expresses as absolute threshold in Mbytes to be applied. Speaking range, say that we would like to be alerted when the available memory drops below the thresholds reported in the table below:

Memory size	Mbytes Threshold
4096	512
8192	1024
16384	2048

As said, the threshold is different according to the memory size and our goal for this log alert, is to assemble a query that:

Retrieve the amount of memory:

We can use the let statement to set a view containing the computer name and the corresponding memory size. We can use a modified version of query #2 to reach this part of the goal:

let memSizing = InsightsMetrics
| where Origin == "vm.azm.ms"
| where Name == "AvailableMB"
| summarize arg_max(TimeGenerated, *) by Computer
| extend memorySizeMB = round(toreal(parse_json(Tags).["vm.azm.ms/memorySizeMB"]))
| distinct Computer, memorySizeMB;

Set the threshold:

Threshold can be set in the query that will actively look for the alert condition. Hence, once the current value has been retrieved, we can join this info with the view created in the previous step based on the computer name. At this point we just have to extend a variable with the threshold value calculated using the case function. As an example, the partial query to achieve this second part of the goal would look like:

InsightsMetrics| where Origin == "vm.azm.ms"
| where Name == "AvailableMB"
| summarize freeMb = avg(Val) by Computer, bin(TimeGenerated, 5m)
| summarize arg_max(TimeGenerated,*) by Computer
| join kind=innerunique (memSizing) on Computer
| extend mbThreshold = case(memorySizeMB <= 4096, 512,
                            memorySizeMB <= 8192, 768,
                            memorySizeMB <= 16384, 1024,
                            memorySizeMB <= 32768, 2048,
                            3072)

At this point, since we have all the necessary information, why not set up a percentage value as an additional threshold? Our alert could fire up either if the available memory drops below the Mbytes threshold or below the percentage threshold. Adding the below line (like the extend mbThreshold will do the trick:

| extend percThreshold = case(memorySizeMB <= 4096, 10,
                              memorySizeMB <= 8192, 9,
                              memorySizeMB <= 16384, 8,
                              memorySizeMB <= 32768, 7,
                              6)

Compare the value of available memory with the threshold(s) set up according to the memory size:

NOTE: If we decided to compare also with the percentage-based threshold, we need to calculate the percentage of available memory because this is not natively collected. An additional simple line based on the extend , like the one below, would be more than enough:

| extend freePercentage = (freeMb / memorySizeMB) * 100

At this point we are ready to compare and project the results by adding the comparison logic:

| where freeMb < mbThreshold or freePercentage < percThreshold

And the presentation logic:

| project TimeGenerated, Computer, memorySizeMB, percThreshold, freePercentage, mbThreshold, freeMb

The final query would look like this (first line with comments is optional):

// ## AVAILABLE MEMORY (FreeMb or %) - INSIGHTSMETRICS - WINDOWS & LINUX SERVERS WITH DYNAMIC THRESHOLDS
let memSizing = InsightsMetrics
    | where Origin == "vm.azm.ms"
    | where Name == "AvailableMB"
    | summarize arg_max(TimeGenerated, *) by Computer
    | extend memorySizeMB = round(toreal(parse_json(Tags).["vm.azm.ms/memorySizeMB"]))
    | distinct Computer, memorySizeMB;
InsightsMetrics
| where Origin == "vm.azm.ms"
| where Name == "AvailableMB"
| summarize freeMb = avg(Val) by Computer, bin(TimeGenerated, 5m)
| summarize arg_max(TimeGenerated, *) by Computer
| join kind=innerunique (memSizing) on Computer
| extend mbThreshold = case(memorySizeMB <= 4096, 512, memorySizeMB <= 8192, 768, memorySizeMB <= 16384, 1024, memorySizeMB <= 32768, 2048, 3072)
| extend percThreshold = case(memorySizeMB <= 4096, 10, memorySizeMB <= 8192, 9, memorySizeMB <= 16384, 8, memorySizeMB <= 32768, 7, 6)
| extend freePercentage = (freeMb / memorySizeMB) * 100
| where freeMb < mbThreshold or freePercentage < percThreshold
| project 
    TimeGenerated,
    Computer,
    memorySizeMB,
    percThreshold,
    freePercentage,
    mbThreshold,
    freeMb

Using the same concepts, you can create dynamic threshold log alerts for low free disk space. A sample query would look like:

// ## DISK SPACE (FreeMb or %) - INSIGHTSMETRICS - WINDOWS & LINUX SERVERS WITH DYNAMIC THRESHOLDS
let diskSizeAndFreeSpaceMb = InsightsMetrics
    | where Origin == "vm.azm.ms"
    | where Name == "FreeSpaceMB"
    | extend diskUnit = tostring(parse_json(Tags).["vm.azm.ms/mountId"])
    | summarize arg_max(TimeGenerated, *) by Computer, diskUnit
    | extend diskSizeMb = round(toreal(parse_json(Tags).["vm.azm.ms/diskSizeMB"]))
    | extend freeMB = Val
    | distinct TimeGenerated, Computer, diskUnit, diskSizeMb, freeMB, _ResourceId;
InsightsMetrics
| where Origin == "vm.azm.ms"
| where Name == "FreeSpacePercentage"
| extend diskUnit = tostring(parse_json(Tags).["vm.azm.ms/mountId"])// ## Start of exclusion list// ## End of exclusion list| summarize (TimeGenerated, freePerc) = arg_max(TimeGenerated, Val) by Computer, diskUnit, _ResourceId| join (diskSizeAndFreeSpaceMb) on Computer, diskUnit| extend percThreshold = case(    diskSizeMb <= 51200, 10.0, //50Gb    diskSizeMb <= 102400, 8.0, //100 Gb    diskSizeMb <= 204800, 6.0, //200 Gb    diskSizeMb <= 307200, 4.0, //300 Gb    diskSizeMb <= 1024000, 2.0, //1Tb    0.5)| extend mbThreshold = case(    diskSizeMb <= 51200, 200, //50Gb    diskSizeMb <= 102400, 1024, //100 Gb    diskSizeMb <= 204800, 2048, //200 Gb    diskSizeMb <= 307200, 3072, //300 Gb    diskSizeMb <= 1024000, 10240, //1Tb    20480)| extend percThreshold = iif((_ResourceId contains "microsoft.compute/virtualmachines") and (diskUnit == "D:"), 0.0, percThreshold)| extend mbThreshold = iif((_ResourceId contains "microsoft.compute/virtualmachines") and (diskUnit == "D:"), 0, mbThreshold)| where freePerc < percThreshold or freeMB < mbThreshold| project TimeGenerated, Computer, diskUnit, diskSizeMb, percThreshold, freePerc, mbThreshold, freeMB| sort by Computer asc, Unit asc

I am sure you also noted one more advantage (maybe from the comment line at the top of the query): this query can be used for both Window and Linux operating systems thanks to VM Insights

Surely the examples above can be added to your set of server monitoring queries, but the main goal is to set you with the correct mindset to search metadata and create dynamic log alerts.

Thanks for reading it and see you next post

Disclaimer
The sample scripts are not supported under any Microsoft standard support program or service. The sample scripts are provided AS IS without warranty of any kind. Microsoft further disclaims all implied warranties including, without limitation, any implied warranties of merchantability or of fitness for a particular purpose. The entire risk arising out of the use or performance of the sample scripts and documentation remains with you. In no event shall Microsoft, its authors, or anyone else involved in the creation, production, or delivery of the scripts be liable for any damages whatsoever (including, without limitation, damages for loss of business profits, business interruption, loss of business information, or other pecuniary loss) arising out of the use of or inability to use the sample scripts or documentation, even if Microsoft has been advised of the possibility of such damages.

Updated Apr 17, 2023

Version 2.0

BrunoGabrielli