SOLVED

Heartbeat Azure Monitor OMS VMs

Brass Contributor

Hi all, i am trying to create an Alert for heartbeats if VM is didn't heartbeat in the last 15 min... here what i did and unfortunately it didnt fire alert. 

 

so i created a new alert rule in Azure Monitor (Alerts) the new one. 

 

i used the following query:

 

Heartbeat
| where TimeGenerated > ago(1d)
| summarize max(TimeGenerated) by Computer
| where max_TimeGenerated < ago(15m)

 

this result if the VM didnt send any heartbeat for the last 15 mins. 

i set the alert logic to be on (Number of results Greater than 0) 

Evaluated based on the Period 15 mins and Frequency 5 mins. 

 

i stopped the agent and the results of the query indeed was more than 0 but the monitor didnt trigger , does any one have a working example for Alerting on heartbeats,

 

Thanks.

5 Replies
best response confirmed by Stanislav Zhelyazkov (MVP)
Solution

Hi

Your query is correct. Probably you should remove

| where TimeGenerated > ago(1d)

because when the query is used in alert the timespan/timeframe is defined in the alert itself. In the heartbeat alert you would want your the evaluation time to be longer than 15 minutes. Make it at least one hour but probably 24 hours would be better as that was the time span in your query. With that setting you should get alert in 15 minutes after the vm goes down. Keep in mind that you want the VM to be down for 15 minutes at least. If it goes down only for 5 minutes you will probably not be alerted because heartbeat events will start to be send again so the alert will never trigger that the last heartbeat event was 15 minutes ago.

Let me know if you have further questions.

after removing the day query line, you can write the line like.

let start_time=startofday(datetime("2018-05-30"));

let end_time=startofday(datetime("2018-05-31"));

| where TimeGenerated > start_time and TimeGenerated < end_time

Question... is the following configuration fine for alerting? Or should I increase the frequency? 

Query:
Heartbeat
 
| summarize ["Last Heartbeat"]=max(TimeGenerated) by Computer
| where ["Last Heartbeat"] < ago(15m)
 
Based on:
Number of results
Condition:
Greater than
Threshold:
0
 
Evaluated based on:
Period: 1440 minutes
Frequency: 15 minutes
 

Keep in mind that for example if server goes down and it is not available for an hour . You will receive within an hour roughly 4-5 alerts for the same server. This is because you period is 24 hours. My recommendation is Period and frequency to be the same. For example 15 mins. That way you will not receive so many alerts for the same thing.

Surely there is a better solution for this? My use case doesn't work:

 

1. Create a computer group

2. Alert when an agent in computer group has not "heartbeated" for over 24 hours. 

 

By the logic in Alerts, even if I set the query as I do below, the time span that I define is ignored because of the "Period" in Alerts:


Heartbeat
| project TimeGenerated, Computer
| where TimeGenerated < now()
| where Computer in (COMPUTERGROUP)
| summarize ["Last Heartbeat"]=max(TimeGenerated) by Computer
| where ["Last Heartbeat"] < ago(24h)

Is there any way to get around this extremely limiting design?

1 best response

Accepted Solutions
best response confirmed by Stanislav Zhelyazkov (MVP)
Solution

Hi

Your query is correct. Probably you should remove

| where TimeGenerated > ago(1d)

because when the query is used in alert the timespan/timeframe is defined in the alert itself. In the heartbeat alert you would want your the evaluation time to be longer than 15 minutes. Make it at least one hour but probably 24 hours would be better as that was the time span in your query. With that setting you should get alert in 15 minutes after the vm goes down. Keep in mind that you want the VM to be down for 15 minutes at least. If it goes down only for 5 minutes you will probably not be alerted because heartbeat events will start to be send again so the alert will never trigger that the last heartbeat event was 15 minutes ago.

Let me know if you have further questions.

View solution in original post