SOLVED

Alerting on Heartbeat issue

Iron Contributor

Surely there is a better solution for this? My use case doesn't work:

 

1. Create a computer group

2. Alert when an agent in computer group has not "heartbeated" for over 24 hours

 

By the logic in Alerts, even if I set the query as I do below, the time span that I defineis ignored because of the "Period" in Alerts:


Heartbeat
| project TimeGenerated, Computer
| where TimeGenerated < now()
| where Computer in (COMPUTERGROUP)
| summarize ["Last Heartbeat"]=max(TimeGenerated) by Computer
| where ["Last Heartbeat"] < ago(24h)

This query--when run outside of Alerts--returns several machines that have no heartbeat in the last 24 hours, going back to as long as we've been collecting the data. But because Alerts confines me to a max 24 hour period to check against, I get 0 results. 

I essentially want an alert generated every 24 hours as a "nag alert" with a list of the machines that have not sent heartbeat data in over 24 hours. 

Is there any way to get around this extremely limiting design?

4 Replies
best response confirmed by Scott Allison (Iron Contributor)
Solution

@Scott Allison 

 

Alerts are designed to look back on 24hrs as you state 

 

For a report of this nature, I'd suggest a Logic App, something like this mock up?  This fires at a pre-set time (recurrence), then runs your query, then sends an email (you could send to teams/Slack/ServiceNow etc.. instead or in parallel)

 

Annotation 2019-04-15 171941.jpg

 

I also like the example availability rate query:

// Availability rate
// Calculate the availability rate of each connected computer
Heartbeat
// bin_at is used to set the time grain to 1 hour, starting exactly 24 hours ago
| summarize heartbeatPerHour = count() by bin_at(TimeGenerated, 1h, ago(24h)), Computer
| extend availablePerHour = iff(heartbeatPerHour > 0, true, false)
| summarize totalAvailableHours = countif(availablePerHour == true) by Computer 
| extend availabilityRate = totalAvailableHours*100.0/24
| project-away totalAvailableHours 
| render barchart 

Note: I added the last two lines, as I prefer how it looks as a chart  

Thanks Clive. T

his is a pretty straightforward recommendation. It does, however, stray from our deliberate move to Azure Monitor for all alerting (taking advantage of Action Groups and automation). It would be nice to have the option to remove some of these "guardrails" for Alerts... or at the very least, have a viable explanation as to why the guardrail is necessary.

 

cc: @Daniel Thilagan 

Hi @Scott Allison 

 

The best public explanation I've seen is:

https://feedback.azure.com/forums/267889-log-analytics/suggestions/32043751-alerting-timewindow-limi... which gives an explanation - you could add your 'vote' to this?

 

Thanks Clive 

Voted and commented.
1 best response

Accepted Solutions
best response confirmed by Scott Allison (Iron Contributor)
Solution

@Scott Allison 

 

Alerts are designed to look back on 24hrs as you state 

 

For a report of this nature, I'd suggest a Logic App, something like this mock up?  This fires at a pre-set time (recurrence), then runs your query, then sends an email (you could send to teams/Slack/ServiceNow etc.. instead or in parallel)

 

Annotation 2019-04-15 171941.jpg

 

I also like the example availability rate query:

// Availability rate
// Calculate the availability rate of each connected computer
Heartbeat
// bin_at is used to set the time grain to 1 hour, starting exactly 24 hours ago
| summarize heartbeatPerHour = count() by bin_at(TimeGenerated, 1h, ago(24h)), Computer
| extend availablePerHour = iff(heartbeatPerHour > 0, true, false)
| summarize totalAvailableHours = countif(availablePerHour == true) by Computer 
| extend availabilityRate = totalAvailableHours*100.0/24
| project-away totalAvailableHours 
| render barchart 

Note: I added the last two lines, as I prefer how it looks as a chart  

View solution in original post