VM Unresponsive or Down (Heartbeat) Alerts

%3CLINGO-SUB%20id%3D%22lingo-sub-2278616%22%20slang%3D%22en-US%22%3EVM%20Unresponsive%20or%20Down%20(Heartbeat)%20Alerts%3C%2FLINGO-SUB%3E%3CLINGO-BODY%20id%3D%22lingo-body-2278616%22%20slang%3D%22en-US%22%3E%3CP%3EHi!%26nbsp%3B%20The%20business%20would%20like%20to%20be%20alerted%20if%20a%20VM%20becomes%20unresponsive%20or%20goes%20down.%20They%20would%20like%20to%20be%20alerted%20as%20soon%20as%20possible%20(under%205%20minutes%20if%20possible).%20We%20have%20Azure%20Monitor%20Log%20Metric%20Alerts%20to%20test%20for%20VM%20heartbeats%20in%20our%20various%20subscriptions.%20The%20alert%20rule%20checks%20for%204%20missed%20heartbeats%20in%20a%205%20minute%20period%20and%20triggers%20an%20alert%20if%20that%20condition%20is%20met.%20The%20problem%20is%20we%20receive%20lots%20of%20false-positives%20where%20the%20alert%20fires%20because%20the%20monitoring%20agent%20can't%20talk%20to%20the%20Azure%20Monitor%20service%20for%204%20out%20of%205%20minutes%2C%20however%2C%20there%20is%20absolutely%20nothing%20wrong%20with%20the%20VM%20or%20the%20components%20running%20on%20it.%20We%20receive%20so%20many%20false%20positives%20on%20VMs%20in%20our%20various%20subscriptions%20that%20everyone%20completely%20ignores%20them.%20Does%20anyone%20have%20any%20suggestions%20on%20how%20to%20alert%20on%20unresponsive%20VMs%20in%20a%20timely%20manner%20while%20reducing%2Feliminating%20false%20positives%3F%3C%2FP%3E%3CP%3EThanks!%3C%2FP%3E%3C%2FLINGO-BODY%3E
Occasional Contributor

Hi!  The business would like to be alerted if a VM becomes unresponsive or goes down. They would like to be alerted as soon as possible (under 5 minutes if possible). We have Azure Monitor Log Metric Alerts to test for VM heartbeats in our various subscriptions. The alert rule checks for 4 missed heartbeats in a 5 minute period and triggers an alert if that condition is met. The problem is we receive lots of false-positives where the alert fires because the monitoring agent can't talk to the Azure Monitor service for 4 out of 5 minutes, however, there is absolutely nothing wrong with the VM or the components running on it. We receive so many false positives on VMs in our various subscriptions that everyone completely ignores them. Does anyone have any suggestions on how to alert on unresponsive VMs in a timely manner while reducing/eliminating false positives?

Thanks!

0 Replies