For IT Operations teams, performance metrics are essential to monitoring server health. They give you an early warning of potential problems and some quantifiable data over “the computers are running slow”. They also give you data to back up a recommendation to upgrade a system. This monitoring basic has around ever since server operating systems existed. It can translate to a Cloud world, especially if you’re running virtual machines that are not set to auto scale. Historically, you may have set a “high but not too high” alert threshold boundary (say, CPU performance is greater than 70 or 80 percent) and even a low threshold (which can signify that processes, applications or user loads are less than normal).
With Azure Automation, you might have extended this to plug in some Runbook steps to execute, if an alert is triggered. But beyond that, the configuration of server metrics monitoring has been a very subjective task. This usually involves a few initial weeks of pain, as you tweak the threshold to find the sweet spot of what is outside of the normal operating range for that system.
I’m enjoying presenting at Microsoft Ignite the Tour about Azure Monitor, which pulls together monitoring and alerting capabilities from different applications, different operating systems and different physical locations. It’s great to get started with Azure Monitor on your first Cloud virtual machines and then deploy the monitoring agent to your on-premises servers too. And while a “single glass of pain” across all of your systems is nice, our latest announcement gives you a better glimpse into how Cloud power can truly help your IT operations.
Azure Monitor now supports Dynamic Thresholds for metric alerts
Now you have the option to not “hard code” your alert threshold numbers and instead let our advanced machine learning analyze the behaviour of your system and identify patterns and anomalies. This is not a one-off event. Over time, this dynamic threshold will refine itself, even taking into account regular “anomalies” (with Smart Pattern Recognition) such as a dip in usage on the weekends. That becomes recognised as normal for that system.
Dynamic Thresholds make it easy to push out alert rules across multiple systems and applications at scale, which would normally require setting individual thresholds for each system. And Dynamic Thresholds are currently free to try while the service is in public preview.
Enabling a Dynamic Threshold Alert
Within Azure Monitor, when you add a metric alert you’re now presented with the option to set the threshold as Dynamic.
Now there are some options you can tweak: Operator – Alert when the metric is only greater than, or less than the dynamic threshold, or if it is either of those states.
Threshold sensitivity – How far does the metric have to vary from the threshold before the alert should fire? Do you want it to be highly sensitive to changes or less sensitive?
Frequency, number of violations and time period – These let us fine tune the noise of the alerts. If the CPU spikes once and doesn’t reoccur, do you need an alert? And like other Azure Monitor alerts, you can then determine what needs to happen when the alert is generated – who is notified, what severity is set and if an automation runbook or third party support ticket is triggered.
Note: Once a Dynamic Threshold alert is created, it will review the historical data to aggregate and define the initial thresholds. If the resource is new and does not yet have at least 3 days of operational history, the alerts will not fire until that data set is available.
The alerts are also tracked inside Azure Monitor in the Azure Portal. In this example, the Monitor Condition automatically updated itself to Resolved after the virtual machine CPU dropped back down to normal levels, but I still have enough data to go back and investigate the cause. For this reason, the Alert State remains and New and I can manually change it to Acknowledge or Closed and add an optional comment.
The Alert details also tell me what the metric reached and what the current threshold was set to: (note this was a very low stressed test system in my demo tenant!)
Conclusion I’m excited about the future of IT Operations with Cloud capabilities at our fingers. Making monitoring smarter and preventing or auto-remediating problems provides a better level of service availability. I’m hoping this is just the first step of many in applying machine learning to benefit everyday IT Ops tasks. Please take some time to try it out and let us know your feedback!