Dynamically route alerts to the right team.
Hello folks,
After a discussion with a customer where they were expressing their “displeasure” with the number of alert notifications that the IT department was receiving from environments that were not critical but still in need of monitoring. I started thinking about how we could dynamically decide when, where, how, and to whom these alert notifications are sent.
Then I remembered an interview with Aditya Balaji (PM for Azure Backup) where we discussed Azure Backup Alerts and monitoring and I remembered that in his demo they were using a Logic App to manage the delivery of alert notifications.
I decided to try it and see if I could produce a way to limit the number of action groups in Azure monitor alerting while enhancing the “decision” flow of how the notifications are handled.
Here’s what I came up with.
Azure Monitor Alerts
Of course, in Azure monitor, alerts help you detect and address issues before users notice them by proactively notifying you when Azure Monitor data indicates that there may be a problem with your infrastructure or application. You can setup alerts on any metric or log data source in the Azure Monitor data platform. (Formally known as Log Analytics Workspace).
Part of the alert setup includes setting up action groups. An action group is a collection of notification preferences you can define. Azure Monitor, Service Health, and Azure Advisor alerts use action groups to notify users that an alert has been triggered.
Various alerts may use the same action group or different action groups depending on the user's requirements.
However, while an action group is defined, it won't dynamically change if the resource that triggered the alert changes. For example, what if the VM was tagged as a development or testing machine and has now been promoted to production. Would that change how and to who you send notifications? And yes, I understand that typically you should not move VMs from test to prod. You should deploy a new VM in prod with the workload deployed to it. This is just an example. Don’t come at me… LOL
Alert processing rules
I can already hear some of you say, “Pierre, what about Alert Processing Rules?” Well, you’d be right. Alert processing rules can help in some circumstances. Like suppressing notifications during a planned maintenance window. Or, if you want the same set of action to apply to a larger set of alerts or at scale without editing multiple action groups.
The solution
This is where Azure Logic Apps and the Common alert schema came into play. Logic Apps allow you to create and run automated workflows. Therefore provides tremendous possibilities in terms of evaluating different conditions, states, or parameters to decide your next set of actions.
The common alert schema standardizes the consumption experience for alert notifications. The three alert types in Azure (metric, log, and activity log) used to have their own separate email templates, webhook schemas, etc.
With the common alert schema, you can now receive all alert notifications with a consistent schema. Allowing you a predictable way of parsing the alert payload and deciding what to do with it. The common schema provides a consistent JSON structure for ALL alert types, which allows you to easily build integrations across the different alert types.
You can enable the common alert schema when you define the action.
If you noticed, I did NOT define a notification in my alert. But, in my action, I’m triggering a Logic App called NotificationFlow to which I’m passing the alert payload using the common alert schema. At that point, the Logic App has all the info it needs to decide how to deal with notifications.
The actual Logic App will do the following:
- Parse the needed info from the common schema JSON message.
- Collect all the resource info (including tags) from Azure in the subscription where the offending workload was generated.
- Iterate through the list and find the tags from the resource that triggered the alert.
- Identify from the tags if the machine is a production machine.
- Based on the production status, decide if I’m emailing the on-call team, or just put a message in a Teams channel (far less intrusive)
The Result.
My test VM all have tags I use to automate processes. This is just one more usage for them.
I ran the test twice with a simulated CPU run-off. Once with the category tag set to Demo, and once with the category tag set to Production. Both times the alert was fired, and the Logic App made the decision on how to notify the appropriate team.
This was a simple decision tree. You could add so much more thoughts into it. Like notifying appropriate teams based on department tags, based on resource group….
The point is, I could have a single logic app that could become my “default” action on all alerts and hopefully make the right decision. I would also only have to manage a single set of code.
I hope this was useful.
Let me know if you have any scenarios you’re struggling with.
Cheers!
Pierre.