Monitoring and alerting design thoughts and considerations with Azure Web Apps
Published May 12 2021 01:30 PM 18.3K Views
Microsoft

One very important aspect of managing one’s applications is that of monitoring and alerting. The Azure product group is acutely aware of this need, of course, and have built an advanced monitoring and alerting system right inside the portal, under the “Alerts” area. As part of this, you can configure various rules to keep track of your resources. These rules key to various elements ("conditions"), which you would choose from based on your understanding of the app and what are its key function parameters. There are about 60 conditions available, like certain HTTP errors, or CPU time. For example, one of the fundamental ways you could keep an eye on your app would be to set an alert on http server errors, and run it for a certain while without "major" alerting (as in, don't email the entire world about every error just yet) to establish your baseline, as any app would have a certain amount of errors occasionally. Let's say you run this for 2 weeks and see on average of 3 errors per day...you would then set the alert threshold to something higher, thus avoiding waking up everyone at 2am just because one user clicked the wrong button.

After configuring the conditions and thresholds that are appropriate for your application, you would decide what to do with it. Azure can send an alert to an email address, or to SMS, or perform a push-notification to the Azure app on your phone, or to make a phone-call (voice). You could add as many targets as you wish, though most people create some kind of corporate alias or group, which people can join or be added to get the notifications. You can see more info and a helpful video about configuring Service Now to interact with our alerting on the Azure blog.

However, really keeping track of your application is much more complicated, because the very notion of “up” vs “down” is different for every app. For example, if the application displays a form for the user to fill-out, then just testing if the form loads correctly doesn’t really give you much, and a truer test would be to see what happens when the form is submitted. If the application uses some kind of authentication, then testing the authentication process is an important goal, but not always possible, because it would typically require creating some kind of test account and that could create a security risk. One way to clear some of these obstacles is to create specific test pages, which perform “backend” operations, such as running a database query and displaying the result. Creating such a page and checking if it loads successfully and/or delivers the expected content is a good way to test the app.

Another aspect of testing is the one of performance. An application can be “up”, but the time it takes to process a transaction can suddenly go from 8 seconds to 50 seconds. That kind of change is way below normal time-outs, but certainly above the patience threshold of many human beings, so tracking it is an important way to know things might be going awry.

But things can get a lot more complicated, because as I noted, “up” and “down” can mean many things. For example, what if your application normally has about 100 transactions per minute, but suddenly, that number jumps to 1600? That’s not “down”, but such a growth could mean that the code is going into some kind of loop due to a bug or design issue, and that could be both a bad user experience, as well as cause undue strain on your resources, and even cause a spike in costs. Also, it could mean that some malicious party is doing some kind of footprinting on your app to find vulnerabilities, or performing a denial-of-service attack against the app. All of these are things you probably want to be aware of even if the app feels perfectly normal to all your users.

Another thing to consider is that for users, there could be nuanced notions of what’s “down”. For example, your form could be loading, but it could be missing some image or CSS files, causing the appearance to suffer. This kind of thing doesn’t mean the app is down, but it can look very ugly, and if your users are customers, it could make the company look bad.

Yet another thing to consider is alert levels. If your app is dead, you certainly want all-hands on deck, but if it’s performance is down by 20%, you might want a more limited circulation of just system admins or a developer or two. You might want that specific alert level to be off during the night, and set various thresholds (for example, 20% drop, just send an email to be read during the next business day, but a 40% drop warrants a phone call). The more complex the app and development process, the more elaborate your alerting decision tree and flowchart would be. Another aspect of this is the alert interval. Most monitoring options run at very short intervals, like once every 5 minutes, or even less, but people don’t typically respond that fast, and code-fixed can take time to develop and deploy. You certainly don’t want your CEO to receive a phone call every 60 seconds for 5 hours while your people are trying to fix it, right? Similarly, if the alerting system generates a high volume of alerts, many people tend to set email filters, so they don’t wake up in the morning to 540 new emails. Those kind of filters could lead to the issue not being seen, and so the alerting is too loud to be useful. A better design would be to have alerting trigger a certain number of alerts, but then quiet them down before they become unmanageable.

In closing, alerting is an engineering effort that in many cases can be almost as complex as designing the application itself, and so a good idea for any organization is to start planning this from day-1, alongside the applications’ design and coding. Integrating this into the app early is more likely to lead to a reliable and stable monitoring, and thus a more reliable and stable application.

 

Co-Authors
Version history
Last update:
‎May 12 2021 01:30 PM
Updated by: