Planning for an outage: Here is how to anticipate failure in your workload
Published Apr 27 2022 10:10 AM 3,448 Views
Microsoft

We all live in a world where we have grown accustomed to the fact that services are available 24 by 7. Be it a quick game to pass the time on your smartphone at home, accessing a deck of slides on your tablet before a client presentation, or finishing a report on your laptop on the train, our sensitivity to application downtime has decreased over the last few years – particularly during the pandemic. Users expect high quality, instant results, for any task at any time, and will happily switch to an alternative if the service they are using doesn’t meet their expectations.

 

As more workloads embrace the huge opportunities that the Azure platform provides, we must ensure that we reduce downtime as much as possible for a great customer experience. The Microsoft Well-Architected framework comes with some key suggestions that help us run smooth, predictable, and global services.

 

Understand your workload’s needs and risks

As you design your workload you must first understand your own uptime requirements. What’s the end user experience that you want to provide? And how much downtime can you afford?

 

Our guidance around calculating your own Service Level Agreement (SLA) can be useful as you learn how to build highly available solutions on Azure. In general, the guidance in our Well-Architected Reliability pillar will help you with key considerations around building a workload that runs smoothly and is resilient to platform outages.

 

A reliable workload is a great foundation from which you can start addressing additional risks. After working with different customers across various industries we believe that these are best summarized as:

  • Design & Platform risks – which you address by building a highly available, reliable solution
  • Operational risks – which you address through advanced monitoring, automation, and modern software development and deployment practices
  • Inside risks – which you address by designing processes and team structures that can help deploy, operate, and support modern cloud applications sustainably
  • Outside risks – which you address through a strong security posture

An actionable way of having your team address each of those risk areas in the context of their workload is through a structured recovery plan that can be practiced and – where possible – automated. The guidance available across the Well-Architected Framework helps you find pointers to build and structure your plan.

 

Mitigate risk by creating and rehearsing a recovery plan

Every workload needs a tailored recovery plan to address the various types of risk that it is exposed to every day. The shape and structure of the plan depends on your deployment strategy, the type of application that you are running, the risks that you have identified as most pressing, and your uptime needs.

 

Simply putting a recovery plan in place forces your team to think about risk areas and what could go wrong. It allows you to plan with a cool head for a potentially very heated situation.

 

Automation has a variety of benefits when it comes to workload recovery. You can use automated tests – for example – to detect if your workload’s application code deployed correctly. If the routine detects that something is wrong, it can trigger a rollback to quickly recover from failed deployments.

 

Automating recovery and detection processes is essential when it comes to building modern recovery plans that can kick into action 24/7. The following section introduces a few exciting Azure features that could be useful as you build out your own strategy.

 

Leverage exciting new ways to prepare for failure

You can leverage advanced monitoring in Azure to alert you when failures occur. Action groups allow you to customize your response and recovery routine and pose an easy way to automate at least the initial steps towards recovery.

 

Azure Chaos Studio offers a new way to help you introduce failure in your workload deliberately. This helps you test your automation and helps train your team in handling failures more as an operating risk and less as something that is out of the ordinary. Introducing chaos regularly helps give structure in the long run. See this video for when and how to design for chaos engineering and this one for getting started with Chaos Studio.

 

Finally Azure Load Testing allows you to stress test your infrastructure and application code even in pre-production environments. We recommend the use of loosely coupled architecture which can help your code, infrastructure, and organization scale to the challenge of providing a 24-hour service.  

 

Build on your strategy today

Take a Well-Architected Review to help you assess what steps your team is already taking to maximize uptime and what more you could be doing.

 

Taking steps to reduce unwanted downtime drives improvements across all Well-Architected pillars for your workload and helps ensure a consistent and reliable service for your users.

 

Author Bio

Daniel Stocker is a Senior Program Manager in Microsoft's App Innovation Tech Strategy team. His background is in DevOps, operational excellence, and organizational change. He leads several programs including the Well-Architected Operational Excellence pillar.  https://www.linkedin.com/in/dan-stocker/

Co-Authors
Version history
Last update:
‎May 05 2022 12:36 PM
Updated by: