Five Best Practices to Anticipate Failure

Former Employee

May 05, 2022

Every day organizations of different sizes and industries choose Azure to take advantage of economic and technological benefits when operating secure and reliable workloads. To maximize their success on the platform, they need to continuously review their operational procedures to minimize maintenance and management overhead.

To achieve this, each individual team needs to focus on high-business value activities instead of spending time on undifferentiated tasks like manual testing, manual build and deployment, or others.

No matter what type of workload; be it a conventional lift-and-shift or a highly distributed cloud native application: In the cloud we can no longer avoid all failure – like we sometimes attempted to in a traditional on-premises scenario.

Instead, we need to anticipate outages and take mitigating action to ensure our workload remains available. This shift in the operating culture is often described as a move towards Site Reliability Engineering.

Here are five operational best practices that can help your team shift into a cloud-first mindset when it comes to your cloud operations.

Minimize time to recovery

When things go wrong it’s important that services can return to normal as soon as possible. There are several ways that help you reduce the risk of failure in your workload:

Build a good architectural foundation
Plan for failure
Review the state of your workload regularly

Using the Azure Architecture Guide you can learn about the key architectural patterns that you should follow, and find reference implementations that help you create a resilient base architecture. There are guides for both scenario and industry-specific solutions available and deploying a reference implementation gives a good base to build on.

The Well-Architected Framework supports you in developing the core processes that help you run your workload as well as anticipate and recover from issues. This includes specific guidance on what makes a workload resilient to failure and how to respond to downtime. This – in turn - helps you build a recovery plan that is both specific and targeted, so that your team knows what actions to take at what time.

There is also a range of well-architected assessments that you can leverage to continuously review and validate the state of your workload. The assessments go over and above the content already available in Azure Advisor for proactive recommendations, and help you validate the state of your workload on a regular basis. You should combine this with a regular review of your business metrics including the average time it takes you to recover from failure and your uptime when compared to your Service Level Objective target.

Maximize automation

Investing in automation is key. Both when it comes to streamlining your workload’s recovery and when it comes to scaling cloud operations more generally. GitHub Actions and Azure Pipelines are essential tools in this context. They allow for both faster, but also more predictable and more reliable deployments.

You should aim for your build and release processes to be automated end-to-end. This allows you to scale better in the long term and take full advantage of your DevOps tool chain.

Picking the right release strategy is also a key aspect. Depending on your workload you may choose a strategy that allows you to rollback easily if a deployment fails or one that allows you to trial changes with a subset of users so that you know they are safe to make.

Minimize the blast radius

As you consider the architecture of your workload, you want to make sure that any failure that does happen is contained. In an e-commerce scenario – for example – you would want to continue to take payments even if a page in your product catalog is experiencing some downtime.

You can minimize this “blast radius” of a failure in several ways:

The use of modern architecture patterns in your infrastructure choices (for example: loosely-coupled, microservices, serverless)
The application of cloud design patterns (like Circuit breakers, Load-Leveling, or Throttling) in your software development – if you own the code that drives your workload
The employment of proven deployment strategies (canary, blue-green, staggered, and others) aligned to your workloads release strategy

Our own product teams at Microsoft have found that a certain degree of organizational autonomy is essential to make full use of these technical ways to create more loosely coupled workload components.

If your workload consists of several components that can be managed and changed independently, then that makes it easier to put changes in front of users more regularly while also not compromising on the safety of the process.

Maximize observability to minimize disruption

To ensure that a change is indeed “safe” your workload must be fully observable at all times.

Azure Monitor is a comprehensive monitoring solution that allows you to gather signals from the various resources that make up your workload.

With effective telemetry correlation we can trace operations through different components in the workload and start to identify patterns. (for example: Operations that come from a particular type of client device always fail when a database call is made, but other client calls perform just fine.)

You can use signals and patterns that you detect to set alerts and configure automatic remediation. This can be useful to deal with sudden spikes in usage, to monitor the progress and success of a deployment, or to react to service downtime.

Maximize continuous improvement efforts

A key aspect of Site Reliability Engineering is a commitment to continuous learning. While this is a cultural change that can take a while to fully implement, there are some actions that you can take with your team to move in the right direction. Some examples are listed below.

Help your team get comfortable with constant change as well as getting used to the fact that things will go wrong eventually.
Establish a blame-less post incident process for incidents and learn to think of them as learning opportunities rather than failures.
Allocate time to improve your monitoring and instrumentation to avoid past problems so that your workloads are even more stable in the future.

Take the next step

If you’d like to learn more about Site Reliability Engineering, explore our series of SRE-related talks or watch an overview in our Azure Enablement video series.
Take a Well-Architected, Developer Velocity, or DevOps Capability assessment to see how you measure up against the best practices.
Review our extensive guidance as part of the Well-Architected Framework.

Author Bio

Daniel Stocker is a Senior Program Manager in Microsoft's App Innovation Tech Strategy team. His background is in DevOps, operational excellence, and organizational change. He leads several programs including the Well-Architected Operational Excellence pillar. https://www.linkedin.com/in/dan-stocker/

Updated May 05, 2022

Version 1.0

well architected

danielstocker

Former Employee

Joined March 09, 2022

View Profile

Azure Architecture Blog

Follow this blog board to get notified when there's new activity