Jul 21 2021 06:34 AM
Jul 21 2021 06:34 AM
Written by Mark Russinovich, Chief Technology Officer and Technical Fellow, Microsoft Azure
“All service engineering teams in Azure are already familiar with postmortems as a tool for better understanding what went wrong, how it went wrong, and the customer impact of the related outage. An important part of our postmortem process is for all relevant teams to create repair items aimed at preventing the same type of outage from happening again, but while reactive approaches can yield results, we wanted to move this to the left—way left. We wanted to get better at uncovering risks before they had an impact on our customers, not after. For today’s post in our Advancing Reliability blog series I’ve asked Richie Costleigh, Principal Program Manager from our Azure problem management team, to share insights into our journey as we work towards advancing our postmortem and resiliency threat modeling processes.”—Mark Russinovich, CTO, Azure
At a high level, our goal is to avoid self-inflicted and/or avoidable outages, but even more immediately, our goal is to be able to reduce the likelihood of impact to our customers as much as possible. In this context, an outage is any incident in which our services fail to meet the expectations of (or negatively impact the workloads of) our customers, both internally and externally. To avoid that, we needed to improve how we uncover risks before they impact customer workloads on Azure. But Azure itself is a large, complex distributed system—how should we approach resiliency threat modeling if our organization offers thousands of solutions, comprised of hundreds of “services,” each with a team of five to 50 engineers, distributed across multiple different parts of the organization, and each with their own processes, tools, priorities, and goals? How do we scale our resiliency threat modeling process out, and reason across all these individual risk assessments? To address these challenges, it took some major changes to join the reactive approach with the more proactive approach.
We started the shift left with a premortem pilot program. We looked back at past outages and developed a questionnaire that helped not only to start discussions but also to provide a structure to them. Next, we selected several services of varying purpose and architecture—each time we sat down with a team, we learned something new, got better and better at identifying risks, incorporated feedback from the teams on the process, then tried again with the next team. Eventually, we started to identify the right questions to ask as well as other elements we needed to make this process productive and impactful. Some of these elements already existed, others needed to be created to support a centralized approach to resiliency threat modeling. Many that existed also required changes or integration into an overall solution that met our goals. What follows is a high-level overview of our approach, and the elements we discovered were necessary to continue improving the space.