Aug 10 2021 06:23 AM
Written by Mark Russinovich, Chief Technology Officer and Technical Fellow, Microsoft Azure
“Throughout our Advancing Reliability blog series we’ve explained various techniques used by the Azure platform to prevent technical issues from impacting customers’ virtual machines (VMs) or other resources—like host resiliency with Project Tardigrade, cautious safe deployment practices taking advantage of ML-based AIOps insights, as well as predicting and mitigating hardware failures with Project Narya. Despite these efforts, when operating at the scale of Azure we know that there will inevitably be some failures that impact customer resources—so when they do, we strive for transparency in how we communicate to impacted customers. So for today’s post in the series, I have asked Principal Software Engineering Manager Nick Swanson to highlight recent improvements in the space—specifically, how we surface more detailed root cause statements via Azure resource health.”— Mark Russinovich, CTO, Azure
The existing Azure resource health feature helps you to diagnose and get support for service problems that affect your Azure resources. It reports on the current and past health of your resources, showing any time ranges that each of your resources have been unavailable. But we know that our customers and partners are particularly interested in “the why” to understand what caused the underlying technical issue, and in improving how they can receive communications about any issues—to feed into monitoring processes, to explain hiccups to other stakeholders, and ultimately to inform business decisions.
We recently shipped an improvement to the resource health experience that will enhance the information we share with customers about VM failures, with additional context on the root cause that led to the issue. Now, in addition to getting a fast notification when a VM’s availability is impacted, customers can expect a root cause to be added at a later point once our automated Root Cause Analysis (RCA) system identifies the failing Azure platform component that led to the VM failure. Let’s walk through an example to see how this works in practice: