Welcome to the Proactive Reliability Series — a collection of articles dedicated to raising awareness about the importance of designing, implementing, and operating reliable solutions in Azure. Each article will focus on a specific area of reliability engineering: from identifying critical flows and setting reliability targets, to designing for redundancy, testing strategies, and disaster recovery.
This series draws its foundation from the Reliability pillar of the Azure Well-Architected Framework, Microsoft's authoritative guidance for building workloads that are resilient to malfunction and capable of returning to a fully functioning state after a failure occurs.
In the cloud, failures are not a matter of if but when. Whether it is a regional outage, an availability zone going dark, a misconfigured resource, or a downstream service experiencing degradation — your workload will eventually face adverse conditions. The difference between a minor blip and a major incident often comes down to how deliberately you have planned for failure.
In this first article, we start with one of the most foundational practices: Fault Mode Analysis (FMA) — and the question that underpins it: what kinds of faults can actually happen in Azure?
Disclaimer: The views expressed in this article are my own and do not represent the views or positions of Microsoft. This article is written in a personal capacity and has not been reviewed, endorsed, or approved by Microsoft.
Why Fault Mode Analysis Matters
Fault Mode Analysis is the practice of systematically identifying potential points of failure within your workload and its associated flows, and then planning mitigation actions accordingly. A key tenet of FMA is that in any distributed system, failures can occur regardless of how many layers of resiliency are applied. More complex environments are simply exposed to more types of failures. Given this reality, FMA allows you to design your workload to withstand most types of failures and recover gracefully within defined recovery objectives.
If you skip FMA altogether, or perform an incomplete analysis, your workload is at risk of unpredicted behavior and potential outages caused by suboptimal design.
But to perform FMA effectively, you first need to understand what kinds of faults can actually occur in Azure infrastructure — and that is where most teams hit a gap.
Sample "Azure Fault Type" Taxonomy
Azure infrastructure is complex and distributed, and while Microsoft invests heavily in reliability, faults can and do occur. These faults can range from large-scale global service outages to localized issues affecting a single VM.
The following is a sample taxonomy of common Azure infrastructure fault types, categorized by their characteristics, likelihood, and mitigation strategies. The taxonomy is organized from a customer impact perspective — focusing on how fault types affect customer workloads and what mitigation options are available — rather than from an internal Azure engineering perspective.
Some of these "faults" may not even be caused by an actual failure in Azure infrastructure. They can be caused by a lack of understanding of Azure service designed behaviors (e.g., underestimating the impact of Azure planned maintenance) or by Azure platform design decisions (e.g., capacity constraints). However, from a customer perspective, they all represent potential failure modes that need to be considered and mitigated when designing for reliability.
The following table presents infrastructure fault types from a customer impact perspective:
Disclaimer: This is an unofficial taxonomy sample of Azure infrastructure fault types. It is not an official Microsoft publication and is not officially supported, endorsed, or maintained by Microsoft. The fault type definitions, likelihood assessments, and mitigation recommendations are based on publicly available Azure documentation and general cloud architecture best practices, but may not reflect the most current Azure platform behavior. Always refer to official Azure documentation and Azure Service Health for authoritative guidance.
The "Likelihood" values below are relative planning heuristics intended to help prioritize resilience investments. They are not statistical probabilities, do not represent Azure SLA commitments, and are not derived from official Azure reliability data.
| Fault Type | Blast Radius | Likelihood | Mitigation Redundancy Level Requirements |
|---|---|---|---|
| Service Fault (Global) | Worldwide or Multiple Regions | Very Low | High |
| Service Fault (Region) | Single service in region | Medium | Region Redundancy |
| Region Fault | Single region | Very Low | Region Redundancy |
| Partial Region Fault | Multiple services in a single Region | Low | Region Redundancy |
| Availability Zone Fault | Single AZ within region | Low | Availability Zone Redundancy |
| Single Resource Fault | Single VM/instance | High | Resource Redundancy |
| Platform Maintenance Fault | Variable (resource to region) | High | Resource Redundancy, Maintenance Schedules |
| Region Capacity Constraint Fault | Single region | Low | Region Redundancy, Capacity Reservations |
| Network POP Location Fault | Network hardware Colocation site | Low | Site Redundancy |
In future articles we will examine each of these fault types in detail. For this first article, let's take a closer look at one that is often underestimated: the Partial Region Fault.
Deep Dive: "Partial Region Fault"
A Partial Region Fault is a fault affecting multiple Azure services within a single region simultaneously, typically due to shared regional infrastructure dependencies, regional network issues, or regional platform incidents. Sometimes, the number of affected services may be significant enough to resemble a full region outage — but the key distinction is that it is not a complete loss of the region. Some services may continue to operate normally, while others experience degradation or unavailability. Unlike Natural Disaster caused Region outage, in the documented cases referenced later in this article, such "Partial Region Faults" have historically been resolved within hours.
| Attribute | Description |
|---|---|
| Blast Radius | Multiple services within a single region |
| Likelihood | Low |
| Typical Duration | Minutes to hours |
| Fault Tolerance Options | Multi-region architecture; cross-region failover |
| Fault Tolerance Cost | High |
| Impact | Severe |
| Typical Cause | Regional networking infrastructure failure affecting multiple services, regional storage subsystem degradation impacting dependent services, regional control plane issues affecting service management |
These faults are rare, but they can happen — and when they do, they can have a severe impact on customer solutions that are not architected for multi-region resilience.
What makes Partial Region Faults particularly dangerous is that they fall into a blind spot in most teams' resilience planning. When organizations think about regional failures, they tend to think in binary terms: either a region is up or it is down. Disaster recovery runbooks are written around the idea of a full region outage — triggered by a natural disaster or a catastrophic infrastructure event — where the response is to fail over everything to a secondary region.
But a Partial Region Fault is not a full region outage. It is something more insidious. A subset of services in the region degrades or becomes unavailable while others continue to function normally. Your VMs might still be running, but the networking layer that connects them is broken. Your compute is fine, but Azure Resource Manager — the control plane through which you manage everything — is unreachable.
This partial nature creates several problems that teams rarely plan for:
- Failover logic may not trigger. Most automated failover mechanisms are designed to detect a complete loss of connectivity to a region. When only some services are affected, health probes may still pass, traffic managers may still route requests to the degraded region, and your failover automation may sit idle — while your users are already experiencing errors.
- Recovery is more complex. With a full region outage, the playbook is straightforward: fail over to the secondary region. With a partial fault, you may need to selectively fail over some services while others remain in the primary region — a scenario that few teams have tested and most architectures do not support gracefully.
The real-world examples below illustrate this clearly. In each case, a shared infrastructure dependency — regional networking, Managed Identities, or Azure Resource Manager — experienced an issue that cascaded into a multi-service fault lasting hours. None of these were full region outages, yet the scope and duration of affected services was significant in each case:
Switzerland North — Network Connectivity Impact (BT6W-FX0)
A platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region.
| Attribute | Value |
|---|---|
| Date | September 26–27, 2025 |
| Region | Switzerland North |
| Time Window | 23:54 UTC on 26 Sep – 21:59 UTC on 27 Sep 2025 |
| Total Duration | ~22 hours |
| Services Impacted | Multiple (network-dependent services in the region) |
According to the official Post Incident Review (PIR) published by Microsoft on Azure Status History, a platform issue caused network connectivity degradation affecting multiple network-dependent services across the Switzerland North region, with impact lasting approximately 22 hours. The full root cause analysis, timeline, and remediation steps are documented in the linked PIR below.
🔗 View PIR on Azure Status History
East US and West US — Managed Identities and Dependent Services (_M5B-9RZ)
A platform issue with the Managed Identities for Azure resources service impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens in East US and West US regions.
| Attribute | Value |
|---|---|
| Date | February 3, 2026 |
| Regions | East US, West US |
| Time Window | 00:10 UTC – 06:05 UTC on 03 February 2026 |
| Total Duration | ~6 hours |
| Services Impacted | Managed Identities + dependent services (resource create/update/delete, token acquisition) |
🔗 View PIR on Azure Status History
Azure Government — Azure Resource Manager Failures (ML7_-DWG)
Customers using any Azure Government region experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.
| Attribute | Value |
|---|---|
| Date | December 8, 2025 |
| Regions | Azure Government (all regions) |
| Time Window | 11:04 EST (16:04 UTC) – 14:13 EST (19:13 UTC) |
| Total Duration | ~3 hours |
| Services Impacted | 20+ services (ARM and all ARM-dependent services) |
🔗 View PIR on Azure Status History
Wrapping Up
Designing resilient Azure solutions requires understanding the full spectrum of potential infrastructure faults. The Partial Region Fault is just one of many fault types you should account for during your Failure Mode Analysis — but it is a powerful reminder that even within a single region, shared infrastructure dependencies can amplify a single failure into a multi-service outage.
Use this taxonomy as a starting point for FMA when designing your Azure architecture. The area is continuously evolving as the Azure platform and industry evolve — watch the space and revisit your fault type analysis periodically.
In the next article, we will continue exploring additional fault types from the taxonomy. Stay tuned.
Authors & Reviewers
Authored by Zoran Jovanovic, Cloud Solutions Architect at Microsoft.
Peer Review by Catalina Alupoaie, Cloud Solutions Architect at Microsoft.
Peer Review by Stefan Johner, Cloud Solutions Architect at Microsoft.