Proactive Reliability Series — Article 1: Fault Types in Azure

Microsoft

Apr 01, 2026

Welcome to the Proactive Reliability Series — a collection of articles dedicated to raising awareness about the importance of designing, implementing, and operating reliable solutions in Azure. Each article will focus on a specific area of reliability engineering: from identifying critical flows and setting reliability targets, to designing for redundancy, testing strategies, and disaster recovery.

This series draws its foundation from the Reliability pillar of the Azure Well-Architected Framework, Microsoft's authoritative guidance for building workloads that are resilient to malfunction and capable of returning to a fully functioning state after a failure occurs.

In the cloud, failures are not a matter of if but when. Whether it is a regional outage, an availability zone going dark, a misconfigured resource, or a downstream service experiencing degradation — your workload will eventually face adverse conditions. The difference between a minor blip and a major incident often comes down to how deliberately you have planned for failure.

In this first article, we start with one of the most foundational practices: Fault Mode Analysis (FMA) — and the question that underpins it: what kinds of faults can actually happen in Azure?

Disclaimer: The views expressed in this article are my own and do not represent the views or positions of Microsoft. This article is written in a personal capacity and has not been reviewed, endorsed, or approved by Microsoft.

Why Fault Mode Analysis Matters

Fault Mode Analysis is the practice of systematically identifying potential points of failure within your workload and its associated flows, and then planning mitigation actions accordingly. A key tenet of FMA is that in any distributed system, failures can occur regardless of how many layers of resiliency are applied. More complex environments are simply exposed to more types of failures. Given this reality, FMA allows you to design your workload to withstand most types of failures and recover gracefully within defined recovery objectives.

If you skip FMA altogether, or perform an incomplete analysis, your workload is at risk of unpredicted behavior and potential outages caused by suboptimal design.

But to perform FMA effectively, you first need to understand what kinds of faults can actually occur in Azure infrastructure — and that is where most teams hit a gap.

Sample "Azure Fault Type" Taxonomy

Azure infrastructure is complex and distributed, and while Microsoft invests heavily in reliability, faults can and do occur. These faults can range from large-scale global service outages to localized issues affecting a single VM.

The following is a sample taxonomy of common Azure infrastructure fault types, categorized by their characteristics, likelihood, and mitigation strategies. The taxonomy is organized from a customer impact perspective — focusing on how fault types affect customer workloads and what mitigation options are available — rather than from an internal Azure engineering perspective.

Some of these "faults" may not even be caused by an actual failure in Azure infrastructure. They can be caused by a lack of understanding of Azure service designed behaviors (e.g., underestimating the impact of Azure planned maintenance) or by Azure platform design decisions (e.g., capacity constraints). However, from a customer perspective, they all represent potential failure modes that need to be considered and mitigated when designing for reliability.

The following table presents infrastructure fault types from a customer impact perspective:

Disclaimer: This is an unofficial taxonomy sample of Azure infrastructure fault types. It is not an official Microsoft publication and is not officially supported, endorsed, or maintained by Microsoft. The fault type definitions, likelihood assessments, and mitigation recommendations are based on publicly available Azure documentation and general cloud architecture best practices, but may not reflect the most current Azure platform behavior. Always refer to official Azure documentation and Azure Service Health for authoritative guidance.

The "Likelihood" values below are relative planning heuristics intended to help prioritize resilience investments. They are not statistical probabilities, do not represent Azure SLA commitments, and are not derived from official Azure reliability data.

Fault Type	Blast Radius	Likelihood	Mitigation Redundancy Level Requirements
Service Fault (Global)	Worldwide or Multiple Regions	Very Low	High
Service Fault (Region)	Single service in region	Medium	Region Redundancy
Region Fault	Single region	Very Low	Region Redundancy
Partial Region Fault	Multiple services in a single Region	Low	Region Redundancy
Availability Zone Fault	Single AZ within region	Low	Availability Zone Redundancy
Single Resource Fault	Single VM/instance	High	Resource Redundancy
Platform Maintenance Fault	Variable (resource to region)	High	Resource Redundancy, Maintenance Schedules
Region Capacity Constraint Fault	Single region	Low	Region Redundancy, Capacity Reservations
Network POP Location Fault	Network hardware Colocation site	Low	Site Redundancy

In future articles we will examine each of these fault types in detail. For this first article, let's take a closer look at one that is often underestimated: the Partial Region Fault.

Deep Dive: "Partial Region Fault"

A Partial Region Fault is a fault affecting multiple Azure services within a single region simultaneously, typically due to shared regional infrastructure dependencies, regional network issues, or regional platform incidents. Sometimes, the number of affected services may be significant enough to resemble a full region outage — but the key distinction is that it is not a complete loss of the region. Some services may continue to operate normally, while others experience degradation or unavailability. Unlike Natural Disaster caused Region outage, in the documented cases referenced later in this article, such "Partial Region Faults" have historically been resolved within hours.

Attribute	Description
Blast Radius	Multiple services within a single region
Likelihood	Low
Typical Duration	Minutes to hours
Fault Tolerance Options	Multi-region architecture; cross-region failover
Fault Tolerance Cost	High
Impact	Severe
Typical Cause	Regional networking infrastructure failure affecting multiple services, regional storage subsystem degradation impacting dependent services, regional control plane issues affecting service management

These faults are rare, but they can happen — and when they do, they can have a severe impact on customer solutions that are not architected for multi-region resilience.

What makes Partial Region Faults particularly dangerous is that they fall into a blind spot in most teams' resilience planning. When organizations think about regional failures, they tend to think in binary terms: either a region is up or it is down. Disaster recovery runbooks are written around the idea of a full region outage — triggered by a natural disaster or a catastrophic infrastructure event — where the response is to fail over everything to a secondary region.

But a Partial Region Fault is not a full region outage. It is something more insidious. A subset of services in the region degrades or becomes unavailable while others continue to function normally. Your VMs might still be running, but the networking layer that connects them is broken. Your compute is fine, but Azure Resource Manager — the control plane through which you manage everything — is unreachable.

This partial nature creates several problems that teams rarely plan for:

Failover logic may not trigger. Most automated failover mechanisms are designed to detect a complete loss of connectivity to a region. When only some services are affected, health probes may still pass, traffic managers may still route requests to the degraded region, and your failover automation may sit idle — while your users are already experiencing errors.
Recovery is more complex. With a full region outage, the playbook is straightforward: fail over to the secondary region. With a partial fault, you may need to selectively fail over some services while others remain in the primary region — a scenario that few teams have tested and most architectures do not support gracefully.

The real-world examples below illustrate this clearly. In each case, a shared infrastructure dependency — regional networking, Managed Identities, or Azure Resource Manager — experienced an issue that cascaded into a multi-service fault lasting hours. None of these were full region outages, yet the scope and duration of affected services was significant in each case:

Switzerland North — Network Connectivity Impact (BT6W-FX0)

A platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region.

Attribute	Value
Date	September 26–27, 2025
Region	Switzerland North
Time Window	23:54 UTC on 26 Sep – 21:59 UTC on 27 Sep 2025
Total Duration	~22 hours
Services Impacted	Multiple (network-dependent services in the region)

According to the official Post Incident Review (PIR) published by Microsoft on Azure Status History, a platform issue caused network connectivity degradation affecting multiple network-dependent services across the Switzerland North region, with impact lasting approximately 22 hours. The full root cause analysis, timeline, and remediation steps are documented in the linked PIR below.

🔗 View PIR on Azure Status History

East US and West US — Managed Identities and Dependent Services (_M5B-9RZ)

A platform issue with the Managed Identities for Azure resources service impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens in East US and West US regions.

Attribute	Value
Date	February 3, 2026
Regions	East US, West US
Time Window	00:10 UTC – 06:05 UTC on 03 February 2026
Total Duration	~6 hours
Services Impacted	Managed Identities + dependent services (resource create/update/delete, token acquisition)

🔗 View PIR on Azure Status History

Azure Government — Azure Resource Manager Failures (ML7_-DWG)

Customers using any Azure Government region experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

Attribute	Value
Date	December 8, 2025
Regions	Azure Government (all regions)
Time Window	11:04 EST (16:04 UTC) – 14:13 EST (19:13 UTC)
Total Duration	~3 hours
Services Impacted	20+ services (ARM and all ARM-dependent services)

🔗 View PIR on Azure Status History

Wrapping Up

Designing resilient Azure solutions requires understanding the full spectrum of potential infrastructure faults. The Partial Region Fault is just one of many fault types you should account for during your Failure Mode Analysis — but it is a powerful reminder that even within a single region, shared infrastructure dependencies can amplify a single failure into a multi-service outage.

Use this taxonomy as a starting point for FMA when designing your Azure architecture. The area is continuously evolving as the Azure platform and industry evolve — watch the space and revisit your fault type analysis periodically.

In the next article, we will continue exploring additional fault types from the taxonomy. Stay tuned.

Authors & Reviewers

Authored by Zoran Jovanovic, Cloud Solutions Architect at Microsoft.
Peer Review by Catalina Alupoaie, Cloud Solutions Architect at Microsoft.
Peer Review by Stefan Johner, Cloud Solutions Architect at Microsoft.