Blog Post

Azure Architecture Blog
7 MIN READ

Proactive Reliability Series — Article 1: Fault Types in Azure

Zoran Jovanovic's avatar
Apr 01, 2026

Welcome to the Proactive Reliability Series — a collection of articles dedicated to raising awareness about the importance of designingimplementing, and operating reliable solutions in Azure. Each article will focus on a specific area of reliability engineering: from identifying critical flows and setting reliability targets, to designing for redundancy, testing strategies, and disaster recovery.

This series draws its foundation from the Reliability pillar of the Azure Well-Architected Framework, Microsoft's authoritative guidance for building workloads that are resilient to malfunction and capable of returning to a fully functioning state after a failure occurs.

In the cloud, failures are not a matter of if but when. Whether it is a regional outage, an availability zone going dark, a misconfigured resource, or a downstream service experiencing degradation — your workload will eventually face adverse conditions. The difference between a minor blip and a major incident often comes down to how deliberately you have planned for failure.

In this first article, we start with one of the most foundational practices: Fault Mode Analysis (FMA) — and the question that underpins it: what kinds of faults can actually happen in Azure?

Disclaimer: The views expressed in this article are my own and do not represent the views or positions of Microsoft. This article is written in a personal capacity and has not been reviewed, endorsed, or approved by Microsoft.

Why Fault Mode Analysis Matters

Fault Mode Analysis is the practice of systematically identifying potential points of failure within your workload and its associated flows, and then planning mitigation actions accordingly. A key tenet of FMA is that in any distributed system, failures can occur regardless of how many layers of resiliency are applied. More complex environments are simply exposed to more types of failures. Given this reality, FMA allows you to design your workload to withstand most types of failures and recover gracefully within defined recovery objectives.

If you skip FMA altogether, or perform an incomplete analysis, your workload is at risk of unpredicted behavior and potential outages caused by suboptimal design.

But to perform FMA effectively, you first need to understand what kinds of faults can actually occur in Azure infrastructure — and that is where most teams hit a gap.

Sample "Azure Fault Type" Taxonomy

Azure infrastructure is complex and distributed, and while Microsoft invests heavily in reliability, faults can and do occur. These faults can range from large-scale global service outages to localized issues affecting a single VM.

The following is a sample taxonomy of common Azure infrastructure fault types, categorized by their characteristics, likelihood, and mitigation strategies. The taxonomy is organized from a customer impact perspective — focusing on how fault types affect customer workloads and what mitigation options are available — rather than from an internal Azure engineering perspective.

Some of these "faults" may not even be caused by an actual failure in Azure infrastructure. They can be caused by a lack of understanding of Azure service designed behaviors (e.g., underestimating the impact of Azure planned maintenance) or by Azure platform design decisions (e.g., capacity constraints). However, from a customer perspective, they all represent potential failure modes that need to be considered and mitigated when designing for reliability.

The following table presents infrastructure fault types from a customer impact perspective:

Disclaimer: This is an unofficial taxonomy sample of Azure infrastructure fault types. It is not an official Microsoft publication and is not officially supported, endorsed, or maintained by Microsoft. The fault type definitions, likelihood assessments, and mitigation recommendations are based on publicly available Azure documentation and general cloud architecture best practices, but may not reflect the most current Azure platform behavior. Always refer to official Azure documentation and Azure Service Health for authoritative guidance.

The "Likelihood" values below are relative planning heuristics intended to help prioritize resilience investments. They are not statistical probabilities, do not represent Azure SLA commitments, and are not derived from official Azure reliability data.

Fault TypeBlast RadiusLikelihoodMitigation Redundancy Level Requirements
Service Fault (Global)Worldwide or Multiple RegionsVery LowHigh
Service Fault (Region)Single service in regionMediumRegion Redundancy
Region FaultSingle regionVery LowRegion Redundancy
Partial Region FaultMultiple services in a single RegionLowRegion Redundancy
Availability Zone FaultSingle AZ within regionLowAvailability Zone Redundancy
Single Resource FaultSingle VM/instanceHighResource Redundancy
Platform Maintenance FaultVariable (resource to region)HighResource Redundancy, Maintenance Schedules
Region Capacity Constraint FaultSingle regionLowRegion Redundancy, Capacity Reservations
Network POP Location FaultNetwork hardware Colocation siteLowSite Redundancy

In future articles we will examine each of these fault types in detail. For this first article, let's take a closer look at one that is often underestimated: the Partial Region Fault.

Deep Dive: "Partial Region Fault"

Partial Region Fault is a fault affecting multiple Azure services within a single region simultaneously, typically due to shared regional infrastructure dependencies, regional network issues, or regional platform incidents. Sometimes, the number of affected services may be significant enough to resemble a full region outage — but the key distinction is that it is not a complete loss of the region. Some services may continue to operate normally, while others experience degradation or unavailability. Unlike Natural Disaster caused Region outage, in the documented cases referenced later in this article, such "Partial Region Faults" have historically been resolved within hours.

 

 

AttributeDescription
Blast RadiusMultiple services within a single region
LikelihoodLow
Typical DurationMinutes to hours
Fault Tolerance OptionsMulti-region architecture; cross-region failover
Fault Tolerance CostHigh
ImpactSevere
Typical CauseRegional networking infrastructure failure affecting multiple services, regional storage subsystem degradation impacting dependent services, regional control plane issues affecting service management

These faults are rare, but they can happen — and when they do, they can have a severe impact on customer solutions that are not architected for multi-region resilience.

What makes Partial Region Faults particularly dangerous is that they fall into a blind spot in most teams' resilience planning. When organizations think about regional failures, they tend to think in binary terms: either a region is up or it is down. Disaster recovery runbooks are written around the idea of a full region outage — triggered by a natural disaster or a catastrophic infrastructure event — where the response is to fail over everything to a secondary region.

But a Partial Region Fault is not a full region outage. It is something more insidious. A subset of services in the region degrades or becomes unavailable while others continue to function normally. Your VMs might still be running, but the networking layer that connects them is broken. Your compute is fine, but Azure Resource Manager — the control plane through which you manage everything — is unreachable.

This partial nature creates several problems that teams rarely plan for:

  • Failover logic may not trigger. Most automated failover mechanisms are designed to detect a complete loss of connectivity to a region. When only some services are affected, health probes may still pass, traffic managers may still route requests to the degraded region, and your failover automation may sit idle — while your users are already experiencing errors.
  • Recovery is more complex. With a full region outage, the playbook is straightforward: fail over to the secondary region. With a partial fault, you may need to selectively fail over some services while others remain in the primary region — a scenario that few teams have tested and most architectures do not support gracefully.

The real-world examples below illustrate this clearly. In each case, a shared infrastructure dependency — regional networking, Managed Identities, or Azure Resource Manager — experienced an issue that cascaded into a multi-service fault lasting hours. None of these were full region outages, yet the scope and duration of affected services was significant in each case:

 

Switzerland North — Network Connectivity Impact (BT6W-FX0)

A platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region.

AttributeValue
DateSeptember 26–27, 2025
RegionSwitzerland North
Time Window23:54 UTC on 26 Sep – 21:59 UTC on 27 Sep 2025
Total Duration~22 hours
Services ImpactedMultiple (network-dependent services in the region)

According to the official Post Incident Review (PIR) published by Microsoft on Azure Status History, a platform issue caused network connectivity degradation affecting multiple network-dependent services across the Switzerland North region, with impact lasting approximately 22 hours. The full root cause analysis, timeline, and remediation steps are documented in the linked PIR below.

🔗 View PIR on Azure Status History

 

East US and West US — Managed Identities and Dependent Services (_M5B-9RZ)

A platform issue with the Managed Identities for Azure resources service impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens in East US and West US regions.

AttributeValue
DateFebruary 3, 2026
RegionsEast US, West US
Time Window00:10 UTC – 06:05 UTC on 03 February 2026
Total Duration~6 hours
Services ImpactedManaged Identities + dependent services (resource create/update/delete, token acquisition)

🔗 View PIR on Azure Status History

 

Azure Government — Azure Resource Manager Failures (ML7_-DWG)

Customers using any Azure Government region experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI.

AttributeValue
DateDecember 8, 2025
RegionsAzure Government (all regions)
Time Window11:04 EST (16:04 UTC) – 14:13 EST (19:13 UTC)
Total Duration~3 hours
Services Impacted20+ services (ARM and all ARM-dependent services)

🔗 View PIR on Azure Status History

Wrapping Up

Designing resilient Azure solutions requires understanding the full spectrum of potential infrastructure faults. The Partial Region Fault is just one of many fault types you should account for during your Failure Mode Analysis — but it is a powerful reminder that even within a single region, shared infrastructure dependencies can amplify a single failure into a multi-service outage.

Use this taxonomy as a starting point for FMA when designing your Azure architecture. The area is continuously evolving as the Azure platform and industry evolve — watch the space and revisit your fault type analysis periodically.

In the next article, we will continue exploring additional fault types from the taxonomy. Stay tuned.

Authors & Reviewers

Authored by Zoran Jovanovic, Cloud Solutions Architect at Microsoft.
Peer Review by Catalina Alupoaie, Cloud Solutions Architect at Microsoft.
Peer Review by Stefan Johner, Cloud Solutions Architect at Microsoft.

References

Updated Apr 01, 2026
Version 1.0
No CommentsBe the first to comment