reliability and resiliency in azure

2 Topics

Modern Azure Resilience with Mark Russinovich
Resiliency in the cloud reflects different priorities from consistent performance, to withstanding failures, to predictable recovery. These map to reliability, resiliency, and recoverability, which together guide how workloads should be designed on Azure. This post extends foundational guidance with practical multi‑region design decisions, including when to use availability zones, paired regions, and non‑paired regions to meet business continuity goals. Reliability in Azure isn’t defined by a single recommendation, but by a set of architectural patterns designed to balance cost, complexity, recovery speed, and operational effort—because no single approach fits every workload. While disaster recovery is a common driver for multi‑region designs, long‑term scale planning also matters. Azure regions operate within defined physical and latency boundaries, and large-scale workloads may eventually approach the practical capacity limits of a single region. This post introduces four resilience patterns, outlining when and why to use each so you can assess options based on your non‑functional requirements. It also explains how availability zone–based designs can often provide an alternative to paired regions as a default choice. Here are a few common reliability and availability architecture patterns: In-region High Availability (HA) with Availability Zones (AZ): Maximize availability within a single Azure region by deploying across multiple availability zones. Regional Business Continuity and Disaster Recovery (BCDR): A primary/secondary region strategy implemented across separate Azure regions, selected based on geographic risk boundaries, regulatory requirements, and service availability. Recovery sequencing and failover behaviors are defined by workload dependencies and organizational requirements. Non-paired region BCDR: A primary/secondary region strategy where the secondary region is chosen based on requirements such as capacity, service availability, data residency, and network latency. This approach also supports long‑term scale planning, since Azure regions operate within physical datacenter footprints and latency boundaries and can reach practical capacity limits as workloads grow. See multi‑region solutions in non‑paired regions. Multi-region active/active: Deploy workloads across multiple regions simultaneously so that each region can serve production traffic. This approach can provide both high availability and disaster resilience while improving global performance, but it introduces additional architectural complexity and operational overhead. The rest of this post helps you understand the tradeoffs across these patterns, enabling you to select the right approach per workload while avoiding unnecessary cost and operational complexity. First post in this series: Achieve agility and scale in a dynamic cloud world Why did Azure launch with paired regions? Launched in 2010, but rebranded to Microsoft Azure in 2014, the regions were introduced in pairs (West US & East US, West Europe & North Europe, Southeast Asia & East Asia) to align with common enterprise business continuity practices at the time. Many organizations operated multiple datacenters within the same geographic boundary, separated by sufficient distance to reduce shared risk while maintaining regulatory and operational alignment. This design mirrored familiar enterprise BCDR practices at the time and offered: A familiar primary/secondary failover pattern consistent with enterprise BCDR strategies Support for regulatory or data residency requirements that required disaster recovery within a defined geographic boundary Turnkey replication capabilities for services such as Geo-Redundant Storage (GRS) Platform-level sequencing of updates to reduce the likelihood of simultaneous regional impact A defined regional recovery prioritization model for rare geography-wide incidents This model provided assurance that Azure could meet or exceed the resilience of legacy enterprise environments while simplifying early cloud adoption through predefined recovery patterns. However, Azure’s engineering strategy has evolved. Many services now support replication to a region of choice rather than being limited to predefined pairs. This provides architects with greater flexibility to select regions based on workload requirements, risk boundaries, compliance constraints, capacity considerations, and cost models. It’s important to recognize that regional parity is never guaranteed even between paired regions. Differences in service availability, supported SKUs, scale limits, capacity, cost and operational maturity must be explicitly accounted for in the workload design. How has cloud resilience evolved since launch? The introduction of Availability Zones in 2018 provides a significant advancement in Azure resilience. Availability Zones are physically isolated groups of data centers within a region; each zone has independent power, cooling and networking. Many Azure services (App Service, Storage, Azure SQL etc.) use zones to provide platform-managed resilience. In addition, customers can deploy zonal resources, such as virtual machines, into specific zones or distribute them across zones to design for higher availability. Where previously Azure regions were launched in pairs, since 2020, regions have been typically designed with multiple availability zones, without a paired region. This design enables: High availability within a single region Platform-managed resilience for most failure scenarios Reduced need for multi-region deployments for standard high-availability requirements How should customers design for resilience when using both paired and non-paired regions? To decide which resiliency model makes sense, customers should start by defining clear expectations including uptime targets, recovery time objectives (RTO), recovery point objectives (RPO), latency tolerance, and data residency. These non-functional requirements should directly influence architectural decisions. In practice, High Availability (HA) and Disaster Recovery (DR) are differentiated by recovery objectives rather than geography. HA architectures target near-zero downtime and minimal data loss, while DR solutions allow for defined recovery time and acceptable data loss. While HA is commonly established within a region using availability zones, it can also be achieved across regions through active-active designs. Similarly, DR is typically implemented across regions using replication and failover strategies. HA: Availability Zones When designing high availability within a region, Azure builds on AZs with 2 models: Zone-redundant resources are replicated across multiple availability zones to ensure data remains accessible even if one zone fails. Some services provide built-in zone redundancy, while others require manual configuration. Typically, Microsoft chooses the zones used for your resources, though some services allow you to select them. Zonal resources are deployed in a single availability zone and do not provide automatic resiliency against zone outages. While faults in other zones do not affect them, ensuring resiliency requires deploying separate resources across multiple zones. Microsoft does not handle this process; you are responsible for managing failover if an outage occurs. The decision to design a zone-resilient architecture is critical for balancing availability requirements with cost and regional capacity constraints. Designing workloads to be resilient across availability zones is generally the preferred approach for improving availability and protecting against zone-level failures. Deploying workloads across availability zones can enhance fault tolerance and reduce downtime when supported by the Azure service being used. However, architects should still consider workload characteristics, cost implications, and potential latency impacts, which may vary depending on the services and architecture patterns involved. Ultimately, zone resiliency is an architectural decision that should be strategically aligned with business priorities and risk tolerance, not simply treated as a checkbox to be ticked during deployment. DR: Paired and Non-Paired Regions Region pairs should be viewed as an architectural choice rather than a rule. Historically, paired regions played a key role in minimizing correlated failures and streamlining platform updates and recovery processes. However, as the Azure Safe Deployment Practices (SDP) have matured, the advantages of region pairs have become more nuanced. Over time, SDP has evolved to support safer and more flexible change management through longer and more adaptable bake times, richer operational signal integration, and an expanded understanding of regional deployment boundaries. These improvements enable Azure to release changes more safely across a growing and increasingly diverse regional footprint, while still balancing reliability with time‑to‑market. As a result, regional pairs are no longer the sole mechanism for managing correlated change risk, but one of several architectural tools customers can apply based on their resiliency and compliance needs. Using non-paired regions or a mix of paired and non-paired regions allows customers to design high availability and disaster recovery architectures that are driven by business, compliance, and application requirements rather than fixed regional relationships. This enables customers to optimize data residency, regulatory boundaries, latency to specific user populations, and provide differentiated recovery objectives across their workloads. This approach can also reduce exposure to rare but high-impact platform-level events by avoiding tightly coupled regional behaviors. While some Azure services natively simplify replication and recovery within paired regions, and others support replication across arbitrary regions (such as Azure SQL, Cosmos DB, and Azure Blob Storage with object replication), non-paired designs encourage explicit, workload-aware resiliency strategies such as application-level replication, asynchronous data sync, and failover orchestration. Although this introduces more architectural responsibility and may require compensating for paired region features, it delivers greater transparency, predictable recovery behavior, and alignment with business-driven RTO/RPO requirements rather than platform defaults. Regional failover is a customer‑orchestrated decision; customers should design, test, and operate their own failover and failback processes rather than assuming platform‑initiated regional failover. Designing for regional resilience requires distinguishing between workload mobility and data protection. Azure provides two complementary capabilities that address these needs differently: Azure Site Recovery (ASR) and Azure Backup. Azure Site Recovery (ASR) enables near‑continuous replication and orchestrated failover of virtual machine–based workloads to a region of choice, not limited to paired regions. ASR is the primary mechanism for customers who need low RPO, controlled failover, and workload restart in a secondary region. This is especially relevant for regions without a paired region or where the paired region does not meet capacity, service availability, or compliance needs. Azure Backup provides durable, policy‑based data protection, independent of compute availability. While Azure Backup is not a high‑availability or infrastructure failover solution, it plays a critical role when services do not support region‑of‑choice replication natively. In these scenarios, backup and restore become the recovery mechanism. These two services are often used together: ASR for VM‑level workload continuity, and Azure Backup for protecting and restoring data across regions, including to non‑paired regions. I am using paired regions today – does this mean I need to change my architecture? If your current architecture is built around paired regions for compliance, data residency, or strict disaster recovery objectives, that model stays valid and supported. Azure continues to support paired regions providing prioritized recovery sequencing, staggered platform updates, and geo-aligned data residency, all backed by Microsoft’s global infrastructure strategy. What has changed is that paired regions are no longer the only way to achieve enterprise-grade resilience. For many workloads that adopted a paired region (1+1) model primarily to protect against local datacenter failure, Availability Zones combined with geo-redundant services now provide equivalent or better protection with far less architectural complexity and cost. The shift to nonpaired regions is therefore not a forced migration, but an opportunity to simplify. Customers can continue using paired regions where business requirements demand it, while selectively modernizing other workloads to take advantage of platform-managed zone resilience. What’s coming up next for resilience in Azure? Resilience is evolving from static guidance to continuous, workload-aware execution. A multi-region strategy isn’t only about recovery; it’s also a practical hedge against regional capacity constraints (regions have physical limits within a latency boundary, so growth can eventually hit caps). Resiliency agent in Azure Copilot (preview) helps you spot missing resiliency coverage—such as zone alignment gaps or missing backup/DR—and provides automated guidance (including scripts) to remediate issues, configure Azure Backup and Azure Site Recovery, and define recovery drills. Resiliency in Azure brings zone resiliency, high availability, backup, DR, and ransomware protection together into a unified experience within Azure Copilot, enabling teams to set resiliency goals, receive proactive recommendations, and view service‑group insights via Azure portal. If you’re looking for service-specific BCDR and replication guidance, use these authoritative starting points: Cloud Adoption Framework (CAF) – Landing zone design area (BCDR): guidance to define platform DR requirements (RTO/RPO), data residency considerations, and operational readiness as part of landing zone design. Azure Well-Architected Framework (WAF) – Disaster recovery strategies: guidance for structuring, testing, and operating DR plans aligned to recovery targets, with links to companion DR planning resources. WAF design guide – Regions & Availability Zones: how to choose between zone- vs region-based approaches and understand reliability/cost/performance tradeoffs. Azure service reliability guides: service-by-service reliability/replication behavior and customer responsibilities. Non‑paired multi‑region configurations: examples of supported multi-region approaches when regions aren’t paired. Validate feasibility before you design: confirm service/SKU/zone availability in both regions. Next step: Explore Azure Essentials for guidance and tools to build secure, resilient, cost-efficient Azure projects. To see how shared responsibility and Azure Essentials come together in practice, read Resiliency in the cloud—empowered by shared responsibility and Azure Essentials and How to design reliable, resilient, and recoverable workloads on Azure on the Microsoft Azure Blog. For expert-led, outcome-based engagements to strengthen resiliency and operational readiness, Microsoft Unified provides end-to-end support across the Microsoft cloud. To move from guidance to execution, start your project with experts and investments through Azure Accelerate. Related Resources Architecture strategies for using Availability Zones and Region High Availability Architecture strategies for highly available multi-region design Disaster Recovery Architecture strategies for designing a Disaster Recovery strategy Multi-Region solutions in nonpaired Regions Develop a disaster recovery plan for multi-region deployments Azure Regions and Services Azure region pairs and nonpaired regions Reliability guides for Azure services
molina_sharma
Apr 28, 2026 Place Reliability and Resiliency in Azure
7.2KViews
5likes
0Comments
Proactive Reliability Series — Article 1: Fault Types in Azure
Welcome to the Proactive Reliability Series — a collection of articles dedicated to raising awareness about the importance of designing, implementing, and operating reliable solutions in Azure. Each article will focus on a specific area of reliability engineering: from identifying critical flows and setting reliability targets, to designing for redundancy, testing strategies, and disaster recovery. This series draws its foundation from the Reliability pillar of the Azure Well-Architected Framework, Microsoft's authoritative guidance for building workloads that are resilient to malfunction and capable of returning to a fully functioning state after a failure occurs. In the cloud, failures are not a matter of if but when. Whether it is a regional outage, an availability zone going dark, a misconfigured resource, or a downstream service experiencing degradation — your workload will eventually face adverse conditions. The difference between a minor blip and a major incident often comes down to how deliberately you have planned for failure. In this first article, we start with one of the most foundational practices: Fault Mode Analysis (FMA) — and the question that underpins it: what kinds of faults can actually happen in Azure? Disclaimer: The views expressed in this article are my own and do not represent the views or positions of Microsoft. This article is written in a personal capacity and has not been reviewed, endorsed, or approved by Microsoft. Why Fault Mode Analysis Matters Fault Mode Analysis is the practice of systematically identifying potential points of failure within your workload and its associated flows, and then planning mitigation actions accordingly. A key tenet of FMA is that in any distributed system, failures can occur regardless of how many layers of resiliency are applied. More complex environments are simply exposed to more types of failures. Given this reality, FMA allows you to design your workload to withstand most types of failures and recover gracefully within defined recovery objectives. If you skip FMA altogether, or perform an incomplete analysis, your workload is at risk of unpredicted behavior and potential outages caused by suboptimal design. But to perform FMA effectively, you first need to understand what kinds of faults can actually occur in Azure infrastructure — and that is where most teams hit a gap. Sample "Azure Fault Type" Taxonomy Azure infrastructure is complex and distributed, and while Microsoft invests heavily in reliability, faults can and do occur. These faults can range from large-scale global service outages to localized issues affecting a single VM. The following is a sample taxonomy of common Azure infrastructure fault types, categorized by their characteristics, likelihood, and mitigation strategies. The taxonomy is organized from a customer impact perspective — focusing on how fault types affect customer workloads and what mitigation options are available — rather than from an internal Azure engineering perspective. Some of these "faults" may not even be caused by an actual failure in Azure infrastructure. They can be caused by a lack of understanding of Azure service designed behaviors (e.g., underestimating the impact of Azure planned maintenance) or by Azure platform design decisions (e.g., capacity constraints). However, from a customer perspective, they all represent potential failure modes that need to be considered and mitigated when designing for reliability. The following table presents infrastructure fault types from a customer impact perspective: Disclaimer: This is an unofficial taxonomy sample of Azure infrastructure fault types. It is not an official Microsoft publication and is not officially supported, endorsed, or maintained by Microsoft. The fault type definitions, likelihood assessments, and mitigation recommendations are based on publicly available Azure documentation and general cloud architecture best practices, but may not reflect the most current Azure platform behavior. Always refer to official Azure documentation and Azure Service Health for authoritative guidance. The "Likelihood" values below are relative planning heuristics intended to help prioritize resilience investments. They are not statistical probabilities, do not represent Azure SLA commitments, and are not derived from official Azure reliability data. Fault Type Blast Radius Likelihood Mitigation Redundancy Level Requirements Service Fault (Global) Worldwide or Multiple Regions Very Low High Service Fault (Region) Single service in region Medium Region Redundancy Region Fault Single region Very Low Region Redundancy Partial Region Fault Multiple services in a single Region Low Region Redundancy Availability Zone Fault Single AZ within region Low Availability Zone Redundancy Single Resource Fault Single VM/instance High Resource Redundancy Platform Maintenance Fault Variable (resource to region) High Resource Redundancy, Maintenance Schedules Region Capacity Constraint Fault Single region Low Region Redundancy, Capacity Reservations Network POP Location Fault Network hardware Colocation site Low Site Redundancy In future articles we will examine each of these fault types in detail. For this first article, let's take a closer look at one that is often underestimated: the Partial Region Fault. Deep Dive: "Partial Region Fault" A Partial Region Fault is a fault affecting multiple Azure services within a single region simultaneously, typically due to shared regional infrastructure dependencies, regional network issues, or regional platform incidents. Sometimes, the number of affected services may be significant enough to resemble a full region outage — but the key distinction is that it is not a complete loss of the region. Some services may continue to operate normally, while others experience degradation or unavailability. Unlike Natural Disaster caused Region outage, in the documented cases referenced later in this article, such "Partial Region Faults" have historically been resolved within hours. Attribute Description Blast Radius Multiple services within a single region Likelihood Low Typical Duration Minutes to hours Fault Tolerance Options Multi-region architecture; cross-region failover Fault Tolerance Cost High Impact Severe Typical Cause Regional networking infrastructure failure affecting multiple services, regional storage subsystem degradation impacting dependent services, regional control plane issues affecting service management These faults are rare, but they can happen — and when they do, they can have a severe impact on customer solutions that are not architected for multi-region resilience. What makes Partial Region Faults particularly dangerous is that they fall into a blind spot in most teams' resilience planning. When organizations think about regional failures, they tend to think in binary terms: either a region is up or it is down. Disaster recovery runbooks are written around the idea of a full region outage — triggered by a natural disaster or a catastrophic infrastructure event — where the response is to fail over everything to a secondary region. But a Partial Region Fault is not a full region outage. It is something more insidious. A subset of services in the region degrades or becomes unavailable while others continue to function normally. Your VMs might still be running, but the networking layer that connects them is broken. Your compute is fine, but Azure Resource Manager — the control plane through which you manage everything — is unreachable. This partial nature creates several problems that teams rarely plan for: Failover logic may not trigger. Most automated failover mechanisms are designed to detect a complete loss of connectivity to a region. When only some services are affected, health probes may still pass, traffic managers may still route requests to the degraded region, and your failover automation may sit idle — while your users are already experiencing errors. Recovery is more complex. With a full region outage, the playbook is straightforward: fail over to the secondary region. With a partial fault, you may need to selectively fail over some services while others remain in the primary region — a scenario that few teams have tested and most architectures do not support gracefully. The real-world examples below illustrate this clearly. In each case, a shared infrastructure dependency — regional networking, Managed Identities, or Azure Resource Manager — experienced an issue that cascaded into a multi-service fault lasting hours. None of these were full region outages, yet the scope and duration of affected services was significant in each case: Switzerland North — Network Connectivity Impact (BT6W-FX0) A platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region. Attribute Value Date September 26–27, 2025 Region Switzerland North Time Window 23:54 UTC on 26 Sep – 21:59 UTC on 27 Sep 2025 Total Duration ~22 hours Services Impacted Multiple (network-dependent services in the region) According to the official Post Incident Review (PIR) published by Microsoft on Azure Status History, a platform issue caused network connectivity degradation affecting multiple network-dependent services across the Switzerland North region, with impact lasting approximately 22 hours. The full root cause analysis, timeline, and remediation steps are documented in the linked PIR below. 🔗 View PIR on Azure Status History East US and West US — Managed Identities and Dependent Services (_M5B-9RZ) A platform issue with the Managed Identities for Azure resources service impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens in East US and West US regions. Attribute Value Date February 3, 2026 Regions East US, West US Time Window 00:10 UTC – 06:05 UTC on 03 February 2026 Total Duration ~6 hours Services Impacted Managed Identities + dependent services (resource create/update/delete, token acquisition) 🔗 View PIR on Azure Status History Azure Government — Azure Resource Manager Failures (ML7_-DWG) Customers using any Azure Government region experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI. Attribute Value Date December 8, 2025 Regions Azure Government (all regions) Time Window 11:04 EST (16:04 UTC) – 14:13 EST (19:13 UTC) Total Duration ~3 hours Services Impacted 20+ services (ARM and all ARM-dependent services) 🔗 View PIR on Azure Status History Wrapping Up Designing resilient Azure solutions requires understanding the full spectrum of potential infrastructure faults. The Partial Region Fault is just one of many fault types you should account for during your Failure Mode Analysis — but it is a powerful reminder that even within a single region, shared infrastructure dependencies can amplify a single failure into a multi-service outage. Use this taxonomy as a starting point for FMA when designing your Azure architecture. The area is continuously evolving as the Azure platform and industry evolve — watch the space and revisit your fault type analysis periodically. In the next article, we will continue exploring additional fault types from the taxonomy. Stay tuned. Authors & Reviewers Authored by Zoran Jovanovic, Cloud Solutions Architect at Microsoft. Peer Review by Catalina Alupoaie, Cloud Solutions Architect at Microsoft. Peer Review by Stefan Johner, Cloud Solutions Architect at Microsoft. References Azure Well-Architected Framework — Reliability Pillar Failure Mode Analysis Shared Responsibility for Reliability Azure Availability Zones Business Continuity and Disaster Recovery Transient Fault Handling Azure Service Level Agreements Azure Reliability Guidance by Service Azure Status History
Zoran Jovanovic
Apr 28, 2026 Place Reliability and Resiliency in Azure
33Views
1like
0Comments