Abhishek Tiwari, Vice President of Engineering, Azure Networking Amit Srivastava, Partner Director of PM, Azure Networking Varun Chawla, Partner Director of Engineering, Azure Networking Karthik Uthaman, Principal Engineer, Azure Networking
In Part 1 of this blog series, we outlined our four‑pillar strategy for resiliency in Azure Front Door: configuration resiliency, data plane resiliency, tenant isolation, and accelerated Recovery Time Objective (RTO). Together, these pillars help Azure Front Door remain continuously available and resilient at global scale.
Part 1 focused on the first two pillars: configuration and data plane resiliency. Our goal is to make configuration propagation safer, so incompatible changes never escape pre‑production environments. We discussed how incompatible configurations are blocked early, and how data plane resiliency ensures the system continues serving traffic from a last‑known‑good (LKG) configuration even if a bad change manages to propagate. We also introduced ‘Food Taster’, a dedicated sacrificial process running in each edge server’s data plane, that pretests every configuration change in isolation, before it ever reaches the live data plane.
In this post, we turn to the recovery pillar. We describe how we have made key enhancements to the Azure Front Door recovery path so the system can return to full operation in a predictable and bounded timeframe. For a global service like Azure Front Door, serving hundreds of thousands of tenants across 210+ edge sites worldwide, we set an explicit target: to be able to recover any edge site – or all edge sites – within approximately 10 minutes, even in worst‑case scenarios. In typical data plane crash scenarios, we expect recovery in under a second.
Repair status
The first blog post in this series mentioned the two Azure Front Door incidents from October 2025 – learn more by watching our Azure Incident Retrospective session recordings for the October 9th incident and/or the October 29th incident. Before diving into our platform investments for improving our Recovery Time Objectives (RTO), we wanted to provide a quick update on the overall repair items from these incidents. We are pleased to report that the work on configuration propagation and data plane resiliency is now complete and fully deployed across the platform (in the table below, “Completed” means broadly deployed in production). With this, we have reduced configuration propagation latency from ~45 minutes to ~20 minutes. We anticipate reducing this even further – to ~15 minutes by the end of April 2026, while ensuring that platform stability remains our top priority.
|
Learning category |
Goal |
Repairs |
Status |
|
Safe customer configuration deployment |
Incompatible configuration never propagates beyond ‘EUAP or canary regions’ |
Control plane and data plane defect fixes Forced synchronous configuration processing Additional stages with extended bake time Early detection of crash state |
Completed |
|
Data plane resiliency |
Configuration processing cannot impact data plane availability |
Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. |
Completed |
|
Isolated work-process in every data plane server to process and load the configuration. |
Completed | ||
|
100% Azure Front Door resiliency posture for Microsoft internal services |
Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services |
Phase 1: Onboarded critical services batch impacted on Oct 29th outage running on a day old configuration |
Completed |
|
Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services |
March 2026 | ||
|
Recovery improvements |
Data plane crash recovery in under 10 minutes |
Data plane boot-up time optimized via local cache (~1 hour) |
Completed |
|
Accelerate recovery time < 10 minutes |
April 2026 | ||
|
Tenant isolation |
No configuration or traffic regression can impact other tenants |
Micro cellular Azure Front Door with ingress layered shards |
June 2026 |
Why recovery at edge scale is deceptively hard
To understand why recovery took as long as it did, it helps to first understand how the Azure Front Door data plane processes configuration.
Azure Front Door operates in 210+ edge sites with multiple servers per site. The data plane of each edge server hosts multiple processes. A master process orchestrates the lifecycle of multiple worker processes, that serve customer traffic. A separate configuration translator process runs alongside the data plane processes, and is responsible for converting customer configuration bundles from the control plane into optimized binary FlatBuffer files. This translation step, covering hundreds of thousands of tenants, represents hours of cumulative computation. A per edge server cache is kept locally at each server level – to enable a fast recovery of the data plane, if needed.
Once the configuration translator process produces these FlatBuffer files, each worker processes them independently and memory-maps them for zero-copy access. Configuration updates flow through a two-phase commit: new FlatBuffers are first loaded into a staging area and validated, then atomically swapped into production maps. In-flight requests continue using the old configuration, until the last request referencing them completes.
The data process recovery is designed to be resilient to different failure modes. A failure or crash at the worker process level has a typical recovery time of less than one second. Since each server has multiple such worker processes which serve customer traffic, this type of crash has no impact on the data plane. In the case of a master process crash, the system automatically tries to recover using the local cache. When the local cache is reused, the system is able to recover quickly – in approximately 60 minutes – since most of the configurations in the cache were already loaded into the data plane before the crash. However, in certain cases if the cache becomes unavailable or must be invalidated because of corruption, the recovery time increases significantly.
During the October 29th incident, a data plane crash triggered a complete recovery sequence that took approximately 4.5 hours. This was not because restarting a process is slow, it is because a defect in the recovery process invalidated the local cache, which meant that “restart” meant rebuilding everything from scratch. The configuration translator process then had to re-fetch and re-translate every one of the hundreds of thousands of customer configurations, before workers could memory-map them and begin serving traffic.
This experience has crystallized three fundamental learnings related to our recovery path:
- Expensive rework: A subset of crashes discarded all previously translated FlatBuffer artifacts, forcing the configuration translator process to repeat hours of conversion work that had already been validated and stored.
- High restart costs: Every worker on every node had to wait for the configuration translator process to complete the full translation, before it could memory-map any configuration and begin serving requests.
- Unbounded recovery time: Recovery time grew linearly with total tenant footprint rather than with active traffic, creating a ‘scale penalty’ as more tenants onboarded to the system.
Separately and together, the insight was clear: recovery must stop being proportional to the total configuration size.
Persisting ‘validated configurations’ across restarts
One of the key recovery improvements was strengthening how validated customer configurations are cached and reused across failures, rather than rebuilding configuration states from scratch during recovery. Azure Front Door already cached customer configurations on host‑mounted storage prior to the October incident. The platform enhancements post outage focused on making the local configuration cache resilient to crashes, partial failures, and bad tenant inputs.
Our goal was to ensure that recovery behavior is dominated by serving traffic safely, not by reconstructing configuration state. This led us to two explicit design goals…
Design goals
- No category of crash should invalidate the configuration cache: Configuration cache invalidation must never be the default response to failures. Whether the failure is a worker crash, master crash, data plane restart, or coordinated recovery action, previously validated customer configurations should remain usable—unless there is a proven reason to discard it.
- Bad tenant configuration must not poison the entire cache: A single faulty or incompatible tenant configuration should result in targeted eviction of that tenant’s configuration only—not wholesale cache invalidation across all tenants.
Platform enhancements
Previously, customer configurations persisted to host‑mounted storage, but certain failure paths treated the cache as unsafe and invalidated it entirely. In those cases, recovery implicitly meant reloading and reprocessing configuration for hundreds of thousands of tenants before traffic could resume, even though the vast majority of cached data was still valid.
We changed the recovery model to avoid invalidating customer configurations, with strict scoping around when and how cached entries are discarded:
- Cached configurations are no longer invalidated based on crash type. Failures are assumed to be orthogonal to configuration correctness unless explicitly proven otherwise.
- Cache eviction is granular and tenant‑scoped. If a cached configuration fails validation or load checks, only that tenant’s configuration is discarded and reloaded. All other tenant configurations remain available.
This ensures that recovery does not regress into a fleet‑wide rebuild due to localized or unrelated faults.
Safety and correctness
Durability is paired with strong correctness controls, to prevent unsafe configurations from being served:
- Per‑tenant validation on load: Each cached tenant configuration is validated during the ‘load and verification’ phase, before being promoted for traffic serving. Therefore, failures are contained to that tenant.
- Targeted re‑translation: When validation fails, only the affected tenant’s configuration is reloaded or reprocessed. Therefore, the cache for other tenants is left untouched.
- Operational escape hatch: Operators retain the ability to explicitly instruct a clean rebuild of the configuration cache (with proper authorization), preserving control without compromising the default fast‑recovery path.
Resulting behavior
With these changes, recovery behavior now aligns with real‑world traffic patterns - configuration defects impact tenants locally and predictably, rather than globally. The system now prefers isolated tenant impact, and continued service using last-known-good over aggressive invalidation, both of which are critical for predictable recovery at the scale of Azure Front Door.
Making recovery scale with active traffic, not total tenants
Reusing configuration cache solves the problem of rebuilding configuration in its entirety, but even with a warm cache, the original startup path had a second bottleneck: eagerly loading a large volume of tenant configurations into memory before serving any traffic. At our scale, memory-mapping, parsing hundreds of thousands of FlatBuffers, constructing internal lookup maps, adding Transport Layer Security (TLS) certificates and configuration blocks for each tenant, collectively added almost an hour to startup time. This was the case even when a majority of those tenants had no active traffic at that moment.
We addressed this by fundamentally changing when configuration is loaded into workers. Rather than eagerly loading most of the tenants at startup across all edge locations, Azure Front Door now uses an Machine Learning (ML)-optimized lazy loading model.
In the new architecture, instead of loading a large number of tenant configurations, we only load a small subset of tenants that are known to be historically active in a given site, we call this the “warm tenants” list. The warm tenants list per edge site is created through a sophisticated traffic analysis pipeline that leverages ML. However, loading the warm tenants is not good enough, because when a request arrives and we don’t have the configuration in memory, we need to know two things. Firstly, is this a request from a real Azure Front Door tenant – and, if it is, where can I find the configuration?
To answer these questions, each worker maintains a hostmap that tracks the state of each tenant’s configuration. This hostmap is constructed during startup, as we process each tenant configuration – if the tenant is in the warm list, we will process and load their configuration fully; if not, then we will just add an entry into the hostmap where all their domain names are mapped to the configuration path location. When a request arrives for one of these tenants, the worker loads and validates that tenant’s configuration on demand, and immediately begins serving traffic. This allows a node to start serving its busiest tenants within a few minutes of startup, while additional tenants are loaded incrementally only when traffic actually arrives—allowing the system to progressively absorb cold tenants as demand increases.
The effect on recovery is transformative. Instead of recovery time scaling with the total number of tenants configured on a server, it scales with the number of tenants actively receiving traffic. In practice, even at our busiest edge sites, the active tenant set is a small fraction of the total.
Just as importantly, this modified form of lazy loading provides a natural failure isolation boundary. Most Edge sites won’t ever load a faulty configuration of an inactive tenant. When a request for an inactive tenant with an incompatible configuration arrives, impact is contained to a single worker.
The configuration load architecture now prefers serving as many customers as quickly as possible, rather than waiting until everything is ready before serving anyone. The above changes are slated to complete in April 2026 and will bring our RTO from the current ~1 hour to under 10 minutes – for complete recovery from a worst case scenario.
Continuous validation through Game Days
A critical element of our recovery confidence comes from GameDay fault-injection testing. We don’t simply design recovery mechanisms and assume they work—we break the system deliberately and observe how it responds. Since late 2025, we have conducted recurring GameDay drills that simulate the exact failure scenarios we are defending against:
- Food Taster crash scenarios: Injecting deliberately faulty tenant configurations, to verify that they are caught and isolated with zero impact on live traffic. In our January 2026 GameDay, the Food Taster process crashed as expected, the system halted the update within approximately 5 seconds, and no customer traffic was affected.
- Master process crash scenarios: Triggering master process crashes across test environments to verify that workers continue serving traffic, that the Local Config Shield engages within 10 seconds, and that the coordinated recovery tool restores full operation within the expected timeframe.
- Multi-region failure drills: Simulating simultaneous failures across multiple regions to validate that global Config Shield mechanisms engage correctly, and that recovery procedures scale without requiring manual per-region intervention.
- Fallback test drills for critical Azure services running behind Azure Front Door: In our February 2026 GameDay, we simulated the complete unavailability of Azure Front Door, and successfully validated failover for critical Azure services with no impact to traffic.
These drills have both surfaced corner cases and built operational confidence. They have transformed recovery from a theoretical plan into tested, repeatable muscle memory. As we noted in an internal communication to our team: “Game day testing is a deliberate shift from assuming resilience to actively proving it—turning reliability into an observed and repeatable outcome.”
Closing
Part 1 of this series emphasized preventing unsafe configurations from reaching the data plane, and data plane resiliency in case an incompatible configuration reaches production. This post has shown that prevention alone is not enough—when failures do occur, recovery must be fast, predictable, and bounded. By ensuring that the FlatBuffer cache is never invalidated, by loading only active tenants, and by building safe coordinated recovery tooling, we have transformed failure handling from a fleet-wide crisis into a controlled operation.
These recovery investments work in concert with the prevention mechanisms described in Part 1. Together, they ensure that the path from incident detection to full service restoration is measured in minutes, with customer traffic protected at every step.
In the next post of this series, we will cover the third pillar of our resiliency strategy: tenant isolation—how micro-cellular architecture and ingress-layered sharding can reduce the blast radius of any failure to a small subset, ensuring that one customer’s configuration or traffic anomaly never becomes everyone’s problem.
We deeply value our customers’ trust in Azure Front Door. We are committed to transparently sharing our progress on these resiliency investments, and to exceed expectations for safety, reliability, and operational readiness.