Blog Post

Azure Networking Blog
11 MIN READ

Azure Front Door: Implementing lessons learned following October outages

AbhishekTiwari's avatar
Dec 19, 2025
Abhishek Tiwari, Vice President of Engineering, Azure Networking
Amit Srivastava, Principal PM Manager, Azure Networking
Varun Chawla, Partner Director of Engineering


Introduction

Azure Front Door is Microsoft's advanced edge delivery platform encompassing Content Delivery Network (CDN), global security and traffic distribution into a single unified offering. By using Microsoft's extensive global edge network, Azure Front Door ensures efficient content delivery and advanced security through 210+ global and local points of presence (PoPs) strategically positioned closely to both end users and applications.

As the central global entry point from the internet onto customer applications, we power mission critical customer applications as well as many of Microsoft’s internal services. We have a highly distributed resilient architecture, which protects against failures at the server, rack, site and even at the regional level. This resiliency is achieved by the use of our intelligent traffic management layer which monitors failures and load balances traffic at server, rack or edge sites level within the primary ring, supplemented by a secondary-fallback ring which accepts traffic in case of primary traffic overflow or broad regional failures. We also deploy a traffic shield as a terminal safety net to ensure that in the event of a managed or unmanaged edge site going offline, end user traffic continues to flow to the next available edge site.

Like any large-scale CDN, we deploy each customer configuration across a globally distributed edge fleet, densely shared with thousands of other tenants. While this architecture enables global scale, it carries the risk that certain incompatible configurations, if not contained, can propagate broadly and quickly which can result in a large blast radius of impact. Here we describe how the two recent service incidents impacting Azure Front Door have reinforced the need to accelerate ongoing investments in hardening our resiliency, and tenant isolation strategy to mitigate likelihood and the scale of impact from this class of risk.

October incidents: recap and key learnings

Azure Front Door experienced two service incidents; on October 9th and October 29th, both with customer-impacting service degradation.

On October 9th: A manual cleanup of stuck tenant metadata bypassed our configuration protection layer, allowing incompatible metadata to propagate beyond our canary edge sites. This metadata was created on October 7th, from a control-plane defect triggered by a customer configuration change. While the protection system initially blocked the propagation, the manual override operation bypassed our safeguards. This incompatible configuration reached the next stage and activated a latent data-plane defect in a subset of edge sites, causing availability impact primarily across Europe (~6%) and Africa (~16%). You can learn more about this issue in detail at https://aka.ms/AIR/QNBQ-5W8

On October 29th: A different sequence of configuration changes across two control-plane versions produced incompatible metadata. Because the failure mode in the data-plane was asynchronous, the health checks validations embedded in our protection systems were all passed during the rollout. The incompatible customer configuration metadata successfully propagated globally through a staged rollout and also updated the “last known good” (LKG) snapshot. Following this global rollout, the asynchronous process in data-plane exposed another defect which caused crashes. This impacted connectivity and DNS resolutions for all applications onboarded to our platform. Extended recovery time amplified impact on customer applications and Microsoft services. You can learn more about this issue in detail at https://aka.ms/AIR/YKYN-BWZ

We took away a number of clear and actionable lessons from these incidents, which are applicable not just to our service, but to any multi-tenant, high-density, globally distributed system.

  • Configuration resiliency – Valid configuration updates should propagate safely, consistently, and predictably across our global edge, while ensuring that incompatible or erroneous configuration never propagate beyond canary environments.
  • Data plane resiliency - Additionally, configuration processing in the data plane must not cause availability impact to any customer.
  • Tenant isolation – Traditional isolation techniques such as hardware partitioning and virtualization are impractical at edge sites. This requires innovative sharding techniques to ensure single tenant-level isolation – a must-have to reduce potential blast radius.
  • Accelerated and automated recovery time objective (RTO) – System should be able to automatically revert to last known good configuration in an acceptable RTO. In case of a service like Azure Front Door, we deem ~10 mins to be a practical RTO for our hundreds of thousands of customers at every edge site.

Post outage, given the severity of impact which allowed an incompatible configuration to propagate globally, we made the difficult decision to temporarily block configuration changes in order to expedite rollout of additional safeguards. Between October 29th to November 5th, we prioritized and deployed immediate hardening steps before opening up the configuration change. We are confident that the system is stable, and we are continuing to invest in additional safeguards to further strengthen the platform's resiliency.

Learning category

Goal

Repairs

Status

Safe customer configuration deployment

Incompatible configuration never propagates beyond Canary

·       Control plane and data plane defect fixes

·       Forced synchronous configuration processing

·       Additional stages with extended bake time

·       Early detection of crash state

Completed

Data plane resiliency

Configuration processing cannot impact data plane availability

Manage data-plane lifecycle to prevent outages caused by configuration-processing defects.

  Completed

Isolated work-process in every data plane server to process and load the configuration.

January 2026

100% Azure Front Door resiliency posture for Microsoft internal services

Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services

Phase 1: Onboarded critical services batch impacted on Oct 29th outage running on a day old configuration

Completed

Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services

March 2026

Recovery improvements

Data plane crash recovery in under 10 minutes

Data plane boot-up time optimized via local cache (~1 hour)

Completed

Accelerate recovery time < 10 minutes

March 2026

Tenant isolation

No configuration or traffic regression can impact other tenants

Micro cellular Azure Front Door with ingress layered shards

June    2026

This blog is the first in a multi-part series on Azure Front Door resiliency. In this blog, we will focus on configuration resiliency—how we are making the configuration pipeline safer and more robust. Subsequent blogs will cover tenant isolation and recovery improvements.

How our configuration propagation works

Azure Front Door configuration changes can be broadly classified into three distinct categories.  

  • Service code & data – these include all aspects of Azure Front Door service like management plane, control plane, data plane, configuration propagation system. Azure Front Door follows a safe deployment practice (SDP) process to roll out newer versions of management, control or data plane over a period of approximately 2-3 weeks. This ensures that any regression in software does not have a global impact. However, latent bugs that escape pre-validation and SDP rollout can remain undetected until a specific combination of customer traffic patterns or configuration changes trigger the issue.
  • Web Application Firewall (WAF) & L7 DDoS platform data – These datasets are used by Azure Front Door to deliver security and load-balancing capabilities. Examples include GeoIP data, malicious attack signatures, and IP reputation signatures. Updates to these datasets occur daily through multiple SDP stages with an extended bake time of over 12 hours to minimize the risk of global impact during rollout. This dataset is shared across all customers and the platform, and it is validated immediately since it does not depend on variations in customer traffic or configuration steps.
  • Customer configuration data – Examples of these are any customer configuration change—whether a routing rule update, backend pool modification, WAF rule change, or security policy change. Due to the nature of these changes, it is expected across the edge delivery / CDN industry to propagate these changes globally in 5-10 mins. Both outages stemmed from issues within this category.

All configuration changes, including customer configuration data, are processed through a multi-stage pipeline designed to ensure correctness before global rollout across Azure Front Door’s 200+ edge locations. At a high level, Azure Front Door’s configuration propagation system has two distinct components - 

  • Control plane – Accepts customer API/portal changes (create/update/delete for profiles, routes, WAF policies, origins, etc.) and translates them into internal configuration metadata which the data plane can understand. 
  • Data plane – Globally distributed edge servers that terminate client traffic, apply routing/WAF logic, and proxy to origins using the configuration produced by the control plane. 

Between these two halves sits a multi-stage configuration rollout pipeline with a dedicated protection system (known as ConfigShield): 

  • Changes flow through multiple stages (pre-canary, canary, expanding waves to production) rather than going global at once. 
  • Each stage is health-gated: the data plane must remain within strict error and latency thresholds before proceeding. Each stage’s health check also rechecks previous stage’s health for any regressions. 
  • A successfully completed rollout updates a last known good (LKG) snapshot used for automated rollback. 
  • Historically, rollout targeted global completion in roughly 5–10 minutes, in line with industry standards.

 

Customer configuration processing in Azure Front Door data plane stack

Customer configuration changes in Azure Front Door traverse multiple layers—from the control plane through the deployment system—before being converted into FlatBuffers at each Azure Front Door node. These FlatBuffers are then loaded by the Azure Front Door data plane stack, which runs as Kubernetes pods on every node.

  • FlatBuffer Composition: Each FlatBuffer references several sub-resources such as WAF and Rules Engine schematic files, SSL certificate objects, and URL signing secrets.
  • Data plane architecture:

o   Master process: Accepts configuration changes (memory-mapped files with references) and manages the lifecycle of worker processes.

o   Workers: L7 proxy processes that serve customer traffic using the applied configuration.

Processing flow for each configuration update:

  • Load and apply in master: The transformed configuration is loaded and applied in the master process. Cleanup of unused references occurs synchronously except for certain categories à October 9 outage occurred during this step due to a crash triggered by incompatible metadata.
  • Apply to workers: Configuration is applied to all worker processes without memory overhead (FlatBuffers are memory-mapped).
  • Serve traffic: Workers start consuming new FlatBuffers for new requests; in-flight requests continue using old buffers. Old buffers are queued for cleanup post-completion.
  • Feedback to deployment service: Positive feedback signals readiness for rollout.Cleanup: FlatBuffers are freed asynchronously by the master process after all workers load updates à October 29 outage occurred during this step due to a latent bug in reference counting logic.

The October incidents showed we needed to strengthen key aspects of configuration validation, propagation safeguards, and runtime behavior. During the Azure Front Door incident on October 9th, that protection system worked as intended but was later bypassed by our engineering team during a manual cleanup operation. During this Azure Front Door incident on October 29th, the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. 

Configuration propagation safeguards

Based on learnings from the incidents, we are implementing a comprehensive set of configuration resiliency improvements. These changes aim to guarantee that any sequence of configuration changes cannot trigger instability in the data plane, and to ensure quicker recovery in the event of anomalies.

Strengthening configuration generation safety

This improvement pivots on a ‘shift-left’ strategy where we want to ensure that we catch regression early before they propagate to production. It also includes fixing the latent defects which were the proximate cause of the outage.

  • Fixing outage specific defects - We have fixed the control-plane defects that could generate incompatible tenant metadata under specific operation sequences. We have also remediated the associated data-plane defects.
  • Stronger cross-version validation - We are expanding our test and validation suite to account for changes across multiple control plane build versions. This is expected to be fully completed by February 2026.
  • Fuzz testing - Automated fuzzing and testing of metadata generation contract between the control plane and the data plane. This allows us to generate an expanded set of invalid/unexpected configuration combinations which might not be achievable by traditional test cases alone. This is expected to be fully completed by February 2026.

Preventing incompatible configurations from being propagated

This segment of the resiliency strategy strives to ensure that a potentially dangerous configuration change never propagates beyond canary stage.

  • Protection system is “always-on” - Enhancements to operational procedures and tooling prevent bypass in all scenarios (including internal cleanup/maintenance), and any cleanup must flow through the same guarded stages and health checks as standard configuration changes. This is completed.
  • Making rollout behavior more predictable and conservative - Configuration processing in the data plane is now fully synchronous. Every data plane issue due to incompatible meta data can be detected withing 10 seconds at every stage. This is completed.
  • Enhancement to deployment pipeline - Additional stages during roll-out and extended bake time between stages serve as an additional safeguard during configuration propagation. This is completed.
  • Recovery tool improvements now make it easier to revert to any previous version of LKG with a single click. This is completed.

These changes significantly improve system safety. Post-outage we have increased the configuration propagation time to approximately 45 minutes. We are working towards reducing configuration propagation time closer to pre-incident levels once additional safeguards covered in the Data plane resiliency section below are completed by mid-January, 2026.

Data plane resiliency

The data plane recovery was the toughest part of recovery efforts during the October incidents. We must ensure fast recovery as well as resilience to configuration processing related issues for the data plane. To address this, we implemented changes that decouple the data plane from incompatible configuration changes. With these enhancements, the data plane continues operating on the last known good configuration—even if the configuration pipeline safeguards fail to protect as intended.

Decoupling data plane from configuration changes

Each server’s data plane consists of a master process which accepts configuration changes and manages lifecycle of multiple worker processes which serve customer traffic. One of the critical reasons for the prolonged outage in October was that due to latent defects in the data plane, when presented with a bad configuration the master process crashed. The master is a critical command-and-control process and when it crashes it takes down the entire data plane, in that node. Recovery of the master process involves reloading hundreds of thousands of configurations from scratch and took approximately 4.5 hours. 

We have since made changes to the system to ensure that even in the event of the master process crash due to any reason - including incompatible configuration data being presented - the workers remain healthy and able to serve traffic. During such an event, the workers would not be able to accept new configuration changes but will continue to serve customer traffic using the last known good configuration. This work is completed.

Introducing Food Taster: strengthening config propagation resiliency

In our efforts to further strengthen Azure Front Door’s configuration propagation system, we are introducing an additional configuration safeguard known internally as Food Taster which protects the master and worker processes from any configuration change related incidents, thereby ensuring data plane resiliency.

The principle is simple: every data-plane server will have a redundant and isolated process – the Food Taster – whose only job is to ingest and process new configuration metadata first and then pass validated configuration changes to active data plane. This redundant worker does not accept any customer traffic.

All configuration processing in this Food Taster is fully synchronous. That means we do all parsing, validation, and any expensive or risky work up front, and we do not move on until the Food Taster has either proven the configuration is safe or rejected it. Only when the Food Taster successfully loads the configuration and returns “Config OK” does the master process proceed to load the same config and then instruct the worker processes to do the same. If anything goes wrong in the Food Taster, the failure is contained to that isolated worker; the master and traffic-serving workers never see that invalid configuration.

We expect this safeguard to reach production globally in January 2026 timeframe. Introduction of this component will also allow us to return closer to pre-incident level of configuration propagation while ensuring data plane safety.

Closing

This is the first in a series of planned blogs on Azure Front Door resiliency enhancements. We are continuously improving platform safety and reliability and will transparently share updates through this series. Upcoming posts will cover advancements in tenant isolation and improvements to recovery time objectives (RTO).

We deeply value our customers’ trust in Azure Front Door. The October incidents reinforced how critical configuration resiliency is, and we are committed to exceeding industry expectations for safety, reliability, and transparency. By hardening our configuration pipeline, strengthening safety gates, and reinforcing isolation boundaries, we’re making Azure Front Door even more resilient so your applications can be too.

Updated Dec 19, 2025
Version 1.0
No CommentsBe the first to comment