Abstract
Azure Front Door (AFD) sits at the edge of Microsoft’s global cloud, delivering secure, performant, and highly available applications to users worldwide. As adoption has grown—especially for mission‑critical workloads—the need for resilient application architectures that can tolerate rare but impactful platform incidents has become essential.
This article summarizes key lessons from Azure Front Door incidents in October 2025, outlines how Microsoft is hardening the platform, and—most importantly—describes proven architectural patterns customers can adopt today to maintain business continuity when global load‑balancing services are unavailable.
Who this is for
This article is intended for:
- Cloud and solution architects designing mission‑critical internet‑facing workloads
- Platform and SRE teams responsible for high availability and disaster recovery
- Security architects evaluating WAF placement and failover trade‑offs
- Customers running revenue‑impacting workloads on Azure Front Door
Introduction
Azure Front Door (AFD) operates at massive global scale, serving secure, low‑latency traffic for Microsoft first‑party services and thousands of customer applications. Internally, Microsoft is investing heavily in tenant isolation, independent infrastructure resiliency, and active‑active service architectures to reduce blast radius and speed recovery.
However, no global distributed system can completely eliminate risk. Customers hosting mission‑critical workloads on AFD should therefore design for the assumption that global routing services can become temporarily unavailable—and provide alternative routing paths as part of their architecture.
Resiliency options for mission‑critical workloads
The following patterns are in active use by customers today. Each represents a different trade‑off between cost, complexity, operational maturity, and availability.
1. No CDN with Application Gateway
Figure 1: Azure Front Door primary routing with DNS failover
When to use: Workloads without CDN caching requirements that prioritize predictable failover.
Architecture summary
- Azure Traffic Manager (ATM) runs in Always Serve mode to provide DNS‑level failover.
- Web Application Firewall (WAF) is implemented regionally using Azure Application Gateway.
- App Gateway can be private, provided the AFD premium is used, and is the default path. DNS failover available when AFD is not reachable.
- When Failover is triggered, one of the steps will be to switch to AppGW IP to Public (ATM can route to public endpoints only)
- Switch back to AFD route, once AFD resumes service.
Pros
- DNS‑based failover away from the global load balancer
- Consistent WAF enforcement at the regional layer
- Application Gateways can remain private during normal operations
Cons
- Additional cost and reduced composite SLA from extra components
- Application Gateway must be made public during failover
- Active‑passive pattern requires regular testing to maintain confidence
2. Multi‑CDN for mission‑critical applications
Figure 2: Multi‑CDN architecture using Azure Front Door and Akamai with DNS‑based traffic steering
When to use: Mission critical Applications with strict availability requirements and heavy CDN usage.
Architecture summary
- Dual CDN setup (for example, Azure Front Door + Akamai)
- Azure Traffic Manager in Always Serve mode
- Traffic split (for example, 90/10) to keep both CDN caches warm
- During failover, 100% of traffic is shifted to the secondary CDN
- Ensure Origin servers can handle the load of extra hits (Cache misses)
Pros
- Highest resilience against CDN‑specific or control‑plane outages
- Maintains cache readiness on both providers
Cons
- Expensive and operationally complex
- Requires origin capacity planning for cache‑miss surges
- Not suitable if applications rely on CDN‑specific advanced features
3. Multi‑layered CDN (Sequential CDN architecture)
Figure 3: Sequential CDN architecture with Akamai as caching layer in front of Azure Front Door
When to use: Rare, niche scenarios where a layered CDN approach is acceptable. Not a common approach, Akamai can be a single entry point of failure. However if the AFD isn't available, you can update Akamai properties to directly route to Origin servers.
Architecture summary
- Akamai used as the front caching layer
- Azure Front Door used as the L7 gateway and WAF
- During failover, Akamai routes traffic directly to origin services
Pros
- Direct fallback path to origins if AFD becomes unavailable
- Single caching layer in normal operation
Cons
- Fronting CDN remains a single point of failure
- Not generally recommended due to complexity
- Requires a well‑tested operational playbook
4. No CDN – Traffic Manager redirect to origin (with Application Gateway)
Figure 4: DNS‑based failover directly to origin via Application Gateway when Azure Front Door is unavailable
When to use: Applications that require L7 routing but no CDN caching.
Architecture summary
- Azure Front Door provides L7 routing and WAF
- Azure Traffic Manager enables DNS failover
- During an AFD outage, Traffic Manager routes directly to Application Gateway‑protected origins
Pros
- Alternative ingress path to origin services
- Consistent regional WAF enforcement
Cons
- Additional infrastructure cost
- Operational dependency on Traffic Manager configuration accuracy
5. No CDN – Traffic Manager redirect to origin (no Application Gateway)
Figure 5: Direct DNS failover to origin services without Application Gateway
When to use: Cost‑sensitive scenarios with clearly accepted security trade‑offs.
Architecture summary
- WAF implemented directly in Azure Front Door
- Traffic Manager provides DNS failover
- During an outage, traffic routes directly to origins
Pros
- Simplest architecture
- No Application Gateway in the primary path
Cons
- Risk of unscreened traffic during failover
- Failover operations can be complex if WAF consistency is required
Frequently asked questions
Is Azure Traffic Manager a single point of failure?
No. Traffic Manager operates as a globally distributed service. For extreme resilience requirements, customers can combine Traffic Manager with a backup FQDN hosted in a separate DNS provider.
Should every workload implement these patterns?
No. These patterns are intended for mission‑critical workloads where downtime has material business impact. Non critical applications do not require multi‑CDN or alternate routing paths.
What does Microsoft use internally?
Microsoft uses a combination of active‑active regions, multi‑layered CDN patterns, and controlled fail‑away mechanisms, selected based on service criticality and performance requirements.
What happened in October 2025 (summary)
Two separate Azure Front Door incidents in October 2025 highlighted the importance of architectural resiliency:
- A control‑plane defect caused erroneous metadata propagation, impacting approximately 26% of global edge sites
- A later compatibility issue across control‑plane versions resulted in DNS resolution failures
Both incidents were mitigated through automated restarts, manual intervention, and controlled failovers. These events accelerated platform‑level hardening investments.
How Azure Front Door is being hardened
Microsoft has already completed or initiated major improvements, including:
- Synchronous configuration processing before rollout
- Control‑plane and data‑plane isolation
- Reduced configuration propagation times
- Active‑active fail‑away for major first‑party services
- Microcell segmentation to reduce blast radius
These changes reinforce a core principle: no single tenant configuration should ever impact others, and recovery must be fast and predictable.
Key takeaways
- Global platforms can experience rare outages—architect for them
- Mission‑critical workloads should include alternate routing paths
- Multi‑CDN and DNS‑based failover patterns remain the most robust
- Resiliency is a business decision, not just a technical one
References
- Azure Front Door: Implementing lessons learned following October outages | Microsoft Community Hub
- Azure Front Door Resiliency Deep Dive and Architecting for Mission Critical - John Savill's deep dive into Azure Front Door resilience and options for mission critical applications
- Global Routing Redundancy for Mission-Critical Web Applications - Azure Architecture Center | Microsoft Learn
- Architecture Best Practices for Azure Front Door - Microsoft Azure Well-Architected Framework | Microsoft Learn