A Practical Guide for technical architects and cloud workload owners to plan and design resilient ExpressRoute circuits
In today’s mission-critical environments, maintaining resilient and highly available connectivity between on-premises networks and Azure is essential. Azure ExpressRoute offers dedicated, private connectivity with low latency, high throughput, and robust security. However, ensuring resiliency—especially for production-critical workloads—requires careful design and adherence to best practices. This blog post outlines a comprehensive resiliency strategy for ExpressRoute based on the Microsoft documentation.
1. Understanding Resiliency in ExpressRoute
Resiliency refers to the ability of your network infrastructure to withstand failures and quickly recover from disruptions. For ExpressRoute, this involves two key aspects:
- Site Resiliency: Ensuring no single point of failure exists at the network edge.
- Zonal Resiliency: Leveraging Azure regions and availability zones to maintain connectivity during localized failures.
ExpressRoute’s resiliency architectures are categorized into three levels:
Diagram illustrating dual-circuit deployment for maximum resiliency. (Image source aka.ms/learn)
b. High Resiliency (ExpressRoute Metro)
ExpressRoute Metro splits a single circuit across two sites within the same metropolitan area. This setup provides improved site diversity compared to standard configurations, minimizing the risk of edge-site failures.
Diagram showing split circuit configuration across paired sites.
c. Standard Resiliency
This configuration uses a single circuit with two connections at a single peering location. Although it offers built-in active-active redundancy, it lacks the site diversity needed for production-critical workloads and is generally not recommended for such scenarios.
Diagram illustrating a single-homed ExpressRoute configuration.
2. Zonal Resiliency and Azure Regions
In addition to site resiliency, ensuring zonal resiliency is key:
- Availability Zones: Deploy ExpressRoute Virtual Network Gateways as zone-redundant. Availability zones provide fault isolation by spanning multiple physical locations within a region.
- Region-Level Resiliency: Consider geo-redundancy by provisioning ExpressRoute circuits in multiple regions to guard against regional outages.
For detailed guidance on deploying zone-redundant gateways, see Regions & availability zones.
3. Best Practices for ExpressRoute Resiliency
ExpressRoute Circuit Recommendations
- Plan for Circuit or Direct Connectivity: Evaluate whether an ExpressRoute circuit or ExpressRoute Direct best meets your bandwidth and connectivity requirements.
- Multi-Site Redundancy: Deploy circuits with maximum resiliency and ensure on-premises routes are advertised over both circuits.
- Active-Active Configuration: Configure both connections to operate in active-active mode for optimal load balancing and failover.
- Physical Layer Diversity: Establish multiple physical paths from your on-premises edge to the peering locations using different providers or routes.
- Enable BFD: Use Bidirectional Forwarding Detection (BFD) to detect link failures quickly, thereby accelerating failover.
ExpressRoute Gateway Recommendations
- Zone-Redundant Virtual Network Gateways: Deploy gateways across multiple availability zones to provide resiliency against zone-level failures.
- Gateway Migration: If using non-zone-redundant gateways, migrate to zone-redundant configurations using the guided migration experience provided by Azure.
Disaster Recovery and Geo-Redundancy
- High Availability and Disaster Recovery: Architect both the customer and provider segments of your ExpressRoute circuit for high availability. For disaster recovery, plan for redundant circuits in different regions or peering locations.
- Avoid VPN Backups: For latency-sensitive, production-critical workloads, avoid using site-to-site VPN as a backup for ExpressRoute.
Monitoring and Alerting
- Network Insights: Configure Azure Monitor’s Network Insights to track ExpressRoute circuit metrics, including availability, throughput, and packet drops.
- Service Health Alerts: Use Azure Service Health to receive notifications about ExpressRoute circuit maintenance events.
- Connection and Gateway Monitoring: Implement Connection Monitor to check connectivity between on-premises and Azure, and configure gateway health monitoring to track the performance of your ExpressRoute gateways.
For further details on monitoring, refer to Monitoring and alerting recommendations.
4. Implementation Guidance for Production Workloads
When designing a resiliency setup for production-critical workloads, follow these steps:
- Assess Requirements: Determine bandwidth, latency, and uptime requirements for your workloads.
- Select Resiliency Architecture: For production, implement maximum or high resiliency configurations to eliminate single points of failure.
- Deploy Redundant Circuits: Configure dual circuits across diverse sites. Advertise on-premises routes over both circuits to ensure full path redundancy.
- Implement Zone-Redundant Gateways: Use zone-redundant virtual network gateways for connecting your virtual networks to ExpressRoute.
- Set Up Monitoring: Configure Azure Monitor, Service Health alerts, and Connection Monitor to gain real-time insights into your ExpressRoute connectivity.
- Test Failover Scenarios: Regularly test failover and disaster recovery procedures to ensure that your resiliency architecture performs as expected during outages.
Conclusion
For production-critical workloads, ensuring resilient connectivity to Azure via ExpressRoute is not optional—it’s a necessity. By adopting a multi-tiered approach that combines maximum or high resiliency architectures with zonal resiliency and robust monitoring, you can safeguard your mission-critical applications against failures and outages.
Following these best practices will help you design an ExpressRoute setup that meets stringent production requirements, delivering the high availability, performance, and reliability that modern enterprises demand.
For more detailed technical insights, please refer to the full documentation on Design and architect Azure ExpressRoute for resiliency.