well-architected
24 TopicsUnderstanding ExpressRoute private peering to address ExpressRoute resiliency
This article provides an overview of Microsoft ExpressRoute, including its various components such as the Circuit, the Gateway and the Connection, and different connectivity models like ExpressRoute Service Provider and ExpressRoute Direct. It also covers the resilience and failure scenarios related to ExpressRoute, including geo-redundancy, Availability Zones, and route advertisement limits. If you're looking to learn more about ExpressRoute and its implementation, this article is a great resource.11KViews7likes3CommentsMicrosoft Azure scales Hollow Core Fiber (HCF) production through outsourced manufacturing
Introduction As cloud and AI workloads surge, the pressure on datacenter (DC), Metro and Wide Area Network (WAN) networks has never been greater. Microsoft is tackling the physical limits of traditional networking head-on. From pioneering research in microLED technologies to deploying Hollow Core Fiber (HCF) at global scale, Microsoft is reimagining connectivity to power the next era of cloud networking. Azure’s HCF journey has been one of relentless innovation, collaboration, and a vision to redefine the physical layer of the cloud. Microsoft’s HCF, based on the proprietary Double Nested Antiresonant Nodeless Fiber (DNANF) design, delivers up to 47% faster data transmission and approximately 33% lower latency compared to conventional Single Mode Fiber (SMF), bringing significant advantages to the network that powers Azure. Today, Microsoft is announcing a major milestone: the industrial scale-up of HCF production, powered by new strategic manufacturing collaborations with Corning Incorporated (Corning) and Heraeus Covantics (Heraeus). These collaborations will enable Azure to increase the global fiber production of HCF to meet the demands of the growing network infrastructure, advancing the performance and reliability customers expect for cloud and AI workloads. Real-world benefits for Azure customers Since 2023, Microsoft has deployed HCF across multiple Azure regions, with production links meeting performance and reliability targets. As manufacturing scales, Azure plans to expand deployment of the full end-to-end HCF network solution to help increase capacity, resiliency, and speed for customers, with the potential to set new benchmarks for latency and efficiency in fiber infrastructure. Why it matters Microsoft’s proprietary HCF design brings the following improvements for Azure customers: Increased data transmission speeds with up to 33% lower latency. Enhanced signal performance that improves data transmission quality for customers. Improved optical efficiency resulting in higher bandwidth rates compared to conventional fiber. How Microsoft is making it possible To operationalize HCF across Azure with production grade performance, Microsoft is: Deploying a standardized HCF solution with end-to-end systems and components for operational efficiency, streamlined network management, and reliable connectivity across Azure’s infrastructure. Ensuring interoperability with standard SMF environments, enabling seamless integration with existing optical infrastructure in the network for faster deployment and scalable growth. Creating a multinational production supply chain to scale next generation fiber production, ensuring the volumes and speed to market needed for widespread HCF deployment across the Azure network. Scaling up and out With Corning and Heraeus as Microsoft’s first HCF manufacturing collaborators, Azure plans to accelerate deployment to meet surging demand for high-performance connectivity. These collaborations underscore Microsoft’s commitment to enhancing its global infrastructure and delivering a reliable customer experience. They also reinforce Azure’s continued investment in deploying HCF, with a vision for this technology to potentially set the global benchmark for high-capacity fiber innovation. “This milestone marks a new chapter in reimagining the cloud’s physical layer. Our collaborations with Corning and Heraeus establish a resilient, global HCF supply chain so Azure can deliver a standardized, world-class customer experience with ultra-low latency and high reliability for modern AI and cloud workloads.” - Jamie Gaudette, Partner Cloud Network Engineering Manager at Microsoft To scale HCF production, Microsoft will utilize Corning’s established U.S. facilities, while Heraeus will produce out of its sites in both Europe and the U.S. "Corning is excited to expand our longtime collaboration with Microsoft, leveraging Corning’s fiber and cable manufacturing facilities in North Carolina to accelerate the production of Microsoft's Hollow Core Fiber. This collaboration not only strengthens our existing relationship but also underscores our commitment to advancing U.S. leadership in AI innovation and infrastructure. By working closely with Microsoft, we are poised to deliver solutions that meet the demands of AI workloads, setting new benchmarks for speed and efficiency in fiber infrastructure." - Mike O'Day, Senior Vice President and General Manager, Corning Optical Communications “We started our work on HCF a decade ago, teamed up with the Optoelectronics Research Centre (ORC) at the University of Southampton and then with Lumenisity prior to its acquisition. Now, we are excited to continue working with Microsoft on shaping the datacom industry. With leading solutions in glass, tube, preform, and fiber manufacturing, we are ready to scale this disruptive HCF technology to significant volumes. We’ll leverage our proven track record of taking glass and fiber innovations from the lab to widespread adoption, just as we did in the telecom industry, where approximately 2 billion kilometers of fiber are made using Heraeus products.” - Dr. Jan Vydra, Executive Vice President Fiber Optics, Heraeus Covantics Azure engineers are working alongside Corning and Heraeus to operationalize Microsoft manufacturing process intellectual property (IP), deliver targeted training programs, and drive the yield, metrology, and reliability improvements required for scaled production. The collaborations are foundational to a growing standardized, global ecosystem that supports: Glass preform/tubing supply Fiber production at scale Cable and connectivity for deployment into carrier‑grade environments Building on a foundation of innovation: Microsoft’s HCF program In 2022, Microsoft acquired Lumenisity, a spin‑out from the Optoelectronics Research Centre (ORC) at the University of Southampton, UK. That same year, Microsoft launched the world’s first state‑of‑the‑art HCF fabrication facility in the UK to expand production and drive innovation. This purpose-built site continues to support long‑term HCF research, prototyping, and testing, ensuring that Azure remains at the forefront of HCF technology. Working with industry leaders, Microsoft has developed a proven end‑to‑end ecosystem of components, equipment, and HCF‑specific hardware necessary and successfully proven in production deployments and operations. Pushing the boundaries: recent breakthrough research Today, the University of Southampton announced a landmark achievement in optical communications: in collaboration with Azure Fiber researchers, they have demonstrated the lowest signal loss ever recorded for optical fibers (<0.1 dB/km) using research-grade DNANF HCF technology (see figure 4). This breakthrough, detailed in a research paper published in Nature Photonics earlier this month, paves the way for a potential revolution in the field, enabling unprecedented data transmission capacities and longer unamplified spans. ecords at around 1550nm [1] 2002 Nagayama et al. 1 [2] 2025 Sato et al. 2 [3] 2025 research-grade DNANF HCF Petrovich et al. 3 This breakthrough highlights the potential for this technology to transform global internet infrastructure and DC connectivity. Expected benefits include: Faster: Approximately 47% faster, reducing latency, powering real-time AI inference, cloud gaming and other interactive workloads. More capacity: A wider optical spectrum window enabling exponentially greater bandwidth. Future-ready: Lays the groundwork for quantum-safe links, quantum computing infrastructure, advanced sensing, and remote laser delivery. Looking ahead: Unlocking the future of cloud networking The future of cloud networking is being built today! With record-breaking [3] fiber innovations, a rapidly expanding collaborative ecosystem, and the industrialized scale to deliver next-generation performance, Azure continues to evolve to meet the demands for speed, reliability, and connectivity. As we accelerate the deployment of HCF across our global network, we’re not just keeping pace with the demands of AI and cloud, we’re redefining what’s possible. References: [1] Nagayama, K., Kakui, M., Matsui, M., Saitoh, T. & Chigusa, Y. Ultra-low-loss (0.1484 dB/km) pure silica core fibre and extension of transmission distance. Electron. Lett. 38, 1168–1169 (2002). [2] Sato, S., Kawaguchi, Y., Sakuma, H., Haruna, T. & Hasegawa, T. Record low loss optical fiber with 0.1397 dB/km. In Proc. Optical Fiber Communication Conference (OFC) 2024 Tu2E.1 (Optica Publishing Group, 2024). [3] Petrovich, M., Numkam Fokoua, E., Chen, Y., Sakr, H., Isa Adamu, A., Hassan, R., Wu, D., Fatobene Ando, R., Papadimopoulos, A., Sandoghchi, S., Jasion, G., & Poletti, F. Broadband optical fibre with an attenuation lower than 0.1 decibel per kilometre. Nat. Photon. (2025). https://doi.org/10.1038/s41566-025-01747-5 Useful Links: The Deployment of Hollow Core Fiber (HCF) in Azure’s Network How hollow core fiber is accelerating AI | Microsoft Azure Blog Learn more about Microsoft global infrastructure6.9KViews6likes0CommentsNetwork Redundancy Between AVS, On-Premises, and Virtual Networks in a Multi-Region Design
By Mays_Algebary shruthi_nair Establishing redundant network connectivity is vital to ensuring the availability, reliability, and performance of workloads operating in hybrid and cloud environments. Proper planning and implementation of network redundancy are key to achieving high availability and sustaining operational continuity. This article focuses on network redundancy in multi-region architecture. For details on single-region design, refer to this blog. The diagram below illustrates a common network design pattern for multi-region deployments, using either a Hub-and-Spoke or Azure Virtual WAN (vWAN) topology, and serves as the baseline for establishing redundant connectivity throughout this article. In each region, the Hub or Virtual Hub (VHub) extends Azure connectivity to Azure VMware Solution (AVS) via an ExpressRoute circuit. The regional Hub/VHub is connected to on-premises environments by cross-connecting (bowtie) both local and remote ExpressRoute circuits, ensuring redundancy. The concept of weight, used to influence traffic routing preferences, will be discussed in the next section. The diagram below illustrates the traffic flow when both circuits are up and running. Design Considerations If a region loses its local ExpressRoute connection, AVS in that region will lose connectivity to the on-premises environment. However, VNets will still retain connectivity to on-premises via the remote region’s ExpressRoute circuit. The solutions discussed in this article aim to ensure redundancy for both AVS and VNets. Looking at the diagram above, you might wonder: why do we need to set weights at all, and why do the AVS-ER connections (1b/2b) use the same weight as the primary on-premises connections (1a/2a)? Weight is used to influence routing decisions and ensure optimal traffic flow. In this scenario, both ExpressRoute circuits, ER1-EastUS and ER2-WestUS, advertise the same prefixes to the Azure ExpressRoute gateway. As a result, traffic from the VNet to on-premises would be ECMPed across both circuits. To avoid suboptimal routing and ensure that traffic from the VNets prefers the local ExpressRoute circuit, a higher weight is assigned to the local path. It’s also critical that the ExpressRoute gateway connection to on-premises (1a/2a) and to AVS (1b/2b), is assigned the same weight. Otherwise, traffic from the VNet to AVS will follow a less efficient route as AVS routes are also learned over ER1-EastUS via Global Reach. For instance, VNets in EastUS will connect to AVS EUS through ER1-EastUS circuit via Global Reach (as shown by the blue dotted line), instead of using the direct local path (orange line). This suboptimal routing is illustrated in the below diagram. Now let us see what solutions we can have to achieve redundant connectivity. The following solutions will apply to both Hub-and-Spoke and vWAN topology unless noted otherwise. Note: The diagrams in the upcoming solutions will focus only on illustrating the failover traffic flow. Solution1: Network Redundancy via ExpressRoute in Different Peering Location In the solution, deploy an additional ExpressRoute circuit in a different peering location within the same metro area (e.g., ER2–PeeringLocation2), and enable Global Reach between this new circuit and the existing AVS ExpressRoute (e.g., AVS-ER1). If you intend to use this second circuit as a failover path, apply prepends to the on-premises prefixes advertised over it. Alternatively, if you want to use it as an active-active redundant path, do not prepend routes, in this case, both AVS and Azure VNets will ECMP to distribute traffic across both circuits (e.g., ER1–EastUS and ER–PeeringLocation2) when both are available. Note: Compared to the Standard Topology, this design removes both the ExpressRoute cross-connect (bowtie) and weight settings. When adding a second circuit in the same metro, there's no benefit in keeping them, otherwise traffic from the Azure VNet will prefer the local AVS circuit (AVS-ER1/AVS-ER2) to reach on-premises due to the higher weight, as on-premises routes are also learned over AVS circuit (AVS-ER1/AVS-ER2) via Global Reach. Also, when connecting the new circuit (e.g., ER–Peering Location2), remove all weight settings across the connections. Traffic will follow the optimal path based on BGP prepending on the new circuit, or load-balance (ECMP) if no prepend is applied. Note: Use public ASN to prepend the on-premises prefix as AVS circuit (e.g., AVS-ER) will strip the private ASN toward AVS. Solution Insights Ideal for mission-critical applications, providing predictable throughput and bandwidth for backup. It could be cost prohibitive depending on the bandwidth of the second circuit. Solution2: Network Redundancy via ExpressRoute Direct In this solution, ExpressRoute Direct is used to provision multiple circuits from a single port pair in each region, for example, ER2-WestUS and ER4-WestUS are created from the same port pair. This allows you to dedicate one circuit for local traffic and another for failover to a remote region. To ensure optimal routing, prepend the on-premises prefixes using public ASN on the newly created circuit (e.g., ER3-EastUS and ER4-WestUS). Remove all weight settings across the connections; traffic will follow the optimal path based on BGP prepending on the new circuit. For instance, if ER1-EastUS becomes unavailable, traffic from AVS and VNets in the EastUS region will automatically route through ER4-WestUS circuit, ensuring continuity. Note: Compared to the Standard Topology, this design connects the newly created ExpressRoute circuits (e.g., ER3-EastUS/ER4-WestUS) to the remote region of ExpressRoute gateway (black dotted lines) instead of having the bowtie to the primary circuits (e.g., ER1-EastUS/ER2-WestUS). Solution Insights Easy to implement if you have ExpressRoute Direct. ExpressRoute Direct supports over- provisioning where you can create logical ExpressRoute circuits on top of your existing ExpressRoute Direct resource of 10-Gbps or 100-Gbps up to the subscribed Bandwidth of 20 Gbps or 200 Gbps. For example, you can create two 10-Gbps ExpressRoute circuits within a single 10-Gbps ExpressRoute Direct resource (port pair). Ideal for mission-critical applications, providing predictable throughput and bandwidth for backup. Solution3: Network Redundancy via ExpressRoute Metro Metro ExpressRoute is a new configuration that enables dual-homed connectivity to two different peering locations within the same city. This setup enhances resiliency by allowing traffic to continue flowing even if one peering location goes down, using the same circuit. Solution Insights Higher Resiliency: Provides increased reliability with a single circuit. Limited regional availability: Currently available in select regions, with more being added over time. Cost-effective: Offers redundancy without significantly increasing costs. Solution4: Deploy VPN as a Backup to ExpressRoute This solution mirrors solution 1 for a single region but extends it to multiple regions. In this approach, a VPN serves as the backup path for each region in the event of an ExpressRoute failure. In a Hub-and-Spoke topology, a backup path to and from AVS can be established by deploying Azure Route Server (ARS) in the hub VNet. ARS enables seamless transit routing between ExpressRoute and the VPN gateway. In vWAN topology, ARS is not required; the vHub's built-in routing service automatically provides transitive routing between the VPN gateway and ExpressRoute. In this design, you should not cross-connect ExpressRoute circuits (e.g., ER1-EastUS and ER2-WestUS) to the ExpressRoute gateways in the Hub VNets (e.g., Hub-EUS or Hub-WUS). Doing so will lead to routing issues, where the Hub VNet only programs the on-premises routes learned via ExpressRoute. For instance, in the EastUS region, if the primary circuit (ER1-EastUS) goes down, Hub-EUS will receive on-premises routes from both the VPN tunnel and the remote ER2-WestUS circuit. However, it will prefer and program only the ExpressRoute-learned routes from ER2-WestUS circuit. Since ExpressRoute gateways do not support route transitivity between circuits, AVS connected via AVS-ER will not receive the on-premises prefixes, resulting in routing failures. Note: In vWAN topology, to ensure optimal route convergence when failing back to ExpressRoute, you should prepend the prefixes advertised from on-premises over the VPN. Without route prepending, VNets may continue to use the VPN as the primary path to on-premises. If prepend is not an option, you can trigger the failover manually by bouncing the VPN tunnel. Solution Insights Cost-effective and straightforward to deploy. Increased Latency: The VPN tunnel over the internet adds latency due to encryption overhead. Bandwidth Considerations: Multiple VPN tunnels might be needed to achieve bandwidth comparable to a high-capacity ExpressRoute circuit (e.g., over 1G). For details on VPN gateway SKU and tunnel throughput, refer to this link. As you can't cross connect ExpressRoute circuits, VNets will utilize the VPN for failover instead of leveraging remote region ExpressRoute circuit. Solution5: Network Redundancy-Multiple On-Premises (split-prefix) In many scenarios, customers advertise the same prefix from multiple on-premises locations to Azure. However, if the customer can split prefixes across different on-premises sites, it simplifies the implementation of failover strategy using existing ExpressRoute circuits. In this design, each on-premises advertises region-specific prefixes (e.g., 10.10.0.0/16 for EastUS and 10.70.0.0/16 for WestUS), along with a common supernet (e.g., 10.0.0.0/8). Under normal conditions, AVS and VNets in each region use longest prefix match to route traffic efficiently to the appropriate on-premises location. For instance, if ER1-EastUS becomes unavailable, AVS and VNets in EastUS will automatically fail over to ER2-WestUS, routing traffic via the supernet prefix to maintain connectivity. Solution Insights Cost-effective: no additional deployment, using existing ExpressRoute circuits. Advertising specific prefixes over each region might need additional planning. Ideal for mission-critical applications, providing predictable throughput and bandwidth for backup. Solution6: Prioritize Network Redundancy for One Region Over Another If you're operating under budget constraints and can prioritize one region (such as hosting critical workloads in a single location) and want to continue using your existing ExpressRoute setup, this solution could be an ideal fit. In this design, assume AVS in EastUS (AVS-EUS) hosts the critical workloads. To ensure high availability, AVS-ER1 is configured with Global Reach connections to both the local ExpressRoute circuit (ER1-EastUS) and the remote circuit (ER2-WestUS). Make sure to prepend the on-premises prefixes advertised to ER2-WestUS using public ASN to ensure optimal routing (no ECMP) from AVS-EUS over both circuits (ER1-EastUS and ER2-WestUS). On the other hand, AVS in WestUS (AVS-WUS) is connected via Global Reach only to its local region ExpressRoute circuit (ER2-WestUS). If that circuit becomes unavailable, you can establish an on-demand Global Reach connection to ER1-EastUS, either manually or through automation (e.g., a triggered script). This approach introduces temporary downtime until the Global Reach link is established. You might be thinking, why not set up Global Reach between the AVS-WUS circuit and remote region circuits (like connecting AVS-ER2 to ER1-EastUS), just like we did for AVS-EUS? Because it would lead to suboptimal routing. Due to AS path prepending on ER2-WestUS, if both ER1-EastUS and ER2-WestUS are linked to AVS-ER2, traffic would favor the remote ER1-EastUS circuit since it presents a shorter AS path. As a result, traffic would bypass the local ER2-WestUS circuit, causing inefficient routing. That is why for AVS-WUS, it's better to use on-demand Global Reach to ER1-EastUS as a backup path, enabled manually or via automation, only when ER2-WestUS becomes unavailable. Note: VNets will failover via local AVS circuit. E.g., HUB-EUS will route to on-prem through AVS-ER1 and ER2-WestUS via Global Reach Secondary (purple line). Solution Insights Cost-effective Workloads hosted in AVS within the non-critical region will experience downtime if the local region ExpressRoute circuit becomes unavailable, until the on-demand Global Reach connection is established. Conclusion Each solution has its own advantages and considerations, such as cost-effectiveness, ease of implementation, and increased resiliency. By carefully planning and implementing these solutions, organizations can ensure operational continuity and optimal traffic routing in multi-region deployments.2.5KViews6likes0CommentsCombining firewall protection and SD-WAN connectivity in Azure virtual WAN
Virtual WAN (vWAN) introduces new security and connectivity features in Azure, including the ability to operate managed third-party firewalls and SD-WAN virtual appliances, integrated natively within a virtual WAN hub (vhub). This article will discuss updated network designs resulting from these integrations and examine how to combine firewall protection and SD-WAN connectivity when using vWAN. The objective is not to delve into the specifics of the security or SD-WAN connectivity solutions, but to provide an overview of the possibilities. Firewall protection in vWAN In a vWAN environment, the firewall solution is deployed either automatically inside the vhub (Routing Intent) or manually in a transit VNet (VM-series deployment). Routing Intent (managed firewall) Routing Intent refers to the concept of implementing a managed firewall solution within the vhub for internet protection or private traffic protection (VNet-to-VNet, Branch-to-VNet, Branch-to-Branch), or both. The firewall could be either an Azure Firewall or a third-party firewall, deployed within the vhub as Network Virtual Appliances or a SaaS solution. A vhub containing a managed firewall is called a secured hub. For an updated list of Routing Intent supported third-party solutions please refer to the following links: managed NVAs SaaS solution Transit VNet (unmanaged firewall) Another way to provide inspection in vWAN is to manually deploy the firewall solution in a spoke of the vhub and to cascade the actual spokes behind that transit firewall VNet (aka indirect spoke model or tiered-VNet design). In this discussion, the primary reasons for choosing unmanaged deployments are: either the firewall solution lacks an integrated vWAN offer, or it has an integrated offer but falls short in horizontal scalability or specific features compared to the VM-based version. For a detailed analysis on the pros and cons of each design please refer to this article. SD-WAN connectivity in vWAN Similar to the firewall deployment options, there are two main methods for extending an SDWAN overlay into an Azure vWAN environment: a managed deployment within the vhub, or a standard VM-series deployment in a spoke of the vhub. More options here. SD-WAN in vWAN deployment (managed) In this scenario, a pair of virtual SD-WAN appliances are automatically deployed and integrated in the vhub using dynamic routing (BGP) with the vhub router. Deployment and management processes are streamlined as these appliances are seamlessly provisioned in Azure and set up for a simple import into the partner portal (SD-WAN orchestrator). For an updated list of supported SDWAN partners please refer to this link. For more information on SD-WAN in vWAN deployments please refer to this article. VM-series deployment (unmanaged) This solution requires manual deployment of the virtual SD-WAN appliances in a spoke of the vhub. The underlying VMs and the horizontal scaling are managed by the customer. Dynamic route exchange with the vWAN environment is achieved leveraging BGP peering with the vhub. Alternatively, and depending on the complexity of your addressing plan, static routing may also be possible. Firewall protection and SD-WAN in vWAN THE CHALLENGE! Currently, it is only possible to chain managed third-party SD-WAN connectivity with Azure Firewall in the same vhub, or to use dual-role SD-WAN connectivity and security appliances. Routing Intent provided by third-party firewalls combined with another managed SD-WAN solution inside the same vhub is not yet supported. But how can firewall protection and SD-WAN connectivity be integrated together within vWAN? Solution 1: Routing Intent with Azure Firewall and managed SD-WAN (same vhub) Firewall solution: managed. SD-WAN solution: managed. This design is only compatible with Routing Intent using Azure Firewall, as it is the sole firewall solution that can be combined with a managed SD-WAN in vWAN deployment in that same vhub. With the private traffic protection policy enabled in Routing Intent, all East-West flows (VNet-to-VNet, Branch-to-VNet, Branch-to-Branch) are inspected. Solution 2: Routing Intent with a third-party firewall and managed SD-WAN (2 vhubs) Firewall solution: managed. SD-WAN solution: managed. To have both a third-party firewall managed solution in vWAN and an SD-WAN managed solution in vWAN in the same region, the only option is to have a vhub dedicated to the security solution deployment and another vhub dedicated to the SD-WAN solution deployment. In each region, spoke VNets are connected to the secured vhub, while SD-WAN branches are connected to the vhub containing the SD-WAN deployment. In this design, Routing Intent private traffic protection provides VNet-to-VNet and Branch-to-VNet inspection. However, Branch-to-Branch traffic will not be inspected. Solution 3: Routing Intent and SD-WAN spoke VNet (same vhub) Firewall solution: managed. SD-WAN solution: unmanaged. This design is compatible with any Routing Intent supported firewall solution (Azure Firewall or third-party) and with any SD-WAN solution. With Routing Intent private traffic protection enabled, all East-West flows (VNet-to-VNet, Branch-to-VNet, Branch-to-Branch) are inspected. Solution 4: Transit firewall VNet and managed SDWAN (same vhub) Firewall solution: unmanaged. SD-WAN solution: managed. This design utilizes the indirect spoke model, enabling the deployment of managed SD-WAN in vWAN appliances. This design provides VNet-to-VNet and Branch-to-VNet inspection. But because the firewall solution is not hosted in the hub, Branch-to-Branch traffic will not be inspected. Solution 5 - Transit firewall VNet and SD-WAN spoke VNet (same vhub) Firewall solution: unmanaged. SD-WAN solution: unmanaged. This design integrates both the security and the SD-WAN connectivity as unmanaged solutions, placing the responsibility for deploying and managing the firewall and the SD-WAN hub on the customer. Just like in solution #4, only VNet-to-VNet and Branch-to-VNet traffic is inspected. Conclusion Although it is currently not possible to combine a managed third-party firewall solution with a managed SDWAN deployment within the same vhub, numerous design options are still available to meet various needs, whether managed or unmanaged approaches are preferred.4.4KViews6likes2CommentsManaged SD-WAN in vWAN: throughput considerations and underlay options
Discover how to simplify large-scale branch connectivity by leveraging the power of SD-WAN and vWAN to create a cloud-oriented network that delivers optimized inter-region performance and resiliency. Get insights on the deployment process for managed SD-WAN in vWAN, as well as on throughput considerations and scaling options. Finally, explore the various underlay options available for managed SD-WAN in vWAN deployments.7.9KViews5likes1CommentExpressRoute Gateway Migration Playbook
Objective The objective of this document is to help with transitioning the ExpressRoute gateway from a non-zone-redundant SKU to a zone-redundant SKU. This upgrade enhances the reliability and availability of the gateway by ensuring that it is resilient to zone failures. Additionally, the public IP associated with the gateway will be upgraded from a Basic SKU to a Standard SKU. This upgrade provides improved performance, security features, and availability guarantees. The entire migration should be conducted in accordance with IT Service Management (ITSM) guidelines, ensuring that all best practices and standards are followed. Change management protocols should be strictly adhered to, including obtaining necessary approvals, documenting the change, and communicating with stakeholders. Pre-migration and post-migration testing should be performed to validate the success of the migration and to ensure that there are no disruptions to services. The migration should be scheduled within a planned maintenance window to minimize impact on users and services. This window should be carefully selected to ensure that it aligns with business requirements and minimizes downtime. Throughout the process, detailed monitoring and logging should be in place to track progress and quickly address any issues that may arise. Single-zone ExpressRoute Gateway: Zone-redundant ExpressRoute Gateway: Background ExpressRoute Gateway Standard SKU is a non-zone-redundant and lower the resiliency for the service. Basic SKU public IP is retiring in the end of September 2025. After this date the support for this SKU will be ceased which will potentially impact the ExpressRoute Gateway support. ExpressRoute Gateway Public IP is used internally for control plane communication. Migration Scenarios This document is equally relevant to all of the following scenarios: ExpressRoute Gateway Standard/High/Ultraperformance to ErGw1Az/ ErGw2Az/ ErGw3Az SKU ExpressRoute Gateway Standard/High/Ultraperformance to Standard/High/Ultraperformance (Multi-Zone) SKU Single-zone and multi-zone regions Zone redundant SKU (ErGw1Az/ErGw2Az/ErGw3Az) deployed in single zone. Prerequisites Stakeholder Approvals: Ensure ITSM approvals are in place. This is to ensure that changes to IT systems are properly reviewed and authorized before implementation. Change Request (CR): Submit and secure approval for a Change Request to guarantee that all modifications to IT systems are thoroughly reviewed, authorized, and implemented in a controlled manner. Maintenance Window: When scheduling a maintenance window for production work, consider the following to minimize disruption and ensure efficiency: Key Considerations Minimizing Disruption: Schedule during low activity periods, often outside standard business hours or on weekends. Ensuring Adequate Staffing: Ensure necessary staff and resources are available, including technical support. Aligning with Production Cycles: Coordinate with departments to align with production cycles. Best Practices Preventive and Predictive Maintenance: Focus on regular inspections, part replacements, and system upgrades. Effective Communication: Inform stakeholders in advance about the maintenance schedule. Proper Planning: Use historical data and insights to identify the best time slots for maintenance. Backup Plan: Document rollback or roll-forward procedures in case of failure. Following are some important considerations: Minimizing Disruption: A backup plan minimizes disruptions during planned maintenance, especially for VMs that may shut down or reboot. Ensuring Data Integrity: It protects against data loss or corruption by backing up critical data beforehand. Facilitating Quick Recovery: It allows for quick recovery if issues arise, maintaining business continuity and minimizing downtime. Current Configuration backup: Backup configuration for ExpressRoute Gateway, ExpressRoute Gateway Connection and Routing table associated with Gateway (if any) properties. Here are the Powershell commands that can be used to backup ExpressRoute Gateway Configuration. Review Gateway migration article About migrating to an availability zone-enabled ExpressRoute virtual network gateway - Azure ExpressRoute | Microsoft Learn Be ready to open a Microsoft Support Ticket (Optional/Proactive): In certain corner case scenarios where migration encounters a blocker, be ready with the necessary details to open a Microsoft support ticket. In the ticket, provide the maintenance plan to the support engineer and ensure they are fully informed about your environment-specific configuration. Pre-Migration Testing Connectivity Tests: Run network reachability tests to validate current state. Some of the sample tests could be as following: ICMP test from on-premises virtual machine to Azure virtual machine to test basic connectivity. Ping on-premises Virtual machine to an Azure virtual machine. $ ping <Azure-Virtual-Machine-IP> Application access test: Access your workload application from on-premises to a service running in Azure. This depends on the customer application. For example, if it is a web application, access the web server from a browser on a laptop or an on-premises machine. Latency and throughput tests: You can used ACT to test latency and throughput. Please refer to this link for installation details. Troubleshoot network link performance: Azure ExpressRoute | Microsoft Learn $ Get-LinkPerformance -RemoteHost 10.0.0.1 -TestSeconds 10 To test Jitter and packet loss you can use following tools. PSPing: psping -l 1024 -n 100 <Azure_VM_IP>:443 PathPing: pathping <Azure VM IP> Capture the results from above test to compare them after the migration. “iperf” is another tool widely used for throughput and latency testing. A web-based latency tool works fine as well: https://www.azurespeed.com/ Test the whole ExpressRoute Gateway migration process in lower environment (Optional): In other words, migrate an ExpressRoute Gateway in non-production environment. Advanced Notification Send an email to the relevant stakeholders and impacted users/teams a few weeks in advance. Send a final notification to the same group a day before. Stop IOs on hybrid private endpoint Using private endpoints in Azure over a hybrid connection with ExpressRoute provides a secure, reliable, and high-performance connection to Azure services. By leveraging ExpressRoute's private peering and connectivity models, you can ensure that your traffic remains within the Microsoft global network, avoiding public internet exposure. This setup is ideal for scenarios requiring high security, consistent performance, and seamless integration between on-premises and Azure environments Private endpoints (PEs) in a virtual network connected over ExpressRoute private peering might experience connectivity outage during migration. To avoid this, stop all IOs over hybrid private endpoints. Validate you have enough IP for migration Our guidance is to proceed with migration, a /27 prefix or longer is required in the GatewaySubnet. The migration feature checks for enough address space during validation phase. In a scenario where there aren’t enough IP addresses available to create zone-redundant ExpressRoute Gateway, the Gateway migration script will add additional prefix to the subnet. As a user you don’t have to take any action. The migration feature will tell you if it needs more IPs. Migration Steps Migration using Azure portal Step 1: Test connectivity from On-premises to Azure via ExpressRoute Gateway. Refer Step-7 Step 2: Verify that the Microsoft Azure support engineer is on standby. Step 3: Send an email to notify users about the start of the planned connectivity outage. Step 4: Stop or minimize IOs over ExpressRoute circuit (Downtime). Minimizing the IOs will reduce the impact. Step 5: Follow the document below to migrate the ExpressRoute gateway using Azure Portal Migrate to an availability zone-enabled ExpressRoute virtual network gateway in Azure portal - Azure ExpressRoute | Microsoft Learn Step 6: Restart IOs over ExpressRoute Circuit Step 7: Validate and Test Post Migration connectivity. Verify BGP Peering: Get-AzExpressRouteCircuitPeering -ResourceGroupName <RG> -CircuitName <CircuitName> Route Propagation Check: Get-AzExpressRouteCircuitRouteTable -ResourceGroupName <RG> -ExpressRouteCircuitName <CircuitName> -PeeringType AzurePrivatePeering Connectivity Tests: Run network reachability tests to validate current state. Some of the sample tests could be as following: ICMP test from on-premises virtual machine to Azure virtual machine to test basic connectivity. Ping on-premises Virtual machine to an Azure virtual machine. $ ping <Azure-Virtual-Machine-IP> Application access test: Access your workload application from on-premises to a service running in Azure. This depends on the customer application. For example, if it is a web application, access the web server from a browser on a laptop or an on-premises machine. Latency and throughput tests: You can used ACT to test latency and throughput. Please refer to this link for installation details. Troubleshoot network link performance: Azure ExpressRoute | Microsoft Learn $ Get-LinkPerformance -RemoteHost 10.0.0.1 -TestSeconds 10 To test Jitter and packet loss you can use following tools. PSPing: psping -l 1024 -n 100 <Azure_VM_IP>:443 PathPing: pathping <Azure VM IP> Compare the new results with the one captured before the outage. Validate that the migration is successful. ExpressRoute Gateway is migrated to the new SKU. Migration using powershell Step 1: Test connectivity from On-premises to Azure via ExpressRoute Gateway. Refer Step-7 Step 2: Verify that the Microsoft Azure support engineer is on standby. Refer Step 3: Send an email to notify users about the start of the planned connectivity outage. Step 4: Stop or minimize IOs over ExpressRoute circuit (Downtime). Minimizing the IOs will reduce the impact. Step 5: Follow the document below to migrate the ExpressRoute gateway using Powershell. Migrate to an availability zone-enabled ExpressRoute virtual network gateway using PowerShell - Azure ExpressRoute | Microsoft Learn Step 6: Restart IOs over ExpressRoute Circuit Step 7: Validate and Test Post Migration connectivity. Verify BGP Peering: Get-AzExpressRouteCircuitPeering -ResourceGroupName <RG> -CircuitName <CircuitName> Route Propagation Check: Get-AzExpressRouteCircuitRouteTable -ResourceGroupName <RG> -ExpressRouteCircuitName <CircuitName> -PeeringType AzurePrivatePeering Connectivity Tests: Run network reachability tests to validate current state. Some of the sample tests could be as following: ICMP test from on-premises virtual machine to Azure virtual machine to test basic connectivity. Ping on-premises Virtual machine to an Azure virtual machine. $ ping <Azure-Virtual-Machine-IP> Application access test: Access your workload application from on-premises to a service running in Azure. This depends on the customer application. For example, if it is a web application, access the web server from a browser on a laptop or an on-premises machine. Latency and throughput tests: You can used ACT to test latency and throughput. Please refer to this link for installation details. Troubleshoot network link performance: Azure ExpressRoute | Microsoft Learn $ Get-LinkPerformance -RemoteHost 10.0.0.1 -TestSeconds 10 To test Jitter and packet loss you can use following tools. PSPing: psping -l 1024 -n 100 <Azure_VM_IP>:443 PathPing: pathping <Azure VM IP> Compare the new results with the one captured before the outage. Validate that the migration is successful. ExpressRoute Gateway is migrated to the new SKU. Rollback Plan If any issue arises during migration take help of Microsoft support engineer to: Restore Previous Gateway: Use the backed-up configuration to either get back the original gateways or create a new one, based on guidance from support engineer. Validate Connectivity: Perform on-premises to Azure connectivity testing as mentioned in step 7 above. Post-Migration Steps Update Change Request: Document and close the CR. Update CMDB: Reflect the new gateway details in the Configuration Management Database. Stakeholder Sign-off: Ensure all teams validate and approve the changes. Contact Information Network Team: Azure Support: Azure Support Portal References Azure ExpressRoute Gateway Migration Documentation Install Azure PowerShell with PowerShellGet | Microsoft Learn2.9KViews4likes1CommentAzure ExpressRoute Direct: A Comprehensive Overview
What is Express Route Azure ExpressRoute allows you to extend your on-premises network into the Microsoft cloud over a private connection made possible through a connectivity provider. With ExpressRoute, you can establish connections to Microsoft cloud services, such as Microsoft Azure, and Microsoft 365. ExpressRoute allows you to create a connection between your on-premises network and the Microsoft cloud in four different ways, CloudExchange Colocation, Point-to-point Ethernet Connection, Any-to-any (IPVPN) Connection, and ExpressRoute Direct. ExpressRoute Direct gives you the ability to connect directly into the Microsoft global network at peering locations strategically distributed around the world. ExpressRoute Direct provides dual 100-Gbps or 10-Gbps connectivity that supports active-active connectivity at scale. Why ExpressRoute Direct Is Becoming the Preferred Choice for Customers ExpressRoute Direct with ExpressRoute Local – Free Egress: ExpressRoute Direct includes ExpressRoute Local, which allows private connectivity to Azure services within the same metro or peering location. This setup is particularly cost-effective because egress (outbound) data transfer is free, regardless of whether you're on a metered or unlimited data plan. By avoiding Microsoft's global backbone, ExpressRoute Local offers high-speed, low-latency connections for regionally co-located workloads without incurring additional data transfer charges. Dual Port Architecture Both ExpressRoute Direct and the service provider model feature a dual-port architecture, with two physical fiber pairs connected to separate Microsoft router ports and configured in an active/active BGP setup that distributes traffic across both links simultaneously for redundancy and improved throughput. What sets Microsoft apart is making this level of resiliency standard, not optional. Forward-thinking customers in regions like Sydney take it even further by deploying ExpressRoute Direct across multiple colocation facilities for example, placing one port pair in Equinix SY2 and another in NextDC S1 creating four connections across two geographically separate sites. This design protects against facility-level outages from power failures, natural disasters, or accidental infrastructure damage, ensuring business continuity for organizations where downtime is simply not an option. When Geography Limits Your Options: Not every region offers facility diversity, example New Zealand has only one ExpressRoute peering location, businesses needing geographic redundancy must connect to Sydney incurring Auckland to Sydney link costs but gaining critical diversity to mitigate outages. While ExpressRoute’s dual ports provide active/active redundancy, both are on the same Microsoft edge, so true disaster recovery requires using Sydney’s edge. ExpressRoute Direct scales from basic dual-port setups to multi-facility deployments and offers another advantage: free data transfer within the same geopolitical region. Once traffic enters Microsoft’s network, New Zealand customers can move data between Azure services across the trans-Tasman link without per-GB fees, with Microsoft absorbing those costs. Premium SKU: Global Reach: Azure ExpressRoute Direct with the Premium SKU enables Global Reach, allowing private connectivity between your on-premises networks across different geographic regions through Microsoft's global backbone. This means you can link ExpressRoute circuits in different countries or continents, facilitating secure and high-performance data exchange between global offices or data centers. The Premium SKU extends the capabilities of ExpressRoute Direct by supporting cross-region connectivity, increased route limits, and access to more Azure regions, making it ideal for multinational enterprises with distributed infrastructure. MACsec: Defense in Depth and Enterprise Security ExpressRoute Direct uniquely supports MACsec (IEEE 802.1AE) encryption at the data-link layer, allowing your router and Microsoft's router to establish encrypted communication even within the colocation facility. This optional feature provides additional security for compliance-sensitive workloads in banking or government environments. High-Performance Data Transfer for the Enterprise: Azure ExpressRoute Direct enables ultra-fast and secure data transfer between on-premises infrastructure and Azure by offering dedicated bandwidth of 10 to 100 Gbps. This high-speed connectivity is ideal for large-scale data movement scenarios such as AI workloads, backup, and disaster recovery. It ensures consistent performance, low latency, and enhanced reliability, making it well-suited for hybrid and multicloud environments that require frequent or time-sensitive data synchronization. FastPath Support: Azure ExpressRoute Direct now supports FastPath for Private Endpoints and Private Link, enabling low-latency, high-throughput connections by bypassing the virtual network gateway. This feature is available only with ExpressRoute Direct circuits (10 Gbps or 100 Gbps) and is in limited general availability. While a gateway is still needed for route exchange, traffic flows directly once FastPath is enabled. Supported gateway ExpressRoute Direct Setup Workflow Before provisioning ExpressRoute Direct resources, proper planning is essential. Key considerations for connectivity include understanding the two connectivity patterns available for ExpressRoute Direct from the customer edge to Microsoft Enterprise Edge (MSEE). Option 1: Colocation of Customer Equipment: This is a common pattern where the customer racks their network device (edge router) in the same third-party data center facility that houses Microsoft's networking gear (e.g., Equinix or NextDC). They install their router or firewall there and then order a short cross-connect from their cage to Microsoft's cage in that facility. The cross-connect is simply a fiber cable run through the facility's patch panel connecting the two parties. This direct colocation approach has the advantage of a single, highly efficient physical link (no intermediate hops) between the customer and Microsoft, completing the layer-1 connectivity in one step. Option 2: Using a Carrier/Exchange Provider: If the customer prefers not to move hardware into a new facility (due to cost or complexity), they can leverage a provider that already has presence in the relevant colocation. In this case, the customer connects from their data center to the provider's network, and the provider extends connectivity into the Microsoft peering location. For instance, the customer could contract with Megaport or a local telco to carry traffic from their on-premises location into Megaport's equipment, and Megaport in turn handles the cross-connection to Microsoft in the target facility. The conversation cited that the customer had already set up connections to Megaport in their data center. Using an exchange can simplify logistics since the provider arranges the cross-connect and often provides an LOA on the customer's behalf. It may also be more cost-effective where the customer's location is far from any Microsoft peering site. Many enterprises find that placing equipment in a well-connected colocation facility works best for their needs. Banks and large organizations have successfully taken this approach, such as placing routers in Equinix Sydney or NextDC Sydney to establish a direct fiber link to Azure. However, we understand that not every organization wants the capital expense or complexity of managing physical equipment in a new location. For those situations, using a cloud exchange like Megaport offers a practical alternative that still delivers the dedicated connectivity you're looking for, while letting someone else handle the infrastructure management. Once the decision on the connectivity pattern is made, the next step is to provision ExpressRoute Direct ports and establish the physical link: Step1: Provisioning Express Route Direct Ports Through the Azure portal (or CLI), the customer creates an ExpressRoute Direct resource. Customer must select an appropriate peering location, which corresponds to the colocation facility housing Azure's routers. For example, the customer would select the specific facility (such as "Vocus Auckland" or "Equinix Sydney SY2") where they intend to connect. Customer also choose the port bandwidth (either 10 Gbps or 100 Gbps) and the encapsulation type (Dot1Q or QinQ) during this setup. Azure then allocates two ports on two separate Microsoft devices in that location – essentially giving the customer a primary and secondary interface for redundancy, to remove a single point of failure affecting their connectivity. ****Critical considerations we need to keep in mind during this step**** Encapsulation: When configuring ExpressRoute Direct ports, the customer must choose an encapsulation method. Dot1Q (802.1Q) uses a single VLAN tag for the circuit, whereas Q-in-Q (802.1ad) uses stacked VLAN tags (an Outer S-Tag and Inner C-Tag). Q-in-Q allows multiple circuits on one physical port with overlapping customer VLAN IDs because Azure assigns a unique outer tag per circuit (making it ideal if the customer needs several ExpressRoute circuits on the same port). Dot1Q, by contrast, requires each VLAN ID to be unique across all circuits on the port, and is often used if the equipment doesn’t support Q-in-Q. (Most modern deployments prefer Q-in-Q for flexibility.) Capacity Planning: This offering allows customers to overprovision and utilize 20Gbps of capacity. Design for 10 Gbps with redundancy, not 20 Gbps total capacity. During Microsoft's monthly maintenance windows, one port may go offline, and your network must handle this seamlessly. Step 2: Generate Letter of Authorization After the ExpressRoute Direct resource is created, Microsoft generates a Letter of Authorization. The LOA is a document (often a PDF) that authorizes the data center operator to connect a specific Microsoft port to the designated port. It includes details like the facility name, patch panel identifier, and port numbers on Microsoft's side. If co-locating your own gear, you will also obtain a corresponding LOA from the facility for your port (or simply indicate your port details on the cross-connect order form). If a provider like Megaport is involved, that provider will generate an LOA for their port as well. Two LOAs are typically needed – one for Microsoft's ports and one for the other party's ports which are then submitted to the facility to execute the cross-connect. Step 3: Complete Cross Connect with data center provider Using the LOAs, the data center’s technicians will perform the cross-connection in the meet-me room. At this point, the physical fiber link is established between the Microsoft router and the customer (or provider) equipment. The link goes through a patch panel in the MMR – Meet me room rather than a direct cable between cages, for security and manageability. After patching, the circuit is in place but typically kept “administratively down” until ready. *****Critical considerations we need to keep in mind during this step. ***** When port allocation conflicts occur, engage Microsoft Support rather than recreating resources. They coordinate with colocation providers to resolve conflicts or issue new LOAs. Step 4: Change Admin Status of each link Once the cross-connect is physically completed, you can head into Azure's portal and flip the Admin State of each ExpressRoute Direct link to "Enabled." This action lights up the optical interface on Microsoft's side and starts your billing meter running, so you'll want to make sure everything is working properly first. The great thing is that Azure gives you visibility into the health of your fiber connection through optical power metrics. You can check the receive light levels right in the portal , a healthy connection should show power readings somewhere between -1 dBm and -9 dBm, which indicates a strong fiber signal. If you're seeing readings outside this range, or worse, no light at all, that's a red flag pointing to a potential issue like a mis-patch or faulty fiber connector. There was a real case where someone had a bad fiber connector that was caught because the light levels were too low, and the facility had to come back and re-patch the connection. So, this optical power check is really your first line of defence , once you see good light levels within the acceptable range, you know your physical layer is solid and you're ready to move on to the next steps. ****Critical considerations we need to keep in mind during this step. **** Proactive Monitoring: Set up alerts for BGP session failures and optical power thresholds. Link failures might not immediately impact users but require quick restoration to maintain full redundancy. At this stage, you've successfully navigated the physical infrastructure challenge, ExpressRoute Direct port pair is provisioned, fiber cross-connects are in place, and those critical optical power levels are showing healthy readings. Essentially, private physical highway directly connecting your network edge to Microsoft's backbone infrastructure has been built Step 5: Create Express Route Circuits ExpressRoute circuits represent the logical layer that transforms your physical ExpressRoute Direct ports into functional network connections. Through the Azure portal, organizations create circuit resources linked to their ExpressRoute Direct infrastructure, specifying bandwidth requirements and selecting the appropriate SKU (Local, Standard, or Premium) based on connectivity needs. A key advantage is the ability to provision multiple circuits on the same physical port pair, provided aggregate bandwidth stays within physical limits. For example, an organization with 10 Gbps ExpressRoute Direct might run a 1 Gbps non-production circuit alongside a 5 Gbps production circuit on the same infrastructure. Azure handles the technical complexity through automatic VLAN management: Step 6: Establish Peering Once your ExpressRoute circuit is created and VLAN connectivity is established, the next crucial step involves setting up BGP (Border Gateway Protocol) sessions between your network and Microsoft's infrastructure. ExpressRoute supports two primary BGP peering types: Private Peering for accessing Azure Virtual Networks and Microsoft Peering for reaching Microsoft SaaS services like Office 365 and Azure PaaS offerings. For most enterprise scenarios connecting data centers to Azure workloads, Private Peering becomes the focal point. Azure provides specific BGP IP addresses for your circuit configuration, defining /30 subnets for both primary and secondary link peering, which you'll configure on your edge router to exchange routing information. The typical flow involves your organization advertising on-premises network prefixes while Azure advertises VNet prefixes through these BGP sessions, creating dynamic route discovery between your environments. Importantly, both primary and secondary links maintain active BGP sessions, ensuring that if one connection fails, the secondary BGP session seamlessly maintains connectivity and keeps your network resilient against single points of failure. Step 7: Routing and Testing Once BGP sessions are established, your ExpressRoute circuit becomes fully operational, seamlessly extending your on-premises network into Azure virtual networks. Connectivity testing with ping, traceroute, and application traffic confirms that your on-premises servers can now communicate directly with Azure VMs through the private ExpressRoute path, bypassing the public internet entirely. The traffic remains completely isolated to your circuit via VLAN tags, ensuring no intermingling with other tenants while delivering the low latency and predictable performance that only dedicated connectivity can provide. At the end of this stage, the customer’s data center is linked to Azure at layer-3 via a private, resilient connection. They can access Azure resources as if they were on the same LAN extension, with low latency and high throughput. All that remains is to connect this circuit to relevant Azure virtual networks (via an ExpressRoute Gateway) and verify end-to-end application traffic. Step by step instructions are available as below Configure Azure ExpressRoute Direct using the Azure portal | Microsoft Learn Azure ExpressRoute: Configure ExpressRoute Direct | Microsoft Learn Azure ExpressRoute: Configure ExpressRoute Direct: CLI | Microsoft Learn1.6KViews3likes3Comments