azure regions
7 TopicsAzure Front Door: Resiliency Series – Part 2: Faster recovery (RTO)
In Part 1 of this blog series, we outlined our four‑pillar strategy for resiliency in Azure Front Door: configuration resiliency, data plane resiliency, tenant isolation, and accelerated Recovery Time Objective (RTO). Together, these pillars help Azure Front Door remain continuously available and resilient at global scale. Part 1 focused on the first two pillars: configuration and data plane resiliency. Our goal is to make configuration propagation safer, so incompatible changes never escape pre‑production environments. We discussed how incompatible configurations are blocked early, and how data plane resiliency ensures the system continues serving traffic from a last‑known‑good (LKG) configuration even if a bad change manages to propagate. We also introduced ‘Food Taster’, a dedicated sacrificial process running in each edge server’s data plane, that pretests every configuration change in isolation, before it ever reaches the live data plane. In this post, we turn to the recovery pillar. We describe how we have made key enhancements to the Azure Front Door recovery path so the system can return to full operation in a predictable and bounded timeframe. For a global service like Azure Front Door, serving hundreds of thousands of tenants across 210+ edge sites worldwide, we set an explicit target: to be able to recover any edge site – or all edge sites – within approximately 10 minutes, even in worst‑case scenarios. In typical data plane crash scenarios, we expect recovery in under a second. Repair status The first blog post in this series mentioned the two Azure Front Door incidents from October 2025 – learn more by watching our Azure Incident Retrospective session recordings for the October 9 th incident and/or the October 29 th incident. Before diving into our platform investments for improving our Recovery Time Objectives (RTO), we wanted to provide a quick update on the overall repair items from these incidents. We are pleased to report that the work on configuration propagation and data plane resiliency is now complete and fully deployed across the platform (in the table below, “Completed” means broadly deployed in production). With this, we have reduced configuration propagation latency from ~45 minutes to ~20 minutes. We anticipate reducing this even further – to ~15 minutes by the end of April 2026, while ensuring that platform stability remains our top priority. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond ‘EUAP or canary regions’ Control plane and data plane defect fixes Forced synchronous configuration processing Additional stages with extended bake time Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. Completed 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes April 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 Why recovery at edge scale is deceptively hard To understand why recovery took as long as it did, it helps to first understand how the Azure Front Door data plane processes configuration. Azure Front Door operates in 210+ edge sites with multiple servers per site. The data plane of each edge server hosts multiple processes. A master process orchestrates the lifecycle of multiple worker processes, that serve customer traffic. A separate configuration translator process runs alongside the data plane processes, and is responsible for converting customer configuration bundles from the control plane into optimized binary FlatBuffer files. This translation step, covering hundreds of thousands of tenants, represents hours of cumulative computation. A per edge server cache is kept locally at each server level – to enable a fast recovery of the data plane, if needed. Once the configuration translator process produces these FlatBuffer files, each worker processes them independently and memory-maps them for zero-copy access. Configuration updates flow through a two-phase commit: new FlatBuffers are first loaded into a staging area and validated, then atomically swapped into production maps. In-flight requests continue using the old configuration, until the last request referencing them completes. The data process recovery is designed to be resilient to different failure modes. A failure or crash at the worker process level has a typical recovery time of less than one second. Since each server has multiple such worker processes which serve customer traffic, this type of crash has no impact on the data plane. In the case of a master process crash, the system automatically tries to recover using the local cache. When the local cache is reused, the system is able to recover quickly – in approximately 60 minutes – since most of the configurations in the cache were already loaded into the data plane before the crash. However, in certain cases if the cache becomes unavailable or must be invalidated because of corruption, the recovery time increases significantly. During the October 29 th incident, a data plane crash triggered a complete recovery sequence that took approximately 4.5 hours. This was not because restarting a process is slow, it is because a defect in the recovery process invalidated the local cache, which meant that “restart” meant rebuilding everything from scratch. The configuration translator process then had to re-fetch and re-translate every one of the hundreds of thousands of customer configurations, before workers could memory-map them and begin serving traffic. This experience has crystallized three fundamental learnings related to our recovery path: Expensive rework: A subset of crashes discarded all previously translated FlatBuffer artifacts, forcing the configuration translator process to repeat hours of conversion work that had already been validated and stored. High restart costs: Every worker on every node had to wait for the configuration translator process to complete the full translation, before it could memory-map any configuration and begin serving requests. Unbounded recovery time: Recovery time grew linearly with total tenant footprint rather than with active traffic, creating a ‘scale penalty’ as more tenants onboarded to the system. Separately and together, the insight was clear: recovery must stop being proportional to the total configuration size. Persisting ‘validated configurations’ across restarts One of the key recovery improvements was strengthening how validated customer configurations are cached and reused across failures, rather than rebuilding configuration states from scratch during recovery. Azure Front Door already cached customer configurations on host‑mounted storage prior to the October incident. The platform enhancements post outage focused on making the local configuration cache resilient to crashes, partial failures, and bad tenant inputs. Our goal was to ensure that recovery behavior is dominated by serving traffic safely, not by reconstructing configuration state. This led us to two explicit design goals… Design goals No category of crash should invalidate the configuration cache: Configuration cache invalidation must never be the default response to failures. Whether the failure is a worker crash, master crash, data plane restart, or coordinated recovery action, previously validated customer configurations should remain usable—unless there is a proven reason to discard it. Bad tenant configuration must not poison the entire cache: A single faulty or incompatible tenant configuration should result in targeted eviction of that tenant’s configuration only—not wholesale cache invalidation across all tenants. Platform enhancements Previously, customer configurations persisted to host‑mounted storage, but certain failure paths treated the cache as unsafe and invalidated it entirely. In those cases, recovery implicitly meant reloading and reprocessing configuration for hundreds of thousands of tenants before traffic could resume, even though the vast majority of cached data was still valid. We changed the recovery model to avoid invalidating customer configurations, with strict scoping around when and how cached entries are discarded: Cached configurations are no longer invalidated based on crash type. Failures are assumed to be orthogonal to configuration correctness unless explicitly proven otherwise. Cache eviction is granular and tenant‑scoped. If a cached configuration fails validation or load checks, only that tenant’s configuration is discarded and reloaded. All other tenant configurations remain available. This ensures that recovery does not regress into a fleet‑wide rebuild due to localized or unrelated faults. Safety and correctness Durability is paired with strong correctness controls, to prevent unsafe configurations from being served: Per‑tenant validation on load: Each cached tenant configuration is validated during the ‘load and verification’ phase, before being promoted for traffic serving. Therefore, failures are contained to that tenant. Targeted re‑translation: When validation fails, only the affected tenant’s configuration is reloaded or reprocessed. Therefore, the cache for other tenants is left untouched. Operational escape hatch: Operators retain the ability to explicitly instruct a clean rebuild of the configuration cache (with proper authorization), preserving control without compromising the default fast‑recovery path. Resulting behavior With these changes, recovery behavior now aligns with real‑world traffic patterns - configuration defects impact tenants locally and predictably, rather than globally. The system now prefers isolated tenant impact, and continued service using last-known-good over aggressive invalidation, both of which are critical for predictable recovery at the scale of Azure Front Door. Making recovery scale with active traffic, not total tenants Reusing configuration cache solves the problem of rebuilding configuration in its entirety, but even with a warm cache, the original startup path had a second bottleneck: eagerly loading a large volume of tenant configurations into memory before serving any traffic. At our scale, memory-mapping, parsing hundreds of thousands of FlatBuffers, constructing internal lookup maps, adding Transport Layer Security (TLS) certificates and configuration blocks for each tenant, collectively added almost an hour to startup time. This was the case even when a majority of those tenants had no active traffic at that moment. We addressed this by fundamentally changing when configuration is loaded into workers. Rather than eagerly loading most of the tenants at startup across all edge locations, Azure Front Door now uses an Machine Learning (ML)-optimized lazy loading model. In the new architecture, instead of loading a large number of tenant configurations, we only load a small subset of tenants that are known to be historically active in a given site, we call this the “warm tenants” list. The warm tenants list per edge site is created through a sophisticated traffic analysis pipeline that leverages ML. However, loading the warm tenants is not good enough, because when a request arrives and we don’t have the configuration in memory, we need to know two things. Firstly, is this a request from a real Azure Front Door tenant – and, if it is, where can I find the configuration? To answer these questions, each worker maintains a hostmap that tracks the state of each tenant’s configuration. This hostmap is constructed during startup, as we process each tenant configuration – if the tenant is in the warm list, we will process and load their configuration fully; if not, then we will just add an entry into the hostmap where all their domain names are mapped to the configuration path location. When a request arrives for one of these tenants, the worker loads and validates that tenant’s configuration on demand, and immediately begins serving traffic. This allows a node to start serving its busiest tenants within a few minutes of startup, while additional tenants are loaded incrementally only when traffic actually arrives—allowing the system to progressively absorb cold tenants as demand increases. The effect on recovery is transformative. Instead of recovery time scaling with the total number of tenants configured on a server, it scales with the number of tenants actively receiving traffic. In practice, even at our busiest edge sites, the active tenant set is a small fraction of the total. Just as importantly, this modified form of lazy loading provides a natural failure isolation boundary. Most Edge sites won’t ever load a faulty configuration of an inactive tenant. When a request for an inactive tenant with an incompatible configuration arrives, impact is contained to a single worker. The configuration load architecture now prefers serving as many customers as quickly as possible, rather than waiting until everything is ready before serving anyone. The above changes are slated to complete in April 2026 and will bring our RTO from the current ~1 hour to under 10 minutes – for complete recovery from a worst case scenario. Continuous validation through Game Days A critical element of our recovery confidence comes from GameDay fault-injection testing. We don’t simply design recovery mechanisms and assume they work—we break the system deliberately and observe how it responds. Since late 2025, we have conducted recurring GameDay drills that simulate the exact failure scenarios we are defending against: Food Taster crash scenarios: Injecting deliberately faulty tenant configurations, to verify that they are caught and isolated with zero impact on live traffic. In our January 2026 GameDay, the Food Taster process crashed as expected, the system halted the update within approximately 5 seconds, and no customer traffic was affected. Master process crash scenarios: Triggering master process crashes across test environments to verify that workers continue serving traffic, that the Local Config Shield engages within 10 seconds, and that the coordinated recovery tool restores full operation within the expected timeframe. Multi-region failure drills: Simulating simultaneous failures across multiple regions to validate that global Config Shield mechanisms engage correctly, and that recovery procedures scale without requiring manual per-region intervention. Fallback test drills for critical Azure services running behind Azure Front Door: In our February 2026 GameDay, we simulated the complete unavailability of Azure Front Door, and successfully validated failover for critical Azure services with no impact to traffic. These drills have both surfaced corner cases and built operational confidence. They have transformed recovery from a theoretical plan into tested, repeatable muscle memory. As we noted in an internal communication to our team: “Game day testing is a deliberate shift from assuming resilience to actively proving it—turning reliability into an observed and repeatable outcome.” Closing Part 1 of this series emphasized preventing unsafe configurations from reaching the data plane, and data plane resiliency in case an incompatible configuration reaches production. This post has shown that prevention alone is not enough—when failures do occur, recovery must be fast, predictable, and bounded. By ensuring that the FlatBuffer cache is never invalidated, by loading only active tenants, and by building safe coordinated recovery tooling, we have transformed failure handling from a fleet-wide crisis into a controlled operation. These recovery investments work in concert with the prevention mechanisms described in Part 1. Together, they ensure that the path from incident detection to full service restoration is measured in minutes, with customer traffic protected at every step. In the next post of this series, we will cover the third pillar of our resiliency strategy: tenant isolation—how micro-cellular architecture and ingress-layered sharding can reduce the blast radius of any failure to a small subset, ensuring that one customer’s configuration or traffic anomaly never becomes everyone’s problem. We deeply value our customers’ trust in Azure Front Door. We are committed to transparently sharing our progress on these resiliency investments, and to exceed expectations for safety, reliability, and operational readiness.1.8KViews3likes0CommentsMicrosoft Azure scales Hollow Core Fiber (HCF) production through outsourced manufacturing
Introduction As cloud and AI workloads surge, the pressure on datacenter (DC), Metro and Wide Area Network (WAN) networks has never been greater. Microsoft is tackling the physical limits of traditional networking head-on. From pioneering research in microLED technologies to deploying Hollow Core Fiber (HCF) at global scale, Microsoft is reimagining connectivity to power the next era of cloud networking. Azure’s HCF journey has been one of relentless innovation, collaboration, and a vision to redefine the physical layer of the cloud. Microsoft’s HCF, based on the proprietary Double Nested Antiresonant Nodeless Fiber (DNANF) design, delivers up to 47% faster data transmission and approximately 33% lower latency compared to conventional Single Mode Fiber (SMF), bringing significant advantages to the network that powers Azure. Today, Microsoft is announcing a major milestone: the industrial scale-up of HCF production, powered by new strategic manufacturing collaborations with Corning Incorporated (Corning) and Heraeus Covantics (Heraeus). These collaborations will enable Azure to increase the global fiber production of HCF to meet the demands of the growing network infrastructure, advancing the performance and reliability customers expect for cloud and AI workloads. Real-world benefits for Azure customers Since 2023, Microsoft has deployed HCF across multiple Azure regions, with production links meeting performance and reliability targets. As manufacturing scales, Azure plans to expand deployment of the full end-to-end HCF network solution to help increase capacity, resiliency, and speed for customers, with the potential to set new benchmarks for latency and efficiency in fiber infrastructure. Why it matters Microsoft’s proprietary HCF design brings the following improvements for Azure customers: Increased data transmission speeds with up to 33% lower latency. Enhanced signal performance that improves data transmission quality for customers. Improved optical efficiency resulting in higher bandwidth rates compared to conventional fiber. How Microsoft is making it possible To operationalize HCF across Azure with production grade performance, Microsoft is: Deploying a standardized HCF solution with end-to-end systems and components for operational efficiency, streamlined network management, and reliable connectivity across Azure’s infrastructure. Ensuring interoperability with standard SMF environments, enabling seamless integration with existing optical infrastructure in the network for faster deployment and scalable growth. Creating a multinational production supply chain to scale next generation fiber production, ensuring the volumes and speed to market needed for widespread HCF deployment across the Azure network. Scaling up and out With Corning and Heraeus as Microsoft’s first HCF manufacturing collaborators, Azure plans to accelerate deployment to meet surging demand for high-performance connectivity. These collaborations underscore Microsoft’s commitment to enhancing its global infrastructure and delivering a reliable customer experience. They also reinforce Azure’s continued investment in deploying HCF, with a vision for this technology to potentially set the global benchmark for high-capacity fiber innovation. “This milestone marks a new chapter in reimagining the cloud’s physical layer. Our collaborations with Corning and Heraeus establish a resilient, global HCF supply chain so Azure can deliver a standardized, world-class customer experience with ultra-low latency and high reliability for modern AI and cloud workloads.” - Jamie Gaudette, Partner Cloud Network Engineering Manager at Microsoft To scale HCF production, Microsoft will utilize Corning’s established U.S. facilities, while Heraeus will produce out of its sites in both Europe and the U.S. "Corning is excited to expand our longtime collaboration with Microsoft, leveraging Corning’s fiber and cable manufacturing facilities in North Carolina to accelerate the production of Microsoft's Hollow Core Fiber. This collaboration not only strengthens our existing relationship but also underscores our commitment to advancing U.S. leadership in AI innovation and infrastructure. By working closely with Microsoft, we are poised to deliver solutions that meet the demands of AI workloads, setting new benchmarks for speed and efficiency in fiber infrastructure." - Mike O'Day, Senior Vice President and General Manager, Corning Optical Communications “We started our work on HCF a decade ago, teamed up with the Optoelectronics Research Centre (ORC) at the University of Southampton and then with Lumenisity prior to its acquisition. Now, we are excited to continue working with Microsoft on shaping the datacom industry. With leading solutions in glass, tube, preform, and fiber manufacturing, we are ready to scale this disruptive HCF technology to significant volumes. We’ll leverage our proven track record of taking glass and fiber innovations from the lab to widespread adoption, just as we did in the telecom industry, where approximately 2 billion kilometers of fiber are made using Heraeus products.” - Dr. Jan Vydra, Executive Vice President Fiber Optics, Heraeus Covantics Azure engineers are working alongside Corning and Heraeus to operationalize Microsoft manufacturing process intellectual property (IP), deliver targeted training programs, and drive the yield, metrology, and reliability improvements required for scaled production. The collaborations are foundational to a growing standardized, global ecosystem that supports: Glass preform/tubing supply Fiber production at scale Cable and connectivity for deployment into carrier‑grade environments Building on a foundation of innovation: Microsoft’s HCF program In 2022, Microsoft acquired Lumenisity, a spin‑out from the Optoelectronics Research Centre (ORC) at the University of Southampton, UK. That same year, Microsoft launched the world’s first state‑of‑the‑art HCF fabrication facility in the UK to expand production and drive innovation. This purpose-built site continues to support long‑term HCF research, prototyping, and testing, ensuring that Azure remains at the forefront of HCF technology. Working with industry leaders, Microsoft has developed a proven end‑to‑end ecosystem of components, equipment, and HCF‑specific hardware necessary and successfully proven in production deployments and operations. Pushing the boundaries: recent breakthrough research Today, the University of Southampton announced a landmark achievement in optical communications: in collaboration with Azure Fiber researchers, they have demonstrated the lowest signal loss ever recorded for optical fibers (<0.1 dB/km) using research-grade DNANF HCF technology (see figure 4). This breakthrough, detailed in a research paper published in Nature Photonics earlier this month, paves the way for a potential revolution in the field, enabling unprecedented data transmission capacities and longer unamplified spans. ecords at around 1550nm [1] 2002 Nagayama et al. 1 [2] 2025 Sato et al. 2 [3] 2025 research-grade DNANF HCF Petrovich et al. 3 This breakthrough highlights the potential for this technology to transform global internet infrastructure and DC connectivity. Expected benefits include: Faster: Approximately 47% faster, reducing latency, powering real-time AI inference, cloud gaming and other interactive workloads. More capacity: A wider optical spectrum window enabling exponentially greater bandwidth. Future-ready: Lays the groundwork for quantum-safe links, quantum computing infrastructure, advanced sensing, and remote laser delivery. Looking ahead: Unlocking the future of cloud networking The future of cloud networking is being built today! With record-breaking [3] fiber innovations, a rapidly expanding collaborative ecosystem, and the industrialized scale to deliver next-generation performance, Azure continues to evolve to meet the demands for speed, reliability, and connectivity. As we accelerate the deployment of HCF across our global network, we’re not just keeping pace with the demands of AI and cloud, we’re redefining what’s possible. References: [1] Nagayama, K., Kakui, M., Matsui, M., Saitoh, T. & Chigusa, Y. Ultra-low-loss (0.1484 dB/km) pure silica core fibre and extension of transmission distance. Electron. Lett. 38, 1168–1169 (2002). [2] Sato, S., Kawaguchi, Y., Sakuma, H., Haruna, T. & Hasegawa, T. Record low loss optical fiber with 0.1397 dB/km. In Proc. Optical Fiber Communication Conference (OFC) 2024 Tu2E.1 (Optica Publishing Group, 2024). [3] Petrovich, M., Numkam Fokoua, E., Chen, Y., Sakr, H., Isa Adamu, A., Hassan, R., Wu, D., Fatobene Ando, R., Papadimopoulos, A., Sandoghchi, S., Jasion, G., & Poletti, F. Broadband optical fibre with an attenuation lower than 0.1 decibel per kilometre. Nat. Photon. (2025). https://doi.org/10.1038/s41566-025-01747-5 Useful Links: The Deployment of Hollow Core Fiber (HCF) in Azure’s Network How hollow core fiber is accelerating AI | Microsoft Azure Blog Learn more about Microsoft global infrastructure11KViews6likes0CommentsJune 2025 Recap: Azure Database for PostgreSQL
Hello Azure Community, We have introduced a range of exciting new features and updates to Azure Database for PostgreSQL in June. From general availability of PG 17 to public preview of the SSD v2 storage tier for High Availability, there have been some significant feature announcements across multiple areas in the last month. Stay tuned as we dive deeper into each of these feature updates. Before that, let’s look at POSETTE 2025 highlights. POSETTE 2025 Highlights We hosted POSETTE: An Event for Postgres 2025 in June! This year marked our 4th annual event featuring 45 speakers and a total of 42 talks. PostgreSQL developers, contributors, and community members came together to share insights on topics covering everything from AI-powered applications to deep dives into PostgreSQL internals. If you missed it, you can catch up by watching the POSETTE livestream sessions. If this conference sounds interesting to you and want to be part of it next year, don’t forget to subscribe to POSETTE news. Feature Highlights General Availability of PostgreSQL 17 with 'In-Place' upgrade support General Availability of Online Migration Migration service support for PostgreSQL 17 Public Preview of SSD v2 High Availability New Region: Indonesia Central VS Code Extension for PostgreSQL enhancements Enhanced role management Ansible collection released for latest REST API version General Availability of PostgreSQL 17 with 'In-Place' upgrade support PostgreSQL 17 is now generally available on Azure Database for PostgreSQL flexible server, bringing key community innovations to your workloads. You’ll see faster vacuum operations, richer JSON processing, smarter query planning (including better join ordering and parallel execution), dynamic logical replication controls, and enhanced security & audit-logging features—backed by Azure’s five-year support policy. You can easily upgrade to PostgreSQL 17 using the in-place major version upgrade feature available through the Azure portal and CLI, without changing server endpoints or reconfiguring applications. The process includes built-in validations and rollback safety to help ensure a smooth and reliable upgrade experience. For more details, read the PostgreSQL 17 release announcement blog. General Availability of Online Migration We're excited to announce that Online Migration is now generally available for the Migration service for Azure Database for PostgreSQL! Online migration minimizes downtime by keeping your source database operational during the migration process, with continuous data synchronization until cut over. This is particularly beneficial for mission-critical applications that require minimal downtime during migration. This milestone brings production-ready online migration capabilities supporting various source environments including on-premises PostgreSQL, Azure VMs, Amazon RDS, Amazon Aurora, and Google Cloud SQL. For detailed information about the capabilities and how to get started, visit our Migration service documentation. Migration service support for PostgreSQL 17 Building on our PostgreSQL 17 general availability announcement, the Migration service for Azure Database for PostgreSQL now fully supports PostgreSQL 17. This means you can seamlessly migrate your existing PostgreSQL instances from various source platforms to Azure Database for PostgreSQL flexible server running PostgreSQL 17. With this support, organizations can take advantage of the latest PostgreSQL 17 features and performance improvements while leveraging our online migration capabilities for minimal downtime transitions. The migration service maintains full compatibility with PostgreSQL 17's enhanced security features, improved query planning, and other community innovations. Public Preview of SSD v2 High Availability We’re excited to announce the public preview High availability (HA) support for the Premium SSD v2 storage tier in Azure Database for PostgreSQL flexible server. This support allows you to enable Zone-Redundant HA using Premium SSD v2 during server deployments. In addition to high availability on SSDv2 you now get improved resiliency and 10 second failover times when using Premium SSD v2 with zone-redundant HA, helping customers build resilient, high-performance PostgreSQL applications with minimal overhead. This feature is particularly well-suited for mission-critical workloads, including those in financial services, real-time analytics, retail, and multi-tenant SaaS platforms. Key Benefits of Premium SSD v2: Flexible disk sizing: Scale from 32 GiB to 64 TiB in 1-GiB increments Fast failovers: Planned or unplanned failovers typically around 10 seconds Independent performance configuration: Achieve up to 80,000 IOPS and 1,200 Mbps throughput without resizing your disk. Baseline performance: Free throughput of 125 MB/s and 3,000 IOPS for disks up to 399 GiB, and 500 MB/s and 12,000 IOPS for disks 400 GiB and above at no additional cost. For more details, please refer to the Premium SSD v2 HA blog. New Region: Indonesia Central New region rollout! Azure Database for PostgreSQL flexible server is now available in Indonesia Central, giving customers in and around the region lower latency and data residency options. This continues our mission to bring Azure PostgreSQL closer to where you build and run your apps. For the full list of regions visit: Azure Database for PostgreSQL Regions. VS Code Extension for PostgreSQL enhancements The brand-new VS code extension for PostgreSQL launched in mid-May and has already garnered over 122K installs from the Visual Studio Marketplace! And the kickoff blog about this new IDE for PostgreSQL in VS Code has had over 150K views. This extension makes it easier for developers to seamlessly interact with PostgreSQL databases. We have been committed to make this experience better and have introduced several enhancements to improve reliability and compatibility updates. You can now have better control over service restarts and process terminations on supported operating systems. Additionally, we have added support for parsing additional connection-string formats in the “Create Connection” flow, making it more flexible and user-friendly. We also resolved Entra token-fetching failures for newly created accounts, ensuring a smoother onboarding experience. On the feature front, you can now leverage Entra Security Groups and guest accounts across multiple tenants when establishing new connections, streamlining permission management in complex Entra environments. Don’t forget to update to the latest version in the marketplace to take advantage of these enhancements and visit our GitHub repository to learn more about this month’s release. If you learn best by video, these 2 videos are a great way to learn more about this new VS Code extension: POSETTE 2025: Introducing Microsoft’s VS Code Extension for PostgreSQL Demo of using VS code extension for PostgreSQL Enhanced role management With the introduction of PostgreSQL 16, a strict role hierarchy structure has been implemented. As a result, GRANT statements that were functional in PostgreSQL 11-15 may no longer work in PostgreSQL 16. We have improved the administrative flexibility and addressed this limitation in Azure Database for PostgreSQL flexible server across all PostgreSQL versions. Members of ‘azure_pg_admin’ can now manage, and access objects owned by any role that is non-restricted, giving control and permission over user-defined roles. To learn more about this improvement, please refer to our documentation on roles. Ansible collection released for latest REST API version A new version of Ansible collection for Azure Database for PostgreSQL flexible server is now released. Version 3.6.0 now includes the latest GA REST API features. This update introduces several enhancements, such as support for virtual endpoints, on-demand backups, system-assigned identity, storage auto-grow, and seamless switchover of read replicas to a new site (Read Replicas - Switchover), among many other improvements. To get started with using please visit flexible server Ansible collection link. Azure Postgres Learning Bytes 🎓 Using PostgreSQL VS code extension with agent mode The VS Code extension for PostgreSQL has been trending amongst the developer community. In this month's Learning Bytes section, we want to share how to enable the extension and use GitHub Copilot to create a database in Agent Mode, add dummy data, and visualize it using the Agent Mode and VS Code extension. Step 1: Download the VS code Extension for PostgreSQL Step 2: Check GitHub Copilot and Agent mode is enabled Go to File -> Preferences -> Settings (Ctrl + ,). Search and enable "chat.agent.enabled" and "pgsql copilot.enable". Reload VS Code to apply changes. Step 3: Connect to Azure Database for PostgreSQL Use the extension to enter instance details and establish a connection. Create and view schemas under Databases -> Schemas. Step 4: Visualize and Populate Data Right-click the database to visualize schemas. Ask the agent to insert dummy data or run queries. Conclusion That's all for the June 2025 feature updates! We are dedicated to continuously improve Azure Database for PostgreSQL with every release. Stay updated with the latest updates to our features by following this link. Your feedback is important and helps us continue to improve. If you have any suggestions, ideas, or questions, we’d love to hear from you. Share your thoughts here: aka.ms/pgfeedback We look forward to bringing you even more exciting updates throughout the year, stay tuned!924Views3likes0CommentsNetwork Redundancy Between AVS, On-Premises, and Virtual Networks in a Multi-Region Design
By Mays_Algebary shruthi_nair Establishing redundant network connectivity is vital to ensuring the availability, reliability, and performance of workloads operating in hybrid and cloud environments. Proper planning and implementation of network redundancy are key to achieving high availability and sustaining operational continuity. This article focuses on network redundancy in multi-region architecture. For details on single-region design, refer to this blog. The diagram below illustrates a common network design pattern for multi-region deployments, using either a Hub-and-Spoke or Azure Virtual WAN (vWAN) topology, and serves as the baseline for establishing redundant connectivity throughout this article. In each region, the Hub or Virtual Hub (VHub) extends Azure connectivity to Azure VMware Solution (AVS) via an ExpressRoute circuit. The regional Hub/VHub is connected to on-premises environments by cross-connecting (bowtie) both local and remote ExpressRoute circuits, ensuring redundancy. The concept of weight, used to influence traffic routing preferences, will be discussed in the next section. The diagram below illustrates the traffic flow when both circuits are up and running. Design Considerations If a region loses its local ExpressRoute connection, AVS in that region will lose connectivity to the on-premises environment. However, VNets will still retain connectivity to on-premises via the remote region’s ExpressRoute circuit. The solutions discussed in this article aim to ensure redundancy for both AVS and VNets. Looking at the diagram above, you might wonder: why do we need to set weights at all, and why do the AVS-ER connections (1b/2b) use the same weight as the primary on-premises connections (1a/2a)? Weight is used to influence routing decisions and ensure optimal traffic flow. In this scenario, both ExpressRoute circuits, ER1-EastUS and ER2-WestUS, advertise the same prefixes to the Azure ExpressRoute gateway. As a result, traffic from the VNet to on-premises would be ECMPed across both circuits. To avoid suboptimal routing and ensure that traffic from the VNets prefers the local ExpressRoute circuit, a higher weight is assigned to the local path. It’s also critical that the ExpressRoute gateway connection to on-premises (1a/2a) and to AVS (1b/2b), is assigned the same weight. Otherwise, traffic from the VNet to AVS will follow a less efficient route as AVS routes are also learned over ER1-EastUS via Global Reach. For instance, VNets in EastUS will connect to AVS EUS through ER1-EastUS circuit via Global Reach (as shown by the blue dotted line), instead of using the direct local path (orange line). This suboptimal routing is illustrated in the below diagram. Now let us see what solutions we can have to achieve redundant connectivity. The following solutions will apply to both Hub-and-Spoke and vWAN topology unless noted otherwise. Note: The diagrams in the upcoming solutions will focus only on illustrating the failover traffic flow. Solution1: Network Redundancy via ExpressRoute in Different Peering Location In the solution, deploy an additional ExpressRoute circuit in a different peering location within the same metro area (e.g., ER2–PeeringLocation2), and enable Global Reach between this new circuit and the existing AVS ExpressRoute (e.g., AVS-ER1). If you intend to use this second circuit as a failover path, apply prepends to the on-premises prefixes advertised over it. Alternatively, if you want to use it as an active-active redundant path, do not prepend routes, in this case, both AVS and Azure VNets will ECMP to distribute traffic across both circuits (e.g., ER1–EastUS and ER–PeeringLocation2) when both are available. Note: Compared to the Standard Topology, this design removes both the ExpressRoute cross-connect (bowtie) and weight settings. When adding a second circuit in the same metro, there's no benefit in keeping them, otherwise traffic from the Azure VNet will prefer the local AVS circuit (AVS-ER1/AVS-ER2) to reach on-premises due to the higher weight, as on-premises routes are also learned over AVS circuit (AVS-ER1/AVS-ER2) via Global Reach. Also, when connecting the new circuit (e.g., ER–Peering Location2), remove all weight settings across the connections. Traffic will follow the optimal path based on BGP prepending on the new circuit, or load-balance (ECMP) if no prepend is applied. Note: Use public ASN to prepend the on-premises prefix as AVS circuit (e.g., AVS-ER) will strip the private ASN toward AVS. Solution Insights Ideal for mission-critical applications, providing predictable throughput and bandwidth for backup. It could be cost prohibitive depending on the bandwidth of the second circuit. Solution2: Network Redundancy via ExpressRoute Direct In this solution, ExpressRoute Direct is used to provision multiple circuits from a single port pair in each region, for example, ER2-WestUS and ER4-WestUS are created from the same port pair. This allows you to dedicate one circuit for local traffic and another for failover to a remote region. To ensure optimal routing, prepend the on-premises prefixes using public ASN on the newly created circuit (e.g., ER3-EastUS and ER4-WestUS). Remove all weight settings across the connections; traffic will follow the optimal path based on BGP prepending on the new circuit. For instance, if ER1-EastUS becomes unavailable, traffic from AVS and VNets in the EastUS region will automatically route through ER4-WestUS circuit, ensuring continuity. Note: Compared to the Standard Topology, this design connects the newly created ExpressRoute circuits (e.g., ER3-EastUS/ER4-WestUS) to the remote region of ExpressRoute gateway (black dotted lines) instead of having the bowtie to the primary circuits (e.g., ER1-EastUS/ER2-WestUS). Solution Insights Easy to implement if you have ExpressRoute Direct. ExpressRoute Direct supports over- provisioning where you can create logical ExpressRoute circuits on top of your existing ExpressRoute Direct resource of 10-Gbps or 100-Gbps up to the subscribed Bandwidth of 20 Gbps or 200 Gbps. For example, you can create two 10-Gbps ExpressRoute circuits within a single 10-Gbps ExpressRoute Direct resource (port pair). Ideal for mission-critical applications, providing predictable throughput and bandwidth for backup. Solution3: Network Redundancy via ExpressRoute Metro Metro ExpressRoute is a new configuration that enables dual-homed connectivity to two different peering locations within the same city. This setup enhances resiliency by allowing traffic to continue flowing even if one peering location goes down, using the same circuit. Solution Insights Higher Resiliency: Provides increased reliability with a single circuit. Limited regional availability: Currently available in select regions, with more being added over time. Cost-effective: Offers redundancy without significantly increasing costs. Solution4: Deploy VPN as a Backup to ExpressRoute This solution mirrors solution 1 for a single region but extends it to multiple regions. In this approach, a VPN serves as the backup path for each region in the event of an ExpressRoute failure. In a Hub-and-Spoke topology, a backup path to and from AVS can be established by deploying Azure Route Server (ARS) in the hub VNet. ARS enables seamless transit routing between ExpressRoute and the VPN gateway. In vWAN topology, ARS is not required; the vHub's built-in routing service automatically provides transitive routing between the VPN gateway and ExpressRoute. In this design, you should not cross-connect ExpressRoute circuits (e.g., ER1-EastUS and ER2-WestUS) to the ExpressRoute gateways in the Hub VNets (e.g., Hub-EUS or Hub-WUS). Doing so will lead to routing issues, where the Hub VNet only programs the on-premises routes learned via ExpressRoute. For instance, in the EastUS region, if the primary circuit (ER1-EastUS) goes down, Hub-EUS will receive on-premises routes from both the VPN tunnel and the remote ER2-WestUS circuit. However, it will prefer and program only the ExpressRoute-learned routes from ER2-WestUS circuit. Since ExpressRoute gateways do not support route transitivity between circuits, AVS connected via AVS-ER will not receive the on-premises prefixes, resulting in routing failures. Note: In vWAN topology, to ensure optimal route convergence when failing back to ExpressRoute, you should prepend the prefixes advertised from on-premises over the VPN. Without route prepending, VNets may continue to use the VPN as the primary path to on-premises. If prepend is not an option, you can trigger the failover manually by bouncing the VPN tunnel. Solution Insights Cost-effective and straightforward to deploy. Increased Latency: The VPN tunnel over the internet adds latency due to encryption overhead. Bandwidth Considerations: Multiple VPN tunnels might be needed to achieve bandwidth comparable to a high-capacity ExpressRoute circuit (e.g., over 1G). For details on VPN gateway SKU and tunnel throughput, refer to this link. As you can't cross connect ExpressRoute circuits, VNets will utilize the VPN for failover instead of leveraging remote region ExpressRoute circuit. Solution5: Network Redundancy-Multiple On-Premises (split-prefix) In many scenarios, customers advertise the same prefix from multiple on-premises locations to Azure. However, if the customer can split prefixes across different on-premises sites, it simplifies the implementation of failover strategy using existing ExpressRoute circuits. In this design, each on-premises advertises region-specific prefixes (e.g., 10.10.0.0/16 for EastUS and 10.70.0.0/16 for WestUS), along with a common supernet (e.g., 10.0.0.0/8). Under normal conditions, AVS and VNets in each region use longest prefix match to route traffic efficiently to the appropriate on-premises location. For instance, if ER1-EastUS becomes unavailable, AVS and VNets in EastUS will automatically fail over to ER2-WestUS, routing traffic via the supernet prefix to maintain connectivity. Solution Insights Cost-effective: no additional deployment, using existing ExpressRoute circuits. Advertising specific prefixes over each region might need additional planning. Ideal for mission-critical applications, providing predictable throughput and bandwidth for backup. Solution6: Prioritize Network Redundancy for One Region Over Another If you're operating under budget constraints and can prioritize one region (such as hosting critical workloads in a single location) and want to continue using your existing ExpressRoute setup, this solution could be an ideal fit. In this design, assume AVS in EastUS (AVS-EUS) hosts the critical workloads. To ensure high availability, AVS-ER1 is configured with Global Reach connections to both the local ExpressRoute circuit (ER1-EastUS) and the remote circuit (ER2-WestUS). Make sure to prepend the on-premises prefixes advertised to ER2-WestUS using public ASN to ensure optimal routing (no ECMP) from AVS-EUS over both circuits (ER1-EastUS and ER2-WestUS). On the other hand, AVS in WestUS (AVS-WUS) is connected via Global Reach only to its local region ExpressRoute circuit (ER2-WestUS). If that circuit becomes unavailable, you can establish an on-demand Global Reach connection to ER1-EastUS, either manually or through automation (e.g., a triggered script). This approach introduces temporary downtime until the Global Reach link is established. You might be thinking, why not set up Global Reach between the AVS-WUS circuit and remote region circuits (like connecting AVS-ER2 to ER1-EastUS), just like we did for AVS-EUS? Because it would lead to suboptimal routing. Due to AS path prepending on ER2-WestUS, if both ER1-EastUS and ER2-WestUS are linked to AVS-ER2, traffic would favor the remote ER1-EastUS circuit since it presents a shorter AS path. As a result, traffic would bypass the local ER2-WestUS circuit, causing inefficient routing. That is why for AVS-WUS, it's better to use on-demand Global Reach to ER1-EastUS as a backup path, enabled manually or via automation, only when ER2-WestUS becomes unavailable. Note: VNets will failover via local AVS circuit. E.g., HUB-EUS will route to on-prem through AVS-ER1 and ER2-WestUS via Global Reach Secondary (purple line). Solution Insights Cost-effective Workloads hosted in AVS within the non-critical region will experience downtime if the local region ExpressRoute circuit becomes unavailable, until the on-demand Global Reach connection is established. Conclusion Each solution has its own advantages and considerations, such as cost-effectiveness, ease of implementation, and increased resiliency. By carefully planning and implementing these solutions, organizations can ensure operational continuity and optimal traffic routing in multi-region deployments.3KViews6likes0CommentsAzure ExpressRoute Direct: A Comprehensive Overview
What is Express Route Azure ExpressRoute allows you to extend your on-premises network into the Microsoft cloud over a private connection made possible through a connectivity provider. With ExpressRoute, you can establish connections to Microsoft cloud services, such as Microsoft Azure, and Microsoft 365. ExpressRoute allows you to create a connection between your on-premises network and the Microsoft cloud in four different ways, CloudExchange Colocation, Point-to-point Ethernet Connection, Any-to-any (IPVPN) Connection, and ExpressRoute Direct. ExpressRoute Direct gives you the ability to connect directly into the Microsoft global network at peering locations strategically distributed around the world. ExpressRoute Direct provides dual 100-Gbps or 10-Gbps connectivity that supports active-active connectivity at scale. Why ExpressRoute Direct Is Becoming the Preferred Choice for Customers ExpressRoute Direct with ExpressRoute Local – Free Egress: ExpressRoute Direct includes ExpressRoute Local, which allows private connectivity to Azure services within the same metro or peering location. This setup is particularly cost-effective because egress (outbound) data transfer is free, regardless of whether you're on a metered or unlimited data plan. By avoiding Microsoft's global backbone, ExpressRoute Local offers high-speed, low-latency connections for regionally co-located workloads without incurring additional data transfer charges. Dual Port Architecture Both ExpressRoute Direct and the service provider model feature a dual-port architecture, with two physical fiber pairs connected to separate Microsoft router ports and configured in an active/active BGP setup that distributes traffic across both links simultaneously for redundancy and improved throughput. What sets Microsoft apart is making this level of resiliency standard, not optional. Forward-thinking customers in regions like Sydney take it even further by deploying ExpressRoute Direct across multiple colocation facilities for example, placing one port pair in Equinix SY2 and another in NextDC S1 creating four connections across two geographically separate sites. This design protects against facility-level outages from power failures, natural disasters, or accidental infrastructure damage, ensuring business continuity for organizations where downtime is simply not an option. When Geography Limits Your Options: Not every region offers facility diversity, example New Zealand has only one ExpressRoute peering location, businesses needing geographic redundancy must connect to Sydney incurring Auckland to Sydney link costs but gaining critical diversity to mitigate outages. While ExpressRoute’s dual ports provide active/active redundancy, both are on the same Microsoft edge, so true disaster recovery requires using Sydney’s edge. ExpressRoute Direct scales from basic dual-port setups to multi-facility deployments and offers another advantage: free data transfer within the same geopolitical region. Once traffic enters Microsoft’s network, New Zealand customers can move data between Azure services across the trans-Tasman link without per-GB fees, with Microsoft absorbing those costs. Premium SKU: Global Reach: Azure ExpressRoute Direct with the Premium SKU enables Global Reach, allowing private connectivity between your on-premises networks across different geographic regions through Microsoft's global backbone. This means you can link ExpressRoute circuits in different countries or continents, facilitating secure and high-performance data exchange between global offices or data centers. The Premium SKU extends the capabilities of ExpressRoute Direct by supporting cross-region connectivity, increased route limits, and access to more Azure regions, making it ideal for multinational enterprises with distributed infrastructure. MACsec: Defense in Depth and Enterprise Security ExpressRoute Direct uniquely supports MACsec (IEEE 802.1AE) encryption at the data-link layer, allowing your router and Microsoft's router to establish encrypted communication even within the colocation facility. This optional feature provides additional security for compliance-sensitive workloads in banking or government environments. High-Performance Data Transfer for the Enterprise: Azure ExpressRoute Direct enables ultra-fast and secure data transfer between on-premises infrastructure and Azure by offering dedicated bandwidth of 10 to 100 Gbps. This high-speed connectivity is ideal for large-scale data movement scenarios such as AI workloads, backup, and disaster recovery. It ensures consistent performance, low latency, and enhanced reliability, making it well-suited for hybrid and multicloud environments that require frequent or time-sensitive data synchronization. FastPath Support: Azure ExpressRoute Direct now supports FastPath for Private Endpoints and Private Link, enabling low-latency, high-throughput connections by bypassing the virtual network gateway. This feature is available only with ExpressRoute Direct circuits (10 Gbps or 100 Gbps) and is in limited general availability. While a gateway is still needed for route exchange, traffic flows directly once FastPath is enabled. Supported gateway ExpressRoute Direct Setup Workflow Before provisioning ExpressRoute Direct resources, proper planning is essential. Key considerations for connectivity include understanding the two connectivity patterns available for ExpressRoute Direct from the customer edge to Microsoft Enterprise Edge (MSEE). Option 1: Colocation of Customer Equipment: This is a common pattern where the customer racks their network device (edge router) in the same third-party data center facility that houses Microsoft's networking gear (e.g., Equinix or NextDC). They install their router or firewall there and then order a short cross-connect from their cage to Microsoft's cage in that facility. The cross-connect is simply a fiber cable run through the facility's patch panel connecting the two parties. This direct colocation approach has the advantage of a single, highly efficient physical link (no intermediate hops) between the customer and Microsoft, completing the layer-1 connectivity in one step. Option 2: Using a Carrier/Exchange Provider: If the customer prefers not to move hardware into a new facility (due to cost or complexity), they can leverage a provider that already has presence in the relevant colocation. In this case, the customer connects from their data center to the provider's network, and the provider extends connectivity into the Microsoft peering location. For instance, the customer could contract with Megaport or a local telco to carry traffic from their on-premises location into Megaport's equipment, and Megaport in turn handles the cross-connection to Microsoft in the target facility. The conversation cited that the customer had already set up connections to Megaport in their data center. Using an exchange can simplify logistics since the provider arranges the cross-connect and often provides an LOA on the customer's behalf. It may also be more cost-effective where the customer's location is far from any Microsoft peering site. Many enterprises find that placing equipment in a well-connected colocation facility works best for their needs. Banks and large organizations have successfully taken this approach, such as placing routers in Equinix Sydney or NextDC Sydney to establish a direct fiber link to Azure. However, we understand that not every organization wants the capital expense or complexity of managing physical equipment in a new location. For those situations, using a cloud exchange like Megaport offers a practical alternative that still delivers the dedicated connectivity you're looking for, while letting someone else handle the infrastructure management. Once the decision on the connectivity pattern is made, the next step is to provision ExpressRoute Direct ports and establish the physical link: Step1: Provisioning Express Route Direct Ports Through the Azure portal (or CLI), the customer creates an ExpressRoute Direct resource. Customer must select an appropriate peering location, which corresponds to the colocation facility housing Azure's routers. For example, the customer would select the specific facility (such as "Vocus Auckland" or "Equinix Sydney SY2") where they intend to connect. Customer also choose the port bandwidth (either 10 Gbps or 100 Gbps) and the encapsulation type (Dot1Q or QinQ) during this setup. Azure then allocates two ports on two separate Microsoft devices in that location – essentially giving the customer a primary and secondary interface for redundancy, to remove a single point of failure affecting their connectivity. ****Critical considerations we need to keep in mind during this step**** Encapsulation: When configuring ExpressRoute Direct ports, the customer must choose an encapsulation method. Dot1Q (802.1Q) uses a single VLAN tag for the circuit, whereas Q-in-Q (802.1ad) uses stacked VLAN tags (an Outer S-Tag and Inner C-Tag). Q-in-Q allows multiple circuits on one physical port with overlapping customer VLAN IDs because Azure assigns a unique outer tag per circuit (making it ideal if the customer needs several ExpressRoute circuits on the same port). Dot1Q, by contrast, requires each VLAN ID to be unique across all circuits on the port, and is often used if the equipment doesn’t support Q-in-Q. (Most modern deployments prefer Q-in-Q for flexibility.) Capacity Planning: This offering allows customers to overprovision and utilize 20Gbps of capacity. Design for 10 Gbps with redundancy, not 20 Gbps total capacity. During Microsoft's monthly maintenance windows, one port may go offline, and your network must handle this seamlessly. Step 2: Generate Letter of Authorization After the ExpressRoute Direct resource is created, Microsoft generates a Letter of Authorization. The LOA is a document (often a PDF) that authorizes the data center operator to connect a specific Microsoft port to the designated port. It includes details like the facility name, patch panel identifier, and port numbers on Microsoft's side. If co-locating your own gear, you will also obtain a corresponding LOA from the facility for your port (or simply indicate your port details on the cross-connect order form). If a provider like Megaport is involved, that provider will generate an LOA for their port as well. Two LOAs are typically needed – one for Microsoft's ports and one for the other party's ports which are then submitted to the facility to execute the cross-connect. Step 3: Complete Cross Connect with data center provider Using the LOAs, the data center’s technicians will perform the cross-connection in the meet-me room. At this point, the physical fiber link is established between the Microsoft router and the customer (or provider) equipment. The link goes through a patch panel in the MMR – Meet me room rather than a direct cable between cages, for security and manageability. After patching, the circuit is in place but typically kept “administratively down” until ready. *****Critical considerations we need to keep in mind during this step. ***** When port allocation conflicts occur, engage Microsoft Support rather than recreating resources. They coordinate with colocation providers to resolve conflicts or issue new LOAs. Step 4: Change Admin Status of each link Once the cross-connect is physically completed, you can head into Azure's portal and flip the Admin State of each ExpressRoute Direct link to "Enabled." This action lights up the optical interface on Microsoft's side and starts your billing meter running, so you'll want to make sure everything is working properly first. The great thing is that Azure gives you visibility into the health of your fiber connection through optical power metrics. You can check the receive light levels right in the portal , a healthy connection should show power readings somewhere between -1 dBm and -9 dBm, which indicates a strong fiber signal. If you're seeing readings outside this range, or worse, no light at all, that's a red flag pointing to a potential issue like a mis-patch or faulty fiber connector. There was a real case where someone had a bad fiber connector that was caught because the light levels were too low, and the facility had to come back and re-patch the connection. So, this optical power check is really your first line of defence , once you see good light levels within the acceptable range, you know your physical layer is solid and you're ready to move on to the next steps. ****Critical considerations we need to keep in mind during this step. **** Proactive Monitoring: Set up alerts for BGP session failures and optical power thresholds. Link failures might not immediately impact users but require quick restoration to maintain full redundancy. At this stage, you've successfully navigated the physical infrastructure challenge, ExpressRoute Direct port pair is provisioned, fiber cross-connects are in place, and those critical optical power levels are showing healthy readings. Essentially, private physical highway directly connecting your network edge to Microsoft's backbone infrastructure has been built Step 5: Create Express Route Circuits ExpressRoute circuits represent the logical layer that transforms your physical ExpressRoute Direct ports into functional network connections. Through the Azure portal, organizations create circuit resources linked to their ExpressRoute Direct infrastructure, specifying bandwidth requirements and selecting the appropriate SKU (Local, Standard, or Premium) based on connectivity needs. A key advantage is the ability to provision multiple circuits on the same physical port pair, provided aggregate bandwidth stays within physical limits. For example, an organization with 10 Gbps ExpressRoute Direct might run a 1 Gbps non-production circuit alongside a 5 Gbps production circuit on the same infrastructure. Azure handles the technical complexity through automatic VLAN management: Step 6: Establish Peering Once your ExpressRoute circuit is created and VLAN connectivity is established, the next crucial step involves setting up BGP (Border Gateway Protocol) sessions between your network and Microsoft's infrastructure. ExpressRoute supports two primary BGP peering types: Private Peering for accessing Azure Virtual Networks and Microsoft Peering for reaching Microsoft SaaS services like Office 365 and Azure PaaS offerings. For most enterprise scenarios connecting data centers to Azure workloads, Private Peering becomes the focal point. Azure provides specific BGP IP addresses for your circuit configuration, defining /30 subnets for both primary and secondary link peering, which you'll configure on your edge router to exchange routing information. The typical flow involves your organization advertising on-premises network prefixes while Azure advertises VNet prefixes through these BGP sessions, creating dynamic route discovery between your environments. Importantly, both primary and secondary links maintain active BGP sessions, ensuring that if one connection fails, the secondary BGP session seamlessly maintains connectivity and keeps your network resilient against single points of failure. Step 7: Routing and Testing Once BGP sessions are established, your ExpressRoute circuit becomes fully operational, seamlessly extending your on-premises network into Azure virtual networks. Connectivity testing with ping, traceroute, and application traffic confirms that your on-premises servers can now communicate directly with Azure VMs through the private ExpressRoute path, bypassing the public internet entirely. The traffic remains completely isolated to your circuit via VLAN tags, ensuring no intermingling with other tenants while delivering the low latency and predictable performance that only dedicated connectivity can provide. At the end of this stage, the customer’s data center is linked to Azure at layer-3 via a private, resilient connection. They can access Azure resources as if they were on the same LAN extension, with low latency and high throughput. All that remains is to connect this circuit to relevant Azure virtual networks (via an ExpressRoute Gateway) and verify end-to-end application traffic. Step by step instructions are available as below Configure Azure ExpressRoute Direct using the Azure portal | Microsoft Learn Azure ExpressRoute: Configure ExpressRoute Direct | Microsoft Learn Azure ExpressRoute: Configure ExpressRoute Direct: CLI | Microsoft Learn2.4KViews3likes3CommentsMastering Azure at Scale: Why AVNM Is a Game-Changer for Network Management
In today’s dynamic cloud-first landscape, managing distributed and large-scale network infrastructures has become increasingly complex. As enterprises expand their digital footprint across multiple Azure subscriptions and regions, the demand for centralized, scalable, and policy-driven network governance becomes critical. Azure Virtual Network Manager (AVNM) emerges as a strategic solution enabling unified control, automation, and security enforcement across diverse environments. Key Challenges in Large Scale Network Management Managing multiple VNETs increases operational complexity Decentralized security approaches may leave vulnerabilities Validating network changes becomes more critical as VNETs grow IP address consumption requires efficient management solutions Complex network topologies demand simplified peering strategies. What is Azure Virtual Network Manager (AVNM) AVNM is a centralized network management service in Azure that allows you to: 🌐 Group and manage virtual networks across multiple subscriptions and regions. 🔐 Apply security and connectivity configurations (like hub-and-spoke or mesh topologies) at scale. ⚙️ Automate network configuration using static or dynamic membership (via Azure Policy). 📊 Enforce organization-wide security rules that can override standard NSG rules. 🛠️ Deploy changes regionally with control over rollout sequence and frequency. AVNM is especially useful for large enterprises managing complex, multi-region Azure environments, helping reduce operational overhead and improve consistency in network security and connectivity Use Case of AVNM: 1) Network Segmentation and Connectivity Features: AVNM allows network segmentation into development, production, testing, or team-based groups, and supports static and dynamic grouping. Connectivity configuration features include hub-and-spoke, mesh, and direct connectivity topologies, simplifying the creation and management of complex network topologies. Network Segmentation Features: Segment networks into Dev, Prod, Test, or team groups. Group VNets and subnets at subscription, management group, or tenant level. Use static or dynamic grouping with name or tags. Basic Editor allows editing ID/Tags/Resource Groups/Subscription conditions with GUI. Advanced Editor specifies flexible criteria with JSON. Membership changes reflect automatically in configurations. Apply configurations to network groups. Network Group Management: Simplified management of network groups. Manually pick VNets for static membership. Azure Policy notifies for dynamic membership. Diagrams illustrate segmentation into multiple production nodes and flowchart shows managing VNets and applying dynamic membership through Azure Policy. Connectivity Configuration Features: Create different topologies with a few clicks. Topologies include Hub and Spoke, Mesh, and Hub-and-Spoke with direct connectivity. Diagrams illustrate connections between nodes. Use cases highlight gateways, firewalls, and common infrastructure shared by spoke virtual networks in the hub. Emphasizes less hop and lower latency connectivity across regions, subscriptions, and tenants. 2) Security Configuration Features: Admin rules enforce organizational-level security policies, applied to all resources in desired network groups, and can overwrite conflicting rules. User rules managed by AVNM allow micro-segmentation and conflict-free rules with modularity. Security admin rules work with NSGs, evaluated prior to NSG rules, ensuring consistent, high-priority network security policies globally. Admin Rules and User Rules: Admin rules target network admins and central governance teams, applying security policies at a high level and automatically to new VMs. User rules, managed by AVNM, are designed for product/service teams, enabling micro-segmentation and modular, conflict-free rules. Security Admin Rules and NSGs: Security admin rules are evaluated before NSG rules, ensuring high-priority network security. They can allow, deny, or always allow traffic. Allowed and denied traffic can be viewed in VNet flow logs. Comparison of Security Admin Rules and NSG Rules: Security admin rules, applied by network admins, enforce organizational security with higher priority. NSG rules, applied by individual teams, offer flexibility within the guard rail and operate with lower priority. 3) Virtual Network Verifier: AVNM Network Verifier prevents errors through compliance checking, simplifies debugging, enhances network management, allows role-based access, and offers detailed analysis and reporting. AVNM IPAM creates pools for IP address planning, automatically assigns non-overlapped CIDRs, reserves IPs for specific demands, and prevents Azure address space from overlapping on-premises/multi-cloud environments. Virtual Network Verifier's Goal: Virtual Network Verifier aims to assist with virtual network management by providing diagnostics, security compliance, and guard rail checks. It includes a diagram of a virtual network setup involving ExpressRoute, VNet connections, Virtual WAN, and SD-WAN branches with remote users. An illustration shows a person using a magnifying glass and checkmarks for successful verification and a red cross for failures. Benefits of using AVNM Network Verifier: AVNM Network Verifier offers verification of network configurations, simplified debugging, enhanced network management, flexible and delegated access, and detailed analysis and reporting. A diagram visually represents the scope of the AVNM instance and its role in network verification. 4) IPAM: AVNM IPAM creates pools for IP address planning, automatically assigns non-overlapped CIDRs, reserves IPs for specific demands, and prevents Azure address space from overlapping on-premises/multi-cloud environments. Benefits of using AVNM IPAM: AVNM IPAM benefits include creating pools for IP address planning, automatically assigning non-overlapped CIDRs, reserving IPs for specific demands, preventing Azure address space from overlapping with on-premises or multi-cloud environments, and enforcing users to create VNets with no overlap CIDR. A diagram shows the distribution of IP pools between two organizations, highlighting the non-overlapping nature of their CIDRs. View allocations for an IP address pool: Users can view allocation details for an IP address pool, including how many IPs are allocated to subnets, CIDR blocks, etc., and how many IPs are consumed by Azure resources like VMs or VMSSs. A screenshot of the Azure portal interface shows the allocation and usage of IP addresses within an address pool. Historically, AVNM pricing was structured around Azure subscriptions, which often limited flexibility for customers managing diverse network scopes. With the latest update, pricing is now aligned to virtual networks empowering customers with greater control and adaptability in how they adopt AVNM across their environments. https://learn.microsoft.com/en-us/azure/virtual-network-manager/overview#pricing Azure Virtual Network Manager (AVNM) is a game-changer for network management, offering centralized control, enhanced security, and simplified connectivity configurations. Whether you're managing a small network or a complex multi-region environment, AVNM provides the tools and features needed to streamline operations and ensure robust network security.1.2KViews3likes0CommentsThe Deployment of Hollow Core Fiber (HCF) in Azure’s Network
Co-authors: Jamie Gaudette, Frank Rey, Tony Pearson, Russell Ellis, Chris Badgley and Arsalan Saljoghei In the evolving Cloud and AI landscape, Microsoft is deploying state-of-the-art Hollow Core Fiber (HCF) technology in Azure’s network to optimize infrastructure and enhance performance for customers. By deploying cabled HCF technology together with HCF-supportable datacenter (DC) equipment, this solution creates ultra-low latency traffic routes with faster data transmission to meet the demands of Cloud & AI workloads. The successful adoption of HCF technology in Azure’s network relies on developing a new ecosystem to take full advantage of the solution, including new cables, field splicing, installation and testing… and Microsoft has done exactly that. Azure has collaborated with industry leaders to deliver components and equipment, cable manufacturing and installation. These efforts, along with advancements in HCF technology, have paved the way for its deployment in-field. HCF is now operational and carrying live customer traffic in multiple Microsoft Azure regions, proving it is as reliable as conventional fiber with no field failures or outages. This article will explore the installation activities, testing, and link performance of a recent HCF deployment, showcasing the benefits that Azure customers can leverage from HCF technology. HCF connected Azure DCs are ready for service The latest HCF cable deployment connects two Azure DCs in a major city, with two metro routes each over 20km long. The hybrid cables both include 32 HCF and 48 single mode fiber (SMF) strands, with HCFs delivering high-capacity Dense Wavelength Division Multiplexing (DWDM) transmission comparable to SMF. The cables are installed over two diverse paths (the red and blue lines shown in image 1), each with different entry points into the DC. Route diversity at the physical layer enhances network resilience and reliability by allowing traffic to be rerouted through alternate paths, minimizing the risk of network outage should there be a disruption. It also allows for increased capacity by distributing network traffic more evenly, improving overall network performance and operational efficiency. Image 1: Satellite image of two Azure DC sites (A & Z) within a metro region interconnected with new ultra-low latency HCF technology, using two diverse paths (blue & red) Image 2 shows the optical routing that the deployed HCF cables take through both Inside Plant (ISP) and Outside Plant (OSP), for interconnecting terminal equipment within key sites in the region (comprised of DCs, Network Gateways and PoPs). Image 2: Optical connectivity at the physical layer between DCA and DCZ The HCF OSP cables have been developed for outdoor use in harsh environments without degrading the propagation properties of the fiber. The cable technology is smaller, faster, and easier to install (using a blown installation method). Alongside cables, various other technologies have been developed and integrated to provide a reliable end-to-end HCF network solution. This includes dedicated HCF-compatible equipment (shown in image 3), such as custom cable joint enclosures, fusion splicing technology, HCF patch tails for cable termination in the DC, and a HCF custom-designed Optical Time Domain Reflectometer (OTDR) to locate faults in the link. These solutions work with commercially available transponders and DWDM technologies to deliver multi-Tb/s capacities for Azure customers. Looking more closely at a HCF cable installation, in image 4 the cable is installed by passing it through a blowing-head (1) and inserting it into pre-installed conduit in segments underground along the route. As with traditional installations with conventional cable, the conduit, cable entry/exit, and cable joints are accessible through pre-installed access chambers, typically a few hundred meters apart. The blowing head uses high-pressure air from a compressor to push the cable into the conduit. A single drum-length of cable can be re-fleeted above ground (2) at multiple access points and re-jetted (3) over several kilometers. After the cables are installed inside the conduit, they are jointed at pre-designated access chamber locations. These house the purpose designed cable joint enclosures. Image 4: Cable preparation and installation during in-field deployment Image 5 shows a custom HCF cable joint enclosure in the field, tailored to protect HCFs for reliable data transmission. These enclosures organize the many HCF splices inside and are placed in underground chambers across the link. Image 5: 1) HCF joint enclosure in a chamber in-field 2) Open enclosure showing fiber loop storage protected by colored tubes at the rear-side of the joint 3) Open enclosure showing HCF spliced on multiple splice tray layers Inside the DC, connectorized ‘plug-and-play’ HCF-specific patch tails have been developed and installed for use with existing DWDM solutions. The patch tails interface between the HCF transmission and SMF active and passive equipment, each containing two SMF compatible connectors, coupled to the ISP HCF cable. In image 6, this has been terminated to a patch panel and mated with existing DWDM equipment inside the DC. Image 6: HCF patch tail solution connected to DWDM equipment Testing To validate the end-to-end quality of the installed HCF links (post deployment and during its operation), field deployable solutions have been developed and integrated to ensure all required transmission metrics are met and to identify and restore any faults before the link is ready for customer traffic. One such solution is Microsoft’s custom-designed HCF-specific OTDR, which helps measure individual splice losses and verify attenuation in all cable sections. This is checked against rigorous Azure HCF specification requirements. The OTDR tool is invaluable for locating high splice losses or faults that need to be reworked before the link can be brought into service. The diagram below shows an OTDR trace detecting splice locations and splice loss levels (dB) across a single strand of installed HCF. The OTDR can also continuously monitor HCF links and quickly locate faults, such as cable cuts, for quick recovery and remediation. For this deployment, a mean splice loss of 0.16dB was achieved, with some splices as low as 0.04dB, comparable to conventional fiber. Low attenuation and splice loss helps to maintain higher signal integrity, supporting longer transmission reach and higher traffic capacity. There are ongoing Azure HCF roadmap programs to continually improve this. Performance Before running customer traffic on the link, the fiber is tested to ensure reliable, error-free data transmission across the operating spectrum by counting lost or error bits. Once confirmed, the link is moved into production, allowing customer traffic to flow on the route. These optical tests, tailored to HCF, are carried out by the installer to meet Azure’s acceptance requirements. Image 8 illustrates the flow of traffic across a HCF link, dictated by changing demand on capacity and routing protocols in the region, which fluctuate throughout the day. The HCF span supports varying levels of customer traffic from the point the link was made live, without incurring any outages or link flaps. A critical metric for measuring transmission performance over each HCF path is the instantaneous Pre-Forward Error Correction (FEC) Bit Error Rate (BER) level. Pre-FEC BERs measure errors in a digital data stream at the receiver before any error correction is applied. This is crucial for transmission quality when the link carries data traffic; lower levels mean fewer errors and higher signal quality, essential for reliable data transmission. The following graph (image 9) shows the evolution of the Pre-FEC BER level on a HCF span once the link is live. A single strand of HCF is represented by a color, with all showing minimal fluctuation. This demonstrates very stable Pre-FEC BER levels, well below < -3.4 (log 10 ), across all 400G optical transponders, operating over all channels during a 24-day period. This indicates the network can handle high-data transmission efficiently with no Post-FEC errors, leading to high customer traffic performance and reliability. Image 9: Very stable Pre-FEC BER levels across the HCF span over 20 days The graph below demonstrates the optical loss stability over one entire span which is comprised of two HCF strands. It was monitored continuously over 20 days using the inbuilt line system and measured in both directions to assess the optical health of the HCF link. The new HCF cable paths are live and carrying customer traffic across multiple Azure regions. Having demonstrated the end-to-end deployment capabilities and network compatibility of the HCF solution, it is possible to take full advantage of the ultra-stable, high performance and reliable connectivity HCF delivers to Azure customers. What’s next? Unlocking the full potential of HCF requires compatible, end-to-end solutions. This blog outlines the holistic and deployable HCF systems we have developed to better serve our customers. While we further integrate HCF into more Azure Regions, our development roadmap continues. Smaller cables with more fibers and enhanced systems components to further increase the capacity of our solutions, standardized and simplified deployment and operations, as well as extending the deployable distance of HCF long haul transmission solutions. Creating a more stable, higher capacity, faster network will allow Azure to better serve all its customers. Learn more about how hollow core fiber is accelerating AI. Recently published HCF research papers: Ultra high resolution and long range OFDRs for characterizing and monitoring HCF DNANFs Unrepeated HCF transmission over spans up to 301.7km13KViews10likes1Comment