azure networking
68 TopicsSimplify Virtual WAN Spoke Connectivity at Scale with Azure Virtual Network Manager
With Azure Virtual Network Manager (AVNM) integration, organizations using Virtual WAN for transitive connectivity can simplify spoke connectivity and policy management across large-scale hub-and-spoke deployments. By using a Virtual WAN hub as the hub in an AVNM hub-and-spoke topology, organizations can define connectivity and routing intent once at the network group level and apply it consistently across large numbers of spoke VNets. This reduces repetitive per-spoke connection and routing configuration, helps maintain operational consistency as deployments expand, and makes it easier to manage hub-and-spoke environments at scale. Together, AVNM’s centralized, group-based orchestration and Virtual WAN’s managed routing, security integration, and hybrid connectivity provide a more streamlined way to simplify operations and scale with confidence. What is Azure Virtual Network Manager? Azure Virtual Network Manager is a management service that lets you group, configure, and deploy network connectivity and security policies across virtual networks at scale. Instead of configuring VNet peering and access rules on each virtual network individually, you define network groups — logical collections of virtual networks based on static selection or dynamic Azure Policy conditions — and apply connectivity configurations and security admin rules to those groups. Key capabilities include: Hub-and-spoke and mesh topologies — Define how virtual networks in a network group connect to a central hub or to each other. Network groups — Group VNets statically or dynamically (using tags, subscriptions, resource group names, or other Azure Policy conditions). Security admin rules — Author and enforce access control lists across all VNets in a network group, providing a centralized layer of defense that complements NSGs and firewalls. Region-scoped deployment — Deploy configurations to specific Azure regions, enabling incremental rollout and controlled blast radius. AVNM operates as an overlay management layer — it orchestrates VNet peering, connectivity, and security rules without replacing the underlying networking primitives. What is Azure Virtual WAN? Azure Virtual WAN as a service brings together routing, security, VPN, ExpressRoute, and transitive connectivity in a hub-and-spoke architecture. A Virtual WAN hub is a managed regional resource that acts as a central transit point for branch connectivity, remote users, private enterprise connectivity, spoke virtual networks, and private traffic routing through security services. Site-to-site VPN connectivity (branch offices, SD-WAN devices) Point-to-site VPN connectivity (remote users) ExpressRoute private connectivity (on-premises datacenters) VNet-to-VNet transitive connectivity (spoke virtual networks) Routing, firewall, and encryption for private traffic All hubs in a Standard Virtual WAN are connected in a full mesh over the Microsoft backbone, enabling any-to-any connectivity between spokes, branches, and remote users across regions. Virtual WAN removes the need to manually manage complex route tables and transit VNets — routing is handled by the hub's built-in router. What this integration enables When you select a Virtual WAN hub as the hub in an AVNM connectivity configuration, AVNM handles the spoke-to-hub wiring for you. For each virtual network in your selected network groups: If the VNet is not yet connected to the Virtual WAN hub, AVNM creates the Virtual Network connection to Virtual WAN hub and applies a consistent routing configuration with Virtual WAN connection policy. If the VNet is already connected, AVNM updates the existing Virtual Network connection to utilize the routing properties in the Virtual WAN connection policy. A connection policy is a hub-level Virtual WAN resource that defines shared routing behavior for the virtual network connections it governs, including route table association and propagation, route maps, internet security settings, and propagated labels. Because the policy applies these settings consistently across governed connections, it helps standardize routing and overrides conflicting settings configured directly on individual connections. How it works The setup follows AVNM's standard workflow: Create a network group. Add virtual networks as members — either statically (by selecting specific VNets) or dynamically (using Azure Policy conditions such as tags or resource group names). Create a connectivity configuration. Choose hub-and-spoke topology, select your Virtual WAN hub as the hub, and select or create a connection policy. Deploy. Commit the configuration to your target regions. AVNM connects all VNets in the network groups to the Virtual WAN hub and applies the connection policy in parallel. You can also enable direct connectivity within a spoke network group. When enabled, VNet-to-VNet traffic within that group routes directly between virtual networks instead of transiting the Virtual WAN hub — useful for latency-sensitive or high-throughput east-west workloads. By default, direct connectivity is regional; enable global mesh to extend it across Azure regions. Key use cases Bulk spoke onboarding Connect many virtual networks to a Virtual WAN hub in one operation. All connections are orchestrated in parallel by AVNM, and the pre-defined routing configuration is automatically applied. Policy-based dynamic onboarding Use Azure Policy to define network group membership conditions. When a new virtual network matches those conditions—for example, a VNet tagged env:prod—it is automatically added to the network group. On the next deployment, AVNM connects it to the Virtual WAN hub with the correct routing configuration, reducing manual onboarding effort. Batch routing configuration updates Push routing changes to all virtual networks in a network group as a single, fully parallelized operation. This significantly reduces maintenance window duration for network-wide changes and makes rollback straightforward. Incremental deployment Segment your network into precise update domains by creating separate network groups — for example, by environment (staging, dev, production) or by region. Deploy connection policies to each group or region independently. This lets you test changes on a smaller subset before applying them broadly, minimizing blast radius. Mesh for selective inspection bypass If you use routing intent to send all private traffic through a firewall in the Virtual WAN hub, certain high-throughput or latency-sensitive flows (such as database replication) may benefit from bypassing that inspection. Enable direct connectivity in AVNM to create a mesh between selected spokes, allowing VNet-to-VNet traffic to route directly while all other traffic continues through the hub firewall. Security admin rules at scale Define network groups for your Virtual WAN spokes, then use AVNM security admin rules to author and deploy access control lists across those spokes. This provides an additional layer of defense alongside next-generation firewalls in the Virtual WAN hub. Getting started Prerequisites: An existing Azure Virtual Network Manager instance An existing Azure Virtual WAN and Virtual WAN hub One or more virtual networks to use as spoke members To configure: Go to your Network Manager instance in the Azure portal. Create a network group and add your spoke VNets. Create a connectivity configuration → select hub-and-spoke → select your Virtual WAN hub → select or create a connection policy → add spoke network groups. Deploy the configuration to your target regions. In your Virtual WAN resource, verify that the expected spoke VNet connections are in a connected state. Review effective routes in the virtual hub to confirm routing behavior matches the selected connection policy. For detailed step-by-step instructions, see Configure Azure Virtual WAN hub for Azure Virtual Network Manager. For more on connection policy, see Connection policy in Azure Virtual WAN. Learn more Azure Virtual Network Manager documentation Virtual WAN and Virtual Network Manager integration overview Azure Virtual WAN documentation208Views1like0CommentsDeploy with Confidence: Using Rule Impact Analyzer in Azure Virtual Network Manager
Introduction In a previous blog post, we described how Azure Virtual Network Manager (AVNM) enables central teams to enforce security admin rules across hundreds of virtual networks—bring consistency and governance to complex enterprise environments. But enforcement at scale introduces a new challenge: deployment confidence. Security admin rules take priority over NSG rules and can span subscriptions and management groups. That makes them powerful—but a single misconfigured rule can disrupt critical traffic across your entire network. Governance teams need a way to understand the real-world impact of a rule before it reaches production—not after. This is exactly the problem Azure Virtual Network Manager now solves with the Rule Impact Analyzer—a capability that simulates proposed security admin rules against your real network traffic, so you can see exactly what will change, what won't, and deploy with confidence instead of guesswork. The Challenge: Understanding Rule Impact Before Deployment As enterprises scale up their use of security admin rules, a visibility gap emerges. Consider a common scenario: a central governance team needs to block high-risk ports across all production virtual networks. The rules are well-intentioned, but the team has no visibility into which existing traffic flows would be affected. Without a way to preview the impact, teams face an uncomfortable tradeoff—move quickly and risk disruption, or slow down manual review across every affected network. The Rule Impact Analyzer is designed to close this gap—giving teams with a clear, data-driven view of what a rule of change will do before it reaches production. What Is the Rule Impact Analyzer? The Rule Impact Analyzer is a joint capability of Azure Virtual Network Manager and Azure Network Watcher. It lets you simulate proposed security admin rules against traffic data derived from virtual network (VNet) flow logs and Traffic Analytics in your environment. Instead of relying on manual review, the analyzer evaluates proposed rules against observed traffic and classifies each flow: Affected — The proposed rule would change the current evaluation outcome for this flow (e.g., traffic that is currently allowed would be blocked). Not Affected — The flow would continue as-is; the rule does not apply. Indeterminate — The flow cannot be conclusively evaluated (e.g., insufficient traffic data). This gives governance teams and network administrators a clear, data-driven view of what a rule of change will do—before it reaches production. Note: The analysis is based on traffic data available through flow logs and Traffic Analytics. Results reflect recorded traffic patterns; traffic that has not yet been observed will not appear in results. The Customer Journey: From Rule Authoring to Validated Deployment The Rule Impact Analyzer fits naturally into the lifecycle of security admin rule management: This workflow lets teams author rules, simulate impact, review results, and refine policies before committing a single change to production. Teams can cycle through simulation as many times as needed. Key Capabilities Predicted Impact Visibility See briefly how your proposed security admin rules would affect existing traffic flows. Results are based on Traffic Analytics data, helping teams make informed deployment decisions. Flow-Level Drill-Down Go beyond summary counts. Inspect specific source and destination paths, see which rule affects each flow, and identify legitimate traffic that would be unintentionally blocked. This makes it easy to pinpoint issues and refine your rules. Configurable Scope You don't have to analyze everything at once. Target your analysis to specific: Rule collections or individual security admin rules Network groups or specific virtual networks This lets you focus on the areas that matter most, whether you're validating a single rule change or assessing a broad policy rollout. Controlled Iteration Modify your security admin rules, re-run the analysis, and repeat—as many times as you need. Deploy only when the simulated impact matches your intended connectivity outcome. Inbound and Outbound Evaluation The analyzer evaluates both inbound and outbound traffic directions, giving you full visibility into the rule's impact across your network. Real-World Scenario: Locking Down Internet-Exposed Management Ports at Scale Let’s look at a real-world scenario as an example. Your organization runs hundreds of VNets across multiple subscriptions. Over time, different teams have created NSG rules that allow inbound SSH (port 22) and RDP (port 3389) from broad source ranges — some even from 0.0.0.0/0. Your security team mandates: block all inbound management-port access except from trusted bastion subnets. The challenge? You can't just flip a switch. Blocking the wrong traffic could be risky, and you want to know the impact of applying the security rules. With Rule Impact Analyzer, you can: Define the proposed security admin rule — deny inbound TCP 22/3389 from all sources except your bastion subnet prefix Simulate before you commit — see exactly which VNets, subnets, and NICs currently have traffic matching the rule, and which existing NSG rules would be overridden Identify conflicts — spot cases where a team's NSG "Allow" rule would be superseded by your new admin-level "Deny," so you can coordinate before deployment Deploy with confidence — roll out the rule knowing the blast radius is fully understood, not guessed Before Rule Impact Analyzer, this required manually auditing NSG rules across every subscription, cross-referencing with resource inventories, and hoping nothing was missed. Now, a single simulation gives you a complete picture in minutes — turning a week-long audit into a self-service workflow. How It Works: Architecture and Design Rule Impact Analyzer uses existing Azure networking telemetry and analytics components. It does not require a separate data collection pipeline. The following diagram provides an interactive version of the architecture: Step 1: Traffic Analytics as Ground Truth. The analyzer queries your existing VNet flow logs through Traffic Analytics. No new agents, log pipelines, or storage accounts are required. Step 2: Log Analytics as the Query Engine. Traffic Analytics data resides in your Log Analytics workspace. The Rule Impact Analyzer runs Kusto Query Language (KQL) queries to retrieve the observed flows relevant to your analysis scope. Step 3: AVNM Rule Evaluation Engine. The retrieved flows are evaluated using AVNM's own enforcement logic—the same priority ordering, allow/deny behavior, and scope resolution used in production. This ensures that what you see in the analyzer matches what would happen when rules are enforced. Step 4: Results Correlation and Surfacing. Each flow is classified and surfaced in the Azure Portal with drill-down capabilities—from summary impact counts down to individual flow paths and the specific rules affecting them. What Means for You Uses existing infrastructure. If you already have Traffic Analytics enabled, there is nothing new to deploy. No data duplication. Queries run in place within your existing Log Analytics workspace, under your existing RBAC and data retention policies. Transparent costs. Only standard Log Analytics query costs apply—no hidden charges or separate billing. Getting Started You can access Rule Impact Analyzer from two entry points in the Azure Portal: From Azure Virtual Network Manager: Navigate to your security admin configuration → select a rule collection → launch the Rule Impact Analyzer. From Azure Network Watcher: Navigate to Monitoring → Traffic Analytics → Rule Analyzer. Both paths lead to the same analysis experience, so you can start with whichever tool fits your workflow. Prerequisites Before using the Rule Impact Analyzer, ensure the following are in place: VNet flow logs are enabled on the virtual networks you want to analyze. Traffic Analytics is configured and sends data to a Log Analytics workspace. You have the necessary RBAC permissions to access the AVNM security admin configuration and the Log Analytics workspace. Steps Enable VNet flow logs and Traffic Analytics on your target virtual networks. Learn more about Traffic Analytics. Author or update your security admin rules in Azure Virtual Network Manager. Learn more about AVNM security admin rules. Launch the Rule Impact Analyzer from either portal entry point, configure your scope (rule collections, network groups, or specific VNets), and run the analysis. Review, refine, and deploy. Iterate your rules until the simulated impact matches your intended outcome, then deploy with confidence. The screenshot below shows the Rule Impact Analyzer in the Azure Portal. After running a simulation, you can see a summary of predicted traffic impact—total paths analyzed, how many are affected or not affected—along with a detailed results table to drill into individual flows and identify which rule impacts each one. Why It Matters Outage Prevention For organizations rolling out network isolation policies at scale, Rule Impact Analyzer acts as a safety net. By simulating rule impact against recorded traffic patterns, teams can catch misconfigurations before they reach production. Faster Rule Adoption Without the analyzer, deploying new admin rules often requires lengthy manual review cycles. With self-service impact analysis, governance teams can validate and deploy rules faster—without waiting for manual approval. Aligning with Behavior Security policies express intent—what traffic should or shouldn't be allowed. Rule Impact Analyzer validates whether a proposed rule achieves that intent against your observed traffic, closing the loop between policy design and operational behavior. Conclusion The AVNM Rule Impact Analyzer closes the gap between policy intent and deployment confidence. Simulating rules against observed traffic—with no additional infrastructure required—governance teams can validate impact before enforcement. Enforcement without visibility is a risk. Visibility without enforcement is incomplete. This capability brings both together. We welcome your feedback as you start using this capability. Share your experience through the Azure Portal feedback button or your Microsoft account team. Learn more: Azure Virtual Network Manager Azure Network Watcher Traffic Analytics AVNM Security Admin Rules Using Azure Virtual Network Manager to Enhance Network Security Authors: Deepak Bansal, Corporate Vice President and Technical Fellow, Microsoft Azure, Xinyan Zan, Vice President, Ashish Bhargava, Principal Software Development Manager, and Jay Li, Senior Product Manager391Views1like0CommentsSummarized Gateway Prefixes for Route Advertisement in Azure Virtual Networks
Background Many Azure deployments follow a hub-and-spoke topology: one VNet is designated as the hub and holds the connection to on-premises (via ExpressRoute Gateway, VPN Gateway, or both), and workload VNets — the spokes — peer to the hub to reach on-premises and shared services. This centralizes gateway connectivity so many workloads can share a single ExpressRoute or VPN Gateway. However, in large hub-and-spoke topologies, ExpressRoute and VPN Gateway limits on advertised prefixes (for example, 1,000 IPv4 and 100 IPv6 prefixes) can be reached. Because each spoke adds its own address prefixes to that count, these limits are approached quickly, constraining how far the topology can scale. What's New With Summarized Gateway Prefixes, customers can now advertise a single covering prefix (for example, 10.0.0.0/16) instead of many smaller CIDRs (for example, multiple /24s) – dramatically reducing advertised route count and enabling larger-scale Azure environments. A new property, summarizedGatewayPrefixes, is now available on the Virtual Network resource in public preview. When configured on a hub VNet, it controls what your ExpressRoute Gateway and VPN Gateway advertise to on-premises, replacing the default behavior of advertising all individual hub and spoke VNet CIDRs with a set of aggregated prefixes you define. For example, instead of advertising 10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24, and so on for each spoke, you can advertise a single 10.0.0.0/16. Key Benefits Fewer advertised routes — Replace hundreds of individual spoke CIDRs with a small set of summarized prefixes. Scales with your topology — Supports deployments with 500+ spokes without requiring address plan redesigns or VNet splits. IPv4 and IPv6 — Summarize both address families. Works with both gateway types — Supported on ExpressRoute Gateway and VPN Gateway. Simple configuration — A single property on the VNet resource. No additional services or dependencies. Backward compatible — If the property is left empty, behavior is unchanged: all hub and peered spoke address spaces are advertised as before. How It Works Default behavior ExpressRoute Gateway and VPN Gateway advertise all address spaces of the hub VNet and all address spaces of peered spoke VNets to on-premises. With summarizedGatewayPrefixes configured The gateways advertise the summarized prefixes instead of the hub VNet's individual address spaces. For each peered spoke, if the spoke's address space falls within a summarized prefix, the spoke's individual CIDRs are suppressed from advertisement. Spoke address spaces not covered by a summarized prefix continue to be advertised individually. Example: Without Summarization With Summarization 10.0.1.0/24, 10.0.2.0/24, 10.0.3.0/24, … 10.0.0.0/16 Hundreds of prefixes One prefix Getting Started Open the hub VNet (the VNet containing your GatewaySubnet) in the Azure portal. Go to Address space → Advertised gateway prefixes. Add one or more IPv4 or IPv6 CIDR prefixes that cover the address spaces you want to summarize. Navigate to your virtual network and verify that the summarized prefixes appear. Things to Know The property is set on the hub VNet (the VNet with the GatewaySubnet). The summarized prefixes list can include prefixes outside the VNet's own address space. Avoid overlap among prefixes within the list, but overlap with peered VNet address spaces is expected in hub-and-spoke designs. For dual-stack (IPv4 + IPv6) VNets, define both IPv4 and IPv6 summarized prefixes explicitly.527Views1like0CommentsMetrics Filtering and Log Aggregation Now GA for Advanced Container Networking Services
We are thrilled to announce that Advanced Container Networking Services (ACNS) for Azure Kubernetes Service (AKS) now delivers two powerful observability features in General Availability: container network metrics filtering and container network log filtering and aggregation. Together, these capabilities set a new standard for Kubernetes network observability, giving you high-fidelity visibility at dramatically lower cost and noise. These capabilities fundamentally redefine how network observability works at scale while delivering up to 97% cost reduction. Why this is a Milestone? Most Kubernetes observability solutions face a fundamental tension: collect everything and drown in noise and cost, or sample and miss the signals that matter, with new features of Advanced Container Networking Services that tradeoff has been eliminated. With this release, Azure becomes the first cloud provider to deliver on-node metrics filtering and flow log aggregation for Kubernetes networking, capabilities now also contributed to the upstream Hubble project, making them available to the broader open-source community. For AKS customers running Cilium-based clusters, this means: Every flow you care about is captured. Everything else is dropped at the source. Log volume is compressed by up to 45% through aggregation, without losing security verdicts or error context. Costs scale with what you monitor, not with cluster size. What’s been improved in observability? This release introduces two capabilities that work together: container network metrics filtering and container network log filtering and aggregation. Both are available on AKS clusters with the Cilium data plane and give you precise controls to keep observability costs predictable while maintaining the visibility you need. Container Network Metrics Filtering Container network metrics are generated for all pods by default whenever Advanced Container Networking Services is enabled. With metrics filtering, you now control what gets collected at the point of ingestion, on the node, before anything is scraped or transmitted. A single ContainerNetworkMetric CRD per cluster defines which metric types (dns, flow, tcp, drop), namespaces, pod labels, and protocols to ingest. It supports both include and exclude filters, so you can maintain broad collection while carving out specific workloads or namespaces. Anything that doesn't match is dropped on the node. Changes reconcile in a few seconds, with no Cilium agent or Prometheus restarts required. Container Network Log Filtering and Aggregation Unlike metrics, container network logs are not generated automatically. You start capturing network flows only after applying a ContainerNetworkLog CRD that defines exactly which traffic to capture-by namespace, pod, service, protocol, or verdict. Only matching flows are logged, giving you a precise, targeted view rather than a fire hose. This is where Azure's first-to-market innovation comes in. Flow log aggregation, now built into Advanced Container Networking Services and contributed upstream to Hubble for the open-source community, groups similar flows into summarized records every 30 seconds. The result is dramatically reduced data volume while preserving security verdicts, service identity, and error context. What previously required custom post-processing pipelines is now built directly into the platform before storage costs are incurred. Every matched flow log captures: source and destination pods, namespaces, ports, protocols, traffic direction, and policy verdicts. Logs are stored in a Log Analytics workspace (ContainerNetworkLogs table) with a choice of using the Analytics or Basic tier. Built-in Azure portal dashboards are available for both tiers. Logs can also be exported to external log collectors such as Splunk or Datadog. First to Market: Azure and the upstream Hubble Contribution Advanced Container Networking Services built-in filtering and aggregation capabilities were engineered from the ground up to solve real production observability challenges at scale. Rather than keeping this innovation proprietary, Azure contributed the log aggregation and filtering capabilities to the upstream Hubble project, the observability layer of the Cilium ecosystem. This means: AKS customers get a fully managed, Azure-native experience with portal dashboards, Log Analytics integration, and Grafana visualization, out of the box. The broader open-source community gains access to the same filtering and aggregation primitives through upstream Hubble. Azure is the first to ship this capability in a managed Kubernetes service, and the first to give it back to the community. Key Benefits 💰 Lower observability cost. Metrics filtering drops unwanted data on the node before Prometheus ever scrapes it. Flow log aggregation compresses log data by up to 97% in lab testing. Your cost scales with what you choose to monitor, not with cluster size. 📉 Less noise, more signal. Metrics filtering carves out the namespaces and workloads that matter, so dashboards show only relevant signals. Log filters scope collection to specific pods and verdicts. Engineers start every investigation with data that's already relevant. ⚡ Faster root-cause isolation. Every metric carries source and destination pod context. Targeted flow logs add the forensic detail, which policy, destination, or port is involved. Together, they cut mean time to resolution from hours of guesswork to minutes of structured investigation. 🔒 Full signal, zero gaps. Within the scope you define, every flow is captured and every pattern is preserved. Aggregation compresses volume without losing security verdicts or error context. Who Benefits Platform engineers managing multi-tenant clusters can scope data collection per namespace, so each team gets visibility into their own traffic without contributing to a shared cost pool. SREs can isolate packet drops, TCP resets, or DNS failures to a specific workload in minutes, starting with data that's already scoped to what matters. Decision-makers evaluating observability spend get predictable, controllable ingestion costs that scale with intent, not infrastructure size. How to optimize metrics and logs with filtering? Enable Advanced Container Networking Services ( ACNS) on your AKS cluster with the Cilium data plane: az aks create --enable-acns Or on an existing cluster: az aks update --resource-group $RESOURCE_GROUP --name $CLUSTER --enable-acns Apply a ContainerNetworkMetric CRD to filter which metrics are collected on each node. Start by excluding noisy system namespaces, then scope to business-critical workloads. Apply a ContainerNetworkLog CRD to define which flows to capture. Enable Azure Monitor integration with --enable-container-network-logs to send logs to a Log Analytics workspace, or export logs from the node to an external logging system such as Splunk or Datadog. Check your dashboards. Open your cluster in the Azure portal and go to Monitor > Insights > Networking for bytes, drops, DNS errors, and flows. For flow logs, use the built-in Azure portal dashboards available for both Basic and Analytics tiers. Conclusion Kubernetes network observability has long meant choosing between visibility and cost. With container network metrics filtering and log filtering and aggregation now GA in Advanced Container Networking Services (ACNS) and contributed to upstream Hubble for the open-source community, that tradeoff is gone. Azure is first to market with this capability. AKS customers get it fully managed, out of the box, with built-in dashboards with Log Analytics integration. And the broader Cilium ecosystem gets it through upstream Hubble. High-fidelity visibility. Lower cost. No compromise. Learn more: Container network metrics overview: Container network metrics overview - Azure Kubernetes Service | Microsoft Learn Container network logs overview: Container Network Logs Overview - Azure Kubernetes Service | Microsoft Learn Configure container network metrics filtering: Configure Container network metrics filtering for Azure Kubernetes Service (AKS) - Azure Kubernetes Service | Microsoft Learn Set up container network logs: Set up container network logs - Azure Kubernetes Service | Microsoft Learn
361Views0likes0CommentsUnderstanding and building an Azure Hybrid Meshed Hub-Spoke Topology
A meshed hybrid hub-spoke topology Azure offers two main approaches to build network architectures. This article focuses on traditional networking (using VNets, peering, route tables, etc.), rather than Azure Virtual WAN. Why a hub-spoke topology? A hub‑spoke topology is the only way to control traffic flows while maintaining scalability, because it enforces a central point of connectivity and policy enforcement: Centralized traffic control / inspection: All connectivity (to on‑premises, the internet, and between spokes) is anchored through the hub. The hub hosts shared services such as firewalls or NVAs, providing a single control point where traffic is inspected, filtered, and governed consistently. Avoids uncontrolled lateral communication: Spokes do not connect arbitrarily to each other. All connectivity is routed through the hub, preventing uncontrolled east‑west communication and ensuring traffic follows defined security and routing policies. Inherent scalability by design: New workloads are added by introducing additional spokes. The core network design remains unchanged, enabling linear scaling without the complexity of full-mesh connectivity. In summary, the hub‑spoke model provides centralized control combined with scalable, decoupled workload networks—something that flat or full-mesh designs struggle to achieve. From hub-spoke to meshed multi-region In a hub‑spoke topology, it’s important to keep in mind that the hub is implemented as an Azure Virtual Network (VNet) and VNets are scoped to a single region. This means that in a multi‑region setup, you’ll always need at least one hub per region. Each of these hubs hosts shared services like firewalls, NVAs, and DNS, acting as the central point for connectivity and traffic control. Extending dependencies across regions—for example by connecting spokes to a hub in another region—is generally not recommended. It creates tight coupling between regions, which goes against the goal of keeping regions independent. A well-designed multi‑region architecture aims for regional self‑containment to improve resilience and fault isolation. Relying on a remote hub can lead to issues like failure propagation between regions, higher latency for inspected traffic and more complex routing and operations. It can also introduce organizational challenges when different regions are managed by separate teams, reducing agility and increasing operational risk. For this reason, meshed hub‑spoke architectures should use hubs that are deployed within each region. Connectivity between regions should be established directly between the hubs, not through spokes. In a meshed design, hubs are typically connected in a full‑mesh peering model, allowing for controlled and predictable inter‑region communication while still maintaining regional independence. Within a single region, it can also make sense to deploy multiple hubs to create isolated environments. This is especially useful when you need to separate workloads based on security requirements, regulatory needs, or organizational boundaries. Each hub can then have its own dedicated set of connectivity and inspection services. Finally, each spoke VNet connects to just one hub. This keeps routing simple and predictable, ensures that all traffic passes through the correct inspection and policy enforcement layers, and reinforces the hub’s role as the central control point for network traffic within the region. Integrating hybrid connectivity In most enterprise scenarios, Azure doesn’t operate in isolation—it needs to connect to external networks such as on‑premises datacenters or other cloud environments. This hybrid connectivity is typically set up using services like Azure ExpressRoute, Azure VPN Gateway or third‑party SD‑WAN solutions. In a (meshed) hub‑spoke topology, these connectivity components are best deployed in the hub VNet, since the hub acts as the central point where all inbound and outbound traffic comes together. By centralizing external connectivity in the hub, all traffic—whether entering or leaving Azure—can be routed, inspected and governed in a consistent way using shared services like firewalls or NVAs. It also avoids the need to duplicate gateways and connectivity components across multiple spokes, which helps reduce cost and operational overhead. This approach also simplifies routing and policy management. Spokes can rely on the hub’s shared connectivity instead of maintaining their own connections to external networks. Overall, this reinforces the hub’s role as the single, controlled integration point between Azure and the broader network landscape. Implementation fundamentals With the overall architecture in place, the next step is to understand how Azure actually handles routing and traffic control in this kind of design. When working with a hub‑spoke topology in Azure, it’s important to realize that a virtual network (VNet) doesn’t behave like a traditional router. While you can associate Azure Route Tables with subnets, those routes only apply to traffic originating from within that subnet. Traffic entering the VNet from outside isn’t automatically re‑routed. This is also why VNet peering is non‑transitive by design: peered VNets can communicate directly, but they won’t forward traffic for other networks. To enable controlled routing between spokes—and between Azure and external networks such as ExpressRoute or VPN—you need a component in the hub that can actively receive and forward traffic. In most cases, this is handled by an Azure Firewall or a network virtual appliance (NVA) deployed in the hub. These components act as an explicit routing hop: they receive traffic, inspect or process it based on defined policies and then send it back into the virtual network so Azure’s routing engine can continue forwarding it. In a secure hub‑spoke design, the firewall plays a dual role. It not only provides centralized traffic inspection and enforces security policies, but also acts as the mechanism that enables transitive communication between spokes and external networks. This combination of control and connectivity is a key part of the architecture. Of course, this only works as intended if the firewall is configured with the right rules to allow or block traffic according to your security requirements. While it’s technically possible to implement routing using a basic virtual machine or even a Virtual Network Gateway, these approaches don’t meet typical enterprise requirements. They lack built‑in capabilities like advanced traffic inspection, high availability, autoscaling and centralized policy management. Purpose‑built solutions such as Azure Firewall or mature third‑party NVAs are designed to provide not just routing, but also integrated security, consistency, and scalability. For that reason, they’re generally the only realistic choice for production‑grade hub‑spoke environments where both control and resilience matter. Design principles for building the topology The diagram below shows the topology for a hybrid meshed hub-spoke, with 2 hubs and an Azure Firewall (any other 3rd party Firewall could be used as well). Ensuring correct connectivity in a hub-and-spoke topology may initially appear complex, but in practice it comes down to understanding and correctly applying four key design principles: controlled routing in the GatewaySubnet controlled routing in each spoke proper peering of spokes to the hub meshing the hubs. Before looking at these in detail, it is important to understand a fundamental behavior of Azure Virtual Network (VNet) peering. When two VNets are peered, Azure automatically exchanges their address spaces (CIDR ranges) and injects these prefixes as system routes into the effective route tables of all subnets. As a result, resources in one VNet can communicate directly with resources in the other using private IP addressing, without any additional routing configuration. This built-in route propagation is what makes VNet peering an efficient and low-latency connectivity mechanism in Azure. However, this default behavior is not always aligned with the requirements of a hub-and-spoke topology. In this model, network services such as firewalls, inspection and routing control are typically centralized in the hub VNet. If communication between spokes is allowed to follow the automatically injected system routes, traffic could bypass these centralized controls, which would undermine design objectives such as inspection, segmentation and governance. For this reason, although VNet peering provides seamless connectivity by default, additional configuration is required in a hub-and-spoke architecture. This is usually achieved through Azure Route Tables, network virtual appliances (NVAs) or Azure Firewall, ensuring that traffic between spokes is routed through the hub as intended. This approach enables a controlled routing model that is essential for maintaining security and architectural consistency in enterprise-scale Azure environments. Design principle 1: Controlled routing in the GatewaySubnet In hybrid connectivity scenarios, traffic originating from on-premises environments over VPN or ExpressRoute is first terminated by the Azure Virtual Network Gateway. From there, the traffic is injected into the Azure network using the routing context of the GatewaySubnet. By default, this process relies on system routes that are automatically populated through VNet peering. As a result, when the destination resides in a spoke VNet, the traffic is forwarded directly to that spoke, since its address space has already been learned and installed as a system route. While this behavior is efficient, it also means that traffic will bypass centralized security controls in the hub, such as Azure Firewall. To ensure that all incoming traffic is properly inspected, this default routing behavior needs to be adjusted. This is done by associating a custom Azure Route Table with the GatewaySubnet and defining user-defined routes for each spoke address range. These routes should point to the private IP address of the firewall as the next hop, effectively overriding the system routes created by VNet peering. Because Azure gives precedence to user-defined routes over system routes, traffic that would normally go directly to the spoke is instead redirected through the firewall before reaching its destination. It is important that these user-defined routes precisely match the CIDR ranges defined for the spoke VNets! Any mismatch, such as using broader or more specific prefixes, can lead to unexpected routing behavior and may introduce issues such as asymmetric traffic flows or packet loss. For instance, if a spoke uses address spaces like 10.10.10.0/24 and 192.168.10.0/24, these exact prefixes must be reflected in the route table. Only by aligning the custom routes with the advertised address ranges can you ensure predictable routing and consistent inspection through the firewall. If the hub VNet hosts additional resources beyond an Azure Firewall or third-party network virtual appliance that also require traffic inspection, the corresponding CIDR ranges—either for the specific subnets or for the entire hub VNet—should be included as routes in the route table associated with the GatewaySubnet. These routes should be configured in the same way as those for spoke VNets, ensuring that traffic destined for these resources is directed through the intended inspection point. A typical example is Azure DNS Private Resolver, which can include both inbound and outbound endpoints deployed in dedicated subnets. When such endpoints are present in the hub, their associated subnet address ranges must also be added to the route table for the GatewaySubnet. This ensures that traffic to and from these endpoints is routed through the designated inspection path, maintaining consistent enforcement of security controls. Design principle 2: Controlled routing in every spoke In a hub-and-spoke architecture, traffic flows should follow the intended security model. Workloads within the same spoke VNet are usually treated as part of the same trust boundary, so traffic between resources in that spoke can flow directly over the Azure backbone without needing to pass through centralized controls. Network Security Groups (NSGs) should still be used at the subnet level to provide granular, stateful filtering, but routing this traffic through a central firewall is typically not required. The situation changes when traffic leaves the local VNet. As soon as traffic is destined for another spoke, the hub, or on-premises networks, it crosses a trust boundary and needs to be inspected centrally. To enforce this, Azure’s default routing behavior must be overridden by associating an Azure Route Table with each subnet in the spoke VNets. In most cases, this route table can be kept simple by defining a single default route that sends all outbound, non-local traffic to the firewall in the hub: Destination: 0.0.0.0/0 Next hop: Private IP address of the hub firewall (Virtual Appliance) With this configuration in place, all traffic that is not local to the spoke is forced through the hub, ensuring that communication between VNets and towards external networks is inspected and controlled. From a management perspective, the same route table can often be reused across multiple subnets or even multiple VNets within the same subscription, which helps keep the design consistent and easy to maintain. It’s worth noting, however, that Azure requires route tables and the subnets they’re associated with to be in the same subscription, as this association is enforced by the platform. There is one additional setting that is often overlooked but plays an important role in getting routing right in a hub-and-spoke design. Azure route tables include an option called “Propagate gateway routes”, which controls whether routes learned by a Virtual Network Gateway are added to the effective routes of the associated subnets. By default, routes learned via BGP (for example from ExpressRoute or VPN) or defined through a Local Network Gateway are propagated not only within the hub VNet, but also across VNet peerings. This means that spoke VNets can automatically learn routes to on-premises or external networks and may send traffic directly to the gateway, bypassing the firewall in the hub. To avoid this and keep traffic flowing through the centralized security controls, this setting should be disabled on the route tables used by the spoke subnets. When “Propagate gateway routes” is set to No, routes learned by the gateway are no longer injected into the spokes. As a result, traffic to those destinations cannot take a direct path and instead follows the user-defined default route (0.0.0.0/0) toward the hub firewall, where it can be properly inspected. When combined with the default route to the firewall, this setup ensures that traffic—whether it is going to other VNets, on-premises environments, or external networks—always follows a controlled and predictable path through the hub. This helps maintain consistent security enforcement and avoids unexpected routing behavior in larger or hybrid deployments. Design principle 3: Peering the spokes to the hub Virtual Network (VNet) peering in Azure is often seen as a simple, single configuration, but in reality it is directional by design. To fully connect two VNets, you need two separate peering configurations—one in each direction—and both must be configured correctly to ensure not only connectivity, but also proper routing behavior. Each peering exposes four key settings and getting these right is especially important in a hub-and-spoke architecture. For basic connectivity, the first two settings—“allow virtual network access” and “allow forwarded traffic”—should be enabled on both peerings. These ensure that traffic can flow between VNets and support scenarios where traffic is routed through a central component, such as a firewall in the hub. The other two settings depend on the direction of the peering. In a typical hub-and-spoke setup, the Virtual Network Gateway (or Azure Route Server) is deployed in the hub. This means the peering from the spoke to the hub must enable “use remote gateways”, while the peering from the hub to the spoke must enable “allow gateway transit.” At first, this might seem to contradict the idea that spokes should not directly use the gateway. However, these settings influence control plane behavior and don't enable unrestricted traffic flow. They are required so the gateway can learn and advertise spoke address ranges via BGP to external networks, such as those connected over VPN or ExpressRoute. Whether those routes are actually used in the spokes is still controlled through the “propagate gateway routes” setting on the route tables, allowing you to enforce routing through the firewall as intended. Even if you are not currently using BGP—for example, in environments relying on static routing—it is still a good practice to configure peerings this way. Doing so makes the design future-proof, allowing you to introduce dynamic routing later without changes to the peering model. This approach keeps the architecture consistent and avoids unnecessary rework as the environment evolves. Design principle 4: Meshing the hubs When you extend a hub-and-spoke design across multiple regions, you typically introduce multiple hubs, each managing its own regional spokes. In this setup, it becomes important to connect the hubs to each other, which is done by fully meshing the hub VNets using VNet peering. At the same time, a key principle remains unchanged: each spoke should connect to only one hub in the same region. This keeps the architecture simple, scalable and easier to reason about from a routing perspective. When configuring connectivity between hubs, it’s important to note that VNet peering settings differ from the typical hub–spoke configuration. For inter-hub peerings, only “allow virtual network access” and “allow forwarded traffic” should be enabled. The remaining options—“allow gateway transit” and “use remote gateways”—should be left disabled, as gateway sharing is not required between hubs and would even be blocked in the configuration. Just connecting the hubs with peering is not enough to guarantee correct traffic flow. To ensure traffic moves between regions in a controlled and secure way, you need additional routing logic. Each hub should have an Azure Route Table assigned to its FirewallSubnet (or the subnet hosting the 3rd party NVAs) defining how traffic towards other hub-and-spoke environments is handled. This ensures that inter-region traffic is always routed through the appropriate hub firewall, instead of flowing directly across the Azure backbone. At this point, IP address planning becomes critical. Without a clear addressing strategy, routing quickly becomes complex and hard to maintain. A common best practice is to assign a single “master” CIDR range per region, and then allocate all VNets in that region—both hub and spokes—from that range. This creates a clean, hierarchical addressing model that simplifies routing decisions. With this approach in place, route tables can remain relatively simple. Instead of adding routes for every individual spoke, you only need one route per remote region. The destination is the master CIDR range of that region and the next hop is the private IP of the firewall in the corresponding hub. Because all hubs are peered with each other, these address ranges and firewall endpoints are automatically known through peering, allowing for consistent and predictable routing. Overall, this design keeps routing logic straightforward while ensuring that all inter-region traffic is inspected in the correct hub, preserving the security model and making it easy to scale as new regions are added. Conclusion When the four design principles described in this article are applied consistently, a hub-and-spoke architecture becomes a strong, scalable and easy-to-operate foundation for your network. By combining controlled routing, centralized inspection and clear traffic flows, the model delivers both solid security and predictable behavior, even in complex environments. More importantly, the concepts covered here go beyond just one specific design. They represent the key building blocks of Azure networking, including routing, peering and traffic control. Understanding these fundamentals not only helps you implement hub-and-spoke topologies correctly, but also gives you a solid base for designing and running reliable, enterprise-grade network architectures in Azure. To make this easier to apply in practice, the table below summarizes the main concepts from this article and how they translate into actual configuration. It can be useful both when setting up a hub-and-spoke topology and when troubleshooting existing environments. Area Configuration Key Setting / Value Purpose Hub VNet Deploy shared services Azure Firewall or NVA in hub Central inspection + routing Deploy connectivity VPN Gateway / ExpressRoute in hub Centralize hybrid connectivity GatewaySubnet Associate Route Table UDRs for each spoke CIDR → Firewall IP Force inbound traffic through firewall Spoke Subnets Associate Route Table 0.0.0.0/0 → Firewall (Virtual Appliance) Force all outbound traffic via hub Route Table setting Propagate gateway routes = Disabled Prevent bypass of firewall via gateway VNet Peering (Spoke → Hub) Setting Allow VNet access = Yes Basic connectivity Setting Allow forwarded traffic = Yes Support transitive routing via firewall Setting Allow gateway transit = Yes Allow spoke to leverage hub gateway Setting Use remote gateways = No - VNet Peering (Hub → Spoke) Setting Allow VNet access = Yes Basic connectivity Setting Allow forwarded traffic = Yes Support routing through firewall Setting Allow gateway transit = No - Setting Use remote gateways = Yes Advertise spoke prefixes via hub gateway VNet Peering (Hub→ Hub) Setting Allow VNet access = Yes Basic connectivity Setting Allow forwarded traffic = Yes Support transitive routing via firewall Setting Allow gateway transit = No - Setting Use remote gateways = No - Hub FirewallSubnet Associate Route Table Route remote region CIDR → remote hub firewall IP Ensure inter-region/hub routing Addressing strategy CIDR planning Assign master CIDR per region Simplify routing and reduce UDR complexity Spoke design rule Peering constraint Each spoke connected to one hub only Prevent routing ambiguity452Views2likes0CommentsAzure virtual network terminal access point (TAP) public preview announcement
What is virtual network TAP? Virtual network TAP allows customers continuously stream virtual machine network traffic to a network packet collector or analytics tool. Many security and performance monitoring tools rely on packet-level insights that are difficult to access in cloud environments. Virtual network TAP bridges this gap by integrating with our industry partners to offer: Enhanced security and threat detection: Security teams can inspect full packet data in real-time to detect and respond to potential threats. Performance monitoring and troubleshooting: Operations teams can analyze live traffic patterns to identify bottlenecks, troubleshoot latency issues, and optimize application performance. Regulatory compliance: Organizations subject to compliance frameworks such as Health Insurance Portability and Accountability Act (HIPAA), and General Data Protection Regulation (GDPR) can use virtual network TAP to capture network activity for auditing and forensic investigations. Why use virtual network TAP? Unlike traditional packet capture solutions that require deploying additional agents or network appliances, virtual network TAP leverages Azure's native infrastructure to enable seamless traffic mirroring without complex configurations and without impacting the performance of the virtual machine. A key advantage is that mirrored traffic does not count towards virtual machine’s network limits, ensuring complete visibility without compromising application performance. Additionally, virtual network TAP supports all Azure virtual machine SKU. Deploying virtual network TAP The portal is a convenient way to get started with Azure virtual network TAP. However, if you have a lot of Azure resources and want to automate the setup you may want to use a PowerShell, CLI, or REST API. Add a TAP configuration on a network interface that is attached to a virtual machine deployed in your virtual network. The destination is a virtual network IP address in the same virtual network as the monitored network interface or a peered virtual network. The collector solution for virtual network TAP can be deployed behind an Azure Internal Load balancer for high availability. You can use the same virtual network TAP resource to aggregate traffic from multiple network interfaces in the same or different subscriptions. If the monitored network interfaces are in different subscriptions, the subscriptions must be associated to the same Microsoft Entra tenant. Additionally, the monitored network interfaces and the destination endpoint for aggregating the TAP traffic can be in peered virtual networks in the same region. Partnering with industry leaders to enhance network monitoring in Azure To maximize the value of virtual network TAP, we are proud to collaborate with industry-leading security and network visibility partners. Our partners provide deep packet inspection, analytics, threat detection, and monitoring solutions that seamlessly integrate with virtual network TAP: Network packet brokers Partner Product Gigamon GigaVUE Cloud Suite for Azure Keysight CloudLens Security analytics, network/application performance management Partner Product Darktrace Darktrace /NETWORK Netscout Omnis Cyber Intelligence NDR Corelight Corelight Open NDR Platform LinkShadow LinkShadow NDR Fortinet FortiNDR Cloud FortiGate VM cPacket cPacket Cloud Suite TrendMicro Trend Vision One™ Network Security Extrahop RevealX Bitdefender GravityZone Extended Detection and Response for Network eSentire eSentire MDR Vectra Vectra NDR AttackFence AttackFence NDR Arista Networks Arista NDR See our partner blogs: Bitdefender + Microsoft Virtual Network TAP: Deepening Visibility, Strengthening Security Streamline Traffic Mirroring in the Cloud with Azure Virtual Network Terminal Access Point (TAP) and Keysight Visibility | Keysight Blogs eSentire | Unlocking New Possibilities for Network Monitoring and… LinkShadow Unified Identity, Data, and Network Platform Integrated with Microsoft Virtual Network TAP Extrahop and Microsoft Extend Coverage for Azure Workloads Resources | Announcing cPacket Partnership with Azure virtual network terminal access point (TAP) Gain Network Traffic Visibility with FortiGate and Azure virtual network TAP Get started with virtual network TAP To learn more and get started, visit our website. We look forward to seeing how you leverage virtual network TAP to enhance security, performance, and compliance in your cloud environment. Stay tuned for more updates as we continue to refine and expand on our feature set! If you have any questions please reach out to us at azurevnettap@microsoft.com.3.2KViews3likes8CommentsAzure Incident Retrospective - Please register! Session 2 - Tracking ID: 5GP8-W0G
Join our upcoming live webcast for a transparent discussion about this recent Azure service incident — led by our engineering teams. Control plane issues in East US Tracking ID: 5GP8-W0G | Impacted: 24-25 April 2026 Same content presented in both sessions — pick the one that works best for your timezone! What to expect 📚 Understand What happened, how we responded, and what we learned 💬 Ask Live Q&A with our engineering experts throughout the session 🛠 Learn The fixes we've put in place and guidance for workload resiliency Choose your session Same content presented at both times — pick the one that works best for your timezone: Session 1 17:30 UTC Thursday, 14 May 2026 Register now → Session 2 05:30 UTC Friday, 15 May 2026 Register now → 9:30 AM US Pacific (PDT) 12:30 PM US Eastern (EDT) 5:30 PM London (BST) 1:30 AM +1 Beijing (CST) 4:30 AM +1 Sydney (AEDT) 6:30 AM +1 Auckland (NZDT) 9:30 PM -1 US Pacific (PDT) 12:30 AM US Eastern (EDT) 5:30 AM London (BST) 1:30 PM Beijing (CST) 4:30 PM Sydney (AEDT) 6:30 PM Auckland (NZDT) Our engineering leaders Deepak Bansal Corporate Vice President, Technical Fellow Azure Networking Cloud+AI Engineering LinkedIn ↗ Qi Zhang Partner Software Engineering Manager Azure Networking Cloud+AI Engineering LinkedIn ↗ ⚠️ Prepare before the livestream Read the Post Incident Review (PIR) ahead of time so you can ask any follow up questions during the live Q&A Helpful resources 🔔 Azure Service Health Alerts Get alerts for relevant incidents by setting up notifications via email, SMS, or webhook 🎥 Past Retrospective Recordings Watch recordings of previous retrospective livestreams 📄 Azure Post Incident Reviews Learn more about PIRs and the retrospective program79Views0likes0CommentsAzure Incident Retrospective - Please register! Session 1 - Tracking ID: 5GP8-W0G
Join our upcoming live webcast for a transparent discussion about this recent Azure service incident — led by our engineering teams. Control plane issues in East US Tracking ID: 5GP8-W0G | Impacted: 24-25 April 2026 Same content presented in both sessions — pick the one that works best for your timezone! What to expect 📚 Understand What happened, how we responded, and what we learned 💬 Ask Live Q&A with our engineering experts throughout the session 🛠 Learn The fixes we've put in place and guidance for workload resiliency Choose your session Same content presented at both times — pick the one that works best for your timezone: Session 1 17:30 UTC Thursday, 14 May 2026 Register now → Session 2 05:30 UTC Friday, 15 May 2026 Register now → 9:30 AM US Pacific (PDT) 12:30 PM US Eastern (EDT) 5:30 PM London (BST) 1:30 AM +1 Beijing (CST) 4:30 AM +1 Sydney (AEDT) 6:30 AM +1 Auckland (NZDT) 9:30 PM -1 US Pacific (PDT) 12:30 AM US Eastern (EDT) 5:30 AM London (BST) 1:30 PM Beijing (CST) 4:30 PM Sydney (AEDT) 6:30 PM Auckland (NZDT) Our engineering leaders Deepak Bansal Corporate Vice President, Technical Fellow Azure Networking Cloud+AI Engineering LinkedIn↗ Qi Zhang Partner Software Engineering Manager Azure Networking Cloud+AI Engineering LinkedIn ↗ ⚠️ Prepare before the livestream Read the Post Incident Review (PIR) ahead of time so you can ask any follow up questions during the live Q&A Helpful resources 🔔 Azure Service Health Alerts Get alerts for relevant incidents by setting up notifications via email, SMS, or webhook 🎥 Past Retrospective Recordings Watch recordings of previous retrospective livestreams 📄 Azure Post Incident Reviews Learn more about PIRs and the retrospective program109Views0likes0CommentsAzure Front Door: Implementing lessons learned following October outages
Abhishek Tiwari, Vice President of Engineering, Azure Networking Amit Srivastava, Principal PM Manager, Azure Networking Varun Chawla, Partner Director of Engineering Introduction Azure Front Door is Microsoft's advanced edge delivery platform encompassing Content Delivery Network (CDN), global security and traffic distribution into a single unified offering. By using Microsoft's extensive global edge network, Azure Front Door ensures efficient content delivery and advanced security through 210+ global and local points of presence (PoPs) strategically positioned closely to both end users and applications. As the central global entry point from the internet onto customer applications, we power mission critical customer applications as well as many of Microsoft’s internal services. We have a highly distributed resilient architecture, which protects against failures at the server, rack, site and even at the regional level. This resiliency is achieved by the use of our intelligent traffic management layer which monitors failures and load balances traffic at server, rack or edge sites level within the primary ring, supplemented by a secondary-fallback ring which accepts traffic in case of primary traffic overflow or broad regional failures. We also deploy a traffic shield as a terminal safety net to ensure that in the event of a managed or unmanaged edge site going offline, end user traffic continues to flow to the next available edge site. Like any large-scale CDN, we deploy each customer configuration across a globally distributed edge fleet, densely shared with thousands of other tenants. While this architecture enables global scale, it carries the risk that certain incompatible configurations, if not contained, can propagate broadly and quickly which can result in a large blast radius of impact. Here we describe how the two recent service incidents impacting Azure Front Door have reinforced the need to accelerate ongoing investments in hardening our resiliency, and tenant isolation strategy to mitigate likelihood and the scale of impact from this class of risk. October incidents: recap and key learnings Azure Front Door experienced two service incidents; on October 9 th and October 29 th , both with customer-impacting service degradation. On October 9 th : A manual cleanup of stuck tenant metadata bypassed our configuration protection layer, allowing incompatible metadata to propagate beyond our canary edge sites. This metadata was created on October 7 th , from a control-plane defect triggered by a customer configuration change. While the protection system initially blocked the propagation, the manual override operation bypassed our safeguards. This incompatible configuration reached the next stage and activated a latent data-plane defect in a subset of edge sites, causing availability impact primarily across Europe (~6%) and Africa (~16%). You can learn more about this issue in detail at https://aka.ms/AIR/QNBQ-5W8 On October 29 th : A different sequence of configuration changes across two control-plane versions produced incompatible metadata. Because the failure mode in the data-plane was asynchronous, the health checks validations embedded in our protection systems were all passed during the rollout. The incompatible customer configuration metadata successfully propagated globally through a staged rollout and also updated the “last known good” (LKG) snapshot. Following this global rollout, the asynchronous process in data-plane exposed another defect which caused crashes. This impacted connectivity and DNS resolutions for all applications onboarded to our platform. Extended recovery time amplified impact on customer applications and Microsoft services. You can learn more about this issue in detail at https://aka.ms/AIR/YKYN-BWZ We took away a number of clear and actionable lessons from these incidents, which are applicable not just to our service, but to any multi-tenant, high-density, globally distributed system. Configuration resiliency – Valid configuration updates should propagate safely, consistently, and predictably across our global edge, while ensuring that incompatible or erroneous configuration never propagate beyond canary environments. Data plane resiliency - Additionally, configuration processing in the data plane must not cause availability impact to any customer. Tenant isolation – Traditional isolation techniques such as hardware partitioning and virtualization are impractical at edge sites. This requires innovative sharding techniques to ensure single tenant-level isolation – a must-have to reduce potential blast radius. Accelerated and automated recovery time objective (RTO) – System should be able to automatically revert to last known good configuration in an acceptable RTO. In case of a service like Azure Front Door, we deem ~10 mins to be a practical RTO for our hundreds of thousands of customers at every edge site. Post outage, given the severity of impact which allowed an incompatible configuration to propagate globally, we made the difficult decision to temporarily block configuration changes in order to expedite rollout of additional safeguards. Between October 29 th to November 5 th , we prioritized and deployed immediate hardening steps before opening up the configuration change. We are confident that the system is stable, and we are continuing to invest in additional safeguards to further strengthen the platform's resiliency. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond Canary · Control plane and data plane defect fixes · Forced synchronous configuration processing · Additional stages with extended bake time · Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. January 2026 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes March 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 This blog is the first in a multi-part series on Azure Front Door resiliency. In this blog, we will focus on configuration resiliency—how we are making the configuration pipeline safer and more robust. Subsequent blogs will cover tenant isolation and recovery improvements. How our configuration propagation works Azure Front Door configuration changes can be broadly classified into three distinct categories. Service code & data – these include all aspects of Azure Front Door service like management plane, control plane, data plane, configuration propagation system. Azure Front Door follows a safe deployment practice (SDP) process to roll out newer versions of management, control or data plane over a period of approximately 2-3 weeks. This ensures that any regression in software does not have a global impact. However, latent bugs that escape pre-validation and SDP rollout can remain undetected until a specific combination of customer traffic patterns or configuration changes trigger the issue. Web Application Firewall (WAF) & L7 DDoS platform data – These datasets are used by Azure Front Door to deliver security and load-balancing capabilities. Examples include GeoIP data, malicious attack signatures, and IP reputation signatures. Updates to these datasets occur daily through multiple SDP stages with an extended bake time of over 12 hours to minimize the risk of global impact during rollout. This dataset is shared across all customers and the platform, and it is validated immediately since it does not depend on variations in customer traffic or configuration steps. Customer configuration data – Examples of these are any customer configuration change—whether a routing rule update, backend pool modification, WAF rule change, or security policy change. Due to the nature of these changes, it is expected across the edge delivery / CDN industry to propagate these changes globally in 5-10 mins. Both outages stemmed from issues within this category. All configuration changes, including customer configuration data, are processed through a multi-stage pipeline designed to ensure correctness before global rollout across Azure Front Door’s 200+ edge locations. At a high level, Azure Front Door’s configuration propagation system has two distinct components - Control plane – Accepts customer API/portal changes (create/update/delete for profiles, routes, WAF policies, origins, etc.) and translates them into internal configuration metadata which the data plane can understand. Data plane – Globally distributed edge servers that terminate client traffic, apply routing/WAF logic, and proxy to origins using the configuration produced by the control plane. Between these two halves sits a multi-stage configuration rollout pipeline with a dedicated protection system (known as ConfigShield): Changes flow through multiple stages (pre-canary, canary, expanding waves to production) rather than going global at once. Each stage is health-gated: the data plane must remain within strict error and latency thresholds before proceeding. Each stage’s health check also rechecks previous stage’s health for any regressions. A successfully completed rollout updates a last known good (LKG) snapshot used for automated rollback. Historically, rollout targeted global completion in roughly 5–10 minutes, in line with industry standards. Customer configuration processing in Azure Front Door data plane stack Customer configuration changes in Azure Front Door traverse multiple layers—from the control plane through the deployment system—before being converted into FlatBuffers at each Azure Front Door node. These FlatBuffers are then loaded by the Azure Front Door data plane stack, which runs as Kubernetes pods on every node. FlatBuffer Composition: Each FlatBuffer references several sub-resources such as WAF and Rules Engine schematic files, SSL certificate objects, and URL signing secrets. Data plane architecture: o Master process: Accepts configuration changes (memory-mapped files with references) and manages the lifecycle of worker processes. o Workers: L7 proxy processes that serve customer traffic using the applied configuration. Processing flow for each configuration update: Load and apply in master: The transformed configuration is loaded and applied in the master process. Cleanup of unused references occurs synchronously except for certain categories à October 9 outage occurred during this step due to a crash triggered by incompatible metadata. Apply to workers: Configuration is applied to all worker processes without memory overhead (FlatBuffers are memory-mapped). Serve traffic: Workers start consuming new FlatBuffers for new requests; in-flight requests continue using old buffers. Old buffers are queued for cleanup post-completion. Feedback to deployment service: Positive feedback signals readiness for rollout.Cleanup: FlatBuffers are freed asynchronously by the master process after all workers load updates à October 29 outage occurred during this step due to a latent bug in reference counting logic. The October incidents showed we needed to strengthen key aspects of configuration validation, propagation safeguards, and runtime behavior. During the Azure Front Door incident on October 9 th , that protection system worked as intended but was later bypassed by our engineering team during a manual cleanup operation. During this Azure Front Door incident on October 29 th , the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Configuration propagation safeguards Based on learnings from the incidents, we are implementing a comprehensive set of configuration resiliency improvements. These changes aim to guarantee that any sequence of configuration changes cannot trigger instability in the data plane, and to ensure quicker recovery in the event of anomalies. Strengthening configuration generation safety This improvement pivots on a ‘shift-left’ strategy where we want to ensure that we catch regression early before they propagate to production. It also includes fixing the latent defects which were the proximate cause of the outage. Fixing outage specific defects - We have fixed the control-plane defects that could generate incompatible tenant metadata under specific operation sequences. We have also remediated the associated data-plane defects. Stronger cross-version validation - We are expanding our test and validation suite to account for changes across multiple control plane build versions. This is expected to be fully completed by February 2026. Fuzz testing - Automated fuzzing and testing of metadata generation contract between the control plane and the data plane. This allows us to generate an expanded set of invalid/unexpected configuration combinations which might not be achievable by traditional test cases alone. This is expected to be fully completed by February 2026. Preventing incompatible configurations from being propagated This segment of the resiliency strategy strives to ensure that a potentially dangerous configuration change never propagates beyond canary stage. Protection system is “always-on” - Enhancements to operational procedures and tooling prevent bypass in all scenarios (including internal cleanup/maintenance), and any cleanup must flow through the same guarded stages and health checks as standard configuration changes. This is completed. Making rollout behavior more predictable and conservative - Configuration processing in the data plane is now fully synchronous. Every data plane issue due to incompatible meta data can be detected withing 10 seconds at every stage. This is completed. Enhancement to deployment pipeline - Additional stages during roll-out and extended bake time between stages serve as an additional safeguard during configuration propagation. This is completed. Recovery tool improvements now make it easier to revert to any previous version of LKG with a single click. This is completed. These changes significantly improve system safety. Post-outage we have increased the configuration propagation time to approximately 45 minutes. We are working towards reducing configuration propagation time closer to pre-incident levels once additional safeguards covered in the Data plane resiliency section below are completed by mid-January, 2026. Data plane resiliency The data plane recovery was the toughest part of recovery efforts during the October incidents. We must ensure fast recovery as well as resilience to configuration processing related issues for the data plane. To address this, we implemented changes that decouple the data plane from incompatible configuration changes. With these enhancements, the data plane continues operating on the last known good configuration—even if the configuration pipeline safeguards fail to protect as intended. Decoupling data plane from configuration changes Each server’s data plane consists of a master process which accepts configuration changes and manages lifecycle of multiple worker processes which serve customer traffic. One of the critical reasons for the prolonged outage in October was that due to latent defects in the data plane, when presented with a bad configuration the master process crashed. The master is a critical command-and-control process and when it crashes it takes down the entire data plane, in that node. Recovery of the master process involves reloading hundreds of thousands of configurations from scratch and took approximately 4.5 hours. We have since made changes to the system to ensure that even in the event of the master process crash due to any reason - including incompatible configuration data being presented - the workers remain healthy and able to serve traffic. During such an event, the workers would not be able to accept new configuration changes but will continue to serve customer traffic using the last known good configuration. This work is completed. Introducing Food Taster: strengthening config propagation resiliency In our efforts to further strengthen Azure Front Door’s configuration propagation system, we are introducing an additional configuration safeguard known internally as Food Taster which protects the master and worker processes from any configuration change related incidents, thereby ensuring data plane resiliency. The principle is simple: every data-plane server will have a redundant and isolated process – the Food Taster – whose only job is to ingest and process new configuration metadata first and then pass validated configuration changes to active data plane. This redundant worker does not accept any customer traffic. All configuration processing in this Food Taster is fully synchronous. That means we do all parsing, validation, and any expensive or risky work up front, and we do not move on until the Food Taster has either proven the configuration is safe or rejected it. Only when the Food Taster successfully loads the configuration and returns “Config OK” does the master process proceed to load the same config and then instruct the worker processes to do the same. If anything goes wrong in the Food Taster, the failure is contained to that isolated worker; the master and traffic-serving workers never see that invalid configuration. We expect this safeguard to reach production globally in January 2026 timeframe. Introduction of this component will also allow us to return closer to pre-incident level of configuration propagation while ensuring data plane safety. Closing This is the first in a series of planned blogs on Azure Front Door resiliency enhancements. We are continuously improving platform safety and reliability and will transparently share updates through this series. Upcoming posts will cover advancements in tenant isolation and improvements to recovery time objectives (RTO). We deeply value our customers’ trust in Azure Front Door. The October incidents reinforced how critical configuration resiliency is, and we are committed to exceeding industry expectations for safety, reliability, and transparency. By hardening our configuration pipeline, strengthening safety gates, and reinforcing isolation boundaries, we’re making Azure Front Door even more resilient so your applications can be too.17KViews23likes14CommentsConsistent DNS resolution in a hybrid hub spoke network topology
DNS is one of the most essential networking services, next to IP routing. A modern hybrid cloud network may have various sources of DNS: Azure Private DNS Zones, public DNS, domain controllers, etc. Some organizations may also prefer to route their public Internet DNS queries through a specific DNS provider. Therefore, it is crucial to ensure consistent DNS resolution across the whole (hybrid) network. This article describes how DNS Private Resolver can be leveraged to build such architecture.18KViews6likes5Comments