azure networking
51 TopicsPrivate subnets by default in Azure Virtual Networks: What changed and how to use NAT Gateway
Azure is evolving to better support secure‑by‑default cloud architectures. Starting with API version 2025‑07‑01 (released after March 31, 2026), newly created virtual networks now default to using private subnets. This change removes the long‑standing platform behavior of automatically enabling outbound internet access through implicit public IPs, also known as default outbound access (DOA). As a result: newly deployed virtual machines will not have public outbound connectivity unless explicitly configured. What changed? Previously, Azure automatically assigned a hidden Microsoft‑owned public IP to virtual machines deployed without an explicit outbound method (such as NAT Gateway, Load Balancer outbound rules, or instance‑level public IPs). This allowed public outbound connectivity without requiring customer configuration. While convenient, this model introduced challenges: Security – Implicit internet access conflicts with Zero Trust principles. Reliability – Platform‑managed outbound IPs can change unexpectedly. Operational consistency – VMSS instances or multi‑NIC VMs may egress using different default outbound IPs. With API version 2025‑07‑01 and later: Subnets in newly created VNets are private by default. The subnet property `defaultOutboundAccess` is set to false. Azure no longer assigns implicit outbound public IPs. This applies across deployment methods including Portal, ARM/Bicep, CLI, and PowerShell. Portal has started using the new model as of April 1, 2026. Note: This change has not yet applied to Terraform. Am I impacted by this change? Deployment scenario Behavior Existing VNets or VMs using DOA ✅ Unchanged New VMs in existing VNets ✅ Unchanged Subnets already using explicit outbound ✅ Continue using configured outbound method New VMs in new VNets (with subnets created using API 07-01-2025 or later) 🔒 Subnets private by default New VMs in private subnets without explicit outbound configured ❌ No public outbound connectivity Existing workloads are not impacted. If required, you can still create new subnets without the private setting by choosing the appropriate configuration option during creation. See the FAQ section of this blog for more information. However, we strongly recommend transitioning to an explicit outbound method so that: Your workloads won’t be affected by public IP address changes. You have greater control over how your VMs connect to public endpoints. Your VMs use traceable IP resources that you own. When is outbound connectivity required? If your virtual network contains virtual machines, you must configure explicit outbound connectivity. Here are common scenarios that require it: Virtual machine operating system activation and updates, such as Windows or Linux. Pulling container images from public registries (Docker Hub or Microsoft Container Registry). Accessing 3rd party SaaS or public APIs Virtual machine scale sets using flexible orchestration mode are always secure by default and therefore require an explicit outbound method. Private subnets don’t apply to delegated or managed subnets that host PaaS services. In these cases, the service handles outbound connectivity—see the service-specific documentation for details. Recommended outbound connectivity method: StandardV2 NAT Gateway Azure now recommends using an explicit outbound connectivity method such as: NAT Gateway Load Balancer outbound rules Public IP assigned to the VM Network Virtual Appliance (NVA) / Firewall Among these, Azure StandardV2 NAT Gateway is the recommended method for outbound connectivity for scalable and resilient outbound connectivity. StandardV2 NAT Gateway: Provides zone‑redundancy by default in supported regions Supports up to 100 Gbps throughput Provides dual-stack support with IPv4 and IPv6 public IPs Uses customer‑owned static public IPs Enables outbound connectivity without allowing inbound internet access Requires no route table configuration when associated to a subnet When configured, NAT Gateway automatically becomes the subnet’s default outbound path and takes precedence over: Load Balancer outbound rules VM instance‑level public IPs Note: UDRs for 0.0.0.0/0 traffic directed to virtual appliances/Firewall takes precedence over NAT gateway. Migrate from Default Outbound Access to NAT Gateway To transition from DOA to Azure’s recommended method of outbound, StandardV2 NAT Gateway: Go to your virtual network in the portal, and select the subnet you want to modify. In the Edit subnet menu, select the ‘Enable private subnet’ checkbox under the Private subnet section Enabling private subnet can also be done through other supported clients, below is an example for CLI, in which the default-outbound parameter is set to false: az network vnet subnet update \ --resource-group rgname \ --name subnetname \ --vnet-name vnetname \ --default-outbound false 3. Deploy a StandardV2 NAT gateway resource. 4. Associate one or more StandardV2 public IP addresses or prefixes. 5. Attach the NAT gateway to the target subnet. Once associated: All new outbound traffic from that subnet uses NAT Gateway automatically VM‑level public IPs are no longer required Existing outbound connections are not interrupted Note: Enabling private subnet on an existing subnet will not affect any VMs already using default outbound IPs. Private subnet ensures that only new VMs don’t receive a default outbound public IP. For step-by-step guidance, see migrate default outbound access to NAT Gateway. FAQ 1. Will my existing workloads lose outbound connectivity? No. Workloads currently using default outbound IPs are not impacted by this change. The private subnet by default update only affects: Newly created VNets New subnets created using the updated API, 2025-07-01 New virtual machines deployed into those subnets using the updated API VMs and subnets using an explicit outbound connectivity method like a NAT gateway, NVA / Firewall, a VM instance level public IP or Load balancer outbound rules is not impacted by this change. 2. Why can’t my new VM reach the internet or other public endpoints within Microsoft (e.g. VM activation, updates)? New subnets are private by default. If your deployment does not include an explicit outbound method — such as a NAT Gateway, Public IP, Load Balancer outbound rule, or NVA/Firewall— outbound connectivity is not automatically enabled. 3. My workload has a dependency on default outbound IPs and isn’t ready to move to private subnets, what should I do? You can opt-out of the default private subnet setting by disabling the private subnet feature. You can do this in the portal by unselecting the private subnet checkbox: Disabling private subnet can also be done through other supported clients, below is an example for CLI, in which the default-outbound parameter is set to true: az network vnet subnet update \ --resource-group rgname \ --name subnetname \ --vnet-name vnetname \ --default-outbound true 4. Why do I see an alert showing that I have a default outbound IP on my VM? There's a NIC-level parameter `defaultOutboundConnectivityEnabled` that tracks whether a default outbound IP is allocated to a VM/Virtual Machine Scale Set instance. If detected, the Azure portal displays a notification banner and will generate Azure Advisor recommendations about disabling default outbound connectivity for your VMs / VMSS. 5. How do I clear this alert? To remove the default outbound IP and clear the alert: Configure a StandardV2 NAT gateway (or other explicit outbound method). Set your subnet to be private or by setting the subnet property defaultOutboundAccess = false using one of the supported clients. Stop and deallocate any applicable virtual machines (this will remove the default outbound IP currently associated with the VM). 6. I have a NAT gateway (or UDR pointing to an NVA) configured for my private subnet, why do I still see this alert? In some cases, a default outbound IP is still assigned to virtual machines in a non-private subnet, even when an explicit outbound method—such as a NAT gateway or a UDR directing traffic to an NVA/firewall—is configured. This does not mean that the default outbound IP is used for egress traffic. To fully remove the assignment (and clear the alert): Set the subnet to private Stop and deallocate the affected virtual machines Summary The move to private subnets by default improves the security posture of Azure networking deployments by removing implicit outbound internet access. Customers deploying new workloads must now explicitly configure outbound connectivity. StandardV2 NAT Gateway provides a scalable, resilient method for enabling outbound internet access without exposing workloads to inbound connections or relying on platform‑managed IPs. Learn more Default Outbound Access StandardV2 NAT Gateway Migrate Default Outbound Access to StandardV2 NAT Gateway300Views1like0CommentsAzure VNet Data Gateway for Secure Power BI & Power Platform Access in Enterprises
What Is a VNet data gateway? The VNet data gateway is a Microsoft‑managed gateway service that runs inside a delegated subnet of an Azure Virtual Network. It allows supported Microsoft cloud services—such as Power BI, Power Platform dataflows, and Microsoft Fabric workloads—to securely connect to data sources that are protected using private networking. Key characteristics: No customer‑managed VM or container No OS, patching, or gateway software upgrades Gateway lifecycle fully managed by Microsoft Traffic stays on the Azure backbone network Works seamlessly with Private Endpoints This makes it ideal for enterprise and regulated environments where security and operational efficiency are equally important. Why Enterprises need VNet data gateway Eliminates gateway infrastructure management Traditional gateways require: Virtual machines High availability setup OS patching and scaling Monitoring and troubleshooting With the VNet data gateway: Microsoft manages compute lifecycle No VM or gateway software to maintain No HA or load balancer design needed ✅ Result: Significant reduction in operational and maintenance overhead for platform and infrastructure teams. Secure access to private Azure resources Most enterprise Azure environments use: Private Endpoints NSGs and route tables Firewalls blocking public access The VNet data gateway: Is injected into a delegated subnet in your VNet Uses private IP addressing Enforces NSG and UDR rules Communicates with Microsoft services over a Microsoft‑managed internal tunnel ✅ Result: Data sources remain fully private—no public endpoints or inbound ports required. Designed for Power Platform & Power BI at Scale The gateway supports secure access for: Power BI semantic models Power BI paginated reports Microsoft Fabric Dataflow Gen2 Fabric pipelines and copy jobs Because it’s cloud‑native and centrally managed, the VNet data gateway scales well in large enterprises standardizing on Power Platform and Fabric. High‑level architecture overview At runtime, the VNet data gateway works as follows: A query is initiated from Power BI / Power Platform Query details and credentials are sent to the Microsoft Power Platform VNet service A containerized gateway instance is injected into the delegated subnet The gateway connects to the private data source using private networking Results are sent back to Power BI or Power Platform via a Microsoft‑managed internal tunnel Key security highlights: No inbound connectivity No public IP exposure Traffic remains on Azure backbone Full enforcement of NSGs and routing rules Key Enterprise benefits Least management overhead – no gateway servers Zero Trust aligned – private-only connectivity Fully managed by Microsoft Enterprise-grade security & governance Works with Azure Private Endpoint architectures When to Use VNet Data Gateway Scenario Recommendation Azure private PaaS services ✅ VNet data gateway Private Endpoint–only access ✅ VNet data gateway Zero Trust network model ✅ VNet data gateway Minimal ops & maintenance ✅ VNet data gateway On‑prem only, no Azure ❌ Traditional gateway Step‑by‑step configuration: VNet data gateway (Enterprise setup) High‑level flow (What you will configure) Register required Azure resource provider Prepare Azure Virtual Network and subnet Configure private connectivity to data source Create the VNet data gateway Create and bind data source connections Validate with Power BI / Power Platform workloads Step 1: Register Microsoft.PowerPlatform resource provider Why this step is required The VNet data gateway is a Microsoft‑managed service that is injected into your Azure VNet. Azure must explicitly allow Power Platform to deploy managed infrastructure into your subscription. Configuration steps Sign in to Azure portal Navigate to Subscriptions Select the subscription that hosts the target VNet Go to Resource providers Search for Microsoft.PowerPlatform Click Register ✅ Status must show Registered This step enables subnet delegation to Power Platform services. Step 2: Prepare the Azure Virtual Network Why this step is required The gateway runs inside your VNet. It must be placed in a dedicated, delegated subnet to maintain isolation and security boundaries. Requirements VNet can be in any Azure region Subnet must be exclusive to VNet data gateway Subnet must have outbound connectivity to the data source Configuration steps Go to Azure portal → virtual networks Select your existing VNet (or create one) Navigate to Subnets → + Subnet Configure: Subnet name: snet-vnet-datagateway Address range: /27 or larger (recommended) Subnet delegation: Microsoft.PowerPlatform/vnetaccesslinks Save the subnet ⚠️ Do not place any VMs, app gateway, or other workloads in this subnet. Step 3: Configure private connectivity to the data source Why this step is required Enterprises typically block public access to PaaS services. The VNet data gateway is designed to work natively with private endpoints. Example: Azure SQL / SQL Managed Instance Create Private Endpoint for the data service Attach it to the same VNet (can be different subnet) Create or link a Private DNS Zone, for example: privatelink.database.windows.net Link the Private DNS Zone to the VNet Ensure DNS resolution from the delegated subnet resolves to private IP ✅ This ensures all traffic remains private and internal. Step 4: Create the VNet data gateway Why this step is required This is where the actual Microsoft‑managed gateway is logically created and associated with your VNet. Configuration steps You can do this from either Power BI Service or Power Platform Admin Center. Using Power Platform Admin Center Go to https://admin.powerplatform.microsoft.com Select Data → Gateways Click + New → Virtual network data gateway Provide: Gateway name Azure subscription Resource group Virtual network Delegated subnet Click Create 📌 Notes: Gateway metadata is stored in Power BI tenant home region Gateway runtime executes in the VNet region No VM or scale settings are required Step 5: Create and configure data source connections Why this step is required The gateway exists, but Power BI / Power Platform must know which data sources can be accessed via it. Configuration steps (Power BI example) Go to Power BI Service Navigate to Settings → Manage connections and gateways Select the newly created VNet data gateway Click + New connection Provide: Data source type (Azure SQL, Storage, Databricks, etc.) Server / endpoint name (private DNS name) Authentication (SQL / Entra ID) Save the connection Assign users or security groups ✅ This step enables governance and access control. Step 6: Use the gateway in Power BI / Power Platform Power BI Open dataset or semantic model settings Under Gateway connection, select: Use a data gateway Choose the VNet data gateway Apply changes Refresh or run queries Power Platform / Fabric Select the same connection when configuring: Dataflows Gen2 Fabric pipelines Copy jobs Step 7: Validate and test Validation Checklist ✅ DNS resolves to private IP ✅ No public endpoint access enabled ✅ NSGs allow outbound traffic to data source ✅ Dataset refresh succeeds ✅ No gateway VM exists in subscription Optional: Enable logging and auditing from Power BI / Fabric Monitor gateway health in Admin Center Key Enterprise design guidance (Best practices) Use one gateway per environment tier (Prod / Non‑Prod) Use dedicated VNets for data access where possible Use Private Endpoint only (avoid service endpoints) Control access via AAD groups, not individuals Avoid mixing gateway subnet with other workloads Conclusion: For enterprises looking to consume Power Platform, Power BI, and Microsoft Fabric securely while keeping operational overhead close to zero, the VNet data gateway is the recommended approach. It removes gateway infrastructure complexity, strengthens security posture, and aligns perfectly with modern Azure landing zone and Zero Trust architectures.156Views0likes0CommentsIntroducing the Container Network Insights Agent for AKS: Now in Public Preview
We are thrilled to announce public preview of Container Network Insights Agent - Agentic AI network troubleshooting for your workloads running in Azure Kubernetes Service (AKS). The Challenge AKS networking is layered by design. Azure CNI, eBPF, Cilium, CoreDNS, NetworkPolicy, CiliumNetworkPolicy, Hubble. Each layer contributes capabilities, and some of these can fail silently in ways the surrounding layers cannot observe. When something breaks, the evidence usually exists. Operators already have the tools such as Azure Monitor for metrics, Container Insights for cluster health, Prometheus and Grafana for dashboarding, Cilium and Hubble for pod network observation, and Kubectl for direct inspection. However, correlating different signals and identifying the root cause takes time. Imagine this scenario: An application performance alert fires. The on-call engineer checks dashboards, reviews events, inspects pod health. Each tool shows its own slice. But the root cause usually lives in the relationship between signals, not in any single tool. So the real work begins to manually cross-reference Hubble flows, NetworkPolicy specs, DNS state, node-level stats, and verdicts. Each check is a separate query, a separate context switch, a separate mental model of how the layers interact. This process is manual, it is slow, needs domain knowledge, and does not scale. Mean time to resolution (MTTR) stays high not because engineers lack skill, but because the investigation surface is wide and the interactions between the layers are complex. The solution: Container Network Insights Agent Container Network Insights Agent is agentic AI to simplify and speed up AKS network troubleshooting Rather than replacing your existing observability tools, the container network insights agent correlates signals on demand to help you quickly identify and resolve network issues. You describe a problem in natural language, and the agent runs a structured investigation across layers. It delivers a diagnosis with the evidence, the root cause, and the exact commands to fix it. The container network insights agent gets its visibility through two data sources: - AKS MCP server container network insight agent integrates with the AKS MCP (Model Context Protocol) server, a standardized and secure interface to kubectl, Cilium, and Hubble. Every diagnostic command runs through the same tools operators already use, via a well-defined protocol that enforces security boundaries. No ad-hoc scripts, no custom API integrations. - Linux Networking plugin For diagnostics that require visibility below the Kubernetes API layer, container network insights agent collects kernel-level telemetry directly from cluster nodes. This includes NIC ring buffer stats, kernel packet counters, SoftIRQ distribution, and socket buffer utilization. This is how it pinpoints packet drops and network saturation that surface-level metrics cannot explain. When you describe a symptom, the container network insights agent: - Classifies the issue and plans an investigation tailored to the symptom pattern - Gathers evidence through the AKS MCP server and its Linux networking plugin across DNS, service routing, network policies, Cilium, and node-level statistics - Reasons across layers to identify how a failure in one component manifests in another - Delivers a structured report with pass/fail evidence, root cause analysis, and specific remediation guidance The container network insight agent is scoped to AKS networking: DNS failures, packet drops, connectivity issues, policy conflicts, and Cilium dataplane health. It does not modify workloads or change configurations. All remediation guidance is advisory. The agent tells you what to run, and you decide whether to apply it. What makes the container network insights agent different Deep telemetry, not just surface metrics Most observability tools operate at the Kubernetes API level. container network insight agent goes deeper, collecting kernel-level network statistics, BPF program drop counters, and interface-level diagnostics that pinpoint exactly where packets are being lost and why. This is the difference between knowing something is wrong and knowing precisely what is causing it. Cross-layer reasoning Networking incidents rarely have single-layer explanations. The container network insights agent correlates evidence from DNS, service routing, network policy, Cilium, and node-level statistics together. It surfaces causal relationships that span layers. For example: node-level RX drops caused by a Cilium policy denial triggered by a label mismatch after a routine Helm deployment, even though the pods themselves appear healthy. Structured and auditable Every conclusion traces to a specific check, its output, and its pass/fail status. If all checks pass, container network insights agent reports no issue. It does not invent problems. Investigations are deterministic and reproducible. Results can be reviewed, shared, and rerun. Guidance, not just findings The container network insights agent explains what the evidence means, identifies the root cause, and provides specific remediation commands. The analysis is done; the operator reviews and decides. Where the container network insights agent fits The container network insights agent is not another monitoring tool. It does not collect continuous metrics or replace dashboards. Your existing observability stack, including Azure Monitor, Prometheus, Grafana, Container Insights, and your log pipelines, keeps doing what it does. The agent complements those tools by adding an intelligence layer that turns fragmented signals into actionable diagnosis. Your alerting detects the problem; this agent helps you understand it. Safe by Design The container network insights agent is built for production clusters. - Read-only access Minimal RBAC scoped to pods, services, endpoints, nodes, namespaces, network policies, and Cilium resources. container network insight agent deploys a temporary debug DaemonSet only for packet-drop diagnostics that require host-level stats. - Advisory remediation only The container network insights agent tells you what to run. It never executes changes. - Evidence-backed conclusions Every root cause traces to a specific failed check. No speculation. - Scoped and enforced The agent handles AKS networking questions only. It does not respond to off-topic requests. Prompt injection defenses are built in. - Credentials stay in the cluster The container network insights agent authenticates via managed identity with workload identity federation. No secrets, no static credentials. Only a session ID cookie reaches the browser. Get Started Container network insights agent is available in Public Preview in **Central US, East US, East US 2, UK South, and West US 2**. The agent deploys as an AKS cluster extension and uses your own Azure OpenAI resource, giving you control over model configuration and data residency. Full capabilities require Cilium and Advanced Container Networking Services. DNS and packet drop diagnostics work on all supported AKS clusters. To try it: - Review the Container Network Insights Agent overview on Microsoft Learn https://learn.microsoft.com/en-us/azure/aks/container-network-insights-agent-overview - Follow the quickstart to deploy container network insights agent and run your first diagnostic - Share feedback via the Azure feedback channel or the thumbs-up and thumbs-down feedback controls on each response Your feedback shapes the roadmap. If the agent gets something wrong or misses a scenario you encounter, we want to hear about it.
449Views0likes0CommentsEnabling fallback to internet for Azure Private DNS Zones in hybrid architectures
Introduction Azure Private Endpoint enables secure connectivity to Azure PaaS services such as: Azure SQL Managed Instance Azure Container Registry Azure Key Vault Azure Storage Account through private IP addresses within a virtual network. When Private Endpoint is enabled for a service, Azure DNS automatically changes the name resolution path using CNAME Redirection Example: myserver.database.windows.net ↓ myserver.privatelink.database.windows.net ↓ Private IP Azure Private DNS Zones are then used to resolve this Private Endpoint FQDN within the VNet. However, this introduces a critical DNS limitation in: Hybrid cloud architectures (AWS → Azure SQL MI) Multiregion deployments (DR region access) Crosstenant / Crosssubscription access MultiVNet isolated networks If the Private DNS zone does not contain a corresponding record, Azure DNS returns: NXDOMAIN (NonExistent Domain) When a DNS resolver receives a negative response (NXDOMAIN), it sends no DNS response to the DNS client and the query fails. This results in: ❌ Application connectivity failure ❌ Database connection timeout ❌ AKS pod DNS resolution errors ❌ DR failover application outage Problem statement In traditional Private Endpoint DNS resolution: DNS query is sent from the application. Azure DNS checks linked Private DNS Zone. If no matching record exists: NXDOMAIN returned DNS queries for Azure Private Link and network isolation scenarios across different tenants and resource groups have unique name resolution paths which can affect the ability to reach Private Linkenabled resources outside a tenant's control. Azure does not retry resolution using public DNS by default. Therefore: Public Endpoint resolution never occurs DNS query fails permanently Application cannot connect Microsoft native solution Fallback to internet (NxDomainRedirect) Azure introduced a DNS resolution policy: resolutionPolicy = NxDomainRedirect This property enables public recursion via Azure’s recursive resolver fleet when an authoritative NXDOMAIN response is received for a Private Link zone. When enabled: ✅ Azure DNS retries the query ✅ Public endpoint resolution occurs ✅ Application connectivity continues ✅ No custom DNS forwarder required Fallback policy is configured at: Private DNS Zone → virtualnetwork link Resolution policy is enabled at the virtual network link level with the NxDomainRedirect setting. In the Azure portal this appears as: Enable fallback to internet How it works Without fallback: Application → Azure DNS → Private DNS Zone → Record missing → NXDOMAIN returned → Connection failure With fallback enabled: Application → Azure DNS → Private DNS Zone → Record missing → NXDOMAIN returned → Azure recursive resolver → Public DNS resolution → Public endpoint IP returned → Connection successful Azure recursive resolver retries the query using the public endpoint QNAME each time NXDOMAIN is received from the private zone scope Real world use case AWS Application Connecting to Azure SQL Managed Instance You are running: SQL MI in Azure Private Endpoint enabled Private DNS Zone: privatelink.database.windows.net AWS application tries to connect: my-mi.database.windows.net If DR region DNS record is not available: Without fallback: DNS query → NXDOMAIN → App failure With fallback enabled: DNS query → Retry public DNS → Connection success Step-by-step configuration Method 1 – Azure portal Go to: Private DNS Zones Select your Private Link DNS Zone: Example: privatelink.database.windows.net Select: Virtual network links Open your linked VNet Enable: ✅ Enable fallback to internet Click: Save Method 2 – Azure CLI You can configure fallback policy using: az network private-dns link vnet update \ --resource-group RG-Network \ --zone-name privatelink.database.windows.net \ --name VNET-Link \ --resolution-policy NxDomainRedirect Validation steps Run from Azure VM: nslookup my-mi.database.windows.net Expected: ✔ Private IP (if available) ✔ Public IP (if fallback triggered) Security considerations Fallback to internet: ✅ Does NOT expose data ✅ Only impacts DNS resolution ✅ Network traffic still governed by: NSG Azure Firewall UDR Service Endpoint Policies DNS resolution fallback only triggers on NXDOMAIN and does not change networklevel firewall controls. When should you enable this? Recommended in: Hybrid AWS → Azure connectivity Multiregion DR deployments AKS accessing Private Endpoint services CrossTenant connectivity Private Link + VPN / ExpressRoute scenarios Conclusion Fallback to Internet using NxDomainRedirect provides: Seamless hybrid connectivity Reduced DNS complexity No custom forwarders Improved application resilience and simplifies DNS resolution for modern Private Endpointenabled architectures.392Views0likes0CommentsA demonstration of Virtual Network TAP
Azure Virtual Network Terminal Access Point (VTAP), at the time of writing in April 2026 in public preview in select regions, copies network traffic from source Virtual Machines to a collector or traffic analytics tool, running as a Network Virtual Appliance (NVA). VTAP creates a full copy of all traffic sent and received by Virtual Machine Network Interface Card(s) (NICs) designated as VTAP source(s). This includes packet payload content - in contrast to VNET Flow Logs, which only collect traffic meta data. Traffic collectors and analytics tools are 3rd party partner products, available from the Azure Marketplace, amongst which are the major Network Detection and Response solutions. VTAP is an agentless, cloud-native traffic tap at the Azure network infrastructure level. It is entirely out-of-band; it has no impact on the source VM's network performance and the source VM is unaware of the tap. Tapped traffic is VXLAN-encapsulated and delivered to the collector NVA, in the same VNET as the source VMs, or in a peered VNET. This post demonstrates the basic functionality of VTAP: copying traffic into and out of a source VM, to a destination VM. The demo consists of 3 three Windows VMs in one VNET, each running a basic web server that responds with the VM's name. Another VNET contains the target - a Windows VM on which Wireshark is installed, to inspect traffic forwarded by VTAP. This demo does not use 3rd party VTAP partner solutions from the Marketplace. The lab for this demonstration is available on Github: Virtual Network TAP. The VTAP resource is configured with the target VM's NIC as the destination. All traffic captured from sources is VXLAN-encapsulated and sent to the destination on UDP port 4789 (this cannot be changed). We use a single source to easier inspect the traffic flows in Wireshark; we will see that communication from the other VMs to our source VM is captured and copied to the destination. In a real world scenario, multiple or all of the VMs in an environment could be set up as TAP sources. The source VM, vm1, generates traffic through a script that continuously polls vm2 and vm3 on http://10.0.2.5 and http://10.0.2.6, and https://ipconfig.io. On the destination VM, we use Wireshark to observe captured traffic. The filter on UDP port 4789 causes Wireshark to only capture the VXLAN encapsulated traffic forwarded by VTAP. Wireshark automatically decodes VXLAN and displays the actual traffic to and from vm1, which is set up as the (only) VTAP source. Wireshark's capture panel shows the decapsulated TCP and HTTP exchanges, including the TCP handshake, between vm1 and the other VMs, and https://ipconfig.io. Expanding the lines in the detail panel below the capture panel shows the details of the VXLAN encapsulation. The outer IP packets, encapsulating the VXLAN frames in UDP, originate from the source VM's IP address, 10.0.2.4, and have the target VM's address, 10.1.1.4, as the destination. The VXLAN frames contain all the details of the original Ethernet frames sent from and received by the source VM, and the IP packets within those. The Wireshark trace shows the full exchange between vm1 and the destinations it speaks with. This brief demonstration uses Wireshark to simply visualize the operation of VTAP. The partner solutions available from the Azure Marketplace operate on the captured traffic to implement their specific functionality.360Views1like1CommentExpressRoute Gateway Microsoft initiated migration
Important: Microsoft initiated Gateway migrations are temporarily paused. You will be notified when migrations resume. Objective The backend migration process is an automated upgrade performed by Microsoft to ensure your ExpressRoute gateways use the Standard IP SKU. This migration enhances gateway reliability and availability while maintaining service continuity. You receive notifications about scheduled maintenance windows and have options to control the migration timeline. For guidance on upgrading Basic SKU public IP addresses for other networking services, see Upgrading Basic to Standard SKU. Important: As of September 30, 2025, Basic SKU public IPs are retired. For more information, see the official announcement. You can initiate the ExpressRoute gateway migration yourself at a time that best suits your business needs, before the Microsoft team performs the migration on your behalf. This gives you control over the migration timing. Please use the ExpressRoute Gateway Migration Tool to migrate your gateway Public IP to Standard SKU. This tool provides a guided workflow in the Azure portal and PowerShell, enabling a smooth migration with minimal service disruption. Backend migration overview The backend migration is scheduled during your preferred maintenance window. During this time, the Microsoft team performs the migration with minimal disruption. You don’t need to take any actions. The process includes the following steps: Deploy new gateway: Azure provisions a second virtual network gateway in the same GatewaySubnet alongside your existing gateway. Microsoft automatically assigns a new Standard SKU public IP address to this gateway. Transfer configuration: The process copies all existing configurations (connections, settings, routes) from the old gateway. Both gateways run in parallel during the transition to minimize downtime. You may experience brief connectivity interruptions may occur. Clean up resources: After migration completes successfully and passes validation, Azure removes the old gateway and its associated connections. The new gateway includes a tag CreatedBy: GatewayMigrationByService to indicate it was created through the automated backend migration Important: To ensure a smooth backend migration, avoid making non-critical changes to your gateway resources or connected circuits during the migration process. If modifications are absolutely required, you can choose (after the Migrate stage complete) to either commit or abort the migration and make your changes. Backend process details This section provides an overview of the Azure portal experience during backend migration for an existing ExpressRoute gateway. It explains what to expect at each stage and what you see in the Azure portal as the migration progresses. To reduce risk and ensure service continuity, the process performs validation checks before and after every phase. The backend migration follows four key stages: Validate: Checks that your gateway and connected resources meet all migration requirements for the Basic to Standard public IP migration. Prepare: Deploys the new gateway with Standard IP SKU alongside your existing gateway. Migrate: Cuts over traffic from the old gateway to the new gateway with a Standard public IP. Commit or abort: Finalizes the public IP SKU migration by removing the old gateway or reverts to the old gateway if needed. These stages mirror the Gateway migration tool process, ensuring consistency across both migration approaches. The Azure resource group RGA serves as a logical container that displays all associated resources as the process updates, creates, or removes them. Before the migration begins, RGA contains the following resources: This image uses an example ExpressRoute gateway named ERGW-A with two connections (Conn-A and LAconn) in the resource group RGA. Portal walkthrough Before the backend migration starts, a banner appears in the Overview blade of the ExpressRoute gateway. It notifies you that the gateway uses the deprecated Basic IP SKU and will undergo backend migration between March 7, 2026, and April 30, 2026: Validate stage Once you start the migration, the banner in your gateway’s Overview page updates to indicate that migration is currently in progress. In this initial stage, all resources are checked to ensure they are in a Passed state. If any prerequisites aren't met, validation fails and the Azure team doesn't proceed with the migration to avoid traffic disruptions. No resources are created or modified in this stage. After the validation phase completes successfully, a notification appears indicating that validation passed and the migration can proceed to the Prepare stage. Prepare stage In this stage, the backend process provisions a new virtual network gateway in the same region and SKU type as the existing gateway. Azure automatically assigns a new public IP address and re-establishes all connections. This preparation step typically takes up to 45 minutes. To indicate that the new gateway is created by migration, the backend mechanism appends _migrate to the original gateway name. During this phase, the existing gateway is locked to prevent configuration changes, but you retain the option to abort the migration, which deletes the newly created gateway and its connections. After the Prepare stage starts, a notification appears showing that new resources are being deployed to the resource group: Deployment status In the resource group RGA, under Settings → Deployments, you can view the status of all newly deployed resources as part of the backend migration process. In the resource group RGA under the Activity Log blade, you can see events related to the Prepare stage. These events are initiated by GatewayRP, which indicates they are part of the backend process: Deployment verification After the Prepare stage completes, you can verify the deployment details in the resource group RGA under Settings > Deployments. This section lists all components created as part of the backend migration workflow. The new gateway ERGW-A_migrate is deployed successfully along with its corresponding connections: Conn-A_migrate and LAconn_migrate. Gateway tag The newly created gateway ERGW-A_migrate includes the tag CreatedBy: GatewayMigrationByService, which indicates it was provisioned by the backend migration process. Migrate stage After the Prepare stage finishes, the backend process starts the Migrate stage. During this stage, the process switches traffic from the existing gateway ERGW-A to the new gateway ERGW-A_migrate. Gateway ERGW-A_migrate: Old gateway (ERGW-A) handles traffic: After the backend team initiates the traffic migration, the process switches traffic from the old gateway to the new gateway. This step can take up to 15 minutes and might cause brief connectivity interruptions. New gateway (ERGW-A_migrate) handles traffic: Commit stage After migration, the Azure team monitors connectivity for 15 days to ensure everything is functioning as expected. The banner automatically updates to indicate completion of migration: During this validation period, you can’t modify resources associated with both the old and new gateways. To resume normal CRUD operations without waiting 15 days, you have two options: Commit: Finalize the migration and unlock resources. Abort: Revert to the old gateway, which deletes the new gateway and its connections. To initiate Commit before the 15-day window ends, type yes and select Commit in the portal. When the commit is initiated from the backend, you will see “Committing migration. The operation may take some time to complete.” The old gateway and its connections are deleted. The event shows as initiated by GatewayRP in the activity logs. After old connections are deleted, the old gateway gets deleted. Finally, the resource group RGA contains only resources only related to the migrated gateway ERGW-A_migrate: The ExpressRoute Gateway migration from Basic to Standard Public IP SKU is now complete. Frequently asked questions How long will Microsoft team wait before committing to the new gateway? The Microsoft team waits around 15 days after migration to allow you time to validate connectivity and ensure all requirements are met. You can commit at any time during this 15-day period. What is the traffic impact during migration? Is there packet loss or routing disruption? Traffic is rerouted seamlessly during migration. Under normal conditions, no packet loss or routing disruption is expected. Brief connectivity interruptions (typically less than 1 minute) might occur during the traffic cutover phase. Can we make any changes to ExpressRoute Gateway deployment during the migration? Avoid making non-critical changes to the deployment (gateway resources, connected circuits, etc.). If modifications are absolutely required, you have the option (after the Migrate stage) to either commit or abort the migration.1.8KViews0likes0CommentsAnnouncing public preview: Cilium mTLS encryption for Azure Kubernetes Service
We are thrilled to announce the public preview of Cilium mTLS encryption in Azure Kubernetes Service (AKS), delivered as part of Advanced Container Networking Services and powered by the Azure CNI dataplane built on Cilium. This capability is the result of a close engineering collaboration between Microsoft and Isovalent (now part of Cisco). It brings transparent, workload‑level mutual TLS (mTLS) to AKS without sidecars, without application changes, and without introducing a separate service mesh stack. This public preview represents a major step forward in delivering secure, high‑performance, and operationally simple networking for AKS customers. In this post, we’ll walk through how Cilium mTLS works, when to use it, and how to get started. Why Cilium mTLS encryption matters Traditionally, teams looking to in-transit traffic encryption in Kubernetes have had two primary options: Node-level encryption (for example, WireGuard or virtual network encryption), which secures traffic in transit but lacks workload identity and authentication. Service meshes, which provide strong identity and mTLS guarantees but introduce operational complexity. This trade‑off has become increasingly problematic, as many teams want workload‑level encryption and authentication, but without the cost, overhead, and architectural impact of deploying and operating a full-service mesh. Cilium mTLS closes this gap directly in the dataplane. It delivers transparent, inline mTLS encryption and authentication for pod‑to‑pod TCP traffic, enforced below the application layer. And implemented natively in the Azure CNI dataplane built on Cilium, so customers gain workload‑level security without introducing a separate service mesh, resulting in a simpler architecture with lower operational overhead. To see how this works under the hood, the next section breaks down the Cilium mTLS architecture and follows a pod‑to‑pod TCP flow from interception to authentication and encryption. Architecture and design: How Cilium mTLS works Cilium mTLS achieves workload‑level authentication and encryption by combining three key components, each responsible for a specific part of the authentication and encryption lifecycle. Cilium agent: Transparent traffic interception and wiring Cilium agent which already exists on any cluster running with Azure CNI powered by cilium, is responsible for making mTLS invisible to applications. When a namespace is labelled with “io.cilium/mtls-enabled=true”, The Cilium agent enrolls all pods in that namespace. It enters each pod's network namespace and installs iptables rules that redirect outbound traffic to ztunnel on port 15001. It is also responsible for passing workload metadata (such as pod IP and namespace context) to ztunnel. Ztunnel: Node‑level mTLS enforcement Ztunnel is an open source, lightweight, node‑level Layer 4 proxy that was originally created by Istio. Ztunnel runs as a DaemonSet, on the source node it looks up the destination workload via XDS (streamed from the Cilium agent) and establishes mutually authenticated TLS 1.3 sessions between source and destination nodes. Connections are held inline until authentication is complete, ensuring that traffic is never sent in plaintext. The destination ztunnel decrypts the traffic and delivers it into the target pod, bypassing the interception rules via an in-pod mark. The application sees a normal plaintext connection — it is completely unaware encryption happened. SPIRE: Workload identity and trust SPIRE (SPIFFE Runtime Environment) provides the identity foundation for Cilium mTLS. SPIRE acts as the cluster Certificate Authority, issuing short‑lived X.509 certificates (SVIDs) that are automatically rotated and validated. This is a key design principle of Cilium mTLS - trust is based on workload identity, not network topology. Each workload receives a cryptographic identity derived from: Kubernetes namespace Kubernetes ServiceAccount These identities are issued and rotated automatically by SPIRE and validated on both sides of every connection. As a result: Identity remains stable across pod restarts and rescheduling Authentication is decoupled from IP addresses Trust decisions align naturally with Kubernetes RBAC and namespace boundaries This enables a zero‑trust networking model that fits cleanly into existing AKS security practices. End‑to‑End workflow example To see how these components work together, consider a simple pod‑to‑pod connection: A pod initiates a TCP connection to another pod. Traffic intercepted inside the pod network namespace and redirected to the local ztunnel instance. ztunnel retrieves the workload identity using certificates issued by SPIRE. ztunnel establishes a mutually authenticated TLS session with the destination node’s ztunnel. Traffic is encrypted and sent between pods. The destination ztunnel decrypts the traffic and delivers it to the target pod. Every packet from an enrolled pod is encrypted. There is no plaintext window, and no dropped first packets. The connection is held inline by ztunnel until the mTLS tunnel is established, then traffic flows bidirectionally through an HBONE (HTTP/2 CONNECT) tunnel. Workload enrolment and scope Cilium mTLS in AKS is opt‑in and scoped at the namespace level. Platform teams enable mTLS by applying a single label to a namespace. From that point on: All pods in that namespace participate in mTLS Authentication and encryption are mandatory between enrolled workloads Non-enrolled namespaces continue to operate unchanged Encryption is applied only when both pods are enrolled. Traffic between enrolled and non‑enrolled workloads continues in plaintext without causing connectivity issues or hard failures. This model enables gradual rollout, staged migrations, and low-risk adoption across environments. Getting started in AKS Cilium mTLS encryption is available in public preview for AKS clusters that use: Azure CNI powered by Cilium Advanced Container Networking Services You can enable mTLS: When creating a new cluster, or On an existing cluster by updating the Advanced Container Networking Services configuration Once enabled, enrolling workloads is as simple as labelling a namespace. 👉 Learn more Concepts: How Cilium mTLS works, architecture, and trust boundaries How-to guide: Step-by-step instructions to enable and verify mTLS in AKS Looking ahead This public preview represents an important step forward in simplifying network security for AKS and reflects a deep collaboration between Microsoft and Isovalent to bring open, standards‑based innovation into production‑ready cloud platforms. We’re continuing to work closely with the community to improve the feature and move it toward general availability. If you’re looking for workload‑level encryption without the overhead of a traditional service mesh, we invite you to try Cilium mTLS in AKS and share your experience.1KViews3likes0CommentsAzure Front Door: Resiliency Series – Part 2: Faster recovery (RTO)
In Part 1 of this blog series, we outlined our four‑pillar strategy for resiliency in Azure Front Door: configuration resiliency, data plane resiliency, tenant isolation, and accelerated Recovery Time Objective (RTO). Together, these pillars help Azure Front Door remain continuously available and resilient at global scale. Part 1 focused on the first two pillars: configuration and data plane resiliency. Our goal is to make configuration propagation safer, so incompatible changes never escape pre‑production environments. We discussed how incompatible configurations are blocked early, and how data plane resiliency ensures the system continues serving traffic from a last‑known‑good (LKG) configuration even if a bad change manages to propagate. We also introduced ‘Food Taster’, a dedicated sacrificial process running in each edge server’s data plane, that pretests every configuration change in isolation, before it ever reaches the live data plane. In this post, we turn to the recovery pillar. We describe how we have made key enhancements to the Azure Front Door recovery path so the system can return to full operation in a predictable and bounded timeframe. For a global service like Azure Front Door, serving hundreds of thousands of tenants across 210+ edge sites worldwide, we set an explicit target: to be able to recover any edge site – or all edge sites – within approximately 10 minutes, even in worst‑case scenarios. In typical data plane crash scenarios, we expect recovery in under a second. Repair status The first blog post in this series mentioned the two Azure Front Door incidents from October 2025 – learn more by watching our Azure Incident Retrospective session recordings for the October 9 th incident and/or the October 29 th incident. Before diving into our platform investments for improving our Recovery Time Objectives (RTO), we wanted to provide a quick update on the overall repair items from these incidents. We are pleased to report that the work on configuration propagation and data plane resiliency is now complete and fully deployed across the platform (in the table below, “Completed” means broadly deployed in production). With this, we have reduced configuration propagation latency from ~45 minutes to ~20 minutes. We anticipate reducing this even further – to ~15 minutes by the end of April 2026, while ensuring that platform stability remains our top priority. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond ‘EUAP or canary regions’ Control plane and data plane defect fixes Forced synchronous configuration processing Additional stages with extended bake time Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. Completed 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes April 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 Why recovery at edge scale is deceptively hard To understand why recovery took as long as it did, it helps to first understand how the Azure Front Door data plane processes configuration. Azure Front Door operates in 210+ edge sites with multiple servers per site. The data plane of each edge server hosts multiple processes. A master process orchestrates the lifecycle of multiple worker processes, that serve customer traffic. A separate configuration translator process runs alongside the data plane processes, and is responsible for converting customer configuration bundles from the control plane into optimized binary FlatBuffer files. This translation step, covering hundreds of thousands of tenants, represents hours of cumulative computation. A per edge server cache is kept locally at each server level – to enable a fast recovery of the data plane, if needed. Once the configuration translator process produces these FlatBuffer files, each worker processes them independently and memory-maps them for zero-copy access. Configuration updates flow through a two-phase commit: new FlatBuffers are first loaded into a staging area and validated, then atomically swapped into production maps. In-flight requests continue using the old configuration, until the last request referencing them completes. The data process recovery is designed to be resilient to different failure modes. A failure or crash at the worker process level has a typical recovery time of less than one second. Since each server has multiple such worker processes which serve customer traffic, this type of crash has no impact on the data plane. In the case of a master process crash, the system automatically tries to recover using the local cache. When the local cache is reused, the system is able to recover quickly – in approximately 60 minutes – since most of the configurations in the cache were already loaded into the data plane before the crash. However, in certain cases if the cache becomes unavailable or must be invalidated because of corruption, the recovery time increases significantly. During the October 29 th incident, a data plane crash triggered a complete recovery sequence that took approximately 4.5 hours. This was not because restarting a process is slow, it is because a defect in the recovery process invalidated the local cache, which meant that “restart” meant rebuilding everything from scratch. The configuration translator process then had to re-fetch and re-translate every one of the hundreds of thousands of customer configurations, before workers could memory-map them and begin serving traffic. This experience has crystallized three fundamental learnings related to our recovery path: Expensive rework: A subset of crashes discarded all previously translated FlatBuffer artifacts, forcing the configuration translator process to repeat hours of conversion work that had already been validated and stored. High restart costs: Every worker on every node had to wait for the configuration translator process to complete the full translation, before it could memory-map any configuration and begin serving requests. Unbounded recovery time: Recovery time grew linearly with total tenant footprint rather than with active traffic, creating a ‘scale penalty’ as more tenants onboarded to the system. Separately and together, the insight was clear: recovery must stop being proportional to the total configuration size. Persisting ‘validated configurations’ across restarts One of the key recovery improvements was strengthening how validated customer configurations are cached and reused across failures, rather than rebuilding configuration states from scratch during recovery. Azure Front Door already cached customer configurations on host‑mounted storage prior to the October incident. The platform enhancements post outage focused on making the local configuration cache resilient to crashes, partial failures, and bad tenant inputs. Our goal was to ensure that recovery behavior is dominated by serving traffic safely, not by reconstructing configuration state. This led us to two explicit design goals… Design goals No category of crash should invalidate the configuration cache: Configuration cache invalidation must never be the default response to failures. Whether the failure is a worker crash, master crash, data plane restart, or coordinated recovery action, previously validated customer configurations should remain usable—unless there is a proven reason to discard it. Bad tenant configuration must not poison the entire cache: A single faulty or incompatible tenant configuration should result in targeted eviction of that tenant’s configuration only—not wholesale cache invalidation across all tenants. Platform enhancements Previously, customer configurations persisted to host‑mounted storage, but certain failure paths treated the cache as unsafe and invalidated it entirely. In those cases, recovery implicitly meant reloading and reprocessing configuration for hundreds of thousands of tenants before traffic could resume, even though the vast majority of cached data was still valid. We changed the recovery model to avoid invalidating customer configurations, with strict scoping around when and how cached entries are discarded: Cached configurations are no longer invalidated based on crash type. Failures are assumed to be orthogonal to configuration correctness unless explicitly proven otherwise. Cache eviction is granular and tenant‑scoped. If a cached configuration fails validation or load checks, only that tenant’s configuration is discarded and reloaded. All other tenant configurations remain available. This ensures that recovery does not regress into a fleet‑wide rebuild due to localized or unrelated faults. Safety and correctness Durability is paired with strong correctness controls, to prevent unsafe configurations from being served: Per‑tenant validation on load: Each cached tenant configuration is validated during the ‘load and verification’ phase, before being promoted for traffic serving. Therefore, failures are contained to that tenant. Targeted re‑translation: When validation fails, only the affected tenant’s configuration is reloaded or reprocessed. Therefore, the cache for other tenants is left untouched. Operational escape hatch: Operators retain the ability to explicitly instruct a clean rebuild of the configuration cache (with proper authorization), preserving control without compromising the default fast‑recovery path. Resulting behavior With these changes, recovery behavior now aligns with real‑world traffic patterns - configuration defects impact tenants locally and predictably, rather than globally. The system now prefers isolated tenant impact, and continued service using last-known-good over aggressive invalidation, both of which are critical for predictable recovery at the scale of Azure Front Door. Making recovery scale with active traffic, not total tenants Reusing configuration cache solves the problem of rebuilding configuration in its entirety, but even with a warm cache, the original startup path had a second bottleneck: eagerly loading a large volume of tenant configurations into memory before serving any traffic. At our scale, memory-mapping, parsing hundreds of thousands of FlatBuffers, constructing internal lookup maps, adding Transport Layer Security (TLS) certificates and configuration blocks for each tenant, collectively added almost an hour to startup time. This was the case even when a majority of those tenants had no active traffic at that moment. We addressed this by fundamentally changing when configuration is loaded into workers. Rather than eagerly loading most of the tenants at startup across all edge locations, Azure Front Door now uses an Machine Learning (ML)-optimized lazy loading model. In the new architecture, instead of loading a large number of tenant configurations, we only load a small subset of tenants that are known to be historically active in a given site, we call this the “warm tenants” list. The warm tenants list per edge site is created through a sophisticated traffic analysis pipeline that leverages ML. However, loading the warm tenants is not good enough, because when a request arrives and we don’t have the configuration in memory, we need to know two things. Firstly, is this a request from a real Azure Front Door tenant – and, if it is, where can I find the configuration? To answer these questions, each worker maintains a hostmap that tracks the state of each tenant’s configuration. This hostmap is constructed during startup, as we process each tenant configuration – if the tenant is in the warm list, we will process and load their configuration fully; if not, then we will just add an entry into the hostmap where all their domain names are mapped to the configuration path location. When a request arrives for one of these tenants, the worker loads and validates that tenant’s configuration on demand, and immediately begins serving traffic. This allows a node to start serving its busiest tenants within a few minutes of startup, while additional tenants are loaded incrementally only when traffic actually arrives—allowing the system to progressively absorb cold tenants as demand increases. The effect on recovery is transformative. Instead of recovery time scaling with the total number of tenants configured on a server, it scales with the number of tenants actively receiving traffic. In practice, even at our busiest edge sites, the active tenant set is a small fraction of the total. Just as importantly, this modified form of lazy loading provides a natural failure isolation boundary. Most Edge sites won’t ever load a faulty configuration of an inactive tenant. When a request for an inactive tenant with an incompatible configuration arrives, impact is contained to a single worker. The configuration load architecture now prefers serving as many customers as quickly as possible, rather than waiting until everything is ready before serving anyone. The above changes are slated to complete in April 2026 and will bring our RTO from the current ~1 hour to under 10 minutes – for complete recovery from a worst case scenario. Continuous validation through Game Days A critical element of our recovery confidence comes from GameDay fault-injection testing. We don’t simply design recovery mechanisms and assume they work—we break the system deliberately and observe how it responds. Since late 2025, we have conducted recurring GameDay drills that simulate the exact failure scenarios we are defending against: Food Taster crash scenarios: Injecting deliberately faulty tenant configurations, to verify that they are caught and isolated with zero impact on live traffic. In our January 2026 GameDay, the Food Taster process crashed as expected, the system halted the update within approximately 5 seconds, and no customer traffic was affected. Master process crash scenarios: Triggering master process crashes across test environments to verify that workers continue serving traffic, that the Local Config Shield engages within 10 seconds, and that the coordinated recovery tool restores full operation within the expected timeframe. Multi-region failure drills: Simulating simultaneous failures across multiple regions to validate that global Config Shield mechanisms engage correctly, and that recovery procedures scale without requiring manual per-region intervention. Fallback test drills for critical Azure services running behind Azure Front Door: In our February 2026 GameDay, we simulated the complete unavailability of Azure Front Door, and successfully validated failover for critical Azure services with no impact to traffic. These drills have both surfaced corner cases and built operational confidence. They have transformed recovery from a theoretical plan into tested, repeatable muscle memory. As we noted in an internal communication to our team: “Game day testing is a deliberate shift from assuming resilience to actively proving it—turning reliability into an observed and repeatable outcome.” Closing Part 1 of this series emphasized preventing unsafe configurations from reaching the data plane, and data plane resiliency in case an incompatible configuration reaches production. This post has shown that prevention alone is not enough—when failures do occur, recovery must be fast, predictable, and bounded. By ensuring that the FlatBuffer cache is never invalidated, by loading only active tenants, and by building safe coordinated recovery tooling, we have transformed failure handling from a fleet-wide crisis into a controlled operation. These recovery investments work in concert with the prevention mechanisms described in Part 1. Together, they ensure that the path from incident detection to full service restoration is measured in minutes, with customer traffic protected at every step. In the next post of this series, we will cover the third pillar of our resiliency strategy: tenant isolation—how micro-cellular architecture and ingress-layered sharding can reduce the blast radius of any failure to a small subset, ensuring that one customer’s configuration or traffic anomaly never becomes everyone’s problem. We deeply value our customers’ trust in Azure Front Door. We are committed to transparently sharing our progress on these resiliency investments, and to exceed expectations for safety, reliability, and operational readiness.2.1KViews4likes0CommentsDNS best practices for implementation in Azure Landing Zones
Why DNS architecture matters in Landing Zone A well-designed DNS layer is the glue that lets workloads in disparate subscriptions discover one another quickly and securely. Getting it right during your Azure Landing Zone rollout avoids painful refactoring later, especially once you start enforcing Zero-Trust and hub-and-spoke network patterns. Typical Landing-Zone topology Subscription Typical Role Key Resources Connectivity (Hub) Transit, routing, shared security Hub VNet, Azure Firewall / NVA, VPN/ER gateways, Private DNS Resolver Security Security tooling & SOC Sentinel, Defender, Key Vault (HSM) Shared Services Org-wide shared apps ADO and Agents, Automation Management Ops & governance Log Analytics, backup etc Identity Directory and auth services Extended domain controllers, Azure AD DS All five subscriptions contain a single VNet. Spokes (Security, Shared, Management, Identity) are peered to the Connectivity VNet, forming the classic hub-and-spoke. Centralized DNS with mandatory firewall inspection Objective: All network communication from a spoke must cross the firewall in the hub including DNS communication. Design Element Best-Practice Configuration Private DNS Zones Link only to the Connectivity VNet. Spokes have no direct zone links. Private DNS Resolver Deploy inbound + outbound endpoints in the Connectivity VNet. Link connectivity virtual network to outbound resolver endpoint. Spoke DNS Settings Set custom DNS servers on each spoke VNet equal to the inbound endpoint’s IPs. Forwarding Ruleset Create a ruleset, associate it with the outbound endpoint, and add forwarders: • Specific domains → on-prem / external servers • Wildcard “.” → on-prem DNS (for compliance scenarios) Firewall Rules Allow UDP/TCP 53 from spokes to Resolver-inbound, and from Resolver-outbound to target DNS servers Note: Azure private DNS zone is a global resource. Meaning single private DNS zone can be utilized to resolve DNS query for resources deployed in multiple regions. DNS private resolver is a regional resource. Meaning it can only link to virtual network within the same region. Traffic flow Spoke VM → Inbound endpoint (hub) Firewall receives the packet based on spoke UDR configuration and processes the packet before it sent to inbound endpoint IP. Resolver applies forwarding rules on unresolved DNS queries; unresolved queries leave via Outbound endpoint. DNS forwarding rulesets provide a way to route queries for specific DNS namespaces to designated custom DNS servers. Fallback to internet and NXDOMAIN redirect Azure Private DNS now supports two powerful features to enhance name resolution flexibility in hybrid and multi-tenant environments: Fallback to internet Purpose: Allows Azure to resolve DNS queries using public DNS if no matching record is found in the private DNS zone. Use case: Ideal when your private DNS zone doesn't contain all possible hostnames (e.g., partial zone coverage or phased migrations). How to enable: Go to Azure private DNS zones -> Select zone -> Virtual network link -> Edit option Ref article: https://learn.microsoft.com/en-us/azure/dns/private-dns-fallback Centralized DNS - when firewall inspection isn’t required Objective: DNS query is not monitored via firewall and DNS query can be bypassed from firewall. Link every spoke virtual directly to the required Private DNS Zones so that spoken can resolve PaaS resources directly. Keep a single Private DNS Resolver (optional) for on-prem name resolution; spokes can reach its inbound endpoint privately or via VNet peering. Spoke-level custom DNS This can point to extended domain controllers placed within identity virtual. This pattern reduces latency and cost but still centralizes zone management. Integrating on-premises active directory DNS Create conditional forwarders on each Domain Controller for every Private DNS Zone pointing it to DNS private resolver inbound endpoint IP Address. (e.g.,blob.core.windows.net database.windows.net). Do not include the literal privatelink label. Ref article: https://github.com/dmauser/PrivateLink/tree/master/DNS-Integration-Scenarios#43-on-premises-dns-server-conditional-forwarder-considerations Note: Avoid selecting the option “Store this conditional forwarder in Active Directory and replicate as follows” in environments with multiple Azure subscriptions and domain controllers deployed across different Azure environments. Key takeaways Linking zones exclusively to the connectivity subscription's virtual network keeps firewall inspection and egress control simple. Private DNS Resolver plus forwarding rulesets let you shape hybrid name resolution without custom appliances. When no inspection is needed, direct zone links to spokes cut hops and complexity. For on-prem AD DNS, the conditional forwarder is required pointing it to inbound endpoint IP, exclude privatelink name when creating conditional forwarder, and do not replicate conditional forwarder Zone with AD replication if customer has footprint in multiple Azure tenants. Plan your DNS early, bake it into your infrastructure-as-code, and your landing zone will scale cleanly no matter how many spokes join the hub tomorrow.9.1KViews6likes5CommentsUnlock outbound traffic insights with Azure StandardV2 NAT Gateway flow logs
Recommended Outbound Connectivity StandardV2 NAT Gateway is the next evolution of outbound connectivity in Azure. As the recommended solution for providing secure, reliable outbound Internet access, NAT Gateway continues to be the default choice for modern Azure deployments. With the highly anticipated general availability of the new StandardV2 SKU, customers gain access to the following highly requested upgrades: Zone-redundancy: Automatically maintains outbound connectivity during single‑zone failures in AZ-enabled regions. Enhanced performance: Up to 100 Gbps of throughput and 10 million packets per second - double the Standard SKU capacity. Dual-stack support: Attach up to 16 IPv6 and 16 IPv4 public IP addresses for future ready connectivity. Flow logs: Access historical logs of connections being established through your NAT gateway. This blog will focus on how enabling StandardV2 NAT Gateway flow logs can be beneficial for your team along with some tips to get the most out of the data. What are flow logs? StandardV2 NAT Gateway flow logs are enabled through Diagnostic settings on your NAT gateway resource where the log data can be sent to Log Analytics, a storage account, or Event hub destination. “NatGatewayFlowlogV1” is the released log category, and it provides IP level information on traffic flowing through your StandardV2 NAT gateway. Gateway Flow Logs through Diagnostics setting on your StandardV2 NAT gateway resource. Why should I use flow logs? Security and compliance visibility Prior to NAT gateway flow logs, customers could not see NAT gateway information when their virtual machines connect outbound. This made it difficult to: Validate that only approved destinations were being accessed Audit suspicious or unexpected outbound patterns Satisfy compliance requirements that mandate traffic recording Flow logs now provide visibility to the source IP -> NAT gateway outbound IP -> destination IP, along with details on sent/dropped packets and bytes. Usage analytics Flow logs allow you to answer usage questions such as: Which VMs are generating the most outbound requests? Which destinations receive the most traffic? Is throughput growth caused by a specific workload pattern? This level of insight is especially useful when debugging unexpected throughput increases, billing spikes, and connection bottlenecks. To note: Flow logs only capture established connections. This means the TCP 3‑way handshake (SYN → SYN/ACK → ACK) or the UDP ephemeral session setup must complete. If a connection never establishes, for example due to NSG denial, routing mismatch, or SNAT exhaustion, it will not appear in flow logs. Workflow of troubleshooting with flow logs Let's walk through how you can leverage flow logs to troubleshoot a scenario where you are seeing intermittent connection drops. Scenario: You have VMs that use a StandardV2 NAT gateway to reach the Internet. However, your VMs intermittently fail to reach github.com. Step 1: Check NAT gateway health Start with the datapath availability metric, which reflects the NAT gateway's overall health. If metric > 90%, this confirms NAT gateway is healthy and is working as expected to send outbound traffic to the internet. Continue to Step 2. If metric is lower, visit Troubleshoot Azure NAT Gateway connectivity - Azure NAT Gateway | Microsoft Learn for troubleshooting tips. Step 2: Enable StandardV2 NAT Gateway Flow Logs To further investigate the root cause, Enable StandardV2 NAT Gateway Flow Logs (NatGatewayFlowLogsV1 log category in Diagnostics Setting) for the NAT gateway resource providing outbound connectivity for the impacted VMs. It is recommended to enable Log Analytics as a destination as it allows you to easily query the data. For the detailed steps, visit Monitor with StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn. Tip: You may enable flow logs even when not troubleshooting to ensure you’ll have historical data to reference when issues occur. Step 3: Confirm whether the connection was established Use Log Analytics to query for flows with source IP == VM private IP and destination IP == IP address(es) of github.com. The following query will generate a table and chart of the total packets sent per minute from your source IP to the destination IP through your NAT gateway in the last 24 hours. NatGatewayFlowlogsV1 | where TimeGenerated > ago(1d) | where SourceIP == '10.0.0.4' //and DestinationIP == <"github.com IP"> | summarize TotalPacketsSent = sum(PacketsSent) by TimeGenerated = bin(TimeGenerated, 1m), SourceIP, DestinationIP | order by TimeGenerated asc If there are no records of this connection, it is likely an issue with establishing the connection because flow logs will only capture records of established connections. Take a look at SNAT connection metrics to determine whether it may be a SNAT port exhaustion issue or NSGs/UDRs that may be blocking the traffic. If there are records of the connection, proceed with the next step. Step 4: Check if there are any packets dropped In Log Analytics, query for the total "PacketsSentDropped" and "PacketsReceivedDropped" per source/outbound/destination IP connection. If "PacketsSentDropped" > 0 - NAT gateway dropped traffic sent from your VM. If "PacketsReceivedDropped" > 0, NAT gateway dropped traffic received from destination IP, github.com in this case. In both instances, it typically means the either the client or server is pushing more traffic through a single connection than is optimal, causing connection-level rate limiting. To mitigate: Avoid relying on one connection and instead use multiple connections. Distribute traffic across multiple outbound IP addresses by assigning more public IP addresses to the NAT gateway resource. Conclusion StandardV2 NAT Gateway Flow Logs unlock a powerful new dimension of outbound visibility and they can help you: Validate cybersecurity readiness Audit outbound flows Diagnose intermittent connectivity issues Understand traffic patterns and optimize architecture We are excited to see how you leverage this new capability with your StandardV2 NAT gateways! Have more questions? As always, for any feedback, please feel free to reach us by submitting your feedback. We look forward to hearing your thoughts and hope this announcement helps you build more resilient applications in Azure. For more information on StandardV2 NAT Gateway Flow Logs and how to enable it, visit: Manage StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn Monitor with StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn To see the most up-to-date pricing for flow logs, visit Azure NAT Gateway - Pricing | Microsoft Azure. To learn more about StandardV2 NAT Gateway, visit What is Azure NAT Gateway? | Microsoft Learn.481Views0likes0Comments