azure networking
92 TopicsIntroducing the Container Network Insights Agent for AKS: Now in Public Preview
We are thrilled to announce public preview of Container Network Insights Agent - Agentic AI network troubleshooting for your workloads running in Azure Kubernetes Service (AKS). The Challenge AKS networking is layered by design. Azure CNI, eBPF, Cilium, CoreDNS, NetworkPolicy, CiliumNetworkPolicy, Hubble. Each layer contributes capabilities, and some of these can fail silently in ways the surrounding layers cannot observe. When something breaks, the evidence usually exists. Operators already have the tools such as Azure Monitor for metrics, Container Insights for cluster health, Prometheus and Grafana for dashboarding, Cilium and Hubble for pod network observation, and Kubectl for direct inspection. However, correlating different signals and identifying the root cause takes time. Imagine this scenario: An application performance alert fires. The on-call engineer checks dashboards, reviews events, inspects pod health. Each tool shows its own slice. But the root cause usually lives in the relationship between signals, not in any single tool. So the real work begins to manually cross-reference Hubble flows, NetworkPolicy specs, DNS state, node-level stats, and verdicts. Each check is a separate query, a separate context switch, a separate mental model of how the layers interact. This process is manual, it is slow, needs domain knowledge, and does not scale. Mean time to resolution (MTTR) stays high not because engineers lack skill, but because the investigation surface is wide and the interactions between the layers are complex. The solution: Container Network Insights Agent Container Network Insights Agent is agentic AI to simplify and speed up AKS network troubleshooting Rather than replacing your existing observability tools, the container network insights agent correlates signals on demand to help you quickly identify and resolve network issues. You describe a problem in natural language, and the agent runs a structured investigation across layers. It delivers a diagnosis with the evidence, the root cause, and the exact commands to fix it. The container network insights agent gets its visibility through two data sources: - AKS MCP server container network insight agent integrates with the AKS MCP (Model Context Protocol) server, a standardized and secure interface to kubectl, Cilium, and Hubble. Every diagnostic command runs through the same tools operators already use, via a well-defined protocol that enforces security boundaries. No ad-hoc scripts, no custom API integrations. - Linux Networking plugin For diagnostics that require visibility below the Kubernetes API layer, container network insights agent collects kernel-level telemetry directly from cluster nodes. This includes NIC ring buffer stats, kernel packet counters, SoftIRQ distribution, and socket buffer utilization. This is how it pinpoints packet drops and network saturation that surface-level metrics cannot explain. When you describe a symptom, the container network insights agent: - Classifies the issue and plans an investigation tailored to the symptom pattern - Gathers evidence through the AKS MCP server and its Linux networking plugin across DNS, service routing, network policies, Cilium, and node-level statistics - Reasons across layers to identify how a failure in one component manifests in another - Delivers a structured report with pass/fail evidence, root cause analysis, and specific remediation guidance The container network insight agent is scoped to AKS networking: DNS failures, packet drops, connectivity issues, policy conflicts, and Cilium dataplane health. It does not modify workloads or change configurations. All remediation guidance is advisory. The agent tells you what to run, and you decide whether to apply it. What makes the container network insights agent different Deep telemetry, not just surface metrics Most observability tools operate at the Kubernetes API level. container network insight agent goes deeper, collecting kernel-level network statistics, BPF program drop counters, and interface-level diagnostics that pinpoint exactly where packets are being lost and why. This is the difference between knowing something is wrong and knowing precisely what is causing it. Cross-layer reasoning Networking incidents rarely have single-layer explanations. The container network insights agent correlates evidence from DNS, service routing, network policy, Cilium, and node-level statistics together. It surfaces causal relationships that span layers. For example: node-level RX drops caused by a Cilium policy denial triggered by a label mismatch after a routine Helm deployment, even though the pods themselves appear healthy. Structured and auditable Every conclusion traces to a specific check, its output, and its pass/fail status. If all checks pass, container network insights agent reports no issue. It does not invent problems. Investigations are deterministic and reproducible. Results can be reviewed, shared, and rerun. Guidance, not just findings The container network insights agent explains what the evidence means, identifies the root cause, and provides specific remediation commands. The analysis is done; the operator reviews and decides. Where the container network insights agent fits The container network insights agent is not another monitoring tool. It does not collect continuous metrics or replace dashboards. Your existing observability stack, including Azure Monitor, Prometheus, Grafana, Container Insights, and your log pipelines, keeps doing what it does. The agent complements those tools by adding an intelligence layer that turns fragmented signals into actionable diagnosis. Your alerting detects the problem; this agent helps you understand it. Safe by Design The container network insights agent is built for production clusters. - Read-only access Minimal RBAC scoped to pods, services, endpoints, nodes, namespaces, network policies, and Cilium resources. container network insight agent deploys a temporary debug DaemonSet only for packet-drop diagnostics that require host-level stats. - Advisory remediation only The container network insights agent tells you what to run. It never executes changes. - Evidence-backed conclusions Every root cause traces to a specific failed check. No speculation. - Scoped and enforced The agent handles AKS networking questions only. It does not respond to off-topic requests. Prompt injection defenses are built in. - Credentials stay in the cluster The container network insights agent authenticates via managed identity with workload identity federation. No secrets, no static credentials. Only a session ID cookie reaches the browser. Get Started Container network insights agent is available in Public Preview in **Central US, East US, East US 2, UK South, and West US 2**. The agent deploys as an AKS cluster extension and uses your own Azure OpenAI resource, giving you control over model configuration and data residency. Full capabilities require Cilium and Advanced Container Networking Services. DNS and packet drop diagnostics work on all supported AKS clusters. To try it: - Review the Container Network Insights Agent overview on Microsoft Learn https://learn.microsoft.com/en-us/azure/aks/container-network-insights-agent-overview - Follow the quickstart to deploy container network insights agent and run your first diagnostic - Share feedback via the Azure feedback channel or the thumbs-up and thumbs-down feedback controls on each response Your feedback shapes the roadmap. If the agent gets something wrong or misses a scenario you encounter, we want to hear about it.
162Views0likes0CommentsEnabling fallback to internet for Azure Private DNS Zones in hybrid architectures
Introduction Azure Private Endpoint enables secure connectivity to Azure PaaS services such as: Azure SQL Managed Instance Azure Container Registry Azure Key Vault Azure Storage Account through private IP addresses within a virtual network. When Private Endpoint is enabled for a service, Azure DNS automatically changes the name resolution path using CNAME Redirection Example: myserver.database.windows.net ↓ myserver.privatelink.database.windows.net ↓ Private IP Azure Private DNS Zones are then used to resolve this Private Endpoint FQDN within the VNet. However, this introduces a critical DNS limitation in: Hybrid cloud architectures (AWS → Azure SQL MI) Multiregion deployments (DR region access) Crosstenant / Crosssubscription access MultiVNet isolated networks If the Private DNS zone does not contain a corresponding record, Azure DNS returns: NXDOMAIN (NonExistent Domain) When a DNS resolver receives a negative response (NXDOMAIN), it sends no DNS response to the DNS client and the query fails. This results in: ❌ Application connectivity failure ❌ Database connection timeout ❌ AKS pod DNS resolution errors ❌ DR failover application outage Problem statement In traditional Private Endpoint DNS resolution: DNS query is sent from the application. Azure DNS checks linked Private DNS Zone. If no matching record exists: NXDOMAIN returned DNS queries for Azure Private Link and network isolation scenarios across different tenants and resource groups have unique name resolution paths which can affect the ability to reach Private Linkenabled resources outside a tenant's control. Azure does not retry resolution using public DNS by default. Therefore: Public Endpoint resolution never occurs DNS query fails permanently Application cannot connect Microsoft native solution Fallback to internet (NxDomainRedirect) Azure introduced a DNS resolution policy: resolutionPolicy = NxDomainRedirect This property enables public recursion via Azure’s recursive resolver fleet when an authoritative NXDOMAIN response is received for a Private Link zone. When enabled: ✅ Azure DNS retries the query ✅ Public endpoint resolution occurs ✅ Application connectivity continues ✅ No custom DNS forwarder required Fallback policy is configured at: Private DNS Zone → virtualnetwork link Resolution policy is enabled at the virtual network link level with the NxDomainRedirect setting. In the Azure portal this appears as: Enable fallback to internet How it works Without fallback: Application → Azure DNS → Private DNS Zone → Record missing → NXDOMAIN returned → Connection failure With fallback enabled: Application → Azure DNS → Private DNS Zone → Record missing → NXDOMAIN returned → Azure recursive resolver → Public DNS resolution → Public endpoint IP returned → Connection successful Azure recursive resolver retries the query using the public endpoint QNAME each time NXDOMAIN is received from the private zone scope Real world use case AWS Application Connecting to Azure SQL Managed Instance You are running: SQL MI in Azure Private Endpoint enabled Private DNS Zone: privatelink.database.windows.net AWS application tries to connect: my-mi.database.windows.net If DR region DNS record is not available: Without fallback: DNS query → NXDOMAIN → App failure With fallback enabled: DNS query → Retry public DNS → Connection success Step-by-step configuration Method 1 – Azure portal Go to: Private DNS Zones Select your Private Link DNS Zone: Example: privatelink.database.windows.net Select: Virtual network links Open your linked VNet Enable: ✅ Enable fallback to internet Click: Save Method 2 – Azure CLI You can configure fallback policy using: az network private-dns link vnet update \ --resource-group RG-Network \ --zone-name privatelink.database.windows.net \ --name VNET-Link \ --resolution-policy NxDomainRedirect Validation steps Run from Azure VM: nslookup my-mi.database.windows.net Expected: ✔ Private IP (if available) ✔ Public IP (if fallback triggered) Security considerations Fallback to internet: ✅ Does NOT expose data ✅ Only impacts DNS resolution ✅ Network traffic still governed by: NSG Azure Firewall UDR Service Endpoint Policies DNS resolution fallback only triggers on NXDOMAIN and does not change networklevel firewall controls. When should you enable this? Recommended in: Hybrid AWS → Azure connectivity Multiregion DR deployments AKS accessing Private Endpoint services CrossTenant connectivity Private Link + VPN / ExpressRoute scenarios Conclusion Fallback to Internet using NxDomainRedirect provides: Seamless hybrid connectivity Reduced DNS complexity No custom forwarders Improved application resilience and simplifies DNS resolution for modern Private Endpointenabled architectures.172Views0likes0CommentsA demonstration of Virtual Network TAP
Azure Virtual Network Terminal Access Point (VTAP), at the time of writing in April 2026 in public preview in select regions, copies network traffic from source Virtual Machines to a collector or traffic analytics tool, running as a Network Virtual Appliance (NVA). VTAP creates a full copy of all traffic sent and received by Virtual Machine Network Interface Card(s) (NICs) designated as VTAP source(s). This includes packet payload content - in contrast to VNET Flow Logs, which only collect traffic meta data. Traffic collectors and analytics tools are 3rd party partner products, available from the Azure Marketplace, amongst which are the major Network Detection and Response solutions. VTAP is an agentless, cloud-native traffic tap at the Azure network infrastructure level. It is entirely out-of-band; it has no impact on the source VM's network performance and the source VM is unaware of the tap. Tapped traffic is VXLAN-encapsulated and delivered to the collector NVA, in the same VNET as the source VMs, or in a peered VNET. This post demonstrates the basic functionality of VTAP: copying traffic into and out of a source VM, to a destination VM. The demo consists of 3 three Windows VMs in one VNET, each running a basic web server that responds with the VM's name. Another VNET contains the target - a Windows VM on which Wireshark is installed, to inspect traffic forwarded by VTAP. This demo does not use 3rd party VTAP partner solutions from the Marketplace. The lab for this demonstration is available on Github: Virtual Network TAP. The VTAP resource is configured with the target VM's NIC as the destination. All traffic captured from sources is VXLAN-encapsulated and sent to the destination on UDP port 4789 (this cannot be changed). We use a single source to easier inspect the traffic flows in Wireshark; we will see that communication from the other VMs to our source VM is captured and copied to the destination. In a real world scenario, multiple or all of the VMs in an environment could be set up as TAP sources. The source VM, vm1, generates traffic through a script that continuously polls vm2 and vm3 on http://10.0.2.5 and http://10.0.2.6, and https://ipconfig.io. On the destination VM, we use Wireshark to observe captured traffic. The filter on UDP port 4789 causes Wireshark to only capture the VXLAN encapsulated traffic forwarded by VTAP. Wireshark automatically decodes VXLAN and displays the actual traffic to and from vm1, which is set up as the (only) VTAP source. Wireshark's capture panel shows the decapsulated TCP and HTTP exchanges, including the TCP handshake, between vm1 and the other VMs, and https://ipconfig.io. Expanding the lines in the detail panel below the capture panel shows the details of the VXLAN encapsulation. The outer IP packets, encapsulating the VXLAN frames in UDP, originate from the source VM's IP address, 10.0.2.4, and have the target VM's address, 10.1.1.4, as the destination. The VXLAN frames contain all the details of the original Ethernet frames sent from and received by the source VM, and the IP packets within those. The Wireshark trace shows the full exchange between vm1 and the destinations it speaks with. This brief demonstration uses Wireshark to simply visualize the operation of VTAP. The partner solutions available from the Azure Marketplace operate on the captured traffic to implement their specific functionality.216Views0likes1CommentOrchestrating Intrusion Detection and Prevention Signature overrides in Azure Firewall Premium
Introduction: Azure Firewall Premium provides strong protection with a built-in Intrusion Detection and Prevention System (IDPS). It inspects inbound, outbound, and east-west traffic against Microsoft’s continuously updated signature set and can block threats before they reach your workloads. IDPS works out of the box without manual intervention. However, in many environments administrators need the flexibility to override specific signatures to better align with operational or security requirements. Common reasons include: Compliance enforcement – enforcing policies that require certain threats (such as High severity signatures) to always be blocked, directional tuning or protocol/category-based tuning. Incident response – reacting quickly to emerging vulnerabilities by enabling blocking for newly relevant signatures. Noise reduction – keeping informational signatures in alert mode to avoid false positives while still maintaining visibility. In many environments, signature overrides are typically managed in one of two ways: Using the global IDPS mode Using the Azure portal to apply per-signature overrides individually While these approaches work, managing overrides manually becomes difficult when thousands of signatures are involved. The Azure portal also limits the number of changes that can be applied at once, which makes large tuning operations time-consuming. To simplify this process, this blog introduces an automation approach that allows you to export, filter, and apply IDPS signature overrides in bulk using PowerShell scripts. A Common Operational Scenario: Consider the following scenario frequently encountered by security teams: Scenario A security team wants to move their firewall from Alert → Alert + Deny globally to strengthen threat prevention. However, they do not want Low severity signatures to Deny traffic, because these signatures are primarily informational and may create unnecessary noise or false positives. Example: Signature ID: 2014906 Severity: Low Description: INFO – .exe File requested over FTP This signature is classified as informational because requesting an .exe file over FTP indicates contextual risk, not necessarily confirmed malicious activity. If the global mode is switched to Alert + Deny, this signature may start blocking traffic unnecessarily. The goal therefore becomes: Enable Alert + Deny globally Keep Low severity signatures in Alert mode The workflow described in this blog demonstrates how to achieve this outcome using the IDPS Override script. Automation Workflow: The automation process uses two scripts to export and update signatures. Workflow overview Azure Firewall Policy │ ▼ Export Signatures (ipssigs.ps1) │ ▼ CSV Review / Edit │ ▼ Bulk Update (ipssigupdate.ps1) │ ▼ Updated Firewall Policy Before implementing the workflow, it’s helpful to review the available IDPS modes and severity as seen below, very briefly. IDPS Modes: Severity: Prerequisites: Now that we understand Azure Firewall IDPS concepts and have the context for this script, let's get started with the workings of the script itself. First of all, let us ensure that you are connected to your Azure account and have selected the correct subscription. You can do so by running the following command: Connect-AzAccount -Subscription "<your-subscription-id>" Ensure the following modules are installed which are required for this operation: Az.Accounts Az.Network 💡 Tip: You can check if the above modules are installed by running the following command: Get-Module -ListAvailable Az* or check specific modules using this following commands: Get-module Az.Network | select Name, Version, Path Get-module Az.Accounts | select Name, Version, Path If you need to install/import them, run the following command which downloads all generally available Azure service modules from the PowerShell Gallery, overwriting existing versions without prompting: Import-Module Az.Network Import-Module Az.Accounts Restart PowerShell after installation. Configure ipsconfig.json Now, let's configure the ipsconfig.json file and ensure the configuration file contains your target environment details i.e., target subscription, target firewall policy resource group name, firewall name, firewall policy name, location and rule collection group name. Example: { "subs": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "rg": "TEST-RG", "fw": "fw", "fwp": "fw-policy", "location": "CentralUS", "rcg": "DefaultNetworkRuleCollectionGroup" } Note: Your account must have permissions to read and update firewall policy and IDPS settings. Running the Script: 1. Export Signatures Now that we have all the prerequisites ready, it's time to run the script. Run the following command in PS in the directory where the script exists: .\ipssigs.ps1 Now, the script should prompt for filtering criteria as shown below and you can input the values as per your requirements: For the example scenario that we considered, we will give the following inputs as shown above in the snapshot: Mode: Alert Severity: Low 💡 Tip: When specifying multiple values, ensure there is space between the 2 values but no comma, otherwise the script may return no results. The script now exports the results to ipssignatures_results.csv file by default (or a custom filename if specified). The exported CSV includes metadata such as severity, direction, group, and protocol, which can help inform tuning decisions. 2. Prepare the CSV Now, we do not need all of these columns when inputting the CSV file to update the Firewall Policy. We only need the following columns. Signature Id Mode Therefore, we will need to remove all other columns while keeping the SignatureId and mode columns along with their headers as seen below: 3. Update the Firewall Policy Now, it's time to update the Firewall Policy with the signature/mode overrides that we need using the above CSV file. However, please note that the script supports two operations: Changing the global IDPS mode Applying bulk signature overrides using the CSV file You can use either option independently or both together. Let's understand this further by looking at these 2 examples. Example 1: Change Global Mode and Override Low Severity Signatures Goal: Set global mode to Alert + Deny Keep Low severity signatures in Alert Command: .\ipssigupdate.ps1 -GlobalMode Deny -InputFile Lowseveritysignatures.csv Result: High and Medium signatures → Alert + Deny Low signatures → Alert Example 2: Override Signatures Only If the global mode should remain unchanged, then run the following command only. .\ipssigupdate.ps1 The script will then prompt for the input CSV file in the next step as seen below: As seen the changed were made to the Azure Firewall in just a few seconds. After the script completes, updated signature actions should appear in the firewall policy. 4. Monitoring Script Execution Please use the following commands to track and monitor the background processes, to verify the status, check for any error and remove completed jobs as seen below: You can check background job status using: Get-Job -Id <#> View results: Receive-Job -Id <#> -Keep Remove completed jobs: Remove-Job -Id <#> Note: Up to 10,000 IDPS rules can be customized at a time 5. Validate the Changes: Now that we finished running the script, it's time to verify the update by confirming: Global IDPS mode in the firewall policy Signature override state Alert or block events in your logging destination (Log Analytics or Microsoft Sentinel) Note: Please note that, while most signatures support Off, Alert, or Deny actions, there are some context-setting signatures, that have fixed actions and cannot be overridden. Conclusion: Azure Firewall Premium makes it straightforward to apply broad IDPS configuration changes through the Azure portal. However, as environments scale, administrators often require more precise and repeatable ways to manage signature tuning. The automation approach described in this blog allows administrators to query, review, and update thousands of signatures in minutes. This enables repeatable tuning workflows, improves operational efficiency, and simplifies large-scale security configuration changes. References: Github Repository for the IDPS scripts Azure Firewall IDPS Azure Firewall IDPS signature rule categories450Views0likes0CommentsExpressRoute Gateway Microsoft initiated migration
Important: Microsoft initiated Gateway migrations are temporarily paused. You will be notified when migrations resume. Objective The backend migration process is an automated upgrade performed by Microsoft to ensure your ExpressRoute gateways use the Standard IP SKU. This migration enhances gateway reliability and availability while maintaining service continuity. You receive notifications about scheduled maintenance windows and have options to control the migration timeline. For guidance on upgrading Basic SKU public IP addresses for other networking services, see Upgrading Basic to Standard SKU. Important: As of September 30, 2025, Basic SKU public IPs are retired. For more information, see the official announcement. You can initiate the ExpressRoute gateway migration yourself at a time that best suits your business needs, before the Microsoft team performs the migration on your behalf. This gives you control over the migration timing. Please use the ExpressRoute Gateway Migration Tool to migrate your gateway Public IP to Standard SKU. This tool provides a guided workflow in the Azure portal and PowerShell, enabling a smooth migration with minimal service disruption. Backend migration overview The backend migration is scheduled during your preferred maintenance window. During this time, the Microsoft team performs the migration with minimal disruption. You don’t need to take any actions. The process includes the following steps: Deploy new gateway: Azure provisions a second virtual network gateway in the same GatewaySubnet alongside your existing gateway. Microsoft automatically assigns a new Standard SKU public IP address to this gateway. Transfer configuration: The process copies all existing configurations (connections, settings, routes) from the old gateway. Both gateways run in parallel during the transition to minimize downtime. You may experience brief connectivity interruptions may occur. Clean up resources: After migration completes successfully and passes validation, Azure removes the old gateway and its associated connections. The new gateway includes a tag CreatedBy: GatewayMigrationByService to indicate it was created through the automated backend migration Important: To ensure a smooth backend migration, avoid making non-critical changes to your gateway resources or connected circuits during the migration process. If modifications are absolutely required, you can choose (after the Migrate stage complete) to either commit or abort the migration and make your changes. Backend process details This section provides an overview of the Azure portal experience during backend migration for an existing ExpressRoute gateway. It explains what to expect at each stage and what you see in the Azure portal as the migration progresses. To reduce risk and ensure service continuity, the process performs validation checks before and after every phase. The backend migration follows four key stages: Validate: Checks that your gateway and connected resources meet all migration requirements for the Basic to Standard public IP migration. Prepare: Deploys the new gateway with Standard IP SKU alongside your existing gateway. Migrate: Cuts over traffic from the old gateway to the new gateway with a Standard public IP. Commit or abort: Finalizes the public IP SKU migration by removing the old gateway or reverts to the old gateway if needed. These stages mirror the Gateway migration tool process, ensuring consistency across both migration approaches. The Azure resource group RGA serves as a logical container that displays all associated resources as the process updates, creates, or removes them. Before the migration begins, RGA contains the following resources: This image uses an example ExpressRoute gateway named ERGW-A with two connections (Conn-A and LAconn) in the resource group RGA. Portal walkthrough Before the backend migration starts, a banner appears in the Overview blade of the ExpressRoute gateway. It notifies you that the gateway uses the deprecated Basic IP SKU and will undergo backend migration between March 7, 2026, and April 30, 2026: Validate stage Once you start the migration, the banner in your gateway’s Overview page updates to indicate that migration is currently in progress. In this initial stage, all resources are checked to ensure they are in a Passed state. If any prerequisites aren't met, validation fails and the Azure team doesn't proceed with the migration to avoid traffic disruptions. No resources are created or modified in this stage. After the validation phase completes successfully, a notification appears indicating that validation passed and the migration can proceed to the Prepare stage. Prepare stage In this stage, the backend process provisions a new virtual network gateway in the same region and SKU type as the existing gateway. Azure automatically assigns a new public IP address and re-establishes all connections. This preparation step typically takes up to 45 minutes. To indicate that the new gateway is created by migration, the backend mechanism appends _migrate to the original gateway name. During this phase, the existing gateway is locked to prevent configuration changes, but you retain the option to abort the migration, which deletes the newly created gateway and its connections. After the Prepare stage starts, a notification appears showing that new resources are being deployed to the resource group: Deployment status In the resource group RGA, under Settings → Deployments, you can view the status of all newly deployed resources as part of the backend migration process. In the resource group RGA under the Activity Log blade, you can see events related to the Prepare stage. These events are initiated by GatewayRP, which indicates they are part of the backend process: Deployment verification After the Prepare stage completes, you can verify the deployment details in the resource group RGA under Settings > Deployments. This section lists all components created as part of the backend migration workflow. The new gateway ERGW-A_migrate is deployed successfully along with its corresponding connections: Conn-A_migrate and LAconn_migrate. Gateway tag The newly created gateway ERGW-A_migrate includes the tag CreatedBy: GatewayMigrationByService, which indicates it was provisioned by the backend migration process. Migrate stage After the Prepare stage finishes, the backend process starts the Migrate stage. During this stage, the process switches traffic from the existing gateway ERGW-A to the new gateway ERGW-A_migrate. Gateway ERGW-A_migrate: Old gateway (ERGW-A) handles traffic: After the backend team initiates the traffic migration, the process switches traffic from the old gateway to the new gateway. This step can take up to 15 minutes and might cause brief connectivity interruptions. New gateway (ERGW-A_migrate) handles traffic: Commit stage After migration, the Azure team monitors connectivity for 15 days to ensure everything is functioning as expected. The banner automatically updates to indicate completion of migration: During this validation period, you can’t modify resources associated with both the old and new gateways. To resume normal CRUD operations without waiting 15 days, you have two options: Commit: Finalize the migration and unlock resources. Abort: Revert to the old gateway, which deletes the new gateway and its connections. To initiate Commit before the 15-day window ends, type yes and select Commit in the portal. When the commit is initiated from the backend, you will see “Committing migration. The operation may take some time to complete.” The old gateway and its connections are deleted. The event shows as initiated by GatewayRP in the activity logs. After old connections are deleted, the old gateway gets deleted. Finally, the resource group RGA contains only resources only related to the migrated gateway ERGW-A_migrate: The ExpressRoute Gateway migration from Basic to Standard Public IP SKU is now complete. Frequently asked questions How long will Microsoft team wait before committing to the new gateway? The Microsoft team waits around 15 days after migration to allow you time to validate connectivity and ensure all requirements are met. You can commit at any time during this 15-day period. What is the traffic impact during migration? Is there packet loss or routing disruption? Traffic is rerouted seamlessly during migration. Under normal conditions, no packet loss or routing disruption is expected. Brief connectivity interruptions (typically less than 1 minute) might occur during the traffic cutover phase. Can we make any changes to ExpressRoute Gateway deployment during the migration? Avoid making non-critical changes to the deployment (gateway resources, connected circuits, etc.). If modifications are absolutely required, you have the option (after the Migrate stage) to either commit or abort the migration.1.7KViews0likes0CommentsAssess Azure DDoS Protection Status Across Your Environment
Introduction Distributed Denial of Service (DDoS) attacks continue to be one of the most prevalent threats facing organizations with internet-facing workloads. Azure DDoS Protection provides cloud-scale protection against L3/4 volumetric attacks, helping ensure your applications remain available during an attack. However, as Azure environments grow, maintaining visibility into which resources are protected and whether diagnostic logging is properly configured becomes increasingly challenging. Security teams often struggle to answer basic questions: Which Public IP addresses are protected by Azure DDoS Protection? Are we using IP Protection or Network Protection (DDoS Protection Plan)? Is diagnostic logging enabled for protected resources? To address these questions at scale, we’ve developed a PowerShell script that assesses your Azure DDoS Protection posture across all subscriptions. Understanding Azure DDoS Protection SKUs Azure offers three DDoS Protection tiers: Protection Type Description Scope Network Protection Enterprise-grade protection via a DDoS Protection Plan attached to VNETs All Public IPs in protected VNETs IP Protection Per-IP protection for individual Public IP addresses Individual Public IP For more details, see Azure DDoS Protection overview. The Assessment Script The Check-DDoSProtection.ps1 script provides a full view of DDoS Protection status across your Azure environment. This section covers the script’s key capabilities and the resource types it supports. Key Features Multi-subscription support: Scan a single subscription or all subscriptions you have access to DDoS Protection status: Identifies which Public IPs are protected and which SKU is being used VNET correlation: Automatically determines the VNET associated with each Public IP to assess Network Protection inheritance Diagnostic logging check: Verifies if DDoS diagnostic logs are configured for protected resources CSV export: Export results for further analysis or reporting Prerequisites Before running the script, ensure you have: Azure PowerShell modules installed: Run the following commands in PowerShell (version 5.1+) or PowerShell Core to install the required Azure modules. No special permissions are needed, these will install in your user profile. Install-Module -Name Az.Accounts -Scope CurrentUser -Force Install-Module -Name Az.Network -Scope CurrentUser -Force Install-Module -Name Az.Monitor -Scope CurrentUser -Force Appropriate Azure permissions: o Reader role on subscriptions you want to scan o Microsoft.Network/publicIPAddresses/read o Microsoft.Network/virtualNetworks/read o Microsoft.Insights/diagnosticSettings/read Azure login: Authenticate to Azure before running the script. This opens a browser window for interactive sign-in. Connect-AzAccount How to Use the Script Run the script from a PowerShell session where you’ve already authenticated with Connect-AzAccount. The account must have Reader role on the subscriptions you want to scan. Download the Script You can download the script from: - GitHub: Check-DDoSProtection.ps1 Basic Usage: Scan Current Subscription Scans only the subscription currently selected in your Azure context. .\Check-DDoSProtection.ps1 Scan a Specific Subscription Scans a single subscription by its ID. .\Check-DDoSProtection.ps1 -SubscriptionId "12345678-1234-1234-1234-123456789012" Scan All Subscriptions Scans every subscription your account has Reader access to. .\Check-DDoSProtection.ps1 -AllSubscriptions Export Results to CSV Exports the assessment results to a CSV file for reporting or further analysis. .\Check-DDoSProtection.ps1 -AllSubscriptions -ExportPath "C:\Reports\DDoS-Report.csv" Large Environment Options For organizations with many subscriptions or thousands of Public IPs, use the following parameters to handle errors gracefully and avoid API throttling. .\Check-DDoSProtection.ps1 -AllSubscriptions ` -ContinueOnError ` -SavePerSubscription ` -ExportPath "C:\Reports\DDoS-Report.csv" ` -ThrottleDelayMs 200 Parameters for large environments: Parameter Description -ContinueOnError Continue scanning even if a subscription fails (e.g., access denied) -SavePerSubscription Save a separate CSV file for each subscription -ThrottleDelayMs Delay between API calls to avoid throttling (default: 100ms) Understanding the Output The script provides both console output and optional CSV export. This section covers what each output type contains. Console Output The script displays a summary table for each subscription: Summary Statistics At the end of each subscription scan: CSV Export Columns Column Description Subscription Name of the Azure subscription Public IP Name Name of the Public IP resource Resource Group Resource group containing the Public IP Location Azure region IP Address Actual IP address (or “Dynamic” if not allocated) IP SKU Basic or Standard DDoS Protected Yes/No Risk Level High (unprotected) / Low (protected) DDoS SKU Network Protection, IP Protection, or None DDoS Plan Name Name of the DDoS Protection Plan (if applicable) VNET Name Associated Virtual Network name Associated Resource Resource the Public IP is attached to Resource Type Type of associated resource (VM, AppGw, LB, etc.) Diagnostic Logging Configured/Not Configured/N/A Log Destination Log Analytics, Storage, Event Hub, or None Recommendation Suggested action for unprotected resources Sample Scenarios Scenario 1: Protected Application Gateway Public IP Name: appgw-frontend-pip DDoS Protected: Yes DDoS SKU: Network Protection DDoS Plan Name: contoso-ddos-plan VNET Name: production-vnet Diagnostic Logging: Configured (Log Analytics) Risk Level: Low Explanation: The Application Gateway’s Public IP inherits protection from the VNET which has a DDoS Protection Plan attached. Diagnostic logging is properly configured. Scenario 2: Unprotected External Load Balancer Public IP Name: external-lb-pip DDoS Protected: No DDoS SKU: VNET not protected VNET Name: (External LB) Diagnostic Logging: N/A Risk Level: High Recommendation: Enable DDoS Protection on associated VNET or enable IP Protection Explanation: This external Load Balancer’s Public IP is not in a protected VNET. The script flags this as high risk. Scenario 3: IP Protection Without Logging Public IP Name: standalone-api-pip DDoS Protected: Yes DDoS SKU: IP Protection VNET Name: - Diagnostic Logging: Not Configured Risk Level: Low Recommendation: Configure diagnostic logging for DDoS-protected resources Explanation: The IP has IP Protection enabled, but diagnostic logging is not configured. While protected, you won’t have visibility into attack telemetry. Troubleshooting Script Doesn’t Find All Subscriptions Use the following command to list your Azure role assignments and verify you have Reader access to the target subscriptions. Run this from Azure Cloud Shell or a local PowerShell session after authenticating with Connect-AzAccount. # Check your role assignments Get-AzRoleAssignment -SignInName (Get-AzContext).Account.Id | Select-Object Scope, RoleDefinitionName API Throttling The script includes built-in retry logic for API throttling. If you still experience rate limit errors, increase the delay between API calls. Run this from the directory containing the script. .\Check-DDoSProtection.ps1 -AllSubscriptions -ThrottleDelayMs 500 Access Denied for Specific Resources The script displays “(Access Denied)” for VNETs or resources you don’t have permission to read. This doesn’t affect the overall assessment but may result in incomplete VNET information. Summary This guide covered how to use the Check-DDoSProtection.ps1 script to identify unprotected Public IP addresses, determine which DDoS SKU (Network Protection vs. IP Protection) is in use, verify diagnostic logging configuration, and assess risk levels across all subscriptions. Running this script periodically helps security teams track protection coverage as their Azure environment evolves. Related Resources Azure DDoS Protection Overview Azure DDoS Protection SKU Comparison Configure DDoS Protection Diagnostic Logging Best Practices for Azure DDoS Protection Zero Trust with Azure DDoS Protection199Views1like0CommentsAnnouncing public preview: Cilium mTLS encryption for Azure Kubernetes Service
We are thrilled to announce the public preview of Cilium mTLS encryption in Azure Kubernetes Service (AKS), delivered as part of Advanced Container Networking Services and powered by the Azure CNI dataplane built on Cilium. This capability is the result of a close engineering collaboration between Microsoft and Isovalent (now part of Cisco). It brings transparent, workload‑level mutual TLS (mTLS) to AKS without sidecars, without application changes, and without introducing a separate service mesh stack. This public preview represents a major step forward in delivering secure, high‑performance, and operationally simple networking for AKS customers. In this post, we’ll walk through how Cilium mTLS works, when to use it, and how to get started. Why Cilium mTLS encryption matters Traditionally, teams looking to in-transit traffic encryption in Kubernetes have had two primary options: Node-level encryption (for example, WireGuard or virtual network encryption), which secures traffic in transit but lacks workload identity and authentication. Service meshes, which provide strong identity and mTLS guarantees but introduce operational complexity. This trade‑off has become increasingly problematic, as many teams want workload‑level encryption and authentication, but without the cost, overhead, and architectural impact of deploying and operating a full-service mesh. Cilium mTLS closes this gap directly in the dataplane. It delivers transparent, inline mTLS encryption and authentication for pod‑to‑pod TCP traffic, enforced below the application layer. And implemented natively in the Azure CNI dataplane built on Cilium, so customers gain workload‑level security without introducing a separate service mesh, resulting in a simpler architecture with lower operational overhead. To see how this works under the hood, the next section breaks down the Cilium mTLS architecture and follows a pod‑to‑pod TCP flow from interception to authentication and encryption. Architecture and design: How Cilium mTLS works Cilium mTLS achieves workload‑level authentication and encryption by combining three key components, each responsible for a specific part of the authentication and encryption lifecycle. Cilium agent: Transparent traffic interception and wiring Cilium agent which already exists on any cluster running with Azure CNI powered by cilium, is responsible for making mTLS invisible to applications. When a namespace is labelled with “io.cilium/mtls-enabled=true”, The Cilium agent enrolls all pods in that namespace. It enters each pod's network namespace and installs iptables rules that redirect outbound traffic to ztunnel on port 15001. It is also responsible for passing workload metadata (such as pod IP and namespace context) to ztunnel. Ztunnel: Node‑level mTLS enforcement Ztunnel is an open source, lightweight, node‑level Layer 4 proxy that was originally created by Istio. Ztunnel runs as a DaemonSet, on the source node it looks up the destination workload via XDS (streamed from the Cilium agent) and establishes mutually authenticated TLS 1.3 sessions between source and destination nodes. Connections are held inline until authentication is complete, ensuring that traffic is never sent in plaintext. The destination ztunnel decrypts the traffic and delivers it into the target pod, bypassing the interception rules via an in-pod mark. The application sees a normal plaintext connection — it is completely unaware encryption happened. SPIRE: Workload identity and trust SPIRE (SPIFFE Runtime Environment) provides the identity foundation for Cilium mTLS. SPIRE acts as the cluster Certificate Authority, issuing short‑lived X.509 certificates (SVIDs) that are automatically rotated and validated. This is a key design principle of Cilium mTLS - trust is based on workload identity, not network topology. Each workload receives a cryptographic identity derived from: Kubernetes namespace Kubernetes ServiceAccount These identities are issued and rotated automatically by SPIRE and validated on both sides of every connection. As a result: Identity remains stable across pod restarts and rescheduling Authentication is decoupled from IP addresses Trust decisions align naturally with Kubernetes RBAC and namespace boundaries This enables a zero‑trust networking model that fits cleanly into existing AKS security practices. End‑to‑End workflow example To see how these components work together, consider a simple pod‑to‑pod connection: A pod initiates a TCP connection to another pod. Traffic intercepted inside the pod network namespace and redirected to the local ztunnel instance. ztunnel retrieves the workload identity using certificates issued by SPIRE. ztunnel establishes a mutually authenticated TLS session with the destination node’s ztunnel. Traffic is encrypted and sent between pods. The destination ztunnel decrypts the traffic and delivers it to the target pod. Every packet from an enrolled pod is encrypted. There is no plaintext window, and no dropped first packets. The connection is held inline by ztunnel until the mTLS tunnel is established, then traffic flows bidirectionally through an HBONE (HTTP/2 CONNECT) tunnel. Workload enrolment and scope Cilium mTLS in AKS is opt‑in and scoped at the namespace level. Platform teams enable mTLS by applying a single label to a namespace. From that point on: All pods in that namespace participate in mTLS Authentication and encryption are mandatory between enrolled workloads Non-enrolled namespaces continue to operate unchanged Encryption is applied only when both pods are enrolled. Traffic between enrolled and non‑enrolled workloads continues in plaintext without causing connectivity issues or hard failures. This model enables gradual rollout, staged migrations, and low-risk adoption across environments. Getting started in AKS Cilium mTLS encryption is available in public preview for AKS clusters that use: Azure CNI powered by Cilium Advanced Container Networking Services You can enable mTLS: When creating a new cluster, or On an existing cluster by updating the Advanced Container Networking Services configuration Once enabled, enrolling workloads is as simple as labelling a namespace. 👉 Learn more Concepts: How Cilium mTLS works, architecture, and trust boundaries How-to guide: Step-by-step instructions to enable and verify mTLS in AKS Looking ahead This public preview represents an important step forward in simplifying network security for AKS and reflects a deep collaboration between Microsoft and Isovalent to bring open, standards‑based innovation into production‑ready cloud platforms. We’re continuing to work closely with the community to improve the feature and move it toward general availability. If you’re looking for workload‑level encryption without the overhead of a traditional service mesh, we invite you to try Cilium mTLS in AKS and share your experience.747Views3likes0CommentsDetect, correlate, contain: New Azure Firewall IDPS detections in Microsoft Sentinel and XDR
As threat actors continue to blend reconnaissance, exploitation, and post-compromise activity, network-level signals remain critical for early detection and correlated response. To strengthen this layer, we're introducing five new Azure Firewall IDPS detections, now available out of the box in the Azure Firewall solution for Microsoft Sentinel and Microsoft Defender XDR. See It in Action This short demo walks through Azure Firewall's IDPS capabilities, the new Sentinel detections, and the automated response playbook — from malicious traffic hitting the firewall to the threat being contained without manual intervention. Watch the demo → Azure Firewall integration with Microsoft Sentinel and Defender XDR Read on for the full details on each detection, customization options, and a step-by-step walkthrough of the automated response workflow. What’s new The Azure Firewall solution now includes five new analytic detections built on Azure Firewall. Detection name What it detects (network signal) MITRE ATT&CK tactic(s) Example ATT&CK techniques (representative) SOC impact High severity malicious activity Repeated high confidence IDPS hits such as exploit kits, malware C2, credential theft, trojans, shellcode delivery Initial access (TA0001) execution (TA0002) Command and Control (TA0011) Exploit public facing application (T1190) command and control over web protocols (T1071.001) Ingress Tool Transfer (T1105) Highlights active exploitation or post compromise behavior at the network layer; strong pivot point into XDR investigations Elevation of privilege attempt Repeated attempts or success gaining user or administrator privileges Privilege escalation (TA0004) Exploitation for privilege escalation (T1068) Flags critical inflection points where attackers move from foothold to higher impact control Web application attack Probing or exploitation attempts against web applications Initial access (TA0001) Exploit public facing application (T1190) Surfaces external attack pressure against internet facing apps protected by Azure Firewall Medium severity malicious activity Potentially unwanted programs, crypto mining, social engineering indicators, suspicious filenames/system calls Initial access (TA0001) execution (TA0002) impact (TA0040) User Execution (T1204) Resource Hijacking (T1496) Early stage or lower confidence signals that help teams hunt, monitor, and tune response before escalation Denial of Service (DoS) attack Attempted or sustained denial of service traffic patterns Impact (TA0040) Network Denial of Service (T1498) Enables faster DoS identification and escalation, reducing time to mitigation Where these detections apply These detections are available through the Azure Firewall solution in: Microsoft Sentinel, enabling SOC centric investigation, hunting, and automation Microsoft Defender XDR, allowing network level signals to participate in end-to-end attack correlation across identity, endpoint, cloud, and email They are powered by the AZFWIdpsSignature log table and require Azure Firewall with IDPS enabled (preferably with TLS inspection). Customizing the detections to fit your environment The Azure Firewall IDPS detections included in the Microsoft Sentinel solution are designed to be fully adaptable to customer environments, allowing SOC teams to tune sensitivity, scope, and signal fidelity based on their risk tolerance and operational maturity. Each detection is built on the AZFWIdpsSignature log table and exposes several clearly defined parameters that customers can modify without rewriting the analytic logic. 1. Tune alert sensitivity and time horizon Customers can adjust the lookback period (TimeWindow) and minimum hit count (HitThreshold) to control how aggressively the detection triggers. Shorter windows and lower thresholds surface faster alerts for high-risk environments, while longer windows and higher thresholds help reduce noise in high volume networks. 2. Align severity with internal risk models Each analytic rule includes a configurable minimum severity (MinSeverity) aligned to Azure Firewall IDPS severity scoring. Organizations can raise or lower this value to match internal incident classification standards and escalation policies. 3. Focus on relevant threat categories and behaviors Optional filters allow detections to be scoped to specific threat categories, descriptions, or enforcement actions. Customers can enable or disable: Category filtering to focus on specific attack classes (for example, command and control, exploit kits, denial of service, or privilege escalation). Description filtering to target specific behavioral patterns. Action filtering to alert only on denied or alerted traffic versus purely observed activity. This flexibility makes it easy to tailor detections for different deployment scenarios such as internet facing workloads, internal east-west traffic monitoring, or regulated environments with stricter alerting requirements. 4. Preserve structure while customizing output Even with customization, the detections retain consistent enrichment fields including source IP, threat category, hit count, severity, actions taken, and signature IDs ensuring alerts remain actionable and easy to correlate across Microsoft Sentinel and Microsoft Defender XDR workflows. By allowing customers to tune thresholds, scope, and focus areas while preserving analytic intent, these Azure Firewall IDPS detections provide a strong out of the box baseline that can evolve alongside an organization’s threat landscape and SOC maturity. Automated detection and response for Azure Firewall using Microsoft Sentinel In this walkthrough, we’ll follow a real-world attack simulation and see how Azure Firewall, Microsoft Sentinel, and an automated playbook work together to detect, respond to, and contain malicious activity, without manual intervention. Step 1: Malicious traffic originates from a compromised source A source IP address 10.0.100.20, hosted within a virtual network, attempts to reach a web application protected by Azure Firewall. To validate the scenario, we intentionally generate malicious outbound traffic from this source, such as payloads that match known attack patterns. This is an outbound flow, meaning the traffic is leaving the internal network and attempting to reach an external destination through Azure Firewall. At this stage: Azure Firewall is acting as the central enforcement point Traffic is still allowed, but deep packet inspection is in effect Step 2: Azure Firewall IDPS detects malicious behavior Azure Firewall's intrusion detection and prevention system (IDPS) is enabled and inspects traffic as it passes through the firewall. When IDPS detects patterns that match known malicious signatures, the action taken depends on the signature's configured mode: Alert mode: IDPS generates a detailed security log for the matched signature but allows the traffic to continue. This is useful for monitoring and tuning before enforcing blocks. Alert and Deny mode: IDPS blocks the matching traffic and generates a detailed security log. The threat is stopped at the network layer while full telemetry is preserved for investigation. In both cases, IDPS records rich metadata including source IP, destination, protocol, signature name, severity, and threat category. These logs are what power the downstream detections in Microsoft Sentinel. In this walkthrough, the signature is configured in Alert and Deny mode, meaning the malicious traffic from 10.0.100.20 is blocked immediately at the firewall while the corresponding log is forwarded for analysis. Step 3: Firewall logs are sent to Log Analytics All Azure Firewall logs, including IDPS logs, are sent to a Log Analytics workspace named law-cxeinstance. At this point: Firewall logs are centralized Logs are normalized and can be queried No alerting has happened yet, only data collection This workspace becomes the single source of truth for downstream analytics and detections. Step 4: Microsoft Sentinel ingests and analyzes the Firewall logs The Log Analytics workspace is connected to Microsoft Sentinel, which continuously analyzes incoming data. Using the Azure Firewall solution from the Sentinel Content Hub, we previously deployed a set of built-in analytic rule templates designed specifically for Firewall telemetry. One of these rules is: “High severity malicious activity detected”. This rule evaluates IDPS logs and looks for: High-confidence signatures, known exploit techniques and malicious categories identified by Firewall IDPS. Step 5: Sentinel creates an incident When the analytic rules are met, Microsoft Sentinel automatically: Raises an alert Groups related alerts into an incident Extracts entities such as IP addresses, severity, and evidence In this case, the source IP 10.0.100.20 is clearly identified as the malicious actor and attached as an IP entity to the incident. This marks the transition from detection to response. Step 6: An automation rule triggers the playbook To avoid manual response, we configured a Sentinel automation rule that triggers whenever: An incident is created The analytic rule name matches any of the analytic rules we configured The automation rule immediately triggers a Logic App playbook named AzureFirewallBlockIPaddToIPGroup. This playbook is available as part of the Azure Firewall solution and can be deployed directly from the solution package. In addition, a simplified version of the playbook is published in our GitHub repository, allowing you to deploy it directly to your resource group using the provided ARM template. This is where automated containment begins. Step 7: The playbook aggregates and updates the IP Group The playbook performs several critical actions in sequence: Extracts IP entities from the Sentinel incident Retrieves the existing Azure Firewall IP Group named MaliciousIPs Checks for duplicates to avoid unnecessary updates Aggregates new IPs into a single array/list Updates the IP Group in a single operation. It is important to note that the playbook managed identity should have contributor access on the IP Group or its resource group to perform this action. In our scenario, the IP 10.0.100.20 is added to the MaliciousIPs IP Group. Step 8: Firewall policy enforces the block immediately Azure Firewall already has a network rule named BlockMaliciousTraffic configured with: Source: MaliciousIPs IP Group Destination: Any Protocol: Any Action: Deny Because the rule references the IP Group dynamically, the moment the playbook updates MaliciousIPs, the firewall enforcement takes effect instantly — without modifying the rule itself. Traffic originating from 10.0.100.20 is now fully blocked, preventing any further probing or communication with the destination. The threat has been effectively contained. When a SOC analyst opens the Sentinel incident, they see that containment has already occurred: the malicious IP was identified, the IP Group was updated, and the firewall block is in effect — all with a full audit trail of every automated action taken, from detection through response. No manual intervention was required. Conclusion With these five new IDPS detections, Azure Firewall closes the gap between network-level signal and SOC-level action. Raw signature telemetry is automatically transformed into severity-aware, MITRE ATT&CK-mapped alerts inside Microsoft Sentinel and Microsoft Defender XDR — giving security teams correlated, investigation-ready incidents instead of isolated log entries. Combined with automation playbooks, the result is a fully integrated detect-and-respond workflow: Azure Firewall identifies malicious behavior, Sentinel raises and enriches the incident, and a Logic App playbook contains the threat by updating firewall policy in real time — all without manual intervention. These detections are included at no additional cost. Simply install the Azure Firewall solution from the Microsoft Sentinel Content Hub, and the analytic rules automatically appear in your Sentinel workspace — ready to enable, customize, and operationalize. Get started today: Azure Firewall with Microsoft Sentinel overview Automate Threat Response with Playbooks in Microsoft Sentinel Azure Firewall Premium features implementation guide Recent real‑world breaches underscore why these detections matter. Over the past year, attackers have repeatedly gained initial access by exploiting public‑facing applications, followed by command‑and‑control activity, web shell deployment, cryptomining, and denial‑of‑service attacks. Incidents such as the GoAnywhere MFT exploitation, widespread web‑application intrusions observed by Cisco Talos, and large‑scale cryptomining campaigns against exposed cloud services demonstrate the value of correlating repeated network‑level malicious signals. The new Azure Firewall IDPS detections are designed to surface these patterns early, reduce alert noise, and feed high‑confidence network signals directly into Microsoft Sentinel and Microsoft Defender XDR for faster investigation and response. Your network telemetry is a first-class security signal - let it work for you! Visit us at RSA 2026 to see the full detection-to-containment workflow live.811Views0likes0CommentsMy First TechCommunity Post: Azure VPN Gateway BGP Timer Mismatches
This is my first post on the Microsoft TechCommunity. Today is my seven-year anniversary at Microsoft. In my current role as a Senior Cloud Solution Architect supporting Infrastructure in Cloud & AI Platforms, I want to start by sharing a real-world lesson learned from customer engagements rather than a purely theoretical walkthrough. This work and the update of the official documentation on Microsoft Learn is the culmination of nearly two years of support for a very large global SD-WAN deployment with hundreds of site-to-site VPN connections into Azure VPN Gateway. The topic is deceptively simple—BGP timers—but mismatched expectations can cause significant instability when connecting on‑premises environments to Azure. If you’ve ever seen seemingly random BGP session resets, intermittent route loss, or confusing failover behavior, there’s a good chance that a timer mismatch between Azure and your customer premises equipment (CPE) was a contributing factor. Customer Expectation: BGP Timer Negotiation Many enterprise routers and firewalls support aggressive BGP timers and expect them to be negotiated during session establishment. A common configuration I see in customer environments looks like: Keepalive: 10 seconds Hold time: 30 seconds This configuration is not inherently wrong. In fact, it is often used intentionally to speed up failure detection and convergence in conventional network environments. My past experience with short timers was in a national cellular network carrier between core switching routers in adjacent racks, but all other connections used the default timer values. The challenge appears when that expectation is carried into Azure VPN Gateway. Azure VPN Gateway Reality: Fixed BGP Timers Azure VPN Gateway supports BGP but uses fixed timers (60/180) and won’t negotiate down. The timers are documented: The BGP keepalive timer is 60 seconds, and the hold timer is 180 seconds. Azure VPN Gateways use fixed timer values and do not support configurable keepalive or hold timers. This behavior is consistent across supported VPN Gateway SKUs that offer BGP support. Unlike some on‑premises devices, Azure will not adapt its timers downward during session establishment. What Happens During a Timer Mismatch When a CPE is configured with a 30‑second hold timer, it expects to receive BGP keepalives well within that window. Azure, however, sends BGP keepalives every 60 seconds. From the CPE’s point of view: No keepalive is received within 30 seconds The BGP hold timer expires The session is declared dead and torn down Azure may not declare the peer down on the same timeline as the CPE. This mismatch leads to repeated session flaps. The Hidden Side Effect: BGP State and Stability Controls During these rapid teardown and re‑establishment cycles, many CPE platforms rebuild their BGP tables and may increment internal routing metadata. When this occurs repeatedly: Azure observes unexpected and rapid route updates The BGP finite state machine is forced to continually reset and re‑converge BGP session stability is compromised CPE equipment logging may trigger alerts and internal support tickets. The resulting behavior is often described by customers as “Azure randomly drops routes” or “BGP is unstable”, when the instability originates from mismatched BGP timer expectations between the CPE and Azure VPN Gateway. Why This Is More Noticeable on VPN (Not ExpressRoute) This issue is far more common with VPN Gateway than with ExpressRoute. ExpressRoute supports BFD and allows faster failure detection without relying solely on aggressive BGP timers. VPN Gateway does not support BFD, so customers sometimes compensate by lowering BGP timers on the CPE—unintentionally creating this mismatch. The VPN path is Internet/WAN-like where delay/loss/jitter is normal, so conservative timer choices are stability-focused. Updated Azure Documentation The good news is that the official Azure documentation has been updated to clearly state the fixed BGP timer values for VPN Gateway: Keepalive: 60 seconds Hold time: 180 seconds Timer negotiation: Azure uses fixed timers Azure VPN Gateway FAQ | Microsoft Learn This clarification helps set the right expectations and prevents customers from assuming Azure behaves like conventional CPE routers. Practical Guidance If you are connecting a CPE to Azure VPN Gateway using BGP: Do not configure BGP timers lower than Azure’s defaults Align CPE timers to 60 / 180 or higher Avoid using aggressive timers as a substitute for BFD For further resilience: Consider Active‑Active VPN Gateways for better resiliency Use 4 Tunnels commonly implemented in a bowtie configuration for even better resiliency and traffic stability Closing Thoughts This is a great example of how cloud networking often behaves correctly, but differently than conventional on‑premises networking environments. Understanding those differences—and documenting them clearly—can save hours of troubleshooting and frustration. If this post helps even one engineer avoid a late‑night or multi-month BGP debugging session, then it has done its job. I did use AI (M365 Copilot) to aid in formatting and to validate technical accuracy. Otherwise, these are my thoughts. Thanks for reading my first TechCommunity post.215Views4likes0CommentsAzure Front Door: Resiliency Series – Part 2: Faster recovery (RTO)
In Part 1 of this blog series, we outlined our four‑pillar strategy for resiliency in Azure Front Door: configuration resiliency, data plane resiliency, tenant isolation, and accelerated Recovery Time Objective (RTO). Together, these pillars help Azure Front Door remain continuously available and resilient at global scale. Part 1 focused on the first two pillars: configuration and data plane resiliency. Our goal is to make configuration propagation safer, so incompatible changes never escape pre‑production environments. We discussed how incompatible configurations are blocked early, and how data plane resiliency ensures the system continues serving traffic from a last‑known‑good (LKG) configuration even if a bad change manages to propagate. We also introduced ‘Food Taster’, a dedicated sacrificial process running in each edge server’s data plane, that pretests every configuration change in isolation, before it ever reaches the live data plane. In this post, we turn to the recovery pillar. We describe how we have made key enhancements to the Azure Front Door recovery path so the system can return to full operation in a predictable and bounded timeframe. For a global service like Azure Front Door, serving hundreds of thousands of tenants across 210+ edge sites worldwide, we set an explicit target: to be able to recover any edge site – or all edge sites – within approximately 10 minutes, even in worst‑case scenarios. In typical data plane crash scenarios, we expect recovery in under a second. Repair status The first blog post in this series mentioned the two Azure Front Door incidents from October 2025 – learn more by watching our Azure Incident Retrospective session recordings for the October 9 th incident and/or the October 29 th incident. Before diving into our platform investments for improving our Recovery Time Objectives (RTO), we wanted to provide a quick update on the overall repair items from these incidents. We are pleased to report that the work on configuration propagation and data plane resiliency is now complete and fully deployed across the platform (in the table below, “Completed” means broadly deployed in production). With this, we have reduced configuration propagation latency from ~45 minutes to ~20 minutes. We anticipate reducing this even further – to ~15 minutes by the end of April 2026, while ensuring that platform stability remains our top priority. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond ‘EUAP or canary regions’ Control plane and data plane defect fixes Forced synchronous configuration processing Additional stages with extended bake time Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. Completed 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes April 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 Why recovery at edge scale is deceptively hard To understand why recovery took as long as it did, it helps to first understand how the Azure Front Door data plane processes configuration. Azure Front Door operates in 210+ edge sites with multiple servers per site. The data plane of each edge server hosts multiple processes. A master process orchestrates the lifecycle of multiple worker processes, that serve customer traffic. A separate configuration translator process runs alongside the data plane processes, and is responsible for converting customer configuration bundles from the control plane into optimized binary FlatBuffer files. This translation step, covering hundreds of thousands of tenants, represents hours of cumulative computation. A per edge server cache is kept locally at each server level – to enable a fast recovery of the data plane, if needed. Once the configuration translator process produces these FlatBuffer files, each worker processes them independently and memory-maps them for zero-copy access. Configuration updates flow through a two-phase commit: new FlatBuffers are first loaded into a staging area and validated, then atomically swapped into production maps. In-flight requests continue using the old configuration, until the last request referencing them completes. The data process recovery is designed to be resilient to different failure modes. A failure or crash at the worker process level has a typical recovery time of less than one second. Since each server has multiple such worker processes which serve customer traffic, this type of crash has no impact on the data plane. In the case of a master process crash, the system automatically tries to recover using the local cache. When the local cache is reused, the system is able to recover quickly – in approximately 60 minutes – since most of the configurations in the cache were already loaded into the data plane before the crash. However, in certain cases if the cache becomes unavailable or must be invalidated because of corruption, the recovery time increases significantly. During the October 29 th incident, a data plane crash triggered a complete recovery sequence that took approximately 4.5 hours. This was not because restarting a process is slow, it is because a defect in the recovery process invalidated the local cache, which meant that “restart” meant rebuilding everything from scratch. The configuration translator process then had to re-fetch and re-translate every one of the hundreds of thousands of customer configurations, before workers could memory-map them and begin serving traffic. This experience has crystallized three fundamental learnings related to our recovery path: Expensive rework: A subset of crashes discarded all previously translated FlatBuffer artifacts, forcing the configuration translator process to repeat hours of conversion work that had already been validated and stored. High restart costs: Every worker on every node had to wait for the configuration translator process to complete the full translation, before it could memory-map any configuration and begin serving requests. Unbounded recovery time: Recovery time grew linearly with total tenant footprint rather than with active traffic, creating a ‘scale penalty’ as more tenants onboarded to the system. Separately and together, the insight was clear: recovery must stop being proportional to the total configuration size. Persisting ‘validated configurations’ across restarts One of the key recovery improvements was strengthening how validated customer configurations are cached and reused across failures, rather than rebuilding configuration states from scratch during recovery. Azure Front Door already cached customer configurations on host‑mounted storage prior to the October incident. The platform enhancements post outage focused on making the local configuration cache resilient to crashes, partial failures, and bad tenant inputs. Our goal was to ensure that recovery behavior is dominated by serving traffic safely, not by reconstructing configuration state. This led us to two explicit design goals… Design goals No category of crash should invalidate the configuration cache: Configuration cache invalidation must never be the default response to failures. Whether the failure is a worker crash, master crash, data plane restart, or coordinated recovery action, previously validated customer configurations should remain usable—unless there is a proven reason to discard it. Bad tenant configuration must not poison the entire cache: A single faulty or incompatible tenant configuration should result in targeted eviction of that tenant’s configuration only—not wholesale cache invalidation across all tenants. Platform enhancements Previously, customer configurations persisted to host‑mounted storage, but certain failure paths treated the cache as unsafe and invalidated it entirely. In those cases, recovery implicitly meant reloading and reprocessing configuration for hundreds of thousands of tenants before traffic could resume, even though the vast majority of cached data was still valid. We changed the recovery model to avoid invalidating customer configurations, with strict scoping around when and how cached entries are discarded: Cached configurations are no longer invalidated based on crash type. Failures are assumed to be orthogonal to configuration correctness unless explicitly proven otherwise. Cache eviction is granular and tenant‑scoped. If a cached configuration fails validation or load checks, only that tenant’s configuration is discarded and reloaded. All other tenant configurations remain available. This ensures that recovery does not regress into a fleet‑wide rebuild due to localized or unrelated faults. Safety and correctness Durability is paired with strong correctness controls, to prevent unsafe configurations from being served: Per‑tenant validation on load: Each cached tenant configuration is validated during the ‘load and verification’ phase, before being promoted for traffic serving. Therefore, failures are contained to that tenant. Targeted re‑translation: When validation fails, only the affected tenant’s configuration is reloaded or reprocessed. Therefore, the cache for other tenants is left untouched. Operational escape hatch: Operators retain the ability to explicitly instruct a clean rebuild of the configuration cache (with proper authorization), preserving control without compromising the default fast‑recovery path. Resulting behavior With these changes, recovery behavior now aligns with real‑world traffic patterns - configuration defects impact tenants locally and predictably, rather than globally. The system now prefers isolated tenant impact, and continued service using last-known-good over aggressive invalidation, both of which are critical for predictable recovery at the scale of Azure Front Door. Making recovery scale with active traffic, not total tenants Reusing configuration cache solves the problem of rebuilding configuration in its entirety, but even with a warm cache, the original startup path had a second bottleneck: eagerly loading a large volume of tenant configurations into memory before serving any traffic. At our scale, memory-mapping, parsing hundreds of thousands of FlatBuffers, constructing internal lookup maps, adding Transport Layer Security (TLS) certificates and configuration blocks for each tenant, collectively added almost an hour to startup time. This was the case even when a majority of those tenants had no active traffic at that moment. We addressed this by fundamentally changing when configuration is loaded into workers. Rather than eagerly loading most of the tenants at startup across all edge locations, Azure Front Door now uses an Machine Learning (ML)-optimized lazy loading model. In the new architecture, instead of loading a large number of tenant configurations, we only load a small subset of tenants that are known to be historically active in a given site, we call this the “warm tenants” list. The warm tenants list per edge site is created through a sophisticated traffic analysis pipeline that leverages ML. However, loading the warm tenants is not good enough, because when a request arrives and we don’t have the configuration in memory, we need to know two things. Firstly, is this a request from a real Azure Front Door tenant – and, if it is, where can I find the configuration? To answer these questions, each worker maintains a hostmap that tracks the state of each tenant’s configuration. This hostmap is constructed during startup, as we process each tenant configuration – if the tenant is in the warm list, we will process and load their configuration fully; if not, then we will just add an entry into the hostmap where all their domain names are mapped to the configuration path location. When a request arrives for one of these tenants, the worker loads and validates that tenant’s configuration on demand, and immediately begins serving traffic. This allows a node to start serving its busiest tenants within a few minutes of startup, while additional tenants are loaded incrementally only when traffic actually arrives—allowing the system to progressively absorb cold tenants as demand increases. The effect on recovery is transformative. Instead of recovery time scaling with the total number of tenants configured on a server, it scales with the number of tenants actively receiving traffic. In practice, even at our busiest edge sites, the active tenant set is a small fraction of the total. Just as importantly, this modified form of lazy loading provides a natural failure isolation boundary. Most Edge sites won’t ever load a faulty configuration of an inactive tenant. When a request for an inactive tenant with an incompatible configuration arrives, impact is contained to a single worker. The configuration load architecture now prefers serving as many customers as quickly as possible, rather than waiting until everything is ready before serving anyone. The above changes are slated to complete in April 2026 and will bring our RTO from the current ~1 hour to under 10 minutes – for complete recovery from a worst case scenario. Continuous validation through Game Days A critical element of our recovery confidence comes from GameDay fault-injection testing. We don’t simply design recovery mechanisms and assume they work—we break the system deliberately and observe how it responds. Since late 2025, we have conducted recurring GameDay drills that simulate the exact failure scenarios we are defending against: Food Taster crash scenarios: Injecting deliberately faulty tenant configurations, to verify that they are caught and isolated with zero impact on live traffic. In our January 2026 GameDay, the Food Taster process crashed as expected, the system halted the update within approximately 5 seconds, and no customer traffic was affected. Master process crash scenarios: Triggering master process crashes across test environments to verify that workers continue serving traffic, that the Local Config Shield engages within 10 seconds, and that the coordinated recovery tool restores full operation within the expected timeframe. Multi-region failure drills: Simulating simultaneous failures across multiple regions to validate that global Config Shield mechanisms engage correctly, and that recovery procedures scale without requiring manual per-region intervention. Fallback test drills for critical Azure services running behind Azure Front Door: In our February 2026 GameDay, we simulated the complete unavailability of Azure Front Door, and successfully validated failover for critical Azure services with no impact to traffic. These drills have both surfaced corner cases and built operational confidence. They have transformed recovery from a theoretical plan into tested, repeatable muscle memory. As we noted in an internal communication to our team: “Game day testing is a deliberate shift from assuming resilience to actively proving it—turning reliability into an observed and repeatable outcome.” Closing Part 1 of this series emphasized preventing unsafe configurations from reaching the data plane, and data plane resiliency in case an incompatible configuration reaches production. This post has shown that prevention alone is not enough—when failures do occur, recovery must be fast, predictable, and bounded. By ensuring that the FlatBuffer cache is never invalidated, by loading only active tenants, and by building safe coordinated recovery tooling, we have transformed failure handling from a fleet-wide crisis into a controlled operation. These recovery investments work in concert with the prevention mechanisms described in Part 1. Together, they ensure that the path from incident detection to full service restoration is measured in minutes, with customer traffic protected at every step. In the next post of this series, we will cover the third pillar of our resiliency strategy: tenant isolation—how micro-cellular architecture and ingress-layered sharding can reduce the blast radius of any failure to a small subset, ensuring that one customer’s configuration or traffic anomaly never becomes everyone’s problem. We deeply value our customers’ trust in Azure Front Door. We are committed to transparently sharing our progress on these resiliency investments, and to exceed expectations for safety, reliability, and operational readiness.1.9KViews3likes0Comments