azure networking
88 TopicsAnnouncing public preview: Cilium mTLS encryption for Azure Kubernetes Service
We are thrilled to announce the public preview of Cilium mTLS encryption in Azure Kubernetes Service (AKS), delivered as part of Advanced Container Networking Services and powered by the Azure CNI dataplane built on Cilium. This capability is the result of a close engineering collaboration between Microsoft and Isovalent (now part of Cisco). It brings transparent, workload‑level mutual TLS (mTLS) to AKS without sidecars, without application changes, and without introducing a separate service mesh stack. This public preview represents a major step forward in delivering secure, high‑performance, and operationally simple networking for AKS customers. In this post, we’ll walk through how Cilium mTLS works, when to use it, and how to get started. Why Cilium mTLS encryption matters Traditionally, teams looking to in-transit traffic encryption in Kubernetes have had two primary options: Node-level encryption (for example, WireGuard or virtual network encryption), which secures traffic in transit but lacks workload identity and authentication. Service meshes, which provide strong identity and mTLS guarantees but introduce operational complexity. This trade‑off has become increasingly problematic, as many teams want workload‑level encryption and authentication, but without the cost, overhead, and architectural impact of deploying and operating a full-service mesh. Cilium mTLS closes this gap directly in the dataplane. It delivers transparent, inline mTLS encryption and authentication for pod‑to‑pod TCP traffic, enforced below the application layer. And implemented natively in the Azure CNI dataplane built on Cilium, so customers gain workload‑level security without introducing a separate service mesh, resulting in a simpler architecture with lower operational overhead. To see how this works under the hood, the next section breaks down the Cilium mTLS architecture and follows a pod‑to‑pod TCP flow from interception to authentication and encryption. Architecture and design: How Cilium mTLS works Cilium mTLS achieves workload‑level authentication and encryption by combining three key components, each responsible for a specific part of the authentication and encryption lifecycle. Cilium agent: Transparent traffic interception and wiring Cilium agent which already exists on any cluster running with Azure CNI powered by cilium, is responsible for making mTLS invisible to applications. When a namespace is labelled with “io.cilium/mtls-enabled=true”, The Cilium agent enrolls all pods in that namespace. It enters each pod's network namespace and installs iptables rules that redirect outbound traffic to ztunnel on port 15001. It is also responsible for passing workload metadata (such as pod IP and namespace context) to ztunnel. Ztunnel: Node‑level mTLS enforcement Ztunnel is an open source, lightweight, node‑level Layer 4 proxy that was originally created by Istio. Ztunnel runs as a DaemonSet, on the source node it looks up the destination workload via XDS (streamed from the Cilium agent) and establishes mutually authenticated TLS 1.3 sessions between source and destination nodes. Connections are held inline until authentication is complete, ensuring that traffic is never sent in plaintext. The destination ztunnel decrypts the traffic and delivers it into the target pod, bypassing the interception rules via an in-pod mark. The application sees a normal plaintext connection — it is completely unaware encryption happened. SPIRE: Workload identity and trust SPIRE (SPIFFE Runtime Environment) provides the identity foundation for Cilium mTLS. SPIRE acts as the cluster Certificate Authority, issuing short‑lived X.509 certificates (SVIDs) that are automatically rotated and validated. This is a key design principle of Cilium mTLS - trust is based on workload identity, not network topology. Each workload receives a cryptographic identity derived from: Kubernetes namespace Kubernetes ServiceAccount These identities are issued and rotated automatically by SPIRE and validated on both sides of every connection. As a result: Identity remains stable across pod restarts and rescheduling Authentication is decoupled from IP addresses Trust decisions align naturally with Kubernetes RBAC and namespace boundaries This enables a zero‑trust networking model that fits cleanly into existing AKS security practices. End‑to‑End workflow example To see how these components work together, consider a simple pod‑to‑pod connection: A pod initiates a TCP connection to another pod. Traffic intercepted inside the pod network namespace and redirected to the local ztunnel instance. ztunnel retrieves the workload identity using certificates issued by SPIRE. ztunnel establishes a mutually authenticated TLS session with the destination node’s ztunnel. Traffic is encrypted and sent between pods. The destination ztunnel decrypts the traffic and delivers it to the target pod. Every packet from an enrolled pod is encrypted. There is no plaintext window, and no dropped first packets. The connection is held inline by ztunnel until the mTLS tunnel is established, then traffic flows bidirectionally through an HBONE (HTTP/2 CONNECT) tunnel. Workload enrolment and scope Cilium mTLS in AKS is opt‑in and scoped at the namespace level. Platform teams enable mTLS by applying a single label to a namespace. From that point on: All pods in that namespace participate in mTLS Authentication and encryption are mandatory between enrolled workloads Non-enrolled namespaces continue to operate unchanged Encryption is applied only when both pods are enrolled. Traffic between enrolled and non‑enrolled workloads continues in plaintext without causing connectivity issues or hard failures. This model enables gradual rollout, staged migrations, and low-risk adoption across environments. Getting started in AKS Cilium mTLS encryption is available in public preview for AKS clusters that use: Azure CNI powered by Cilium Advanced Container Networking Services You can enable mTLS: When creating a new cluster, or On an existing cluster by updating the Advanced Container Networking Services configuration Once enabled, enrolling workloads is as simple as labelling a namespace. 👉 Learn more Concepts: How Cilium mTLS works, architecture, and trust boundaries How-to guide: Step-by-step instructions to enable and verify mTLS in AKS Looking ahead This public preview represents an important step forward in simplifying network security for AKS and reflects a deep collaboration between Microsoft and Isovalent to bring open, standards‑based innovation into production‑ready cloud platforms. We’re continuing to work closely with the community to improve the feature and move it toward general availability. If you’re looking for workload‑level encryption without the overhead of a traditional service mesh, we invite you to try Cilium mTLS in AKS and share your experience.361Views1like0CommentsDetect, correlate, contain: New Azure Firewall IDPS detections in Microsoft Sentinel and XDR
As threat actors continue to blend reconnaissance, exploitation, and post-compromise activity, network-level signals remain critical for early detection and correlated response. To strengthen this layer, we're introducing five new Azure Firewall IDPS detections, now available out of the box in the Azure Firewall solution for Microsoft Sentinel and Microsoft Defender XDR. See It in Action This short demo walks through Azure Firewall's IDPS capabilities, the new Sentinel detections, and the automated response playbook — from malicious traffic hitting the firewall to the threat being contained without manual intervention. Watch the demo → Azure Firewall integration with Microsoft Sentinel and Defender XDR Read on for the full details on each detection, customization options, and a step-by-step walkthrough of the automated response workflow. What’s new The Azure Firewall solution now includes five new analytic detections built on Azure Firewall. Detection name What it detects (network signal) MITRE ATT&CK tactic(s) Example ATT&CK techniques (representative) SOC impact High severity malicious activity Repeated high confidence IDPS hits such as exploit kits, malware C2, credential theft, trojans, shellcode delivery Initial access (TA0001) execution (TA0002) Command and Control (TA0011) Exploit public facing application (T1190) command and control over web protocols (T1071.001) Ingress Tool Transfer (T1105) Highlights active exploitation or post compromise behavior at the network layer; strong pivot point into XDR investigations Elevation of privilege attempt Repeated attempts or success gaining user or administrator privileges Privilege escalation (TA0004) Exploitation for privilege escalation (T1068) Flags critical inflection points where attackers move from foothold to higher impact control Web application attack Probing or exploitation attempts against web applications Initial access (TA0001) Exploit public facing application (T1190) Surfaces external attack pressure against internet facing apps protected by Azure Firewall Medium severity malicious activity Potentially unwanted programs, crypto mining, social engineering indicators, suspicious filenames/system calls Initial access (TA0001) execution (TA0002) impact (TA0040) User Execution (T1204) Resource Hijacking (T1496) Early stage or lower confidence signals that help teams hunt, monitor, and tune response before escalation Denial of Service (DoS) attack Attempted or sustained denial of service traffic patterns Impact (TA0040) Network Denial of Service (T1498) Enables faster DoS identification and escalation, reducing time to mitigation Where these detections apply These detections are available through the Azure Firewall solution in: Microsoft Sentinel, enabling SOC centric investigation, hunting, and automation Microsoft Defender XDR, allowing network level signals to participate in end-to-end attack correlation across identity, endpoint, cloud, and email They are powered by the AZFWIdpsSignature log table and require Azure Firewall with IDPS enabled (preferably with TLS inspection). Customizing the detections to fit your environment The Azure Firewall IDPS detections included in the Microsoft Sentinel solution are designed to be fully adaptable to customer environments, allowing SOC teams to tune sensitivity, scope, and signal fidelity based on their risk tolerance and operational maturity. Each detection is built on the AZFWIdpsSignature log table and exposes several clearly defined parameters that customers can modify without rewriting the analytic logic. 1. Tune alert sensitivity and time horizon Customers can adjust the lookback period (TimeWindow) and minimum hit count (HitThreshold) to control how aggressively the detection triggers. Shorter windows and lower thresholds surface faster alerts for high-risk environments, while longer windows and higher thresholds help reduce noise in high volume networks. 2. Align severity with internal risk models Each analytic rule includes a configurable minimum severity (MinSeverity) aligned to Azure Firewall IDPS severity scoring. Organizations can raise or lower this value to match internal incident classification standards and escalation policies. 3. Focus on relevant threat categories and behaviors Optional filters allow detections to be scoped to specific threat categories, descriptions, or enforcement actions. Customers can enable or disable: Category filtering to focus on specific attack classes (for example, command and control, exploit kits, denial of service, or privilege escalation). Description filtering to target specific behavioral patterns. Action filtering to alert only on denied or alerted traffic versus purely observed activity. This flexibility makes it easy to tailor detections for different deployment scenarios such as internet facing workloads, internal east-west traffic monitoring, or regulated environments with stricter alerting requirements. 4. Preserve structure while customizing output Even with customization, the detections retain consistent enrichment fields including source IP, threat category, hit count, severity, actions taken, and signature IDs ensuring alerts remain actionable and easy to correlate across Microsoft Sentinel and Microsoft Defender XDR workflows. By allowing customers to tune thresholds, scope, and focus areas while preserving analytic intent, these Azure Firewall IDPS detections provide a strong out of the box baseline that can evolve alongside an organization’s threat landscape and SOC maturity. Automated detection and response for Azure Firewall using Microsoft Sentinel In this walkthrough, we’ll follow a real-world attack simulation and see how Azure Firewall, Microsoft Sentinel, and an automated playbook work together to detect, respond to, and contain malicious activity, without manual intervention. Step 1: Malicious traffic originates from a compromised source A source IP address 10.0.100.20, hosted within a virtual network, attempts to reach a web application protected by Azure Firewall. To validate the scenario, we intentionally generate malicious outbound traffic from this source, such as payloads that match known attack patterns. This is an outbound flow, meaning the traffic is leaving the internal network and attempting to reach an external destination through Azure Firewall. At this stage: Azure Firewall is acting as the central enforcement point Traffic is still allowed, but deep packet inspection is in effect Step 2: Azure Firewall IDPS detects malicious behavior Azure Firewall's intrusion detection and prevention system (IDPS) is enabled and inspects traffic as it passes through the firewall. When IDPS detects patterns that match known malicious signatures, the action taken depends on the signature's configured mode: Alert mode: IDPS generates a detailed security log for the matched signature but allows the traffic to continue. This is useful for monitoring and tuning before enforcing blocks. Alert and Deny mode: IDPS blocks the matching traffic and generates a detailed security log. The threat is stopped at the network layer while full telemetry is preserved for investigation. In both cases, IDPS records rich metadata including source IP, destination, protocol, signature name, severity, and threat category. These logs are what power the downstream detections in Microsoft Sentinel. In this walkthrough, the signature is configured in Alert and Deny mode, meaning the malicious traffic from 10.0.100.20 is blocked immediately at the firewall while the corresponding log is forwarded for analysis. Step 3: Firewall logs are sent to Log Analytics All Azure Firewall logs, including IDPS logs, are sent to a Log Analytics workspace named law-cxeinstance. At this point: Firewall logs are centralized Logs are normalized and can be queried No alerting has happened yet, only data collection This workspace becomes the single source of truth for downstream analytics and detections. Step 4: Microsoft Sentinel ingests and analyzes the Firewall logs The Log Analytics workspace is connected to Microsoft Sentinel, which continuously analyzes incoming data. Using the Azure Firewall solution from the Sentinel Content Hub, we previously deployed a set of built-in analytic rule templates designed specifically for Firewall telemetry. One of these rules is: “High severity malicious activity detected”. This rule evaluates IDPS logs and looks for: High-confidence signatures, known exploit techniques and malicious categories identified by Firewall IDPS. Step 5: Sentinel creates an incident When the analytic rules are met, Microsoft Sentinel automatically: Raises an alert Groups related alerts into an incident Extracts entities such as IP addresses, severity, and evidence In this case, the source IP 10.0.100.20 is clearly identified as the malicious actor and attached as an IP entity to the incident. This marks the transition from detection to response. Step 6: An automation rule triggers the playbook To avoid manual response, we configured a Sentinel automation rule that triggers whenever: An incident is created The analytic rule name matches any of the analytic rules we configured The automation rule immediately triggers a Logic App playbook named AzureFirewallBlockIPaddToIPGroup. This playbook is available as part of the Azure Firewall solution and can be deployed directly from the solution package. In addition, a simplified version of the playbook is published in our GitHub repository, allowing you to deploy it directly to your resource group using the provided ARM template. This is where automated containment begins. Step 7: The playbook aggregates and updates the IP Group The playbook performs several critical actions in sequence: Extracts IP entities from the Sentinel incident Retrieves the existing Azure Firewall IP Group named MaliciousIPs Checks for duplicates to avoid unnecessary updates Aggregates new IPs into a single array/list Updates the IP Group in a single operation. It is important to note that the playbook managed identity should have contributor access on the IP Group or its resource group to perform this action. In our scenario, the IP 10.0.100.20 is added to the MaliciousIPs IP Group. Step 8: Firewall policy enforces the block immediately Azure Firewall already has a network rule named BlockMaliciousTraffic configured with: Source: MaliciousIPs IP Group Destination: Any Protocol: Any Action: Deny Because the rule references the IP Group dynamically, the moment the playbook updates MaliciousIPs, the firewall enforcement takes effect instantly — without modifying the rule itself. Traffic originating from 10.0.100.20 is now fully blocked, preventing any further probing or communication with the destination. The threat has been effectively contained. When a SOC analyst opens the Sentinel incident, they see that containment has already occurred: the malicious IP was identified, the IP Group was updated, and the firewall block is in effect — all with a full audit trail of every automated action taken, from detection through response. No manual intervention was required. Conclusion With these five new IDPS detections, Azure Firewall closes the gap between network-level signal and SOC-level action. Raw signature telemetry is automatically transformed into severity-aware, MITRE ATT&CK-mapped alerts inside Microsoft Sentinel and Microsoft Defender XDR — giving security teams correlated, investigation-ready incidents instead of isolated log entries. Combined with automation playbooks, the result is a fully integrated detect-and-respond workflow: Azure Firewall identifies malicious behavior, Sentinel raises and enriches the incident, and a Logic App playbook contains the threat by updating firewall policy in real time — all without manual intervention. These detections are included at no additional cost. Simply install the Azure Firewall solution from the Microsoft Sentinel Content Hub, and the analytic rules automatically appear in your Sentinel workspace — ready to enable, customize, and operationalize. Get started today: Azure Firewall with Microsoft Sentinel overview Automate Threat Response with Playbooks in Microsoft Sentinel Azure Firewall Premium features implementation guide Recent real‑world breaches underscore why these detections matter. Over the past year, attackers have repeatedly gained initial access by exploiting public‑facing applications, followed by command‑and‑control activity, web shell deployment, cryptomining, and denial‑of‑service attacks. Incidents such as the GoAnywhere MFT exploitation, widespread web‑application intrusions observed by Cisco Talos, and large‑scale cryptomining campaigns against exposed cloud services demonstrate the value of correlating repeated network‑level malicious signals. The new Azure Firewall IDPS detections are designed to surface these patterns early, reduce alert noise, and feed high‑confidence network signals directly into Microsoft Sentinel and Microsoft Defender XDR for faster investigation and response. Your network telemetry is a first-class security signal - let it work for you! Visit us at RSA 2026 to see the full detection-to-containment workflow live.601Views0likes0CommentsMy First TechCommunity Post: Azure VPN Gateway BGP Timer Mismatches
This is my first post on the Microsoft TechCommunity. Today is my seven-year anniversary at Microsoft. In my current role as a Senior Cloud Solution Architect supporting Infrastructure in Cloud & AI Platforms, I want to start by sharing a real-world lesson learned from customer engagements rather than a purely theoretical walkthrough. This work and the update of the official documentation on Microsoft Learn is the culmination of nearly two years of support for a very large global SD-WAN deployment with hundreds of site-to-site VPN connections into Azure VPN Gateway. The topic is deceptively simple—BGP timers—but mismatched expectations can cause significant instability when connecting on‑premises environments to Azure. If you’ve ever seen seemingly random BGP session resets, intermittent route loss, or confusing failover behavior, there’s a good chance that a timer mismatch between Azure and your customer premises equipment (CPE) was a contributing factor. Customer Expectation: BGP Timer Negotiation Many enterprise routers and firewalls support aggressive BGP timers and expect them to be negotiated during session establishment. A common configuration I see in customer environments looks like: Keepalive: 10 seconds Hold time: 30 seconds This configuration is not inherently wrong. In fact, it is often used intentionally to speed up failure detection and convergence in conventional network environments. My past experience with short timers was in a national cellular network carrier between core switching routers in adjacent racks, but all other connections used the default timer values. The challenge appears when that expectation is carried into Azure VPN Gateway. Azure VPN Gateway Reality: Fixed BGP Timers Azure VPN Gateway supports BGP but uses fixed timers (60/180) and won’t negotiate down. The timers are documented: The BGP keepalive timer is 60 seconds, and the hold timer is 180 seconds. Azure VPN Gateways use fixed timer values and do not support configurable keepalive or hold timers. This behavior is consistent across supported VPN Gateway SKUs that offer BGP support. Unlike some on‑premises devices, Azure will not adapt its timers downward during session establishment. What Happens During a Timer Mismatch When a CPE is configured with a 30‑second hold timer, it expects to receive BGP keepalives well within that window. Azure, however, sends BGP keepalives every 60 seconds. From the CPE’s point of view: No keepalive is received within 30 seconds The BGP hold timer expires The session is declared dead and torn down Azure may not declare the peer down on the same timeline as the CPE. This mismatch leads to repeated session flaps. The Hidden Side Effect: BGP State and Stability Controls During these rapid teardown and re‑establishment cycles, many CPE platforms rebuild their BGP tables and may increment internal routing metadata. When this occurs repeatedly: Azure observes unexpected and rapid route updates The BGP finite state machine is forced to continually reset and re‑converge BGP session stability is compromised CPE equipment logging may trigger alerts and internal support tickets. The resulting behavior is often described by customers as “Azure randomly drops routes” or “BGP is unstable”, when the instability originates from mismatched BGP timer expectations between the CPE and Azure VPN Gateway. Why This Is More Noticeable on VPN (Not ExpressRoute) This issue is far more common with VPN Gateway than with ExpressRoute. ExpressRoute supports BFD and allows faster failure detection without relying solely on aggressive BGP timers. VPN Gateway does not support BFD, so customers sometimes compensate by lowering BGP timers on the CPE—unintentionally creating this mismatch. The VPN path is Internet/WAN-like where delay/loss/jitter is normal, so conservative timer choices are stability-focused. Updated Azure Documentation The good news is that the official Azure documentation has been updated to clearly state the fixed BGP timer values for VPN Gateway: Keepalive: 60 seconds Hold time: 180 seconds Timer negotiation: Azure uses fixed timers Azure VPN Gateway FAQ | Microsoft Learn This clarification helps set the right expectations and prevents customers from assuming Azure behaves like conventional CPE routers. Practical Guidance If you are connecting a CPE to Azure VPN Gateway using BGP: Do not configure BGP timers lower than Azure’s defaults Align CPE timers to 60 / 180 or higher Avoid using aggressive timers as a substitute for BFD For further resilience: Consider Active‑Active VPN Gateways for better resiliency Use 4 Tunnels commonly implemented in a bowtie configuration for even better resiliency and traffic stability Closing Thoughts This is a great example of how cloud networking often behaves correctly, but differently than conventional on‑premises networking environments. Understanding those differences—and documenting them clearly—can save hours of troubleshooting and frustration. If this post helps even one engineer avoid a late‑night or multi-month BGP debugging session, then it has done its job. I did use AI (M365 Copilot) to aid in formatting and to validate technical accuracy. Otherwise, these are my thoughts. Thanks for reading my first TechCommunity post.129Views4likes0CommentsOrchestrating Intrusion Detection and Prevention Signature overrides in Azure Firewall Premium
Introduction: Azure Firewall Premium provides strong protection with a built-in Intrusion Detection and Prevention System (IDPS). It inspects inbound, outbound, and east-west traffic against Microsoft’s continuously updated signature set and can block threats before they reach your workloads. IDPS works out of the box without manual intervention. However, in many environments administrators need the flexibility to override specific signatures to better align with operational or security requirements. Common reasons include: Compliance enforcement – enforcing policies that require certain threats (such as High severity signatures) to always be blocked, directional tuning or protocol/category-based tuning. Incident response – reacting quickly to emerging vulnerabilities by enabling blocking for newly relevant signatures. Noise reduction – keeping informational signatures in alert mode to avoid false positives while still maintaining visibility. In many environments, signature overrides are typically managed in one of two ways: Using the global IDPS mode Using the Azure portal to apply per-signature overrides individually While these approaches work, managing overrides manually becomes difficult when thousands of signatures are involved. The Azure portal also limits the number of changes that can be applied at once, which makes large tuning operations time-consuming. To simplify this process, this blog introduces an automation approach that allows you to export, filter, and apply IDPS signature overrides in bulk using PowerShell scripts. A Common Operational Scenario: Consider the following scenario frequently encountered by security teams: Scenario A security team wants to move their firewall from Alert → Alert + Deny globally to strengthen threat prevention. However, they do not want Low severity signatures to Deny traffic, because these signatures are primarily informational and may create unnecessary noise or false positives. Example: Signature ID: 2014906 Severity: Low Description: INFO – .exe File requested over FTP This signature is classified as informational because requesting an .exe file over FTP indicates contextual risk, not necessarily confirmed malicious activity. If the global mode is switched to Alert + Deny, this signature may start blocking traffic unnecessarily. The goal therefore becomes: Enable Alert + Deny globally Keep Low severity signatures in Alert mode The workflow described in this blog demonstrates how to achieve this outcome using the IDPS Override script. Automation Workflow: The automation process uses two scripts to export and update signatures. Workflow overview Azure Firewall Policy │ ▼ Export Signatures (ipssigs.ps1) │ ▼ CSV Review / Edit │ ▼ Bulk Update (ipssigupdate.ps1) │ ▼ Updated Firewall Policy Before implementing the workflow, it’s helpful to review the available IDPS modes and severity as seen below, very briefly. IDPS Modes: Severity: Prerequisites: Now that we understand Azure Firewall IDPS concepts and have the context for this script, let's get started with the workings of the script itself. First of all, let us ensure that you are connected to your Azure account and have selected the correct subscription. You can do so by running the following command: Connect-AzAccount -Subscription "<your-subscription-id>" Ensure the following modules are installed which are required for this operation: Az.Accounts Az.Network 💡 Tip: You can check if the above modules are installed by running the following command: Get-Module -ListAvailable Az* or check specific modules using this following commands: Get-module Az.Network | select Name, Version, Path Get-module Az.Accounts | select Name, Version, Path If you need to install them, run the following command which downloads all generally available Azure service modules from the PowerShell Gallery, overwriting existing versions without prompting: Install-Module Az -Repository PSGallery -Force Restart PowerShell after installation. Configure ipsconfig.json Now, let's configure the ipsconfig.json file and ensure the configuration file contains your target environment details i.e., target subscription, target firewall policy resource group name, firewall name, firewall policy name, location and rule collection group name. Example: { "subs": "xxxxxxxx-xxxx-xxxx-xxxx-xxxxxxxxxxxx", "rg": "TEST-RG", "fw": "fw", "fwp": "fw-policy", "location": "CentralUS", "rcg": "DefaultNetworkRuleCollectionGroup" } Note: Your account must have permissions to read and update firewall policy and IDPS settings. Running the Script: 1. Export Signatures Now that we have all the prerequisites ready, it's time to run the script. Run the following command in PS in the directory where the script exists: .\ipssigs.ps1 Now, the script should prompt for filtering criteria as shown below and you can input the values as per your requirements: For the example scenario that we considered, we will give the following inputs as shown above in the snapshot: Mode: Alert Severity: Low 💡 Tip: When specifying multiple values, ensure there is space between the 2 values but no comma, otherwise the script may return no results. The script now exports the results to ipssignatures_results.csv file by default (or a custom filename if specified). The exported CSV includes metadata such as severity, direction, group, and protocol, which can help inform tuning decisions. 2. Prepare the CSV Now, we do not need all of these columns when inputting the CSV file to update the Firewall Policy. We only need the following columns. Signature Id Mode Therefore, we will need to remove all other columns while keeping the SignatureId and mode columns along with their headers as seen below: 3. Update the Firewall Policy Now, it's time to update the Firewall Policy with the signature/mode overrides that we need using the above CSV file. However, please note that the script supports two operations: Changing the global IDPS mode Applying bulk signature overrides using the CSV file You can use either option independently or both together. Let's understand this further by looking at these 2 examples. Example 1: Change Global Mode and Override Low Severity Signatures Goal: Set global mode to Alert + Deny Keep Low severity signatures in Alert Command: .\ipssigupdate.ps1 -GlobalMode Deny -InputFile Lowseveritysignatures.csv Result: High and Medium signatures → Alert + Deny Low signatures → Alert Example 2: Override Signatures Only If the global mode should remain unchanged, then run the following command only. .\ipssigupdate.ps1 The script will then prompt for the input CSV file in the next step as seen below: As seen the changed were made to the Azure Firewall in just a few seconds. After the script completes, updated signature actions should appear in the firewall policy. 4. Monitoring Script Execution Please use the following commands to track and monitor the background processes, to verify the status, check for any error and remove completed jobs as seen below: You can check background job status using: Get-Job -Id <#> View results: Receive-Job -Id <#> -Keep Remove completed jobs: Remove-Job -Id <#> Note: Up to 10,000 IDPS rules can be customized at a time 5. Validate the Changes: Now that we finished running the script, it's time to verify the update by confirming: Global IDPS mode in the firewall policy Signature override state Alert or block events in your logging destination (Log Analytics or Microsoft Sentinel) Note: Please note that, while most signatures support Off, Alert, or Deny actions, there are some context-setting signatures, that have fixed actions and cannot be overridden. Conclusion: Azure Firewall Premium makes it straightforward to apply broad IDPS configuration changes through the Azure portal. However, as environments scale, administrators often require more precise and repeatable ways to manage signature tuning. The automation approach described in this blog allows administrators to query, review, and update thousands of signatures in minutes. This enables repeatable tuning workflows, improves operational efficiency, and simplifies large-scale security configuration changes. References: Github Repository for the IDPS scripts Azure Firewall IDPS Azure Firewall IDPS signature rule categories275Views0likes0CommentsAzure Front Door: Resiliency Series – Part 2: Faster recovery (RTO)
In Part 1 of this blog series, we outlined our four‑pillar strategy for resiliency in Azure Front Door: configuration resiliency, data plane resiliency, tenant isolation, and accelerated Recovery Time Objective (RTO). Together, these pillars help Azure Front Door remain continuously available and resilient at global scale. Part 1 focused on the first two pillars: configuration and data plane resiliency. Our goal is to make configuration propagation safer, so incompatible changes never escape pre‑production environments. We discussed how incompatible configurations are blocked early, and how data plane resiliency ensures the system continues serving traffic from a last‑known‑good (LKG) configuration even if a bad change manages to propagate. We also introduced ‘Food Taster’, a dedicated sacrificial process running in each edge server’s data plane, that pretests every configuration change in isolation, before it ever reaches the live data plane. In this post, we turn to the recovery pillar. We describe how we have made key enhancements to the Azure Front Door recovery path so the system can return to full operation in a predictable and bounded timeframe. For a global service like Azure Front Door, serving hundreds of thousands of tenants across 210+ edge sites worldwide, we set an explicit target: to be able to recover any edge site – or all edge sites – within approximately 10 minutes, even in worst‑case scenarios. In typical data plane crash scenarios, we expect recovery in under a second. Repair status The first blog post in this series mentioned the two Azure Front Door incidents from October 2025 – learn more by watching our Azure Incident Retrospective session recordings for the October 9 th incident and/or the October 29 th incident. Before diving into our platform investments for improving our Recovery Time Objectives (RTO), we wanted to provide a quick update on the overall repair items from these incidents. We are pleased to report that the work on configuration propagation and data plane resiliency is now complete and fully deployed across the platform (in the table below, “Completed” means broadly deployed in production). With this, we have reduced configuration propagation latency from ~45 minutes to ~20 minutes. We anticipate reducing this even further – to ~15 minutes by the end of April 2026, while ensuring that platform stability remains our top priority. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond ‘EUAP or canary regions’ Control plane and data plane defect fixes Forced synchronous configuration processing Additional stages with extended bake time Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. Completed 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes April 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 Why recovery at edge scale is deceptively hard To understand why recovery took as long as it did, it helps to first understand how the Azure Front Door data plane processes configuration. Azure Front Door operates in 210+ edge sites with multiple servers per site. The data plane of each edge server hosts multiple processes. A master process orchestrates the lifecycle of multiple worker processes, that serve customer traffic. A separate configuration translator process runs alongside the data plane processes, and is responsible for converting customer configuration bundles from the control plane into optimized binary FlatBuffer files. This translation step, covering hundreds of thousands of tenants, represents hours of cumulative computation. A per edge server cache is kept locally at each server level – to enable a fast recovery of the data plane, if needed. Once the configuration translator process produces these FlatBuffer files, each worker processes them independently and memory-maps them for zero-copy access. Configuration updates flow through a two-phase commit: new FlatBuffers are first loaded into a staging area and validated, then atomically swapped into production maps. In-flight requests continue using the old configuration, until the last request referencing them completes. The data process recovery is designed to be resilient to different failure modes. A failure or crash at the worker process level has a typical recovery time of less than one second. Since each server has multiple such worker processes which serve customer traffic, this type of crash has no impact on the data plane. In the case of a master process crash, the system automatically tries to recover using the local cache. When the local cache is reused, the system is able to recover quickly – in approximately 60 minutes – since most of the configurations in the cache were already loaded into the data plane before the crash. However, in certain cases if the cache becomes unavailable or must be invalidated because of corruption, the recovery time increases significantly. During the October 29 th incident, a data plane crash triggered a complete recovery sequence that took approximately 4.5 hours. This was not because restarting a process is slow, it is because a defect in the recovery process invalidated the local cache, which meant that “restart” meant rebuilding everything from scratch. The configuration translator process then had to re-fetch and re-translate every one of the hundreds of thousands of customer configurations, before workers could memory-map them and begin serving traffic. This experience has crystallized three fundamental learnings related to our recovery path: Expensive rework: A subset of crashes discarded all previously translated FlatBuffer artifacts, forcing the configuration translator process to repeat hours of conversion work that had already been validated and stored. High restart costs: Every worker on every node had to wait for the configuration translator process to complete the full translation, before it could memory-map any configuration and begin serving requests. Unbounded recovery time: Recovery time grew linearly with total tenant footprint rather than with active traffic, creating a ‘scale penalty’ as more tenants onboarded to the system. Separately and together, the insight was clear: recovery must stop being proportional to the total configuration size. Persisting ‘validated configurations’ across restarts One of the key recovery improvements was strengthening how validated customer configurations are cached and reused across failures, rather than rebuilding configuration states from scratch during recovery. Azure Front Door already cached customer configurations on host‑mounted storage prior to the October incident. The platform enhancements post outage focused on making the local configuration cache resilient to crashes, partial failures, and bad tenant inputs. Our goal was to ensure that recovery behavior is dominated by serving traffic safely, not by reconstructing configuration state. This led us to two explicit design goals… Design goals No category of crash should invalidate the configuration cache: Configuration cache invalidation must never be the default response to failures. Whether the failure is a worker crash, master crash, data plane restart, or coordinated recovery action, previously validated customer configurations should remain usable—unless there is a proven reason to discard it. Bad tenant configuration must not poison the entire cache: A single faulty or incompatible tenant configuration should result in targeted eviction of that tenant’s configuration only—not wholesale cache invalidation across all tenants. Platform enhancements Previously, customer configurations persisted to host‑mounted storage, but certain failure paths treated the cache as unsafe and invalidated it entirely. In those cases, recovery implicitly meant reloading and reprocessing configuration for hundreds of thousands of tenants before traffic could resume, even though the vast majority of cached data was still valid. We changed the recovery model to avoid invalidating customer configurations, with strict scoping around when and how cached entries are discarded: Cached configurations are no longer invalidated based on crash type. Failures are assumed to be orthogonal to configuration correctness unless explicitly proven otherwise. Cache eviction is granular and tenant‑scoped. If a cached configuration fails validation or load checks, only that tenant’s configuration is discarded and reloaded. All other tenant configurations remain available. This ensures that recovery does not regress into a fleet‑wide rebuild due to localized or unrelated faults. Safety and correctness Durability is paired with strong correctness controls, to prevent unsafe configurations from being served: Per‑tenant validation on load: Each cached tenant configuration is validated during the ‘load and verification’ phase, before being promoted for traffic serving. Therefore, failures are contained to that tenant. Targeted re‑translation: When validation fails, only the affected tenant’s configuration is reloaded or reprocessed. Therefore, the cache for other tenants is left untouched. Operational escape hatch: Operators retain the ability to explicitly instruct a clean rebuild of the configuration cache (with proper authorization), preserving control without compromising the default fast‑recovery path. Resulting behavior With these changes, recovery behavior now aligns with real‑world traffic patterns - configuration defects impact tenants locally and predictably, rather than globally. The system now prefers isolated tenant impact, and continued service using last-known-good over aggressive invalidation, both of which are critical for predictable recovery at the scale of Azure Front Door. Making recovery scale with active traffic, not total tenants Reusing configuration cache solves the problem of rebuilding configuration in its entirety, but even with a warm cache, the original startup path had a second bottleneck: eagerly loading a large volume of tenant configurations into memory before serving any traffic. At our scale, memory-mapping, parsing hundreds of thousands of FlatBuffers, constructing internal lookup maps, adding Transport Layer Security (TLS) certificates and configuration blocks for each tenant, collectively added almost an hour to startup time. This was the case even when a majority of those tenants had no active traffic at that moment. We addressed this by fundamentally changing when configuration is loaded into workers. Rather than eagerly loading most of the tenants at startup across all edge locations, Azure Front Door now uses an Machine Learning (ML)-optimized lazy loading model. In the new architecture, instead of loading a large number of tenant configurations, we only load a small subset of tenants that are known to be historically active in a given site, we call this the “warm tenants” list. The warm tenants list per edge site is created through a sophisticated traffic analysis pipeline that leverages ML. However, loading the warm tenants is not good enough, because when a request arrives and we don’t have the configuration in memory, we need to know two things. Firstly, is this a request from a real Azure Front Door tenant – and, if it is, where can I find the configuration? To answer these questions, each worker maintains a hostmap that tracks the state of each tenant’s configuration. This hostmap is constructed during startup, as we process each tenant configuration – if the tenant is in the warm list, we will process and load their configuration fully; if not, then we will just add an entry into the hostmap where all their domain names are mapped to the configuration path location. When a request arrives for one of these tenants, the worker loads and validates that tenant’s configuration on demand, and immediately begins serving traffic. This allows a node to start serving its busiest tenants within a few minutes of startup, while additional tenants are loaded incrementally only when traffic actually arrives—allowing the system to progressively absorb cold tenants as demand increases. The effect on recovery is transformative. Instead of recovery time scaling with the total number of tenants configured on a server, it scales with the number of tenants actively receiving traffic. In practice, even at our busiest edge sites, the active tenant set is a small fraction of the total. Just as importantly, this modified form of lazy loading provides a natural failure isolation boundary. Most Edge sites won’t ever load a faulty configuration of an inactive tenant. When a request for an inactive tenant with an incompatible configuration arrives, impact is contained to a single worker. The configuration load architecture now prefers serving as many customers as quickly as possible, rather than waiting until everything is ready before serving anyone. The above changes are slated to complete in April 2026 and will bring our RTO from the current ~1 hour to under 10 minutes – for complete recovery from a worst case scenario. Continuous validation through Game Days A critical element of our recovery confidence comes from GameDay fault-injection testing. We don’t simply design recovery mechanisms and assume they work—we break the system deliberately and observe how it responds. Since late 2025, we have conducted recurring GameDay drills that simulate the exact failure scenarios we are defending against: Food Taster crash scenarios: Injecting deliberately faulty tenant configurations, to verify that they are caught and isolated with zero impact on live traffic. In our January 2026 GameDay, the Food Taster process crashed as expected, the system halted the update within approximately 5 seconds, and no customer traffic was affected. Master process crash scenarios: Triggering master process crashes across test environments to verify that workers continue serving traffic, that the Local Config Shield engages within 10 seconds, and that the coordinated recovery tool restores full operation within the expected timeframe. Multi-region failure drills: Simulating simultaneous failures across multiple regions to validate that global Config Shield mechanisms engage correctly, and that recovery procedures scale without requiring manual per-region intervention. Fallback test drills for critical Azure services running behind Azure Front Door: In our February 2026 GameDay, we simulated the complete unavailability of Azure Front Door, and successfully validated failover for critical Azure services with no impact to traffic. These drills have both surfaced corner cases and built operational confidence. They have transformed recovery from a theoretical plan into tested, repeatable muscle memory. As we noted in an internal communication to our team: “Game day testing is a deliberate shift from assuming resilience to actively proving it—turning reliability into an observed and repeatable outcome.” Closing Part 1 of this series emphasized preventing unsafe configurations from reaching the data plane, and data plane resiliency in case an incompatible configuration reaches production. This post has shown that prevention alone is not enough—when failures do occur, recovery must be fast, predictable, and bounded. By ensuring that the FlatBuffer cache is never invalidated, by loading only active tenants, and by building safe coordinated recovery tooling, we have transformed failure handling from a fleet-wide crisis into a controlled operation. These recovery investments work in concert with the prevention mechanisms described in Part 1. Together, they ensure that the path from incident detection to full service restoration is measured in minutes, with customer traffic protected at every step. In the next post of this series, we will cover the third pillar of our resiliency strategy: tenant isolation—how micro-cellular architecture and ingress-layered sharding can reduce the blast radius of any failure to a small subset, ensuring that one customer’s configuration or traffic anomaly never becomes everyone’s problem. We deeply value our customers’ trust in Azure Front Door. We are committed to transparently sharing our progress on these resiliency investments, and to exceed expectations for safety, reliability, and operational readiness.1.3KViews2likes0CommentsExpressRoute Gateway Microsoft initiated migration
Objective The backend migration process is an automated upgrade performed by Microsoft to ensure your ExpressRoute gateways use the Standard IP SKU. This migration enhances gateway reliability and availability while maintaining service continuity. You receive notifications about scheduled maintenance windows and have options to control the migration timeline. For guidance on upgrading Basic SKU public IP addresses for other networking services, see Upgrading Basic to Standard SKU. . Important: As of September 30, 2025, Basic SKU public IPs are retired. For more information, see the official announcement. You can initiate the ExpressRoute gateway migration yourself at a time that best suits your business needs, before the Microsoft team performs the migration on your behalf. This gives you control over the migration timing. Please use the ExpressRoute Gateway Migration Tool to migrate your gateway Public IP to Standard SKU. This tool provides a guided workflow in the Azure portal and PowerShell, enabling a smooth migration with minimal service disruption. Backend migration overview The backend migration is scheduled during your preferred maintenance window. During this time, the Microsoft team performs the migration with minimal disruption. You don’t’ need to take any actions. The process includes the following steps: Deploy new gateway: Azure provisions a second virtual network gateway in the same GatewaySubnet alongside your existing gateway. Microsoft automatically assigns a new Standard SKU public IP address to this gateway. Transfer configuration: The process copies all existing configurations (connections, settings, routes) from the old gateway. Both gateways run in parallel during the transition to minimize downtime. You may experience brief connectivity interruptions may occur. Clean up resources: After migration completes successfully and passes validation, Azure removes the old gateway and its associated connections. The new gateway includes a tag CreatedBy: GatewayMigrationByService to indicate it was created through the automated backend migration Important: To ensure a smooth backend migration, avoid making non-critical changes to your gateway resources or connected circuits during the migration process. If modifications are absolutely required, you can choose (after the Migrate stage complete) to either commit or abort the migration and make your changes. Backend process details This section provides an overview of the Azure portal experience during backend migration for an existing ExpressRoute gateway. It explains what to expect at each stage and what you see in the Azure portal as the migration progresses. To reduce risk and ensure service continuity, the process performs validation checks before and after every phase. The backend migration follows four key stages: Validate: Checks that your gateway and connected resources meet all migration requirements for the Basic to Standard public IP migration. Prepare: Deploys the new gateway with Standard IP SKU alongside your existing gateway. Migrate: Cuts over traffic from the old gateway to the new gateway with a Standard public IP. Commit or abort: Finalizes the public IP SKU migration by removing the old gateway or reverts to the old gateway if needed. These stages mirror the Gateway migration tool process, ensuring consistency across both migration approaches. The Azure resource group RGA serves as a logical container that displays all associated resources as the process updates, creates, or removes them. Before the migration begins, RGA contains the following resources: This image uses an example ExpressRoute gateway named ERGW-A with two connections (Conn-A and LAconn) in the resource group RGA. Portal walkthrough Before the backend migration starts, a banner appears in the Overview blade of the ExpressRoute gateway. It notifies you that the gateway uses the deprecated Basic IP SKU and will undergo backend migration between March 7, 2026, and April 30, 2026: Validate stage Once you start the migration, the banner in your gateway’s Overview page updates to indicate that migration is currently in progress. In this initial stage, all resources are checked to ensure they are in a Passed state. If any prerequisites aren't met, validation fails and the Azure team doesn't proceed with the migration to avoid traffic disruptions. No resources are created or modified in this stage. After the validation phase completes successfully, a notification appears indicating that validation passed and the migration can proceed to the Prepare stage. Prepare stage In this stage, the backend process provisions a new virtual network gateway in the same region and SKU type as the existing gateway. Azure automatically assigns a new public IP address and re-establishes all connections. This preparation step typically takes up to 45 minutes. To indicate that the new gateway is created by migration, the backend mechanism appends _migrate to the original gateway name. During this phase, the existing gateway is locked to prevent configuration changes, but you retain the option to abort the migration, which deletes the newly created gateway and its connections. After the Prepare stage starts, a notification appears showing that new resources are being deployed to the resource group: Deployment status In the resource group RGA, under Settings → Deployments, you can view the status of all newly deployed resources as part of the backend migration process. In the resource group RGA under the Activity Log blade, you can see events related to the Prepare stage. These events are initiated by GatewayRP, which indicates they are part of the backend process: Deployment verification After the Prepare stage completes, you can verify the deployment details in the resource group RGA under Settings > Deployments. This section lists all components created as part of the backend migration workflow. The new gateway ERGW-A_migrate is deployed successfully along with its corresponding connections: Conn-A_migrate and LAconn_migrate. Gateway tag The newly created gateway ERGW-A_migrate includes the tag CreatedBy: GatewayMigrationByService, which indicates it was provisioned by the backend migration process. Migrate stage After the Prepare stage finishes, the backend process starts the Migrate stage. During this stage, the process switches traffic from the existing gateway ERGW-A to the new gateway ERGW-A_migrate. Gateway ERGW-A_migrate: Old gateway (ERGW-A) handles traffic: After the backend team initiates the traffic migration, the process switches traffic from the old gateway to the new gateway. This step can take up to 15 minutes and might cause brief connectivity interruptions. New gateway (ERGW-A_migrate) handles traffic: Commit stage After migration, the Azure team monitors connectivity for 15 days to ensure everything is functioning as expected. The banner automatically updates to indicate completion of migration: During this validation period, you can’t modify resources associated with both the old and new gateways. To resume normal CRUD operations without waiting15 days, you have two options: Commit: Finalize the migration and unlock resources. Abort: Revert to the old gateway, which deletes the new gateway and its connections. To initiate Commit before the 15-day window ends, type yes and select Commit in the portal. When the commit is initiated from the backend, you will see “Committing migration. The operation may take some time to complete.” The old gateway and its connections are deleted. The event shows as initiated by GatewayRP in the activity logs. After old connections are deleted, the old gateway gets deleted. Finally, the resource group RGA contains only resources only related to the migrated gateway ERGW-A_migrate: The ExpressRoute Gateway migration from Basic to Standard Public IP SKU is now complete. Frequently asked questions How long will Microsoft team wait before committing to the new gateway? The Microsoft team waits around 15 days after migration to allow you time to validate connectivity and ensure all requirements are met. You can commit at any time during this 15-day period. What is the traffic impact during migration? Is there packet loss or routing disruption? Traffic is rerouted seamlessly during migration. Under normal conditions, no packet loss or routing disruption is expected. Brief connectivity interruptions (typically less than 1 minute) might occur during the traffic cutover phase. Can we make any changes to ExpressRoute Gateway deployment during the migration? Avoid making non-critical changes to the deployment (gateway resources, connected circuits, etc.). If modifications are absolutely required, you have the option (after the Migrate stage) to either commit or abort the migration.1.5KViews0likes0CommentsNavigating the 2025 holiday season: Insights into Azure’s DDoS defense
The holiday season continues to be one of the most demanding periods for online businesses. Traffic surges, higher transaction volumes, and user expectations for seamless digital experiences all converge, making reliability a non-negotiable requirement. For attackers, this same period presents an opportunity: even brief instability can translate into lost revenue, operational disruption, and reputational impact. This year, the most notable shift wasn’t simply the size of attacks, but how they were executed. We observed a rise in burst‑style DDoS events, fast-ramping, high-intensity surges distributed across multiple resources, designed to overwhelm packet processing and connection-handling layers before traditional bandwidth metrics show signs of strain. From November 15, 2025 through January 5, 2026, Azure DDoS Protection helped customers maintain continuity through sustained Layer 3 and Layer 4 attack traffic, underscoring two persistent realities: Most attacks remain short, automated, and frequently create constant background attack traffic. The upper limit of attacker capability continues to grow, with botnets across the industry regularly demonstrating multi‑Tbps scale. The holiday season once again reinforced that DDoS resilience must be treated as a continuous operational discipline. Rising volume and intensity Between November 15 and January 5, Azure mitigated approximately 174,054 inbound DDoS attacks. While many were small and frequent, the distribution revealed the real shift: 16% exceeded 1M packets per second (pps). ~3% surpassed 10M pps, up significantly from 0.2% last year. Even when individual events are modest, the cumulative impact of sustained attack traffic can be operationally draining—consuming on-call cycles, increasing autoscale and egress costs, and creating intermittent instability that can provide cover for more targeted activity. Operational takeaway: Treat DDoS mitigation as an always-on requirement. Ensure protection is enabled across all internet-facing entry points, align alerting to packet rate trends, and maintain clear triage workflows. What the TCP/UDP mix is telling us this season TCP did what it usually does during peak season: it carried the fight. TCP floods made up ~72% of activity, and ACK floods dominated (58.7%) a reliable way to grind down packet processing and connection handling. UDP was ~24%, showing up as sharp, high-intensity bursts; amplification (like NTP) appeared, but it wasn’t the main play. Put together, it’s a familiar one-two punch: sustain TCP/ACK pressure to exhaust the edge, then spike UDP to jolt stability and steal attention. The goal isn’t just to saturate bandwidth, it’s to push services into intermittent instability, where things technically stay online but feel broken to users. TCP-heavy pressure: Make sure your edge and backends can absorb a surge in connections without falling over—check load balancer limits, connection/state capacity, and confirm health checks won’t start flapping during traffic spikes. UDP burst patterns: Rely on automated detection and mitigation—these bursts are often over before a human can respond. Reduce exposure: Inventory any internet-facing UDP services and shut down, restrict, or isolate anything you don’t truly need. Attack duration: Attackers continued to favor short-lived bursts designed to outrun manual response, but we also saw a notable shift in “who” felt the impact most. High-sensitivity workloads, especially gaming, experienced some of the highest packet-per-second and bandwidth-driven spikes, often concentrated into bursts lasting from a few minutes to several minutes. Even when these events were brief, the combination of high PPS + high bandwidth can be enough to trigger jitter, session drops, match instability, or rapid scaling churn. Overall, 34% of attacks lasted 5 minutes or less, and 83% ended within 40 minutes, reinforcing the same lesson: modern DDoS patterns are optimized for speed and disruption, not longevity. For latency- and session-sensitive services, “only a few minutes” can still be a full outage experience. Attack duration is an attacker advantage when defenses rely on humans to notice, diagnose, and react. Design for minute-long spikes: assume attacks will be short, sharp, and high PPS such that your protections should engage automatically. Watch the right signals: alert on PPS spikes and service health (disconnect rates, latency/jitter), not bandwidth alone. Botnet-driven surges: Azure observed rapid rotation of botnet traffic associated with Aisuru and KimWolf targeting public-facing endpoints. The traffic was highly distributed across regions and networks. In several instances, when activity was mitigated in one region, similar traffic shifted to alternate regions or segments shortly afterward. “Relocation” behavior is the operational signature of automated botnet playbooks: probe → hit → shift → retry. If defenses vary by region or endpoint, attackers will find the weakest link quickly. Customers should standardize protection posture, ensure consistent DDoS policies and thresholds across regions. Monitor by setting the right alerts and notifications. The snapshot below captures the Source-side distribution at that moment, showing which industry verticals were used to generate the botnet traffic during the observation window The geography indicators below reflect where the traffic was observed egressing onto the internet, and do not imply attribution or intent by any provider or country. Preparing for 2026 As organizations transition into 2026, the lessons from the 2025 holiday season marked by persistent and evolving DDoS threats, including the rise of DDoS-for-hire services, massive botnets underscore the critical need for proactive, resilient cybersecurity. Azure's proven ability to automatically detect, mitigate, and withstand advanced attacks (such as record-breaking volumetric incidents) highlights the value of always-on protections to maintain business continuity and safeguard digital services during peak demand periods. Adopting a Zero Trust approach is essential in this landscape, as it operates on the principle of "never trust, always verify," assuming breaches are inevitable and requiring continuous validation of access and traffic principles that complement DDoS defenses by limiting lateral movement and exposure even under attack. To achieve comprehensive protection, implement layered security: deploy Azure DDoS Protection for network-layer (Layers 3 and 4) volumetric mitigation with always-on monitoring, adaptive tuning, telemetry, and alerting; combine it with Azure Web Application Firewall (WAF) to defend the application layer (Layer 7) against sophisticated techniques like HTTP floods; and integrate Azure Firewall for additional network perimeter controls. Key preparatory steps include identifying public-facing exposure points, establishing normal traffic baselines, conducting regular DDoS simulations, configuring alerts for active mitigations, forming a dedicated response team, and enabling expert support like the DDoS Rapid Response (DRR) team when needed. By prioritizing these multi-layered defenses and a well-practiced response plan, organizations can significantly enhance resilience against the evolving DDoS landscape in 2026.330Views0likes0CommentsA Practical Guide to Azure DDoS Protection Cost Optimization
Introduction Azure provides infrastructure-level DDoS protection by default to protect Azure’s own platform and services. However, this protection does not extend to customer workloads or non-Microsoft managed resources like Application Gateway, Azure Firewall, or virtual machines with public IPs. To protect these resources, Azure offers enhanced DDoS protection capabilities (Network Protection and IP Protection) that customers can apply based on workload exposure and business requirements. As environments scale, it’s important to ensure these capabilities are applied deliberately and aligned with actual risk. For more details on how Azure DDoS protection works, see Understanding Azure DDoS Protection: A Closer Look. Why Cost Optimization Matters Cost inefficiencies related to Azure DDoS Protection typically emerge as environments scale: New public IPs are introduced Virtual networks evolve Workloads change ownership Protection scope grows without clear alignment to workload exposure The goal here is deliberate, consistent application of enhanced protection matched to real risk rather than historical defaults. Scoping Enhanced Protection Customer workloads with public IPs require enhanced DDoS protection to be protected against targeted attacks. Enhanced DDoS protection provides: Advanced mitigation capabilities Detailed telemetry and attack insights Mitigation tuned to specific traffic patterns Dedicated support for customer workloads When to apply enhanced protection: Workload Type Enhanced Protection Recommended? Internet-facing production apps with direct customer impact Yes Business-critical systems with compliance requirements Yes Internal-only workloads behind private endpoints Typically not needed Development/test environments Evaluate based on exposure Best Practice: Regularly review public IP exposure and workload criticality to ensure enhanced protection aligns with current needs. Understanding Azure DDoS Protection SKUs Azure offers two ways to apply enhanced DDoS protection: DDoS Network Protection and DDoS IP Protection. Both provide DDoS protection for customer workloads. Comparison Table Feature DDoS Network Protection DDoS IP Protection Scope Virtual network level Individual public IP Pricing model Fixed base + overage per IP Per protected IP Included IPs 100 public IPs N/A DDoS Rapid Response (DRR) Included Not available Cost protection guarantee Included Not available WAF discount Included Not available Best for Production environments with many public IPs Selective protection for specific endpoints Management Centralized Granular Cost efficiency Lower per-IP cost at scale (100+ IPs) Lower total cost for few IPs (< 15) DDoS Network Protection DDoS Network Protection can be applied in two ways: VNet-level protection: Associate a DDoS Protection Plan with virtual networks, and all public IPs within those VNets receive enhanced protection Selective IP linking: Link specific public IPs directly to a DDoS Protection Plan without enabling protection for the entire VNet This flexibility allows you to protect entire production VNets while also selectively adding individual IPs from other environments to the same plan. For more details on selective IP linking, see Optimizing DDoS Protection Costs: Adding IPs to Existing DDoS Protection Plans. Ideal for: - Production environments with multiple internet-facing workloads - Mixed environments where some VNets need full coverage and others need selective protection - Scenarios requiring centralized visibility, management, and access to DRR, cost protection, and WAF discounts DDoS IP Protection DDoS IP Protection allows enhanced protection to be applied directly to individual public IPs, with per-IP billing. This is a standalone option that does not require a DDoS Protection Plan. Ideal for: Environments with fewer than 15 IPs requiring protection Cases where DRR, cost protection, and WAF discounts are not needed Quick enablement without creating a protection plan Decision Tree: Choosing the Right SKU Now that you know the main scenarios, the decision tree below can help you determine which SKU best fits your environment based on feature requirements and scale: Network Protection exclusive features: DDoS Rapid Response (DRR): Access to Microsoft DDoS experts during active attacks Cost protection: Resource credits for scale-out costs incurred during attacks WAF discount: Reduced pricing on Azure Web Application Firewall Consolidating Protection Plans at Tenant Level A single DDoS Protection Plan can protect multiple virtual networks and subscriptions within a tenant. Each plan includes: Fixed monthly base cost 100 public IPs included Overage charges for additional IPs beyond the included threshold Cost Comparison Example Consider a customer with 130 public IPs requiring enhanced protection: Configuration Plans Base Cost Overage Total Monthly Cost Two separate plans 2 $2,944 × 2 = $5,888 $0 ~$5,888 Single consolidated plan 1 $2,944 30 IPs × $30 = $900 ~$3,844 Savings: ~$2,044/month ($24,528/year) by consolidating to a single plan. In both cases, the same public IPs receive the same enhanced protection. The cost difference is driven entirely by plan architecture. How to Consolidate Plans Use the PowerShell script below to list existing DDoS Protection Plans and associate virtual networks with a consolidated plan. Run this script from Azure Cloud Shell or a local PowerShell session with the [Az module](https://learn.microsoft.com/powershell/azure/install-azure-powershell) installed. The account running the script must have Network Contributor role (or equivalent) on the virtual networks being modified and Reader access to the DDoS Protection Plan. # List all DDoS Protection Plans in your tenant Get-AzDdosProtectionPlan | Select-Object Name, ResourceGroupName, Id # Associate a virtual network with an existing DDoS Protection Plan $ddosPlan = Get-AzDdosProtectionPlan -Name "ConsolidatedDDoSPlan" -ResourceGroupName "rg-security" $vnet = Get-AzVirtualNetwork -Name "vnet-production" -ResourceGroupName "rg-workloads" $vnet.DdosProtectionPlan = New-Object Microsoft.Azure.Commands.Network.Models.PSResourceId $vnet.DdosProtectionPlan.Id = $ddosPlan.Id $vnet.EnableDdosProtection = $true Set-AzVirtualNetwork -VirtualNetwork $vnet Preventing Protection Drift Protection drift occurs when the resources covered by DDoS protection no longer align with the resources that actually need it. This mismatch can result in wasted spend (protecting resources that are no longer critical) or security gaps (missing protection on newly deployed resources). Common causes include: Applications are retired but protection remains Test environments persist longer than expected Ownership changes without updating protection configuration Quarterly Review Checklist List all public IPs with enhanced protection enabled Verify each protected IP maps to an active, production workload Confirm workload criticality justifies enhanced protection Review ownership tags and update as needed Remove protection from decommissioned or non-critical resources Validate DDoS Protection Plan consolidation opportunities Sample Query: List Protected Public IPs Use the following PowerShell script to identify all public IPs currently receiving DDoS protection in your environment. This helps you audit which resources are protected and spot candidates for removal. Run this from Azure Cloud Shell or a local PowerShell session with the Az module installed. The account must have Reader access to the subscriptions being queried. # List all public IPs with DDoS protection enabled Get-AzPublicIpAddress | Where-Object { $_.DdosSettings.ProtectionMode -eq "Enabled" -or ($_.IpConfiguration -and (Get-AzVirtualNetwork | Where-Object { $_.EnableDdosProtection -eq $true }).Subnets.IpConfigurations.Id -contains $_.IpConfiguration.Id) } | Select-Object Name, ResourceGroupName, IpAddress, @{N='Tags';E={$_.Tag | ConvertTo-Json -Compress}} For a comprehensive assessment of all public IPs and their DDoS protection status across your environment, use the DDoS Protection Assessment Tool. Making Enhanced Protection Costs Observable Ongoing visibility into DDoS Protection costs enables proactive optimization rather than reactive bill shock. When costs are surfaced early, you can spot scope creep before it impacts your budget, attribute spending to specific workloads, and measure whether your optimization efforts are paying off. The following sections cover three key capabilities: budget alerts to notify you when spending exceeds thresholds, Azure Resource Graph queries to analyze protection coverage, and tagging strategies to attribute costs by workload. Setting Up Cost Alerts Navigate to Azure Cost Management + Billing Select Cost alerts > Add Configure: o Scope: Subscription or resource group o Budget amount: Based on expected DDoS Protection spend o Alert threshold: 80%, 100%, 120% o Action group: Email security and finance teams Tagging Strategy for Cost Attribution Apply consistent tags to track DDoS protection costs by workload: # Tag public IPs for cost attribution $pip = Get-AzPublicIpAddress -Name "pip-webapp" -ResourceGroupName "rg-production" $tags = @{ "CostCenter" = "IT-Security" "Workload" = "CustomerPortal" "Environment" = "Production" "DDoSProtectionTier" = "NetworkProtection" } Set-AzPublicIpAddress -PublicIpAddress $pip -Tag $tags Summary This guide covered how to consolidate DDoS Protection Plans to avoid paying multiple base costs, select the appropriate SKU based on IP count and feature needs, apply protection selectively with IP linking, and prevent configuration drift through regular reviews. These practices help ensure you're paying only for the protection your workloads actually need. References Review Azure DDoS Protection pricing Enable DDoS Network Protection for a virtual network Configure DDoS IP Protection Configure Cost Management alerts377Views0likes0CommentsDNS best practices for implementation in Azure Landing Zones
Why DNS architecture matters in Landing Zone A well-designed DNS layer is the glue that lets workloads in disparate subscriptions discover one another quickly and securely. Getting it right during your Azure Landing Zone rollout avoids painful refactoring later, especially once you start enforcing Zero-Trust and hub-and-spoke network patterns. Typical Landing-Zone topology Subscription Typical Role Key Resources Connectivity (Hub) Transit, routing, shared security Hub VNet, Azure Firewall / NVA, VPN/ER gateways, Private DNS Resolver Security Security tooling & SOC Sentinel, Defender, Key Vault (HSM) Shared Services Org-wide shared apps ADO and Agents, Automation Management Ops & governance Log Analytics, backup etc Identity Directory and auth services Extended domain controllers, Azure AD DS All five subscriptions contain a single VNet. Spokes (Security, Shared, Management, Identity) are peered to the Connectivity VNet, forming the classic hub-and-spoke. Centralized DNS with mandatory firewall inspection Objective: All network communication from a spoke must cross the firewall in the hub including DNS communication. Design Element Best-Practice Configuration Private DNS Zones Link only to the Connectivity VNet. Spokes have no direct zone links. Private DNS Resolver Deploy inbound + outbound endpoints in the Connectivity VNet. Link connectivity virtual network to outbound resolver endpoint. Spoke DNS Settings Set custom DNS servers on each spoke VNet equal to the inbound endpoint’s IPs. Forwarding Ruleset Create a ruleset, associate it with the outbound endpoint, and add forwarders: • Specific domains → on-prem / external servers • Wildcard “.” → on-prem DNS (for compliance scenarios) Firewall Rules Allow UDP/TCP 53 from spokes to Resolver-inbound, and from Resolver-outbound to target DNS servers Note: Azure private DNS zone is a global resource. Meaning single private DNS zone can be utilized to resolve DNS query for resources deployed in multiple regions. DNS private resolver is a regional resource. Meaning it can only link to virtual network within the same region. Traffic flow Spoke VM → Inbound endpoint (hub) Firewall receives the packet based on spoke UDR configuration and processes the packet before it sent to inbound endpoint IP. Resolver applies forwarding rules on unresolved DNS queries; unresolved queries leave via Outbound endpoint. DNS forwarding rulesets provide a way to route queries for specific DNS namespaces to designated custom DNS servers. Fallback to internet and NXDOMAIN redirect Azure Private DNS now supports two powerful features to enhance name resolution flexibility in hybrid and multi-tenant environments: Fallback to internet Purpose: Allows Azure to resolve DNS queries using public DNS if no matching record is found in the private DNS zone. Use case: Ideal when your private DNS zone doesn't contain all possible hostnames (e.g., partial zone coverage or phased migrations). How to enable: Go to Azure private DNS zones -> Select zone -> Virtual network link -> Edit option Ref article: https://learn.microsoft.com/en-us/azure/dns/private-dns-fallback Centralized DNS - when firewall inspection isn’t required Objective: DNS query is not monitored via firewall and DNS query can be bypassed from firewall. Link every spoke virtual directly to the required Private DNS Zones so that spoken can resolve PaaS resources directly. Keep a single Private DNS Resolver (optional) for on-prem name resolution; spokes can reach its inbound endpoint privately or via VNet peering. Spoke-level custom DNS This can point to extended domain controllers placed within identity virtual. This pattern reduces latency and cost but still centralizes zone management. Integrating on-premises active directory DNS Create conditional forwarders on each Domain Controller for every Private DNS Zone pointing it to DNS private resolver inbound endpoint IP Address. (e.g.,blob.core.windows.net database.windows.net). Do not include the literal privatelink label. Ref article: https://github.com/dmauser/PrivateLink/tree/master/DNS-Integration-Scenarios#43-on-premises-dns-server-conditional-forwarder-considerations Note: Avoid selecting the option “Store this conditional forwarder in Active Directory and replicate as follows” in environments with multiple Azure subscriptions and domain controllers deployed across different Azure environments. Key takeaways Linking zones exclusively to the connectivity subscription's virtual network keeps firewall inspection and egress control simple. Private DNS Resolver plus forwarding rulesets let you shape hybrid name resolution without custom appliances. When no inspection is needed, direct zone links to spokes cut hops and complexity. For on-prem AD DNS, the conditional forwarder is required pointing it to inbound endpoint IP, exclude privatelink name when creating conditional forwarder, and do not replicate conditional forwarder Zone with AD replication if customer has footprint in multiple Azure tenants. Plan your DNS early, bake it into your infrastructure-as-code, and your landing zone will scale cleanly no matter how many spokes join the hub tomorrow.8.9KViews6likes5Comments