azure networking
82 TopicsNavigating the 2025 holiday season: Insights into Azure’s DDoS defense
The holiday season continues to be one of the most demanding periods for online businesses. Traffic surges, higher transaction volumes, and user expectations for seamless digital experiences all converge, making reliability a non-negotiable requirement. For attackers, this same period presents an opportunity: even brief instability can translate into lost revenue, operational disruption, and reputational impact. This year, the most notable shift wasn’t simply the size of attacks, but how they were executed. We observed a rise in burst‑style DDoS events, fast-ramping, high-intensity surges distributed across multiple resources, designed to overwhelm packet processing and connection-handling layers before traditional bandwidth metrics show signs of strain. From November 15, 2025 through January 5, 2026, Azure DDoS Protection helped customers maintain continuity through sustained Layer 3 and Layer 4 attack traffic, underscoring two persistent realities: Most attacks remain short, automated, and frequently create constant background attack traffic. The upper limit of attacker capability continues to grow, with botnets across the industry regularly demonstrating multi‑Tbps scale. The holiday season once again reinforced that DDoS resilience must be treated as a continuous operational discipline. Rising volume and intensity Between November 15 and January 5, Azure mitigated approximately 174,054 inbound DDoS attacks. While many were small and frequent, the distribution revealed the real shift: 16% exceeded 1M packets per second (pps). ~3% surpassed 10M pps, up significantly from 0.2% last year. Even when individual events are modest, the cumulative impact of sustained attack traffic can be operationally draining—consuming on-call cycles, increasing autoscale and egress costs, and creating intermittent instability that can provide cover for more targeted activity. Operational takeaway: Treat DDoS mitigation as an always-on requirement. Ensure protection is enabled across all internet-facing entry points, align alerting to packet rate trends, and maintain clear triage workflows. What the TCP/UDP mix is telling us this season TCP did what it usually does during peak season: it carried the fight. TCP floods made up ~72% of activity, and ACK floods dominated (58.7%) a reliable way to grind down packet processing and connection handling. UDP was ~24%, showing up as sharp, high-intensity bursts; amplification (like NTP) appeared, but it wasn’t the main play. Put together, it’s a familiar one-two punch: sustain TCP/ACK pressure to exhaust the edge, then spike UDP to jolt stability and steal attention. The goal isn’t just to saturate bandwidth, it’s to push services into intermittent instability, where things technically stay online but feel broken to users. TCP-heavy pressure: Make sure your edge and backends can absorb a surge in connections without falling over—check load balancer limits, connection/state capacity, and confirm health checks won’t start flapping during traffic spikes. UDP burst patterns: Rely on automated detection and mitigation—these bursts are often over before a human can respond. Reduce exposure: Inventory any internet-facing UDP services and shut down, restrict, or isolate anything you don’t truly need. Attack duration: Attackers continued to favor short-lived bursts designed to outrun manual response, but we also saw a notable shift in “who” felt the impact most. High-sensitivity workloads, especially gaming, experienced some of the highest packet-per-second and bandwidth-driven spikes, often concentrated into bursts lasting from a few minutes to several minutes. Even when these events were brief, the combination of high PPS + high bandwidth can be enough to trigger jitter, session drops, match instability, or rapid scaling churn. Overall, 34% of attacks lasted 5 minutes or less, and 83% ended within 40 minutes, reinforcing the same lesson: modern DDoS patterns are optimized for speed and disruption, not longevity. For latency- and session-sensitive services, “only a few minutes” can still be a full outage experience. Attack duration is an attacker advantage when defenses rely on humans to notice, diagnose, and react. Design for minute-long spikes: assume attacks will be short, sharp, and high PPS such that your protections should engage automatically. Watch the right signals: alert on PPS spikes and service health (disconnect rates, latency/jitter), not bandwidth alone. Botnet-driven surges: Azure observed rapid rotation of botnet traffic associated with Aisuru and KimWolf targeting public-facing endpoints. The traffic was highly distributed across regions and networks. In several instances, when activity was mitigated in one region, similar traffic shifted to alternate regions or segments shortly afterward. “Relocation” behavior is the operational signature of automated botnet playbooks: probe → hit → shift → retry. If defenses vary by region or endpoint, attackers will find the weakest link quickly. Customers should standardize protection posture, ensure consistent DDoS policies and thresholds across regions. Monitor by setting the right alerts and notifications. The snapshot below captures the Source-side distribution at that moment, showing which industry verticals were used to generate the botnet traffic during the observation window The geography indicators below reflect where the traffic was observed egressing onto the internet, and do not imply attribution or intent by any provider or country. Preparing for 2026 As organizations transition into 2026, the lessons from the 2025 holiday season marked by persistent and evolving DDoS threats, including the rise of DDoS-for-hire services, massive botnets underscore the critical need for proactive, resilient cybersecurity. Azure's proven ability to automatically detect, mitigate, and withstand advanced attacks (such as record-breaking volumetric incidents) highlights the value of always-on protections to maintain business continuity and safeguard digital services during peak demand periods. Adopting a Zero Trust approach is essential in this landscape, as it operates on the principle of "never trust, always verify," assuming breaches are inevitable and requiring continuous validation of access and traffic principles that complement DDoS defenses by limiting lateral movement and exposure even under attack. To achieve comprehensive protection, implement layered security: deploy Azure DDoS Protection for network-layer (Layers 3 and 4) volumetric mitigation with always-on monitoring, adaptive tuning, telemetry, and alerting; combine it with Azure Web Application Firewall (WAF) to defend the application layer (Layer 7) against sophisticated techniques like HTTP floods; and integrate Azure Firewall for additional network perimeter controls. Key preparatory steps include identifying public-facing exposure points, establishing normal traffic baselines, conducting regular DDoS simulations, configuring alerts for active mitigations, forming a dedicated response team, and enabling expert support like the DDoS Rapid Response (DRR) team when needed. By prioritizing these multi-layered defenses and a well-practiced response plan, organizations can significantly enhance resilience against the evolving DDoS landscape in 2026.242Views0likes0CommentsA Practical Guide to Azure DDoS Protection Cost Optimization
Introduction Azure provides infrastructure-level DDoS protection by default to protect Azure’s own platform and services. However, this protection does not extend to customer workloads or non-Microsoft managed resources like Application Gateway, Azure Firewall, or virtual machines with public IPs. To protect these resources, Azure offers enhanced DDoS protection capabilities (Network Protection and IP Protection) that customers can apply based on workload exposure and business requirements. As environments scale, it’s important to ensure these capabilities are applied deliberately and aligned with actual risk. For more details on how Azure DDoS protection works, see Understanding Azure DDoS Protection: A Closer Look. Why Cost Optimization Matters Cost inefficiencies related to Azure DDoS Protection typically emerge as environments scale: New public IPs are introduced Virtual networks evolve Workloads change ownership Protection scope grows without clear alignment to workload exposure The goal here is deliberate, consistent application of enhanced protection matched to real risk rather than historical defaults. Scoping Enhanced Protection Customer workloads with public IPs require enhanced DDoS protection to be protected against targeted attacks. Enhanced DDoS protection provides: Advanced mitigation capabilities Detailed telemetry and attack insights Mitigation tuned to specific traffic patterns Dedicated support for customer workloads When to apply enhanced protection: Workload Type Enhanced Protection Recommended? Internet-facing production apps with direct customer impact Yes Business-critical systems with compliance requirements Yes Internal-only workloads behind private endpoints Typically not needed Development/test environments Evaluate based on exposure Best Practice: Regularly review public IP exposure and workload criticality to ensure enhanced protection aligns with current needs. Understanding Azure DDoS Protection SKUs Azure offers two ways to apply enhanced DDoS protection: DDoS Network Protection and DDoS IP Protection. Both provide DDoS protection for customer workloads. Comparison Table Feature DDoS Network Protection DDoS IP Protection Scope Virtual network level Individual public IP Pricing model Fixed base + overage per IP Per protected IP Included IPs 100 public IPs N/A DDoS Rapid Response (DRR) Included Not available Cost protection guarantee Included Not available WAF discount Included Not available Best for Production environments with many public IPs Selective protection for specific endpoints Management Centralized Granular Cost efficiency Lower per-IP cost at scale (100+ IPs) Lower total cost for few IPs (< 15) DDoS Network Protection DDoS Network Protection can be applied in two ways: VNet-level protection: Associate a DDoS Protection Plan with virtual networks, and all public IPs within those VNets receive enhanced protection Selective IP linking: Link specific public IPs directly to a DDoS Protection Plan without enabling protection for the entire VNet This flexibility allows you to protect entire production VNets while also selectively adding individual IPs from other environments to the same plan. For more details on selective IP linking, see Optimizing DDoS Protection Costs: Adding IPs to Existing DDoS Protection Plans. Ideal for: - Production environments with multiple internet-facing workloads - Mixed environments where some VNets need full coverage and others need selective protection - Scenarios requiring centralized visibility, management, and access to DRR, cost protection, and WAF discounts DDoS IP Protection DDoS IP Protection allows enhanced protection to be applied directly to individual public IPs, with per-IP billing. This is a standalone option that does not require a DDoS Protection Plan. Ideal for: Environments with fewer than 15 IPs requiring protection Cases where DRR, cost protection, and WAF discounts are not needed Quick enablement without creating a protection plan Decision Tree: Choosing the Right SKU Now that you know the main scenarios, the decision tree below can help you determine which SKU best fits your environment based on feature requirements and scale: Network Protection exclusive features: DDoS Rapid Response (DRR): Access to Microsoft DDoS experts during active attacks Cost protection: Resource credits for scale-out costs incurred during attacks WAF discount: Reduced pricing on Azure Web Application Firewall Consolidating Protection Plans at Tenant Level A single DDoS Protection Plan can protect multiple virtual networks and subscriptions within a tenant. Each plan includes: Fixed monthly base cost 100 public IPs included Overage charges for additional IPs beyond the included threshold Cost Comparison Example Consider a customer with 130 public IPs requiring enhanced protection: Configuration Plans Base Cost Overage Total Monthly Cost Two separate plans 2 $2,944 × 2 = $5,888 $0 ~$5,888 Single consolidated plan 1 $2,944 30 IPs × $30 = $900 ~$3,844 Savings: ~$2,044/month ($24,528/year) by consolidating to a single plan. In both cases, the same public IPs receive the same enhanced protection. The cost difference is driven entirely by plan architecture. How to Consolidate Plans Use the PowerShell script below to list existing DDoS Protection Plans and associate virtual networks with a consolidated plan. Run this script from Azure Cloud Shell or a local PowerShell session with the [Az module](https://learn.microsoft.com/powershell/azure/install-azure-powershell) installed. The account running the script must have Network Contributor role (or equivalent) on the virtual networks being modified and Reader access to the DDoS Protection Plan. # List all DDoS Protection Plans in your tenant Get-AzDdosProtectionPlan | Select-Object Name, ResourceGroupName, Id # Associate a virtual network with an existing DDoS Protection Plan $ddosPlan = Get-AzDdosProtectionPlan -Name "ConsolidatedDDoSPlan" -ResourceGroupName "rg-security" $vnet = Get-AzVirtualNetwork -Name "vnet-production" -ResourceGroupName "rg-workloads" $vnet.DdosProtectionPlan = New-Object Microsoft.Azure.Commands.Network.Models.PSResourceId $vnet.DdosProtectionPlan.Id = $ddosPlan.Id $vnet.EnableDdosProtection = $true Set-AzVirtualNetwork -VirtualNetwork $vnet Preventing Protection Drift Protection drift occurs when the resources covered by DDoS protection no longer align with the resources that actually need it. This mismatch can result in wasted spend (protecting resources that are no longer critical) or security gaps (missing protection on newly deployed resources). Common causes include: Applications are retired but protection remains Test environments persist longer than expected Ownership changes without updating protection configuration Quarterly Review Checklist List all public IPs with enhanced protection enabled Verify each protected IP maps to an active, production workload Confirm workload criticality justifies enhanced protection Review ownership tags and update as needed Remove protection from decommissioned or non-critical resources Validate DDoS Protection Plan consolidation opportunities Sample Query: List Protected Public IPs Use the following PowerShell script to identify all public IPs currently receiving DDoS protection in your environment. This helps you audit which resources are protected and spot candidates for removal. Run this from Azure Cloud Shell or a local PowerShell session with the Az module installed. The account must have Reader access to the subscriptions being queried. # List all public IPs with DDoS protection enabled Get-AzPublicIpAddress | Where-Object { $_.DdosSettings.ProtectionMode -eq "Enabled" -or ($_.IpConfiguration -and (Get-AzVirtualNetwork | Where-Object { $_.EnableDdosProtection -eq $true }).Subnets.IpConfigurations.Id -contains $_.IpConfiguration.Id) } | Select-Object Name, ResourceGroupName, IpAddress, @{N='Tags';E={$_.Tag | ConvertTo-Json -Compress}} For a comprehensive assessment of all public IPs and their DDoS protection status across your environment, use the DDoS Protection Assessment Tool. Making Enhanced Protection Costs Observable Ongoing visibility into DDoS Protection costs enables proactive optimization rather than reactive bill shock. When costs are surfaced early, you can spot scope creep before it impacts your budget, attribute spending to specific workloads, and measure whether your optimization efforts are paying off. The following sections cover three key capabilities: budget alerts to notify you when spending exceeds thresholds, Azure Resource Graph queries to analyze protection coverage, and tagging strategies to attribute costs by workload. Setting Up Cost Alerts Navigate to Azure Cost Management + Billing Select Cost alerts > Add Configure: o Scope: Subscription or resource group o Budget amount: Based on expected DDoS Protection spend o Alert threshold: 80%, 100%, 120% o Action group: Email security and finance teams Tagging Strategy for Cost Attribution Apply consistent tags to track DDoS protection costs by workload: # Tag public IPs for cost attribution $pip = Get-AzPublicIpAddress -Name "pip-webapp" -ResourceGroupName "rg-production" $tags = @{ "CostCenter" = "IT-Security" "Workload" = "CustomerPortal" "Environment" = "Production" "DDoSProtectionTier" = "NetworkProtection" } Set-AzPublicIpAddress -PublicIpAddress $pip -Tag $tags Summary This guide covered how to consolidate DDoS Protection Plans to avoid paying multiple base costs, select the appropriate SKU based on IP count and feature needs, apply protection selectively with IP linking, and prevent configuration drift through regular reviews. These practices help ensure you're paying only for the protection your workloads actually need. References Review Azure DDoS Protection pricing Enable DDoS Network Protection for a virtual network Configure DDoS IP Protection Configure Cost Management alerts265Views0likes0CommentsDNS best practices for implementation in Azure Landing Zones
Why DNS architecture matters in Landing Zone A well-designed DNS layer is the glue that lets workloads in disparate subscriptions discover one another quickly and securely. Getting it right during your Azure Landing Zone rollout avoids painful refactoring later, especially once you start enforcing Zero-Trust and hub-and-spoke network patterns. Typical Landing-Zone topology Subscription Typical Role Key Resources Connectivity (Hub) Transit, routing, shared security Hub VNet, Azure Firewall / NVA, VPN/ER gateways, Private DNS Resolver Security Security tooling & SOC Sentinel, Defender, Key Vault (HSM) Shared Services Org-wide shared apps ADO and Agents, Automation Management Ops & governance Log Analytics, backup etc Identity Directory and auth services Extended domain controllers, Azure AD DS All five subscriptions contain a single VNet. Spokes (Security, Shared, Management, Identity) are peered to the Connectivity VNet, forming the classic hub-and-spoke. Centralized DNS with mandatory firewall inspection Objective: All network communication from a spoke must cross the firewall in the hub including DNS communication. Design Element Best-Practice Configuration Private DNS Zones Link only to the Connectivity VNet. Spokes have no direct zone links. Private DNS Resolver Deploy inbound + outbound endpoints in the Connectivity VNet. Link connectivity virtual network to outbound resolver endpoint. Spoke DNS Settings Set custom DNS servers on each spoke VNet equal to the inbound endpoint’s IPs. Forwarding Ruleset Create a ruleset, associate it with the outbound endpoint, and add forwarders: • Specific domains → on-prem / external servers • Wildcard “.” → on-prem DNS (for compliance scenarios) Firewall Rules Allow UDP/TCP 53 from spokes to Resolver-inbound, and from Resolver-outbound to target DNS servers Note: Azure private DNS zone is a global resource. Meaning single private DNS zone can be utilized to resolve DNS query for resources deployed in multiple regions. DNS private resolver is a regional resource. Meaning it can only link to virtual network within the same region. Traffic flow Spoke VM → Inbound endpoint (hub) Firewall receives the packet based on spoke UDR configuration and processes the packet before it sent to inbound endpoint IP. Resolver applies forwarding rules on unresolved DNS queries; unresolved queries leave via Outbound endpoint. DNS forwarding rulesets provide a way to route queries for specific DNS namespaces to designated custom DNS servers. Fallback to internet and NXDOMAIN redirect Azure Private DNS now supports two powerful features to enhance name resolution flexibility in hybrid and multi-tenant environments: Fallback to internet Purpose: Allows Azure to resolve DNS queries using public DNS if no matching record is found in the private DNS zone. Use case: Ideal when your private DNS zone doesn't contain all possible hostnames (e.g., partial zone coverage or phased migrations). How to enable: Go to Azure private DNS zones -> Select zone -> Virtual network link -> Edit option Ref article: https://learn.microsoft.com/en-us/azure/dns/private-dns-fallback Centralized DNS - when firewall inspection isn’t required Objective: DNS query is not monitored via firewall and DNS query can be bypassed from firewall. Link every spoke virtual directly to the required Private DNS Zones so that spoken can resolve PaaS resources directly. Keep a single Private DNS Resolver (optional) for on-prem name resolution; spokes can reach its inbound endpoint privately or via VNet peering. Spoke-level custom DNS This can point to extended domain controllers placed within identity virtual. This pattern reduces latency and cost but still centralizes zone management. Integrating on-premises active directory DNS Create conditional forwarders on each Domain Controller for every Private DNS Zone pointing it to DNS private resolver inbound endpoint IP Address. (e.g.,blob.core.windows.net database.windows.net). Do not include the literal privatelink label. Ref article: https://github.com/dmauser/PrivateLink/tree/master/DNS-Integration-Scenarios#43-on-premises-dns-server-conditional-forwarder-considerations Note: Avoid selecting the option “Store this conditional forwarder in Active Directory and replicate as follows” in environments with multiple Azure subscriptions and domain controllers deployed across different Azure environments. Key takeaways Linking zones exclusively to the connectivity subscription's virtual network keeps firewall inspection and egress control simple. Private DNS Resolver plus forwarding rulesets let you shape hybrid name resolution without custom appliances. When no inspection is needed, direct zone links to spokes cut hops and complexity. For on-prem AD DNS, the conditional forwarder is required pointing it to inbound endpoint IP, exclude privatelink name when creating conditional forwarder, and do not replicate conditional forwarder Zone with AD replication if customer has footprint in multiple Azure tenants. Plan your DNS early, bake it into your infrastructure-as-code, and your landing zone will scale cleanly no matter how many spokes join the hub tomorrow.8.5KViews6likes5CommentsUnlock outbound traffic insights with Azure StandardV2 NAT Gateway flow logs
Recommended Outbound Connectivity StandardV2 NAT Gateway is the next evolution of outbound connectivity in Azure. As the recommended solution for providing secure, reliable outbound Internet access, NAT Gateway continues to be the default choice for modern Azure deployments. With the highly anticipated general availability of the new StandardV2 SKU, customers gain access to the following highly requested upgrades: Zone-redundancy: Automatically maintains outbound connectivity during single‑zone failures in AZ-enabled regions. Enhanced performance: Up to 100 Gbps of throughput and 10 million packets per second - double the Standard SKU capacity. Dual-stack support: Attach up to 16 IPv6 and 16 IPv4 public IP addresses for future ready connectivity. Flow logs: Access historical logs of connections being established through your NAT gateway. This blog will focus on how enabling StandardV2 NAT Gateway flow logs can be beneficial for your team along with some tips to get the most out of the data. What are flow logs? StandardV2 NAT Gateway flow logs are enabled through Diagnostic settings on your NAT gateway resource where the log data can be sent to Log Analytics, a storage account, or Event hub destination. “NatGatewayFlowlogV1” is the released log category, and it provides IP level information on traffic flowing through your StandardV2 NAT gateway. Gateway Flow Logs through Diagnostics setting on your StandardV2 NAT gateway resource. Why should I use flow logs? Security and compliance visibility Prior to NAT gateway flow logs, customers could not see NAT gateway information when their virtual machines connect outbound. This made it difficult to: Validate that only approved destinations were being accessed Audit suspicious or unexpected outbound patterns Satisfy compliance requirements that mandate traffic recording Flow logs now provide visibility to the source IP -> NAT gateway outbound IP -> destination IP, along with details on sent/dropped packets and bytes. Usage analytics Flow logs allow you to answer usage questions such as: Which VMs are generating the most outbound requests? Which destinations receive the most traffic? Is throughput growth caused by a specific workload pattern? This level of insight is especially useful when debugging unexpected throughput increases, billing spikes, and connection bottlenecks. To note: Flow logs only capture established connections. This means the TCP 3‑way handshake (SYN → SYN/ACK → ACK) or the UDP ephemeral session setup must complete. If a connection never establishes, for example due to NSG denial, routing mismatch, or SNAT exhaustion, it will not appear in flow logs. Workflow of troubleshooting with flow logs Let's walk through how you can leverage flow logs to troubleshoot a scenario where you are seeing intermittent connection drops. Scenario: You have VMs that use a StandardV2 NAT gateway to reach the Internet. However, your VMs intermittently fail to reach github.com. Step 1: Check NAT gateway health Start with the datapath availability metric, which reflects the NAT gateway's overall health. If metric > 90%, this confirms NAT gateway is healthy and is working as expected to send outbound traffic to the internet. Continue to Step 2. If metric is lower, visit Troubleshoot Azure NAT Gateway connectivity - Azure NAT Gateway | Microsoft Learn for troubleshooting tips. Step 2: Enable StandardV2 NAT Gateway Flow Logs To further investigate the root cause, Enable StandardV2 NAT Gateway Flow Logs (NatGatewayFlowLogsV1 log category in Diagnostics Setting) for the NAT gateway resource providing outbound connectivity for the impacted VMs. It is recommended to enable Log Analytics as a destination as it allows you to easily query the data. For the detailed steps, visit Monitor with StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn. Tip: You may enable flow logs even when not troubleshooting to ensure you’ll have historical data to reference when issues occur. Step 3: Confirm whether the connection was established Use Log Analytics to query for flows with source IP == VM private IP and destination IP == IP address(es) of github.com. The following query will generate a table and chart of the total packets sent per minute from your source IP to the destination IP through your NAT gateway in the last 24 hours. NatGatewayFlowlogsV1 | where TimeGenerated > ago(1d) | where SourceIP == '10.0.0.4' //and DestinationIP == <"github.com IP"> | summarize TotalPacketsSent = sum(PacketsSent) by TimeGenerated = bin(TimeGenerated, 1m), SourceIP, DestinationIP | order by TimeGenerated asc If there are no records of this connection, it is likely an issue with establishing the connection because flow logs will only capture records of established connections. Take a look at SNAT connection metrics to determine whether it may be a SNAT port exhaustion issue or NSGs/UDRs that may be blocking the traffic. If there are records of the connection, proceed with the next step. Step 4: Check if there are any packets dropped In Log Analytics, query for the total "PacketsSentDropped" and "PacketsReceivedDropped" per source/outbound/destination IP connection. If "PacketsSentDropped" > 0 - NAT gateway dropped traffic sent from your VM. If "PacketsReceivedDropped" > 0, NAT gateway dropped traffic received from destination IP, github.com in this case. In both instances, it typically means the either the client or server is pushing more traffic through a single connection than is optimal, causing connection-level rate limiting. To mitigate: Avoid relying on one connection and instead use multiple connections. Distribute traffic across multiple outbound IP addresses by assigning more public IP addresses to the NAT gateway resource. Conclusion StandardV2 NAT Gateway Flow Logs unlock a powerful new dimension of outbound visibility and they can help you: Validate cybersecurity readiness Audit outbound flows Diagnose intermittent connectivity issues Understand traffic patterns and optimize architecture We are excited to see how you leverage this new capability with your StandardV2 NAT gateways! Have more questions? As always, for any feedback, please feel free to reach us by submitting your feedback. We look forward to hearing your thoughts and hope this announcement helps you build more resilient applications in Azure. For more information on StandardV2 NAT Gateway Flow Logs and how to enable it, visit: Manage StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn Monitor with StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn To see the most up-to-date pricing for flow logs, visit Azure NAT Gateway - Pricing | Microsoft Azure. To learn more about StandardV2 NAT Gateway, visit What is Azure NAT Gateway? | Microsoft Learn.298Views0likes0CommentsAzure Front Door: Implementing lessons learned following October outages
Abhishek Tiwari, Vice President of Engineering, Azure Networking Amit Srivastava, Principal PM Manager, Azure Networking Varun Chawla, Partner Director of Engineering Introduction Azure Front Door is Microsoft's advanced edge delivery platform encompassing Content Delivery Network (CDN), global security and traffic distribution into a single unified offering. By using Microsoft's extensive global edge network, Azure Front Door ensures efficient content delivery and advanced security through 210+ global and local points of presence (PoPs) strategically positioned closely to both end users and applications. As the central global entry point from the internet onto customer applications, we power mission critical customer applications as well as many of Microsoft’s internal services. We have a highly distributed resilient architecture, which protects against failures at the server, rack, site and even at the regional level. This resiliency is achieved by the use of our intelligent traffic management layer which monitors failures and load balances traffic at server, rack or edge sites level within the primary ring, supplemented by a secondary-fallback ring which accepts traffic in case of primary traffic overflow or broad regional failures. We also deploy a traffic shield as a terminal safety net to ensure that in the event of a managed or unmanaged edge site going offline, end user traffic continues to flow to the next available edge site. Like any large-scale CDN, we deploy each customer configuration across a globally distributed edge fleet, densely shared with thousands of other tenants. While this architecture enables global scale, it carries the risk that certain incompatible configurations, if not contained, can propagate broadly and quickly which can result in a large blast radius of impact. Here we describe how the two recent service incidents impacting Azure Front Door have reinforced the need to accelerate ongoing investments in hardening our resiliency, and tenant isolation strategy to mitigate likelihood and the scale of impact from this class of risk. October incidents: recap and key learnings Azure Front Door experienced two service incidents; on October 9 th and October 29 th , both with customer-impacting service degradation. On October 9 th : A manual cleanup of stuck tenant metadata bypassed our configuration protection layer, allowing incompatible metadata to propagate beyond our canary edge sites. This metadata was created on October 7 th , from a control-plane defect triggered by a customer configuration change. While the protection system initially blocked the propagation, the manual override operation bypassed our safeguards. This incompatible configuration reached the next stage and activated a latent data-plane defect in a subset of edge sites, causing availability impact primarily across Europe (~6%) and Africa (~16%). You can learn more about this issue in detail at https://aka.ms/AIR/QNBQ-5W8 On October 29 th : A different sequence of configuration changes across two control-plane versions produced incompatible metadata. Because the failure mode in the data-plane was asynchronous, the health checks validations embedded in our protection systems were all passed during the rollout. The incompatible customer configuration metadata successfully propagated globally through a staged rollout and also updated the “last known good” (LKG) snapshot. Following this global rollout, the asynchronous process in data-plane exposed another defect which caused crashes. This impacted connectivity and DNS resolutions for all applications onboarded to our platform. Extended recovery time amplified impact on customer applications and Microsoft services. You can learn more about this issue in detail at https://aka.ms/AIR/YKYN-BWZ We took away a number of clear and actionable lessons from these incidents, which are applicable not just to our service, but to any multi-tenant, high-density, globally distributed system. Configuration resiliency – Valid configuration updates should propagate safely, consistently, and predictably across our global edge, while ensuring that incompatible or erroneous configuration never propagate beyond canary environments. Data plane resiliency - Additionally, configuration processing in the data plane must not cause availability impact to any customer. Tenant isolation – Traditional isolation techniques such as hardware partitioning and virtualization are impractical at edge sites. This requires innovative sharding techniques to ensure single tenant-level isolation – a must-have to reduce potential blast radius. Accelerated and automated recovery time objective (RTO) – System should be able to automatically revert to last known good configuration in an acceptable RTO. In case of a service like Azure Front Door, we deem ~10 mins to be a practical RTO for our hundreds of thousands of customers at every edge site. Post outage, given the severity of impact which allowed an incompatible configuration to propagate globally, we made the difficult decision to temporarily block configuration changes in order to expedite rollout of additional safeguards. Between October 29 th to November 5 th , we prioritized and deployed immediate hardening steps before opening up the configuration change. We are confident that the system is stable, and we are continuing to invest in additional safeguards to further strengthen the platform's resiliency. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond Canary · Control plane and data plane defect fixes · Forced synchronous configuration processing · Additional stages with extended bake time · Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. January 2026 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes March 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 This blog is the first in a multi-part series on Azure Front Door resiliency. In this blog, we will focus on configuration resiliency—how we are making the configuration pipeline safer and more robust. Subsequent blogs will cover tenant isolation and recovery improvements. How our configuration propagation works Azure Front Door configuration changes can be broadly classified into three distinct categories. Service code & data – these include all aspects of Azure Front Door service like management plane, control plane, data plane, configuration propagation system. Azure Front Door follows a safe deployment practice (SDP) process to roll out newer versions of management, control or data plane over a period of approximately 2-3 weeks. This ensures that any regression in software does not have a global impact. However, latent bugs that escape pre-validation and SDP rollout can remain undetected until a specific combination of customer traffic patterns or configuration changes trigger the issue. Web Application Firewall (WAF) & L7 DDoS platform data – These datasets are used by Azure Front Door to deliver security and load-balancing capabilities. Examples include GeoIP data, malicious attack signatures, and IP reputation signatures. Updates to these datasets occur daily through multiple SDP stages with an extended bake time of over 12 hours to minimize the risk of global impact during rollout. This dataset is shared across all customers and the platform, and it is validated immediately since it does not depend on variations in customer traffic or configuration steps. Customer configuration data – Examples of these are any customer configuration change—whether a routing rule update, backend pool modification, WAF rule change, or security policy change. Due to the nature of these changes, it is expected across the edge delivery / CDN industry to propagate these changes globally in 5-10 mins. Both outages stemmed from issues within this category. All configuration changes, including customer configuration data, are processed through a multi-stage pipeline designed to ensure correctness before global rollout across Azure Front Door’s 200+ edge locations. At a high level, Azure Front Door’s configuration propagation system has two distinct components - Control plane – Accepts customer API/portal changes (create/update/delete for profiles, routes, WAF policies, origins, etc.) and translates them into internal configuration metadata which the data plane can understand. Data plane – Globally distributed edge servers that terminate client traffic, apply routing/WAF logic, and proxy to origins using the configuration produced by the control plane. Between these two halves sits a multi-stage configuration rollout pipeline with a dedicated protection system (known as ConfigShield): Changes flow through multiple stages (pre-canary, canary, expanding waves to production) rather than going global at once. Each stage is health-gated: the data plane must remain within strict error and latency thresholds before proceeding. Each stage’s health check also rechecks previous stage’s health for any regressions. A successfully completed rollout updates a last known good (LKG) snapshot used for automated rollback. Historically, rollout targeted global completion in roughly 5–10 minutes, in line with industry standards. Customer configuration processing in Azure Front Door data plane stack Customer configuration changes in Azure Front Door traverse multiple layers—from the control plane through the deployment system—before being converted into FlatBuffers at each Azure Front Door node. These FlatBuffers are then loaded by the Azure Front Door data plane stack, which runs as Kubernetes pods on every node. FlatBuffer Composition: Each FlatBuffer references several sub-resources such as WAF and Rules Engine schematic files, SSL certificate objects, and URL signing secrets. Data plane architecture: o Master process: Accepts configuration changes (memory-mapped files with references) and manages the lifecycle of worker processes. o Workers: L7 proxy processes that serve customer traffic using the applied configuration. Processing flow for each configuration update: Load and apply in master: The transformed configuration is loaded and applied in the master process. Cleanup of unused references occurs synchronously except for certain categories à October 9 outage occurred during this step due to a crash triggered by incompatible metadata. Apply to workers: Configuration is applied to all worker processes without memory overhead (FlatBuffers are memory-mapped). Serve traffic: Workers start consuming new FlatBuffers for new requests; in-flight requests continue using old buffers. Old buffers are queued for cleanup post-completion. Feedback to deployment service: Positive feedback signals readiness for rollout.Cleanup: FlatBuffers are freed asynchronously by the master process after all workers load updates à October 29 outage occurred during this step due to a latent bug in reference counting logic. The October incidents showed we needed to strengthen key aspects of configuration validation, propagation safeguards, and runtime behavior. During the Azure Front Door incident on October 9 th , that protection system worked as intended but was later bypassed by our engineering team during a manual cleanup operation. During this Azure Front Door incident on October 29 th , the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Configuration propagation safeguards Based on learnings from the incidents, we are implementing a comprehensive set of configuration resiliency improvements. These changes aim to guarantee that any sequence of configuration changes cannot trigger instability in the data plane, and to ensure quicker recovery in the event of anomalies. Strengthening configuration generation safety This improvement pivots on a ‘shift-left’ strategy where we want to ensure that we catch regression early before they propagate to production. It also includes fixing the latent defects which were the proximate cause of the outage. Fixing outage specific defects - We have fixed the control-plane defects that could generate incompatible tenant metadata under specific operation sequences. We have also remediated the associated data-plane defects. Stronger cross-version validation - We are expanding our test and validation suite to account for changes across multiple control plane build versions. This is expected to be fully completed by February 2026. Fuzz testing - Automated fuzzing and testing of metadata generation contract between the control plane and the data plane. This allows us to generate an expanded set of invalid/unexpected configuration combinations which might not be achievable by traditional test cases alone. This is expected to be fully completed by February 2026. Preventing incompatible configurations from being propagated This segment of the resiliency strategy strives to ensure that a potentially dangerous configuration change never propagates beyond canary stage. Protection system is “always-on” - Enhancements to operational procedures and tooling prevent bypass in all scenarios (including internal cleanup/maintenance), and any cleanup must flow through the same guarded stages and health checks as standard configuration changes. This is completed. Making rollout behavior more predictable and conservative - Configuration processing in the data plane is now fully synchronous. Every data plane issue due to incompatible meta data can be detected withing 10 seconds at every stage. This is completed. Enhancement to deployment pipeline - Additional stages during roll-out and extended bake time between stages serve as an additional safeguard during configuration propagation. This is completed. Recovery tool improvements now make it easier to revert to any previous version of LKG with a single click. This is completed. These changes significantly improve system safety. Post-outage we have increased the configuration propagation time to approximately 45 minutes. We are working towards reducing configuration propagation time closer to pre-incident levels once additional safeguards covered in the Data plane resiliency section below are completed by mid-January, 2026. Data plane resiliency The data plane recovery was the toughest part of recovery efforts during the October incidents. We must ensure fast recovery as well as resilience to configuration processing related issues for the data plane. To address this, we implemented changes that decouple the data plane from incompatible configuration changes. With these enhancements, the data plane continues operating on the last known good configuration—even if the configuration pipeline safeguards fail to protect as intended. Decoupling data plane from configuration changes Each server’s data plane consists of a master process which accepts configuration changes and manages lifecycle of multiple worker processes which serve customer traffic. One of the critical reasons for the prolonged outage in October was that due to latent defects in the data plane, when presented with a bad configuration the master process crashed. The master is a critical command-and-control process and when it crashes it takes down the entire data plane, in that node. Recovery of the master process involves reloading hundreds of thousands of configurations from scratch and took approximately 4.5 hours. We have since made changes to the system to ensure that even in the event of the master process crash due to any reason - including incompatible configuration data being presented - the workers remain healthy and able to serve traffic. During such an event, the workers would not be able to accept new configuration changes but will continue to serve customer traffic using the last known good configuration. This work is completed. Introducing Food Taster: strengthening config propagation resiliency In our efforts to further strengthen Azure Front Door’s configuration propagation system, we are introducing an additional configuration safeguard known internally as Food Taster which protects the master and worker processes from any configuration change related incidents, thereby ensuring data plane resiliency. The principle is simple: every data-plane server will have a redundant and isolated process – the Food Taster – whose only job is to ingest and process new configuration metadata first and then pass validated configuration changes to active data plane. This redundant worker does not accept any customer traffic. All configuration processing in this Food Taster is fully synchronous. That means we do all parsing, validation, and any expensive or risky work up front, and we do not move on until the Food Taster has either proven the configuration is safe or rejected it. Only when the Food Taster successfully loads the configuration and returns “Config OK” does the master process proceed to load the same config and then instruct the worker processes to do the same. If anything goes wrong in the Food Taster, the failure is contained to that isolated worker; the master and traffic-serving workers never see that invalid configuration. We expect this safeguard to reach production globally in January 2026 timeframe. Introduction of this component will also allow us to return closer to pre-incident level of configuration propagation while ensuring data plane safety. Closing This is the first in a series of planned blogs on Azure Front Door resiliency enhancements. We are continuously improving platform safety and reliability and will transparently share updates through this series. Upcoming posts will cover advancements in tenant isolation and improvements to recovery time objectives (RTO). We deeply value our customers’ trust in Azure Front Door. The October incidents reinforced how critical configuration resiliency is, and we are committed to exceeding industry expectations for safety, reliability, and transparency. By hardening our configuration pipeline, strengthening safety gates, and reinforcing isolation boundaries, we’re making Azure Front Door even more resilient so your applications can be too.13KViews22likes13CommentsAdvanced Container Apps Networking: VNet Integration and Centralized Firewall Traffic Logging
Azure community, I recently documented a networking scenario relevant to Azure Container Apps environments where you need to control and inspect application traffic using a third-party network virtual appliance. The article walks through a practical deployment pattern: • Integrate your Azure Container Apps environment with a Virtual Network. • Configure user-defined routes (UDRs) so that traffic from your container workloads is directed toward a firewall appliance before reaching external networks or backend services. • Verify actual traffic paths using firewall logs to confirm that routing policies are effective. This pattern is helpful for organizations that must enforce advanced filtering, logging, or compliance checks on container egress/ingress traffic, going beyond what native Azure networking controls provide. It also complements Azure Firewall and NSG controls by introducing a dedicated next-generation firewall within your VNet. If you’re working with network control, security perimeters, or hybrid network architectures involving containerized workloads on Azure, you might find it useful. Read the full article on my blog75Views0likes0CommentsData Center Quantized Congestion Notification: Scaling congestion control for RoCE RDMA in Azure
As cloud storage demands continue to grow, the need for ultra-fast, reliable networking becomes ever more critical. Microsoft Azure’s journey to empower its storage infrastructure with RDMA (Remote Direct Memory Access) has been transformative, but it’s not without challenges—especially when it comes to congestion control at scale. Azure’s deployment of RDMA at regional scale relies on DCQCN (Data Center Quantized Congestion Notification), a protocol that’s become central to Azure’s ability to deliver high-throughput, low-latency storage services across vast, heterogeneous data center regions. Why congestion control matters in RDMA networks RDMA offloads the network stack to NIC hardware, reducing CPU overhead and enabling near line-rate performance. However, as Azure scaled RDMA across clusters and regions, it faced new challenges: Heterogeneous hardware: Different generations of RDMA NICs (Network Interface Cards) and switches, each with their own quirks. Variable latency: Long-haul links between datacenters introduce large round-trip time (RTT) variations. Congestion risks: High-speed, incast-like traffic patterns can easily overwhelm buffers, leading to packet loss and degraded performance. To address these, Azure needed a congestion control protocol that could operate reliably across diverse hardware and network conditions. Traditional TCP congestion control mechanisms don’t apply here, so Azure leverages DCQCN combined with Priority Flow Control (PFC) to maintain high throughput, low latency, and near-zero packet loss. How DCQCN works DCQCN coordinates congestion control using three main entities: Reaction point (RP): The sender adjusts its rate based on feedback. Congestion point (CP): Switches mark packets using ECN when queues exceed thresholds. Notification point (NP): The receiver sends Congestion Notification Packets (CNPs) upon receiving ECN-marked packets. This feedback loop allows RDMA flows to dynamically adapt their sending rates, preventing congestion collapse while maintaining fairness. When the switch detects congestion, it marks packets with ECN. The receiver NIC (NP) observes ECN marks and sends CNPs to the sender. The sender NIC (RP) reduces its sending rate upon receiving CNPs; otherwise, it increases the rate gradually. Interoperability challenges across different hardware generations Cloud infrastructure evolves incrementally, typically at the level of individual clusters or racks, as newer server hardware generations are introduced. Within a single region, clusters often differ in their NIC configurations. Our deployment includes three generations of commodity RDMA NICs—Gen1, Gen2, and Gen3—each implementing DCQCN with distinct design variations. These discrepancies create complex and often problematic interactions when NICs from different generations interoperate. Gen1 NICs: Firmware-based DCQCN, NP-side CNP coalescing, burst-based rate limiting. Gen2/Gen3 NICs: Hardware-based DCQCN, RP-side CNP coalescing, per-packet rate limiting. Problem: Gen2/Gen3 NICs sending to Gen1 can trigger excessive cache misses, slowing down Gen1’s receiver pipeline. Gen1 sending to Gen2/Gen3 can cause excessive rate reductions due to frequent CNPs. Azure’s solution: Move CNP coalescing to NP side for Gen2/Gen3. Implement per-QP CNP rate limiting, matching Gen1’s timer. Enable per-burst rate limiting on Gen2/Gen3 to reduce cache pressure. DCQCN tuning: Achieving fairness and performance DCQCN is inherently RTT-fair—its rate adjustment is independent of round-trip time, making it suitable for Azure’s regional networks with RTTs ranging from microseconds to milliseconds. Key Tuning Strategies: Sparse ECN marking: Use large ECN marking thresholds (K_max - K_min) and low marking probabilities (P_max) for flows with large RTTs. Joint buffer and DCQCN tuning: Tune switch buffer thresholds and DCQCN parameters together to avoid premature congestion signals and optimize throughput. Global parameter settings: Azure’s NICs support only global DCQCN settings, so parameters must work well across all traffic types and RTTs. Real-world results High throughput & low latency: RDMA traffic runs at line rate with near-zero packet loss. CPU savings: Freed CPU cores can be repurposed for customer VMs or application logic. Performance metrics: RDMA reduces CPU utilization by up to 34.5% compared to TCP for storage frontend traffic. Large I/O requests (1 MB) see up to 23.8% latency reduction for reads and 15.6% for writes. Scalability: As of November 2025, ~85% of Azure’s traffic is RDMA, supported in all public regions. Conclusion DCQCN is a cornerstone of Azure’s RDMA-enabled storage infrastructure, enabling reliable, high-performance cloud storage at scale. By combining ECN-based signaling with dynamic rate adjustments, DCQCN ensures high throughput, low latency, and near-zero packet loss—even across heterogeneous hardware and long-haul links. Its interoperability fixes and careful tuning make it a critical enabler for RDMA adoption in modern data centers, paving the way for efficient, scalable, and resilient cloud storage.549Views2likes1CommentAzure PostgreSQL Lesson Learned#12: Private Endpoint Approval Fails for Cross Subscription
Co‑authored with HaiderZ-MSFT Symptoms Customers experience issues when attempting to approve a Private Endpoint for Azure PostgreSQL Flexible Server, particularly in cross‑subscription or cross‑tenant setups: Private Endpoint remains stuck in Pending state Portal approval action fails silently or reverts Selecting the Private Endpoint displays a “No Access” message Activity logs show repeated retries followed by failure Common Error Message AuthorizationFailed: The client '<object-id>' does not have authorization to perform action 'Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write' over scope '<private-endpoint-resource-id>' or the scope is invalid. Root Cause Although the approval action is initiated from the PostgreSQL Flexible Server (service provider resource), Azure performs additional network‑level operations during approval. Specifically, Azure must update a Private Link Service Proxy on the Private Endpoint resource, which exists in the consumer subscription. When the Private Endpoint resides in a different subscription or tenant, the approval process fails if: Required Resource Providers are not registered, or The approving identity lacks network‑level permissions on the Private Endpoint scope In this case, the root cause was missing Resource Provider registration, resulting in an AuthorizationFailed error during proxy updates. Required Resource Providers Microsoft.Network Microsoft.DBforPostgreSQL If either provider is missing on either subscription, the approval process will fail regardless of RBAC configuration. Mitigation Steps Step 1: Register Resource Providers (Mandatory) Register the following providers on both subscriptions: Microsoft.Network Microsoft.DBforPostgreSQL This step alone resolves most cross‑subscription approval failures. Azure resource providers and types - Azure Resource Manager | Microsoft Learn Step 2: Validate Network Permissions Ensure the approving identity can perform: Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write Grant Network Contributor if needed. Step 3: Refresh Credentials and Retry If changes were made recently: Sign out and sign in again Retry the Private Endpoint approval Post‑Resolution Outcome After correcting provider registration and permissions: Private Endpoint approval succeeds immediately Connection state transitions from Pending → Approved No further authorization or retry errors PostgreSQL connectivity works as expected Prevention & Best Practices Pre‑register required Resource Providers in landing zones Validate cross‑subscription readiness before creating Private Endpoints Document service‑specific approval requirements (PostgreSQL differs from Key Vault) Automate provider registration via policy or IaC where possible Include provider validation in enterprise onboarding checklists Why This Matters Missing provider registration can lead to: Failed Private Endpoint approvals Confusing authorization errors Extended troubleshooting cycles Production delays during go‑live A simple subscription readiness check prevents downstream networking failures that are difficult to diagnose from portal errors alone. Key Takeaways Issue: Azure PostgreSQL private endpoint approval fails across subscriptions Root Cause: Missing Resource Provider registration Fix: Register Microsoft.Network and Microsoft.DBforPostgreSQL on both subscriptions Result: Approval succeeds without backend authorization failures References Manage Azure Private Endpoints – Azure Private Link Approve Private Endpoint Connections – Azure Database for PostgreSQL Private Endpoint Overview – Azure Private Link157Views0likes0CommentsAzure Networking 2025: Powering cloud innovation and AI at global scale
In 2025, Azure’s networking platform proved itself as the invisible engine driving the cloud’s most transformative innovations. Consider the construction of Microsoft’s new Fairwater AI datacenter in Wisconsin – a 315-acre campus housing hundreds of thousands of GPUs. To operate as one giant AI supercomputer, Fairwater required a single flat, ultra-fast network interconnecting every GPU. Azure’s networking team delivered: the facility’s network fabric links GPUs at 800 Gbps speeds in a non-blocking architecture, enabling 10× the performance of the world’s fastest supercomputer. This feat showcases how fundamental networking is to cloud innovation. Whether it’s uniting massive AI clusters or connecting millions of everyday users, Azure’s globally distributed network is the foundation upon which new breakthroughs are built. In 2025, the surge of AI workloads, data-driven applications, and hybrid cloud adoption put unprecedented demands on this foundation. We responded with bold network investments and innovations. Each new networking feature delivered in 2025, from smarter routing to faster gateways, was not just a technical upgrade but an innovation enabling customers to achieve more. Recapping the year’s major releases across Azure Networking services and key highlights how AI both drive and benefit from these advancements. Unprecedented connectivity for a hybrid and AI era Hybrid connectivity at scale: Azure’s network enhancements in 2025 focused on making global and hybrid connectivity faster, simpler, and ready for the next wave of AI-driven traffic. For enterprises extending on-premises infrastructure to Azure, Azure ExpressRoute private connectivity saw a major leap in capacity: Microsoft announced support for 400 Gbps ExpressRoute Direct ports (available in 2026) to meet the needs of AI supercomputing and massive data volumes. These high-speed ports – which can be aggregated into multi-terabit links – ensure that even the largest enterprises or HPC clusters can transfer data to Azure with dedicated, low-latency links. In parallel, Azure VPN Gateway performance reached new highs, with a generally available upgrade that delivers up to 20 Gbps aggregate throughput per gateway and 5 Gbps per individual tunnel. This is a 3× increase over previous limits, enabling branch offices and remote sites to connect to Azure even more seamlessly without bandwidth bottlenecks. Together, the ExpressRoute and VPN improvements give customers a spectrum of high-performance options for hybrid networking – from offices and datacenters to the cloud – supporting scenarios like large-scale data migrations, resilient multi-site architectures, and hybrid AI processing. Simplified global networking: Azure Virtual WAN (vWAN) continued to mature as the one-stop solution for managing global connectivity. Virtual WAN introduced forced tunneling for Secure Virtual Hubs (now in preview), which allows organizations to route all Internet-bound traffic from branch offices or virtual networks back to a central hub for inspection. This capability simplifies the implementation of a “backhaul to hub” security model – for example, forcing branches to use a central firewall or security appliance – without complex user-defined routing. Empowering multicloud and NVA integration: Azure recognizes that enterprise networks are diverse. Azure Route Server improvements enhanced interoperability with customer equipment and third-party network virtual appliances (NVAs). Notably, Azure Route Server now supports up to 500 virtual network connections (spokes) per route server, a significant scale boost that enables larger hub-and-spoke topologies and simplified Border Gateway Protocol (BGP) route exchange even in very large environments. This helps customers using SD-WAN appliances or custom firewalls in Azure to seamlessly learn routes from hundreds of VNet spokes – maintaining central routing control without manual configuration. Additionally, Azure Route Server introduced a preview of hub routing preference, giving admins the ability to influence BGP route selection (for example, preferring ExpressRoute over a VPN path, or vice versa). This fine-grained control means hybrid networks can be tuned for optimal performance and cost. Resilience and reliability by design Azure’s growth has been underpinned by making the network “resilient by default.” We shipped tools to help validate and improve network resiliency. ExpressRoute Resiliency Insights was released for general availability – delivering an intelligent assessment of an enterprise’s ExpressRoute setup. This feature evaluates how well your ExpressRoute circuits and gateways are architected for high availability (for example, using dual circuits in diverse locations, zone-redundant gateways, etc.) and assigns a resiliency index score as a percentage. It will highlight suboptimal configurations – such as routes advertised on only one circuit, or a gateway that isn’t zone-redundant – and provide recommendations for improvement. Moreover, Resiliency Insights includes a failover simulation tool that can test circuit redundancy by mimicking failures, so you can verify that your connections will survive real-world incidents. By proactively monitoring and testing resilience, Azure is helping customers achieve “always-on” connectivity even in the face of fiber cuts, hardware faults, or other disruptions. Security, governance, and trust in the network As enterprises entrust more core business to Azure, the platform’s networking services advanced on security and governance – helping customers achieve Zero Trust networks and high compliance with minimal complexity. Azure DNS now offers DNS Security Policies with Threat Intelligence feeds (GA). This capability allows organizations to protect their DNS queries from known malicious domains by leveraging continuously updated threat intel. For example, if a known phishing domain or C2 (command-and-control) hostname appears in DNS queries from your environment, Azure DNS can automatically block or redirect those requests. Because DNS is often the first line of detection for malware and phishing activities, this built-in filtering provides a powerful layer of defense that’s fully managed by Azure. It’s essentially a cloud-delivered DNS firewall using Microsoft’s vast threat intelligence – enabling all Azure customers to benefit from enterprise-grade security without deploying additional appliances. Network traffic governance was another focus. The introduction of forced tunneling in Azure Virtual WAN hubs (preview) shared above is a prime example where networking meets security compliance. Optimizing cloud-native and edge networks We previewed DNS intelligent traffic control features – such as filtering DNS queries to prevent data exfiltration and applying flexible recursion policies – which complement the DNS Security offering in safeguarding name resolution. Meanwhile, for load balancing across regions, Azure Traffic Manager’s behind-the-scenes upgrades (as noted earlier) improved reliability, and it’s evolving to integrate with modern container-based apps and edge scenarios. AI-powered networking: Both enabling and enabled by AI We are infusing AI into networking to make management and troubleshooting more intelligent. Networking functionality in Azure Copilot accelerates tasks like never before: it outlines the best practices instantly and troubleshooting that once required combing through docs and logs can be conversational. It effectively democratizes networking expertise, helping even smaller IT teams manage sophisticated networks by leveraging AI recommendations. The future of cloud networking in an AI world As we close out 2025, one message is clear: networking is strategic. The network is no longer a static utility – it is the adaptive circulatory system of the cloud, determining how far and fast customers can go. By delivering higher speeds, greater reliability, tighter security, and easier management, Azure Networking has empowered businesses to connect everything to anything, anywhere – securely and at scale. These advances unlock new scenarios: global supply chains running in real-time over a trusted network, multi-player AR/VR and gaming experiences delivered without lag, and AI models trained across continents. Looking ahead, AI-powered networking will become the norm. The convergence of AI and network tech means we will see more self-optimizing networks that can heal, defend, and tune themselves with minimal human intervention.1KViews3likes0Comments