azure networking
78 TopicsData Center Quantized Congestion Notification: Scaling congestion control for RoCE RDMA in Azure
As cloud storage demands continue to grow, the need for ultra-fast, reliable networking becomes ever more critical. Microsoft Azure’s journey to empower its storage infrastructure with RDMA (Remote Direct Memory Access) has been transformative, but it’s not without challenges—especially when it comes to congestion control at scale. Azure’s deployment of RDMA at regional scale relies on DCQCN (Data Center Quantized Congestion Notification), a protocol that’s become central to Azure’s ability to deliver high-throughput, low-latency storage services across vast, heterogeneous data center regions. Why congestion control matters in RDMA networks RDMA offloads the network stack to NIC hardware, reducing CPU overhead and enabling near line-rate performance. However, as Azure scaled RDMA across clusters and regions, it faced new challenges: Heterogeneous hardware: Different generations of RDMA NICs (Network Interface Cards) and switches, each with their own quirks. Variable latency: Long-haul links between datacenters introduce large round-trip time (RTT) variations. Congestion risks: High-speed, incast-like traffic patterns can easily overwhelm buffers, leading to packet loss and degraded performance. To address these, Azure needed a congestion control protocol that could operate reliably across diverse hardware and network conditions. Traditional TCP congestion control mechanisms don’t apply here, so Azure leverages DCQCN combined with Priority Flow Control (PFC) to maintain high throughput, low latency, and near-zero packet loss. How DCQCN works DCQCN coordinates congestion control using three main entities: Reaction point (RP): The sender adjusts its rate based on feedback. Congestion point (CP): Switches mark packets using ECN when queues exceed thresholds. Notification point (NP): The receiver sends Congestion Notification Packets (CNPs) upon receiving ECN-marked packets. This feedback loop allows RDMA flows to dynamically adapt their sending rates, preventing congestion collapse while maintaining fairness. When the switch detects congestion, it marks packets with ECN. The receiver NIC (NP) observes ECN marks and sends CNPs to the sender. The sender NIC (RP) reduces its sending rate upon receiving CNPs; otherwise, it increases the rate gradually. Interoperability challenges across different hardware generations Cloud infrastructure evolves incrementally, typically at the level of individual clusters or racks, as newer server hardware generations are introduced. Within a single region, clusters often differ in their NIC configurations. Our deployment includes three generations of commodity RDMA NICs—Gen1, Gen2, and Gen3—each implementing DCQCN with distinct design variations. These discrepancies create complex and often problematic interactions when NICs from different generations interoperate. Gen1 NICs: Firmware-based DCQCN, NP-side CNP coalescing, burst-based rate limiting. Gen2/Gen3 NICs: Hardware-based DCQCN, RP-side CNP coalescing, per-packet rate limiting. Problem: Gen2/Gen3 NICs sending to Gen1 can trigger excessive cache misses, slowing down Gen1’s receiver pipeline. Gen1 sending to Gen2/Gen3 can cause excessive rate reductions due to frequent CNPs. Azure’s solution: Move CNP coalescing to NP side for Gen2/Gen3. Implement per-QP CNP rate limiting, matching Gen1’s timer. Enable per-burst rate limiting on Gen2/Gen3 to reduce cache pressure. DCQCN tuning: Achieving fairness and performance DCQCN is inherently RTT-fair—its rate adjustment is independent of round-trip time, making it suitable for Azure’s regional networks with RTTs ranging from microseconds to milliseconds. Key Tuning Strategies: Sparse ECN marking: Use large ECN marking thresholds (K_max - K_min) and low marking probabilities (P_max) for flows with large RTTs. Joint buffer and DCQCN tuning: Tune switch buffer thresholds and DCQCN parameters together to avoid premature congestion signals and optimize throughput. Global parameter settings: Azure’s NICs support only global DCQCN settings, so parameters must work well across all traffic types and RTTs. Real-world results High throughput & low latency: RDMA traffic runs at line rate with near-zero packet loss. CPU savings: Freed CPU cores can be repurposed for customer VMs or application logic. Performance metrics: RDMA reduces CPU utilization by up to 34.5% compared to TCP for storage frontend traffic. Large I/O requests (1 MB) see up to 23.8% latency reduction for reads and 15.6% for writes. Scalability: As of November 2025, ~85% of Azure’s traffic is RDMA, supported in all public regions. Conclusion DCQCN is a cornerstone of Azure’s RDMA-enabled storage infrastructure, enabling reliable, high-performance cloud storage at scale. By combining ECN-based signaling with dynamic rate adjustments, DCQCN ensures high throughput, low latency, and near-zero packet loss—even across heterogeneous hardware and long-haul links. Its interoperability fixes and careful tuning make it a critical enabler for RDMA adoption in modern data centers, paving the way for efficient, scalable, and resilient cloud storage.188Views2likes1CommentAzure PostgreSQL Lesson Learned#12: Private Endpoint Approval Fails for Cross Subscription
Co‑authored with HaiderZ-MSFT Symptoms Customers experience issues when attempting to approve a Private Endpoint for Azure PostgreSQL Flexible Server, particularly in cross‑subscription or cross‑tenant setups: Private Endpoint remains stuck in Pending state Portal approval action fails silently or reverts Selecting the Private Endpoint displays a “No Access” message Activity logs show repeated retries followed by failure Common Error Message AuthorizationFailed: The client '<object-id>' does not have authorization to perform action 'Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write' over scope '<private-endpoint-resource-id>' or the scope is invalid. Root Cause Although the approval action is initiated from the PostgreSQL Flexible Server (service provider resource), Azure performs additional network‑level operations during approval. Specifically, Azure must update a Private Link Service Proxy on the Private Endpoint resource, which exists in the consumer subscription. When the Private Endpoint resides in a different subscription or tenant, the approval process fails if: Required Resource Providers are not registered, or The approving identity lacks network‑level permissions on the Private Endpoint scope In this case, the root cause was missing Resource Provider registration, resulting in an AuthorizationFailed error during proxy updates. Required Resource Providers Microsoft.Network Microsoft.DBforPostgreSQL If either provider is missing on either subscription, the approval process will fail regardless of RBAC configuration. Mitigation Steps Step 1: Register Resource Providers (Mandatory) Register the following providers on both subscriptions: Microsoft.Network Microsoft.DBforPostgreSQL This step alone resolves most cross‑subscription approval failures. Azure resource providers and types - Azure Resource Manager | Microsoft Learn Step 2: Validate Network Permissions Ensure the approving identity can perform: Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write Grant Network Contributor if needed. Step 3: Refresh Credentials and Retry If changes were made recently: Sign out and sign in again Retry the Private Endpoint approval Post‑Resolution Outcome After correcting provider registration and permissions: Private Endpoint approval succeeds immediately Connection state transitions from Pending → Approved No further authorization or retry errors PostgreSQL connectivity works as expected Prevention & Best Practices Pre‑register required Resource Providers in landing zones Validate cross‑subscription readiness before creating Private Endpoints Document service‑specific approval requirements (PostgreSQL differs from Key Vault) Automate provider registration via policy or IaC where possible Include provider validation in enterprise onboarding checklists Why This Matters Missing provider registration can lead to: Failed Private Endpoint approvals Confusing authorization errors Extended troubleshooting cycles Production delays during go‑live A simple subscription readiness check prevents downstream networking failures that are difficult to diagnose from portal errors alone. Key Takeaways Issue: Azure PostgreSQL private endpoint approval fails across subscriptions Root Cause: Missing Resource Provider registration Fix: Register Microsoft.Network and Microsoft.DBforPostgreSQL on both subscriptions Result: Approval succeeds without backend authorization failures References Manage Azure Private Endpoints – Azure Private Link Approve Private Endpoint Connections – Azure Database for PostgreSQL Private Endpoint Overview – Azure Private Link76Views0likes0CommentsAzure Front Door: Implementing lessons learned following October outages
Abhishek Tiwari, Vice President of Engineering, Azure Networking Amit Srivastava, Principal PM Manager, Azure Networking Varun Chawla, Partner Director of Engineering Introduction Azure Front Door is Microsoft's advanced edge delivery platform encompassing Content Delivery Network (CDN), global security and traffic distribution into a single unified offering. By using Microsoft's extensive global edge network, Azure Front Door ensures efficient content delivery and advanced security through 210+ global and local points of presence (PoPs) strategically positioned closely to both end users and applications. As the central global entry point from the internet onto customer applications, we power mission critical customer applications as well as many of Microsoft’s internal services. We have a highly distributed resilient architecture, which protects against failures at the server, rack, site and even at the regional level. This resiliency is achieved by the use of our intelligent traffic management layer which monitors failures and load balances traffic at server, rack or edge sites level within the primary ring, supplemented by a secondary-fallback ring which accepts traffic in case of primary traffic overflow or broad regional failures. We also deploy a traffic shield as a terminal safety net to ensure that in the event of a managed or unmanaged edge site going offline, end user traffic continues to flow to the next available edge site. Like any large-scale CDN, we deploy each customer configuration across a globally distributed edge fleet, densely shared with thousands of other tenants. While this architecture enables global scale, it carries the risk that certain incompatible configurations, if not contained, can propagate broadly and quickly which can result in a large blast radius of impact. Here we describe how the two recent service incidents impacting Azure Front Door have reinforced the need to accelerate ongoing investments in hardening our resiliency, and tenant isolation strategy to mitigate likelihood and the scale of impact from this class of risk. October incidents: recap and key learnings Azure Front Door experienced two service incidents; on October 9 th and October 29 th , both with customer-impacting service degradation. On October 9 th : A manual cleanup of stuck tenant metadata bypassed our configuration protection layer, allowing incompatible metadata to propagate beyond our canary edge sites. This metadata was created on October 7 th , from a control-plane defect triggered by a customer configuration change. While the protection system initially blocked the propagation, the manual override operation bypassed our safeguards. This incompatible configuration reached the next stage and activated a latent data-plane defect in a subset of edge sites, causing availability impact primarily across Europe (~6%) and Africa (~16%). You can learn more about this issue in detail at https://aka.ms/AIR/QNBQ-5W8 On October 29 th : A different sequence of configuration changes across two control-plane versions produced incompatible metadata. Because the failure mode in the data-plane was asynchronous, the health checks validations embedded in our protection systems were all passed during the rollout. The incompatible customer configuration metadata successfully propagated globally through a staged rollout and also updated the “last known good” (LKG) snapshot. Following this global rollout, the asynchronous process in data-plane exposed another defect which caused crashes. This impacted connectivity and DNS resolutions for all applications onboarded to our platform. Extended recovery time amplified impact on customer applications and Microsoft services. You can learn more about this issue in detail at https://aka.ms/AIR/YKYN-BWZ We took away a number of clear and actionable lessons from these incidents, which are applicable not just to our service, but to any multi-tenant, high-density, globally distributed system. Configuration resiliency – Valid configuration updates should propagate safely, consistently, and predictably across our global edge, while ensuring that incompatible or erroneous configuration never propagate beyond canary environments. Data plane resiliency - Additionally, configuration processing in the data plane must not cause availability impact to any customer. Tenant isolation – Traditional isolation techniques such as hardware partitioning and virtualization are impractical at edge sites. This requires innovative sharding techniques to ensure single tenant-level isolation – a must-have to reduce potential blast radius. Accelerated and automated recovery time objective (RTO) – System should be able to automatically revert to last known good configuration in an acceptable RTO. In case of a service like Azure Front Door, we deem ~10 mins to be a practical RTO for our hundreds of thousands of customers at every edge site. Post outage, given the severity of impact which allowed an incompatible configuration to propagate globally, we made the difficult decision to temporarily block configuration changes in order to expedite rollout of additional safeguards. Between October 29 th to November 5 th , we prioritized and deployed immediate hardening steps before opening up the configuration change. We are confident that the system is stable, and we are continuing to invest in additional safeguards to further strengthen the platform's resiliency. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond Canary · Control plane and data plane defect fixes · Forced synchronous configuration processing · Additional stages with extended bake time · Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. January 2026 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes March 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 This blog is the first in a multi-part series on Azure Front Door resiliency. In this blog, we will focus on configuration resiliency—how we are making the configuration pipeline safer and more robust. Subsequent blogs will cover tenant isolation and recovery improvements. How our configuration propagation works Azure Front Door configuration changes can be broadly classified into three distinct categories. Service code & data – these include all aspects of Azure Front Door service like management plane, control plane, data plane, configuration propagation system. Azure Front Door follows a safe deployment practice (SDP) process to roll out newer versions of management, control or data plane over a period of approximately 2-3 weeks. This ensures that any regression in software does not have a global impact. However, latent bugs that escape pre-validation and SDP rollout can remain undetected until a specific combination of customer traffic patterns or configuration changes trigger the issue. Web Application Firewall (WAF) & L7 DDoS platform data – These datasets are used by Azure Front Door to deliver security and load-balancing capabilities. Examples include GeoIP data, malicious attack signatures, and IP reputation signatures. Updates to these datasets occur daily through multiple SDP stages with an extended bake time of over 12 hours to minimize the risk of global impact during rollout. This dataset is shared across all customers and the platform, and it is validated immediately since it does not depend on variations in customer traffic or configuration steps. Customer configuration data – Examples of these are any customer configuration change—whether a routing rule update, backend pool modification, WAF rule change, or security policy change. Due to the nature of these changes, it is expected across the edge delivery / CDN industry to propagate these changes globally in 5-10 mins. Both outages stemmed from issues within this category. All configuration changes, including customer configuration data, are processed through a multi-stage pipeline designed to ensure correctness before global rollout across Azure Front Door’s 200+ edge locations. At a high level, Azure Front Door’s configuration propagation system has two distinct components - Control plane – Accepts customer API/portal changes (create/update/delete for profiles, routes, WAF policies, origins, etc.) and translates them into internal configuration metadata which the data plane can understand. Data plane – Globally distributed edge servers that terminate client traffic, apply routing/WAF logic, and proxy to origins using the configuration produced by the control plane. Between these two halves sits a multi-stage configuration rollout pipeline with a dedicated protection system (known as ConfigShield): Changes flow through multiple stages (pre-canary, canary, expanding waves to production) rather than going global at once. Each stage is health-gated: the data plane must remain within strict error and latency thresholds before proceeding. Each stage’s health check also rechecks previous stage’s health for any regressions. A successfully completed rollout updates a last known good (LKG) snapshot used for automated rollback. Historically, rollout targeted global completion in roughly 5–10 minutes, in line with industry standards. Customer configuration processing in Azure Front Door data plane stack Customer configuration changes in Azure Front Door traverse multiple layers—from the control plane through the deployment system—before being converted into FlatBuffers at each Azure Front Door node. These FlatBuffers are then loaded by the Azure Front Door data plane stack, which runs as Kubernetes pods on every node. FlatBuffer Composition: Each FlatBuffer references several sub-resources such as WAF and Rules Engine schematic files, SSL certificate objects, and URL signing secrets. Data plane architecture: o Master process: Accepts configuration changes (memory-mapped files with references) and manages the lifecycle of worker processes. o Workers: L7 proxy processes that serve customer traffic using the applied configuration. Processing flow for each configuration update: Load and apply in master: The transformed configuration is loaded and applied in the master process. Cleanup of unused references occurs synchronously except for certain categories à October 9 outage occurred during this step due to a crash triggered by incompatible metadata. Apply to workers: Configuration is applied to all worker processes without memory overhead (FlatBuffers are memory-mapped). Serve traffic: Workers start consuming new FlatBuffers for new requests; in-flight requests continue using old buffers. Old buffers are queued for cleanup post-completion. Feedback to deployment service: Positive feedback signals readiness for rollout.Cleanup: FlatBuffers are freed asynchronously by the master process after all workers load updates à October 29 outage occurred during this step due to a latent bug in reference counting logic. The October incidents showed we needed to strengthen key aspects of configuration validation, propagation safeguards, and runtime behavior. During the Azure Front Door incident on October 9 th , that protection system worked as intended but was later bypassed by our engineering team during a manual cleanup operation. During this Azure Front Door incident on October 29 th , the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Configuration propagation safeguards Based on learnings from the incidents, we are implementing a comprehensive set of configuration resiliency improvements. These changes aim to guarantee that any sequence of configuration changes cannot trigger instability in the data plane, and to ensure quicker recovery in the event of anomalies. Strengthening configuration generation safety This improvement pivots on a ‘shift-left’ strategy where we want to ensure that we catch regression early before they propagate to production. It also includes fixing the latent defects which were the proximate cause of the outage. Fixing outage specific defects - We have fixed the control-plane defects that could generate incompatible tenant metadata under specific operation sequences. We have also remediated the associated data-plane defects. Stronger cross-version validation - We are expanding our test and validation suite to account for changes across multiple control plane build versions. This is expected to be fully completed by February 2026. Fuzz testing - Automated fuzzing and testing of metadata generation contract between the control plane and the data plane. This allows us to generate an expanded set of invalid/unexpected configuration combinations which might not be achievable by traditional test cases alone. This is expected to be fully completed by February 2026. Preventing incompatible configurations from being propagated This segment of the resiliency strategy strives to ensure that a potentially dangerous configuration change never propagates beyond canary stage. Protection system is “always-on” - Enhancements to operational procedures and tooling prevent bypass in all scenarios (including internal cleanup/maintenance), and any cleanup must flow through the same guarded stages and health checks as standard configuration changes. This is completed. Making rollout behavior more predictable and conservative - Configuration processing in the data plane is now fully synchronous. Every data plane issue due to incompatible meta data can be detected withing 10 seconds at every stage. This is completed. Enhancement to deployment pipeline - Additional stages during roll-out and extended bake time between stages serve as an additional safeguard during configuration propagation. This is completed. Recovery tool improvements now make it easier to revert to any previous version of LKG with a single click. This is completed. These changes significantly improve system safety. Post-outage we have increased the configuration propagation time to approximately 45 minutes. We are working towards reducing configuration propagation time closer to pre-incident levels once additional safeguards covered in the Data plane resiliency section below are completed by mid-January, 2026. Data plane resiliency The data plane recovery was the toughest part of recovery efforts during the October incidents. We must ensure fast recovery as well as resilience to configuration processing related issues for the data plane. To address this, we implemented changes that decouple the data plane from incompatible configuration changes. With these enhancements, the data plane continues operating on the last known good configuration—even if the configuration pipeline safeguards fail to protect as intended. Decoupling data plane from configuration changes Each server’s data plane consists of a master process which accepts configuration changes and manages lifecycle of multiple worker processes which serve customer traffic. One of the critical reasons for the prolonged outage in October was that due to latent defects in the data plane, when presented with a bad configuration the master process crashed. The master is a critical command-and-control process and when it crashes it takes down the entire data plane, in that node. Recovery of the master process involves reloading hundreds of thousands of configurations from scratch and took approximately 4.5 hours. We have since made changes to the system to ensure that even in the event of the master process crash due to any reason - including incompatible configuration data being presented - the workers remain healthy and able to serve traffic. During such an event, the workers would not be able to accept new configuration changes but will continue to serve customer traffic using the last known good configuration. This work is completed. Introducing Food Taster: strengthening config propagation resiliency In our efforts to further strengthen Azure Front Door’s configuration propagation system, we are introducing an additional configuration safeguard known internally as Food Taster which protects the master and worker processes from any configuration change related incidents, thereby ensuring data plane resiliency. The principle is simple: every data-plane server will have a redundant and isolated process – the Food Taster – whose only job is to ingest and process new configuration metadata first and then pass validated configuration changes to active data plane. This redundant worker does not accept any customer traffic. All configuration processing in this Food Taster is fully synchronous. That means we do all parsing, validation, and any expensive or risky work up front, and we do not move on until the Food Taster has either proven the configuration is safe or rejected it. Only when the Food Taster successfully loads the configuration and returns “Config OK” does the master process proceed to load the same config and then instruct the worker processes to do the same. If anything goes wrong in the Food Taster, the failure is contained to that isolated worker; the master and traffic-serving workers never see that invalid configuration. We expect this safeguard to reach production globally in January 2026 timeframe. Introduction of this component will also allow us to return closer to pre-incident level of configuration propagation while ensuring data plane safety. Closing This is the first in a series of planned blogs on Azure Front Door resiliency enhancements. We are continuously improving platform safety and reliability and will transparently share updates through this series. Upcoming posts will cover advancements in tenant isolation and improvements to recovery time objectives (RTO). We deeply value our customers’ trust in Azure Front Door. The October incidents reinforced how critical configuration resiliency is, and we are committed to exceeding industry expectations for safety, reliability, and transparency. By hardening our configuration pipeline, strengthening safety gates, and reinforcing isolation boundaries, we’re making Azure Front Door even more resilient so your applications can be too.9.9KViews19likes7CommentsAzure Networking 2025: Powering cloud innovation and AI at global scale
In 2025, Azure’s networking platform proved itself as the invisible engine driving the cloud’s most transformative innovations. Consider the construction of Microsoft’s new Fairwater AI datacenter in Wisconsin – a 315-acre campus housing hundreds of thousands of GPUs. To operate as one giant AI supercomputer, Fairwater required a single flat, ultra-fast network interconnecting every GPU. Azure’s networking team delivered: the facility’s network fabric links GPUs at 800 Gbps speeds in a non-blocking architecture, enabling 10× the performance of the world’s fastest supercomputer. This feat showcases how fundamental networking is to cloud innovation. Whether it’s uniting massive AI clusters or connecting millions of everyday users, Azure’s globally distributed network is the foundation upon which new breakthroughs are built. In 2025, the surge of AI workloads, data-driven applications, and hybrid cloud adoption put unprecedented demands on this foundation. We responded with bold network investments and innovations. Each new networking feature delivered in 2025, from smarter routing to faster gateways, was not just a technical upgrade but an innovation enabling customers to achieve more. Recapping the year’s major releases across Azure Networking services and key highlights how AI both drive and benefit from these advancements. Unprecedented connectivity for a hybrid and AI era Hybrid connectivity at scale: Azure’s network enhancements in 2025 focused on making global and hybrid connectivity faster, simpler, and ready for the next wave of AI-driven traffic. For enterprises extending on-premises infrastructure to Azure, Azure ExpressRoute private connectivity saw a major leap in capacity: Microsoft announced support for 400 Gbps ExpressRoute Direct ports (available in 2026) to meet the needs of AI supercomputing and massive data volumes. These high-speed ports – which can be aggregated into multi-terabit links – ensure that even the largest enterprises or HPC clusters can transfer data to Azure with dedicated, low-latency links. In parallel, Azure VPN Gateway performance reached new highs, with a generally available upgrade that delivers up to 20 Gbps aggregate throughput per gateway and 5 Gbps per individual tunnel. This is a 3× increase over previous limits, enabling branch offices and remote sites to connect to Azure even more seamlessly without bandwidth bottlenecks. Together, the ExpressRoute and VPN improvements give customers a spectrum of high-performance options for hybrid networking – from offices and datacenters to the cloud – supporting scenarios like large-scale data migrations, resilient multi-site architectures, and hybrid AI processing. Simplified global networking: Azure Virtual WAN (vWAN) continued to mature as the one-stop solution for managing global connectivity. Virtual WAN introduced forced tunneling for Secure Virtual Hubs (now in preview), which allows organizations to route all Internet-bound traffic from branch offices or virtual networks back to a central hub for inspection. This capability simplifies the implementation of a “backhaul to hub” security model – for example, forcing branches to use a central firewall or security appliance – without complex user-defined routing. Empowering multicloud and NVA integration: Azure recognizes that enterprise networks are diverse. Azure Route Server improvements enhanced interoperability with customer equipment and third-party network virtual appliances (NVAs). Notably, Azure Route Server now supports up to 500 virtual network connections (spokes) per route server, a significant scale boost that enables larger hub-and-spoke topologies and simplified Border Gateway Protocol (BGP) route exchange even in very large environments. This helps customers using SD-WAN appliances or custom firewalls in Azure to seamlessly learn routes from hundreds of VNet spokes – maintaining central routing control without manual configuration. Additionally, Azure Route Server introduced a preview of hub routing preference, giving admins the ability to influence BGP route selection (for example, preferring ExpressRoute over a VPN path, or vice versa). This fine-grained control means hybrid networks can be tuned for optimal performance and cost. Resilience and reliability by design Azure’s growth has been underpinned by making the network “resilient by default.” We shipped tools to help validate and improve network resiliency. ExpressRoute Resiliency Insights was released for general availability – delivering an intelligent assessment of an enterprise’s ExpressRoute setup. This feature evaluates how well your ExpressRoute circuits and gateways are architected for high availability (for example, using dual circuits in diverse locations, zone-redundant gateways, etc.) and assigns a resiliency index score as a percentage. It will highlight suboptimal configurations – such as routes advertised on only one circuit, or a gateway that isn’t zone-redundant – and provide recommendations for improvement. Moreover, Resiliency Insights includes a failover simulation tool that can test circuit redundancy by mimicking failures, so you can verify that your connections will survive real-world incidents. By proactively monitoring and testing resilience, Azure is helping customers achieve “always-on” connectivity even in the face of fiber cuts, hardware faults, or other disruptions. Security, governance, and trust in the network As enterprises entrust more core business to Azure, the platform’s networking services advanced on security and governance – helping customers achieve Zero Trust networks and high compliance with minimal complexity. Azure DNS now offers DNS Security Policies with Threat Intelligence feeds (GA). This capability allows organizations to protect their DNS queries from known malicious domains by leveraging continuously updated threat intel. For example, if a known phishing domain or C2 (command-and-control) hostname appears in DNS queries from your environment, Azure DNS can automatically block or redirect those requests. Because DNS is often the first line of detection for malware and phishing activities, this built-in filtering provides a powerful layer of defense that’s fully managed by Azure. It’s essentially a cloud-delivered DNS firewall using Microsoft’s vast threat intelligence – enabling all Azure customers to benefit from enterprise-grade security without deploying additional appliances. Network traffic governance was another focus. The introduction of forced tunneling in Azure Virtual WAN hubs (preview) shared above is a prime example where networking meets security compliance. Optimizing cloud-native and edge networks We previewed DNS intelligent traffic control features – such as filtering DNS queries to prevent data exfiltration and applying flexible recursion policies – which complement the DNS Security offering in safeguarding name resolution. Meanwhile, for load balancing across regions, Azure Traffic Manager’s behind-the-scenes upgrades (as noted earlier) improved reliability, and it’s evolving to integrate with modern container-based apps and edge scenarios. AI-powered networking: Both enabling and enabled by AI We are infusing AI into networking to make management and troubleshooting more intelligent. Networking functionality in Azure Copilot accelerates tasks like never before: it outlines the best practices instantly and troubleshooting that once required combing through docs and logs can be conversational. It effectively democratizes networking expertise, helping even smaller IT teams manage sophisticated networks by leveraging AI recommendations. The future of cloud networking in an AI world As we close out 2025, one message is clear: networking is strategic. The network is no longer a static utility – it is the adaptive circulatory system of the cloud, determining how far and fast customers can go. By delivering higher speeds, greater reliability, tighter security, and easier management, Azure Networking has empowered businesses to connect everything to anything, anywhere – securely and at scale. These advances unlock new scenarios: global supply chains running in real-time over a trusted network, multi-player AR/VR and gaming experiences delivered without lag, and AI models trained across continents. Looking ahead, AI-powered networking will become the norm. The convergence of AI and network tech means we will see more self-optimizing networks that can heal, defend, and tune themselves with minimal human intervention.837Views3likes0CommentsNetwork Detection and Response (NDR) in Financial Services
New PCI DSS v4.0.1 requirements heighten the need for automated monitoring and analysis of security logs. Network Detection and Response solutions fulfill these mandates by providing 24/7 network traffic inspection and real-time alerting on suspicious activities. Azure’s native tools (Azure vTAP for full packets, VNET Flow Logs for all flows) capture rich network data and integrate with advanced NDR analytics from partners. This combination detects intrusions (satisfying IDS requirements under Requirement 11), validates network segmentation (for scope reduction under Req. 1), and feeds alerts into Microsoft Sentinel for rapid response (fulfilling incident response obligations in Req. 12). The result is a cloud architecture that not only meets PCI DSS controls but actively strengthens security.585Views0likes0CommentsIntroducing eBPF Host Routing: High performance AI networking with Azure CNI powered by Cilium
AI-driven applications demand low-latency workloads for optimal user experience. To meet this need, services are moving to containerized environments, with Kubernetes as the standard. Kubernetes networking relies on the Container Network Interface (CNI) for pod connectivity and routing. Traditional CNI implementations use iptables for packet processing, adding latency and reducing throughput. Azure CNI powered by Cilium natively integrates Azure Kubernetes service (AKS) data plane with Azure CNI networking modes for superior performance, hardware offload support, and enterprise-grade reliability. Azure CNI powered by Cilium delivers up to 30% higher throughput in both benchmark and real-world customer tests compared to a bring-your-own Cilium setup on AKS. The next leap forward: Now, AKS data plane performance can be optimized even further with eBPF host routing, which is an open-source Cilium CNI capability that accelerates packet forwarding by executing routing logic directly in eBPF. As shown in the figure, this architecture eliminates reliance on iptables and connection tracking (conntrack) within the host network namespace. As a result, significantly improving packet processing efficiency, reducing CPU overhead and optimized performance for modern workloads. Comparison of host routing using the Linux kernel stack vs eBPF Azure CNI powered by Cilium is battle-tested for mission-critical workloads, backed by Microsoft support, and enriched with Advanced Container Networking Services features for security, observability, and accelerated performance. eBPF host routing is now included as part of Advanced Container Networking Services suite, delivering network performance acceleration. In this blog, we highlight the performance benefits of eBPF host routing, explain how to enable it in an AKS cluster, and provide a deep dive into its implementation on Azure. We start by examining AKS cluster performance before and after enabling eBPF host routing. Performance comparison Our comparative benchmarks measure the difference in Azure CNI Powered by Cilium, by enabling eBPF host routing. To perform these measurements, we use AKS clusters on K8s version 1.33, with host nodes of 16 cores, running Ubuntu 24.04. We are interested in throughput and latency numbers for pod-to-pod traffic in these clusters. For throughput measurements, we deploy netperf client and server pods, and measure TCP_STREAM throughput at varying message sizes in tests running 20 seconds each. The wide range of message sizes are meant to capture the variety of workloads running on AKS clusters, ranging from AI training and inference to messaging systems and media streaming. For latency, we run TCP_RR tests, measuring latency at various percentiles, as well as transaction rates. The following figure compares pods on the same node; eBPF-based routing results in a dramatic improvement in throughput (~30%). This is because, on the same node, the throughput is not constrained by factors such as the VM NIC limits and is almost entirely determined by host routing performance. For pod-to-pod throughput across different nodes in the cluster. eBPF host routing results in better pod-to-pod throughput across nodes, and the difference is more pronounced with smaller message sizes (3x more). This is because, with smaller messages, the per-message overhead incurred in the host network stack has a bigger impact on performance. Next, we compare latency for pod-to-pod traffic. We limit this benchmark to intra-node traffic, because cross-node traffic latency is determined by factors other than the routing latency incurred in the hosts. eBPF host routing results in reduced latency compared to the non-accelerated configuration at all measured percentiles. We have also measured the transaction rate between client and server pods, with and without eBPF host routing. This benchmark is an alternative measurement of latency because a transaction is essentially a small TCP request/response pair. We observe that eBPF host routing improves transactions per second by around 27% as compared to legacy host routing. Transactions/second (same node) Azure CNI configuration Transactions/second eBPF host routing 20396.9 Traditional host routing 16003.7 Enabling eBPF routing through Advanced Container Networking Services eBPF host routing is disabled by default in Advanced Container Networking Services because bypassing iptables in the host network namespace can ignore custom user rules and host-level security policies. This may lead to visible failures such as dropped traffic or broken network policies, as well as silent issues like unintended access or missed audit logs. To mitigate these risks, eBPF host routing is offered as an opt-in feature, enabled through Advanced Container Networking Services on Azure CNI powered by Cilium. The Advanced Container Networking Services advantage: Built-in safeguards: Enabling eBPF Host Routing in ACNS enhances the open-source offering with strong built-in safeguards. Before activation, ACNS validates existing iptables rules in the host network namespace and blocks enablement if user-defined rules are detected. Once enabled, kernel-level protections prevent new iptables rules and generate Kubernetes events for visibility. These measures allow customers to benefit from eBPF’s performance gains while maintaining security and reliability. Thanks to the additional safeguards, eBPF host routing in Advanced Container Networking Services is a safer and more robust option for customers who wish to obtain the best possible networking performance on their Kubernetes infrastructure. How to enable eBPF Host Routing with ACNS Visit the documentation on how to enable eBPF Host Routing for new and existing Azure CNI Powered by Cilium clusters. Verify the network profile with the new performance `accelerationMode`field set to `BpfVeth`. "networkProfile": { "advancedNetworking": { "enabled": true, "performance": { "accelerationMode": "BpfVeth" }, … For more information on Advanced Container Networking Services and ACNS Performance, please visit https://aka.ms/acnsperformance. Resources For more info about Advanced Container Networking Services please visit (Container Network Security with Advanced Container Networking Services (ACNS) - Azure Kubernetes Service | Microsoft Learn). For more info about Azure CNI Powered by Cilium please visit (Configure Azure CNI Powered by Cilium in Azure Kubernetes Service (AKS) - Azure Kubernetes Service | Microsoft Learn).932Views1like2CommentsHow Azure network security can help you meet NIS2 compliance
With the adoption of the NIS2 Directive EU 2022 2555, cybersecurity obligations for both public and private sector organizations have become more strict and far reaching. NIS2 aims to establish a higher common level of cybersecurity across the European Union by enforcing stronger requirements on risk management, incident reporting, supply chain protection, and governance. If your organization runs on Microsoft Azure, you already have powerful services to support your NIS2 journey. In particular Azure network security products such as Azure Firewall, Azure Web Application Firewall WAF, and Azure DDoS Protection provide foundational controls. The key is to configure and operate them in a way that aligns with the directive’s expectations. Important note This article is a technical guide based on the NIS2 Directive EU 2022 2555 and Microsoft product documentation. It is not legal advice. For formal interpretations, consult your legal or regulatory experts. What is NIS2? NIS2 replaces the original NIS Directive 2016 and entered into force on 16 January 2023. Member states must transpose it into national law by 17 October 2024. Its goals are to: Expand the scope of covered entities essential and important entities Harmonize cybersecurity standards across member states Introduce stricter supervisory and enforcement measures Strengthen supply chain security and reporting obligations Key provisions include: Article 20 management responsibility and governance Article 21 cybersecurity risk management measures Article 23 incident notification obligations These articles require organizations to implement technical, operational, and organizational measures to manage risks, respond to incidents, and ensure leadership accountability. Where Azure network security fits The table below maps common NIS2 focus areas to Azure network security capabilities and how they support compliance outcomes. NIS2 focus area Azure services and capabilities How this supports compliance Incident handling and detection Azure Firewall Premium IDPS and TLS inspection, Threat Intelligence mode, Azure WAF managed rule sets and custom rules, Azure DDoS Protection, Azure Bastion diagnostic logs Detect, block, and log threats across layers three to seven. Provide telemetry for triage and enable response workflows that are auditable. Business continuity and resilience Azure Firewall availability zones and autoscale, Azure Front Door or Application Gateway WAF with zone redundant deployments, Azure Monitor with Log Analytics, Traffic Manager or Front Door for failover Improve service availability and provide data for resilience reviews and disaster recovery scenarios. Access control and segmentation Azure Firewall policy with DNAT, network, and application rules, NSGs and ASGs, Azure Bastion for browser based RDP SSH without public IPs, Private Link Enforce segmentation and isolation of critical assets. Support Zero Trust and least privilege for inbound and egress. Vulnerability and misconfiguration defense Azure WAF Microsoft managed rule set based on OWASP CRS. Azure Firewall Premium IDPS signatures Reduce exposure to common web exploits and misconfigurations for public facing apps and APIs. Encryption and secure communications TLS policy: Application Gateway SSL policy; Front Door TLS policy; App Service/PaaS minimum TLS. Inspection: Azure Firewall Premium TLS inspection Inspect and enforce encrypted communication policies and block traffic that violates TLS requirements. Inspect decrypted traffic for threats. Incident reporting and evidence Azure Network Security diagnostics, Log Analytics, Microsoft Sentinel incidents, workbooks, and playbooks Capture and retain telemetry. Correlate events, create incident timelines, and export reports to meet regulator timelines. NIS2 articles in practice Article 21 cybersecurity risk management measures Azure network controls contribute to several required measures: Prevention and detection. Azure Firewall blocks unauthorized access and inspects traffic with IDPS. Azure DDoS Protection mitigates volumetric and protocol attacks. Azure WAF prevents common web exploits based on OWASP guidance. Logging and monitoring. Azure Firewall, WAF, DDoS, and Bastion resources produce detailed resource logs and metrics in Azure Monitor. Ingest these into Microsoft Sentinel for correlation, analytics rules, and automation. Control of encrypted communications. Azure Firewall Premium provides TLS inspection to reveal malicious payloads inside encrypted sessions. Supply chain and service provider management. Use Azure Policy and Defender for Cloud to continuously assess configuration and require approved network security baselines across subscriptions and landing zones. Article 23 incident notification Build an evidence friendly workflow with Sentinel: Early warning within twenty four hours. Use Sentinel analytics rules on Firewall, WAF, DDoS, and Bastion logs to generate incidents and trigger playbooks that assemble an initial advisory. Incident notification within seventy two hours. Enrich the incident with additional context such as mitigation actions from DDoS, Firewall and WAF. Final report within one month. Produce a summary that includes root cause, impact, and corrective actions. Use Workbooks to export charts and tables that back up your narrative. Article 20 governance and accountability Management accountability. Track policy compliance with Azure Policy initiatives for Firewall, DDoS and WAF. Use exemptions rarely and record justification. Centralized visibility. Defender for Cloud’s network security posture views and recommendations give executives and owners a quick view of exposure and misconfigurations. Change control and drift prevention. Manage Firewall, WAF, and DDoS through Network Security Hub and Infrastructure as Code with Bicep or Terraform. Require pull requests and approvals to enforce four eyes on changes. Network security baseline Use this blueprint as a starting point. Adapt to your landing zone architecture and regulator guidance. Topology and control plane Hub and spoke architecture with a centralized Azure Firewall Premium in the hub. Enable availability zones. Deploy Azure Bastion Premium in the hub or a dedicated management VNet; peer to spokes. Remove public IPs from management NICs and disable public RDP SSH on VMs. Use Network Security Hub for at-scale management. Require Infrastructure as Code for all network security resources. Web application protection Protect public apps with Azure Front Door Premium WAF where edge inspection is required. Use Application Gateway WAF v2 for regional scenarios. Enable the Microsoft managed rule set and the latest version. Add custom rules for geo based allow or deny and bot management. enable rate limiting when appropriate. DDoS strategy Enable DDoS Network Protection on virtual networks that contain internet facing resources. Use IP Protection for single public IP scenarios. Configure DDoS diagnostics and alerts. Stream to Sentinel. Define runbooks for escalation and service team engagement. Firewall policy Enable IDPS in alert and then in alert and deny for high confidence signatures. Enable TLS inspection for outbound and inbound where supported. Enforce FQDN and URL filtering for egress. Require explicit allow lists for critical segments. Deny inbound RDP SSH from the internet. Allow management traffic only from Bastion subnets or approved management jump segments. Logging, retention, and access Turn on diagnostic settings for Firewall, WAF, DDoS, and Application Gateway or Front Door. Send to Log Analytics and an archive storage account for long term retention. Set retention per national law and internal policy. Azure Monitor Log Analytics supports table-level retention and archive for up to 12 years, many teams keep a shorter interactive window and multi-year archive for audits. Restrict access with Azure RBAC and Customer Managed Keys where applicable. Automation and playbooks Build Sentinel playbooks for regulator notifications, ticket creation, and evidence collection. Maintain dry run versions for exercises. Add analytics for Bastion session starts to sensitive VMs, excessive failed connection attempts, and out of hours access. Conclusion Azure network security services provide the technical controls most organizations need in order to align with NIS2. When combined with policy enforcement, centralized logging, and automated detection and response, they create a defensible and auditable posture. Focus on layered protection, secure connectivity, and real time response so that you can reduce exposure to evolving threats, accelerate incident response, and meet NIS2 obligations with confidence. References NIS2 primary source Directive (EU) 2022/2555 (NIS2). https://eur-lex.europa.eu/eli/dir/2022/2555/oj/eng Azure Firewall Premium features (TLS inspection, IDPS, URL filtering). https://learn.microsoft.com/en-us/azure/firewall/premium-features Deploy & configure Azure Firewall Premium. https://learn.microsoft.com/en-us/azure/firewall/premium-deploy IDPS signature categories reference. https://learn.microsoft.com/en-us/azure/firewall/idps-signature-categories Monitoring & diagnostic logs reference. https://learn.microsoft.com/en-us/azure/firewall/monitor-firewall-reference Web Application Firewall WAF on Azure Front Door overview & features. https://learn.microsoft.com/en-us/azure/frontdoor/web-application-firewall WAF on Application Gateway overview. https://learn.microsoft.com/en-us/azure/web-application-firewall/overview Examine WAF logs with Log Analytics. https://learn.microsoft.com/en-us/azure/application-gateway/log-analytics Rate limiting with Front Door WAF. https://learn.microsoft.com/en-us/azure/web-application-firewall/afds/waf-front-door-rate-limit Azure DDoS Protection Service overview & SKUs (Network Protection, IP Protection). https://learn.microsoft.com/en-us/azure/ddos-protection/ddos-protection-overview Quickstart: Enable DDoS IP Protection. https://learn.microsoft.com/en-us/azure/ddos-protection/manage-ddos-ip-protection-portal View DDoS diagnostic logs (Notifications, Mitigation Reports/Flows). https://learn.microsoft.com/en-us/azure/ddos-protection/ddos-view-diagnostic-logs Azure Bastion Azure Bastion overview and SKUs. https://learn.microsoft.com/en-us/azure/bastion/bastion-overview Deploy and configure Azure Bastion. https://learn.microsoft.com/en-us/azure/bastion/tutorial-create-host-portal Disable public RDP and SSH on Azure VMs. https://learn.microsoft.com/en-us/azure/virtual-machines/security-baseline Azure Bastion diagnostic logs and metrics. https://learn.microsoft.com/en-us/azure/bastion/bastion-diagnostic-logs Microsoft Sentinel Sentinel documentation (onboard, analytics, automation). https://learn.microsoft.com/en-us/azure/sentinel/ Azure Firewall solution for Microsoft Sentinel. https://learn.microsoft.com/en-us/azure/firewall/firewall-sentinel-overview Use Microsoft Sentinel with Azure WAF. https://learn.microsoft.com/en-us/azure/web-application-firewall/waf-sentinel Architecture & routing Hub‑spoke network topology (reference). https://learn.microsoft.com/en-us/azure/architecture/networking/architecture/hub-spoke Azure Firewall Manager & secured virtual hub. https://learn.microsoft.com/en-us/azure/firewall-manager/secured-virtual-hub670Views0likes1CommentSimplify container network metrics filtering in Azure Container Networking Services for AKS
We’re excited to introduce container network metrics filtering in Azure Container Networking Services for Azure Kubernetes Service (AKS) is now in public preview! This capability transforms how you manage network observability in Kubernetes clusters by giving you control over what metrics matter most. Why excessive metrics are a problem (and how we’re fixing it) In today’s large-scale, microservices-driven environments, teams often face metrics bloat, collecting far more data than they need. The result? High storage & ingestion costs: Paying for data you’ll never use. Cluttered dashboards: Hunting for critical latency spikes in a sea of irrelevant pod restarts. Operational overhead: Slower queries, higher maintenance, and fatigue. Our new filtering capability solves this by letting you define precise filters at the pod level using standard Kubernetes custom resources. You collect only what matters, before it ever reaches your monitoring stack. Key Benefits: Signal Over Noise Benefit Your Gain Fine-grained control Filter by namespace or pod label. Target critical services and ignore noise. Cost optimization Reduce ingestion costs for Prometheus, Grafana, and other tools. Improved observability Cleaner dashboards and faster troubleshooting with relevant metrics only. Dynamic & zero-downtime Apply or update filters without restarting Cilium agents or Prometheus. How it works: Filtering at the source Unlike traditional sampling or post-processing, filtering happens at the Cilium agent level—inside the kernel’s data plane. You define filters using the ContainerNetworkMetric custom resource to include or exclude metrics such as: DNS lookups TCP connection metrics Flow metrics Drop (error) metrics This reduces data volume before metrics leave the host, ensuring your observability tools receive only curated, high-value data. Example: Filtering flow metrics to reduce noise Here’s a sample ContainerNetworkMetric CRD that filters only dropped flows from the traffic/http namespace and excludes flows from traffic/fortio pods: apiVersion: acn.azure.com/v1alpha1 kind: ContainerNetworkMetric metadata: name: container-network-metric spec: filters: - metric: flow includeFilters: # Include only DROPPED flows from traffic namespace verdict: - "dropped" from: namespacedPod: - "traffic/http" excludeFilters: # Exclude traffic/fortio flows to reduce noise from: namespacedPod: - "traffic/fortio" Before filtering: After applying filters: Getting started today Ready to simplify your network observability? Enable Advanced Container Networking Services: Make sure Advanced Container Networking Services is enabled on your AKS cluster. Define Your Filter: Apply the ContainerNetworkMetric CRD with your include/exclude rules. Validate: Check your settings via ConfigMap and Cilium agent logs. See the Impact: Watch ingestion costs drop and dashboards become clearer! 👉 Learn more in the Metrics Filtering Guide. Try the public preview today and take control of your container network metrics.361Views0likes0CommentsAnnouncing Azure DNS security policy with Threat Intelligence feed general availability
Azure DNS security policy with Threat Intelligence feed allows early detection and prevention of security incidents on customer Virtual Networks where known malicious domains sourced by Microsoft’s Security Response Center (MSRC) can be blocked from name resolution. Azure DNS security policy with Threat Intelligence feed is being announced to all customers and will have regional availability in all public regions.2KViews3likes0Comments