azure networking
101 TopicsAzure virtual network terminal access point (TAP) public preview announcement
What is virtual network TAP? Virtual network TAP allows customers continuously stream virtual machine network traffic to a network packet collector or analytics tool. Many security and performance monitoring tools rely on packet-level insights that are difficult to access in cloud environments. Virtual network TAP bridges this gap by integrating with our industry partners to offer: Enhanced security and threat detection: Security teams can inspect full packet data in real-time to detect and respond to potential threats. Performance monitoring and troubleshooting: Operations teams can analyze live traffic patterns to identify bottlenecks, troubleshoot latency issues, and optimize application performance. Regulatory compliance: Organizations subject to compliance frameworks such as Health Insurance Portability and Accountability Act (HIPAA), and General Data Protection Regulation (GDPR) can use virtual network TAP to capture network activity for auditing and forensic investigations. Why use virtual network TAP? Unlike traditional packet capture solutions that require deploying additional agents or network appliances, virtual network TAP leverages Azure's native infrastructure to enable seamless traffic mirroring without complex configurations and without impacting the performance of the virtual machine. A key advantage is that mirrored traffic does not count towards virtual machine’s network limits, ensuring complete visibility without compromising application performance. Additionally, virtual network TAP supports all Azure virtual machine SKU. Deploying virtual network TAP The portal is a convenient way to get started with Azure virtual network TAP. However, if you have a lot of Azure resources and want to automate the setup you may want to use a PowerShell, CLI, or REST API. Add a TAP configuration on a network interface that is attached to a virtual machine deployed in your virtual network. The destination is a virtual network IP address in the same virtual network as the monitored network interface or a peered virtual network. The collector solution for virtual network TAP can be deployed behind an Azure Internal Load balancer for high availability. You can use the same virtual network TAP resource to aggregate traffic from multiple network interfaces in the same or different subscriptions. If the monitored network interfaces are in different subscriptions, the subscriptions must be associated to the same Microsoft Entra tenant. Additionally, the monitored network interfaces and the destination endpoint for aggregating the TAP traffic can be in peered virtual networks in the same region. Partnering with industry leaders to enhance network monitoring in Azure To maximize the value of virtual network TAP, we are proud to collaborate with industry-leading security and network visibility partners. Our partners provide deep packet inspection, analytics, threat detection, and monitoring solutions that seamlessly integrate with virtual network TAP: Network packet brokers Partner Product Gigamon GigaVUE Cloud Suite for Azure Keysight CloudLens Security analytics, network/application performance management Partner Product Darktrace Darktrace /NETWORK Netscout Omnis Cyber Intelligence NDR Corelight Corelight Open NDR Platform LinkShadow LinkShadow NDR Fortinet FortiNDR Cloud FortiGate VM cPacket cPacket Cloud Suite TrendMicro Trend Vision One™ Network Security Extrahop RevealX Bitdefender GravityZone Extended Detection and Response for Network eSentire eSentire MDR Vectra Vectra NDR AttackFence AttackFence NDR Arista Networks Arista NDR See our partner blogs: Bitdefender + Microsoft Virtual Network TAP: Deepening Visibility, Strengthening Security Streamline Traffic Mirroring in the Cloud with Azure Virtual Network Terminal Access Point (TAP) and Keysight Visibility | Keysight Blogs eSentire | Unlocking New Possibilities for Network Monitoring and… LinkShadow Unified Identity, Data, and Network Platform Integrated with Microsoft Virtual Network TAP Extrahop and Microsoft Extend Coverage for Azure Workloads Resources | Announcing cPacket Partnership with Azure virtual network terminal access point (TAP) Gain Network Traffic Visibility with FortiGate and Azure virtual network TAP Get started with virtual network TAP To learn more and get started, visit our website. We look forward to seeing how you leverage virtual network TAP to enhance security, performance, and compliance in your cloud environment. Stay tuned for more updates as we continue to refine and expand on our feature set! If you have any questions please reach out to us at azurevnettap@microsoft.com.3.2KViews3likes8CommentsAzure Incident Retrospective - Please register! Session 2 - Tracking ID: 5GP8-W0G
Join our upcoming live webcast for a transparent discussion about this recent Azure service incident — led by our engineering teams. Control plane issues in East US Tracking ID: 5GP8-W0G | Impacted: 24-25 April 2026 Same content presented in both sessions — pick the one that works best for your timezone! What to expect 📚 Understand What happened, how we responded, and what we learned 💬 Ask Live Q&A with our engineering experts throughout the session 🛠 Learn The fixes we've put in place and guidance for workload resiliency Choose your session Same content presented at both times — pick the one that works best for your timezone: Session 1 17:30 UTC Thursday, 14 May 2026 Register now → Session 2 05:30 UTC Friday, 15 May 2026 Register now → 9:30 AM US Pacific (PDT) 12:30 PM US Eastern (EDT) 5:30 PM London (BST) 1:30 AM +1 Beijing (CST) 4:30 AM +1 Sydney (AEDT) 6:30 AM +1 Auckland (NZDT) 9:30 PM -1 US Pacific (PDT) 12:30 AM US Eastern (EDT) 5:30 AM London (BST) 1:30 PM Beijing (CST) 4:30 PM Sydney (AEDT) 6:30 PM Auckland (NZDT) Our engineering leaders Deepak Bansal Corporate Vice President, Technical Fellow Azure Networking Cloud+AI Engineering LinkedIn ↗ Qi Zhang Partner Software Engineering Manager Azure Networking Cloud+AI Engineering LinkedIn ↗ ⚠️ Prepare before the livestream Read the Post Incident Review (PIR) ahead of time so you can ask any follow up questions during the live Q&A Helpful resources 🔔 Azure Service Health Alerts Get alerts for relevant incidents by setting up notifications via email, SMS, or webhook 🎥 Past Retrospective Recordings Watch recordings of previous retrospective livestreams 📄 Azure Post Incident Reviews Learn more about PIRs and the retrospective program42Views0likes0CommentsAzure Incident Retrospective - Please register! Session 1 - Tracking ID: 5GP8-W0G
Join our upcoming live webcast for a transparent discussion about this recent Azure service incident — led by our engineering teams. Control plane issues in East US Tracking ID: 5GP8-W0G | Impacted: 24-25 April 2026 Same content presented in both sessions — pick the one that works best for your timezone! What to expect 📚 Understand What happened, how we responded, and what we learned 💬 Ask Live Q&A with our engineering experts throughout the session 🛠 Learn The fixes we've put in place and guidance for workload resiliency Choose your session Same content presented at both times — pick the one that works best for your timezone: Session 1 17:30 UTC Thursday, 14 May 2026 Register now → Session 2 05:30 UTC Friday, 15 May 2026 Register now → 9:30 AM US Pacific (PDT) 12:30 PM US Eastern (EDT) 5:30 PM London (BST) 1:30 AM +1 Beijing (CST) 4:30 AM +1 Sydney (AEDT) 6:30 AM +1 Auckland (NZDT) 9:30 PM -1 US Pacific (PDT) 12:30 AM US Eastern (EDT) 5:30 AM London (BST) 1:30 PM Beijing (CST) 4:30 PM Sydney (AEDT) 6:30 PM Auckland (NZDT) Our engineering leaders Deepak Bansal Corporate Vice President, Technical Fellow Azure Networking Cloud+AI Engineering LinkedIn↗ Qi Zhang Partner Software Engineering Manager Azure Networking Cloud+AI Engineering LinkedIn ↗ ⚠️ Prepare before the livestream Read the Post Incident Review (PIR) ahead of time so you can ask any follow up questions during the live Q&A Helpful resources 🔔 Azure Service Health Alerts Get alerts for relevant incidents by setting up notifications via email, SMS, or webhook 🎥 Past Retrospective Recordings Watch recordings of previous retrospective livestreams 📄 Azure Post Incident Reviews Learn more about PIRs and the retrospective program71Views0likes0CommentsAzure Front Door: Implementing lessons learned following October outages
Abhishek Tiwari, Vice President of Engineering, Azure Networking Amit Srivastava, Principal PM Manager, Azure Networking Varun Chawla, Partner Director of Engineering Introduction Azure Front Door is Microsoft's advanced edge delivery platform encompassing Content Delivery Network (CDN), global security and traffic distribution into a single unified offering. By using Microsoft's extensive global edge network, Azure Front Door ensures efficient content delivery and advanced security through 210+ global and local points of presence (PoPs) strategically positioned closely to both end users and applications. As the central global entry point from the internet onto customer applications, we power mission critical customer applications as well as many of Microsoft’s internal services. We have a highly distributed resilient architecture, which protects against failures at the server, rack, site and even at the regional level. This resiliency is achieved by the use of our intelligent traffic management layer which monitors failures and load balances traffic at server, rack or edge sites level within the primary ring, supplemented by a secondary-fallback ring which accepts traffic in case of primary traffic overflow or broad regional failures. We also deploy a traffic shield as a terminal safety net to ensure that in the event of a managed or unmanaged edge site going offline, end user traffic continues to flow to the next available edge site. Like any large-scale CDN, we deploy each customer configuration across a globally distributed edge fleet, densely shared with thousands of other tenants. While this architecture enables global scale, it carries the risk that certain incompatible configurations, if not contained, can propagate broadly and quickly which can result in a large blast radius of impact. Here we describe how the two recent service incidents impacting Azure Front Door have reinforced the need to accelerate ongoing investments in hardening our resiliency, and tenant isolation strategy to mitigate likelihood and the scale of impact from this class of risk. October incidents: recap and key learnings Azure Front Door experienced two service incidents; on October 9 th and October 29 th , both with customer-impacting service degradation. On October 9 th : A manual cleanup of stuck tenant metadata bypassed our configuration protection layer, allowing incompatible metadata to propagate beyond our canary edge sites. This metadata was created on October 7 th , from a control-plane defect triggered by a customer configuration change. While the protection system initially blocked the propagation, the manual override operation bypassed our safeguards. This incompatible configuration reached the next stage and activated a latent data-plane defect in a subset of edge sites, causing availability impact primarily across Europe (~6%) and Africa (~16%). You can learn more about this issue in detail at https://aka.ms/AIR/QNBQ-5W8 On October 29 th : A different sequence of configuration changes across two control-plane versions produced incompatible metadata. Because the failure mode in the data-plane was asynchronous, the health checks validations embedded in our protection systems were all passed during the rollout. The incompatible customer configuration metadata successfully propagated globally through a staged rollout and also updated the “last known good” (LKG) snapshot. Following this global rollout, the asynchronous process in data-plane exposed another defect which caused crashes. This impacted connectivity and DNS resolutions for all applications onboarded to our platform. Extended recovery time amplified impact on customer applications and Microsoft services. You can learn more about this issue in detail at https://aka.ms/AIR/YKYN-BWZ We took away a number of clear and actionable lessons from these incidents, which are applicable not just to our service, but to any multi-tenant, high-density, globally distributed system. Configuration resiliency – Valid configuration updates should propagate safely, consistently, and predictably across our global edge, while ensuring that incompatible or erroneous configuration never propagate beyond canary environments. Data plane resiliency - Additionally, configuration processing in the data plane must not cause availability impact to any customer. Tenant isolation – Traditional isolation techniques such as hardware partitioning and virtualization are impractical at edge sites. This requires innovative sharding techniques to ensure single tenant-level isolation – a must-have to reduce potential blast radius. Accelerated and automated recovery time objective (RTO) – System should be able to automatically revert to last known good configuration in an acceptable RTO. In case of a service like Azure Front Door, we deem ~10 mins to be a practical RTO for our hundreds of thousands of customers at every edge site. Post outage, given the severity of impact which allowed an incompatible configuration to propagate globally, we made the difficult decision to temporarily block configuration changes in order to expedite rollout of additional safeguards. Between October 29 th to November 5 th , we prioritized and deployed immediate hardening steps before opening up the configuration change. We are confident that the system is stable, and we are continuing to invest in additional safeguards to further strengthen the platform's resiliency. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond Canary · Control plane and data plane defect fixes · Forced synchronous configuration processing · Additional stages with extended bake time · Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. January 2026 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes March 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 This blog is the first in a multi-part series on Azure Front Door resiliency. In this blog, we will focus on configuration resiliency—how we are making the configuration pipeline safer and more robust. Subsequent blogs will cover tenant isolation and recovery improvements. How our configuration propagation works Azure Front Door configuration changes can be broadly classified into three distinct categories. Service code & data – these include all aspects of Azure Front Door service like management plane, control plane, data plane, configuration propagation system. Azure Front Door follows a safe deployment practice (SDP) process to roll out newer versions of management, control or data plane over a period of approximately 2-3 weeks. This ensures that any regression in software does not have a global impact. However, latent bugs that escape pre-validation and SDP rollout can remain undetected until a specific combination of customer traffic patterns or configuration changes trigger the issue. Web Application Firewall (WAF) & L7 DDoS platform data – These datasets are used by Azure Front Door to deliver security and load-balancing capabilities. Examples include GeoIP data, malicious attack signatures, and IP reputation signatures. Updates to these datasets occur daily through multiple SDP stages with an extended bake time of over 12 hours to minimize the risk of global impact during rollout. This dataset is shared across all customers and the platform, and it is validated immediately since it does not depend on variations in customer traffic or configuration steps. Customer configuration data – Examples of these are any customer configuration change—whether a routing rule update, backend pool modification, WAF rule change, or security policy change. Due to the nature of these changes, it is expected across the edge delivery / CDN industry to propagate these changes globally in 5-10 mins. Both outages stemmed from issues within this category. All configuration changes, including customer configuration data, are processed through a multi-stage pipeline designed to ensure correctness before global rollout across Azure Front Door’s 200+ edge locations. At a high level, Azure Front Door’s configuration propagation system has two distinct components - Control plane – Accepts customer API/portal changes (create/update/delete for profiles, routes, WAF policies, origins, etc.) and translates them into internal configuration metadata which the data plane can understand. Data plane – Globally distributed edge servers that terminate client traffic, apply routing/WAF logic, and proxy to origins using the configuration produced by the control plane. Between these two halves sits a multi-stage configuration rollout pipeline with a dedicated protection system (known as ConfigShield): Changes flow through multiple stages (pre-canary, canary, expanding waves to production) rather than going global at once. Each stage is health-gated: the data plane must remain within strict error and latency thresholds before proceeding. Each stage’s health check also rechecks previous stage’s health for any regressions. A successfully completed rollout updates a last known good (LKG) snapshot used for automated rollback. Historically, rollout targeted global completion in roughly 5–10 minutes, in line with industry standards. Customer configuration processing in Azure Front Door data plane stack Customer configuration changes in Azure Front Door traverse multiple layers—from the control plane through the deployment system—before being converted into FlatBuffers at each Azure Front Door node. These FlatBuffers are then loaded by the Azure Front Door data plane stack, which runs as Kubernetes pods on every node. FlatBuffer Composition: Each FlatBuffer references several sub-resources such as WAF and Rules Engine schematic files, SSL certificate objects, and URL signing secrets. Data plane architecture: o Master process: Accepts configuration changes (memory-mapped files with references) and manages the lifecycle of worker processes. o Workers: L7 proxy processes that serve customer traffic using the applied configuration. Processing flow for each configuration update: Load and apply in master: The transformed configuration is loaded and applied in the master process. Cleanup of unused references occurs synchronously except for certain categories à October 9 outage occurred during this step due to a crash triggered by incompatible metadata. Apply to workers: Configuration is applied to all worker processes without memory overhead (FlatBuffers are memory-mapped). Serve traffic: Workers start consuming new FlatBuffers for new requests; in-flight requests continue using old buffers. Old buffers are queued for cleanup post-completion. Feedback to deployment service: Positive feedback signals readiness for rollout.Cleanup: FlatBuffers are freed asynchronously by the master process after all workers load updates à October 29 outage occurred during this step due to a latent bug in reference counting logic. The October incidents showed we needed to strengthen key aspects of configuration validation, propagation safeguards, and runtime behavior. During the Azure Front Door incident on October 9 th , that protection system worked as intended but was later bypassed by our engineering team during a manual cleanup operation. During this Azure Front Door incident on October 29 th , the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Configuration propagation safeguards Based on learnings from the incidents, we are implementing a comprehensive set of configuration resiliency improvements. These changes aim to guarantee that any sequence of configuration changes cannot trigger instability in the data plane, and to ensure quicker recovery in the event of anomalies. Strengthening configuration generation safety This improvement pivots on a ‘shift-left’ strategy where we want to ensure that we catch regression early before they propagate to production. It also includes fixing the latent defects which were the proximate cause of the outage. Fixing outage specific defects - We have fixed the control-plane defects that could generate incompatible tenant metadata under specific operation sequences. We have also remediated the associated data-plane defects. Stronger cross-version validation - We are expanding our test and validation suite to account for changes across multiple control plane build versions. This is expected to be fully completed by February 2026. Fuzz testing - Automated fuzzing and testing of metadata generation contract between the control plane and the data plane. This allows us to generate an expanded set of invalid/unexpected configuration combinations which might not be achievable by traditional test cases alone. This is expected to be fully completed by February 2026. Preventing incompatible configurations from being propagated This segment of the resiliency strategy strives to ensure that a potentially dangerous configuration change never propagates beyond canary stage. Protection system is “always-on” - Enhancements to operational procedures and tooling prevent bypass in all scenarios (including internal cleanup/maintenance), and any cleanup must flow through the same guarded stages and health checks as standard configuration changes. This is completed. Making rollout behavior more predictable and conservative - Configuration processing in the data plane is now fully synchronous. Every data plane issue due to incompatible meta data can be detected withing 10 seconds at every stage. This is completed. Enhancement to deployment pipeline - Additional stages during roll-out and extended bake time between stages serve as an additional safeguard during configuration propagation. This is completed. Recovery tool improvements now make it easier to revert to any previous version of LKG with a single click. This is completed. These changes significantly improve system safety. Post-outage we have increased the configuration propagation time to approximately 45 minutes. We are working towards reducing configuration propagation time closer to pre-incident levels once additional safeguards covered in the Data plane resiliency section below are completed by mid-January, 2026. Data plane resiliency The data plane recovery was the toughest part of recovery efforts during the October incidents. We must ensure fast recovery as well as resilience to configuration processing related issues for the data plane. To address this, we implemented changes that decouple the data plane from incompatible configuration changes. With these enhancements, the data plane continues operating on the last known good configuration—even if the configuration pipeline safeguards fail to protect as intended. Decoupling data plane from configuration changes Each server’s data plane consists of a master process which accepts configuration changes and manages lifecycle of multiple worker processes which serve customer traffic. One of the critical reasons for the prolonged outage in October was that due to latent defects in the data plane, when presented with a bad configuration the master process crashed. The master is a critical command-and-control process and when it crashes it takes down the entire data plane, in that node. Recovery of the master process involves reloading hundreds of thousands of configurations from scratch and took approximately 4.5 hours. We have since made changes to the system to ensure that even in the event of the master process crash due to any reason - including incompatible configuration data being presented - the workers remain healthy and able to serve traffic. During such an event, the workers would not be able to accept new configuration changes but will continue to serve customer traffic using the last known good configuration. This work is completed. Introducing Food Taster: strengthening config propagation resiliency In our efforts to further strengthen Azure Front Door’s configuration propagation system, we are introducing an additional configuration safeguard known internally as Food Taster which protects the master and worker processes from any configuration change related incidents, thereby ensuring data plane resiliency. The principle is simple: every data-plane server will have a redundant and isolated process – the Food Taster – whose only job is to ingest and process new configuration metadata first and then pass validated configuration changes to active data plane. This redundant worker does not accept any customer traffic. All configuration processing in this Food Taster is fully synchronous. That means we do all parsing, validation, and any expensive or risky work up front, and we do not move on until the Food Taster has either proven the configuration is safe or rejected it. Only when the Food Taster successfully loads the configuration and returns “Config OK” does the master process proceed to load the same config and then instruct the worker processes to do the same. If anything goes wrong in the Food Taster, the failure is contained to that isolated worker; the master and traffic-serving workers never see that invalid configuration. We expect this safeguard to reach production globally in January 2026 timeframe. Introduction of this component will also allow us to return closer to pre-incident level of configuration propagation while ensuring data plane safety. Closing This is the first in a series of planned blogs on Azure Front Door resiliency enhancements. We are continuously improving platform safety and reliability and will transparently share updates through this series. Upcoming posts will cover advancements in tenant isolation and improvements to recovery time objectives (RTO). We deeply value our customers’ trust in Azure Front Door. The October incidents reinforced how critical configuration resiliency is, and we are committed to exceeding industry expectations for safety, reliability, and transparency. By hardening our configuration pipeline, strengthening safety gates, and reinforcing isolation boundaries, we’re making Azure Front Door even more resilient so your applications can be too.17KViews23likes14CommentsConsistent DNS resolution in a hybrid hub spoke network topology
DNS is one of the most essential networking services, next to IP routing. A modern hybrid cloud network may have various sources of DNS: Azure Private DNS Zones, public DNS, domain controllers, etc. Some organizations may also prefer to route their public Internet DNS queries through a specific DNS provider. Therefore, it is crucial to ensure consistent DNS resolution across the whole (hybrid) network. This article describes how DNS Private Resolver can be leveraged to build such architecture.18KViews6likes5CommentsHigh-Fidelity Network Observability at Scale— ACNS Metrics Filtering and Log Aggregation Now GA
We are thrilled to announce that Advanced Container Networking Services (ACNS) for Azure Kubernetes Service (AKS) now delivers two powerful observability features in General Availability: container network metrics filtering and container network log filtering and aggregation. Together, these capabilities set a new standard for Kubernetes network observability, giving you high-fidelity visibility at dramatically lower cost and noise. These capabilities fundamentally redefine how network observability works at scale while delivering up to 97% cost reduction. Why this is a Milestone? Most Kubernetes observability solutions face a fundamental tension: collect everything and drown in noise and cost, or sample and miss the signals that matter. ACNS breaks that tradeoff. With this release, Azure becomes the first cloud provider to deliver on-node metrics filtering and flow log aggregation for Kubernetes networking, capabilities now also contributed to the upstream Hubble project, making them available to the broader open-source community. For AKS customers running Cilium-based clusters, this means: Every flow you care about is captured. Everything else is dropped at the source. Log volume is compressed by up to 45% through aggregation, without losing security verdicts or error context. Costs scale with what you monitor, not with cluster size. What’s been improved in ACNS observability? This release introduces two capabilities that work together: container network metrics filtering and container network log filtering and aggregation. Both are available on AKS clusters with the Cilium data plane and give you precise controls to keep observability costs predictable while maintaining the visibility you need. Container Network Metrics Filtering Container network metrics are generated for all pods by default whenever ACNS is enabled. With metrics filtering, you now control what gets collected at the point of ingestion, on the node, before anything is scraped or transmitted. A single ContainerNetworkMetric CRD per cluster defines which metric types (dns, flow, tcp, drop), namespaces, pod labels, and protocols to ingest. It supports both include and exclude filters, so you can maintain broad collection while carving out specific workloads or namespaces. Anything that doesn't match is dropped on the node. Changes reconcile in a few seconds, with no Cilium agent or Prometheus restarts required. Container Network Log Filtering and Aggregation Unlike metrics, container network logs are not generated automatically. You start capturing network flows only after applying a ContainerNetworkLog CRD that defines exactly which traffic to capture-by namespace, pod, service, protocol, or verdict. Only matching flows are logged, giving you a precise, targeted view rather than a fire hose. This is where Azure's first-to-market innovation comes in. Flow log aggregation, now built into ACNS and contributed upstream to Hubble for the open-source community, groups similar flows into summarized records every 30 seconds. The result is dramatically reduced data volume while preserving security verdicts, service identity, and error context. What previously required custom post-processing pipelines is now built directly into the platform before storage costs are incurred. Every matched flow log captures: source and destination pods, namespaces, ports, protocols, traffic direction, and policy verdicts. Logs are stored in a Log Analytics workspace (ContainerNetworkLogs table) with a choice of using the Analytics or Basic tier. Built-in Azure portal dashboards are available for both tiers. Logs can also be exported to external log collectors such as Splunk or Datadog. First to Market: Azure and the upstream Hubble Contribution ACNS's filtering and aggregation capabilities were engineered from the ground up to solve real production observability challenges at scale. Rather than keeping this innovation proprietary, Azure contributed the log aggregation and filtering capabilities to the upstream Hubble project, the observability layer of the Cilium ecosystem. This means: AKS customers get a fully managed, Azure-native experience with portal dashboards, Log Analytics integration, and Grafana visualization, out of the box. The broader open-source community gains access to the same filtering and aggregation primitives through upstream Hubble. Azure is the first to ship this capability in a managed Kubernetes service, and the first to give it back to the community. Key Benefits 💰 Lower observability cost. Metrics filtering drops unwanted data on the node before Prometheus ever scrapes it. Flow log aggregation compresses log data by up to 97% in lab testing. Your cost scales with what you choose to monitor, not with cluster size. 📉 Less noise, more signal. Metrics filtering carves out the namespaces and workloads that matter, so dashboards show only relevant signals. Log filters scope collection to specific pods and verdicts. Engineers start every investigation with data that's already relevant. ⚡ Faster root-cause isolation. Every metric carries source and destination pod context. Targeted flow logs add the forensic detail, which policy, destination, or port is involved. Together, they cut mean time to resolution from hours of guesswork to minutes of structured investigation. 🔒 Full signal, zero gaps. ACNS doesn't sample. Within the scope you define, every flow is captured and every pattern is preserved. Aggregation compresses volume without losing security verdicts or error context. Who Benefits Platform engineers managing multi-tenant clusters can scope data collection per namespace, so each team gets visibility into their own traffic without contributing to a shared cost pool. SREs can isolate packet drops, TCP resets, or DNS failures to a specific workload in minutes, starting with data that's already scoped to what matters. Decision-makers evaluating observability spend get predictable, controllable ingestion costs that scale with intent, not infrastructure size. How to optimize ACNS metrics and logs with filtering? Enable ACNS on your AKS cluster with the Cilium data plane: az aks create --enable-acns Or on an existing cluster: az aks update --resource-group $RESOURCE_GROUP --name $CLUSTER --enable-acns Apply a ContainerNetworkMetric CRD to filter which metrics are collected on each node. Start by excluding noisy system namespaces, then scope to business-critical workloads. Apply a ContainerNetworkLog CRD to define which flows to capture. Enable Azure Monitor integration with --enable-container-network-logs to send logs to a Log Analytics workspace, or export logs from the node to an external logging system such as Splunk or Datadog. Check your dashboards. Open your cluster in the Azure portal and go to Monitor > Insights > Networking for bytes, drops, DNS errors, and flows. For flow logs, use the built-in Azure portal dashboards available for both Basic and Analytics tiers. Conclusion Kubernetes network observability has long meant choosing between visibility and cost. With container network metrics filtering and log filtering and aggregation now GA in ACNS and contributed to upstream Hubble for the open-source community, that tradeoff is gone. Azure is first to market with this capability. AKS customers get it fully managed, out of the box, with built-in dashboards with Log Analytics integration. And the broader Cilium ecosystem gets it through upstream Hubble. High-fidelity visibility. Lower cost. No compromise. Learn more: Container network metrics overview: Container network metrics overview - Azure Kubernetes Service | Microsoft Learn Container network logs overview: Container Network Logs Overview - Azure Kubernetes Service | Microsoft Learn Configure container network metrics filtering: Configure Container network metrics filtering for Azure Kubernetes Service (AKS) - Azure Kubernetes Service | Microsoft Learn Set up container network logs: Set up container network logs - Azure Kubernetes Service | Microsoft Learn
287Views0likes0CommentsPublic Preview: Managed Identity support for graphical session recording
Overview Azure Bastion provides secure RDP and SSH access to Azure virtual machines directly via the Azure portal or via the native SSH/RDP client already installed on your local computer. Today, we are introducing public preview for managed identity support for session recording, giving administrators a seamless, identity-based way to authenticate Bastion when writing recordings to a designated storage account. Why Managed Identities? With managed identity support, Bastion authenticates directly to your storage account using an Azure identity, no additional credentials to configure or manage. You can use either a system-assigned or user-assigned managed identity depending on your needs. Authentication is handled automatically through Microsoft Entra ID, which means setup is straightforward: enable the identity, assign a role, and point Bastion at your storage container. For organizations operating at scale across many Bastion deployments and regions, this identity-based approach removes the need to manage credentials, aligns with Zero Trust principles, and lets you control access centrally through Azure RBAC. Getting Started in Azure Portal Prerequisites Ensure that Azure Bastion is deployed with the Premium SKU Ensure that a storage account with a dedicated container for session recordings is created Ensure that the storage account has the required CORS policy configured. Click here to set up the storage account for session recordings Ensure that users who need to view recordings have the Storage Blob Data Reader role on the storage account Steps Navigate to your Bastion resource in the Azure portal. Select Identity (Preview) in the left pane and turn the Status to On to enable a system-assigned managed identity. Wait for the configuration to complete. Select Azure role assignments, then select Add role assignment (Preview). Assign the Storage Blob Data Contributor role scoped to your storage account. Select Save, then navigate to the Configuration blade. Under Session Recording Configuration, select System Assigned Managed Identity and enter the Blob Container URI for your storage container. Navigate to the Session recordings blade to view and play back recorded sessions. Next Steps Learn more about configuring session recording with managed identities here and keep up to date with all things Azure Bastion in our What's New page.391Views0likes0CommentsIntroducing WireGuard In-Transit Encryption for AKS (Generally Available)
Update (Generally Available) WireGuard in‑transit encryption for Azure Kubernetes Service (AKS) is now generally available for clusters using Azure CNI powered by Cilium and Advanced Container Networking Services. The feature is production‑ready and no longer requires preview enrollment. The core behavior, scope, and configuration model remain unchanged from the public preview. As organizations continue to scale containerized workloads in Azure Kubernetes Service (AKS), securing network traffic between applications and services is critical—especially in regulated or security‑sensitive environments. WireGuard in‑transit encryption is now generally available in AKS, delivering transparent, node‑level encryption for inter‑node pod traffic as part of Advanced Container Networking Services, powered by Azure CNI built on Cilium. What is WireGuard? WireGuard is a modern, high-performance VPN protocol known for its simplicity, and robust cryptography. Integrated into the Cilium data plane and managed as part of AKS networking, WireGuard offers an efficient way to encrypt traffic transparently within your cluster. With this new feature, WireGuard is now natively supported as part of Azure CNI powered by Cilium with Advanced Container Networking services, no need for third-party encryption tools or custom key management systems. What Gets Encrypted? The WireGuard integration in AKS focuses on the most critical traffic path: ✅ Encrypted: Inter-node pod traffic: Network communication between pods running on different nodes in the AKS cluster. This traffic traverses the underlying network infrastructure and is encrypted using WireGuard to ensure confidentiality and integrity. ❌ Not encrypted: Same-node pod traffic: Communication between pods that are running on the same node. Since this traffic does not leave the node, it bypasses WireGuard and remains unencrypted. Node-generated traffic: Traffic initiated by the node itself, which is currently not routed through WireGuard and thus not encrypted. This scope is designed to strike the right balance between strong protection and performance by securing the most critical traffic, which is data that leaves the host and traverses the network. Key Benefits Simple Configuration: Enable WireGuard with just a few flags during AKS cluster creation or update. Automatic Key Management: Each node generates and exchanges WireGuard keys automatically, no need for manual configuration. Transparent to Applications: No application-level changes are required. Encryption happens at the network layer. Cloud-Native Integration: Fully managed as part of Advanced Container Networking Services and Cilium, offering a seamless and reliable experience Architecture: How It Works When WireGuard is enabled: Each node generates a unique public/private key pair. The public keys are securely shared between nodes via the CiliumNode custom resource. A dedicated network interface (cilium_wg0) is created and managed by the Cilium agent running on each node. Peers are dynamically updated, and keys are rotated automatically every 120 seconds to minimize risk. This mechanism ensures that only validated nodes can participate in encrypted communication. WireGuard and VNet Encryption AKS now offers two powerful in-transit encryption options: Feature WireGuard Encryption VNet Encryption Scope Pod-to-pod inter-node traffic All traffic in the VNet VM Support Works on all VM SKUs Requires hardware support (e.g., Gen2 VMs) Deployment Flexibility Cloud-agnostic, hybrid ready Azure-only Performance Software-based, moderate CPU usage Hardware-accelerated, low overhead Choose WireGuard if you want encryption flexibility across clouds or have VM SKUs that don’t support VNet encryption . Choose VNet Encryption for full-network coverage and ultra-low CPU overhead. Conclusion and Next Steps With WireGuard now generally available in AKS, customers can secure inter‑node pod traffic using a lightweight, cloud‑native encryption mechanism that requires no application changes and minimal operational overhead Ready to get started? Check out our how-to guide for step-by-step instructions on enabling WireGuard in your cluster and securing your container networking with ease. Explore more about Advanced Container Networking Services: Container Network Observability L7 network policies FQDN-based Policy783Views0likes0CommentsDesigning Outbound Connectivity for "Private Subnets" in Azure
Why Private Subnets Change Everything Historically, Azure virtual machines relied on default outbound internet access, where the platform automatically assigned a dynamic SNAT IP from a shared pool. This was convenient but problematic: ❌ No deterministic outbound IP addresses ❌ No traffic inspection or filtering ❌ No FQDN or URL governance ❌ Difficult to audit for compliance ❌ Susceptible to noisy neighbor SNAT exhaustion With private subnets, outbound access is disabled by default. This shifts the responsibility to the architect — deliberately. The result is an environment where: ✅ Every outbound flow is intentional ✅ Every outbound IP is known and documented ✅ Every egress path can be governed and logged ✅ Compliance evidence is straightforward to produce The question is no longer "does my VM have internet access?" but rather "how exactly does my VM reach the internet, and is that path appropriate for this workload?" The Three Outbound Patterns at a Glance Option Primary Role Inspection Scale Cost Best For NAT Gateway Managed outbound SNAT ❌ None ⭐⭐⭐ High 💲 Low Simple, scalable egress Azure Firewall Secure governed egress ✅ Full L3–L7 ⭐⭐⭐ High 💲💲💲 Higher Security boundaries Load Balancer Legacy SNAT ❌ None ⭐⭐ Limited 💲 Low Legacy / transitional Scenario 1: NAT Gateway What is NAT Gateway? Azure NAT Gateway is a fully managed, zone‑resilient, outbound‑only SNAT service. It attaches at the subnet level and automatically handles all outbound flows from that subnet using one or more static public IP addresses or prefixes. It is purpose‑built for one thing: providing predictable, scalable outbound internet access — without routing complexity or inline devices. Key flow are depicted below: VM → NAT Gateway: Automatic SNAT (no UDR required) NAT Gateway → Internet: Static, deterministic public IP Inbound: NOT supported (outbound only) How it works (step by step) VM initiates an outbound connection (e.g., HTTPS to an API) NAT Gateway intercepts the flow at the subnet boundary Source IP is translated to the NAT Gateway's static public IP The packet is forwarded to the internet Return traffic is automatically tracked and delivered back to the VM No UDRs. No routing tables. No inline devices. It just works. Strengths Massive SNAT scale — no port exhaustion concerns at typical enterprise scale Deterministic outbound IPs — easy to allowlist with external services Zone resilient — survives availability zone failures Subnet scoped — applies to all VMs in the subnet automatically No routing configuration required Limitations ❌ No traffic inspection or filtering ❌ No FQDN or URL policy enforcement ❌ No threat intelligence integration ❌ Cannot restrict which internet destinations are allowed Best Fit Use Cases ✅ Application tiers calling external SaaS APIs ✅ VMs requiring OS updates and patch downloads ✅ CI/CD build agents and pipeline runners ✅ Spoke VNets in hub‑and‑spoke where east‑west goes through firewall, but simple internet egress is acceptable ✅ Dev/test environments Scenario 2: Azure Firewall What is Azure Firewall? Azure Firewall is a cloud‑native, stateful, L3–L7 network security service. When used for outbound egress, it transforms the egress path from a connectivity function into a security enforcement boundary. Unlike NAT Gateway, Azure Firewall inspects every packet, evaluates it against policy, and either allows or denies it based on network rules, application rules, and threat intelligence feeds. KEY Flow are depicted below: VM → UDR: Forces ALL outbound traffic to Firewall Firewall: Evaluates against policy before allowing Firewall → Internet: Only explicitly permitted flows pass All denied flows: Logged and alertable How it works (step by step) VM initiates an outbound connection UDR intercepts the flow and redirects to Azure Firewall's private IP Azure Firewall evaluates the traffic: Network rules (IP/port match) Application rules (FQDN/URL match) Threat intelligence (known malicious IPs/domains) If allowed: traffic is forwarded via Firewall's public IP If denied: traffic is dropped and logged All flows (allowed and denied) are logged to Log Analytics / Sentinel Strengths ✅ Full L3–L7 inspection ✅ FQDN and URL‑based filtering (application rules) ✅ Threat intelligence integration (Microsoft TI feed) ✅ TLS inspection (Premium SKU) ✅ Centralized governance across multiple VNets via Firewall Manager ✅ Rich logging — every allowed and denied flow is recorded ✅ IDPS (Intrusion Detection and Prevention) available in Premium Limitations ❌ Higher cost (hourly + data processing charges) ❌ Requires UDR configuration on each spoke subnet ❌ Adds latency (small but non‑zero) ❌ Requires careful SNAT configuration at scale Best Fit Use Cases ✅ Regulated industries (financial services, healthcare, government) ✅ Any workload where outbound internet is a security boundary ✅ Environments requiring egress allowlisting for compliance ✅ Hub‑and‑spoke architectures with centralized control plane ✅ SOC environments needing outbound flow telemetry Scenario 3: Load Balancer Outbound What is Load Balancer Outbound? Azure Load Balancer outbound rules were historically the primary mechanism for providing SNAT to VMs behind a Standard Load Balancer. While newer patterns (NAT Gateway, Azure Firewall) have largely replaced this approach for new designs, outbound rules remain valid in specific scenarios. Key flows are depicted below: VMs → Load Balancer: Backend pool members get SNAT LB Outbound Rules: Define port allocation per VM ⚠️ Port exhaustion risk at scale ⚠️ No inspection or policy enforcement How it works (step by step) VM in the backend pool initiates an outbound connection Load Balancer applies SNAT using the frontend public IP Ephemeral ports are allocated per VM from a fixed pool Return traffic is tracked and delivered back to the correct VM If port pool is exhausted: connections fail (SNAT exhaustion) Strengths Lower cost than NAT Gateway or Firewall Tightly integrated with existing load‑balanced workloads Familiar operational model for legacy teams Limitations ❌ SNAT port pool is fixed and must be manually managed ❌ Risk of SNAT exhaustion at scale ❌ No traffic inspection ❌ Less flexible than NAT Gateway ❌ Not recommended for new designs Best Fit Use Cases ✅ Existing architectures already built around Azure Load Balancer ✅ Low outbound connection volume workloads ✅ Transitional architectures during modernization to NAT Gateway Decision Framework: Choosing the Right Outbound Pattern Common Pitfalls to Avoid ⚠️ Pitfall 1: Forgetting SNAT scale limits Load Balancer outbound rules allocate a fixed number of ephemeral ports per VM. At scale this exhausts quickly. Use NAT Gateway instead. ⚠️ Pitfall 2: Over‑securing low‑risk workloads Not every workload needs Azure Firewall for outbound. Dev/test and patch traffic are better served by NAT Gateway — simpler, cheaper, faster. ⚠️ Pitfall 3: Mixing outbound models in the same subnet NAT Gateway and Load Balancer outbound rules cannot coexist on the same subnet. NAT Gateway always takes precedence. Plan your subnet boundaries carefully. ⚠️ Pitfall 4: Blocking Azure platform dependencies Many Azure services still use public endpoints (even when Private Link is available). Ensure your outbound policy allows required Azure service tags before enforcing egress controls. ⚠️ Pitfall 5: Relying on platform defaults Default outbound access is retired for new VNets. Do not assume VMs can reach the internet without explicit configuration. Summary and Key Takeaways Scenario Best Choice Why Simple internet egress at scale NAT Gateway Scalable, predictable, no complexity Security boundary for egress Azure Firewall Inspection, FQDN rules, threat intel Legacy load‑balanced workloads Load Balancer Outbound Transitional only Regulated / compliance environments Azure Firewall Audit logs, policy enforcement Dev / test / patch traffic NAT Gateway Low cost, low friction The core principle Private subnets make outbound access intentional. Choose the outbound pattern that matches the risk level of the workload — not the most complex option available. References https://learn.microsoft.com/azure/nat-gateway/nat-overview https://learn.microsoft.com/azure/firewall/overview https://learn.microsoft.com/azure/load-balancer/outbound-rules https://azure.microsoft.com/blog/default-outbound-access-for-vms-in-azure-will-be-retiredPrivate subnets by default in Azure Virtual Networks: What changed and how to use NAT Gateway
Azure is evolving to better support secure‑by‑default cloud architectures. Starting with API version 2025‑07‑01 (released after March 31, 2026), newly created virtual networks now default to using private subnets. This change removes the long‑standing platform behavior of automatically enabling outbound internet access through implicit public IPs, also known as default outbound access (DOA). As a result: newly deployed virtual machines will not have public outbound connectivity unless explicitly configured. What changed? Previously, Azure automatically assigned a hidden Microsoft‑owned public IP to virtual machines deployed without an explicit outbound method (such as NAT Gateway, Load Balancer outbound rules, or instance‑level public IPs). This allowed public outbound connectivity without requiring customer configuration. While convenient, this model introduced challenges: Security – Implicit internet access conflicts with Zero Trust principles. Reliability – Platform‑managed outbound IPs can change unexpectedly. Operational consistency – VMSS instances or multi‑NIC VMs may egress using different default outbound IPs. With API version 2025‑07‑01 and later: Subnets in newly created VNets are private by default. The subnet property `defaultOutboundAccess` is set to false. Azure no longer assigns implicit outbound public IPs. This applies across deployment methods including Portal, ARM/Bicep, CLI, and PowerShell. Portal has started using the new model as of April 1, 2026. Note: This change has not yet applied to Terraform. Am I impacted by this change? Deployment scenario Behavior Existing VNets or VMs using DOA ✅ Unchanged New VMs in existing VNets ✅ Unchanged Subnets already using explicit outbound ✅ Continue using configured outbound method New VMs in new VNets (with subnets created using API 07-01-2025 or later) 🔒 Subnets private by default New VMs in private subnets without explicit outbound configured ❌ No public outbound connectivity Existing workloads are not impacted. If required, you can still create new subnets without the private setting by choosing the appropriate configuration option during creation. See the FAQ section of this blog for more information. However, we strongly recommend transitioning to an explicit outbound method so that: Your workloads won’t be affected by public IP address changes. You have greater control over how your VMs connect to public endpoints. Your VMs use traceable IP resources that you own. When is outbound connectivity required? If your virtual network contains virtual machines, you must configure explicit outbound connectivity. Here are common scenarios that require it: Virtual machine operating system activation and updates, such as Windows or Linux. Pulling container images from public registries (Docker Hub or Microsoft Container Registry). Accessing 3rd party SaaS or public APIs Virtual machine scale sets using flexible orchestration mode are always secure by default and therefore require an explicit outbound method. Private subnets don’t apply to delegated or managed subnets that host PaaS services. In these cases, the service handles outbound connectivity—see the service-specific documentation for details. Recommended outbound connectivity method: StandardV2 NAT Gateway Azure now recommends using an explicit outbound connectivity method such as: NAT Gateway Load Balancer outbound rules Public IP assigned to the VM Network Virtual Appliance (NVA) / Firewall Among these, Azure StandardV2 NAT Gateway is the recommended method for outbound connectivity for scalable and resilient outbound connectivity. StandardV2 NAT Gateway: Provides zone‑redundancy by default in supported regions Supports up to 100 Gbps throughput Provides dual-stack support with IPv4 and IPv6 public IPs Uses customer‑owned static public IPs Enables outbound connectivity without allowing inbound internet access Requires no route table configuration when associated to a subnet When configured, NAT Gateway automatically becomes the subnet’s default outbound path and takes precedence over: Load Balancer outbound rules VM instance‑level public IPs Note: UDRs for 0.0.0.0/0 traffic directed to virtual appliances/Firewall takes precedence over NAT gateway. Migrate from Default Outbound Access to NAT Gateway To transition from DOA to Azure’s recommended method of outbound, StandardV2 NAT Gateway: Go to your virtual network in the portal, and select the subnet you want to modify. In the Edit subnet menu, select the ‘Enable private subnet’ checkbox under the Private subnet section Enabling private subnet can also be done through other supported clients, below is an example for CLI, in which the default-outbound parameter is set to false: az network vnet subnet update \ --resource-group rgname \ --name subnetname \ --vnet-name vnetname \ --default-outbound false 3. Deploy a StandardV2 NAT gateway resource. 4. Associate one or more StandardV2 public IP addresses or prefixes. 5. Attach the NAT gateway to the target subnet. Once associated: All new outbound traffic from that subnet uses NAT Gateway automatically VM‑level public IPs are no longer required Existing outbound connections are not interrupted Note: Enabling private subnet on an existing subnet will not affect any VMs already using default outbound IPs. Private subnet ensures that only new VMs don’t receive a default outbound public IP. For step-by-step guidance, see migrate default outbound access to NAT Gateway. FAQ 1. Will my existing workloads lose outbound connectivity? No. Workloads currently using default outbound IPs are not impacted by this change. The private subnet by default update only affects: Newly created VNets New subnets created using the updated API, 2025-07-01 New virtual machines deployed into those subnets using the updated API VMs and subnets using an explicit outbound connectivity method like a NAT gateway, NVA / Firewall, a VM instance level public IP or Load balancer outbound rules is not impacted by this change. 2. Why can’t my new VM reach the internet or other public endpoints within Microsoft (e.g. VM activation, updates)? New subnets are private by default. If your deployment does not include an explicit outbound method — such as a NAT Gateway, Public IP, Load Balancer outbound rule, or NVA/Firewall— outbound connectivity is not automatically enabled. 3. My workload has a dependency on default outbound IPs and isn’t ready to move to private subnets, what should I do? You can opt-out of the default private subnet setting by disabling the private subnet feature. You can do this in the portal by unselecting the private subnet checkbox: Disabling private subnet can also be done through other supported clients, below is an example for CLI, in which the default-outbound parameter is set to true: az network vnet subnet update \ --resource-group rgname \ --name subnetname \ --vnet-name vnetname \ --default-outbound true 4. Why do I see an alert showing that I have a default outbound IP on my VM? There's a NIC-level parameter `defaultOutboundConnectivityEnabled` that tracks whether a default outbound IP is allocated to a VM/Virtual Machine Scale Set instance. If detected, the Azure portal displays a notification banner and will generate Azure Advisor recommendations about disabling default outbound connectivity for your VMs / VMSS. 5. How do I clear this alert? To remove the default outbound IP and clear the alert: Configure a StandardV2 NAT gateway (or other explicit outbound method). Set your subnet to be private or by setting the subnet property defaultOutboundAccess = false using one of the supported clients. Stop and deallocate any applicable virtual machines (this will remove the default outbound IP currently associated with the VM). 6. I have a NAT gateway (or UDR pointing to an NVA) configured for my private subnet, why do I still see this alert? In some cases, a default outbound IP is still assigned to virtual machines in a non-private subnet, even when an explicit outbound method—such as a NAT gateway or a UDR directing traffic to an NVA/firewall—is configured. This does not mean that the default outbound IP is used for egress traffic. To fully remove the assignment (and clear the alert): Set the subnet to private Stop and deallocate the affected virtual machines Summary The move to private subnets by default improves the security posture of Azure networking deployments by removing implicit outbound internet access. Customers deploying new workloads must now explicitly configure outbound connectivity. StandardV2 NAT Gateway provides a scalable, resilient method for enabling outbound internet access without exposing workloads to inbound connections or relying on platform‑managed IPs. Learn more Default Outbound Access StandardV2 NAT Gateway Migrate Default Outbound Access to StandardV2 NAT Gateway1.3KViews2likes0Comments