azure networking
83 TopicsExpressRoute Gateway Microsoft initiated migration
Objective The backend migration process is an automated upgrade performed by Microsoft to ensure your ExpressRoute gateways use the Standard IP SKU. This migration enhances gateway reliability and availability while maintaining service continuity. You receive notifications about scheduled maintenance windows and have options to control the migration timeline. For guidance on upgrading Basic SKU public IP addresses for other networking services, see Upgrading Basic to Standard SKU. . Important: As of September 30, 2025, Basic SKU public IPs are retired. For more information, see the official announcement. You can initiate the ExpressRoute gateway migration yourself at a time that best suits your business needs, before the Microsoft team performs the migration on your behalf. This gives you control over the migration timing. Please use the ExpressRoute Gateway Migration Tool to migrate your gateway Public IP to Standard SKU. This tool provides a guided workflow in the Azure portal and PowerShell, enabling a smooth migration with minimal service disruption. Backend migration overview The backend migration is scheduled during your preferred maintenance window. During this time, the Microsoft team performs the migration with minimal disruption. You don’t’ need to take any actions. The process includes the following steps: Deploy new gateway: Azure provisions a second virtual network gateway in the same GatewaySubnet alongside your existing gateway. Microsoft automatically assigns a new Standard SKU public IP address to this gateway. Transfer configuration: The process copies all existing configurations (connections, settings, routes) from the old gateway. Both gateways run in parallel during the transition to minimize downtime. You may experience brief connectivity interruptions may occur. Clean up resources: After migration completes successfully and passes validation, Azure removes the old gateway and its associated connections. The new gateway includes a tag CreatedBy: GatewayMigrationByService to indicate it was created through the automated backend migration Important: To ensure a smooth backend migration, avoid making non-critical changes to your gateway resources or connected circuits during the migration process. If modifications are absolutely required, you can choose (after the Migrate stage complete) to either commit or abort the migration and make your changes. Backend process details This section provides an overview of the Azure portal experience during backend migration for an existing ExpressRoute gateway. It explains what to expect at each stage and what you see in the Azure portal as the migration progresses. To reduce risk and ensure service continuity, the process performs validation checks before and after every phase. The backend migration follows four key stages: Validate: Checks that your gateway and connected resources meet all migration requirements for the Basic to Standard public IP migration. Prepare: Deploys the new gateway with Standard IP SKU alongside your existing gateway. Migrate: Cuts over traffic from the old gateway to the new gateway with a Standard public IP. Commit or abort: Finalizes the public IP SKU migration by removing the old gateway or reverts to the old gateway if needed. These stages mirror the Gateway migration tool process, ensuring consistency across both migration approaches. The Azure resource group RGA serves as a logical container that displays all associated resources as the process updates, creates, or removes them. Before the migration begins, RGA contains the following resources: This image uses an example ExpressRoute gateway named ERGW-A with two connections (Conn-A and LAconn) in the resource group RGA. Portal walkthrough Before the backend migration starts, a banner appears in the Overview blade of the ExpressRoute gateway. It notifies you that the gateway uses the deprecated Basic IP SKU and will undergo backend migration between March 7, 2026, and April 30, 2026: Validate stage Once you start the migration, the banner in your gateway’s Overview page updates to indicate that migration is currently in progress. In this initial stage, all resources are checked to ensure they are in a Passed state. If any prerequisites aren't met, validation fails and the Azure team doesn't proceed with the migration to avoid traffic disruptions. No resources are created or modified in this stage. After the validation phase completes successfully, a notification appears indicating that validation passed and the migration can proceed to the Prepare stage. Prepare stage In this stage, the backend process provisions a new virtual network gateway in the same region and SKU type as the existing gateway. Azure automatically assigns a new public IP address and re-establishes all connections. This preparation step typically takes up to 45 minutes. To indicate that the new gateway is created by migration, the backend mechanism appends _migrate to the original gateway name. During this phase, the existing gateway is locked to prevent configuration changes, but you retain the option to abort the migration, which deletes the newly created gateway and its connections. After the Prepare stage starts, a notification appears showing that new resources are being deployed to the resource group: Deployment status In the resource group RGA, under Settings → Deployments, you can view the status of all newly deployed resources as part of the backend migration process. In the resource group RGA under the Activity Log blade, you can see events related to the Prepare stage. These events are initiated by GatewayRP, which indicates they are part of the backend process: Deployment verification After the Prepare stage completes, you can verify the deployment details in the resource group RGA under Settings > Deployments. This section lists all components created as part of the backend migration workflow. The new gateway ERGW-A_migrate is deployed successfully along with its corresponding connections: Conn-A_migrate and LAconn_migrate. Gateway tag The newly created gateway ERGW-A_migrate includes the tag CreatedBy: GatewayMigrationByService, which indicates it was provisioned by the backend migration process. Migrate stage After the Prepare stage finishes, the backend process starts the Migrate stage. During this stage, the process switches traffic from the existing gateway ERGW-A to the new gateway ERGW-A_migrate. Gateway ERGW-A_migrate: Old gateway (ERGW-A) handles traffic: After the backend team initiates the traffic migration, the process switches traffic from the old gateway to the new gateway. This step can take up to 15 minutes and might cause brief connectivity interruptions. New gateway (ERGW-A_migrate) handles traffic: Commit stage After migration, the Azure team monitors connectivity for 15 days to ensure everything is functioning as expected. The banner automatically updates to indicate completion of migration: During this validation period, you can’t modify resources associated with both the old and new gateways. To resume normal CRUD operations without waiting15 days, you have two options: Commit: Finalize the migration and unlock resources. Abort: Revert to the old gateway, which deletes the new gateway and its connections. To initiate Commit before the 15-day window ends, type yes and select Commit in the portal. When the commit is initiated from the backend, you will see “Committing migration. The operation may take some time to complete.” The old gateway and its connections are deleted. The event shows as initiated by GatewayRP in the activity logs. After old connections are deleted, the old gateway gets deleted. Finally, the resource group RGA contains only resources only related to the migrated gateway ERGW-A_migrate: The ExpressRoute Gateway migration from Basic to Standard Public IP SKU is now complete. Frequently asked questions How long will Microsoft team wait before committing to the new gateway? The Microsoft team waits around 15 days after migration to allow you time to validate connectivity and ensure all requirements are met. You can commit at any time during this 15-day period. What is the traffic impact during migration? Is there packet loss or routing disruption? Traffic is rerouted seamlessly during migration. Under normal conditions, no packet loss or routing disruption is expected. Brief connectivity interruptions (typically less than 1 minute) might occur during the traffic cutover phase. Can we make any changes to ExpressRoute Gateway deployment during the migration? Avoid making non-critical changes to the deployment (gateway resources, connected circuits, etc.). If modifications are absolutely required, you have the option (after the Migrate stage) to either commit or abort the migration.327Views0likes0CommentsNavigating the 2025 holiday season: Insights into Azure’s DDoS defense
The holiday season continues to be one of the most demanding periods for online businesses. Traffic surges, higher transaction volumes, and user expectations for seamless digital experiences all converge, making reliability a non-negotiable requirement. For attackers, this same period presents an opportunity: even brief instability can translate into lost revenue, operational disruption, and reputational impact. This year, the most notable shift wasn’t simply the size of attacks, but how they were executed. We observed a rise in burst‑style DDoS events, fast-ramping, high-intensity surges distributed across multiple resources, designed to overwhelm packet processing and connection-handling layers before traditional bandwidth metrics show signs of strain. From November 15, 2025 through January 5, 2026, Azure DDoS Protection helped customers maintain continuity through sustained Layer 3 and Layer 4 attack traffic, underscoring two persistent realities: Most attacks remain short, automated, and frequently create constant background attack traffic. The upper limit of attacker capability continues to grow, with botnets across the industry regularly demonstrating multi‑Tbps scale. The holiday season once again reinforced that DDoS resilience must be treated as a continuous operational discipline. Rising volume and intensity Between November 15 and January 5, Azure mitigated approximately 174,054 inbound DDoS attacks. While many were small and frequent, the distribution revealed the real shift: 16% exceeded 1M packets per second (pps). ~3% surpassed 10M pps, up significantly from 0.2% last year. Even when individual events are modest, the cumulative impact of sustained attack traffic can be operationally draining—consuming on-call cycles, increasing autoscale and egress costs, and creating intermittent instability that can provide cover for more targeted activity. Operational takeaway: Treat DDoS mitigation as an always-on requirement. Ensure protection is enabled across all internet-facing entry points, align alerting to packet rate trends, and maintain clear triage workflows. What the TCP/UDP mix is telling us this season TCP did what it usually does during peak season: it carried the fight. TCP floods made up ~72% of activity, and ACK floods dominated (58.7%) a reliable way to grind down packet processing and connection handling. UDP was ~24%, showing up as sharp, high-intensity bursts; amplification (like NTP) appeared, but it wasn’t the main play. Put together, it’s a familiar one-two punch: sustain TCP/ACK pressure to exhaust the edge, then spike UDP to jolt stability and steal attention. The goal isn’t just to saturate bandwidth, it’s to push services into intermittent instability, where things technically stay online but feel broken to users. TCP-heavy pressure: Make sure your edge and backends can absorb a surge in connections without falling over—check load balancer limits, connection/state capacity, and confirm health checks won’t start flapping during traffic spikes. UDP burst patterns: Rely on automated detection and mitigation—these bursts are often over before a human can respond. Reduce exposure: Inventory any internet-facing UDP services and shut down, restrict, or isolate anything you don’t truly need. Attack duration: Attackers continued to favor short-lived bursts designed to outrun manual response, but we also saw a notable shift in “who” felt the impact most. High-sensitivity workloads, especially gaming, experienced some of the highest packet-per-second and bandwidth-driven spikes, often concentrated into bursts lasting from a few minutes to several minutes. Even when these events were brief, the combination of high PPS + high bandwidth can be enough to trigger jitter, session drops, match instability, or rapid scaling churn. Overall, 34% of attacks lasted 5 minutes or less, and 83% ended within 40 minutes, reinforcing the same lesson: modern DDoS patterns are optimized for speed and disruption, not longevity. For latency- and session-sensitive services, “only a few minutes” can still be a full outage experience. Attack duration is an attacker advantage when defenses rely on humans to notice, diagnose, and react. Design for minute-long spikes: assume attacks will be short, sharp, and high PPS such that your protections should engage automatically. Watch the right signals: alert on PPS spikes and service health (disconnect rates, latency/jitter), not bandwidth alone. Botnet-driven surges: Azure observed rapid rotation of botnet traffic associated with Aisuru and KimWolf targeting public-facing endpoints. The traffic was highly distributed across regions and networks. In several instances, when activity was mitigated in one region, similar traffic shifted to alternate regions or segments shortly afterward. “Relocation” behavior is the operational signature of automated botnet playbooks: probe → hit → shift → retry. If defenses vary by region or endpoint, attackers will find the weakest link quickly. Customers should standardize protection posture, ensure consistent DDoS policies and thresholds across regions. Monitor by setting the right alerts and notifications. The snapshot below captures the Source-side distribution at that moment, showing which industry verticals were used to generate the botnet traffic during the observation window The geography indicators below reflect where the traffic was observed egressing onto the internet, and do not imply attribution or intent by any provider or country. Preparing for 2026 As organizations transition into 2026, the lessons from the 2025 holiday season marked by persistent and evolving DDoS threats, including the rise of DDoS-for-hire services, massive botnets underscore the critical need for proactive, resilient cybersecurity. Azure's proven ability to automatically detect, mitigate, and withstand advanced attacks (such as record-breaking volumetric incidents) highlights the value of always-on protections to maintain business continuity and safeguard digital services during peak demand periods. Adopting a Zero Trust approach is essential in this landscape, as it operates on the principle of "never trust, always verify," assuming breaches are inevitable and requiring continuous validation of access and traffic principles that complement DDoS defenses by limiting lateral movement and exposure even under attack. To achieve comprehensive protection, implement layered security: deploy Azure DDoS Protection for network-layer (Layers 3 and 4) volumetric mitigation with always-on monitoring, adaptive tuning, telemetry, and alerting; combine it with Azure Web Application Firewall (WAF) to defend the application layer (Layer 7) against sophisticated techniques like HTTP floods; and integrate Azure Firewall for additional network perimeter controls. Key preparatory steps include identifying public-facing exposure points, establishing normal traffic baselines, conducting regular DDoS simulations, configuring alerts for active mitigations, forming a dedicated response team, and enabling expert support like the DDoS Rapid Response (DRR) team when needed. By prioritizing these multi-layered defenses and a well-practiced response plan, organizations can significantly enhance resilience against the evolving DDoS landscape in 2026.255Views0likes0CommentsA Practical Guide to Azure DDoS Protection Cost Optimization
Introduction Azure provides infrastructure-level DDoS protection by default to protect Azure’s own platform and services. However, this protection does not extend to customer workloads or non-Microsoft managed resources like Application Gateway, Azure Firewall, or virtual machines with public IPs. To protect these resources, Azure offers enhanced DDoS protection capabilities (Network Protection and IP Protection) that customers can apply based on workload exposure and business requirements. As environments scale, it’s important to ensure these capabilities are applied deliberately and aligned with actual risk. For more details on how Azure DDoS protection works, see Understanding Azure DDoS Protection: A Closer Look. Why Cost Optimization Matters Cost inefficiencies related to Azure DDoS Protection typically emerge as environments scale: New public IPs are introduced Virtual networks evolve Workloads change ownership Protection scope grows without clear alignment to workload exposure The goal here is deliberate, consistent application of enhanced protection matched to real risk rather than historical defaults. Scoping Enhanced Protection Customer workloads with public IPs require enhanced DDoS protection to be protected against targeted attacks. Enhanced DDoS protection provides: Advanced mitigation capabilities Detailed telemetry and attack insights Mitigation tuned to specific traffic patterns Dedicated support for customer workloads When to apply enhanced protection: Workload Type Enhanced Protection Recommended? Internet-facing production apps with direct customer impact Yes Business-critical systems with compliance requirements Yes Internal-only workloads behind private endpoints Typically not needed Development/test environments Evaluate based on exposure Best Practice: Regularly review public IP exposure and workload criticality to ensure enhanced protection aligns with current needs. Understanding Azure DDoS Protection SKUs Azure offers two ways to apply enhanced DDoS protection: DDoS Network Protection and DDoS IP Protection. Both provide DDoS protection for customer workloads. Comparison Table Feature DDoS Network Protection DDoS IP Protection Scope Virtual network level Individual public IP Pricing model Fixed base + overage per IP Per protected IP Included IPs 100 public IPs N/A DDoS Rapid Response (DRR) Included Not available Cost protection guarantee Included Not available WAF discount Included Not available Best for Production environments with many public IPs Selective protection for specific endpoints Management Centralized Granular Cost efficiency Lower per-IP cost at scale (100+ IPs) Lower total cost for few IPs (< 15) DDoS Network Protection DDoS Network Protection can be applied in two ways: VNet-level protection: Associate a DDoS Protection Plan with virtual networks, and all public IPs within those VNets receive enhanced protection Selective IP linking: Link specific public IPs directly to a DDoS Protection Plan without enabling protection for the entire VNet This flexibility allows you to protect entire production VNets while also selectively adding individual IPs from other environments to the same plan. For more details on selective IP linking, see Optimizing DDoS Protection Costs: Adding IPs to Existing DDoS Protection Plans. Ideal for: - Production environments with multiple internet-facing workloads - Mixed environments where some VNets need full coverage and others need selective protection - Scenarios requiring centralized visibility, management, and access to DRR, cost protection, and WAF discounts DDoS IP Protection DDoS IP Protection allows enhanced protection to be applied directly to individual public IPs, with per-IP billing. This is a standalone option that does not require a DDoS Protection Plan. Ideal for: Environments with fewer than 15 IPs requiring protection Cases where DRR, cost protection, and WAF discounts are not needed Quick enablement without creating a protection plan Decision Tree: Choosing the Right SKU Now that you know the main scenarios, the decision tree below can help you determine which SKU best fits your environment based on feature requirements and scale: Network Protection exclusive features: DDoS Rapid Response (DRR): Access to Microsoft DDoS experts during active attacks Cost protection: Resource credits for scale-out costs incurred during attacks WAF discount: Reduced pricing on Azure Web Application Firewall Consolidating Protection Plans at Tenant Level A single DDoS Protection Plan can protect multiple virtual networks and subscriptions within a tenant. Each plan includes: Fixed monthly base cost 100 public IPs included Overage charges for additional IPs beyond the included threshold Cost Comparison Example Consider a customer with 130 public IPs requiring enhanced protection: Configuration Plans Base Cost Overage Total Monthly Cost Two separate plans 2 $2,944 × 2 = $5,888 $0 ~$5,888 Single consolidated plan 1 $2,944 30 IPs × $30 = $900 ~$3,844 Savings: ~$2,044/month ($24,528/year) by consolidating to a single plan. In both cases, the same public IPs receive the same enhanced protection. The cost difference is driven entirely by plan architecture. How to Consolidate Plans Use the PowerShell script below to list existing DDoS Protection Plans and associate virtual networks with a consolidated plan. Run this script from Azure Cloud Shell or a local PowerShell session with the [Az module](https://learn.microsoft.com/powershell/azure/install-azure-powershell) installed. The account running the script must have Network Contributor role (or equivalent) on the virtual networks being modified and Reader access to the DDoS Protection Plan. # List all DDoS Protection Plans in your tenant Get-AzDdosProtectionPlan | Select-Object Name, ResourceGroupName, Id # Associate a virtual network with an existing DDoS Protection Plan $ddosPlan = Get-AzDdosProtectionPlan -Name "ConsolidatedDDoSPlan" -ResourceGroupName "rg-security" $vnet = Get-AzVirtualNetwork -Name "vnet-production" -ResourceGroupName "rg-workloads" $vnet.DdosProtectionPlan = New-Object Microsoft.Azure.Commands.Network.Models.PSResourceId $vnet.DdosProtectionPlan.Id = $ddosPlan.Id $vnet.EnableDdosProtection = $true Set-AzVirtualNetwork -VirtualNetwork $vnet Preventing Protection Drift Protection drift occurs when the resources covered by DDoS protection no longer align with the resources that actually need it. This mismatch can result in wasted spend (protecting resources that are no longer critical) or security gaps (missing protection on newly deployed resources). Common causes include: Applications are retired but protection remains Test environments persist longer than expected Ownership changes without updating protection configuration Quarterly Review Checklist List all public IPs with enhanced protection enabled Verify each protected IP maps to an active, production workload Confirm workload criticality justifies enhanced protection Review ownership tags and update as needed Remove protection from decommissioned or non-critical resources Validate DDoS Protection Plan consolidation opportunities Sample Query: List Protected Public IPs Use the following PowerShell script to identify all public IPs currently receiving DDoS protection in your environment. This helps you audit which resources are protected and spot candidates for removal. Run this from Azure Cloud Shell or a local PowerShell session with the Az module installed. The account must have Reader access to the subscriptions being queried. # List all public IPs with DDoS protection enabled Get-AzPublicIpAddress | Where-Object { $_.DdosSettings.ProtectionMode -eq "Enabled" -or ($_.IpConfiguration -and (Get-AzVirtualNetwork | Where-Object { $_.EnableDdosProtection -eq $true }).Subnets.IpConfigurations.Id -contains $_.IpConfiguration.Id) } | Select-Object Name, ResourceGroupName, IpAddress, @{N='Tags';E={$_.Tag | ConvertTo-Json -Compress}} For a comprehensive assessment of all public IPs and their DDoS protection status across your environment, use the DDoS Protection Assessment Tool. Making Enhanced Protection Costs Observable Ongoing visibility into DDoS Protection costs enables proactive optimization rather than reactive bill shock. When costs are surfaced early, you can spot scope creep before it impacts your budget, attribute spending to specific workloads, and measure whether your optimization efforts are paying off. The following sections cover three key capabilities: budget alerts to notify you when spending exceeds thresholds, Azure Resource Graph queries to analyze protection coverage, and tagging strategies to attribute costs by workload. Setting Up Cost Alerts Navigate to Azure Cost Management + Billing Select Cost alerts > Add Configure: o Scope: Subscription or resource group o Budget amount: Based on expected DDoS Protection spend o Alert threshold: 80%, 100%, 120% o Action group: Email security and finance teams Tagging Strategy for Cost Attribution Apply consistent tags to track DDoS protection costs by workload: # Tag public IPs for cost attribution $pip = Get-AzPublicIpAddress -Name "pip-webapp" -ResourceGroupName "rg-production" $tags = @{ "CostCenter" = "IT-Security" "Workload" = "CustomerPortal" "Environment" = "Production" "DDoSProtectionTier" = "NetworkProtection" } Set-AzPublicIpAddress -PublicIpAddress $pip -Tag $tags Summary This guide covered how to consolidate DDoS Protection Plans to avoid paying multiple base costs, select the appropriate SKU based on IP count and feature needs, apply protection selectively with IP linking, and prevent configuration drift through regular reviews. These practices help ensure you're paying only for the protection your workloads actually need. References Review Azure DDoS Protection pricing Enable DDoS Network Protection for a virtual network Configure DDoS IP Protection Configure Cost Management alerts296Views0likes0CommentsDNS best practices for implementation in Azure Landing Zones
Why DNS architecture matters in Landing Zone A well-designed DNS layer is the glue that lets workloads in disparate subscriptions discover one another quickly and securely. Getting it right during your Azure Landing Zone rollout avoids painful refactoring later, especially once you start enforcing Zero-Trust and hub-and-spoke network patterns. Typical Landing-Zone topology Subscription Typical Role Key Resources Connectivity (Hub) Transit, routing, shared security Hub VNet, Azure Firewall / NVA, VPN/ER gateways, Private DNS Resolver Security Security tooling & SOC Sentinel, Defender, Key Vault (HSM) Shared Services Org-wide shared apps ADO and Agents, Automation Management Ops & governance Log Analytics, backup etc Identity Directory and auth services Extended domain controllers, Azure AD DS All five subscriptions contain a single VNet. Spokes (Security, Shared, Management, Identity) are peered to the Connectivity VNet, forming the classic hub-and-spoke. Centralized DNS with mandatory firewall inspection Objective: All network communication from a spoke must cross the firewall in the hub including DNS communication. Design Element Best-Practice Configuration Private DNS Zones Link only to the Connectivity VNet. Spokes have no direct zone links. Private DNS Resolver Deploy inbound + outbound endpoints in the Connectivity VNet. Link connectivity virtual network to outbound resolver endpoint. Spoke DNS Settings Set custom DNS servers on each spoke VNet equal to the inbound endpoint’s IPs. Forwarding Ruleset Create a ruleset, associate it with the outbound endpoint, and add forwarders: • Specific domains → on-prem / external servers • Wildcard “.” → on-prem DNS (for compliance scenarios) Firewall Rules Allow UDP/TCP 53 from spokes to Resolver-inbound, and from Resolver-outbound to target DNS servers Note: Azure private DNS zone is a global resource. Meaning single private DNS zone can be utilized to resolve DNS query for resources deployed in multiple regions. DNS private resolver is a regional resource. Meaning it can only link to virtual network within the same region. Traffic flow Spoke VM → Inbound endpoint (hub) Firewall receives the packet based on spoke UDR configuration and processes the packet before it sent to inbound endpoint IP. Resolver applies forwarding rules on unresolved DNS queries; unresolved queries leave via Outbound endpoint. DNS forwarding rulesets provide a way to route queries for specific DNS namespaces to designated custom DNS servers. Fallback to internet and NXDOMAIN redirect Azure Private DNS now supports two powerful features to enhance name resolution flexibility in hybrid and multi-tenant environments: Fallback to internet Purpose: Allows Azure to resolve DNS queries using public DNS if no matching record is found in the private DNS zone. Use case: Ideal when your private DNS zone doesn't contain all possible hostnames (e.g., partial zone coverage or phased migrations). How to enable: Go to Azure private DNS zones -> Select zone -> Virtual network link -> Edit option Ref article: https://learn.microsoft.com/en-us/azure/dns/private-dns-fallback Centralized DNS - when firewall inspection isn’t required Objective: DNS query is not monitored via firewall and DNS query can be bypassed from firewall. Link every spoke virtual directly to the required Private DNS Zones so that spoken can resolve PaaS resources directly. Keep a single Private DNS Resolver (optional) for on-prem name resolution; spokes can reach its inbound endpoint privately or via VNet peering. Spoke-level custom DNS This can point to extended domain controllers placed within identity virtual. This pattern reduces latency and cost but still centralizes zone management. Integrating on-premises active directory DNS Create conditional forwarders on each Domain Controller for every Private DNS Zone pointing it to DNS private resolver inbound endpoint IP Address. (e.g.,blob.core.windows.net database.windows.net). Do not include the literal privatelink label. Ref article: https://github.com/dmauser/PrivateLink/tree/master/DNS-Integration-Scenarios#43-on-premises-dns-server-conditional-forwarder-considerations Note: Avoid selecting the option “Store this conditional forwarder in Active Directory and replicate as follows” in environments with multiple Azure subscriptions and domain controllers deployed across different Azure environments. Key takeaways Linking zones exclusively to the connectivity subscription's virtual network keeps firewall inspection and egress control simple. Private DNS Resolver plus forwarding rulesets let you shape hybrid name resolution without custom appliances. When no inspection is needed, direct zone links to spokes cut hops and complexity. For on-prem AD DNS, the conditional forwarder is required pointing it to inbound endpoint IP, exclude privatelink name when creating conditional forwarder, and do not replicate conditional forwarder Zone with AD replication if customer has footprint in multiple Azure tenants. Plan your DNS early, bake it into your infrastructure-as-code, and your landing zone will scale cleanly no matter how many spokes join the hub tomorrow.8.6KViews6likes5CommentsUnlock outbound traffic insights with Azure StandardV2 NAT Gateway flow logs
Recommended Outbound Connectivity StandardV2 NAT Gateway is the next evolution of outbound connectivity in Azure. As the recommended solution for providing secure, reliable outbound Internet access, NAT Gateway continues to be the default choice for modern Azure deployments. With the highly anticipated general availability of the new StandardV2 SKU, customers gain access to the following highly requested upgrades: Zone-redundancy: Automatically maintains outbound connectivity during single‑zone failures in AZ-enabled regions. Enhanced performance: Up to 100 Gbps of throughput and 10 million packets per second - double the Standard SKU capacity. Dual-stack support: Attach up to 16 IPv6 and 16 IPv4 public IP addresses for future ready connectivity. Flow logs: Access historical logs of connections being established through your NAT gateway. This blog will focus on how enabling StandardV2 NAT Gateway flow logs can be beneficial for your team along with some tips to get the most out of the data. What are flow logs? StandardV2 NAT Gateway flow logs are enabled through Diagnostic settings on your NAT gateway resource where the log data can be sent to Log Analytics, a storage account, or Event hub destination. “NatGatewayFlowlogV1” is the released log category, and it provides IP level information on traffic flowing through your StandardV2 NAT gateway. Gateway Flow Logs through Diagnostics setting on your StandardV2 NAT gateway resource. Why should I use flow logs? Security and compliance visibility Prior to NAT gateway flow logs, customers could not see NAT gateway information when their virtual machines connect outbound. This made it difficult to: Validate that only approved destinations were being accessed Audit suspicious or unexpected outbound patterns Satisfy compliance requirements that mandate traffic recording Flow logs now provide visibility to the source IP -> NAT gateway outbound IP -> destination IP, along with details on sent/dropped packets and bytes. Usage analytics Flow logs allow you to answer usage questions such as: Which VMs are generating the most outbound requests? Which destinations receive the most traffic? Is throughput growth caused by a specific workload pattern? This level of insight is especially useful when debugging unexpected throughput increases, billing spikes, and connection bottlenecks. To note: Flow logs only capture established connections. This means the TCP 3‑way handshake (SYN → SYN/ACK → ACK) or the UDP ephemeral session setup must complete. If a connection never establishes, for example due to NSG denial, routing mismatch, or SNAT exhaustion, it will not appear in flow logs. Workflow of troubleshooting with flow logs Let's walk through how you can leverage flow logs to troubleshoot a scenario where you are seeing intermittent connection drops. Scenario: You have VMs that use a StandardV2 NAT gateway to reach the Internet. However, your VMs intermittently fail to reach github.com. Step 1: Check NAT gateway health Start with the datapath availability metric, which reflects the NAT gateway's overall health. If metric > 90%, this confirms NAT gateway is healthy and is working as expected to send outbound traffic to the internet. Continue to Step 2. If metric is lower, visit Troubleshoot Azure NAT Gateway connectivity - Azure NAT Gateway | Microsoft Learn for troubleshooting tips. Step 2: Enable StandardV2 NAT Gateway Flow Logs To further investigate the root cause, Enable StandardV2 NAT Gateway Flow Logs (NatGatewayFlowLogsV1 log category in Diagnostics Setting) for the NAT gateway resource providing outbound connectivity for the impacted VMs. It is recommended to enable Log Analytics as a destination as it allows you to easily query the data. For the detailed steps, visit Monitor with StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn. Tip: You may enable flow logs even when not troubleshooting to ensure you’ll have historical data to reference when issues occur. Step 3: Confirm whether the connection was established Use Log Analytics to query for flows with source IP == VM private IP and destination IP == IP address(es) of github.com. The following query will generate a table and chart of the total packets sent per minute from your source IP to the destination IP through your NAT gateway in the last 24 hours. NatGatewayFlowlogsV1 | where TimeGenerated > ago(1d) | where SourceIP == '10.0.0.4' //and DestinationIP == <"github.com IP"> | summarize TotalPacketsSent = sum(PacketsSent) by TimeGenerated = bin(TimeGenerated, 1m), SourceIP, DestinationIP | order by TimeGenerated asc If there are no records of this connection, it is likely an issue with establishing the connection because flow logs will only capture records of established connections. Take a look at SNAT connection metrics to determine whether it may be a SNAT port exhaustion issue or NSGs/UDRs that may be blocking the traffic. If there are records of the connection, proceed with the next step. Step 4: Check if there are any packets dropped In Log Analytics, query for the total "PacketsSentDropped" and "PacketsReceivedDropped" per source/outbound/destination IP connection. If "PacketsSentDropped" > 0 - NAT gateway dropped traffic sent from your VM. If "PacketsReceivedDropped" > 0, NAT gateway dropped traffic received from destination IP, github.com in this case. In both instances, it typically means the either the client or server is pushing more traffic through a single connection than is optimal, causing connection-level rate limiting. To mitigate: Avoid relying on one connection and instead use multiple connections. Distribute traffic across multiple outbound IP addresses by assigning more public IP addresses to the NAT gateway resource. Conclusion StandardV2 NAT Gateway Flow Logs unlock a powerful new dimension of outbound visibility and they can help you: Validate cybersecurity readiness Audit outbound flows Diagnose intermittent connectivity issues Understand traffic patterns and optimize architecture We are excited to see how you leverage this new capability with your StandardV2 NAT gateways! Have more questions? As always, for any feedback, please feel free to reach us by submitting your feedback. We look forward to hearing your thoughts and hope this announcement helps you build more resilient applications in Azure. For more information on StandardV2 NAT Gateway Flow Logs and how to enable it, visit: Manage StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn Monitor with StandardV2 NAT Gateway Flow Logs - Azure NAT Gateway | Microsoft Learn To see the most up-to-date pricing for flow logs, visit Azure NAT Gateway - Pricing | Microsoft Azure. To learn more about StandardV2 NAT Gateway, visit What is Azure NAT Gateway? | Microsoft Learn.323Views0likes0CommentsAzure Front Door: Implementing lessons learned following October outages
Abhishek Tiwari, Vice President of Engineering, Azure Networking Amit Srivastava, Principal PM Manager, Azure Networking Varun Chawla, Partner Director of Engineering Introduction Azure Front Door is Microsoft's advanced edge delivery platform encompassing Content Delivery Network (CDN), global security and traffic distribution into a single unified offering. By using Microsoft's extensive global edge network, Azure Front Door ensures efficient content delivery and advanced security through 210+ global and local points of presence (PoPs) strategically positioned closely to both end users and applications. As the central global entry point from the internet onto customer applications, we power mission critical customer applications as well as many of Microsoft’s internal services. We have a highly distributed resilient architecture, which protects against failures at the server, rack, site and even at the regional level. This resiliency is achieved by the use of our intelligent traffic management layer which monitors failures and load balances traffic at server, rack or edge sites level within the primary ring, supplemented by a secondary-fallback ring which accepts traffic in case of primary traffic overflow or broad regional failures. We also deploy a traffic shield as a terminal safety net to ensure that in the event of a managed or unmanaged edge site going offline, end user traffic continues to flow to the next available edge site. Like any large-scale CDN, we deploy each customer configuration across a globally distributed edge fleet, densely shared with thousands of other tenants. While this architecture enables global scale, it carries the risk that certain incompatible configurations, if not contained, can propagate broadly and quickly which can result in a large blast radius of impact. Here we describe how the two recent service incidents impacting Azure Front Door have reinforced the need to accelerate ongoing investments in hardening our resiliency, and tenant isolation strategy to mitigate likelihood and the scale of impact from this class of risk. October incidents: recap and key learnings Azure Front Door experienced two service incidents; on October 9 th and October 29 th , both with customer-impacting service degradation. On October 9 th : A manual cleanup of stuck tenant metadata bypassed our configuration protection layer, allowing incompatible metadata to propagate beyond our canary edge sites. This metadata was created on October 7 th , from a control-plane defect triggered by a customer configuration change. While the protection system initially blocked the propagation, the manual override operation bypassed our safeguards. This incompatible configuration reached the next stage and activated a latent data-plane defect in a subset of edge sites, causing availability impact primarily across Europe (~6%) and Africa (~16%). You can learn more about this issue in detail at https://aka.ms/AIR/QNBQ-5W8 On October 29 th : A different sequence of configuration changes across two control-plane versions produced incompatible metadata. Because the failure mode in the data-plane was asynchronous, the health checks validations embedded in our protection systems were all passed during the rollout. The incompatible customer configuration metadata successfully propagated globally through a staged rollout and also updated the “last known good” (LKG) snapshot. Following this global rollout, the asynchronous process in data-plane exposed another defect which caused crashes. This impacted connectivity and DNS resolutions for all applications onboarded to our platform. Extended recovery time amplified impact on customer applications and Microsoft services. You can learn more about this issue in detail at https://aka.ms/AIR/YKYN-BWZ We took away a number of clear and actionable lessons from these incidents, which are applicable not just to our service, but to any multi-tenant, high-density, globally distributed system. Configuration resiliency – Valid configuration updates should propagate safely, consistently, and predictably across our global edge, while ensuring that incompatible or erroneous configuration never propagate beyond canary environments. Data plane resiliency - Additionally, configuration processing in the data plane must not cause availability impact to any customer. Tenant isolation – Traditional isolation techniques such as hardware partitioning and virtualization are impractical at edge sites. This requires innovative sharding techniques to ensure single tenant-level isolation – a must-have to reduce potential blast radius. Accelerated and automated recovery time objective (RTO) – System should be able to automatically revert to last known good configuration in an acceptable RTO. In case of a service like Azure Front Door, we deem ~10 mins to be a practical RTO for our hundreds of thousands of customers at every edge site. Post outage, given the severity of impact which allowed an incompatible configuration to propagate globally, we made the difficult decision to temporarily block configuration changes in order to expedite rollout of additional safeguards. Between October 29 th to November 5 th , we prioritized and deployed immediate hardening steps before opening up the configuration change. We are confident that the system is stable, and we are continuing to invest in additional safeguards to further strengthen the platform's resiliency. Learning category Goal Repairs Status Safe customer configuration deployment Incompatible configuration never propagates beyond Canary · Control plane and data plane defect fixes · Forced synchronous configuration processing · Additional stages with extended bake time · Early detection of crash state Completed Data plane resiliency Configuration processing cannot impact data plane availability Manage data-plane lifecycle to prevent outages caused by configuration-processing defects. Completed Isolated work-process in every data plane server to process and load the configuration. January 2026 100% Azure Front Door resiliency posture for Microsoft internal services Microsoft operates an isolated, independent Active/Active fleet with automatic failover for critical Azure services Phase 1: Onboarded critical services batch impacted on Oct 29 th outage running on a day old configuration Completed Phase 2: Automation & hardening of operations, auto-failover and self-management of Azure Front Door onboarding for additional services March 2026 Recovery improvements Data plane crash recovery in under 10 minutes Data plane boot-up time optimized via local cache (~1 hour) Completed Accelerate recovery time < 10 minutes March 2026 Tenant isolation No configuration or traffic regression can impact other tenants Micro cellular Azure Front Door with ingress layered shards June 2026 This blog is the first in a multi-part series on Azure Front Door resiliency. In this blog, we will focus on configuration resiliency—how we are making the configuration pipeline safer and more robust. Subsequent blogs will cover tenant isolation and recovery improvements. How our configuration propagation works Azure Front Door configuration changes can be broadly classified into three distinct categories. Service code & data – these include all aspects of Azure Front Door service like management plane, control plane, data plane, configuration propagation system. Azure Front Door follows a safe deployment practice (SDP) process to roll out newer versions of management, control or data plane over a period of approximately 2-3 weeks. This ensures that any regression in software does not have a global impact. However, latent bugs that escape pre-validation and SDP rollout can remain undetected until a specific combination of customer traffic patterns or configuration changes trigger the issue. Web Application Firewall (WAF) & L7 DDoS platform data – These datasets are used by Azure Front Door to deliver security and load-balancing capabilities. Examples include GeoIP data, malicious attack signatures, and IP reputation signatures. Updates to these datasets occur daily through multiple SDP stages with an extended bake time of over 12 hours to minimize the risk of global impact during rollout. This dataset is shared across all customers and the platform, and it is validated immediately since it does not depend on variations in customer traffic or configuration steps. Customer configuration data – Examples of these are any customer configuration change—whether a routing rule update, backend pool modification, WAF rule change, or security policy change. Due to the nature of these changes, it is expected across the edge delivery / CDN industry to propagate these changes globally in 5-10 mins. Both outages stemmed from issues within this category. All configuration changes, including customer configuration data, are processed through a multi-stage pipeline designed to ensure correctness before global rollout across Azure Front Door’s 200+ edge locations. At a high level, Azure Front Door’s configuration propagation system has two distinct components - Control plane – Accepts customer API/portal changes (create/update/delete for profiles, routes, WAF policies, origins, etc.) and translates them into internal configuration metadata which the data plane can understand. Data plane – Globally distributed edge servers that terminate client traffic, apply routing/WAF logic, and proxy to origins using the configuration produced by the control plane. Between these two halves sits a multi-stage configuration rollout pipeline with a dedicated protection system (known as ConfigShield): Changes flow through multiple stages (pre-canary, canary, expanding waves to production) rather than going global at once. Each stage is health-gated: the data plane must remain within strict error and latency thresholds before proceeding. Each stage’s health check also rechecks previous stage’s health for any regressions. A successfully completed rollout updates a last known good (LKG) snapshot used for automated rollback. Historically, rollout targeted global completion in roughly 5–10 minutes, in line with industry standards. Customer configuration processing in Azure Front Door data plane stack Customer configuration changes in Azure Front Door traverse multiple layers—from the control plane through the deployment system—before being converted into FlatBuffers at each Azure Front Door node. These FlatBuffers are then loaded by the Azure Front Door data plane stack, which runs as Kubernetes pods on every node. FlatBuffer Composition: Each FlatBuffer references several sub-resources such as WAF and Rules Engine schematic files, SSL certificate objects, and URL signing secrets. Data plane architecture: o Master process: Accepts configuration changes (memory-mapped files with references) and manages the lifecycle of worker processes. o Workers: L7 proxy processes that serve customer traffic using the applied configuration. Processing flow for each configuration update: Load and apply in master: The transformed configuration is loaded and applied in the master process. Cleanup of unused references occurs synchronously except for certain categories à October 9 outage occurred during this step due to a crash triggered by incompatible metadata. Apply to workers: Configuration is applied to all worker processes without memory overhead (FlatBuffers are memory-mapped). Serve traffic: Workers start consuming new FlatBuffers for new requests; in-flight requests continue using old buffers. Old buffers are queued for cleanup post-completion. Feedback to deployment service: Positive feedback signals readiness for rollout.Cleanup: FlatBuffers are freed asynchronously by the master process after all workers load updates à October 29 outage occurred during this step due to a latent bug in reference counting logic. The October incidents showed we needed to strengthen key aspects of configuration validation, propagation safeguards, and runtime behavior. During the Azure Front Door incident on October 9 th , that protection system worked as intended but was later bypassed by our engineering team during a manual cleanup operation. During this Azure Front Door incident on October 29 th , the incompatible customer configuration metadata progressed through the protection system, before the delayed asynchronous processing task resulted in the crash. Configuration propagation safeguards Based on learnings from the incidents, we are implementing a comprehensive set of configuration resiliency improvements. These changes aim to guarantee that any sequence of configuration changes cannot trigger instability in the data plane, and to ensure quicker recovery in the event of anomalies. Strengthening configuration generation safety This improvement pivots on a ‘shift-left’ strategy where we want to ensure that we catch regression early before they propagate to production. It also includes fixing the latent defects which were the proximate cause of the outage. Fixing outage specific defects - We have fixed the control-plane defects that could generate incompatible tenant metadata under specific operation sequences. We have also remediated the associated data-plane defects. Stronger cross-version validation - We are expanding our test and validation suite to account for changes across multiple control plane build versions. This is expected to be fully completed by February 2026. Fuzz testing - Automated fuzzing and testing of metadata generation contract between the control plane and the data plane. This allows us to generate an expanded set of invalid/unexpected configuration combinations which might not be achievable by traditional test cases alone. This is expected to be fully completed by February 2026. Preventing incompatible configurations from being propagated This segment of the resiliency strategy strives to ensure that a potentially dangerous configuration change never propagates beyond canary stage. Protection system is “always-on” - Enhancements to operational procedures and tooling prevent bypass in all scenarios (including internal cleanup/maintenance), and any cleanup must flow through the same guarded stages and health checks as standard configuration changes. This is completed. Making rollout behavior more predictable and conservative - Configuration processing in the data plane is now fully synchronous. Every data plane issue due to incompatible meta data can be detected withing 10 seconds at every stage. This is completed. Enhancement to deployment pipeline - Additional stages during roll-out and extended bake time between stages serve as an additional safeguard during configuration propagation. This is completed. Recovery tool improvements now make it easier to revert to any previous version of LKG with a single click. This is completed. These changes significantly improve system safety. Post-outage we have increased the configuration propagation time to approximately 45 minutes. We are working towards reducing configuration propagation time closer to pre-incident levels once additional safeguards covered in the Data plane resiliency section below are completed by mid-January, 2026. Data plane resiliency The data plane recovery was the toughest part of recovery efforts during the October incidents. We must ensure fast recovery as well as resilience to configuration processing related issues for the data plane. To address this, we implemented changes that decouple the data plane from incompatible configuration changes. With these enhancements, the data plane continues operating on the last known good configuration—even if the configuration pipeline safeguards fail to protect as intended. Decoupling data plane from configuration changes Each server’s data plane consists of a master process which accepts configuration changes and manages lifecycle of multiple worker processes which serve customer traffic. One of the critical reasons for the prolonged outage in October was that due to latent defects in the data plane, when presented with a bad configuration the master process crashed. The master is a critical command-and-control process and when it crashes it takes down the entire data plane, in that node. Recovery of the master process involves reloading hundreds of thousands of configurations from scratch and took approximately 4.5 hours. We have since made changes to the system to ensure that even in the event of the master process crash due to any reason - including incompatible configuration data being presented - the workers remain healthy and able to serve traffic. During such an event, the workers would not be able to accept new configuration changes but will continue to serve customer traffic using the last known good configuration. This work is completed. Introducing Food Taster: strengthening config propagation resiliency In our efforts to further strengthen Azure Front Door’s configuration propagation system, we are introducing an additional configuration safeguard known internally as Food Taster which protects the master and worker processes from any configuration change related incidents, thereby ensuring data plane resiliency. The principle is simple: every data-plane server will have a redundant and isolated process – the Food Taster – whose only job is to ingest and process new configuration metadata first and then pass validated configuration changes to active data plane. This redundant worker does not accept any customer traffic. All configuration processing in this Food Taster is fully synchronous. That means we do all parsing, validation, and any expensive or risky work up front, and we do not move on until the Food Taster has either proven the configuration is safe or rejected it. Only when the Food Taster successfully loads the configuration and returns “Config OK” does the master process proceed to load the same config and then instruct the worker processes to do the same. If anything goes wrong in the Food Taster, the failure is contained to that isolated worker; the master and traffic-serving workers never see that invalid configuration. We expect this safeguard to reach production globally in January 2026 timeframe. Introduction of this component will also allow us to return closer to pre-incident level of configuration propagation while ensuring data plane safety. Closing This is the first in a series of planned blogs on Azure Front Door resiliency enhancements. We are continuously improving platform safety and reliability and will transparently share updates through this series. Upcoming posts will cover advancements in tenant isolation and improvements to recovery time objectives (RTO). We deeply value our customers’ trust in Azure Front Door. The October incidents reinforced how critical configuration resiliency is, and we are committed to exceeding industry expectations for safety, reliability, and transparency. By hardening our configuration pipeline, strengthening safety gates, and reinforcing isolation boundaries, we’re making Azure Front Door even more resilient so your applications can be too.13KViews22likes13CommentsAdvanced Container Apps Networking: VNet Integration and Centralized Firewall Traffic Logging
Azure community, I recently documented a networking scenario relevant to Azure Container Apps environments where you need to control and inspect application traffic using a third-party network virtual appliance. The article walks through a practical deployment pattern: • Integrate your Azure Container Apps environment with a Virtual Network. • Configure user-defined routes (UDRs) so that traffic from your container workloads is directed toward a firewall appliance before reaching external networks or backend services. • Verify actual traffic paths using firewall logs to confirm that routing policies are effective. This pattern is helpful for organizations that must enforce advanced filtering, logging, or compliance checks on container egress/ingress traffic, going beyond what native Azure networking controls provide. It also complements Azure Firewall and NSG controls by introducing a dedicated next-generation firewall within your VNet. If you’re working with network control, security perimeters, or hybrid network architectures involving containerized workloads on Azure, you might find it useful. Read the full article on my blog77Views0likes0CommentsData Center Quantized Congestion Notification: Scaling congestion control for RoCE RDMA in Azure
As cloud storage demands continue to grow, the need for ultra-fast, reliable networking becomes ever more critical. Microsoft Azure’s journey to empower its storage infrastructure with RDMA (Remote Direct Memory Access) has been transformative, but it’s not without challenges—especially when it comes to congestion control at scale. Azure’s deployment of RDMA at regional scale relies on DCQCN (Data Center Quantized Congestion Notification), a protocol that’s become central to Azure’s ability to deliver high-throughput, low-latency storage services across vast, heterogeneous data center regions. Why congestion control matters in RDMA networks RDMA offloads the network stack to NIC hardware, reducing CPU overhead and enabling near line-rate performance. However, as Azure scaled RDMA across clusters and regions, it faced new challenges: Heterogeneous hardware: Different generations of RDMA NICs (Network Interface Cards) and switches, each with their own quirks. Variable latency: Long-haul links between datacenters introduce large round-trip time (RTT) variations. Congestion risks: High-speed, incast-like traffic patterns can easily overwhelm buffers, leading to packet loss and degraded performance. To address these, Azure needed a congestion control protocol that could operate reliably across diverse hardware and network conditions. Traditional TCP congestion control mechanisms don’t apply here, so Azure leverages DCQCN combined with Priority Flow Control (PFC) to maintain high throughput, low latency, and near-zero packet loss. How DCQCN works DCQCN coordinates congestion control using three main entities: Reaction point (RP): The sender adjusts its rate based on feedback. Congestion point (CP): Switches mark packets using ECN when queues exceed thresholds. Notification point (NP): The receiver sends Congestion Notification Packets (CNPs) upon receiving ECN-marked packets. This feedback loop allows RDMA flows to dynamically adapt their sending rates, preventing congestion collapse while maintaining fairness. When the switch detects congestion, it marks packets with ECN. The receiver NIC (NP) observes ECN marks and sends CNPs to the sender. The sender NIC (RP) reduces its sending rate upon receiving CNPs; otherwise, it increases the rate gradually. Interoperability challenges across different hardware generations Cloud infrastructure evolves incrementally, typically at the level of individual clusters or racks, as newer server hardware generations are introduced. Within a single region, clusters often differ in their NIC configurations. Our deployment includes three generations of commodity RDMA NICs—Gen1, Gen2, and Gen3—each implementing DCQCN with distinct design variations. These discrepancies create complex and often problematic interactions when NICs from different generations interoperate. Gen1 NICs: Firmware-based DCQCN, NP-side CNP coalescing, burst-based rate limiting. Gen2/Gen3 NICs: Hardware-based DCQCN, RP-side CNP coalescing, per-packet rate limiting. Problem: Gen2/Gen3 NICs sending to Gen1 can trigger excessive cache misses, slowing down Gen1’s receiver pipeline. Gen1 sending to Gen2/Gen3 can cause excessive rate reductions due to frequent CNPs. Azure’s solution: Move CNP coalescing to NP side for Gen2/Gen3. Implement per-QP CNP rate limiting, matching Gen1’s timer. Enable per-burst rate limiting on Gen2/Gen3 to reduce cache pressure. DCQCN tuning: Achieving fairness and performance DCQCN is inherently RTT-fair—its rate adjustment is independent of round-trip time, making it suitable for Azure’s regional networks with RTTs ranging from microseconds to milliseconds. Key Tuning Strategies: Sparse ECN marking: Use large ECN marking thresholds (K_max - K_min) and low marking probabilities (P_max) for flows with large RTTs. Joint buffer and DCQCN tuning: Tune switch buffer thresholds and DCQCN parameters together to avoid premature congestion signals and optimize throughput. Global parameter settings: Azure’s NICs support only global DCQCN settings, so parameters must work well across all traffic types and RTTs. Real-world results High throughput & low latency: RDMA traffic runs at line rate with near-zero packet loss. CPU savings: Freed CPU cores can be repurposed for customer VMs or application logic. Performance metrics: RDMA reduces CPU utilization by up to 34.5% compared to TCP for storage frontend traffic. Large I/O requests (1 MB) see up to 23.8% latency reduction for reads and 15.6% for writes. Scalability: As of November 2025, ~85% of Azure’s traffic is RDMA, supported in all public regions. Conclusion DCQCN is a cornerstone of Azure’s RDMA-enabled storage infrastructure, enabling reliable, high-performance cloud storage at scale. By combining ECN-based signaling with dynamic rate adjustments, DCQCN ensures high throughput, low latency, and near-zero packet loss—even across heterogeneous hardware and long-haul links. Its interoperability fixes and careful tuning make it a critical enabler for RDMA adoption in modern data centers, paving the way for efficient, scalable, and resilient cloud storage.567Views2likes1CommentAzure PostgreSQL Lesson Learned#12: Private Endpoint Approval Fails for Cross Subscription
Co‑authored with HaiderZ-MSFT Symptoms Customers experience issues when attempting to approve a Private Endpoint for Azure PostgreSQL Flexible Server, particularly in cross‑subscription or cross‑tenant setups: Private Endpoint remains stuck in Pending state Portal approval action fails silently or reverts Selecting the Private Endpoint displays a “No Access” message Activity logs show repeated retries followed by failure Common Error Message AuthorizationFailed: The client '<object-id>' does not have authorization to perform action 'Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write' over scope '<private-endpoint-resource-id>' or the scope is invalid. Root Cause Although the approval action is initiated from the PostgreSQL Flexible Server (service provider resource), Azure performs additional network‑level operations during approval. Specifically, Azure must update a Private Link Service Proxy on the Private Endpoint resource, which exists in the consumer subscription. When the Private Endpoint resides in a different subscription or tenant, the approval process fails if: Required Resource Providers are not registered, or The approving identity lacks network‑level permissions on the Private Endpoint scope In this case, the root cause was missing Resource Provider registration, resulting in an AuthorizationFailed error during proxy updates. Required Resource Providers Microsoft.Network Microsoft.DBforPostgreSQL If either provider is missing on either subscription, the approval process will fail regardless of RBAC configuration. Mitigation Steps Step 1: Register Resource Providers (Mandatory) Register the following providers on both subscriptions: Microsoft.Network Microsoft.DBforPostgreSQL This step alone resolves most cross‑subscription approval failures. Azure resource providers and types - Azure Resource Manager | Microsoft Learn Step 2: Validate Network Permissions Ensure the approving identity can perform: Microsoft.Network/privateEndpoints/privateLinkServiceProxies/write Grant Network Contributor if needed. Step 3: Refresh Credentials and Retry If changes were made recently: Sign out and sign in again Retry the Private Endpoint approval Post‑Resolution Outcome After correcting provider registration and permissions: Private Endpoint approval succeeds immediately Connection state transitions from Pending → Approved No further authorization or retry errors PostgreSQL connectivity works as expected Prevention & Best Practices Pre‑register required Resource Providers in landing zones Validate cross‑subscription readiness before creating Private Endpoints Document service‑specific approval requirements (PostgreSQL differs from Key Vault) Automate provider registration via policy or IaC where possible Include provider validation in enterprise onboarding checklists Why This Matters Missing provider registration can lead to: Failed Private Endpoint approvals Confusing authorization errors Extended troubleshooting cycles Production delays during go‑live A simple subscription readiness check prevents downstream networking failures that are difficult to diagnose from portal errors alone. Key Takeaways Issue: Azure PostgreSQL private endpoint approval fails across subscriptions Root Cause: Missing Resource Provider registration Fix: Register Microsoft.Network and Microsoft.DBforPostgreSQL on both subscriptions Result: Approval succeeds without backend authorization failures References Manage Azure Private Endpoints – Azure Private Link Approve Private Endpoint Connections – Azure Database for PostgreSQL Private Endpoint Overview – Azure Private Link160Views0likes0Comments