infrastructure
273 TopicsDesigning Reliable Health Check Endpoints for IIS Behind Azure Application Gateway
Why Health Probes Matter in Azure Application Gateway Azure Application Gateway relies entirely on health probes to determine whether backend instances should receive traffic. If a probe: Receives a non‑200 response Times out Gets redirected Requires authentication …the backend is marked Unhealthy, and traffic is stopped—resulting in user-facing errors. A healthy IIS application does not automatically mean a healthy Application Gateway backend. Failure Flow: How a Misconfigured Health Probe Leads to 502 Errors One of the most confusing scenarios teams encounter is when the IIS application is running correctly, yet users intermittently receive 502 Bad Gateway errors. This typically happens when health probes fail, causing Azure Application Gateway to mark backend instances as Unhealthy and stop routing traffic to them. The following diagram illustrates this failure flow. Failure Flow Diagram (Probe Fails → Backend Unhealthy → 502) Key takeaway: Most 502 errors behind Azure Application Gateway are not application failures—they are health probe failures. What’s Happening Here? Azure Application Gateway periodically sends health probes to backend IIS instances. If the probe endpoint: o Redirects to /login o Requires authentication o Returns 401 / 403 / 302 o Times out the probe is considered failed. After consecutive failures, the backend instance is marked Unhealthy. Application Gateway stops forwarding traffic to unhealthy backends. If all backend instances are unhealthy, every client request results in a 502 Bad Gateway—even though IIS itself may still be running. This is why a dedicated, lightweight, unauthenticated health endpoint is critical for production stability. Common Health Probe Pitfalls with IIS Before designing a solution, let’s look at what commonly goes wrong. 1. Probing the Root Path (/) Many IIS applications: Redirect / → /login Require authentication Return 401 / 302 / 403 Application Gateway expects a clean 200 OK, not redirects or auth challenges. 2. Authentication-Enabled Endpoints Health probes do not support authentication headers. If your app enforces: Windows Authentication OAuth / JWT Client certificates …the probe will fail. 3. Slow or Heavy Endpoints Probing a controller that: Calls a database Performs startup checks Loads configuration can cause intermittent failures, especially under load. 4. Certificate and Host Header Mismatch TLS-enabled backends may fail probes due to: Missing Host header Incorrect SNI configuration Certificate CN mismatch Design Principles for a Reliable IIS Health Endpoint A good health check endpoint should be: Lightweight Anonymous Fast (< 100 ms) Always return HTTP 200 Independent of business logic Client Browser | | HTTPS (Public DNS) v +-------------------------------------------------+ | Azure Application Gateway (v2) | | - HTTPS Listener | | - SSL Certificate | | - Custom Health Probe (/health) | +-------------------------------------------------+ | | HTTPS (SNI + Host Header) v +-------------------------------------------------------------------+ | IIS Backend VM | | | | Site Bindings: | | - HTTPS : app.domain.com | | | | Endpoints: | | - /health (Anonymous, Static, 200 OK) | | - /login (Authenticated) | | | +-------------------------------------------------------------------+ Azure Application Gateway health probe architecture for IIS backends using a dedicated /health endpoint. Azure Application Gateway continuously probes a dedicated /health endpoint on each IIS backend instance. The health endpoint is designed to return a fast, unauthenticated 200 OK response, allowing Application Gateway to reliably determine backend health while keeping application endpoints secure. Step 1: Create a Dedicated Health Endpoint Recommended Path 1 /health This endpoint should: Bypass authentication Avoid redirects Avoid database calls Example: Simple IIS Health Page Create a static file: 1 C:\inetpub\wwwroot\website\health\index.html Static Fast Zero dependencies Step 2: Exclude the Health Endpoint from Authentication If your IIS site uses authentication, explicitly allow anonymous access to /health. web.config Example 1 <location path="health"> 2 <system.webServer> 3 <security> 4 <authentication> 5 <anonymousAuthentication enabled="true" /> 6 <windowsAuthentication enabled="false" /> 7 </authentication> 8 </security> 9 </system.webServer> 10 </location> ⚠️ This ensures probes succeed even if the rest of the site is secured. Step 3: Configure Azure Application Gateway Health Probe Recommended Probe Settings Setting Value Protocol HTTPS Path /health Interval 30 seconds Timeout 30 seconds Unhealthy threshold 3 Pick host name from backend Enabled Why “Pick host name from backend” matters This ensures: Correct Host header Proper certificate validation Avoids TLS handshake failures Step 4: Validate Health Probe Behavior From Application Gateway Navigate to Backend health Ensure status shows Healthy Confirm response code = 200 From the IIS VM 1 Invoke-WebRequest https://your-app-domain/health Expected: 1 StatusCode : 200 Troubleshooting Common Failures Probe shows Unhealthy but app works ✔ Check authentication rules ✔ Verify /health does not redirect ✔ Confirm HTTP 200 response TLS or certificate errors ✔ Ensure certificate CN matches backend domain ✔ Enable “Pick host name from backend” ✔ Validate certificate is bound in IIS Intermittent failures ✔ Reduce probe complexity ✔ Avoid DB or service calls ✔ Use static content Production Best Practices Use separate health endpoints per application Never reuse business endpoints for probes Monitor probe failures as early warning signs Test probes after every deployment Keep health endpoints simple and boring Final Thoughts A reliable health check endpoint is not optional when running IIS behind Azure Application Gateway—it is a core part of application availability. By designing a dedicated, authentication‑free, lightweight health endpoint, you can eliminate a large class of false outages and significantly improve platform stability. If you’re migrating IIS applications to Azure or troubleshooting unexplained Application Gateway failures, start with your health probe—it’s often the silent culprit.37Views0likes0CommentsFast cloud migration, measurable ROI: Forrester Total Economic Impact study of Azure VMware Solution
Many organizations are balancing near-term continuity for VMware-based workloads with longer-term cloud modernization goals – all while managing cost, security, and resiliency. Azure VMware Solution (AVS) is built for this moment: a Microsoft-managed service verified by VMware that enables running VMware Cloud Foundation (VCF) workloads (vSphere, NSX-T, vSAN, HCX) on dedicated Azure infrastructure. It gives organizations a practical way to move or extend VMware environments into Azure while maintaining operational consistency and leveraging the skills of existing VMware teams. To help leaders quantify the potential value of this approach, Microsoft commissioned Forrester Consulting to conduct The Total Economic Impact™ (TEI) of Microsoft Azure VMware Solution (March 2026). The study models the financial impact over three years and risk-adjusts results. Access the full study here: aka.ms/AVS-TEI Here’s what the study found and how IT leaders can use it as a framework for decision-making: Topline results from the study Forrester’s risk-adjusted financial analysis for a composite organization 1 found: 341% ROI over three years 2 $5.6M net present value (NPV) 3 <6 months payback 4 These metrics are meaningful on their own, but the bigger story for leadership is where the value comes from: improved operational stability, reduced infrastructure costs driven by data center exit and hardware refresh avoidance, and the ability to redeploy skilled IT resources from maintenance to modernization. The customer journey: why organizations turn to AVS AVS offers a bridge: Lift and shift VMware workloads into Azure without forcing immediate re-platforming then, modernize at a pace aligned to business priorities. In the study, Forrester interviewed decision-makers with experience using AVS. Interviewees described common challenges that led them to invest in AVS, including: Fragmented systems that complicated and slowed operations: Inherited stacks, duplicated tools, and unclear ownership of orphan machines made operations and governance harder. Rising cost and complexity of on-premises operation: Colocation fees, energy and cooling costs, server refresh cycles, and tooling renewals were difficult to justify against cloud economics. Limited capacity and skills to refactor at scale: Teams wanted the cost and agility benefits of the cloud but didn’t have the time or skills to rewrite hundreds (or thousands) of VMs on aggressive timelines. Security and audit pressure: Disparate environments and legacy access models elevated risk and created audit friction. Operational variability and end-user experience: VPN dependencies, inconsistent remote tooling, and endpoint logistics led to slow first-call resolution and downtime risks. Three quantified benefits that drive the business case 1) Reduction in downtime and associated costs by 80% In the study, interviewees reported that moving VMware workloads to AVS improved day-to-day reliability by eliminating fragile on-premises workflows and leveraging Azure’s managed infrastructure. Examples included fewer VPN-related failures, faster issue resolution through centralized tooling, and stronger service-level performance. For leadership teams, this benefit is about more than avoided cost. Better up time protects customer experience, employee productivity, and reduces the operational noise that can slow modernization programs. 2) Reduced infrastructure costs through data center exit, refresh avoidance, and cleanup A second driver is the ability to avoid or eliminate significant portions of data center cost and refresh spend. In the study, interviewees described using AVS to close data centers, avoid upcoming hardware refresh cycles, and reduce ongoing capital and operating costs. Importantly, interviewees also reported that migration waves prompted additional savings through portfolio hygiene by validating each VM, decommissioning redundant systems, and rightsizing oversized workloads. Those actions helped organizations reduce their ongoing compute, storage, and licensing footprint after migration. 3) Redeployment of 50% of IT team members from maintenance to modernization The TEI study quantifies a practical advantage of a managed VMware environment in Azure: fewer hours spent on hardware lifecycle, cluster patching, upgrades, and other routine data center tasks. In practice, many leaders treat this as capacity created rather than budget eliminated: the opportunity to shift experienced engineers toward modernization, automation, cloud governance, proactive incident prevention, and higher-value business initiatives. Unquantified benefits organizations should weigh Beyond the quantified categories, the study also highlights benefits that are strategically important, but not fully quantified in the model: Acceleration of future modernization: With workloads running in Azure via AVS, organizations can integrate platform services across security, identity, data, and analytics and build a runway for new capabilities, including AI-driven scenarios in Azure. Fast, cost-effective migration of legacy workloads: Interviewees described avoiding major consulting or hiring costs that would have been required to refactor complex workloads into cloud-native designs. Improved audit readiness and security posture: Consolidating fragmented environments into governed Azure landing zones can simplify audit preparation and strengthen governance and monitoring. For many leadership teams, these benefits strengthen the business case because they support broader transformation outcomes that extend beyond infrastructure cost alone. Things to consider in your own decision process If you’re building a business case to move workloads to Azure, whether it be lifting and shifting to AVS or replatforming and refactoring to Azure IaaS and managed services, consider mapping your environment across these areas: Data center timelines: Refresh cycles, colocation exit deadlines, and contract constraints. Operating model readiness: How quickly teams can adopt cloud-native services versus preserving VMware operations during transition. Modernization roadmap: Determine which applications are candidates for investment in replatforming, refactoring, replacement, or retirement once in Azure. Next steps Read the full TEI study: aka.ms/AVS-TEI Explore more about AVS: aka.ms/AzureVMwareSolution Get the VMware to Azure VMware Solution Planning Guide: aka.ms/VMwareToAVSguide Learn more about the Azure Copilot migration agent: aka.ms/migrate/AMA 1 Composite organization: Forrester designed a composite organization based on characteristics of the interviewees’ organizations. 2 Return on Investment (ROI): A project’s expected return in percentage terms. ROI is calculated by dividing net benefits (benefits less costs) by costs. 3 Net present value (NPV): The present or current value of (discounted) future net cash flows given an interest rate (the discount rate). A positive project NPV normally indicates that the investment should be made unless other projects have higher NPVs. 4 Payback: The breakeven point for an investment. This is the point in time at which net benefits (benefits minus costs) equal initial investment or cost.Join us at Microsoft Azure Infra Summit 2026 for deep technical Azure infrastructure content
Microsoft Azure Infra Summit 2026 is a free, engineering-led virtual event created for IT professionals, platform engineers, SREs, and infrastructure teams who want to go deeper on how Azure really works in production. It will take place May 19-21, 2026. This event is built for the people responsible for keeping systems running, making sound architecture decisions, and dealing with the operational realities that show up long after deployment day. Over the past year, one message has come through clearly from the community: infrastructure and operations audiences want more in-depth technical content. They want fewer surface-level overviews and more practical guidance from the engineers and experts who build, run, and support these systems every day. That is exactly what Azure Infra Summit aims to deliver. All content is created AND delivered by engineering, targeting folks working with Azure infrastructure and operating production environments. Who is this for: IT professionals, platform engineers, SREs, and infrastructure teams When: May 19-21, 2026 - 8:00 AM–1:00 PM Pacific Time, all 3 days Where: Online Virtual Cost: Free Level: Most sessions are advanced (L300-400). Register here: https://aka.ms/MAIS-Reg Built for the people who run workloads on Azure Azure Infra Summit is for the people who do more than deploy to Azure. It is for the people who run it. If your day involves uptime, patching, governance, monitoring, reliability, networking, identity, storage, or hybrid infrastructure, this event is for you. Whether you are an IT professional managing enterprise environments, a platform engineer designing landing zones, an Azure administrator, an architect, or an SRE responsible for resilience and operational excellence, you will find content built with your needs in mind. We are intentionally shaping this event around peer-to-peer technical learning. That means engineering-led sessions, practical examples, and candid discussion about architecture, failure modes, operational tradeoffs, and what breaks in production. The promise here is straightforward: less fluff, more infrastructure. What to expect Azure Infra Summit will feature deep technical content in the 300 to 400 level range, with sessions designed by engineering to help you build, operate, and optimize Azure infrastructure more effectively. The event will include a mix of live and pre-recorded sessions and live Q&A. Throughout the three days, we will dig into topics such as: Hybrid operations and management Networking at scale Storage, backup, and disaster recovery Observability, SLOs, and day-2 operations Confidential compute Architecture, automation, governance, and optimization in Azure Core environments And more… The goal is simple: to give you practical guidance you can take back to your environment and apply right away. We want attendees to leave with stronger mental models, a better understanding of how Azure behaves in the real world, and clearer patterns for designing and operating infrastructure with confidence. Why this event matters Infrastructure decisions have a long tail. The choices we make around architecture, operations, governance, and resilience show up later in the form of performance issues, outages, cost, complexity, and recovery challenges. That is why deep technical learning matters, and why events like this matter. Join us I hope you will join us for Microsoft Azure Infra Summit 2026, happening May 19-21, 2026. If you care about how Azure infrastructure behaves in the real world, and you want practical, engineering-led guidance on how to build, operate, and optimize it, this event was built for you. Register here: https://aka.ms/MAIS-Reg Cheers! Pierre Roman598Views1like0CommentsSecure HTTP‑Only AKS Ingress with Azure Front Door Premium, Firewall DNAT, and Private AGIC
Reference architecture and runbook (Part 1: HTTP-only) for Hub-Spoke networking with private Application Gateway (AGIC), Azure Firewall DNAT, and Azure Front Door Premium (WAF) 0. When and Why to Use This Architecture Series note: This document is Part 1 and uses HTTP to keep the focus on routing and control points. A follow-up Part 2 will extend the same architecture to HTTPS (end-to-end TLS) with the recommended certificate and policy configuration. What this document contains Scope: Architecture overview and traffic flow, build/run steps, sample Kubernetes manifests, DNS configuration, and validation steps for end-to-end connectivity through Azure Front Door → Azure Firewall DNAT → private Application Gateway (AGIC) → AKS. Typical scenarios Private-by-default Kubernetes ingress: You want application ingress without exposing a public Application Gateway or public load balancer for the cluster. Centralized hub ingress and inspection: You need a shared Hub VNet pattern with centralized inbound control (NAT, allow-listing, inspection) for one or more spoke workloads. Global entry point + edge WAF: You want a globally distributed frontend with WAF, bot/rate controls, and consistent L7 policy before traffic reaches your VNets. Controlled origin exposure: You need to ensure only the edge service can reach your origin (firewall public IP), and all other inbound sources are blocked. Key benefits (the “why”) Layered security: WAF blocks common web attacks at the edge; the hub firewall enforces network-level allow lists and DNAT; App Gateway applies L7 routing to AKS. Reduced public attack surface: Application Gateway and AKS remain private; only Azure Front Door and the firewall public IP are internet-facing. Hub-spoke scalability: The hub pattern supports multiple spokes and consistent ingress controls across workloads. Operational clarity: Clear separation of responsibilities (edge policy vs. network boundary vs. app routing) makes troubleshooting and governance easier. When not to use this Simple dev/test exposure: If you only need quick internet access, a public Application Gateway or public AKS ingress may be simpler and cheaper. You require end-to-end TLS in this lab: This runbook is HTTP-only for learning; production designs should use HTTPS throughout. You do not need hub centralization: If there is only one workload and no hub-spoke standardization requirement, the firewall hop may be unnecessary. Prerequisites and assumptions Series scope: Part 1 is HTTP-only to focus on routing and control points. Part 2 will cover HTTPS (end-to-end TLS) and the certificate/policy configuration typically required for production deployments. Permissions: Ability to create VNets, peerings, Azure Firewall + policy, Application Gateway, AKS, and Private DNS (typically Contributor on the subscription/resource groups). Networking: Hub-Spoke VNets with peering configured to allow forwarded traffic, plus name resolution via Private DNS. Tools: Azure CLI, kubectl, and permission to enable the AKS AGIC addon. Architecture Diagram 1. Architecture Components and Workflow Workflow (end-to-end request path) Client → Azure Front Door (WAF + TLS, public endpoint) → Azure Firewall public IP (Hub VNet; DNAT) → private Application Gateway (Spoke VNet; AGIC-managed) → AKS service/pods. 1.1 Network topology (Hub-Spoke) Connectivity Hub and Spoke VNets are connected via VNet peering with forwarded traffic allowed so Azure Front Door traffic can traverse Azure Firewall DNAT to the private Application Gateway, and Hub-based validation hosts can resolve private DNS and reach Spoke private IPs. Hub VNet (10.0.0.0/16) Purpose: Central ingress and shared services. The Hub hosts the security boundary (Azure Firewall) and optional connectivity/management components used to reach and validate private resources in the Spoke. Azure Firewall in AzureFirewallSubnet (10.0.1.0/24); example private IP 10.0.1.4 with a Public IP used as the Azure Front Door origin and for inbound DNAT. Azure Bastion (optional) in AzureBastionSubnet (10.0.2.0/26) for browser-based access to test VMs without public IPs. Test VM subnet (optional) testvm-subnet (10.0.3.0/24) for in-VNet validation (for example, nslookup and curl against the private App Gateway hostname). Spoke VNet (10.224.0.0/12) Purpose: Hosts private application workloads (AKS) and the private layer-7 ingress (Application Gateway) that is managed by AGIC. AKS subnet aks-subnet: 10.224.0.0/16 (node pool subnet for the AKS cluster). Application Gateway subnet appgw-subnet: 10.238.0.0/24 (dedicated subnet for a private Application Gateway; example private frontend IP 10.238.0.10). AKS + AGIC: AGIC programs listeners/rules on the private Application Gateway based on Kubernetes Ingress resources. 1.2 Azure Front Door (Frontend) Role: Public entry point for the application, providing global anycast ingress, TLS termination, and Layer 7 routing to the origin (Azure Firewall public IP) while keeping Application Gateway private. SKU: Use Azure Front Door Premium when you need WAF plus advanced security/traffic controls; Standard also supports WAF, but Premium is typically chosen for broader capabilities and enterprise patterns. WAF support: Azure Front Door supports WAF with managed rule sets and custom rules (for example, allow/deny lists, geo-matching, header-based controls, and rate limiting policies). What WAF brings: Adds edge protection against common web attacks (for example OWASP Top 10 patterns), reduces attack surface before traffic reaches the Hub, and centralizes L7 policy enforcement for all apps onboarded to Front Door. Security note: Apply WAF policy at the edge (managed + custom rules) to block malicious requests early; origin access control is enforced at the Azure Firewall layer (see Section 1.3). 1.3 Azure Firewall Premium (Hub security boundary) Role: Security boundary in the Hub that exposes a controlled public ingress point (Firewall Public IP) for Azure Front Door origins, then performs DNAT to the private Application Gateway in the Spoke. Why Premium: Use Firewall Premium when you need advanced threat protection beyond basic L3/L4 controls, while keeping the origin private. IDPS (intrusion detection and prevention): Premium can add signature-based detection and prevention to help identify and block known threats as traffic traverses the firewall. TLS inspection (optional): Premium supports TLS inspection patterns so you can apply threat detection to encrypted flows when your compliance and certificate management model allows it. Premium feature note (DNAT scenarios): These security features still apply when Azure Firewall is used for DNAT (public IP) scenarios. IDPS operates in all traffic directions; however, Azure Firewall does not perform TLS inspection on inbound internet traffic, so the effectiveness of IDPS for inbound encrypted flows is inherently limited. That said, Threat Intelligence enforcement still applies, so protection against known malicious IPs and domains remains in effect. Hardening guidance: Enforce origin lockdown here by restricting the DNAT listener to AzureFrontDoor.Backend (typically via an IP Group) so only Front Door can reach the firewall public IP; use Front Door WAF as the complementary L7 control plane at the edge. 2. Build Steps (Command Runbook) 2.1 Set variables $HUB_RG="HUB-VNET-Rgp" $AKS_RG="AKS-VNET-RGp" $LOCATION="eastus" $HUB_VNET="Hub-VNet" $SPOKE_VNET="Spoke-AKS-VNet" $APPGW_NAME="spoke-appgw" $APPGW_PRIVATE_IP="10.238.0.10" Note: The commands below are formatted for PowerShell. When capturing output from an az command, use $VAR = (az ...). 2.2 Create resource groups az group create --name $HUB_RG --location $LOCATION az group create --name $AKS_RG --location $LOCATION 2.3 Create Hub VNet + AzureFirewallSubnet + Bastion subnet + VM subnet # Create Hub VNet with AzureFirewallSubnet az network vnet create -g $HUB_RG -n $HUB_VNET -l $LOCATION --address-prefixes 10.0.0.0/16 --subnet-name AzureFirewallSubnet --subnet-prefixes 10.0.1.0/24 # Create Azure Bastion subnet (optional) az network vnet subnet create -g $HUB_RG --vnet-name $HUB_VNET -n "AzureBastionSubnet" --address-prefixes "10.0.2.0/26" # Deploy Bastion (optional; requires AzureBastionSubnet) az network public-ip create -g $HUB_RG -n "bastion-pip" --sku Standard --allocation-method Static az network bastion create -g $HUB_RG -n "hub-bastion" --vnet-name $HUB_VNET --public-ip-address "bastion-pip" -l $LOCATION # Create test VM subnet for validation az network vnet subnet create -g $HUB_RG --vnet-name $HUB_VNET -n "testvm-subnet" --address-prefixes "10.0.3.0/24" # Create a Windows test VM in the Hub (no public IP) $VM_NAME = "win-testvm-hub" $ADMIN_USER = "adminuser" $ADMIN_PASS = "" $NIC_NAME = "win-testvm-nic" az network nic create --resource-group $HUB_RG --location $LOCATION --name $NIC_NAME --vnet-name $HUB_VNET --subnet "testvm-subnet" az vm create --resource-group $HUB_RG --name $VM_NAME --location $LOCATION --nics $NIC_NAME --image MicrosoftWindowsServer:WindowsServer:2022-datacenter-azure-edition:latest --admin-username $ADMIN_USER --admin-password $ADMIN_PASS --size Standard_D2s_v5 2.4 Create Spoke VNet + AKS subnet + App Gateway subnet # Create Spoke VNet az network vnet create -g $AKS_RG -n $SPOKE_VNET -l $LOCATION --address-prefixes 10.224.0.0/12 # Create AKS subnet az network vnet subnet create -g $AKS_RG --vnet-name $SPOKE_VNET -n aks-subnet --address-prefixes 10.224.0.0/16 # Create Application Gateway subnet az network vnet subnet create -g $AKS_RG --vnet-name $SPOKE_VNET -n appgw-subnet --address-prefixes 10.238.0.0/24 2.5 Validate and delegate the App Gateway subnet (required) # Validate subnet exists az network vnet subnet show -g $AKS_RG --vnet-name $SPOKE_VNET -n appgw-subnet az network vnet subnet show -g $AKS_RG --vnet-name $SPOKE_VNET -n appgw-subnet --query addressPrefix -o tsv # Delegate subnet for Application Gateway (required) az network vnet subnet update -g $AKS_RG --vnet-name $SPOKE_VNET -n appgw-subnet --delegations Microsoft.Network/applicationGateways 2.6 Create the private Application Gateway az network application-gateway create -g $AKS_RG -n $APPGW_NAME --sku Standard_v2 --capacity 2 --vnet-name $SPOKE_VNET --subnet appgw-subnet --frontend-port 80 --http-settings-protocol Http --http-settings-port 80 --routing-rule-type Basic --priority 100 --private-ip-address $APPGW_PRIVATE_IP 2.7 Create AKS (public, Azure CNI overlay) $AKS_SUBNET_ID = (az network vnet subnet show -g $AKS_RG --vnet-name $SPOKE_VNET -n aks-subnet --query id -o tsv) $AKS_NAME = "aks-public-overlay" az aks create -g $AKS_RG -n $AKS_NAME -l $LOCATION --enable-managed-identity --network-plugin azure --network-plugin-mode overlay --vnet-subnet-id $AKS_SUBNET_ID --node-count 2 --node-vm-size Standard_DS3_v2 --dns-name-prefix aks-overlay --generate-ssh-keys 2.8 Enable AGIC and attach the existing Application Gateway $APPGW_ID = (az network application-gateway show -g $AKS_RG -n $APPGW_NAME --query id -o tsv) az aks enable-addons -g $AKS_RG -n $AKS_NAME --addons ingress-appgw --appgw-id $APPGW_ID 2.9 Connect to the cluster and validate AGIC az aks get-credentials -g $AKS_RG -n $AKS_NAME --overwrite-existing kubectl get nodes # Validate AGIC is running kubectl get pods -n kube-system | findstr ingress # Inspect AGIC logs (optional) $AGIC_POD = (kubectl get pod -n kube-system -l app=ingress-appgw -o jsonpath="{.items[0].metadata.name}") kubectl logs -n kube-system $AGIC_POD 2.10 Create and link Private DNS zone (Hub) and add an A record Create a Private DNS zone in the Hub, link it to both VNets, then create an A record for app1 pointing to the private Application Gateway IP. $PRIVATE_ZONE = "clusterksk.com" az network private-dns zone create -g $HUB_RG -n $PRIVATE_ZONE $HUB_VNET_ID = (az network vnet show -g $HUB_RG -n $HUB_VNET --query id -o tsv) $SPOKE_VNET_ID = (az network vnet show -g $AKS_RG -n $SPOKE_VNET --query id -o tsv) az network private-dns link vnet create -g $HUB_RG -n "link-hub-vnet" -z $PRIVATE_ZONE -v $HUB_VNET_ID -e false az network private-dns link vnet create -g $HUB_RG -n "link-spoke-aks-vnet" -z $PRIVATE_ZONE -v $SPOKE_VNET_ID -e false az network private-dns record-set a create -g $HUB_RG -z $PRIVATE_ZONE -n "app1" --ttl 30 az network private-dns record-set a add-record -g $HUB_RG -z $PRIVATE_ZONE -n "app1" -a $APPGW_PRIVATE_IP 2.11 Create VNet peering (Hub Spoke) az network vnet peering create -g $HUB_RG --vnet-name $HUB_VNET -n "HubToSpoke" --remote-vnet $SPOKE_VNET_ID --allow-vnet-access --allow-forwarded-traffic az network vnet peering create -g $AKS_RG --vnet-name $SPOKE_VNET -n "SpokeToHub" --remote-vnet $HUB_VNET_ID --allow-vnet-access --allow-forwarded-traffic 2.12 Deploy sample app + Ingress and validate App Gateway programming # Create namespace kubectl create namespace demo # Create Deployment + Service (PowerShell) @' apiVersion: apps/v1 kind: Deployment metadata: name: app1 namespace: demo spec: replicas: 2 selector: matchLabels: app: app1 template: metadata: labels: app: app1 spec: containers: - name: app1 image: hashicorp/http-echo:1.0 args: - "-text=Hello from app1 via AGIC" ports: - containerPort: 5678 --- apiVersion: v1 kind: Service metadata: name: app1-svc namespace: demo spec: selector: app: app1 ports: - port: 80 targetPort: 5678 type: ClusterIP '@ | Set-Content .\app1.yaml kubectl apply -f .\app1.yaml # Create Ingress (PowerShell) @' apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: app1-ing namespace: demo annotations: kubernetes.io/ingress.class: azure/application-gateway appgw.ingress.kubernetes.io/use-private-ip: "true" spec: rules: - host: app1.clusterksk.com http: paths: - path: / pathType: Prefix backend: service: name: app1-svc port: number: 80 '@ | Set-Content .\app1-ingress.yaml kubectl apply -f .\app1-ingress.yaml # Validate Kubernetes objects kubectl -n demo get deploy,svc,ingress kubectl -n demo describe ingress app1-ing # Validate App Gateway has been programmed by AGIC az network application-gateway show -g $AKS_RG -n $APPGW_NAME --query "{frontendIPConfigs:frontendIPConfigurations[].name,listeners:httpListeners[].name,rules:requestRoutingRules[].name,backendPools:backendAddressPools[].name}" -o json # If rules/listeners are missing, re-check AGIC logs from step 2.9 kubectl logs -n kube-system $AGIC_POD 2.13 Deploy Azure Firewall Premium + policy + public IP Firewall deployment (run after sample Ingress is created) $FWPOL_NAME = "hub-azfw-pol-test" $FW_NAME = "hub-azfw-test" $FW_PIP_NAME = "hub-azfw-pip" $FW_IPCONF_NAME = "azfw-ipconf" # Create Firewall Policy (Premium) az network firewall policy create -g $HUB_RG -n $FWPOL_NAME -l $LOCATION --sku Premium # Create Firewall public IP (Standard) az network public-ip create -g $HUB_RG -n $FW_PIP_NAME -l $LOCATION --sku Standard --allocation-method Static # Deploy Azure Firewall in Hub VNet and associate policy + public IP az network firewall create -g $HUB_RG -n $FW_NAME -l $LOCATION --sku AZFW_VNet --tier Premium --vnet-name $HUB_VNET --conf-name $FW_IPCONF_NAME --public-ip $FW_PIP_NAME --firewall-policy $FWPOL_NAME $FW_PUBLIC_IP = (az network public-ip show -g $HUB_RG -n $FW_PIP_NAME --query ipAddress -o tsv) $FW_PUBLIC_IP 2.14 (Optional) Validate from Hub test VM Optional: From the Hub Windows test VM (created in step 2.3), confirm app1.clusterksk.com resolves privately and the app responds through the private Application Gateway. # DNS should resolve to the private App Gateway IP nslookup app1.clusterksk.com # HTTP request should return the sample response (for example: "Hello from app1 via AGIC") curl http://app1.clusterksk.com # Browser validation (from the VM) # Open: http://app1.clusterksk.com 2.15 Restrict DNAT to Azure Front Door (IP Group + DNAT rule) $IPG_NAME = "ipg-afd-backend" $RCG_NAME = "rcg-dnat" $NATCOLL_NAME = "dnat-afd-to-appgw" $NATRULE_NAME = "afd80-to-appgw80" # 1) Get AzureFrontDoor.Backend IPv4 prefixes and create an IP Group $AFD_BACKEND_IPV4 = (az network list-service-tags --location $LOCATION --query "values[?name=='AzureFrontDoor.Backend'].properties.addressPrefixes[] | [?contains(@, '.')]" -o tsv) az network ip-group create -g $HUB_RG -n $IPG_NAME -l $LOCATION --ip-addresses $AFD_BACKEND_IPV4 # 2) Create a rule collection group for DNAT az network firewall policy rule-collection-group create -g $HUB_RG --policy-name $FWPOL_NAME -n $RCG_NAME --priority 100 # 3) Add NAT collection + DNAT rule (source = AFD IP Group, destination = Firewall public IP, 80 → 80) az network firewall policy rule-collection-group collection add-nat-collection -g $HUB_RG --policy-name $FWPOL_NAME --rule-collection-group-name $RCG_NAME --name $NATCOLL_NAME --collection-priority 1000 --action DNAT --rule-name $NATRULE_NAME --ip-protocols TCP --source-ip-groups $IPG_NAME --destination-addresses $FW_PUBLIC_IP --destination-ports 80 --translated-address $APPGW_PRIVATE_IP --translated-port 80 3. Azure Front Door Configuration In this section, we configure Azure Front Door Premium as the public frontend with WAF, create an endpoint, and route requests over HTTP (port 80) to the Azure Firewall public IP origin while preserving the host header (app1.clusterksk.com) for AGIC-based Ingress routing. Create Front Door profile: Create an Azure Front Door profile and choose Premium. Premium enables enterprise-grade edge features (including WAF and richer traffic/security controls) that you’ll use in this lab. Attach WAF: Enable/associate a WAF policy so requests are inspected at the edge (managed rules + any custom rules) before they’re allowed to reach the Azure Firewall origin. Create an endpoint: Add an endpoint name to create the public Front Door hostname (<endpoint>.azurefd.net) that clients will browse to in this lab. Create an origin group: Create an origin group to define how Front Door health-probes and load-balances traffic to one or more origins (for this lab, it will contain a single origin: the Firewall public IP). Add an origin: Add the Azure Firewall as the origin so Front Door forwards requests to the Hub entry point (Firewall Public IP), which then DNATs to the private Application Gateway. Origin type: Public IP address Public IP address: select the Azure Firewall public IP Origin protocol/port: HTTP, 80 Host header: app1.clusterksk.com Create a route: Create a route to connect the endpoint to the origin group and define the HTTP behaviors (patterns, accepted protocols, and forwarding protocol) used for this lab. Patterns to match: /* Accepted protocols: HTTP Forwarding protocol: HTTP only (this lab is HTTP-only) Then you need to add the Route Review + create, then wait for propagation: Select Review + create (or Create) to deploy the Front Door configuration, wait ~30–40 minutes for global propagation, then browse to http://<endpoint>.azurefd.net/. 4. Validation (Done Criteria) app1.clusterksk.com resolves to 10.238.0.10 from within the Hub/Spoke VNets (Private DNS link working). Azure Front Door can reach the origin over HTTP and returns a 200/expected response (origin health is healthy). Requests to http://app1.clusterksk.com/ (internal) and http://<your-front-door-domain>/ (external) are routed to app1-svc and return the expected http-echo text (Ingress + AGIC wiring correct). Author: Kumar shashi kaushal (Sr Digital cloud solutions architect Microsoft)223Views2likes0CommentsAKS cluster with AGIC hits the Azure Application Gateway backend pool limit (100)
I’m writing this article to document a real-world scaling issue we hit while exposing many applications from an Azure Kubernetes Service (AKS) cluster using Application Gateway Ingress Controller (AGIC). The problem is easy to miss because Kubernetes resources keep applying successfully, but the underlying Azure Application Gateway has a hard platform limit of 100 backend pools—so once your deployment pattern requires the 101st pool, AGIC can’t reconcile the gateway configuration and traffic stops flowing for new apps. This post explains how the limit is triggered, how to reproduce and recognize it, and what practical mitigation paths exist as you grow. A real-world scalability limit, reproduction steps, and recommended mitigation options: AGIC typically creates one Application Gateway backend pool per Kubernetes Service referenced by an Ingress. Azure Application Gateway enforces a hard limit of 100 backend pools. When the 101st backend pool is required, Application Gateway rejects the update and AGIC fails reconciliation. Kubernetes resources appear created, but traffic does not flow due to the external platform limit. Gateway API–based application routing is the most scalable forward-looking solution. Architecture Overview The environment follows a Hub-and-Spoke network architecture, commonly used in enterprise Azure deployments to centralize shared services and isolate workloads. Hub Network Azure Firewall / Network security services VPN / ExpressRoute Gateways Private DNS Zones Shared monitoring and governance components Spoke Network Private Azure Kubernetes Service (AKS) cluster Azure Application Gateway with private frontend Application Gateway Ingress Controller (AGIC) Application workloads exposed via Kubernetes Services and Ingress Ingress Traffic Flow Client → Private Application Gateway → AGIC-managed routing → Kubernetes Service → Pod Application Deployment Model Each application followed a simple and repeatable Kubernetes pattern that ultimately triggered backend pool exhaustion. One Deployment per application One Service per application One Ingress per application Each Ingress referencing a unique Service Kubernetes Manifests Used Note: All Kubernetes manifests in this example are deployed into the demo namespace. Please ensure the namespace is created before applying the manifests. Deployment template apiVersion: apps/v1 kind: Deployment metadata: name: app-{{N}} namespace: demo spec: replicas: 1 selector: matchLabels: app: app-{{N}} template: metadata: labels: app: app-{{N}} spec: containers: - name: app image: hashicorp/http-echo:1.0 args: - "-text=Hello from app {{N}}" ports: - containerPort: 5678 Service template apiVersion: v1 kind: Service metadata: name: svc-{{N}} namespace: demo spec: selector: app: app-{{N}} ports: - port: 80 targetPort: 5678 Ingress template apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: ing-{{N}} namespace: demo spec: ingressClassName: azure-application-gateway rules: - host: app{{N}}.example.internal http: paths: - path: / pathType: Prefix backend: service: name: svc-{{N}} port: number: 80 Reproducing the Backend Pool Limitation The issue was reproduced by deploying 101 applications using the same pattern. Each iteration resulted in AGIC attempting to create a new backend pool. for ($i = 1; $i -le 101; $i++) { (Get-Content deployment.yaml) -replace "{{N}}", $i | kubectl apply -f - (Get-Content service.yaml) -replace "{{N}}", $i | kubectl apply -f - (Get-Content ingress.yaml) -replace "{{N}}", $i | kubectl apply -f - } Observed AGIC Error Code="ApplicationGatewayBackendAddressPoolLimitReached" Message="The number of BackendAddressPools exceeds the maximum allowed value. The number of BackendAddressPools is 101 and the maximum allowed is 100. Root Cause Analysis Azure Application Gateway enforces a non-configurable maximum of 100 backend pools. AGIC creates backend pools based on Services referenced by Ingress resources, leading to exhaustion at scale. Available Options After Hitting the Limit Option 1: Azure Gateway Controller (AGC) AGC uses the Kubernetes Gateway API and avoids the legacy Ingress model. However, it currently supports only public frontends and does not support private frontends. Option 2: ingress-nginx via Application Routing This option is supported only until November 2026 and is not recommended due to deprecation and lack of long-term viability. Option 3: Application Routing with Gateway API (Preview) Gateway API–based application routing is the strategic long-term direction for AKS. Although currently in preview, it has been stable upstream for several years and is suitable for onboarding new applications with appropriate risk awareness. Like in the below screenshot, I am using two controllers. Reference Microsoft documents: Azure Kubernetes Service (AKS) Managed Gateway API Installation (preview) - Azure Kubernetes Service | Microsoft Learn Azure Kubernetes Service (AKS) application routing add-on with the Kubernetes Gateway API (preview) - Azure Kubernetes Service | Microsoft Learn Secure ingress traffic with the application routing Gateway API implementation Conclusion The 100-backend pool limitation is a hard Azure Application Gateway constraint. Teams using AGIC must plan for scale early by consolidating services or adopting Gateway API–based routing to avoid production onboarding blockers. Author: Kumar Shashi Kaushal(Sr. Digital Cloud solutions Architect)260Views0likes0CommentsProactive Reliability Series — Article 1: Fault Types in Azure
Welcome to the Proactive Reliability Series — a collection of articles dedicated to raising awareness about the importance of designing, implementing, and operating reliable solutions in Azure. Each article will focus on a specific area of reliability engineering: from identifying critical flows and setting reliability targets, to designing for redundancy, testing strategies, and disaster recovery. This series draws its foundation from the Reliability pillar of the Azure Well-Architected Framework, Microsoft's authoritative guidance for building workloads that are resilient to malfunction and capable of returning to a fully functioning state after a failure occurs. In the cloud, failures are not a matter of if but when. Whether it is a regional outage, an availability zone going dark, a misconfigured resource, or a downstream service experiencing degradation — your workload will eventually face adverse conditions. The difference between a minor blip and a major incident often comes down to how deliberately you have planned for failure. In this first article, we start with one of the most foundational practices: Fault Mode Analysis (FMA) — and the question that underpins it: what kinds of faults can actually happen in Azure? Disclaimer: The views expressed in this article are my own and do not represent the views or positions of Microsoft. This article is written in a personal capacity and has not been reviewed, endorsed, or approved by Microsoft. Why Fault Mode Analysis Matters Fault Mode Analysis is the practice of systematically identifying potential points of failure within your workload and its associated flows, and then planning mitigation actions accordingly. A key tenet of FMA is that in any distributed system, failures can occur regardless of how many layers of resiliency are applied. More complex environments are simply exposed to more types of failures. Given this reality, FMA allows you to design your workload to withstand most types of failures and recover gracefully within defined recovery objectives. If you skip FMA altogether, or perform an incomplete analysis, your workload is at risk of unpredicted behavior and potential outages caused by suboptimal design. But to perform FMA effectively, you first need to understand what kinds of faults can actually occur in Azure infrastructure — and that is where most teams hit a gap. Sample "Azure Fault Type" Taxonomy Azure infrastructure is complex and distributed, and while Microsoft invests heavily in reliability, faults can and do occur. These faults can range from large-scale global service outages to localized issues affecting a single VM. The following is a sample taxonomy of common Azure infrastructure fault types, categorized by their characteristics, likelihood, and mitigation strategies. The taxonomy is organized from a customer impact perspective — focusing on how fault types affect customer workloads and what mitigation options are available — rather than from an internal Azure engineering perspective. Some of these "faults" may not even be caused by an actual failure in Azure infrastructure. They can be caused by a lack of understanding of Azure service designed behaviors (e.g., underestimating the impact of Azure planned maintenance) or by Azure platform design decisions (e.g., capacity constraints). However, from a customer perspective, they all represent potential failure modes that need to be considered and mitigated when designing for reliability. The following table presents infrastructure fault types from a customer impact perspective: Disclaimer: This is an unofficial taxonomy sample of Azure infrastructure fault types. It is not an official Microsoft publication and is not officially supported, endorsed, or maintained by Microsoft. The fault type definitions, likelihood assessments, and mitigation recommendations are based on publicly available Azure documentation and general cloud architecture best practices, but may not reflect the most current Azure platform behavior. Always refer to official Azure documentation and Azure Service Health for authoritative guidance. The "Likelihood" values below are relative planning heuristics intended to help prioritize resilience investments. They are not statistical probabilities, do not represent Azure SLA commitments, and are not derived from official Azure reliability data. Fault Type Blast Radius Likelihood Mitigation Redundancy Level Requirements Service Fault (Global) Worldwide or Multiple Regions Very Low High Service Fault (Region) Single service in region Medium Region Redundancy Region Fault Single region Very Low Region Redundancy Partial Region Fault Multiple services in a single Region Low Region Redundancy Availability Zone Fault Single AZ within region Low Availability Zone Redundancy Single Resource Fault Single VM/instance High Resource Redundancy Platform Maintenance Fault Variable (resource to region) High Resource Redundancy, Maintenance Schedules Region Capacity Constraint Fault Single region Low Region Redundancy, Capacity Reservations Network POP Location Fault Network hardware Colocation site Low Site Redundancy In future articles we will examine each of these fault types in detail. For this first article, let's take a closer look at one that is often underestimated: the Partial Region Fault. Deep Dive: "Partial Region Fault" A Partial Region Fault is a fault affecting multiple Azure services within a single region simultaneously, typically due to shared regional infrastructure dependencies, regional network issues, or regional platform incidents. Sometimes, the number of affected services may be significant enough to resemble a full region outage — but the key distinction is that it is not a complete loss of the region. Some services may continue to operate normally, while others experience degradation or unavailability. Unlike Natural Disaster caused Region outage, in the documented cases referenced later in this article, such "Partial Region Faults" have historically been resolved within hours. Attribute Description Blast Radius Multiple services within a single region Likelihood Low Typical Duration Minutes to hours Fault Tolerance Options Multi-region architecture; cross-region failover Fault Tolerance Cost High Impact Severe Typical Cause Regional networking infrastructure failure affecting multiple services, regional storage subsystem degradation impacting dependent services, regional control plane issues affecting service management These faults are rare, but they can happen — and when they do, they can have a severe impact on customer solutions that are not architected for multi-region resilience. What makes Partial Region Faults particularly dangerous is that they fall into a blind spot in most teams' resilience planning. When organizations think about regional failures, they tend to think in binary terms: either a region is up or it is down. Disaster recovery runbooks are written around the idea of a full region outage — triggered by a natural disaster or a catastrophic infrastructure event — where the response is to fail over everything to a secondary region. But a Partial Region Fault is not a full region outage. It is something more insidious. A subset of services in the region degrades or becomes unavailable while others continue to function normally. Your VMs might still be running, but the networking layer that connects them is broken. Your compute is fine, but Azure Resource Manager — the control plane through which you manage everything — is unreachable. This partial nature creates several problems that teams rarely plan for: Failover logic may not trigger. Most automated failover mechanisms are designed to detect a complete loss of connectivity to a region. When only some services are affected, health probes may still pass, traffic managers may still route requests to the degraded region, and your failover automation may sit idle — while your users are already experiencing errors. Recovery is more complex. With a full region outage, the playbook is straightforward: fail over to the secondary region. With a partial fault, you may need to selectively fail over some services while others remain in the primary region — a scenario that few teams have tested and most architectures do not support gracefully. The real-world examples below illustrate this clearly. In each case, a shared infrastructure dependency — regional networking, Managed Identities, or Azure Resource Manager — experienced an issue that cascaded into a multi-service fault lasting hours. None of these were full region outages, yet the scope and duration of affected services was significant in each case: Switzerland North — Network Connectivity Impact (BT6W-FX0) A platform issue resulted in an impact to customers in Switzerland North who may have experienced service availability issues for resources hosted in the region. Attribute Value Date September 26–27, 2025 Region Switzerland North Time Window 23:54 UTC on 26 Sep – 21:59 UTC on 27 Sep 2025 Total Duration ~22 hours Services Impacted Multiple (network-dependent services in the region) According to the official Post Incident Review (PIR) published by Microsoft on Azure Status History, a platform issue caused network connectivity degradation affecting multiple network-dependent services across the Switzerland North region, with impact lasting approximately 22 hours. The full root cause analysis, timeline, and remediation steps are documented in the linked PIR below. 🔗 View PIR on Azure Status History East US and West US — Managed Identities and Dependent Services (_M5B-9RZ) A platform issue with the Managed Identities for Azure resources service impacted customers trying to create, update, or delete Azure resources, or acquire Managed Identity tokens in East US and West US regions. Attribute Value Date February 3, 2026 Regions East US, West US Time Window 00:10 UTC – 06:05 UTC on 03 February 2026 Total Duration ~6 hours Services Impacted Managed Identities + dependent services (resource create/update/delete, token acquisition) 🔗 View PIR on Azure Status History Azure Government — Azure Resource Manager Failures (ML7_-DWG) Customers using any Azure Government region experienced failures when attempting to perform service management operations through Azure Resource Manager (ARM). This included operations through the Azure Portal, Azure REST APIs, Azure PowerShell, and Azure CLI. Attribute Value Date December 8, 2025 Regions Azure Government (all regions) Time Window 11:04 EST (16:04 UTC) – 14:13 EST (19:13 UTC) Total Duration ~3 hours Services Impacted 20+ services (ARM and all ARM-dependent services) 🔗 View PIR on Azure Status History Wrapping Up Designing resilient Azure solutions requires understanding the full spectrum of potential infrastructure faults. The Partial Region Fault is just one of many fault types you should account for during your Failure Mode Analysis — but it is a powerful reminder that even within a single region, shared infrastructure dependencies can amplify a single failure into a multi-service outage. Use this taxonomy as a starting point for FMA when designing your Azure architecture. The area is continuously evolving as the Azure platform and industry evolve — watch the space and revisit your fault type analysis periodically. In the next article, we will continue exploring additional fault types from the taxonomy. Stay tuned. Authors & Reviewers Authored by Zoran Jovanovic, Cloud Solutions Architect at Microsoft. Peer Review by Catalina Alupoaie, Cloud Solutions Architect at Microsoft. Peer Review by Stefan Johner, Cloud Solutions Architect at Microsoft. References Azure Well-Architected Framework — Reliability Pillar Failure Mode Analysis Shared Responsibility for Reliability Azure Availability Zones Business Continuity and Disaster Recovery Transient Fault Handling Azure Service Level Agreements Azure Reliability Guidance by Service Azure Status History168Views0likes0CommentsAWS to Azure Migration — From the Cloud Economics & FinOps Lens
“ROI fails when FinOps joins late.” That single pattern explains why many cloud migrations deliver technical success but financial disappointment. Workloads move. SLAs hold. Teams celebrate go‑live. Then the CFO asks: Where are the savings we modeled? In most cases, FinOps was engaged after architecture decisions were locked, licenses were double‑paid, and governance debt had already accumulated. This article frames AWS‑to‑Azure migration through a FinOps lens—not to chase immediate modernization, but to deliver defensible, incremental cost savings during and after migration, without increasing risk. Azure migration guidance consistently emphasizes a structured, phased approach—discover, migrate like‑for‑like, stabilize, then optimize. From a FinOps perspective, this sequencing is not conservative—it is economically rational: Like‑for‑like preserves performance baselines and business KPIs Cost comparisons remain apples‑to‑apples Optimization levers can be applied surgically, not blindly The real value emerges in the first 90 days after migration, when cost signals stabilize and commitment‑based savings become safe to apply. {TLDR: Cloud migrations miss ROI when FinOps joins late. AWS → Azure migrations deliver real savings when FinOps leads early, migrations stay like‑for‑like, and optimization is applied after costs stabilize. Azure enables this through four levers: AI‑assisted planning (Copilot + Azure Migrate), cheaper non‑prod with Dev/Test pricing, license reuse via Azure Hybrid Benefit, and low‑risk long‑term savings with Reservations—across compute and storage. Result: lower migration risk, controlled spend, and sustainable savings post‑move.} This Article talks about top 4 FinOps Levers in AWS → Azure Migration 1. Azure Copilot Migration Agent + Azure Migrate. Azure Copilot Migration Agent (currently in public preview) is a planning‑focused, AI‑assisted experience built on Azure Migrate. It analyzes inventory, readiness, landing zone requirements, and ROI before execution. You can interact with the Agent using natural language prompts to explore inventory, migration readiness, strategies, ROI considerations, and landing zone requirements. From a FinOps perspective, this directly translates into faster decision cycles and lower planning overhead. By simplifying and compressing activities that traditionally required weeks of manual analysis or external managed services support, organizations can reduce the cost of migration planning, accelerate business case creation, and bring cost and ROI discussions forward—before environments are deployed and financial commitments are made. 2. Azure Dev/Test pricing: Azure Dev/Test pricing provides discounted rates for non‑production workloads for eligible subscriptions, significantly reducing dev and test environment costs (Azure Dev/Test pricing). You can save up to 57 percent for a typical web app dev/test environment running SQL Database and App Service. Unlike other Cloud Providers, this directly reduces environment sprawl costs, which often exceed production waste post‑migration. It also enables wave‑based migration by lowering the cost of parallel environments, allowing teams to migrate deliberately rather than under financial pressure. 3. Azure Hybrid Benefit: Azure Hybrid Benefit allows organizations to reuse existing Windows Server, SQL Server, and supported Linux subscriptions (RHEL and SLES) on Azure, reducing both migration and steady‑state run costs. It enables license portability across Azure services, helping organizations avoid repurchasing software licenses they already own and redirect savings toward innovation and modernization. During migration, Azure Hybrid Benefit is especially impactful because it addresses migration overlap costs. The 180‑day migration allowance for Windows Server and SQL Server allows workloads to run on‑premises and in Azure simultaneously, supporting parallel validation, phased cutovers, and rollback readiness without double‑paying for licenses. For Linux, Azure Hybrid Benefit enables RHEL and SLES workloads to move to Azure without redeployment, ensuring continuity and avoiding downtime. From a FinOps perspective, this reduces one of the most underestimated migration cost drivers, delivering up to 76% savings versus pay‑as‑you‑go pricing for Linux and up to 29% versus leading cloud providers for SQL Server, while keeping migration timelines driven by readiness—not cost pressure. 4. Azure Reservations: Azure Reservations enable organizations to reduce costs by committing to one‑year or three‑year plans for eligible Azure services, receiving a billing discount that is automatically applied to matching resources. Reservations provide discounts of up to 72% compared to pay‑as‑you‑go pricing, do not affect the runtime state of workloads, and can be paid upfront or monthly with no difference in total cost. Importantly, Azure Reservations apply not only to compute and database, but also to storage services like Azure Blob storage, Azure Data Lake Gen2 Storage and Azure Files (for storage capacity) which often represent a significant portion of enterprise cloud spend. In the context of migration, Azure Reservations matter because they allow FinOps teams to optimize baseline costs across both compute and data layers once workloads stabilize. Unlike AWS, where commitment‑based discounts are largely compute‑centric and storage services such as Amazon S3 do not offer reservation‑style pricing, Azure enables long‑term cost optimization for persistent storage footprints that continue to grow post‑migration. Additionally, Azure Reservations offer greater flexibility—customers can modify, exchange, or cancel reservations through a self‑service program, subject to defined limits. This is particularly valuable during wave‑based migrations, where workload shapes evolve over time. From a FinOps perspective, Azure Reservations allow organizations to commit to predictable savings with broader scope and lower risk, covering both infrastructure and data‑heavy workloads common in migration scenarios. Successful migrations are no longer measured by workloads moved, but by cost control maintained and value unlocked. Azure’s FinOps‑aligned migration capabilities allow organizations to reduce risk first, optimize deliberately, and ensure that savings are sustained long after the last workload migrates.Resiliency Patterns for Azure Front Door: Field Lessons
Abstract Azure Front Door (AFD) sits at the edge of Microsoft’s global cloud, delivering secure, performant, and highly available applications to users worldwide. As adoption has grown—especially for mission‑critical workloads—the need for resilient application architectures that can tolerate rare but impactful platform incidents has become essential. This article summarizes key lessons from Azure Front Door incidents in October 2025, outlines how Microsoft is hardening the platform, and—most importantly—describes proven architectural patterns customers can adopt today to maintain business continuity when global load‑balancing services are unavailable. Who this is for This article is intended for: Cloud and solution architects designing mission‑critical internet‑facing workloads Platform and SRE teams responsible for high availability and disaster recovery Security architects evaluating WAF placement and failover trade‑offs Customers running revenue‑impacting workloads on Azure Front Door Introduction Azure Front Door (AFD) operates at massive global scale, serving secure, low‑latency traffic for Microsoft first‑party services and thousands of customer applications. Internally, Microsoft is investing heavily in tenant isolation, independent infrastructure resiliency, and active‑active service architectures to reduce blast radius and speed recovery. However, no global distributed system can completely eliminate risk. Customers hosting mission‑critical workloads on AFD should therefore design for the assumption that global routing services can become temporarily unavailable—and provide alternative routing paths as part of their architecture. Resiliency options for mission‑critical workloads The following patterns are in active use by customers today. Each represents a different trade‑off between cost, complexity, operational maturity, and availability. 1. No CDN with Application Gateway Figure 1: Azure Front Door primary routing with DNS failover When to use: Workloads without CDN caching requirements that prioritize predictable failover. Architecture summary Azure Traffic Manager (ATM) runs in Always Serve mode to provide DNS‑level failover. Web Application Firewall (WAF) is implemented regionally using Azure Application Gateway. App Gateway can be private, provided the AFD premium is used, and is the default path. DNS failover available when AFD is not reachable. When Failover is triggered, one of the steps will be to switch to AppGW IP to Public (ATM can route to public endpoints only) Switch back to AFD route, once AFD resumes service. Pros DNS‑based failover away from the global load balancer Consistent WAF enforcement at the regional layer Application Gateways can remain private during normal operations Cons Additional cost and reduced composite SLA from extra components Application Gateway must be made public during failover Active‑passive pattern requires regular testing to maintain confidence 2. Multi‑CDN for mission‑critical applications Figure 2: Multi‑CDN architecture using Azure Front Door and Akamai with DNS‑based traffic steering When to use: Mission critical Applications with strict availability requirements and heavy CDN usage. Architecture summary Dual CDN setup (for example, Azure Front Door + Akamai) Azure Traffic Manager in Always Serve mode Traffic split (for example, 90/10) to keep both CDN caches warm During failover, 100% of traffic is shifted to the secondary CDN Ensure Origin servers can handle the load of extra hits (Cache misses) Pros Highest resilience against CDN‑specific or control‑plane outages Maintains cache readiness on both providers Cons Expensive and operationally complex Requires origin capacity planning for cache‑miss surges Not suitable if applications rely on CDN‑specific advanced features 3. Multi‑layered CDN (Sequential CDN architecture) Figure 3: Sequential CDN architecture with Akamai as caching layer in front of Azure Front Door When to use: Rare, niche scenarios where a layered CDN approach is acceptable. Not a common approach, Akamai can be a single entry point of failure. However if the AFD isn't available, you can update Akamai properties to directly route to Origin servers. Architecture summary Akamai used as the front caching layer Azure Front Door used as the L7 gateway and WAF During failover, Akamai routes traffic directly to origin services Pros Direct fallback path to origins if AFD becomes unavailable Single caching layer in normal operation Cons Fronting CDN remains a single point of failure Not generally recommended due to complexity Requires a well‑tested operational playbook 4. No CDN – Traffic Manager redirect to origin (with Application Gateway) Figure 4: DNS‑based failover directly to origin via Application Gateway when Azure Front Door is unavailable When to use: Applications that require L7 routing but no CDN caching. Architecture summary Azure Front Door provides L7 routing and WAF Azure Traffic Manager enables DNS failover During an AFD outage, Traffic Manager routes directly to Application Gateway‑protected origins Pros Alternative ingress path to origin services Consistent regional WAF enforcement Cons Additional infrastructure cost Operational dependency on Traffic Manager configuration accuracy 5. No CDN – Traffic Manager redirect to origin (no Application Gateway) Figure 5: Direct DNS failover to origin services without Application Gateway When to use: Cost‑sensitive scenarios with clearly accepted security trade‑offs. Architecture summary WAF implemented directly in Azure Front Door Traffic Manager provides DNS failover During an outage, traffic routes directly to origins Pros Simplest architecture No Application Gateway in the primary path Cons Risk of unscreened traffic during failover Failover operations can be complex if WAF consistency is required Frequently asked questions Is Azure Traffic Manager a single point of failure? No. Traffic Manager operates as a globally distributed service. For extreme resilience requirements, customers can combine Traffic Manager with a backup FQDN hosted in a separate DNS provider. Should every workload implement these patterns? No. These patterns are intended for mission‑critical workloads where downtime has material business impact. Non critical applications do not require multi‑CDN or alternate routing paths. What does Microsoft use internally? Microsoft uses a combination of active‑active regions, multi‑layered CDN patterns, and controlled fail‑away mechanisms, selected based on service criticality and performance requirements. What happened in October 2025 (summary) Two separate Azure Front Door incidents in October 2025 highlighted the importance of architectural resiliency: A control‑plane defect caused erroneous metadata propagation, impacting approximately 26% of global edge sites A later compatibility issue across control‑plane versions resulted in DNS resolution failures Both incidents were mitigated through automated restarts, manual intervention, and controlled failovers. These events accelerated platform‑level hardening investments. How Azure Front Door is being hardened Microsoft has already completed or initiated major improvements, including: Synchronous configuration processing before rollout Control‑plane and data‑plane isolation Reduced configuration propagation times Active‑active fail‑away for major first‑party services Microcell segmentation to reduce blast radius These changes reinforce a core principle: no single tenant configuration should ever impact others, and recovery must be fast and predictable. Key takeaways Global platforms can experience rare outages—architect for them Mission‑critical workloads should include alternate routing paths Multi‑CDN and DNS‑based failover patterns remain the most robust Resiliency is a business decision, not just a technical one References Azure Front Door: Implementing lessons learned following October outages | Microsoft Community Hub Azure Front Door Resiliency Deep Dive and Architecting for Mission Critical - John Savill's deep dive into Azure Front Door resilience and options for mission critical applications Global Routing Redundancy for Mission-Critical Web Applications - Azure Architecture Center | Microsoft Learn Architecture Best Practices for Azure Front Door - Microsoft Azure Well-Architected Framework | Microsoft Learn691Views3likes0CommentsHow Azure NetApp Files Object REST API powers Azure and ISV Data and AI services – on YOUR data
This article introduces the Azure NetApp Files Object REST API, a transformative solution for enterprises seeking seamless, real-time integration between their data and Azure's advanced analytics and AI services. By enabling direct, secure access to enterprise data—without costly transfers or duplication—the Object REST API accelerates innovation, streamlines workflows, and enhances operational efficiency. With S3-compatible object storage support, it empowers organizations to make faster, data-driven decisions while maintaining compliance and data security. Discover how this new capability unlocks business potential and drives a new era of productivity in the cloud.1.2KViews0likes0CommentsAzure Local LENS workbook—deep insights at scale, in minutes
Azure Local at scale needs fleet-level visibility As Azure Local deployments grow from a handful of instances to hundreds (or even thousands), the operational questions change. You’re no longer troubleshooting a single environment—you’re looking for patterns across your entire fleet: Which sites are trending with a specific health issue? Where are workload deployments increasing over time, do we have enough capacity available? Which clusters are outliers compared to the rest? Today we’re sharing Azure Local LENS: a free, community-driven Azure Workbook designed to help you gain deep insights across a large Azure Local fleet—quickly and consistently—so you can move from reactive troubleshooting to proactive operations. Get the workbook and step-by-step instructions to deploy it here: https://aka.ms/AzureLocalLENS Who is it for? This workbook is especially useful if you manage or support: Large Azure Local fleets distributed across many sites (retail, manufacturing, branch offices, healthcare, etc.). Central operations teams that need standardized health/update views. Architects who want to aggregate data to gain insights in cluster and workload deployment trends over time. What is Azure Local LENS? Azure Local - Lifecycle, Events & Notification Status (or LENS) workbook brings together the signals you need to understand your Azure Local estate through a fleet lens. Instead of jumping between individual resources, you can use a consistent set of views to compare instances, spot outliers, and drill into the focus areas that need attention. Fleet-first design: Start with an estate-wide view, then drill down to a specific site/cluster using the seven tabs in the workbook. Operational consistency: Standard dashboards help teams align on “what good looks like” across environments, update trends, health check results and more. Actionable insights: Identify hotspots and trends early so you can prioritize remediation and plan health remediation, updates and workload capacity with confidence. What insights does it provide? Azure Local LENS is built to help you answer the questions that matter at scale, such as: Fleet scale overview and connection status: How many Azure Local instances do you have, and what are their connection, health and update status? Workload deployment trends: Where have you deployed Azure Local VMs and AKS Arc clusters, how many do you have in total, are they connected and in a healthy state? Top issues to prioritize: What are the common signals across your estate that deserve operational focus, such as update health checks, extension failures or Azure Resource Bridge connectivity issues? Updates: What is your overall update compliance status for Solution and SBE updates? What is the average, standard deviation or 95 th percentile update duration run times for your fleet? Drilldown workflow: After spotting an outlier, what does the instance-level view show, so you can act or link directly to Azure portal for more actions and support? Get started in minutes If you are managing Azure Local instances, give Azure Local LENS a try and see how quickly a fleet-wide view can help with day-to-day management, helping to surface trends & actionable insights. The workbook is an open-source, community-driven project, which can be accessed using a public GitHub repository, which includes full step-by-step instructions for setup at https://aka.ms/AzureLocalLENS. Most teams can deploy the workbook and start exploring insights in a matter of minutes. (depending on your environment). An example of the “Azure Local Instances” tab: How teams are using fleet dashboards like LENS Weekly fleet review: Use a standard set of views to review top outliers and trend shifts, then assign follow-ups. Update planning: Identify clusters with system health check failures, and prioritize resolving the issues based on frequency of the issue category. Update progress: Review clusters update status (InProgress, Failed, Success) and take action based on trends and insights from real-time data. Baseline validation: Spot clusters that consistently differ from the norm—can be a sign of configuration or environmental difference, such as network access, policies, operational procedures or other factors. Feedback and what’s next This workbook is a community driven, open source project intended to be practical and easy to adopt. The project is not a Microsoft‑supported offering. If you encounter any issues, have feedback, or a new feature request, please raise an Issue on the GitHub repository, so we can track discussions, prioritize improvements, and keep updates transparent for everyone. Author Bio Neil Bird is a Principal Program Manager in the Azure Edge & Platform Engineering team at Microsoft. His background is in Azure and hybrid / sovereign cloud infrastructure, specialising in operational excellence and automation. He is passionate about helping customers deploy and manage cloud solutions successfully using Azure and Azure Edge technologies.1.7KViews8likes4Comments