azure kubernetes service
220 TopicsBuilding an Enterprise Platform for Inference at Scale
Architecture Decisions With the optimization stack in place, the next layer of decisions is architectural — how you distribute compute across GPUs, nodes, and deployment environments to match your model size and traffic profile. GPU Parallelism Strategy on AKS Strategy How It Works When to Use Tradeoff Tensor Parallelism Splits weight matrices within each layer across GPUs (intra-layer sharding); all GPUs participate in every forward pass. Model exceeds single-GPU memory (e.g., 70B on A100 GPUs once weights, KV cache, runtime overhead are included) Inter-GPU communication overhead; requires fast interconnects (NVLink on ND-series) — costly to scale beyond a single node without them Pipeline Parallelism Distributes layers sequentially across nodes, with each stage processing part of the model Model exceeds single-node GPU memory — typically unquantized deployments beyond ~70–100B depending on node GPU count and memory Pipeline “bubbles” reduce utilization. Pipeline parallelism is unfriendly to small batches Data Parallelism Replicates full model across GPUs Scaling throughput / QPS on AKS node pools Memory-inefficient (full copy per replica); only strategy that scales throughput linearly Combined Tensor within node + Pipeline across nodes + Data for throughput scaling Production at scale on AKS — for any model requiring multi-node deployment, combine TP within each node and PP across nodes Complexity; standard for large deployments When a model can be quantized to fit a single GPU or a single node, the performance and cost benefits of avoiding cross-node communication are substantial. When quality permits, quantize before introducing distributed sharding, because fitting on a single GPU or single node often delivers the best latency and cost profile. If the model still doesn't fit after quantization, tensor parallelism across GPUs within a single node is the next step — keeping communication on fast intra-node interconnects like NVLink. Once the model fits, scale throughput through data parallelism. Pipeline parallelism across nodes is a last resort: it introduces cross-node communication overhead and pipeline bubbles that hurt latency at inference batch sizes. In practice, implementing combined parallelism requires coordinating placement of model shards across nodes, managing inter-GPU communication, and ensuring that scaling decisions don't break shard assignments. Anyscale on Azure handles this orchestration layer through Ray's distributed scheduling primitives — specifically placement groups, which allow tensor-parallel shards to be co-located within a node while data-parallel replicas scale independently across node pools. The result is that teams get the throughput benefits of combined parallelism without building and maintaining the scheduling logic themselves. Deployment Topology Parallelism strategy determines how you use GPUs inside a deployment. Topology determines where those deployments run. Cloud (AKS) offers flexibility and elastic scaling across Azure GPU SKUs (ND GB200-v6, ND H100 v5, NC A100 v4). Anyscale on Azure adds managed Ray clusters that run inside the customer’s AKS environment, with Azure billing integration, Microsoft Entra ID integration, and connectivity to Azure storage services. Edge enables ultra-low latency, avoids per-query cloud inference cost, and supports local data residency—critical in environments such as manufacturing, healthcare, and retail Hybrid is the pragmatic default for most enterprises. Sensitive data stays local with small quantized models; complex analysis routes to AKS. Azure Arc can extend governance across hybrid deployments. Across all three deployment patterns — cloud, edge, and hybrid — the operational challenge is consistent: managing distributed inference workloads without fragmenting your control plane. Anyscale on AKS addresses this directly. In pure cloud deployments, it provides managed Ray clusters inside your own Azure subscription, eliminating the need to operate Ray infrastructure yourself. In hybrid architectures, Ray clusters on AKS serve as the cloud leg, with Azure Arc extending Azure RBAC, Azure policy for governance, and centralized audit logging to Arc-enabled servers/Kubernetes clusters on the edge infrastructure. The result is a single operational model regardless of where inference is actually executing: scheduling, scaling, and observability are handled by Ray, the network boundary stays inside your Azure environment, and the governance layer stays consistent across locations. Teams that would otherwise maintain separate orchestration stacks for cloud and edge workloads can run both through a unified Ray deployment managed by Anyscale. The Enterprise Platform — Security, Compliance, and Governance on AKS The optimizations in this series — quantization, continuous batching, disaggregated inference, MIG partitioning — all assume a platform that meets enterprise requirements for security, compliance, and data governance. Without that foundation, none of the performance work matters. A fraud detection model that leaks customer data is not “cost-efficient.” An inference endpoint exposed to the public internet is not “low-latency.” The platform has to be solid before the optimizations can be useful. Self-hosting inference on AKS provides that foundation. Every inference request — input prompts, output tokens, KV cache, model weights, fine-tuning data — stays inside the customer’s own Azure subscription and virtual network. Data never traverses third-party infrastructure. This eliminates an entire class of data residency and sovereignty concerns that hosted API services cannot address by design. Network Isolation and Access Control AKS supports private clusters in which the Kubernetes API server is exposed through Azure Private Link rather than a public endpoint, limiting API-server access to approved private network paths. All traffic between the API server and GPU node pools stays internal. Network Security Groups (NSGs), Azure Firewall, and Kubernetes network policies enforced through Azure CNI powered by Cilium can restrict traffic between pods, namespaces, and external endpoints, enabling micro-segmentation between inference workloads. Microsoft Entra ID integration with Kubernetes RBAC handles enterprise identity management: SSO, group-based role assignments, and automatic permission updates when team membership changes. Managed identities eliminate credentials in application code. Azure Key Vault stores secrets, certificates, and API keys with hardware-backed encryption. The Anyscale on Azure integration inherits this entire stack. Workloads run inside the customer’s AKS cluster — with Entra ID authentication, Azure Blob storage connectivity via private endpoints, and unified Azure billing. There is no separate Anyscale-controlled infrastructure to audit or secure. The Metrics That Determine Profitability Metric What It Measures Why It Matters Tokens/second/GPU Raw hardware throughput Helps you understand how much work each GPU can do and supports capacity planning on AKS GPU node pools Tokens/GPU-hour Unit economics Tokens generated per Azure VM billing hour — the number your CFO cares about P95 / P99 latency Tail latency Shows the experience of slower requests, which matters more than averages in real production systems. GPU utilization % Paid vs. used Azure GPU capacity Low utilization means you are paying for expensive GPU capacity that is sitting idle or underused. Output-to-input token ratio Generation cost ratio Higher output ratios increase generation time and reduce how many requests each GPU can serve per hour. KV cache hit rate Context reuse efficiency Low hit rates mean more recomputation of prior context, which increases latency and cost. Product design directly affects inference economics. Defaulting to verbose responses when concise ones suffice consumes more GPU cycles per request, reducing how many requests each GPU can serve per hour. Conclusion Base model intelligence is increasingly commoditized. Inference efficiency compounds. Organizations that treat inference as a first-class engineering and financial discipline win. By deliberately managing the accuracy–latency–cost tradeoff and tracking tokens per GPU-hour like a core unit metric, they deploy AI cheaper, scale faster, and protect margins as usage grows. Links: Strategic partnership: Powering Distributed AI/ML at Scale with Azure and Anyscale | All things Azure Part 1: Inference at Enterprise Scale: Why LLM Inference Is a Capital Allocation Problem | Microsoft Community Hub Part 2: The LLM Inference Optimization Stack: A Prioritized Playbook for Enterprise Teams | Microsoft Community Hub Part 3: (this one)247Views0likes0CommentsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Be sure to use a new subnet CIDR within the VNet address space, which doesn't overlap with other subnets nor the AKS CIDRS for nodes/pods and ClusterIP services. To minimise disruption, we'll first install the Virtual Nodes on ACI Helm chart, before then removing the legacy managed add-on and its resources. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Create the new Virtual Nodes on ACI subnet with the specific name value of cg (a custom subnet can be used by following the steps here): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes <your subnet CIDR> \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Scale-down any running Virtual Nodes workloads (example below): kubectl scale deploy <deploymentName> -n <namespace> --replicas=0 Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Delete the previous (legacy) Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify the nodeSelector and tolerations properties of your Virtual Nodes workloads to match the requirements of Virtual Nodes on ACI (see details below) Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here520Views3likes0CommentsAfter Ingress NGINX: Migrating to Application Gateway for Containers
If you're running Ingress NGINX on AKS, you've probably seen the announcements by now. The community Ingress Nginx project is being retired, upstream maintenance ends in March 2026, and Microsoft's extended support for the Application Routing add-on runs out in November 2026. A requirement to migrate to another solution is inevitable. There are a few places you can go. This post focuses on Application Gateway for Containers: what it is, why it's worth the move, and how to actually do it. Microsoft has also released a migration utility that handles most of the translation work from your existing Ingress resources, so we'll cover that too. Ingress NGINX Retirement Ingress NGINX has been the default choice for Kubernetes HTTP routing for years. It's reliable, well-understood, and it appears in roughly half the "getting started with AKS" tutorials on the internet. So the retirement announcement caught a lot of teams off guard. In November 2025, the Kubernetes SIG Network and Security Response Committee announced that the community ingress-nginx project would enter best-effort maintenance until March 2026, after which there will be no further releases, bug fixes, or security patches. It had been running on a small group of volunteers for years, accumulated serious technical debt from its flexible annotation model, and the maintainers couldn't sustain it. For AKS, the timeline depends on how you're running it. If you self-installed via Helm, you're directly exposed to the March 2026 upstream deadline, after that, you're on your own for CVEs. If you're using the Application Routing add-on, Microsoft has committed to critical security patches until November 2026, but nothing beyond that. No new features, no general bug fixes. Application Gateway for Containers Application Gateway for Containers (AGC) is Azure's managed Layer 7 load balancer for AKS, and it's the successor to both the classic Application Gateway Ingress Controller and the Ingress API approach more broadly. It went GA in late 2024 and added WAF support in November 2025. The architecture splits across two planes. On the Azure side, you have the AGC resource itself, a managed load balancer that sits outside your cluster and handles the actual traffic. It has child resources for frontends (the public entry points, each with an auto-generated FQDN) and an association that links it to a dedicated delegated subnet in your VNet. Unlike the older App Gateway Ingress Controller, AGC is a standalone Azure resource, you don't deploy an App Gateway instance On the Kubernetes side, the ALB Controller runs as a small deployment in your cluster. It watches for Gateway API resources: Gateways, HTTPRoutes, and the various AGC policy types and translates them into configuration on the AGC resource. When you create or update an HTTPRoute, the controller picks it up and pushes the changes to the data plane. AGC supports both Gateway API and the Ingress API. This means you don't have to convert everything to Gateway API resources in one shot. Gateway API is where the richer functionality lives though, and so you may want to consider undertaking this migration. For deployment, you have two options: Bring Your Own (BYO) — you create the AGC resource, frontend, and subnet association in Azure yourself using the CLI, portal, Bicep, Terraform, or whatever tool you prefer. The ALB Controller then references the resource by ID. This gives you full control over the Azure-side lifecycle and fits well into existing IaC pipelines. Managed by ALB Controller — you define an ApplicationLoadBalancer custom resource in Kubernetes and the ALB Controller creates and manages the Azure resources for you. Simpler to get started, but the Azure resource lifecycle is tied to the Kubernetes resource — which some teams find uncomfortable for production workloads. One prerequisite worth flagging upfront: AGC requires Azure CNI or Azure CNI Overlay. Kubenet has been deprecated and will be fully retired in 2028, so If you're on Kubenet, you'll need to plan a CNI migration alongside this work. There is an in-place cluster migration process to allow you to do this without re-building your cluster. Why Choose AGC Over Other Alternatives? AGC's architecture is different from running an in-cluster ingress controller, and worth understanding before you start. The data plane runs outside your cluster entirely. With NGINX you're running pods that consume node resources, need upgrading, and can themselves become a reliability concern. With AGC, that's Azure's problem. You're not patching an ingress controller or sizing nodes around it. The ALB Controller does run a small number of pods in your cluster, but they're lightweight, watching Kubernetes resources and syncing configuration to the Azure data plane. They're not in the traffic path, and their resource footprint is minimal. Ingress and HTTPRoute resources still reference Kubernetes Services as usual. Application Gateway for Containers runs an Azure‑managed data plane outside the cluster and routes traffic directly to backend pod IPs using Kubernetes Endpoint/EndpointSlice data, rather than relying on in‑cluster ingress pods. This enables faster convergence as pods scale and allows health probing and traffic management to be handled at the gateway layer. WAF is built in, using the same Azure WAF policies you might already have. If you're currently running a separate Application Gateway in front of your cluster purely for WAF, AGC removes that extra hop and one fewer resource to keep current. Configuration changes push to the data plane near-instantly, without a reload cycle. NGINX reloads its config when routes change, which is mostly fine, but noticeable if you're in a high-churn environment with frequent deployments. Building on Gateway API from the start also means you're not doing this twice. It's where Kubernetes ingress is heading, and AGC fully supports it. By taking advantage of the Gateway API you are defining your configuration once in a proxy agnostic manner, and can easily switch the underlying proxy if you need to at a later date, avoiding vendor lock-in. Planning Your Migration Before you run any tooling or touch any manifests, spend some time understanding what you actually have. Start by inventorying your Ingress NGINX resources across all clusters and namespaces. You want to know how many Ingress objects you have, which annotations they're using, and whether there's anything non-standard such as custom snippets, lua configurations, or anything that leans heavily on NGINX-specific behaviour. The migration utility will flag most of this, but knowing upfront means fewer surprises. Next, confirm your cluster prerequisites. AGC requires Azure CNI or Azure CNI Overlay and workload identity. If you're on Kubenet, that migration needs to happen first. Finally, check that workload identity is enabled on your cluster. Decide on your deployment model before generating any output. BYO gives you full control over the AGC resource lifecycle and slots into existing IaC pipelines cleanly, but requires you to pre-create Azure resources. Managed is simpler to get started with but ties the Azure resource lifecycle to Kubernetes objects, which can feel uncomfortable for production workloads. Finally, decide whether you want to migrate from Ingress API to Gateway API as part of this work, or keep your existing Ingress resources and just swap the controller. AGC supports both. Doing both at once is more work upfront but gets you to the right place in a single migration. Keeping Ingress resources is lower risk in the short term, but you'll need to do the API migration later regardless. Introducing the AGC Migration Utility Microsoft released the AGC Migration Utility in January 2026 as an official CLI tool to handle the conversion of existing Ingress resources to Gateway API resources compatible with AGC. It doesn't modify anything on your cluster. It reads your existing configuration and generates YAML you can review and apply when you're ready. One thing to be aware of is that the migration utility only generates Gateway API resources, so if you use it, you're moving off the Ingress API at the same time as moving off NGINX. There's no flag to produce Ingress resources for AGC instead. If you want to land on AGC but keep Ingress resources for now, you'll need to set that up manually. There are two input modes. In files mode, you point it at a directory of YAML manifests and it converts them locally without needing cluster access. In cluster mode, it connects to your current kubeconfig context and reads Ingress resources directly from a live cluster. Both produce the same output. Alongside the converted YAML, the utility produces a migration report covering every annotation it encountered. Each annotation gets a status: completed, warning, not-supported, or error. The warning and not-supported statuses are where you'll need to do some manual work/ These represent annotations that either migrated with caveats, or have no AGC equivalent at all. The coverage of NGINX annotations is broad. URL rewrites, SSL redirects, session affinity, backend protocol, mTLS, WAF, canary routing by weight or header, permanent and temporary redirects, custom hostnames, most of the common patterns are covered. Before you run a full conversion, it's worth doing a --dry-run pass first to get a clear picture of what needs manual attention. Migrating Step by Step With prerequisites confirmed and your deployment model chosen, here's how the migration looks end to end. 1. Get the utility Pre-built binaries for Linux, macOS, and Windows are available on the GitHub releases page. Download the binary for your platform and make it executable. If you'd prefer to build from source, clone the repo and run ./build.sh from the root, which produces binaries in the bin folder. 2. Run a dry-run against your manifests Before generating any output, run in dry-run mode to see what the migration report looks like. This tells you which annotations are fully supported, which need manual attention, and which have no AGC equivalent. ./agc-migration files --provider nginx --ingress-class nginx --dry-run ./manifests/*.yaml If you'd rather read directly from your cluster, use cluster mode: ./agc-migration cluster --provider nginx --ingress-class nginx --dry-run 3. Review the migration report Work through the report before proceeding. Anything marked not-supported needs a plan. The next section covers the most common gaps, but the report itself includes specific recommendations for each issue it finds. 4. Set up AGC and install the ALB Controller Before applying any generated resources you need AGC running in Azure and the ALB Controller installed in your cluster. The setup process is well documented, so rather than reproduce it here, follow the official quickstart at aka.ms/agc. Make sure you note the resource ID of your AGC instance if you're using BYO deployment, as you'll need it in the next step. 5. Generate the converted resources Run the utility again with your chosen deployment flag to generate output: # BYO ./agc-migration files --provider nginx --ingress-class nginx \ --byo-resource-id $AGC_ID \ --output-dir ./output \ ./manifests/*.yaml # Managed ./agc-migration files --provider nginx --ingress-class nginx \ --managed-subnet-id $SUBNET_ID \ - -output-dir ./output \ ./manifests/*.yaml 6. Review and apply the generated resources Check the generated Gateway, HTTPRoute, and policy resources against your expected routing behaviour before applying anything. Apply to a non-production cluster first if you can. kubectl apply -f ./output/ 7. Validate and cut over Test your routes before updating DNS. Running both NGINX and AGC in parallel while you validate is a sensible approach; route test traffic to AGC while NGINX continues serving production, then update your DNS records to point to the AGC frontend FQDN once you're satisfied. 8. Decommission NGINX Once traffic has been running through AGC cleanly, uninstall the NGINX controller and remove the old Ingress resources. Two ingress controllers watching the same resources will cause confusion sooner or later. What the Migration Utility Doesn't Handle The utility covers a lot of ground, but there are some gaps you should be clear on. Annotations marked not-supported in the migration report have no direct AGC equivalent and won't appear in the generated output. The most common for NGINX users are custom snippets and lua-based configurations, which allow arbitrary NGINX config to be injected directly into the server block. There's no equivalent in AGC or Gateway API. If you're relying on these, you'll need to work out whether AGC's native routing capabilities can cover the same requirements through HTTPRoute filters, URL rewrites, or header manipulation. The utility doesn't migrate TLS certificates or update certificate references in the generated resources. Your existing Kubernetes Secrets containing certificates should carry over without changes, but verify that the Secret references in your generated Gateway and HTTPRoute resources are correct before cutting over. DNS cutover is outside the scope of the utility entirely. Once your AGC frontend is provisioned it gets an auto-generated FQDN, and you'll need to update your DNS records or CNAME entries accordingly. Any GitOps or CI/CD pipelines that reference your Ingress resources by name or apply them from a specific path will also need updating to reflect the new Gateway API resource types and output structure. Conclusion For many, the retirement of Ingress NGINX is unwanted complexity and extra work. If you have to migrate though, you can use it as an opportunity to land on a significantly better architecture: Gateway API as your routing layer, WAF and per-pod load balancing built in, and an ingress controller that's fully managed by Azure rather than running in your cluster. The migration utility can take care of a lot of the mechanical conversion work. Rather than manually rewriting Ingress resources into Gateway API equivalents and mapping NGINX annotations to their AGC counterparts, the utility does that translation for you and produces a migration report that tells you exactly what it couldn't handle. Running a dry-run against your manifests is a good first step to get a clear picture of your annotation coverage and what needs manual attention before you commit to a timeline. Full documentation for AGC is at aka.ms/agc and the migration utility repo is at github.com/Azure/Application-Gateway-for-Containers-Migration-Utility. Ingress NGINX retirement is coming up very soon, with the standalone implementation retiring very soon, at the end of March 2026. Using the App Routing add-on for AKS gives you a little bit of breathing room until November 2026, but it's still not long. Make sure you have a solution in place before this date to avoid running unsupported and potentially vulnerable software on your critical infrastructure.354Views1like1CommentThe LLM Inference Optimization Stack: A Prioritized Playbook for Enterprise Teams
The Solutions — An Optimization Stack for Enterprise Inference The optimizations below are ordered by implementation priority — starting with the highest-leverage. The Three-Layer Serving Stack Most enterprise LLM deployments operate across three layers, each responsible for a different part of the inference pipeline. Understanding which layer a bottleneck belongs to is often the fastest path to improving inference performance. Ray Serve provides the distributed model serving layer — handling request routing, autoscaling, batching, replica placement, and multi-model serving. Azure Kubernetes Service (AKS) orchestrates the infrastructure — GPU nodes, networking, and container lifecycle. Inference engines such as vLLM execute the model forward passes and implement token-generation optimizations such as continuous batching and KV-cache management. In simple terms: AKS manages infrastructure. Ray Serve manages inference workloads. vLLM generates tokens With that architecture in mind, we can examine the optimization stack. 1. GPU Utilization: Maximize What You Already Have Before optimizing models or inference engines, start here: are you fully utilizing the GPUs you’re already paying for? For most enterprise deployments, the answer is no. GPU utilization below 50% means you’re effectively paying double for every token generated. Autoscaling on inference-specific signals. Autoscaling should be driven by request queue depth, GPU utilization, and P95 latency — not generic CPU or memory metrics, which are poor proxies for LLM serving load. AKS supports GPU-enabled node pools with cluster autoscaler integration across NC-series (A100, H100) and ND-series VMs. Scale to zero during idle periods; scale up based on token-level demand, not container-level metrics. Inference-aware orchestration. AKS orchestrates infrastructure resources such as GPU nodes, pods, containers. Ray Serve operates one layer above as the inference orchestration framework, managing model replicas, request routing, autoscaling, streaming responses, and backpressure handling, while inference engines like vLLM perform continuous batching and KV-cache management. The distinction matters because LLM serving load doesn't express well in CPU or memory metrics; Ray Serve operates at the level of tokens and requests, not containers. AKS orchestrates infrastructure; Ray Serve orchestrates model serving. Anyscale Runtime reports faster performance and lower compute cost than self-managed Ray OSS on selected workloads, though gains depend on workload and configuration. Right-sizing Azure GPU selection. The default instinct when deploying GenAI in production is often to grab the biggest, fastest hardware available. For inference, that is often the wrong call. For structured output tasks, a well-optimized, quantized 7B model running on an NCads H100 v5 (H100 NVL 94GB) or an NC A100 v4 (A100 80GB) node can easily outperform a generalized 70B model on a full ND allocation—at a fraction of the cost. New deployments should target NCads H100 v5. The secret to cost-effective inference is matching your VM SKU to your workload's specific bottleneck. For compute-heavy prefill phases or massive multi-GPU parallelism, the ND H100 v5's ultra-fast interconnects are unmatched. However, autoregressive token generation (decode) is primarily bound by memory bandwidth. For single-GPU, decode-heavy workloads, the NCads series is the better fit: the H100 NVL 94GB has higher published HBM bandwidth (3.9 TB/s) than the H100 80GB (3.35 TB/s). ND H100 v5 remains the right choice when you need multi-GPU sharding, high aggregate throughput, or tightly coupled scale-out inference. You can extend utilization further with MIG partitioning to host multiple small models on a single NVL card, provided your application can tolerate the proportional drop in memory bandwidth per slice. 2. GPU Partitioning: MIG and Fractional GPU Allocation on AKS For smaller models or moderate-concurrency workloads, dedicating an entire GPU to a single model replica wastes resources. Two techniques address this on AKS. NVIDIA Multi-Instance GPU (MIG) partitions a single physical GPU into up to seven hardware-isolated instances, each with its own compute cores, memory, cache, and memory bandwidth. Each instance behaves as a standalone GPU with no code changes required. On AKS, MIG is supported on Standard_NC40ads_H100_v5, Standard_ND96isr_H100_v5, and A100 GPU VM sizes, configured at node pool creation using the --gpu-instance-profile parameter (e.g., MIG1g, MIG3g, MIG7g). Fractional GPU allocation in Ray Serve is a scheduling and placement mechanism, not hardware partitioning. By assigning fractional GPU resources (say, 0.5 GPU per replica) through Ray placement groups, multiple model replicas can share a single physical GPU. Ray Serve propagates the configured fraction to the serving worker (i.e. vLLM), but, unlike MIG, replicas still share the same underlying GPU memory and memory bandwidth. There’s no hard isolation. Because fractional allocation does not enforce hard VRAM limits, it requires careful memory management: conservative gpu_memory_utilization configuration, controlled concurrency and context length, and enough headroom for KV cache growth, CUDA overhead, and allocator fragmentation. It works best when model weights are relatively small, concurrency is predictable and moderate, and replica counts are stable. For stronger isolation and guaranteed memory partitioning, use NVIDIA MIG. Fractional allocation is best treated as a GPU packing optimization, not an isolation mechanism. 3. Quantization: The Fastest Path to Cost Reduction Quantization reduces the numerical precision of model weights, activations, and KV cache entries to shrink memory footprint and increase throughput. FP16 → INT8 roughly halves memory; 4-bit quantization cuts it by approximately 4×. Post-Training Quantization (PTQ) is the fastest path to production gains. As one example, Llama-3.3-70B-Instruct reduces weight memory from ~140 GB in BF16 to ~70 GB in FP8, which can make single-GPU deployment feasible on an 80GB GPU for low-concurrency or short-context workloads. Production feasibility still depends on KV cache size, engine overhead, and concurrency, so careful capacity planning is required. 4. Inference Engine Optimizations in vLLM Modern inference engines — particularly vLLM, which powers Anyscale’s Ray Serve on AKS — implement several optimizations that compound to deliver significant throughput improvements. Continuous batching replaces static batching, where the system waits for all requests in a batch to complete before accepting new ones. With continuous batching, new requests join at every decode iteration, keeping GPUs more fully utilized. Anyscale has demonstrated up to 23x throughput improvement using continuous batching versus static batching (measured on OPT-13B on A100 40GB with varying concurrency levels). In practice, this can push GPU utilization from 30–40% to 80%+ on AKS GPU node pools. PagedAttention manages KV cache allocation the way an operating system manages RAM — breaking it into small, non-contiguous pages to eliminate fragmentation. Naive KV cache allocation wastes significant reserved memory through internal and external fragmentation. PagedAttention eliminates this, enabling more concurrent requests per GPU. Enabled by default in vLLM. Prefix caching automatically stores the KV cache of completed requests in a global on-GPU cache. When new requests share common prefixes — system prompts, shared context in RAG — vLLM reuses cached state instead of recomputing it, reducing TTFT and compute load. Anyscale’s PrefixCacheAffinityRouter extends this by routing requests with similar prefixes to the same replica, maximizing cache hit rates across AKS pods. Chunked prefill breaks large prefill operations into smaller chunks and interleaves them with decode steps. Without it, a long incoming prompt can stall all ongoing decode operations. Chunked prefill keeps streaming responses smooth even when new long prompts arrive, and improves GPU utilization by mixing compute-bound prefill chunks with memory-bound decode. Enabled by default in vLLM V1. Speculative decoding addresses the sequential decode bottleneck directly. A smaller, faster “draft” model proposes multiple tokens ahead; the larger “target” model verifies them in parallel in a single forward pass. When the draft predicts correctly — which is frequent for routine language patterns — multiple tokens are generated in one step. Output quality is identical because every token is verified by the target model. Particularly effective for code completion, where token patterns are highly predictable. 5. Disaggregated Prefill and Decode Since prefill is compute-bound and decode is memory-bandwidth-bound, running both on the same GPU forces a compromise — the hardware is optimized for neither. Disaggregated inference separates these phases across different hardware resources. vLLM supports disaggregated prefill and decode and Ray Serve can orchestrate separate worker pools for each phase. In practice, this means Ray Serve routes each incoming request to a prefill worker first, then hands off the resulting KV cache to a dedicated decode worker — without the application layer needing to manage that handoff. This capability is evolving and should be validated against your Ray and vLLM versions before deploying to production. With MIG or separate node pools, prefill and decode resources can be isolated to better match each phase’s hardware requirements. Azure ND GB200 v6 VMs include four NVIDIA Blackwell GPUs per VM, while the broader GB200 NVL72 system enables rack-scale NVLink connectivity — providing the high-bandwidth GPU-to-GPU communication that disaggregated prefill/decode architectures depend on for KV-cache movement. 6. Multi-LoRA Adapters: Serve Many Use Cases from One Deployment Fine-tuned Low-Rank Adaptation (LoRA) adapters for different domains can share a single base model in GPU memory, with lightweight task-specific layers swapped at inference time. Legal, HR, finance, and engineering copilots are served from one AKS GPU deployment instead of four separate ones. This is a direct cost multiplier: instead of provisioning N separate model deployments for N departments, you provision one base model and swap adapters per request. Ray Serve and vLLM both support multi-LoRA serving on AKS. Open-Source Models for Enterprise Inference The open-source model ecosystem has matured to the point where self-hosted inference on open-weight models - running on AKS with Ray Serve and vLLM - is a viable and often preferable alternative to proprietary API access. The strategic advantages are significant: full control over data residency and privacy (workloads run inside your Azure subscription), no per-token API fees (cost shifts to Azure GPU infrastructure), the ability to fine-tune and distill for domain-specific accuracy, no vendor lock-in, and predictable cost structures that don’t scale with usage volume. Leading Open-Source Model Families Meta Llama (Llama 3.1, Llama 4) is the most widely adopted open-weight model family. Llama 3.1 offers dense models from 8B to 405B parameters; Llama 4 introduces MoE variants. Strong general-purpose performance with native vLLM integration. The 70B variant hits a reasonable quality-to-serving cost for most enterprise use cases. Available under Meta’s community license - Validate the specific model architecture and license you plan to use. Qwen (Alibaba) excels in multilingual and reasoning tasks. Qwen3-235B is a MoE model activating roughly 22B parameters per token — delivering frontier-class quality at a fraction of dense-model inference cost. Strong on code, math, and structured output. Apache 2.0 license on most variants. Mistral models are optimized for efficiency and inference speed. Mistral 7B remains one of the highest-performing models at its size class, making it well-suited for cost-sensitive, high-throughput deployments on smaller Azure GPU SKUs. Mixtral 8x22B provides MoE-based quality scaling. Mistral Large (123B) competes with frontier proprietary models. Licensing varies: most smaller models are Apache 2.0, while some larger releases use research or commercial licensing terms. Verify the license for the specific model prior to production deployment. DeepSeek (DeepSeek AI) introduced aggressive MoE architectures with cost-efficient training. DeepSeek-V3 (671B total, 37B active per token) delivers strong reasoning quality at significantly lower per-token inference cost than dense models of comparable capability. Strong on math, code, and multilingual tasks. DeepSeek models are developed by a Chinese AI research lab. Organizations in regulated industries should evaluate applicable data sovereignty, export control, and vendor risk policies before deploying DeepSeek weights in production. The examples below are illustrative starting points rather than fixed recommendations. Actual model and infrastructure choices should be validated against workload-specific latency, accuracy, and cost requirements. Model Selection Examples Workload Recommended Model Class Azure Infrastructure Rationale Internal copilots, high-throughput APIs 7B–13B (Llama 8B, Mistral 7B, Qwen 7B) NCads H100 v5 with MIG, or NC A100 v4 (existing deployments) 10–30x cheaper serving; recover accuracy via RAG and fine-tuning Customer-facing assistants 30B–70B (Llama 70B, Qwen 72B, Mistral Large) NC A100 v4 (80GB – existing deployments ) or ND H100 v5 Quality directly impacts revenue and trust Frontier quality at sub-frontier cost MoE (Qwen3-235B-A22B, DeepSeek-V3, and Mistral’s Mixtral-family models) ND H100 v5 or ND GB200-v6 Active parameters determine inference cost, not total model size Code completion and engineering copilots Code-specialized (DeepSeek-Coder, Qwen-Coder) NCads H100 v5 with MIG Domain models outperform larger general models at lower cost Multilingual Qwen, DeepSeek Matches workload size above Strongest non-English performance in open-weight ecosystem Edge / on-device Small edge-capable models (for example, 2B–8B-class models, often quantized) Azure IoT Edge / local hardware Fits within edge memory and power envelopes The rule of thumb: start with the smallest model that meets your quality threshold. Add RAG, caching, fine-tuning, and batching before scaling model size. Treat model choice as an ongoing decision —the open-source ecosystem evolves fast enough that what’s optimal today may not be in six months. Actual performance varies by workload, so these model and size recommendations should be validated through testing in your target environment. All leading open-weight models are natively supported by vLLM and Ray Serve / Anyscale on AKS, with out-of-the-box quantization, multi-GPU parallelism, and Multi-LoRA support. The optimizations above assume a platform that is already secure, governed, and production-hardened. Continuous batching on an exposed endpoint is not a production system. Part three covers the architecture decisions, security controls, and operational metrics that make enterprise inference deployable — and auditable. Continue to Part 3: Building an Enterprise Platform for Inference at Scale → Part 1: Inference at Enterprise Scale: Why LLM Inference Is a Capital Allocation Problem | Microsoft Community Hub Part 3: Building an Enterprise Platform for Inference at Scale | Microsoft Community Hub500Views0likes0CommentsHelp wanted: Refresh articles in Azure Architecture Center (AAC)
I’m the Project Manager for architecture review boards (ARBs) in the Azure Architecture Center (AAC). We’re looking for subject matter experts to help us improve the freshness of the AAC, Cloud Adoption Framework (CAF), and Well-Architected Framework (WAF) repos. This opportunity is currently limited to Microsoft employees only. As an ARB member, your main focus is to review, update, and maintain content to meet quarterly freshness targets. Your involvement directly impacts the quality, relevance, and direction of Azure Patterns & Practices content across AAC, CAF, and WAF. The content in these repos reaches almost 900,000 unique readers per month, so your time investment has a big, global impact. The expected commitment is 4-6 hours per month, including attendance at weekly or bi-weekly sync meetings. Become an ARB member to gain: Increased visibility and credibility as a subject‑matter expert by contributing to Microsoft‑authored guidance used by customers and partners worldwide. Broader internal reach and networking without changing roles or teams. Attribution on Microsoft Learn articles that you own. Opportunity to take on expanded roles over time (for example, owning a set of articles, mentoring contributors, or helping shape ARB direction). We’re recruiting new members across several ARBs. Our highest needs are in the Web ARB, Containers ARB, and Data & Analytics ARB: The Web ARB focuses on modern web application architecture on Azure—App Service and PaaS web apps, APIs and API Management, ingress and networking (Application Gateway, Front Door, DNS), security and identity, and designing for reliability, scalability, and disaster recovery. The Containers ARB focuses on containerized and Kubernetes‑based architectures—AKS design and operations, networking and ingress, security and identity, scalability, and reliability for production container platforms. The Data & Analytics ARB focuses on data platform and analytics architectures—data ingestion and integration, analytics and reporting, streaming and real‑time scenarios, data security and governance, and designing scalable, reliable data solutions on Azure. We’re also looking for people to take ownership of other articles across AAC, CAF, and WAF. These articles span many areas, including application and solution architectures, containers and compute, networking and security, governance and observability, data and integration, and reliability and operational best practices. You don’t need to know everything—deep expertise in one or two areas and an interest in keeping Azure architecture guidance accurate and current is what matters most. Please reply to this post if you’re interested in becoming an ARB member, and I’ll follow up with next steps. If you prefer, you can email me at v-jodimartis@microsoft.com. Thanks! 🙂32Views0likes0CommentsInference at Enterprise Scale: Why LLM Inference Is a Capital Allocation Problem
Most enterprise AI conversations focus on model selection and fine-tuning. The harder problem — and the one that determines whether AI investments produce returns or just costs — is inference: serving those models reliably, at scale, under real production load. For organizations running millions of daily requests across copilots, analytics pipelines, and agentic workflows, inference is what drives cloud spend. It is not purely an infrastructure decision. At scale, it becomes a capital allocation decision. Microsoft and Anyscale recently announced a strategic partnership that brings Ray — the open-source distributed compute framework powering AI workloads at scale — directly into Azure Kubernetes Service (AKS) as an Azure Native Integration. Azure customers can provision and manage Anyscale-powered Ray clusters from the Azure Portal, with unified billing and Microsoft Entra ID integration. (Sign up for private preview). Workloads run inside the customer's own AKS clusters within their Azure tenant, so you keep full control over your data, compliance posture, and security boundaries. The serving stack referenced throughout this series is built on two components: Anyscale’s services powered by Ray Serve for inference orchestration and vLLM as the inference engine for high-throughput token generation. One organizing principle ties all three parts together: inference systems live on a three-way tradeoff between accuracy, latency, and cost — the Pareto frontier of LLMs. Pick two; engineer around the third. Improving one dimension almost always requires tradeoffs in another. A larger model improves accuracy but increases latency and GPU costs. A smaller model reduces cost but risks quality degradation. Optimizing aggressively for speed can sacrifice reasoning depth. The frontier itself is not fixed — it shifts outward as your engineering matures — but the tradeoffs never disappear. Every architectural decision in this series maps back to that constraint, alongside the security, compliance, and governance requirements enterprise deployments cannot skip. The Challenges — Why Inference Is Hard Challenge 1: The Pareto Frontier — You Cannot Optimize Everything Simultaneously Enterprise inference teams run into the same constraint regardless of stack: accuracy, latency, and cost are interdependent. These pressures play out across three dimensions that define every inference architecture: Dimension 1: Model quality (accuracy). The baseline capability curve. Larger models, better fine-tuning, and RAG shift you to a higher-quality frontier. Dimension 2: Throughput per GPU (cost). Tokens per GPU-hour — since self-hosted models on AKS are billed by VM uptime, not per token. Quantization, continuous batching, MIG partitioning, and batch inference all move this number. Dimension 3: Latency per user (speed). How fast each user gets a response. Speculative decoding, prefix caching, disaggregated prefill/decode, and smaller context windows push this dimension. In practice, this plays out in two stages. First, you choose the accuracy level your business requires — this is a model selection decision (model size, fine-tuning, RAG, quantization precision). That decision locks you onto a specific cost-latency curve. Second, you optimize along that curve: striving for more tokens per GPU-hour, lower tail latency, or both. The frontier itself isn’t fixed – it shifts outward as your engineering matures. The tradeoffs don't disappear, but they get progressively less painful. The practical question to anchor on: What is the minimum acceptable accuracy for this business outcome, and how far can I push the throughput-latency frontier at that level? When your team has answered that, the table below maps directly to engineering levers available. Priority Tradeoff Engineering Bridges Accuracy + Low Latency Higher cost Use smaller models to reduce serving cost; recover accuracy with RAG, fine-tuning, and tool use. Quantization cuts GPU memory footprint further. Accuracy + Low Cost Higher latency Batch inference, async pipelines, and queue-tolerant architectures absorb the latency gracefully. Low Latency + Low Cost Accuracy at risk Smaller or distilled models with quantization; improve accuracy via RAG, fine-tuning Challenge 2: Two Phases, Two Bottlenecks Inference has two computationally distinct phases, each constrained by different hardware resources. Prefill processes the entire input prompt in parallel, builds the Key-Value (KV) cache and produces the first output token. It is compute-bound — limited by how fast the GPUs can execute matrix multiplications. Time scales with input length. This phase determines Time to First Token (TTFT). Decode generates output tokens sequentially, one at a time. Each token depends on all prior tokens, so the GPU reads the full KV cache from memory at each step. It is memory-bandwidth-bound — limited by how fast data moves from GPU memory to processor. This phase determines Time Per Output Token (TPOT). Total Latency = TTFT + (TPOT × (Output Token Count-1)) These bottlenecks don’t overlap. A document classification workload (long input, short output) is prefill-dominated and compute-bound. A content generation workload (short input, long output) is decode-dominated and memory-bandwidth-bound. Optimizing one phase does not automatically improve the other. That is why advanced inference stacks now disaggregate these phases across different hardware to optimize each independently – a technique covered in depth in part two of this series Challenge 3: The KV Cache — The Hidden Cost Driver Model weights are static — loaded once into GPU VRAM per replica. The KV cache is dynamic: it's allocated at runtime per request, and grows linearly with context length, batch size, and number of attention layers. At high concurrency and long context, it is a frequent primary driver of OOM failures, often amplified by prefill workspace and runtime overhead. In practice, this means LLM serving capacity is constrained less by model size and more by KV cache growth driven by context length and concurrency. A 7B-parameter model needs roughly 14 GB for weights in FP16. On an NC A100 v4 node on AKS (A100 80GB per GPU), a single idle replica has plenty of headroom. But KV cache scales with concurrent users. KV cache memory per sequence is determined by: KV_Bytes_total = batch_size * num_layers × num_KV_heads × head_dim × tokens × bytes_per_element × 2 (K and V) The KV cache is where things get unpredictable. For Llama 3 8B, at ~8B parameters, requires roughly ~16 GB. On an A100 80GB GPU (e.g., Azure NC A100 v4 on AKS), a single low-concurrency replica typically has plenty of headroom — but KV cache scales with concurrency. For Llama-3 8B, which uses Grouped Query Attention (GQA), an 8K-token sequence consumes roughly ~1 GB of KV cache in FP16/BF16. This compounds quickly: 40 concurrent 8K-token requests → ~40 GB KV cache Add model weights (~16 GB) and runtime overhead, and total memory usage can approach ~60 GB+ on an 80 GB GPU, leaving limited headroom. Because KV memory scales linearly with tokens: 15 users at 32K context create roughly the same KV pressure as 60 users at 8K. At 128K+ context lengths, even a single long-running sequence can materially reduce safe concurrency. The model weights never changed — KV cache growth drove the failure. Context Length Is the Sharpest Lever Context length is often the largest controllable driver of GPU memory consumption. Rather than defaulting to the maximum context window supported by a model, systems should match context size to the workload Context Length Typical Use Cases Memory Impact 4K–8K tokens Q&A, simple chat Low KV cache memory 32K–128K tokens Document analysis, summarization Moderate — GPU memory pressure begins 128K+ tokens Multi-step agents, complex reasoning KV cache dominates VRAM; drives architecture decisions Context length directly controls KV cache growth, which is often the main cause of GPU memory exhaustion.Increasing context length reduces the number of concurrent sequences a GPU can safely serve — often more sharply than any other single variable. Controlling it at the application layer, through chunking, retrieval (RAG), or enforced limits, is one of the highest-leverage interventions available before reaching for more hardware. Challenge 4: Agentic AI Multiplies Everything Agentic workloads fundamentally change the resource profile. A single user interaction with an AI agent can trigger dozens or hundreds of sequential inference calls — planning, executing, verifying, iterating — each consuming context that grows over the session. Agentic workloads stress every dimension of the Pareto frontier simultaneously: they need accuracy (autonomous decisions carry risk), low latency (multi-step chains compound delays), and cost efficiency (token consumption scales with autonomy duration). Challenge 5: GPU Economics — Idle Capacity Is Burned Capital Production inference traffic is bursty and unpredictable. Idle GPUs equal burned cash. Under-batching means low utilization. Choosing the wrong Azure VM SKU for your workload introduces significant cost inefficiency. In self-hosted AKS deployments, cost is GPU-hours — you pay for the VM regardless of token throughput. Output tokens are more expensive per token than input tokens because decode is sequential, so generation-heavy workloads require more GPU-hours per request. Product design decisions like response verbosity and default generation length directly affect how many requests each GPU can serve per hour. Token discipline is cost discipline — not because tokens are priced individually, but because they determine how efficiently you use the GPU-hours you’re already paying for. These five challenges don't operate in isolation — they compound. An agentic workload running at long context on the wrong GPU SKU hits all five simultaneously. Part 2 of this series walks through the optimization stack that addresses each one, ordered by implementation priority -> The LLM Inference Optimization Stack: A Prioritized Playbook for Enterprise Teams | Microsoft Community Hub795Views0likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.10KViews1like1CommentMicrosoft Azure at KubeCon Europe 2026 | Amsterdam, NL - March 23-26
Microsoft Azure is coming back to Amsterdam for KubeCon + CloudNativeCon Europe 2026 in two short weeks, from March 23-26! As a Diamond Sponsor, we have a full week of sessions, hands-on activities, and ways to connect with the engineers behind AKS and our open-source projects. Here's what's on the schedule: Azure Day with Kubernetes: 23 March 2026 Before the main conference begins, join us at Hotel Casa Amsterdam for a free, full-day technical event built around AKS (registration required for entry - capacity is limited!). Whether you're early in your Kubernetes journey, running clusters at scale, or building AI apps, the day is designed to give you practical guidance from Microsoft product and engineering teams. Morning sessions cover what's new in AKS, including how teams are building and running AI apps on Kubernetes. In the afternoon, pick your track: Hands-on AKS Labs: Instructor-led labs to put the morning's concepts into practice. Expert Roundtables: Small-group conversations with AKS engineers on topics like security, autoscaling, AI workloads, and performance. Bring your hard questions. Evening: Drinks on us. Capacity is limited, so secure your spot before it closes: aka.ms/AKSDayEU KubeCon + CloudNativeCon: 24-26 March 2026 There will be lots going on at the main conference! Here's what to add to your calendar: Keynote (24 March): Jorge Palma takes the stage to tackle a question the industry is actively wrestling with: can AI agents reliably operate and troubleshoot Kubernetes at scale, and should they? Customer Keynote (24 March): Wayve's Mukund Muralikrishnan shares how they handle GPU scheduling across multi-tenant inference workloads using Kueue, providing a practical look at what production AI infrastructure actually requires. Demo Theatre (25 March): Anson Qian and Jorge Palma walk through a Kubernetes-native approach to cross-cloud AI inference, covering elastic autoscaling with Karpenter and GPU capacity scheduling across clouds. Sessions: Microsoft engineers are presenting across all three days on topics ranging from multi-cluster networking, supply chain security, observability, Istio in production, and more. Full list below. Find our team in the Project Pavilion at kiosks for Inspektor Gadget, Headlamp, Drasi, Radius, Notary Project, Flatcar, ORAS, Ratify, and Istio. Brendan Burns, Kubernetes co-founder and Microsoft CVP & Technical Fellow, will also share his thoughts on the latest developments and key Microsoft announcements related to open-source, cloud native, and AI application development in his KubeCon Europe blog on March 24. Come find us at Microsoft Azure booth #200 all three days. We'll be running short demos and sessions on AKS, running Kubernetes at scale, AI workloads, and cloud-native topics throughout the show, plus fun activations and opportunities to unlock special swag. Read on below for full details on our KubeCon sessions and booth theater presentations: Sponsored Keynote Date: Tues 24 March 2026 Start Time: 10:18 AM CET Room: Hall 12 Title: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Speakers: Jorge Palma, Natan Yellin (Robusta) As AI agents increasingly write our code, can they also operate and troubleshoot our infrastructure? More importantly, should they? This keynote explores the practical reality of deploying AI agents to maintain Kubernetes clusters at scale. We'll demonstrate HolmesGPT, an open-source CNCF sandbox project that connects LLMs to operational and observability data to diagnose production issues. You'll see how agents reduce MTTR by correlating logs, metrics, and cluster state far faster than manual investigation. Then we'll tackle the harder problem: moving from diagnosis to remediation. We'll show how agents with remediation policies can detect and fix issues autonomously, within strict RBAC boundaries, approval workflows, and audit trails. We'll be honest about challenges: LLM non-determinism, building trust, and why guardrails are non-negotiable. This isn't about replacing SREs; it's about multiplying their effectiveness so they can focus on creative problem-solving and system design. Customer Keynote Date: Tues 24 March 2026 Start Time: 9:37 AM CET Room: Hall 12 Title: Rules of the road for shared GPUs: AI inference scheduling at Wayve Speaker: Mukund Muralikrishnan, Wayve Technologies As AI inference workloads grow in both scale and diversity, predictable access to GPUs becomes as important as raw throughput, especially in large, multi-tenant Kubernetes clusters. At Wayve, Kubernetes underpins a wide range of inference workloads, from latency-sensitive evaluation and validation to large-scale synthetic data generation supporting the development of an end-to-end self-driving system. These workloads run side by side, have very different priorities, and all compete for the same GPU capacity. In this keynote, we will share how we manage scheduling and resources for multi-tenant AI inference on Kubernetes. We will explain why default Kubernetes scheduling falls short, and how we use Kueue, a Kubernetes-native queueing and admission control solution, to operate shared GPU clusters reliably at scale. This approach gives teams predictable GPU allocations, improves cluster utilisation, and reduces operational noise. We will close by briefly showing how frameworks like Ray fit into this model as Wayve scales its AI Driver platform. KubeCon Theatre Demo Date: Wed 25 March 2026 Start Time: 13:15 CET Room: Hall 1-5 | Solutions Showcase | Demo Theater Title: Building cross-cloud AI inference on Kubernetes with OSS Speaker: Anson Qian, Jorge Palma Operating AI inference under bursty, latency-sensitive workloads is hard enough on a single cluster. It gets harder when GPU capacity is fragmented across regions and cloud providers. This demo walks through a Kubernetes-native pattern for cross-cloud AI inference, using an incident triage and root cause analysis workflow as the example. The stack is built on open-source capabilities for lifecycle management, inference, autoscaling, and cross-cloud capacity scheduling. We will specifically highlight Karpenter for elastic autoscaling and a GPU flex nodes project for scheduling capacity across multiple cloud providers into a single cluster. Models, inference endpoints, and GPU resources are treated as first-class Kubernetes objects, enabling elastic scaling, stable routing under traffic spikes, and cross-provider failover without a separate AI control plane. KubeCon Europe 2025 Sessions with Microsoft Speakers Speaker Title Jorge Palma Microsoft keynote: Scaling Platform Ops with AI Agents: Troubleshooting to Remediation Anson Qian, Jorge Palma Microsoft demo: Building cross-cloud AI inference on Kubernetes with OSS Will Tsai Leveling up with Radius: Custom Resources and Headlamp Integration for Real-World Workloads Simone Rodigari Demystifying the Kubernetes Network Stack (From Pod to Pod) Joaquin Rodriguez Privacy as Infrastructure: Declarative Data Protection for AI on Kubernetes Cijo Thomas ⚡Lightning Talk: “Metrics That Lie”: Understanding OpenTelemetry’s Cardinality Capping and Its Implications Gaurika Poplai ⚡Lightning Talk: Compliance as Code Meets Developer Portals: Kyverno + Backstage in Action Mereta Degutyte & Anubhab Majumdar Network Flow Aggregation: Pay for the Logs You Care About! Niranjan Shankar Expl(AI)n Like I’m 5: An Introduction To AI-Native Networking Danilo Chiarlone Running Wasmtime in Hardware-Isolated Microenvironments Jack Francis Cluster Autoscaler Evolution Jackie Maertens Cloud Native Theater | Istio Day: Running State of the Art Inference with Istio and LLM-D Jackie Maertens & Mitch Connors Bob and Alice Revisited: Understanding Encryption in Kubernetes Mitch Connors Istio in Production: Expected Value, Results, and Effort at GitHub Scale Mitch Connors Evolution or Revolution: Istio as the Network Platform for Cloud Native René Dudfield Ping SRE? I Am the SRE! Awesome Fun I Had Drawing a Zine for Troubleshooting Kubernetes Deployments René Dudfield & Santhosh Nagaraj Does Your Project Want a UI in Kubernetes-SIGs/headlamp? Bridget Kromhout How Will Customized Kubernetes Distributions Work for You? a Discussion on Options and Use Cases Kenneth Kilty AI-Powered Cloud Native Modernization: From Real Challenges to Concrete Solutions Mike Morris Building the Next Generation of Multi-Cluster with Gateway API Toddy Mladenov, Flora Taagen & Dallas Delaney Beyond Image Pull-Time: Ensuring Runtime Integrity With Image Layer Signing Microsoft Booth Theatre Sessions Tues 24 March (11:00 - 18:00) Zero-Migration AI with Drasi: Bridge Your Existing Infrastructure to Modern Workflows Bringing real-time Kubernetes observability to AI agents via Model Context Protocol Secure Kubernetes Across the Stack: Supply Chain to Runtime Cut the Noise, Cut the Bill: Cost‑Smart Network Observability for Kubernetes AKS everywhere: one Kubernetes experience from Cloud to Edge Teaching AI to Build Better AKS Clusters with Terraform AKS-Flex: autoscale GPU nodes from Azure and neocloud like Nebius using karpenter Block Game with Block Storage: Running Minecraft on Kubernetes with local NVMe When One Cluster Fails: Keeping Kubernetes Services Online with Cilium ClusterMesh You Spent How Much? Controlling Your AI Spend with Istio + agentgateway Azure Front Door Edge Actions: Hardware-protected CDN functions in Azure Secure Your Sensitive Workloads with Confidential Containers on Azure Red Hat OpenShift AKS Automatic Anyscale on Azure Wed 25 March Kubernetes Answers without AI (And That's Okay) Accelerating Cloud‑Native and AI Workloads with Azure Linux on AKS Codeless OpenTelemetry: Auto‑Instrumenting Kubernetes Apps in Minutes Life After ingress-nginx: Modern Kubernetes Ingress on AKS Modern Apps, Faster: Modernization with AKS + GitHub Copilot App Mod Get started developing on AKS Encrypt Everything, Complicate Nothing: Rethinking Kubernetes Workload Network Security From Repo to Running on AKS with GitHub Copilot Simplify Multi‑Cluster App Traffic with Azure Kubernetes Application Network Open Source with Chainguard and Microsoft: Better Together on AKS Accelerating Cloud-Native Delivery for Developers: API-Driven Platforms with Radius Operate Kubernetes at Scale with Azure Kubernetes Fleet Manager Thurs 26 March Oooh Wee! An AKS GUI! – Deploy, Secure & Collaborate in Minutes (No CLI Required) Sovereign Kubernetes: Run AKS Where the Cloud Can’t Go Thousand Pods, One SAN: Burst-Scaling Stateful Apps with Azure Container Storage + Elastic SAN There will also be a wide variety of demos running at our booth throughout the show – be sure to swing by to chat with the team. We look forward to seeing you at KubeCon Europe 2026 in Amsterdam Psst! Local or coming in to Amsterdam early? You can also catch the Microsoft team at: Cloud Native Rejekts on 21 March Maintainer Summit on 22 March855Views0likes0CommentsRethinking Ingress on Azure: Application Gateway for Containers Explained
Introduction Azure Application Gateway for Containers is a managed Azure service designed to handle incoming traffic for container-based applications. It brings Layer-7 load balancing, routing, TLS termination, and web application protection outside of the Kubernetes cluster and into an Azure-managed data plane. By separating traffic management from the cluster itself, the service reduces operational complexity while providing a more consistent, secure, and scalable way to expose container workloads on Azure. Service Overview What Application Gateway for Containers does Azure Application Gateway for Containers is a managed Layer-7 load balancing and ingress service built specifically for containerized workloads. Its main job is to receive incoming application traffic (HTTP/HTTPS), apply routing and security rules, and forward that traffic to the right backend containers running in your Kubernetes cluster. Instead of deploying and operating an ingress controller inside the cluster, Application Gateway for Containers runs outside the cluster, as an Azure-managed data plane. It integrates natively with Kubernetes through the Gateway API (and Ingress API), translating Kubernetes configuration into fully managed Azure networking behavior. In practical terms, it handles: HTTP/HTTPS routing based on hostnames, paths, headers, and methods TLS termination and certificate management Web Application Firewall (WAF) protection Scaling and high availability of the ingress layer All of this is provided as a managed Azure service, without running ingress pods in your cluster. What problems it solves Application Gateway for Containers addresses several common challenges teams face with traditional Kubernetes ingress setups: Operational overhead Running ingress controllers inside the cluster means managing upgrades, scaling, certificates, and availability yourself. Moving ingress to a managed Azure service significantly reduces this burden. Security boundaries By keeping traffic management and WAF outside the cluster, you reduce the attack surface of the Kubernetes environment and keep security controls aligned with Azure-native services. Consistency across environments Platform teams can offer a standard, Azure-managed ingress layer that behaves the same way across clusters and environments, instead of relying on different in-cluster ingress configurations. Separation of responsibilities Infrastructure teams manage the gateway and security policies, while application teams focus on Kubernetes resources like routes and services. How it differs from classic Application Gateway While both services share the “Application Gateway” name, they target different use cases and operating models. In the traditional model of using Azure Application Gateway is a general-purpose Layer-7 load balancer primarily designed for VM-based or service-based backends. It relies on centralized configuration through Azure resources and is not Kubernetes-native by design. Application Gateway for Containers, on the other hand: Is designed specifically for container platforms Uses Kubernetes APIs (Gateway API / Ingress) instead of manual listener and rule configuration Separates control plane and data plane more cleanly Enables faster, near real-time updates driven by Kubernetes changes Avoids running ingress components inside the cluster In short, classic Application Gateway is infrastructure-first, while Application Gateway for Containers is platform- and Kubernetes-first. Architecture at a Glance At a high level, Azure Application Gateway for Containers is built around a clear separation between control plane and data plane. This separation is one of the key architectural ideas behind the service and explains many of its benefits. Control plane and data plane The control plane is responsible for configuration and orchestration. It listens to Kubernetes resources—such as Gateway API or Ingress objects—and translates them into a running gateway configuration. When you create or update routing rules, TLS settings, or security policies in Kubernetes, the control plane picks up those changes and applies them automatically. The data plane is where traffic actually flows. It handles incoming HTTP and HTTPS requests, applies routing rules, performs TLS termination, and forwards traffic to the correct backend services inside your cluster. This data plane is fully managed by Azure and runs outside of the Kubernetes cluster, providing isolation and high availability by design. Because the data plane is not deployed as pods inside the cluster, it does not consume cluster resources and does not need to be scaled or upgraded by the customer. Managed components vs customer responsibilities One of the goals of Application Gateway for Containers is to reduce what customers need to operate, while still giving them control where it matters. Managed by Azure Application Gateway for Containers data plane Scaling, availability, and patching of the gateway Integration with Azure networking Web Application Firewall engine and updates Translation of Kubernetes configuration into gateway rules Customer-managed Kubernetes resources (Gateway API or Ingress) Backend services and workloads TLS certificates and references Routing and security intent (hosts, paths, policies) Network design and connectivity to the cluster This split allows platform teams to keep ownership of the underlying Azure infrastructure, while application teams interact with the gateway using familiar Kubernetes APIs. The result is a cleaner operating model with fewer moving parts inside the cluster. In short, Application Gateway for Containers acts as an Azure-managed ingress layer, driven by Kubernetes configuration but operated outside the cluster. This architecture keeps traffic management simple, scalable, and aligned with Azure-native networking and security services. Traffic Handling and Routing This section explains what happens to a request from the moment it reaches Azure until it is forwarded to a container running in your cluster. Traffic Flow: From Internet to Pod Azure Application Gateway for Containers (AGC) acts as the specialized "front door" for your Kubernetes workloads. By sitting outside the cluster, it manages high-volume traffic ingestion so your environment remains focused on application logic rather than networking overhead. The Request Journey Once a request is initiated by a client—such as a browser or an API—it follows a streamlined path to your container: 1. Entry via Public Frontend: The request reaches AGC’s public frontend endpoint. Note: While private frontends are currently the most requested feature and are under high-priority development, the service currently supports public-facing endpoints. 2. Rule Evaluation: AGC evaluates the incoming request against the routing rules you’ve defined using standard Kubernetes resources (Gateway API or Ingress). 3. Direct Pod Proxying: Once a rule is matched, AGC forwards the traffic directly to the backend pods within your cluster. 4. Azure Native Delivery: Because AGC operates as a managed data plane outside the cluster, traffic reaches your workloads via Azure networking. This removes the need for managing scaling or resource contention for in-cluster ingress pods. Flexibility in Security and Routing The architecture is designed to be as "hands-off" or as "hands-on" as your security policy requires: Optional TLS Offloading: You have full control over the encryption lifecycle. Depending on your specific use case, you can choose to perform TLS termination at the gateway to offload the compute-intensive decryption, or maintain encryption all the way to the container for end-to-end security. Simplified Infrastructure: By using AGC, you eliminate the "hop" typically required by in-cluster controllers, allowing the gateway to communicate with pods with minimal latency and high predictability. Kubernetes Integration Application Gateway for Containers is designed to integrate natively with Kubernetes, allowing teams to manage ingress behavior using familiar Kubernetes resources instead of Azure-specific configuration. This makes the service feel like a natural extension of the Kubernetes platform rather than an external load balancer. Gateway API as the primary integration model The Gateway API is the preferred and recommended way to integrate Application Gateway for Containers with Kubernetes. With the Gateway API: Platform teams define the Gateway and control how traffic enters the cluster. Application teams define routes (such as HTTPRoute) to expose their services. Responsibilities are clearly separated, supporting multi-team and multi-namespace environments. Application Gateway for Containers supports core Gateway API resources such as: GatewayClass Gateway HTTPRoute When these resources are created or updated, Application Gateway for Containers automatically translates them into gateway configuration and applies the changes in near real time. Ingress API support For teams that already use the traditional Kubernetes Ingress API, Application Gateway for Containers also provides Ingress support. This allows: Reuse of existing Ingress manifests A smoother migration path from older ingress controllers Gradual adoption of Gateway API over time Ingress resources are associated with Application Gateway for Containers using a specific ingress class. While fully functional, the Ingress API offers fewer capabilities and less flexibility compared to the Gateway API. How teams interact with the service A key benefit of this integration model is the clean separation of responsibilities: Platform teams Provision and manage Application Gateway for Containers Define gateways, listeners, and security boundaries Own network and security policies Application teams Define routes using Kubernetes APIs Control how their applications are exposed Do not need direct access to Azure networking resources This approach enables self-service for application teams while keeping governance and security centralized. Why this matters By integrating deeply with Kubernetes APIs, Application Gateway for Containers avoids custom controllers, sidecars, or ingress pods inside the cluster. Configuration stays declarative, changes are automated, and the operational model stays consistent with Kubernetes best practices. Security Capabilities Security is a core part of Azure Application Gateway for Containers and one of the main reasons teams choose it over in-cluster ingress controllers. The service brings Azure-native security controls directly in front of your container workloads, without adding complexity inside the cluster. Web Application Firewall (WAF) Application Gateway for Containers integrates with Azure Web Application Firewall (WAF) to protect applications against common web attacks such as SQL injection, cross-site scripting, and other OWASP Top 10 threats. A key differentiator of this service is that it leverages Microsoft's global threat intelligence. This provides an enterprise-grade layer of security that constantly evolves to block emerging threats, a significant advantage over many open-source or standard competitor WAF solutions. Because the WAF operates within the managed data plane, it offers several operational benefits: Zero Cluster Footprint: No WAF-specific pods or components are required to run inside your Kubernetes cluster, saving resources for your actual applications. Edge Protection: Security rules and policies are applied at the Azure network edge, ensuring malicious traffic is blocked before it ever reaches your workloads. Automated Maintenance: All rule updates, patching, and engine maintenance are handled entirely by Azure. Centralized Governance: WAF policies can be managed centrally, ensuring consistent security enforcement across multiple teams and namespaces—a critical requirement for regulated environments. TLS and certificate handling TLS termination happens directly at the gateway. HTTPS traffic is decrypted at the edge, inspected, and then forwarded to backend services. Key points: Certificates are referenced from Kubernetes configuration TLS policies are enforced by the Azure-managed gateway Applications receive plain HTTP traffic, keeping workloads simpler This approach allows teams to standardize TLS behavior across clusters and environments, while avoiding certificate logic inside application pods. Network isolation and exposure control Because Application Gateway for Containers runs outside the cluster, it provides a clear security boundary between external traffic and Kubernetes workloads. Common patterns include: Internet-facing gateways with WAF protection Private gateways for internal or zero-trust access Controlled exposure of only selected services By keeping traffic management and security at the gateway layer, clusters remain more isolated and easier to protect. Security by design Overall, the security model follows a simple principle: inspect, protect, and control traffic before it enters the cluster. This reduces the attack surface of Kubernetes, centralizes security controls, and aligns container ingress with Azure’s broader security ecosystem. Scale, Performance, and Limits Azure Application Gateway for Containers is built to handle production-scale traffic without requiring customers to manage capacity, scaling rules, or availability of the ingress layer. Scalability and performance are handled as part of the managed service. Interoperability: The Best of Both Worlds A common hesitation when adopting cloud-native networking is the fear of vendor lock-in. Many organizations worry that using a provider-specific ingress service will tie their application logic too closely to a single cloud’s proprietary configuration. Azure Application Gateway for Containers (AGC) addresses this directly by utilizing the Kubernetes Gateway API as its primary integration model. This creates a powerful decoupling between how you define your traffic and how that traffic is actually delivered. Standardized API, Managed Execution By adopting this model, you gain two critical advantages simultaneously: Zero Vendor Lock-In (Standardized API): Your routing logic is defined using the open-source Kubernetes Gateway API standard. Because HTTPRoute and Gateway resources are community-driven standards, your configuration remains portable and familiar to any Kubernetes professional, regardless of the underlying infrastructure. Zero Operational Overhead (Managed Implementation): While the interface is a standard Kubernetes API, the implementation is a high-performance Azure-managed service. You gain the benefits of an enterprise-grade load balancer—automatic scaling, high availability, and integrated security—without the burden of managing, patching, or troubleshooting proxy pods inside your cluster. The "Pragmatic" Advantage As highlighted in recent architectural discussions, moving from traditional Ingress to the Gateway API is about more than just new features; it’s about interoperability. It allows platform teams to offer a consistent, self-service experience to developers while retaining the ability to leverage the best-in-class performance and security that only a native cloud provider can offer. The result is a future-proof architecture: your teams use the industry-standard language of Kubernetes to describe what they need, and Azure provides the managed muscle to make it happen. Scaling model Application Gateway for Containers uses an automatic scaling model. The gateway data plane scales up or down based on incoming traffic patterns, without manual intervention. From an operator’s perspective: There are no ingress pods to scale No node capacity planning for ingress No separate autoscaler to configure Scaling is handled entirely by Azure, allowing teams to focus on application behavior rather than ingress infrastructure. Performance characteristics Because the data plane runs outside the Kubernetes cluster, ingress traffic does not compete with application workloads for CPU or memory. This often results in: More predictable latency Better isolation between traffic management and application execution Consistent performance under load The service supports common production requirements such as: High concurrent connections Low-latency HTTP and HTTPS traffic Near real-time configuration updates driven by Kubernetes changes Service limits and considerations Like any managed service, Application Gateway for Containers has defined limits that architects should be aware of when designing solutions. These include limits around: Number of listeners and routes Backend service associations Certificates and TLS configurations Throughput and connection scaling thresholds These limits are documented and enforced by the platform to ensure stability and predictable behavior. For most application platforms, these limits are well above typical usage. However, they should be reviewed early when designing large multi-tenant or high-traffic environments. Designing with scale in mind The key takeaway is that Application Gateway for Containers removes ingress scaling from the cluster and turns it into an Azure-managed concern. This simplifies operations and provides a stable, high-performance entry point for container workloads. When to Use (and When Not to Use) Scenario Use it? Why Kubernetes workloads on Azure ✅ Yes The service is designed specifically for container platforms and integrates natively with Kubernetes APIs. Need for managed Layer-7 ingress ✅ Yes Routing, TLS, and scaling are handled by Azure without in-cluster components. Enterprise security requirements (WAF, TLS policies) ✅ Yes Built-in Azure WAF and centralized TLS enforcement simplify security. Platform team managing ingress for multiple apps ✅ Yes Clear separation between platform and application responsibilities. Multi-tenant Kubernetes clusters ✅ Yes Gateway API model supports clean ownership boundaries and isolation. Desire to avoid running ingress controllers in the cluster ✅ Yes No ingress pods, no cluster resource consumption. VM-based or non-container backends ❌ No Classic Application Gateway is a better fit for non-container workloads. Simple, low-traffic test or dev environments ❌ Maybe not A lightweight in-cluster ingress may be simpler and more cost-effective. Need for custom or unsupported L7 features ❌ Maybe not Some advanced or niche ingress features may not yet be available. Non-Kubernetes platforms ❌ No The service is tightly integrated with Kubernetes APIs. When to Choose a Different Path: Azure Container Apps While Application Gateway for Containers provides the ultimate control for Kubernetes environments, not every project requires that level of infrastructure management. For teams that don't need the full flexibility of Kubernetes and are looking for the fastest path to running containers on Azure without managing clusters or ingress infrastructure at all, Azure Container Apps offers a specialized alternative. It provides a fully managed, serverless container platform that handles scaling, ingress, and networking automatically "out of the box". Key Differences at a Glance Feature AGC + Kubernetes Azure Container Apps Control Granular control over cluster and ingress. Fully managed, serverless experience. Management You manage the cluster; Azure manages the gateway. Azure manages both the platform and ingress. Best For Complex, multi-team, or highly regulated environments. Rapid development and simplified operations. Appendix - Routing configuration examples The following examples show how Application Gateway for Containers can be configured using both Gateway API and Ingress API for common routing and TLS scenarios. More examples can be found here, in the detailed documentation. HTTP listener apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: app-route spec: parentRefs: - name: agc-gateway rules: - backendRefs: - name: app-service port: 80 Path routing logic apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: path-routing spec: parentRefs: - name: agc-gateway rules: - matches: - path: type: PathPrefix value: /api backendRefs: - name: api-service port: 80 - backendRefs: - name: web-service port: 80 Weighted canary / rollout apiVersion: gateway.networking.k8s.io/v1 kind: HTTPRoute metadata: name: canary-route spec: parentRefs: - name: agc-gateway rules: - backendRefs: - name: app-v1 port: 80 weight: 80 - name: app-v2 port: 80 weight: 20 TLS Termination apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: app-ingress spec: ingressClassName: azure-alb-external tls: - hosts: - app.contoso.com secretName: tls-cert rules: - host: app.contoso.com http: paths: - path: / pathType: Prefix backend: service: name: app-service port: number: 80918Views2likes0CommentsSeamless Migrations From Self Hosted Nginx Ingress To The AKS App Routing Add-On
The Kubernetes Steering Committee has announced that the Nginx Ingress controller will be retired in March 2026. That' not far away, and once this happens Nginx Ingress will not receive any further updates, including security patches. Continuing to run the standalone Nginx Ingress controller past the end of March could open you up to security risks. Azure Kubernetes Service (AKS) offers a managed routing add-on which also implements Nginx as the Ingress Controller. Microsoft has recently committed to supporting this version of Nginx Ingress until November 2026. There is also an updated version of the App Routing add-on in the works, that will be based on Istio to allow for transition off Nginx Ingress. This new App Routing add-on will support Gateway API based ingress only, so there will be some migration required if you are using the Ingress API. There is tooling availible to support migration from Ingress to Gateway API, such as the Ingress2Gateway tool. If you are already using the App Routing add-on then you are supported until November and have extra time to either move to the new Istio based solution when it is released or migrate to another solution such as App Gateway for Containers. However, if you are running the standalone version of Nginx Ingress, you may want to consider migrating to the App Routing add-on to give you some extra time. To be very clear, migrating to the App Routing add-on does not solve the problem; it buys you some more time until November and sets you up for a transition to the future Istio based App Routing add-on. Once you complete this migration you will need to plan to either move to the new version based on Istio, or migrate to another Ingress solution, before November. This rest of this article walks through migrating from BYO Nginx to the App Routing add-on without disrupting your existing traffic. How Parallel Running Works The key to a zero-downtime migration is that both controllers can run simultaneously. Each controller uses a different IngressClass, so Kubernetes routes traffic based on which class your Ingress resources reference. Your BYO Nginx uses the nginx IngressClass and runs in the ingress-nginx namespace. The App Routing add-on uses the webapprouting.kubernetes.azure.com IngressClass and runs in the app-routing-system namespace. They operate completely independently, each with its own load balancer IP. This means you can: Enable the add-on alongside your existing controller Create new Ingress resources targeting the add-on (or duplicate existing ones) Validate everything works via the add-on's IP Cut over DNS or backend pool configuration Remove the old Ingress resources once you're satisfied At no point does your production traffic go offline. Enabling the App Routing add-on Start by enabling the add-on on your existing cluster. This doesn't touch your BYO Nginx installation. bash az aks approuting enable \ --resource-group <resource-group> \ --name <cluster-name> </cluster-name></resource-group> Wait for the add-on to deploy. You can verify it's running by checking the app-routing-system namespace: kubectl get pods -n app-routing-system kubectl get svc -n app-routing-system You should see the Nginx controller pod running and a service called nginx with a load balancer IP. This IP is separate from your BYO controller's IP. bash # Get both IPs for comparison BYO_IP=$(kubectl get svc ingress-nginx-controller -n ingress-nginx \ -o jsonpath='{.status.loadBalancer.ingress[0].ip}') add-on_IP=$(kubectl get svc nginx -n app-routing-system \ -o jsonpath='{.status.loadBalancer.ingress[0].ip}') echo "BYO Nginx IP: $BYO_IP" echo "add-on IP: $add-on_IP" Both controllers are now running. Your existing applications continue to use the BYO controller because their Ingress resources still reference ingressClassName: nginx. Migrating Applications: The Parallel Ingress Approach For production workloads, create a second Ingress resource that targets the add-on. This lets you validate everything before cutting over traffic. Here's an example. Your existing Ingress might look like this: apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: myapp-ingress-byo namespace: myapp annotations: nginx.ingress.kubernetes.io/rewrite-target: / spec: ingressClassName: nginx # BYO controller rules: - host: myapp.example.com http: paths: - path: / pathType: Prefix backend: service: name: myapp port: number: 80 Create a new Ingress for the add-on: apiVersion: networking.k8s.io/v1 kind: Ingress metadata: name: myapp-ingress-add-on namespace: myapp annotations: nginx.ingress.kubernetes.io/rewrite-target: / spec: ingressClassName: webapprouting.kubernetes.azure.com # add-on controller rules: - host: myapp.example.com http: paths: - path: / pathType: Prefix backend: service: name: myapp port: number: 80 Apply this new Ingress resource. The add-on controller picks it up and configures routing, but your production traffic still flows through the BYO controller because DNS (or your backend pool) still points to the old IP. Validating Before Cutover Test the new route via the add-on IP before touching anything else: # For public ingress with DNS curl -H "Host: myapp.example.com" http://$add-on_IP # For private ingress, test from a VM in the VNet curl -H "Host: myapp.example.com" http://$add-on_IP Run your full validation suite against this IP. Check TLS certificates, path routing, authentication, rate limiting, custom headers, and anything else your application depends on. If you have monitoring or synthetic tests, point them at the add-on IP temporarily to gather confidence. If something doesn't work, you can troubleshoot without affecting production. The BYO controller is still handling all real traffic. Cutover To Routing Add-on If your ingress has a public IP and you're using DNS to route traffic, the cutover is straightforward. Lower your DNS TTL well in advance. Set it to 60 seconds at least an hour before you plan to cut over. This ensures changes propagate quickly and you can roll back fast if needed. When you're ready, update your DNS A record to point to the add-on IP If your ingress has a private IP and sits behind App Gateway, API Management, or Front Door, the cutover involves updating the backend pool instead of DNS. In-Place Patching: The Faster But Riskier Option If you're migrating a non-critical application or an internal service where some downtime is acceptable, you can patch the ingressClassName in place: kubectl patch ingress myapp-ingress-byo -n myapp \ --type='json' \ -p='[{"op":"replace","path":"/spec/ingressClassName","value":"webapprouting.kubernetes.azure.com"}]' This is atomic from Kubernetes' perspective. The BYO controller immediately drops the route, and the add-on controller immediately picks it up. In practice, there's usually a few seconds gap while the add-on configures Nginx and reloads. Once this change is made, the Ingress will not work until you update your DNS or backend pool details to point to the new IP. Decommissioning the BYO Nginx Controller Once all your applications are migrated and you're confident everything works, you can remove the BYO controller. First, verify nothing is still using it: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' \ | grep -v "webapprouting" If that returns only the header row (or is empty), you're clear to proceed. If it shows any Ingress resources, you've still got work to do. Remove the BYO Nginx Helm release: helm uninstall ingress-nginx -n ingress-nginx kubectl delete namespace ingress-nginx The Azure Load Balancer provisioned for the BYO controller will be deprovisioned automatically. Verify only the add-on IngressClass remains: kubectl get ingressclass You should see only webapprouting.kubernetes.azure.com. Key Differences Between BYO Nginx and the App Routing add-on The add-on runs the same Nginx binary, so most of your configuration carries over. However, there are a few differences worth noting. TLS Certificates: The BYO setup typically uses cert-manager or manual Secrets for certificates. The add-on supports this, but it also integrates natively with Azure Key Vault. If you want to use Key Vault, you need to configure the add-on with the appropriate annotations. Otherwise, your existing cert-manager setup continues to work. DNS Management: If you're using external-dns with your BYO controller, it works with the add-on too. The add-on also has native integration with Azure DNS zones if you want to simplify your setup. This is optional. Custom Nginx Configuration: With BYO Nginx, you have full access to the ConfigMap and can customise the global Nginx configuration extensively. The add-on restricts this because it's managed by Azure. If you've done significant customisation (Lua scripts, custom modules, etc.), audit carefully to ensure the add-on supports what you need. Most standard configurations work fine. Annotations: The nginx.ingress.kubernetes.io/* annotations work the same way. The add-on adds some Azure-specific annotations for WAF integration and other features, but your existing annotations should carry over without changes. What Comes Next This migration gets you onto a supported platform, but it's temporary. November 2026 is not far away, and you'll need to plan your next move. Microsoft is building a new App Routing add-on based on Istio. This is expected later in 2026 and will likely become the long-term supported option. Keep an eye on Azure updates for announcements about preview availability and migration paths. If you need something production-ready sooner, App Gateway for Containers is worth evaluating. It's built on Envoy and supports the Kubernetes Gateway API, which is the future direction for ingress in Kubernetes. The Gateway API is more expressive than the Ingress API and is designed to be vendor-neutral. For now, getting off the unsupported BYO Nginx controller is the priority. The App Routing add-on gives you the breathing room to make an informed decision about your long-term strategy rather than rushing into something because you're running out of time.733Views1like0Comments