containers
209 TopicsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.13KViews1like2CommentsNFS Permission Denied in Azure App Service on Linux: What It Means and What to Do
If your Azure App Service on Linux uses an Azure Files NFS share, you may sometimes see errors like Permission denied or Errno 13 when your app tries to write to the mounted path. Azure Files supports NFS for Linux and Unix workloads, and NFS uses Unix-style numeric ownership and permissions (UID/GID), which can behave differently from SMB-based file sharing. Overview This post is for customers using Azure App Service on Linux together with an Azure Files NFS share for persistent storage. Azure Files NFS is designed for Linux and Unix-style workloads, supports POSIX-style permissions, and does not support Windows clients or NFS ACLs. In this setup, a write failure does not always mean the file is corrupted. Sometimes it means the file ownership seen by the running app no longer matches the identity context currently used to access the NFS share. In containerized Linux environments, user IDs inside a container can be mapped differently outside the container, and Docker documents that this can affect access to host-mounted resources. Common signs You may notice: Permission denied Errno 13 your app can read files but cannot update or overwrite them file ownership looks different than expected when you inspect the mounted path These symptoms are consistent with how NFS handles Unix-style ownership and permissions. Azure documents that NFS permissions are enforced through the operating system and NFS model rather than SMB-style user authentication. Why this can happen At a high level, NFS uses numeric ownership such as UID and GID. In container-based Linux environments, the identity that appears inside the container is not always the same as the identity seen outside the container. Docker’s user namespace documentation explains that a container user such as root can be mapped to a less-privileged user on the host, and that mounted-resource access can become more complex because of that mapping. That means a file created earlier under one effective identity context may later be accessed under a different one. When that happens, the app may no longer be able to write to the file even though the file itself is still present and intact. What to check first Start by checking the mounted share from the app’s runtime context. ls -l /mount/path/file ls -ln /mount/path/file id -u id -g The ls -ln output is especially useful because it shows the numeric UID and GID directly. If you need shell access for investigation, App Service supports SSH into Linux containers, and Microsoft notes that Linux custom containers may need extra SSH configuration. You should also review the NFS share’s squash setting. Azure Files NFS supports No Root Squash, Root Squash, and All Squash. Microsoft documents these options in the root squash guidance. A practical mitigation If the main issue is inconsistent ownership behavior, a practical mitigation is often to use All Squash on the NFS share. Azure documents All Squash as a supported NFS setting, and squash settings are specifically intended to control how client identities are handled when they access the share. One important note: changing the squash setting does not automatically rewrite old files. If existing data was created under a different ownership context, you may still need to migrate that data to a new share configured the way you want. Recommended approach A simple and cautious approach is: Create a new Azure Files NFS share. Configure it with All Squash if that matches your workload needs. Mount both the old share and the new share on a Linux environment. Copy the data from old to new. Validate that the app can read and write correctly. Repoint production to the validated share. Azure Files supports NFS shares and squash configuration, and Azure also documents how to mount NFS shares on Linux if you need a separate environment for validation or migration. Final takeaway If your App Service on Linux starts hitting NFS permission denied errors, focus first on ownership, UID/GID behavior, and squash settings before assuming the files are damaged. For many users, the most effective path is to validate the current ownership model, review the NFS squash setting, and, if needed, migrate data to a share configured with All Squash. References NFS file shares in Azure Files | Microsoft Learn Configure Root Squash Settings for NFS Azure File Shares | Microsoft Learn SSH Access for Linux and Windows Containers - Azure App Service | Microsoft Learn Isolate containers with a user namespace | Docker Docs105Views0likes0CommentsAutonomous AKS Incident Response with Azure SRE Agent: From Alert to Verified Recovery in Minutes
When a Sev1 alert fires on an AKS cluster, detection is rarely the hard part. The hard part is what comes next: proving what broke, why it broke, and fixing it without widening the blast radius, all under time pressure, often at 2 a.m. Azure SRE Agent is designed to close that gap. It connects Azure-native observability, AKS diagnostics, and engineering workflows into a single incident-response loop that can investigate, remediate, verify, and follow up, without waiting for a human to page through dashboards and run ad-hoc kubectl commands. This post walks through that loop in two real AKS failure scenarios. In both cases, the agent received an incident, investigated Azure Monitor and AKS signals, applied targeted remediation, verified recovery, and created follow-up in GitHub, all while keeping the team informed in Microsoft Teams. Core concepts Azure SRE Agent is a governed incident-response system, not a conversational assistant with infrastructure access. Five concepts matter most in an AKS incident workflow: Incident platform. Where incidents originate. In this demo, that is Azure Monitor. Built-in Azure capabilities. The agent uses Azure Monitor, Log Analytics, Azure Resource Graph, Azure CLI/ARM, and AKS diagnostics without requiring external connectors. Connectors. Extend the workflow to systems such as GitHub, Teams, Kusto, and MCP servers. Permission levels. Reader for investigation and read oriented access, privileged for operational changes when allowed. Run modes. Review for approval-gated execution and Autonomous for direct execution. The most important production controls are permission level and run mode, not prompt quality. Custom instructions can shape workflow behavior, but they do not replace RBAC, telemetry quality, or tool availability. The safest production rollout path: Start: Reader + Review Then: Privileged + Review Finally: Privileged + Autonomous. Only for narrow, trusted incident paths. Demo environment The full scripts and manifests are available if you want to reproduce this: Demo repository: github.com/hailugebru/azure-sre-agents-aks. The README includes setup and configuration details. The environment uses an AKS cluster with node auto-provisioning (NAP), Azure CNI Overlay powered by Cilium, managed Prometheus metrics, the AKS Store sample microservices application, and Azure SRE Agent configured for incident-triggered investigation and remediation. This setup is intentionally realistic but minimal. It provides enough surface area to exercise real AKS failure modes without distracting from the incident workflow itself. Azure Monitor → Action Group → Azure SRE Agent → AKS Cluster (Alert) (Webhook) (Investigate / Fix) (Recover) ↓ Teams notification + GitHub issue → GitHub Agent → PR for review How the agent was configured Configuration came down to four things: scope, permissions, incident intake, and response mode. I scoped the agent to the demo resource group and used its user-assigned managed identity (UAMI) for Azure access. That scope defined what the agent could investigate, while RBAC determined what actions it could take. I used broader AKS permissions than I would recommend as a default production baseline so the agent could complete remediation end to end in the lab. That is an important distinction: permissions control what the agent can access, while run mode controls whether it asks for approval or acts directly. For this scenario, Azure Monitor served as the incident platform, and I set the response plan to Autonomous for a narrow, trusted path so the workflow could run without manual approval gates. I also added Teams and GitHub integrations so the workflow could extend beyond Azure. Teams provided milestone updates during the incident, and GitHub provided durable follow up after remediation. For the complete setup, see the README. A note on context. The more context you can provide the agent about your environment, resources, runbooks, and conventions, the better it performs. Scope boundaries, known workloads, common failure patterns, and links to relevant documentation all sharpen its investigations and reduce the time it spends exploring. Treat custom instructions and connector content as first-class inputs, not afterthoughts. Two incidents, two response modes These incidents occurred on the same cluster in one session and illustrate two realistic operating modes: Alert triggered automation. The agent acts when Azure Monitor fires. Ad hoc chat investigation. An engineer sees a symptom first and asks the agent to investigate. Both matter in real environments. The first is your scale path. The second is your operator assist path. Incident 1. CPU starvation (alert driven, ~8 min MTTR) The makeline-service deployment manifest contained a CPU and memory configuration that was not viable for startup: resources: requests: cpu: 1m memory: 6Mi limits: cpu: 5m memory: 20Mi Within five minutes, Azure Monitor fired the pod-not-healthy Sev1 alert. The agent picked it up immediately. Here is the key diagnostic conclusion the agent reached from the pod state, probe behavior, and exit code: "Exit code 1 (not 137) rules out OOMKill. The pod failed at startup, not at runtime memory pressure. CPU limit of 5m is insufficient for the process to bind its port before the startup probe times out. This is a configuration error, not a resource exhaustion scenario." That is the kind of distinction that often takes an on call engineer several minutes to prove under pressure: startup failure from CPU starvation vs. runtime termination from memory pressure. The agent then: Identified three additional CPU-throttled pods at 112 to 200% of configured limit using kubectl top. Patched four workloads: makeline-service, virtual-customer, virtual-worker, and mongodb. Verified that all affected pods returned to healthy running state with 0 restarts cluster wide. Azure SRE Agent's Incident History blade confirming full cluster recovery: 4 patches applied, 0 unhealthy pods — no human intervention required. Outcome. Full cluster recovery in ~8 minutes, 0 human interventions. Incident 2. OOMKilled (chat driven, ~4 min MTTR) For the second case, I deployed a deliberately undersized version of order-service: kubectl apply -f .\manifests\aks-store\order-service-changed.yaml -n pets I started this case from chat before the pod-phase alert fired to demonstrate the interactive troubleshooting flow. That was a demo choice, not an alerting gap. CrashLoopBackOff is a container waiting reason, not a pod phase, so production coverage should come from Prometheus based crash-loop signals rather than pod phase alone. Here is the PromQL query I use in Azure Monitor to catch this class of failure: sum by (namespace, pod) ( ( max_over_time( kube_pod_container_status_waiting_reason{ namespace="pets", reason="CrashLoopBackOff" }[5m] ) == 1 ) and on (namespace, pod, container) ( increase( kube_pod_container_status_restarts_total{ namespace="pets" }[15m] ) > 0 ) ) > 0 This query fires when a container has been in CrashLoopBackOff within the last 5 minutes and its restart count has increased in the last 15 minutes. In production, replace the hardcoded namespace with a regex matcher or remove it entirely to cover all namespaces. The order-service pod in the pets namespace is not healthy. Please investigate, identify the root cause, and fix it. The agent's reasoning: "Container logs are empty. The process was killed before it could write its first log line. Exit code 137 confirms OOMKill. No NODE_OPTIONS in the ConfigMap rules out a V8 heap misconfiguration. The 20Mi limit is 12.8x below the pod's observed 50Mi runtime baseline. This limit was never viable for this workload." The agent increased the memory limit (20Mi to 128Mi) and request (10Mi to 50Mi), then verified the new pod stabilized at 74Mi/128Mi (58% utilization) with 0 restarts. Outcome. Service recovered in ~4 minutes without any manual cluster interaction. Side by side comparison Dimension Incident 1: CPU starvation Incident 2: OOMKilled Trigger Azure Monitor alert (automated) Engineer chat prompt (ad hoc) Failure mode CPU too low for startup probe to pass Memory limit too low for process to start Key signal Exit code 1, probe timeout Exit code 137, empty container logs Blast radius 4 workloads affected cluster wide 1 workload in target namespace Remediation CPU request/limit patches across 4 deployments Memory request/limit patch on 1 deployment MTTR ~8 min ~4 min Human interventions 0 0 Why this matters Most AKS environments already emit rich telemetry through Azure Monitor and managed Prometheus. What is still manual is the response: engineers paging through dashboards, running ad-hoc kubectl commands, and applying hotfixes under time pressure. Azure SRE Agent changes that by turning repeatable investigation and remediation paths into an automated workflow. The value isn't just that the agent patched a CPU limit. It's that the investigation, remediation, and verification loop is the same regardless of failure mode, and it runs while your team sleeps. In this lab, the impact was measurable: Metric This demo with Azure SRE Agent Alert to recovery ~4 to 8 min Human interventions 0 Scope of investigation Cluster wide, automated Correlate evidence and diagnose ~2 min Apply fix and verify ~4 min Post incident follow-up GitHub issue + draft PR These results came from a controlled run on April 10, 2026. Real world outcomes depend on alert quality, cluster size, and how much automation you enable. For reference, industry reports from PagerDuty and Datadog typically place manual Sev1 MTTR in the 30 to 120 minute range for Kubernetes environments. Teams + GitHub follow-up Runtime remediation is only half the story. If the workflow ends when the pod becomes healthy again, the same issue returns on the next deployment. That is why the post incident path matters. After Incident 1 resolved, Azure SRE Agent used the GitHub connector to file an issue with the incident summary, root cause, and runtime changes. In the demo, I assigned that issue to GitHub Copilot agent, which opened a draft pull request to align the source manifests with the hotfix. The agent can also be configured to submit the PR directly in the same workflow, not just open the issue, so the fix is in your review queue by the time anyone sees the notification. Human review still remains the final control point before merge. Setup details for the GitHub connector are in the demo repo README, and the official reference is in the Azure SRE Agent docs. Azure SRE Agent fixes the live issue, and the GitHub follow-up prepares the durable source change so future deployments do not reintroduce the same configuration problem. The operations to engineering handoff: Azure SRE Agent fixed the live cluster; GitHub Copilot agent prepares the durable source change so the same misconfiguration can't ship again. In parallel, the Teams connector posted milestone updates during the incident: Investigation started. Root cause and remediation identified. Incident resolved. Teams handled real time situational awareness. GitHub handled durable engineering follow-up. Together, they closed the gap between operations and software delivery. Key takeaways Three things to carry forward Treat Azure SRE Agent as a governed incident response system, not a chatbot with infrastructure access. The most important controls are permission levels and run modes, not prompt quality. Anchor detection in your existing incident platforms. For this demo, we used Prometheus and Azure Monitor, but the pattern applies regardless of where your signals live. Use connectors to extend the workflow outward. Teams for real time coordination, GitHub for durable engineering follow-up. Start where you're comfortable. If you are just getting your feet wet, begin with one resource group, one incident type, and Review mode. Validate that telemetry flows, RBAC is scoped correctly, and your alert rules cover the failure modes you actually care about before enabling Autonomous. Expand only once each layer is trusted. Next steps Add Prometheus based alert coverage for ImagePullBackOff and node resource pressure to complement the pod phase rule. Expand to multi cluster managed scopes once the single cluster path is trusted and validated. Explore how NAP and Azure SRE Agent complement each other — NAP manages infrastructure capacity, while the agent investigates and remediates incidents. I'd like to thank Cary Chai, Senior Product Manager for Azure SRE Agent, for his early technical guidance and thorough review — his feedback sharpened both the accuracy and quality of this post.567Views0likes0CommentsAKS App Routing's Next Chapter: Gateway API with Istio
If you've been following my previous posts on the Ingress NGINX retirement, you'll know the story so far. The community Ingress NGINX project was retired in March 2026, and Microsoft's extended support for the NGINX-based App Routing add-on runs until November 2026. I've covered migrating from standalone NGINX to the App Routing add-on to buy time, and migrating to Application Gateway for Containers as a long-term option. In both of those posts I mentioned that Microsoft was working on a new version of the App Routing add-on based on Istio and the Gateway API. Well, it's here, in preview at least. The App Routing Gateway API implementation is Microsoft's recommended migration path for anyone currently using the NGINX-based App Routing add-on. It moves you off NGINX entirely and onto the Kubernetes Gateway API, with a lightweight Istio control plane handling the gateway infrastructure under the hood. Let's look at what this actually is, how it differs from other options, and how to migrate from both standalone NGINX and the existing App Routing add-on. What Is It? The new App Routing mode uses the Kubernetes Gateway API instead of the Ingress API. When you enable the add-on, AKS deploys an Istio control plane (istiod) to manage Envoy-based gateway proxies. The important thing to understand here is that this is not the full Istio service mesh. There's no sidecar injection, no Istio CRDs installed for your workloads. It's Istio doing one specific job: managing gateway proxies for ingress traffic. When you create a Gateway resource, AKS provisions an Envoy Deployment, a LoadBalancer Service, a HorizontalPodAutoscaler (defaulting to 2-5 replicas at 80% CPU), and a PodDisruptionBudget. All managed. You write Gateway and HTTPRoute resources, and AKS handles everything else. This is a fundamentally different API from what you're used to with Ingress. Instead of a single Ingress resource that combines the entry point and routing rules, Gateway API splits things into layers: GatewayClass defines the type of gateway infrastructure (provided by AKS in this case) Gateway creates the actual gateway with its listeners HTTPRoute defines the routing rules and attaches to a Gateway This separation is one of Gateway API's main selling points. Platform teams can own the Gateway resources while application teams manage their own HTTPRoutes independently, without needing to modify shared infrastructure. If you've ever had a team accidentally break routing for everyone by editing a shared Ingress, you'll appreciate why this matters. How It Differs From the Istio Service Mesh Add-On If you're already running or considering the Istio service mesh add-on for AKS, this is a different thing. The App Routing Gateway API mode uses the approuting-istio GatewayClass, doesn't install Istio CRDs, doesn't enable sidecar injection, and handles upgrades in-place. The full Istio service mesh add-on uses the istio GatewayClass, installs Istio CRDs cluster-wide, enables sidecar injection, and uses canary upgrades for minor versions. The two cannot run at the same time. If you have the Istio service mesh add-on enabled, you need to disable it before enabling App Routing Gateway API (and vice versa). If you need full mesh capabilities like mTLS between services, traffic policies, and telemetry, stick with the Istio service mesh add-on. If you just need managed ingress via Gateway API without the mesh overhead, this is the right choice. Current Limitations The new App Routing solution is in preview, so should not be run in production yet. There are also some gaps compared to the existing add-on, which you need to be aware of before planning a production migration. The biggest one: DNS and TLS certificate management via the add-on isn't supported yet for Gateway API. If you're currently using az aks approuting update and az aks approuting zone add to automate Key Vault and Azure DNS integration with the NGINX-based add-on, that workflow doesn't carry over. TLS termination is still possible, but you'll need to set it up manually. The AKS docs cover the steps, but it's more hands-on than what the NGINX add-on gives you today. This is expected to be addressed when the feature reaches GA. SNI passthrough (TLSRoute) and egress traffic management aren't supported either. And as mentioned, it's mutually exclusive with the Istio service mesh add-on. For production workloads that depend heavily on automated DNS and TLS management, you may want to wait until GA, or look at Application Gateway for Containers as an alternative. But for teams that can handle TLS setup manually, for non-production environments, there's no reason not to start testing this now. Getting Started Before you can enable the feature, you need the aks-preview CLI extension (version 19.0.0b24 or later), the Managed Gateway API CRDs enabled, and the App Routing Gateway API preview feature flag registered: az extension add --name aks-preview az extension update --name aks-preview # Managed Gateway API CRDs (required dependency) az feature register --namespace "Microsoft.ContainerService" --name "ManagedGatewayAPIPreview" # App Routing Gateway API implementation az feature register --namespace "Microsoft.ContainerService" --name "AppRoutingIstioGatewayAPIPreview" Feature flag registration can take a few minutes. Once they're registered, enable the add-on on a new or existing cluster. You need both --enable-gateway-api (for the managed Gateway API CRD installation) and --enable-app-routing-istio (for the Istio-based implementation): # New cluster az aks create \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --location swedencentral \ --enable-gateway-api \ --enable-app-routing-istio # Existing cluster az aks update \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --enable-gateway-api \ --enable-app-routing-istio Verify istiod is running: kubectl get pods -n aks-istio-system You should see two istiod pods in a Running state. From here, you can create a Gateway and HTTPRoute to test traffic flow. The AKS quickstart walks through this with the httpbin sample app if you want a quick validation. Migrating From NGINX Ingress Whether you're running standalone NGINX (self-installed via Helm) or the NGINX-based App Routing add-on, the migration process is essentially the same. You're moving from Ingress API resources to Gateway API resources, and the new controller runs alongside your existing one during the transition. The only real differences are what you're cleaning up at the end and, if you're on the App Routing add-on, whether you were relying on its built-in DNS and TLS automation. Inventory Your Ingress Resources Before anything else, understand what you have: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' Look specifically for custom snippets, lua configurations, or anything that relies heavily on NGINX-specific behaviour. These won't have direct equivalents in Gateway API and will need manual attention. Convert Ingress Resources to Gateway API The ingress2gateway tool (v1.0.0) handles conversion of Ingress resources to Gateway API equivalents. It supports over 30 common NGINX annotations and generates Gateway and HTTPRoute YAML. It works regardless of whether your Ingress resources use the nginx or webapprouting.kubernetes.azure.com IngressClass: # Install go install github.com/kubernetes-sigs/ingress2gateway@v1.0.0 # Convert from live cluster ingress2gateway print --providers=ingress-nginx -A > gateway-resources.yaml # Or convert from a local file ingress2gateway print --providers=ingress-nginx --input-file=./manifests/ingress.yaml > gateway-resources.yaml Review the output carefully. The tool flags annotations it can't convert as comments in the generated YAML, so you'll know exactly what needs manual work. Common gaps include custom configuration snippets and regex-based rewrites that don't map cleanly to Gateway API's routing model. Make sure you update the gatewayClassName in the generated Gateway resources to approuting-istio. The tool may generate a generic GatewayClass name that you'll need to change. Handle DNS and TLS If you're coming from standalone NGINX, you're likely managing DNS and TLS yourself already, so nothing changes here: just make sure your certificate Secrets and DNS records are ready for the new Gateway IP. If you're coming from the App Routing add-on and relying on its built-in DNS and TLS management (via az aks approuting zone add and Key Vault integration), this is the part that needs extra thought. That automation doesn't carry over to the Gateway API implementation yet, so you'll need to handle it differently until GA. For TLS, you can either create Kubernetes Secrets with your certificates manually or set up a workflow to sync them from Key Vault. The AKS docs on securing Gateway API traffic cover the manual approach. For DNS, you'll need to manage records yourself or use ExternalDNS to automate it. ExternalDNS supports Gateway API resources, so this is a viable path if you want automation. Deploy and Validate With the add-on enabled, apply your converted resources: kubectl apply -f gateway-resources.yaml Wait for the Gateway to be programmed and get the external IP: kubectl wait --for=condition=programmed gateways.gateway.networking.k8s.io <gateway-name> export GATEWAY_IP=$(kubectl get gateways.gateway.networking.k8s.io <gateway-name> -ojsonpath='{.status.addresses[0].value}') The key thing here is that your existing NGINX controller (whether standalone or add-on managed) is still running and serving production traffic. The Gateway API resources are handled separately by the Istio-based controller in aks-istio-system. This parallel running is what makes the migration safe. Test your routes against the new Gateway IP, you'll need to provide the appropriate URL as a host header, as your DNS will still be pointing at the NGINX Add-On at this point. curl -H "Host: myapp.example.com" http://$GATEWAY_IP Run your full validation suite. Check TLS, path routing, headers, authentication, anything your applications depend on. Take your time here; nothing changes for production until you update DNS. Cut Over DNS and Clean Up Once you're confident, lower your DNS TTL to 60 seconds (do this well in advance), then update your DNS records to point to the new Gateway IP. Keep the old NGINX controller running for 24-48 hours as a rollback option. After traffic has been flowing cleanly through the Gateway API path, clean up the old setup. What this looks like depends on where you started: If you were on standalone NGINX: helm uninstall ingress-nginx -n ingress-nginx kubectl delete namespace ingress-nginx If you were on the App Routing add-on with NGINX: Verify nothing is still using the old IngressClass: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' \ | grep "webapprouting" Delete any remaining Ingress resources that reference the old class, then disable the NGINX-based App Routing add-on: az aks approuting disable --resource-group ${RESOURCE_GROUP} --name ${CLUSTER} Some resources (configMaps, secrets, and the controller deployment) will remain in the app-routing-system namespace after disabling. You can clean these up by deleting the namespace once you're satisfied everything is running through the Gateway API path: kubectl delete ns app-routing-system In both cases, clean up any old Ingress resources that are no longer being used. Upgrades and Lifecycle The Istio control plane version is tied to your AKS cluster's Kubernetes version. AKS automatically handles patch upgrades as part of its release cycle, and minor version upgrades happen in-place when you upgrade your cluster's Kubernetes version or when a new Istio minor version is released for your AKS version. One thing to be aware of - unlike the Istio service mesh add-on, upgrades here are in-place, not canary-based. The HPA and PDB on each Gateway help minimise disruption, but plan accordingly for production. If you have maintenance windows configured, the istiod upgrades will respect them. What Should You Do Now? The timeline hasn't changed. The standalone NGINX Ingress project was retired in March 2026, so if you're still running that, you're already on unsupported software. The NGINX App Routing add-on is supported until November 2026, which gives you a window, but it's not a long one. If you're on standalone NGINX you could get onto the App Routing add-on now to buy time (I covered this in my earlier post), then plan your migration to either the Gateway API mode or AGC. If you're on the NGINX App Routing add-on: start testing the Gateway API mode in non-production now. Get familiar with the Gateway API resource model, understand the TLS and DNS gaps in the preview, and be ready to migrate when the feature reaches GA or when November gets close, whichever comes first. If you need production-ready TLS and DNS automation today and can't wait for GA, App Gateway for Containers is your best option right now. Whatever path you choose, make sure you have a plan in place before November. Running unsupported ingress software on production infrastructure isn't where you want to be.312Views1like0CommentsContinued Investment in Azure App Service
This blog was originally published to the App Service team blog Recent Investments Premium v4 (Pv4) Azure App Service Premium v4 delivers higher performance and scalability on newer Azure infrastructure while preserving the fully managed PaaS experience developers rely on. Premium v4 offers expanded CPU and memory options, improved price-performance, and continued support for App Service capabilities such as deployment slots, integrated monitoring, and availability zone resiliency. These improvements help teams modernize and scale demanding workloads without taking on additional operational complexity. App Service Managed Instance App Service Managed Instance extends the App Service model to support Windows web applications that require deeper environment control. It enables plan-level isolation, optional private networking, and operating system customization while retaining managed scaling, patching, identity, and diagnostics. Managed Instance is designed to reduce migration friction for existing applications, allowing teams to move to a modern PaaS environment without code changes. Faster Runtime and Language Support Azure App Service continues to invest in keeping pace with modern application stacks. Regular updates across .NET, Node.js, Python, Java, and PHP help developers adopt new language versions and runtime improvements without managing underlying infrastructure. Reliability and Availability Improvements Ongoing investments in platform reliability and resiliency strengthen production confidence. Expanded Availability Zone support and related infrastructure improvements help applications achieve higher availability with more flexible configuration options as workloads scale. Deployment Workflow Enhancements Deployment workflows across Azure App Service continue to evolve, with ongoing improvements to GitHub Actions, Azure DevOps, and platform tooling. These enhancements reduce friction from build to production while preserving the managed App Service experience. A Platform That Grows With You These recent investments reflect a consistent direction for Azure App Service: active development focused on performance, reliability, and developer productivity. Improvements to runtimes, infrastructure, availability, and deployment workflows are designed to work together, so applications benefit from platform progress without needing to re-architect or change operating models. The recent General Availability of Aspire on Azure App Service is another example of this direction. Developers building distributed .NET applications can now use the Aspire AppHost model to define, orchestrate, and deploy their services directly to App Service — bringing a code-first development experience to a fully managed platform. We are also seeing many customers build and run AI-powered applications on Azure App Service, integrating models, agents, and intelligent features directly into their web apps and APIs. App Service continues to evolve to support these scenarios, providing a managed, scalable foundation that works seamlessly with Azure's broader AI services and tooling. Whether you are modernizing with Premium v4, migrating existing workloads using App Service Managed Instance, or running production applications at scale - including AI-enabled workloads - Azure App Service provides a predictable and transparent foundation that evolves alongside your applications. Azure App Service continues to focus on long-term value through sustained investment in a managed platform developers can rely on as requirements grow, change, and increasingly incorporate AI. Get Started Ready to build on Azure App Service? Here are some resources to help you get started: Create your first web app — Deploy a web app in minutes using the Azure portal, CLI, or VS Code. App Service documentation — Explore guides, tutorials, and reference for the full platform. Aspire on Azure App Service — Now generally available. Deploy distributed .NET applications to App Service using the Aspire AppHost model. Pricing and plans — Compare tiers including Premium v4 and find the right fit for your workload. App Service on Azure Architecture Center — Reference architectures and best practices for production deployments.408Views1like0CommentsUnit Testing Helm Charts with Terratest: A Pattern Guide for Type-Safe Validation
Helm charts are the de facto standard for packaging Kubernetes applications. But here's a question worth asking: how do you know your chart actually produces the manifests you expect, across every environment, before it reaches a cluster? If you're like most teams, the answer is some combination of helm template eyeball checks, catching issues in staging, or hoping for the best. That's slow, error-prone, and doesn't scale. In this post, we'll walk through a better way: a render-and-assert approach to unit testing Helm charts using Terratest and Go. The result? Type-safe, automated tests that run locally in seconds with no cluster required. The Problem Let's start with why this matters. Helm charts are templates that produce YAML, and templates have logic: conditionals, loops, value overrides per environment. That logic can break silently: A values-prod.yaml override points to the wrong container registry A security context gets removed during a refactor and nobody notices An ingress host is correct in dev but wrong in production HPA scaling bounds are accidentally swapped between environments Label selectors drift out of alignment with pod templates, causing orphaned ReplicaSets These aren't hypothetical scenarios. They're real bugs that slip through helm lint and code review because those tools don't understand what your chart should produce. They only check whether the YAML is syntactically valid. These bugs surface at deploy time, or worse, in production. So how do we catch them earlier? The Approach: Render and Assert The idea is straightforward. Instead of deploying to a cluster to see if things work, we render the chart locally and validate the output programmatically. Here's the three-step model: Render: Terratest calls helm template with your base values.yaml + an environment-specific values-<env>.yaml override Unmarshal: The rendered YAML is deserialized into real Kubernetes API structs (appsV1.Deployment, coreV1.ConfigMap, networkingV1.Ingress, etc.) Assert: Testify assertions validate every field that matters, including names, labels, security context, probes, resource limits, ingress routing, and more No cluster. No mocks. No flaky integration tests. Just fast, deterministic validation of your chart's output. Here's what that looks like in practice: // Arrange options := &helm.Options{ ValuesFiles: s.valuesFiles, } output := helm.RenderTemplate(s.T(), options, s.chartPath, s.releaseName, s.templates) // Act var deployment appsV1.Deployment helm.UnmarshalK8SYaml(s.T(), output, &deployment) // Assert: security context is hardened secCtx := deployment.Spec.Template.Spec.Containers[0].SecurityContext require.Equal(s.T(), int64(1000), *secCtx.RunAsUser) require.True(s.T(), *secCtx.RunAsNonRoot) require.True(s.T(), *secCtx.ReadOnlyRootFilesystem) require.False(s.T(), *secCtx.AllowPrivilegeEscalation) Notice something important here: because you're working with real Go structs, the compiler catches schema errors. If you typo a field path like secCtx.RunAsUsr, the code won't compile. With YAML-based assertion tools, that same typo would fail silently at runtime. This type safety is a big deal when you're validating complex resources like Deployments. What to Test: 16 Patterns Across 6 Categories That covers the how. But what should you actually assert? Through applying this approach across multiple charts, we've identified 16 test patterns that consistently catch real bugs. They fall into six categories: Category What Gets Validated Identity & Labels Resource names, 5 standard Helm/K8s labels, selector alignment Configuration Environment-specific configmap data, env var injection Container Image registry per env, ports, resource requests/limits Security Non-root user, read-only FS, dropped capabilities, AppArmor, seccomp, SA token automount Reliability Startup/liveness/readiness probes, volume mounts Networking & Scaling Ingress hosts/TLS per env, service port wiring, HPA bounds per env You don't need all 16 on day one. Start with resource name and label validation, since those apply to every resource and catch the most common _helpers.tpl bugs. Then add security and environment-specific patterns as your coverage grows. Now, let's look at how to structure these tests to handle the trickiest part: multiple environments. Multi-Environment Testing One of the most common Helm chart bugs is environment drift, where values that are correct in dev are wrong in production. A single test suite that only validates one set of values will miss these entirely. The solution is to maintain separate test suites per environment: tests/unit/my-chart/ ├── dev/ ← Asserts against values.yaml + values-dev.yaml ├── test/ ← Asserts against values.yaml + values-test.yaml └── prod/ ← Asserts against values.yaml + values-prod.yaml Each environment's tests assert the merged result of values.yaml + values-<env>.yaml. So when your values-prod.yaml overrides the container registry to prod.azurecr.io, the prod tests verify exactly that, while the dev tests verify dev.azurecr.io. This structure catches a class of bugs that no other approach does: "it works in dev" issues where an environment-specific override has a typo, a missing field, or an outdated value. But environment-specific configuration isn't the only thing worth testing per commit. Let's talk about security. Security as Code Security controls in Kubernetes manifests are notoriously easy to weaken by accident. Someone refactors a deployment template, removes a securityContext block they think is unused, and suddenly your containers are running as root in production. No linter catches this. No code reviewer is going to diff every field of a rendered manifest. With this approach, you encode your security posture directly into your test suite. Every deployment test should validate: Container runs as non-root (UID 1000) Root filesystem is read-only All Linux capabilities are dropped Privilege escalation is blocked AppArmor profile is set to runtime/default Seccomp profile is set to RuntimeDefault Service account token automount is disabled If someone removes a security control during a refactor, the test fails immediately, not after a security review weeks later. Security becomes a CI gate, not a review checklist. With patterns and environments covered, the next question is: how do you wire this into your CI/CD pipeline? CI/CD Integration with Azure DevOps These tests integrate naturally into Azure DevOps pipelines. Since they're just Go tests that call helm template under the hood, all you need is a Helm CLI and a Go runtime on your build agent. A typical multi-stage pipeline looks like: stages: - stage: Build # Package the Helm chart - stage: Dev # Lint + test against values-dev.yaml - stage: Test # Lint + test against values-test.yaml - stage: Production # Lint + test against values-prod.yaml Each stage uses a shared template that installs Helm and Go, extracts the packaged chart, runs helm lint, and executes the Go tests with gotestsum. Environment gates ensure production tests pass before deployment proceeds. Here's the key part of a reusable test template: - script: | export PATH=$PATH:/usr/local/go/bin:$(go env GOPATH)/bin go install gotest.tools/gotestsum@latest cd $(Pipeline.Workspace)/helm.artifact/tests/unit gotestsum --format testname --junitfile $(Agent.TempDirectory)/test-results.xml \ -- ./${{ parameters.helmTestPath }}/... -count=1 -timeout 50m displayName: 'Test helm chart' env: HELM_RELEASE_NAME: ${{ parameters.helmReleaseName }} HELM_VALUES_FILE_OVERRIDE: ${{ parameters.helmValuesFileOverride }} - task: PublishTestResults@2 displayName: 'Publish test results' inputs: testResultsFormat: 'JUnit' testResultsFiles: '$(Agent.TempDirectory)/test-results.xml' condition: always() The PublishTestResults@2 task makes pass/fail results visible on the build's Tests tab, showing individual test names, durations, and failure details. The condition: always() ensures results are published even when tests fail, so you always have visibility. At this point you might be wondering: why Go and Terratest? Why not a simpler YAML-based tool? Why Terratest + Go Instead of helm-unittest? helm-unittest is a popular YAML-based alternative, and it's a fair question. Both tools are valid. Here's why we landed on Terratest: Terratest + Go helm-unittest (YAML) Type safety Renders into real K8s API structs; compiler catches schema errors String matching on raw YAML; typos in field paths fail silently Language features Loops, conditionals, shared setup, table-driven tests Limited to YAML assertion DSL Debugging Standard Go debugger, stack traces YAML diff output only Ecosystem alignment Same language as Terraform tests, one testing stack Separate tool, YAML-only The type safety argument is the strongest. When you unmarshal into appsV1.Deployment, the Go compiler guarantees your assertions reference real fields. With helm-unittest, a YAML path like spec.template.spec.containers[0].securityContest (note the typo) would silently pass because it matches nothing, rather than failing loudly. That said, if your team has no Go experience and needs the lowest adoption barrier, helm-unittest is a reasonable starting point. For teams already using Go or Terraform, Terratest is the stronger long-term choice. Getting Started Ready to try this? Here's a minimal project structure to get you going: your-repo/ ├── charts/ │ └── your-chart/ │ ├── Chart.yaml │ ├── values.yaml │ ├── values-dev.yaml │ ├── values-test.yaml │ ├── values-prod.yaml │ └── templates/ ├── tests/ │ └── unit/ │ ├── go.mod │ └── your-chart/ │ ├── dev/ │ ├── test/ │ └── prod/ └── Makefile Prerequisites: Go 1.22+, Helm 3.14+ You'll need three Go module dependencies: github.com/gruntwork-io/terratest v0.46.16 github.com/stretchr/testify v1.8.4 k8s.io/api v0.28.4 Initialize your test module, write your first test using the patterns above, and run: cd tests/unit HELM_RELEASE_NAME=your-chart \ HELM_VALUES_FILE_OVERRIDE=values-dev.yaml \ go test -v ./your-chart/dev/... -timeout 30m Start with a ConfigMap test. It's the simplest resource type and lets you validate the full render-unmarshal-assert flow before tackling Deployments. Once that passes, work your way through the pattern categories, adding security and environment-specific assertions as you go. Wrapping Up Unit testing Helm charts with Terratest gives you something that helm lint and manual review can't: Type-safe validation: The compiler catches schema errors, not production Environment-specific coverage: Each environment's values are tested independently Security as code: Security controls are verified on every commit, not in periodic reviews Fast feedback: Tests run in seconds with no cluster required CI/CD integration: JUnit results published natively to Azure DevOps The patterns we've covered here are the ones that have caught the most real bugs for us. Start small with resource names and labels, and expand from there. The investment is modest, and the first time a test catches a broken values-prod.yaml override before it reaches production, it'll pay for itself. We'd Love Your Feedback We'd love to hear how this approach works for your team: Which patterns were most useful for your charts? What resource types or patterns are missing? How did the adoption experience go? Drop a comment below. Happy to dig into any of these topics further!327Views0likes0CommentsHeroku Entered Maintenance Mode — Here's Your Next Move
Heroku has entered sustaining engineering — no new features, no new enterprise contracts. If you're running production workloads on the platform, you're probably thinking about what comes next. Azure Container Apps is worth a serious look. Scale-to-zero pricing, event-driven autoscaling, built-in microservice support, serverless GPUs, and an active roadmap — it's a container platform that handles everything from a simple web app to AI-native workloads, and you only pay for what you use. I migrated a real Heroku app to Container Apps to pressure-test the experience. Here's what I learned, what to watch out for, and how you can do it in an afternoon. Why Container Apps is the natural next chapter The philosophy carries over directly. Where Heroku had git push , Container Apps has: az containerapp up --name my-app --source . --environment my-env One command. If you have a Dockerfile, it builds and deploys your app directly. No local Docker install, no manual registry push — code in, URL out. That part didn't change. The concept mapping is tight: Heroku Azure Container Apps Dyno Container App replica Procfile process types Separate Container Apps Heroku add-ons Azure managed services Config vars Environment variables + secrets heroku run one-off dynos Container Apps Jobs Heroku Pipelines GitHub Actions Heroku Scheduler Scheduled Container Apps Jobs Container Apps also includes capabilities you'd need to piece together separately on Heroku — KEDA-powered autoscaling from any event source, Dapr for service-to-service communication, traffic splitting across revisions for safe rollouts, and scale to zero so you stop paying when nothing's running. Simplest path? If your app is a straightforward web server and you don't want containers at all, Azure App Service ( az webapp up ) is also available. But for most Heroku workloads — especially anything with workers, background jobs, or variable traffic — Container Apps is the better fit. What a real migration looks like I took a Node.js + Redis todo app from Heroku and moved it to Container Apps. The app is intentionally boring — Express server, Redis for storage, one web process. This is roughly what a lot of Heroku apps look like, and the migration took about 90 minutes end-to-end (most of that waiting for Redis to provision). Step 1: Export what you have heroku config --json --app my-heroku-app > heroku-config.json heroku apps:info --app my-heroku-app heroku addons --app my-heroku-app You want three things: your environment variables, your add-on list, and your app metadata. The config export is the most important one — it's every secret and connection string your app needs. Step 2: Create the Azure backing services For each Heroku add-on, create the Azure equivalent. Here are the common ones: Heroku add-on Azure service CLI command Heroku Postgres Azure Database for PostgreSQL az postgres flexible-server create Heroku Redis Azure Cache for Redis az redis create Heroku Scheduler Container Apps Jobs az containerapp job create SendGrid SendGrid via Marketplace (Portal) Papertrail / LogDNA Azure Monitor + Log Analytics (Enabled by default) For my todo app, I needed Redis: az redis create \ --name my-redis \ --resource-group my-rg \ --location swedencentral \ --sku Basic --vm-size c0 One thing to know: Azure Cache for Redis takes 10–20 minutes to provision. Heroku's Redis add-on takes about two minutes. Budget the time. Step 3: Containerize If you don't have a Dockerfile, write one. For a Node app this is about 10 lines: FROM node:20-slim WORKDIR /app COPY package*.json ./ RUN npm ci --omit=dev COPY . . EXPOSE 8080 CMD ["node", "server.js"] Don't have a Dockerfile? Point GitHub Copilot at the migration repo and it will generate one for your stack — Node, Python, Ruby, Java, or Go. The repo includes templates and a containerization skill that inspects your app and produces a production-ready Dockerfile. Step 4: Build, push, deploy I used Azure Container Registry (ACR) for the build. No local Docker install needed. az acr create --name myacr --resource-group my-rg --sku Basic az acr build --registry myacr --image my-app:v1 . Then create the Container App: az containerapp create \ --name my-app \ --resource-group my-rg \ --environment my-env \ --image myacr.azurecr.io/my-app:v1 \ --registry-server myacr.azurecr.io \ --target-port 8080 \ --ingress external \ --min-replicas 1 Step 5: Wire up the config This is where Heroku's config:get maps to Container Apps' secrets and environment variables. One gotcha I hit: you have to set secrets before you reference them in environment variables. If you try to do both at once, the deployment fails. # Set the secret first az containerapp secret set \ --name my-app \ --resource-group my-rg \ --secrets redis-url="rediss://:ACCESS_KEY@my-redis.redis.cache.windows.net:6380" # Then reference it az containerapp update \ --name my-app \ --resource-group my-rg \ --set-env-vars "REDIS_URL=secretref:redis-url" Step 6: Verify and cut over Hit the Azure URL, test your routes, check that data flows through the new Redis instance. When you're satisfied, update your DNS CNAME to point at the Container Apps FQDN. Total time: ~90 minutes, and most of that was waiting for Redis to provision. The actual migration work was about 30 minutes of CLI commands. Lessons from the field I want to be upfront about what to watch for — these are the things that would waste your time if you hit them unprepared. 📌 Register Azure providers first. If your subscription has never used Container Apps, you need to register the resource providers before creating anything: az provider register --namespace Microsoft.App --wait az provider register --namespace Microsoft.OperationalInsights --wait This takes a minute or two. Without it, resource creation fails with confusing error messages. 📌 Set secrets before referencing them in env vars. The CLI doesn't warn you — it just fails the deployment. Always az containerapp secret set first, then az containerapp update --set-env-vars . 📌 Budget time for Azure resource provisioning. Azure Cache for Redis takes 10–20 minutes vs Heroku's ~2 minutes. Enterprise-grade infrastructure takes a bit longer to spin up — plan accordingly and provision backing services in parallel. None of these are blockers. They're the kind of things a migration guide should tell you upfront — and ours does. Migrate today, build intelligent apps tomorrow Once your app is on Container Apps, you're on a platform built for AI-native workloads too. No second migration required: Serverless GPU — attach GPU compute to your Container Apps for inference workloads. Run models alongside your app, same environment, same deployment pipeline. No separate ML infrastructure to manage. Dynamic Sessions — spin up isolated, sandboxed code execution environments on demand. Build AI agents that execute tools, run LLM-generated code safely, or offer interactive coding experiences — all within your existing Container Apps environment. These aren't separate services you bolt on — they're configuration changes on the platform you're already running on. Building AI-native? Container Apps pairs naturally with Azure AI Foundry — one place to access state-of-the-art models from both OpenAI and Anthropic, manage prompts, evaluate outputs, and deploy endpoints. Your app runs on Container Apps; your intelligence runs on Foundry. Same subscription, same identity, no glue code between clouds. The applications being migrated today won't look the same in two years. A platform that grows with you — from web app to intelligent service — means you make this move once. You don't have to figure this out alone We've built the resources to make this migration fast and repeatable: 📖 Official Migration Guide — End-to-end walkthrough covering assessment, containerization, service mapping, and deployment. Start here. 🤖 Agent-Assisted Migration Repo — An open-source repository designed to work with GitHub Copilot and other AI coding assistants. It includes an AGENTS.md file and six migration skills that walk you through the entire process — from Heroku assessment to DNS cutover — with real CLI commands, Dockerfile templates for five languages, Bicep IaC, and GitHub Actions workflows. Point Copilot at the repo alongside your app's source code, and it becomes a migration pair-programmer: running the right commands, generating Dockerfiles, setting up CI/CD, and flagging things you might miss. This isn't a magic migrate my app button. It's more like having a colleague who has done this migration twenty times sitting next to you while you do it. The cost math works Let's talk numbers. Heroku plan Monthly cost Container Apps equivalent Monthly cost Standard-1X (idle most of the day) $25/mo Consumption plan (scale to zero) ~$0–5/mo Performance-L (steady traffic) $500/mo Dedicated plan with autoscaling Meaningfully less 10 low-traffic apps across dev/staging/prod $250+/mo Consumption plan with free grants Near zero Container Apps' monthly free grant covers 180,000 vCPU-seconds and 2 million requests. For apps that idle most of the day, that's often enough to pay nothing at all. The biggest savings come from workloads that don't run 24/7. Heroku charges for every hour a dyno is running, period. Container Apps charges for actual compute consumption and scales to zero when there's no traffic. Get started Inventory — Run heroku apps and heroku addons to see what you have. Pick a pilot — Choose a non-critical app for your first migration. Migrate — Follow the official migration guide, or point GitHub Copilot at the migration repo and let it pair with you. Azure Container Apps gives you a production-grade container platform with scale-to-zero economics, an active roadmap, and a path to AI-native workloads — all from a single az containerapp up command. If you're evaluating what comes after Heroku, start here. 👉 Start your migration · Clone the migration repo · Explore Azure Container Apps293Views0likes0CommentsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Be sure to use a new subnet CIDR within the VNet address space, which doesn't overlap with other subnets nor the AKS CIDRS for nodes/pods and ClusterIP services. To minimise disruption, we'll first install the Virtual Nodes on ACI Helm chart, before then removing the legacy managed add-on and its resources. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Create the new Virtual Nodes on ACI subnet with the specific name value of cg (a custom subnet can be used by following the steps here): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes <your subnet CIDR> \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Scale-down any running Virtual Nodes workloads (example below): kubectl scale deploy <deploymentName> -n <namespace> --replicas=0 Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Delete the previous (legacy) Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify the nodeSelector and tolerations properties of your Virtual Nodes workloads to match the requirements of Virtual Nodes on ACI (see details below) Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here635Views3likes0CommentsThe Swarm Diaries: What Happens When You Let AI Agents Loose on a Codebase
The Idea Single-agent coding assistants are impressive, but they have a fundamental bottleneck: they think serially. Ask one to build a full CLI app with a database layer, a command parser, pretty output, and tests, and it’ll grind through each piece one by one. Industry benchmarks bear this out: AIMultiple’s 2026 agentic coding benchmark measured Claude Code CLI completing full-stack tasks in ~12 minutes on average, with other CLI agents ranging from 3 to 14 minutes depending on the tool. A three-week real-world test by Render.com found single-agent coding workflows taking 10–30 minutes for multi-file feature work. But these subtasks don’t depend on each other. A storage agent doesn’t need to wait for the CLI agent. A test writer doesn’t need to watch the renderer work. What if they all ran at the same time? The hypothesis was straightforward: a swarm of specialized agents should beat a single generalist on at least two of three pillars — speed, quality, or cost. The architecture looked clean on a whiteboard: The reality was messier. But first, let me explain the machinery that makes this possible. How It’s Wired: Brains and Hands The system runs on a brains-and-hands split. The brain is an Azure Durable Task Scheduler (DTS) orchestration — a deterministic workflow that decomposes the goal into a task DAG, fans agents out in parallel, merges their branches, and runs quality gates. If the worker crashes mid-run, DTS replays from the last checkpoint. No work lost. Simple LLM calls — the planner that decomposes the goal, the judge that scores the output — run as lightweight DTS activities. One call, no tools, cheap. The hands are Microsoft Agent Framework (MAF) agents, each running in its own Docker container. One sandbox per agent, each with its own git clone, filesystem, and toolset. When an agent’s LLM decides to edit a file or run a build, the call routes through middleware to that agent’s isolated container. No two agents ever touch the same workspace. These complex agents — coders, researchers, the integrator — run as DTS durable entities with full agentic loops and turn-level checkpointing. The split matters because LLM reasoning and code execution have completely different reliability profiles. The brain checkpoints and replays deterministically. The hands are ephemeral — if a container dies, spin up a new one and replay the agent’s last turn. This separation is what lets you run five agents in parallel without them stepping on each other’s git branches, build artifacts, or file handles. It’s also what made every bug I was about to encounter debuggable. When something broke, I always knew which side broke — the orchestration logic, or the agent behavior. That distinction saved me more hours than any other design decision. The First Run Produced Nothing After hours of vibe-coding the foundation — Pydantic models, skill prompts, a prompt builder, a context store, sixteen architectural decisions documented in ADRs — I wired up the seven-phase orchestration and hit go. All five agents returned empty responses. Every single one. The logs showed agents “running” but producing zero output. I stared at the code for an embarrassingly long time before I found it. The planner returned task IDs as integers — 1, 2, 3 . The sandbox provisioner stored them as string keys — "1", "2", "3" . When the orchestrator did sandbox_map.get(1) , it got None . No sandbox meant no middleware. The agents were literally talking to thin air — making LLM calls with no tools attached, like a carpenter showing up to a job site with no hammer. The fix was one line. The lesson was bigger: LLMs don’t respect type contracts. They’ll return an integer when you expect a string, a list when you expect a dict, and a confident hallucination when they have nothing to say. Every boundary between AI-generated data and deterministic systems needs defensive normalization. This would not be the last time I learned that lesson. The Seven-Minute Merge Once agents actually ran and produced code, a new problem emerged. I watched the logs on a run that took twenty-one minutes total. Four agents finished their work in about twelve minutes. The remaining seven minutes were the LLM integrator merging four branches — eight to thirty tool calls per merge, using the premium model, to do what git merge --no-edit does in five seconds. I was paying for a premium LLM to run git diff , read both sides of every file, and write a merged version. For branches that merged cleanly. With zero conflicts. The fix was obvious in retrospect: try git merge first. If it succeeds — great, five seconds, done. Only call the LLM integrator when there are actual conflicts to resolve. Merge time dropped from seven minutes to under thirty seconds. I felt a little silly for not doing this from the start. When Agents Build Different Apps The merge speedup felt like a win until I looked at what was actually being merged. The storage agent had built a JSON-file backend. The CLI agent had written its commands against SQLite. Both modules were well-written. They compiled individually. Together, nothing worked — the CLI tried to import a Storage class that didn’t exist in the JSON backend. This was the moment I realized the agents weren’t really a team. They were strangers who happened to be assigned to the same project, each interpreting the goal in their own way. The fix was the single most impactful change in the entire project: contract-first planning. Instead of just decomposing the goal into tasks, the planner now generates API contracts — function signatures, class shapes, data model definitions — and injects them into every agent’s prompt. “Here’s what the Storage class looks like. Here’s what Task looks like. Build against these interfaces.” Before contracts, three of six branches conflicted and the quality score was 28. After contracts, zero of four branches conflicted and the score hit 68. It turns out the plan isn’t just a plan. In a multi-agent system, the plan is the product. A brilliant plan with mediocre agents produces working code. A vague plan with brilliant agents produces beautiful components that don’t fit together. The Agent Who Lied PR #4 came back with what looked like a solid result. The test writer reported three test files with detailed coverage summaries. The JSON output was meticulous — file names, function names, which modules each test covered. Then I checked tool_call_count: 0 . The test writer hadn’t written a single file. It hadn’t even opened a file. It received zero tools — because the skill loader normalized test_writer to underscores while the tool registry used test-writer with hyphens. The lookup failed silently. The agent got no tools, couldn’t do any work, and did what LLMs do when they can’t fulfill a request but feel pressure to answer: it made something up. Confidently. This happened in three of our first four evaluation runs. I called them “phantom agents” — they showed up to work, clocked in, filed a report, and went home without lifting a finger. The fix had two parts. First, obviously, fix the hyphen/underscore normalization. Second, and more importantly: add a zero-tool-call guard. If an agent that should be writing files reports success with zero tool calls, don’t believe it. Nudge it and retry. The deeper lesson stuck with me: agents will never tell you they failed. They’ll report success with elaborate detail. You have to verify what they actually did, not what they said they did. The Integrator Who Took Shortcuts Even with contracts preventing mismatched architectures, merge conflicts still happened when multiple agents touched the same files. The LLM integrator’s job was to resolve these conflicts intelligently, preserving logic from both sides. Instead, facing a gnarly conflict in models.py , it ran: git restore --source=HEAD -- models.py One command. Silently destroyed one agent’s entire implementation — the Task class, the constants, the schema version — gone. The integrator committed the lobotomized file and reported “merge resolved successfully.” The downstream damage was immediate. storage.py imported symbols that no longer existed. The judge scored 43 out of 100. The fixer agent had to spend five minutes reconstructing the data model from scratch. But that wasn’t even the worst shortcut. On other runs, the integrator replaced conflicting code with: def add_task(desc, priority=0): pass # TODO: implement storage layer When an LLM is asked to resolve a hard conflict, it’ll sometimes pick the easiest valid output — delete everything and write a placeholder. Technically valid Python. Functionally a disaster. Fixing this required explicit prompt guardrails: Never run git restore --source=HEAD Never replace implementations with pass # TODO placeholders When two implementations conflict, keep the more complete one After resolving each file, read it back and verify the expected symbols still exist The lesson: LLMs optimize for the path of least resistance. Under pressure, “valid” and “useful” diverge sharply. Demolishing the House for a Leaky Faucet When the judge scored a run below 70, the original retry strategy was: start over. Re-plan. Re-provision five sandboxes. Re-run all agents. Re-merge. Re-judge. Seven minutes and a non-trivial cloud bill, all because one agent missed an import statement. This was absurd. Most failures weren’t catastrophic — they were close. A missing model field. A broken import. An unhandled error case. The code was 90% right. Starting from scratch was like tearing down a house because the bathroom faucet leaks. So I built the fixer agent: a premium-tier model that receives the judge’s specific complaints and makes surgical edits directly on the integrator’s branch. No new sandboxes, no new branches, no merge step. The first time it ran, the score jumped from 43 to 89.5. Three minutes instead of seven. And it solved the problem that actually existed, rather than hoping a second roll of the dice would land better. Of course, the fixer’s first implementation had its own bug — it ran in a new sandbox, created a new branch, and occasionally conflicted with the code it was trying to fix. The fix to the fixer: just edit in place on the integrator’s existing sandbox. No branch, no merge, no drama. How Others Parallelize (and Why We Went Distributed) Most multi-agent coding frameworks today parallelize by spawning agents as local processes on a single developer machine. Depending on the framework, there’s typically a lead agent or orchestrator that breaks the task down into subtasks, spins up new agents to handle each piece, and combines their work when they finish — often through parallel TMux sessions or subprocess pools sharing a local filesystem. It’s simple, it’s fast to set up, and for many tasks it works. But local parallelization hits a ceiling. All agents share one machine’s CPU, memory, and disk I/O. Five agents each running npm install or cargo build compete for the same 32 GB of RAM. There’s no true filesystem isolation — two agents can clobber the same file if the orchestrator doesn’t carefully sequence writes. Recovery from a crash means restarting the entire local process tree. And scaling from 3 agents to 10 means buying a bigger machine. Our swarm takes a different approach: fully distributed execution. Each agent runs in its own Docker container with its own filesystem, git clone, and compute allocation — provisioned on AKS, ACA, or any container host. Four agents get four independent resource pools. If one container dies, DTS replays that agent from its last checkpoint in a fresh container without affecting the others. Git branch-per-agent isolation means zero filesystem conflicts by design. The trade-off is overhead: container provisioning, network latency, and the merge step add wall-clock time that a local TMux setup avoids. On a small two-agent task, local parallelization on a fast laptop probably wins. But for tasks with 4+ agents doing real work — cloning repos, installing dependencies, running builds and tests — independent resource pools and crash isolation matter. Our benchmarks on a 4-agent helpdesk system showed the swarm completing in ~8 minutes with zero resource contention, producing 1,029 lines across 14 files with 4 clean branch merges. The Scorecard After all of this, did the swarm actually beat a single agent? I ran head-to-head benchmarks: same prompt, same model (GPT-5-nano), solo agent vs. swarm, scored by a Sonnet 4.6 judge on a four-criterion rubric. Two tasks — a simple URL shortener (Render.com’s benchmark prompt) and a complex helpdesk ticket system. All runs are public — you can review every line of generated code: Task Solo Agent PR Swarm PR URL Shortener PR #1 PR #2 Helpdesk System PR #3 PR #4 URL Shortener (Simple) Helpdesk System (Complex) Quality (rubric, /5) Solo 1.9 → Swarm 2.5 (+32%) Solo 2.3 → Swarm 2.95 (+28%) Speed Solo 2.5 min → Swarm 5.5 min (2.2×) Solo 1.75 min → Swarm ~8 min (~4.5×) Tokens 7.7K → 30K (3.9×) 11K → 39K (3.4×) The pattern held across both tasks: +28–32% quality improvement, at the cost of 2–4× more time and ~3.5× more tokens. On the complex task, the quality gains broadened — the swarm produced better code organization (3/5 vs 2/5), actually wrote tests (code:test ratio 0 → 0.15), and generated 5× more files with cleaner decomposition. On the simple task, the gap came entirely from security practices: environment variables, parameterized queries, and proper .gitignore rules that the solo agent skipped entirely. Industry benchmarks from AIMultiple and Render.com show single CLI agents averaging 10–15 minutes on comparable full-stack tasks. Our swarm runs in 5–12 minutes depending on parallelizability — but the real win is quality, not speed. Specialized agents with a narrow, well-defined scope tend to be more thorough: the solo agent skipped tests and security practices entirely, while the swarm's dedicated agents actually addressed them. Two out of three pillars — with a caveat the size of your task. On small, tightly-coupled problems, just use one good agent. On larger, parallelizable work with three or more independent modules? The swarm earns its keep. What I Actually Learned The Rules That Stuck Contract-first planning. Define interfaces before writing implementations. The plan isn’t just a guide — it’s the product. Deterministic before LLM. Try git merge before calling the LLM integrator. Run ruff check before asking an agent to debug. Use code when you can; use AI when you must. Validate actions, not claims. An agent that reports “merge resolved successfully” may have deleted everything. Check tool call counts. Read the actual diff. Trust nothing. Cheap recovery over expensive retries. A fixer agent that patches one file beats re-running five agents from scratch. The cost of failure should be proportional to the failure. Not every problem needs a swarm. If the task fits in one agent’s context window, adding four more just adds overhead. The sweet spot is 3+ genuinely independent modules. The Bigger Picture The biggest surprise? Building a multi-agent AI system is more about software engineering than AI engineering. The hard problems weren’t prompt design or model selection — they were contracts between components, isolation of concerns, idempotent operations, observability, and recovery strategies. Principles that have been around since the 1970s. The agents themselves are almost interchangeable. Swap GPT for Claude, change the temperature, fine-tune the system prompt — it barely matters if your orchestration is broken. What matters is how you decompose work, how you share context, how you merge results, and how you recover from failure. Get the engineering right, and the AI just works. Get it wrong, and no model on earth will save you. By the Numbers The codebase is ~7,400 lines of Python across 230 tests and 141 commits. Over 10+ evaluation runs, the swarm processed a combined ~200K+ tokens, merged 20+ branches, and resolved conflicts ranging from trivial (package.json version bumps) to gnarly (overlapping data models). It’s built on Azure Durable Task Scheduler, Microsoft Agent Framework, and containerized sandboxes that run anywhere Docker does — AKS, ACA, or a plain docker run on your laptop. And somewhere in those 141 commits is a one-line fix for an integer-vs-string bug that took me an embarrassingly long time to find. References Azure Durable Task Scheduler — Deterministic workflow orchestration with replay, checkpointing, and fan-out/fan-in patterns. Microsoft Agent Framework (MAF) — Python agent framework for tool-calling, middleware, and structured output. Azure Kubernetes Service (AKS) — Managed Kubernetes for running containerized agent workloads at scale. Azure Container Apps (ACA) — Serverless container platform for simpler deployments. Azure OpenAI Service — Hosts the GPT models used by planner, coder, and judge agents. Built with Azure DTS, Microsoft Agent Framework, and containerized sandboxes (Docker, AKS, ACA — your choice). And a lot of grep through log files.879Views6likes0Comments