containers
396 TopicsHardening OpenClaw on AKS: Mitigating Container Escapes with Kata microVM Isolation
What is OpenClaw, and what security challenges does it pose with container escapes? OpenClaw is an open-source autonomous AI agent designed for power users and developers to automate tasks, such as managing emails, files, and scheduling via chat apps like WhatsApp or Telegram. While OpenClaw functions as a powerful autonomous assistant, its runtime model creates a massive security paradox: to be truly useful, the agent requires broad permissions to your filesystem and APIs, yet this "God Mode" access often lacks the rigorous containerized isolation typical of enterprise workloads. Because many users run the framework natively rather than within a hardened sandbox, the primary security challenge is that a single malicious "Skill" or an indirect prompt injection can escalate into full system compromise. This structural vulnerability, exemplified by high-profile exploits like CVE-2026-25253, transforms the agent from a helpful tool into a high-risk entry point for lateral movement and data exfiltration within a private network. Why container escapes matter in OpenClaw-style deployments: because containers share the host kernel, a successful container escape turns a single compromised container into a host compromise (or at least a compromise of other co-located workloads). This is especially important when OpenClaw runs code from many tenants, many teams, or varying trust levels on the same worker nodes. That soft isolation is often permeable due to the following structural and configuration-based weaknesses: Shared-kernel attack surface: the container boundary is not a hypervisor boundary. Kernel vulnerabilities (e.g., privilege escalation bugs) can allow a process in a container to gain host-level privileges. Excessive privileges / misconfiguration: running with --privileged, broad Linux capabilities, hostPath mounts, access to the Docker socket, or device passthrough (e.g., /dev/kvm, /dev/fuse) can provide direct paths to host control. Filesystem and namespace boundary breaks: mount namespace confusion, writable host mounts, or mistakes in chroot/pivot_root handling can expose host files and credentials. Supply-chain and image risk: a malicious image or dependency can execute within the container and then attempt escalation/escape. Blast radius: once the host is compromised, attackers can access node-level secrets (service account tokens, registry creds), tamper with the runtime, sniff traffic, or pivot to other containers and the broader cluster. In short, OpenClaw’s security challenge is not that containers are inherently insecure, but that the isolation boundary is thinner than a VM boundary. When the threat model includes adversarial code execution, a “container-only” isolation strategy often requires additional hardening or a stronger sandbox. What are MicroVMs and Kata Containers, and how do they help mitigate OpenClaw container-escape risks? MicroVMs are lightweight virtual machines optimized for running short-lived or container-like workloads with much lower overhead than traditional VMs. They use hardware virtualization (via a hypervisor such as KVM) but keep the device model and boot path minimal, reducing startup time and the overall attack surface compared to a full general-purpose VM. Kata Containers is an “OCI-compatible containers in a VM” approach: it runs each container (or pod sandbox) inside a dedicated microVM by default (implementation varies by runtime/config). To the orchestration layer (e.g., Kubernetes), it still looks like a container runtime, but isolation is provided by a hypervisor boundary rather than only namespaces/cgroups. Stronger isolation boundary: a container escape that relies on Linux kernel exploitation is far less likely to directly compromise the host, because the workload’s “host” kernel is typically the guest kernel inside the microVM. Reduced blast radius: compromise is contained to the microVM/pod sandbox; lateral movement to other workloads on the same node becomes significantly harder. Smaller and more controllable attack surface: minimal device models, tighter default privileges, and fewer host mounts/devices exposed to the workload. Defense-in-depth with container controls: you still can (and should) apply seccomp, capabilities dropping, read-only root filesystems, and LSMs inside the guest, but the hypervisor boundary becomes an additional layer. Better fit for hostile multi-tenant workloads: when OpenClaw executes third-party jobs/plugins, Kata-style sandboxing aligns better with an adversarial threat model. Solution overview Figure 1 illustrates a Kubernetes-based sandboxing architecture for running OpenClaw workloads with stronger isolation. The design keeps the developer experience and packaging model of containers (OCI images, Kubernetes scheduling) while ensuring that untrusted agent code executes inside a microVM boundary using Kata Containers. This reduces the likelihood that a container escape can compromise the underlying node or other co-located workloads. Key components: (1) Application gateway for HTTPS traffic to the backend, (2) Kubernetes as the orchestration, scheduling and policy enforcement plane, (3) a container runtime (e.g., containerd) configured with a Kata Containers runtime class, (4) KVM-backed microVMs that provide the isolation boundary for each untrusted workload and (5) Azure files for persistent storage which allows scaling of OpenClaw. Figure 1: Solution architecture diagram End-to-end flow: Traffic Entry via Application Gateway: Incoming user requests (e.g., from WhatsApp or Discord) first hit the Azure Application Gateway. Orchestration in AKS: The traffic is routed into an Azure Kubernetes Service (AKS) cluster, which manages the lifecycle of the OpenClaw agent and its associated "Skills." Hardened Execution via Kata Containers: Instead of running in standard shared-kernel containers, the OpenClaw agent runs inside Kata Containers. This provides a dedicated lightweight VM for the agent, creating a hardware-level isolation boundary that prevents "container escapes" from compromising the host. Stateful Storage in Azure Files: The agent interacts with Azure Files to read and write persistent data, such as conversation history, configuration files, and downloaded assets, ensuring data remains available even if the container is restarted. Security posture: by shifting isolation from “shared-kernel containers” to “containers inside microVMs,” the architecture limits the blast radius of kernel-level exploits and common escape paths. Even if an attacker achieves code execution within an OpenClaw container, they must additionally break the microVM/hypervisor boundary to affect the node or neighboring workloads, providing a strong defense-in-depth improvement over standard container alone. Implement the solution This section describes how to deploy the solution architecture. In this post, you’ll perform the following tasks: Create a Kata VM-isolated AKS node pool Mount a NFS persistent storage Create the application ConfigMap Deploy the OpenClaw gateway Expose the gateway internally Set up TLS termination Route external traffic through the Azure application gateway for containers. Ensure that you have the following prerequisites deployed before moving to the next section: An AKS cluster provisioned in Azure An Azure NFS File Share with private link enabled. An Application gateway for containers managed by ALB controller Kubectl configured and pointing to the cluster Az CLI authenticated with the correct subscription Initialise environment variables In your Linux terminal, export these variables with your own values. They will be used in later commands. export cluster_name=<CLUSTER_NAME> export resource_group=<RESOURCE_GROUP> Create the AKS Node Pool with Kata VM Isolation The OpenClaw gateway pods require Kata VM isolation (runtimeClassName: kata-vm-isolation). You must create a dedicated AKS node pool that supports this runtime before deploying any workloads. Use the Azure CLI to add a node pool with the Kata VM isolation workload runtime to your existing AKS cluster: az aks nodepool add \ --resource-group $resource_group \ --cluster-name $cluster_name \ --name katanp \ --node-count 2 \ --node-vm-size Standard_D4s_v3 \ --os-sku AzureLinux \ --workload-runtime KataMshvVmIsolation \ --labels agentpool=katanp **Important:** The `--workload-runtime KataMshvVmIsolation` flag enables the `kata-vm-isolation` runtime class on the node pool. The VM size must support nested virtualization (D-series v3/v5, E-series v3/v5, etc.). Create NFS Persistent Volume The deployment uses an Azure Files NFS share for persistent workspace storage. The PersistentVolume must exist before the PVC can bind to it. Replace volumeHandle and volumeAttributes with your own Azure Files values. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolume metadata: name: openclaw-nfs-pv spec: capacity: storage: 100Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain mountOptions: - sec=sys - noresvport - actimeo=30 csi: driver: file.csi.azure.com volumeHandle: <resource-group>#<storage-account>#<share-name> volumeAttributes: resourceGroup: <resource-group> shareName: <share-name> protocol: nfs server: <storage-account>.privatelink.file.core.windows.net EOF Verify that the persistent volume is created. kubectl get pv openclaw-nfs-pv Figure 2: Persistent volume Create the NFS PersistentVolumeClaim The PVC binds to the PV created. The deployment references this PVC by name (`pvc-openclaw-nfs`). cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: # The name of the PVC name: pvc-openclaw-nfs spec: accessModes: - ReadWriteMany resources: requests: # The real storage capacity in the claim storage: 50Gi # This field must be the same as the storage class name in StorageClass storageClassName: "" volumeName: openclaw-nfs-pv EOF Verify that the persistent volume claim is created successfully. The status should show bound. Figure 3: Persistent Volume Claim Create the ConfigMap The ConfigMap provides the openclaw.json configuration file to the gateway pods. It configures allowed CORS origins for the control UI and the gateway token. Replace the allowed origins with your own ALB frontend URL. The ConfigMap also stores the gateway auth token, so DO NOT hardcode your token here. Always keep it as a variable rather than storing it in plain text so that, if attackers gain access to this file, they cannot see the OpenClaw gateway auth token. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: openclaw-config data: openclaw.json: | { "gateway": { "auth": { "token": "${AUTH_TOKEN}" }, "controlUi": { "allowedOrigins": [ "https://<YOUR ALB FRONTEND URL>.alb.azure.com" ] } } } EOF Create the Auth Token Secret The OpenClaw gateway requires an authentication token to secure access. The deployment references a Kubernetes Secret named openclaw-auth-token and injects it into the container as the AUTH_TOKEN environment variable via secretKeyRef. Generate a random token (or use an existing one) and create the kubernetes secret. # Generate a random 32-byte hex token AUTH_TOKEN=$(openssl rand -hex 32) echo "$AUTH_TOKEN" # save this — you'll need it to authenticate with the gateway kubectl create secret generic openclaw-auth-token \ --from-literal=token="$AUTH_TOKEN" If the secret does not exist when the deployment is applied, pods will fail with `CreateContainerConfigError`. Deploy the OpenClaw Gateway This is the main application deployment. It depends on all previous steps: - Kata node pool (pods require runtimeClassName: kata-vm-isolation and nodeSelector: agentpool=katanp) - PVC (pvc-openclaw-nfs for persistent workspace data) - ConfigMap (openclaw-config for openclaw.json) Key details: - Runs 2 replicas with a rolling update strategy - Uses an init container to copy the config file to a writable volume - Exposes port 18789 - Includes liveness and readiness probes on /health - Resource requests: 500m CPU, 512Mi memory - Resource limits: 2 CPU, 2Gi memory cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: openclaw-gateway spec: replicas: 2 selector: matchLabels: app: openclaw-gateway strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 template: metadata: labels: app: openclaw-gateway spec: runtimeClassName: kata-vm-isolation nodeSelector: agentpool: katanp securityContext: fsGroup: 1000 initContainers: - name: copy-openclaw-config image: alpine/openclaw:latest env: - name: HOME value: /writable command: - sh - -c - | cp /config/openclaw.json /writable/openclaw.json \ && chown 1000:1000 /writable/openclaw.json \ && echo "--- Config file contents ---" \ && cat /writable/openclaw.json volumeMounts: - name: openclaw-config-volume mountPath: /config - name: openclaw-writable mountPath: /writable containers: - name: gateway image: alpine/openclaw:latest ports: - containerPort: 18789 env: - name: NODE_OPTIONS value: "--max-old-space-size=4096" - name: AUTH_TOKEN valueFrom: secretKeyRef: name: openclaw-auth-token key: token # Start gateway the way the tutorial indicates command: ["openclaw", "gateway"] args: ["run", "--allow-unconfigured", "--bind", "lan"] volumeMounts: - name: openclaw-writable mountPath: /home/node/.openclaw - name: openclaw-data mountPath: /home/node/workspace subPath: workspace resources: requests: cpu: "500m" memory: "2Gi" limits: cpu: "1000m" memory: "4Gi" livenessProbe: httpGet: path: /health port: 18789 initialDelaySeconds: 60 periodSeconds: 15 failureThreshold: 3 readinessProbe: httpGet: path: /health port: 18789 initialDelaySeconds: 10 periodSeconds: 5 volumes: - name: openclaw-data persistentVolumeClaim: claimName: pvc-openclaw-nfs - name: openclaw-config-volume configMap: name: openclaw-config items: - key: openclaw.json path: openclaw.json - name: openclaw-writable emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: openclaw-gateway-service spec: type: ClusterIP selector: app: openclaw-gateway ports: - protocol: TCP port: 18789 targetPort: 18789 EOF Verify that the deployment succeeds. Wait until all pods show `Running` and `READY 2/2`. kubectl get deployment openclaw-gateway kubectl get pods -l app=openclaw-gateway Figure 4: OpenClaw deployment Create the TLS secret (for HTTPS) The Application Gateway for Containers references a TLS secret (gateway-tls-secret) for HTTPS termination. This blog post uses a self-signed certificate; in a production environment, use a certificate signed by a certificate authority. Replace `<path-to-tls-cert>` and `<path-to-tls-key>` with paths to your TLS certificate and private key files. kubectl create secret tls gateway-tls-secret \ --cert=<path-to-tls-cert> \ --key=<path-to-tls-key> Create the Gateway The Gateway resource defines the HTTPS listener on the Azure Application Load Balancer (ALB). Update the `alb.network.azure.com/application-gateway-id` annotation to match your ALB traffic controller resource ID. You will also need to reference the gateway-tls-secret to enable HTTPS. cat <<EOF | kubectl apply -f - apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: https annotations: alb.network.azure.com/application-gateway-id: /subscriptions/<subscription id>/resourceGroups/mc_openclaw_openclaw-cluster_centralus/providers/Microsoft.ServiceNetworking/trafficControllers/<alb id> alb.networking.azure.io/alb-namespace: default alb.networking.azure.io/alb-name: alb-openclaw spec: gatewayClassName: azure-alb-external listeners: - name: https protocol: HTTPS port: 443 allowedRoutes: namespaces: from: All tls: mode: Terminate certificateRefs: - kind: Secret group: "" name: gateway-tls-secret EOF kubectl get gateway https Wait until the Gateway shows a `Programmed=True` condition. Create the HTTPRoute The HTTPRoute connects the Gateway to the backend Service. It routes all traffic (`/` prefix) from the HTTPS Gateway to `openclaw-gateway-service` on port 18789. cat <<EOF | kubectl apply -f - kind: HTTPRoute apiVersion: gateway.networking.k8s.io/v1 metadata: name: http-route spec: parentRefs: - name: https rules: - matches: - path: type: PathPrefix value: / backendRefs: - name: openclaw-gateway-service kind: Service namespace: default port: 18789 EOF Test OpenClaw application Get the external endpoint. kubectl get gateway https -o jsonpath='{.status.addresses[0].value}' Paste the endpoint into your browser to reach the OpenClaw application. If you are using a self-signed certificate, you will see a “Not secure” warning; click Advanced to proceed. In a production environment with a certificate signed by a certificate authority, you should not see that warning. Figure 5: OpenClaw Authentication Paste in your Gateway Token (the auth token created earlier). You will notice that even though the token is valid, it throws back a “pairing required” error. Pairing is required in OpenClaw whenever a new device, browser profile, or CLI client attempts to connect to the gateway for the first time, ensuring only authorized clients can control the AI agent. POD=$(kubectl get pod -l app=openclaw-gateway -o jsonpath='{.items[0].metadata.name}') POD2=$(kubectl get pod -l app=openclaw-gateway -o jsonpath='{.items[1].metadata.name}') TOKEN=$(kubectl get secret openclaw-auth-token -o jsonpath='{.data.token}' | base64 -d) kubectl exec "$POD" -c gateway -- openclaw devices approve --latest --token "$TOKEN" kubectl exec "$POD2" -c gateway -- openclaw devices approve --latest --token "$TOKEN" You should see a message like the one in the image below. You can now open the OpenClaw application and start using it. Figure 6: OpenClaw pairing success message Figure 7: OpenClaw Application You have successfully deployed OpenClaw within a microVM hosted on Azure Kubernetes Service. Test microVM kernel isolation From within the OpenClaw pod, try to read the host’s root filesystem via /proc/1/root. You should see an error like: ls: cannot access '/proc/1/root/etc/kubernetes': No such file or directory. kubectl exec -it "$POD" -c gateway -- ls /proc/1/root/etc/kubernetes 2>&1 In a standard container deployment, PID 1 inside the container is still running on the host kernel, so traversing /proc/1/root/ exposes the host's root filesystem — including sensitive paths like /etc/kubernetes (which holds kubelet credentials). With Kata VM isolation, the picture is completely different. When we run ls /proc/1/root/etc/kubernetes from inside the OpenClaw pod, it returns "No such file or directory". This is because PID 1 is no longer a process on the host — it's running inside a dedicated guest VM with its own kernel. The /proc/1/root/ path leads to the microVM's root filesystem, not the host's, and that microVM has no knowledge of the node's Kubernetes configuration or machine identity. The host is simply invisible. This is the core security guarantee of Kata Containers: even if an attacker achieves a full container escape, there is nothing to escape to — they land inside a lightweight VM boundary, not on the shared host, making lateral movement to other pods or the node itself impossible. Conclusion This post discussed why running OpenClaw workloads in standard containers can be risky when the workload includes untrusted or semi-trusted code: containers share the host Linux kernel, so a single container escape or privileged misconfiguration can expand into node-level compromise and a much larger blast radius. To address this, we introduced microVM-based sandboxing with Kata Containers on Azure Kubernetes Service (AKS) and walked through an implementation approach (a node pool with Kata VM isolation, storage, gateway deployment, and ingress). Finally, we validated the isolation properties by demonstrating that common host-visibility techniques (for example, probing /proc/1/root) no longer reveal host paths when the workload runs inside a microVM. Separate kernel boundary: Kata runs the container inside a microVM, so the workload executes against a guest kernel rather than the shared host kernel—kernel exploits and escape attempts don’t directly translate into host control. Host filesystem is no longer “in scope”: paths that often leak host context in standard containers (for example, traversals via /proc) resolve inside the microVM’s filesystem, not the node’s root filesystem. Reduced blast radius per workload: each sandbox has its own VM boundary, making it much harder to pivot from one compromised workload to other pods/containers on the same node. Stronger default device and privilege separation: the hypervisor boundary and minimal virtual device model limit exposure to host devices and privileged interfaces that commonly enable breakouts. Defense-in-depth still applies: you can keep container hardening (seccomp, capability dropping, read-only filesystems, restricted mounts) while gaining an additional isolation layer that is independent of Linux namespaces/cgroups. Overall, this post helps you deploy OpenClaw on AKS with Kata microVM isolation so you can run agent workloads with a significantly reduced risk of host-kernel compromise from container escape techniques.Announcing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.14KViews1like2CommentsSimplifying gMSA for Windows Containers on AKS: Open-Source Tooling Now Available
We’re excited to announce that the Windows Containers AKS gMSA tooling is now publicly available on our GitHub repo (Microsoft/Windows-Containers-AKS-gMSA): Windows-Containers-AKS-gMSA. This open-source repository provides tooling to simplify configuring Group Managed Service Accounts (gMSA) for Windows containers running on Azure Kubernetes Service (AKS)—making it easier to containerize and run Active Directory–dependent applications in Kubernetes. Many enterprises rely on Windows applications that integrate with Active Directory (AD) for authentication and authorization. As these workloads move to AKS using Windows containers, it’s critical that they continue to securely support AD‑based authentication. This tooling helps organizations modernize to containers while maintaining trusted identity and authorization workflows built on Active Directory. Who this is for This tooling is intended for: Teams modernizing existing AD-dependent Windows applications Customers running Windows containers on AKS who require Kerberos or Integrated Windows Authentication Platform and infrastructure teams looking to standardize gMSA setup across environments Anyone evaluating whether gMSA is the right fit for their Windows container scenarios If you’re running workloads that depend on Active Directory and want to bring them to AKS with minimal refactoring, this repository can serve as a starting point for validating gMSA in your environment. Why gMSA on AKS matters Windows containers are a natural fit for modernizing existing IIS, .NET Framework, and other AD-integrated applications with minimal code changes. However, containers themselves can’t be domain joined, which historically made AD authentication challenging in containerized environments. With gMSA support on AKS, Windows containers can securely authenticate to Active Directory without requiring domain-joined nodes, instead relying on the AKS host to perform authentication on the application’s behalf. This enables: Secure AD authentication for Windows containers Easier cluster scaling and upgrades Reduced operational overhead compared to domain-joined node models with no changes to the AD infrastructure required While platform support exists, configuring gMSA on AKS still involves multiple moving parts—including AKS, Active Directory, Azure Key Vault, and credential specifications. This tooling is intended to help streamline that setup by reducing manual configuration across these components. What’s in the repository The Windows-Containers-AKS-gMSA repository provides a PowerShell module and supporting scripts designed to simplify the end-to-end setup of gMSA for Windows containers on AKS. Key highlights include: A PowerShell module to help configure an AKS cluster for gMSA usage Automation to reduce manual setup across Azure, AD, and AKS components Documentation and troubleshooting guidance for prerequisites and common pitfalls A trial/validation setup to help stand up a test environment for gMSA on AKS The goal is to lower the barrier to entry and make it easier for teams to experiment with—and ultimately adopt—gMSA for their Windows container workloads. Getting started To get started, visit the GitHub repository and review the README and documentation: https://github.com/microsoft/Windows-Containers-AKS-gMSA You’ll find: Environment and prerequisite requirements Instructions for importing and using the PowerShell module Guidance for validating your setup in a non-production environment For the official documentation, please visit Use gMSA on Azure Kubernetes service in Windows containers | Microsoft Learn. Open source and community feedback By making this repository public, we’re inviting the community to explore, experiment, and provide feedback. While this tooling is designed to simplify setup, it’s important to review the documentation carefully and validate configurations in test environments before production use. We welcome issues and feedback, suggestions for improvements, and any contributions that help improve reliability, clarity, or usability. What’s next This release is part of our continued effort to improve the experience of running Windows containers on AKS—particularly for customers modernizing existing Windows Server workloads that depend on Active Directory. We look forward to hearing how you’re using gMSA on AKS and where we can continue to improve the setup and deployment experience.153Views0likes0CommentsNFS Permission Denied in Azure App Service on Linux: What It Means and What to Do
If your Azure App Service on Linux uses an Azure Files NFS share, you may sometimes see errors like Permission denied or Errno 13 when your app tries to write to the mounted path. Azure Files supports NFS for Linux and Unix workloads, and NFS uses Unix-style numeric ownership and permissions (UID/GID), which can behave differently from SMB-based file sharing. Overview This post is for customers using Azure App Service on Linux together with an Azure Files NFS share for persistent storage. Azure Files NFS is designed for Linux and Unix-style workloads, supports POSIX-style permissions, and does not support Windows clients or NFS ACLs. In this setup, a write failure does not always mean the file is corrupted. Sometimes it means the file ownership seen by the running app no longer matches the identity context currently used to access the NFS share. In containerized Linux environments, user IDs inside a container can be mapped differently outside the container, and Docker documents that this can affect access to host-mounted resources. Common signs You may notice: Permission denied Errno 13 your app can read files but cannot update or overwrite them file ownership looks different than expected when you inspect the mounted path These symptoms are consistent with how NFS handles Unix-style ownership and permissions. Azure documents that NFS permissions are enforced through the operating system and NFS model rather than SMB-style user authentication. Why this can happen At a high level, NFS uses numeric ownership such as UID and GID. In container-based Linux environments, the identity that appears inside the container is not always the same as the identity seen outside the container. Docker’s user namespace documentation explains that a container user such as root can be mapped to a less-privileged user on the host, and that mounted-resource access can become more complex because of that mapping. That means a file created earlier under one effective identity context may later be accessed under a different one. When that happens, the app may no longer be able to write to the file even though the file itself is still present and intact. What to check first Start by checking the mounted share from the app’s runtime context. ls -l /mount/path/file ls -ln /mount/path/file id -u id -g The ls -ln output is especially useful because it shows the numeric UID and GID directly. If you need shell access for investigation, App Service supports SSH into Linux containers, and Microsoft notes that Linux custom containers may need extra SSH configuration. You should also review the NFS share’s squash setting. Azure Files NFS supports No Root Squash, Root Squash, and All Squash. Microsoft documents these options in the root squash guidance. A practical mitigation If the main issue is inconsistent ownership behavior, a practical mitigation is often to use All Squash on the NFS share. Azure documents All Squash as a supported NFS setting, and squash settings are specifically intended to control how client identities are handled when they access the share. One important note: changing the squash setting does not automatically rewrite old files. If existing data was created under a different ownership context, you may still need to migrate that data to a new share configured the way you want. Recommended approach A simple and cautious approach is: Create a new Azure Files NFS share. Configure it with All Squash if that matches your workload needs. Mount both the old share and the new share on a Linux environment. Copy the data from old to new. Validate that the app can read and write correctly. Repoint production to the validated share. Azure Files supports NFS shares and squash configuration, and Azure also documents how to mount NFS shares on Linux if you need a separate environment for validation or migration. Final takeaway If your App Service on Linux starts hitting NFS permission denied errors, focus first on ownership, UID/GID behavior, and squash settings before assuming the files are damaged. For many users, the most effective path is to validate the current ownership model, review the NFS squash setting, and, if needed, migrate data to a share configured with All Squash. References NFS file shares in Azure Files | Microsoft Learn Configure Root Squash Settings for NFS Azure File Shares | Microsoft Learn SSH Access for Linux and Windows Containers - Azure App Service | Microsoft Learn Isolate containers with a user namespace | Docker Docs153Views0likes0CommentsAutonomous AKS Incident Response with Azure SRE Agent: From Alert to Verified Recovery in Minutes
When a Sev1 alert fires on an AKS cluster, detection is rarely the hard part. The hard part is what comes next: proving what broke, why it broke, and fixing it without widening the blast radius, all under time pressure, often at 2 a.m. Azure SRE Agent is designed to close that gap. It connects Azure-native observability, AKS diagnostics, and engineering workflows into a single incident-response loop that can investigate, remediate, verify, and follow up, without waiting for a human to page through dashboards and run ad-hoc kubectl commands. This post walks through that loop in two real AKS failure scenarios. In both cases, the agent received an incident, investigated Azure Monitor and AKS signals, applied targeted remediation, verified recovery, and created follow-up in GitHub, all while keeping the team informed in Microsoft Teams. Core concepts Azure SRE Agent is a governed incident-response system, not a conversational assistant with infrastructure access. Five concepts matter most in an AKS incident workflow: Incident platform. Where incidents originate. In this demo, that is Azure Monitor. Built-in Azure capabilities. The agent uses Azure Monitor, Log Analytics, Azure Resource Graph, Azure CLI/ARM, and AKS diagnostics without requiring external connectors. Connectors. Extend the workflow to systems such as GitHub, Teams, Kusto, and MCP servers. Permission levels. Reader for investigation and read oriented access, privileged for operational changes when allowed. Run modes. Review for approval-gated execution and Autonomous for direct execution. The most important production controls are permission level and run mode, not prompt quality. Custom instructions can shape workflow behavior, but they do not replace RBAC, telemetry quality, or tool availability. The safest production rollout path: Start: Reader + Review Then: Privileged + Review Finally: Privileged + Autonomous. Only for narrow, trusted incident paths. Demo environment The full scripts and manifests are available if you want to reproduce this: Demo repository: github.com/hailugebru/azure-sre-agents-aks. The README includes setup and configuration details. The environment uses an AKS cluster with node auto-provisioning (NAP), Azure CNI Overlay powered by Cilium, managed Prometheus metrics, the AKS Store sample microservices application, and Azure SRE Agent configured for incident-triggered investigation and remediation. This setup is intentionally realistic but minimal. It provides enough surface area to exercise real AKS failure modes without distracting from the incident workflow itself. Azure Monitor → Action Group → Azure SRE Agent → AKS Cluster (Alert) (Webhook) (Investigate / Fix) (Recover) ↓ Teams notification + GitHub issue → GitHub Agent → PR for review How the agent was configured Configuration came down to four things: scope, permissions, incident intake, and response mode. I scoped the agent to the demo resource group and used its user-assigned managed identity (UAMI) for Azure access. That scope defined what the agent could investigate, while RBAC determined what actions it could take. I used broader AKS permissions than I would recommend as a default production baseline so the agent could complete remediation end to end in the lab. That is an important distinction: permissions control what the agent can access, while run mode controls whether it asks for approval or acts directly. For this scenario, Azure Monitor served as the incident platform, and I set the response plan to Autonomous for a narrow, trusted path so the workflow could run without manual approval gates. I also added Teams and GitHub integrations so the workflow could extend beyond Azure. Teams provided milestone updates during the incident, and GitHub provided durable follow up after remediation. For the complete setup, see the README. A note on context. The more context you can provide the agent about your environment, resources, runbooks, and conventions, the better it performs. Scope boundaries, known workloads, common failure patterns, and links to relevant documentation all sharpen its investigations and reduce the time it spends exploring. Treat custom instructions and connector content as first-class inputs, not afterthoughts. Two incidents, two response modes These incidents occurred on the same cluster in one session and illustrate two realistic operating modes: Alert triggered automation. The agent acts when Azure Monitor fires. Ad hoc chat investigation. An engineer sees a symptom first and asks the agent to investigate. Both matter in real environments. The first is your scale path. The second is your operator assist path. Incident 1. CPU starvation (alert driven, ~8 min MTTR) The makeline-service deployment manifest contained a CPU and memory configuration that was not viable for startup: resources: requests: cpu: 1m memory: 6Mi limits: cpu: 5m memory: 20Mi Within five minutes, Azure Monitor fired the pod-not-healthy Sev1 alert. The agent picked it up immediately. Here is the key diagnostic conclusion the agent reached from the pod state, probe behavior, and exit code: "Exit code 1 (not 137) rules out OOMKill. The pod failed at startup, not at runtime memory pressure. CPU limit of 5m is insufficient for the process to bind its port before the startup probe times out. This is a configuration error, not a resource exhaustion scenario." That is the kind of distinction that often takes an on call engineer several minutes to prove under pressure: startup failure from CPU starvation vs. runtime termination from memory pressure. The agent then: Identified three additional CPU-throttled pods at 112 to 200% of configured limit using kubectl top. Patched four workloads: makeline-service, virtual-customer, virtual-worker, and mongodb. Verified that all affected pods returned to healthy running state with 0 restarts cluster wide. Azure SRE Agent's Incident History blade confirming full cluster recovery: 4 patches applied, 0 unhealthy pods — no human intervention required. Outcome. Full cluster recovery in ~8 minutes, 0 human interventions. Incident 2. OOMKilled (chat driven, ~4 min MTTR) For the second case, I deployed a deliberately undersized version of order-service: kubectl apply -f .\manifests\aks-store\order-service-changed.yaml -n pets I started this case from chat before the pod-phase alert fired to demonstrate the interactive troubleshooting flow. That was a demo choice, not an alerting gap. CrashLoopBackOff is a container waiting reason, not a pod phase, so production coverage should come from Prometheus based crash-loop signals rather than pod phase alone. Here is the PromQL query I use in Azure Monitor to catch this class of failure: sum by (namespace, pod) ( ( max_over_time( kube_pod_container_status_waiting_reason{ namespace="pets", reason="CrashLoopBackOff" }[5m] ) == 1 ) and on (namespace, pod, container) ( increase( kube_pod_container_status_restarts_total{ namespace="pets" }[15m] ) > 0 ) ) > 0 This query fires when a container has been in CrashLoopBackOff within the last 5 minutes and its restart count has increased in the last 15 minutes. In production, replace the hardcoded namespace with a regex matcher or remove it entirely to cover all namespaces. The order-service pod in the pets namespace is not healthy. Please investigate, identify the root cause, and fix it. The agent's reasoning: "Container logs are empty. The process was killed before it could write its first log line. Exit code 137 confirms OOMKill. No NODE_OPTIONS in the ConfigMap rules out a V8 heap misconfiguration. The 20Mi limit is 12.8x below the pod's observed 50Mi runtime baseline. This limit was never viable for this workload." The agent increased the memory limit (20Mi to 128Mi) and request (10Mi to 50Mi), then verified the new pod stabilized at 74Mi/128Mi (58% utilization) with 0 restarts. Outcome. Service recovered in ~4 minutes without any manual cluster interaction. Side by side comparison Dimension Incident 1: CPU starvation Incident 2: OOMKilled Trigger Azure Monitor alert (automated) Engineer chat prompt (ad hoc) Failure mode CPU too low for startup probe to pass Memory limit too low for process to start Key signal Exit code 1, probe timeout Exit code 137, empty container logs Blast radius 4 workloads affected cluster wide 1 workload in target namespace Remediation CPU request/limit patches across 4 deployments Memory request/limit patch on 1 deployment MTTR ~8 min ~4 min Human interventions 0 0 Why this matters Most AKS environments already emit rich telemetry through Azure Monitor and managed Prometheus. What is still manual is the response: engineers paging through dashboards, running ad-hoc kubectl commands, and applying hotfixes under time pressure. Azure SRE Agent changes that by turning repeatable investigation and remediation paths into an automated workflow. The value isn't just that the agent patched a CPU limit. It's that the investigation, remediation, and verification loop is the same regardless of failure mode, and it runs while your team sleeps. In this lab, the impact was measurable: Metric This demo with Azure SRE Agent Alert to recovery ~4 to 8 min Human interventions 0 Scope of investigation Cluster wide, automated Correlate evidence and diagnose ~2 min Apply fix and verify ~4 min Post incident follow-up GitHub issue + draft PR These results came from a controlled run on April 10, 2026. Real world outcomes depend on alert quality, cluster size, and how much automation you enable. For reference, industry reports from PagerDuty and Datadog typically place manual Sev1 MTTR in the 30 to 120 minute range for Kubernetes environments. Teams + GitHub follow-up Runtime remediation is only half the story. If the workflow ends when the pod becomes healthy again, the same issue returns on the next deployment. That is why the post incident path matters. After Incident 1 resolved, Azure SRE Agent used the GitHub connector to file an issue with the incident summary, root cause, and runtime changes. In the demo, I assigned that issue to GitHub Copilot agent, which opened a draft pull request to align the source manifests with the hotfix. The agent can also be configured to submit the PR directly in the same workflow, not just open the issue, so the fix is in your review queue by the time anyone sees the notification. Human review still remains the final control point before merge. Setup details for the GitHub connector are in the demo repo README, and the official reference is in the Azure SRE Agent docs. Azure SRE Agent fixes the live issue, and the GitHub follow-up prepares the durable source change so future deployments do not reintroduce the same configuration problem. The operations to engineering handoff: Azure SRE Agent fixed the live cluster; GitHub Copilot agent prepares the durable source change so the same misconfiguration can't ship again. In parallel, the Teams connector posted milestone updates during the incident: Investigation started. Root cause and remediation identified. Incident resolved. Teams handled real time situational awareness. GitHub handled durable engineering follow-up. Together, they closed the gap between operations and software delivery. Key takeaways Three things to carry forward Treat Azure SRE Agent as a governed incident response system, not a chatbot with infrastructure access. The most important controls are permission levels and run modes, not prompt quality. Anchor detection in your existing incident platforms. For this demo, we used Prometheus and Azure Monitor, but the pattern applies regardless of where your signals live. Use connectors to extend the workflow outward. Teams for real time coordination, GitHub for durable engineering follow-up. Start where you're comfortable. If you are just getting your feet wet, begin with one resource group, one incident type, and Review mode. Validate that telemetry flows, RBAC is scoped correctly, and your alert rules cover the failure modes you actually care about before enabling Autonomous. Expand only once each layer is trusted. Next steps Add Prometheus based alert coverage for ImagePullBackOff and node resource pressure to complement the pod phase rule. Expand to multi cluster managed scopes once the single cluster path is trusted and validated. Explore how NAP and Azure SRE Agent complement each other — NAP manages infrastructure capacity, while the agent investigates and remediates incidents. I'd like to thank Cary Chai, Senior Product Manager for Azure SRE Agent, for his early technical guidance and thorough review — his feedback sharpened both the accuracy and quality of this post.733Views0likes0CommentsAKS App Routing's Next Chapter: Gateway API with Istio
If you've been following my previous posts on the Ingress NGINX retirement, you'll know the story so far. The community Ingress NGINX project was retired in March 2026, and Microsoft's extended support for the NGINX-based App Routing add-on runs until November 2026. I've covered migrating from standalone NGINX to the App Routing add-on to buy time, and migrating to Application Gateway for Containers as a long-term option. In both of those posts I mentioned that Microsoft was working on a new version of the App Routing add-on based on Istio and the Gateway API. Well, it's here, in preview at least. The App Routing Gateway API implementation is Microsoft's recommended migration path for anyone currently using the NGINX-based App Routing add-on. It moves you off NGINX entirely and onto the Kubernetes Gateway API, with a lightweight Istio control plane handling the gateway infrastructure under the hood. Let's look at what this actually is, how it differs from other options, and how to migrate from both standalone NGINX and the existing App Routing add-on. What Is It? The new App Routing mode uses the Kubernetes Gateway API instead of the Ingress API. When you enable the add-on, AKS deploys an Istio control plane (istiod) to manage Envoy-based gateway proxies. The important thing to understand here is that this is not the full Istio service mesh. There's no sidecar injection, no Istio CRDs installed for your workloads. It's Istio doing one specific job: managing gateway proxies for ingress traffic. When you create a Gateway resource, AKS provisions an Envoy Deployment, a LoadBalancer Service, a HorizontalPodAutoscaler (defaulting to 2-5 replicas at 80% CPU), and a PodDisruptionBudget. All managed. You write Gateway and HTTPRoute resources, and AKS handles everything else. This is a fundamentally different API from what you're used to with Ingress. Instead of a single Ingress resource that combines the entry point and routing rules, Gateway API splits things into layers: GatewayClass defines the type of gateway infrastructure (provided by AKS in this case) Gateway creates the actual gateway with its listeners HTTPRoute defines the routing rules and attaches to a Gateway This separation is one of Gateway API's main selling points. Platform teams can own the Gateway resources while application teams manage their own HTTPRoutes independently, without needing to modify shared infrastructure. If you've ever had a team accidentally break routing for everyone by editing a shared Ingress, you'll appreciate why this matters. How It Differs From the Istio Service Mesh Add-On If you're already running or considering the Istio service mesh add-on for AKS, this is a different thing. The App Routing Gateway API mode uses the approuting-istio GatewayClass, doesn't install Istio CRDs, doesn't enable sidecar injection, and handles upgrades in-place. The full Istio service mesh add-on uses the istio GatewayClass, installs Istio CRDs cluster-wide, enables sidecar injection, and uses canary upgrades for minor versions. The two cannot run at the same time. If you have the Istio service mesh add-on enabled, you need to disable it before enabling App Routing Gateway API (and vice versa). If you need full mesh capabilities like mTLS between services, traffic policies, and telemetry, stick with the Istio service mesh add-on. If you just need managed ingress via Gateway API without the mesh overhead, this is the right choice. Current Limitations The new App Routing solution is in preview, so should not be run in production yet. There are also some gaps compared to the existing add-on, which you need to be aware of before planning a production migration. The biggest one: DNS and TLS certificate management via the add-on isn't supported yet for Gateway API. If you're currently using az aks approuting update and az aks approuting zone add to automate Key Vault and Azure DNS integration with the NGINX-based add-on, that workflow doesn't carry over. TLS termination is still possible, but you'll need to set it up manually. The AKS docs cover the steps, but it's more hands-on than what the NGINX add-on gives you today. This is expected to be addressed when the feature reaches GA. SNI passthrough (TLSRoute) and egress traffic management aren't supported either. And as mentioned, it's mutually exclusive with the Istio service mesh add-on. For production workloads that depend heavily on automated DNS and TLS management, you may want to wait until GA, or look at Application Gateway for Containers as an alternative. But for teams that can handle TLS setup manually, for non-production environments, there's no reason not to start testing this now. Getting Started Before you can enable the feature, you need the aks-preview CLI extension (version 19.0.0b24 or later), the Managed Gateway API CRDs enabled, and the App Routing Gateway API preview feature flag registered: az extension add --name aks-preview az extension update --name aks-preview # Managed Gateway API CRDs (required dependency) az feature register --namespace "Microsoft.ContainerService" --name "ManagedGatewayAPIPreview" # App Routing Gateway API implementation az feature register --namespace "Microsoft.ContainerService" --name "AppRoutingIstioGatewayAPIPreview" Feature flag registration can take a few minutes. Once they're registered, enable the add-on on a new or existing cluster. You need both --enable-gateway-api (for the managed Gateway API CRD installation) and --enable-app-routing-istio (for the Istio-based implementation): # New cluster az aks create \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --location swedencentral \ --enable-gateway-api \ --enable-app-routing-istio # Existing cluster az aks update \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --enable-gateway-api \ --enable-app-routing-istio Verify istiod is running: kubectl get pods -n aks-istio-system You should see two istiod pods in a Running state. From here, you can create a Gateway and HTTPRoute to test traffic flow. The AKS quickstart walks through this with the httpbin sample app if you want a quick validation. Migrating From NGINX Ingress Whether you're running standalone NGINX (self-installed via Helm) or the NGINX-based App Routing add-on, the migration process is essentially the same. You're moving from Ingress API resources to Gateway API resources, and the new controller runs alongside your existing one during the transition. The only real differences are what you're cleaning up at the end and, if you're on the App Routing add-on, whether you were relying on its built-in DNS and TLS automation. Inventory Your Ingress Resources Before anything else, understand what you have: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' Look specifically for custom snippets, lua configurations, or anything that relies heavily on NGINX-specific behaviour. These won't have direct equivalents in Gateway API and will need manual attention. Convert Ingress Resources to Gateway API The ingress2gateway tool (v1.0.0) handles conversion of Ingress resources to Gateway API equivalents. It supports over 30 common NGINX annotations and generates Gateway and HTTPRoute YAML. It works regardless of whether your Ingress resources use the nginx or webapprouting.kubernetes.azure.com IngressClass: # Install go install github.com/kubernetes-sigs/ingress2gateway@v1.0.0 # Convert from live cluster ingress2gateway print --providers=ingress-nginx -A > gateway-resources.yaml # Or convert from a local file ingress2gateway print --providers=ingress-nginx --input-file=./manifests/ingress.yaml > gateway-resources.yaml Review the output carefully. The tool flags annotations it can't convert as comments in the generated YAML, so you'll know exactly what needs manual work. Common gaps include custom configuration snippets and regex-based rewrites that don't map cleanly to Gateway API's routing model. Make sure you update the gatewayClassName in the generated Gateway resources to approuting-istio. The tool may generate a generic GatewayClass name that you'll need to change. Handle DNS and TLS If you're coming from standalone NGINX, you're likely managing DNS and TLS yourself already, so nothing changes here: just make sure your certificate Secrets and DNS records are ready for the new Gateway IP. If you're coming from the App Routing add-on and relying on its built-in DNS and TLS management (via az aks approuting zone add and Key Vault integration), this is the part that needs extra thought. That automation doesn't carry over to the Gateway API implementation yet, so you'll need to handle it differently until GA. For TLS, you can either create Kubernetes Secrets with your certificates manually or set up a workflow to sync them from Key Vault. The AKS docs on securing Gateway API traffic cover the manual approach. For DNS, you'll need to manage records yourself or use ExternalDNS to automate it. ExternalDNS supports Gateway API resources, so this is a viable path if you want automation. Deploy and Validate With the add-on enabled, apply your converted resources: kubectl apply -f gateway-resources.yaml Wait for the Gateway to be programmed and get the external IP: kubectl wait --for=condition=programmed gateways.gateway.networking.k8s.io <gateway-name> export GATEWAY_IP=$(kubectl get gateways.gateway.networking.k8s.io <gateway-name> -ojsonpath='{.status.addresses[0].value}') The key thing here is that your existing NGINX controller (whether standalone or add-on managed) is still running and serving production traffic. The Gateway API resources are handled separately by the Istio-based controller in aks-istio-system. This parallel running is what makes the migration safe. Test your routes against the new Gateway IP, you'll need to provide the appropriate URL as a host header, as your DNS will still be pointing at the NGINX Add-On at this point. curl -H "Host: myapp.example.com" http://$GATEWAY_IP Run your full validation suite. Check TLS, path routing, headers, authentication, anything your applications depend on. Take your time here; nothing changes for production until you update DNS. Cut Over DNS and Clean Up Once you're confident, lower your DNS TTL to 60 seconds (do this well in advance), then update your DNS records to point to the new Gateway IP. Keep the old NGINX controller running for 24-48 hours as a rollback option. After traffic has been flowing cleanly through the Gateway API path, clean up the old setup. What this looks like depends on where you started: If you were on standalone NGINX: helm uninstall ingress-nginx -n ingress-nginx kubectl delete namespace ingress-nginx If you were on the App Routing add-on with NGINX: Verify nothing is still using the old IngressClass: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' \ | grep "webapprouting" Delete any remaining Ingress resources that reference the old class, then disable the NGINX-based App Routing add-on: az aks approuting disable --resource-group ${RESOURCE_GROUP} --name ${CLUSTER} Some resources (configMaps, secrets, and the controller deployment) will remain in the app-routing-system namespace after disabling. You can clean these up by deleting the namespace once you're satisfied everything is running through the Gateway API path: kubectl delete ns app-routing-system In both cases, clean up any old Ingress resources that are no longer being used. Upgrades and Lifecycle The Istio control plane version is tied to your AKS cluster's Kubernetes version. AKS automatically handles patch upgrades as part of its release cycle, and minor version upgrades happen in-place when you upgrade your cluster's Kubernetes version or when a new Istio minor version is released for your AKS version. One thing to be aware of - unlike the Istio service mesh add-on, upgrades here are in-place, not canary-based. The HPA and PDB on each Gateway help minimise disruption, but plan accordingly for production. If you have maintenance windows configured, the istiod upgrades will respect them. What Should You Do Now? The timeline hasn't changed. The standalone NGINX Ingress project was retired in March 2026, so if you're still running that, you're already on unsupported software. The NGINX App Routing add-on is supported until November 2026, which gives you a window, but it's not a long one. If you're on standalone NGINX you could get onto the App Routing add-on now to buy time (I covered this in my earlier post), then plan your migration to either the Gateway API mode or AGC. If you're on the NGINX App Routing add-on: start testing the Gateway API mode in non-production now. Get familiar with the Gateway API resource model, understand the TLS and DNS gaps in the preview, and be ready to migrate when the feature reaches GA or when November gets close, whichever comes first. If you need production-ready TLS and DNS automation today and can't wait for GA, App Gateway for Containers is your best option right now. Whatever path you choose, make sure you have a plan in place before November. Running unsupported ingress software on production infrastructure isn't where you want to be.465Views1like0CommentsContinued Investment in Azure App Service
This blog was originally published to the App Service team blog Recent Investments Premium v4 (Pv4) Azure App Service Premium v4 delivers higher performance and scalability on newer Azure infrastructure while preserving the fully managed PaaS experience developers rely on. Premium v4 offers expanded CPU and memory options, improved price-performance, and continued support for App Service capabilities such as deployment slots, integrated monitoring, and availability zone resiliency. These improvements help teams modernize and scale demanding workloads without taking on additional operational complexity. App Service Managed Instance App Service Managed Instance extends the App Service model to support Windows web applications that require deeper environment control. It enables plan-level isolation, optional private networking, and operating system customization while retaining managed scaling, patching, identity, and diagnostics. Managed Instance is designed to reduce migration friction for existing applications, allowing teams to move to a modern PaaS environment without code changes. Faster Runtime and Language Support Azure App Service continues to invest in keeping pace with modern application stacks. Regular updates across .NET, Node.js, Python, Java, and PHP help developers adopt new language versions and runtime improvements without managing underlying infrastructure. Reliability and Availability Improvements Ongoing investments in platform reliability and resiliency strengthen production confidence. Expanded Availability Zone support and related infrastructure improvements help applications achieve higher availability with more flexible configuration options as workloads scale. Deployment Workflow Enhancements Deployment workflows across Azure App Service continue to evolve, with ongoing improvements to GitHub Actions, Azure DevOps, and platform tooling. These enhancements reduce friction from build to production while preserving the managed App Service experience. A Platform That Grows With You These recent investments reflect a consistent direction for Azure App Service: active development focused on performance, reliability, and developer productivity. Improvements to runtimes, infrastructure, availability, and deployment workflows are designed to work together, so applications benefit from platform progress without needing to re-architect or change operating models. The recent General Availability of Aspire on Azure App Service is another example of this direction. Developers building distributed .NET applications can now use the Aspire AppHost model to define, orchestrate, and deploy their services directly to App Service — bringing a code-first development experience to a fully managed platform. We are also seeing many customers build and run AI-powered applications on Azure App Service, integrating models, agents, and intelligent features directly into their web apps and APIs. App Service continues to evolve to support these scenarios, providing a managed, scalable foundation that works seamlessly with Azure's broader AI services and tooling. Whether you are modernizing with Premium v4, migrating existing workloads using App Service Managed Instance, or running production applications at scale - including AI-enabled workloads - Azure App Service provides a predictable and transparent foundation that evolves alongside your applications. Azure App Service continues to focus on long-term value through sustained investment in a managed platform developers can rely on as requirements grow, change, and increasingly incorporate AI. Get Started Ready to build on Azure App Service? Here are some resources to help you get started: Create your first web app — Deploy a web app in minutes using the Azure portal, CLI, or VS Code. App Service documentation — Explore guides, tutorials, and reference for the full platform. Aspire on Azure App Service — Now generally available. Deploy distributed .NET applications to App Service using the Aspire AppHost model. Pricing and plans — Compare tiers including Premium v4 and find the right fit for your workload. App Service on Azure Architecture Center — Reference architectures and best practices for production deployments.457Views1like0CommentsUnit Testing Helm Charts with Terratest: A Pattern Guide for Type-Safe Validation
Helm charts are the de facto standard for packaging Kubernetes applications. But here's a question worth asking: how do you know your chart actually produces the manifests you expect, across every environment, before it reaches a cluster? If you're like most teams, the answer is some combination of helm template eyeball checks, catching issues in staging, or hoping for the best. That's slow, error-prone, and doesn't scale. In this post, we'll walk through a better way: a render-and-assert approach to unit testing Helm charts using Terratest and Go. The result? Type-safe, automated tests that run locally in seconds with no cluster required. The Problem Let's start with why this matters. Helm charts are templates that produce YAML, and templates have logic: conditionals, loops, value overrides per environment. That logic can break silently: A values-prod.yaml override points to the wrong container registry A security context gets removed during a refactor and nobody notices An ingress host is correct in dev but wrong in production HPA scaling bounds are accidentally swapped between environments Label selectors drift out of alignment with pod templates, causing orphaned ReplicaSets These aren't hypothetical scenarios. They're real bugs that slip through helm lint and code review because those tools don't understand what your chart should produce. They only check whether the YAML is syntactically valid. These bugs surface at deploy time, or worse, in production. So how do we catch them earlier? The Approach: Render and Assert The idea is straightforward. Instead of deploying to a cluster to see if things work, we render the chart locally and validate the output programmatically. Here's the three-step model: Render: Terratest calls helm template with your base values.yaml + an environment-specific values-<env>.yaml override Unmarshal: The rendered YAML is deserialized into real Kubernetes API structs (appsV1.Deployment, coreV1.ConfigMap, networkingV1.Ingress, etc.) Assert: Testify assertions validate every field that matters, including names, labels, security context, probes, resource limits, ingress routing, and more No cluster. No mocks. No flaky integration tests. Just fast, deterministic validation of your chart's output. Here's what that looks like in practice: // Arrange options := &helm.Options{ ValuesFiles: s.valuesFiles, } output := helm.RenderTemplate(s.T(), options, s.chartPath, s.releaseName, s.templates) // Act var deployment appsV1.Deployment helm.UnmarshalK8SYaml(s.T(), output, &deployment) // Assert: security context is hardened secCtx := deployment.Spec.Template.Spec.Containers[0].SecurityContext require.Equal(s.T(), int64(1000), *secCtx.RunAsUser) require.True(s.T(), *secCtx.RunAsNonRoot) require.True(s.T(), *secCtx.ReadOnlyRootFilesystem) require.False(s.T(), *secCtx.AllowPrivilegeEscalation) Notice something important here: because you're working with real Go structs, the compiler catches schema errors. If you typo a field path like secCtx.RunAsUsr, the code won't compile. With YAML-based assertion tools, that same typo would fail silently at runtime. This type safety is a big deal when you're validating complex resources like Deployments. What to Test: 16 Patterns Across 6 Categories That covers the how. But what should you actually assert? Through applying this approach across multiple charts, we've identified 16 test patterns that consistently catch real bugs. They fall into six categories: Category What Gets Validated Identity & Labels Resource names, 5 standard Helm/K8s labels, selector alignment Configuration Environment-specific configmap data, env var injection Container Image registry per env, ports, resource requests/limits Security Non-root user, read-only FS, dropped capabilities, AppArmor, seccomp, SA token automount Reliability Startup/liveness/readiness probes, volume mounts Networking & Scaling Ingress hosts/TLS per env, service port wiring, HPA bounds per env You don't need all 16 on day one. Start with resource name and label validation, since those apply to every resource and catch the most common _helpers.tpl bugs. Then add security and environment-specific patterns as your coverage grows. Now, let's look at how to structure these tests to handle the trickiest part: multiple environments. Multi-Environment Testing One of the most common Helm chart bugs is environment drift, where values that are correct in dev are wrong in production. A single test suite that only validates one set of values will miss these entirely. The solution is to maintain separate test suites per environment: tests/unit/my-chart/ ├── dev/ ← Asserts against values.yaml + values-dev.yaml ├── test/ ← Asserts against values.yaml + values-test.yaml └── prod/ ← Asserts against values.yaml + values-prod.yaml Each environment's tests assert the merged result of values.yaml + values-<env>.yaml. So when your values-prod.yaml overrides the container registry to prod.azurecr.io, the prod tests verify exactly that, while the dev tests verify dev.azurecr.io. This structure catches a class of bugs that no other approach does: "it works in dev" issues where an environment-specific override has a typo, a missing field, or an outdated value. But environment-specific configuration isn't the only thing worth testing per commit. Let's talk about security. Security as Code Security controls in Kubernetes manifests are notoriously easy to weaken by accident. Someone refactors a deployment template, removes a securityContext block they think is unused, and suddenly your containers are running as root in production. No linter catches this. No code reviewer is going to diff every field of a rendered manifest. With this approach, you encode your security posture directly into your test suite. Every deployment test should validate: Container runs as non-root (UID 1000) Root filesystem is read-only All Linux capabilities are dropped Privilege escalation is blocked AppArmor profile is set to runtime/default Seccomp profile is set to RuntimeDefault Service account token automount is disabled If someone removes a security control during a refactor, the test fails immediately, not after a security review weeks later. Security becomes a CI gate, not a review checklist. With patterns and environments covered, the next question is: how do you wire this into your CI/CD pipeline? CI/CD Integration with Azure DevOps These tests integrate naturally into Azure DevOps pipelines. Since they're just Go tests that call helm template under the hood, all you need is a Helm CLI and a Go runtime on your build agent. A typical multi-stage pipeline looks like: stages: - stage: Build # Package the Helm chart - stage: Dev # Lint + test against values-dev.yaml - stage: Test # Lint + test against values-test.yaml - stage: Production # Lint + test against values-prod.yaml Each stage uses a shared template that installs Helm and Go, extracts the packaged chart, runs helm lint, and executes the Go tests with gotestsum. Environment gates ensure production tests pass before deployment proceeds. Here's the key part of a reusable test template: - script: | export PATH=$PATH:/usr/local/go/bin:$(go env GOPATH)/bin go install gotest.tools/gotestsum@latest cd $(Pipeline.Workspace)/helm.artifact/tests/unit gotestsum --format testname --junitfile $(Agent.TempDirectory)/test-results.xml \ -- ./${{ parameters.helmTestPath }}/... -count=1 -timeout 50m displayName: 'Test helm chart' env: HELM_RELEASE_NAME: ${{ parameters.helmReleaseName }} HELM_VALUES_FILE_OVERRIDE: ${{ parameters.helmValuesFileOverride }} - task: PublishTestResults@2 displayName: 'Publish test results' inputs: testResultsFormat: 'JUnit' testResultsFiles: '$(Agent.TempDirectory)/test-results.xml' condition: always() The PublishTestResults@2 task makes pass/fail results visible on the build's Tests tab, showing individual test names, durations, and failure details. The condition: always() ensures results are published even when tests fail, so you always have visibility. At this point you might be wondering: why Go and Terratest? Why not a simpler YAML-based tool? Why Terratest + Go Instead of helm-unittest? helm-unittest is a popular YAML-based alternative, and it's a fair question. Both tools are valid. Here's why we landed on Terratest: Terratest + Go helm-unittest (YAML) Type safety Renders into real K8s API structs; compiler catches schema errors String matching on raw YAML; typos in field paths fail silently Language features Loops, conditionals, shared setup, table-driven tests Limited to YAML assertion DSL Debugging Standard Go debugger, stack traces YAML diff output only Ecosystem alignment Same language as Terraform tests, one testing stack Separate tool, YAML-only The type safety argument is the strongest. When you unmarshal into appsV1.Deployment, the Go compiler guarantees your assertions reference real fields. With helm-unittest, a YAML path like spec.template.spec.containers[0].securityContest (note the typo) would silently pass because it matches nothing, rather than failing loudly. That said, if your team has no Go experience and needs the lowest adoption barrier, helm-unittest is a reasonable starting point. For teams already using Go or Terraform, Terratest is the stronger long-term choice. Getting Started Ready to try this? Here's a minimal project structure to get you going: your-repo/ ├── charts/ │ └── your-chart/ │ ├── Chart.yaml │ ├── values.yaml │ ├── values-dev.yaml │ ├── values-test.yaml │ ├── values-prod.yaml │ └── templates/ ├── tests/ │ └── unit/ │ ├── go.mod │ └── your-chart/ │ ├── dev/ │ ├── test/ │ └── prod/ └── Makefile Prerequisites: Go 1.22+, Helm 3.14+ You'll need three Go module dependencies: github.com/gruntwork-io/terratest v0.46.16 github.com/stretchr/testify v1.8.4 k8s.io/api v0.28.4 Initialize your test module, write your first test using the patterns above, and run: cd tests/unit HELM_RELEASE_NAME=your-chart \ HELM_VALUES_FILE_OVERRIDE=values-dev.yaml \ go test -v ./your-chart/dev/... -timeout 30m Start with a ConfigMap test. It's the simplest resource type and lets you validate the full render-unmarshal-assert flow before tackling Deployments. Once that passes, work your way through the pattern categories, adding security and environment-specific assertions as you go. Wrapping Up Unit testing Helm charts with Terratest gives you something that helm lint and manual review can't: Type-safe validation: The compiler catches schema errors, not production Environment-specific coverage: Each environment's values are tested independently Security as code: Security controls are verified on every commit, not in periodic reviews Fast feedback: Tests run in seconds with no cluster required CI/CD integration: JUnit results published natively to Azure DevOps The patterns we've covered here are the ones that have caught the most real bugs for us. Start small with resource names and labels, and expand from there. The investment is modest, and the first time a test catches a broken values-prod.yaml override before it reaches production, it'll pay for itself. We'd Love Your Feedback We'd love to hear how this approach works for your team: Which patterns were most useful for your charts? What resource types or patterns are missing? How did the adoption experience go? Drop a comment below. Happy to dig into any of these topics further!366Views0likes0CommentsHeroku Entered Maintenance Mode — Here's Your Next Move
Heroku has entered sustaining engineering — no new features, no new enterprise contracts. If you're running production workloads on the platform, you're probably thinking about what comes next. Azure Container Apps is worth a serious look. Scale-to-zero pricing, event-driven autoscaling, built-in microservice support, serverless GPUs, and an active roadmap — it's a container platform that handles everything from a simple web app to AI-native workloads, and you only pay for what you use. I migrated a real Heroku app to Container Apps to pressure-test the experience. Here's what I learned, what to watch out for, and how you can do it in an afternoon. Why Container Apps is the natural next chapter The philosophy carries over directly. Where Heroku had git push , Container Apps has: az containerapp up --name my-app --source . --environment my-env One command. If you have a Dockerfile, it builds and deploys your app directly. No local Docker install, no manual registry push — code in, URL out. That part didn't change. The concept mapping is tight: Heroku Azure Container Apps Dyno Container App replica Procfile process types Separate Container Apps Heroku add-ons Azure managed services Config vars Environment variables + secrets heroku run one-off dynos Container Apps Jobs Heroku Pipelines GitHub Actions Heroku Scheduler Scheduled Container Apps Jobs Container Apps also includes capabilities you'd need to piece together separately on Heroku — KEDA-powered autoscaling from any event source, Dapr for service-to-service communication, traffic splitting across revisions for safe rollouts, and scale to zero so you stop paying when nothing's running. Simplest path? If your app is a straightforward web server and you don't want containers at all, Azure App Service ( az webapp up ) is also available. But for most Heroku workloads — especially anything with workers, background jobs, or variable traffic — Container Apps is the better fit. What a real migration looks like I took a Node.js + Redis todo app from Heroku and moved it to Container Apps. The app is intentionally boring — Express server, Redis for storage, one web process. This is roughly what a lot of Heroku apps look like, and the migration took about 90 minutes end-to-end (most of that waiting for Redis to provision). Step 1: Export what you have heroku config --json --app my-heroku-app > heroku-config.json heroku apps:info --app my-heroku-app heroku addons --app my-heroku-app You want three things: your environment variables, your add-on list, and your app metadata. The config export is the most important one — it's every secret and connection string your app needs. Step 2: Create the Azure backing services For each Heroku add-on, create the Azure equivalent. Here are the common ones: Heroku add-on Azure service CLI command Heroku Postgres Azure Database for PostgreSQL az postgres flexible-server create Heroku Redis Azure Cache for Redis az redis create Heroku Scheduler Container Apps Jobs az containerapp job create SendGrid SendGrid via Marketplace (Portal) Papertrail / LogDNA Azure Monitor + Log Analytics (Enabled by default) For my todo app, I needed Redis: az redis create \ --name my-redis \ --resource-group my-rg \ --location swedencentral \ --sku Basic --vm-size c0 One thing to know: Azure Cache for Redis takes 10–20 minutes to provision. Heroku's Redis add-on takes about two minutes. Budget the time. Step 3: Containerize If you don't have a Dockerfile, write one. For a Node app this is about 10 lines: FROM node:20-slim WORKDIR /app COPY package*.json ./ RUN npm ci --omit=dev COPY . . EXPOSE 8080 CMD ["node", "server.js"] Don't have a Dockerfile? Point GitHub Copilot at the migration repo and it will generate one for your stack — Node, Python, Ruby, Java, or Go. The repo includes templates and a containerization skill that inspects your app and produces a production-ready Dockerfile. Step 4: Build, push, deploy I used Azure Container Registry (ACR) for the build. No local Docker install needed. az acr create --name myacr --resource-group my-rg --sku Basic az acr build --registry myacr --image my-app:v1 . Then create the Container App: az containerapp create \ --name my-app \ --resource-group my-rg \ --environment my-env \ --image myacr.azurecr.io/my-app:v1 \ --registry-server myacr.azurecr.io \ --target-port 8080 \ --ingress external \ --min-replicas 1 Step 5: Wire up the config This is where Heroku's config:get maps to Container Apps' secrets and environment variables. One gotcha I hit: you have to set secrets before you reference them in environment variables. If you try to do both at once, the deployment fails. # Set the secret first az containerapp secret set \ --name my-app \ --resource-group my-rg \ --secrets redis-url="rediss://:ACCESS_KEY@my-redis.redis.cache.windows.net:6380" # Then reference it az containerapp update \ --name my-app \ --resource-group my-rg \ --set-env-vars "REDIS_URL=secretref:redis-url" Step 6: Verify and cut over Hit the Azure URL, test your routes, check that data flows through the new Redis instance. When you're satisfied, update your DNS CNAME to point at the Container Apps FQDN. Total time: ~90 minutes, and most of that was waiting for Redis to provision. The actual migration work was about 30 minutes of CLI commands. Lessons from the field I want to be upfront about what to watch for — these are the things that would waste your time if you hit them unprepared. 📌 Register Azure providers first. If your subscription has never used Container Apps, you need to register the resource providers before creating anything: az provider register --namespace Microsoft.App --wait az provider register --namespace Microsoft.OperationalInsights --wait This takes a minute or two. Without it, resource creation fails with confusing error messages. 📌 Set secrets before referencing them in env vars. The CLI doesn't warn you — it just fails the deployment. Always az containerapp secret set first, then az containerapp update --set-env-vars . 📌 Budget time for Azure resource provisioning. Azure Cache for Redis takes 10–20 minutes vs Heroku's ~2 minutes. Enterprise-grade infrastructure takes a bit longer to spin up — plan accordingly and provision backing services in parallel. None of these are blockers. They're the kind of things a migration guide should tell you upfront — and ours does. Migrate today, build intelligent apps tomorrow Once your app is on Container Apps, you're on a platform built for AI-native workloads too. No second migration required: Serverless GPU — attach GPU compute to your Container Apps for inference workloads. Run models alongside your app, same environment, same deployment pipeline. No separate ML infrastructure to manage. Dynamic Sessions — spin up isolated, sandboxed code execution environments on demand. Build AI agents that execute tools, run LLM-generated code safely, or offer interactive coding experiences — all within your existing Container Apps environment. These aren't separate services you bolt on — they're configuration changes on the platform you're already running on. Building AI-native? Container Apps pairs naturally with Azure AI Foundry — one place to access state-of-the-art models from both OpenAI and Anthropic, manage prompts, evaluate outputs, and deploy endpoints. Your app runs on Container Apps; your intelligence runs on Foundry. Same subscription, same identity, no glue code between clouds. The applications being migrated today won't look the same in two years. A platform that grows with you — from web app to intelligent service — means you make this move once. You don't have to figure this out alone We've built the resources to make this migration fast and repeatable: 📖 Official Migration Guide — End-to-end walkthrough covering assessment, containerization, service mapping, and deployment. Start here. 🤖 Agent-Assisted Migration Repo — An open-source repository designed to work with GitHub Copilot and other AI coding assistants. It includes an AGENTS.md file and six migration skills that walk you through the entire process — from Heroku assessment to DNS cutover — with real CLI commands, Dockerfile templates for five languages, Bicep IaC, and GitHub Actions workflows. Point Copilot at the repo alongside your app's source code, and it becomes a migration pair-programmer: running the right commands, generating Dockerfiles, setting up CI/CD, and flagging things you might miss. This isn't a magic migrate my app button. It's more like having a colleague who has done this migration twenty times sitting next to you while you do it. The cost math works Let's talk numbers. Heroku plan Monthly cost Container Apps equivalent Monthly cost Standard-1X (idle most of the day) $25/mo Consumption plan (scale to zero) ~$0–5/mo Performance-L (steady traffic) $500/mo Dedicated plan with autoscaling Meaningfully less 10 low-traffic apps across dev/staging/prod $250+/mo Consumption plan with free grants Near zero Container Apps' monthly free grant covers 180,000 vCPU-seconds and 2 million requests. For apps that idle most of the day, that's often enough to pay nothing at all. The biggest savings come from workloads that don't run 24/7. Heroku charges for every hour a dyno is running, period. Container Apps charges for actual compute consumption and scales to zero when there's no traffic. Get started Inventory — Run heroku apps and heroku addons to see what you have. Pick a pilot — Choose a non-critical app for your first migration. Migrate — Follow the official migration guide, or point GitHub Copilot at the migration repo and let it pair with you. Azure Container Apps gives you a production-grade container platform with scale-to-zero economics, an active roadmap, and a path to AI-native workloads — all from a single az containerapp up command. If you're evaluating what comes after Heroku, start here. 👉 Start your migration · Clone the migration repo · Explore Azure Container Apps336Views0likes0Comments