azure kubernetes service
212 TopicsTroubleshoot with OpenTelemetry in Azure Monitor - Public Preview
OpenTelemetry is fast becoming the industry standard for modern telemetry collection and ingestion pipelines. With Azure Monitor’s new OpenTelemetry Protocol (OTLP) support, you can ship logs, metrics, and traces from wherever you run workloads to analyze and act on your observability data in one place. What’s in the preview Direct OTLP ingestion into Azure Monitor for logs, metrics, and traces. Automated onboarding for AKS workloads. Application Insights on OTLP for distributed tracing, performance and troubleshooting experiences. Pre-built Grafana dashboards to visualize signals quickly. Prometheus for metric storage and query. OpenTelemetry semantic conventions for logs and traces, so your data lands in a familiar standard-based schema. How to send OTLP to Azure Monitor: pick your path AKS: Auto-instrument Java and Node.js workloads using the Azure Monitor OpenTelemetry distro, or auto-configure any OpenTelemetry SDK-instrumented workload to export OTLP to Azure Monitor. Get started Limited preview: Auto-instrumentation for .NET and Python is also available. Get started VMs/VM Scale Sets (and Azure Arc-enabled compute): Use the Azure Monitor Agent (AMA) to receive OTLP from your apps and export it to Azure Monitor. Get started Any environment: Use the OpenTelemetry Collector to receive OTLP signals and export directly to Azure Monitor cloud ingestion endpoints. Get started Under the hood: where your telemetry lands Metrics: Stored in an Azure Monitor Workspace, a Prometheus metrics store. Logs + traces: Stored in a Log Analytics workspace using an OpenTelemetry semantic conventions–based schema. Troubleshooting: Application Insights lights up distributed tracing and end-to-end performance investigations, backed by Azure Monitor. Why it matters Standardize once: Instrument with OpenTelemetry and keep your telemetry portable. Reduce overhead: Fewer bespoke exporters and pipelines to maintain. Debug faster: Correlate metrics, logs, and traces to get from alert to root cause with less guesswork. Observe with confidence: Use dashboards and tracing views that are ready on day one. Next step: Try the OTLP preview in your environment, then validate end-to-end signal flow with Application Insights and Grafana dashboards. Learn More201Views2likes0CommentsAutonomous AKS Incident Response with Azure SRE Agent: From Alert to Verified Recovery in Minutes
When a Sev1 alert fires on an AKS cluster, detection is rarely the hard part. The hard part is what comes next: proving what broke, why it broke, and fixing it without widening the blast radius, all under time pressure, often at 2 a.m. Azure SRE Agent is designed to close that gap. It connects Azure-native observability, AKS diagnostics, and engineering workflows into a single incident-response loop that can investigate, remediate, verify, and follow up, without waiting for a human to page through dashboards and run ad-hoc kubectl commands. This post walks through that loop in two real AKS failure scenarios. In both cases, the agent received an incident, investigated Azure Monitor and AKS signals, applied targeted remediation, verified recovery, and created follow-up in GitHub, all while keeping the team informed in Microsoft Teams. Core concepts Azure SRE Agent is a governed incident-response system, not a conversational assistant with infrastructure access. Five concepts matter most in an AKS incident workflow: Incident platform. Where incidents originate. In this demo, that is Azure Monitor. Built-in Azure capabilities. The agent uses Azure Monitor, Log Analytics, Azure Resource Graph, Azure CLI/ARM, and AKS diagnostics without requiring external connectors. Connectors. Extend the workflow to systems such as GitHub, Teams, Kusto, and MCP servers. Permission levels. Reader for investigation and read oriented access, privileged for operational changes when allowed. Run modes. Review for approval-gated execution and Autonomous for direct execution. The most important production controls are permission level and run mode, not prompt quality. Custom instructions can shape workflow behavior, but they do not replace RBAC, telemetry quality, or tool availability. The safest production rollout path: Start: Reader + Review Then: Privileged + Review Finally: Privileged + Autonomous. Only for narrow, trusted incident paths. Demo environment The full scripts and manifests are available if you want to reproduce this: Demo repository: github.com/hailugebru/azure-sre-agents-aks. The README includes setup and configuration details. The environment uses an AKS cluster with node auto-provisioning (NAP), Azure CNI Overlay powered by Cilium, managed Prometheus metrics, the AKS Store sample microservices application, and Azure SRE Agent configured for incident-triggered investigation and remediation. This setup is intentionally realistic but minimal. It provides enough surface area to exercise real AKS failure modes without distracting from the incident workflow itself. Azure Monitor → Action Group → Azure SRE Agent → AKS Cluster (Alert) (Webhook) (Investigate / Fix) (Recover) ↓ Teams notification + GitHub issue → GitHub Agent → PR for review How the agent was configured Configuration came down to four things: scope, permissions, incident intake, and response mode. I scoped the agent to the demo resource group and used its user-assigned managed identity (UAMI) for Azure access. That scope defined what the agent could investigate, while RBAC determined what actions it could take. I used broader AKS permissions than I would recommend as a default production baseline so the agent could complete remediation end to end in the lab. That is an important distinction: permissions control what the agent can access, while run mode controls whether it asks for approval or acts directly. For this scenario, Azure Monitor served as the incident platform, and I set the response plan to Autonomous for a narrow, trusted path so the workflow could run without manual approval gates. I also added Teams and GitHub integrations so the workflow could extend beyond Azure. Teams provided milestone updates during the incident, and GitHub provided durable follow up after remediation. For the complete setup, see the README. A note on context. The more context you can provide the agent about your environment, resources, runbooks, and conventions, the better it performs. Scope boundaries, known workloads, common failure patterns, and links to relevant documentation all sharpen its investigations and reduce the time it spends exploring. Treat custom instructions and connector content as first-class inputs, not afterthoughts. Two incidents, two response modes These incidents occurred on the same cluster in one session and illustrate two realistic operating modes: Alert triggered automation. The agent acts when Azure Monitor fires. Ad hoc chat investigation. An engineer sees a symptom first and asks the agent to investigate. Both matter in real environments. The first is your scale path. The second is your operator assist path. Incident 1. CPU starvation (alert driven, ~8 min MTTR) The makeline-service deployment manifest contained a CPU and memory configuration that was not viable for startup: resources: requests: cpu: 1m memory: 6Mi limits: cpu: 5m memory: 20Mi Within five minutes, Azure Monitor fired the pod-not-healthy Sev1 alert. The agent picked it up immediately. Here is the key diagnostic conclusion the agent reached from the pod state, probe behavior, and exit code: "Exit code 1 (not 137) rules out OOMKill. The pod failed at startup, not at runtime memory pressure. CPU limit of 5m is insufficient for the process to bind its port before the startup probe times out. This is a configuration error, not a resource exhaustion scenario." That is the kind of distinction that often takes an on call engineer several minutes to prove under pressure: startup failure from CPU starvation vs. runtime termination from memory pressure. The agent then: Identified three additional CPU-throttled pods at 112 to 200% of configured limit using kubectl top. Patched four workloads: makeline-service, virtual-customer, virtual-worker, and mongodb. Verified that all affected pods returned to healthy running state with 0 restarts cluster wide. Azure SRE Agent's Incident History blade confirming full cluster recovery: 4 patches applied, 0 unhealthy pods — no human intervention required. Outcome. Full cluster recovery in ~8 minutes, 0 human interventions. Incident 2. OOMKilled (chat driven, ~4 min MTTR) For the second case, I deployed a deliberately undersized version of order-service: kubectl apply -f .\manifests\aks-store\order-service-changed.yaml -n pets I started this case from chat before the pod-phase alert fired to demonstrate the interactive troubleshooting flow. That was a demo choice, not an alerting gap. CrashLoopBackOff is a container waiting reason, not a pod phase, so production coverage should come from Prometheus based crash-loop signals rather than pod phase alone. Here is the PromQL query I use in Azure Monitor to catch this class of failure: sum by (namespace, pod) ( ( max_over_time( kube_pod_container_status_waiting_reason{ namespace="pets", reason="CrashLoopBackOff" }[5m] ) == 1 ) and on (namespace, pod, container) ( increase( kube_pod_container_status_restarts_total{ namespace="pets" }[15m] ) > 0 ) ) > 0 This query fires when a container has been in CrashLoopBackOff within the last 5 minutes and its restart count has increased in the last 15 minutes. In production, replace the hardcoded namespace with a regex matcher or remove it entirely to cover all namespaces. The order-service pod in the pets namespace is not healthy. Please investigate, identify the root cause, and fix it. The agent's reasoning: "Container logs are empty. The process was killed before it could write its first log line. Exit code 137 confirms OOMKill. No NODE_OPTIONS in the ConfigMap rules out a V8 heap misconfiguration. The 20Mi limit is 12.8x below the pod's observed 50Mi runtime baseline. This limit was never viable for this workload." The agent increased the memory limit (20Mi to 128Mi) and request (10Mi to 50Mi), then verified the new pod stabilized at 74Mi/128Mi (58% utilization) with 0 restarts. Outcome. Service recovered in ~4 minutes without any manual cluster interaction. Side by side comparison Dimension Incident 1: CPU starvation Incident 2: OOMKilled Trigger Azure Monitor alert (automated) Engineer chat prompt (ad hoc) Failure mode CPU too low for startup probe to pass Memory limit too low for process to start Key signal Exit code 1, probe timeout Exit code 137, empty container logs Blast radius 4 workloads affected cluster wide 1 workload in target namespace Remediation CPU request/limit patches across 4 deployments Memory request/limit patch on 1 deployment MTTR ~8 min ~4 min Human interventions 0 0 Why this matters Most AKS environments already emit rich telemetry through Azure Monitor and managed Prometheus. What is still manual is the response: engineers paging through dashboards, running ad-hoc kubectl commands, and applying hotfixes under time pressure. Azure SRE Agent changes that by turning repeatable investigation and remediation paths into an automated workflow. The value isn't just that the agent patched a CPU limit. It's that the investigation, remediation, and verification loop is the same regardless of failure mode, and it runs while your team sleeps. In this lab, the impact was measurable: Metric This demo with Azure SRE Agent Alert to recovery ~4 to 8 min Human interventions 0 Scope of investigation Cluster wide, automated Correlate evidence and diagnose ~2 min Apply fix and verify ~4 min Post incident follow-up GitHub issue + draft PR These results came from a controlled run on April 10, 2026. Real world outcomes depend on alert quality, cluster size, and how much automation you enable. For reference, industry reports from PagerDuty and Datadog typically place manual Sev1 MTTR in the 30 to 120 minute range for Kubernetes environments. Teams + GitHub follow-up Runtime remediation is only half the story. If the workflow ends when the pod becomes healthy again, the same issue returns on the next deployment. That is why the post incident path matters. After Incident 1 resolved, Azure SRE Agent used the GitHub connector to file an issue with the incident summary, root cause, and runtime changes. In the demo, I assigned that issue to GitHub Copilot agent, which opened a draft pull request to align the source manifests with the hotfix. The agent can also be configured to submit the PR directly in the same workflow, not just open the issue, so the fix is in your review queue by the time anyone sees the notification. Human review still remains the final control point before merge. Setup details for the GitHub connector are in the demo repo README, and the official reference is in the Azure SRE Agent docs. Azure SRE Agent fixes the live issue, and the GitHub follow-up prepares the durable source change so future deployments do not reintroduce the same configuration problem. The operations to engineering handoff: Azure SRE Agent fixed the live cluster; GitHub Copilot agent prepares the durable source change so the same misconfiguration can't ship again. In parallel, the Teams connector posted milestone updates during the incident: Investigation started. Root cause and remediation identified. Incident resolved. Teams handled real time situational awareness. GitHub handled durable engineering follow-up. Together, they closed the gap between operations and software delivery. Key takeaways Three things to carry forward Treat Azure SRE Agent as a governed incident response system, not a chatbot with infrastructure access. The most important controls are permission levels and run modes, not prompt quality. Anchor detection in your existing incident platforms. For this demo, we used Prometheus and Azure Monitor, but the pattern applies regardless of where your signals live. Use connectors to extend the workflow outward. Teams for real time coordination, GitHub for durable engineering follow-up. Start where you're comfortable. If you are just getting your feet wet, begin with one resource group, one incident type, and Review mode. Validate that telemetry flows, RBAC is scoped correctly, and your alert rules cover the failure modes you actually care about before enabling Autonomous. Expand only once each layer is trusted. Next steps Add Prometheus based alert coverage for ImagePullBackOff and node resource pressure to complement the pod phase rule. Expand to multi cluster managed scopes once the single cluster path is trusted and validated. Explore how NAP and Azure SRE Agent complement each other — NAP manages infrastructure capacity, while the agent investigates and remediates incidents. I'd like to thank Cary Chai, Senior Product Manager for Azure SRE Agent, for his early technical guidance and thorough review — his feedback sharpened both the accuracy and quality of this post.487Views0likes0CommentsSecure, Keyless Application Access with Managed Identities - Now GA in Azure Files SMB
As enterprises modernize applications and strengthen their security posture, identity is central to how applications access shared storage. Traditional identity models relying on account keys, stored credentials, or domain‑joined infrastructure add operational overhead and introduce security risks such as credential leakage, lack of identity attribution, and excessive privilege if shared keys are compromised. Today, we are excited to announce the General Availability (GA) of Managed Identity support for Azure Files over SMB, enabling applications and virtual machines to securely access Azure Files without secrets, passwords, or key distribution. Managed Identity support enables customers to meet modern enterprise security standards without reliance on storage account keys, streamlining how organizations securely enable file‑based application access and reducing the operational overhead of filing internal exceptions. New storage accounts can support secure, identity‑based SMB access out of the box, while existing deployments can get secure by enabling Managed Identity authentication. From web application workloads such as WordPress, to databases on Azure Kubernetes Service (AKS), to CI/CD pipelines, applications require secure access. In a world where security is foundational, continued reliance on key-based access conflicts with Zero Trust principles and least privilege access. What’s New In GA AKS Workload Identity Support AKS Workload Identity (preview) extends the traditional managed identity model for Kubernetes by shifting the identity from the node to pods. Instead of inheriting the identity of the underlying cluster, each Kubernetes pod can use its own federated identity, mapped directly to a Microsoft Entra ID principal. This feature enables: Pod-level identity isolation, rather than cluster-level Least-privilege access with secure RBAC Seamless scaling and redeployment, without identity reconfiguration No secrets, no key rotation, no credential injection When combined with Azure Files over SMB, Workload Identity allows AKS workloads to access shared file storage securely and natively per pod, using the same identity-driven model as cluster level managed identities. Now available with AKS 1.35, for customers specifically in the financial services industries, AKS Workload Identity enables per‑application, least‑privilege access to Azure Files without credentials, improving isolation and auditability. This allows regulated, stateful workloads to run securely on AKS while meeting strict compliance and regulatory requirements. Co-existence of Application Identities and end-user identity access Azure Files now enables both Managed Identity and end‑user access on the same storage account, with users and applications independently authenticated via Entra ID and authorized through a shared permissions model. This unified access model eliminates the need for duplicate storage or credentials, enabling secure collaboration, troubleshooting, and automation on shared data without compromising governance or compliance. This supports scenarios such as: Developers accessing the same file share as an application for debugging Admins managing content used by automated workflows Hybrid environments with user-driven and app-driven access Simplified Storage Account enablement via the Azure portal We have now added a dedicated Managed Identity property that makes enabling identity‑based SMB access simple and transparent via the Azure portal for new as well as existing storage accounts. With a single configuration at the storage account level, customers can allow applications to authenticate to Azure Files using Managed Identities. This portal experience supports incremental adoption, making it easy to modernize authentication while maintaining compatibility with existing user access and governance models. Get Started with Managed Identities with SMB Azure Files Start using Managed Identities with Azure Files today at no additional cost. This feature is supported on HDD and SSD SMB shares across all billing models. Refer to our documentation for complete set-up guidance. Whether provisioning new storage or enhancing existing deployments, this capability provides secure, enterprise‑grade access with a streamlined configuration experience. For any questions, reach out to the team at azurefiles@microsoft.com.212Views0likes0CommentsAKS App Routing's Next Chapter: Gateway API with Istio
If you've been following my previous posts on the Ingress NGINX retirement, you'll know the story so far. The community Ingress NGINX project was retired in March 2026, and Microsoft's extended support for the NGINX-based App Routing add-on runs until November 2026. I've covered migrating from standalone NGINX to the App Routing add-on to buy time, and migrating to Application Gateway for Containers as a long-term option. In both of those posts I mentioned that Microsoft was working on a new version of the App Routing add-on based on Istio and the Gateway API. Well, it's here, in preview at least. The App Routing Gateway API implementation is Microsoft's recommended migration path for anyone currently using the NGINX-based App Routing add-on. It moves you off NGINX entirely and onto the Kubernetes Gateway API, with a lightweight Istio control plane handling the gateway infrastructure under the hood. Let's look at what this actually is, how it differs from other options, and how to migrate from both standalone NGINX and the existing App Routing add-on. What Is It? The new App Routing mode uses the Kubernetes Gateway API instead of the Ingress API. When you enable the add-on, AKS deploys an Istio control plane (istiod) to manage Envoy-based gateway proxies. The important thing to understand here is that this is not the full Istio service mesh. There's no sidecar injection, no Istio CRDs installed for your workloads. It's Istio doing one specific job: managing gateway proxies for ingress traffic. When you create a Gateway resource, AKS provisions an Envoy Deployment, a LoadBalancer Service, a HorizontalPodAutoscaler (defaulting to 2-5 replicas at 80% CPU), and a PodDisruptionBudget. All managed. You write Gateway and HTTPRoute resources, and AKS handles everything else. This is a fundamentally different API from what you're used to with Ingress. Instead of a single Ingress resource that combines the entry point and routing rules, Gateway API splits things into layers: GatewayClass defines the type of gateway infrastructure (provided by AKS in this case) Gateway creates the actual gateway with its listeners HTTPRoute defines the routing rules and attaches to a Gateway This separation is one of Gateway API's main selling points. Platform teams can own the Gateway resources while application teams manage their own HTTPRoutes independently, without needing to modify shared infrastructure. If you've ever had a team accidentally break routing for everyone by editing a shared Ingress, you'll appreciate why this matters. How It Differs From the Istio Service Mesh Add-On If you're already running or considering the Istio service mesh add-on for AKS, this is a different thing. The App Routing Gateway API mode uses the approuting-istio GatewayClass, doesn't install Istio CRDs, doesn't enable sidecar injection, and handles upgrades in-place. The full Istio service mesh add-on uses the istio GatewayClass, installs Istio CRDs cluster-wide, enables sidecar injection, and uses canary upgrades for minor versions. The two cannot run at the same time. If you have the Istio service mesh add-on enabled, you need to disable it before enabling App Routing Gateway API (and vice versa). If you need full mesh capabilities like mTLS between services, traffic policies, and telemetry, stick with the Istio service mesh add-on. If you just need managed ingress via Gateway API without the mesh overhead, this is the right choice. Current Limitations The new App Routing solution is in preview, so should not be run in production yet. There are also some gaps compared to the existing add-on, which you need to be aware of before planning a production migration. The biggest one: DNS and TLS certificate management via the add-on isn't supported yet for Gateway API. If you're currently using az aks approuting update and az aks approuting zone add to automate Key Vault and Azure DNS integration with the NGINX-based add-on, that workflow doesn't carry over. TLS termination is still possible, but you'll need to set it up manually. The AKS docs cover the steps, but it's more hands-on than what the NGINX add-on gives you today. This is expected to be addressed when the feature reaches GA. SNI passthrough (TLSRoute) and egress traffic management aren't supported either. And as mentioned, it's mutually exclusive with the Istio service mesh add-on. For production workloads that depend heavily on automated DNS and TLS management, you may want to wait until GA, or look at Application Gateway for Containers as an alternative. But for teams that can handle TLS setup manually, for non-production environments, there's no reason not to start testing this now. Getting Started Before you can enable the feature, you need the aks-preview CLI extension (version 19.0.0b24 or later), the Managed Gateway API CRDs enabled, and the App Routing Gateway API preview feature flag registered: az extension add --name aks-preview az extension update --name aks-preview # Managed Gateway API CRDs (required dependency) az feature register --namespace "Microsoft.ContainerService" --name "ManagedGatewayAPIPreview" # App Routing Gateway API implementation az feature register --namespace "Microsoft.ContainerService" --name "AppRoutingIstioGatewayAPIPreview" Feature flag registration can take a few minutes. Once they're registered, enable the add-on on a new or existing cluster. You need both --enable-gateway-api (for the managed Gateway API CRD installation) and --enable-app-routing-istio (for the Istio-based implementation): # New cluster az aks create \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --location swedencentral \ --enable-gateway-api \ --enable-app-routing-istio # Existing cluster az aks update \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --enable-gateway-api \ --enable-app-routing-istio Verify istiod is running: kubectl get pods -n aks-istio-system You should see two istiod pods in a Running state. From here, you can create a Gateway and HTTPRoute to test traffic flow. The AKS quickstart walks through this with the httpbin sample app if you want a quick validation. Migrating From NGINX Ingress Whether you're running standalone NGINX (self-installed via Helm) or the NGINX-based App Routing add-on, the migration process is essentially the same. You're moving from Ingress API resources to Gateway API resources, and the new controller runs alongside your existing one during the transition. The only real differences are what you're cleaning up at the end and, if you're on the App Routing add-on, whether you were relying on its built-in DNS and TLS automation. Inventory Your Ingress Resources Before anything else, understand what you have: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' Look specifically for custom snippets, lua configurations, or anything that relies heavily on NGINX-specific behaviour. These won't have direct equivalents in Gateway API and will need manual attention. Convert Ingress Resources to Gateway API The ingress2gateway tool (v1.0.0) handles conversion of Ingress resources to Gateway API equivalents. It supports over 30 common NGINX annotations and generates Gateway and HTTPRoute YAML. It works regardless of whether your Ingress resources use the nginx or webapprouting.kubernetes.azure.com IngressClass: # Install go install github.com/kubernetes-sigs/ingress2gateway@v1.0.0 # Convert from live cluster ingress2gateway print --providers=ingress-nginx -A > gateway-resources.yaml # Or convert from a local file ingress2gateway print --providers=ingress-nginx --input-file=./manifests/ingress.yaml > gateway-resources.yaml Review the output carefully. The tool flags annotations it can't convert as comments in the generated YAML, so you'll know exactly what needs manual work. Common gaps include custom configuration snippets and regex-based rewrites that don't map cleanly to Gateway API's routing model. Make sure you update the gatewayClassName in the generated Gateway resources to approuting-istio. The tool may generate a generic GatewayClass name that you'll need to change. Handle DNS and TLS If you're coming from standalone NGINX, you're likely managing DNS and TLS yourself already, so nothing changes here: just make sure your certificate Secrets and DNS records are ready for the new Gateway IP. If you're coming from the App Routing add-on and relying on its built-in DNS and TLS management (via az aks approuting zone add and Key Vault integration), this is the part that needs extra thought. That automation doesn't carry over to the Gateway API implementation yet, so you'll need to handle it differently until GA. For TLS, you can either create Kubernetes Secrets with your certificates manually or set up a workflow to sync them from Key Vault. The AKS docs on securing Gateway API traffic cover the manual approach. For DNS, you'll need to manage records yourself or use ExternalDNS to automate it. ExternalDNS supports Gateway API resources, so this is a viable path if you want automation. Deploy and Validate With the add-on enabled, apply your converted resources: kubectl apply -f gateway-resources.yaml Wait for the Gateway to be programmed and get the external IP: kubectl wait --for=condition=programmed gateways.gateway.networking.k8s.io <gateway-name> export GATEWAY_IP=$(kubectl get gateways.gateway.networking.k8s.io <gateway-name> -ojsonpath='{.status.addresses[0].value}') The key thing here is that your existing NGINX controller (whether standalone or add-on managed) is still running and serving production traffic. The Gateway API resources are handled separately by the Istio-based controller in aks-istio-system. This parallel running is what makes the migration safe. Test your routes against the new Gateway IP, you'll need to provide the appropriate URL as a host header, as your DNS will still be pointing at the NGINX Add-On at this point. curl -H "Host: myapp.example.com" http://$GATEWAY_IP Run your full validation suite. Check TLS, path routing, headers, authentication, anything your applications depend on. Take your time here; nothing changes for production until you update DNS. Cut Over DNS and Clean Up Once you're confident, lower your DNS TTL to 60 seconds (do this well in advance), then update your DNS records to point to the new Gateway IP. Keep the old NGINX controller running for 24-48 hours as a rollback option. After traffic has been flowing cleanly through the Gateway API path, clean up the old setup. What this looks like depends on where you started: If you were on standalone NGINX: helm uninstall ingress-nginx -n ingress-nginx kubectl delete namespace ingress-nginx If you were on the App Routing add-on with NGINX: Verify nothing is still using the old IngressClass: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' \ | grep "webapprouting" Delete any remaining Ingress resources that reference the old class, then disable the NGINX-based App Routing add-on: az aks approuting disable --resource-group ${RESOURCE_GROUP} --name ${CLUSTER} Some resources (configMaps, secrets, and the controller deployment) will remain in the app-routing-system namespace after disabling. You can clean these up by deleting the namespace once you're satisfied everything is running through the Gateway API path: kubectl delete ns app-routing-system In both cases, clean up any old Ingress resources that are no longer being used. Upgrades and Lifecycle The Istio control plane version is tied to your AKS cluster's Kubernetes version. AKS automatically handles patch upgrades as part of its release cycle, and minor version upgrades happen in-place when you upgrade your cluster's Kubernetes version or when a new Istio minor version is released for your AKS version. One thing to be aware of - unlike the Istio service mesh add-on, upgrades here are in-place, not canary-based. The HPA and PDB on each Gateway help minimise disruption, but plan accordingly for production. If you have maintenance windows configured, the istiod upgrades will respect them. What Should You Do Now? The timeline hasn't changed. The standalone NGINX Ingress project was retired in March 2026, so if you're still running that, you're already on unsupported software. The NGINX App Routing add-on is supported until November 2026, which gives you a window, but it's not a long one. If you're on standalone NGINX you could get onto the App Routing add-on now to buy time (I covered this in my earlier post), then plan your migration to either the Gateway API mode or AGC. If you're on the NGINX App Routing add-on: start testing the Gateway API mode in non-production now. Get familiar with the Gateway API resource model, understand the TLS and DNS gaps in the preview, and be ready to migrate when the feature reaches GA or when November gets close, whichever comes first. If you need production-ready TLS and DNS automation today and can't wait for GA, App Gateway for Containers is your best option right now. Whatever path you choose, make sure you have a plan in place before November. Running unsupported ingress software on production infrastructure isn't where you want to be.205Views1like0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.13KViews1like1CommentAzure Monitor in Azure SRE Agent: Autonomous Alert Investigation and Intelligent Merging
Azure Monitor is great at telling you something is wrong. But once the alert fires, the real work begins — someone has to open the portal, triage it, dig into logs, and figure out what happened. That takes time. And while they're investigating, the same alert keeps firing every few minutes, stacking up duplicates of a problem that's already being looked at. This is exactly what Azure SRE Agent's Azure Monitor integration addresses. The agent picks up alerts as they fire, investigates autonomously, and remediates when it can — all without waiting for a human to get involved. And when that same alert fires again while the investigation is still underway, the agent merges it into the existing thread rather than creating a new one. In this blog, we'll walk through the full Azure Monitor experience in SRE Agent with a live AKS + Redis scenario — how alerts get picked up, what the agent does with them, how merging handles the noise, and why one often-overlooked setting (auto-resolve) makes a bigger difference than you'd expect. Key Takeaways Set up Incident Response Plans to scope which alerts the agent handles — filter by severity, title patterns, and resource type. Start with review mode, then promote to autonomous once you trust the agent's behavior for that failure pattern. Recurring alerts merge into one thread automatically — when the same alert rule fires repeatedly, the agent merges subsequent firings into the existing investigation instead of creating duplicates. Turn auto-resolve OFF for persistent failures (bad credentials, misconfigurations, resource exhaustion) so all firings merge into one thread. Turn it ON for transient issues (traffic spikes, brief timeouts) so each gets a fresh investigation. Design alert rules around failure categories, not components — one alert rule = one investigation thread. Structure rules by symptom (Redis errors, HTTP errors, pod health) to give the agent focused, non-overlapping threads. Attach Custom Response Plans for specialized handling — route specific alert patterns to custom-agents with custom instructions, tools, and runbooks. It Starts with Any Azure Monitor Alert Before we get to the demo, a quick note on what SRE Agent actually watches. The agent queries the Azure Alerts Management REST API, which returns every fired alert regardless of signal type. Log search alerts, metric alerts, activity log alerts, smart detection, service health, Prometheus — all of them come through the same API, and the agent processes them all the same way. You don't need to configure connectors or webhooks per alert type. If it fires in Azure Monitor, the agent can see it. What you do need to configure is which alerts the agent should care about. That's where Incident Response Plans come in. Setting Up: Incident Response Plans and Alert Rules We start by heading to Settings > Incident Platform > Azure Monitor and creating an Incident Response Plan. Response Plans et you scope the agent's attention by severity, alert name patterns, target resource types, and — importantly — whether the agent should act autonomously or wait for human approval. Action: Match the agent mode to your confidence in the remediation, not just the severity. Use autonomous mode for well-understood failure patterns where the fix is predictable and safe (e.g., rolling back a bad config, restarting a pod). Use review mode for anything where you want a human to validate before the agent acts — especially Sev0/Sev1 alerts that touch critical systems. You can always start in review mode and promote to autonomous once you've validated the agent's behavior. For our demo, we created a Sev1 response plan in autonomous mode — meaning the agent would pick up any Sev1 alert and immediately start investigating and remediating, no approval needed. On the Azure Monitor side, we set up three log-based alert rules against our AKS cluster's Log Analytics workspace. The star of the show was a Redis connection error alert — a custom log search query looking for WRONGPASS, ECONNREFUSED, and other Redis failure signatures in ContainerLog: Each rule evaluates every 5 minutes with a 15-minute aggregation window. If the query returns any results, the alert fires. Simple enough. Breaking Redis (On Purpose) Our test app is a Node.js journal app on AKS, backed by Azure Cache for Redis. To create a realistic failure scenario, we updated the Redis password in the Kubernetes secret to a wrong value. The app pods picked up the bad credential, Redis connections started failing, and error logs started flowing. Within minutes, the Redis connection error alert fired. What Happened Next Here's where it gets interesting. We didn't touch anything — we just watched. The agent's scanner polls the Azure Monitor Alerts API every 60 seconds. It spotted the new alert (state: "New", condition: "Fired"), matched it against our Sev1 Incident Response Plan, and immediately acknowledged it in Azure Monitor — flipping the state to "Acknowledged" so other systems and humans know someone's on it. Then it created a new investigation thread. The thread included everything the agent needed to get started: the alert ID, rule name, severity, description, affected resource, subscription, resource group, and a deep-link back to the Azure Portal alert. From there, the agent went to work autonomously. It queried container logs, identified the Redis WRONGPASS errors, traced them to the bad secret, retrieved the correct access key from Azure Cache for Redis, updated the Kubernetes secret, and triggered a pod rollout. By the time we checked the thread, it was already marked "Completed." No pages. No human investigation. No context-switching. But the Alert Kept Firing... Here's the thing — our alert rule evaluates every 5 minutes. Between the first firing and the agent completing the fix, the alert fired again. And again. Seven times total over 35 minutes. Without intelligent handling, that would mean seven separate investigation threads. Seven notifications. Seven disruptions. SRE Agent handles this with alert merging. When a subsequent firing comes in for the same alert rule, the agent checks: is there already an active thread for this rule, created within the last 7 days, that hasn't been resolved or closed? If yes, the new firing gets silently merged into the existing thread — the total alert count goes up, the "Last fired" timestamp updates, and that's it. No new thread, no new notification, no interruption to the ongoing investigation. How merging decides: new thread or merge? Condition Result Same alert rule, existing thread still active Merged — alert count increments, no new thread Same alert rule, existing thread resolved/closed New thread — fresh investigation starts Different alert rule New thread — always separate Five minutes after the first alert, the second firing came in and that continued. The agent finished the fix and closed the thread, and the final tally was one thread, seven merged alerts — spanning 35 minutes of continuous firings. On the Azure Portal side, you can see all seven individual alert instances. Each one was acknowledged by the agent. 7 Redis Connection Error Alert entries, all Sev1, Fired condition, Closed by user, spanning 8:50 PM to 9:21 PM Seven firings. One investigation. One fix. That's the merge in action. The Auto-Resolve Twist Now here's the part we didn't expect to matter as much as it did. Azure Monitor has a setting called "Automatically resolve alerts". When enabled, Azure Monitor automatically transitions an alert to "Resolved" once the underlying condition clears — for example, when the Redis errors stop because the pod restarted. For our first scenario above, we had auto-resolve turned off. That's why the alert stayed in "Fired" state across all seven evaluation cycles, and all seven firings merged cleanly into one thread. But what happens if auto-resolve is on? We turned it on and ran the same scenario again: Here's what happened: Redis broke. Alert fired. Agent picked it up and created a thread. The agent investigated, found the bad Redis password, fixed it. With Redis working again, error logs stopped. We noticed that the condition cleared and closed all the 7 alerts manually. We broke Redis a second time (simulating a recurrence). The alert fired again — but the previous alert was already closed/resolved. The merge check found no active thread. A brand-new thread was created, reinvestigated and mitigated. Two threads for the same alert rule, right there on the Incidents page: And on the Azure Monitor side, the newest alert shows "Resolved" condition — that's the auto-resolve doing its thing: For a persistent failure like a Redis misconfiguration, this is clearly worse. You get a new investigation thread every break-fix cycle instead of one continuous investigation. So, Should You Just Turn Auto-Resolve Off? No. It depends on what kind of failure the alert is watching for. Quick Reference: Auto-Resolve Decision Guide Auto-Resolve OFF Auto-Resolve ON Use when Problem persists until fixed Problem is transient and self-correcting Examples Bad credentials, misconfigurations, CrashLoopBackOff, connection pool exhaustion, IOPS limits OOM kills during traffic spikes, brief latency from neighboring deployments, one-off job timeouts Merge behavior All repeat firings merge into one thread Each break-fix cycle creates a new thread Best for Agent is actively managing the alert lifecycle Each occurrence may have a different root cause Tradeoff Alerts stay in "Fired/Acknowledged" state in Azure Monitor until the agent closes them More threads, but each gets a clean investigation Turn auto-resolve OFF when you want repeated firings from the same alert rule to stay in a single investigation thread until the alert is explicitly resolved or closed in Azure Monitor. This works best for persistent issues such as a Kubernetes deployment stuck in CrashLoopBackOff because of a bad image tag, a database connection pool exhausted due to a leaked connection, or a storage account hitting its IOPS limit under sustained load. Turn auto-resolve ON when you want a new investigation thread after the previous occurrence has been resolved or closed in Azure Monitor. This works best for episodic or self-clearing issues such as a pod getting OOM-killed during a temporary traffic spike, a brief latency increases during a neighboring service’s deployment, or a scheduled job that times out once due to short-lived resource contention. The key question is: when this alert fires again, is it the same ongoing problem or a new one? If it's the same problem, turn auto-resolve off and let the merges do their job. If it's a new problem, leave auto-resolve on and let the agent investigate fresh. Note: These behaviors describe how SRE Agent groups alert investigations and may differ from how Azure Monitor documents native alert state behavior. A Few Things We Learned Along the Way Design alert rules around symptoms, not components. Each alert rule maps to one investigation thread. We structured ours around failure categories — root cause signal (Redis errors, Sev1), blast radius signal (HTTP errors, Sev2), infrastructure signal (unhealthy pods, Sev2). This gave the agent focused threads without overlap. Incident Response Plans let you tier your response. Not every alert needs the agent to go fix things immediately. We used a Sev1 filter in autonomous mode for the Redis alert, but you could set up a Sev2 filter in review mode — the agent investigates and provides analysis but waits for human approval before taking action. Response Plans specialize the agent. For specific alert patterns, you can give the agent custom instructions, specialized tools, and a tailored system prompt. A Redis alert can route to a custom-agent loaded with Redis-specific runbooks; a Kubernetes alert can route to one with deep kubectl expertise. Best Practices Checklist Here's what we learned distilled into concrete actions: Alert Rule Design Do Don't Design rules around failure categories (root cause, blast radius, infra health) Create one alert per component — you'll get overlapping threads Set evaluation frequency and aggregation window to match the failure pattern Use the same frequency for everything — transient vs. persistent issues need different cadences Example rule structure from our test: Root cause signal — Redis WRONGPASS/ECONNREFUSED errors → Sev1 Blast radius signal — HTTP 5xx response codes → Sev2 Infrastructure signal — KubeEvents Reason="Unhealthy" → Sev2 Incident Response Plan Setup Do Don't Create separate response plans per severity tier Use one catch-all filter for everything Start with review mode — especially for Sev0/Sev1 where wrong fixes are costly Jump straight to autonomous mode on critical alerts without validating agent behavior first Promote to autonomous mode once you've validated the agent handles a specific failure pattern correctly Assume severity alone determines the right mode — it's about confidence in the remediation Response Plans Do Don't Attach custom response plans to specific alert patterns for specialized handling Leave every alert to the agent's general knowledge Include custom instructions, tools, and runbooks relevant to the failure type Write generic instructions — the more specific, the better the investigation Route Redis alerts to a Redis-specialized custom-agent; K8s alerts to one with kubectl expertise Assume one agent configuration fits all failure types Getting Started Head to sre.azure.com and open your agent Make sure the agent's managed identity has Monitoring Reader on your target subscriptions Go to Settings > Incident Platform > Azure Monitor and create your Incident Response Plans Review the auto-resolve setting on your alert rules — turn it off for persistent issues, leave it on for transient ones (see the decision guide above) Start with a test response plan using Title Contains to target a specific alert rule — validate agent behavior before broadening Watch the Incidents page and review the agent's investigation threads before expanding to more alert rules Learn More Azure SRE Agent Documentation Incident Response Guide Azure Monitor Alert Rules289Views0likes0CommentsAnnouncing a flexible, predictable billing model for Azure SRE Agent
Billing for Azure SRE Agent will start on September 1, 2025. Announced at Microsoft Build 2025, Azure SRE Agent is a pre-built AI agent for root cause analysis, uptime improvement, and operational cost reduction. Learn more about the billing model and example scenarios.4.2KViews1like1CommentThe Durable Task Scheduler Consumption SKU is now Generally Available
Today, we're excited to announce that the Durable Task Scheduler Consumption SKU has reached General Availability. Developers can now run durable workflows and agents on Azure with pay-per-use pricing, no storage to manage, no capacity to plan, and no idle costs. Just create a scheduler, connect your app, and start orchestrating. Whether you're coordinating AI agent workflows, processing event-driven pipelines, or running background jobs, the Consumption SKU is ready to go. GET STARTED WITH THE DURABLE TASK SCHEDULER CONSUMPTION SKU Since launching the Consumption SKU in public preview last November, we've seen incredible adoption and have incorporated feedback from developers around the world to ensure the GA release is truly production ready. “The Durable Task Scheduler has become a foundational piece of what we call ‘workflows’. It gives us the reliability guarantees we need for processing financial documents and sensitive workflows, while keeping the programming model straightforward. The combination of durable execution, external event correlation, deterministic idempotency, and the local emulator experience has made it a natural fit for our event-driven architecture. We have been delighted with the consumption SKUs cost model for our lower environments.”– Emily Lewis, CarMax What is the Durable Task Scheduler? If you're new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background: Announcing Limited Early Access of the Durable Task Scheduler Announcing Workflow in Azure Container Apps with the Durable Task Scheduler Announcing Dedicated SKU GA & Consumption SKU Public Preview In brief, the Durable Task Scheduler is a fully managed orchestration backend for durable execution on Azure, meaning your workflows and agent sessions can reliably resume and run to completion, even through process failures, restarts, and scaling events. Whether you’re running workflows or orchestrating durable agents, it handles task scheduling, state persistence, fault tolerance, and built-in monitoring, freeing developers from the operational overhead of managing their own execution engines and storage backends. The Durable Task Scheduler works across Azure compute environments: Azure Functions: Using the Durable Functions extension across all Function App SKUs, including Flex Consumption. Azure Container Apps: Using the Durable Functions or Durable Task SDKs with built-in workflow support and auto-scaling. Any compute: Azure Kubernetes Service, Azure App Service, or any environment where you can run the Durable Task SDKs (.NET, Python, Java, JavaScript). Why choose the Consumption SKU? With the Consumption SKU you’re charged only for actions dispatched, with no minimum commitments or idle costs. There’s no capacity to size or throughput to reserve. Create a scheduler, connect your app, and you’re running. The Consumption SKU is a natural fit for workloads with unpredictable or bursty usage patterns: AI agent orchestration: Multi-step agent workflows that call LLMs, retrieve data, and take actions. Users trigger these on demand, so volume is spiky and pay-per-use avoids idle costs between bursts. Event-driven pipelines: Processing events from queues, webhooks, or streams with reliable orchestration and automatic checkpointing, where volumes spike and dip unpredictably. API-triggered workflows: User signups, form submissions, payment flows, and other request-driven processing where volume varies throughout the day. Distributed transactions: Retries and compensation logic across microservices with durable sagas that survive failures and restarts. What's included in the Consumption SKU at GA The Consumption SKU has been hardened based on feedback and real-world usage during the public preview. Here's what's included at GA: Performance Up to 500 actions per second: Sufficient throughput for a wide range of workloads, with the option to move to the Dedicated SKU for higher-scale scenarios. Up to 30 days of data retention: View and manage orchestration history, debug failures, and audit execution data for up to 30 days. Built-in monitoring dashboard Filter orchestrations by status, drill into execution history, view visual Gantt and sequence charts, and manage orchestrations (pause, resume, terminate, or raise events), all from the dashboard, secured with Role-Based Access Control (RBAC). Identity-based security The Consumption SKU uses Entra ID for authentication and RBAC for authorization. No SAS tokens or access keys to manage, just assign the appropriate role and connect. Get started with the Durable Task Scheduler today The Consumption SKU is available now Generally Available. Provision a scheduler in the Azure portal, connect your app, and start orchestrating. You only pay for what you use. Documentation Getting started Samples Pricing Consumption SKU docs We'd love to hear your feedback. Reach out to us by filing an issue on our GitHub repository469Views0likes0CommentsBuilding the agentic future together at JDConf 2026
JDConf 2026 is just weeks away, and I’m excited to welcome Java developers, architects, and engineering leaders from around the world for two days of learning and connection. Now in its sixth year, JDConf has become a place where the Java community compares notes on their real-world production experience: patterns, tooling, and hard-earned lessons you can take back to your team, while we keep moving the Java systems that run businesses and services forward in the AI era. This year’s program lines up with a shift many of us are seeing first-hand: delivery is getting more intelligent, more automated, and more tightly coupled to the systems and data we already own. Agentic approaches are moving from demos to backlog items, and that raises practical questions: what’s the right architecture, where do you draw trust boundaries, how do you keep secrets safe, and how do you ship without trading reliability for novelty? JDConf is for and by the people who build and manage the mission-critical apps powering organizations worldwide. Across three regional livestreams, you’ll hear from open source and enterprise practitioners who are making the same tradeoffs you are—velocity vs. safety, modernization vs. continuity, experimentation vs. operational excellence. Expect sessions that go beyond “what” and get into “how”: design choices, integration patterns, migration steps, and the guardrails that make AI features safe to run in production. You’ll find several practical themes for shipping Java in the AI era: connecting agents to enterprise systems with clear governance; frameworks and runtimes adapting to AI-native workloads; and how testing and delivery pipelines evolve as automation gets more capable. To make this more concrete, a sampling of sessions would include topics like Secrets of Agentic Memory Management (patterns for short- and long-term memory and safe retrieval), Modernizing a Java App with GitHub Copilot (end-to-end upgrade and migration with AI-powered technologies), and Docker Sandboxes for AI Agents (guardrails for running agent workflows without risking your filesystem or secrets). The goal is to help you adopt what’s new while hardening your long lived codebases. JDConf is built for community learning—free to attend, accessible worldwide, and designed for an interactive live experience in three time zones. You’ll not only get 23 practitioner-led sessions with production-ready guidance but also free on-demand access after the event to re-watch with your whole team. Pro tip: join live and get more value by discussing practical implications and ideas with your peers in the chat. This is where the “how” details and tradeoffs become clearer. JDConf 2026 Keynote Building the Agentic Future Together Rod Johnson, Embabel | Bruno Borges, Microsoft | Ayan Gupta, Microsoft The JDConf 2026 keynote features Rod Johnson, creator of the Spring Framework and founder of Embabel, joined by Bruno Borges and Ayan Gupta to explore where the Java ecosystem is headed in the agentic era. Expect a practitioner-level discussion on how frameworks like Spring continue to evolve, how MCP is changing the way agents interact with enterprise systems, and what Java developers should be paying attention to right now. Register. Attend. Earn. Register for JDConf 2026 to earn Microsoft Rewards points, which you can use for gift cards, sweepstakes entries, and more. Earn 1,000 points simply by signing up. When you register for any regional JDConf 2026 event with your Microsoft account, you'll automatically receive these points. Get 5,000 additional points for attending live (limited to the first 300 attendees per stream). On the day of your regional event, check in through the Reactor page or your email confirmation link to qualify. Disclaimer: Points are added to your Microsoft account within 60 days after the event. Must register with a Microsoft account email. Up to 10,000 developers eligible. Points will be applied upon registration and attendance and will not be counted multiple times for registering or attending at different events. Terms | Privacy JDConf 2026 Regional Live Streams Americas – April 8, 8:30 AM – 12:30 PM PDT (UTC -7) Bruno Borges hosts the Americas stream, discussing practical agentic Java topics like memory management, multi-agent system design, LLM integration, modernization with AI, and dependency security. Experts from Redis, IBM, Hammerspace, HeroDevs, AI Collective, Tekskills, and Microsoft share their insights. Register for Americas → Asia-Pacific – April 9, 10:00 AM – 2:00 PM SGT (UTC +8) Brian Benz and Ayan Gupta co-host the APAC stream, highlighting Java frameworks and practices for agentic delivery. Topics include Spring AI, multi-agent orchestration, spec-driven development, scalable DevOps, and legacy modernization, with speakers from Broadcom, Alibaba, CERN, MHP (A Porsche Company), and Microsoft. Register for Asia-Pacific → Europe, Middle East and Africa – April 9, 9:00 AM – 12:30 PM GMT (UTC +0) The EMEA stream, hosted by Sandra Ahlgrimm, will address the implementation of agentic Java in production environments. Topics include self-improving systems utilizing Spring AI, Docker sandboxes for agent workflow management, Retrieval-Augmented Generation (RAG) pipelines, modernization initiatives from a national tax authority, and AI-driven CI/CD enhancements. Presentations will feature experts from Broadcom, Docker, Elastic, Azul Systems, IBM, Team Rockstars IT, and Microsoft. Register for EMEA → Make It Interactive: Join Live Come prepared with an actual challenge you’re facing, whether you’re modernizing a legacy application, connecting agents to internal APIs, or refining CI/CD processes. Test your strategies by participating in live chats and Q&As with presenters and fellow professionals. If you’re attending with your team, schedule a debrief after the live stream to discuss how to quickly use key takeaways and insights in your pilots and projects. Learning Resources Java and AI for Beginners Video Series: Practical, episode-based walkthroughs on MCP, GenAI integration, and building AI-powered apps from scratch. Modernize Java Apps Guide: Step-by-step guide using GitHub Copilot agent mode for legacy Java project upgrades, automated fixes, and cloud-ready migrations. AI Agents for Java Webinar: Embedding AI Agent capabilities into Java applications using Microsoft Foundry, from project setup to production deployment. Java Practitioner’s Guide: Learning plan for deploying, managing, and optimizing Java applications on Azure using modern cloud-native approaches. Register Now JDConf 2026 is a free global event for Java teams. Join live to ask questions, connect, and gain practical patterns. All 23 sessions will be available on-demand. Register now to earn Microsoft Rewards points for attending. Register at JDConf.com191Views0likes0CommentsUnit Testing Helm Charts with Terratest: A Pattern Guide for Type-Safe Validation
Helm charts are the de facto standard for packaging Kubernetes applications. But here's a question worth asking: how do you know your chart actually produces the manifests you expect, across every environment, before it reaches a cluster? If you're like most teams, the answer is some combination of helm template eyeball checks, catching issues in staging, or hoping for the best. That's slow, error-prone, and doesn't scale. In this post, we'll walk through a better way: a render-and-assert approach to unit testing Helm charts using Terratest and Go. The result? Type-safe, automated tests that run locally in seconds with no cluster required. The Problem Let's start with why this matters. Helm charts are templates that produce YAML, and templates have logic: conditionals, loops, value overrides per environment. That logic can break silently: A values-prod.yaml override points to the wrong container registry A security context gets removed during a refactor and nobody notices An ingress host is correct in dev but wrong in production HPA scaling bounds are accidentally swapped between environments Label selectors drift out of alignment with pod templates, causing orphaned ReplicaSets These aren't hypothetical scenarios. They're real bugs that slip through helm lint and code review because those tools don't understand what your chart should produce. They only check whether the YAML is syntactically valid. These bugs surface at deploy time, or worse, in production. So how do we catch them earlier? The Approach: Render and Assert The idea is straightforward. Instead of deploying to a cluster to see if things work, we render the chart locally and validate the output programmatically. Here's the three-step model: Render: Terratest calls helm template with your base values.yaml + an environment-specific values-<env>.yaml override Unmarshal: The rendered YAML is deserialized into real Kubernetes API structs (appsV1.Deployment, coreV1.ConfigMap, networkingV1.Ingress, etc.) Assert: Testify assertions validate every field that matters, including names, labels, security context, probes, resource limits, ingress routing, and more No cluster. No mocks. No flaky integration tests. Just fast, deterministic validation of your chart's output. Here's what that looks like in practice: // Arrange options := &helm.Options{ ValuesFiles: s.valuesFiles, } output := helm.RenderTemplate(s.T(), options, s.chartPath, s.releaseName, s.templates) // Act var deployment appsV1.Deployment helm.UnmarshalK8SYaml(s.T(), output, &deployment) // Assert: security context is hardened secCtx := deployment.Spec.Template.Spec.Containers[0].SecurityContext require.Equal(s.T(), int64(1000), *secCtx.RunAsUser) require.True(s.T(), *secCtx.RunAsNonRoot) require.True(s.T(), *secCtx.ReadOnlyRootFilesystem) require.False(s.T(), *secCtx.AllowPrivilegeEscalation) Notice something important here: because you're working with real Go structs, the compiler catches schema errors. If you typo a field path like secCtx.RunAsUsr, the code won't compile. With YAML-based assertion tools, that same typo would fail silently at runtime. This type safety is a big deal when you're validating complex resources like Deployments. What to Test: 16 Patterns Across 6 Categories That covers the how. But what should you actually assert? Through applying this approach across multiple charts, we've identified 16 test patterns that consistently catch real bugs. They fall into six categories: Category What Gets Validated Identity & Labels Resource names, 5 standard Helm/K8s labels, selector alignment Configuration Environment-specific configmap data, env var injection Container Image registry per env, ports, resource requests/limits Security Non-root user, read-only FS, dropped capabilities, AppArmor, seccomp, SA token automount Reliability Startup/liveness/readiness probes, volume mounts Networking & Scaling Ingress hosts/TLS per env, service port wiring, HPA bounds per env You don't need all 16 on day one. Start with resource name and label validation, since those apply to every resource and catch the most common _helpers.tpl bugs. Then add security and environment-specific patterns as your coverage grows. Now, let's look at how to structure these tests to handle the trickiest part: multiple environments. Multi-Environment Testing One of the most common Helm chart bugs is environment drift, where values that are correct in dev are wrong in production. A single test suite that only validates one set of values will miss these entirely. The solution is to maintain separate test suites per environment: tests/unit/my-chart/ ├── dev/ ← Asserts against values.yaml + values-dev.yaml ├── test/ ← Asserts against values.yaml + values-test.yaml └── prod/ ← Asserts against values.yaml + values-prod.yaml Each environment's tests assert the merged result of values.yaml + values-<env>.yaml. So when your values-prod.yaml overrides the container registry to prod.azurecr.io, the prod tests verify exactly that, while the dev tests verify dev.azurecr.io. This structure catches a class of bugs that no other approach does: "it works in dev" issues where an environment-specific override has a typo, a missing field, or an outdated value. But environment-specific configuration isn't the only thing worth testing per commit. Let's talk about security. Security as Code Security controls in Kubernetes manifests are notoriously easy to weaken by accident. Someone refactors a deployment template, removes a securityContext block they think is unused, and suddenly your containers are running as root in production. No linter catches this. No code reviewer is going to diff every field of a rendered manifest. With this approach, you encode your security posture directly into your test suite. Every deployment test should validate: Container runs as non-root (UID 1000) Root filesystem is read-only All Linux capabilities are dropped Privilege escalation is blocked AppArmor profile is set to runtime/default Seccomp profile is set to RuntimeDefault Service account token automount is disabled If someone removes a security control during a refactor, the test fails immediately, not after a security review weeks later. Security becomes a CI gate, not a review checklist. With patterns and environments covered, the next question is: how do you wire this into your CI/CD pipeline? CI/CD Integration with Azure DevOps These tests integrate naturally into Azure DevOps pipelines. Since they're just Go tests that call helm template under the hood, all you need is a Helm CLI and a Go runtime on your build agent. A typical multi-stage pipeline looks like: stages: - stage: Build # Package the Helm chart - stage: Dev # Lint + test against values-dev.yaml - stage: Test # Lint + test against values-test.yaml - stage: Production # Lint + test against values-prod.yaml Each stage uses a shared template that installs Helm and Go, extracts the packaged chart, runs helm lint, and executes the Go tests with gotestsum. Environment gates ensure production tests pass before deployment proceeds. Here's the key part of a reusable test template: - script: | export PATH=$PATH:/usr/local/go/bin:$(go env GOPATH)/bin go install gotest.tools/gotestsum@latest cd $(Pipeline.Workspace)/helm.artifact/tests/unit gotestsum --format testname --junitfile $(Agent.TempDirectory)/test-results.xml \ -- ./${{ parameters.helmTestPath }}/... -count=1 -timeout 50m displayName: 'Test helm chart' env: HELM_RELEASE_NAME: ${{ parameters.helmReleaseName }} HELM_VALUES_FILE_OVERRIDE: ${{ parameters.helmValuesFileOverride }} - task: PublishTestResults@2 displayName: 'Publish test results' inputs: testResultsFormat: 'JUnit' testResultsFiles: '$(Agent.TempDirectory)/test-results.xml' condition: always() The PublishTestResults@2 task makes pass/fail results visible on the build's Tests tab, showing individual test names, durations, and failure details. The condition: always() ensures results are published even when tests fail, so you always have visibility. At this point you might be wondering: why Go and Terratest? Why not a simpler YAML-based tool? Why Terratest + Go Instead of helm-unittest? helm-unittest is a popular YAML-based alternative, and it's a fair question. Both tools are valid. Here's why we landed on Terratest: Terratest + Go helm-unittest (YAML) Type safety Renders into real K8s API structs; compiler catches schema errors String matching on raw YAML; typos in field paths fail silently Language features Loops, conditionals, shared setup, table-driven tests Limited to YAML assertion DSL Debugging Standard Go debugger, stack traces YAML diff output only Ecosystem alignment Same language as Terraform tests, one testing stack Separate tool, YAML-only The type safety argument is the strongest. When you unmarshal into appsV1.Deployment, the Go compiler guarantees your assertions reference real fields. With helm-unittest, a YAML path like spec.template.spec.containers[0].securityContest (note the typo) would silently pass because it matches nothing, rather than failing loudly. That said, if your team has no Go experience and needs the lowest adoption barrier, helm-unittest is a reasonable starting point. For teams already using Go or Terraform, Terratest is the stronger long-term choice. Getting Started Ready to try this? Here's a minimal project structure to get you going: your-repo/ ├── charts/ │ └── your-chart/ │ ├── Chart.yaml │ ├── values.yaml │ ├── values-dev.yaml │ ├── values-test.yaml │ ├── values-prod.yaml │ └── templates/ ├── tests/ │ └── unit/ │ ├── go.mod │ └── your-chart/ │ ├── dev/ │ ├── test/ │ └── prod/ └── Makefile Prerequisites: Go 1.22+, Helm 3.14+ You'll need three Go module dependencies: github.com/gruntwork-io/terratest v0.46.16 github.com/stretchr/testify v1.8.4 k8s.io/api v0.28.4 Initialize your test module, write your first test using the patterns above, and run: cd tests/unit HELM_RELEASE_NAME=your-chart \ HELM_VALUES_FILE_OVERRIDE=values-dev.yaml \ go test -v ./your-chart/dev/... -timeout 30m Start with a ConfigMap test. It's the simplest resource type and lets you validate the full render-unmarshal-assert flow before tackling Deployments. Once that passes, work your way through the pattern categories, adding security and environment-specific assertions as you go. Wrapping Up Unit testing Helm charts with Terratest gives you something that helm lint and manual review can't: Type-safe validation: The compiler catches schema errors, not production Environment-specific coverage: Each environment's values are tested independently Security as code: Security controls are verified on every commit, not in periodic reviews Fast feedback: Tests run in seconds with no cluster required CI/CD integration: JUnit results published natively to Azure DevOps The patterns we've covered here are the ones that have caught the most real bugs for us. Start small with resource names and labels, and expand from there. The investment is modest, and the first time a test catches a broken values-prod.yaml override before it reaches production, it'll pay for itself. We'd Love Your Feedback We'd love to hear how this approach works for your team: Which patterns were most useful for your charts? What resource types or patterns are missing? How did the adoption experience go? Drop a comment below. Happy to dig into any of these topics further!301Views0likes0Comments