azure kubernetes service
232 TopicsGetting Started with OpenSearch on AKS with AKS AVM and Helm
A starter blueprint for running OpenSearch on Azure Kubernetes Service using an AVM-based baseline, separate Helm releases for manager and data tiers, internal-only Dashboards, and keyless snapshots via workload identity. Portal-first and CLI-first paths included.70Views1like0CommentsPerformance Tuning and Scaling Optimization for Large-Scale Azure Workloads
Summary As cloud-native systems scale, performance challenges rarely stem from a single bottleneck. Instead, they emerge from the interaction between compute, orchestration, and data layers under load. This article captures a practical optimization journey of a high-volume Azure-based workload and highlights how controlled scaling, improved orchestration design, and proactive database maintenance can significantly outperform brute-force scaling. Introduction Distributed systems are often designed with the assumption that scaling out will solve performance issues. However, for orchestration-heavy and database-intensive workloads, this approach can introduce more problems than it solves. In this scenario, the system processed millions of transactional records through Azure Functions, Durable Functions, messaging pipelines, APIs, and SQL databases. As the workload grew, the platform began experiencing: CPU and memory spikes Slower SQL queries Service Bus throttling Increased retries and execution delays What stood out was that these issues were not due to insufficient resources, but due to inefficient execution patterns at scale. The optimization effort therefore focused on controlling how the system scaled and executed, rather than simply increasing capacity. Understanding Workload Behavior A critical early step was identifying the nature of the workload—specifically, whether it was CPU-heavy or data-heavy. Rethinking Scaling: More Is Not Always Better One of the most important lessons was that scaling out aggressively can degrade performance. As more function instances processed messages in parallel: Database calls increased sharply API traffic surged Lock contention intensified Retry rates increased This created a cascading effect where retries amplified load, further slowing down the system. To address this, scaling was intentionally controlled using: Concurrency limits on function execution Batch-based processing instead of full parallel fan-out Small delays to smooth traffic spikes Chunking of large datasets into manageable units This shift from maximum parallelism to controlled throughput significantly improved system stability. Compute Optimization: CPU and Memory After stabilizing scaling behavior, the next step was optimizing compute usage. CPU Optimization CPU spikes were largely caused by excessive parallel execution and orchestration overhead. Improvements included: Breaking large workloads into smaller units Reducing unnecessary fan-outs of processes Limiting concurrent executions This resulted in more predictable CPU usage and improved execution consistency. Memory Optimization Memory pressure was primarily driven by large payloads and batch processing. Optimizations focused on: Processing data in smaller chunks Avoiding large in-memory payloads and memory leaks Reducing orchestration state size These changes improved system reliability and reduced execution failures under load. Scaling Approaches: Practical Trade-Offs Both vertical and horizontal scaling were used, but with careful consideration. Scale Up (Vertical Scaling) Quick to implement No architectural changes required Useful for immediate stabilization However, it had cost and scalability limits. Scale Out (Horizontal Scaling) Better suited for long-term scalability Enables workload distribution But without control, it can: Increase database contention Amplify retries Introduce instability Key Insight The most effective approach was not choosing one over the other but combining both with strict control over concurrency and execution patterns. Durable Functions: Orchestration Optimization Durable Functions were central to the system, making orchestration design a key factor in performance. Challenges Observed The initial design relied heavily on nested sub-orchestrators, which introduced: High orchestration overhead Increased replay and persistence operations Slower execution at scale Key Improvements Refactoring unnecessary sub-orchestrators into Activity Functions simplified execution and improved throughput. The benefits included: Reduced orchestration latency Faster execution cycles Lower infrastructure cost Note: However, sub-orchestrators remain the right choice when the design requires composing multiple dependent steps, managing scoped retry/error logic, or isolating orchestration history. The decision should be driven by the complexity and reuse requirements of each workflow segment and not applied as a blanket rule. Improved Retry Strategy Retry behavior was also optimized by redefining execution boundaries. Previously: One activity processed multiple records A single failure triggered a retry of the entire batch After optimization: One activity handled one logical unit of work This enabled: Granular retries Better failure isolation Reduced duplicate processing Database Hygiene: A Critical Foundation The database emerged as a major bottleneck due to fragmentation and stale statistics caused by continuous high-volume operations. Issues Identified Fragmented indexes Inefficient query plans Increased query execution time Optimization Approach A proactive maintenance strategy was implemented using scheduled jobs to: Update statistics regularly Rebuild indexes Maintain query performance consistency Controlled Database Load For heavy long-running workloads in multi-tenant architecture, execution of DB intensive process was intentionally run in singleton fashion at a tenant level to reduce contention. This approach: Prevented concurrent heavy operations Improved overall system stability Delivered more predictable throughput Observability: Finding the Real Problem A major challenge during optimization was distinguishing between symptoms and root causes. For example: Slow APIs were often caused by database contention High retries were triggered by upstream throttling Orchestration delays originated from downstream dependencies To address this, end-to-end observability was established using: Application-level tracing Load testing correlations Cross-service telemetry analysis This enabled accurate root cause identification and prevented misdirected optimization efforts. Key Takeaways Some key principles emerged from this optimization journey: Scaling more does not always mean performing better Controlled parallelism is more effective than unrestricted concurrency Orchestration design directly impacts system performance Database maintenance must be proactive Retry strategies should align with logical units of work Observability is essential for correct diagnosis Conclusion Performance tuning in distributed systems is less about adding resources and more about using them efficiently. By focusing on controlled scaling, simplifying orchestration, maintaining database health, and improving observability, the system achieved higher throughput, lower cost, and significantly improved stability. These lessons are broadly applicable to any Azure-based system handling large-scale, orchestration-heavy workloads and can help teams design more predictable and resilient architectures.409Views5likes0CommentsHardening OpenClaw on AKS: Mitigating Container Escapes with Kata microVM Isolation
What is OpenClaw, and what security challenges does it pose with container escapes? OpenClaw is an open-source autonomous AI agent designed for power users and developers to automate tasks, such as managing emails, files, and scheduling via chat apps like WhatsApp or Telegram. While OpenClaw functions as a powerful autonomous assistant, its runtime model creates a massive security paradox: to be truly useful, the agent requires broad permissions to your filesystem and APIs, yet this "God Mode" access often lacks the rigorous containerized isolation typical of enterprise workloads. Because many users run the framework natively rather than within a hardened sandbox, the primary security challenge is that a single malicious "Skill" or an indirect prompt injection can escalate into full system compromise. This structural vulnerability, exemplified by high-profile exploits like CVE-2026-25253, transforms the agent from a helpful tool into a high-risk entry point for lateral movement and data exfiltration within a private network. Why container escapes matter in OpenClaw-style deployments: because containers share the host kernel, a successful container escape turns a single compromised container into a host compromise (or at least a compromise of other co-located workloads). This is especially important when OpenClaw runs code from many tenants, many teams, or varying trust levels on the same worker nodes. That soft isolation is often permeable due to the following structural and configuration-based weaknesses: Shared-kernel attack surface: the container boundary is not a hypervisor boundary. Kernel vulnerabilities (e.g., privilege escalation bugs) can allow a process in a container to gain host-level privileges. Excessive privileges / misconfiguration: running with --privileged, broad Linux capabilities, hostPath mounts, access to the Docker socket, or device passthrough (e.g., /dev/kvm, /dev/fuse) can provide direct paths to host control. Filesystem and namespace boundary breaks: mount namespace confusion, writable host mounts, or mistakes in chroot/pivot_root handling can expose host files and credentials. Supply-chain and image risk: a malicious image or dependency can execute within the container and then attempt escalation/escape. Blast radius: once the host is compromised, attackers can access node-level secrets (service account tokens, registry creds), tamper with the runtime, sniff traffic, or pivot to other containers and the broader cluster. In short, OpenClaw’s security challenge is not that containers are inherently insecure, but that the isolation boundary is thinner than a VM boundary. When the threat model includes adversarial code execution, a “container-only” isolation strategy often requires additional hardening or a stronger sandbox. What are MicroVMs and Kata Containers, and how do they help mitigate OpenClaw container-escape risks? MicroVMs are lightweight virtual machines optimized for running short-lived or container-like workloads with much lower overhead than traditional VMs. They use hardware virtualization (via a hypervisor such as KVM) but keep the device model and boot path minimal, reducing startup time and the overall attack surface compared to a full general-purpose VM. Kata Containers is an “OCI-compatible containers in a VM” approach: it runs each container (or pod sandbox) inside a dedicated microVM by default (implementation varies by runtime/config). To the orchestration layer (e.g., Kubernetes), it still looks like a container runtime, but isolation is provided by a hypervisor boundary rather than only namespaces/cgroups. Stronger isolation boundary: a container escape that relies on Linux kernel exploitation is far less likely to directly compromise the host, because the workload’s “host” kernel is typically the guest kernel inside the microVM. Reduced blast radius: compromise is contained to the microVM/pod sandbox; lateral movement to other workloads on the same node becomes significantly harder. Smaller and more controllable attack surface: minimal device models, tighter default privileges, and fewer host mounts/devices exposed to the workload. Defense-in-depth with container controls: you still can (and should) apply seccomp, capabilities dropping, read-only root filesystems, and LSMs inside the guest, but the hypervisor boundary becomes an additional layer. Better fit for hostile multi-tenant workloads: when OpenClaw executes third-party jobs/plugins, Kata-style sandboxing aligns better with an adversarial threat model. Solution overview Figure 1 illustrates a Kubernetes-based sandboxing architecture for running OpenClaw workloads with stronger isolation. The design keeps the developer experience and packaging model of containers (OCI images, Kubernetes scheduling) while ensuring that untrusted agent code executes inside a microVM boundary using Kata Containers. This reduces the likelihood that a container escape can compromise the underlying node or other co-located workloads. Key components: (1) Application gateway for HTTPS traffic to the backend, (2) Kubernetes as the orchestration, scheduling and policy enforcement plane, (3) a container runtime (e.g., containerd) configured with a Kata Containers runtime class, (4) KVM-backed microVMs that provide the isolation boundary for each untrusted workload and (5) Azure files for persistent storage which allows scaling of OpenClaw. Figure 1: Solution architecture diagram End-to-end flow: Traffic Entry via Application Gateway: Incoming user requests (e.g., from WhatsApp or Discord) first hit the Azure Application Gateway. Orchestration in AKS: The traffic is routed into an Azure Kubernetes Service (AKS) cluster, which manages the lifecycle of the OpenClaw agent and its associated "Skills." Hardened Execution via Kata Containers: Instead of running in standard shared-kernel containers, the OpenClaw agent runs inside Kata Containers. This provides a dedicated lightweight VM for the agent, creating a hardware-level isolation boundary that prevents "container escapes" from compromising the host. Stateful Storage in Azure Files: The agent interacts with Azure Files to read and write persistent data, such as conversation history, configuration files, and downloaded assets, ensuring data remains available even if the container is restarted. Security posture: by shifting isolation from “shared-kernel containers” to “containers inside microVMs,” the architecture limits the blast radius of kernel-level exploits and common escape paths. Even if an attacker achieves code execution within an OpenClaw container, they must additionally break the microVM/hypervisor boundary to affect the node or neighboring workloads, providing a strong defense-in-depth improvement over standard container alone. Implement the solution This section describes how to deploy the solution architecture. In this post, you’ll perform the following tasks: Create a Kata VM-isolated AKS node pool Mount a NFS persistent storage Create the application ConfigMap Deploy the OpenClaw gateway Expose the gateway internally Set up TLS termination Route external traffic through the Azure application gateway for containers. Ensure that you have the following prerequisites deployed before moving to the next section: An AKS cluster provisioned in Azure An Azure NFS File Share with private link enabled. An Application gateway for containers managed by ALB controller Kubectl configured and pointing to the cluster Az CLI authenticated with the correct subscription Initialise environment variables In your Linux terminal, export these variables with your own values. They will be used in later commands. export cluster_name=<CLUSTER_NAME> export resource_group=<RESOURCE_GROUP> Create the AKS Node Pool with Kata VM Isolation The OpenClaw gateway pods require Kata VM isolation (runtimeClassName: kata-vm-isolation). You must create a dedicated AKS node pool that supports this runtime before deploying any workloads. Use the Azure CLI to add a node pool with the Kata VM isolation workload runtime to your existing AKS cluster: az aks nodepool add \ --resource-group $resource_group \ --cluster-name $cluster_name \ --name katanp \ --node-count 2 \ --node-vm-size Standard_D4s_v3 \ --os-sku AzureLinux \ --workload-runtime KataMshvVmIsolation \ --labels agentpool=katanp **Important:** The `--workload-runtime KataMshvVmIsolation` flag enables the `kata-vm-isolation` runtime class on the node pool. The VM size must support nested virtualization (D-series v3/v5, E-series v3/v5, etc.). Create NFS Persistent Volume The deployment uses an Azure Files NFS share for persistent workspace storage. The PersistentVolume must exist before the PVC can bind to it. Replace volumeHandle and volumeAttributes with your own Azure Files values. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolume metadata: name: openclaw-nfs-pv spec: capacity: storage: 100Gi accessModes: - ReadWriteMany persistentVolumeReclaimPolicy: Retain mountOptions: - sec=sys - noresvport - actimeo=30 csi: driver: file.csi.azure.com volumeHandle: <resource-group>#<storage-account>#<share-name> volumeAttributes: resourceGroup: <resource-group> shareName: <share-name> protocol: nfs server: <storage-account>.privatelink.file.core.windows.net EOF Verify that the persistent volume is created. kubectl get pv openclaw-nfs-pv Figure 2: Persistent volume Create the NFS PersistentVolumeClaim The PVC binds to the PV created. The deployment references this PVC by name (`pvc-openclaw-nfs`). cat <<EOF | kubectl apply -f - apiVersion: v1 kind: PersistentVolumeClaim metadata: # The name of the PVC name: pvc-openclaw-nfs spec: accessModes: - ReadWriteMany resources: requests: # The real storage capacity in the claim storage: 50Gi # This field must be the same as the storage class name in StorageClass storageClassName: "" volumeName: openclaw-nfs-pv EOF Verify that the persistent volume claim is created successfully. The status should show bound. Figure 3: Persistent Volume Claim Create the ConfigMap The ConfigMap provides the openclaw.json configuration file to the gateway pods. It configures allowed CORS origins for the control UI and the gateway token. Replace the allowed origins with your own ALB frontend URL. The ConfigMap also stores the gateway auth token, so DO NOT hardcode your token here. Always keep it as a variable rather than storing it in plain text so that, if attackers gain access to this file, they cannot see the OpenClaw gateway auth token. cat <<EOF | kubectl apply -f - apiVersion: v1 kind: ConfigMap metadata: name: openclaw-config data: openclaw.json: | { "gateway": { "auth": { "token": "${AUTH_TOKEN}" }, "controlUi": { "allowedOrigins": [ "https://<YOUR ALB FRONTEND URL>.alb.azure.com" ] } } } EOF Create the Auth Token Secret The OpenClaw gateway requires an authentication token to secure access. The deployment references a Kubernetes Secret named openclaw-auth-token and injects it into the container as the AUTH_TOKEN environment variable via secretKeyRef. Generate a random token (or use an existing one) and create the kubernetes secret. # Generate a random 32-byte hex token AUTH_TOKEN=$(openssl rand -hex 32) echo "$AUTH_TOKEN" # save this — you'll need it to authenticate with the gateway kubectl create secret generic openclaw-auth-token \ --from-literal=token="$AUTH_TOKEN" If the secret does not exist when the deployment is applied, pods will fail with `CreateContainerConfigError`. Deploy the OpenClaw Gateway This is the main application deployment. It depends on all previous steps: - Kata node pool (pods require runtimeClassName: kata-vm-isolation and nodeSelector: agentpool=katanp) - PVC (pvc-openclaw-nfs for persistent workspace data) - ConfigMap (openclaw-config for openclaw.json) Key details: - Runs 2 replicas with a rolling update strategy - Uses an init container to copy the config file to a writable volume - Exposes port 18789 - Includes liveness and readiness probes on /health - Resource requests: 500m CPU, 512Mi memory - Resource limits: 2 CPU, 2Gi memory cat <<EOF | kubectl apply -f - apiVersion: apps/v1 kind: Deployment metadata: name: openclaw-gateway spec: replicas: 2 selector: matchLabels: app: openclaw-gateway strategy: type: RollingUpdate rollingUpdate: maxUnavailable: 1 maxSurge: 1 template: metadata: labels: app: openclaw-gateway spec: runtimeClassName: kata-vm-isolation nodeSelector: agentpool: katanp securityContext: fsGroup: 1000 initContainers: - name: copy-openclaw-config image: alpine/openclaw:latest env: - name: HOME value: /writable command: - sh - -c - | cp /config/openclaw.json /writable/openclaw.json \ && chown 1000:1000 /writable/openclaw.json \ && echo "--- Config file contents ---" \ && cat /writable/openclaw.json volumeMounts: - name: openclaw-config-volume mountPath: /config - name: openclaw-writable mountPath: /writable containers: - name: gateway image: alpine/openclaw:latest ports: - containerPort: 18789 env: - name: NODE_OPTIONS value: "--max-old-space-size=4096" - name: AUTH_TOKEN valueFrom: secretKeyRef: name: openclaw-auth-token key: token # Start gateway the way the tutorial indicates command: ["openclaw", "gateway"] args: ["run", "--allow-unconfigured", "--bind", "lan"] volumeMounts: - name: openclaw-writable mountPath: /home/node/.openclaw - name: openclaw-data mountPath: /home/node/workspace subPath: workspace resources: requests: cpu: "500m" memory: "2Gi" limits: cpu: "1000m" memory: "4Gi" livenessProbe: httpGet: path: /health port: 18789 initialDelaySeconds: 60 periodSeconds: 15 failureThreshold: 3 readinessProbe: httpGet: path: /health port: 18789 initialDelaySeconds: 10 periodSeconds: 5 volumes: - name: openclaw-data persistentVolumeClaim: claimName: pvc-openclaw-nfs - name: openclaw-config-volume configMap: name: openclaw-config items: - key: openclaw.json path: openclaw.json - name: openclaw-writable emptyDir: {} --- apiVersion: v1 kind: Service metadata: name: openclaw-gateway-service spec: type: ClusterIP selector: app: openclaw-gateway ports: - protocol: TCP port: 18789 targetPort: 18789 EOF Verify that the deployment succeeds. Wait until all pods show `Running` and `READY 2/2`. kubectl get deployment openclaw-gateway kubectl get pods -l app=openclaw-gateway Figure 4: OpenClaw deployment Create the TLS secret (for HTTPS) The Application Gateway for Containers references a TLS secret (gateway-tls-secret) for HTTPS termination. This blog post uses a self-signed certificate; in a production environment, use a certificate signed by a certificate authority. Replace `<path-to-tls-cert>` and `<path-to-tls-key>` with paths to your TLS certificate and private key files. kubectl create secret tls gateway-tls-secret \ --cert=<path-to-tls-cert> \ --key=<path-to-tls-key> Create the Gateway The Gateway resource defines the HTTPS listener on the Azure Application Load Balancer (ALB). Update the `alb.network.azure.com/application-gateway-id` annotation to match your ALB traffic controller resource ID. You will also need to reference the gateway-tls-secret to enable HTTPS. cat <<EOF | kubectl apply -f - apiVersion: gateway.networking.k8s.io/v1 kind: Gateway metadata: name: https annotations: alb.network.azure.com/application-gateway-id: /subscriptions/<subscription id>/resourceGroups/mc_openclaw_openclaw-cluster_centralus/providers/Microsoft.ServiceNetworking/trafficControllers/<alb id> alb.networking.azure.io/alb-namespace: default alb.networking.azure.io/alb-name: alb-openclaw spec: gatewayClassName: azure-alb-external listeners: - name: https protocol: HTTPS port: 443 allowedRoutes: namespaces: from: All tls: mode: Terminate certificateRefs: - kind: Secret group: "" name: gateway-tls-secret EOF kubectl get gateway https Wait until the Gateway shows a `Programmed=True` condition. Create the HTTPRoute The HTTPRoute connects the Gateway to the backend Service. It routes all traffic (`/` prefix) from the HTTPS Gateway to `openclaw-gateway-service` on port 18789. cat <<EOF | kubectl apply -f - kind: HTTPRoute apiVersion: gateway.networking.k8s.io/v1 metadata: name: http-route spec: parentRefs: - name: https rules: - matches: - path: type: PathPrefix value: / backendRefs: - name: openclaw-gateway-service kind: Service namespace: default port: 18789 EOF Test OpenClaw application Get the external endpoint. kubectl get gateway https -o jsonpath='{.status.addresses[0].value}' Paste the endpoint into your browser to reach the OpenClaw application. If you are using a self-signed certificate, you will see a “Not secure” warning; click Advanced to proceed. In a production environment with a certificate signed by a certificate authority, you should not see that warning. Figure 5: OpenClaw Authentication Paste in your Gateway Token (the auth token created earlier). You will notice that even though the token is valid, it throws back a “pairing required” error. Pairing is required in OpenClaw whenever a new device, browser profile, or CLI client attempts to connect to the gateway for the first time, ensuring only authorized clients can control the AI agent. POD=$(kubectl get pod -l app=openclaw-gateway -o jsonpath='{.items[0].metadata.name}') POD2=$(kubectl get pod -l app=openclaw-gateway -o jsonpath='{.items[1].metadata.name}') TOKEN=$(kubectl get secret openclaw-auth-token -o jsonpath='{.data.token}' | base64 -d) kubectl exec "$POD" -c gateway -- openclaw devices approve --latest --token "$TOKEN" kubectl exec "$POD2" -c gateway -- openclaw devices approve --latest --token "$TOKEN" You should see a message like the one in the image below. You can now open the OpenClaw application and start using it. Figure 6: OpenClaw pairing success message Figure 7: OpenClaw Application You have successfully deployed OpenClaw within a microVM hosted on Azure Kubernetes Service. Test microVM kernel isolation From within the OpenClaw pod, try to read the host’s root filesystem via /proc/1/root. You should see an error like: ls: cannot access '/proc/1/root/etc/kubernetes': No such file or directory. kubectl exec -it "$POD" -c gateway -- ls /proc/1/root/etc/kubernetes 2>&1 In a standard container deployment, PID 1 inside the container is still running on the host kernel, so traversing /proc/1/root/ exposes the host's root filesystem — including sensitive paths like /etc/kubernetes (which holds kubelet credentials). With Kata VM isolation, the picture is completely different. When we run ls /proc/1/root/etc/kubernetes from inside the OpenClaw pod, it returns "No such file or directory". This is because PID 1 is no longer a process on the host — it's running inside a dedicated guest VM with its own kernel. The /proc/1/root/ path leads to the microVM's root filesystem, not the host's, and that microVM has no knowledge of the node's Kubernetes configuration or machine identity. The host is simply invisible. This is the core security guarantee of Kata Containers: even if an attacker achieves a full container escape, there is nothing to escape to — they land inside a lightweight VM boundary, not on the shared host, making lateral movement to other pods or the node itself impossible. Conclusion This post discussed why running OpenClaw workloads in standard containers can be risky when the workload includes untrusted or semi-trusted code: containers share the host Linux kernel, so a single container escape or privileged misconfiguration can expand into node-level compromise and a much larger blast radius. To address this, we introduced microVM-based sandboxing with Kata Containers on Azure Kubernetes Service (AKS) and walked through an implementation approach (a node pool with Kata VM isolation, storage, gateway deployment, and ingress). Finally, we validated the isolation properties by demonstrating that common host-visibility techniques (for example, probing /proc/1/root) no longer reveal host paths when the workload runs inside a microVM. Separate kernel boundary: Kata runs the container inside a microVM, so the workload executes against a guest kernel rather than the shared host kernel—kernel exploits and escape attempts don’t directly translate into host control. Host filesystem is no longer “in scope”: paths that often leak host context in standard containers (for example, traversals via /proc) resolve inside the microVM’s filesystem, not the node’s root filesystem. Reduced blast radius per workload: each sandbox has its own VM boundary, making it much harder to pivot from one compromised workload to other pods/containers on the same node. Stronger default device and privilege separation: the hypervisor boundary and minimal virtual device model limit exposure to host devices and privileged interfaces that commonly enable breakouts. Defense-in-depth still applies: you can keep container hardening (seccomp, capability dropping, read-only filesystems, restricted mounts) while gaining an additional isolation layer that is independent of Linux namespaces/cgroups. Overall, this post helps you deploy OpenClaw on AKS with Kata microVM isolation so you can run agent workloads with a significantly reduced risk of host-kernel compromise from container escape techniques.Plugin Marketplace for Azure SRE Agent: Build Once, Install Anywhere
What's a Plugin? A plugin bundles two things: Skills — Operational knowledge (triage runbooks, policy rules, known issues) the agent reads at runtime to guide its reasoning MCP Connectors — Live integrations to your internal APIs (deployment tracker, cost dashboard, CMDB) the agent can query during an investigation This is the key distinction: a plugin doesn't just tell the agent what your policies are — it gives the agent tools to query your internal systems and apply those policies with real data. A plugin bundles skills and MCP connectors as a single installable unit. The Marketplace Model: Create Once, Install Everywhere The marketplace is a GitHub repository with a marketplace.json manifest. Any team pushes their plugin to the repo. Every SRE Agent in the org can discover it and install it with one click — no need for each team to manually recreate skills and configure connectors. How it works: A specialist team creates a plugin (skills + MCP connector config) and pushes it to the shared GitHub repo Any SRE Agent user browses the marketplace, sees what's available, and clicks Install The plugin's skills and connectors are deployed to that agent instance instantly Contoso runs multiple SRE Agent instances — payments team, platform team, data team. The same marketplace serves all of them. Each team installs exactly the plugins they need. One marketplace, many agents. Teams publish plugins once — every agent in the org can install them. The Scenario: AKS Incident Investigation with Plugins Contoso runs a payment processing service on AKS. Three teams have contributed plugins to the company's internal marketplace: Plugin Team Skills MCP Connector AKS Runbooks K8s Platform Team aks-incident-triage, aks-deployment-analysis Deployment Tracker API Cost & Capacity Cloud FinOps Team cost-analysis, capacity-planning Cost Dashboard API Service Catalog SRE Leadership service-ownership-lookup, dependency-impact-analysis CMDB API All three are installed on the payments team's SRE Agent. Let's see what happens when an incident hits. Building and Publishing the Plugins Each team creates their plugin independently and pushes it to the shared marketplace repo. 1. AKS Runbooks (Kubernetes Platform Team) The K8s Platform Team packages their triage procedures, node pool naming conventions, PDB policies, known issues registry, and deployment gates. Skills: aks-incident-triage — Per-symptom triage procedures (OOMKill, NodePressure, CrashLoop), PDB-first policy checks, Tier-0 escalation rules, and a known issues registry aks-deployment-analysis — Correlates incidents with recent deployments, surfaces resource spec diffs and gate violations, provides a rollback decision tree MCP Connector: contoso-deploy-tracker — Exposes get_deployments: recent deployments by namespace with deployer, image versions, resource diffs, and gate status 2. Cost & Capacity (Cloud FinOps Team) The FinOps team packages their SKU approval matrix, team budget allocations, chargeback model, and scaling governance. Skills: cost-analysis — Team budget tiers, cost dashboard API usage, incident cost impact calculations capacity-planning — "Scale-out before scale-up" rule (CCP-001), SKU approval matrix (B-series = team lead, D-series = director, E/N-series = VP/CTO), auto-scale thresholds MCP Connector: contoso-cost-dashboard — Exposes get_team_spend (budget, burn rate) and get_resources_cost (resource-level cost with utilization) 3. Service Catalog (SRE Leadership) SRE Leadership packages service ownership, SLA tiers, escalation paths, and the dependency graph. Skills: service-ownership-lookup — Maps namespaces to owning teams, on-call contacts, SLA tiers (Tier-0 through Tier-3), escalation policies dependency-impact-analysis — Dependency classification (hard/soft/async), blast radius assessment, security implications MCP Connector: contoso-cmdb — Exposes get_service_info (ownership, SLA), get_service_dependencies, and get_blast_radius The Marketplace Manifest All three plugins are described in a single marketplace.json that the SRE Agent discovers: { "name": "Contoso SRE Plugins", "description": "Internal plugin marketplace for Contoso SRE teams", "version": "1.0.0", "plugins": [ { "id": "aks-runbooks", "name": "AKS Runbooks", "description": "Kubernetes Platform Team's operational runbooks and deployment correlation", "author": "K8s Platform Team", "source": "./aks-runbooks", "category": "Operations" }, { "id": "cost-capacity", "name": "Cost & Capacity", "description": "FinOps team's cost governance, SKU approval matrix, and capacity planning", "author": "Cloud FinOps Team", "source": "./cost-capacity", "category": "Cost Management" }, { "id": "service-catalog", "name": "Service Catalog", "description": "Service ownership, SLA tiers, dependency graphs, and escalation paths", "author": "SRE Leadership", "source": "./service-catalog", "category": "Governance" } ] } The internal plugin marketplace on GitHub. Each directory is a plugin contributed by a different team. The marketplace.json manifest tells the SRE Agent what's available. Registering the Marketplace and Installing Plugins Step 1: Add the Marketplace In the SRE Agent, navigate to Builder → Plugins → Browse and click "Add Marketplace". Enter the GitHub repository path (contoso/sre-agent-plugins) and click Add. The agent fetches marketplace.json and displays the marketplace card with all three plugins discovered. Adding the internal marketplace — just point to the GitHub repo. Step 2: Browse the Catalog The Browse tab now shows the Contoso SRE Plugins marketplace. Clicking into it reveals three plugin cards — one from each contributing team — with descriptions, skill counts, and connector details. & Capacity, Service Catalog — with author teams, skill counts, and install buttonsThree plugins from three teams. Each one brings skills (organizational knowledge) and an MCP connector (internal API access). Step 3: Install All Three Plugins Click into each plugin to review what it installs — skills and MCP connectors — then click "Install Plugin" for each one. After installing all three: 6 skills loaded (2 per plugin) — organizational knowledge documents the agent reads at runtime 3 MCP connectors registered — internal API integrations the agent can call as tools Each plugin clearly shows what it installs — skills and connectors — before you commit. ith green "Installed" badges, green borders, and skill/connector countsAll three plugins installed — each card shows its "Installed" status, the authoring team, and exactly what it brings (2 skills, 1 connector). The green border and badge make installed plugins immediately recognizable. The Agent in Action Now let's ask the question: "Pods are crashing in payments-prod on aks-payments-prod-eastus2 in sre-marketplace-demo-rg. Investigate and give me a full incident report." The agent investigates — combining its native Kubernetes capabilities with the organizational context from all three plugins. The same strong Kubernetes diagnosis, now enriched with organizational context. Deployment correlation, policy violations, cost governance, blast radius, and escalation paths — all layered on top of the agent's native investigation. Why This Matters: Different Teams, One Agent The K8s Platform Team writes triage procedures and known issues. The FinOps Team writes budget governance and SKU rules. SRE Leadership defines service ownership and escalation paths. Each team packages their domain expertise independently. The SRE Agent combines all of it at runtime — producing a response no single team could have written alone, drawn from three internal systems and three bodies of institutional knowledge. This is how organizational knowledge scales: composable plugins that the agent reasons with in real-time, not longer wikis that nobody reads. Learn More Plugin Marketplace overview — How the marketplace works, manifest formats, and MCP config support Tutorial: Install a marketplace plugin — Step-by-step walkthrough of adding a marketplace, browsing plugins, and importing skills Skills in Azure SRE Agent — How skills work, how the agent loads them at runtime, and how they relate to custom agents and knowledge files MCP connectors and tools — Connecting your agent to external systems via the Model Context Protocol Tutorial: Set up an MCP connector — Configuring remote and local MCP servers as agent connectors272Views0likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.14KViews1like2CommentsTroubleshoot with OpenTelemetry in Azure Monitor - Public Preview
OpenTelemetry is fast becoming the industry standard for modern telemetry collection and ingestion pipelines. With Azure Monitor’s new OpenTelemetry Protocol (OTLP) support, you can ship logs, metrics, and traces from wherever you run workloads to analyze and act on your observability data in one place. What’s in the preview Direct OTLP ingestion into Azure Monitor for logs, metrics, and traces. Automated onboarding for AKS workloads. Application Insights on OTLP for distributed tracing, performance and troubleshooting experiences. Pre-built Grafana dashboards to visualize signals quickly. Prometheus for metric storage and query. OpenTelemetry semantic conventions for logs and traces, so your data lands in a familiar standard-based schema. How to send OTLP to Azure Monitor: pick your path AKS: Auto-instrument Java and Node.js workloads using the Azure Monitor OpenTelemetry distro, or auto-configure any OpenTelemetry SDK-instrumented workload to export OTLP to Azure Monitor. Get started Limited preview: Auto-instrumentation for .NET and Python is also available. Get started VMs/VM Scale Sets (and Azure Arc-enabled compute): Use the Azure Monitor Agent (AMA) to receive OTLP from your apps and export it to Azure Monitor. Get started Any environment: Use the OpenTelemetry Collector to receive OTLP signals and export directly to Azure Monitor cloud ingestion endpoints. Get started Under the hood: where your telemetry lands Metrics: Stored in an Azure Monitor Workspace, a Prometheus metrics store. Logs + traces: Stored in a Log Analytics workspace using an OpenTelemetry semantic conventions–based schema. Troubleshooting: Application Insights lights up distributed tracing and end-to-end performance investigations, backed by Azure Monitor. Why it matters Standardize once: Instrument with OpenTelemetry and keep your telemetry portable. Reduce overhead: Fewer bespoke exporters and pipelines to maintain. Debug faster: Correlate metrics, logs, and traces to get from alert to root cause with less guesswork. Observe with confidence: Use dashboards and tracing views that are ready on day one. Next step: Try the OTLP preview in your environment, then validate end-to-end signal flow with Application Insights and Grafana dashboards. Learn More422Views3likes0CommentsAutonomous AKS Incident Response with Azure SRE Agent: From Alert to Verified Recovery in Minutes
When a Sev1 alert fires on an AKS cluster, detection is rarely the hard part. The hard part is what comes next: proving what broke, why it broke, and fixing it without widening the blast radius, all under time pressure, often at 2 a.m. Azure SRE Agent is designed to close that gap. It connects Azure-native observability, AKS diagnostics, and engineering workflows into a single incident-response loop that can investigate, remediate, verify, and follow up, without waiting for a human to page through dashboards and run ad-hoc kubectl commands. This post walks through that loop in two real AKS failure scenarios. In both cases, the agent received an incident, investigated Azure Monitor and AKS signals, applied targeted remediation, verified recovery, and created follow-up in GitHub, all while keeping the team informed in Microsoft Teams. Core concepts Azure SRE Agent is a governed incident-response system, not a conversational assistant with infrastructure access. Five concepts matter most in an AKS incident workflow: Incident platform. Where incidents originate. In this demo, that is Azure Monitor. Built-in Azure capabilities. The agent uses Azure Monitor, Log Analytics, Azure Resource Graph, Azure CLI/ARM, and AKS diagnostics without requiring external connectors. Connectors. Extend the workflow to systems such as GitHub, Teams, Kusto, and MCP servers. Permission levels. Reader for investigation and read oriented access, privileged for operational changes when allowed. Run modes. Review for approval-gated execution and Autonomous for direct execution. The most important production controls are permission level and run mode, not prompt quality. Custom instructions can shape workflow behavior, but they do not replace RBAC, telemetry quality, or tool availability. The safest production rollout path: Start: Reader + Review Then: Privileged + Review Finally: Privileged + Autonomous. Only for narrow, trusted incident paths. Demo environment The full scripts and manifests are available if you want to reproduce this: Demo repository: github.com/hailugebru/azure-sre-agents-aks. The README includes setup and configuration details. The environment uses an AKS cluster with node auto-provisioning (NAP), Azure CNI Overlay powered by Cilium, managed Prometheus metrics, the AKS Store sample microservices application, and Azure SRE Agent configured for incident-triggered investigation and remediation. This setup is intentionally realistic but minimal. It provides enough surface area to exercise real AKS failure modes without distracting from the incident workflow itself. Azure Monitor → Action Group → Azure SRE Agent → AKS Cluster (Alert) (Webhook) (Investigate / Fix) (Recover) ↓ Teams notification + GitHub issue → GitHub Agent → PR for review How the agent was configured Configuration came down to four things: scope, permissions, incident intake, and response mode. I scoped the agent to the demo resource group and used its user-assigned managed identity (UAMI) for Azure access. That scope defined what the agent could investigate, while RBAC determined what actions it could take. I used broader AKS permissions than I would recommend as a default production baseline so the agent could complete remediation end to end in the lab. That is an important distinction: permissions control what the agent can access, while run mode controls whether it asks for approval or acts directly. For this scenario, Azure Monitor served as the incident platform, and I set the response plan to Autonomous for a narrow, trusted path so the workflow could run without manual approval gates. I also added Teams and GitHub integrations so the workflow could extend beyond Azure. Teams provided milestone updates during the incident, and GitHub provided durable follow up after remediation. For the complete setup, see the README. A note on context. The more context you can provide the agent about your environment, resources, runbooks, and conventions, the better it performs. Scope boundaries, known workloads, common failure patterns, and links to relevant documentation all sharpen its investigations and reduce the time it spends exploring. Treat custom instructions and connector content as first-class inputs, not afterthoughts. Two incidents, two response modes These incidents occurred on the same cluster in one session and illustrate two realistic operating modes: Alert triggered automation. The agent acts when Azure Monitor fires. Ad hoc chat investigation. An engineer sees a symptom first and asks the agent to investigate. Both matter in real environments. The first is your scale path. The second is your operator assist path. Incident 1. CPU starvation (alert driven, ~8 min MTTR) The makeline-service deployment manifest contained a CPU and memory configuration that was not viable for startup: resources: requests: cpu: 1m memory: 6Mi limits: cpu: 5m memory: 20Mi Within five minutes, Azure Monitor fired the pod-not-healthy Sev1 alert. The agent picked it up immediately. Here is the key diagnostic conclusion the agent reached from the pod state, probe behavior, and exit code: "Exit code 1 (not 137) rules out OOMKill. The pod failed at startup, not at runtime memory pressure. CPU limit of 5m is insufficient for the process to bind its port before the startup probe times out. This is a configuration error, not a resource exhaustion scenario." That is the kind of distinction that often takes an on call engineer several minutes to prove under pressure: startup failure from CPU starvation vs. runtime termination from memory pressure. The agent then: Identified three additional CPU-throttled pods at 112 to 200% of configured limit using kubectl top. Patched four workloads: makeline-service, virtual-customer, virtual-worker, and mongodb. Verified that all affected pods returned to healthy running state with 0 restarts cluster wide. Azure SRE Agent's Incident History blade confirming full cluster recovery: 4 patches applied, 0 unhealthy pods — no human intervention required. Outcome. Full cluster recovery in ~8 minutes, 0 human interventions. Incident 2. OOMKilled (chat driven, ~4 min MTTR) For the second case, I deployed a deliberately undersized version of order-service: kubectl apply -f .\manifests\aks-store\order-service-changed.yaml -n pets I started this case from chat before the pod-phase alert fired to demonstrate the interactive troubleshooting flow. That was a demo choice, not an alerting gap. CrashLoopBackOff is a container waiting reason, not a pod phase, so production coverage should come from Prometheus based crash-loop signals rather than pod phase alone. Here is the PromQL query I use in Azure Monitor to catch this class of failure: sum by (namespace, pod) ( ( max_over_time( kube_pod_container_status_waiting_reason{ namespace="pets", reason="CrashLoopBackOff" }[5m] ) == 1 ) and on (namespace, pod, container) ( increase( kube_pod_container_status_restarts_total{ namespace="pets" }[15m] ) > 0 ) ) > 0 This query fires when a container has been in CrashLoopBackOff within the last 5 minutes and its restart count has increased in the last 15 minutes. In production, replace the hardcoded namespace with a regex matcher or remove it entirely to cover all namespaces. The order-service pod in the pets namespace is not healthy. Please investigate, identify the root cause, and fix it. The agent's reasoning: "Container logs are empty. The process was killed before it could write its first log line. Exit code 137 confirms OOMKill. No NODE_OPTIONS in the ConfigMap rules out a V8 heap misconfiguration. The 20Mi limit is 12.8x below the pod's observed 50Mi runtime baseline. This limit was never viable for this workload." The agent increased the memory limit (20Mi to 128Mi) and request (10Mi to 50Mi), then verified the new pod stabilized at 74Mi/128Mi (58% utilization) with 0 restarts. Outcome. Service recovered in ~4 minutes without any manual cluster interaction. Side by side comparison Dimension Incident 1: CPU starvation Incident 2: OOMKilled Trigger Azure Monitor alert (automated) Engineer chat prompt (ad hoc) Failure mode CPU too low for startup probe to pass Memory limit too low for process to start Key signal Exit code 1, probe timeout Exit code 137, empty container logs Blast radius 4 workloads affected cluster wide 1 workload in target namespace Remediation CPU request/limit patches across 4 deployments Memory request/limit patch on 1 deployment MTTR ~8 min ~4 min Human interventions 0 0 Why this matters Most AKS environments already emit rich telemetry through Azure Monitor and managed Prometheus. What is still manual is the response: engineers paging through dashboards, running ad-hoc kubectl commands, and applying hotfixes under time pressure. Azure SRE Agent changes that by turning repeatable investigation and remediation paths into an automated workflow. The value isn't just that the agent patched a CPU limit. It's that the investigation, remediation, and verification loop is the same regardless of failure mode, and it runs while your team sleeps. In this lab, the impact was measurable: Metric This demo with Azure SRE Agent Alert to recovery ~4 to 8 min Human interventions 0 Scope of investigation Cluster wide, automated Correlate evidence and diagnose ~2 min Apply fix and verify ~4 min Post incident follow-up GitHub issue + draft PR These results came from a controlled run on April 10, 2026. Real world outcomes depend on alert quality, cluster size, and how much automation you enable. For reference, industry reports from PagerDuty and Datadog typically place manual Sev1 MTTR in the 30 to 120 minute range for Kubernetes environments. Teams + GitHub follow-up Runtime remediation is only half the story. If the workflow ends when the pod becomes healthy again, the same issue returns on the next deployment. That is why the post incident path matters. After Incident 1 resolved, Azure SRE Agent used the GitHub connector to file an issue with the incident summary, root cause, and runtime changes. In the demo, I assigned that issue to GitHub Copilot agent, which opened a draft pull request to align the source manifests with the hotfix. The agent can also be configured to submit the PR directly in the same workflow, not just open the issue, so the fix is in your review queue by the time anyone sees the notification. Human review still remains the final control point before merge. Setup details for the GitHub connector are in the demo repo README, and the official reference is in the Azure SRE Agent docs. Azure SRE Agent fixes the live issue, and the GitHub follow-up prepares the durable source change so future deployments do not reintroduce the same configuration problem. The operations to engineering handoff: Azure SRE Agent fixed the live cluster; GitHub Copilot agent prepares the durable source change so the same misconfiguration can't ship again. In parallel, the Teams connector posted milestone updates during the incident: Investigation started. Root cause and remediation identified. Incident resolved. Teams handled real time situational awareness. GitHub handled durable engineering follow-up. Together, they closed the gap between operations and software delivery. Key takeaways Three things to carry forward Treat Azure SRE Agent as a governed incident response system, not a chatbot with infrastructure access. The most important controls are permission levels and run modes, not prompt quality. Anchor detection in your existing incident platforms. For this demo, we used Prometheus and Azure Monitor, but the pattern applies regardless of where your signals live. Use connectors to extend the workflow outward. Teams for real time coordination, GitHub for durable engineering follow-up. Start where you're comfortable. If you are just getting your feet wet, begin with one resource group, one incident type, and Review mode. Validate that telemetry flows, RBAC is scoped correctly, and your alert rules cover the failure modes you actually care about before enabling Autonomous. Expand only once each layer is trusted. Next steps Add Prometheus based alert coverage for ImagePullBackOff and node resource pressure to complement the pod phase rule. Expand to multi cluster managed scopes once the single cluster path is trusted and validated. Explore how NAP and Azure SRE Agent complement each other — NAP manages infrastructure capacity, while the agent investigates and remediates incidents. I'd like to thank Cary Chai, Senior Product Manager for Azure SRE Agent, for his early technical guidance and thorough review — his feedback sharpened both the accuracy and quality of this post.738Views0likes0CommentsSecure, Keyless Application Access with Managed Identities - Now GA in Azure Files SMB
As enterprises modernize applications and strengthen their security posture, identity is central to how applications access shared storage. Traditional identity models relying on account keys, stored credentials, or domain‑joined infrastructure add operational overhead and introduce security risks such as credential leakage, lack of identity attribution, and excessive privilege if shared keys are compromised. Today, we are excited to announce the General Availability (GA) of Managed Identity support for Azure Files over SMB, enabling applications and virtual machines to securely access Azure Files without secrets, passwords, or key distribution. Managed Identity support enables customers to meet modern enterprise security standards without reliance on storage account keys, streamlining how organizations securely enable file‑based application access and reducing the operational overhead of filing internal exceptions. New storage accounts can support secure, identity‑based SMB access out of the box, while existing deployments can get secure by enabling Managed Identity authentication. From web application workloads such as WordPress, to databases on Azure Kubernetes Service (AKS), to CI/CD pipelines, applications require secure access. In a world where security is foundational, continued reliance on key-based access conflicts with Zero Trust principles and least privilege access. What’s New In GA AKS Workload Identity Support AKS Workload Identity (preview) extends the traditional managed identity model for Kubernetes by shifting the identity from the node to pods. Instead of inheriting the identity of the underlying cluster, each Kubernetes pod can use its own federated identity, mapped directly to a Microsoft Entra ID principal. This feature enables: Pod-level identity isolation, rather than cluster-level Least-privilege access with secure RBAC Seamless scaling and redeployment, without identity reconfiguration No secrets, no key rotation, no credential injection When combined with Azure Files over SMB, Workload Identity allows AKS workloads to access shared file storage securely and natively per pod, using the same identity-driven model as cluster level managed identities. Now available with AKS 1.35, for customers specifically in the financial services industries, AKS Workload Identity enables per‑application, least‑privilege access to Azure Files without credentials, improving isolation and auditability. This allows regulated, stateful workloads to run securely on AKS while meeting strict compliance and regulatory requirements. Co-existence of Application Identities and end-user identity access Azure Files now enables both Managed Identity and end‑user access on the same storage account, with users and applications independently authenticated via Entra ID and authorized through a shared permissions model. This unified access model eliminates the need for duplicate storage or credentials, enabling secure collaboration, troubleshooting, and automation on shared data without compromising governance or compliance. This supports scenarios such as: Developers accessing the same file share as an application for debugging Admins managing content used by automated workflows Hybrid environments with user-driven and app-driven access Simplified Storage Account enablement via the Azure portal We have now added a dedicated Managed Identity property that makes enabling identity‑based SMB access simple and transparent via the Azure portal for new as well as existing storage accounts. With a single configuration at the storage account level, customers can allow applications to authenticate to Azure Files using Managed Identities. This portal experience supports incremental adoption, making it easy to modernize authentication while maintaining compatibility with existing user access and governance models. Get Started with Managed Identities with SMB Azure Files Start using Managed Identities with Azure Files today at no additional cost. This feature is supported on HDD and SSD SMB shares across all billing models. Refer to our documentation for complete set-up guidance. Whether provisioning new storage or enhancing existing deployments, this capability provides secure, enterprise‑grade access with a streamlined configuration experience. For any questions, reach out to the team at azurefiles@microsoft.com.639Views0likes0CommentsAKS App Routing's Next Chapter: Gateway API with Istio
If you've been following my previous posts on the Ingress NGINX retirement, you'll know the story so far. The community Ingress NGINX project was retired in March 2026, and Microsoft's extended support for the NGINX-based App Routing add-on runs until November 2026. I've covered migrating from standalone NGINX to the App Routing add-on to buy time, and migrating to Application Gateway for Containers as a long-term option. In both of those posts I mentioned that Microsoft was working on a new version of the App Routing add-on based on Istio and the Gateway API. Well, it's here, in preview at least. The App Routing Gateway API implementation is Microsoft's recommended migration path for anyone currently using the NGINX-based App Routing add-on. It moves you off NGINX entirely and onto the Kubernetes Gateway API, with a lightweight Istio control plane handling the gateway infrastructure under the hood. Let's look at what this actually is, how it differs from other options, and how to migrate from both standalone NGINX and the existing App Routing add-on. What Is It? The new App Routing mode uses the Kubernetes Gateway API instead of the Ingress API. When you enable the add-on, AKS deploys an Istio control plane (istiod) to manage Envoy-based gateway proxies. The important thing to understand here is that this is not the full Istio service mesh. There's no sidecar injection, no Istio CRDs installed for your workloads. It's Istio doing one specific job: managing gateway proxies for ingress traffic. When you create a Gateway resource, AKS provisions an Envoy Deployment, a LoadBalancer Service, a HorizontalPodAutoscaler (defaulting to 2-5 replicas at 80% CPU), and a PodDisruptionBudget. All managed. You write Gateway and HTTPRoute resources, and AKS handles everything else. This is a fundamentally different API from what you're used to with Ingress. Instead of a single Ingress resource that combines the entry point and routing rules, Gateway API splits things into layers: GatewayClass defines the type of gateway infrastructure (provided by AKS in this case) Gateway creates the actual gateway with its listeners HTTPRoute defines the routing rules and attaches to a Gateway This separation is one of Gateway API's main selling points. Platform teams can own the Gateway resources while application teams manage their own HTTPRoutes independently, without needing to modify shared infrastructure. If you've ever had a team accidentally break routing for everyone by editing a shared Ingress, you'll appreciate why this matters. How It Differs From the Istio Service Mesh Add-On If you're already running or considering the Istio service mesh add-on for AKS, this is a different thing. The App Routing Gateway API mode uses the approuting-istio GatewayClass, doesn't install Istio CRDs, doesn't enable sidecar injection, and handles upgrades in-place. The full Istio service mesh add-on uses the istio GatewayClass, installs Istio CRDs cluster-wide, enables sidecar injection, and uses canary upgrades for minor versions. The two cannot run at the same time. If you have the Istio service mesh add-on enabled, you need to disable it before enabling App Routing Gateway API (and vice versa). If you need full mesh capabilities like mTLS between services, traffic policies, and telemetry, stick with the Istio service mesh add-on. If you just need managed ingress via Gateway API without the mesh overhead, this is the right choice. Current Limitations The new App Routing solution is in preview, so should not be run in production yet. There are also some gaps compared to the existing add-on, which you need to be aware of before planning a production migration. The biggest one: DNS and TLS certificate management via the add-on isn't supported yet for Gateway API. If you're currently using az aks approuting update and az aks approuting zone add to automate Key Vault and Azure DNS integration with the NGINX-based add-on, that workflow doesn't carry over. TLS termination is still possible, but you'll need to set it up manually. The AKS docs cover the steps, but it's more hands-on than what the NGINX add-on gives you today. This is expected to be addressed when the feature reaches GA. SNI passthrough (TLSRoute) and egress traffic management aren't supported either. And as mentioned, it's mutually exclusive with the Istio service mesh add-on. For production workloads that depend heavily on automated DNS and TLS management, you may want to wait until GA, or look at Application Gateway for Containers as an alternative. But for teams that can handle TLS setup manually, for non-production environments, there's no reason not to start testing this now. Getting Started Before you can enable the feature, you need the aks-preview CLI extension (version 19.0.0b24 or later), the Managed Gateway API CRDs enabled, and the App Routing Gateway API preview feature flag registered: az extension add --name aks-preview az extension update --name aks-preview # Managed Gateway API CRDs (required dependency) az feature register --namespace "Microsoft.ContainerService" --name "ManagedGatewayAPIPreview" # App Routing Gateway API implementation az feature register --namespace "Microsoft.ContainerService" --name "AppRoutingIstioGatewayAPIPreview" Feature flag registration can take a few minutes. Once they're registered, enable the add-on on a new or existing cluster. You need both --enable-gateway-api (for the managed Gateway API CRD installation) and --enable-app-routing-istio (for the Istio-based implementation): # New cluster az aks create \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --location swedencentral \ --enable-gateway-api \ --enable-app-routing-istio # Existing cluster az aks update \ --resource-group ${RESOURCE_GROUP} \ --name ${CLUSTER} \ --enable-gateway-api \ --enable-app-routing-istio Verify istiod is running: kubectl get pods -n aks-istio-system You should see two istiod pods in a Running state. From here, you can create a Gateway and HTTPRoute to test traffic flow. The AKS quickstart walks through this with the httpbin sample app if you want a quick validation. Migrating From NGINX Ingress Whether you're running standalone NGINX (self-installed via Helm) or the NGINX-based App Routing add-on, the migration process is essentially the same. You're moving from Ingress API resources to Gateway API resources, and the new controller runs alongside your existing one during the transition. The only real differences are what you're cleaning up at the end and, if you're on the App Routing add-on, whether you were relying on its built-in DNS and TLS automation. Inventory Your Ingress Resources Before anything else, understand what you have: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' Look specifically for custom snippets, lua configurations, or anything that relies heavily on NGINX-specific behaviour. These won't have direct equivalents in Gateway API and will need manual attention. Convert Ingress Resources to Gateway API The ingress2gateway tool (v1.0.0) handles conversion of Ingress resources to Gateway API equivalents. It supports over 30 common NGINX annotations and generates Gateway and HTTPRoute YAML. It works regardless of whether your Ingress resources use the nginx or webapprouting.kubernetes.azure.com IngressClass: # Install go install github.com/kubernetes-sigs/ingress2gateway@v1.0.0 # Convert from live cluster ingress2gateway print --providers=ingress-nginx -A > gateway-resources.yaml # Or convert from a local file ingress2gateway print --providers=ingress-nginx --input-file=./manifests/ingress.yaml > gateway-resources.yaml Review the output carefully. The tool flags annotations it can't convert as comments in the generated YAML, so you'll know exactly what needs manual work. Common gaps include custom configuration snippets and regex-based rewrites that don't map cleanly to Gateway API's routing model. Make sure you update the gatewayClassName in the generated Gateway resources to approuting-istio. The tool may generate a generic GatewayClass name that you'll need to change. Handle DNS and TLS If you're coming from standalone NGINX, you're likely managing DNS and TLS yourself already, so nothing changes here: just make sure your certificate Secrets and DNS records are ready for the new Gateway IP. If you're coming from the App Routing add-on and relying on its built-in DNS and TLS management (via az aks approuting zone add and Key Vault integration), this is the part that needs extra thought. That automation doesn't carry over to the Gateway API implementation yet, so you'll need to handle it differently until GA. For TLS, you can either create Kubernetes Secrets with your certificates manually or set up a workflow to sync them from Key Vault. The AKS docs on securing Gateway API traffic cover the manual approach. For DNS, you'll need to manage records yourself or use ExternalDNS to automate it. ExternalDNS supports Gateway API resources, so this is a viable path if you want automation. Deploy and Validate With the add-on enabled, apply your converted resources: kubectl apply -f gateway-resources.yaml Wait for the Gateway to be programmed and get the external IP: kubectl wait --for=condition=programmed gateways.gateway.networking.k8s.io <gateway-name> export GATEWAY_IP=$(kubectl get gateways.gateway.networking.k8s.io <gateway-name> -ojsonpath='{.status.addresses[0].value}') The key thing here is that your existing NGINX controller (whether standalone or add-on managed) is still running and serving production traffic. The Gateway API resources are handled separately by the Istio-based controller in aks-istio-system. This parallel running is what makes the migration safe. Test your routes against the new Gateway IP, you'll need to provide the appropriate URL as a host header, as your DNS will still be pointing at the NGINX Add-On at this point. curl -H "Host: myapp.example.com" http://$GATEWAY_IP Run your full validation suite. Check TLS, path routing, headers, authentication, anything your applications depend on. Take your time here; nothing changes for production until you update DNS. Cut Over DNS and Clean Up Once you're confident, lower your DNS TTL to 60 seconds (do this well in advance), then update your DNS records to point to the new Gateway IP. Keep the old NGINX controller running for 24-48 hours as a rollback option. After traffic has been flowing cleanly through the Gateway API path, clean up the old setup. What this looks like depends on where you started: If you were on standalone NGINX: helm uninstall ingress-nginx -n ingress-nginx kubectl delete namespace ingress-nginx If you were on the App Routing add-on with NGINX: Verify nothing is still using the old IngressClass: kubectl get ingress --all-namespaces \ -o custom-columns='NAMESPACE:.metadata.namespace,NAME:.metadata.name,CLASS:.spec.ingressClassName' \ | grep "webapprouting" Delete any remaining Ingress resources that reference the old class, then disable the NGINX-based App Routing add-on: az aks approuting disable --resource-group ${RESOURCE_GROUP} --name ${CLUSTER} Some resources (configMaps, secrets, and the controller deployment) will remain in the app-routing-system namespace after disabling. You can clean these up by deleting the namespace once you're satisfied everything is running through the Gateway API path: kubectl delete ns app-routing-system In both cases, clean up any old Ingress resources that are no longer being used. Upgrades and Lifecycle The Istio control plane version is tied to your AKS cluster's Kubernetes version. AKS automatically handles patch upgrades as part of its release cycle, and minor version upgrades happen in-place when you upgrade your cluster's Kubernetes version or when a new Istio minor version is released for your AKS version. One thing to be aware of - unlike the Istio service mesh add-on, upgrades here are in-place, not canary-based. The HPA and PDB on each Gateway help minimise disruption, but plan accordingly for production. If you have maintenance windows configured, the istiod upgrades will respect them. What Should You Do Now? The timeline hasn't changed. The standalone NGINX Ingress project was retired in March 2026, so if you're still running that, you're already on unsupported software. The NGINX App Routing add-on is supported until November 2026, which gives you a window, but it's not a long one. If you're on standalone NGINX you could get onto the App Routing add-on now to buy time (I covered this in my earlier post), then plan your migration to either the Gateway API mode or AGC. If you're on the NGINX App Routing add-on: start testing the Gateway API mode in non-production now. Get familiar with the Gateway API resource model, understand the TLS and DNS gaps in the preview, and be ready to migrate when the feature reaches GA or when November gets close, whichever comes first. If you need production-ready TLS and DNS automation today and can't wait for GA, App Gateway for Containers is your best option right now. Whatever path you choose, make sure you have a plan in place before November. Running unsupported ingress software on production infrastructure isn't where you want to be.471Views1like0CommentsAzure Monitor in Azure SRE Agent: Autonomous Alert Investigation and Intelligent Merging
Azure Monitor is great at telling you something is wrong. But once the alert fires, the real work begins — someone has to open the portal, triage it, dig into logs, and figure out what happened. That takes time. And while they're investigating, the same alert keeps firing every few minutes, stacking up duplicates of a problem that's already being looked at. This is exactly what Azure SRE Agent's Azure Monitor integration addresses. The agent picks up alerts as they fire, investigates autonomously, and remediates when it can — all without waiting for a human to get involved. And when that same alert fires again while the investigation is still underway, the agent merges it into the existing thread rather than creating a new one. In this blog, we'll walk through the full Azure Monitor experience in SRE Agent with a live AKS + Redis scenario — how alerts get picked up, what the agent does with them, how merging handles the noise, and why one often-overlooked setting (auto-resolve) makes a bigger difference than you'd expect. Key Takeaways Set up Incident Response Plans to scope which alerts the agent handles — filter by severity, title patterns, and resource type. Start with review mode, then promote to autonomous once you trust the agent's behavior for that failure pattern. Recurring alerts merge into one thread automatically — when the same alert rule fires repeatedly, the agent merges subsequent firings into the existing investigation instead of creating duplicates. Turn auto-resolve OFF for persistent failures (bad credentials, misconfigurations, resource exhaustion) so all firings merge into one thread. Turn it ON for transient issues (traffic spikes, brief timeouts) so each gets a fresh investigation. Design alert rules around failure categories, not components — one alert rule = one investigation thread. Structure rules by symptom (Redis errors, HTTP errors, pod health) to give the agent focused, non-overlapping threads. Attach Custom Response Plans for specialized handling — route specific alert patterns to custom-agents with custom instructions, tools, and runbooks. It Starts with Any Azure Monitor Alert Before we get to the demo, a quick note on what SRE Agent actually watches. The agent queries the Azure Alerts Management REST API, which returns every fired alert regardless of signal type. Log search alerts, metric alerts, activity log alerts, smart detection, service health, Prometheus — all of them come through the same API, and the agent processes them all the same way. You don't need to configure connectors or webhooks per alert type. If it fires in Azure Monitor, the agent can see it. What you do need to configure is which alerts the agent should care about. That's where Incident Response Plans come in. Setting Up: Incident Response Plans and Alert Rules We start by heading to Settings > Incident Platform > Azure Monitor and creating an Incident Response Plan. Response Plans et you scope the agent's attention by severity, alert name patterns, target resource types, and — importantly — whether the agent should act autonomously or wait for human approval. Action: Match the agent mode to your confidence in the remediation, not just the severity. Use autonomous mode for well-understood failure patterns where the fix is predictable and safe (e.g., rolling back a bad config, restarting a pod). Use review mode for anything where you want a human to validate before the agent acts — especially Sev0/Sev1 alerts that touch critical systems. You can always start in review mode and promote to autonomous once you've validated the agent's behavior. For our demo, we created a Sev1 response plan in autonomous mode — meaning the agent would pick up any Sev1 alert and immediately start investigating and remediating, no approval needed. On the Azure Monitor side, we set up three log-based alert rules against our AKS cluster's Log Analytics workspace. The star of the show was a Redis connection error alert — a custom log search query looking for WRONGPASS, ECONNREFUSED, and other Redis failure signatures in ContainerLog: Each rule evaluates every 5 minutes with a 15-minute aggregation window. If the query returns any results, the alert fires. Simple enough. Breaking Redis (On Purpose) Our test app is a Node.js journal app on AKS, backed by Azure Cache for Redis. To create a realistic failure scenario, we updated the Redis password in the Kubernetes secret to a wrong value. The app pods picked up the bad credential, Redis connections started failing, and error logs started flowing. Within minutes, the Redis connection error alert fired. What Happened Next Here's where it gets interesting. We didn't touch anything — we just watched. The agent's scanner polls the Azure Monitor Alerts API every 60 seconds. It spotted the new alert (state: "New", condition: "Fired"), matched it against our Sev1 Incident Response Plan, and immediately acknowledged it in Azure Monitor — flipping the state to "Acknowledged" so other systems and humans know someone's on it. Then it created a new investigation thread. The thread included everything the agent needed to get started: the alert ID, rule name, severity, description, affected resource, subscription, resource group, and a deep-link back to the Azure Portal alert. From there, the agent went to work autonomously. It queried container logs, identified the Redis WRONGPASS errors, traced them to the bad secret, retrieved the correct access key from Azure Cache for Redis, updated the Kubernetes secret, and triggered a pod rollout. By the time we checked the thread, it was already marked "Completed." No pages. No human investigation. No context-switching. But the Alert Kept Firing... Here's the thing — our alert rule evaluates every 5 minutes. Between the first firing and the agent completing the fix, the alert fired again. And again. Seven times total over 35 minutes. Without intelligent handling, that would mean seven separate investigation threads. Seven notifications. Seven disruptions. SRE Agent handles this with alert merging. When a subsequent firing comes in for the same alert rule, the agent checks: is there already an active thread for this rule, created within the last 7 days, that hasn't been resolved or closed? If yes, the new firing gets silently merged into the existing thread — the total alert count goes up, the "Last fired" timestamp updates, and that's it. No new thread, no new notification, no interruption to the ongoing investigation. How merging decides: new thread or merge? Condition Result Same alert rule, existing thread still active Merged — alert count increments, no new thread Same alert rule, existing thread resolved/closed New thread — fresh investigation starts Different alert rule New thread — always separate Five minutes after the first alert, the second firing came in and that continued. The agent finished the fix and closed the thread, and the final tally was one thread, seven merged alerts — spanning 35 minutes of continuous firings. On the Azure Portal side, you can see all seven individual alert instances. Each one was acknowledged by the agent. 7 Redis Connection Error Alert entries, all Sev1, Fired condition, Closed by user, spanning 8:50 PM to 9:21 PM Seven firings. One investigation. One fix. That's the merge in action. The Auto-Resolve Twist Now here's the part we didn't expect to matter as much as it did. Azure Monitor has a setting called "Automatically resolve alerts". When enabled, Azure Monitor automatically transitions an alert to "Resolved" once the underlying condition clears — for example, when the Redis errors stop because the pod restarted. For our first scenario above, we had auto-resolve turned off. That's why the alert stayed in "Fired" state across all seven evaluation cycles, and all seven firings merged cleanly into one thread. But what happens if auto-resolve is on? We turned it on and ran the same scenario again: Here's what happened: Redis broke. Alert fired. Agent picked it up and created a thread. The agent investigated, found the bad Redis password, fixed it. With Redis working again, error logs stopped. We noticed that the condition cleared and closed all the 7 alerts manually. We broke Redis a second time (simulating a recurrence). The alert fired again — but the previous alert was already closed/resolved. The merge check found no active thread. A brand-new thread was created, reinvestigated and mitigated. Two threads for the same alert rule, right there on the Incidents page: And on the Azure Monitor side, the newest alert shows "Resolved" condition — that's the auto-resolve doing its thing: For a persistent failure like a Redis misconfiguration, this is clearly worse. You get a new investigation thread every break-fix cycle instead of one continuous investigation. So, Should You Just Turn Auto-Resolve Off? No. It depends on what kind of failure the alert is watching for. Quick Reference: Auto-Resolve Decision Guide Auto-Resolve OFF Auto-Resolve ON Use when Problem persists until fixed Problem is transient and self-correcting Examples Bad credentials, misconfigurations, CrashLoopBackOff, connection pool exhaustion, IOPS limits OOM kills during traffic spikes, brief latency from neighboring deployments, one-off job timeouts Merge behavior All repeat firings merge into one thread Each break-fix cycle creates a new thread Best for Agent is actively managing the alert lifecycle Each occurrence may have a different root cause Tradeoff Alerts stay in "Fired/Acknowledged" state in Azure Monitor until the agent closes them More threads, but each gets a clean investigation Turn auto-resolve OFF when you want repeated firings from the same alert rule to stay in a single investigation thread until the alert is explicitly resolved or closed in Azure Monitor. This works best for persistent issues such as a Kubernetes deployment stuck in CrashLoopBackOff because of a bad image tag, a database connection pool exhausted due to a leaked connection, or a storage account hitting its IOPS limit under sustained load. Turn auto-resolve ON when you want a new investigation thread after the previous occurrence has been resolved or closed in Azure Monitor. This works best for episodic or self-clearing issues such as a pod getting OOM-killed during a temporary traffic spike, a brief latency increases during a neighboring service’s deployment, or a scheduled job that times out once due to short-lived resource contention. The key question is: when this alert fires again, is it the same ongoing problem or a new one? If it's the same problem, turn auto-resolve off and let the merges do their job. If it's a new problem, leave auto-resolve on and let the agent investigate fresh. Note: These behaviors describe how SRE Agent groups alert investigations and may differ from how Azure Monitor documents native alert state behavior. A Few Things We Learned Along the Way Design alert rules around symptoms, not components. Each alert rule maps to one investigation thread. We structured ours around failure categories — root cause signal (Redis errors, Sev1), blast radius signal (HTTP errors, Sev2), infrastructure signal (unhealthy pods, Sev2). This gave the agent focused threads without overlap. Incident Response Plans let you tier your response. Not every alert needs the agent to go fix things immediately. We used a Sev1 filter in autonomous mode for the Redis alert, but you could set up a Sev2 filter in review mode — the agent investigates and provides analysis but waits for human approval before taking action. Response Plans specialize the agent. For specific alert patterns, you can give the agent custom instructions, specialized tools, and a tailored system prompt. A Redis alert can route to a custom-agent loaded with Redis-specific runbooks; a Kubernetes alert can route to one with deep kubectl expertise. Best Practices Checklist Here's what we learned distilled into concrete actions: Alert Rule Design Do Don't Design rules around failure categories (root cause, blast radius, infra health) Create one alert per component — you'll get overlapping threads Set evaluation frequency and aggregation window to match the failure pattern Use the same frequency for everything — transient vs. persistent issues need different cadences Example rule structure from our test: Root cause signal — Redis WRONGPASS/ECONNREFUSED errors → Sev1 Blast radius signal — HTTP 5xx response codes → Sev2 Infrastructure signal — KubeEvents Reason="Unhealthy" → Sev2 Incident Response Plan Setup Do Don't Create separate response plans per severity tier Use one catch-all filter for everything Start with review mode — especially for Sev0/Sev1 where wrong fixes are costly Jump straight to autonomous mode on critical alerts without validating agent behavior first Promote to autonomous mode once you've validated the agent handles a specific failure pattern correctly Assume severity alone determines the right mode — it's about confidence in the remediation Response Plans Do Don't Attach custom response plans to specific alert patterns for specialized handling Leave every alert to the agent's general knowledge Include custom instructions, tools, and runbooks relevant to the failure type Write generic instructions — the more specific, the better the investigation Route Redis alerts to a Redis-specialized custom-agent; K8s alerts to one with kubectl expertise Assume one agent configuration fits all failure types Getting Started Head to sre.azure.com and open your agent Make sure the agent's managed identity has Monitoring Reader on your target subscriptions Go to Settings > Incident Platform > Azure Monitor and create your Incident Response Plans Review the auto-resolve setting on your alert rules — turn it off for persistent issues, leave it on for transient ones (see the decision guide above) Start with a test response plan using Title Contains to target a specific alert rule — validate agent behavior before broadening Watch the Incidents page and review the agent's investigation threads before expanding to more alert rules Learn More Azure SRE Agent Documentation Incident Response Guide Azure Monitor Alert Rules448Views0likes0Comments