azure sre agent
47 TopicsAutonomous AKS Incident Response with Azure SRE Agent: From Alert to Verified Recovery in Minutes
When a Sev1 alert fires on an AKS cluster, detection is rarely the hard part. The hard part is what comes next: proving what broke, why it broke, and fixing it without widening the blast radius, all under time pressure, often at 2 a.m. Azure SRE Agent is designed to close that gap. It connects Azure-native observability, AKS diagnostics, and engineering workflows into a single incident-response loop that can investigate, remediate, verify, and follow up, without waiting for a human to page through dashboards and run ad-hoc kubectl commands. This post walks through that loop in two real AKS failure scenarios. In both cases, the agent received an incident, investigated Azure Monitor and AKS signals, applied targeted remediation, verified recovery, and created follow-up in GitHub, all while keeping the team informed in Microsoft Teams. Core concepts Azure SRE Agent is a governed incident-response system, not a conversational assistant with infrastructure access. Five concepts matter most in an AKS incident workflow: Incident platform. Where incidents originate. In this demo, that is Azure Monitor. Built-in Azure capabilities. The agent uses Azure Monitor, Log Analytics, Azure Resource Graph, Azure CLI/ARM, and AKS diagnostics without requiring external connectors. Connectors. Extend the workflow to systems such as GitHub, Teams, Kusto, and MCP servers. Permission levels. Reader for investigation and read oriented access, privileged for operational changes when allowed. Run modes. Review for approval-gated execution and Autonomous for direct execution. The most important production controls are permission level and run mode, not prompt quality. Custom instructions can shape workflow behavior, but they do not replace RBAC, telemetry quality, or tool availability. The safest production rollout path: Start: Reader + Review Then: Privileged + Review Finally: Privileged + Autonomous. Only for narrow, trusted incident paths. Demo environment The full scripts and manifests are available if you want to reproduce this: Demo repository: github.com/hailugebru/azure-sre-agents-aks. The README includes setup and configuration details. The environment uses an AKS cluster with node auto-provisioning (NAP), Azure CNI Overlay powered by Cilium, managed Prometheus metrics, the AKS Store sample microservices application, and Azure SRE Agent configured for incident-triggered investigation and remediation. This setup is intentionally realistic but minimal. It provides enough surface area to exercise real AKS failure modes without distracting from the incident workflow itself. Azure Monitor → Action Group → Azure SRE Agent → AKS Cluster (Alert) (Webhook) (Investigate / Fix) (Recover) ↓ Teams notification + GitHub issue → GitHub Agent → PR for review How the agent was configured Configuration came down to four things: scope, permissions, incident intake, and response mode. I scoped the agent to the demo resource group and used its user-assigned managed identity (UAMI) for Azure access. That scope defined what the agent could investigate, while RBAC determined what actions it could take. I used broader AKS permissions than I would recommend as a default production baseline so the agent could complete remediation end to end in the lab. That is an important distinction: permissions control what the agent can access, while run mode controls whether it asks for approval or acts directly. For this scenario, Azure Monitor served as the incident platform, and I set the response plan to Autonomous for a narrow, trusted path so the workflow could run without manual approval gates. I also added Teams and GitHub integrations so the workflow could extend beyond Azure. Teams provided milestone updates during the incident, and GitHub provided durable follow up after remediation. For the complete setup, see the README. A note on context. The more context you can provide the agent about your environment, resources, runbooks, and conventions, the better it performs. Scope boundaries, known workloads, common failure patterns, and links to relevant documentation all sharpen its investigations and reduce the time it spends exploring. Treat custom instructions and connector content as first-class inputs, not afterthoughts. Two incidents, two response modes These incidents occurred on the same cluster in one session and illustrate two realistic operating modes: Alert triggered automation. The agent acts when Azure Monitor fires. Ad hoc chat investigation. An engineer sees a symptom first and asks the agent to investigate. Both matter in real environments. The first is your scale path. The second is your operator assist path. Incident 1. CPU starvation (alert driven, ~8 min MTTR) The makeline-service deployment manifest contained a CPU and memory configuration that was not viable for startup: resources: requests: cpu: 1m memory: 6Mi limits: cpu: 5m memory: 20Mi Within five minutes, Azure Monitor fired the pod-not-healthy Sev1 alert. The agent picked it up immediately. Here is the key diagnostic conclusion the agent reached from the pod state, probe behavior, and exit code: "Exit code 1 (not 137) rules out OOMKill. The pod failed at startup, not at runtime memory pressure. CPU limit of 5m is insufficient for the process to bind its port before the startup probe times out. This is a configuration error, not a resource exhaustion scenario." That is the kind of distinction that often takes an on call engineer several minutes to prove under pressure: startup failure from CPU starvation vs. runtime termination from memory pressure. The agent then: Identified three additional CPU-throttled pods at 112 to 200% of configured limit using kubectl top. Patched four workloads: makeline-service, virtual-customer, virtual-worker, and mongodb. Verified that all affected pods returned to healthy running state with 0 restarts cluster wide. Azure SRE Agent's Incident History blade confirming full cluster recovery: 4 patches applied, 0 unhealthy pods — no human intervention required. Outcome. Full cluster recovery in ~8 minutes, 0 human interventions. Incident 2. OOMKilled (chat driven, ~4 min MTTR) For the second case, I deployed a deliberately undersized version of order-service: kubectl apply -f .\manifests\aks-store\order-service-changed.yaml -n pets I started this case from chat before the pod-phase alert fired to demonstrate the interactive troubleshooting flow. That was a demo choice, not an alerting gap. CrashLoopBackOff is a container waiting reason, not a pod phase, so production coverage should come from Prometheus based crash-loop signals rather than pod phase alone. Here is the PromQL query I use in Azure Monitor to catch this class of failure: sum by (namespace, pod) ( ( max_over_time( kube_pod_container_status_waiting_reason{ namespace="pets", reason="CrashLoopBackOff" }[5m] ) == 1 ) and on (namespace, pod, container) ( increase( kube_pod_container_status_restarts_total{ namespace="pets" }[15m] ) > 0 ) ) > 0 This query fires when a container has been in CrashLoopBackOff within the last 5 minutes and its restart count has increased in the last 15 minutes. In production, replace the hardcoded namespace with a regex matcher or remove it entirely to cover all namespaces. The order-service pod in the pets namespace is not healthy. Please investigate, identify the root cause, and fix it. The agent's reasoning: "Container logs are empty. The process was killed before it could write its first log line. Exit code 137 confirms OOMKill. No NODE_OPTIONS in the ConfigMap rules out a V8 heap misconfiguration. The 20Mi limit is 12.8x below the pod's observed 50Mi runtime baseline. This limit was never viable for this workload." The agent increased the memory limit (20Mi to 128Mi) and request (10Mi to 50Mi), then verified the new pod stabilized at 74Mi/128Mi (58% utilization) with 0 restarts. Outcome. Service recovered in ~4 minutes without any manual cluster interaction. Side by side comparison Dimension Incident 1: CPU starvation Incident 2: OOMKilled Trigger Azure Monitor alert (automated) Engineer chat prompt (ad hoc) Failure mode CPU too low for startup probe to pass Memory limit too low for process to start Key signal Exit code 1, probe timeout Exit code 137, empty container logs Blast radius 4 workloads affected cluster wide 1 workload in target namespace Remediation CPU request/limit patches across 4 deployments Memory request/limit patch on 1 deployment MTTR ~8 min ~4 min Human interventions 0 0 Why this matters Most AKS environments already emit rich telemetry through Azure Monitor and managed Prometheus. What is still manual is the response: engineers paging through dashboards, running ad-hoc kubectl commands, and applying hotfixes under time pressure. Azure SRE Agent changes that by turning repeatable investigation and remediation paths into an automated workflow. The value isn't just that the agent patched a CPU limit. It's that the investigation, remediation, and verification loop is the same regardless of failure mode, and it runs while your team sleeps. In this lab, the impact was measurable: Metric This demo with Azure SRE Agent Alert to recovery ~4 to 8 min Human interventions 0 Scope of investigation Cluster wide, automated Correlate evidence and diagnose ~2 min Apply fix and verify ~4 min Post incident follow-up GitHub issue + draft PR These results came from a controlled run on April 10, 2026. Real world outcomes depend on alert quality, cluster size, and how much automation you enable. For reference, industry reports from PagerDuty and Datadog typically place manual Sev1 MTTR in the 30 to 120 minute range for Kubernetes environments. Teams + GitHub follow-up Runtime remediation is only half the story. If the workflow ends when the pod becomes healthy again, the same issue returns on the next deployment. That is why the post incident path matters. After Incident 1 resolved, Azure SRE Agent used the GitHub connector to file an issue with the incident summary, root cause, and runtime changes. In the demo, I assigned that issue to GitHub Copilot agent, which opened a draft pull request to align the source manifests with the hotfix. The agent can also be configured to submit the PR directly in the same workflow, not just open the issue, so the fix is in your review queue by the time anyone sees the notification. Human review still remains the final control point before merge. Setup details for the GitHub connector are in the demo repo README, and the official reference is in the Azure SRE Agent docs. Azure SRE Agent fixes the live issue, and the GitHub follow-up prepares the durable source change so future deployments do not reintroduce the same configuration problem. The operations to engineering handoff: Azure SRE Agent fixed the live cluster; GitHub Copilot agent prepares the durable source change so the same misconfiguration can't ship again. In parallel, the Teams connector posted milestone updates during the incident: Investigation started. Root cause and remediation identified. Incident resolved. Teams handled real time situational awareness. GitHub handled durable engineering follow-up. Together, they closed the gap between operations and software delivery. Key takeaways Three things to carry forward Treat Azure SRE Agent as a governed incident response system, not a chatbot with infrastructure access. The most important controls are permission levels and run modes, not prompt quality. Anchor detection in your existing incident platforms. For this demo, we used Prometheus and Azure Monitor, but the pattern applies regardless of where your signals live. Use connectors to extend the workflow outward. Teams for real time coordination, GitHub for durable engineering follow-up. Start where you're comfortable. If you are just getting your feet wet, begin with one resource group, one incident type, and Review mode. Validate that telemetry flows, RBAC is scoped correctly, and your alert rules cover the failure modes you actually care about before enabling Autonomous. Expand only once each layer is trusted. Next steps Add Prometheus based alert coverage for ImagePullBackOff and node resource pressure to complement the pod phase rule. Expand to multi cluster managed scopes once the single cluster path is trusted and validated. Explore how NAP and Azure SRE Agent complement each other — NAP manages infrastructure capacity, while the agent investigates and remediates incidents. I'd like to thank Cary Chai, Senior Product Manager for Azure SRE Agent, for his early technical guidance and thorough review — his feedback sharpened both the accuracy and quality of this post.441Views0likes0CommentsNew in Azure SRE Agent: Log Analytics and Application Insights Connectors
Azure SRE Agent now supports Log Analytics and Application Insights as log providers, backed by the Azure MCP Server. Connect your workspaces and App Insights resources, and the agent can query them directly during investigations. Why This Matters Log Analytics and Application Insights are common destinations for Azure operational data - container logs, application traces, dependency failures, security events. The agent could already access this data through az monitor CLI commands if you granted RBAC roles to its managed identity, and that approach still works. But it required manual RBAC setup and the agent had to shell out to CLI for every query. With these connectors, setup is simpler and querying is faster. You pick a workspace, we handle the RBAC grants, and the agent gets native MCP-backed query tools instead of going through CLI. What You Get Two new connector types in Builder > Connectors (or through the onboarding flow under Logs): Log Analytics - connect a workspace. The agent can query ContainerLog, Syslog, AzureDiagnostics, KubeEvents, SecurityEvent, custom tables, anything in that workspace. Application Insights - connect an App Insights resource. The agent gets access to requests, dependencies, exceptions, traces, and custom telemetry. You can connect multiple workspaces and App Insights resources. The agent knows which ones are available and targets the right one based on the investigation. Setup If you want early access, please enable: Early access to features under Settings > Basics. From there you can add connectors in two ways: Through onboarding: Click Logs in the onboarding flow, then select Log Analytics Workspace or Application Insights under Additional connectors. Through Builder: Go to Builder > Connectors in the sidebar and add a Log Analytics or Application Insights connector. Pick your resource from the dropdown and save. If discovery doesn't find your resource, both connector types have a manual entry fallback. On save, we grant the agent's managed identity Log Analytics Reader and Monitoring Reader on the target resource group. If your account can't assign roles, you can grant them separately. Backed by Azure MCP Under the hood, this uses the Azure MCP Server with the monitor namespace. When you save your first connector, we spin up an MCP server instance automatically. The agent gets access to tools like: monitor_workspace_log_query - KQL against a workspace monitor_resource_log_query - KQL against a specific resource monitor_workspace_list - discover workspaces monitor_table_list - list tables in a workspace Everything is read-only. The agent can query but never modify your monitoring configuration. If different connectors use different managed identities, the system handles per-call identity routing automatically. What It Looks Like An alert fires on your AKS cluster. The agent starts investigating and queries your connected workspace: ContainerLog | where TimeGenerated > ago(30m) | where LogEntry contains "error" or LogEntry contains "exception" | summarize count() by ContainerID, LogEntry | top 10 by count_ KubeEvents | where TimeGenerated > ago(1h) | where Reason in ("BackOff", "Failed", "Unhealthy") | summarize count() by Reason, Name, Namespace | order by count_ desc The agent also ships with built-in skills for common Log Analytics and App Insights query patterns, so it knows which tables to look at and how to structure queries for typical failure scenarios. Things to Know Read-only - the agent can query data but cannot modify alerts, retention, or workspace config Resource discovery needs Reader - the dropdown uses Azure Resource Graph. If your resources don't show up, use the manual entry fallback One identity per connector - if workspaces need different managed identities, create separate connectors Learn More Azure SRE Agent documentation Azure MCP Server We'd love feedback. Try it out and let us know what works and what doesn't. Azure SRE Agent is generally available. Learn more at sre.azure.com/docs.345Views0likes0CommentsEvent-Driven IaC Operations with Azure SRE Agent: Terraform Drift Detection via HTTP Triggers
What Happens After terraform plan Finds Drift? If your team is like most, the answer looks something like this: A nightly terraform plan runs and finds 3 drifted resources A notification lands in Slack or Teams Someone files a ticket During the next sprint, an engineer opens 4 browser tabs — Terraform state, Azure Portal, Activity Log, Application Insights — and spends 30 minutes piecing together what happened They discover the drift was caused by an on-call engineer who scaled up the App Service during a latency incident at 2 AM They revert the drift with terraform apply The app goes down because they just scaled it back down while the bug that caused the incident is still deployed Step 7 is the one nobody talks about. Drift detection tooling has gotten remarkably good — scheduled plans, speculative runs, drift alerts — but the output is always the same: a list of differences. What changed. Not why. Not whether it's safe to fix. The gap isn't detection. It's everything that happens after detection. HTTP Triggers in Azure SRE Agent close that gap. They turn the structured output that drift detection already produces — webhook payloads, plan summaries, run notifications — into the starting point of an autonomous investigation. Detection feeds the agent. The agent does the rest: correlates with incidents, reads source code, classifies severity, recommends context-aware remediation, notifies the team, and even ships a fix. Here's what that looks like end to end. What you'll see in this blog: An agent that classifies drift as Benign, Risky, or Critical — not just "changed" Incident correlation that links a SKU change to a latency spike in Application Insights A remediation recommendation that says "Do NOT revert" — and why reverting would cause an outage A Teams notification with the full investigation summary An agent that reviews its own performance, finds gaps, and improves its own skill file A pull request the agent created on its own to fix the root cause The Pipeline: Detection to Resolution in One Webhook The architecture is straightforward. Terraform Cloud (or any drift detection tool) sends a webhook when it finds drift. An Azure Logic App adds authentication. The SRE Agent's HTTP Trigger receives it and starts an autonomous investigation. The end-to-end pipeline: Terraform Cloud detects drift and sends a webhook. The Logic App adds Azure AD authentication via Managed Identity. The SRE Agent's HTTP Trigger fires and the agent autonomously investigates across 7 dimensions. Setting Up the Pipeline Step 1: Deploy the Infrastructure with Terraform We start with a simple Azure App Service running a Node.js application, deployed via Terraform. The Terraform configuration defines the desired state: App Service Plan: B1 (Basic) — single vCPU, ~$13/mo App Service: Node 20-lts with TLS 1.2 Tags: environment: demo, managed_by: terraform, project: sre-agent-iac-blog resource "azurerm_service_plan" "demo" { name = "iacdemo-plan" resource_group_name = azurerm_resource_group.demo.name location = azurerm_resource_group.demo.location os_type = "Linux" sku_name = "B1" } A Logic App is also deployed to act as the authentication bridge between Terraform Cloud webhooks and the SRE Agent's HTTP Trigger endpoint, using Managed Identity to acquire Azure AD tokens. Learn more about HTTP Triggers here. Step 2: Create the Drift Analysis Skill Skills are domain knowledge files that teach the agent how to approach a problem. We create a terraform-drift-analysis skill with an 8-step workflow: Identify Scope — Which resource group and resources to check Detect Drift — Compare Terraform config against Azure reality Correlate with Incidents — Check Activity Log and App Insights Classify Severity — Benign, Risky, or Critical Investigate Root Cause — Read source code from the connected repository Generate Drift Report — Structured summary with severity-coded table Recommend Smart Remediation — Context-aware: don't blindly revert Notify Team — Post findings to Microsoft Teams The key insight in the skill: "NEVER revert critical drift that is actively mitigating an incident." This teaches the agent to think like an experienced SRE, not just a diff tool. Step 3: Create the HTTP Trigger In the SRE Agent UI, we create an HTTP Trigger named tfc-drift-handler with a 7-step agent prompt: A Terraform Cloud run has completed and detected infrastructure drift. Workspace: {payload.workspace_name} Organization: {payload.organization_name} Run ID: {payload.run_id} Run Message: {payload.run_message} STEP 1 — DETECT DRIFT: Compare Terraform configuration against actual Azure state... STEP 2 — CORRELATE WITH INCIDENTS: Check Azure Activity Log and App Insights... STEP 3 — CLASSIFY SEVERITY: Rate each drift item as Benign, Risky, or Critical... STEP 4 — INVESTIGATE ROOT CAUSE: Read the application source code... STEP 5 — GENERATE DRIFT REPORT: Produce a structured summary... STEP 6 — RECOMMEND SMART REMEDIATION: Context-aware recommendations... STEP 7 — NOTIFY TEAM: Post a summary to Microsoft Teams... Step 4: Connect GitHub and Teams We connect two integrations in the SRE Agent Connectors settings: Code Repository: GitHub — so the agent can read application source code during investigations Notification: Microsoft Teams — so the agent can post drift reports to the team channel The Incident Story Act 1: The Latency Bug Our demo app has a subtle but devastating bug. The /api/data endpoint calls processLargeDatasetSync() — a function that sorts an array on every iteration, creating an O(n² log n) blocking operation. On a B1 App Service Plan (single vCPU), this blocks the Node.js event loop entirely. Under load, response times spike from milliseconds to 25-58 seconds, with 502 Bad Gateway errors from the Azure load balancer. Act 2: The On-Call Response An on-call engineer sees the latency alerts and responds — not through Terraform, but directly through the Azure Portal and CLI. They: Add diagnostic tags — manual_update=True, changed_by=portal_user (benign) Downgrade TLS from 1.2 to 1.0 while troubleshooting (risky — security regression) Scale the App Service Plan from B1 to S1 to throw more compute at the problem (critical — cost increase from ~$13/mo to ~$73/mo) The incident is partially mitigated — S1 has more compute, so latency drops from catastrophic to merely bad. Everyone goes back to sleep. Nobody updates Terraform. Act 3: The Drift Check Fires The next morning, a nightly speculative Terraform plan runs and detects 3 drifted attributes. The notification webhook fires, flowing through the Logic App auth bridge to the SRE Agent HTTP Trigger. The agent wakes up and begins its investigation. What the Agent Found Layer 1: Drift Detection The agent compares Terraform configuration against Azure reality and produces a severity-classified drift report: Three drift items detected: Critical: App Service Plan SKU changed from B1 (~$13/mo) to S1 (~$73/mo) — a +462% cost increase Risky: Minimum TLS version downgraded from 1.2 to 1.0 — a security regression vulnerable to BEAST and POODLE attacks Benign: Additional tags (changed_by: portal_user, manual_update: True) — cosmetic, no functional impact Layer 2: Incident Correlation Here's where the agent goes beyond simple drift detection. It queries Application Insights and discovers a performance incident correlated with the SKU change: Key findings from the incident correlation: 97.6% of requests (40 of 41) were impacted by high latency The /api/data endpoint does not exist in the repository source code — the deployed application has diverged from the codebase The endpoint likely contains a blocking synchronous pattern — Node.js runs on a single event loop, and any synchronous blocking call would explain 26-58s response times The SKU scale-up from B1→S1 was an attempt to mitigate latency by adding more compute, but scaling cannot fix application-level blocking code on a single-threaded Node.js server Layer 3: Smart Remediation This is the insight that separates an autonomous agent from a reporting tool. Instead of blindly recommending "revert all drift," the agent produces context-aware remediation recommendations: The agent's remediation logic: Tags (Benign) → Safe to revert anytime via terraform apply -target TLS 1.0 (Risky) → Revert immediately — the TLS downgrade is a security risk unrelated to the incident SKU S1 (Critical) → DO NOT revert until the /api/data performance root cause is fixed This is the logic an experienced SRE would apply. Blindly running terraform apply to revert all drift would scale the app back down to B1 while the blocking code is still deployed — turning a mitigated incident into an active outage. Layer 4: Investigation Summary The agent produces a complete summary tying everything together: Key findings in the summary: Actor: surivineela@microsoft.com made all changes via Azure Portal at ~23:19 UTC Performance incident: /api/data averaging 25-57s latency, affecting 97.6% of requests Code-infrastructure mismatch: /api/data exists in production but not in the repository source code Root cause: SKU scale-up was emergency incident response, not unauthorized drift Layer 5: Teams Notification The agent posts a structured drift report to the team's Microsoft Teams channel: The on-call engineer opens Teams in the morning and sees everything they need: what drifted, why it drifted, and exactly what to do about it — without logging into any dashboard. The Payoff: A Self-Improving Agent Here's where the demo surprised us. After completing the investigation, the agent did two things we didn't explicitly ask for. The Agent Improved Its Own Skill The agent performed an Execution Review — analyzing what worked and what didn't during its investigation — and found 5 gaps in its own terraform-drift-analysis.md skill file: What worked well: Drift detection via az CLI comparison against Terraform HCL was straightforward Activity Log correlation identified the actor and timing Application Insights telemetry revealed the performance incident driving the SKU change Gaps it found and fixed: No incident correlation guidance — the skill didn't instruct checking App Insights No code-infrastructure mismatch detection — no guidance to verify deployed code matches the repository No smart remediation logic — didn't warn against reverting critical drift during active incidents Report template missing incident correlation column No Activity Log integration guidance — didn't instruct checking who made changes and when The agent then edited its own skill file to incorporate these learnings. Next time it runs a drift analysis, it will include incident correlation, code-infra mismatch checks, and smart remediation logic by default. This is a learning loop — every investigation makes the agent better at future investigations. The Agent Created a PR Without being asked, the agent identified the root cause code issue and proactively created a pull request to fix it: The PR includes: App safety fixes: Adding MAX_DELAY_MS and SERVER_TIMEOUT_MS constants to prevent unbounded latency Skill improvements: Incorporating incident correlation, code-infra mismatch detection, and smart remediation logic From a single webhook: drift detected → incident correlated → root cause found → team notified → skill improved → fix shipped. Key Takeaways Drift detection is not enough. Knowing that B1 changed to S1 is table stakes. Knowing it changed because of a latency incident, and that reverting it would cause an outage — that's the insight that matters. Context-aware remediation prevents outages. Blindly running terraform apply after drift would have scaled the app back to B1 while blocking code was still deployed. The agent's "DO NOT revert SKU" recommendation is the difference between fixing drift and causing a P1. Skills create a learning loop. The agent's self-review and skill improvement means every investigation makes the next one better — without human intervention. HTTP Triggers connect any platform. The auth bridge pattern (Logic App + Managed Identity) works for Terraform Cloud, but the same architecture applies to any webhook source: GitHub Actions, Jenkins, Datadog, PagerDuty, custom internal tools. The agent acts, not just reports. From a single webhook: drift detected, incident correlated, root cause identified, team notified via Teams, skill improved, and PR created. End-to-end in one autonomous session. Getting Started HTTP Triggers are available now in Azure SRE Agent: Create a Skill — Teach the agent your operational runbook (in this case, drift analysis with severity classification and smart remediation) Create an HTTP Trigger — Define your agent prompt with {payload.X} placeholders and connect it to a skill Set Up an Auth Bridge — Deploy a Logic App with Managed Identity to handle Azure AD token acquisition Connect Your Source — Point Terraform Cloud (or any webhook-capable platform) at the Logic App URL Connect GitHub + Teams — Give the agent access to source code and team notifications Within minutes, you'll have an autonomous pipeline that turns infrastructure drift events into fully contextualized investigations — with incident correlation, root cause analysis, and smart remediation recommendations. The full implementation guide, Terraform files, skill definitions, and demo scripts are available in this repository.308Views0likes0CommentsManaging Multi‑Tenant Azure Resource with SRE Agent and Lighthouse
Azure SRE Agent is an AI‑powered reliability assistant that helps teams diagnose and resolve production issues faster while reducing operational toil. It analyzes logs, metrics, alerts, and deployment data to perform root cause analysis and recommend or execute mitigations with human approval. It’s capable of integrating with azure services across subscriptions and resource groups that you need to monitor and manage. Today’s enterprise customers live in a multi-tenant world, and there are multiple reasons to that due to acquisitions, complex corporate structures, managed service providers, or IT partners. Azure Lighthouse enables enterprise IT teams and managed service providers to manage resources across multiple azure tenants from a single control plane. In this demo I will walk you through how to set up Azure SRE agent to manage and monitor multi-tenant resources delegated through Azure Lighthouse. Navigate to the Azure SRE agent and select Create agent. Fill in the required details along with the deployment region and deploy the SRE agent. Once the deployment is complete, hit Set up your agent. Select the Azure resources you would like your agent to analyze like resource groups or subscriptions. This will land you to the popup window that allows you to select the subscriptions and resource groups that you would like SRE agent to monitor and manage. You can then select the subscriptions and resource groups under the same tenant that you want SRE agent to manage; Great, So far so good 👍 As a Managed Service Provider (MSP) you have multiple tenants that you are managing via Azure Lighthouse, and you need to have SRE agent access to those. So, to demo this will need to set up Azure Lighthouse with correct set of roles and configuration to delegate access to management subscription where the Centralized SRE agent is running. From Azure portal search Lighthouse. Navigate to the Lighthouse home page and select Manage your customers. On My customers Overview select Create ARM Template Provide a Name and Description. Select subscriptions on a Delegated scope. Select + Add authorization which will take you to Add authorization window. Select Principal type, I am selecting User for demo purposes. The pop-up window will allow Select users from the list. Select the checkbox next to the desired user who you want to delegate the subscription and hit Select Then select the Role that you would like to assign the user from the managing tenant to the delegated tenant and select add. You can add multiple roles by adding additional authorization to the selected user. This step is important to make sure the delegated tenant is assigned with the right role in order for SRE Agents to add it as Azure source. Azure SRE agent requires an Owner or User Administrator RBAC role to assign the subscription to the list of managed resources. If an appropriate role is not assigned, you will see an error when selecting the delegated subscriptions in SRE agent Managed resources. As per Lighthouse role support Owner role isn’t supported and User access Administrator role is supported, but only for limited purpose. Refer Azure Lighthouse documentation for additional information. If role is not defined correctly, you might see an error stating: 🛑Failed to add Role assignment “The 'delegatedRoleDefinitionIds' property is required when using certain roleDefinitionIds for authorization. To allow a principalId to assign roles to a managed identity in the customer tenant, set its roleDefinitionId to User Access Administrator. Download the ARM template and add specific Azure built-in roles that you want to grant in the delegatedRoleDefinitionIds property. You can include any supported Azure built-in role except for User Access Administrator or Owner. This example shows a principalId with User Access Administrator role that can assign two built in roles to managed identities in the customer tenant: Contributor and Log Analytics Contributor. { "principalId": "00000000-0000-0000-0000-000000000000", "principalIdDisplayName": "Policy Automation Account", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ] } In addition SRE agent would require certain roles at the managed identity level in order to access and operate on those services. Locate SRE agent User assigned managed identity and add roles to the service principal. For the demo purpose I am assigning Reader, Monitoring Reader, and Log Analytics Reader role. Here is the sample ARM template used for this demo. { "$schema": "https://schema.management.azure.com/schemas/2019-08-01/subscriptionDeploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "mspOfferName": { "type": "string", "metadata": { "description": "Specify a unique name for your offer" }, "defaultValue": "lighthouse-sre-demo" }, "mspOfferDescription": { "type": "string", "metadata": { "description": "Name of the Managed Service Provider offering" }, "defaultValue": "lighthouse-sre-demo" } }, "variables": { "mspRegistrationName": "[guid(parameters('mspOfferName'))]", "mspAssignmentName": "[guid(parameters('mspOfferName'))]", "managedByTenantId": "6e03bca1-4300-400d-9e80-000000000000", "authorizations": [ { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "e40ec5ca-96e0-45a2-b4ff-59039f2c2b59", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ], "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "b24988ac-6180-42a0-ab88-20f7382dd24c", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "acdd72a7-3385-48ef-bd42-f606fba81ae7", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "43d0d8ad-25c7-4714-9337-8ba259a9fe05", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "73c42c96-874c-492b-b04d-ab87d138a893", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" } ] }, "resources": [ { "type": "Microsoft.ManagedServices/registrationDefinitions", "apiVersion": "2022-10-01", "name": "[variables('mspRegistrationName')]", "properties": { "registrationDefinitionName": "[parameters('mspOfferName')]", "description": "[parameters('mspOfferDescription')]", "managedByTenantId": "[variables('managedByTenantId')]", "authorizations": "[variables('authorizations')]" } }, { "type": "Microsoft.ManagedServices/registrationAssignments", "apiVersion": "2022-10-01", "name": "[variables('mspAssignmentName')]", "dependsOn": [ "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" ], "properties": { "registrationDefinitionId": "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" } } ], "outputs": { "mspOfferName": { "type": "string", "value": "[concat('Managed by', ' ', parameters('mspOfferName'))]" }, "authorizations": { "type": "array", "value": "[variables('authorizations')]" } } } Login to the customers tenant and navigate to the service provides from the Azure Portal. From the Service Providers overview screen, select Service provider offers from the left navigation pane. From the top menu, select the Add offer drop down and select Add via template. In the Upload Offer Template window drag and drop or upload the template file that was created in the earlier step and hit Upload. Once the file is uploaded, select Review + Create. This will take a few minutes to deploy the template, and a successful deployment page should be displayed. Navigate to Delegations from Lighthouse overview and validate if you see the delegated subscription and the assigned role. Once the Lighthouse delegation is set up sign in to the managing tenant and navigate to the deployed SRE agent. Navigate to Azure resources from top menu or via Settings > Managed resources. Navigate to Add subscriptions to select customers subscriptions that you need SRE agent to manage. Adding subscription will automatically add required permission for the agent. Once the appropriate roles are added, the subscriptions are ready for the agent to manage and monitor resources within them. Summary - Benefits This blog post demonstrates how Azure SRE Agent can be used to centrally monitor and manage Azure resources across multiple tenants by integrating it with Azure Lighthouse, a common requirement for enterprises and managed service providers operating in complex, multi-tenant environments. It walks through: Centralized SRE operations across multiple Azure tenants Secure, role-based access using delegated resource management Reduced operational overhead for MSPs and enterprise IT teams Unified visibility into resource health and reliability across customer environments269Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.13KViews6likes0CommentsAn AI led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub.
This is due to the inevitable move towards fully agentic, end-to-end SDLCs. We may not yet be at a point where software engineers are managing fleets of agents creating the billion-dollar AI abstraction layer, but (as I will evidence in this article) we are certainly on the precipice of such a world. Before we dive into the reality of agentic development today, let me examine two very different modules from university and their relevance in an AI-first development environment. Manual Requirements Translation. At university I dedicated two whole years to a unit called “Systems Design”. This was one of my favourite units, primarily focused on requirements translation. Often, I would receive a scenario between “The Proprietor” and “The Proprietor’s wife”, who seemed to be in a never-ending cycle of new product ideas. These tasks would be analysed, broken down, manually refined, and then mapped to some kind of early-stage application architecture (potentially some pseudo-code and a UML diagram or two). The big intellectual effort in this exercise was taking human intention and turning it into something tangible to build from (BA’s). Today, by the time I have opened Notepad and started to decipher requirements, an agent can already have created a comprehensive list, a service blueprint, and a code scaffold to start the process (*cough* spec-kit *cough*). Manual debugging. Need I say any more? Old-school debugging with print()’s and breakpoints is dead. I spent countless hours learning to debug in a classroom and then later with my own software, stepping through execution line by line, reading through logs, and understanding what to look for; where correlation did and didn’t mean causation. I think back to my year at IBM as a fresh-faced intern in a cloud engineering team, where around 50% of my time was debugging different issues until it was sufficiently “narrowed down”, and then reading countless Stack Overflow posts figuring out the actual change I would need to make to a PowerShell script or Jenkins pipeline. Already in Azure, with the emergence of SRE agents, that debug process looks entirely different. The debug process for software even more so… #terminallastcommand WHY IS THIS NOT RUNNING? #terminallastcommand Review these logs and surface errors relating to XYZ. As I said: breakpoints are dead, for now at least. Caveat – Is this a good thing? One more deviation from the main core of the article if you would be so kind (if you are not as kind skip to the implementation walkthrough below). Is this actually a good thing? Is a software engineering degree now worthless? What if I love printf()? I don’t know is my answer today, at the start of 2026. Two things worry me: one theoretical and one very real. To start with the theoretical: today AI takes a significant amount of the “donkey work” away from developers. How does this impact cognitive load at both ends of the spectrum? The list that “donkey work” encapsulates is certainly growing. As a result, on one end of the spectrum humans are left with the complicated parts yet to be within an agent’s remit. This could have quite an impact on our ability to perform tasks. If we are constantly dealing with the complex and advanced, when do we have time to re-root ourselves in the foundations? Will we see an increase in developer burnout? How do technical people perform without the mundane or routine tasks? I often hear people who have been in the industry for years discuss how simple infrastructure, computing, development, etc. were 20 years ago, almost with a longing to return to a world where today’s zero trust, globally replicated architectures are a twinkle in an architect’s eye. Is constantly working on only the most complex problems a good thing? At the other end of the spectrum, what if the performance of AI tooling and agents outperforms our wildest expectations? Suddenly, AI tools and agents are picking up more and more of today’s complicated and advanced tasks. Will developers, architects, and organisations lose some ability to innovate? Fundamentally, we are not talking about artificial general intelligence when we say AI; we are talking about incredibly complex predictive models that can augment the existing ideas they are built upon but are not, in themselves, innovators. Put simply, in the words of Scott Hanselman: “Spicy auto-complete”. Does increased reliance on these agents in more and more of our business processes remove the opportunity for innovative ideas? For example, if agents were football managers, would we ever have graduated from Neil Warnock and Mick McCarthy football to Pep? Would every agent just augment a ‘lump it long and hope’ approach? We hear about learning loops, but can these learning loops evolve into “innovation loops?” Past the theoretical and the game of 20 questions, the very real concern I have is off the back of some data shared recently on Stack Overflow traffic. We can see in the diagram below that Stack Overflow traffic has dipped significantly since the release of GitHub Copilot in October 2021, and as the product has matured that trend has only accelerated. Data from 12 months ago suggests that Stack Overflow has lost 77% of new questions compared to 2022… Stack Overflow democratises access to problem-solving (I have to be careful not to talk in past tense here), but I will admit I cannot remember the last time I was reviewing Stack Overflow or furiously searching through solutions that are vaguely similar to my own issue. This causes some concern over the data available in the future to train models. Today, models can be grounded in real, tested scenarios built by developers in anger. What happens with this question drop when API schemas change, when the technology built for today is old and deprecated, and the dataset is stale and never returning to its peak? How do we mitigate this impact? There is potential for some closed-loop type continuous improvement in the future, but do we think this is a scalable solution? I am unsure. So, back to the question: “Is this a good thing?”. It’s great today; the long-term impacts are yet to be seen. If we think that AGI may never be achieved, or is at least a very distant horizon, then understanding the foundations of your technical discipline is still incredibly important. Developers will not only be the managers of their fleet of agents, but also the janitors mopping up the mess when there is an accident (albeit likely mopping with AI-augmented tooling). An AI First SDLC Today – The Reality Enough reflection and nostalgia (I don’t think that’s why you clicked the article), let’s start building something. For the rest of this article I will be building an AI-led, agent-powered software development lifecycle. The example I will be building is an AI-generated weather dashboard. It’s a simple example, but if agents can generate, test, deploy, observe, and evolve this application, it proves that today, and into the future, the process can likely scale to more complex domains. Let’s start with the entry point. The problem statement that we will build from. “As a user I want to view real time weather data for my city so that I can plan my day.” We will use this as the single input for our AI led SDLC. This is what we will pass to promptkit and watch our app and subsequent features built in front of our eyes. The goal is that we will: - Spec-kit to get going and move from textual idea to requirements and scaffold. - Use a coding agent to implement our plan. - A Quality agent to assess the output and quality of the code. - GitHub Actions that not only host the agents (Abstracted) but also handle the build and deployment. - An SRE agent proactively monitoring and opening issues automatically. The end to end flow that we will review through this article is the following: Step 1: Spec-driven development - Spec First, Code Second A big piece of realising an AI-led SDLC today relies on spec-driven development (SDD). One of the best summaries for SDD that I have seen is: “Version control for your thinking”. Instead of huge specs that are stale and buried in a knowledge repository somewhere, SDD looks to make them a first-class citizen within the SDLC. Architectural decisions, business logic, and intent can be captured and versioned as a product evolves; an executable artefact that evolves with the project. In 2025, GitHub released the open-source Spec Kit: a tool that enables the goal of placing a specification at the centre of the engineering process. Specs drive the implementation, checklists, and task breakdowns, steering an agent towards the end goal. This article from GitHub does a great job explaining the basics, so if you’d like to learn more it’s a great place to start (https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/). In short, Spec Kit generates requirements, a plan, and tasks to guide a coding agent through an iterative, structured development process. Through the Spec Kit constitution, organisational standards and tech-stack preferences are adhered to throughout each change. I did notice one (likely intentional) gap in functionality that would cement Spec Kit’s role in an autonomous SDLC. That gap is that the implement stage is designed to run within an IDE or client coding agent. You can now, in the IDE, toggle between task implementation locally or with an agent in the cloud. That is great but again it still requires you to drive through the IDE. Thinking about this in the context of an AI-led SDLC (where we are pushing tasks from Spec Kit to a coding agent outside of my own desktop), it was clear that a bridge was needed. As a result, I used Spec Kit to create the Spec-to-issue tool. This allows us to take the tasks and plan generated by Spec Kit, parse the important parts, and automatically create a GitHub issue, with the option to auto-assign the coding agent. From the perspective of an autonomous AI-led SDLC, Speckit really is the entry point that triggers the flow. How Speckit is surfaced to users will vary depending on the organisation and the context of the users. For the rest of this demo I use Spec Kit to create a weather app calling out to the OpenWeather API, and then add additional features with new specs. With one simple prompt of “/promptkit.specify “Application feature/idea/change” I suddenly had a really clear breakdown of the tasks and plan required to get to my desired end state while respecting the context and preferences I had previously set in my Spec Kit constitution. I had mentioned a desire for test driven development, that I required certain coverage and that all solutions were to be Azure Native. The real benefit here compared to prompting directly into the coding agent is that the breakdown of one large task into individual measurable small components that are clear and methodical improves the coding agents ability to perform them by a considerable degree. We can see an example below of not just creating a whole application but another spec to iterate on an existing application and add a feature. We can see the result of the spec creation, the issue in our github repo and most importantly for the next step, our coding agent, GitHub CoPilot has been assigned automatically. Step 2: GitHub Coding Agent - Iterative, autonomous software creation Talking of coding agents, GitHub Copilot’s coding agent is an autonom ous agent in GitHub that can take a scoped development task and work on it in the background using the repository’s context. It can make code changes and produce concrete outputs like commits and pull requests for a developer to review. The developer stays in control by reviewing, requesting changes, or taking over at any point. This does the heavy lifting in our AI-led SDLC. We have already seen great success with customers who have adopted the coding agent when it comes to carrying out menial tasks to save developers time. These coding agents can work in parallel to human developers and with each other. In our example we see that the coding agent creates a new branch for its changes, and creates a PR which it starts working on as it ticks off the various tasks generated in our spec. One huge positive of the coding agent that sets it apart from other similar solutions is the transparency in decision-making and actions taken. The monitoring and observability built directly into the feature means that the agent’s “thinking” is easily visible: the iterations and steps being taken can be viewed in full sequence in the Agents tab. Furthermore, the action that the agent is running is also transparently available to view in the Actions tab, meaning problems can be assessed very quickly. Once the coding agent is finished, it has run the required tests and, even in the case of a UI change, goes as far as calling the Playwright MCP server and screenshotting the change to showcase in the PR. We are then asked to review the change. In this demo, I also created a GitHub Action that is triggered when a PR review is requested: it creates the required resources in Azure and surfaces the (in this case) Azure Container Apps revision URL, making it even smoother for the human in the loop to evaluate the changes. Just like any normal PR, if changes are required comments can be left; when they are, the coding agent can pick them up and action what is needed. It’s also worth noting that for any manual intervention here, use of GitHub Codespaces would work very well to make minor changes or perform testing on an agent’s branch. We can even see the unit tests that have been specified in our spec how been executed by our coding agent. The pattern used here (Spec Kit -> coding agent) overcomes one of the biggest challenges we see with the coding agent. Unlike an IDE-based coding agent, the GitHub.com coding agent is left to its own iterations and implementation without input until the PR review. This can lead to subpar performance, especially compared to IDE agents which have constant input and interruption. The concise and considered breakdown generated from Spec Kit provides the structure and foundation for the agent to execute on; very little is left to interpretation for the coding agent. Step 3: GitHub Code Quality Review (Human in the loop with agent assistance.) GitHub Code Quality is a feature (currently in preview) that proactively identifies code quality risks and opportunities for enhancement both in PRs and through repository scans. These are surfaced within a PR and also in repo-level scoreboards. This means that PRs can now extend existing static code analysis: Copilot can action CodeQL, PMD, and ESLint scanning on top of the new, in-context code quality findings and autofixes. Furthermore, we receive a summary of the actual changes made. This can be used to assist the human in the loop in understanding what changes have been made and whether enhancements or improvements are required. Thinking about this in the context of review coverage, one of the challenges sometimes in already-lean development teams is the time to give proper credence to PRs. Now, with AI-assisted quality scanning, we can be more confident in our overall evaluation and test coverage. I would expect that use of these tools alongside existing human review processes would increase repository code quality and reduce uncaught errors. The data points support this too. The Qodo 2025 AI Code Quality report showed that usage of AI code reviews increased quality improvements to 81% (from 55%). A similar study from Atlassian RovoDev 2026 study showed that 38.7% of comments left by AI agents in code reviews lead to additional code fixes. LLM’s in their current form are never going to achieve 100% accuracy however these are still considerable, significant gains in one of the most important (and often neglected) parts of the SDLC. With a significant number of software supply chain attacks recently it is also not a stretch to imagine that that many projects could benefit from "independently" (use this term loosely) reviewed and summarised PR's and commits. This in the future could potentially by a specialist/sub agent during a PR or merge to focus on identifying malicious code that may be hidden within otherwise normal contributions, case in point being the "near-miss" XZ Utils attack. Step 4: GitHub Actions for build and deploy - No agents here, just deterministic automation. This step will be our briefest, as the idea of CI/CD and automation needs no introduction. It is worth noting that while I am sure there are additional opportunities for using agents within a build and deploy pipeline, I have not investigated them. I often speak with customers about deterministic and non-deterministic business process automation, and the importance of distinguishing between the two. Some processes were created to be deterministic because that is all that was available at the time; the number of conditions required to deal with N possible flows just did not scale. However, now those processes can be non-deterministic. Good examples include IVR decision trees in customer service or hard-coded sales routines to retain a customer regardless of context; these would benefit from less determinism in their execution. However, some processes remain best as deterministic flows: financial transactions, policy engines, document ingestion. While all these flows may be part of an AI solution in the future (possibly as a tool an agent calls, or as part of a larger agent-based orchestration), the processes themselves are deterministic for a reason. Just because we could have dynamic decision-making doesn’t mean we should. Infrastructure deployment and CI/CD pipelines are one good example of this, in my opinion. We could have an agent decide what service best fits our codebase and which region we should deploy to, but do we really want to, and do the benefits outweigh the potential negatives? In this process flow we use a deterministic GitHub action to deploy our weather application into our “development” environment and then promote through the environments until we reach production and we want to now ensure that the application is running smoothly. We also use an action as mentioned above to deploy and surface our agents changes. In Azure Container Apps we can do this in a secure sandbox environment called a “Dynamic Session” to ensure strong isolation of what is essentially “untrusted code”. Often enterprises can view the building and development of AI applications as something that requires a completely new process to take to production, while certain additional processes are new, evaluation, model deployment etc many of our traditional SDLC principles are just as relevant as ever before, CI/CD pipelines being a great example of that. Checked in code that is predictably deployed alongside required services to run tests or promote through environments. Whether you are deploying a java calculator app or a multi agent customer service bot, CI/CD even in this new world is a non-negotiable. We can see that our geolocation feature is running on our Azure Container Apps revision and we can begin to evaluate if we agree with CoPilot that all the feature requirements have been met. In this case they have. If they hadn't we'd just jump into the PR and add a new comment with "@copilot" requesting our changes. Step 5: SRE Agent - Proactive agentic day two operations. The SRE agent service on Azure is an operations-focused agent that continuously watches a running service using telemetry such as logs, metrics, and traces. When it detects incidents or reliability risks, it can investigate signals, correlate likely causes, and propose or initiate response actions such as opening issues, creating runbook-guided fixes, or escalating to an on-call engineer. It effectively automates parts of day two operations while keeping humans in control of approval and remediation. It can be run in two different permission models: one with a reader role that can temporarily take user permissions for approved actions when identified. The other model is a privileged level that allows it to autonomously take approved actions on resources and resource types within the resource groups it is monitoring. In our example, our SRE agent could take actions to ensure our container app runs as intended: restarting pods, changing traffic allocations, and alerting for secret expiry. The SRE agent can also perform detailed debugging to save human SREs time, summarising the issue, fixes tried so far, and narrowing down potential root causes to reduce time to resolution, even across the most complex issues. My initial concern with these types of autonomous fixes (be it VPA on Kubernetes or an SRE agent across your infrastructure) is always that they can very quickly mask problems, or become an anti-pattern where you have drift between your IaC and what is actually running in Azure. One of my favourite features of SRE agents is sub-agents. Sub-agents can be created to handle very specific tasks that the primary SRE agent can leverage. Examples include alerting, report generation, and potentially other third-party integrations or tooling that require a more concise context. In my example, I created a GitHub sub-agent to be called by the primary agent after every issue that is resolved. When called, the GitHub sub-agent creates an issue summarising the origin, context, and resolution. This really brings us full circle. We can then potentially assign this to our coding agent to implement the fix before we proceed with the rest of the cycle; for example, a change where a port is incorrect in some Bicep, or min scale has been adjusted because of latency observed by the SRE agent. These are quick fixes that can be easily implemented by a coding agent, subsequently creating an autonomous feedback loop with human review. Conclusion: The journey through this AI-led SDLC demonstrates that it is possible, with today’s tooling, to improve any existing SDLC with AI assistance, evolving from simply using a chat interface in an IDE. By combining Speckit, spec-driven development, autonomous coding agents, AI-augmented quality checks, deterministic CI/CD pipelines, and proactive SRE agents, we see an emerging ecosystem where human creativity and oversight guide an increasingly capable fleet of collaborative agents. As with all AI solutions we design today, I remind myself that “this is as bad as it gets”. If the last two years are anything to go by, the rate of change in this space means this article may look very different in 12 months. I imagine Spec-to-issue will no longer be required as a bridge, as native solutions evolve to make this process even smoother. There are also some areas of an AI-led SDLC that are not included in this post, things like reviewing the inner-loop process or the use of existing enterprise patterns and blueprints. I also did not review use of third-party plugins or tools available through GitHub. These would make for an interesting expansion of the demo. We also did not look at the creation of custom coding agents, which could be hosted in Microsoft Foundry; this is especially pertinent with the recent announcement of Anthropic models now being available to deploy in Foundry. Does today’s tooling mean that developers, QAs, and engineers are no longer required? Absolutely not (and if I am honest, I can’t see that changing any time soon). However, it is evidently clear that in the next 12 months, enterprises who reshape their SDLC (and any other business process) to become one augmented by agents will innovate faster, learn faster, and deliver faster, leaving organisations who resist this shift struggling to keep up.23KViews9likes2CommentsAn update to the active flow billing model for Azure SRE Agent
Earlier today, we announced that Azure SRE Agent now supports multiple AI model providers, starting with Anthropic. To support multi-model choice, and make active usage costs easier to understand, we’re updating how active flow usage is measured, effective April 15, 2026. At a glance What’s changing Active flow billing moves from time-based to token-based usage. You’ll be billed based on the tokens consumed when SRE Agent is actively doing work (for example, investigating an incident, responding to an alert, or helping in chat). Each model provider has its own published rate (AAUs per million tokens), so you can choose the model provider that fits your scenario and budget. What stays the same Azure Agent Unit (AAU) remains the billing unit. Always-on flow pricing is unchanged: 4 AAUs per agent-hour Your bill continues to have two components: a fixed always-on component plus a variable active flow component. What you need to do For most customers, no action is required. Your existing agents continue running. For the latest information on the AAU rates by model provider and estimates of example consumption scenarios, please refer to the pricing documentation. Why we’re making this change In reliability operations, different tasks can look very different: a quick health check isn’t the same as a multi-step investigation across logs, deployments, and metrics. With multi-model provider support, token consumption varies by model provider and by task complexity. Moving active flow billing to a token-based model provides a more direct, transparent connection between the work being performed and the active usage you’re billed for; especially as we expand model options over time. How token-based active flow helps More predictable costs for common tasks Simple interactions typically use fewer tokens. More complex investigations use more. With token-based billing, the relationship between task complexity and active usage is clearer. More flexibility as we add models You choose the provider, we select the best model for the job. As model providers release newer models and we adopt them, we publish updated AAU-per-token rates so you always know what you're paying. See the current rates in the pricing documentation. Spending controls stay in place You can still set a monthly AAU allocation limit in Settings → Agent consumption in the SRE Agent portal. When you reach your active flow limit, your agent continues to run, but pauses chat and autonomous actions until the next month. You can adjust your limit at any time. Next steps For most customers, this change requires no action. Your always-on billing is unchanged, your existing agents continue running, and your AAU meter remains the same. The billing change affects only how active flow usage is measured and calculated. If you're currently using SRE Agent and want to understand the new pricing in detail – including AAU rates per model, example consumption scenarios for light, medium, and heavy workloads, and guidance on setting spending limits – please visit pricing documentation for the latest information. 💡 Tip: The pricing section in product documentation contains the current AAU rates by model. Questions or feedback on the new billing model? Use the Feedback & issues link in the SRE Agent portal or reach out through the Azure SRE Agent community. Additional resources Product documentation: https://aka.ms/sreagent/docs Self-paced hands-on labs: https://aka.ms/sreagent/lab Technical videos and demos: https://aka.ms/sreagent/youtube Azure SRE Agent home page: https://www.azure.com/sreagent Azure SRE Agent on X: https://x.com/azuresreagent373Views0likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.13KViews1like1CommentHow we build and use Azure SRE Agent with agentic workflows
The Challenge: Ops is critical but takes time from innovation Microsoft operates always-on, mission-critical production systems at extraordinary scale. Thousands of services, millions of deployments, and constant change are the reality of modern cloud engineering. These are titan systems that power organizations around the globe—including our own—with extremely low risk tolerance for downtime. While operations work like incident investigation, response and recovery, and remediation is essential, it’s also disruptive to innovation. For engineers, operational toil often means being pulled away from feature work to diagnose alerts, sift through logs, correlate metrics across systems, or respond to incidents at any hour. On-call rotations and manual investigations slow teams down and introduce burnout. What's more, in the era of AI, demand for operational excellence has spiked to new heights. It became clear that traditional human-only processes couldn't meet the scale and complexity needs for system maintenance especially in the AI world where code shipping velocity has increased exponentially. At the same time, we needed to integrate with the AI landscape which continues to evolve at a breakneck pace. New models, new tooling, and new best practices released constantly, fragmenting ecosystems between different platforms for observability, DevOps, incident management, and security. Beyond simply automating tasks, we needed to build an adaptable approach that could integrate with existing systems and improve over time. Microsoft needed a fundamentally different way to perform operations—one that reduced toil, accelerated response, and gave engineers the time to focus on building great products. The Solution: How we build Azure SRE Agent using agentic workflows To address these challenges, Microsoft built Azure SRE Agent, an AI-powered operations agent that serves as an always-on SRE partner for engineers. In practice, Azure SRE Agent continuously observes production environments to detect and investigate incidents. It reasons across signals like logs, metrics, code changes, and other deployment records to perform root cause analysis. It supports engineers from triage to resolution and it’s used in a variety of autonomy levels from assistive investigation to automating remediation proposals. Everything occurs within governance guardrails and human approval checks grounded in role‑based access controls and clear escalation paths. What’s more, Azure SRE Agent learns from past incidents, outcomes, and human feedback to improve over time. But just as important as what was built is how it was built. Azure SRE Agent was created using the agentic workflow approach—building agents with agents. Rather than treating AI as a bolt-on tool, Microsoft embedded specialized agents across the entire software development lifecycle (SDLC) to collaborate with developers, from planning through operations. The diagram above outlines the agents used at each stage of development. They come together to form a full lifecycle: Plan & Code: Agents support spec‑driven development to unlock faster inner loop cycles for developers and even product managers. With AI, we can not only draft spec documentation that defines feature requirements for UX and software development agents but also create prototypes and check in code to staging which now enables PMs/UX/Engineering to rapidly iterate, generate and improve code even for early-stage merges. Verify, Test & Deploy: Agents for code quality review, security, evaluation, and deployment agents work together to shift left on quality and security issues. They also continuously assess reliability, ensure performance, and enforce consistent release best practices. Operate & Optimize: Azure SRE Agent handles ongoing operational work from investigating alerts, to assisting with remediation, and even resolving some issues autonomously. Moreover, it learns continuously over time and we provide Azure SRE Agent with its own specialized instance of Azure SRE Agent to maintain itself and catalyze feedback loops. While agents surface insights, propose actions, mitigate issues and suggest long term code or IaC fixes autonomously, humans remain in the loop for oversight, approval, and decision-making when required. This combination of autonomy and governance proved critical for safe operations at scale. We also designed Azure SRE agent to integrate across existing systems. Our team uses custom agents, Model Context Protocol (MCP) and Python tools, telemetry connections, incident management platforms, code repositories, knowledge sources, business process and operational tools to add intelligence on top of established workflows rather than replacing them. Built this way, Azure SRE Agent was not just a new tool but a new operational system. And at Microsoft’s scale, transformative systems lead to transformative outcomes. The Impact: Reducing toil at enterprise scale The impact of Azure SRE Agent is felt most clearly in day-to-day operations. By automating investigations and assisting with remediation, the agent reduces burden for on-call engineers and accelerates time to resolution. Internally at Microsoft in the last nine months, we've seen: 35,000+ incidents have been handled autonomously by Azure SRE Agent. 50,000+ developer hours have been saved by reducing manual investigation and response work. Teams experienced a reduced on-call burden and faster time-to-mitigation during incidents. To share a couple of specific cases, the Azure Container Apps and Azure App Service product group teams have had tremendous success with Azure SRE Agent. Engineers for Azure Container Apps had overwhelmingly positive (89%) responses to the root cause analysis (RCA) results from Azure SRE agent, covering over 90% of incidents. Meanwhile, Azure App Service has brought their time-to-mitigation for live-site incidents (LSIs) down to 3 minutes, a drastic improvement from the 40.5-hour average with human-only activity. And this impact is felt within the developer experience. When asked developers about how the agent has changed ops work, one of our engineers had this to say: “[It’s] been a massive help in dealing with quota requests which were being done manually at first. I can also say with high confidence that there have been quite a few CRIs that the agent was spot on/ gave the right RCA / provided useful clues that helped navigate my initial investigation in the right direction RATHER than me having to spend time exploring all different possibilities before arriving at the correct one. Since the Agent/AI has already explored all different combinations and narrowed it down to the right one, I can pick the investigation up from there and save me countless hours of logs checking.” - Software Engineer II, Microsoft Engineering Beyond the impact of the agent itself, the agentic workflow process has also completely redefined how we build. Key learnings: Agentic workflow process and impact It's very easy to think of agents as another form of advanced automation, but it's important to understand that Azure SRE agent is also a collaborative tool. Engineers can prompt the agent in their investigations to surface relevant context (logs, metrics, and related code changes) to propose actions far faster and easier than traditional troubleshooting. What’s more, they can also extend it for data analysis and dashboarding. Now engineers can focus on the agent’s findings to approve actions or intervene when necessary. The result is a human-AI partnership that scales operations expertise without sacrificing control. While the process took time and experimentation to refine, the payoff has been extraordinary; our team is building high-quality features faster than ever since we introduced specialized agents for each stage of the SDLC. While these results were achieved inside Microsoft, the underlying patterns are broadly applicable. First, building agents with agents is essential to scaling, as manual development quickly became a bottleneck; agents dramatically accelerated inner loop iteration through code generation, review, debugging, security fixes, etc. In practice, we found that a generic agent—guided by rich context and powered by memory and learning—can continuously adapt, becoming faster and more effective over time as it builds experience. This allows the agent to apply prior knowledge, avoid relearning, and reduce the effort required to resolve similar problems repeatedly. In parallel, specialized agents help bring consistency and repeatability to well‑defined categories of incidents, encoding proven patterns, workflows, and safeguards. Together, these approaches enable systems that both adapt to new situations and respond reliably at scale. Microsoft also learned to integrate deeply with existing systems, embedding agents into established telemetry, workflows, and platforms rather than attempting to replace them. Throughout this process, maintaining tight human‑in‑the‑loop governance proved critical. Autonomy had to be balanced with clear approval boundaries, role‑based access, and safety checks to build trust. Finally, teams learned to invest in continuous feedback and evaluation, using ongoing measurement to improve agents over time and understand where automation added value versus where human judgment should remain central. Want to learn more? Azure SRE Agent is one example of how agentic workflows can transform both product development and operations at scale. Teams at Microsoft are on a mission of leading the industry by example, not just sharing results. We invite you to take the practical learnings from this blog and apply the same principles in your own environments. Discover more about Azure SRE Agent Learn about agents in DevOps tools and processes Read best practices on agent management with Azure7.5KViews4likes0Comments3 Ways to Get More from Azure SRE Agent
When you first set up Azure SRE Agent, it’s tempting to give it everything. Connect all your alert sources, route every severity, set up scheduled tasks to poll your channels every 30 seconds. The agent can handle all of it. But a few simple configuration choices can help you get more value from every token the agent uses. Each investigation creates a conversation thread, and each thread consumes tokens. With the right setup, you can make sure the agent is spending those tokens on the work that has the highest impact. The pattern that works best: start focused, see results, and expand from there. Here are three ways to do that. 1. Start with the incidents that matter most It's natural to want full coverage from day one. But in practice, starting narrow and expanding works better. When you route only high-severity or high-impact incidents to the agent first, you get to see the quality of its investigations on the work that matters most. Once you trust the output, expanding to broader coverage is a confident decision, not a leap of faith. The mechanism for this is your **incident response plan**. Instead of relying on a default handler that routes everything, create a targeted response plan with filters that match the incidents you want the agent to investigate. Incident response plan filters: severity, title keywords, and exclusions. Getting started: Go to Response plan configuration and create a new incident response plan. Set the Severity filter. A good starting point is Sev0 through Sev2. These are the incidents where deep investigation has the highest impact. Use Title contains to focus on specific incident patterns, or Title does not contain to exclude known noisy alerts. Preview the filter results to see which past incidents would have matched. As you see results and get comfortable, widen the filters. Add Sev3. Remove title exclusions. Bring in more incident sources. The agent will handle the volume, and you'll know what the cost looks like because you've been watching it grow incrementally. If you already have an agent running with broad filters, it's worth reviewing your response plan. A quick check on your severity and title filters can make sure the agent is spending its time on the incidents you care about. 2. Replace high-frequency polling with smarter patterns Scheduled tasks are one of the most powerful features of the agent, but they're also where cost can quietly balloon. The reason is simple: a scheduled task runs on a timer whether there's anything to find. An incident investigation fires once per incident. A task polling every 2 minutes fires 720 times a day, and most of those runs may find nothing new. High-frequency polling is generally a weak engineering pattern regardless of cost. It wastes compute, creates unnecessary load, and in the case of an AI agent, burns tokens checking for changes that haven't happened. Better patterns exist. Prefer push over poll. If the source system can send a signal (an alert, a webhook, a ticket), use that to trigger the agent. Push-based workflows fire only when something happens. This is cheaper and faster than polling. When polling is the right fit, batch it. Instead of checking every 2 minutes, run a thorough check every hour. One consolidated report from 24 daily runs is more useful than 720 micro-checks that mostly say "nothing changed." The hourly report shows trends. The 2-minute poll shows snapshots. Consider HTTP triggers. If you have an external system that knows when work is needed (a deployment pipeline, a CI/CD tool, a monitoring platform), use an HTTP trigger to invoke the agent on demand. The agent only runs when there's actually something to do. Match frequency to the operational cadence. A Teams channel monitor works fine at 5-minute intervals. Humans don't type that fast. A health summary runs once a day. A shift-handoff report runs once per shift. Ask: how quickly do I actually need to detect this change? The answer is almost always slower than the timer you first set. 3. Keep threads fresh Here's a detail that's easy to miss: every time a scheduled task runs, it adds to the same conversation thread. The agent reads the full thread history before responding. So a task that runs hourly accumulates 24 conversations a day in the same thread. After a week, the agent is reading through hundreds of prior exchanges before it even starts on the new work. The work stays the same. The cost per run keeps climbing. It's the equivalent of reopening a document and reading the entire thing from page one every time you want to add a sentence at the end. The fix is one setting. When creating or editing a scheduled task, set "Message grouping for updates" to "New chat thread for each run." That gives the agent a clean context on every execution. No accumulated history, no growing cost. One dropdown, predictable token usage on every run. The pattern Start small with incident routing, expand as you see results. Replace high-frequency polling with push signals, batching, and HTTP triggers. Keep scheduled task threads fresh with "New chat thread for each run." The agent is built to handle whatever you throw at it. These patterns just make sure you're getting the most value for what you spend.267Views0likes0Comments