azure sre agent
54 TopicsWho's Calling Your Service? Designing for Humans and Agents at the Same Time
We're building three interfaces for Azure SRE Agent: an interactive CLI for humans at a terminal, an agent mode for coding agents that spawn it as a subprocess, and an MCP server for humans inside coding agents and for remote agents in other ecosystems. The CLI and agent mode are coming. The MCP server ships first. That ordering wasn't obvious at first, and this post is about why we landed there. Three interfaces, one question: who's actually calling this? When we started designing the Azure SRE Agent CLI, the question we kept running into was deceptively simple: who's the caller? A human at 2 AM during an incident? Yes. But also a coding agent mid-session that wants SRE Agent capabilities without leaving VS Code. And a PagerDuty SRE agent running an automated triage loop with no human in the picture. And another Azure SRE Agent instance that wants to delegate a sub-task. Four callers. Same backend. None of them want the same thing. The three interfaces map to these callers: Interactive CLI: humans at a terminal, terse and incident-optimized Agent Mode: coding agents like Copilot CLI that spawn it as a subprocess MCP Server: humans inside coding agents, and remote agents in other ecosystems The MCP server is the one shipping now. Here's why. The CLI requires someone to reach for it The interactive CLI and agent mode have something in common: the caller has to know Azure SRE Agent exists and decide to invoke it. A human types a command. A coding agent spawns a subprocess. Either way, it's a deliberate call. The MCP server works differently. It surfaces itself as tools inside whatever environment the caller is already in. The model decides when to use them. An SRE working in Copilot CLI doesn't open a separate terminal and type a command. They ask a question and the right tool fires. A remote agent in a PagerDuty loop doesn't spawn a subprocess. It speaks a protocol and gets a response. That's the difference. The CLI requires intent. The MCP server meets callers where they already are. Two callers, one protocol The MCP surface has two audiences. They speak the same protocol but come from completely different contexts. Humans inside coding agents. An SRE in VS Code Copilot, Claude Desktop, GitHub Copilot CLI, or Cursor is already in a session: writing a deployment script, reviewing a runbook, debugging a failing service. They don't want a context switch. They want SRE Agent capabilities alongside the work they're already doing. Connect the MCP server once and it's just there. Remote agents in other ecosystems. An AWS DevOps agent handling a cross-cloud incident might need to check Azure resource health without bouncing the call to a human. A PagerDuty SRE agent might pull an incident summary as part of its triage loop. One Azure SRE Agent instance might delegate work to another. MCP is what makes any of this work without custom integrations on both sides. Both sides agree on the protocol. Neither side has to know the other's internals. Caller Context What they need Human in Copilot CLI / VS Code Copilot Mid-workflow, coding session Readable summary, minimal overhead Human in Claude Desktop / Cursor Agentic session SRE tools available in the conversation AWS DevOps Agent Automated incident loop Defined schema, stable fields PagerDuty SRE Agent Triage pipeline Parseable, sparse, no narrative Other Azure SRE Agent Delegated sub-task Agent-to-agent contract Tool descriptions are product decisions Each MCP tool maps to a specific SRE Agent capability. Tools have names, descriptions in natural language, and JSON input schemas. The descriptions do more work than they look like they should. When someone in Copilot CLI asks "what's wrong with my API gateway," the model reads tool descriptions to decide which tool to call. A description that says "Returns health status for an Azure resource" gets invoked less reliably than one that says "Check whether an Azure resource (VM, gateway, database, container) is healthy, degraded, or unreachable. Use this when diagnosing an active outage or validating state after a deployment." The second version tells the model when to reach for the tool, not just what it does. PM, engineering, and content design reviewed descriptions together. When an invocation misfired in testing, the fix was almost always the description, not the schema. We iterated on tool descriptions the same way you'd iterate on a system prompt, because that's what they are. One output shape, two callers The human-in-a-coding-agent and the remote-agent-in-an-automated-loop want different things from the same tool response. A human wants something readable. A remote agent wants something parseable. The obvious answer is to return different shapes based on who's calling. We didn't do that. Every tool response follows the same contract: defined fields, stable semantics, no preamble, no internal reasoning, plus one summary field with a plain-language sentence. A human reads the summary. A remote agent ignores it and parses the structured fields. The overhead is negligible in both directions. We briefly considered branching on a caller-type hint in the request header. The problem: it added surface area to maintain and created subtle failure modes when the hint was wrong or missing. One shape, always. What's harder than it looks Statelessness is a feature for remote agents and friction for humans. MCP tools are stateless by design. Each invocation is independent. Remote agents love this; they don't want to manage session state across calls. Humans working interactively want context to carry forward. We handled it by making every response self-sufficient: the tool returns enough context that the model can construct a coherent follow-up call without re-explaining the situation. The tool doesn't remember. It returns enough that memory is cheap for whoever holds it. You can't test the remote agent use case the same way you test the human use case. Spin up Copilot CLI, connect the server, ask a question and you can watch what happens. You can't easily simulate an AWS agent calling you cold with no prior context about what Azure SRE Agent does. Designing for that caller meant writing descriptions and schemas that work for a model meeting your tools for the first time, with no assumed vocabulary and no assumed workflow. What's next The interactive CLI and agent mode follow the same three-node architecture. The interactive CLI is for humans at the terminal: terse, incident-optimized, with progressive disclosure. Agent Mode is for coding agents that spawn the CLI as a subprocess and want direct access to SRE Agent capabilities without a protocol layer in between. Both are in progress. In the meantime connect SRE Agent to your MCP client and it will show up where you're already working. Part of a series on the design decisions behind Azure SRE Agent. Companion posts on the CLI and agent mode will follow when they ship.30Views0likes0CommentsAccess Your SRE Agent from Any IDE, Terminal, or AI Assistant
Your team already uses an SRE Agent — it monitors your services, learns your architecture, and handles operational tasks. Now developers can talk to that agent in natural language from the interfaces they already use every day: their editor, their terminal, their AI assistant. Check what the agent knows, ask it a question, search its memories, wire it into a workflow — all without leaving the tool where they're already writing code. What we're announcing Azure SRE Agent tools are now shipping in the Azure MCP Server. The azure/mcp package includes a full set of SRE Agent tools that let you manage and operate your SRE Agents from any MCP-compatible client — GitHub Copilot CLI, VS Code Copilot, Cursor, Claude Desktop, or any agent framework that speaks MCP. No separate CLI. No portal tab. No custom integration code. Your SRE Agent becomes accessible wherever you already think and work. This is about meeting developers where they are. Your SRE Agent has deep context about your systems — incident history, architecture knowledge, operational patterns. Now that expertise is accessible from VS Code, from your terminal, from any MCP-compatible AI assistant. Just type a question in natural language and your agent responds, right inside the workflow you're already in. Your SRE Agent stops being a destination you visit and becomes part of how your team works every day. This post walks through what you can do and how to get it running, using GitHub Copilot CLI as the example. The same setup works in VS Code Copilot, Claude Desktop, Cursor, and any other MCP-compatible client. What this unlocks Once the Azure MCP Server is connected to Copilot CLI, you can talk to your SRE Agent infrastructure the same way you'd ask a colleague: "List my SRE agents in subscription <sub-id> " "Create a Kusto connector named prod-logs on agent myagent pointing at cluster https://help.kusto.windows.net , database Samples " "Search memories on agent myagent for 'deployment failures'" "Pause the nightly scheduled task on agent myagent " "Generate an architecture plan for a multi-region web app" The full capability set breaks down into seven areas: Manage SRE Agents. List, get, and create Microsoft.App/agents resources in your subscription. Discover which tools a given agent has access to. Resource groups are resolved automatically via Resource Graph. Configure connectors. Create and manage Kusto connectors, MCP connectors (both http and stdio transports), Azure Monitor connectors, and more. Connectors go through ARM and show up in the Azure portal alongside anything you created there. MCP connectors default to system-assigned managed identity. Run and inspect threads. Create and list conversation threads, get thread details, send messages, and manage hooks on a thread. This is how you talk to the agent programmatically or inspect what it's doing mid-run. Schedule recurring work. Create, list, pause, resume, and delete scheduled tasks. Manage incidents. List active incidents and run incident setup commands for PagerDuty and ServiceNow. Knowledge and prompts. Manage common prompts like safety rules and standing instructions. Search, upload, and delete memories. List and delete skills. Fetch agent docs by topic. Author workflows. Generate architecture plans from requirements. Generate, validate, and apply YAML workflows. Safety Giving an AI assistant broad management access to your SRE Agents means it's worth knowing what guardrails are in place: Destructive operations require --confirm true . Any delete (connectors, hooks, memories, skills, scheduled tasks, sub-agents) refuses to run without the explicit flag. There's no way to accidentally tear something down through an autocompleted command. Secrets are stripped before they reach your client. Bearer tokens, API keys, passwords, connection strings, and Authorization headers are redacted from connector and tool responses. Error messages are sanitized. Upstream error bodies are scrubbed for secrets and truncated before surfacing, so credentials don't leak through error text. Data-plane calls are pinned to *.azuresre.ai . HTTPS is required; the host suffix is enforced to prevent SSRF. http:// is only allowed for localhost . Third-party hosts are pinned. ServiceNow connectors are restricted to .service-now.com and .servicenowservices.com . PagerDuty subdomains must be valid DNS labels. MCP connector secrets must be env-referenced. Header and environment values for MCP connectors must use ${env:NAME} syntax — literal secrets are rejected so they never enter LLM context. Prerequisites Before connecting anything, make sure you have the following installed and authenticated. Node.js LTS node --version If you're not on a current LTS, update via nodejs.org or your package manager. The Azure MCP Server is tested against the active Node.js LTS releases. Azure CLI The MCP server uses DefaultAzureCredential , which picks up credentials from az login . You need the Azure CLI installed and signed in. Install: https://learn.microsoft.com/cli/azure/install-azure-cli az --version az login If you work across multiple tenants: az login --tenant <tenant-id> If you have multiple subscriptions, set a default: az account set --subscription <subscription-id> GitHub Copilot CLI Install Copilot CLI following the official instructions: https://docs.github.com/en/copilot/how-tos/use-copilot-agents/use-copilot-cli Once installed, launch it and authenticate with your GitHub account. Connect the Azure MCP Server to Copilot CLI The Azure MCP Server runs as an npm package ( azure/mcp ) and launches via npx . The easiest way to add it is interactively from within Copilot CLI: /mcp add Follow the prompts: name it azure , set the command to npx , and args to -y azure/mcp@latest server start . Or add it manually to your MCP config file ( ~/.copilot/mcp.json or .copilot/mcp.json in your repo): { "mcpServers": { "azure": { "type": "stdio", "command": "npx", "args": ["-y", "@azure/mcp@latest", "server", "start"] } } } Restart Copilot CLI after saving. On the next launch, npx fetches azure/mcp and starts the server automatically. If you'd rather install globally: npm install -g @azure/mcp azmcp server start Keeping it up to date @latest means npx pulls the newest version on each launch, but npx caches aggressively. If you upgrade and the old version is still running: npx clear-npx-cache # or: rm -rf ~/.npm/_npx Then restart Copilot CLI. For environments where you want version stability, pin an exact version instead: "args": ["-y", "@azure/mcp@0.x.y", "server", "start"] Bump the version string when you're ready to upgrade. Set up access to your SRE Agents The MCP server doesn't pin to a specific agent. It discovers agents dynamically from your subscription and you target one per command. Two steps to get access working. Assign RBAC You need two roles on the Microsoft.App/agents resource (or at the resource group or subscription level): Role What it covers Reader Control-plane: list and get agents and connectors via ARM SRE Agent Administrator Data-plane: threads, memories, scheduled tasks, prompts, and everything on the agent's own endpoint az role assignment create \ --assignee <your-upn-or-objectid> \ --role "Reader" \ --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.App/agents/<agentName> az role assignment create \ --assignee <your-upn-or-objectid> \ --role "SRE Agent Administrator" \ --scope /subscriptions/<sub>/resourceGroups/<rg>/providers/Microsoft.App/agents/<agentName> On Windows PowerShell, use a single line or backtick continuations instead of \ . Find your agents From Copilot CLI, ask: "List my SRE agents in subscription <sub-id> " This returns each agent's name, resource group, and endpoint. Once you have that, you're ready to work. How the calls work under the hood Two distinct layers, worth knowing which is which. Control-plane (agents, connectors): goes through Azure Resource Manager at Microsoft.App/agents , API version 2025-05-01-preview . Anything you create or modify shows up in the Azure portal. Data-plane (threads, memories, scheduled tasks, incidents, prompts, skills, hooks, docs, workflows): goes through the agent's own endpoint at https://<name>--<hash>.<region>.azuresre.ai . The server handles the SRE Agent token audience automatically. You don't need to manage separate credentials for the data plane. Your az login session covers both. When things go wrong Symptom What's happening Fix 401/403 on data-plane calls Missing SRE Agent Administrator role Assign the role at the agent scope 403 on ARM calls Missing Reader role Assign Reader at subscription, RG, or agent scope "No agent endpoint" Agent not fully provisioned Check provisioningState in the portal sreagent_* tools not showing up npx cache is stale npx clear-npx-cache , restart Copilot CLI Wrong tenant errors Credentials from a different tenant az login --tenant <id> , restart Copilot CLI Verify it's working Ask Copilot CLI: "List my Azure subscriptions" or "List my SRE agents." If sreagent_* tools appear in the tool list and return results, you're connected and on a version that includes this release. Get started with Azure SRE Agent If you don't have an SRE Agent yet, you can create one in minutes from the Azure portal or through the CLI. Connect it to your code, your logs, and your incident sources — and it starts building expertise from day one. Once you've added the Azure MCP Server to your editor, your agent is one sentence away in every session. Resources SRE Agent documentation — https://aka.ms/sreagent/newdocs SRE Agent overview — https://aka.ms/sreagent/newdocsoverview Azure MCP Server — https://aka.ms/azmcp Azure MCP Server get-started — https://learn.microsoft.com/azure/developer/azure-mcp-server/get-started Deep Context blog — https://aka.ms/sreagent/blogs/deepcontextblog732Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.13KViews6likes0CommentsFrom Coding Agents to Cloud Automation: AI-Assisted Customer Related Incidents in Azure Functions
On the Azure Functions team, we have been exploring how AI can help with investigating customer-reported incidents, root-cause analysis, and incident mitigation. This post shares our journey from early RCA agents to coding-agent-assisted investigations and cloud-hosted automation, and the lessons we learned along the way. Microsoft Engineering teams like Azure Functions work on production live site issues alongside customer reported issues, and these are one of the most important and rewarding parts of the job. On the Azure Functions team, complex customer incidents often require deep investigation. Engineers review Azure Data Explorer (Kusto) query results, source code, GitHub issues, previous incidents, public documentation, internal troubleshooting guides, and service-specific operational knowledge. The goal is always to mitigate customer impact quickly, identify the root cause, and feed what we learn back into the platform, and log improvement work items where needed. This work is valuable, but it is also time-consuming. As AI capabilities improved, we started asking a practical question: could AI help us reduce the operational burden of incident investigation while preserving the learning and engineering judgment that make those investigations useful? This is the story of how our approach evolved—from early RCA agents, to coding-agent-based workflows, and finally to cloud-hosted automation. Starting with AI-Assisted RCA Around May 2024, we began experimenting with an internal RCA agent together with a colleague from Microsoft Research. The first version was an informal approach towards the development of a formal service. It was a personal tool to help with our own investigations. The early experiments were very useful. We could give the agent an incident, let it run for several minutes, and then review the analysis. It did not always produce a perfect root cause, but it could run multiple queries, explore different hypotheses, and narrow the solution space enough to save time. Later, Azure SRE Agent emerged as a formal internal service. We contributed to it based on what We had learned from our earlier experiments. At that point, using AI to help resolve customer-reported incidents became a major focus for our team. What We Learned from Agentic Workflows The first generation of AI-assisted incident workflows were highly structured. The early experiments with the available models required careful design—especially for generating complex Kusto—we often needed to expose fixed Kusto queries as tools and let the model call them through well-defined parameters. Fig. 1 kusto query tool This made execution more predictable and reproducible, but it also revealed limitations. Detailed agentic workflows could work well when the incident matched the predefined path. Outside those paths, they were less flexible. Engineers also found it expensive to define and maintain those workflows, especially when the output felt only modestly better than a dashboard. Fig2. Agentic Workflow That experience taught us an important lesson: for complex operational investigations, flexibility matters as much as structure. The Shift to Coding Agents Near the end of 2025, we started using an internal tool using GitHub Copilot and skills, which made it possible to define and share VS Code workspaces. A workspace could include agent definitions, instructions, prompts, skills, MCP configuration, and repositories. Fig 3. GitHub Copilot internal tool The quality difference was significant. Combined with newer models, coding agents could investigate incidents much more flexibly. They could run Kusto queries, inspect code, use CLI and MCP tools, and iterate quickly by trying different paths. The team quickly adopted this model. With earlier workflow-based systems, engineers were reluctant to onboard because defining detailed workflows took effort, and the payoff was limited. With this internal tool, engineers started contributing agent definitions and skills because the system was easy to extend. Over time, the Azure Functions team accumulated a growing set of AI-ready materials which consisted of agent definitions, skills, MCP tools, instructions and repositories A number of lessons stood out. Lessons from Building AI-Ready Materials Prefer guidance over over-specification Modern coding agents are capable enough that they do not need every step spelled out. In fact, too many instructions can make the system brittle or stale. We found it better to provide concise guidance and point the agent to maintained sources of truth rather than embedding large amounts of detail directly into prompts. Manage context deliberately Instructions, tool definitions, conversation history, and tool outputs all compete for model context. Irrelevant or contradictory information can reduce quality. Tool design matters too: if a tool returns a large payload directly to the model, it can consume many tokens and confuse the agent. For large outputs, writing results to files and returning concise pointers often works better. Use files as durable memory Long-running investigations benefit from a simple pattern: create a plan and checklist file, update it as work progresses, and let the agent re-read it when needed. This helps the agent recover from context compaction and gives the investigation a durable state inside the workspace. Prefer references over inline knowledge Agent Definition and skills includes domain knowledge. Internal troubleshooting guides, product behavior, operational history, and expert judgment. Instead of placing all that information directly in prompts, we found it more effective to provide references to where the knowledge lives and guidance on when to use it. Make the right repositories visible Coding agents are strong at reading code. For our scenarios, multi-repository workspaces were especially powerful. When the agent could see related repositories together, it could trace behavior across components, understand dependencies, and produce better analysis. Domain knowledge matters most The best agent assets were often created by engineers with deep product and operational experience, not necessarily by AI specialists. The key skill was turning expert knowledge into instructions, references, and repository layouts that an agent could use. Facilitate and streamline domain knowledge updates Every incident not fully handled by an agent is a learning opportunity. Feed the context engineering flywheel: investigate, find gaps, update agent guidance, then re-test. It's important to keep this cycle quick and easy. Why We Moved Toward Cloud Automation Coding agents were extremely helpful, but they were still interactive tools. An engineer had to start the investigation and often guide it. For incident response, we wanted to go further. If an incident entered a specific feature area, the system should be able to start the investigation automatically, run the relevant analysis, and post useful results back to the incident. Even if the analysis was not perfect, narrowing the problem space early could reduce mitigation time. Some scenarios could eventually support automatic mitigation or automatic transfer to the right team. A local coding-agent workflow has advantages, especially because it can authenticate as the user. But as a foundation for reliable automation, it also had important limitations. First, it still depended on human involvement. AI dramatically improved individual productivity, but in incident response the bottleneck is often human attention and time. Even when starting an agent is simple, requiring an engineer to initiate the run introduces a context switch and consumes a scarce resource. Second, it depended on user credentials. Coding agents run with the user’s permissions, which can be overly broad for automation, and they inherit human-oriented flows such as browser-based reauthentication. For durable automation, we wanted an identity model better suited to unattended execution, such as managed identity. Third, there were execution-environment and security concerns. A local environment is powerful, but it does not naturally provide the sandboxing we wanted for safe automation. Because it runs with user access, it may also reach a much wider set of files and resources than is desirable for an automated incident workflow. Local and dev-box environments also have operational drawbacks. They can require restarts, contend with other workloads, and are not ideal for durable execution, failure recovery, or failover. For automation, we wanted a dedicated execution environment rather than something tied to an engineer’s machine. Finally, token management became an operational concern. User-linked token consumption can create instability when limits are reached, and automation can skew usage patterns so that one user appears to consume a disproportionate share of AI capacity. That adds noise to operational analysis and makes governance harder. For all of these reasons, cloud execution looked like the right direction. We wanted managed identity, a secure sandbox, durable execution, and a system that would not depend on someone’s local machine. Requirements for Cloud Automation Many of us had become strong supporters of coding agents and wanted to keep using them. Just as importantly, we had already accumulated assets that had been proven to work well: agent definitions, instructions, prompts, skills, MCP configuration, and repository layouts that the team had gradually built up and refined. That meant our move toward cloud automation was not about replacing coding agents with something entirely different. We wanted to preserve and reuse the assets that had made coding agents successful, while moving to an execution model that was better suited to automation. At the same time, coding agents had set a high quality bar. Because they worked so well in practice, we were not willing to assume that a cloud service would automatically deliver the same level of quality. So we defined two concrete goals for the cloud path. Achieve the same level of quality we were seeing from our existing coding-agent workflows when run in a focused, one-shot investigation. Ensure the assets we had already built could continue to be used and improved. In other words, we were not looking for just another cloud AI system. We were looking for a cloud automation path that could inherit the strengths of coding agents while providing the operational properties automation required. Comparing Headless coding-agent execution service and Azure SRE Agent To evaluate which approach could meet those requirements, we ran a side-by-side comparison. One path was a prototype headless coding agent execution service. It reused the same the internal tool’s workspace definitions that engineers used locally, but ran them without a human in the loop. When an incident entered a target loop, the system created an agent workspace, prepared repositories, started GitHub Copilot CLI with an initial prompt, collected the analysis, and posted the result back to the incident. It also preserved session artifacts so that an engineer could later review or resume the investigation. Fig 4. Agent Helped Trend – It shows people use Coding Agent, the introduction of headless coding agent execution service and SRE Agent increases the percentage of usefulness. The other path used Azure SRE Agent, which had improved with preview customer feedback and was nearing general availability, since our earlier experiments. It now supported newer models, stronger custom-agent behavior, MCP and built-in tools, repository access, and incident-triggered execution. We’ve performed a one-time migration from the Coding Agent asset to the Azure SRE Agent asset. This was achieved in one day using GitHub Copilot CLI and our existing coding agents. The comparison was deliberately practical. We already knew that our internal coding-agent environment produced results engineers trusted and liked. That became our quality bar. If Azure SRE Agent could meet or exceed that bar while also satisfying the operational requirements of cloud automation, it would be the stronger long-term path. Results and Feedback Loop The first Headless coding-agent execution service results were very encouraging. In its first set of incidents, the RCA matched the SME conclusion in cases where the agent could safely process the incident. That showed that the assets we had built for local coding agents could transfer effectively into a headless scenario. Azure SRE Agent also performed strongly from the beginning. Headless coding-agent execution service initially had slightly better analysis in some areas, but Azure SRE Agent was already good enough to be operationally useful. We then built an evaluation framework that compared: Each agent’s RCA, confidence score, and mitigation steps The RCA and mitigation reason later provided by a human Auto-mitigation recommendations Path to auto-mitigation Auto-transfer recommendations Session-level execution issues This evaluation became a feedback loop. Engineers reviewed interesting incidents, identified weaknesses, improved agent definitions and skills, and submitted pull requests. We also used agent assistance to generate improvement PRs from comparison reports. Fig 5. LLM as Judge side-by-side eval for headless coding-agent execution service (blue) vs Azure SRE Agent (green) Within a few weeks, Azure SRE Agent’s quality consistently exceeded the headless coding-agent execution service baseline. At that point, we stopped posting headless results back to incidents and focused on improving the Azure SRE Agent path instead. We also automated synchronization from the internal coding-agent assets so improvements could continue to flow through pull requests. That shift was important. It meant Azure SRE Agent was no longer just an interesting alternative—it had become the cloud path that could inherit what worked in coding agents while providing a better foundation for automation. Why Cloud AI Started to Work Better A common reaction to coding agents is that they feel much improved than the previous cloud AI experiences. Our experience suggests two main reasons: stronger models and improved access to the right context. A coding agent sees a workspace. It can use instructions, skills, tools, repositories, and files. Traditional cloud AI systems often did not have access to the same set of resources. Once Azure SRE Agent could see similar assets - the right repositories, the right tools, and the right domain-specific knowledge - it could reach comparable or better quality. The details of context compaction, tool execution, and orchestration matter. But the core principle is simpler: the agent needs to reach the right knowledge at the right time without carrying unnecessary context all the time. That means the most important work is not only choosing a model or building a tool. It creates high-quality AI-ready assets: concise instructions, useful skills, accurate references, well-structured repository access, and domain knowledge that was previously locked in people’s heads. The cloud hosted automation path instantly provided an exciting benefit, which is that the issue analysis is stored in the cloud and not only on the developer's machine. This means that the conclusions and investigations are stored for perusal and human interaction is possible via the chat interface. Fig 6. An Example of the Chat Interface for Azure SRE Agent Conclusion Our journey started with a personal RCA assistant, moved through structured agentic workflows, accelerated with coding agents, and eventually led us back to a cloud-hosted automation path. The lesson is not that coding agents or cloud agents are universally better. The lesson is that agent quality depends heavily on what the agent can see, how much irrelevant context it avoids, and whether domain experts have translated their knowledge into usable assets. For us, the key was not abandoning coding agents. It was carrying their strengths forward into Azure SRE Agent and a cloud execution model that was better suited to automation. Modern agents are now capable enough to make that work worthwhile. For incident response, that opens the door to faster investigation, safer automation, and ultimately lower incident mitigation time for customers. The Azure Functions team hope this experience is useful to other teams exploring how to apply AI to complex engineering operations. In the next post, we plan to go deeper into the evaluation framework and how we automated the feedback loop behind these improvements.1.6KViews2likes1CommentManaging Multi‑Tenant Azure Resource with SRE Agent and Lighthouse
Azure SRE Agent is an AI‑powered reliability assistant that helps teams diagnose and resolve production issues faster while reducing operational toil. It analyzes logs, metrics, alerts, and deployment data to perform root cause analysis and recommend or execute mitigations with human approval. It’s capable of integrating with azure services across subscriptions and resource groups that you need to monitor and manage. Today’s enterprise customers live in a multi-tenant world, and there are multiple reasons to that due to acquisitions, complex corporate structures, managed service providers, or IT partners. Azure Lighthouse enables enterprise IT teams and managed service providers to manage resources across multiple azure tenants from a single control plane. In this demo I will walk you through how to set up Azure SRE agent to manage and monitor multi-tenant resources delegated through Azure Lighthouse. Navigate to the Azure SRE agent and select Create agent. Fill in the required details along with the deployment region and deploy the SRE agent. Once the deployment is complete, hit Set up your agent. Select the Azure resources you would like your agent to analyze like resource groups or subscriptions. This will land you to the popup window that allows you to select the subscriptions and resource groups that you would like SRE agent to monitor and manage. You can then select the subscriptions and resource groups under the same tenant that you want SRE agent to manage; Great, So far so good 👍 As a Managed Service Provider (MSP) you have multiple tenants that you are managing via Azure Lighthouse, and you need to have SRE agent access to those. So, to demo this will need to set up Azure Lighthouse with correct set of roles and configuration to delegate access to management subscription where the Centralized SRE agent is running. From Azure portal search Lighthouse. Navigate to the Lighthouse home page and select Manage your customers. On My customers Overview select Create ARM Template Provide a Name and Description. Select subscriptions on a Delegated scope. Select + Add authorization which will take you to Add authorization window. Select Principal type, I am selecting User for demo purposes. The pop-up window will allow Select users from the list. Select the checkbox next to the desired user who you want to delegate the subscription and hit Select Then select the Role that you would like to assign the user from the managing tenant to the delegated tenant and select add. You can add multiple roles by adding additional authorization to the selected user. This step is important to make sure the delegated tenant is assigned with the right role in order for SRE Agents to add it as Azure source. Azure SRE agent requires an Owner or User Administrator RBAC role to assign the subscription to the list of managed resources. If an appropriate role is not assigned, you will see an error when selecting the delegated subscriptions in SRE agent Managed resources. As per Lighthouse role support Owner role isn’t supported and User access Administrator role is supported, but only for limited purpose. Refer Azure Lighthouse documentation for additional information. If role is not defined correctly, you might see an error stating: 🛑Failed to add Role assignment “The 'delegatedRoleDefinitionIds' property is required when using certain roleDefinitionIds for authorization. To allow a principalId to assign roles to a managed identity in the customer tenant, set its roleDefinitionId to User Access Administrator. Download the ARM template and add specific Azure built-in roles that you want to grant in the delegatedRoleDefinitionIds property. You can include any supported Azure built-in role except for User Access Administrator or Owner. This example shows a principalId with User Access Administrator role that can assign two built in roles to managed identities in the customer tenant: Contributor and Log Analytics Contributor. { "principalId": "00000000-0000-0000-0000-000000000000", "principalIdDisplayName": "Policy Automation Account", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ] } In addition SRE agent would require certain roles at the managed identity level in order to access and operate on those services. Locate SRE agent User assigned managed identity and add roles to the service principal. For the demo purpose I am assigning Reader, Monitoring Reader, and Log Analytics Reader role. Here is the sample ARM template used for this demo. { "$schema": "https://schema.management.azure.com/schemas/2019-08-01/subscriptionDeploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "mspOfferName": { "type": "string", "metadata": { "description": "Specify a unique name for your offer" }, "defaultValue": "lighthouse-sre-demo" }, "mspOfferDescription": { "type": "string", "metadata": { "description": "Name of the Managed Service Provider offering" }, "defaultValue": "lighthouse-sre-demo" } }, "variables": { "mspRegistrationName": "[guid(parameters('mspOfferName'))]", "mspAssignmentName": "[guid(parameters('mspOfferName'))]", "managedByTenantId": "6e03bca1-4300-400d-9e80-000000000000", "authorizations": [ { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "e40ec5ca-96e0-45a2-b4ff-59039f2c2b59", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ], "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "b24988ac-6180-42a0-ab88-20f7382dd24c", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "acdd72a7-3385-48ef-bd42-f606fba81ae7", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "43d0d8ad-25c7-4714-9337-8ba259a9fe05", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "73c42c96-874c-492b-b04d-ab87d138a893", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" } ] }, "resources": [ { "type": "Microsoft.ManagedServices/registrationDefinitions", "apiVersion": "2022-10-01", "name": "[variables('mspRegistrationName')]", "properties": { "registrationDefinitionName": "[parameters('mspOfferName')]", "description": "[parameters('mspOfferDescription')]", "managedByTenantId": "[variables('managedByTenantId')]", "authorizations": "[variables('authorizations')]" } }, { "type": "Microsoft.ManagedServices/registrationAssignments", "apiVersion": "2022-10-01", "name": "[variables('mspAssignmentName')]", "dependsOn": [ "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" ], "properties": { "registrationDefinitionId": "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" } } ], "outputs": { "mspOfferName": { "type": "string", "value": "[concat('Managed by', ' ', parameters('mspOfferName'))]" }, "authorizations": { "type": "array", "value": "[variables('authorizations')]" } } } Login to the customers tenant and navigate to the service provides from the Azure Portal. From the Service Providers overview screen, select Service provider offers from the left navigation pane. From the top menu, select the Add offer drop down and select Add via template. In the Upload Offer Template window drag and drop or upload the template file that was created in the earlier step and hit Upload. Once the file is uploaded, select Review + Create. This will take a few minutes to deploy the template, and a successful deployment page should be displayed. Navigate to Delegations from Lighthouse overview and validate if you see the delegated subscription and the assigned role. Once the Lighthouse delegation is set up sign in to the managing tenant and navigate to the deployed SRE agent. Navigate to Azure resources from top menu or via Settings > Managed resources. Navigate to Add subscriptions to select customers subscriptions that you need SRE agent to manage. Adding subscription will automatically add required permission for the agent. Once the appropriate roles are added, the subscriptions are ready for the agent to manage and monitor resources within them. Summary - Benefits This blog post demonstrates how Azure SRE Agent can be used to centrally monitor and manage Azure resources across multiple tenants by integrating it with Azure Lighthouse, a common requirement for enterprises and managed service providers operating in complex, multi-tenant environments. It walks through: Centralized SRE operations across multiple Azure tenants Secure, role-based access using delegated resource management Reduced operational overhead for MSPs and enterprise IT teams Unified visibility into resource health and reliability across customer environments549Views2likes1CommentAnnouncing AWS with Azure SRE Agent: Cross-Cloud Investigation using the brand new AWS DevOps Agent
Overview Connect Azure SRE Agent to AWS services using the official AWS MCP server. Query AWS documentation, execute any of the 15,000+ AWS APIs, run operational workflows, and kick off incident investigations through AWS DevOps Agent, which is now generally available. The AWS MCP server connects Azure SRE Agent to AWS documentation, APIs, regional availability data, pre-built operational workflows (Agent SOPs), and AWS DevOps Agent for incident investigation. When connected, the proxy exposes 23 MCP tools organized into four categories: documentation and knowledge, API execution, guided workflows, and DevOps Agent operations. How it works The MCP Proxy for AWS runs as a local stdio process that SRE Agent spawns via uvx . The proxy handles AWS authentication using credentials you provide as environment variables. No separate infrastructure or container deployment is needed. In the portal, you use the generic MCP server (User provided connector) option with stdio transport. Key capabilities Area Capabilities Documentation Search all AWS docs, API references, and best practices; retrieve pages as markdown API execution Execute authenticated calls across 15,000+ AWS APIs with syntax validation and error handling Agent SOPs Pre-built multi-step workflows following AWS Well-Architected principles Regional info List all AWS regions, check service and feature availability by region Infrastructure Provision VPCs, databases, compute instances, storage, and networking resources Troubleshooting Analyze CloudWatch logs, CloudTrail events, permission issues, and application failures Cost management Set up billing alerts, analyze resource usage, and review cost data DevOps Agent Start AWS incident investigations, read root cause analyses, get remediation recommendations, and chat with AWS DevOps Agent Note: The AWS MCP Server is free to use. You pay only for the AWS resources consumed by API calls made through the server. All actions respect your existing IAM policies. Prerequisites Azure SRE Agent resource deployed in Azure AWS account with IAM credentials configured uv package manager installed on the SRE Agent host (used to run the MCP proxy via uvx ) IAM permissions: aws-mcp:InvokeMcp , aws-mcp:CallReadOnlyTool , and optionally aws-mcp:CallReadWriteTool Step 1: Create AWS access keys The AWS MCP server authenticates using AWS access keys (an Access Key ID and a Secret Access Key). These keys are tied to an IAM user in your AWS account. You create them in the AWS Management Console. Navigate to IAM in the AWS Console Sign in to the AWS Management Console In the top search bar, type IAM and select IAM from the results (Direct URL: https://console.aws.amazon.com/iam/ ) In the left sidebar, select Users (Direct URL: https://console.aws.amazon.com/iam/home#/users ) Create a dedicated IAM user Create a dedicated user for SRE Agent rather than reusing a personal account. This makes it easy to scope permissions and rotate keys independently. Select Create user Enter a descriptive user name (e.g., sre-agent-mcp ) Do not check "Provide user access to the AWS Management Console" (this user only needs programmatic access) Select Next Select Attach policies directly Select Create policy (opens in a new tab) and paste the following JSON in the JSON editor: { "Version": "2012-10-17", "Statement": [ { "Effect": "Allow", "Action": [ "aws-mcp:InvokeMcp", "aws-mcp:CallReadOnlyTool", "aws-mcp:CallReadWriteTool" ], "Resource": "*" } ] } Select Next, give the policy a name (e.g., SREAgentMCPAccess ), and select Create policy Back on the Create user tab, select the refresh button in the policy list, search for SREAgentMCPAccess , and check it Select Next > Create user Generate access keys After the user is created, generate the access keys that SRE Agent will use: From the Users list, select the user you just created (e.g., sre-agent-mcp ) Select the Security credentials tab Scroll down to the Access keys section Select Create access key For the use case, select Third-party service Check the confirmation checkbox and select Next Optionally add a description tag (e.g., Azure SRE Agent ) and select Create access key Copy both values immediately: Value Example format Where you'll use it Access Key ID <your-access-key-id> Connector environment variable AWS_ACCESS_KEY_ID Secret Access Key <your-secret-access-key> Connector environment variable AWS_SECRET_ACCESS_KEY Important: The Secret Access Key is shown only once on this screen. If you close the page without copying it, you must delete the key and create a new one. Select Download .csv file as a backup, then store the file securely and delete it after configuring the connector. Tip: For production use, also add service-specific IAM permissions for the AWS APIs you want SRE Agent to call. The MCP permissions above grant access to the MCP server itself, but individual API calls (e.g., ec2:DescribeInstances , logs:GetQueryResults ) require their own IAM actions. Start broad for testing, then scope down using the principle of least privilege. Required permissions summary Permission Description Required? aws-mcp:InvokeMcp Base access to the AWS MCP server Yes aws-mcp:CallReadOnlyTool Read operations (describe, list, get, search) Yes aws-mcp:CallReadWriteTool Write operations (create, update, delete resources) Optional Step 2: Add the MCP connector Connect the AWS MCP server to your SRE Agent using the portal. The proxy runs as a local stdio process that SRE Agent spawns via uvx . It handles SigV4 signing using the AWS credentials you provide as environment variables. Determine the AWS MCP endpoint for your region The AWS MCP server has regional endpoints. Choose the one matching your AWS resources: AWS Region MCP Endpoint URL us-east-1 (default) https://aws-mcp.us-east-1.api.aws/mcp us-west-2 https://aws-mcp.us-west-2.api.aws/mcp eu-west-1 https://aws-mcp.eu-west-1.api.aws/mcp Note: Without the --metadata AWS_REGION=<region> argument, operations default to us-east-1 . You can always override the region in your query. Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select MCP server (User provided connector) and select Next Configure the connector with these values: Field Value Name aws-mcp Connection type stdio Command python3 Arguments -c , __import__('subprocess').check_call(['pip','install','-q','mcp-proxy-for-aws']);__import__('os').execlp('mcp-proxy-for-aws','mcp-proxy-for-aws','https://aws-mcp.us-east-1.api.aws/mcp','--metadata','AWS_REGION=us-west-2') Environment variables AWS_ACCESS_KEY_ID=<your-access-key-id> , AWS_SECRET_ACCESS_KEY=<your-secret-access-key> Select Next to review Select Add connector This is equivalent to the following MCP client configuration used by tools like Claude Desktop or Amazon Kiro CLI: { "mcpServers": { "aws-mcp": { "command": "uvx", "args": [ "mcp-proxy-for-aws@latest", "https://aws-mcp.us-east-1.api.aws/mcp", "--metadata", "AWS_REGION=us-west-2" ] } } } Important: Store the AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY securely. In the portal, environment variables for connectors are stored encrypted. For production deployments, consider using a dedicated IAM user with scoped-down permissions (see Step 1). Never commit credentials to source control. Tip: If your SRE Agent host already has AWS credentials configured (e.g., via aws configure or an instance profile), the proxy will pick them up automatically from the environment. In that case, you can omit the explicit AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY environment variables. Note: After adding the connector, the agent service initializes the MCP connection. This may take up to 30 seconds as uvx downloads the proxy package on first run (~89 dependencies). If the connector does not show Connected status after a minute, see the Troubleshooting section below. Step 3: Add an AWS skill Skills give agents domain knowledge and best practices for specific tool sets. Create an AWS skill so your agent knows how to troubleshoot AWS services, provision infrastructure, and follow operational workflows. Tip: Why skills over subagents? Skills inject domain knowledge into the main agent's context, so it can use AWS expertise without handing off to a separate agent. Conversation context stays intact and there's no handoff latency. Use a subagent when you need full isolation with its own system prompt and tool restrictions. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: aws_infrastructure_operations display_name: AWS Infrastructure & Operations description: | AWS infrastructure and operations: EC2, EKS, Lambda, S3, RDS, CloudWatch, CloudTrail, IAM, VPC, and others. Also covers AWS DevOps Agent for incident investigation, root cause analysis, and remediation. Use for querying AWS resources, investigating issues, provisioning infrastructure, searching documentation, running AWS API calls via the AWS MCP server, and coordinating investigations between Azure SRE Agent and AWS DevOps Agent. instructions: | ## Overview The AWS MCP Server is a managed remote MCP server that gives AI assistants authenticated access to AWS services. It combines documentation access, authenticated API execution, and pre-built Agent SOPs in a single interface. **Authentication:** Handled automatically by the MCP Proxy for AWS, running as a local stdio process. All actions respect existing IAM policies configured in the connector environment variables. **Regional endpoints:** The MCP server has regional endpoints. The proxy is configured with a default region; you can override by specifying a region in your queries (e.g., "list my EC2 instances in eu-west-1"). ## Searching Documentation Use aws___search_documentation to find information across all AWS docs. ## Executing AWS API Calls Use aws___call_aws to execute authenticated AWS API calls. The tool handles SigV4 signing and provides syntax validation. ## Using Agent SOPs Use aws___retrieve_agent_sop to find and follow pre-built workflows. SOPs provide step-by-step guidance following AWS Well-Architected principles. ## Regional Operations Use aws___list_regions to see all available AWS regions and aws___get_regional_availability to check service support in specific regions. ## AWS DevOps Agent Integration The AWS MCP server includes tools for AWS DevOps Agent: - aws___list_agent_spaces / aws___create_agent_space: Manage AgentSpaces - aws___create_investigation: Start incident investigations (5-8 min async) - aws___get_task: Poll investigation status - aws___list_journal_records: Read root cause analysis - aws___list_recommendations / aws___get_recommendation: Get remediation steps - aws___start_evaluation: Run proactive infrastructure evaluations - aws___create_chat / aws___send_message: Chat with AWS DevOps Agent ## Troubleshooting | Issue | Solution | |-------|----------| | Access denied errors | Verify IAM policy includes aws-mcp:InvokeMcp and aws-mcp:CallReadOnlyTool | | API call fails | Check IAM policy includes the specific service action | | Wrong region results | Specify the region explicitly in your query | | Proxy connection error | Verify uvx is installed and the proxy can reach aws-mcp.region.api.aws | mcp_connectors: - aws-mcp Select Save Note: The mcp_connectors: - aws-mcp at the bottom links this skill to the connector you created in Step 2. The skill's instructions teach the agent how to use the 23 AWS MCP tools effectively. Step 4: Test the integration Open a new chat session with your SRE Agent and try these example prompts to verify the connection is working. Quick verification Start with this simple test to confirm the AWS MCP proxy is connected and authenticating correctly: What AWS regions are available? If the agent returns a list of regions, the connection is working. If you see authentication errors, go back and verify the IAM credentials and permissions from Step 1. Documentation and knowledge Search AWS documentation for EKS best practices for production clusters What AWS regions support Amazon Bedrock? Read the AWS documentation page about S3 bucket policies Infrastructure queries List all my running EC2 instances in us-east-1 Show me the details of my EKS cluster named "production-cluster" What Lambda functions are deployed in my account? CloudWatch and monitoring What CloudWatch alarms are currently in ALARM state? Show me the CPU utilization metrics for my RDS instance over the last 24 hours Search CloudWatch Logs for errors in the /aws/lambda/my-function log group Troubleshooting workflows My EC2 instance i-0abc123 is not reachable. Help me troubleshoot. My Lambda function is timing out. Walk me through the investigation. Find an Agent SOP for troubleshooting EKS pod scheduling failures Cross-cloud scenarios My Azure Function is failing when calling AWS S3. Check if there are any S3 service issues and review the bucket policy for "my-data-bucket". Compare the health of my AWS EKS cluster with my Azure AKS cluster. AWS DevOps Agent investigations List all available AWS DevOps Agent spaces in my account Create an AWS DevOps Agent investigation for the high error rate on my Lambda function "order-processor" in us-west-2 Start a chat with AWS DevOps Agent about my EKS cluster performance Cross-agent investigation (Azure SRE Agent + AWS DevOps Agent) My application is failing across both Azure and AWS. Start an AWS DevOps Agent investigation for the AWS side while you check Azure Monitor for errors on the Azure side. Then combine the findings into a unified root cause analysis. What's New: AWS DevOps Agent Integration The AWS MCP server now includes full integration with AWS DevOps Agent, which recently became generally available. This means Azure SRE Agent can start autonomous incident investigations on AWS infrastructure and get back root cause analyses and remediation recommendations — all within the same chat session. Available tools by category AgentSpace management Tool Description aws___list_agent_spaces Discover available AgentSpaces aws___get_agent_space Get AgentSpace details including ARN and configuration aws___create_agent_space Create a new AgentSpace for investigations Investigation lifecycle Tool Description aws___create_investigation Start an incident investigation (async, 5-8 min) aws___get_task Poll investigation task status aws___list_tasks List investigation tasks with filters aws___list_journal_records Read root cause analysis journal aws___list_executions List execution runs for a task aws___list_recommendations Get prioritized mitigation recommendations aws___get_recommendation Get full remediation specification Proactive evaluations Tool Description aws___start_evaluation Start an evaluation to find preventive recommendations aws___list_goals List evaluation goals and criteria Real-time chat Tool Description aws___create_chat Start a real-time chat session with AWS DevOps Agent aws___list_chats List recent chat sessions aws___send_message Send a message and get a streamed response Cross-Agent Investigation Workflow With the AWS MCP server connected, SRE Agent can run parallel investigations across both clouds. Here's how the cross-agent workflow works: Start an AWS investigation: Ask SRE Agent to create an AWS DevOps Agent investigation for the AWS-side symptoms Investigate Azure in parallel: While the AWS investigation runs (5-8 minutes), SRE Agent uses its native tools to check Azure Monitor, Log Analytics, and resource health Read AWS results: When the investigation completes, SRE Agent reads the journal records and recommendations Correlate findings: SRE Agent combines both sets of findings into a single root cause analysis with remediation steps for both clouds Common cross-cloud scenarios: Azure app calling AWS services: Investigate Azure Function errors that correlate with AWS API failures Hybrid deployments: Check AWS EKS clusters alongside Azure AKS clusters during multi-cloud outages Data pipeline issues: Trace data flow across Azure Event Hubs and AWS Kinesis or SQS Agent-to-agent investigation: Start an AWS DevOps Agent investigation for the AWS side while Azure SRE Agent checks Azure resources in parallel Architecture The integration uses a stdio proxy architecture. SRE Agent spawns the proxy as a child process, and the proxy forwards requests to the AWS MCP endpoint: Azure SRE Agent | | stdio (local process) v mcp-proxy-for-aws (spawned via uvx) | | Authenticated HTTPS requests v AWS MCP Server (aws-mcp.<region>.api.aws) | |--- Authenticated AWS API calls --> AWS Services | (EC2, S3, CloudWatch, EKS, Lambda, etc.) | '--- DevOps Agent API calls ------> AWS DevOps Agent |-- AgentSpaces (workspaces) |-- Investigations (async root cause analysis) |-- Recommendations (remediation specs) '-- Chat sessions (real-time interaction) Troubleshooting Authentication and connectivity issues Error Cause Solution 403 Forbidden IAM user lacks MCP permissions Add aws-mcp:InvokeMcp , aws-mcp:CallReadOnlyTool to the IAM policy 401 Unauthorized Invalid or expired AWS credentials Rotate access keys and update the connector environment variables Proxy fails to start uvx not installed or not on PATH Install uv on the SRE Agent host Connection timeout Proxy cannot reach the AWS MCP endpoint Verify outbound HTTPS (port 443) is allowed to aws-mcp.<region>.api.aws Connector added but tools not available MCP connections are initialized at agent startup Redeploy or restart the agent service from the Azure portal Slow first connection uvx downloads ~89 dependencies on first run Wait up to 30 seconds for the initial connection API and permission issues Error Cause Solution AccessDenied on API call IAM user lacks the service-specific permission Add the required IAM action (e.g., ec2:DescribeInstances ) to the user's policy CallReadWriteTool denied Write permission not granted Add aws-mcp:CallReadWriteTool to the IAM policy Wrong region data Proxy configured for a different region Update the AWS_REGION metadata in the connector arguments, or specify the region in your query API not found Newly released or unsupported API Use aws___suggest_aws_commands to find the correct API name Verify the connection Test that the proxy can authenticate by opening a new chat session and asking: What AWS regions are available? If the agent returns a list of regions, the connection is working. If you see authentication errors, verify the IAM credentials and permissions from Step 1. Re-authorize the integration If you encounter persistent authentication issues: Navigate to the IAM console Select the user created in Step 1 Navigate to Security credentials > Access keys Deactivate or delete the old access key Create a new access key Update the connector environment variables in the SRE Agent portal with the new credentials Related content AWS MCP Server documentation MCP Proxy for AWS on GitHub AWS MCP Server tools reference AWS DevOps Agent documentation AWS DevOps Agent GA announcement AWS IAM documentation8.6KViews0likes1CommentAzure SRE Agent for Azure Monitor Alerts: Reduce Alert Fatigue, Investigate What Matters
The Alert Problem Organizations running Azure Monitor tend to land in one of two situations: Alert fatigue has set in. Alert rules tend to grow over time — a CPU threshold from two years ago, a health probe check from a migration, a disk alert from an outage that never got cleaned up. These rules fire regularly, most auto-resolve, and nobody investigates them. But buried in that noise are real incidents that go unnoticed until they escalate. Teams respond, but the effort is repetitive. Engineers triage the same alerts repeatedly — running the same diagnostic queries, confirming the same "transient spike, no action needed" conclusion. They know the rule is noisy, but fixing it in Azure Monitor requires data they don't have readily available: What should the threshold be? What's the auto-resolution rate? Is it safe to change? So the noisy rule stays, and the manual toil continues. Both situations share the same gap: there's no intelligent layer between Azure Monitor and the team. Azure SRE Agent fills that gap — it receives alert fires in real time, investigates them automatically, consolidates noisy ones, and surfaces the data your team needs to improve the rules at the source. Here's how to set it up. 1. Intelligent Alert Handling: Cooldown and Response Plan Configuration 1.a. Alert Reinvestigation Cooldown The most impactful configuration for Azure Monitor alerts is the new reinvestigation cooldown. This is a per-response-plan setting that controls how the agent handles repeated fires of the same alert rule. When an alert rule fires and the agent already has an active thread for that rule, it merges the new fire into the existing thread — no new investigation, no duplicate work. What makes this especially useful: if the previous thread was resolved or closed within the cooldown window, the agent reopens it and appends the new fire rather than starting a fresh investigation. This catches the common "it fired, we resolved it, it fired again 30 minutes later" pattern that generates the most duplicate effort. To configure it: Navigate to your AzMonitor response plan and look for the "Alert reinvestigation cooldown" section in the Save step. It's enabled by default with a 3-hour window — a default chosen because most noisy alert rules re-fire within a 1–3 hour cycle, making this window broad enough to catch recurring patterns while short enough that a genuinely new issue several hours later still gets a fresh investigation. To disable the cooldown entirely — for critical alerts where every fire demands a fresh investigation — uncheck the merge toggle: You can adjust the window between 1 and 24 hours depending on the alert pattern: Alert Pattern Recommended Window Frequent polling-based alerts (health probes, heartbeats) 1–2 hours Recurring issues tied to daily batch jobs or deploy cycles 6–12 hours Intermittent failures with unpredictable recurrence 12–24 hours Critical alerts where every fire demands a fresh look Disable the cooldown entirely 1.b. Segmenting Alerts with Response Plans The cooldown works best when paired with tiered response plans that route alerts by severity and title keyword. Rather than one catch-all plan for all alert types, create separate plans that match the right investigation depth to the right alerts. Critical alerts (Sev0–1, titles containing "failover", "security", "data loss") — disable cooldown. Every fire gets a fresh investigation because a repeat fire here likely means the first remediation didn't hold. Operational alerts (Sev2, titles containing "high CPU", "memory pressure", "latency") — set a 6-hour cooldown. These are real issues, but recurring fires within a few hours are almost always the same root cause. The agent consolidates them into one thread while still giving a genuinely new occurrence later in the day a fresh look. Low-priority alerts (Sev3–4, titles containing "health probe", "availability test") — set a short 1-hour cooldown. These rarely require deep investigation. The agent captures context without spending effort on redundant analysis. Informational alerts — don't create a response plan at all. These are telemetry, not incidents. This tiering works regardless of which agent mode (Autonomous or Review) your team uses. The value comes from the cooldown and severity segmentation — agent mode is a separate decision based on your team's comfort level with autonomous remediation. To see the difference this makes in practice: we deployed a web app with Azure Monitor alert rules and induced real failures. Azure Monitor fired 9 alerts across three rule types over a few hours. The agent consolidated them based on each response plan's cooldown: Alert Rule Response Plan Merge Setting AzMon Fires Agent Threads Total Alerts (in thread) What Happened High Response Time (Sev3) low-priority-alerts Merge ON, 4h cooldown 3 1 4 All 4 fires merged into a single thread — the agent investigated once and appended recurring fires HTTP 5xx Errors (Sev2) critical-alerts-no-merge Merge OFF 3 3 1 each Each fire created its own investigation — appropriate for critical alerts where every occurrence matters High CPU (Sev2) operational-alerts Merge ON, 1h cooldown 2 2 1 each Fires were >1 hour apart (resolved at 12:05, re-fired at 3:37) — outside the cooldown window, so the agent correctly treated them as separate incidents The key insight: the same 9 Azure Monitor alerts produced different agent behavior depending on the response plan configuration. The High Response Time rule demonstrates the merge path saving 3 redundant investigations. The HTTP 5xx rule shows merge disabled for critical alerts. And the High CPU rule shows what happens when the cooldown window is too short for the alert's recurrence pattern — a signal to increase the window. 2. Proactive Noise Monitoring: Let the Agent Analyze Its Own Patterns Handling alerts intelligently is the first step. The next is having the agent proactively surface insights about your alert landscape so your team can improve the rules at the source — which is the data that Category 2 teams in our intro are missing. 2.a. Weekly Alert Hygiene Report Create a weekly scheduled task with instructions like: Analyze all Azure Monitor alert threads from the past 7 days. For each alert rule that fired more than 3 times, produce a ranked report covering: High Auto-Resolution Rules: Rules with high auto-resolution rates. Recommend threshold changes or suppression windows. Rules with Recurring Root Causes: Rules where the same root cause recurs. Recommend permanent remediation actions. Miscategorized Severity: Rules where investigation concludes low impact but the alert is Sev1/Sev2. Recommend severity adjustment. Cost Summary: Estimated effort consumed per alert rule this week. This creates a compounding feedback loop. Week over week, your team has a concrete, data-backed list of which alert rules to adjust in Azure Monitor — complete with specific recommendations. The data that was too time-consuming to gather manually is now generated automatically. 2.b. Monthly Threshold Audit For a deeper analysis, schedule a monthly task: Audit Azure Monitor alert rules for this agent's subscriptions. For each rule: Query the rule's metric history over 30 days Compare current threshold vs. actual P50, P90, and P99 values Flag rules with threshold below P50 (always firing) or above P99 (never firing) For high-frequency rules with high auto-resolution, recommend a threshold at P95 to reduce fires while still catching genuine anomalies Produce: a threshold optimization table, dormant rules (no fires in 30+ days), and specific Azure CLI commands to update each rule. This is the highest-leverage outcome because it fixes noise at the source. A single threshold adjustment on one noisy rule can eliminate hundreds of alert fires per month — permanently. And the agent provides the data and specific commands to make it happen. What This Means for Agent Costs Each alert investigation consumes LLM tokens — for reasoning, querying, and building analysis. Without thoughtful configuration, a high-volume alert pipeline can lead to higher agent costs than expected. The setup described in this post naturally keeps token usage in check: the cooldown prevents redundant investigations, tiered response plans match effort to alert importance, and low-priority alerts get minimal attention. For additional control, you can optionally add a PostToolUse hook that nudges the agent to include time-range filters in Log Analytics queries — preventing large, unbounded result sets from inflating the conversation context. Since this hook uses a simple regex check on the query text rather than an LLM call, it adds zero token cost of its own. Getting Started Connect Azure Monitor as an incident source in your SRE Agent Enable the reinvestigation cooldown on your response plans (the 3-hour default is a sensible starting point) Create tiered response plans — at minimum, separate critical alerts (cooldown disabled) from operational alerts (cooldown 6h) and low-priority alerts (cooldown 1h) Set up a weekly alert hygiene report as a scheduled task to start building visibility into your alert patterns Add the monthly threshold audit once your weekly reports have a few weeks of data Start with the first three — they take a few minutes each and begin working immediately. Learn More Incident Response Overview — How SRE Agent handles incidents across platforms including Azure Monitor Incident Response Plans — Configuring response plans, filters, severity routing, and cooldown settings Setting Up a Response Plan — Step-by-step tutorial for creating your first response plan Scheduled Tasks — Creating weekly and monthly automated reports Agent Hooks — PostToolUse hooks, command hooks, and governance controls Monitor Agent Usage — Tracking token usage and agent activity Getting Started with Incident Response — Connecting Azure Monitor and configuring your first alert pipeline580Views0likes0CommentsPlugin Marketplace for Azure SRE Agent: Build Once, Install Anywhere
What's a Plugin? A plugin bundles two things: Skills — Operational knowledge (triage runbooks, policy rules, known issues) the agent reads at runtime to guide its reasoning MCP Connectors — Live integrations to your internal APIs (deployment tracker, cost dashboard, CMDB) the agent can query during an investigation This is the key distinction: a plugin doesn't just tell the agent what your policies are — it gives the agent tools to query your internal systems and apply those policies with real data. A plugin bundles skills and MCP connectors as a single installable unit. The Marketplace Model: Create Once, Install Everywhere The marketplace is a GitHub repository with a marketplace.json manifest. Any team pushes their plugin to the repo. Every SRE Agent in the org can discover it and install it with one click — no need for each team to manually recreate skills and configure connectors. How it works: A specialist team creates a plugin (skills + MCP connector config) and pushes it to the shared GitHub repo Any SRE Agent user browses the marketplace, sees what's available, and clicks Install The plugin's skills and connectors are deployed to that agent instance instantly Contoso runs multiple SRE Agent instances — payments team, platform team, data team. The same marketplace serves all of them. Each team installs exactly the plugins they need. One marketplace, many agents. Teams publish plugins once — every agent in the org can install them. The Scenario: AKS Incident Investigation with Plugins Contoso runs a payment processing service on AKS. Three teams have contributed plugins to the company's internal marketplace: Plugin Team Skills MCP Connector AKS Runbooks K8s Platform Team aks-incident-triage, aks-deployment-analysis Deployment Tracker API Cost & Capacity Cloud FinOps Team cost-analysis, capacity-planning Cost Dashboard API Service Catalog SRE Leadership service-ownership-lookup, dependency-impact-analysis CMDB API All three are installed on the payments team's SRE Agent. Let's see what happens when an incident hits. Building and Publishing the Plugins Each team creates their plugin independently and pushes it to the shared marketplace repo. 1. AKS Runbooks (Kubernetes Platform Team) The K8s Platform Team packages their triage procedures, node pool naming conventions, PDB policies, known issues registry, and deployment gates. Skills: aks-incident-triage — Per-symptom triage procedures (OOMKill, NodePressure, CrashLoop), PDB-first policy checks, Tier-0 escalation rules, and a known issues registry aks-deployment-analysis — Correlates incidents with recent deployments, surfaces resource spec diffs and gate violations, provides a rollback decision tree MCP Connector: contoso-deploy-tracker — Exposes get_deployments: recent deployments by namespace with deployer, image versions, resource diffs, and gate status 2. Cost & Capacity (Cloud FinOps Team) The FinOps team packages their SKU approval matrix, team budget allocations, chargeback model, and scaling governance. Skills: cost-analysis — Team budget tiers, cost dashboard API usage, incident cost impact calculations capacity-planning — "Scale-out before scale-up" rule (CCP-001), SKU approval matrix (B-series = team lead, D-series = director, E/N-series = VP/CTO), auto-scale thresholds MCP Connector: contoso-cost-dashboard — Exposes get_team_spend (budget, burn rate) and get_resources_cost (resource-level cost with utilization) 3. Service Catalog (SRE Leadership) SRE Leadership packages service ownership, SLA tiers, escalation paths, and the dependency graph. Skills: service-ownership-lookup — Maps namespaces to owning teams, on-call contacts, SLA tiers (Tier-0 through Tier-3), escalation policies dependency-impact-analysis — Dependency classification (hard/soft/async), blast radius assessment, security implications MCP Connector: contoso-cmdb — Exposes get_service_info (ownership, SLA), get_service_dependencies, and get_blast_radius The Marketplace Manifest All three plugins are described in a single marketplace.json that the SRE Agent discovers: { "name": "Contoso SRE Plugins", "description": "Internal plugin marketplace for Contoso SRE teams", "version": "1.0.0", "plugins": [ { "id": "aks-runbooks", "name": "AKS Runbooks", "description": "Kubernetes Platform Team's operational runbooks and deployment correlation", "author": "K8s Platform Team", "source": "./aks-runbooks", "category": "Operations" }, { "id": "cost-capacity", "name": "Cost & Capacity", "description": "FinOps team's cost governance, SKU approval matrix, and capacity planning", "author": "Cloud FinOps Team", "source": "./cost-capacity", "category": "Cost Management" }, { "id": "service-catalog", "name": "Service Catalog", "description": "Service ownership, SLA tiers, dependency graphs, and escalation paths", "author": "SRE Leadership", "source": "./service-catalog", "category": "Governance" } ] } The internal plugin marketplace on GitHub. Each directory is a plugin contributed by a different team. The marketplace.json manifest tells the SRE Agent what's available. Registering the Marketplace and Installing Plugins Step 1: Add the Marketplace In the SRE Agent, navigate to Builder → Plugins → Browse and click "Add Marketplace". Enter the GitHub repository path (contoso/sre-agent-plugins) and click Add. The agent fetches marketplace.json and displays the marketplace card with all three plugins discovered. Adding the internal marketplace — just point to the GitHub repo. Step 2: Browse the Catalog The Browse tab now shows the Contoso SRE Plugins marketplace. Clicking into it reveals three plugin cards — one from each contributing team — with descriptions, skill counts, and connector details. & Capacity, Service Catalog — with author teams, skill counts, and install buttonsThree plugins from three teams. Each one brings skills (organizational knowledge) and an MCP connector (internal API access). Step 3: Install All Three Plugins Click into each plugin to review what it installs — skills and MCP connectors — then click "Install Plugin" for each one. After installing all three: 6 skills loaded (2 per plugin) — organizational knowledge documents the agent reads at runtime 3 MCP connectors registered — internal API integrations the agent can call as tools Each plugin clearly shows what it installs — skills and connectors — before you commit. ith green "Installed" badges, green borders, and skill/connector countsAll three plugins installed — each card shows its "Installed" status, the authoring team, and exactly what it brings (2 skills, 1 connector). The green border and badge make installed plugins immediately recognizable. The Agent in Action Now let's ask the question: "Pods are crashing in payments-prod on aks-payments-prod-eastus2 in sre-marketplace-demo-rg. Investigate and give me a full incident report." The agent investigates — combining its native Kubernetes capabilities with the organizational context from all three plugins. The same strong Kubernetes diagnosis, now enriched with organizational context. Deployment correlation, policy violations, cost governance, blast radius, and escalation paths — all layered on top of the agent's native investigation. Why This Matters: Different Teams, One Agent The K8s Platform Team writes triage procedures and known issues. The FinOps Team writes budget governance and SKU rules. SRE Leadership defines service ownership and escalation paths. Each team packages their domain expertise independently. The SRE Agent combines all of it at runtime — producing a response no single team could have written alone, drawn from three internal systems and three bodies of institutional knowledge. This is how organizational knowledge scales: composable plugins that the agent reasons with in real-time, not longer wikis that nobody reads. Learn More Plugin Marketplace overview — How the marketplace works, manifest formats, and MCP config support Tutorial: Install a marketplace plugin — Step-by-step walkthrough of adding a marketplace, browsing plugins, and importing skills Skills in Azure SRE Agent — How skills work, how the agent loads them at runtime, and how they relate to custom agents and knowledge files MCP connectors and tools — Connecting your agent to external systems via the Model Context Protocol Tutorial: Set up an MCP connector — Configuring remote and local MCP servers as agent connectors307Views0likes0CommentsGet started with the New Relic MCP server in Azure SRE Agent
Overview The New Relic MCP server is a cloud-hosted bridge between your New Relic account and Azure SRE Agent. Once configured, it enables real-time interaction with your observability data—APM traces, distributed traces, logs, metrics, alerts, incidents, dashboards, and entities—through natural language. All actions respect your existing New Relic role and capability assignments. The server uses Streamable HTTP transport with a single Api-Key header for authentication. Azure SRE Agent connects directly to the New Relic-hosted endpoint ( https://mcp.newrelic.com/mcp/ )—no npm packages, local proxies, or container deployments are required. The SRE Agent portal includes a dedicated New Relic connector type that pre-populates the endpoint URL and the Api-Key header for streamlined setup. Key capabilities Area Capabilities NRQL Run NRQL queries against any event type, build queries from natural language, explain existing NRQL APM Search applications, view golden signals (throughput, errors, latency), inspect transactions Distributed tracing Find slow or errored traces, fetch full trace details by trace ID Logs Search and filter logs by attribute, service, host, or time range Infrastructure Query host metrics (CPU, memory, disk, network), inspect Kubernetes clusters and containers Alerts & Incidents List alert policies and conditions, search open and closed incidents, view incident timelines Entities Search entities (services, hosts, databases, browsers, mobile apps) by name, tag, or domain Dashboards Search and retrieve dashboards by name, account, or tag Synthetics List synthetic monitors, view recent check results and failures Errors Inbox Query error groups, occurrences, and stack traces Change tracking Inspect deployment markers and configuration changes correlated with incidents Note: > This is the official New Relic-hosted MCP server (Preview). Tool availability depends on your New Relic plan, the data your account ingests, and your role-based access control assignments. Prerequisites Azure SRE Agent resource deployed in Azure New Relic account with an active subscription (Free, Standard, Pro, or Enterprise) New Relic user with appropriate role and capability assignments User API key created from your New Relic account ( NRAK-… ) (Optional) Your New Relic account ID for NRQL queries that explicitly target a specific account Step 1: Get your New Relic credentials The New Relic MCP server requires a User API key, which authenticates the user and inherits that user's role-based permissions. The key is created in the New Relic UI. Create a User API key Sign in to New Relic One Select your user menu (your initials in the bottom-left corner) Select API keys Direct URL: https://one.newrelic.com/api-keys Select Create a key in the top-right corner Configure the key: Key type: Select User Name: Enter a descriptive name (e.g., sre-agent-mcp ) Select Create a key Copy the key value (starts with NRAK-… )—it is shown only once. If lost, you must create a new key. Tip: > User API keys inherit the role and capabilities of the user who created them. For production use, create the key from a dedicated service user rather than a personal account so the integration continues to work if team members leave the organization. Find your account ID You don't need the account ID for the connector itself (the User API key carries the user's default account). It's still useful for NRQL queries that explicitly scope to a specific account when the user has access to multiple. From any page in New Relic One, select the account picker in the top navigation Note the numeric Account ID displayed next to your account name Alternatively, navigate to Administration > Account settings Direct URL: https://one.newrelic.com/admin-portal/organizations/organization-detail Note: > If you have access to multiple accounts under one organization, the User API key uses the user's default account. Tools that accept an explicit accountId argument can target other accounts the user has access to. Required role and capabilities The User API key inherits the permissions of the underlying user. Configure the user's role to grant the capabilities your agent needs: Capability Description Required? Logs.read Read log data Recommended APM.read Read APM application data, transactions, and traces Recommended Infrastructure.read Read infrastructure metrics, hosts, and Kubernetes data Recommended Alerts.read Read alert policies, conditions, and incidents Recommended Dashboards.read Read dashboards Optional Synthetics.read Read synthetic monitor results Optional Entities.read Search and inspect entities Required NRQL.execute Run NRQL queries Required Important: > Apply the principle of least privilege. Grant read-only capabilities unless the agent needs to acknowledge incidents, mute conditions, or modify other resources. Avoid using a full-admin user account. Step 2: Add the MCP connector Connect the New Relic MCP server to your SRE Agent using the portal. The portal includes a dedicated New Relic connector type that pre-populates the endpoint URL and the Api-Key header. Using the SRE Agent portal In the SRE Agent portal, open your agent In the left navigation, expand Builder and select Connectors Select + Add connector in the toolbar In the Choose a connector step, scroll to the Additional connectors section under Telemetry and select the New Relic card. The card description reads "Connect to New Relic for application performance monitoring and analytics." Select Next. In the Set up New Relic connector step, configure the fields: Field Value Name newrelic-mcp (any unique name for this connector) URL https://mcp.newrelic.com/mcp/ (pre-populated) Authentication method Custom headers (pre-selected) Key Api-Key (pre-populated) Value Your New Relic User API key ( NRAK-… ) Select Next to advance to Review + test connection. The portal validates the endpoint and credentials. In the Select tools step, choose which New Relic tools to expose to the agent (or leave the defaults selected). Select Add connector to save. Note: > The New Relic connector type pre-populates the endpoint URL ( https://mcp.newrelic.com/mcp/ ), sets the authentication method to Custom headers, and adds the Api-Key header key automatically. You only need to paste in the User API key value ( NRAK-… ). Tip: > The single https://mcp.newrelic.com/mcp/ endpoint serves accounts in all New Relic data centers. The User API key includes the account context required to route the request to the correct region. Once the connector shows Connected status on the Connectors list, the New Relic MCP tools are automatically available to your agent. Step 3: Create a New Relic subagent Create a specialized subagent to give the AI focused New Relic observability expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: NewRelicObservabilityExpert display_name: New Relic Observability Expert system_prompt: | You are a New Relic observability expert with access to NRQL, APM data, distributed traces, logs, infrastructure metrics, alerts, incidents, entities, dashboards, and synthetic monitors via the New Relic MCP server. ## Capabilities ### NRQL (New Relic Query Language) - Run NRQL queries with `run_nrql_query` against any event type (Transaction, TransactionError, Log, Metric, SystemSample, etc.) - Generate NRQL from natural language with `create_nrql_query` - Explain existing NRQL with `explain_nrql_query` - Always include a `SINCE` clause to bound the time range ### APM (Application Performance Monitoring) - Search APM applications with `search_apm_applications` - Get golden signals (throughput, errors, latency, Apdex) with `get_apm_golden_signals` - Inspect slow or errored transactions with NRQL on the Transaction event ### Distributed Tracing - Search traces with `search_traces` filtered by service, error, or duration - Fetch a full trace with `get_trace` using the trace ID - Identify the slowest span and the service that owns it ### Logs - Search logs with `search_logs` using attribute filters and time ranges - Use NRQL aggregation on the Log event for grouping and counts - Correlate logs with traces using `trace.id` and `span.id` attributes ### Infrastructure - List hosts with `search_hosts` and filter by tag or status - Query host metrics (CPU, memory, disk, network) with NRQL on SystemSample, ProcessSample, NetworkSample, StorageSample - For Kubernetes, use the K8sClusterSample, K8sPodSample, K8sNodeSample event types ### Alerts & Incidents - List alert policies with `search_alert_policies` - List alert conditions with `search_alert_conditions` - Search open and closed incidents with `search_incidents` - Get full incident details and timeline with `get_incident` ### Entities - Search entities with `search_entities` across services, hosts, databases, browsers, mobile apps, and synthetic monitors - Get entity details with `get_entity` including tags, golden metrics, and related entities ### Dashboards - Search dashboards with `search_dashboards` - Retrieve dashboard widgets and embedded NRQL with `get_dashboard` ### Synthetics - List synthetic monitors with `search_synthetic_monitors` - Inspect recent check results with `get_synthetic_check_results` ### Errors Inbox - Query error groups with `search_error_groups` - Get occurrences and stack traces with `get_error_group` ## Best Practices When investigating incidents: - Start with `search_incidents` or `get_incident` for context - Check related alert conditions with `search_alert_conditions` - Pull APM golden signals with `get_apm_golden_signals` for the affected service - Use `search_traces` to find slow or errored requests - Correlate with `search_logs` filtered by service and time range - Check `search_hosts` and SystemSample for infrastructure-level problems When writing NRQL: - Always include `SINCE` (e.g., `SINCE 30 minutes ago`) - Use `FACET` to group results by service, host, or status code - Use `TIMESERIES` to plot trends over time - Use `LIMIT` to bound result size - Filter by `appName`, `service.name`, or `host` to scope queries When handling errors: - If access is denied, explain which capability the user is missing - If no data is returned, suggest broadening the time range - For multi-account organizations, mention the `accountId` parameter - For NRQL syntax errors, use `explain_nrql_query` to validate mcp_connectors: - newrelic-mcp handoffs: [] Select Save Click to edit the created sub-agent and scroll down to Tools to add the New Relic tools Step 4: Add a New Relic skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a New Relic skill to give your agent expertise in NRQL, APM analysis, and incident investigation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: newrelic_observability display_name: New Relic Observability description: | Expertise in New Relic's observability platform including NRQL, APM, distributed tracing, logs, infrastructure metrics, alerts, incidents, entities, dashboards, and synthetics. Use for running NRQL queries, investigating slow services, analyzing traces, searching logs, inspecting incidents, and navigating New Relic data via the New Relic MCP server. instructions: | ## Overview New Relic is a cloud-scale observability platform for APM, distributed tracing, logs, metrics, infrastructure, browser, mobile, and synthetics. NRQL (New Relic Query Language) is the unified query language across all event types. **Authentication:** A single `Api-Key` header containing a User API key (`NRAK-…`). The key inherits the role and capabilities of the user who created it, and the user's default account scopes the queries. **Endpoint:** `https://mcp.newrelic.com/mcp/` — a single endpoint serves accounts in all New Relic data centers (US, EU, FedRAMP). ## Writing NRQL queries NRQL syntax is similar to SQL but optimized for time-series event data. **Common patterns:** ```sql -- Errors per service in the last hour SELECT count(*) FROM TransactionError SINCE 1 hour ago FACET appName -- p95 latency by transaction SELECT percentile(duration, 95) FROM Transaction WHERE appName = 'checkout-api' SINCE 30 minutes ago FACET name TIMESERIES -- Top error messages from logs SELECT count(*) FROM Log WHERE level = 'ERROR' AND service.name = 'payment-api' SINCE 1 hour ago FACET message LIMIT 10 -- Host CPU utilization SELECT average(cpuPercent) FROM SystemSample WHERE hostname LIKE 'web-prod-%' SINCE 30 minutes ago TIMESERIES -- Apdex by application SELECT apdex(duration, t: 0.5) FROM Transaction SINCE 1 hour ago FACET appName ``` Always include `SINCE`. Use `FACET` for grouping, `TIMESERIES` for trends, `LIMIT` to bound results, and `WHERE` to filter. ## APM investigation workflow 1. `search_apm_applications` — find the affected service 2. `get_apm_golden_signals` — pull throughput, error rate, response time, Apdex 3. NRQL on Transaction — drill into slow or failing transactions 4. NRQL on TransactionError — identify error classes and frequencies 5. `search_traces` — find specific slow or errored traces 6. `get_trace` — inspect span timing and errors 7. `search_logs` — correlate logs by `trace.id` or `service.name` ## Distributed trace investigation Use `search_traces` for span-level queries and `get_trace` for full traces. **Workflow:** 1. Search for slow or errored traces with `search_traces` 2. Get the full trace with `get_trace` using the trace ID 3. Identify the bottleneck span (longest duration or error) 4. Note the owning service and operation name 5. Correlate with `search_logs` using the `trace.id` attribute 6. Check `get_apm_golden_signals` for the bottleneck service ## Log analysis Use `search_logs` for filtered retrieval and NRQL on the Log event for aggregation. **Common log filters:** ``` # Errors from a specific service service.name:payment-api level:ERROR # Logs containing a trace ID trace.id:abc123def456 # Logs from a Kubernetes pod k8s.namespace.name:production k8s.deployment.name:checkout # HTTP 5xx responses http.statusCode:>=500 ``` **Aggregations on the Log event:** ```sql SELECT count(*) FROM Log WHERE level = 'ERROR' SINCE 1 hour ago FACET service.name LIMIT 20 ``` ## Infrastructure investigation New Relic infrastructure data lives in event types: | Event type | Purpose | |------------|---------| | `SystemSample` | Host CPU, memory, disk, load average | | `ProcessSample` | Per-process CPU and memory usage | | `NetworkSample` | Network interface throughput, errors, drops | | `StorageSample` | Disk capacity and I/O | | `K8sClusterSample` | Kubernetes cluster-level metrics | | `K8sNodeSample` | Kubernetes node metrics | | `K8sPodSample` | Pod-level CPU, memory, restart counts | | `K8sContainerSample` | Container-level metrics | ## Incident investigation workflow For structured incident investigation: 1. `search_incidents` — find recent or active incidents 2. `get_incident` — get full incident details and timeline 3. `search_alert_conditions` — check which conditions triggered 4. `search_logs` — search for errors around the incident time 5. NRQL on Transaction/TransactionError — check service health 6. `search_traces` — inspect request traces for latency or errors 7. `search_hosts` — verify infrastructure health 8. Check change tracking markers near the incident start time ## Working with entities New Relic entities include services (APM, browser, mobile), hosts, databases, queues, synthetic monitors, dashboards, and workloads. - Use `search_entities` to find entities by name, domain, or tag - Use `get_entity` to retrieve tags, golden metrics, and related entities - Filter by `domain` (APM, INFRA, BROWSER, MOBILE, SYNTH, EXT) to narrow ## Troubleshooting | Issue | Solution | |-------|----------| | 401 Unauthorized | Verify the User API key is valid and starts with `NRAK-` | | 403 Forbidden | The user's role lacks the required capability—check role assignments | | No data returned | Confirm the account ID is correct and contains data for the queried event type | | Wrong region | Ensure the connector URL matches your data center (US, EU, or FedRAMP) | | NRQL syntax error | Use `explain_nrql_query` to validate before running | | Cross-account query | Pass `accountId` explicitly if the user has access to multiple accounts | mcp_connectors: - newrelic-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: NewRelicObservabilityExpert skills: - newrelic_observability mcp_connectors: - newrelic-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: NRQL queries Run NRQL: SELECT count(*) FROM TransactionError SINCE 1 hour ago FACET appName Show me the p95 response time for the checkout-api over the last 30 minutes Build me an NRQL query to find the top 10 slowest transactions in the payment-api Explain this NRQL query: SELECT rate(count(*), 1 minute) FROM Transaction FACET appName APM and traces Show me the golden signals for the payment-api service in the last hour Find the slowest distributed traces for the checkout-api in the last 30 minutes Get the full trace details for trace ID abc123def456 What is the error rate trend for the api-gateway over the last 4 hours? Log analysis Search for ERROR logs from the payment-api service in the last hour Count log errors grouped by service over the last 24 hours Find logs containing trace.id abc123def456 Show me HTTP 5xx responses across all services in the last 30 minutes Incident investigation Show me all open incidents in the last 24 hours Get details for incident 12345 including the timeline and triggered condition What alert conditions triggered during the most recent incident? Correlate the latest incident with related logs and traces Infrastructure What is the CPU utilization across all production hosts in the last 30 minutes? Show me memory pressure on web-prod-01 over the last hour Find Kubernetes pods that have restarted in the last 24 hours List hosts tagged with env:production and team:platform Entities and dashboards Search for all APM services tagged with team:checkout Get details for the entity named payment-api including its tags and related services List all dashboards related to "Checkout" or "Payments" What synthetic monitors are currently failing? Available tools The New Relic MCP server exposes the following core tools. Tool availability depends on your New Relic plan and the user's role capabilities. Tool Description run_nrql_query Execute an NRQL query and return results create_nrql_query Generate an NRQL query from a natural language prompt explain_nrql_query Get a plain English explanation of an NRQL query search_apm_applications Search APM applications by name, language, or tag get_apm_golden_signals Get throughput, error rate, response time, and Apdex for an APM service search_traces Search distributed traces by service, error, or duration get_trace Fetch a full distributed trace by trace ID search_logs Search logs by attribute filters and time ranges search_hosts Search infrastructure hosts by name, tag, or status search_entities Search entities (services, hosts, dashboards, monitors) across domains get_entity Get entity details including tags, golden metrics, and related entities search_alert_policies List alert policies in the account search_alert_conditions List alert conditions, optionally filtered by policy search_incidents Search open and closed incidents get_incident Get full incident details and timeline search_dashboards Search dashboards by title, account, or tag get_dashboard Retrieve dashboard widgets and embedded NRQL search_synthetic_monitors List synthetic monitors get_synthetic_check_results Get recent check results for a synthetic monitor search_error_groups Query Errors Inbox for error groups get_error_group Get occurrences and stack traces for an error group search_change_markers List deployment markers and configuration changes Related content New Relic User API keys NRQL reference NerdGraph API New Relic role-based access control New Relic regional data centers MCP integration overview Build a custom subagent403Views0likes0CommentsHow Microsoft 1ES uses agentic AI to take on security and compliance at scale
Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade IQ platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization. What we do Within Microsoft’s One Engineering System (1ES) organization, teams build and maintain the internal engineering systems that product groups across the company rely on to ship and secure their services. These shared tools and processes support teams responsible for mission-critical products, from modern cloud-native platforms to long-lived legacy applications. Security, compliance, and reliability work is non-negotiable at this scale. But it has to coexist with developer productivity and velocity across thousands of independently owned repositories. The problem: the CVE and compliance treadmill Here’s the loop we kept living: A security or compliance alert arrives, often via automation like Dependabot or a CVE finding. The version gets bumped, or the config gets nudged. CI is green. The PR merges. Production fails or the finding reopens because the fix required code changes beyond a version bump or a config flip. This repeats across repositories, teams, and organizations. And the hard truth is not all vulnerabilities are mechanical version bumps, and not all compliance findings are config tweaks. Many introduce behavioral or security model changes. Automation handles the easy cases but silently fails on the hard ones. A second pattern compounds it: when a service has 30+ open action items spanning OTel audit, identity, secret rotation, and CodeQL findings, just figuring out which ones are quick versus deep can take longer than the fixes themselves. Multiply this across Microsoft’s repo footprint and the cost becomes months of engineering time spent on work that doesn’t ship new customer value. But this is exactly the kind of challenge AI was made for: high-speed, high-scale evaluation and judgment calls, coached by human expertise. Why this is solvable now In the previous era of software development, an average CVE alert meant hours of developer toil. Three things changed at once. Frontier models like GPT-5.5 and Claude Opus 4.7 can now reason about context, intent, and tradeoffs not just generate code. Agent runtimes like GitHub Copilot CLI can read repositories, run tools, execute tests, and open pull requests end-to-end. And we’ve started encoding hard-won domain expertise as portable skills, so an agent doesn’t have to re-derive what an expert already knows. None of these is enough alone. Frontier models without runtimes are just chat. Runtimes without skills hallucinate confidently. Skills without judgment automate the wrong thing. Together, bounded by human–AI partnership patterns that make escalation a first-class behavior, they enable a safer, more disciplined way to tackle judgment-heavy engineering work. How we approach it: collaborate, don’t automate The co-creative model Instead of treating AI as a script executor, we treat agents as collaborators operating within explicit guardrails: Agents propose changes based on skills and available context. Humans review, approve, and retain final ownership of every change. Skills over prompts Agents start cold. They don’t have repo-specific context beyond the invoked skill. A skill captures the exact steps, decisions, and edge cases a human expert would apply to a specific class of problem. Skills are written once as Markdown and loaded only when needed: focused context, improved complexity handling, more predictable behavior. We author skills with agents too. The same operating model we use for remediation. Human owns the decision, agent does the work, signals feed back is how the skills themselves get written and refined. One of those agents, Ember, is now open-sourced on awesome-copilot. A real example: the XStream CVE Some CVEs include changes in aspects like default security models, which require code changes beyond just bumping the dependency version. Take the XStream dependency update. In the previous 1.4.17 version, any class deserializes through a default-allow classification. But in the latest update, classification changed to default-deny meaning we need to make permitted types explicit. Once we find the XStream call sites, we need to fix type permissions after each instantiation and make sure that change propagates from test, to PR, to run. This is the type of judgment-heavy work where naïve automation creates risk and blocks developers from focusing on feature work. How execution works The agent loads the relevant skill for the task at hand. If it encounters ambiguity or risk, it stops and escalates rather than guessing. The agent goes through required steps: compile, test, pull request, as explicitly agreed upon in the guidance we provide. After each run, the agent emits an Agent Signals: a structured self-assessment of what worked, what was hard, and where the skill fell short. These compound across sessions so the system improves continuously. Autonomy is great, but trust is far better. Between the CVE context, the skills, and our working agreement with the agent, we’re creating a dynamic where the agent feels empowered to execute until it reaches a point of uncertainty. This cuts down the risk of hallucinations dramatically and scales repeatable, trustworthy execution. The most important issues get surfaced for humans in the loop, where human judgment actually matters. Closing the loop: dev-side and ops-side Skills and agents handle the dev-side work: CVE remediation, compliance findings, codebase changes that need judgment. On the ops side, Azure SRE Agent handles at-scale data analysis and operational toil. Same philosophy on both sides: agents act within explicit guardrails, humans own the decisions that matter, and signals from every run feed back into the system. Then the two sides connect. Every Agent Signal our dev-side skills emit flows into Azure SRE Agent, which analyzes them at scale, identifies where skills are degrading or falling short, opens PRs against the skills themselves to fix the gaps, and sends us a daily skill-health report. The ops-side agent maintains the dev-side agents: agents improving agents, while humans review and merge every change. The same human-in-the-loop discipline that governs a CVE fix governs a skill fix. Impact Across Microsoft, 1ES supports teams working on hundreds of repos at a variety of ages and sizes. Agents enable velocity while skills enable uniqueness which is what helps us scale across such a vast enterprise. Impact of the frontier models, GitHub Copilot, agent skills and agent signals for compliance work. Real engineering time saved We’re finding 18-15 hours of manual work compressed into ~9 hours of agent+skill assisted work – a 50-60% reduction overall, with some compliance work moving from 3-4 hrs manually to 30 min with the agent+skill. What devs told us “Considering I didn’t know anything about any of this, including never having seen the IaC in question, I’d say at least a week’s worth, done in less than 10 prompts.” — Patrick, Senior Engineer “Many times with [compliance], the actual changes are minimal, but reading the docs and knowing what applies to your app can be more time consuming… When you have 30+ action items, you need to go hunting for which one is quick versus time-consuming. This [agent+skills] saves a lot of time.” — Greg, Engineering Manager “The [agent+skills] eliminates most early-phase toil — up to ~90% — but 0% of the last-mile effort. The bottleneck shifts entirely to validation and deployment.” — CloudBuild team That last quote is the one we keep coming back to. The agent+skills doesn’t eliminate the work, it changes where the work lives. Discovery, scoping, and first-draft remediation collapse. Validation and deployment become the new ceiling. That’s the right problem to have and it tells us where to invest next. Security and compliance response with agents is evolving from reactive maintenance to a proactive, strategic defense capability. What we’ve learned On quality and trust With agents, silent confidence is more dangerous than visible uncertainty. Testing agents cold exposes gaps early, before risk compounds. Build uncertainty into skills, and lean on Agent Signals to capture what worked, what was hard, and where the skill fell short. When agents report honestly, the next run starts smarter than the last one. Quality is measured, not assumed. We evaluate every PR on an A/B/C scale, and we run agents that evaluate other agents’ output, closing the loop between execution and assessment. On scaling Not all work should be automated. Some work requires human-AI collaboration. Encoding expertise will always be more valuable than scaling generic prompts. Start with a win in one repo, then slowly scale out that skill to other teams and repos. Where teams can start Teams don’t adopt AI through mandates. They adopt it through trust, built on quality results in their code. Start with one team, one skill, and one real win. Identify a CVE or dependency issue that appears repeatedly across repositories. Write the fix as Markdown, as if you’re onboarding a new engineer. That’s your first skill file. Test the skill with a cold agent on a real repo with a real problem. Iterate until the agent knows both how to act and when to stop. Agents can assess their own work and flag gaps in skills. Want to learn more? Watch the demo video of the dependency update scenario Learn more about the co-creative framework Discover how the GitHub Copilot CLI can help you run and orchestrate agents Learn more about Agent Signals Learn more about Agent Skills Read the companion ops-side story: How we build and use Azure SRE Agent with agentic workflows458Views3likes0Comments