azure sre agent
54 TopicsContext Engineering Lessons from Building Azure SRE Agent
We started with 100+ tools and 50+ specialized agents. We ended with 5 core tools and a handful of generalists. The agent got more reliable, not less. Every context decision is a tradeoff: latency vs autonomy, evidence-building vs speed, oversight - and the cost of being wrong. This post is a practical map of those knobs and how we adjusted them for SRE Agent.13KViews22likes2CommentsAn AI led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub.
This is due to the inevitable move towards fully agentic, end-to-end SDLCs. We may not yet be at a point where software engineers are managing fleets of agents creating the billion-dollar AI abstraction layer, but (as I will evidence in this article) we are certainly on the precipice of such a world. Before we dive into the reality of agentic development today, let me examine two very different modules from university and their relevance in an AI-first development environment. Manual Requirements Translation. At university I dedicated two whole years to a unit called “Systems Design”. This was one of my favourite units, primarily focused on requirements translation. Often, I would receive a scenario between “The Proprietor” and “The Proprietor’s wife”, who seemed to be in a never-ending cycle of new product ideas. These tasks would be analysed, broken down, manually refined, and then mapped to some kind of early-stage application architecture (potentially some pseudo-code and a UML diagram or two). The big intellectual effort in this exercise was taking human intention and turning it into something tangible to build from (BA’s). Today, by the time I have opened Notepad and started to decipher requirements, an agent can already have created a comprehensive list, a service blueprint, and a code scaffold to start the process (*cough* spec-kit *cough*). Manual debugging. Need I say any more? Old-school debugging with print()’s and breakpoints is dead. I spent countless hours learning to debug in a classroom and then later with my own software, stepping through execution line by line, reading through logs, and understanding what to look for; where correlation did and didn’t mean causation. I think back to my year at IBM as a fresh-faced intern in a cloud engineering team, where around 50% of my time was debugging different issues until it was sufficiently “narrowed down”, and then reading countless Stack Overflow posts figuring out the actual change I would need to make to a PowerShell script or Jenkins pipeline. Already in Azure, with the emergence of SRE agents, that debug process looks entirely different. The debug process for software even more so… #terminallastcommand WHY IS THIS NOT RUNNING? #terminallastcommand Review these logs and surface errors relating to XYZ. As I said: breakpoints are dead, for now at least. Caveat – Is this a good thing? One more deviation from the main core of the article if you would be so kind (if you are not as kind skip to the implementation walkthrough below). Is this actually a good thing? Is a software engineering degree now worthless? What if I love printf()? I don’t know is my answer today, at the start of 2026. Two things worry me: one theoretical and one very real. To start with the theoretical: today AI takes a significant amount of the “donkey work” away from developers. How does this impact cognitive load at both ends of the spectrum? The list that “donkey work” encapsulates is certainly growing. As a result, on one end of the spectrum humans are left with the complicated parts yet to be within an agent’s remit. This could have quite an impact on our ability to perform tasks. If we are constantly dealing with the complex and advanced, when do we have time to re-root ourselves in the foundations? Will we see an increase in developer burnout? How do technical people perform without the mundane or routine tasks? I often hear people who have been in the industry for years discuss how simple infrastructure, computing, development, etc. were 20 years ago, almost with a longing to return to a world where today’s zero trust, globally replicated architectures are a twinkle in an architect’s eye. Is constantly working on only the most complex problems a good thing? At the other end of the spectrum, what if the performance of AI tooling and agents outperforms our wildest expectations? Suddenly, AI tools and agents are picking up more and more of today’s complicated and advanced tasks. Will developers, architects, and organisations lose some ability to innovate? Fundamentally, we are not talking about artificial general intelligence when we say AI; we are talking about incredibly complex predictive models that can augment the existing ideas they are built upon but are not, in themselves, innovators. Put simply, in the words of Scott Hanselman: “Spicy auto-complete”. Does increased reliance on these agents in more and more of our business processes remove the opportunity for innovative ideas? For example, if agents were football managers, would we ever have graduated from Neil Warnock and Mick McCarthy football to Pep? Would every agent just augment a ‘lump it long and hope’ approach? We hear about learning loops, but can these learning loops evolve into “innovation loops?” Past the theoretical and the game of 20 questions, the very real concern I have is off the back of some data shared recently on Stack Overflow traffic. We can see in the diagram below that Stack Overflow traffic has dipped significantly since the release of GitHub Copilot in October 2021, and as the product has matured that trend has only accelerated. Data from 12 months ago suggests that Stack Overflow has lost 77% of new questions compared to 2022… Stack Overflow democratises access to problem-solving (I have to be careful not to talk in past tense here), but I will admit I cannot remember the last time I was reviewing Stack Overflow or furiously searching through solutions that are vaguely similar to my own issue. This causes some concern over the data available in the future to train models. Today, models can be grounded in real, tested scenarios built by developers in anger. What happens with this question drop when API schemas change, when the technology built for today is old and deprecated, and the dataset is stale and never returning to its peak? How do we mitigate this impact? There is potential for some closed-loop type continuous improvement in the future, but do we think this is a scalable solution? I am unsure. So, back to the question: “Is this a good thing?”. It’s great today; the long-term impacts are yet to be seen. If we think that AGI may never be achieved, or is at least a very distant horizon, then understanding the foundations of your technical discipline is still incredibly important. Developers will not only be the managers of their fleet of agents, but also the janitors mopping up the mess when there is an accident (albeit likely mopping with AI-augmented tooling). An AI First SDLC Today – The Reality Enough reflection and nostalgia (I don’t think that’s why you clicked the article), let’s start building something. For the rest of this article I will be building an AI-led, agent-powered software development lifecycle. The example I will be building is an AI-generated weather dashboard. It’s a simple example, but if agents can generate, test, deploy, observe, and evolve this application, it proves that today, and into the future, the process can likely scale to more complex domains. Let’s start with the entry point. The problem statement that we will build from. “As a user I want to view real time weather data for my city so that I can plan my day.” We will use this as the single input for our AI led SDLC. This is what we will pass to promptkit and watch our app and subsequent features built in front of our eyes. The goal is that we will: - Spec-kit to get going and move from textual idea to requirements and scaffold. - Use a coding agent to implement our plan. - A Quality agent to assess the output and quality of the code. - GitHub Actions that not only host the agents (Abstracted) but also handle the build and deployment. - An SRE agent proactively monitoring and opening issues automatically. The end to end flow that we will review through this article is the following: Step 1: Spec-driven development - Spec First, Code Second A big piece of realising an AI-led SDLC today relies on spec-driven development (SDD). One of the best summaries for SDD that I have seen is: “Version control for your thinking”. Instead of huge specs that are stale and buried in a knowledge repository somewhere, SDD looks to make them a first-class citizen within the SDLC. Architectural decisions, business logic, and intent can be captured and versioned as a product evolves; an executable artefact that evolves with the project. In 2025, GitHub released the open-source Spec Kit: a tool that enables the goal of placing a specification at the centre of the engineering process. Specs drive the implementation, checklists, and task breakdowns, steering an agent towards the end goal. This article from GitHub does a great job explaining the basics, so if you’d like to learn more it’s a great place to start (https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/). In short, Spec Kit generates requirements, a plan, and tasks to guide a coding agent through an iterative, structured development process. Through the Spec Kit constitution, organisational standards and tech-stack preferences are adhered to throughout each change. I did notice one (likely intentional) gap in functionality that would cement Spec Kit’s role in an autonomous SDLC. That gap is that the implement stage is designed to run within an IDE or client coding agent. You can now, in the IDE, toggle between task implementation locally or with an agent in the cloud. That is great but again it still requires you to drive through the IDE. Thinking about this in the context of an AI-led SDLC (where we are pushing tasks from Spec Kit to a coding agent outside of my own desktop), it was clear that a bridge was needed. As a result, I used Spec Kit to create the Spec-to-issue tool. This allows us to take the tasks and plan generated by Spec Kit, parse the important parts, and automatically create a GitHub issue, with the option to auto-assign the coding agent. From the perspective of an autonomous AI-led SDLC, Speckit really is the entry point that triggers the flow. How Speckit is surfaced to users will vary depending on the organisation and the context of the users. For the rest of this demo I use Spec Kit to create a weather app calling out to the OpenWeather API, and then add additional features with new specs. With one simple prompt of “/promptkit.specify “Application feature/idea/change” I suddenly had a really clear breakdown of the tasks and plan required to get to my desired end state while respecting the context and preferences I had previously set in my Spec Kit constitution. I had mentioned a desire for test driven development, that I required certain coverage and that all solutions were to be Azure Native. The real benefit here compared to prompting directly into the coding agent is that the breakdown of one large task into individual measurable small components that are clear and methodical improves the coding agents ability to perform them by a considerable degree. We can see an example below of not just creating a whole application but another spec to iterate on an existing application and add a feature. We can see the result of the spec creation, the issue in our github repo and most importantly for the next step, our coding agent, GitHub CoPilot has been assigned automatically. Step 2: GitHub Coding Agent - Iterative, autonomous software creation Talking of coding agents, GitHub Copilot’s coding agent is an autonom ous agent in GitHub that can take a scoped development task and work on it in the background using the repository’s context. It can make code changes and produce concrete outputs like commits and pull requests for a developer to review. The developer stays in control by reviewing, requesting changes, or taking over at any point. This does the heavy lifting in our AI-led SDLC. We have already seen great success with customers who have adopted the coding agent when it comes to carrying out menial tasks to save developers time. These coding agents can work in parallel to human developers and with each other. In our example we see that the coding agent creates a new branch for its changes, and creates a PR which it starts working on as it ticks off the various tasks generated in our spec. One huge positive of the coding agent that sets it apart from other similar solutions is the transparency in decision-making and actions taken. The monitoring and observability built directly into the feature means that the agent’s “thinking” is easily visible: the iterations and steps being taken can be viewed in full sequence in the Agents tab. Furthermore, the action that the agent is running is also transparently available to view in the Actions tab, meaning problems can be assessed very quickly. Once the coding agent is finished, it has run the required tests and, even in the case of a UI change, goes as far as calling the Playwright MCP server and screenshotting the change to showcase in the PR. We are then asked to review the change. In this demo, I also created a GitHub Action that is triggered when a PR review is requested: it creates the required resources in Azure and surfaces the (in this case) Azure Container Apps revision URL, making it even smoother for the human in the loop to evaluate the changes. Just like any normal PR, if changes are required comments can be left; when they are, the coding agent can pick them up and action what is needed. It’s also worth noting that for any manual intervention here, use of GitHub Codespaces would work very well to make minor changes or perform testing on an agent’s branch. We can even see the unit tests that have been specified in our spec how been executed by our coding agent. The pattern used here (Spec Kit -> coding agent) overcomes one of the biggest challenges we see with the coding agent. Unlike an IDE-based coding agent, the GitHub.com coding agent is left to its own iterations and implementation without input until the PR review. This can lead to subpar performance, especially compared to IDE agents which have constant input and interruption. The concise and considered breakdown generated from Spec Kit provides the structure and foundation for the agent to execute on; very little is left to interpretation for the coding agent. Step 3: GitHub Code Quality Review (Human in the loop with agent assistance.) GitHub Code Quality is a feature (currently in preview) that proactively identifies code quality risks and opportunities for enhancement both in PRs and through repository scans. These are surfaced within a PR and also in repo-level scoreboards. This means that PRs can now extend existing static code analysis: Copilot can action CodeQL, PMD, and ESLint scanning on top of the new, in-context code quality findings and autofixes. Furthermore, we receive a summary of the actual changes made. This can be used to assist the human in the loop in understanding what changes have been made and whether enhancements or improvements are required. Thinking about this in the context of review coverage, one of the challenges sometimes in already-lean development teams is the time to give proper credence to PRs. Now, with AI-assisted quality scanning, we can be more confident in our overall evaluation and test coverage. I would expect that use of these tools alongside existing human review processes would increase repository code quality and reduce uncaught errors. The data points support this too. The Qodo 2025 AI Code Quality report showed that usage of AI code reviews increased quality improvements to 81% (from 55%). A similar study from Atlassian RovoDev 2026 study showed that 38.7% of comments left by AI agents in code reviews lead to additional code fixes. LLM’s in their current form are never going to achieve 100% accuracy however these are still considerable, significant gains in one of the most important (and often neglected) parts of the SDLC. With a significant number of software supply chain attacks recently it is also not a stretch to imagine that that many projects could benefit from "independently" (use this term loosely) reviewed and summarised PR's and commits. This in the future could potentially by a specialist/sub agent during a PR or merge to focus on identifying malicious code that may be hidden within otherwise normal contributions, case in point being the "near-miss" XZ Utils attack. Step 4: GitHub Actions for build and deploy - No agents here, just deterministic automation. This step will be our briefest, as the idea of CI/CD and automation needs no introduction. It is worth noting that while I am sure there are additional opportunities for using agents within a build and deploy pipeline, I have not investigated them. I often speak with customers about deterministic and non-deterministic business process automation, and the importance of distinguishing between the two. Some processes were created to be deterministic because that is all that was available at the time; the number of conditions required to deal with N possible flows just did not scale. However, now those processes can be non-deterministic. Good examples include IVR decision trees in customer service or hard-coded sales routines to retain a customer regardless of context; these would benefit from less determinism in their execution. However, some processes remain best as deterministic flows: financial transactions, policy engines, document ingestion. While all these flows may be part of an AI solution in the future (possibly as a tool an agent calls, or as part of a larger agent-based orchestration), the processes themselves are deterministic for a reason. Just because we could have dynamic decision-making doesn’t mean we should. Infrastructure deployment and CI/CD pipelines are one good example of this, in my opinion. We could have an agent decide what service best fits our codebase and which region we should deploy to, but do we really want to, and do the benefits outweigh the potential negatives? In this process flow we use a deterministic GitHub action to deploy our weather application into our “development” environment and then promote through the environments until we reach production and we want to now ensure that the application is running smoothly. We also use an action as mentioned above to deploy and surface our agents changes. In Azure Container Apps we can do this in a secure sandbox environment called a “Dynamic Session” to ensure strong isolation of what is essentially “untrusted code”. Often enterprises can view the building and development of AI applications as something that requires a completely new process to take to production, while certain additional processes are new, evaluation, model deployment etc many of our traditional SDLC principles are just as relevant as ever before, CI/CD pipelines being a great example of that. Checked in code that is predictably deployed alongside required services to run tests or promote through environments. Whether you are deploying a java calculator app or a multi agent customer service bot, CI/CD even in this new world is a non-negotiable. We can see that our geolocation feature is running on our Azure Container Apps revision and we can begin to evaluate if we agree with CoPilot that all the feature requirements have been met. In this case they have. If they hadn't we'd just jump into the PR and add a new comment with "@copilot" requesting our changes. Step 5: SRE Agent - Proactive agentic day two operations. The SRE agent service on Azure is an operations-focused agent that continuously watches a running service using telemetry such as logs, metrics, and traces. When it detects incidents or reliability risks, it can investigate signals, correlate likely causes, and propose or initiate response actions such as opening issues, creating runbook-guided fixes, or escalating to an on-call engineer. It effectively automates parts of day two operations while keeping humans in control of approval and remediation. It can be run in two different permission models: one with a reader role that can temporarily take user permissions for approved actions when identified. The other model is a privileged level that allows it to autonomously take approved actions on resources and resource types within the resource groups it is monitoring. In our example, our SRE agent could take actions to ensure our container app runs as intended: restarting pods, changing traffic allocations, and alerting for secret expiry. The SRE agent can also perform detailed debugging to save human SREs time, summarising the issue, fixes tried so far, and narrowing down potential root causes to reduce time to resolution, even across the most complex issues. My initial concern with these types of autonomous fixes (be it VPA on Kubernetes or an SRE agent across your infrastructure) is always that they can very quickly mask problems, or become an anti-pattern where you have drift between your IaC and what is actually running in Azure. One of my favourite features of SRE agents is sub-agents. Sub-agents can be created to handle very specific tasks that the primary SRE agent can leverage. Examples include alerting, report generation, and potentially other third-party integrations or tooling that require a more concise context. In my example, I created a GitHub sub-agent to be called by the primary agent after every issue that is resolved. When called, the GitHub sub-agent creates an issue summarising the origin, context, and resolution. This really brings us full circle. We can then potentially assign this to our coding agent to implement the fix before we proceed with the rest of the cycle; for example, a change where a port is incorrect in some Bicep, or min scale has been adjusted because of latency observed by the SRE agent. These are quick fixes that can be easily implemented by a coding agent, subsequently creating an autonomous feedback loop with human review. Conclusion: The journey through this AI-led SDLC demonstrates that it is possible, with today’s tooling, to improve any existing SDLC with AI assistance, evolving from simply using a chat interface in an IDE. By combining Speckit, spec-driven development, autonomous coding agents, AI-augmented quality checks, deterministic CI/CD pipelines, and proactive SRE agents, we see an emerging ecosystem where human creativity and oversight guide an increasingly capable fleet of collaborative agents. As with all AI solutions we design today, I remind myself that “this is as bad as it gets”. If the last two years are anything to go by, the rate of change in this space means this article may look very different in 12 months. I imagine Spec-to-issue will no longer be required as a bridge, as native solutions evolve to make this process even smoother. There are also some areas of an AI-led SDLC that are not included in this post, things like reviewing the inner-loop process or the use of existing enterprise patterns and blueprints. I also did not review use of third-party plugins or tools available through GitHub. These would make for an interesting expansion of the demo. We also did not look at the creation of custom coding agents, which could be hosted in Microsoft Foundry; this is especially pertinent with the recent announcement of Anthropic models now being available to deploy in Foundry. Does today’s tooling mean that developers, QAs, and engineers are no longer required? Absolutely not (and if I am honest, I can’t see that changing any time soon). However, it is evidently clear that in the next 12 months, enterprises who reshape their SDLC (and any other business process) to become one augmented by agents will innovate faster, learn faster, and deliver faster, leaving organisations who resist this shift struggling to keep up.37KViews9likes2CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.13KViews6likes0CommentsHow we build and use Azure SRE Agent with agentic workflows
The Challenge: Ops is critical but takes time from innovation Microsoft operates always-on, mission-critical production systems at extraordinary scale. Thousands of services, millions of deployments, and constant change are the reality of modern cloud engineering. These are titan systems that power organizations around the globe—including our own—with extremely low risk tolerance for downtime. While operations work like incident investigation, response and recovery, and remediation is essential, it’s also disruptive to innovation. For engineers, operational toil often means being pulled away from feature work to diagnose alerts, sift through logs, correlate metrics across systems, or respond to incidents at any hour. On-call rotations and manual investigations slow teams down and introduce burnout. What's more, in the era of AI, demand for operational excellence has spiked to new heights. It became clear that traditional human-only processes couldn't meet the scale and complexity needs for system maintenance especially in the AI world where code shipping velocity has increased exponentially. At the same time, we needed to integrate with the AI landscape which continues to evolve at a breakneck pace. New models, new tooling, and new best practices released constantly, fragmenting ecosystems between different platforms for observability, DevOps, incident management, and security. Beyond simply automating tasks, we needed to build an adaptable approach that could integrate with existing systems and improve over time. Microsoft needed a fundamentally different way to perform operations—one that reduced toil, accelerated response, and gave engineers the time to focus on building great products. The Solution: How we build Azure SRE Agent using agentic workflows To address these challenges, Microsoft built Azure SRE Agent, an AI-powered operations agent that serves as an always-on SRE partner for engineers. In practice, Azure SRE Agent continuously observes production environments to detect and investigate incidents. It reasons across signals like logs, metrics, code changes, and other deployment records to perform root cause analysis. It supports engineers from triage to resolution and it’s used in a variety of autonomy levels from assistive investigation to automating remediation proposals. Everything occurs within governance guardrails and human approval checks grounded in role‑based access controls and clear escalation paths. What’s more, Azure SRE Agent learns from past incidents, outcomes, and human feedback to improve over time. But just as important as what was built is how it was built. Azure SRE Agent was created using the agentic workflow approach—building agents with agents. Rather than treating AI as a bolt-on tool, Microsoft embedded specialized agents across the entire software development lifecycle (SDLC) to collaborate with developers, from planning through operations. The diagram above outlines the agents used at each stage of development. They come together to form a full lifecycle: Plan & Code: Agents support spec‑driven development to unlock faster inner loop cycles for developers and even product managers. With AI, we can not only draft spec documentation that defines feature requirements for UX and software development agents but also create prototypes and check in code to staging which now enables PMs/UX/Engineering to rapidly iterate, generate and improve code even for early-stage merges. Verify, Test & Deploy: Agents for code quality review, security, evaluation, and deployment agents work together to shift left on quality and security issues. They also continuously assess reliability, ensure performance, and enforce consistent release best practices. Operate & Optimize: Azure SRE Agent handles ongoing operational work from investigating alerts, to assisting with remediation, and even resolving some issues autonomously. Moreover, it learns continuously over time and we provide Azure SRE Agent with its own specialized instance of Azure SRE Agent to maintain itself and catalyze feedback loops. While agents surface insights, propose actions, mitigate issues and suggest long term code or IaC fixes autonomously, humans remain in the loop for oversight, approval, and decision-making when required. This combination of autonomy and governance proved critical for safe operations at scale. We also designed Azure SRE agent to integrate across existing systems. Our team uses custom agents, Model Context Protocol (MCP) and Python tools, telemetry connections, incident management platforms, code repositories, knowledge sources, business process and operational tools to add intelligence on top of established workflows rather than replacing them. Built this way, Azure SRE Agent was not just a new tool but a new operational system. And at Microsoft’s scale, transformative systems lead to transformative outcomes. The Impact: Reducing toil at enterprise scale The impact of Azure SRE Agent is felt most clearly in day-to-day operations. By automating investigations and assisting with remediation, the agent reduces burden for on-call engineers and accelerates time to resolution. Internally at Microsoft in the last nine months, we've seen: 35,000+ incidents have been handled autonomously by Azure SRE Agent. 50,000+ developer hours have been saved by reducing manual investigation and response work. Teams experienced a reduced on-call burden and faster time-to-mitigation during incidents. To share a couple of specific cases, the Azure Container Apps and Azure App Service product group teams have had tremendous success with Azure SRE Agent. Engineers for Azure Container Apps had overwhelmingly positive (89%) responses to the root cause analysis (RCA) results from Azure SRE agent, covering over 90% of incidents. Meanwhile, Azure App Service has brought their time-to-mitigation for live-site incidents (LSIs) down to 3 minutes, a drastic improvement from the 40.5-hour average with human-only activity. And this impact is felt within the developer experience. When asked developers about how the agent has changed ops work, one of our engineers had this to say: “[It’s] been a massive help in dealing with quota requests which were being done manually at first. I can also say with high confidence that there have been quite a few CRIs that the agent was spot on/ gave the right RCA / provided useful clues that helped navigate my initial investigation in the right direction RATHER than me having to spend time exploring all different possibilities before arriving at the correct one. Since the Agent/AI has already explored all different combinations and narrowed it down to the right one, I can pick the investigation up from there and save me countless hours of logs checking.” - Software Engineer II, Microsoft Engineering Beyond the impact of the agent itself, the agentic workflow process has also completely redefined how we build. Key learnings: Agentic workflow process and impact It's very easy to think of agents as another form of advanced automation, but it's important to understand that Azure SRE agent is also a collaborative tool. Engineers can prompt the agent in their investigations to surface relevant context (logs, metrics, and related code changes) to propose actions far faster and easier than traditional troubleshooting. What’s more, they can also extend it for data analysis and dashboarding. Now engineers can focus on the agent’s findings to approve actions or intervene when necessary. The result is a human-AI partnership that scales operations expertise without sacrificing control. While the process took time and experimentation to refine, the payoff has been extraordinary; our team is building high-quality features faster than ever since we introduced specialized agents for each stage of the SDLC. While these results were achieved inside Microsoft, the underlying patterns are broadly applicable. First, building agents with agents is essential to scaling, as manual development quickly became a bottleneck; agents dramatically accelerated inner loop iteration through code generation, review, debugging, security fixes, etc. In practice, we found that a generic agent—guided by rich context and powered by memory and learning—can continuously adapt, becoming faster and more effective over time as it builds experience. This allows the agent to apply prior knowledge, avoid relearning, and reduce the effort required to resolve similar problems repeatedly. In parallel, specialized agents help bring consistency and repeatability to well‑defined categories of incidents, encoding proven patterns, workflows, and safeguards. Together, these approaches enable systems that both adapt to new situations and respond reliably at scale. Microsoft also learned to integrate deeply with existing systems, embedding agents into established telemetry, workflows, and platforms rather than attempting to replace them. Throughout this process, maintaining tight human‑in‑the‑loop governance proved critical. Autonomy had to be balanced with clear approval boundaries, role‑based access, and safety checks to build trust. Finally, teams learned to invest in continuous feedback and evaluation, using ongoing measurement to improve agents over time and understand where automation added value versus where human judgment should remain central. Want to learn more? Azure SRE Agent is one example of how agentic workflows can transform both product development and operations at scale. Teams at Microsoft are on a mission of leading the industry by example, not just sharing results. We invite you to take the practical learnings from this blog and apply the same principles in your own environments. Discover more about Azure SRE Agent Learn about agents in DevOps tools and processes Read best practices on agent management with Azure8.6KViews4likes1CommentHow Microsoft 1ES uses agentic AI to take on security and compliance at scale
Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade IQ platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization. What we do Within Microsoft’s One Engineering System (1ES) organization, teams build and maintain the internal engineering systems that product groups across the company rely on to ship and secure their services. These shared tools and processes support teams responsible for mission-critical products, from modern cloud-native platforms to long-lived legacy applications. Security, compliance, and reliability work is non-negotiable at this scale. But it has to coexist with developer productivity and velocity across thousands of independently owned repositories. The problem: the CVE and compliance treadmill Here’s the loop we kept living: A security or compliance alert arrives, often via automation like Dependabot or a CVE finding. The version gets bumped, or the config gets nudged. CI is green. The PR merges. Production fails or the finding reopens because the fix required code changes beyond a version bump or a config flip. This repeats across repositories, teams, and organizations. And the hard truth is not all vulnerabilities are mechanical version bumps, and not all compliance findings are config tweaks. Many introduce behavioral or security model changes. Automation handles the easy cases but silently fails on the hard ones. A second pattern compounds it: when a service has 30+ open action items spanning OTel audit, identity, secret rotation, and CodeQL findings, just figuring out which ones are quick versus deep can take longer than the fixes themselves. Multiply this across Microsoft’s repo footprint and the cost becomes months of engineering time spent on work that doesn’t ship new customer value. But this is exactly the kind of challenge AI was made for: high-speed, high-scale evaluation and judgment calls, coached by human expertise. Why this is solvable now In the previous era of software development, an average CVE alert meant hours of developer toil. Three things changed at once. Frontier models like GPT-5.5 and Claude Opus 4.7 can now reason about context, intent, and tradeoffs not just generate code. Agent runtimes like GitHub Copilot CLI can read repositories, run tools, execute tests, and open pull requests end-to-end. And we’ve started encoding hard-won domain expertise as portable skills, so an agent doesn’t have to re-derive what an expert already knows. None of these is enough alone. Frontier models without runtimes are just chat. Runtimes without skills hallucinate confidently. Skills without judgment automate the wrong thing. Together, bounded by human–AI partnership patterns that make escalation a first-class behavior, they enable a safer, more disciplined way to tackle judgment-heavy engineering work. How we approach it: collaborate, don’t automate The co-creative model Instead of treating AI as a script executor, we treat agents as collaborators operating within explicit guardrails: Agents propose changes based on skills and available context. Humans review, approve, and retain final ownership of every change. Skills over prompts Agents start cold. They don’t have repo-specific context beyond the invoked skill. A skill captures the exact steps, decisions, and edge cases a human expert would apply to a specific class of problem. Skills are written once as Markdown and loaded only when needed: focused context, improved complexity handling, more predictable behavior. We author skills with agents too. The same operating model we use for remediation. Human owns the decision, agent does the work, signals feed back is how the skills themselves get written and refined. One of those agents, Ember, is now open-sourced on awesome-copilot. A real example: the XStream CVE Some CVEs include changes in aspects like default security models, which require code changes beyond just bumping the dependency version. Take the XStream dependency update. In the previous 1.4.17 version, any class deserializes through a default-allow classification. But in the latest update, classification changed to default-deny meaning we need to make permitted types explicit. Once we find the XStream call sites, we need to fix type permissions after each instantiation and make sure that change propagates from test, to PR, to run. This is the type of judgment-heavy work where naïve automation creates risk and blocks developers from focusing on feature work. How execution works The agent loads the relevant skill for the task at hand. If it encounters ambiguity or risk, it stops and escalates rather than guessing. The agent goes through required steps: compile, test, pull request, as explicitly agreed upon in the guidance we provide. After each run, the agent emits an Agent Signals: a structured self-assessment of what worked, what was hard, and where the skill fell short. These compound across sessions so the system improves continuously. Autonomy is great, but trust is far better. Between the CVE context, the skills, and our working agreement with the agent, we’re creating a dynamic where the agent feels empowered to execute until it reaches a point of uncertainty. This cuts down the risk of hallucinations dramatically and scales repeatable, trustworthy execution. The most important issues get surfaced for humans in the loop, where human judgment actually matters. Closing the loop: dev-side and ops-side Skills and agents handle the dev-side work: CVE remediation, compliance findings, codebase changes that need judgment. On the ops side, Azure SRE Agent handles at-scale data analysis and operational toil. Same philosophy on both sides: agents act within explicit guardrails, humans own the decisions that matter, and signals from every run feed back into the system. Then the two sides connect. Every Agent Signal our dev-side skills emit flows into Azure SRE Agent, which analyzes them at scale, identifies where skills are degrading or falling short, opens PRs against the skills themselves to fix the gaps, and sends us a daily skill-health report. The ops-side agent maintains the dev-side agents: agents improving agents, while humans review and merge every change. The same human-in-the-loop discipline that governs a CVE fix governs a skill fix. Impact Across Microsoft, 1ES supports teams working on hundreds of repos at a variety of ages and sizes. Agents enable velocity while skills enable uniqueness which is what helps us scale across such a vast enterprise. Impact of the frontier models, GitHub Copilot, agent skills and agent signals for compliance work. Real engineering time saved We’re finding 18-15 hours of manual work compressed into ~9 hours of agent+skill assisted work – a 50-60% reduction overall, with some compliance work moving from 3-4 hrs manually to 30 min with the agent+skill. What devs told us “Considering I didn’t know anything about any of this, including never having seen the IaC in question, I’d say at least a week’s worth, done in less than 10 prompts.” — Patrick, Senior Engineer “Many times with [compliance], the actual changes are minimal, but reading the docs and knowing what applies to your app can be more time consuming… When you have 30+ action items, you need to go hunting for which one is quick versus time-consuming. This [agent+skills] saves a lot of time.” — Greg, Engineering Manager “The [agent+skills] eliminates most early-phase toil — up to ~90% — but 0% of the last-mile effort. The bottleneck shifts entirely to validation and deployment.” — CloudBuild team That last quote is the one we keep coming back to. The agent+skills doesn’t eliminate the work, it changes where the work lives. Discovery, scoping, and first-draft remediation collapse. Validation and deployment become the new ceiling. That’s the right problem to have and it tells us where to invest next. Security and compliance response with agents is evolving from reactive maintenance to a proactive, strategic defense capability. What we’ve learned On quality and trust With agents, silent confidence is more dangerous than visible uncertainty. Testing agents cold exposes gaps early, before risk compounds. Build uncertainty into skills, and lean on Agent Signals to capture what worked, what was hard, and where the skill fell short. When agents report honestly, the next run starts smarter than the last one. Quality is measured, not assumed. We evaluate every PR on an A/B/C scale, and we run agents that evaluate other agents’ output, closing the loop between execution and assessment. On scaling Not all work should be automated. Some work requires human-AI collaboration. Encoding expertise will always be more valuable than scaling generic prompts. Start with a win in one repo, then slowly scale out that skill to other teams and repos. Where teams can start Teams don’t adopt AI through mandates. They adopt it through trust, built on quality results in their code. Start with one team, one skill, and one real win. Identify a CVE or dependency issue that appears repeatedly across repositories. Write the fix as Markdown, as if you’re onboarding a new engineer. That’s your first skill file. Test the skill with a cold agent on a real repo with a real problem. Iterate until the agent knows both how to act and when to stop. Agents can assess their own work and flag gaps in skills. Want to learn more? Watch the demo video of the dependency update scenario Learn more about the co-creative framework Discover how the GitHub Copilot CLI can help you run and orchestrate agents Learn more about Agent Signals Learn more about Agent Skills Read the companion ops-side story: How we build and use Azure SRE Agent with agentic workflows465Views3likes0CommentsFrom Vibe Coding to Working App: How SRE Agent Completes the Developer Loop
The Most Common Challenge in Modern Cloud Apps There's a category of bugs that drive engineers crazy: multi-layer infrastructure issues. Your app deploys successfully. Every Azure resource shows "Succeeded." But the app fails at runtime with a vague error like Login failed for user ''. Where do you even start? You're checking the Web App, the SQL Server, the VNet, the private endpoint, the DNS zone, the identity configuration... and each one looks fine in isolation. The problem is how they connect and that's invisible in the portal. Networking issues are especially brutal. The error says "Login failed" but the actual causes could be DNS, firewall, identity, or all three. The symptom and the root causes are in completely different resources. Without deep Azure networking knowledge, you're just clicking around hoping something jumps out. Now imagine you vibe coded the infrastructure. You used AI to generate the Bicep, deployed it, and moved on. When it breaks, you're debugging code you didn't write, configuring resources you don't fully understand. This is where I wanted AI to help not just to build, but to debug. Enter SRE Agent + Coding Agent Here's what I used: Layer Tool Purpose Build VS Code Copilot Agent Mode + Claude Opus Generate code, Bicep, deploy Debug Azure SRE Agent Diagnose infrastructure issues and create developer issue with suggested fixes in source code (app code and IaC) Fix GitHub Coding Agent Create PRs with code and IaC fix from Github issue created by SRE Agent Copilot builds. SRE Agent debugs. Coding Agent fixes. What I Built I used VS Code Copilot in Agent Mode with Claude Opus to create a .NET 8 Web App connected to Azure SQL via private endpoint: Private networking (no public exposure) Entra-only authentication Managed identity (no secrets) Deployed with azd up. All green. Then I tested the health endpoint: $ curl https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql {"status":"unhealthy","error":"Login failed for user ''.","errorType":"SqlException"} Deployment succeeded. App failed. One error. How I Fixed It: Step by Step Step 1: Create SRE Agent with Azure Access I created an SRE Agent with read access to my Azure subscription. You can scope it to specific resource groups. The agent builds a knowledge graph of your resources and their dependencies visible in the Resource Mapping view below. Step 2: Connect GitHub to SRE Agent using GitHub MCP server I connected the GitHub MCP server so the agent could read my repository and create issues. Step 3: Create Sub Agent to analyze source code I created a sub-agent for analyzing source code using GitHub mcp tools. this lets SRE Agent understand not just Azure resources, but also the Bicep and source code files that created them. "you are expert in analyzing source code (bicep and app code) from github repos" Step 4: Invoke Sub-Agent to Analyze the Error In the SRE Agent chat, I invoked the sub-agent to diagnose the error I received from my app end point. It correlated the runtime error with the infrastructure configuration Step 5: Watch the SRE Agent Think and Reason SRE Agent analyzed the error by tracing code in Program.cs, Bicep configurations, and Azure resource relationships Web App, SQL Server, VNet, private endpoint, DNS zone, and managed identity. Its reasoning process worked through each layer, eliminating possibilities one by one until it identified the root causes. Step 6: Agent Creates GitHub Issue Based on its analysis, SRE Agent summarized the root causes and suggested fixes in a GitHub issue: Root Causes: Private DNS Zone missing VNet link Managed identity not created as SQL user Suggested Fixes: Add virtualNetworkLinks resource to Bicep Add SQL setup script to create user with db_datareader and db_datawriter roles Step 7: Merge the PR from Coding Agent Assign the Github issue to Coding Agent which then creates a PR with the fixes. I just reviewed the fix. It made sense and I merged it. Redeployed with azd up, ran the SQL script: curl -s https://app-tsdvdfdwo77hc.azurewebsites.net/health/sql | jq . { "status": "healthy", "database": "tododb", "server": "tcp:sql-tsdvdfdwo77hc.database.windows.net,1433", "message": "Successfully connected to SQL Server" } 🎉 From error to fix in minutes without manually debugging a single Azure resource. Why This Matters If you're a developer building and deploying apps to Azure, SRE Agent changes how you work: You don't need to be a networking expert. SRE Agent understands the relationships between Azure resources private endpoints, DNS zones, VNet links, managed identities. It connects dots you didn't know existed. You don't need to guess. Instead of clicking through the portal hoping something looks wrong, the agent systematically eliminates possibilities like a senior engineer would. You don't break your workflow. SRE Agent suggests fixes in your Bicep and source code not portal changes. Everything stays version controlled. Deployed through pipelines. No hot fixes at 2 AM. You close the loop. AI helps you build fast. Now AI helps you debug fast too. Try It Yourself Do you vibe code your app, your infrastructure, or both? How do you debug when things break? Here's a challenge: Vibe code a todo app with a Web App, VNet, private endpoint, and SQL database. "Forget" to link the DNS zone to the VNet. Deploy it. Watch it fail. Then point SRE Agent at it and see how it identifies the root cause, creates a GitHub issue with the fix, and hands it off to Coding Agent for a PR. Share your experience. I'd love to hear how it goes. Learn More Azure SRE Agent documentation Azure SRE Agent blogs Azure SRE Agent community Azure SRE Agent home page Azure SRE Agent pricing1.2KViews3likes0CommentsAzure SRE Agent: Expanding Observability and Multi-Cloud Resilience
The Azure SRE Agent continues to evolve as a cornerstone for operational excellence and incident management. Over the past few months, we have made significant strides in enabling integrations with leading observability platforms—Dynatrace, New Relic, and Datadog—through Model Context Protocol (MCP) Servers. These partnerships serve joint customers, enabling automated remediation across diverse environments. Deepening Integrations with MCP Servers Our collaboration with these partners is more than technical—it’s about delivering value at scale. Datadog, New Relic, and Dynatrace are all Azure Native ISV Service partners. With these integrations Azure Native customers can also choose to add these MCP servers directly from the Azure Native partners’ resource: Datadog: At Ignite, Azure SRE Agent was presented with the Datadog MCP Server, to demonstrate how our customers can streamline complex workflows. Customers can now bring their Datadog MCP Server into Azure SRE Agent, extending knowledge capabilities and centralizing logs and metrics. Find Datadog Azure Native offerings on Marketplace. New Relic: When an alert fires in New Relic, the Azure SRE Agent calls the New Relic MCP Server to provide Intelligent Observability insights. This agentic integration with the New Relic MCP Server offers over 35 specialized tools across, entity and account management, alerts and monitoring, data analysis and queries, performance analysis, and much more. The advanced remediation skills of the Azure SRE Agent + New Relic AI help our joint customers diagnose and resolve production issues faster. Find New Relic’s Azure Native offering on Marketplace Dynatrace: The Dynatrace integration bridges Microsoft Azure's cloud-native infrastructure management with Dynatrace's AI-powered observability platform, leveraging the Davis AI engine and remote MCP server capabilities for incident detection, root cause analysis, and remediation across hybrid cloud environments. Check out Dynatrace’s Azure Native offering on Marketplace. These integrations are made possible by Azure SRE Agent’s MCP connectors. The MCP connectors in Azure SRE Agent act as the bridge between the agent and MCP servers, enabling dynamic discovery and execution of specialized tools for observability and incident management across diverse environments. This feature allows customers to build their own custom sub-agents to leverage tools from MCP Servers from integrated platforms like Dynatrace, Datadog, and New Relic to complement the agent’s diagnostic and remediation capabilities. By connecting Azure SRE Agent to external MCP servers scenarios such as cross-platform telemetry analysis are unlocked. Looking Ahead: Multi-Agent Collaboration Azure SRE Agent isn’t stopping with MCP integrations We’re actively working with PagerDuty and NeuBird to support dynamic use cases via agent-to-agent collaboration: PagerDuty: PagerDuty’s PD Advance SRE Agent is an AI-powered assistant that triages incidents by analyzing logs, diagnostics, past incident history, and runbooks to surface relevant context and recommended remediations. At Ignite, PagerDuty and Microsoft demonstrated how Azure SRE Agent can ingest PagerDuty incidents and collaborate with PagerDuty’s SRE Agent to complement triage using historical patterns, runbook intelligence and Azure diagnostics. NeuBird: NeuBird’s Agentic AI SRE, Hawkeye, autonomously investigates and resolves incidents across hybrid, and multi-cloud environments. By connecting to telemetry sources like Azure Monitor, Prometheus, and GitHub, Hawkeye delivers real-time diagnosis and targeted fixes. Building on the work presented at SRE Day this partnership underscores our commitment to agentic ecosystems where specialized agents collaborate for complex scenarios. Sign up for the private preview to try the integration, here. Additionally, please check out NeuBird on Marketplace. These efforts reflect a broader vision: Azure SRE Agent as a hub for cross-platform reliability, enabling customers to manage incidents across Azure, on-premises, and other clouds with confidence. Why This Matters As organizations embrace distributed architectures, the need for integrated, intelligent, and multi-cloud-ready SRE solutions has never been greater. By partnering with industry leaders and pioneering agent-to-agent workflows, Azure SRE Agent is setting the stage for a future where resilience is not just reactive—it’s proactive and collaborative.1.3KViews3likes0CommentsFrom Coding Agents to Cloud Automation: AI-Assisted Customer Related Incidents in Azure Functions
On the Azure Functions team, we have been exploring how AI can help with investigating customer-reported incidents, root-cause analysis, and incident mitigation. This post shares our journey from early RCA agents to coding-agent-assisted investigations and cloud-hosted automation, and the lessons we learned along the way. Microsoft Engineering teams like Azure Functions work on production live site issues alongside customer reported issues, and these are one of the most important and rewarding parts of the job. On the Azure Functions team, complex customer incidents often require deep investigation. Engineers review Azure Data Explorer (Kusto) query results, source code, GitHub issues, previous incidents, public documentation, internal troubleshooting guides, and service-specific operational knowledge. The goal is always to mitigate customer impact quickly, identify the root cause, and feed what we learn back into the platform, and log improvement work items where needed. This work is valuable, but it is also time-consuming. As AI capabilities improved, we started asking a practical question: could AI help us reduce the operational burden of incident investigation while preserving the learning and engineering judgment that make those investigations useful? This is the story of how our approach evolved—from early RCA agents, to coding-agent-based workflows, and finally to cloud-hosted automation. Starting with AI-Assisted RCA Around May 2024, we began experimenting with an internal RCA agent together with a colleague from Microsoft Research. The first version was an informal approach towards the development of a formal service. It was a personal tool to help with our own investigations. The early experiments were very useful. We could give the agent an incident, let it run for several minutes, and then review the analysis. It did not always produce a perfect root cause, but it could run multiple queries, explore different hypotheses, and narrow the solution space enough to save time. Later, Azure SRE Agent emerged as a formal internal service. We contributed to it based on what We had learned from our earlier experiments. At that point, using AI to help resolve customer-reported incidents became a major focus for our team. What We Learned from Agentic Workflows The first generation of AI-assisted incident workflows were highly structured. The early experiments with the available models required careful design—especially for generating complex Kusto—we often needed to expose fixed Kusto queries as tools and let the model call them through well-defined parameters. Fig. 1 kusto query tool This made execution more predictable and reproducible, but it also revealed limitations. Detailed agentic workflows could work well when the incident matched the predefined path. Outside those paths, they were less flexible. Engineers also found it expensive to define and maintain those workflows, especially when the output felt only modestly better than a dashboard. Fig2. Agentic Workflow That experience taught us an important lesson: for complex operational investigations, flexibility matters as much as structure. The Shift to Coding Agents Near the end of 2025, we started using an internal tool using GitHub Copilot and skills, which made it possible to define and share VS Code workspaces. A workspace could include agent definitions, instructions, prompts, skills, MCP configuration, and repositories. Fig 3. GitHub Copilot internal tool The quality difference was significant. Combined with newer models, coding agents could investigate incidents much more flexibly. They could run Kusto queries, inspect code, use CLI and MCP tools, and iterate quickly by trying different paths. The team quickly adopted this model. With earlier workflow-based systems, engineers were reluctant to onboard because defining detailed workflows took effort, and the payoff was limited. With this internal tool, engineers started contributing agent definitions and skills because the system was easy to extend. Over time, the Azure Functions team accumulated a growing set of AI-ready materials which consisted of agent definitions, skills, MCP tools, instructions and repositories A number of lessons stood out. Lessons from Building AI-Ready Materials Prefer guidance over over-specification Modern coding agents are capable enough that they do not need every step spelled out. In fact, too many instructions can make the system brittle or stale. We found it better to provide concise guidance and point the agent to maintained sources of truth rather than embedding large amounts of detail directly into prompts. Manage context deliberately Instructions, tool definitions, conversation history, and tool outputs all compete for model context. Irrelevant or contradictory information can reduce quality. Tool design matters too: if a tool returns a large payload directly to the model, it can consume many tokens and confuse the agent. For large outputs, writing results to files and returning concise pointers often works better. Use files as durable memory Long-running investigations benefit from a simple pattern: create a plan and checklist file, update it as work progresses, and let the agent re-read it when needed. This helps the agent recover from context compaction and gives the investigation a durable state inside the workspace. Prefer references over inline knowledge Agent Definition and skills includes domain knowledge. Internal troubleshooting guides, product behavior, operational history, and expert judgment. Instead of placing all that information directly in prompts, we found it more effective to provide references to where the knowledge lives and guidance on when to use it. Make the right repositories visible Coding agents are strong at reading code. For our scenarios, multi-repository workspaces were especially powerful. When the agent could see related repositories together, it could trace behavior across components, understand dependencies, and produce better analysis. Domain knowledge matters most The best agent assets were often created by engineers with deep product and operational experience, not necessarily by AI specialists. The key skill was turning expert knowledge into instructions, references, and repository layouts that an agent could use. Facilitate and streamline domain knowledge updates Every incident not fully handled by an agent is a learning opportunity. Feed the context engineering flywheel: investigate, find gaps, update agent guidance, then re-test. It's important to keep this cycle quick and easy. Why We Moved Toward Cloud Automation Coding agents were extremely helpful, but they were still interactive tools. An engineer had to start the investigation and often guide it. For incident response, we wanted to go further. If an incident entered a specific feature area, the system should be able to start the investigation automatically, run the relevant analysis, and post useful results back to the incident. Even if the analysis was not perfect, narrowing the problem space early could reduce mitigation time. Some scenarios could eventually support automatic mitigation or automatic transfer to the right team. A local coding-agent workflow has advantages, especially because it can authenticate as the user. But as a foundation for reliable automation, it also had important limitations. First, it still depended on human involvement. AI dramatically improved individual productivity, but in incident response the bottleneck is often human attention and time. Even when starting an agent is simple, requiring an engineer to initiate the run introduces a context switch and consumes a scarce resource. Second, it depended on user credentials. Coding agents run with the user’s permissions, which can be overly broad for automation, and they inherit human-oriented flows such as browser-based reauthentication. For durable automation, we wanted an identity model better suited to unattended execution, such as managed identity. Third, there were execution-environment and security concerns. A local environment is powerful, but it does not naturally provide the sandboxing we wanted for safe automation. Because it runs with user access, it may also reach a much wider set of files and resources than is desirable for an automated incident workflow. Local and dev-box environments also have operational drawbacks. They can require restarts, contend with other workloads, and are not ideal for durable execution, failure recovery, or failover. For automation, we wanted a dedicated execution environment rather than something tied to an engineer’s machine. Finally, token management became an operational concern. User-linked token consumption can create instability when limits are reached, and automation can skew usage patterns so that one user appears to consume a disproportionate share of AI capacity. That adds noise to operational analysis and makes governance harder. For all of these reasons, cloud execution looked like the right direction. We wanted managed identity, a secure sandbox, durable execution, and a system that would not depend on someone’s local machine. Requirements for Cloud Automation Many of us had become strong supporters of coding agents and wanted to keep using them. Just as importantly, we had already accumulated assets that had been proven to work well: agent definitions, instructions, prompts, skills, MCP configuration, and repository layouts that the team had gradually built up and refined. That meant our move toward cloud automation was not about replacing coding agents with something entirely different. We wanted to preserve and reuse the assets that had made coding agents successful, while moving to an execution model that was better suited to automation. At the same time, coding agents had set a high quality bar. Because they worked so well in practice, we were not willing to assume that a cloud service would automatically deliver the same level of quality. So we defined two concrete goals for the cloud path. Achieve the same level of quality we were seeing from our existing coding-agent workflows when run in a focused, one-shot investigation. Ensure the assets we had already built could continue to be used and improved. In other words, we were not looking for just another cloud AI system. We were looking for a cloud automation path that could inherit the strengths of coding agents while providing the operational properties automation required. Comparing Headless coding-agent execution service and Azure SRE Agent To evaluate which approach could meet those requirements, we ran a side-by-side comparison. One path was a prototype headless coding agent execution service. It reused the same the internal tool’s workspace definitions that engineers used locally, but ran them without a human in the loop. When an incident entered a target loop, the system created an agent workspace, prepared repositories, started GitHub Copilot CLI with an initial prompt, collected the analysis, and posted the result back to the incident. It also preserved session artifacts so that an engineer could later review or resume the investigation. Fig 4. Agent Helped Trend – It shows people use Coding Agent, the introduction of headless coding agent execution service and SRE Agent increases the percentage of usefulness. The other path used Azure SRE Agent, which had improved with preview customer feedback and was nearing general availability, since our earlier experiments. It now supported newer models, stronger custom-agent behavior, MCP and built-in tools, repository access, and incident-triggered execution. We’ve performed a one-time migration from the Coding Agent asset to the Azure SRE Agent asset. This was achieved in one day using GitHub Copilot CLI and our existing coding agents. The comparison was deliberately practical. We already knew that our internal coding-agent environment produced results engineers trusted and liked. That became our quality bar. If Azure SRE Agent could meet or exceed that bar while also satisfying the operational requirements of cloud automation, it would be the stronger long-term path. Results and Feedback Loop The first Headless coding-agent execution service results were very encouraging. In its first set of incidents, the RCA matched the SME conclusion in cases where the agent could safely process the incident. That showed that the assets we had built for local coding agents could transfer effectively into a headless scenario. Azure SRE Agent also performed strongly from the beginning. Headless coding-agent execution service initially had slightly better analysis in some areas, but Azure SRE Agent was already good enough to be operationally useful. We then built an evaluation framework that compared: Each agent’s RCA, confidence score, and mitigation steps The RCA and mitigation reason later provided by a human Auto-mitigation recommendations Path to auto-mitigation Auto-transfer recommendations Session-level execution issues This evaluation became a feedback loop. Engineers reviewed interesting incidents, identified weaknesses, improved agent definitions and skills, and submitted pull requests. We also used agent assistance to generate improvement PRs from comparison reports. Fig 5. LLM as Judge side-by-side eval for headless coding-agent execution service (blue) vs Azure SRE Agent (green) Within a few weeks, Azure SRE Agent’s quality consistently exceeded the headless coding-agent execution service baseline. At that point, we stopped posting headless results back to incidents and focused on improving the Azure SRE Agent path instead. We also automated synchronization from the internal coding-agent assets so improvements could continue to flow through pull requests. That shift was important. It meant Azure SRE Agent was no longer just an interesting alternative—it had become the cloud path that could inherit what worked in coding agents while providing a better foundation for automation. Why Cloud AI Started to Work Better A common reaction to coding agents is that they feel much improved than the previous cloud AI experiences. Our experience suggests two main reasons: stronger models and improved access to the right context. A coding agent sees a workspace. It can use instructions, skills, tools, repositories, and files. Traditional cloud AI systems often did not have access to the same set of resources. Once Azure SRE Agent could see similar assets - the right repositories, the right tools, and the right domain-specific knowledge - it could reach comparable or better quality. The details of context compaction, tool execution, and orchestration matter. But the core principle is simpler: the agent needs to reach the right knowledge at the right time without carrying unnecessary context all the time. That means the most important work is not only choosing a model or building a tool. It creates high-quality AI-ready assets: concise instructions, useful skills, accurate references, well-structured repository access, and domain knowledge that was previously locked in people’s heads. The cloud hosted automation path instantly provided an exciting benefit, which is that the issue analysis is stored in the cloud and not only on the developer's machine. This means that the conclusions and investigations are stored for perusal and human interaction is possible via the chat interface. Fig 6. An Example of the Chat Interface for Azure SRE Agent Conclusion Our journey started with a personal RCA assistant, moved through structured agentic workflows, accelerated with coding agents, and eventually led us back to a cloud-hosted automation path. The lesson is not that coding agents or cloud agents are universally better. The lesson is that agent quality depends heavily on what the agent can see, how much irrelevant context it avoids, and whether domain experts have translated their knowledge into usable assets. For us, the key was not abandoning coding agents. It was carrying their strengths forward into Azure SRE Agent and a cloud execution model that was better suited to automation. Modern agents are now capable enough to make that work worthwhile. For incident response, that opens the door to faster investigation, safer automation, and ultimately lower incident mitigation time for customers. The Azure Functions team hope this experience is useful to other teams exploring how to apply AI to complex engineering operations. In the next post, we plan to go deeper into the evaluation framework and how we automated the feedback loop behind these improvements.1.6KViews2likes1CommentManaging Multi‑Tenant Azure Resource with SRE Agent and Lighthouse
Azure SRE Agent is an AI‑powered reliability assistant that helps teams diagnose and resolve production issues faster while reducing operational toil. It analyzes logs, metrics, alerts, and deployment data to perform root cause analysis and recommend or execute mitigations with human approval. It’s capable of integrating with azure services across subscriptions and resource groups that you need to monitor and manage. Today’s enterprise customers live in a multi-tenant world, and there are multiple reasons to that due to acquisitions, complex corporate structures, managed service providers, or IT partners. Azure Lighthouse enables enterprise IT teams and managed service providers to manage resources across multiple azure tenants from a single control plane. In this demo I will walk you through how to set up Azure SRE agent to manage and monitor multi-tenant resources delegated through Azure Lighthouse. Navigate to the Azure SRE agent and select Create agent. Fill in the required details along with the deployment region and deploy the SRE agent. Once the deployment is complete, hit Set up your agent. Select the Azure resources you would like your agent to analyze like resource groups or subscriptions. This will land you to the popup window that allows you to select the subscriptions and resource groups that you would like SRE agent to monitor and manage. You can then select the subscriptions and resource groups under the same tenant that you want SRE agent to manage; Great, So far so good 👍 As a Managed Service Provider (MSP) you have multiple tenants that you are managing via Azure Lighthouse, and you need to have SRE agent access to those. So, to demo this will need to set up Azure Lighthouse with correct set of roles and configuration to delegate access to management subscription where the Centralized SRE agent is running. From Azure portal search Lighthouse. Navigate to the Lighthouse home page and select Manage your customers. On My customers Overview select Create ARM Template Provide a Name and Description. Select subscriptions on a Delegated scope. Select + Add authorization which will take you to Add authorization window. Select Principal type, I am selecting User for demo purposes. The pop-up window will allow Select users from the list. Select the checkbox next to the desired user who you want to delegate the subscription and hit Select Then select the Role that you would like to assign the user from the managing tenant to the delegated tenant and select add. You can add multiple roles by adding additional authorization to the selected user. This step is important to make sure the delegated tenant is assigned with the right role in order for SRE Agents to add it as Azure source. Azure SRE agent requires an Owner or User Administrator RBAC role to assign the subscription to the list of managed resources. If an appropriate role is not assigned, you will see an error when selecting the delegated subscriptions in SRE agent Managed resources. As per Lighthouse role support Owner role isn’t supported and User access Administrator role is supported, but only for limited purpose. Refer Azure Lighthouse documentation for additional information. If role is not defined correctly, you might see an error stating: 🛑Failed to add Role assignment “The 'delegatedRoleDefinitionIds' property is required when using certain roleDefinitionIds for authorization. To allow a principalId to assign roles to a managed identity in the customer tenant, set its roleDefinitionId to User Access Administrator. Download the ARM template and add specific Azure built-in roles that you want to grant in the delegatedRoleDefinitionIds property. You can include any supported Azure built-in role except for User Access Administrator or Owner. This example shows a principalId with User Access Administrator role that can assign two built in roles to managed identities in the customer tenant: Contributor and Log Analytics Contributor. { "principalId": "00000000-0000-0000-0000-000000000000", "principalIdDisplayName": "Policy Automation Account", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ] } In addition SRE agent would require certain roles at the managed identity level in order to access and operate on those services. Locate SRE agent User assigned managed identity and add roles to the service principal. For the demo purpose I am assigning Reader, Monitoring Reader, and Log Analytics Reader role. Here is the sample ARM template used for this demo. { "$schema": "https://schema.management.azure.com/schemas/2019-08-01/subscriptionDeploymentTemplate.json#", "contentVersion": "1.0.0.0", "parameters": { "mspOfferName": { "type": "string", "metadata": { "description": "Specify a unique name for your offer" }, "defaultValue": "lighthouse-sre-demo" }, "mspOfferDescription": { "type": "string", "metadata": { "description": "Name of the Managed Service Provider offering" }, "defaultValue": "lighthouse-sre-demo" } }, "variables": { "mspRegistrationName": "[guid(parameters('mspOfferName'))]", "mspAssignmentName": "[guid(parameters('mspOfferName'))]", "managedByTenantId": "6e03bca1-4300-400d-9e80-000000000000", "authorizations": [ { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "e40ec5ca-96e0-45a2-b4ff-59039f2c2b59", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "18d7d88d-d35e-4fb5-a5c3-7773c20a72d9", "delegatedRoleDefinitionIds": [ "b24988ac-6180-42a0-ab88-20f7382dd24c", "92aaf0da-9dab-42b6-94a3-d43ce8d16293" ], "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "504adfc5-da83-47d4-8709-000000000000", "roleDefinitionId": "b24988ac-6180-42a0-ab88-20f7382dd24c", "principalIdDisplayName": "Pranab Mandal" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "acdd72a7-3385-48ef-bd42-f606fba81ae7", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "43d0d8ad-25c7-4714-9337-8ba259a9fe05", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" }, { "principalId": "0374ff5c-5272-49fa-878a-000000000000", "roleDefinitionId": "73c42c96-874c-492b-b04d-ab87d138a893", "principalIdDisplayName": "sre-agent-ext-sub1-4n4y4v5jjdtuu" } ] }, "resources": [ { "type": "Microsoft.ManagedServices/registrationDefinitions", "apiVersion": "2022-10-01", "name": "[variables('mspRegistrationName')]", "properties": { "registrationDefinitionName": "[parameters('mspOfferName')]", "description": "[parameters('mspOfferDescription')]", "managedByTenantId": "[variables('managedByTenantId')]", "authorizations": "[variables('authorizations')]" } }, { "type": "Microsoft.ManagedServices/registrationAssignments", "apiVersion": "2022-10-01", "name": "[variables('mspAssignmentName')]", "dependsOn": [ "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" ], "properties": { "registrationDefinitionId": "[resourceId('Microsoft.ManagedServices/registrationDefinitions/', variables('mspRegistrationName'))]" } } ], "outputs": { "mspOfferName": { "type": "string", "value": "[concat('Managed by', ' ', parameters('mspOfferName'))]" }, "authorizations": { "type": "array", "value": "[variables('authorizations')]" } } } Login to the customers tenant and navigate to the service provides from the Azure Portal. From the Service Providers overview screen, select Service provider offers from the left navigation pane. From the top menu, select the Add offer drop down and select Add via template. In the Upload Offer Template window drag and drop or upload the template file that was created in the earlier step and hit Upload. Once the file is uploaded, select Review + Create. This will take a few minutes to deploy the template, and a successful deployment page should be displayed. Navigate to Delegations from Lighthouse overview and validate if you see the delegated subscription and the assigned role. Once the Lighthouse delegation is set up sign in to the managing tenant and navigate to the deployed SRE agent. Navigate to Azure resources from top menu or via Settings > Managed resources. Navigate to Add subscriptions to select customers subscriptions that you need SRE agent to manage. Adding subscription will automatically add required permission for the agent. Once the appropriate roles are added, the subscriptions are ready for the agent to manage and monitor resources within them. Summary - Benefits This blog post demonstrates how Azure SRE Agent can be used to centrally monitor and manage Azure resources across multiple tenants by integrating it with Azure Lighthouse, a common requirement for enterprises and managed service providers operating in complex, multi-tenant environments. It walks through: Centralized SRE operations across multiple Azure tenants Secure, role-based access using delegated resource management Reduced operational overhead for MSPs and enterprise IT teams Unified visibility into resource health and reliability across customer environments557Views2likes1CommentWhat It Takes to Give SRE Agent a Useful Starting Point
In our latest posts, The Agent that investigates itself and Azure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context, we wrote about a moment that changed how we think about agent systems. Azure SRE Agent investigated a regression in its own prompt cache, traced the drop to a specific PR, and proposed fixes. What mattered was not just the model. What mattered was the starting point. The agent had code, logs, deployment history, and a workspace it could use to discover the next piece of context. That lesson forced an uncomfortable question about onboarding. If a customer finishes setup and the agent still knows nothing about their app, we have not really onboarded them. We have only created a resource. So for the March 10 GA release, we rebuilt onboarding around a more practical bar: can a new agent become useful on day one? To test that, we used the new flow the way we expect customers to use it. We connected a real sample app, wired up live Azure Monitor alerts, attached code and logs, uploaded a knowledge file, and then pushed the agent through actual work. We asked it to inspect the app, explain a 401 path from the source, debug its own log access, and triage GitHub issues in the repo. This post walks through that experience. We connected everything we could because we wanted to see what the agent does when it has a real starting point, not a partial one. If your setup is shorter, the SRE Agent still works. It just knows less. The cold start we were trying to fix The worst version of an agent experience is familiar by now. You ask a concrete question about your system and get back a smart-sounding answer that is only loosely attached to reality. The model knows what a Kubernetes probe is. It knows what a 500 looks like. It may even know common Kusto table names. But it does not know your deployment, your repo, your auth flow, or the naming mistakes your team made six months ago and still lives with. We saw the same pattern again and again inside our own work. When the agent had real context, it could do deep investigations. When it started cold, it filled the gaps with general knowledge and good guesses. The new onboarding is our attempt to close that gap up front. Instead of treating code, logs, incidents, and knowledge as optional extras, the flow is built around connecting the things the agent needs to reason well. Walking through the new onboarding Starting March 10, you can create and configure an SRE Agent at sre.azure.com. Here is what that looked like for us. Step 1: Create the agent You choose a subscription, resource group, name, and region. Azure provisions the runtime, managed identity, Application Insights, and Log Analytics workspace. In our run, the whole thing took about two minutes. That first step matters more than it may look. We are not just spinning up a chatbot. We are creating the execution environment where the agent can actually work: run commands, inspect files, query services, and keep track of what it learns. Step 2: Start adding context Once provisioning finishes, you land on the setup page. The page is organized around the sources that make the agent useful: code, logs, incidents, Azure resources, and knowledge files. Data source Why it matters Code Lets the agent read the system it is supposed to investigate. Logs Gives it real tables, schemas, and data instead of guesses. Incidents Connects the agent to the place where operational pain actually shows up. Azure resources Gives it the right scope so it starts in the right subscription and resource group. Knowledge files Adds the team-specific context that never shows up cleanly in telemetry. The page is blunt in a way we like. If you have not connected anything yet, it tells you the agent does not know enough about your app to answer useful questions. That is the right framing. The job of onboarding is to fix that. Step 3: Connect logs We started with Azure Data Explorer. The wizard supports Azure Kusto, Datadog, Elasticsearch, Dynatrace, New Relic, Splunk, and Hawkeye. After choosing Kusto, it generated the MCP connector settings for us. We supplied the cluster details, tested the connection, and let it discover the tools. This step removes a whole class of bad agent behavior. The model no longer has to invent table names or hope the cluster it wants is the cluster that exists. It knows what it can query because the connection is explicit. Step 4: Connect the incident platform For incidents, we chose Azure Monitor. This part is simple by design. If incidents are where the agent proves its value, connecting them should feel like the most natural part of setup, not a side quest. PagerDuty and ServiceNow work too, but for this walkthrough we kept it on Azure Monitor so we could wire real alerts to a real app. Step 5: Connect code Then we connected the code repo. We used microsoft-foundry/foundry-agent-webapp, a React and ASP.NET Core sample app running on Azure Container Apps. This is still the highest-leverage source we give the agent. Once the repo is connected, the agent can stop treating the app as an abstract web service. It can read the auth flow. It can inspect how health probes are configured. It can compare logs against the exact code paths that produced them. It can even look at the commit that was live when an incident happened. That changes the quality of the investigation immediately. Step 6: Scope the Azure resources Next we told the agent which resources it was responsible for. We scoped it to the resource group that contained the sample Container App. The wizard then set the roles the agent needed to observe and investigate the environment. That sounds like a small step, but it fixes another common failure mode. Agents do better when they start from the right part of the world. Subscription and resource-group scope give them that boundary. Step 7: Upload knowledge Last, we uploaded a Markdown knowledge file we wrote for the sample app. The file covered the app architecture, API endpoints, auth flow, likely failure modes, and the files we would expect an engineer to open first during debugging. We like Markdown here because it stays honest. It is easy for a human to read, easy for the agent to navigate, and easy to update as the system changes. All sources configured Once everything was connected, the setup panel turned green. At that point the agent had a repo, logs, incidents, Azure resources, and a knowledge file. That is the moment where onboarding stops being a checklist and starts being operational setup. The chat experience makes the setup visible When you open a new thread, the configuration panel stays at the top of the chat. If you expand it, you can see exactly what is connected and what is not. We built this because people should not have to guess what the agent knows. If code is connected and logs are not, that should be obvious. If incidents are wired up but knowledge files are missing, that should be obvious too. The panel makes the agent's working context visible in the same place where you ask it to think. It also makes partial setup less punishing. You do not have to finish every step before the agent becomes useful. But you can see, very clearly, what extra context would make the next answer better. What changed once the agent had context The easiest way to evaluate the onboarding is to look at the first questions we asked after setup. We started with a simple one: What do you know about the Container App in the rg-big-refactor resource group? The agent used Azure CLI to inspect the app, its revisions, and the system logs, then came back with a concise summary: image version, resource sizing, ingress, scale-to-zero behavior, and probe failures during cold start. It also correctly called out that the readiness probe noise was expected and not the root of a real outage. That answer was useful because it was grounded in the actual resource, not in generic advice about Container Apps. Then we asked a harder question: Based on the connected repo, what authentication flow does this app use? If a user reports 401s, what should we check first? The agent opened authConfig.ts, Program.cs, useAuth.ts, postprovision.ps1, and entra-app.bicep, then traced the auth path end to end. The checklist it produced was exactly the kind of thing we hoped onboarding would unlock: client ID alignment, identifier URI issues, redirect URI mismatches, audience validation, missing scopes, token expiry handling, and the single-tenant assumption in the backend. It even pointed to the place in Program.cs where extra logging could be enabled. Without the repo, this would have been a boilerplate answer about JWTs. With the repo, it read like advice from someone who had already been paged for this app before. We did not stop at setup. We wired real monitoring. A polished demo can make any agent look capable, so we pushed farther. We set up live Azure Monitor alerts for the sample web app instead of leaving the incident side as dummy data. We created three alerts: HTTP 5xx errors (Sev 1), for more than 3 server errors in 5 minutes Container restarts (Sev 2), to catch crash loops and OOMs High response latency (Sev 2), when average response time goes above 10 seconds The high-latency alert fired almost immediately. The app was scaling from zero, and the cold start was slow enough to trip the threshold. That was perfect. It gave us a real incident to put through the system instead of a fictional one. Incident response plans From the Builder menu, we created a response plan targeted at incidents with foundry-webapp in the title and severity 1 or 2. The incident that had just fired showed up in the learning flow. We used the actual codebase and deployment details to write the default plan: which files to inspect for failures, how to reason about health probes, and how to tell the difference between a cold start and a real crash. That felt like an important moment in the product. The response plan was not generic incident theater. It was anchored in the system we had just onboarded. One of the most useful demos was the agent debugging itself The sharpest proof point came when we tried to query the Log Analytics workspace from the agent. We expected it to query tables and summarize what it found. Instead, it hit insufficient_scope. That could have been a dead end. Instead, the agent turned the failure into the investigation. It identified the missing permissions, noticed there were two managed identities in play, told us which RBAC roles were required, and gave us the exact commands to apply them. After we fixed the access, it retried and ran a series of KQL queries against the workspace. That is where it found the next problem: Container Apps platform logs were present, but AppRequests, AppExceptions, and the rest of the App Insights-style tables were still empty. That was not a connector bug. It was a real observability gap in the sample app. The backend had OpenTelemetry packages, but the exporter configuration was not actually sending the telemetry we expected. The agent did not just tell us that data was missing. It explained which data was present, which data was absent, and why that difference mattered. That is the sort of thing we wanted this onboarding to set up: not just answering the first question, but exposing the next real thing that needs fixing. We also asked it to triage the repo backlog Once the repo was connected, it was natural to see how well the agent could read open issues against the code. We pointed it at the three open GitHub issues in the sample repo and asked it to triage them. It opened the relevant files, compared the code to the issue descriptions, and came back with a clear breakdown: Issue #21, @fluentui-copilot is not opensource? Partially valid, low severity. The package is public and MIT licensed. The real concern is package maturity, not licensing. Issue #20, SDK fails to deserialize agent tool definitions Confirmed, medium severity. The agent traced the problem to metadata handling in AgentFrameworkService.cs and suggested a safe fallback path. Issue #19, Create Preview experience from AI Foundry is incomplete Confirmed, medium severity. The agent found the gap between the environment variables people are told to paste and the variables the app actually expects. What stood out to us was not just that the output was correct. It was that the agent was careful. It did not overclaim. It separated a documentation concern from two real product bugs. Then it asked whether we wanted it to start implementing the fixes. That is the posture we want from an engineering agent: useful, specific, and a little humble. What the onboarding is really doing After working through the whole flow, we do not think of onboarding as a wizard anymore. We think of it as the process of giving the agent a fair shot. Each connection removes one reason for the model to bluff: Code keeps it from guessing how the system works. Logs keep it from guessing what data exists. Incidents keep it close to operational reality. Azure resource scope keeps it from wandering. Knowledge files keep team-specific context from getting lost. This is the same lesson we learned building the product itself. The agent does better when it can discover context progressively inside a world that is real and well-scoped. Good onboarding is how you create that world. Closing The main thing we learned from this work is simple: onboarding is not done when the resource exists. It is done when the agent can help with a real problem. In one setup we were able to connect a real app, fire a real alert, create a real response plan, debug a real RBAC problem, inspect real logs, and triage real GitHub issues. That is a much better standard than "the wizard completed successfully." If you try SRE Agent after GA, start there. Connect the things that make your system legible, then ask a question that would actually matter during a bad day. The answer will tell you very quickly whether the agent has a real starting point. Create your SRE Agent -> Azure SRE Agent is generally available starting March 10, 2026.954Views2likes0Comments