azure sre agent
37 TopicsAn AI led SDLC: Building an End-to-End Agentic Software Development Lifecycle with Azure and GitHub.
This is due to the inevitable move towards fully agentic, end-to-end SDLCs. We may not yet be at a point where software engineers are managing fleets of agents creating the billion-dollar AI abstraction layer, but (as I will evidence in this article) we are certainly on the precipice of such a world. Before we dive into the reality of agentic development today, let me examine two very different modules from university and their relevance in an AI-first development environment. Manual Requirements Translation. At university I dedicated two whole years to a unit called “Systems Design”. This was one of my favourite units, primarily focused on requirements translation. Often, I would receive a scenario between “The Proprietor” and “The Proprietor’s wife”, who seemed to be in a never-ending cycle of new product ideas. These tasks would be analysed, broken down, manually refined, and then mapped to some kind of early-stage application architecture (potentially some pseudo-code and a UML diagram or two). The big intellectual effort in this exercise was taking human intention and turning it into something tangible to build from (BA’s). Today, by the time I have opened Notepad and started to decipher requirements, an agent can already have created a comprehensive list, a service blueprint, and a code scaffold to start the process (*cough* spec-kit *cough*). Manual debugging. Need I say any more? Old-school debugging with print()’s and breakpoints is dead. I spent countless hours learning to debug in a classroom and then later with my own software, stepping through execution line by line, reading through logs, and understanding what to look for; where correlation did and didn’t mean causation. I think back to my year at IBM as a fresh-faced intern in a cloud engineering team, where around 50% of my time was debugging different issues until it was sufficiently “narrowed down”, and then reading countless Stack Overflow posts figuring out the actual change I would need to make to a PowerShell script or Jenkins pipeline. Already in Azure, with the emergence of SRE agents, that debug process looks entirely different. The debug process for software even more so… #terminallastcommand WHY IS THIS NOT RUNNING? #terminallastcommand Review these logs and surface errors relating to XYZ. As I said: breakpoints are dead, for now at least. Caveat – Is this a good thing? One more deviation from the main core of the article if you would be so kind (if you are not as kind skip to the implementation walkthrough below). Is this actually a good thing? Is a software engineering degree now worthless? What if I love printf()? I don’t know is my answer today, at the start of 2026. Two things worry me: one theoretical and one very real. To start with the theoretical: today AI takes a significant amount of the “donkey work” away from developers. How does this impact cognitive load at both ends of the spectrum? The list that “donkey work” encapsulates is certainly growing. As a result, on one end of the spectrum humans are left with the complicated parts yet to be within an agent’s remit. This could have quite an impact on our ability to perform tasks. If we are constantly dealing with the complex and advanced, when do we have time to re-root ourselves in the foundations? Will we see an increase in developer burnout? How do technical people perform without the mundane or routine tasks? I often hear people who have been in the industry for years discuss how simple infrastructure, computing, development, etc. were 20 years ago, almost with a longing to return to a world where today’s zero trust, globally replicated architectures are a twinkle in an architect’s eye. Is constantly working on only the most complex problems a good thing? At the other end of the spectrum, what if the performance of AI tooling and agents outperforms our wildest expectations? Suddenly, AI tools and agents are picking up more and more of today’s complicated and advanced tasks. Will developers, architects, and organisations lose some ability to innovate? Fundamentally, we are not talking about artificial general intelligence when we say AI; we are talking about incredibly complex predictive models that can augment the existing ideas they are built upon but are not, in themselves, innovators. Put simply, in the words of Scott Hanselman: “Spicy auto-complete”. Does increased reliance on these agents in more and more of our business processes remove the opportunity for innovative ideas? For example, if agents were football managers, would we ever have graduated from Neil Warnock and Mick McCarthy football to Pep? Would every agent just augment a ‘lump it long and hope’ approach? We hear about learning loops, but can these learning loops evolve into “innovation loops?” Past the theoretical and the game of 20 questions, the very real concern I have is off the back of some data shared recently on Stack Overflow traffic. We can see in the diagram below that Stack Overflow traffic has dipped significantly since the release of GitHub Copilot in October 2021, and as the product has matured that trend has only accelerated. Data from 12 months ago suggests that Stack Overflow has lost 77% of new questions compared to 2022… Stack Overflow democratises access to problem-solving (I have to be careful not to talk in past tense here), but I will admit I cannot remember the last time I was reviewing Stack Overflow or furiously searching through solutions that are vaguely similar to my own issue. This causes some concern over the data available in the future to train models. Today, models can be grounded in real, tested scenarios built by developers in anger. What happens with this question drop when API schemas change, when the technology built for today is old and deprecated, and the dataset is stale and never returning to its peak? How do we mitigate this impact? There is potential for some closed-loop type continuous improvement in the future, but do we think this is a scalable solution? I am unsure. So, back to the question: “Is this a good thing?”. It’s great today; the long-term impacts are yet to be seen. If we think that AGI may never be achieved, or is at least a very distant horizon, then understanding the foundations of your technical discipline is still incredibly important. Developers will not only be the managers of their fleet of agents, but also the janitors mopping up the mess when there is an accident (albeit likely mopping with AI-augmented tooling). An AI First SDLC Today – The Reality Enough reflection and nostalgia (I don’t think that’s why you clicked the article), let’s start building something. For the rest of this article I will be building an AI-led, agent-powered software development lifecycle. The example I will be building is an AI-generated weather dashboard. It’s a simple example, but if agents can generate, test, deploy, observe, and evolve this application, it proves that today, and into the future, the process can likely scale to more complex domains. Let’s start with the entry point. The problem statement that we will build from. “As a user I want to view real time weather data for my city so that I can plan my day.” We will use this as the single input for our AI led SDLC. This is what we will pass to promptkit and watch our app and subsequent features built in front of our eyes. The goal is that we will: - Spec-kit to get going and move from textual idea to requirements and scaffold. - Use a coding agent to implement our plan. - A Quality agent to assess the output and quality of the code. - GitHub Actions that not only host the agents (Abstracted) but also handle the build and deployment. - An SRE agent proactively monitoring and opening issues automatically. The end to end flow that we will review through this article is the following: Step 1: Spec-driven development - Spec First, Code Second A big piece of realising an AI-led SDLC today relies on spec-driven development (SDD). One of the best summaries for SDD that I have seen is: “Version control for your thinking”. Instead of huge specs that are stale and buried in a knowledge repository somewhere, SDD looks to make them a first-class citizen within the SDLC. Architectural decisions, business logic, and intent can be captured and versioned as a product evolves; an executable artefact that evolves with the project. In 2025, GitHub released the open-source Spec Kit: a tool that enables the goal of placing a specification at the centre of the engineering process. Specs drive the implementation, checklists, and task breakdowns, steering an agent towards the end goal. This article from GitHub does a great job explaining the basics, so if you’d like to learn more it’s a great place to start (https://github.blog/ai-and-ml/generative-ai/spec-driven-development-with-ai-get-started-with-a-new-open-source-toolkit/). In short, Spec Kit generates requirements, a plan, and tasks to guide a coding agent through an iterative, structured development process. Through the Spec Kit constitution, organisational standards and tech-stack preferences are adhered to throughout each change. I did notice one (likely intentional) gap in functionality that would cement Spec Kit’s role in an autonomous SDLC. That gap is that the implement stage is designed to run within an IDE or client coding agent. You can now, in the IDE, toggle between task implementation locally or with an agent in the cloud. That is great but again it still requires you to drive through the IDE. Thinking about this in the context of an AI-led SDLC (where we are pushing tasks from Spec Kit to a coding agent outside of my own desktop), it was clear that a bridge was needed. As a result, I used Spec Kit to create the Spec-to-issue tool. This allows us to take the tasks and plan generated by Spec Kit, parse the important parts, and automatically create a GitHub issue, with the option to auto-assign the coding agent. From the perspective of an autonomous AI-led SDLC, Speckit really is the entry point that triggers the flow. How Speckit is surfaced to users will vary depending on the organisation and the context of the users. For the rest of this demo I use Spec Kit to create a weather app calling out to the OpenWeather API, and then add additional features with new specs. With one simple prompt of “/promptkit.specify “Application feature/idea/change” I suddenly had a really clear breakdown of the tasks and plan required to get to my desired end state while respecting the context and preferences I had previously set in my Spec Kit constitution. I had mentioned a desire for test driven development, that I required certain coverage and that all solutions were to be Azure Native. The real benefit here compared to prompting directly into the coding agent is that the breakdown of one large task into individual measurable small components that are clear and methodical improves the coding agents ability to perform them by a considerable degree. We can see an example below of not just creating a whole application but another spec to iterate on an existing application and add a feature. We can see the result of the spec creation, the issue in our github repo and most importantly for the next step, our coding agent, GitHub CoPilot has been assigned automatically. Step 2: GitHub Coding Agent - Iterative, autonomous software creation Talking of coding agents, GitHub Copilot’s coding agent is an autonom ous agent in GitHub that can take a scoped development task and work on it in the background using the repository’s context. It can make code changes and produce concrete outputs like commits and pull requests for a developer to review. The developer stays in control by reviewing, requesting changes, or taking over at any point. This does the heavy lifting in our AI-led SDLC. We have already seen great success with customers who have adopted the coding agent when it comes to carrying out menial tasks to save developers time. These coding agents can work in parallel to human developers and with each other. In our example we see that the coding agent creates a new branch for its changes, and creates a PR which it starts working on as it ticks off the various tasks generated in our spec. One huge positive of the coding agent that sets it apart from other similar solutions is the transparency in decision-making and actions taken. The monitoring and observability built directly into the feature means that the agent’s “thinking” is easily visible: the iterations and steps being taken can be viewed in full sequence in the Agents tab. Furthermore, the action that the agent is running is also transparently available to view in the Actions tab, meaning problems can be assessed very quickly. Once the coding agent is finished, it has run the required tests and, even in the case of a UI change, goes as far as calling the Playwright MCP server and screenshotting the change to showcase in the PR. We are then asked to review the change. In this demo, I also created a GitHub Action that is triggered when a PR review is requested: it creates the required resources in Azure and surfaces the (in this case) Azure Container Apps revision URL, making it even smoother for the human in the loop to evaluate the changes. Just like any normal PR, if changes are required comments can be left; when they are, the coding agent can pick them up and action what is needed. It’s also worth noting that for any manual intervention here, use of GitHub Codespaces would work very well to make minor changes or perform testing on an agent’s branch. We can even see the unit tests that have been specified in our spec how been executed by our coding agent. The pattern used here (Spec Kit -> coding agent) overcomes one of the biggest challenges we see with the coding agent. Unlike an IDE-based coding agent, the GitHub.com coding agent is left to its own iterations and implementation without input until the PR review. This can lead to subpar performance, especially compared to IDE agents which have constant input and interruption. The concise and considered breakdown generated from Spec Kit provides the structure and foundation for the agent to execute on; very little is left to interpretation for the coding agent. Step 3: GitHub Code Quality Review (Human in the loop with agent assistance.) GitHub Code Quality is a feature (currently in preview) that proactively identifies code quality risks and opportunities for enhancement both in PRs and through repository scans. These are surfaced within a PR and also in repo-level scoreboards. This means that PRs can now extend existing static code analysis: Copilot can action CodeQL, PMD, and ESLint scanning on top of the new, in-context code quality findings and autofixes. Furthermore, we receive a summary of the actual changes made. This can be used to assist the human in the loop in understanding what changes have been made and whether enhancements or improvements are required. Thinking about this in the context of review coverage, one of the challenges sometimes in already-lean development teams is the time to give proper credence to PRs. Now, with AI-assisted quality scanning, we can be more confident in our overall evaluation and test coverage. I would expect that use of these tools alongside existing human review processes would increase repository code quality and reduce uncaught errors. The data points support this too. The Qodo 2025 AI Code Quality report showed that usage of AI code reviews increased quality improvements to 81% (from 55%). A similar study from Atlassian RovoDev 2026 study showed that 38.7% of comments left by AI agents in code reviews lead to additional code fixes. LLM’s in their current form are never going to achieve 100% accuracy however these are still considerable, significant gains in one of the most important (and often neglected) parts of the SDLC. With a significant number of software supply chain attacks recently it is also not a stretch to imagine that that many projects could benefit from "independently" (use this term loosely) reviewed and summarised PR's and commits. This in the future could potentially by a specialist/sub agent during a PR or merge to focus on identifying malicious code that may be hidden within otherwise normal contributions, case in point being the "near-miss" XZ Utils attack. Step 4: GitHub Actions for build and deploy - No agents here, just deterministic automation. This step will be our briefest, as the idea of CI/CD and automation needs no introduction. It is worth noting that while I am sure there are additional opportunities for using agents within a build and deploy pipeline, I have not investigated them. I often speak with customers about deterministic and non-deterministic business process automation, and the importance of distinguishing between the two. Some processes were created to be deterministic because that is all that was available at the time; the number of conditions required to deal with N possible flows just did not scale. However, now those processes can be non-deterministic. Good examples include IVR decision trees in customer service or hard-coded sales routines to retain a customer regardless of context; these would benefit from less determinism in their execution. However, some processes remain best as deterministic flows: financial transactions, policy engines, document ingestion. While all these flows may be part of an AI solution in the future (possibly as a tool an agent calls, or as part of a larger agent-based orchestration), the processes themselves are deterministic for a reason. Just because we could have dynamic decision-making doesn’t mean we should. Infrastructure deployment and CI/CD pipelines are one good example of this, in my opinion. We could have an agent decide what service best fits our codebase and which region we should deploy to, but do we really want to, and do the benefits outweigh the potential negatives? In this process flow we use a deterministic GitHub action to deploy our weather application into our “development” environment and then promote through the environments until we reach production and we want to now ensure that the application is running smoothly. We also use an action as mentioned above to deploy and surface our agents changes. In Azure Container Apps we can do this in a secure sandbox environment called a “Dynamic Session” to ensure strong isolation of what is essentially “untrusted code”. Often enterprises can view the building and development of AI applications as something that requires a completely new process to take to production, while certain additional processes are new, evaluation, model deployment etc many of our traditional SDLC principles are just as relevant as ever before, CI/CD pipelines being a great example of that. Checked in code that is predictably deployed alongside required services to run tests or promote through environments. Whether you are deploying a java calculator app or a multi agent customer service bot, CI/CD even in this new world is a non-negotiable. We can see that our geolocation feature is running on our Azure Container Apps revision and we can begin to evaluate if we agree with CoPilot that all the feature requirements have been met. In this case they have. If they hadn't we'd just jump into the PR and add a new comment with "@copilot" requesting our changes. Step 5: SRE Agent - Proactive agentic day two operations. The SRE agent service on Azure is an operations-focused agent that continuously watches a running service using telemetry such as logs, metrics, and traces. When it detects incidents or reliability risks, it can investigate signals, correlate likely causes, and propose or initiate response actions such as opening issues, creating runbook-guided fixes, or escalating to an on-call engineer. It effectively automates parts of day two operations while keeping humans in control of approval and remediation. It can be run in two different permission models: one with a reader role that can temporarily take user permissions for approved actions when identified. The other model is a privileged level that allows it to autonomously take approved actions on resources and resource types within the resource groups it is monitoring. In our example, our SRE agent could take actions to ensure our container app runs as intended: restarting pods, changing traffic allocations, and alerting for secret expiry. The SRE agent can also perform detailed debugging to save human SREs time, summarising the issue, fixes tried so far, and narrowing down potential root causes to reduce time to resolution, even across the most complex issues. My initial concern with these types of autonomous fixes (be it VPA on Kubernetes or an SRE agent across your infrastructure) is always that they can very quickly mask problems, or become an anti-pattern where you have drift between your IaC and what is actually running in Azure. One of my favourite features of SRE agents is sub-agents. Sub-agents can be created to handle very specific tasks that the primary SRE agent can leverage. Examples include alerting, report generation, and potentially other third-party integrations or tooling that require a more concise context. In my example, I created a GitHub sub-agent to be called by the primary agent after every issue that is resolved. When called, the GitHub sub-agent creates an issue summarising the origin, context, and resolution. This really brings us full circle. We can then potentially assign this to our coding agent to implement the fix before we proceed with the rest of the cycle; for example, a change where a port is incorrect in some Bicep, or min scale has been adjusted because of latency observed by the SRE agent. These are quick fixes that can be easily implemented by a coding agent, subsequently creating an autonomous feedback loop with human review. Conclusion: The journey through this AI-led SDLC demonstrates that it is possible, with today’s tooling, to improve any existing SDLC with AI assistance, evolving from simply using a chat interface in an IDE. By combining Speckit, spec-driven development, autonomous coding agents, AI-augmented quality checks, deterministic CI/CD pipelines, and proactive SRE agents, we see an emerging ecosystem where human creativity and oversight guide an increasingly capable fleet of collaborative agents. As with all AI solutions we design today, I remind myself that “this is as bad as it gets”. If the last two years are anything to go by, the rate of change in this space means this article may look very different in 12 months. I imagine Spec-to-issue will no longer be required as a bridge, as native solutions evolve to make this process even smoother. There are also some areas of an AI-led SDLC that are not included in this post, things like reviewing the inner-loop process or the use of existing enterprise patterns and blueprints. I also did not review use of third-party plugins or tools available through GitHub. These would make for an interesting expansion of the demo. We also did not look at the creation of custom coding agents, which could be hosted in Microsoft Foundry; this is especially pertinent with the recent announcement of Anthropic models now being available to deploy in Foundry. Does today’s tooling mean that developers, QAs, and engineers are no longer required? Absolutely not (and if I am honest, I can’t see that changing any time soon). However, it is evidently clear that in the next 12 months, enterprises who reshape their SDLC (and any other business process) to become one augmented by agents will innovate faster, learn faster, and deliver faster, leaving organisations who resist this shift struggling to keep up.14KViews6likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.12KViews6likes0CommentsContext Engineering Lessons from Building Azure SRE Agent
We started with 100+ tools and 50+ specialized agents. We ended with 5 core tools and a handful of generalists. The agent got more reliable, not less. Every context decision is a tradeoff: latency vs autonomy, evidence-building vs speed, oversight - and the cost of being wrong. This post is a practical map of those knobs and how we adjusted them for SRE Agent.12KViews22likes2CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.11KViews1like1CommentExpanding the Public Preview of the Azure SRE Agent
We are excited to share that the Azure SRE Agent is now available in public preview for everyone instantly – no sign up required. A big thank you to all our preview customers who provided feedback and helped shape this release! Watching teams put the SRE Agent to work taught us a ton, and we’ve baked those lessons into a smarter, more resilient, and enterprise-ready experience. You can now find Azure SRE Agent directly in the Azure Portal and get started, or use the link below. 📖 Learn more about SRE Agent. 👉 Create your first SRE Agent (Azure login required) What’s New in Azure SRE Agent - October Update The Azure SRE Agent now delivers secure-by-default governance, deeper diagnostics, and extensible automation—built for scale. It can even resolve incidents autonomously by following your team’s runbooks. With native integrations across Azure Monitor, GitHub, ServiceNow, and PagerDuty, it supports root cause analysis using both source code and historical patterns. And since September 1, billing and reporting are available via Azure Agent Units (AAUs). Please visit product documentation for the latest updates. Here are a few highlights for this month: Prioritizing enterprise governance and security: By default, the Azure SRE Agent operates with least-privilege access and never executes write actions on Azure resources without explicit human approval. Additionally, it uses role-based access control (RBAC) so organizations can assign read-only or approver roles, providing clear oversight and traceability from day one. This allows teams to choose their desired level of autonomy from read-only insights to approval-gated actions to full automation without compromising control. Covering the breadth and depth of Azure: The Azure SRE Agent helps teams manage and understand their entire Azure footprint. With built-in support for AZ CLI and kubectl, it works across all Azure services. But it doesn’t stop there—diagnostics are enhanced for platforms like PostgreSQL, API Management, Azure Functions, AKS, Azure Container Apps, and Azure App Service. Whether you're running microservices or managing monoliths, the agent delivers consistent automation and deep insights across your cloud environment. Automating Incident Management: The Azure SRE Agent now plugs directly into Azure Monitor, PagerDuty, and ServiceNow to streamline incident detection and resolution. These integrations let the Agent ingest alerts and trigger workflows that match your team’s existing tools—so you can respond faster, with less manual effort. Engineered for extensibility: The Azure SRE Agent incident management approach lets teams reuse existing runbooks and customize response plans to fit their unique workflows. Whether you want to keep a human in the loop or empower the Agent to autonomously mitigate and resolve issues, the choice is yours. This flexibility gives teams the freedom to evolve—from guided actions to trusted autonomy—without ever giving up control. Root cause, meet source code: The Azure SRE Agent now supports code-aware root cause analysis (RCA) by linking diagnostics directly to source context in GitHub and Azure DevOps. This tight integration helps teams trace incidents back to the exact code changes that triggered them—accelerating resolution and boosting confidence in automated responses. By bridging operational signals with engineering workflows, the agent makes RCA faster, clearer, and more actionable. Close the loop with DevOps: The Azure SRE Agent now generates incident summary reports directly in GitHub and Azure DevOps—complete with diagnostic context. These reports can be assigned to a GitHub Copilot coding agent, which automatically creates pull requests and merges validated fixes. Every incident becomes an actionable code change, driving permanent resolution instead of temporary mitigation. Getting Started Start here: Create a new SRE Agent in the Azure portal (Azure login required) Blog: Announcing a flexible, predictable billing model for Azure SRE Agent Blog: Enterprise-ready and extensible – Update on the Azure SRE Agent preview Product documentation Product home page Community & Support We’d love to hear from you! Please use our GitHub repo to file issues, request features, or share feedback with the team6KViews2likes3CommentsReimagining AI Ops with Azure SRE Agent: New Automation, Integration, and Extensibility features
Azure SRE Agent offers intelligent and context aware automation for IT operations. Enhanced by customer feedback from our preview, the SRE Agent has evolved into an extensible platform to automate and manage tasks across Azure and other environments. Built on an Agentic DevOps approach - drawing from proven practices in internal Azure operations - the Azure SRE Agent has already saved over 20,000 engineering hours across Microsoft product teams operations, delivering strong ROI for teams seeking sustainable AIOps. An Operations Agent that adapts to your playbooks Azure SRE Agent is an AI powered operations automation platform that empowers SREs, DevOps, IT operations, and support teams to automate tasks such as incident response, customer support, and developer operations from a single, extensible agent. Its value proposition and capabilities have evolved beyond diagnosis and mitigation of Azure issues, to automating operational workflows and seamless integration with the standards and processes used in your organization. SRE Agent is designed to automate operational work and reduce toil, enabling developers and operators to focus on high-value tasks. By streamlining repetitive and complex processes, SRE Agent accelerates innovation and improves reliability across cloud and hybrid environments. In this article, we will look at what’s new and what has changed since the last update. What’s New: Automation, Integration, and Extensibility Azure SRE Agent just got a major upgrade. From no-code automation to seamless integrations and expanded data connectivity, here’s what’s new in this release: No-code Sub-Agent Builder: Rapidly create custom automations without writing code. Flexible, event-driven triggers: Instantly respond to incidents and operational changes. Expanded data connectivity: Unify diagnostics and troubleshooting across more data sources. Custom actions: Integrate with your existing tools and orchestrate end-to-end workflows via MCP. Prebuilt operational scenarios: Accelerate deployment and improve reliability out of the box. Unlike generic agent platforms, Azure SRE Agent comes with deep integrations, prebuilt tools, and frameworks specifically for IT, DevOps, and SRE workflows. This means you can automate complex operational tasks faster and more reliably, tailored to your organization’s needs. Sub-Agent Builder: Custom Automation, No Code Required Empower teams to automate repetitive operational tasks without coding expertise, dramatically reducing manual workload and development cycles. This feature helps address the need for targeted automation, letting teams solve specific operational pain points without relying on one-size-fits-all solutions. Modular Sub-Agents: Easily create custom sub-agents tailored to your team’s needs. Each sub-agent can have its own instructions, triggers, and toolsets, letting you automate everything from outage response to customer email triage. Prebuilt System Tools: Eliminate the inefficiency of creating basic automation from scratch, and choose from a rich library of hundreds of built-in tools for Azure operations, code analysis, deployment management, diagnostics, and more. Custom Logic: Align automation to your unique business processes by defining your automation logic and prompts, teaching the agent to act exactly as your workflow requires. Flexible Triggers: Automate on Your Terms Invoke the agent to respond automatically to mission-critical events, not wait for manual commands. This feature helps speed up incident response and eliminate missed opportunities for efficiency. Multi-Source Triggers: Go beyond chat-based interactions, and trigger the agent to automatically respond to Incident Management and Ticketing systems like PagerDuty and ServiceNow, Observability Alerting systems like Azure Monitor Alerts, or even on a cron-based schedule for proactive monitoring and best-practices checks. Additional trigger sources such as GitHub issues, Azure DevOps pipelines, email, etc. will be added over time. This means automation can start exactly when and where you need it. Event-Driven Operations: Integrate with your CI/CD, monitoring, or support systems to launch automations in response to real-world events - like deployments, incidents, or customer requests. Vital for reducing downtime, it ensures that business-critical actions happen automatically and promptly. Expanded Data Connectivity: Unified Observability and Troubleshooting Integrate data, enabling comprehensive diagnostics and troubleshooting and faster, more informed decision-making by eliminating silos and speeding up issue resolution. Multiple Data Sources: The agent can now read data from Azure Monitor, Log Analytics, and Application Insights based on its Azure role-based access control (RBAC). Additional observability data sources such as Dynatrace, New Relic, Datadog, and more can be added via the Remote Model Context Protocol (MCP) servers for these tools. This gives you a unified view for diagnostics and automation. Knowledge Integration: Rather than manually detailing every instruction in your prompt, you can upload your Troubleshooting Guide (TSG) or Runbook directly, allowing the agent to automatically create an execution plan from the file. You may also connect the agent to resources like SharePoint, Jira, or documentation repositories through Remote MCP servers, enabling it to retrieve needed files on its own. This approach utilizes your organization’s existing knowledge base, streamlining onboarding and enhancing consistency in managing incidents. Azure SRE Agent is also building multi-agent collaboration by integrating with PagerDuty and Neubird, enabling advanced, cross-platform incident management and reliability across diverse environments. Custom Actions: Automate Anything, Anywhere Extend automation beyond Azure and integrate with any tool or workflow, solving the problem of limited automation scope and enabling end-to-end process orchestration. Out-of-the-Box Actions: Instantly automate common tasks like running azcli, kubectl, creating GitHub issues, or updating Azure resources, reducing setup time and operational overhead. Communication Notifications: The SRE Agent now features built-in connectors for Outlook, enabling automated email notifications, and for Microsoft Teams, allowing it to post messages directly to Teams channels for streamlined communication. Bring Your Own Actions: Drop in your own Remote MCP servers to extend the agent’s capabilities to any custom tool or workflow. Future-proof your agentic DevOps by automating proprietary or emerging processes with confidence. Prebuilt Operations Scenarios Address common operational challenges out of the box, saving teams time and effort while improving reliability and customer satisfaction. Incident Response: Minimize business impact and reduce operational risk by automating detection, diagnosis, and mitigation of your workload stack. The agent has built-in runbooks for common issues related to many Azure resource types including Azure Kubernetes Service (AKS), Azure Container Apps (ACA), Azure App Service, Azure Logic Apps, Azure Database for PostgreSQL, Azure CosmosDB, Azure VMs, etc. Support for additional resource types is being added continually, please see product documentation for the latest information. Root Cause Analysis & IaC Drift Detection: Instantly pinpoint incident causes with AI-driven root cause analysis including automated source code scanning via GitHub and Azure DevOps integration. Proactively detect and resolve infrastructure drift by comparing live cloud environments against source-controlled IaC, ensuring configuration consistency and compliance. Handle Complex Investigations: Enable the deep investigation mode that uses a hypothesis-driven method to analyze possible root causes. It collects logs and metrics, tests hypotheses with iterative checks, and documents findings. The process delivers a clear summary and actionable steps to help teams accurately resolve critical issues. Incident Analysis: The integrated dashboard offers a comprehensive overview of all incidents managed by the SRE Agent. It presents essential metrics, including the number of incidents reviewed, assisted, and mitigated by the agent, as well as those awaiting human intervention. Users can leverage aggregated visualizations and AI-generated root cause analyses to gain insights into incident processing, identify trends, enhance response strategies, and detect areas for improvement in incident management. Inbuilt Agent Memory: The new SRE Agent Memory System transforms incident response by institutionalizing the expertise of top SREs - capturing, indexing, and reusing critical knowledge from past incidents, investigations, and user guidance. Benefit from faster, more accurate troubleshooting, as the agent learns from both successes and mistakes, surfacing relevant insights, runbooks, and mitigation strategies exactly when needed. This system leverages advanced retrieval techniques and a domain-aware schema to ensure every on-call engagement is smarter than the last, reducing mean time to resolution (MTTR) and minimizing repeated toil. Automatically gain a continuously improving agent that remembers what works, avoids past pitfalls, and delivers actionable guidance tailored to the environment. GitHub Copilot and Azure DevOps Integration: Automatically triage, respond to, and resolve issues raised in GitHub or Azure DevOps. Integration with modern development platforms such as GitHub Copilot coding agent increases efficiency and ensures that issues are resolved faster, reducing bottlenecks in the development lifecycle. Ready to get started? Azure SRE Agent home page Product overview Pricing Page Pricing Calculator Pricing Blog Demo recordings Deployment samples What’s Next? Give us feedback: Your feedback is critical - You can Thumbs Up / Thumbs Down each interaction or thread, or go to the “Give Feedback” button in the agent to give us in-product feedback - or you can create issues or just share your thoughts in our GitHub repo at https://github.com/microsoft/sre-agent. We’re just getting started. In the coming months, expect even more prebuilt integrations, expanded data sources, and new automation scenarios. We anticipate continuous growth and improvement throughout our agentic AI platforms and services to effectively address customer needs and preferences. Let us know what Ops toil you want to automate next!4.3KViews1like0CommentsAnnouncing a flexible, predictable billing model for Azure SRE Agent
Billing for Azure SRE Agent will start on September 1, 2025. Announced at Microsoft Build 2025, Azure SRE Agent is a pre-built AI agent for root cause analysis, uptime improvement, and operational cost reduction. Learn more about the billing model and example scenarios.4.1KViews1like1CommentWhat's new in Azure SRE Agent in the GA release
Azure SRE Agent is now generally available (read the GA announcement). . After months in preview with teams across Microsoft and early customers, here's what's shipping at GA. We use SRE Agent in our team We built SRE Agent to solve our own operational problems first. It investigates our regressions, triages errors daily, and turns investigations into reusable knowledge. Every capability in this release was shaped from those learnings. → The Agent That Investigates Itself What's new at GA Redesigned onboarding — useful on day one Can a new agent become useful the same day you set it up? That's the bar we designed around. Connect code, logs, incidents, Azure resources, and knowledge files in a single guided flow. → What It Takes to Give an SRE Agent a Useful Starting Point Deep Context — your agent builds expertise on your environment Continuous access to your logs, code, and knowledge. Persistent memory across investigations. Background intelligence that runs when nobody is asking questions. Your agent already knows your routes, error handlers, and deployment configs because it's been exploring your environment continuously. It remembers what worked last time and surfaces operational insights nobody asked for. → Meet the Best Engineer That Learns Continuously Why SRE Agent - Capabilities that move the needle Automated investigation — proactive and reactive Set up scheduled tasks to run investigations on a cadence — catch issues before they become incidents. When an incident does fire, your agent picks it up automatically through integrations with platforms like ICM, PagerDuty, and ServiceNow. Faster root cause analysis → lower MTTR Your agent is code and context aware and learns continuously. It connects runtime errors to the code that caused them and gets faster with every investigation. Automate workflows across any ecosystem → reduce toil Connect to any system via MCP connectors. Eliminate the context-switching of working across multiple platforms, orchestrate workflows across Azure, monitoring, ticketing, and more from a single place. Integrate with any HTTP API → bring your own tools Write custom Python tools that call any endpoint. Extend your agent to interact with internal APIs, third-party services, or any system your team relies on. Customize your agent → skills and plugins Add your own skills to teach domain-specific knowledge, or browse the Plugin Marketplace to install pre-built capabilities with a single click. Get started Create your agent Documentation Get started guide Pricing Feedback & issues Samples Videos This is just the start — more capabilities are coming soon. Try it out and let us know what you think.2.5KViews0likes0CommentsGet started with Datadog MCP server in Azure SRE Agent
Overview The Datadog MCP server is a cloud-hosted bridge between your Datadog organization and Azure SRE Agent. Once configured, it enables real-time interaction with logs, metrics, APM traces, monitors, incidents, dashboards, and other Datadog data through natural language. All actions respect your existing Datadog RBAC permissions. The server uses Streamable HTTP transport with two custom headers ( DD_API_KEY and DD_APPLICATION_KEY ) for authentication. Azure SRE Agent connects directly to the Datadog-hosted endpoint—no npm packages, local proxies, or container deployments are required. The SRE Agent portal includes a dedicated Datadog MCP server connector type that pre-populates the required header keys for streamlined setup. Key capabilities Area Capabilities Logs Search and analyze logs with SQL-based queries, filter by facets and time ranges Metrics Query metric values, explore available metrics, get metric metadata and tags APM Search spans, fetch complete traces, analyze trace performance, compare traces Monitors Search monitors, validate configurations, inspect monitor groups and templates Incidents Search and get incident details, view timeline and responders Dashboards Search and list dashboards by name or tag Hosts Search hosts by name, tags, or status Services List services and map service dependencies Events Search events including monitor alerts, deployments, and custom events Notebooks Search and retrieve notebooks for investigation documentation RUM Search Real User Monitoring events for frontend observability This is the official Datadog-hosted MCP server (Preview). The server exposes 16+ core tools with additional toolsets available for alerting, APM, Database Monitoring, Error Tracking, feature flags, LLM Observability, networking, security, software delivery, and Synthetic tests. Tool availability depends on your Datadog plan and RBAC permissions. Prerequisites Azure SRE Agent resource deployed in Azure Datadog organization with an active plan Datadog user account with appropriate RBAC permissions API key: Created from Organization Settings > API Keys Application key: Created from Organization Settings > Application Keys with MCP Read and/or MCP Write permissions Your organization must be allowlisted for the Datadog MCP server Preview Step 1: Create API and Application keys The Datadog MCP server requires two credentials: an API key (identifies your organization) and an Application key (authenticates the user and defines permission scope). Both are created in the Datadog portal. Create an API key Log in to your Datadog organization (use your region-specific URL if applicable—e.g., app.datadoghq.eu for EU1) Select your account avatar in the bottom-left corner of the navigation bar Select Organization Settings In the left sidebar, select API Keys (under the Access section) Direct URL: https://app.datadoghq.com/organization-settings/api-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp ) Select Create Key Copy the key value immediately—it is shown only once. If lost, you must create a new key. [!TIP] API keys are organization-level credentials. Any Datadog Admin or user with the API Keys Write permission can create them. The API key alone does not grant data access—it must be paired with an Application key. Create an Application key From the same Organization Settings page, select Application Keys in the left sidebar Direct URL: https://app.datadoghq.com/organization-settings/application-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp-app ) Select Create Key Copy the key value immediately—it is shown only once Add MCP permissions to the Application key After creating the Application key, you must grant it the MCP-specific scopes: In the Application Keys list, locate the key you just created Select the key name to open its detail panel In the detail panel, find the Scopes section and select Edit Search for MCP in the scopes search box Check MCP Read to enable read access to Datadog data via MCP tools Optionally check MCP Write if your agent needs to create or modify resources (e.g., feature flags, Synthetic tests) Select Save If you don't see the MCP Read or MCP Write scopes, your organization may not be enrolled in the Datadog MCP server preview. Contact your Datadog account representative to request access. Required permissions summary Permission Description Required? MCP Read Read access to Datadog data via MCP tools (logs, metrics, traces, monitors, etc.) Yes MCP Write Write access for mutating operations (creating feature flags, editing Synthetic tests, etc.) Optional For production use, create keys from a service account rather than a personal account. Navigate to Organization Settings > Service Accounts to create one. This ensures the integration continues to work if team members leave the organization. Apply the principle of least privilege—grant only MCP Read unless write operations are needed. Use scoped Application keys to restrict access to only the permissions your agent needs. This limits blast radius if a key is compromised. Step 2: Add the MCP connector Connect the Datadog MCP server to your SRE Agent using the portal. The portal includes a dedicated Datadog connector type that pre-populates the required configuration. Determine your regional endpoint Select the endpoint URL that matches your Datadog organization's region: Region Endpoint URL US1 (default) https://mcp.datadoghq.com/api/unstable/mcp-server/mcp US3 https://mcp.us3.datadoghq.com/api/unstable/mcp-server/mcp US5 https://mcp.us5.datadoghq.com/api/unstable/mcp-server/mcp EU1 https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp AP1 https://mcp.ap1.datadoghq.com/api/unstable/mcp-server/mcp AP2 https://mcp.ap2.datadoghq.com/api/unstable/mcp-server/mcp Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select Datadog MCP server and select Next Configure the connector: Field Value Name datadog-mcp Connection type Streamable-HTTP (pre-selected) URL https://mcp.datadoghq.com/api/unstable/mcp-server/mcp (change for non-US1 regions) Authentication Custom headers (pre-selected, disabled) DD_API_KEY Your Datadog API key DD_APPLICATION_KEY Your Datadog Application key Select Next to review Select Add connector The Datadog connector type pre-populates both header keys ( DD_API_KEY and DD_APPLICATION_KEY ) and sets the authentication method to "Custom headers" automatically. The default URL is the US1 endpoint—update it if your organization is in a different region. Once the connector shows Connected status, the Datadog MCP tools are automatically available to your agent. You can verify by checking the tools list in the connector details. Step 3: Create a Datadog subagent (optional) Create a specialized subagent to give the AI focused Datadog observability expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: DatadogObservabilityExpert display_name: Datadog Observability Expert system_prompt: | You are a Datadog observability expert with access to logs, metrics, APM traces, monitors, incidents, dashboards, hosts, services, and more via the Datadog MCP server. ## Capabilities ### Logs - Search logs using facets, tags, and time ranges with `search_datadog_logs` - Perform SQL-based log analysis with `analyze_datadog_logs` for aggregations, grouping, and statistical queries - Correlate log entries with traces and metrics ### Metrics - Query metric time series with `get_datadog_metric` - Get metric metadata, tags, and context with `get_datadog_metric_context` - Discover available metrics with `search_datadog_metrics` ### APM (Application Performance Monitoring) - Fetch complete traces with `get_datadog_trace` - Search distributed traces and spans with `search_datadog_spans` - Analyze service-level performance and latency patterns - Map service dependencies with `search_datadog_service_dependencies` ### Monitors & Alerting - Search monitors by name, tag, or status with `search_datadog_monitors` - Investigate triggered monitors and alert history - Correlate monitor alerts with underlying metrics and logs ### Incidents - Search incidents with `search_datadog_incidents` - Get incident details, timeline, and responders with `get_datadog_incident` - Correlate incidents with monitors, logs, and traces ### Infrastructure - Search hosts by name, tag, or status with `search_datadog_hosts` - List and discover services with `search_datadog_services` - Search dashboards with `search_datadog_dashboards` - Search events (monitor alerts, deployments) with `search_datadog_events` ### Notebooks - Search notebooks with `search_datadog_notebooks` - Retrieve notebook content with `get_datadog_notebook` ### Real User Monitoring - Search RUM events for frontend performance data with `search_datadog_rum_events` ## Best Practices When investigating incidents: - Start with `search_datadog_incidents` or `get_datadog_incident` for context - Check related monitors with `search_datadog_monitors` - Correlate with `search_datadog_logs` and `get_datadog_metric` for root cause - Use `get_datadog_trace` to inspect request flows for latency issues - Check `search_datadog_hosts` for infrastructure-level problems When analyzing logs: - Use `analyze_datadog_logs` for SQL-based aggregation queries - Use `search_datadog_logs` for individual log retrieval and filtering - Include time ranges to narrow results and reduce response size - Filter by service, host, or status to focus on relevant data When working with metrics: - Use `search_datadog_metrics` to discover available metric names - Use `get_datadog_metric_context` to understand metric tags and metadata - Use `get_datadog_metric` to query actual metric values with time ranges When handling errors: - If access is denied, explain which RBAC permission is needed - Suggest the user verify their Application key has `MCP Read` or `MCP Write` - For large traces that appear truncated, note this is a known limitation mcp_connectors: - datadog-mcp handoffs: [] Select Save The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the Datadog MCP server. Step 4: Add a Datadog skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a Datadog skill to give your agent expertise in log queries, metric analysis, and incident investigation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: datadog_observability display_name: Datadog Observability description: | Expertise in Datadog's observability platform including logs, metrics, APM, monitors, incidents, dashboards, hosts, and services. Use for searching logs, querying metrics, investigating incidents, analyzing traces, inspecting monitors, and navigating Datadog data via the Datadog MCP server. instructions: | ## Overview Datadog is a cloud-scale observability platform for logs, metrics, APM traces, monitors, incidents, infrastructure, and more. The Datadog MCP server enables natural language interaction with your organization's Datadog data. **Authentication:** Two custom headers—`DD_API_KEY` (API key) and `DD_APPLICATION_KEY` (Application key with MCP permissions). All actions respect existing RBAC permissions. **Regional endpoints:** The MCP server URL varies by Datadog region (US1, US3, US5, EU1, AP1, AP2). Ensure the connector URL matches your organization's region. ## Searching Logs Use `search_datadog_logs` for individual log retrieval and `analyze_datadog_logs` for SQL-based aggregation queries. **Common log search patterns:** ``` # Errors from a specific service service:payment-api status:error # Logs from a host in the last hour host:web-prod-01 # Logs containing a specific trace ID trace_id:abc123def456 # Errors with a specific HTTP status @http.status_code:500 service:api-gateway # Logs from a Kubernetes pod kube_namespace:production kube_deployment:checkout-service ``` **SQL-based log analysis with `analyze_datadog_logs`:** ```sql -- Count errors by service in the last hour SELECT service, count(*) as error_count FROM logs WHERE status = 'error' GROUP BY service ORDER BY error_count DESC -- Average response time by endpoint SELECT @http.url_details.path, avg(@duration) as avg_duration FROM logs WHERE service = 'api-gateway' GROUP BY @http.url_details.path ``` ## Querying Metrics Use `search_datadog_metrics` to discover metrics, `get_datadog_metric_context` for metadata, and `get_datadog_metric` for time series data. **Common metric patterns:** ``` # System metrics system.cpu.user, system.mem.used, system.disk.used # Container metrics docker.cpu.usage, kubernetes.cpu.requests # Application metrics trace.servlet.request.hits, trace.servlet.request.duration # Custom metrics app.payment.processed, app.queue.depth ``` Always specify a time range when querying metrics to avoid retrieving excessive data. ## Investigating Traces Use `get_datadog_trace` for complete trace details and `search_datadog_spans` for span-level queries. **Trace investigation workflow:** 1. Search for slow or errored spans with `search_datadog_spans` 2. Get the full trace with `get_datadog_trace` using the trace ID 3. Identify the bottleneck service or operation 4. Correlate with `search_datadog_logs` using the trace ID 5. Check related metrics with `get_datadog_metric` ## Working with Monitors Use `search_datadog_monitors` to find monitors by name, tag, or status. **Common monitor queries:** ``` # Find all triggered monitors Search for monitors with status "Alert" # Find monitors for a specific service Search for monitors tagged with service:payment-api # Find monitors by name Search for monitors matching "CPU" or "memory" ``` ## Incident Investigation Workflow For structured incident investigation: 1. `search_datadog_incidents` — find recent or active incidents 2. `get_datadog_incident` — get full incident details and timeline 3. `search_datadog_monitors` — check which monitors triggered 4. `search_datadog_logs` — search for errors around the incident time 5. `get_datadog_metric` — check key metrics for anomalies 6. `get_datadog_trace` — inspect request traces for latency or errors 7. `search_datadog_hosts` — verify infrastructure health 8. `search_datadog_service_dependencies` — map affected services ## Working with Dashboards and Notebooks - Use `search_datadog_dashboards` to find dashboards by title or tag - Use `search_datadog_notebooks` and `get_datadog_notebook` for investigation notebooks that document past analyses ## Toolsets The Datadog MCP server supports toolsets via the `?toolsets=` query parameter on the endpoint URL. Available toolsets: | Toolset | Description | |---------|-------------| | `core` | Logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, notebooks (default) | | `alerting` | Monitor validation, groups, and templates | | `apm` | Trace analysis, span search, Watchdog insights, performance investigation | | `dbm` | Database Monitoring query plans and samples | | `error-tracking` | Error Tracking issues across RUM, Logs, and Traces | | `feature-flags` | Creating, listing, and updating feature flags | | `llmobs` | LLM Observability spans | | `networks` | Cloud Network Monitoring, Network Device Monitoring | | `onboarding` | Guided Datadog setup and configuration | | `security` | Code security scanning, security signals, findings | | `software-delivery` | CI Visibility, Test Optimization | | `synthetics` | Synthetic test management | To enable additional toolsets, append `?toolsets=core,apm,alerting` to the connector URL. ## Troubleshooting | Issue | Solution | |-------|----------| | 401/403 errors | Verify API key and Application key are correct and active | | No data returned | Check that Application key has `MCP Read` permission | | Wrong region | Ensure the connector URL matches your Datadog organization's region | | Truncated traces | Large traces may be truncated; this is a known limitation | | Tool not found | The tool may require a non-default toolset; update the connector URL | | Write operations fail | Verify Application key has `MCP Write` permission | mcp_connectors: - datadog-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: DatadogObservabilityExpert skills: - datadog_observability mcp_connectors: - datadog-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Log analysis Search for error logs from the payment-api service in the last hour Analyze logs to count errors by service over the last 24 hours Find all logs with HTTP 500 status from the api-gateway in the last 30 minutes Show me the most recent logs from host web-prod-01 Metrics investigation What is the current CPU usage across all production hosts? Show me the request rate and error rate for the checkout-service over the last 4 hours What metrics are available for the payment-api service? Get the p99 latency for the api-gateway service in the last hour APM and trace analysis Find the slowest traces for the checkout-service in the last hour Get the full trace details for trace ID abc123def456 What services depend on the payment-api? Search for errored spans in the api-gateway service from the last 30 minutes Monitor and alerting workflows Show me all monitors currently in Alert status Find monitors related to the database-primary host What monitors are tagged with team:platform? Search for monitors matching "disk space" or "memory" Incident investigation Show me all active incidents from the last 24 hours Get details for incident INC-12345 including the timeline What monitors triggered during the last production incident? Correlate the most recent incident with related logs and metrics Infrastructure and dashboards Search for hosts tagged with env:production and team:platform List all dashboards related to "Kubernetes" or "EKS" What services are running in the production environment? Show me recent deployment events for the checkout-service Available tools Core toolset (default) The core toolset is included by default and provides essential observability tools. Tool Description search_datadog_logs Search logs by facets, tags, and time ranges analyze_datadog_logs SQL-based log analysis for aggregations and statistical queries get_datadog_metric Query metric time series with rollup and aggregation get_datadog_metric_context Get metric metadata, tags, and related context search_datadog_metrics List and discover available metrics get_datadog_trace Fetch a complete distributed trace by trace ID search_datadog_spans Search APM spans by service, operation, or tags search_datadog_monitors Search monitors by name, tag, or status get_datadog_incident Get incident details including timeline and responders search_datadog_incidents List and search incidents search_datadog_dashboards Search dashboards by title or tag search_datadog_hosts Search hosts by name, tag, or status search_datadog_services List and search services search_datadog_service_dependencies Map service dependency relationships search_datadog_events Search events (monitor alerts, deployments, custom events) get_datadog_notebook Retrieve notebook content by ID search_datadog_notebooks Search notebooks by title or tag search_datadog_rum_events Search Real User Monitoring events Alerting toolset Enable with ?toolsets=core,alerting on the connector URL. Tool Description validate_datadog_monitor Validate monitor configuration before creation get_datadog_monitor_templates Get monitor configuration templates search_datadog_monitor_groups Search monitor groups and their statuses APM toolset Enable with ?toolsets=core,apm on the connector URL. Tool Description apm_search_spans Advanced span search with APM-specific filters apm_explore_trace Interactive trace exploration and analysis apm_trace_summary Get a summary analysis of a trace apm_trace_comparison Compare two traces side by side apm_analyze_trace_metrics Analyze aggregated trace metrics and trends Database Monitoring toolset Enable with ?toolsets=core,dbm on the connector URL. Tool Description search_datadog_dbm_plans Search database query execution plans search_datadog_dbm_samples Search database query samples and statistics Error Tracking toolset Enable with ?toolsets=core,error-tracking on the connector URL. Tool Description search_datadog_error_tracking_issues Search error tracking issues across RUM, Logs, and Traces get_datadog_error_tracking_issue Get details of a specific error tracking issue Feature Flags toolset Enable with ?toolsets=core,feature-flags on the connector URL. Tool Description list_datadog_feature_flags List feature flags create_datadog_feature_flag Create a new feature flag update_datadog_feature_flag_environment Update feature flag settings for an environment LLM Observability toolset Enable with ?toolsets=core,llmobs on the connector URL. Tool Description LLM Observability spans Query and analyze LLM Observability span data Networks toolset Enable with ?toolsets=core,networks on the connector URL. Tool Description Cloud Network Monitoring tools Analyze cloud network traffic and dependencies Network Device Monitoring tools Monitor and troubleshoot network devices Security toolset Enable with ?toolsets=core,security on the connector URL. Tool Description datadog_code_security_scan Run code security scanning datadog_sast_scan Run Static Application Security Testing datadog_secrets_scan Scan for secrets and credentials in code Software Delivery toolset Enable with ?toolsets=core,software-delivery on the connector URL. Tool Description search_datadog_ci_pipeline_events Search CI pipeline execution events get_datadog_flaky_tests Identify flaky tests in CI pipelines Synthetics toolset Enable with ?toolsets=core,synthetics on the connector URL. Tool Description get_synthetics_tests List and get Synthetic test configurations edit_synthetics_tests Edit Synthetic test settings synthetics_test_wizard Guided wizard for creating Synthetic tests Toolsets The Datadog MCP server organizes tools into toolsets. By default, only the core toolset is enabled. To enable additional toolsets, append the ?toolsets= query parameter to the connector URL. Syntax https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting Examples Use case URL suffix Default (core only) No suffix needed Core + APM analysis ?toolsets=core,apm Core + Alerting + APM ?toolsets=core,alerting,apm Core + Database Monitoring ?toolsets=core,dbm Core + Security scanning ?toolsets=core,security Core + CI/CD visibility ?toolsets=core,software-delivery All toolsets ?toolsets=core,alerting,apm,dbm,error-tracking,feature-flags,llmobs,networks,onboarding,security,software-delivery,synthetics [!TIP] Only enable the toolsets you need. Each additional toolset increases the number of tools exposed to the agent, which can increase token usage and may impact response quality. Start with core and add toolsets as needed. Updating the connector URL To add toolsets after initial setup: Navigate to Builder > Connectors Select the datadog-mcp connector Update the URL field to include the ?toolsets= parameter Select Save Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid API key or Application key Verify both keys are correct and active in Organization Settings 403 Forbidden Missing RBAC permissions Ensure the Application key has MCP Read and/or MCP Write permissions Connection refused Wrong regional endpoint Verify the connector URL matches your Datadog organization's region "Organization not allowlisted" Preview access not granted Contact Datadog support to request MCP server Preview access Data and permission issues Error Cause Solution No data returned Insufficient permissions or wrong time range Verify Application key permissions; try a broader time range Tool not found Tool belongs to a non-default toolset Add the required toolset to the ?toolsets= parameter in the connector URL Truncated trace data Trace exceeds size limit Large traces are truncated for context window efficiency; query specific spans instead Write operation failed Missing MCP Write permission Add MCP Write permission to the Application key Metric not found Wrong metric name or no data in time range Use search_datadog_metrics to discover available metric names Verify the connection Test the server endpoint directly: curl -I "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp" \ -H "DD_API_KEY: <your_api_key>" \ -H "DD_APPLICATION_KEY: <your_application_key>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Navigate to Organization Settings > Application Keys in Datadog Revoke the existing Application key Create a new Application key with the required MCP Read / MCP Write permissions Update the connector in the SRE Agent portal with the new key Limitations Limitation Details Preview only The Datadog MCP server is in Preview and not recommended for production use Allowlisted organizations Only organizations that have been allowlisted by Datadog can access the MCP server Large trace truncation Responses are optimized for LLM context windows; large traces may be truncated Unstable API path The endpoint URL contains /unstable/ indicating the API may change without notice Toolset availability Some toolsets may not be available depending on your Datadog plan and features enabled Regional endpoints You must use the endpoint matching your organization's region; cross-region queries are not supported Security considerations How permissions work RBAC-scoped: All actions respect the RBAC permissions associated with the API and Application keys Key-based: Access is controlled through API key (organization-level) and Application key (user or service account-level) Permission granularity: MCP Read enables read operations; MCP Write enables mutating operations Admin controls Datadog administrators can: - Create and revoke API and Application keys in Organization Settings - Assign granular RBAC permissions ( MCP Read , MCP Write ) to Application keys - Use service accounts to decouple access from individual user accounts - Monitor MCP tool usage through the Datadog Audit Trail - Scope Application keys to limit the blast radius of compromised credentials The Datadog MCP server can read sensitive operational data including logs, metrics, and traces. Use service accounts with scoped Application keys, grant only the permissions your agent needs, and monitor the Audit Trail for unusual activity. Related content Datadog MCP Server documentation Datadog API and Application keys Datadog RBAC permissions Datadog Audit Trail Datadog regional sites MCP integration overview Build a custom subagent2.5KViews0likes1CommentConnect Azure SRE Agent to ServiceNow: End-to-End Incident Response
🎯 What You'll Achieve In this tutorial, you'll: Connect Azure SRE Agent to ServiceNow as your incident management platform Create a test incident in ServiceNow Watch the AI agent automatically pick up, investigate, and resolve the incident See the agent write triage findings and resolution notes back to ServiceNow Time to complete: ~10 minutes 🎬 The End Result Before we dive in, here's what the end result looks like: ServiceNow Incident - Resolved by Azure SRE Agent The Azure SRE Agent: ✅ Detected the incident from ServiceNow ✅ Acknowledged and began triage automatically ✅ Investigated AKS cluster memory utilization ✅ Documented findings in work notes ✅ Resolved the incident with detailed root cause analysis 📋 Prerequisites A ServiceNow instance (Developer, PDI, or Enterprise) Administrator access to ServiceNow An Azure SRE Agent deployed in your Azure subscription 💡 Don't have a ServiceNow instance? Get a free Personal Developer Instance (PDI) at developer.servicenow.com 🔧 Step 1: Gather Your ServiceNow Credentials To connect Azure SRE Agent to ServiceNow, you need three pieces of information: Component Where to Find It ServiceNow Endpoint Browser address bar when logged into ServiceNow (format: https://your-instance.service-now.com ) Username Click your profile avatar → Profile → User ID Password Your ServiceNow login password Finding Your Instance URL Your ServiceNow instance URL is visible in your browser's address bar when logged in: https://{your-instance-name}.service-now.com Finding Your Username Click your profile avatar in the top-right corner of ServiceNow Click Profile Your User ID is your username ⚙️ Step 2: Connect SRE Agent to ServiceNow Navigate to Your SRE Agent Open the Azure Portal Search for "Azure SRE Agent" in the search bar Click on Azure SRE Agent (Preview) in the results Select your agent from the list Configure the Incident Platform In the left navigation, expand Settings Click Incident platform Click the Incident platform dropdown Select ServiceNow Here's what the ServiceNow configuration form looks like: Enter Your ServiceNow Credentials Field Value ServiceNow endpoint Your ServiceNow instance URL (from Step 1) Username Your ServiceNow username (from Step 1) Password Your ServiceNow password Quickstart response plan ✓ Enable this for automatic investigation Save and Verify Click the Save button Wait for validation to complete Look for: "ServiceNow is connected." with a green checkmark 🚨 Step 3: Create a Test Incident in ServiceNow Now let's test the integration by creating an incident in ServiceNow. Navigate to Create Incident In ServiceNow, click All in the left navigation Search for "Incident" Click Incident → Create New Fill in the Incident Details Field Value Caller Select any user (e.g., System Administrator) Short description [SRE Agent Test] AKS Cluster memory pressure detected in production environment Impact 2 - Medium Submit the Incident Click Submit to create the incident. Note the incident number that's assigned. 🤖 Step 4: Watch SRE Agent Investigate Check the SRE Agent Portal Return to the Azure Portal Open your SRE Agent Click Activities → Incidents Within seconds, you should see your ServiceNow incident appear! Observe the Autonomous Investigation Click on the incident to see the SRE Agent's investigation in action: The agent automatically: 🔔 Acknowledged the incident 📋 Created a triage plan with clear steps 🔍 Identified AKS clusters in your subscription 📊 Validated memory utilization metrics ✅ Resolved the incident with findings 📝 Step 5: Review Resolution in ServiceNow Check the ServiceNow Incident Return to ServiceNow and open your incident. You'll see: State: Changed to Resolved Activity Stream: Multiple work notes from the agent Resolution notes: Detailed findings Resolution Notes The agent writes comprehensive resolution notes including: Timestamp of resolution Root cause analysis Validation steps performed Fix applied (if any) 🚀 Next Steps Create custom response plans to customize how the agent responds to different incident types Configure alert routing to route specific Azure Monitor alerts to the agent Explore the Azure SRE Agent documentation for more features Share your experiences, learnings, and questions with other early adopters. Start a discussion in our Community Hub1.9KViews0likes2Comments