azure sre agent
34 TopicsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.1.6KViews3likes0CommentsWhat It Takes to Give SRE Agent a Useful Starting Point
In our latest posts, The Agent that investigates itself and Azure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context, we wrote about a moment that changed how we think about agent systems. Azure SRE Agent investigated a regression in its own prompt cache, traced the drop to a specific PR, and proposed fixes. What mattered was not just the model. What mattered was the starting point. The agent had code, logs, deployment history, and a workspace it could use to discover the next piece of context. That lesson forced an uncomfortable question about onboarding. If a customer finishes setup and the agent still knows nothing about their app, we have not really onboarded them. We have only created a resource. So for the March 10 GA release, we rebuilt onboarding around a more practical bar: can a new agent become useful on day one? To test that, we used the new flow the way we expect customers to use it. We connected a real sample app, wired up live Azure Monitor alerts, attached code and logs, uploaded a knowledge file, and then pushed the agent through actual work. We asked it to inspect the app, explain a 401 path from the source, debug its own log access, and triage GitHub issues in the repo. This post walks through that experience. We connected everything we could because we wanted to see what the agent does when it has a real starting point, not a partial one. If your setup is shorter, the SRE Agent still works. It just knows less. The cold start we were trying to fix The worst version of an agent experience is familiar by now. You ask a concrete question about your system and get back a smart-sounding answer that is only loosely attached to reality. The model knows what a Kubernetes probe is. It knows what a 500 looks like. It may even know common Kusto table names. But it does not know your deployment, your repo, your auth flow, or the naming mistakes your team made six months ago and still lives with. We saw the same pattern again and again inside our own work. When the agent had real context, it could do deep investigations. When it started cold, it filled the gaps with general knowledge and good guesses. The new onboarding is our attempt to close that gap up front. Instead of treating code, logs, incidents, and knowledge as optional extras, the flow is built around connecting the things the agent needs to reason well. Walking through the new onboarding Starting March 10, you can create and configure an SRE Agent at sre.azure.com. Here is what that looked like for us. Step 1: Create the agent You choose a subscription, resource group, name, and region. Azure provisions the runtime, managed identity, Application Insights, and Log Analytics workspace. In our run, the whole thing took about two minutes. That first step matters more than it may look. We are not just spinning up a chatbot. We are creating the execution environment where the agent can actually work: run commands, inspect files, query services, and keep track of what it learns. Step 2: Start adding context Once provisioning finishes, you land on the setup page. The page is organized around the sources that make the agent useful: code, logs, incidents, Azure resources, and knowledge files. Data source Why it matters Code Lets the agent read the system it is supposed to investigate. Logs Gives it real tables, schemas, and data instead of guesses. Incidents Connects the agent to the place where operational pain actually shows up. Azure resources Gives it the right scope so it starts in the right subscription and resource group. Knowledge files Adds the team-specific context that never shows up cleanly in telemetry. The page is blunt in a way we like. If you have not connected anything yet, it tells you the agent does not know enough about your app to answer useful questions. That is the right framing. The job of onboarding is to fix that. Step 3: Connect logs We started with Azure Data Explorer. The wizard supports Azure Kusto, Datadog, Elasticsearch, Dynatrace, New Relic, Splunk, and Hawkeye. After choosing Kusto, it generated the MCP connector settings for us. We supplied the cluster details, tested the connection, and let it discover the tools. This step removes a whole class of bad agent behavior. The model no longer has to invent table names or hope the cluster it wants is the cluster that exists. It knows what it can query because the connection is explicit. Step 4: Connect the incident platform For incidents, we chose Azure Monitor. This part is simple by design. If incidents are where the agent proves its value, connecting them should feel like the most natural part of setup, not a side quest. PagerDuty and ServiceNow work too, but for this walkthrough we kept it on Azure Monitor so we could wire real alerts to a real app. Step 5: Connect code Then we connected the code repo. We used microsoft-foundry/foundry-agent-webapp, a React and ASP.NET Core sample app running on Azure Container Apps. This is still the highest-leverage source we give the agent. Once the repo is connected, the agent can stop treating the app as an abstract web service. It can read the auth flow. It can inspect how health probes are configured. It can compare logs against the exact code paths that produced them. It can even look at the commit that was live when an incident happened. That changes the quality of the investigation immediately. Step 6: Scope the Azure resources Next we told the agent which resources it was responsible for. We scoped it to the resource group that contained the sample Container App. The wizard then set the roles the agent needed to observe and investigate the environment. That sounds like a small step, but it fixes another common failure mode. Agents do better when they start from the right part of the world. Subscription and resource-group scope give them that boundary. Step 7: Upload knowledge Last, we uploaded a Markdown knowledge file we wrote for the sample app. The file covered the app architecture, API endpoints, auth flow, likely failure modes, and the files we would expect an engineer to open first during debugging. We like Markdown here because it stays honest. It is easy for a human to read, easy for the agent to navigate, and easy to update as the system changes. All sources configured Once everything was connected, the setup panel turned green. At that point the agent had a repo, logs, incidents, Azure resources, and a knowledge file. That is the moment where onboarding stops being a checklist and starts being operational setup. The chat experience makes the setup visible When you open a new thread, the configuration panel stays at the top of the chat. If you expand it, you can see exactly what is connected and what is not. We built this because people should not have to guess what the agent knows. If code is connected and logs are not, that should be obvious. If incidents are wired up but knowledge files are missing, that should be obvious too. The panel makes the agent's working context visible in the same place where you ask it to think. It also makes partial setup less punishing. You do not have to finish every step before the agent becomes useful. But you can see, very clearly, what extra context would make the next answer better. What changed once the agent had context The easiest way to evaluate the onboarding is to look at the first questions we asked after setup. We started with a simple one: What do you know about the Container App in the rg-big-refactor resource group? The agent used Azure CLI to inspect the app, its revisions, and the system logs, then came back with a concise summary: image version, resource sizing, ingress, scale-to-zero behavior, and probe failures during cold start. It also correctly called out that the readiness probe noise was expected and not the root of a real outage. That answer was useful because it was grounded in the actual resource, not in generic advice about Container Apps. Then we asked a harder question: Based on the connected repo, what authentication flow does this app use? If a user reports 401s, what should we check first? The agent opened authConfig.ts, Program.cs, useAuth.ts, postprovision.ps1, and entra-app.bicep, then traced the auth path end to end. The checklist it produced was exactly the kind of thing we hoped onboarding would unlock: client ID alignment, identifier URI issues, redirect URI mismatches, audience validation, missing scopes, token expiry handling, and the single-tenant assumption in the backend. It even pointed to the place in Program.cs where extra logging could be enabled. Without the repo, this would have been a boilerplate answer about JWTs. With the repo, it read like advice from someone who had already been paged for this app before. We did not stop at setup. We wired real monitoring. A polished demo can make any agent look capable, so we pushed farther. We set up live Azure Monitor alerts for the sample web app instead of leaving the incident side as dummy data. We created three alerts: HTTP 5xx errors (Sev 1), for more than 3 server errors in 5 minutes Container restarts (Sev 2), to catch crash loops and OOMs High response latency (Sev 2), when average response time goes above 10 seconds The high-latency alert fired almost immediately. The app was scaling from zero, and the cold start was slow enough to trip the threshold. That was perfect. It gave us a real incident to put through the system instead of a fictional one. Incident response plans From the Builder menu, we created a response plan targeted at incidents with foundry-webapp in the title and severity 1 or 2. The incident that had just fired showed up in the learning flow. We used the actual codebase and deployment details to write the default plan: which files to inspect for failures, how to reason about health probes, and how to tell the difference between a cold start and a real crash. That felt like an important moment in the product. The response plan was not generic incident theater. It was anchored in the system we had just onboarded. One of the most useful demos was the agent debugging itself The sharpest proof point came when we tried to query the Log Analytics workspace from the agent. We expected it to query tables and summarize what it found. Instead, it hit insufficient_scope. That could have been a dead end. Instead, the agent turned the failure into the investigation. It identified the missing permissions, noticed there were two managed identities in play, told us which RBAC roles were required, and gave us the exact commands to apply them. After we fixed the access, it retried and ran a series of KQL queries against the workspace. That is where it found the next problem: Container Apps platform logs were present, but AppRequests, AppExceptions, and the rest of the App Insights-style tables were still empty. That was not a connector bug. It was a real observability gap in the sample app. The backend had OpenTelemetry packages, but the exporter configuration was not actually sending the telemetry we expected. The agent did not just tell us that data was missing. It explained which data was present, which data was absent, and why that difference mattered. That is the sort of thing we wanted this onboarding to set up: not just answering the first question, but exposing the next real thing that needs fixing. We also asked it to triage the repo backlog Once the repo was connected, it was natural to see how well the agent could read open issues against the code. We pointed it at the three open GitHub issues in the sample repo and asked it to triage them. It opened the relevant files, compared the code to the issue descriptions, and came back with a clear breakdown: Issue #21, @fluentui-copilot is not opensource? Partially valid, low severity. The package is public and MIT licensed. The real concern is package maturity, not licensing. Issue #20, SDK fails to deserialize agent tool definitions Confirmed, medium severity. The agent traced the problem to metadata handling in AgentFrameworkService.cs and suggested a safe fallback path. Issue #19, Create Preview experience from AI Foundry is incomplete Confirmed, medium severity. The agent found the gap between the environment variables people are told to paste and the variables the app actually expects. What stood out to us was not just that the output was correct. It was that the agent was careful. It did not overclaim. It separated a documentation concern from two real product bugs. Then it asked whether we wanted it to start implementing the fixes. That is the posture we want from an engineering agent: useful, specific, and a little humble. What the onboarding is really doing After working through the whole flow, we do not think of onboarding as a wizard anymore. We think of it as the process of giving the agent a fair shot. Each connection removes one reason for the model to bluff: Code keeps it from guessing how the system works. Logs keep it from guessing what data exists. Incidents keep it close to operational reality. Azure resource scope keeps it from wandering. Knowledge files keep team-specific context from getting lost. This is the same lesson we learned building the product itself. The agent does better when it can discover context progressively inside a world that is real and well-scoped. Good onboarding is how you create that world. Closing The main thing we learned from this work is simple: onboarding is not done when the resource exists. It is done when the agent can help with a real problem. In one setup we were able to connect a real app, fire a real alert, create a real response plan, debug a real RBAC problem, inspect real logs, and triage real GitHub issues. That is a much better standard than "the wizard completed successfully." If you try SRE Agent after GA, start there. Connect the things that make your system legible, then ask a question that would actually matter during a bad day. The answer will tell you very quickly whether the agent has a real starting point. Create your SRE Agent -> Azure SRE Agent is generally available starting March 10, 2026.183Views0likes0CommentsWhat's new in Azure SRE Agent in the GA release
Azure SRE Agent is now generally available (read the GA announcement). . After months in preview with teams across Microsoft and early customers, here's what's shipping at GA. We use SRE Agent in our team We built SRE Agent to solve our own operational problems first. It investigates our regressions, triages errors daily, and turns investigations into reusable knowledge. Every capability in this release was shaped from those learnings. → The Agent That Investigates Itself What's new at GA Redesigned onboarding — useful on day one Can a new agent become useful the same day you set it up? That's the bar we designed around. Connect code, logs, incidents, Azure resources, and knowledge files in a single guided flow. → What It Takes to Give an SRE Agent a Useful Starting Point Deep Context — your agent builds expertise on your environment Continuous access to your logs, code, and knowledge. Persistent memory across investigations. Background intelligence that runs when nobody is asking questions. Your agent already knows your routes, error handlers, and deployment configs because it's been exploring your environment continuously. It remembers what worked last time and surfaces operational insights nobody asked for. → Meet the Best Engineer That Learns Continuously Why SRE Agent - Capabilities that move the needle Automated investigation — proactive and reactive Set up scheduled tasks to run investigations on a cadence — catch issues before they become incidents. When an incident does fire, your agent picks it up automatically through integrations with platforms like ICM, PagerDuty, and ServiceNow. Faster root cause analysis → lower MTTR Your agent is code and context aware and learns continuously. It connects runtime errors to the code that caused them and gets faster with every investigation. Automate workflows across any ecosystem → reduce toil Connect to any system via MCP connectors. Eliminate the context-switching of working across multiple platforms, orchestrate workflows across Azure, monitoring, ticketing, and more from a single place. Integrate with any HTTP API → bring your own tools Write custom Python tools that call any endpoint. Extend your agent to interact with internal APIs, third-party services, or any system your team relies on. Customize your agent → skills and plugins Add your own skills to teach domain-specific knowledge, or browse the Plugin Marketplace to install pre-built capabilities with a single click. Get started Create your agent Documentation Get started guide Pricing Feedback & issues Samples Videos This is just the start — more capabilities are coming soon. Try it out and let us know what you think.415Views0likes0CommentsAgent Hooks: Production-Grade Governance for Azure SRE Agent
Introduction Azure SRE Agent helps engineering teams automate incident response, diagnostics, and remediation tasks. But when you're giving an agent access to production systems—your databases, your Kubernetes clusters, your cloud resources—you need more than just automation. You need governance. Today, we're diving deep into Agent Hooks, the built-in governance framework in Azure SRE Agent that lets you enforce quality standards, prevent dangerous operations, and maintain audit trails without writing custom middleware or proxies. Agent Hooks work by intercepting your SRE Agent at critical execution points—before it responds to users (Stop hooks) or after it executes tools (PostToolUse hooks). You define the rules once in your custom agent configuration, and the SRE Agent runtime enforces them automatically across every conversation thread. In this post, we'll show you how to configure Agent Hooks for a real production scenario: diagnosing and remediating PostgreSQL connection pool exhaustion while maintaining enterprise controls. The Challenge: Autonomous Remediation with Guardrails You're managing a production application backed by Azure PostgreSQL Flexible Server. Your on-call team frequently deals with connection pool exhaustion issues that cause latency spikes. You want your SRE Agent to diagnose and resolve these incidents autonomously, but you need to ensure: Quality Control: The agent provides thorough, evidence-based analysis instead of superficial guesses Safety: The agent can't accidentally execute dangerous commands, but can still perform necessary remediation Compliance: Every agent action is logged for security audits and post-mortems Without Agent Hooks, you'd need to build custom middleware, write validation logic around the SRE Agent API, or settle for manual approval workflows. With Agent Hooks, you configure these controls once in your custom agent definition and the SRE Agent platform enforces them automatically. The Scenario: PostgreSQL Connection Pool Exhaustion For our demo, we'll use a real production application (octopets-prod-web) experiencing connection pool exhaustion. When this happens: P95 latency spikes from ~120ms to 800ms+ Active connections reach the pool limit New requests get queued or fail The correct remediation is to restart the PostgreSQL Flexible Server to flush stale connections—but we want our agent to do this safely and with proper oversight. Demo Setup: Three Hooks, Three Purposes We'll configure three hooks that work together to create a robust governance framework: Hook #1: Quality Gate (Stop Hook) Ensures the agent provides structured, evidence-based responses before presenting them to users. Hook #2: Safety Guardrails (PostToolUse Hook) Blocks dangerous commands while allowing safe operations through an explicit allowlist. Hook #3: Audit Trail (Global Hook) Logs every tool execution across all agents for compliance and debugging. Step-by-Step Implementation Creating the Custom Agent First, we create a specialized subagent in the Azure SRE Agent platform called sre_analyst_agent designed for PostgreSQL diagnostics. In the Agent Canvas, we configure the agent instructions: You are an SRE agent responsible for diagnosing and remediating production issues for an application backed by an Azure PostgreSQL Flexible Server. When investigating a problem: - Use available tools to query Azure Monitor metrics, PostgreSQL logs, and connection statistics - Look for patterns: latency spikes, connection counts, error rates, CPU/memory pressure - Quantify findings with actual numbers where possible (e.g., P95 latency in ms, active connection count, error rate %) When presenting your diagnosis, structure your response with these exact sections: ## Root Cause A precise explanation of what is causing the issue. ## Evidence Specific metrics and observations that support your root cause. Include actual numbers: latency values in ms, connection counts, error rates, timestamps. ## Recommended Actions Numbered list of remediation steps ordered by priority. Be specific — include actual resource names and exact commands. When executing a fix: - Always verify the current state before acting - Confirm the fix worked by re-checking the same metrics after the action - Report before and after numbers to show impact This explicit guidance ensures the agent knows the correct remediation path. Configuring Hook #1: Quality Gate In the Agent Canvas' Hooks tab, we add our first agent-level hook—a Stop hook that fires before the SRE Agent presents its response. This hook uses the SRE Agent's own LLM to evaluate response quality: Event Type: Stop Hook Type: Prompt Activation: Always Hook Prompt: You are a quality gate for an SRE agent that investigates database and app performance issues. Review the agent's response below: $ARGUMENTS Evaluate whether the response meets ALL of the following criteria: 1. Has a "## Root Cause" section with a specific, clear explanation (not vague — must say specifically what failed, e.g., "connection pool exhaustion due to long-running queries holding connections" not just "database issue") 2. Has a "## Evidence" section that includes at least one concrete metric or data point with an actual number (e.g., "P95 latency spiked to 847ms", "active connections: 497/500", "error rate: 23% over last 15 minutes") 3. Has a "## Recommended Actions" section with numbered, specific steps (must include actual resource names or commands, not just "restart the database") If ALL three criteria are met with substantive content, respond: {"ok": true} If ANY criterion is missing, vague, or uses placeholder text, respond: {"ok": false, "reason": "Your response needs more depth before it reaches the user. Specifically: ## Root Cause must name the exact failure mechanism, ## Evidence must include real metric values with numbers (latency in ms, connection counts, error rates), ## Recommended Actions must reference actual resource names and specific commands. Go back and verify your findings."} This hook acts as an automated quality gate built directly into the SRE Agent runtime, catching superficial responses before they reach your on-call engineers. Configuring Hook #2: Safety Guardrails Our second agent-level hook is a PostToolUse hook that fires after the SRE Agent executes Bash or Python tools. This implements an allowlist pattern to control what commands can actually run in production: Event Type: PostToolUse Hook Type: Command (Python) Matcher: Bash|ExecuteShellCommand|ExecutePythonCode Activation: Always Hook Script: #!/usr/bin/env python3 import sys, json, re context = json.load(sys.stdin) tool_input = context.get('tool_input', {}) command = '' if isinstance(tool_input, dict): command = tool_input.get('command', '') or tool_input.get('code', '') # Safe allowlist — check these FIRST before any blocking logic # These are explicitly approved remediation actions for PostgreSQL issues safe_allowlist = [ r'az\s+postgres\s+flexible-server\s+restart', ] for safe_pattern in safe_allowlist: if re.search(safe_pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': '[SAFETY] ✅ PostgreSQL server restart approved — recognized as a safe remediation action for connection pool exhaustion.' } })) sys.exit(0) # Destructive commands to block dangerous = [ (r'\baz\s+postgres\s+flexible-server\s+delete\b', 'az postgres flexible-server delete (permanent server deletion)'), (r'\baz\s+\S+\s+delete\b', 'az delete (Azure resource deletion)'), (r'\brm\s+-rf\b', 'rm -rf (recursive force delete)'), (r'\bsudo\b', 'sudo (privilege escalation)'), (r'\bdrop\s+(table|database)\b', 'DROP TABLE/DATABASE (irreversible data loss)'), (r'\btruncate\s+table\b', 'TRUNCATE TABLE (irreversible data wipe)'), (r'\bdelete\s+from\b(?!.*\bwhere\b)', 'DELETE FROM without WHERE clause (wipes entire table)'), ] for pattern, label in dangerous: if re.search(pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'block', 'reason': f'🛑 BLOCKED: {label} is not permitted. Use safe, non-destructive alternatives. For PostgreSQL connection issues, prefer server restart or connection pool configuration changes.' })) sys.exit(0) print(json.dumps({'decision': 'allow'})) This ensures only pre-approved PostgreSQL operations can execute, preventing accidental data deletion or configuration changes. Now that we've configured both agent-level hooks, here's what our custom agent looks like in the canvas: - Overview ofsre_analyst_agent with hooks. Agent Canvas showing the sre_analyst_agent configuration with two agent-level hooks attached Configuring Hook #3: Audit Trail Finally, we create a Global hook using the Hooks management page in the Azure SRE Agent Portal. Global hooks apply across all custom agents in your organization, providing centralized governance: obal Hooks Management Page - Creating the sre_audit_trail global hook. The Global Hooks management page showing the sre_audit_trail hook configuration with event type, activation mode, matcher pattern, and Python script editor Event Type: PostToolUse Hook Type: Command (Python) Matcher: * (all tools) Activation: On-demand Hook Script: #!/usr/bin/env python3 import sys, json context = json.load(sys.stdin) tool_name = context.get('tool_name', 'unknown') agent_name = context.get('agent_name', 'unknown') succeeded = context.get('tool_succeeded', False) turn = context.get('current_turn', '?') audit = f'[AUDIT] Turn {turn} | Agent: {agent_name} | Tool: {tool_name} | Success: {succeeded}' print(audit, file=sys.stderr) print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': audit } })) By setting this as "on-demand," your SRE engineers can toggle this hook on/off per conversation thread from the chat interface—enabling detailed audit logging during incident investigations without overwhelming logs during routine queries. Seeing Agent Hooks in Action Now let's see how these hooks work together when our SRE Agent investigates a real production incident. Activating Audit Trail Before starting our investigation, we toggle on the audit trail hook from the chat interface: - Managing hooks for this thread with sre_audit_trail activated the "Manage hooks for this thread" menu showing the sre_audit_trail global hook toggled on for this conversation This gives us visibility into every tool the agent executes during the investigation. Starting the Investigation We prompt our SRE Agent: "Can you check the octopets-prod-web application and diagnose any performance issues?" The SRE Agent begins gathering metrics from Azure Monitor, and we immediately see our audit trail hook logging each tool execution: This real-time visibility is invaluable for understanding what your SRE Agent is doing and debugging issues when things don't go as planned. Quality Gate Rejection The SRE Agent completes its initial analysis and attempts to respond. But our Stop hook intercepts it—the response doesn't meet our quality standards: - Stop hook forcing agent to provide more detailed analysisStop hook rejection message: "Your response needs more depth and specificity..." forcing the agent to re-analyze with more evidence The hook rejects the response and forces the SRE Agent to retry—gathering more evidence, querying additional metrics, and providing specific numbers. This self-correction happens automatically within the SRE Agent runtime, with no manual intervention required. Structured Final Response After re-verification, the SRE Agent presents a properly structured analysis that passes our quality gate: with Root Cause, Evidence, and Recommended Actions. Agent response showing the required structure: Root Cause section with connection pool exhaustion diagnosis, Evidence section with specific metric numbers, and Recommended Actions with the exact restart command Root Cause: Connection pool exhaustion Evidence: Specific metrics (83 active connections, P95 latency 847ms) Recommended Actions: Restart command with actual resource names This is the level of rigor we expect from production-ready agents. Safety Allowlist in Action The SRE Agent determines it needs to restart the PostgreSQL server to remediate the connection pool exhaustion. Our PostToolUse hook intercepts the command execution and validates it against our allowlist: - PostgreSQL metrics query and restart command output. Code execution output showing the PostgreSQL metrics query results and the az postgres flexible-server restart command being executed successfully Because the az postgres flexible-server restart command matches our safety allowlist pattern, the hook allows it to proceed. If the SRE Agent had attempted any unapproved operation (like DROP DATABASE or firewall rule changes), the safety hook would have blocked it immediately. The Results After the SRE Agent restarts the PostgreSQL server: P95 latency drops from 847ms back to ~120ms Active connections reset to healthy levels Application performance returns to normal But more importantly, we achieved autonomous remediation with enterprise governance: ✅ Quality assurance: Every response met our evidence standards (enforced by Stop hooks) ✅ Safety controls: Only pre-approved operations executed (enforced by PostToolUse hooks) ✅ Complete audit trail: Every tool call logged for compliance (enforced by Global hooks) ✅ Zero manual interventions: The SRE Agent self-corrected when quality standards weren't met This is the power of Agent Hooks—governance that doesn't get in the way of automation. Key Takeaways Agent Hooks bring production-grade governance to Azure SRE Agent: Layered Governance: Combine agent-level hooks for custom agent-specific controls with global hooks for organization-wide policies Fail-Safe by Default: Use allowlist patterns in PostToolUse hooks rather than denylists—explicitly permit safe operations instead of trying to block every dangerous one Self-Correcting SRE Agents: Stop hooks with quality gates create feedback loops that improve response quality without human intervention Audit Without Overhead: On-demand global hooks let your engineers toggle detailed logging only during incident investigations No Custom Middleware: All governance logic lives in your custom agent configuration—no need to build validation proxies or wrapper services Getting Started Agent Hooks are available now in the Azure SRE Agent platform. You can configure them entirely through the UI—no API calls or tokens needed: Agent-Level Hooks: Navigate to the Agent Canvas → Hooks tab and add hooks directly to your custom agent Global Hooks: Use the Hooks management page to create organization-wide policies Thread-Level Control: Toggle on-demand hooks from the chat interface using the "Manage hooks" menu Learn More Agent Hooks Documentation YAML Schema Reference Subagent Builder Guide Ready to build safer, smarter agents? Start experimenting with Agent Hooks today at sre.azure.com.92Views0likes0CommentsAzure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context
What if SRE Agent already knew your system before the next incident? Your most experienced SRE didn't become an expert overnight. Day one: reading runbooks, studying architecture diagrams, asking a lot of questions. Month three: knowing which services are fragile, which config changes cascade, which log patterns mean real trouble. Year two: diagnosing a production issue at 2 AM from a single alert because they'd built deep, living context about your systems. That learning process, absorbing documentation, reading code, handling incidents, building intuition from every interaction is what makes an expert. Azure SRE Agent could do the same thing From pulling context to living in it Azure SRE Agent already connects to Azure Monitor, PagerDuty, and ServiceNow. It queries Kusto logs, checks resource health, reads your code, and delivers root cause analysis often resolving incidents without waking anyone up. Thousands of incidents handled. Thousands of engineering hours saved. Deep Context takes this to the next level. Instead of accessing context on demand, your agent now lives in it — continuously reading your code, knowledge building persistent memory from every interaction, and evolving its understanding of your systems in the background. Three things makes Deep Context work: Continuous access. Source code, terminal, Python runtime, and Azure environment are available whenever the agent needs them. Connected repos are cloned into the agent's workspace automatically. The agent knows your code structure from the first message. Persistent memory. Insights from previous investigations, architecture understanding, team context — it all persists across sessions. The next time the agent picks up an alert, it already knows what happened last time. Background intelligence. Even when you're not chatting, background services continuously learn. After every conversation, the agent extracts what worked, what failed, what the root cause was. It aggregates these across all past investigations to build evolving operational insights. The agent recognizes patterns you haven't noticed yet. One example: connected to Kusto, background scanning auto-discovers every table, documents schemas, and builds reusable query templates. But this learning applies broadly — every conversation, every incident, every data source makes the agent sharper. Expertise that compound with every incident New on-call engineer SRE Agent with Deep Context Alert fires Opens runbook, looks up which service this maps to Already knows the service, its dependencies, and failure patterns from prior incidents Investigation Reads logs, searches code, asks teammates Goes straight to the relevant code path, correlates with logs and persistent insights from similar incidents After 100 incidents Becomes the team expert — irreplaceable institutional knowledge Same institutional knowledge — always available, never forgets, scales across your entire organization A human expert takes months to build this depth. An agent with Deep Context builds it in days and the knowledge compounds with every interaction. You shape what your agent learns. Deep Context learns automatically but the best results come when your team actively guides what the agent retains. Type #remember in chat to save important facts your agent should always know environment details, escalation paths, team preferences. For example: "#remember our Redis cache uses Premium tier with 6GB" or "#remember database failover takes approximately 15 minutes." These are recalled automatically during future investigations. Turn investigations into knowledge. After a good investigation, ask your agent to turn the resolution into a runbook: "Create a troubleshooting guide from the steps we just followed and save it to Knowledge settings." The agent generates a structured document, uploads it, and indexes it — so the next time a similar issue occurs, the agent finds and follows that guide automatically. The agent captures insights from every conversation on its own. Your guidance tells it which ones matter most. This is exactly how Microsoft’s own SRE team gets the best results: “Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn’t fail that class of problem again.” Read the full story in The Agent That Investigates Itself. See it in action: an Azure Monitor alert, end to end An HTTP 5xx spike fires on your container app. Your agent is in autonomous mode. It acknowledges the alert, checks resource health, reads logs, and delivers a diagnosis — that's what it already does well. Deep Context makes this dramatically better. Two things change everything: The agent already knows your environment.It'salready read your code, runbooks, and built context from previous investigations. Your route handlers, database layer, deployment configs, operational procedures, it knows all of it. So, when these alert fires, it doesn't start from scratch. It goes straight to the relevant code path, correlates a recent connection pooling commit with the deployment timeline, and confirms the root cause. The agent remembers.It's seen this pattern before a similar incident last week that was investigated but never permanently fixed. It recognizes the recurrence from persistent memory, skips rediscovery, confirms the issue is still in the code, and this time fixes it. Because it's in autonomous mode, the agent edits the source code, restarts the container, pushes the fix to a new branch, creates a PR, opens a GitHub Issue, and verifies service health, all before you wake up. The agent delivers a complete remediation summary including alert, root cause with code references, fix applied, PR created, without a single message from you. Code access turns diagnosis into action. Persistent memory turns recurring problems into solved problems. Give your agent your code — here's why it matters If you're on an IT operations, SRE, or DevOps team, you might think: "Code access? That's for developers." We'd encourage you to rethink that. Your infrastructure-as-code, deployment configs, Helm charts, Terraform files, pipeline definitions — that's all code. And it's exactly the context your agent needs to go from good to extraordinary. When your agent can read your actual configuration and infrastructure code, investigations transform. Instead of generic troubleshooting, you get root cause analysis that points to the exact file, the exact line, the exact config change. It correlates a deployment failure with a specific commit. It reads your Helm values and spots the misconfiguration that caused the pod crash loop. "Will the agent modify our production code?" No. The agent works in a secure sandbox — a copy of your repository, not your production environment. When it identifies a fix, it creates a pull request on a new branch. Your code review process, your CI/CD pipeline, your approval gates — all untouched. The agent proposes. Your team decides. Whether you're a developer, an SRE, or an IT operator managing infrastructure you didn't write — connecting your code is the single highest-impact thing you can do to make your agent smarter. The compound effects Deep Context amplifies every other SRE Agent capability: Deep Context + Incident management → Alerts fire, the agent correlates logs with actual code. Root cause references specific files and line numbers. Deep Context + Scheduled tasks → Automated code analysis, compliance checks, and drift detection — inspecting your actual infrastructure code, not just metrics. Deep Context + MCP connectors → Datadog, Splunk, PagerDuty data combined with source code context. The full picture in one conversation. Deep Context + Knowledge files → Upload runbooks, architecture docs, postmortems — in any format. The agent cross-references your team's knowledge with live code, logs, and infrastructure state. Logs tell the agent what happened. Code tells it why. Your knowledge files tell it what to do about it. Get started Deep Context is available today as part of Azure SRE Agent GA. New agents have it enabled by default. For a step-by-step walkthrough connecting your code, logs, incidents, and knowledge files, see What It Takes to Give an SRE Agent a Useful Starting Point Resources SRE Agent GA Announcement blog - https://aka.ms/sreagent/ga SRE Agent GA What’s new post - https://aka.ms/sreagent/blog/whatsnewGA SRE Agent Documentation – https://aka.ms/sreagent/newdocs SRE Agent Overview - https://aka.ms/sreagent/newdocsoverview119Views0likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.1.6KViews0likes0CommentsGet started with Datadog MCP server in Azure SRE Agent
Overview The Datadog MCP server is a cloud-hosted bridge between your Datadog organization and Azure SRE Agent. Once configured, it enables real-time interaction with logs, metrics, APM traces, monitors, incidents, dashboards, and other Datadog data through natural language. All actions respect your existing Datadog RBAC permissions. The server uses Streamable HTTP transport with two custom headers ( DD_API_KEY and DD_APPLICATION_KEY ) for authentication. Azure SRE Agent connects directly to the Datadog-hosted endpoint—no npm packages, local proxies, or container deployments are required. The SRE Agent portal includes a dedicated Datadog MCP server connector type that pre-populates the required header keys for streamlined setup. Key capabilities Area Capabilities Logs Search and analyze logs with SQL-based queries, filter by facets and time ranges Metrics Query metric values, explore available metrics, get metric metadata and tags APM Search spans, fetch complete traces, analyze trace performance, compare traces Monitors Search monitors, validate configurations, inspect monitor groups and templates Incidents Search and get incident details, view timeline and responders Dashboards Search and list dashboards by name or tag Hosts Search hosts by name, tags, or status Services List services and map service dependencies Events Search events including monitor alerts, deployments, and custom events Notebooks Search and retrieve notebooks for investigation documentation RUM Search Real User Monitoring events for frontend observability This is the official Datadog-hosted MCP server (Preview). The server exposes 16+ core tools with additional toolsets available for alerting, APM, Database Monitoring, Error Tracking, feature flags, LLM Observability, networking, security, software delivery, and Synthetic tests. Tool availability depends on your Datadog plan and RBAC permissions. Prerequisites Azure SRE Agent resource deployed in Azure Datadog organization with an active plan Datadog user account with appropriate RBAC permissions API key: Created from Organization Settings > API Keys Application key: Created from Organization Settings > Application Keys with MCP Read and/or MCP Write permissions Your organization must be allowlisted for the Datadog MCP server Preview Step 1: Create API and Application keys The Datadog MCP server requires two credentials: an API key (identifies your organization) and an Application key (authenticates the user and defines permission scope). Both are created in the Datadog portal. Create an API key Log in to your Datadog organization (use your region-specific URL if applicable—e.g., app.datadoghq.eu for EU1) Select your account avatar in the bottom-left corner of the navigation bar Select Organization Settings In the left sidebar, select API Keys (under the Access section) Direct URL: https://app.datadoghq.com/organization-settings/api-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp ) Select Create Key Copy the key value immediately—it is shown only once. If lost, you must create a new key. [!TIP] API keys are organization-level credentials. Any Datadog Admin or user with the API Keys Write permission can create them. The API key alone does not grant data access—it must be paired with an Application key. Create an Application key From the same Organization Settings page, select Application Keys in the left sidebar Direct URL: https://app.datadoghq.com/organization-settings/application-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp-app ) Select Create Key Copy the key value immediately—it is shown only once Add MCP permissions to the Application key After creating the Application key, you must grant it the MCP-specific scopes: In the Application Keys list, locate the key you just created Select the key name to open its detail panel In the detail panel, find the Scopes section and select Edit Search for MCP in the scopes search box Check MCP Read to enable read access to Datadog data via MCP tools Optionally check MCP Write if your agent needs to create or modify resources (e.g., feature flags, Synthetic tests) Select Save If you don't see the MCP Read or MCP Write scopes, your organization may not be enrolled in the Datadog MCP server preview. Contact your Datadog account representative to request access. Required permissions summary Permission Description Required? MCP Read Read access to Datadog data via MCP tools (logs, metrics, traces, monitors, etc.) Yes MCP Write Write access for mutating operations (creating feature flags, editing Synthetic tests, etc.) Optional For production use, create keys from a service account rather than a personal account. Navigate to Organization Settings > Service Accounts to create one. This ensures the integration continues to work if team members leave the organization. Apply the principle of least privilege—grant only MCP Read unless write operations are needed. Use scoped Application keys to restrict access to only the permissions your agent needs. This limits blast radius if a key is compromised. Step 2: Add the MCP connector Connect the Datadog MCP server to your SRE Agent using the portal. The portal includes a dedicated Datadog connector type that pre-populates the required configuration. Determine your regional endpoint Select the endpoint URL that matches your Datadog organization's region: Region Endpoint URL US1 (default) https://mcp.datadoghq.com/api/unstable/mcp-server/mcp US3 https://mcp.us3.datadoghq.com/api/unstable/mcp-server/mcp US5 https://mcp.us5.datadoghq.com/api/unstable/mcp-server/mcp EU1 https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp AP1 https://mcp.ap1.datadoghq.com/api/unstable/mcp-server/mcp AP2 https://mcp.ap2.datadoghq.com/api/unstable/mcp-server/mcp Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select Datadog MCP server and select Next Configure the connector: Field Value Name datadog-mcp Connection type Streamable-HTTP (pre-selected) URL https://mcp.datadoghq.com/api/unstable/mcp-server/mcp (change for non-US1 regions) Authentication Custom headers (pre-selected, disabled) DD_API_KEY Your Datadog API key DD_APPLICATION_KEY Your Datadog Application key Select Next to review Select Add connector The Datadog connector type pre-populates both header keys ( DD_API_KEY and DD_APPLICATION_KEY ) and sets the authentication method to "Custom headers" automatically. The default URL is the US1 endpoint—update it if your organization is in a different region. Once the connector shows Connected status, the Datadog MCP tools are automatically available to your agent. You can verify by checking the tools list in the connector details. Step 3: Create a Datadog subagent (optional) Create a specialized subagent to give the AI focused Datadog observability expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: DatadogObservabilityExpert display_name: Datadog Observability Expert system_prompt: | You are a Datadog observability expert with access to logs, metrics, APM traces, monitors, incidents, dashboards, hosts, services, and more via the Datadog MCP server. ## Capabilities ### Logs - Search logs using facets, tags, and time ranges with `search_datadog_logs` - Perform SQL-based log analysis with `analyze_datadog_logs` for aggregations, grouping, and statistical queries - Correlate log entries with traces and metrics ### Metrics - Query metric time series with `get_datadog_metric` - Get metric metadata, tags, and context with `get_datadog_metric_context` - Discover available metrics with `search_datadog_metrics` ### APM (Application Performance Monitoring) - Fetch complete traces with `get_datadog_trace` - Search distributed traces and spans with `search_datadog_spans` - Analyze service-level performance and latency patterns - Map service dependencies with `search_datadog_service_dependencies` ### Monitors & Alerting - Search monitors by name, tag, or status with `search_datadog_monitors` - Investigate triggered monitors and alert history - Correlate monitor alerts with underlying metrics and logs ### Incidents - Search incidents with `search_datadog_incidents` - Get incident details, timeline, and responders with `get_datadog_incident` - Correlate incidents with monitors, logs, and traces ### Infrastructure - Search hosts by name, tag, or status with `search_datadog_hosts` - List and discover services with `search_datadog_services` - Search dashboards with `search_datadog_dashboards` - Search events (monitor alerts, deployments) with `search_datadog_events` ### Notebooks - Search notebooks with `search_datadog_notebooks` - Retrieve notebook content with `get_datadog_notebook` ### Real User Monitoring - Search RUM events for frontend performance data with `search_datadog_rum_events` ## Best Practices When investigating incidents: - Start with `search_datadog_incidents` or `get_datadog_incident` for context - Check related monitors with `search_datadog_monitors` - Correlate with `search_datadog_logs` and `get_datadog_metric` for root cause - Use `get_datadog_trace` to inspect request flows for latency issues - Check `search_datadog_hosts` for infrastructure-level problems When analyzing logs: - Use `analyze_datadog_logs` for SQL-based aggregation queries - Use `search_datadog_logs` for individual log retrieval and filtering - Include time ranges to narrow results and reduce response size - Filter by service, host, or status to focus on relevant data When working with metrics: - Use `search_datadog_metrics` to discover available metric names - Use `get_datadog_metric_context` to understand metric tags and metadata - Use `get_datadog_metric` to query actual metric values with time ranges When handling errors: - If access is denied, explain which RBAC permission is needed - Suggest the user verify their Application key has `MCP Read` or `MCP Write` - For large traces that appear truncated, note this is a known limitation mcp_connectors: - datadog-mcp handoffs: [] Select Save The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the Datadog MCP server. Step 4: Add a Datadog skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a Datadog skill to give your agent expertise in log queries, metric analysis, and incident investigation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: datadog_observability display_name: Datadog Observability description: | Expertise in Datadog's observability platform including logs, metrics, APM, monitors, incidents, dashboards, hosts, and services. Use for searching logs, querying metrics, investigating incidents, analyzing traces, inspecting monitors, and navigating Datadog data via the Datadog MCP server. instructions: | ## Overview Datadog is a cloud-scale observability platform for logs, metrics, APM traces, monitors, incidents, infrastructure, and more. The Datadog MCP server enables natural language interaction with your organization's Datadog data. **Authentication:** Two custom headers—`DD_API_KEY` (API key) and `DD_APPLICATION_KEY` (Application key with MCP permissions). All actions respect existing RBAC permissions. **Regional endpoints:** The MCP server URL varies by Datadog region (US1, US3, US5, EU1, AP1, AP2). Ensure the connector URL matches your organization's region. ## Searching Logs Use `search_datadog_logs` for individual log retrieval and `analyze_datadog_logs` for SQL-based aggregation queries. **Common log search patterns:** ``` # Errors from a specific service service:payment-api status:error # Logs from a host in the last hour host:web-prod-01 # Logs containing a specific trace ID trace_id:abc123def456 # Errors with a specific HTTP status @http.status_code:500 service:api-gateway # Logs from a Kubernetes pod kube_namespace:production kube_deployment:checkout-service ``` **SQL-based log analysis with `analyze_datadog_logs`:** ```sql -- Count errors by service in the last hour SELECT service, count(*) as error_count FROM logs WHERE status = 'error' GROUP BY service ORDER BY error_count DESC -- Average response time by endpoint SELECT @http.url_details.path, avg(@duration) as avg_duration FROM logs WHERE service = 'api-gateway' GROUP BY @http.url_details.path ``` ## Querying Metrics Use `search_datadog_metrics` to discover metrics, `get_datadog_metric_context` for metadata, and `get_datadog_metric` for time series data. **Common metric patterns:** ``` # System metrics system.cpu.user, system.mem.used, system.disk.used # Container metrics docker.cpu.usage, kubernetes.cpu.requests # Application metrics trace.servlet.request.hits, trace.servlet.request.duration # Custom metrics app.payment.processed, app.queue.depth ``` Always specify a time range when querying metrics to avoid retrieving excessive data. ## Investigating Traces Use `get_datadog_trace` for complete trace details and `search_datadog_spans` for span-level queries. **Trace investigation workflow:** 1. Search for slow or errored spans with `search_datadog_spans` 2. Get the full trace with `get_datadog_trace` using the trace ID 3. Identify the bottleneck service or operation 4. Correlate with `search_datadog_logs` using the trace ID 5. Check related metrics with `get_datadog_metric` ## Working with Monitors Use `search_datadog_monitors` to find monitors by name, tag, or status. **Common monitor queries:** ``` # Find all triggered monitors Search for monitors with status "Alert" # Find monitors for a specific service Search for monitors tagged with service:payment-api # Find monitors by name Search for monitors matching "CPU" or "memory" ``` ## Incident Investigation Workflow For structured incident investigation: 1. `search_datadog_incidents` — find recent or active incidents 2. `get_datadog_incident` — get full incident details and timeline 3. `search_datadog_monitors` — check which monitors triggered 4. `search_datadog_logs` — search for errors around the incident time 5. `get_datadog_metric` — check key metrics for anomalies 6. `get_datadog_trace` — inspect request traces for latency or errors 7. `search_datadog_hosts` — verify infrastructure health 8. `search_datadog_service_dependencies` — map affected services ## Working with Dashboards and Notebooks - Use `search_datadog_dashboards` to find dashboards by title or tag - Use `search_datadog_notebooks` and `get_datadog_notebook` for investigation notebooks that document past analyses ## Toolsets The Datadog MCP server supports toolsets via the `?toolsets=` query parameter on the endpoint URL. Available toolsets: | Toolset | Description | |---------|-------------| | `core` | Logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, notebooks (default) | | `alerting` | Monitor validation, groups, and templates | | `apm` | Trace analysis, span search, Watchdog insights, performance investigation | | `dbm` | Database Monitoring query plans and samples | | `error-tracking` | Error Tracking issues across RUM, Logs, and Traces | | `feature-flags` | Creating, listing, and updating feature flags | | `llmobs` | LLM Observability spans | | `networks` | Cloud Network Monitoring, Network Device Monitoring | | `onboarding` | Guided Datadog setup and configuration | | `security` | Code security scanning, security signals, findings | | `software-delivery` | CI Visibility, Test Optimization | | `synthetics` | Synthetic test management | To enable additional toolsets, append `?toolsets=core,apm,alerting` to the connector URL. ## Troubleshooting | Issue | Solution | |-------|----------| | 401/403 errors | Verify API key and Application key are correct and active | | No data returned | Check that Application key has `MCP Read` permission | | Wrong region | Ensure the connector URL matches your Datadog organization's region | | Truncated traces | Large traces may be truncated; this is a known limitation | | Tool not found | The tool may require a non-default toolset; update the connector URL | | Write operations fail | Verify Application key has `MCP Write` permission | mcp_connectors: - datadog-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: DatadogObservabilityExpert skills: - datadog_observability mcp_connectors: - datadog-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Log analysis Search for error logs from the payment-api service in the last hour Analyze logs to count errors by service over the last 24 hours Find all logs with HTTP 500 status from the api-gateway in the last 30 minutes Show me the most recent logs from host web-prod-01 Metrics investigation What is the current CPU usage across all production hosts? Show me the request rate and error rate for the checkout-service over the last 4 hours What metrics are available for the payment-api service? Get the p99 latency for the api-gateway service in the last hour APM and trace analysis Find the slowest traces for the checkout-service in the last hour Get the full trace details for trace ID abc123def456 What services depend on the payment-api? Search for errored spans in the api-gateway service from the last 30 minutes Monitor and alerting workflows Show me all monitors currently in Alert status Find monitors related to the database-primary host What monitors are tagged with team:platform? Search for monitors matching "disk space" or "memory" Incident investigation Show me all active incidents from the last 24 hours Get details for incident INC-12345 including the timeline What monitors triggered during the last production incident? Correlate the most recent incident with related logs and metrics Infrastructure and dashboards Search for hosts tagged with env:production and team:platform List all dashboards related to "Kubernetes" or "EKS" What services are running in the production environment? Show me recent deployment events for the checkout-service Available tools Core toolset (default) The core toolset is included by default and provides essential observability tools. Tool Description search_datadog_logs Search logs by facets, tags, and time ranges analyze_datadog_logs SQL-based log analysis for aggregations and statistical queries get_datadog_metric Query metric time series with rollup and aggregation get_datadog_metric_context Get metric metadata, tags, and related context search_datadog_metrics List and discover available metrics get_datadog_trace Fetch a complete distributed trace by trace ID search_datadog_spans Search APM spans by service, operation, or tags search_datadog_monitors Search monitors by name, tag, or status get_datadog_incident Get incident details including timeline and responders search_datadog_incidents List and search incidents search_datadog_dashboards Search dashboards by title or tag search_datadog_hosts Search hosts by name, tag, or status search_datadog_services List and search services search_datadog_service_dependencies Map service dependency relationships search_datadog_events Search events (monitor alerts, deployments, custom events) get_datadog_notebook Retrieve notebook content by ID search_datadog_notebooks Search notebooks by title or tag search_datadog_rum_events Search Real User Monitoring events Alerting toolset Enable with ?toolsets=core,alerting on the connector URL. Tool Description validate_datadog_monitor Validate monitor configuration before creation get_datadog_monitor_templates Get monitor configuration templates search_datadog_monitor_groups Search monitor groups and their statuses APM toolset Enable with ?toolsets=core,apm on the connector URL. Tool Description apm_search_spans Advanced span search with APM-specific filters apm_explore_trace Interactive trace exploration and analysis apm_trace_summary Get a summary analysis of a trace apm_trace_comparison Compare two traces side by side apm_analyze_trace_metrics Analyze aggregated trace metrics and trends Database Monitoring toolset Enable with ?toolsets=core,dbm on the connector URL. Tool Description search_datadog_dbm_plans Search database query execution plans search_datadog_dbm_samples Search database query samples and statistics Error Tracking toolset Enable with ?toolsets=core,error-tracking on the connector URL. Tool Description search_datadog_error_tracking_issues Search error tracking issues across RUM, Logs, and Traces get_datadog_error_tracking_issue Get details of a specific error tracking issue Feature Flags toolset Enable with ?toolsets=core,feature-flags on the connector URL. Tool Description list_datadog_feature_flags List feature flags create_datadog_feature_flag Create a new feature flag update_datadog_feature_flag_environment Update feature flag settings for an environment LLM Observability toolset Enable with ?toolsets=core,llmobs on the connector URL. Tool Description LLM Observability spans Query and analyze LLM Observability span data Networks toolset Enable with ?toolsets=core,networks on the connector URL. Tool Description Cloud Network Monitoring tools Analyze cloud network traffic and dependencies Network Device Monitoring tools Monitor and troubleshoot network devices Security toolset Enable with ?toolsets=core,security on the connector URL. Tool Description datadog_code_security_scan Run code security scanning datadog_sast_scan Run Static Application Security Testing datadog_secrets_scan Scan for secrets and credentials in code Software Delivery toolset Enable with ?toolsets=core,software-delivery on the connector URL. Tool Description search_datadog_ci_pipeline_events Search CI pipeline execution events get_datadog_flaky_tests Identify flaky tests in CI pipelines Synthetics toolset Enable with ?toolsets=core,synthetics on the connector URL. Tool Description get_synthetics_tests List and get Synthetic test configurations edit_synthetics_tests Edit Synthetic test settings synthetics_test_wizard Guided wizard for creating Synthetic tests Toolsets The Datadog MCP server organizes tools into toolsets. By default, only the core toolset is enabled. To enable additional toolsets, append the ?toolsets= query parameter to the connector URL. Syntax https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting Examples Use case URL suffix Default (core only) No suffix needed Core + APM analysis ?toolsets=core,apm Core + Alerting + APM ?toolsets=core,alerting,apm Core + Database Monitoring ?toolsets=core,dbm Core + Security scanning ?toolsets=core,security Core + CI/CD visibility ?toolsets=core,software-delivery All toolsets ?toolsets=core,alerting,apm,dbm,error-tracking,feature-flags,llmobs,networks,onboarding,security,software-delivery,synthetics [!TIP] Only enable the toolsets you need. Each additional toolset increases the number of tools exposed to the agent, which can increase token usage and may impact response quality. Start with core and add toolsets as needed. Updating the connector URL To add toolsets after initial setup: Navigate to Builder > Connectors Select the datadog-mcp connector Update the URL field to include the ?toolsets= parameter Select Save Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid API key or Application key Verify both keys are correct and active in Organization Settings 403 Forbidden Missing RBAC permissions Ensure the Application key has MCP Read and/or MCP Write permissions Connection refused Wrong regional endpoint Verify the connector URL matches your Datadog organization's region "Organization not allowlisted" Preview access not granted Contact Datadog support to request MCP server Preview access Data and permission issues Error Cause Solution No data returned Insufficient permissions or wrong time range Verify Application key permissions; try a broader time range Tool not found Tool belongs to a non-default toolset Add the required toolset to the ?toolsets= parameter in the connector URL Truncated trace data Trace exceeds size limit Large traces are truncated for context window efficiency; query specific spans instead Write operation failed Missing MCP Write permission Add MCP Write permission to the Application key Metric not found Wrong metric name or no data in time range Use search_datadog_metrics to discover available metric names Verify the connection Test the server endpoint directly: curl -I "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp" \ -H "DD_API_KEY: <your_api_key>" \ -H "DD_APPLICATION_KEY: <your_application_key>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Navigate to Organization Settings > Application Keys in Datadog Revoke the existing Application key Create a new Application key with the required MCP Read / MCP Write permissions Update the connector in the SRE Agent portal with the new key Limitations Limitation Details Preview only The Datadog MCP server is in Preview and not recommended for production use Allowlisted organizations Only organizations that have been allowlisted by Datadog can access the MCP server Large trace truncation Responses are optimized for LLM context windows; large traces may be truncated Unstable API path The endpoint URL contains /unstable/ indicating the API may change without notice Toolset availability Some toolsets may not be available depending on your Datadog plan and features enabled Regional endpoints You must use the endpoint matching your organization's region; cross-region queries are not supported Security considerations How permissions work RBAC-scoped: All actions respect the RBAC permissions associated with the API and Application keys Key-based: Access is controlled through API key (organization-level) and Application key (user or service account-level) Permission granularity: MCP Read enables read operations; MCP Write enables mutating operations Admin controls Datadog administrators can: - Create and revoke API and Application keys in Organization Settings - Assign granular RBAC permissions ( MCP Read , MCP Write ) to Application keys - Use service accounts to decouple access from individual user accounts - Monitor MCP tool usage through the Datadog Audit Trail - Scope Application keys to limit the blast radius of compromised credentials The Datadog MCP server can read sensitive operational data including logs, metrics, and traces. Use service accounts with scoped Application keys, grant only the permissions your agent needs, and monitor the Audit Trail for unusual activity. Related content Datadog MCP Server documentation Datadog API and Application keys Datadog RBAC permissions Datadog Audit Trail Datadog regional sites MCP integration overview Build a custom subagent2.2KViews0likes0CommentsBuild a Custom SSL Certificate Monitor with Azure SRE Agent: From Python Tool to Production Skill
TL;DR Expired SSL certificates cause outages that are 100% preventable. In this post, you’ll learn how to create a custom Python tool in Azure SRE Agent that checks SSL certificate expiry across your domains, then wrap it in a skill that gives your agent a complete certificate health audit workflow. The result: your SRE Agent proactively flags certificates expiring in the next 30 days and recommends renewal actions , before they become 3 AM pages. The Problem Every ITOps Team Knows Too Well It’s a Tuesday morning. Your monitoring dashboard lights up with alerts: your customer-facing API is returning connection errors. Users are calling. Slack is on fire. After 20 minutes of frantic debugging, someone discovers the root cause: an SSL certificate expired overnight. This scenario plays out across enterprises every week. According to industry data, certificate-related outages cost an average of $300,000 per incident in downtime and remediation. The frustrating part? Every single one is preventable. ITOps teams say: “We have spreadsheets for tracking certs, but someone always forgets to update them after a renewal.” On-call engineers say: “I spent 20 minutes debugging before realizing it was just an expired certificate.” Most teams rely on a patchwork of solutions , and they all have gaps: Current Approach The Gap Spreadsheets Go stale , someone forgets to update after renewal Calendar reminders Fire too late , 7 days isn’t enough for compliance review Standalone SaaS tools Don’t integrate with existing incident workflows Manual checks Don’t scale with multi-domain sprawl What if your SRE Agent could check certificate health as part of its regular investigation workflow, and proactively warn you during routine health checks? What We’ll Build In this tutorial, you’ll create two things: A Python Tool ( CheckSSLCertificateExpiry ) , A custom tool that connects to any domain, retrieves its SSL certificate details, and returns structured data about the certificate’s validity, issuer, and days until expiry. A Skill ( ssl_certificate_audit ) , A reusable knowledge package that teaches your SRE Agent how to perform a complete certificate health audit across multiple domains, classify risk levels, and recommend actions. By the end, your agent will respond to prompts like: “Check the SSL certificates for all our production domains” “Are any of our certificates expiring in the next 30 days?” “Run a certificate health audit for api.contoso.com, portal.contoso.com, and store.contoso.com” The CheckSSLCertificateExpiry tool in the Azure SRE Agent portal , showing the Python code, parameters, and description. Prerequisites An Azure SRE Agent instance deployed in your subscription Access to the Azure SRE Agent portal Basic familiarity with Python and YAML Part 1: Creating the Python Tool Step 1: Create the Tool in the Portal Navigate to the Azure SRE Agent portal, go to Settings > Subagent Builder, and click Create New Tool. Select Python Tool as the type, enter the name CheckSSLCertificateExpiry , and provide the description: Checks SSL/TLS certificate expiry for a given domain and returns certificate details including days until expiration, issuer, and validity dates. Add two parameters: domain (string, required): The fully qualified domain name to check (e.g., api.contoso.com) port (string, optional): The port to connect on (default 443) Step 2: Write the Python Code In the Function Code field, paste the following Python implementation: import ssl import socket import json from datetime import datetime, timezone def main(domain, port="443"): """Check SSL certificate expiry for a domain.""" port = int(port) context = ssl.create_default_context() try: with socket.create_connection((domain, port), timeout=10) as sock: with context.wrap_socket(sock, server_hostname=domain) as ssock: cert = ssock.getpeercert() not_before = datetime.strptime(cert["notBefore"], "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc) not_after = datetime.strptime(cert["notAfter"], "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc) now = datetime.now(timezone.utc) days_remaining = (not_after - now).days issuer = dict(x[0] for x in cert.get("issuer", [])) subject = dict(x[0] for x in cert.get("subject", [])) if days_remaining < 0: risk_level = "EXPIRED" elif days_remaining <= 7: risk_level = "CRITICAL" elif days_remaining <= 30: risk_level = "WARNING" elif days_remaining <= 60: risk_level = "ATTENTION" else: risk_level = "HEALTHY" san_list = [] for entry_type, value in cert.get("subjectAltName", []): if entry_type == "DNS": san_list.append(value) return { "domain": domain, "port": port, "status": "valid" if days_remaining >= 0 else "expired", "risk_level": risk_level, "days_remaining": days_remaining, "not_before": not_before.isoformat(), "not_after": not_after.isoformat(), "issuer": issuer.get("organizationName", "Unknown"), "issuer_cn": issuer.get("commonName", "Unknown"), "subject_cn": subject.get("commonName", domain), "serial_number": cert.get("serialNumber", "Unknown"), "version": cert.get("version", "Unknown"), "san_count": len(san_list), "san_domains": san_list[:10], "checked_at": now.isoformat() } except ssl.SSLCertVerificationError as e: return { "domain": domain, "port": port, "status": "verification_failed", "risk_level": "CRITICAL", "error": str(e), "checked_at": datetime.now(timezone.utc).isoformat() } except (socket.timeout, socket.gaierror, ConnectionRefusedError, OSError) as e: return { "domain": domain, "port": port, "status": "connection_failed", "risk_level": "UNKNOWN", "error": str(e), "checked_at": datetime.now(timezone.utc).isoformat() } Key Design Decisions: Structured output , The tool returns a JSON object with clearly labeled fields so the LLM can compare, sort, and aggregate results across multiple domains. Risk classification , Five risk levels (EXPIRED, CRITICAL, WARNING, ATTENTION, HEALTHY) give the agent clear thresholds to reason about. Error handling , Specific exception types return structured error objects rather than crashing, so the agent gets useful information even when a domain is unreachable. Zero dependencies , Uses only Python standard library ( ssl , socket , datetime ) for fast cold starts and no supply chain risk. Step 3: Deploy the Tool Click Save in the tool editor to deploy the tool to your SRE Agent instance. The portal validates the YAML and Python code before saving. The Subagent Builder in the Azure SRE Agent portal , showing all deployed subagents, Python tools, and skills at a glance. Step 4: Test the Tool Open a new chat thread in the portal, select the SSLCertificateMonitor agent, and type: "Check the SSL certificate for microsoft.com" The agent checks microsoft.com and returns real certificate data: valid, healthy, 164 days remaining, issued by Microsoft Azure RSA TLS Issuing CA 04. Part 2: Creating the Skill A tool gives the agent a capability. A skill gives the agent a methodology. Tool: “I can check one certificate.” Skill: “Here’s how to audit all your certificates, classify the risks, and tell you exactly what to do about each one.” What is a Skill? A skill is a markdown document with YAML frontmatter that contains: Metadata: name, description, and which tools the skill uses Instructions: step-by-step guidance the agent follows when the skill is loaded Think of it as a runbook injected into the agent’s context when relevant. Step 1: Create the Skill in the Portal In the Azure SRE Agent portal, go to Settings > Subagent Builder and click Create New Skill. You will need to provide the full SKILL.md content, which includes both the YAML frontmatter and the markdown instructions. Step 2: Write the Skill Document Paste the following as the complete skill content: --- name: ssl_certificate_audit description: | Load this skill when the user asks about SSL/TLS certificate health, certificate expiry, certificate monitoring, or requests a certificate audit across one or more domains. Trigger phrases: "check our certificates", "are any certs expiring", "SSL audit", "certificate health check", "TLS certificate status", "cert renewal needed". Do NOT load for general security assessments, network connectivity issues unrelated to TLS, or application-level HTTPS errors (use standard troubleshooting for those). tools: - CheckSSLCertificateExpiry --- # SSL/TLS Certificate Health Audit Skill ## Purpose Perform a structured certificate health audit across one or more domains: check each certificate, classify risk, aggregate findings, and deliver a prioritized action plan with specific renewal deadlines. ## Scope Focus ONLY on SSL/TLS certificate validity, expiry, and health. Exclude: - Application-level HTTPS configuration issues - Cipher suite or TLS version analysis (unless certificate is the root cause) - Certificate Authority trust chain debugging (unless verification fails) ## Workflow ### Phase 1: Domain Collection 1. If the user provides specific domains, use those directly. 2. If the user says "all our domains" or "production domains," ask them to list the domains or provide a resource group to discover App Services, Front Doors, or API Management instances with custom domains. 3. Confirm the domain list before proceeding. ### Phase 2: Certificate Checks 1. Run CheckSSLCertificateExpiry for each domain. Execute checks in parallel when possible. 2. Collect all results before analysis. 3. If any domain returns a connection error, note it separately; do not abort the audit. ### Phase 3: Risk Classification and Reporting Classify each certificate into one of these categories: | Risk Level | Criteria | Required Action | |------------|----------|-----------------| | EXPIRED | days_remaining < 0 | Immediate renewal, this is causing outages | | CRITICAL | days_remaining <= 7 | Emergency renewal within 24 hours | | WARNING | days_remaining <= 30 | Schedule renewal this sprint | | ATTENTION | days_remaining <= 60 | Add to next renewal cycle | | HEALTHY | days_remaining > 60 | No action needed | ### Phase 4: Summary Report Present findings in this order: 1. **Executive Summary** (1-2 sentences): Total domains checked, how many need action. 2. **Certificates Needing Action** (table): Domain, expiry date, days remaining, risk level, recommended action. Sort by days_remaining ascending (most urgent first). 3. **Healthy Certificates** (compact list): Domain and expiry date only. 4. **Unreachable Domains** (if any): Domain and error reason. 5. **Recommendations**: Specific next steps based on findings. ### Phase 5: Actionable Recommendations Based on findings, recommend: - **For EXPIRED or CRITICAL**: "Renew the certificate for {domain} immediately. If using Azure-managed certificates, check the App Service custom domain binding. If using a third-party CA, initiate the renewal process with {issuer}." - **For WARNING**: "Schedule renewal for {domain} (expires {date}). Recommended to renew by {date - 7 days} to allow for propagation and testing." - **For ATTENTION**: "Add {domain} to the renewal queue. Certificate expires {date}." - **For mixed results**: "Consider implementing automated certificate management (e.g., Azure Key Vault with auto-renewal) to prevent future expiry risks." ## Output Format Use markdown tables for certificate status. Include the checked_at timestamp to establish when the audit was performed. Bold the risk level for EXPIRED and CRITICAL entries. ## Example Output (Condensed) Certificate Health Audit: 5 domains checked at 2026-02-18T14:30:00Z. 2 certificates need immediate attention; 3 are healthy. | Domain | Expires | Days Left | Risk | Action | |--------|---------|-----------|------|--------| | api.contoso.com | 2026-02-20 | **2** | **CRITICAL** | Renew within 24 hours | | store.contoso.com | 2026-03-10 | 20 | WARNING | Schedule renewal this sprint | | portal.contoso.com | 2026-06-15 | 117 | HEALTHY | None | | auth.contoso.com | 2026-08-22 | 185 | HEALTHY | None | | cdn.contoso.com | 2026-09-01 | 195 | HEALTHY | None | Recommendation: Renew api.contoso.com immediately to prevent service disruption. Schedule store.contoso.com renewal by March 3rd. ## Quality Principles - Check all domains before reporting (don't report one-by-one). - Never guess certificate details; only report what the tool returns. - Sort urgent items first in all outputs. - Include specific dates, not vague timeframes. - Align with system prompt: answer first, then evidence. Step 3: Deploy the Skill and Configure the Agent Back in the Subagent Builder, create a new subagent called SSLCertificateMonitor . In the agent configuration: Add the CheckSSLCertificateExpiry tool to the agent's tool list Enable Allow Parallel Tool Calls in the agent settings Click Save to deploy the agent Skills are automatically enabled on every agent, so no additional configuration is needed. The portal will validate and deploy the skill, tool, and agent together. The SSLCertificateMonitor subagent in the portal , showing the CheckSSLCertificateExpiry tool, agent instructions, and skills enabled. Part 3: See It in Action Here’s what happens when you ask the agent to audit four real domains , microsoft.com, azure.com, github.com, and learn.microsoft.com: Open a new chat thread in the portal, select the SSLCertificateMonitor agent, and type: "Run a certificate health audit for microsoft.com, azure.com, github.com, and learn.microsoft.com" The agent checks all 4 domains in parallel, classifies github.com as ATTENTION (45 days remaining), and recommends scheduling renewal by March 29, 2026. The agent: ✅ Loaded the ssl_certificate_audit skill (matched by “certificate health audit”) ✅ Ran CheckSSLCertificateExpiry for all 4 domains in parallel ✅ Classified github.com as ATTENTION (45 days) and the rest as HEALTHY ✅ Produced a prioritized report , action items first, healthy domains second ✅ Recommended a specific renewal date and suggested Azure Key Vault auto-renewal Real result: This audit ran against live production domains and completed in under 25 seconds. The agent correctly identified that github.com’s certificate expires soonest and needs to be added to the renewal cycle. Scenario 1: Morning Certificate Health Check User: “Run a certificate health check across our production domains: api.contoso.com, portal.contoso.com, store.contoso.com, auth.contoso.com, and payments.contoso.com” The agent: ✅ Loads the ssl_certificate_audit skill (matched by “certificate health check”) ✅ Runs CheckSSLCertificateExpiry for each domain in parallel ✅ Classifies each result by risk level ✅ Delivers a prioritized report with specific action items Scenario 2: Discovering Cert Issues During Incident Investigation During a connectivity incident, the agent may use CheckSSLCertificateExpiry to check if the certificate has expired , discovering the root cause without the engineer needing to manually check. Scenario 3: Cross-Agent Integration Because the skill references tools by name, any agent with access to CheckSSLCertificateExpiry can use it , add it to your triage agent, weekly health-check workflow, or other skills that deal with frontend health. How Tools and Skills Work Together ┌──────────────────────────────────────┐ │ Skill │ │ "ssl_certificate_audit" │ │ │ │ Methodology: │ │ 1. Collect domains │ │ 2. Check each certificate ─┐ │ │ 3. Classify risk levels │ │ │ 4. Generate report │ │ │ 5. Recommend actions │ │ └──────────────────────────────┼───────┘ │ ▼ ┌──────────────────────────────────────┐ │ Tool │ │ "CheckSSLCertificateExpiry" │ │ │ │ Capability: │ │ - Connect to domain:port │ │ - Read SSL certificate │ │ - Return structured cert data │ └──────────────────────────────────────┘ Concept Role Analogy Tool Atomic capability , does one thing, returns data A stethoscope Skill Methodology , combines tools, interprets results, makes decisions A diagnostic protocol Key Takeaways Custom Python tools are first-class citizens You don’t need to build a microservice or deploy an MCP server. Write a Python function, deploy it through the Azure SRE Agent portal, and it’s immediately available. Skills turn tools into expertise A tool tells the agent what it can do. A skill tells the agent what it should do and how. The audit skill transforms a simple certificate check into a comprehensive capability. Start small, iterate fast Tool creation, skill creation, deployment, and testing , under 30 minutes. Start with one domain check and expand incrementally. ITOps value is immediate Every team has certificates. Every team has been burned by an expired one. Deploy this on day one and prevent the next certificate outage. Want to learn more about Azure SRE Agent extensibility? Check out the YAML Schema Reference and the Python Tool documentation.677Views1like1CommentGet started with PagerDuty MCP server and PagerDuty SRE Agent in Azure SRE Agent
Overview The PagerDuty MCP server is a cloud-hosted bridge between your PagerDuty account and Azure SRE Agent. Once configured, it enables real-time interaction with incidents, on-call schedules, services, teams, escalation policies, event orchestration, incident workflows, status pages, and more through natural language. All actions respect the permissions of the user account associated with the API token. The server uses Streamable HTTP transport with a single Authorization custom header for authentication. Azure SRE Agent connects directly to the PagerDuty-hosted endpoint—no npm packages, local proxies, or container deployments are required. Since there is no dedicated PagerDuty connector type in the portal, you use the generic MCP server (User provided connector) option and configure the authorization header manually. Key capabilities Area Capabilities Incidents Create, list, manage incidents; add notes, responders; view alerts; find related/outlier/past incidents Services Create, list, update, and get service details On-Call & Schedules List on-calls, manage schedules, create overrides, list schedule users Escalation Policies List and get escalation policy details Teams & Users Create, update, delete teams; manage team members; list and get user data Alert Grouping Create, update, delete, and list alert grouping settings Change Events List and get change events by service or incident Event Orchestration Manage event orchestration routers, global rules, and service rules Incident Workflows List, get, and start incident workflows Log Entries List and get log entry details Status Pages Create and manage status page posts, updates, impacts, and severities This is the official PagerDuty-hosted MCP server. It exposes 60+ tools covering incidents, services, on-call, escalation, event orchestration, incident workflows, status pages, and more. The hosted service at mcp.pagerduty.com exposes all tools (both read and write) by default. Tool availability depends on your PagerDuty plan and user account permissions. Prerequisites Azure SRE Agent resource deployed in Azure PagerDuty account with an active plan PagerDuty user account with appropriate permissions User API Token: Created from User Profile > User Settings > API Access Step 1: Create a PagerDuty API token Generate the User API Token needed to authenticate with the PagerDuty MCP server. PagerDuty uses a single token for both authentication and authorization—the token inherits all permissions of the user account that creates it. Navigate to API Access in PagerDuty Log in to your PagerDuty account For EU accounts, use https://app.eu.pagerduty.com/ Select your user avatar in the top-right corner of the navigation bar Select My Profile from the dropdown menu Select the User Settings tab at the top of your profile page Scroll down to the API Access section Create a User API Token In the API Access section, select Create API User Token Enter a descriptive name for the token (e.g., sre-agent-mcp ) Select Create Token Copy the token value immediately—it is displayed only once and cannot be retrieved later The token format will look like: u+xxxxxxxxxxxxxxxx Store the API token securely. If you lose it, you must delete the old token and create a new one. Navigate back to My Profile > User Settings > API Access to manage your tokens. Choose the right account for token creation The API token inherits all permissions of the PagerDuty user account that creates it. Consider these options: Account type When to use Permissions Personal account Quick testing and development Full permissions of your user role Service account (recommended for production) Production deployments Create a dedicated PagerDuty user with a restricted role Read-only account Monitoring-only use cases Create a user with the Observer or Restricted Access role For production use, create a dedicated PagerDuty user with a Responder or Observer role (depending on whether write access is needed), then generate the token from that account. This ensures the integration continues to work if team members leave the organization and limits the blast radius of a compromised token. PagerDuty also supports Account-level API keys (created under Integrations > Developer Tools > API Access Keys), but the MCP server requires a User API Token, not an account-level key. Step 2: Add the MCP connector Connect the PagerDuty MCP server to your SRE Agent using the portal. Since there is no dedicated PagerDuty connector type, you use the generic MCP server (User provided connector) option. Determine your regional endpoint Select the endpoint URL that matches your PagerDuty account's service region: Region Endpoint URL US (default) https://mcp.pagerduty.com/mcp EU https://mcp.eu.pagerduty.com/mcp Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select MCP server (User provided connector) and select Next Configure the connector: Field Value Name pagerduty-mcp Connection type Streamable-HTTP URL https://mcp.pagerduty.com/mcp (use EU endpoint for EU service region) Authentication Custom headers Authorization Token <your-pagerduty-api-token> Select Next to review Select Add connector The token format in the Authorization header must be Token <your-api-token> (not Bearer ). For example: Token u+abcdefg123456789 . Using the wrong format will result in 401 Unauthorized errors. Once the connector shows Connected status, the PagerDuty MCP tools are automatically available to your agent. You can verify by checking the tools list in the connector details. Step 3: Create a PagerDuty subagent (optional) Create a specialized subagent to give the AI focused PagerDuty incident management expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: PagerDutyIncidentExpert display_name: PagerDuty Incident Expert system_prompt: | You are a PagerDuty incident management expert with access to incidents, services, on-call schedules, escalation policies, teams, event orchestration, incident workflows, status pages, and more via the PagerDuty MCP server. ## Capabilities ### Incidents - List and search incidents with `list_incidents` - Get incident details with `get_incident` - Create new incidents with `create_incident` - Manage incidents (update status, urgency, assignment, escalation) with `manage_incidents` - Add notes with `add_note_to_incident` and list notes with `list_incident_notes` - Add responders with `add_responders` - View alerts from incidents with `list_alerts_from_incident` and `get_alert_from_incident` - Find related incidents with `get_related_incidents` - Find similar past incidents with `get_past_incidents` - Identify outlier incidents with `get_outlier_incident` ### Services - List all services with `list_services` - Get service details with `get_service` - Create new services with `create_service` - Update service configuration with `update_service` ### On-Call & Schedules - List current on-calls with `list_oncalls` - Get schedule details with `get_schedule` - List all schedules with `list_schedules` - List users in a schedule with `list_schedule_users` - Create and update schedules with `create_schedule` and `update_schedule` - Create schedule overrides with `create_schedule_override` ### Escalation Policies - List escalation policies with `list_escalation_policies` - Get escalation policy details with `get_escalation_policy` ### Teams & Users - List teams with `list_teams` and get team details with `get_team` - Create, update, and delete teams - Manage team members with `add_team_member` and `remove_team_member` - List users with `list_users` and get user data with `get_user_data` ### Event Orchestration - List and get event orchestrations - Manage orchestration routers, global rules, and service rules - Append rules to event orchestration routers ### Incident Workflows - List and get incident workflows - Start incident workflows with `start_incident_workflow` ### Status Pages - Create and manage status page posts and updates - List status page impacts, severities, and statuses ### Log Entries - List and get log entry details for audit trails ### Alert Grouping - Create, update, and manage alert grouping settings ### Change Events - List and get change events, including by service or incident ## Best Practices When investigating incidents: - Start with `list_incidents` to find active or recent incidents - Use `get_incident` for full details including status and assignments - Check `list_alerts_from_incident` to see triggering alerts - Use `get_related_incidents` to find correlated issues - Use `get_past_incidents` to find similar historical incidents - Check `list_oncalls` to identify who is currently on-call - Review `list_incident_notes` for any existing investigation notes When managing on-call: - Use `list_oncalls` to see current on-call assignments - Use `get_schedule` and `list_schedule_users` for schedule details - Use `create_schedule_override` for temporary coverage changes When handling errors: - If 401 errors occur, explain the token may be invalid or expired - If 403 errors occur, explain which permissions may be missing - Suggest the user verify their API token is valid and has sufficient permissions mcp_connectors: - pagerduty-mcp handoffs: [] Select Save The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the PagerDuty MCP server. Step 4: Add a PagerDuty skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a PagerDuty skill to give your agent expertise in incident management, on-call scheduling, and escalation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: pagerduty_incident_management display_name: PagerDuty Incident Management description: | Expertise in PagerDuty's incident management platform including incidents, on-call schedules, services, teams, escalation policies, event orchestration, incident workflows, and status pages. Use for managing incidents, checking on-call status, investigating alerts, escalating issues, and navigating PagerDuty data via the PagerDuty MCP server. instructions: | ## Overview PagerDuty is an incident management and on-call scheduling platform for operations teams. The PagerDuty MCP server enables natural language interaction with your PagerDuty account data including incidents, services, schedules, teams, escalation policies, and more. **Authentication:** A single `Authorization` custom header with the format `Token <api-token-value>`. All actions respect the permissions of the user account associated with the token. **Regional endpoints:** The hosted MCP server has two endpoints—US (`mcp.pagerduty.com`) and EU (`mcp.eu.pagerduty.com`). Ensure the connector URL matches your PagerDuty service region. ## Incident Management Use `list_incidents` to search and filter incidents, `get_incident` for details, and `manage_incidents` to update status, urgency, assignment, or escalation level. **Common incident workflows:** ``` # List all triggered incidents Use list_incidents with status "triggered" # List high-urgency incidents Use list_incidents filtered by urgency "high" # Get details for a specific incident Use get_incident with the incident ID # Acknowledge an incident Use manage_incidents to set status to "acknowledged" # Resolve an incident Use manage_incidents to set status to "resolved" # Escalate an incident Use manage_incidents to escalate to the next level ``` ## On-Call Management Use `list_oncalls` to see current on-call assignments, `get_schedule` for schedule details, and `create_schedule_override` for temporary coverage. **Common on-call workflows:** ``` # Who is currently on-call? Use list_oncalls to see all current on-call assignments # Who is on-call for a specific escalation policy? Use list_oncalls filtered by escalation_policy_id # Get details for a schedule Use get_schedule with the schedule ID # Create a temporary override Use create_schedule_override with start/end times and user ``` ## Service Management Use `list_services` to discover services, `get_service` for details, and `create_service` or `update_service` for configuration changes. **Service investigation patterns:** ``` # List all services Use list_services # Get service details including integrations Use get_service with the service ID # Find incidents for a specific service Use list_incidents filtered by service_id ``` ## Escalation Policy Management Use `list_escalation_policies` to discover policies and `get_escalation_policy` for details including escalation rules and targets. ## Team Management Use `list_teams` to discover teams, `get_team` for details, and team member management tools for roster changes. ## Incident Investigation Workflow For structured incident investigation: 1. `list_incidents` — find active or recent incidents 2. `get_incident` — get full incident details and current status 3. `list_alerts_from_incident` — see triggering alerts and their details 4. `get_alert_from_incident` — get specific alert details 5. `get_related_incidents` — find correlated incidents 6. `get_past_incidents` — find similar historical incidents 7. `list_oncalls` — identify who is currently on-call 8. `list_incident_notes` — review existing investigation notes 9. `add_note_to_incident` — document findings 10. `manage_incidents` — update status, urgency, or escalate ## Event Orchestration Use event orchestration tools to manage how events are routed and processed: - `list_event_orchestrations` — discover orchestration configurations - `get_event_orchestration_router` — view routing rules - `append_event_orchestration_router_rule` — add new routing rules - `get_event_orchestration_global` — view global orchestration rules - `get_event_orchestration_service` — view service-level rules ## Incident Workflows Use `list_incident_workflows` to discover automated workflows and `start_incident_workflow` to trigger them for an incident. ## Status Page Management Use status page tools to communicate during incidents: - `list_status_pages` — discover status pages - `create_status_page_post` — create a new incident post - `create_status_page_post_update` — add updates to existing posts - `list_status_page_impacts` — view impact categories - `list_status_page_severities` — view severity levels ## Troubleshooting | Issue | Solution | |-------|----------| | 401 Unauthorized | Verify the API token is valid and not expired | | 403 Forbidden | Check that the user account has sufficient permissions | | Connection refused | Verify firewall allows HTTPS to mcp.pagerduty.com | | EU region errors | Ensure you are using `mcp.eu.pagerduty.com` for EU accounts | | Token format error | Use `Token <value>` format, not `Bearer <value>` | | No data returned | Verify the token's user account has access to the requested resources | mcp_connectors: - pagerduty-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: PagerDutyIncidentExpert skills: - pagerduty_incident_management mcp_connectors: - pagerduty-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Incident management Show me all currently triggered incidents Get details for incident P1234567 including the timeline and notes Create a new high-urgency incident for the payment-service with title "Payment processing degraded" Acknowledge all triggered incidents assigned to me On-call and schedules Who is currently on-call for the platform-engineering escalation policy? Show me the on-call schedule for the next 7 days Create a schedule override for John Smith covering Saturday 9am to Monday 9am List all users in the primary on-call schedule Service and team management List all services and their current status Get details for the checkout-service including escalation policy and integrations Show me all teams and their members What escalation policies are configured for the payment team? Incident investigation Find incidents related to the current database outage Show me similar past incidents to P1234567 What alerts triggered incident P1234567? List all notes and timeline entries for the most recent SEV-1 incident Event orchestration and workflows List all event orchestration configurations Show me the routing rules for the production orchestration What incident workflows are available? Start the "SEV-1 Response" workflow for incident P1234567 Status page management List all status pages Create a new status page post for the ongoing API degradation Add an update to the current status page post indicating the issue is being investigated What severity levels are available for status page posts? Available tools Incidents Tool Description get_incident Get details of a specific incident by ID list_incidents List and filter incidents by status, urgency, service, and more create_incident Create a new incident on a specified service manage_incidents Update incident status, urgency, assignment, or escalation level add_note_to_incident Add an investigation note to an incident list_incident_notes List all notes on an incident add_responders Add additional responders to an incident list_alerts_from_incident List all alerts associated with an incident get_alert_from_incident Get details of a specific alert from an incident get_outlier_incident Identify outlier incidents based on patterns get_past_incidents Find similar historical incidents get_related_incidents Find incidents related to a specific incident Services Tool Description get_service Get details of a specific service list_services List all services in the account create_service Create a new service update_service Update service configuration On-Call & Schedules Tool Description list_oncalls List current on-call assignments get_schedule Get details of a specific schedule list_schedules List all schedules list_schedule_users List users in a specific schedule create_schedule Create a new on-call schedule update_schedule Update an existing schedule create_schedule_override Create a temporary schedule override Escalation Policies Tool Description list_escalation_policies List all escalation policies get_escalation_policy Get details of a specific escalation policy Teams & Users Tool Description get_team Get details of a specific team list_teams List all teams list_team_members List members of a specific team create_team Create a new team update_team Update team details delete_team Delete a team add_team_member Add a user to a team remove_team_member Remove a user from a team get_user_data Get details of a specific user list_users List all users in the account Alert Grouping Tool Description create_alert_grouping_setting Create an alert grouping configuration get_alert_grouping_setting Get details of an alert grouping setting list_alert_grouping_settings List all alert grouping settings update_alert_grouping_setting Update an alert grouping setting delete_alert_grouping_setting Delete an alert grouping setting Change Events Tool Description get_change_event Get details of a specific change event list_change_events List all change events list_incident_change_events List change events related to an incident list_service_change_events List change events for a specific service Event Orchestration Tool Description get_event_orchestration Get details of an event orchestration list_event_orchestrations List all event orchestrations get_event_orchestration_router Get routing rules for an orchestration update_event_orchestration_router Update routing rules append_event_orchestration_router_rule Add a new routing rule get_event_orchestration_global Get global orchestration rules get_event_orchestration_service Get service-level orchestration rules Incident Workflows Tool Description get_incident_workflow Get details of an incident workflow list_incident_workflows List all incident workflows start_incident_workflow Start an incident workflow for a specific incident Log Entries Tool Description get_log_entry Get details of a specific log entry list_log_entries List log entries for audit and investigation Status Pages Tool Description create_status_page_post Create a new status page incident post create_status_page_post_update Add an update to a status page post get_status_page_post Get details of a status page post list_status_page_impacts List available impact categories list_status_page_post_updates List updates for a status page post list_status_page_severities List available severity levels list_status_page_statuses List available status values list_status_pages List all status pages Write operations The PagerDuty MCP server supports both read and write operations. The hosted service at mcp.pagerduty.com exposes all tools (both read and write) by default. Write tools Write operations include creating and modifying PagerDuty resources: Category Write tools Incidents create_incident , manage_incidents , add_note_to_incident , add_responders Services create_service , update_service Schedules create_schedule , update_schedule , create_schedule_override Teams create_team , update_team , delete_team , add_team_member , remove_team_member Alert Grouping create_alert_grouping_setting , update_alert_grouping_setting , delete_alert_grouping_setting Event Orchestration update_event_orchestration_router , append_event_orchestration_router_rule Incident Workflows start_incident_workflow Status Pages create_status_page_post , create_status_page_post_update PagerDuty also provides a self-hosted MCP server that can be run locally. The self-hosted server exposes only read-only tools by default; write tools require the --enable-write-tools flag at startup. For Azure SRE Agent, the hosted service at mcp.pagerduty.com is recommended as it requires no infrastructure management and exposes all tools automatically. Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid or expired API token Verify the token is correct and active in User Settings > API Access 403 Forbidden Insufficient user permissions Ensure the user account associated with the token has the required PagerDuty role Connection refused Firewall blocking outbound HTTPS Verify firewall allows HTTPS traffic to mcp.pagerduty.com (port 443) Token format error Using Bearer instead of Token The Authorization header must use Token <value> format, not Bearer <value> Data and permission issues Error Cause Solution No data returned Token user lacks access to the resource Verify the user account has access to the requested services, teams, or incidents EU region errors Using US endpoint for EU account Switch the connector URL to https://mcp.eu.pagerduty.com/mcp Write operation failed User lacks write permissions Verify the token's user account has a role that allows write operations (e.g., Manager, Admin) Rate limit exceeded Too many API requests PagerDuty rate limits vary by plan; reduce request frequency or contact PagerDuty support Incident not found Wrong incident ID or no access Verify the incident ID and that the token's user has access to the incident's service Verify the connection Test the server endpoint directly: curl -I "https://mcp.pagerduty.com/mcp" \ -H "Authorization: Token <your-api-token>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Navigate to My Profile > User Settings > API Access in PagerDuty Delete the existing API User Token Create a new API User Token Update the connector in the SRE Agent portal with the new token value in the Authorization header (format: Token <new-token> ) Limitations Limitation Details User-scoped permissions API token permissions are tied to the creating user's account; the token cannot exceed the user's access level Self-hosted write restriction The self-hosted MCP server only exposes read-only tools by default; write tools require the --enable-write-tools flag Rate limits API rate limits apply per your PagerDuty plan; high-frequency usage may be throttled No dedicated connector type The portal does not have a dedicated PagerDuty connector; you must use the generic MCP server connector and configure headers manually Two regional endpoints only Only US and EU service regions are supported; the endpoint must match your account's service region Token rotation API tokens do not automatically expire; manual rotation is recommended as a security best practice Security considerations How permissions work User-scoped: All actions respect the permissions of the PagerDuty user account that created the API token Token-based: A single User API Token in the Authorization header provides both authentication and authorization Role-based: The token inherits the PagerDuty role (Observer, Responder, Manager, Admin, etc.) of the creating user Admin controls PagerDuty administrators can: - Create and revoke User API tokens from user profile settings - Assign roles to user accounts to control permission scope - Use service accounts with restricted roles to limit the blast radius of compromised tokens - Monitor API token usage through PagerDuty's audit logs - Enforce token rotation policies as part of security governance PagerDuty User API tokens can read and modify sensitive operational data including incidents, on-call schedules, and service configurations. Use service account tokens with restricted roles, grant only the permissions your agent needs, and rotate tokens regularly. Monitor the PagerDuty audit logs for unusual activity. PagerDuty SRE Agent In addition to connecting Azure SRE Agent to PagerDuty via MCP, PagerDuty offers its own built-in SRE Agent—an AI-powered assistant that works side-by-side with responders during incident triage and resolution. When combined with the Azure SRE Agent MCP integration, you get a powerful end-to-end incident management experience. What is PagerDuty SRE Agent? PagerDuty’s SRE Agent transforms incident response in the Operations Console and Slack by automatically analyzing incidents, providing key context, and recommending remediation actions. It accelerates triage to reduce risk, cost, and cognitive load, and it continuously learns to prevent repeat issues. Key features Automated incident analysis: Ingests and analyzes runbooks, SOPs, and diagnostics (e.g., error logs) to surface likely root causes Playbook generation: Generates and saves playbooks for recurring issues based on past resolutions Pattern detection: Detects patterns, recalls similar incidents, and provides structured troubleshooting Actionable nudges: Recommends next steps through interactive buttons such as “Upload Runbook,” “Analyze Past Incidents,” “Generate a Playbook,” and “Search Logs” Continuous learning: Builds memory from resolved incidents including incident playbooks, service runbooks, incident summaries, and service profiles Observability integrations: Retrieves log data from platforms like Grafana, Datadog, New Relic, and AWS CloudWatch for deeper investigation Prerequisites PagerDuty Advance add-on (required for both Operations Console and Slack access) AIOps add-on (required for Operations Console access) Available on Enterprise, Business, and Professional plans An Account Owner or Global Admin role is required to enable SRE Agent Step 1: Enable PagerDuty SRE Agent In the PagerDuty web app, navigate to AI > AI Settings Select the Assistant and AI Agents Configuration tab Under AI Agents, find SRE Agent and toggle the switch to the on position If you don’t have Account Owner or Global Admin permissions, click Request to Admin next to the SRE Agent toggle. This sends an email request to your admins to enable it for you. Step 2: Configure tool integrations (optional) PagerDuty SRE Agent can retrieve log data and runbooks from external tools for deeper investigation. Set up Workflow Integrations and select Allow SRE Agent access for each integration. Supported integrations include: Log platforms: Grafana, Datadog, New Relic, AWS CloudWatch Runbook sources: Confluence, GitHub For runbook sources, update your event payload to include the runbook URL in custom_details : "custom_details": { "runbook_url": { "confluence": "https://YOUR-RUNBOOK-LINK" } } For more details, see Agent Tooling Configuration. Step 3: Use SRE Agent in Operations Console Navigate to AIOps > Operations Console Optional: Add the SRE Agent column to the Operations Console for faster incident triage Select an incident by clicking its Title Select the SRE Agent tab and wait for the agent to load your incident summary Begin troubleshooting by asking questions or using the agent’s nudge buttons (e.g., Upload Runbook, Analyze Past Incidents, Generate a Playbook) How it works with Azure SRE Agent Azure SRE Agent has a built-in direct integration with PagerDuty’s SRE Agent. This means you can query PagerDuty’s AI-powered SRE Agent directly from within Azure SRE Agent’s chat interface—no separate tab or tool switching required. Built-in PagerDuty Incident Management Agent Azure SRE Agent includes a dedicated PagerDuty Incident Management Agent that provides the following tools: Tool Description QueryPagerDutyIncidentChat Queries PagerDuty’s SRE Agent (Advance Chat API) for intelligent insights, troubleshooting guidance, runbook generation, or diagnostic recommendations about a specific incident GetPagerDutyIncidentById Retrieves details for a specific PagerDuty incident by its ID ResolvePagerDutyIncident Resolves a PagerDuty incident directly from Azure SRE Agent AcknowledgePagerDutyIncident Acknowledges a PagerDuty incident AddNoteToPagerDutyIncident Adds notes to a PagerDuty incident for tracking investigation progress Querying PagerDuty SRE Agent from Azure SRE Agent The QueryPagerDutyIncidentChat tool connects directly to PagerDuty’s Advance Chat API ( https://api.pagerduty.com/advance/chat ) using your PagerDuty API token. When you ask Azure SRE Agent a question about a PagerDuty incident, it automatically calls PagerDuty’s SRE Agent and returns the AI-powered response. This enables scenarios like: “What caused incident Q391Y5VW0YYUEL?” — PagerDuty SRE Agent analyzes the incident context and provides root cause analysis “Generate a runbook for incident Q391Y5VW0YYUEL” — PagerDuty SRE Agent creates a step-by-step runbook based on the incident details “How do I troubleshoot incident Q391Y5VW0YYUEL?” — PagerDuty SRE Agent recommends diagnostic and remediation steps “Provide mitigation steps for incident Q391Y5VW0YYUEL” — PagerDuty SRE Agent suggests actions prioritized by urgency and impact “Triage incident Q391Y5VW0YYUEL” — PagerDuty SRE Agent provides a full triage summary with next steps Configuration The PagerDuty SRE Agent integration uses the same API token you configured for PagerDuty incident management. No additional setup is required beyond the standard PagerDuty connector configuration. When PagerDuty is configured as your incident management platform in Azure SRE Agent settings, the QueryPagerDutyIncidentChat tool is automatically available. The PagerDuty Advance Chat API requires a PagerDuty Advance subscription. Each query to the SRE Agent consumes 4 PagerDuty Advance credits. Ensure your account has sufficient credits for your expected usage. End-to-end workflow With PagerDuty configured as both an MCP connector and an incident management platform, Azure SRE Agent enables a seamless workflow: Detect: Azure SRE Agent monitors your Azure infrastructure and detects issues Correlate: Azure SRE Agent retrieves related PagerDuty incidents for the affected Azure resources Triage: Azure SRE Agent queries PagerDuty’s SRE Agent for AI-powered root cause analysis, troubleshooting steps, and runbook recommendations Act: Azure SRE Agent acknowledges, adds notes to, or resolves PagerDuty incidents—all from a single conversation Learn: PagerDuty SRE Agent saves incident learnings and playbooks for future incidents, improving response over time For the best experience, configure both the PagerDuty MCP connector (for service and schedule queries) and PagerDuty as your incident management platform (for direct SRE Agent access). This gives your team the full breadth of PagerDuty capabilities from within Azure SRE Agent. For full documentation on PagerDuty SRE Agent capabilities, including best practices and example questions, see the PagerDuty SRE Agent documentation. Related content PagerDuty MCP Server documentation PagerDuty REST API v2 documentation PagerDuty API Access Keys PagerDuty User Roles PagerDuty Audit Records MCP integration overview Build a custom subagent PagerDuty SRE Agent documentation PagerDuty Advance259Views0likes0CommentsGet started with Atlassian Rovo MCP server in Azure SRE Agent
Get started with Atlassian Rovo MCP server in Azure SRE Agent Connect Azure SRE Agent to Jira, Confluence, Compass, and Jira Service Management using the official Atlassian Rovo MCP server. Overview The Atlassian Rovo MCP server is a cloud-hosted bridge between your Atlassian Cloud site and Azure SRE Agent. Once configured, it enables real-time interaction with Jira, Confluence, Compass, and Jira Service Management data through natural language. All actions respect your existing Atlassian user permissions. The server supports API token (Basic or Bearer auth) for headless or automated setups. Azure SRE Agent connects using Streamable-HTTP transport directly to the Atlassian-hosted endpoint. Key capabilities Product Capabilities Jira Search issues with JQL, create/update tickets, add comments and worklogs, transition issues through workflows Confluence Search pages with CQL, create/update pages and live docs, manage inline and footer comments Compass Create/delete service components and relationships, manage custom fields, query dependencies Jira Service Management Query ops alerts, view on-call schedules, get team info, escalate alerts Rovo Search Natural language search across Jira and Confluence, fetch content by ARI [!NOTE] This is the official Atlassian-hosted MCP server at https://mcp.atlassian.com/v1/mcp . The server exposes 46+ tools across five product areas. Tool availability depends on authentication method and granted scopes. Prerequisites Azure SRE Agent resource deployed in Azure Atlassian Cloud site with one or more of: Jira, Confluence, Compass, or Jira Service Management User account with appropriate permissions in the Atlassian products you want to access For API token auth: Organization admin must enable API token authentication in the Rovo MCP server settings Step 1: Get your Atlassian credentials Choose one of the two authentication methods below. API token (Option A) is recommended for Azure SRE Agent because it enables headless configuration without browser-based flows. Option A: Personal API token (recommended for Azure SRE Agent) API token authentication allows headless configuration without browser-based OAuth flows—ideal for Azure SRE Agent connectors. Navigate to the API token page Log in to your Atlassian account Select your profile avatar in the top-right corner Select Manage account In the left sidebar, select Security Under the API tokens section, you can manage your existing tokens Alternatively, use this direct link that pre-selects all MCP scopes: Direct URL: Create API token with all MCP scopes Create the token Navigate to the Atlassian API token creation page to create a token with all MCP scopes preselected Optionally click Back to manually select only the scopes you need (see Available scopes) Copy the generated token and note the email address associated with your Atlassian account Base64-encode your credentials: # Format: email:api_token echo -n "your.email@example.com:YOUR_API_TOKEN_HERE" | base64 On Windows PowerShell: [Convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes("your.email@example.com:YOUR_API_TOKEN_HERE")) This produces a base64-encoded string you'll use in the connector configuration as the Authorization: Basic <value> header. [!IMPORTANT] Store the API token securely. It cannot be viewed again after creation. If lost, generate a new token from the same API tokens page. [!NOTE] API token authentication must be enabled by your organization admin. If you cannot create a token, ask your admin to enable API token authentication in the Rovo MCP server settings at admin.atlassian.com > Security > Rovo MCP server. Available scopes The API token supports the following scope categories: Category Scopes Jira read:jira-work , write:jira-work , read:jira-user Confluence read:page:confluence , write:page:confluence , read:comment:confluence , write:comment:confluence , read:space:confluence , read:hierarchical-content:confluence , read:confluence-user , search:confluence Compass read:component:compass , write:component:compass JSM read:incident:jira-service-management , write:incident:jira-service-management , read:ops-alert:jira-service-management , write:ops-alert:jira-service-management , read:ops-config:jira-service-management , read:servicedesk-request Bitbucket read:repository:bitbucket , write:repository:bitbucket , read:pullrequest:bitbucket , write:pullrequest:bitbucket , read:pipeline:bitbucket , write:pipeline:bitbucket , read:user:bitbucket , read:workspace:bitbucket , admin:repository:bitbucket Platform read:me , read:account , search:rovo:mcp [!NOTE] Bitbucket scopes are available in the token, but Bitbucket tools are not yet listed on the official supported tools page. Bitbucket tool support may be added in a future update. Step 2: Add the MCP connector Connect the Atlassian Rovo MCP server to your SRE Agent using the portal. Using the Azure portal (API token auth) In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select MCP server (User provided connector) and select Next Configure the connector: Field Value Name atlassian-rovo-mcp Connection type Streamable-HTTP URL https://mcp.atlassian.com/v1/mcp Authentication Custom headers Header Key Authorization Header Value Basic <your_base64_encoded_email_and_token> Select Next to review Select Add connector Step 3: Create an Atlassian subagent Create a specialized subagent to give the AI focused Atlassian expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: AtlassianRovoExpert display_name: Atlassian Rovo Expert system_prompt: | You are an Atlassian expert with access to Jira, Confluence, Compass, and Jira Service Management via the Atlassian Rovo MCP server. ## Capabilities ### Jira - Search issues using JQL (Jira Query Language) with `searchJiraIssuesUsingJql` - Create, update, and transition issues through workflows - Add comments, worklogs, and manage issue metadata - Look up user account IDs and project configurations ### Confluence - Search pages and content using CQL (Confluence Query Language) with `searchConfluenceUsingCql` - Create and update pages and live docs with Markdown content - Add inline and footer comments on pages - Navigate spaces and page hierarchies ### Compass - Create, query, and delete service components (services, libraries, applications) - Define and manage relationships between components - Manage custom field definitions and component metadata - View component activity events (deployments, alerts) ### Jira Service Management - Query ops alerts by ID, alias, or search criteria - View on-call schedules and current/next responders - Get team info including escalation policies and roles - Acknowledge, close, or escalate alerts ### Cross-Product - Use Rovo Search (`search`) for natural language queries across Jira and Confluence - Fetch specific content by Atlassian Resource Identifier (ARI) using `fetch` - Get current user info and list accessible cloud sites ## Best Practices When searching Jira: - Use JQL for precise queries: `project = "MYPROJ" AND status = "Open"` - Start with broad searches, then refine based on results - Use `currentUser()` for user-relative queries - Use `openSprints()` for active sprint work When searching Confluence: - Use CQL for structured searches: `space = "ENG" AND type = page` - Use Rovo Search for natural language queries when JQL/CQL isn't needed - Consider space keys to narrow results When creating content: - Confirm project/space/issue type with the user before creating - Use `getJiraIssueTypeMetaWithFields` to check required fields - Use `getConfluenceSpaces` to list available spaces When handling errors: - If access is denied, explain what permission is needed - Suggest the user contact their Atlassian administrator - For expired tokens, advise re-authentication mcp_connectors: - atlassian-rovo-mcp handoffs: [] Select Save [!NOTE] The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the Atlassian Rovo MCP server. Step 4: Add an Atlassian skill Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create an Atlassian skill to give your agent expertise in JQL, CQL, and Atlassian workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: atlassian_rovo display_name: Atlassian Rovo description: | Expertise in Atlassian Cloud products including Jira, Confluence, Compass, and Jira Service Management. Use for searching issues with JQL, creating and updating pages, managing service components, investigating ops alerts, and navigating Atlassian workspaces via the Rovo MCP server. instructions: | ## Overview Atlassian Cloud provides integrated tools for project tracking (Jira), documentation (Confluence), service catalog management (Compass), and incident management (Jira Service Management). The Atlassian Rovo MCP server enables natural language interaction with all four products. **Authentication:** OAuth 2.1 or API token (Basic/Bearer). All actions respect existing user permissions. ## Searching Jira with JQL JQL (Jira Query Language) enables precise issue searches. Always use `searchJiraIssuesUsingJql` for structured queries. **Common JQL patterns:** ```jql # Open issues assigned to current user assignee = currentUser() AND status != Done # Bugs created in the last 7 days project = "MYPROJ" AND type = Bug AND created >= -7d # High-priority issues in active sprints project = "MYPROJ" AND priority in (High, Highest) AND sprint in openSprints() # Full-text search project = "MYPROJ" AND text ~ "payment error" # Issues updated recently updated >= -24h ORDER BY updated DESC ``` **JQL operators:** `=`, `!=`, `~` (contains), `in`, `>=`, `<=`, `NOT`, `AND`, `OR` **JQL functions:** `currentUser()`, `openSprints()`, `startOfDay()`, `endOfDay()`, `membersOf("group")` ## Searching Confluence with CQL CQL (Confluence Query Language) searches pages, blog posts, and attachments. Use `searchConfluenceUsingCql` for structured queries. **Common CQL patterns:** ```cql # Search by title title ~ "Architecture" # Search in specific space space = "ENG" AND type = page # Full-text content search text ~ "deployment pipeline" # Recently modified pages lastModified >= now("-7d") AND type = page # Pages by label label = "runbook" AND space = "SRE" ``` **CQL fields:** `title`, `text`, `space`, `type`, `label`, `creator`, `lastModified` ## Creating Jira Issues Follow this workflow: 1. `getVisibleJiraProjects` — list available projects 2. `getJiraProjectIssueTypesMetadata` — list issue types for the project 3. `getJiraIssueTypeMetaWithFields` — get required/optional fields 4. `createJiraIssue` — create the issue Common issue types: Story, Bug, Task, Epic, Sub-task. ## Creating Confluence Pages Pages support Markdown content: 1. `getConfluenceSpaces` — list available spaces 2. `getPagesInConfluenceSpace` — optionally find a parent page 3. `createConfluencePage` — create the page with space, title, and body ## Working with Compass Components Component types: SERVICE, LIBRARY, APPLICATION, CAPABILITY, CLOUD_RESOURCE, DATA_PIPELINE, MACHINE_LEARNING_MODEL, UI_ELEMENT, WEBSITE, OTHER. Relationship types: DEPENDS_ON, OTHER. ## Jira Service Management Operations For incident and alert management: - `getJsmOpsAlerts` — query alerts by ID, alias, or search - `updateJsmOpsAlert` — acknowledge, close, or escalate alerts - `getJsmOpsScheduleInfo` — view on-call schedules and responders - `getJsmOpsTeamInfo` — list teams with escalation policies ## Cross-Product Workflows - Use `search` (Rovo Search) for natural language queries across products - Use `fetch` with ARIs (Atlassian Resource Identifiers) for direct content retrieval - Use `getAccessibleAtlassianResources` to list cloud sites and get cloudIds ## Troubleshooting | Issue | Solution | |-------|----------| | JQL syntax error | Check field names; quote values with spaces | | CQL returns no results | Verify space key; try broader terms | | Cannot create issue | Verify "Create" permission in the project | | Cannot edit page | Verify "Edit" permission in the space | | OAuth expired | Re-invoke any tool to trigger fresh OAuth flow | | "Site admin must authorize" | Admin must complete initial 3LO consent | | cloudId errors | Use `getAccessibleAtlassianResources` to find correct cloudId | mcp_connectors: - atlassian-rovo-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: AtlassianRovoExpert skills: - atlassian_rovo mcp_connectors: - atlassian-rovo-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Jira workflows Find all open bugs assigned to me in the PAYMENTS project Create a new story in project PLATFORM titled "Implement rate limiting for API gateway" Show me the available transitions for issue PLATFORM-1234 Add a comment to PLATFORM-1234: "Reviewed and approved for deployment" Log 2 hours of work on PLATFORM-1234 with description "Code review and testing" Confluence workflows Search Confluence for pages about "incident response runbooks" Show me the spaces I have access to Create a new Confluence page in the Engineering space titled "Q3 2025 Architecture Review" What pages are under the "Runbooks" parent page? Compass workflows List all service components in Compass Create a new service component called "payment-gateway" What components depend on the api-gateway service? Show me recent activity events for the auth-service component Jira Service Management workflows Show me active ops alerts from the last 24 hours Who is currently on-call for the platform-engineering schedule? Acknowledge alert with alias "high-cpu-prod-web-01" Get team info and escalation policies for the SRE team Cross-product workflows Search across Jira and Confluence for content related to "deployment pipeline" What Atlassian cloud sites do I have access to? Fetch the Confluence page linked to Jira issue PLATFORM-500 Available tools Jira tools (14 tools) Tool Description Required Scopes searchJiraIssuesUsingJql Search issues using a JQL query read:jira-work getJiraIssue Get issue details by ID or key read:jira-work createJiraIssue Create a new issue in a project write:jira-work editJiraIssue Update fields on an existing issue write:jira-work addCommentToJiraIssue Add a comment to an issue write:jira-work addWorklogToJiraIssue Add a time tracking worklog entry write:jira-work transitionJiraIssue Perform a workflow transition write:jira-work getTransitionsForJiraIssue List available workflow transitions read:jira-work getVisibleJiraProjects List projects the user can access read:jira-work getJiraProjectIssueTypesMetadata List issue types in a project read:jira-work getJiraIssueTypeMetaWithFields Get create-field metadata for a project and issue type read:jira-work getJiraIssueRemoteIssueLinks List remote links (e.g., Confluence pages) on an issue read:jira-work lookupJiraAccountId Find user account IDs by name or email read:jira-work Confluence tools (11 tools) Tool Description Required Scopes searchConfluenceUsingCql Search content using a CQL query search:confluence getConfluencePage Get page content by ID (as Markdown) read:page:confluence createConfluencePage Create a new page or live doc with Markdown body write:page:confluence updateConfluencePage Update an existing page (title, body, location) write:page:confluence getConfluenceSpaces List spaces by key, ID, type, status, or labels read:space:confluence getPagesInConfluenceSpace List pages in a space, filtered by title/status/type read:page:confluence getConfluencePageDescendants List descendant pages under a parent page read:hierarchical-content:confluence createConfluenceFooterComment Create a footer comment or reply on a page write:page:confluence createConfluenceInlineComment Create an inline comment tied to selected text write:page:confluence getConfluencePageFooterComments List footer comments on a page (as Markdown) read:comment:confluence getConfluencePageInlineComments List inline comments on a page read:comment:confluence Compass tools (13 tools) Tool Description Required Scopes getCompassComponents Search or list components read:component:compass getCompassComponent Get component details by ID read:component:compass createCompassComponent Create a service, library, or other component write:component:compass deleteCompassComponent Delete an existing component and its definitions write:component:compass createCompassComponentRelationship Create a relationship between two components write:component:compass deleteCompassComponentRelationship Remove a relationship between two components write:component:compass getCompassComponentActivityEvents List recent activity events (deployments, alerts) read:component:compass getCompassComponentLabels Get labels applied to a component read:component:compass getCompassComponentTypes List available component types read:component:compass getCompassComponentsOwnedByMyTeams List components owned by your teams read:component:compass getCompassCustomFieldDefinitions List custom field definitions read:component:compass createCompassCustomFieldDefinition Create a custom field definition write:component:compass deleteCompassCustomFieldDefinition Delete a custom field definition write:component:compass Jira Service Management tools (4 tools) [!NOTE] JSM tools only support authentication via API token. These tools are available only if API token authentication is enabled by your organization admin. Tool Description Required Scopes getJsmOpsAlerts Get alert by ID/alias or search by query and time window read:ops-alert:jira-service-management , read:ops-config:jira-service-management , read:jira-user getJsmOpsScheduleInfo List on-call schedules or get current/next responders read:ops-config:jira-service-management , read:jira-user getJsmOpsTeamInfo List ops teams with escalation policies and roles read:ops-config:jira-service-management , read:jira-user updateJsmOpsAlert Acknowledge, unacknowledge, close, or escalate an alert read:ops-alert:jira-service-management , write:ops-alert:jira-service-management Rovo / Shared platform tools (4 tools) Tool Description Required Scopes search Natural language search across Jira and Confluence (not JQL/CQL) search:rovo:mcp fetch Fetch content by Atlassian Resource Identifier (ARI) search:rovo:mcp atlassianUserInfo Get current user details (account ID) read:me getAccessibleAtlassianResources List accessible cloud sites and their cloudIds read:account , read:me Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid or expired API token Generate a new token at id.atlassian.com 403 Forbidden Missing product permissions Verify you have access to the Atlassian product (Jira, Confluence, etc.) "Your site admin must authorize this app" First-time setup requires admin A site admin must complete initial 3LO consent "Your organization admin must authorize access from a domain" Domain not allowed Admin must add the domain in Rovo MCP server settings "You don't have permission to connect from this IP address" IP allowlisting enabled Admin must add your IP range to the allowlist API token auth fails Feature disabled by admin Admin must enable API token authentication Data and permission issues Error Cause Solution No data returned Wrong cloudId or expired session Use getAccessibleAtlassianResources to find the correct cloudId Cannot create issue Missing project permission Verify "Create" permission in the Jira project Cannot update page Missing space permission Verify "Edit" permission in the Confluence space Tool not available Missing scopes Re-create API token with required scopes Compass tools unavailable Scopes not available for API tokens Some Compass tools require OAuth 2.1 JSM tools not working API token auth disabled Admin must enable API token authentication Verify the connection Test the server endpoint directly: # Test with API token (Basic auth) curl -I https://mcp.atlassian.com/v1/mcp \ -H "Authorization: Basic <your_base64_encoded_credentials>" # Test with service account (Bearer auth) curl -I https://mcp.atlassian.com/v1/mcp \ -H "Authorization: Bearer <your_api_key>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Go to id.atlassian.com/manage-profile/apps Find and revoke the MCP app authorization Generate a new API token or re-invoke a tool to trigger fresh OAuth flow Limitations Limitation Details Limited tool availability with API tokens Some tools (e.g., certain Compass tools) may not be available because required scopes aren't available for API tokens No bounded cloudId API tokens are not bound to a specific cloudId. Tools must explicitly pass the cloudId where needed No domain allowlist validation API token auth doesn't use OAuth redirect URIs, so domain allowlist checks cannot be performed Bitbucket tools Bitbucket scopes are available in the token, but Bitbucket tools are not yet listed as supported JSM requires API token Jira Service Management tools only work with API token authentication, not OAuth 2.1 Security considerations How permissions work User-scoped: All actions respect the authenticated user's existing Atlassian permissions Product-level: Access requires matching product permissions (Jira, Confluence, Compass) Session-based: OAuth tokens expire and require re-authentication; API tokens persist until revoked Admin controls Atlassian administrators can: - Enable or disable API token authentication in Rovo MCP server settings - Manage and revoke MCP app access from the Connected Apps list - Control which external domains can connect via domain allowlists - Monitor activity through Atlassian audit logs - Configure IP allowlisting for additional security [!IMPORTANT] MCP clients can perform actions in Jira, Confluence, and Compass with your existing permissions. Use least privilege, review high-impact changes before confirming, and monitor audit logs for unusual activity. See MCP Clients - Understanding security risks. Related content Atlassian Rovo MCP Server - Getting started Atlassian Rovo MCP Server - Supported tools Atlassian Rovo MCP Server - API token authentication Atlassian Rovo MCP Server - OAuth 2.1 configuration Control Atlassian Rovo MCP Server settings MCP Clients - Understanding security risks MCP integration overview Build a custom subagent409Views1like1Comment