azure sre agent

28 Topics

Get started with PagerDuty MCP server and PagerDuty SRE Agent in Azure SRE Agent
Overview The PagerDuty MCP server is a cloud-hosted bridge between your PagerDuty account and Azure SRE Agent. Once configured, it enables real-time interaction with incidents, on-call schedules, services, teams, escalation policies, event orchestration, incident workflows, status pages, and more through natural language. All actions respect the permissions of the user account associated with the API token. The server uses Streamable HTTP transport with a single Authorization custom header for authentication. Azure SRE Agent connects directly to the PagerDuty-hosted endpoint—no npm packages, local proxies, or container deployments are required. Since there is no dedicated PagerDuty connector type in the portal, you use the generic MCP server (User provided connector) option and configure the authorization header manually. Key capabilities Area Capabilities Incidents Create, list, manage incidents; add notes, responders; view alerts; find related/outlier/past incidents Services Create, list, update, and get service details On-Call & Schedules List on-calls, manage schedules, create overrides, list schedule users Escalation Policies List and get escalation policy details Teams & Users Create, update, delete teams; manage team members; list and get user data Alert Grouping Create, update, delete, and list alert grouping settings Change Events List and get change events by service or incident Event Orchestration Manage event orchestration routers, global rules, and service rules Incident Workflows List, get, and start incident workflows Log Entries List and get log entry details Status Pages Create and manage status page posts, updates, impacts, and severities This is the official PagerDuty-hosted MCP server. It exposes 60+ tools covering incidents, services, on-call, escalation, event orchestration, incident workflows, status pages, and more. The hosted service at mcp.pagerduty.com exposes all tools (both read and write) by default. Tool availability depends on your PagerDuty plan and user account permissions. Prerequisites Azure SRE Agent resource deployed in Azure PagerDuty account with an active plan PagerDuty user account with appropriate permissions User API Token: Created from User Profile > User Settings > API Access Step 1: Create a PagerDuty API token Generate the User API Token needed to authenticate with the PagerDuty MCP server. PagerDuty uses a single token for both authentication and authorization—the token inherits all permissions of the user account that creates it. Navigate to API Access in PagerDuty Log in to your PagerDuty account For EU accounts, use https://app.eu.pagerduty.com/ Select your user avatar in the top-right corner of the navigation bar Select My Profile from the dropdown menu Select the User Settings tab at the top of your profile page Scroll down to the API Access section Create a User API Token In the API Access section, select Create API User Token Enter a descriptive name for the token (e.g., sre-agent-mcp ) Select Create Token Copy the token value immediately—it is displayed only once and cannot be retrieved later The token format will look like: u+xxxxxxxxxxxxxxxx Store the API token securely. If you lose it, you must delete the old token and create a new one. Navigate back to My Profile > User Settings > API Access to manage your tokens. Choose the right account for token creation The API token inherits all permissions of the PagerDuty user account that creates it. Consider these options: Account type When to use Permissions Personal account Quick testing and development Full permissions of your user role Service account (recommended for production) Production deployments Create a dedicated PagerDuty user with a restricted role Read-only account Monitoring-only use cases Create a user with the Observer or Restricted Access role For production use, create a dedicated PagerDuty user with a Responder or Observer role (depending on whether write access is needed), then generate the token from that account. This ensures the integration continues to work if team members leave the organization and limits the blast radius of a compromised token. PagerDuty also supports Account-level API keys (created under Integrations > Developer Tools > API Access Keys), but the MCP server requires a User API Token, not an account-level key. Step 2: Add the MCP connector Connect the PagerDuty MCP server to your SRE Agent using the portal. Since there is no dedicated PagerDuty connector type, you use the generic MCP server (User provided connector) option. Determine your regional endpoint Select the endpoint URL that matches your PagerDuty account's service region: Region Endpoint URL US (default) https://mcp.pagerduty.com/mcp EU https://mcp.eu.pagerduty.com/mcp Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select MCP server (User provided connector) and select Next Configure the connector: Field Value Name pagerduty-mcp Connection type Streamable-HTTP URL https://mcp.pagerduty.com/mcp (use EU endpoint for EU service region) Authentication Custom headers Authorization Token <your-pagerduty-api-token> Select Next to review Select Add connector The token format in the Authorization header must be Token <your-api-token> (not Bearer ). For example: Token u+abcdefg123456789 . Using the wrong format will result in 401 Unauthorized errors. Once the connector shows Connected status, the PagerDuty MCP tools are automatically available to your agent. You can verify by checking the tools list in the connector details. Step 3: Create a PagerDuty subagent (optional) Create a specialized subagent to give the AI focused PagerDuty incident management expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: PagerDutyIncidentExpert display_name: PagerDuty Incident Expert system_prompt: | You are a PagerDuty incident management expert with access to incidents, services, on-call schedules, escalation policies, teams, event orchestration, incident workflows, status pages, and more via the PagerDuty MCP server. ## Capabilities ### Incidents - List and search incidents with `list_incidents` - Get incident details with `get_incident` - Create new incidents with `create_incident` - Manage incidents (update status, urgency, assignment, escalation) with `manage_incidents` - Add notes with `add_note_to_incident` and list notes with `list_incident_notes` - Add responders with `add_responders` - View alerts from incidents with `list_alerts_from_incident` and `get_alert_from_incident` - Find related incidents with `get_related_incidents` - Find similar past incidents with `get_past_incidents` - Identify outlier incidents with `get_outlier_incident` ### Services - List all services with `list_services` - Get service details with `get_service` - Create new services with `create_service` - Update service configuration with `update_service` ### On-Call & Schedules - List current on-calls with `list_oncalls` - Get schedule details with `get_schedule` - List all schedules with `list_schedules` - List users in a schedule with `list_schedule_users` - Create and update schedules with `create_schedule` and `update_schedule` - Create schedule overrides with `create_schedule_override` ### Escalation Policies - List escalation policies with `list_escalation_policies` - Get escalation policy details with `get_escalation_policy` ### Teams & Users - List teams with `list_teams` and get team details with `get_team` - Create, update, and delete teams - Manage team members with `add_team_member` and `remove_team_member` - List users with `list_users` and get user data with `get_user_data` ### Event Orchestration - List and get event orchestrations - Manage orchestration routers, global rules, and service rules - Append rules to event orchestration routers ### Incident Workflows - List and get incident workflows - Start incident workflows with `start_incident_workflow` ### Status Pages - Create and manage status page posts and updates - List status page impacts, severities, and statuses ### Log Entries - List and get log entry details for audit trails ### Alert Grouping - Create, update, and manage alert grouping settings ### Change Events - List and get change events, including by service or incident ## Best Practices When investigating incidents: - Start with `list_incidents` to find active or recent incidents - Use `get_incident` for full details including status and assignments - Check `list_alerts_from_incident` to see triggering alerts - Use `get_related_incidents` to find correlated issues - Use `get_past_incidents` to find similar historical incidents - Check `list_oncalls` to identify who is currently on-call - Review `list_incident_notes` for any existing investigation notes When managing on-call: - Use `list_oncalls` to see current on-call assignments - Use `get_schedule` and `list_schedule_users` for schedule details - Use `create_schedule_override` for temporary coverage changes When handling errors: - If 401 errors occur, explain the token may be invalid or expired - If 403 errors occur, explain which permissions may be missing - Suggest the user verify their API token is valid and has sufficient permissions mcp_connectors: - pagerduty-mcp handoffs: [] Select Save The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the PagerDuty MCP server. Step 4: Add a PagerDuty skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a PagerDuty skill to give your agent expertise in incident management, on-call scheduling, and escalation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: pagerduty_incident_management display_name: PagerDuty Incident Management description: | Expertise in PagerDuty's incident management platform including incidents, on-call schedules, services, teams, escalation policies, event orchestration, incident workflows, and status pages. Use for managing incidents, checking on-call status, investigating alerts, escalating issues, and navigating PagerDuty data via the PagerDuty MCP server. instructions: | ## Overview PagerDuty is an incident management and on-call scheduling platform for operations teams. The PagerDuty MCP server enables natural language interaction with your PagerDuty account data including incidents, services, schedules, teams, escalation policies, and more. **Authentication:** A single `Authorization` custom header with the format `Token <api-token-value>`. All actions respect the permissions of the user account associated with the token. **Regional endpoints:** The hosted MCP server has two endpoints—US (`mcp.pagerduty.com`) and EU (`mcp.eu.pagerduty.com`). Ensure the connector URL matches your PagerDuty service region. ## Incident Management Use `list_incidents` to search and filter incidents, `get_incident` for details, and `manage_incidents` to update status, urgency, assignment, or escalation level. **Common incident workflows:** ``` # List all triggered incidents Use list_incidents with status "triggered" # List high-urgency incidents Use list_incidents filtered by urgency "high" # Get details for a specific incident Use get_incident with the incident ID # Acknowledge an incident Use manage_incidents to set status to "acknowledged" # Resolve an incident Use manage_incidents to set status to "resolved" # Escalate an incident Use manage_incidents to escalate to the next level ``` ## On-Call Management Use `list_oncalls` to see current on-call assignments, `get_schedule` for schedule details, and `create_schedule_override` for temporary coverage. **Common on-call workflows:** ``` # Who is currently on-call? Use list_oncalls to see all current on-call assignments # Who is on-call for a specific escalation policy? Use list_oncalls filtered by escalation_policy_id # Get details for a schedule Use get_schedule with the schedule ID # Create a temporary override Use create_schedule_override with start/end times and user ``` ## Service Management Use `list_services` to discover services, `get_service` for details, and `create_service` or `update_service` for configuration changes. **Service investigation patterns:** ``` # List all services Use list_services # Get service details including integrations Use get_service with the service ID # Find incidents for a specific service Use list_incidents filtered by service_id ``` ## Escalation Policy Management Use `list_escalation_policies` to discover policies and `get_escalation_policy` for details including escalation rules and targets. ## Team Management Use `list_teams` to discover teams, `get_team` for details, and team member management tools for roster changes. ## Incident Investigation Workflow For structured incident investigation: 1. `list_incidents` — find active or recent incidents 2. `get_incident` — get full incident details and current status 3. `list_alerts_from_incident` — see triggering alerts and their details 4. `get_alert_from_incident` — get specific alert details 5. `get_related_incidents` — find correlated incidents 6. `get_past_incidents` — find similar historical incidents 7. `list_oncalls` — identify who is currently on-call 8. `list_incident_notes` — review existing investigation notes 9. `add_note_to_incident` — document findings 10. `manage_incidents` — update status, urgency, or escalate ## Event Orchestration Use event orchestration tools to manage how events are routed and processed: - `list_event_orchestrations` — discover orchestration configurations - `get_event_orchestration_router` — view routing rules - `append_event_orchestration_router_rule` — add new routing rules - `get_event_orchestration_global` — view global orchestration rules - `get_event_orchestration_service` — view service-level rules ## Incident Workflows Use `list_incident_workflows` to discover automated workflows and `start_incident_workflow` to trigger them for an incident. ## Status Page Management Use status page tools to communicate during incidents: - `list_status_pages` — discover status pages - `create_status_page_post` — create a new incident post - `create_status_page_post_update` — add updates to existing posts - `list_status_page_impacts` — view impact categories - `list_status_page_severities` — view severity levels ## Troubleshooting | Issue | Solution | |-------|----------| | 401 Unauthorized | Verify the API token is valid and not expired | | 403 Forbidden | Check that the user account has sufficient permissions | | Connection refused | Verify firewall allows HTTPS to mcp.pagerduty.com | | EU region errors | Ensure you are using `mcp.eu.pagerduty.com` for EU accounts | | Token format error | Use `Token <value>` format, not `Bearer <value>` | | No data returned | Verify the token's user account has access to the requested resources | mcp_connectors: - pagerduty-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: PagerDutyIncidentExpert skills: - pagerduty_incident_management mcp_connectors: - pagerduty-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Incident management Show me all currently triggered incidents Get details for incident P1234567 including the timeline and notes Create a new high-urgency incident for the payment-service with title "Payment processing degraded" Acknowledge all triggered incidents assigned to me On-call and schedules Who is currently on-call for the platform-engineering escalation policy? Show me the on-call schedule for the next 7 days Create a schedule override for John Smith covering Saturday 9am to Monday 9am List all users in the primary on-call schedule Service and team management List all services and their current status Get details for the checkout-service including escalation policy and integrations Show me all teams and their members What escalation policies are configured for the payment team? Incident investigation Find incidents related to the current database outage Show me similar past incidents to P1234567 What alerts triggered incident P1234567? List all notes and timeline entries for the most recent SEV-1 incident Event orchestration and workflows List all event orchestration configurations Show me the routing rules for the production orchestration What incident workflows are available? Start the "SEV-1 Response" workflow for incident P1234567 Status page management List all status pages Create a new status page post for the ongoing API degradation Add an update to the current status page post indicating the issue is being investigated What severity levels are available for status page posts? Available tools Incidents Tool Description get_incident Get details of a specific incident by ID list_incidents List and filter incidents by status, urgency, service, and more create_incident Create a new incident on a specified service manage_incidents Update incident status, urgency, assignment, or escalation level add_note_to_incident Add an investigation note to an incident list_incident_notes List all notes on an incident add_responders Add additional responders to an incident list_alerts_from_incident List all alerts associated with an incident get_alert_from_incident Get details of a specific alert from an incident get_outlier_incident Identify outlier incidents based on patterns get_past_incidents Find similar historical incidents get_related_incidents Find incidents related to a specific incident Services Tool Description get_service Get details of a specific service list_services List all services in the account create_service Create a new service update_service Update service configuration On-Call & Schedules Tool Description list_oncalls List current on-call assignments get_schedule Get details of a specific schedule list_schedules List all schedules list_schedule_users List users in a specific schedule create_schedule Create a new on-call schedule update_schedule Update an existing schedule create_schedule_override Create a temporary schedule override Escalation Policies Tool Description list_escalation_policies List all escalation policies get_escalation_policy Get details of a specific escalation policy Teams & Users Tool Description get_team Get details of a specific team list_teams List all teams list_team_members List members of a specific team create_team Create a new team update_team Update team details delete_team Delete a team add_team_member Add a user to a team remove_team_member Remove a user from a team get_user_data Get details of a specific user list_users List all users in the account Alert Grouping Tool Description create_alert_grouping_setting Create an alert grouping configuration get_alert_grouping_setting Get details of an alert grouping setting list_alert_grouping_settings List all alert grouping settings update_alert_grouping_setting Update an alert grouping setting delete_alert_grouping_setting Delete an alert grouping setting Change Events Tool Description get_change_event Get details of a specific change event list_change_events List all change events list_incident_change_events List change events related to an incident list_service_change_events List change events for a specific service Event Orchestration Tool Description get_event_orchestration Get details of an event orchestration list_event_orchestrations List all event orchestrations get_event_orchestration_router Get routing rules for an orchestration update_event_orchestration_router Update routing rules append_event_orchestration_router_rule Add a new routing rule get_event_orchestration_global Get global orchestration rules get_event_orchestration_service Get service-level orchestration rules Incident Workflows Tool Description get_incident_workflow Get details of an incident workflow list_incident_workflows List all incident workflows start_incident_workflow Start an incident workflow for a specific incident Log Entries Tool Description get_log_entry Get details of a specific log entry list_log_entries List log entries for audit and investigation Status Pages Tool Description create_status_page_post Create a new status page incident post create_status_page_post_update Add an update to a status page post get_status_page_post Get details of a status page post list_status_page_impacts List available impact categories list_status_page_post_updates List updates for a status page post list_status_page_severities List available severity levels list_status_page_statuses List available status values list_status_pages List all status pages Write operations The PagerDuty MCP server supports both read and write operations. The hosted service at mcp.pagerduty.com exposes all tools (both read and write) by default. Write tools Write operations include creating and modifying PagerDuty resources: Category Write tools Incidents create_incident , manage_incidents , add_note_to_incident , add_responders Services create_service , update_service Schedules create_schedule , update_schedule , create_schedule_override Teams create_team , update_team , delete_team , add_team_member , remove_team_member Alert Grouping create_alert_grouping_setting , update_alert_grouping_setting , delete_alert_grouping_setting Event Orchestration update_event_orchestration_router , append_event_orchestration_router_rule Incident Workflows start_incident_workflow Status Pages create_status_page_post , create_status_page_post_update PagerDuty also provides a self-hosted MCP server that can be run locally. The self-hosted server exposes only read-only tools by default; write tools require the --enable-write-tools flag at startup. For Azure SRE Agent, the hosted service at mcp.pagerduty.com is recommended as it requires no infrastructure management and exposes all tools automatically. Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid or expired API token Verify the token is correct and active in User Settings > API Access 403 Forbidden Insufficient user permissions Ensure the user account associated with the token has the required PagerDuty role Connection refused Firewall blocking outbound HTTPS Verify firewall allows HTTPS traffic to mcp.pagerduty.com (port 443) Token format error Using Bearer instead of Token The Authorization header must use Token <value> format, not Bearer <value> Data and permission issues Error Cause Solution No data returned Token user lacks access to the resource Verify the user account has access to the requested services, teams, or incidents EU region errors Using US endpoint for EU account Switch the connector URL to https://mcp.eu.pagerduty.com/mcp Write operation failed User lacks write permissions Verify the token's user account has a role that allows write operations (e.g., Manager, Admin) Rate limit exceeded Too many API requests PagerDuty rate limits vary by plan; reduce request frequency or contact PagerDuty support Incident not found Wrong incident ID or no access Verify the incident ID and that the token's user has access to the incident's service Verify the connection Test the server endpoint directly: curl -I "https://mcp.pagerduty.com/mcp" \ -H "Authorization: Token <your-api-token>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Navigate to My Profile > User Settings > API Access in PagerDuty Delete the existing API User Token Create a new API User Token Update the connector in the SRE Agent portal with the new token value in the Authorization header (format: Token <new-token> ) Limitations Limitation Details User-scoped permissions API token permissions are tied to the creating user's account; the token cannot exceed the user's access level Self-hosted write restriction The self-hosted MCP server only exposes read-only tools by default; write tools require the --enable-write-tools flag Rate limits API rate limits apply per your PagerDuty plan; high-frequency usage may be throttled No dedicated connector type The portal does not have a dedicated PagerDuty connector; you must use the generic MCP server connector and configure headers manually Two regional endpoints only Only US and EU service regions are supported; the endpoint must match your account's service region Token rotation API tokens do not automatically expire; manual rotation is recommended as a security best practice Security considerations How permissions work User-scoped: All actions respect the permissions of the PagerDuty user account that created the API token Token-based: A single User API Token in the Authorization header provides both authentication and authorization Role-based: The token inherits the PagerDuty role (Observer, Responder, Manager, Admin, etc.) of the creating user Admin controls PagerDuty administrators can: - Create and revoke User API tokens from user profile settings - Assign roles to user accounts to control permission scope - Use service accounts with restricted roles to limit the blast radius of compromised tokens - Monitor API token usage through PagerDuty's audit logs - Enforce token rotation policies as part of security governance PagerDuty User API tokens can read and modify sensitive operational data including incidents, on-call schedules, and service configurations. Use service account tokens with restricted roles, grant only the permissions your agent needs, and rotate tokens regularly. Monitor the PagerDuty audit logs for unusual activity. PagerDuty SRE Agent In addition to connecting Azure SRE Agent to PagerDuty via MCP, PagerDuty offers its own built-in SRE Agent—an AI-powered assistant that works side-by-side with responders during incident triage and resolution. When combined with the Azure SRE Agent MCP integration, you get a powerful end-to-end incident management experience. What is PagerDuty SRE Agent? PagerDuty’s SRE Agent transforms incident response in the Operations Console and Slack by automatically analyzing incidents, providing key context, and recommending remediation actions. It accelerates triage to reduce risk, cost, and cognitive load, and it continuously learns to prevent repeat issues. Key features Automated incident analysis: Ingests and analyzes runbooks, SOPs, and diagnostics (e.g., error logs) to surface likely root causes Playbook generation: Generates and saves playbooks for recurring issues based on past resolutions Pattern detection: Detects patterns, recalls similar incidents, and provides structured troubleshooting Actionable nudges: Recommends next steps through interactive buttons such as “Upload Runbook,” “Analyze Past Incidents,” “Generate a Playbook,” and “Search Logs” Continuous learning: Builds memory from resolved incidents including incident playbooks, service runbooks, incident summaries, and service profiles Observability integrations: Retrieves log data from platforms like Grafana, Datadog, New Relic, and AWS CloudWatch for deeper investigation Prerequisites PagerDuty Advance add-on (required for both Operations Console and Slack access) AIOps add-on (required for Operations Console access) Available on Enterprise, Business, and Professional plans An Account Owner or Global Admin role is required to enable SRE Agent Step 1: Enable PagerDuty SRE Agent In the PagerDuty web app, navigate to AI > AI Settings Select the Assistant and AI Agents Configuration tab Under AI Agents, find SRE Agent and toggle the switch to the on position If you don’t have Account Owner or Global Admin permissions, click Request to Admin next to the SRE Agent toggle. This sends an email request to your admins to enable it for you. Step 2: Configure tool integrations (optional) PagerDuty SRE Agent can retrieve log data and runbooks from external tools for deeper investigation. Set up Workflow Integrations and select Allow SRE Agent access for each integration. Supported integrations include: Log platforms: Grafana, Datadog, New Relic, AWS CloudWatch Runbook sources: Confluence, GitHub For runbook sources, update your event payload to include the runbook URL in custom_details : "custom_details": { "runbook_url": { "confluence": "https://YOUR-RUNBOOK-LINK" } } For more details, see Agent Tooling Configuration. Step 3: Use SRE Agent in Operations Console Navigate to AIOps > Operations Console Optional: Add the SRE Agent column to the Operations Console for faster incident triage Select an incident by clicking its Title Select the SRE Agent tab and wait for the agent to load your incident summary Begin troubleshooting by asking questions or using the agent’s nudge buttons (e.g., Upload Runbook, Analyze Past Incidents, Generate a Playbook) How it works with Azure SRE Agent Azure SRE Agent has a built-in direct integration with PagerDuty’s SRE Agent. This means you can query PagerDuty’s AI-powered SRE Agent directly from within Azure SRE Agent’s chat interface—no separate tab or tool switching required. Built-in PagerDuty Incident Management Agent Azure SRE Agent includes a dedicated PagerDuty Incident Management Agent that provides the following tools: Tool Description QueryPagerDutyIncidentChat Queries PagerDuty’s SRE Agent (Advance Chat API) for intelligent insights, troubleshooting guidance, runbook generation, or diagnostic recommendations about a specific incident GetPagerDutyIncidentById Retrieves details for a specific PagerDuty incident by its ID ResolvePagerDutyIncident Resolves a PagerDuty incident directly from Azure SRE Agent AcknowledgePagerDutyIncident Acknowledges a PagerDuty incident AddNoteToPagerDutyIncident Adds notes to a PagerDuty incident for tracking investigation progress Querying PagerDuty SRE Agent from Azure SRE Agent The QueryPagerDutyIncidentChat tool connects directly to PagerDuty’s Advance Chat API ( https://api.pagerduty.com/advance/chat ) using your PagerDuty API token. When you ask Azure SRE Agent a question about a PagerDuty incident, it automatically calls PagerDuty’s SRE Agent and returns the AI-powered response. This enables scenarios like: “What caused incident Q391Y5VW0YYUEL?” — PagerDuty SRE Agent analyzes the incident context and provides root cause analysis “Generate a runbook for incident Q391Y5VW0YYUEL” — PagerDuty SRE Agent creates a step-by-step runbook based on the incident details “How do I troubleshoot incident Q391Y5VW0YYUEL?” — PagerDuty SRE Agent recommends diagnostic and remediation steps “Provide mitigation steps for incident Q391Y5VW0YYUEL” — PagerDuty SRE Agent suggests actions prioritized by urgency and impact “Triage incident Q391Y5VW0YYUEL” — PagerDuty SRE Agent provides a full triage summary with next steps Configuration The PagerDuty SRE Agent integration uses the same API token you configured for PagerDuty incident management. No additional setup is required beyond the standard PagerDuty connector configuration. When PagerDuty is configured as your incident management platform in Azure SRE Agent settings, the QueryPagerDutyIncidentChat tool is automatically available. The PagerDuty Advance Chat API requires a PagerDuty Advance subscription. Each query to the SRE Agent consumes 4 PagerDuty Advance credits. Ensure your account has sufficient credits for your expected usage. End-to-end workflow With PagerDuty configured as both an MCP connector and an incident management platform, Azure SRE Agent enables a seamless workflow: Detect: Azure SRE Agent monitors your Azure infrastructure and detects issues Correlate: Azure SRE Agent retrieves related PagerDuty incidents for the affected Azure resources Triage: Azure SRE Agent queries PagerDuty’s SRE Agent for AI-powered root cause analysis, troubleshooting steps, and runbook recommendations Act: Azure SRE Agent acknowledges, adds notes to, or resolves PagerDuty incidents—all from a single conversation Learn: PagerDuty SRE Agent saves incident learnings and playbooks for future incidents, improving response over time For the best experience, configure both the PagerDuty MCP connector (for service and schedule queries) and PagerDuty as your incident management platform (for direct SRE Agent access). This gives your team the full breadth of PagerDuty capabilities from within Azure SRE Agent. For full documentation on PagerDuty SRE Agent capabilities, including best practices and example questions, see the PagerDuty SRE Agent documentation. Related content PagerDuty MCP Server documentation PagerDuty REST API v2 documentation PagerDuty API Access Keys PagerDuty User Roles PagerDuty Audit Records MCP integration overview Build a custom subagent PagerDuty SRE Agent documentation PagerDuty Advance
dbandaru
Feb 26, 2026 Place Apps on Azure Blog
102Views
0likes
0Comments
Get started with Datadog MCP server in Azure SRE Agent
Overview The Datadog MCP server is a cloud-hosted bridge between your Datadog organization and Azure SRE Agent. Once configured, it enables real-time interaction with logs, metrics, APM traces, monitors, incidents, dashboards, a nd other Datadog data through natural language. All actions respect your existing Datadog RBAC permissions. The server uses Streamable HTTP transport with two custom headers ( DD_API_KEY and DD_APPLICATION_KEY ) for authentication. Azure SRE Agent connects directly to the Datadog-hosted endpoint—no npm packages, local proxies, or container deployments are required. The SRE Agent portal includes a dedicated Datadog MCP server connector type that pre-populates the required header keys for streamlined setup. Key capabilities Area Capabilities Logs Search and analyze logs with SQL-based queries, filter by facets and time ranges Metrics Query metric values, explore available metrics, get metric metadata and tags APM Search spans, fetch complete traces, analyze trace performance, compare traces Monitors Search monitors, validate configurations, inspect monitor groups and templates Incidents Search and get incident details, view timeline and responders Dashboards Search and list dashboards by name or tag Hosts Search hosts by name, tags, or status Services List services and map service dependencies Events Search events including monitor alerts, deployments, and custom events Notebooks Search and retrieve notebooks for investigation documentation RUM Search Real User Monitoring events for frontend observability This is the official Datadog-hosted MCP server (Preview). The server exposes 16+ core tools with additional toolsets available for alerting, APM, Database Monitoring, Error Tracking, feature flags, LLM Observability, networking, security, software delivery, and Synthetic tests. Tool availability depends on your Datadog plan and RBAC permissions. Prerequisites Azure SRE Agent resource deployed in Azure Datadog organization with an active plan Datadog user account with appropriate RBAC permissions API key: Created from Organization Settings > API Keys Application key: Created from Organization Settings > Application Keys with MCP Read and/or MCP Write permissions Your organization must be allowlisted for the Datadog MCP server Preview Step 1: Create API and Application keys The Datadog MCP server requires two credentials: an API key (identifies your organization) and an Application key (authenticates the user and defines permission scope). Both are created in the Datadog portal. Create an API key Log in to your Datadog organization (use your region-specific URL if applicable—e.g., app.datadoghq.eu for EU1) Select your account avatar in the bottom-left corner of the navigation bar Select Organization Settings In the left sidebar, select API Keys (under the Access section) Direct URL: https://app.datadoghq.com/organization-settings/api-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp ) Select Create Key Copy the key value immediately—it is shown only once. If lost, you must create a new key. [!TIP] API keys are organization-level credentials. Any Datadog Admin or user with the API Keys Write permission can create them. The API key alone does not grant data access—it must be paired with an Application key. Create an Application key From the same Organization Settings page, select Application Keys in the left sidebar Direct URL: https://app.datadoghq.com/organization-settings/application-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp-app ) Select Create Key Copy the key value immediately—it is shown only once Add MCP permissions to the Application key After creating the Application key, you must grant it the MCP-specific scopes: In the Application Keys list, locate the key you just created Select the key name to open its detail panel In the detail panel, find the Scopes section and select Edit Search for MCP in the scopes search box Check MCP Read to enable read access to Datadog data via MCP tools Optionally check MCP Write if your agent needs to create or modify resources (e.g., feature flags, Synthetic tests) Select Save If you don't see the MCP Read or MCP Write scopes, your organization may not be enrolled in the Datadog MCP server preview. Contact your Datadog account representative to request access. Required permissions summary Permission Description Required? MCP Read Read access to Datadog data via MCP tools (logs, metrics, traces, monitors, etc.) Yes MCP Write Write access for mutating operations (creating feature flags, editing Synthetic tests, etc.) Optional For production use, create keys from a service account rather than a personal account. Navigate to Organization Settings > Service Accounts to create one. This ensures the integration continues to work if team members leave the organization. Apply the principle of least privilege—grant only MCP Read unless write operations are needed. Use scoped Application keys to restrict access to only the permissions your agent needs. This limits blast radius if a key is compromised. Step 2: Add the MCP connector Connect the Datadog MCP server to your SRE Agent using the portal. The portal includes a dedicated Datadog connector type that pre-populates the required configuration. Determine your regional endpoint Select the endpoint URL that matches your Datadog organization's region: Region Endpoint URL US1 (default) https://mcp.datadoghq.com/api/unstable/mcp-server/mcp US3 https://mcp.us3.datadoghq.com/api/unstable/mcp-server/mcp US5 https://mcp.us5.datadoghq.com/api/unstable/mcp-server/mcp EU1 https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp AP1 https://mcp.ap1.datadoghq.com/api/unstable/mcp-server/mcp AP2 https://mcp.ap2.datadoghq.com/api/unstable/mcp-server/mcp Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select Datadog MCP server and select Next Configure the connector: Field Value Name datadog-mcp Connection type Streamable-HTTP (pre-selected) URL https://mcp.datadoghq.com/api/unstable/mcp-server/mcp (change for non-US1 regions) Authentication Custom headers (pre-selected, disabled) DD_API_KEY Your Datadog API key DD_APPLICATION_KEY Your Datadog Application key Select Next to review Select Add connector The Datadog connector type pre-populates both header keys ( DD_API_KEY and DD_APPLICATION_KEY ) and sets the authentication method to "Custom headers" automatically. The default URL is the US1 endpoint—update it if your organization is in a different region. Once the connector shows Connected status, the Datadog MCP tools are automatically available to your agent. You can verify by checking the tools list in the connector details. Step 3: Create a Datadog subagent (optional) Create a specialized subagent to give the AI focused Datadog observability expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: DatadogObservabilityExpert display_name: Datadog Observability Expert system_prompt: | You are a Datadog observability expert with access to logs, metrics, APM traces, monitors, incidents, dashboards, hosts, services, and more via the Datadog MCP server. ## Capabilities ### Logs - Search logs using facets, tags, and time ranges with `search_datadog_logs` - Perform SQL-based log analysis with `analyze_datadog_logs` for aggregations, grouping, and statistical queries - Correlate log entries with traces and metrics ### Metrics - Query metric time series with `get_datadog_metric` - Get metric metadata, tags, and context with `get_datadog_metric_context` - Discover available metrics with `search_datadog_metrics` ### APM (Application Performance Monitoring) - Fetch complete traces with `get_datadog_trace` - Search distributed traces and spans with `search_datadog_spans` - Analyze service-level performance and latency patterns - Map service dependencies with `search_datadog_service_dependencies` ### Monitors & Alerting - Search monitors by name, tag, or status with `search_datadog_monitors` - Investigate triggered monitors and alert history - Correlate monitor alerts with underlying metrics and logs ### Incidents - Search incidents with `search_datadog_incidents` - Get incident details, timeline, and responders with `get_datadog_incident` - Correlate incidents with monitors, logs, and traces ### Infrastructure - Search hosts by name, tag, or status with `search_datadog_hosts` - List and discover services with `search_datadog_services` - Search dashboards with `search_datadog_dashboards` - Search events (monitor alerts, deployments) with `search_datadog_events` ### Notebooks - Search notebooks with `search_datadog_notebooks` - Retrieve notebook content with `get_datadog_notebook` ### Real User Monitoring - Search RUM events for frontend performance data with `search_datadog_rum_events` ## Best Practices When investigating incidents: - Start with `search_datadog_incidents` or `get_datadog_incident` for context - Check related monitors with `search_datadog_monitors` - Correlate with `search_datadog_logs` and `get_datadog_metric` for root cause - Use `get_datadog_trace` to inspect request flows for latency issues - Check `search_datadog_hosts` for infrastructure-level problems When analyzing logs: - Use `analyze_datadog_logs` for SQL-based aggregation queries - Use `search_datadog_logs` for individual log retrieval and filtering - Include time ranges to narrow results and reduce response size - Filter by service, host, or status to focus on relevant data When working with metrics: - Use `search_datadog_metrics` to discover available metric names - Use `get_datadog_metric_context` to understand metric tags and metadata - Use `get_datadog_metric` to query actual metric values with time ranges When handling errors: - If access is denied, explain which RBAC permission is needed - Suggest the user verify their Application key has `MCP Read` or `MCP Write` - For large traces that appear truncated, note this is a known limitation mcp_connectors: - datadog-mcp handoffs: [] Select Save The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the Datadog MCP server. Step 4: Add a Datadog skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a Datadog skill to give your agent expertise in log queries, metric analysis, and incident investigation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: datadog_observability display_name: Datadog Observability description: | Expertise in Datadog's observability platform including logs, metrics, APM, monitors, incidents, dashboards, hosts, and services. Use for searching logs, querying metrics, investigating incidents, analyzing traces, inspecting monitors, and navigating Datadog data via the Datadog MCP server. instructions: | ## Overview Datadog is a cloud-scale observability platform for logs, metrics, APM traces, monitors, incidents, infrastructure, and more. The Datadog MCP server enables natural language interaction with your organization's Datadog data. **Authentication:** Two custom headers—`DD_API_KEY` (API key) and `DD_APPLICATION_KEY` (Application key with MCP permissions). All actions respect existing RBAC permissions. **Regional endpoints:** The MCP server URL varies by Datadog region (US1, US3, US5, EU1, AP1, AP2). Ensure the connector URL matches your organization's region. ## Searching Logs Use `search_datadog_logs` for individual log retrieval and `analyze_datadog_logs` for SQL-based aggregation queries. **Common log search patterns:** ``` # Errors from a specific service service:payment-api status:error # Logs from a host in the last hour host:web-prod-01 # Logs containing a specific trace ID trace_id:abc123def456 # Errors with a specific HTTP status @http.status_code:500 service:api-gateway # Logs from a Kubernetes pod kube_namespace:production kube_deployment:checkout-service ``` **SQL-based log analysis with `analyze_datadog_logs`:** ```sql -- Count errors by service in the last hour SELECT service, count(*) as error_count FROM logs WHERE status = 'error' GROUP BY service ORDER BY error_count DESC -- Average response time by endpoint SELECT @http.url_details.path, avg(@duration) as avg_duration FROM logs WHERE service = 'api-gateway' GROUP BY @http.url_details.path ``` ## Querying Metrics Use `search_datadog_metrics` to discover metrics, `get_datadog_metric_context` for metadata, and `get_datadog_metric` for time series data. **Common metric patterns:** ``` # System metrics system.cpu.user, system.mem.used, system.disk.used # Container metrics docker.cpu.usage, kubernetes.cpu.requests # Application metrics trace.servlet.request.hits, trace.servlet.request.duration # Custom metrics app.payment.processed, app.queue.depth ``` Always specify a time range when querying metrics to avoid retrieving excessive data. ## Investigating Traces Use `get_datadog_trace` for complete trace details and `search_datadog_spans` for span-level queries. **Trace investigation workflow:** 1. Search for slow or errored spans with `search_datadog_spans` 2. Get the full trace with `get_datadog_trace` using the trace ID 3. Identify the bottleneck service or operation 4. Correlate with `search_datadog_logs` using the trace ID 5. Check related metrics with `get_datadog_metric` ## Working with Monitors Use `search_datadog_monitors` to find monitors by name, tag, or status. **Common monitor queries:** ``` # Find all triggered monitors Search for monitors with status "Alert" # Find monitors for a specific service Search for monitors tagged with service:payment-api # Find monitors by name Search for monitors matching "CPU" or "memory" ``` ## Incident Investigation Workflow For structured incident investigation: 1. `search_datadog_incidents` — find recent or active incidents 2. `get_datadog_incident` — get full incident details and timeline 3. `search_datadog_monitors` — check which monitors triggered 4. `search_datadog_logs` — search for errors around the incident time 5. `get_datadog_metric` — check key metrics for anomalies 6. `get_datadog_trace` — inspect request traces for latency or errors 7. `search_datadog_hosts` — verify infrastructure health 8. `search_datadog_service_dependencies` — map affected services ## Working with Dashboards and Notebooks - Use `search_datadog_dashboards` to find dashboards by title or tag - Use `search_datadog_notebooks` and `get_datadog_notebook` for investigation notebooks that document past analyses ## Toolsets The Datadog MCP server supports toolsets via the `?toolsets=` query parameter on the endpoint URL. Available toolsets: | Toolset | Description | |---------|-------------| | `core` | Logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, notebooks (default) | | `alerting` | Monitor validation, groups, and templates | | `apm` | Trace analysis, span search, Watchdog insights, performance investigation | | `dbm` | Database Monitoring query plans and samples | | `error-tracking` | Error Tracking issues across RUM, Logs, and Traces | | `feature-flags` | Creating, listing, and updating feature flags | | `llmobs` | LLM Observability spans | | `networks` | Cloud Network Monitoring, Network Device Monitoring | | `onboarding` | Guided Datadog setup and configuration | | `security` | Code security scanning, security signals, findings | | `software-delivery` | CI Visibility, Test Optimization | | `synthetics` | Synthetic test management | To enable additional toolsets, append `?toolsets=core,apm,alerting` to the connector URL. ## Troubleshooting | Issue | Solution | |-------|----------| | 401/403 errors | Verify API key and Application key are correct and active | | No data returned | Check that Application key has `MCP Read` permission | | Wrong region | Ensure the connector URL matches your Datadog organization's region | | Truncated traces | Large traces may be truncated; this is a known limitation | | Tool not found | The tool may require a non-default toolset; update the connector URL | | Write operations fail | Verify Application key has `MCP Write` permission | mcp_connectors: - datadog-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: DatadogObservabilityExpert skills: - datadog_observability mcp_connectors: - datadog-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Log analysis Search for error logs from the payment-api service in the last hour Analyze logs to count errors by service over the last 24 hours Find all logs with HTTP 500 status from the api-gateway in the last 30 minutes Show me the most recent logs from host web-prod-01 Metrics investigation What is the current CPU usage across all production hosts? Show me the request rate and error rate for the checkout-service over the last 4 hours What metrics are available for the payment-api service? Get the p99 latency for the api-gateway service in the last hour APM and trace analysis Find the slowest traces for the checkout-service in the last hour Get the full trace details for trace ID abc123def456 What services depend on the payment-api? Search for errored spans in the api-gateway service from the last 30 minutes Monitor and alerting workflows Show me all monitors currently in Alert status Find monitors related to the database-primary host What monitors are tagged with team:platform? Search for monitors matching "disk space" or "memory" Incident investigation Show me all active incidents from the last 24 hours Get details for incident INC-12345 including the timeline What monitors triggered during the last production incident? Correlate the most recent incident with related logs and metrics Infrastructure and dashboards Search for hosts tagged with env:production and team:platform List all dashboards related to "Kubernetes" or "EKS" What services are running in the production environment? Show me recent deployment events for the checkout-service Available tools Core toolset (default) The core toolset is included by default and provides essential observability tools. Tool Description search_datadog_logs Search logs by facets, tags, and time ranges analyze_datadog_logs SQL-based log analysis for aggregations and statistical queries get_datadog_metric Query metric time series with rollup and aggregation get_datadog_metric_context Get metric metadata, tags, and related context search_datadog_metrics List and discover available metrics get_datadog_trace Fetch a complete distributed trace by trace ID search_datadog_spans Search APM spans by service, operation, or tags search_datadog_monitors Search monitors by name, tag, or status get_datadog_incident Get incident details including timeline and responders search_datadog_incidents List and search incidents search_datadog_dashboards Search dashboards by title or tag search_datadog_hosts Search hosts by name, tag, or status search_datadog_services List and search services search_datadog_service_dependencies Map service dependency relationships search_datadog_events Search events (monitor alerts, deployments, custom events) get_datadog_notebook Retrieve notebook content by ID search_datadog_notebooks Search notebooks by title or tag search_datadog_rum_events Search Real User Monitoring events Alerting toolset Enable with ?toolsets=core,alerting on the connector URL. Tool Description validate_datadog_monitor Validate monitor configuration before creation get_datadog_monitor_templates Get monitor configuration templates search_datadog_monitor_groups Search monitor groups and their statuses APM toolset Enable with ?toolsets=core,apm on the connector URL. Tool Description apm_search_spans Advanced span search with APM-specific filters apm_explore_trace Interactive trace exploration and analysis apm_trace_summary Get a summary analysis of a trace apm_trace_comparison Compare two traces side by side apm_analyze_trace_metrics Analyze aggregated trace metrics and trends Database Monitoring toolset Enable with ?toolsets=core,dbm on the connector URL. Tool Description search_datadog_dbm_plans Search database query execution plans search_datadog_dbm_samples Search database query samples and statistics Error Tracking toolset Enable with ?toolsets=core,error-tracking on the connector URL. Tool Description search_datadog_error_tracking_issues Search error tracking issues across RUM, Logs, and Traces get_datadog_error_tracking_issue Get details of a specific error tracking issue Feature Flags toolset Enable with ?toolsets=core,feature-flags on the connector URL. Tool Description list_datadog_feature_flags List feature flags create_datadog_feature_flag Create a new feature flag update_datadog_feature_flag_environment Update feature flag settings for an environment LLM Observability toolset Enable with ?toolsets=core,llmobs on the connector URL. Tool Description LLM Observability spans Query and analyze LLM Observability span data Networks toolset Enable with ?toolsets=core,networks on the connector URL. Tool Description Cloud Network Monitoring tools Analyze cloud network traffic and dependencies Network Device Monitoring tools Monitor and troubleshoot network devices Security toolset Enable with ?toolsets=core,security on the connector URL. Tool Description datadog_code_security_scan Run code security scanning datadog_sast_scan Run Static Application Security Testing datadog_secrets_scan Scan for secrets and credentials in code Software Delivery toolset Enable with ?toolsets=core,software-delivery on the connector URL. Tool Description search_datadog_ci_pipeline_events Search CI pipeline execution events get_datadog_flaky_tests Identify flaky tests in CI pipelines Synthetics toolset Enable with ?toolsets=core,synthetics on the connector URL. Tool Description get_synthetics_tests List and get Synthetic test configurations edit_synthetics_tests Edit Synthetic test settings synthetics_test_wizard Guided wizard for creating Synthetic tests Toolsets The Datadog MCP server organizes tools into toolsets. By default, only the core toolset is enabled. To enable additional toolsets, append the ?toolsets= query parameter to the connector URL. Syntax https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting Examples Use case URL suffix Default (core only) No suffix needed Core + APM analysis ?toolsets=core,apm Core + Alerting + APM ?toolsets=core,alerting,apm Core + Database Monitoring ?toolsets=core,dbm Core + Security scanning ?toolsets=core,security Core + CI/CD visibility ?toolsets=core,software-delivery All toolsets ?toolsets=core,alerting,apm,dbm,error-tracking,feature-flags,llmobs,networks,onboarding,security,software-delivery,synthetics [!TIP] Only enable the toolsets you need. Each additional toolset increases the number of tools exposed to the agent, which can increase token usage and may impact response quality. Start with core and add toolsets as needed. Updating the connector URL To add toolsets after initial setup: Navigate to Builder > Connectors Select the datadog-mcp connector Update the URL field to include the ?toolsets= parameter Select Save Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid API key or Application key Verify both keys are correct and active in Organization Settings 403 Forbidden Missing RBAC permissions Ensure the Application key has MCP Read and/or MCP Write permissions Connection refused Wrong regional endpoint Verify the connector URL matches your Datadog organization's region "Organization not allowlisted" Preview access not granted Contact Datadog support to request MCP server Preview access Data and permission issues Error Cause Solution No data returned Insufficient permissions or wrong time range Verify Application key permissions; try a broader time range Tool not found Tool belongs to a non-default toolset Add the required toolset to the ?toolsets= parameter in the connector URL Truncated trace data Trace exceeds size limit Large traces are truncated for context window efficiency; query specific spans instead Write operation failed Missing MCP Write permission Add MCP Write permission to the Application key Metric not found Wrong metric name or no data in time range Use search_datadog_metrics to discover available metric names Verify the connection Test the server endpoint directly: curl -I "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp" \ -H "DD_API_KEY: <your_api_key>" \ -H "DD_APPLICATION_KEY: <your_application_key>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Navigate to Organization Settings > Application Keys in Datadog Revoke the existing Application key Create a new Application key with the required MCP Read / MCP Write permissions Update the connector in the SRE Agent portal with the new key Limitations Limitation Details Preview only The Datadog MCP server is in Preview and not recommended for production use Allowlisted organizations Only organizations that have been allowlisted by Datadog can access the MCP server Large trace truncation Responses are optimized for LLM context windows; large traces may be truncated Unstable API path The endpoint URL contains /unstable/ indicating the API may change without notice Toolset availability Some toolsets may not be available depending on your Datadog plan and features enabled Regional endpoints You must use the endpoint matching your organization's region; cross-region queries are not supported Security considerations How permissions work RBAC-scoped: All actions respect the RBAC permissions associated with the API and Application keys Key-based: Access is controlled through API key (organization-level) and Application key (user or service account-level) Permission granularity: MCP Read enables read operations; MCP Write enables mutating operations Admin controls Datadog administrators can: - Create and revoke API and Application keys in Organization Settings - Assign granular RBAC permissions ( MCP Read , MCP Write ) to Application keys - Use service accounts to decouple access from individual user accounts - Monitor MCP tool usage through the Datadog Audit Trail - Scope Application keys to limit the blast radius of compromised credentials The Datadog MCP server can read sensitive operational data including logs, metrics, and traces. Use service accounts with scoped Application keys, grant only the permissions your agent needs, and monitor the Audit Trail for unusual activity. Related content Datadog MCP Server documentation Datadog API and Application keys Datadog RBAC permissions Datadog Audit Trail Datadog regional sites MCP integration overview Build a custom subagent
dbandaru
Feb 26, 2026 Place Apps on Azure Blog
1.1KViews
0likes
0Comments
Get started with Atlassian Rovo MCP server in Azure SRE Agent
Get started with Atlassian Rovo MCP server in Azure SRE Agent Connect Azure SRE Agent to Jira, Confluence, Compass, and Jira Service Management using the official Atlassian Rovo MCP server. Overview The Atlassian Rovo MCP server is a cloud-hosted bridge between your Atlassian Cloud site and Azure SRE Agent. Once configured, it enables real-time interaction with Jira, Confluence, Compass, and Jira Service Management data through natural language. All actions respect your existing Atlassian user permissions. The server supports API token (Basic or Bearer auth) for headless or automated setups. Azure SRE Agent connects using Streamable-HTTP transport directly to the Atlassian-hosted endpoint. Key capabilities Product Capabilities Jira Search issues with JQL, create/update tickets, add comments and worklogs, transition issues through workflows Confluence Search pages with CQL, create/update pages and live docs, manage inline and footer comments Compass Create/delete service components and relationships, manage custom fields, query dependencies Jira Service Management Query ops alerts, view on-call schedules, get team info, escalate alerts Rovo Search Natural language search across Jira and Confluence, fetch content by ARI [!NOTE] This is the official Atlassian-hosted MCP server at https://mcp.atlassian.com/v1/mcp . The server exposes 46+ tools across five product areas. Tool availability depends on authentication method and granted scopes. Prerequisites Azure SRE Agent resource deployed in Azure Atlassian Cloud site with one or more of: Jira, Confluence, Compass, or Jira Service Management User account with appropriate permissions in the Atlassian products you want to access For API token auth: Organization admin must enable API token authentication in the Rovo MCP server settings Step 1: Get your Atlassian credentials Choose one of the two authentication methods below. API token (Option A) is recommended for Azure SRE Agent because it enables headless configuration without browser-based flows. Option A: Personal API token (recommended for Azure SRE Agent) API token authentication allows headless configuration without browser-based OAuth flows—ideal for Azure SRE Agent connectors. Navigate to the API token page Log in to your Atlassian account Select your profile avatar in the top-right corner Select Manage account In the left sidebar, select Security Under the API tokens section, you can manage your existing tokens Alternatively, use this direct link that pre-selects all MCP scopes: Direct URL: Create API token with all MCP scopes Create the token Navigate to the Atlassian API token creation page to create a token with all MCP scopes preselected Optionally click Back to manually select only the scopes you need (see Available scopes) Copy the generated token and note the email address associated with your Atlassian account Base64-encode your credentials: # Format: email:api_token echo -n "your.email@example.com:YOUR_API_TOKEN_HERE" | base64 On Windows PowerShell: [Convert]::ToBase64String([Text.Encoding]::UTF8.GetBytes("your.email@example.com:YOUR_API_TOKEN_HERE")) This produces a base64-encoded string you'll use in the connector configuration as the Authorization: Basic <value> header. [!IMPORTANT] Store the API token securely. It cannot be viewed again after creation. If lost, generate a new token from the same API tokens page. [!NOTE] API token authentication must be enabled by your organization admin. If you cannot create a token, ask your admin to enable API token authentication in the Rovo MCP server settings at admin.atlassian.com > Security > Rovo MCP server. Available scopes The API token supports the following scope categories: Category Scopes Jira read:jira-work , write:jira-work , read:jira-user Confluence read:page:confluence , write:page:confluence , read:comment:confluence , write:comment:confluence , read:space:confluence , read:hierarchical-content:confluence , read:confluence-user , search:confluence Compass read:component:compass , write:component:compass JSM read:incident:jira-service-management , write:incident:jira-service-management , read:ops-alert:jira-service-management , write:ops-alert:jira-service-management , read:ops-config:jira-service-management , read:servicedesk-request Bitbucket read:repository:bitbucket , write:repository:bitbucket , read:pullrequest:bitbucket , write:pullrequest:bitbucket , read:pipeline:bitbucket , write:pipeline:bitbucket , read:user:bitbucket , read:workspace:bitbucket , admin:repository:bitbucket Platform read:me , read:account , search:rovo:mcp [!NOTE] Bitbucket scopes are available in the token, but Bitbucket tools are not yet listed on the official supported tools page. Bitbucket tool support may be added in a future update. Step 2: Add the MCP connector Connect the Atlassian Rovo MCP server to your SRE Agent using the portal. Using the Azure portal (API token auth) In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select MCP server (User provided connector) and select Next Configure the connector: Field Value Name atlassian-rovo-mcp Connection type Streamable-HTTP URL https://mcp.atlassian.com/v1/mcp Authentication Custom headers Header Key Authorization Header Value Basic <your_base64_encoded_email_and_token> Select Next to review Select Add connector Step 3: Create an Atlassian subagent Create a specialized subagent to give the AI focused Atlassian expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: AtlassianRovoExpert display_name: Atlassian Rovo Expert system_prompt: | You are an Atlassian expert with access to Jira, Confluence, Compass, and Jira Service Management via the Atlassian Rovo MCP server. ## Capabilities ### Jira - Search issues using JQL (Jira Query Language) with `searchJiraIssuesUsingJql` - Create, update, and transition issues through workflows - Add comments, worklogs, and manage issue metadata - Look up user account IDs and project configurations ### Confluence - Search pages and content using CQL (Confluence Query Language) with `searchConfluenceUsingCql` - Create and update pages and live docs with Markdown content - Add inline and footer comments on pages - Navigate spaces and page hierarchies ### Compass - Create, query, and delete service components (services, libraries, applications) - Define and manage relationships between components - Manage custom field definitions and component metadata - View component activity events (deployments, alerts) ### Jira Service Management - Query ops alerts by ID, alias, or search criteria - View on-call schedules and current/next responders - Get team info including escalation policies and roles - Acknowledge, close, or escalate alerts ### Cross-Product - Use Rovo Search (`search`) for natural language queries across Jira and Confluence - Fetch specific content by Atlassian Resource Identifier (ARI) using `fetch` - Get current user info and list accessible cloud sites ## Best Practices When searching Jira: - Use JQL for precise queries: `project = "MYPROJ" AND status = "Open"` - Start with broad searches, then refine based on results - Use `currentUser()` for user-relative queries - Use `openSprints()` for active sprint work When searching Confluence: - Use CQL for structured searches: `space = "ENG" AND type = page` - Use Rovo Search for natural language queries when JQL/CQL isn't needed - Consider space keys to narrow results When creating content: - Confirm project/space/issue type with the user before creating - Use `getJiraIssueTypeMetaWithFields` to check required fields - Use `getConfluenceSpaces` to list available spaces When handling errors: - If access is denied, explain what permission is needed - Suggest the user contact their Atlassian administrator - For expired tokens, advise re-authentication mcp_connectors: - atlassian-rovo-mcp handoffs: [] Select Save [!NOTE] The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the Atlassian Rovo MCP server. Step 4: Add an Atlassian skill Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create an Atlassian skill to give your agent expertise in JQL, CQL, and Atlassian workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: atlassian_rovo display_name: Atlassian Rovo description: | Expertise in Atlassian Cloud products including Jira, Confluence, Compass, and Jira Service Management. Use for searching issues with JQL, creating and updating pages, managing service components, investigating ops alerts, and navigating Atlassian workspaces via the Rovo MCP server. instructions: | ## Overview Atlassian Cloud provides integrated tools for project tracking (Jira), documentation (Confluence), service catalog management (Compass), and incident management (Jira Service Management). The Atlassian Rovo MCP server enables natural language interaction with all four products. **Authentication:** OAuth 2.1 or API token (Basic/Bearer). All actions respect existing user permissions. ## Searching Jira with JQL JQL (Jira Query Language) enables precise issue searches. Always use `searchJiraIssuesUsingJql` for structured queries. **Common JQL patterns:** ```jql # Open issues assigned to current user assignee = currentUser() AND status != Done # Bugs created in the last 7 days project = "MYPROJ" AND type = Bug AND created >= -7d # High-priority issues in active sprints project = "MYPROJ" AND priority in (High, Highest) AND sprint in openSprints() # Full-text search project = "MYPROJ" AND text ~ "payment error" # Issues updated recently updated >= -24h ORDER BY updated DESC ``` **JQL operators:** `=`, `!=`, `~` (contains), `in`, `>=`, `<=`, `NOT`, `AND`, `OR` **JQL functions:** `currentUser()`, `openSprints()`, `startOfDay()`, `endOfDay()`, `membersOf("group")` ## Searching Confluence with CQL CQL (Confluence Query Language) searches pages, blog posts, and attachments. Use `searchConfluenceUsingCql` for structured queries. **Common CQL patterns:** ```cql # Search by title title ~ "Architecture" # Search in specific space space = "ENG" AND type = page # Full-text content search text ~ "deployment pipeline" # Recently modified pages lastModified >= now("-7d") AND type = page # Pages by label label = "runbook" AND space = "SRE" ``` **CQL fields:** `title`, `text`, `space`, `type`, `label`, `creator`, `lastModified` ## Creating Jira Issues Follow this workflow: 1. `getVisibleJiraProjects` — list available projects 2. `getJiraProjectIssueTypesMetadata` — list issue types for the project 3. `getJiraIssueTypeMetaWithFields` — get required/optional fields 4. `createJiraIssue` — create the issue Common issue types: Story, Bug, Task, Epic, Sub-task. ## Creating Confluence Pages Pages support Markdown content: 1. `getConfluenceSpaces` — list available spaces 2. `getPagesInConfluenceSpace` — optionally find a parent page 3. `createConfluencePage` — create the page with space, title, and body ## Working with Compass Components Component types: SERVICE, LIBRARY, APPLICATION, CAPABILITY, CLOUD_RESOURCE, DATA_PIPELINE, MACHINE_LEARNING_MODEL, UI_ELEMENT, WEBSITE, OTHER. Relationship types: DEPENDS_ON, OTHER. ## Jira Service Management Operations For incident and alert management: - `getJsmOpsAlerts` — query alerts by ID, alias, or search - `updateJsmOpsAlert` — acknowledge, close, or escalate alerts - `getJsmOpsScheduleInfo` — view on-call schedules and responders - `getJsmOpsTeamInfo` — list teams with escalation policies ## Cross-Product Workflows - Use `search` (Rovo Search) for natural language queries across products - Use `fetch` with ARIs (Atlassian Resource Identifiers) for direct content retrieval - Use `getAccessibleAtlassianResources` to list cloud sites and get cloudIds ## Troubleshooting | Issue | Solution | |-------|----------| | JQL syntax error | Check field names; quote values with spaces | | CQL returns no results | Verify space key; try broader terms | | Cannot create issue | Verify "Create" permission in the project | | Cannot edit page | Verify "Edit" permission in the space | | OAuth expired | Re-invoke any tool to trigger fresh OAuth flow | | "Site admin must authorize" | Admin must complete initial 3LO consent | | cloudId errors | Use `getAccessibleAtlassianResources` to find correct cloudId | mcp_connectors: - atlassian-rovo-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: AtlassianRovoExpert skills: - atlassian_rovo mcp_connectors: - atlassian-rovo-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Jira workflows Find all open bugs assigned to me in the PAYMENTS project Create a new story in project PLATFORM titled "Implement rate limiting for API gateway" Show me the available transitions for issue PLATFORM-1234 Add a comment to PLATFORM-1234: "Reviewed and approved for deployment" Log 2 hours of work on PLATFORM-1234 with description "Code review and testing" Confluence workflows Search Confluence for pages about "incident response runbooks" Show me the spaces I have access to Create a new Confluence page in the Engineering space titled "Q3 2025 Architecture Review" What pages are under the "Runbooks" parent page? Compass workflows List all service components in Compass Create a new service component called "payment-gateway" What components depend on the api-gateway service? Show me recent activity events for the auth-service component Jira Service Management workflows Show me active ops alerts from the last 24 hours Who is currently on-call for the platform-engineering schedule? Acknowledge alert with alias "high-cpu-prod-web-01" Get team info and escalation policies for the SRE team Cross-product workflows Search across Jira and Confluence for content related to "deployment pipeline" What Atlassian cloud sites do I have access to? Fetch the Confluence page linked to Jira issue PLATFORM-500 Available tools Jira tools (14 tools) Tool Description Required Scopes searchJiraIssuesUsingJql Search issues using a JQL query read:jira-work getJiraIssue Get issue details by ID or key read:jira-work createJiraIssue Create a new issue in a project write:jira-work editJiraIssue Update fields on an existing issue write:jira-work addCommentToJiraIssue Add a comment to an issue write:jira-work addWorklogToJiraIssue Add a time tracking worklog entry write:jira-work transitionJiraIssue Perform a workflow transition write:jira-work getTransitionsForJiraIssue List available workflow transitions read:jira-work getVisibleJiraProjects List projects the user can access read:jira-work getJiraProjectIssueTypesMetadata List issue types in a project read:jira-work getJiraIssueTypeMetaWithFields Get create-field metadata for a project and issue type read:jira-work getJiraIssueRemoteIssueLinks List remote links (e.g., Confluence pages) on an issue read:jira-work lookupJiraAccountId Find user account IDs by name or email read:jira-work Confluence tools (11 tools) Tool Description Required Scopes searchConfluenceUsingCql Search content using a CQL query search:confluence getConfluencePage Get page content by ID (as Markdown) read:page:confluence createConfluencePage Create a new page or live doc with Markdown body write:page:confluence updateConfluencePage Update an existing page (title, body, location) write:page:confluence getConfluenceSpaces List spaces by key, ID, type, status, or labels read:space:confluence getPagesInConfluenceSpace List pages in a space, filtered by title/status/type read:page:confluence getConfluencePageDescendants List descendant pages under a parent page read:hierarchical-content:confluence createConfluenceFooterComment Create a footer comment or reply on a page write:page:confluence createConfluenceInlineComment Create an inline comment tied to selected text write:page:confluence getConfluencePageFooterComments List footer comments on a page (as Markdown) read:comment:confluence getConfluencePageInlineComments List inline comments on a page read:comment:confluence Compass tools (13 tools) Tool Description Required Scopes getCompassComponents Search or list components read:component:compass getCompassComponent Get component details by ID read:component:compass createCompassComponent Create a service, library, or other component write:component:compass deleteCompassComponent Delete an existing component and its definitions write:component:compass createCompassComponentRelationship Create a relationship between two components write:component:compass deleteCompassComponentRelationship Remove a relationship between two components write:component:compass getCompassComponentActivityEvents List recent activity events (deployments, alerts) read:component:compass getCompassComponentLabels Get labels applied to a component read:component:compass getCompassComponentTypes List available component types read:component:compass getCompassComponentsOwnedByMyTeams List components owned by your teams read:component:compass getCompassCustomFieldDefinitions List custom field definitions read:component:compass createCompassCustomFieldDefinition Create a custom field definition write:component:compass deleteCompassCustomFieldDefinition Delete a custom field definition write:component:compass Jira Service Management tools (4 tools) [!NOTE] JSM tools only support authentication via API token. These tools are available only if API token authentication is enabled by your organization admin. Tool Description Required Scopes getJsmOpsAlerts Get alert by ID/alias or search by query and time window read:ops-alert:jira-service-management , read:ops-config:jira-service-management , read:jira-user getJsmOpsScheduleInfo List on-call schedules or get current/next responders read:ops-config:jira-service-management , read:jira-user getJsmOpsTeamInfo List ops teams with escalation policies and roles read:ops-config:jira-service-management , read:jira-user updateJsmOpsAlert Acknowledge, unacknowledge, close, or escalate an alert read:ops-alert:jira-service-management , write:ops-alert:jira-service-management Rovo / Shared platform tools (4 tools) Tool Description Required Scopes search Natural language search across Jira and Confluence (not JQL/CQL) search:rovo:mcp fetch Fetch content by Atlassian Resource Identifier (ARI) search:rovo:mcp atlassianUserInfo Get current user details (account ID) read:me getAccessibleAtlassianResources List accessible cloud sites and their cloudIds read:account , read:me Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid or expired API token Generate a new token at id.atlassian.com 403 Forbidden Missing product permissions Verify you have access to the Atlassian product (Jira, Confluence, etc.) "Your site admin must authorize this app" First-time setup requires admin A site admin must complete initial 3LO consent "Your organization admin must authorize access from a domain" Domain not allowed Admin must add the domain in Rovo MCP server settings "You don't have permission to connect from this IP address" IP allowlisting enabled Admin must add your IP range to the allowlist API token auth fails Feature disabled by admin Admin must enable API token authentication Data and permission issues Error Cause Solution No data returned Wrong cloudId or expired session Use getAccessibleAtlassianResources to find the correct cloudId Cannot create issue Missing project permission Verify "Create" permission in the Jira project Cannot update page Missing space permission Verify "Edit" permission in the Confluence space Tool not available Missing scopes Re-create API token with required scopes Compass tools unavailable Scopes not available for API tokens Some Compass tools require OAuth 2.1 JSM tools not working API token auth disabled Admin must enable API token authentication Verify the connection Test the server endpoint directly: # Test with API token (Basic auth) curl -I https://mcp.atlassian.com/v1/mcp \ -H "Authorization: Basic <your_base64_encoded_credentials>" # Test with service account (Bearer auth) curl -I https://mcp.atlassian.com/v1/mcp \ -H "Authorization: Bearer <your_api_key>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Go to id.atlassian.com/manage-profile/apps Find and revoke the MCP app authorization Generate a new API token or re-invoke a tool to trigger fresh OAuth flow Limitations Limitation Details Limited tool availability with API tokens Some tools (e.g., certain Compass tools) may not be available because required scopes aren't available for API tokens No bounded cloudId API tokens are not bound to a specific cloudId. Tools must explicitly pass the cloudId where needed No domain allowlist validation API token auth doesn't use OAuth redirect URIs, so domain allowlist checks cannot be performed Bitbucket tools Bitbucket scopes are available in the token, but Bitbucket tools are not yet listed as supported JSM requires API token Jira Service Management tools only work with API token authentication, not OAuth 2.1 Security considerations How permissions work User-scoped: All actions respect the authenticated user's existing Atlassian permissions Product-level: Access requires matching product permissions (Jira, Confluence, Compass) Session-based: OAuth tokens expire and require re-authentication; API tokens persist until revoked Admin controls Atlassian administrators can: - Enable or disable API token authentication in Rovo MCP server settings - Manage and revoke MCP app access from the Connected Apps list - Control which external domains can connect via domain allowlists - Monitor activity through Atlassian audit logs - Configure IP allowlisting for additional security [!IMPORTANT] MCP clients can perform actions in Jira, Confluence, and Compass with your existing permissions. Use least privilege, review high-impact changes before confirming, and monitor audit logs for unusual activity. See MCP Clients - Understanding security risks. Related content Atlassian Rovo MCP Server - Getting started Atlassian Rovo MCP Server - Supported tools Atlassian Rovo MCP Server - API token authentication Atlassian Rovo MCP Server - OAuth 2.1 configuration Control Atlassian Rovo MCP Server settings MCP Clients - Understanding security risks MCP integration overview Build a custom subagent
dbandaru
Feb 26, 2026 Place Apps on Azure Blog
156Views
1like
1Comment
Unifying Scattered Observability Data from Dynatrace + Azure for Self-Healing with SRE Agent
What if your deployments could fix themselves? The Deployment Remediation Challenge Modern operations teams face a recurring nightmare: A deployment ships at 9 AM Errors spike at 9:15 AM By the time you correlate logs, identify the bad revision, and execute a rollback—it's 10:30 AM Your users felt 75 minutes of degraded experience The data to detect and fix this existed the entire time—but it was scattered across clouds and platforms: Error logs and traces → Dynatrace (third-party observability cloud) Deployment history and revisions → Azure Container Apps API Resource health and metrics → Azure Monitor Rollback commands → Azure CLI Your observability data lives in one cloud. Your deployment data lives in another. Stitching together log analysis from Dynatrace with deployment correlation from Azure—and then executing remediation—required a human to manually bridge these silos. What if an AI agent could unify data from third-party observability platforms with Azure deployment history and act on it automatically—every week, before users even notice? Enter SRE Agent + Model Context Protocol (MCP) + Subagents Azure SRE Agent doesn't just work with Azure. Using the Model Context Protocol (MCP), you can connect external observability platforms like Dynatrace directly to your agent. Combined with subagents for specialized expertise and scheduled tasks for automation, you can build an automated deployment remediation system. Here's what I built/configured for my Azure Container Apps environment inside SRE Agent: Component Purpose Dynatrace MCP Connector Connect to Dynatrace's MCP gateway for log queries via DQL 'Dynatrace' Subagent Log analysis specialist that executes DQL queries and identifies root causes 'Remediation' Subagent Deployment remediation specialist that correlates errors with deployments and executes rollbacks Scheduled Task Weekly Monday 9 AM health check for the 'octopets-prod-api' Container App Subagent workflow: The subagent workflow in SRE Agent Builder: 'OctopetsScheduledTask' triggers 'RemediationSubagent' (12 tools), which hands off to 'DynatraceSubagent' (3 MCP tools) for log analysis. How I Set It Up: Step by Step Step 1: Connect Dynatrace via MCP SRE Agent supports the Model Context Protocol (MCP) for connecting external data sources. Dynatrace exposes an MCP gateway that provides access to its APIs as first-class tools. Connection configuration: { "name": "dynatrace-mcp-connector", "dataConnectorType": "Mcp", "dataSource": "Endpoint=https://<your-tenant>.live.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp;AuthType=BearerToken;BearerToken=<your-api-token>" } Once connected, SRE Agent automatically discovers Dynatrace tools. 💡 Tip: When creating your Dynatrace API token, grant the `entities.read`, `events.read`, and `metrics.read` scopes for comprehensive access. Step 2: Build Specialized Subagents Generic agents are good. Specialized agents are better. I created two subagents that work together in a coordinated workflow—one for Dynatrace log analysis, the other for deployment remediation. DynatraceSubagent This subagent is the log analysis specialist. It uses the Dynatrace MCP tools to execute DQL queries and identify root causes. Key capabilities: Executes DQL queries via MCP tools (`create-dql`, `execute-dql`, `explain-dql`) Fetches 5xx error counts, request volumes, and spike detection Returns consolidated analysis with root cause, affected services, and error patterns 👉 View full DynatraceSubagent configuration here RemediationSubagent This is the deployment remediation specialist. It correlates Dynatrace log analysis with Azure Container Apps deployment history, generates correlation charts, and executes rollbacks when confidence is high. Key capabilities: Retrieves Container Apps revision history (`GetDeploymentTimes`, `ListRevisions`) Generates correlation charts (`PlotTimeSeriesData`, `PlotBarChart`, `PlotAreaChartWithCorrelation`) Computes confidence score (0-100%) for deployment causation Executes rollback and traffic shift when confidence > 70% 👉 View full RemediationSubagent configuration here The power of specialization: Each agent focuses on its domain—DynatraceSubagent handles log analysis, RemediationSubagent handles deployment correlation and rollback. When the workflow runs, RemediationSubagent hands off to DynatraceSubagent (bi-directional handoff) for analysis, gets the findings back, and continues with remediation. Simple delegation, not a single monolithic agent trying to do everything. Step 3: Create the Weekly Scheduled Task Now the automation. I configured a scheduled task that runs every Monday at 9:30 AM to check whether deployments in the last 4 hours caused any issues—and automatically remediate if needed. Scheduled task configuration: Setting Value Task Name OctopetsScheduledTask Frequency Weekly Day of Week Monday Time 9:30 AM Response Subagent RemediationSubagent Scheduled Task Configuration Configuring the OctopetsScheduledTask in the SRE Agent portal The key insight: the scheduled task is just a coordinator. It immediately hands off to the RemediationSubagent, which orchestrates the entire workflow including handoffs to DynatraceSubagent. Step 4: See It In Action Here's what happens when the scheduled task runs: The scheduled task triggering and initiating Dynatrace analysis for octopets-prod-api The DynatraceSubagent analyzes the logs and identifies the root cause: executing DQL queries and returning consolidated log analysis The RemediationSubagent then generates correlation charts: Finally, with a 95% confidence score, SRE agent executes the rollback autonomously: executing rollback and traffic shift autonomously. The agent detected the bad deployment, generated visual evidence, and automatically shifted 100% traffic to the last known working revision—all without human intervention. Why This Matters Before After Manually check Dynatrace after incidents Automated DQL queries via MCP Stitch together logs + deployments manually Subagents correlate data automatically Rollback requires human decision + execution Confidence-based auto-remediation 75+ minutes from deployment to rollback Under 5 Minutes with autonomous workflow Reactive incident response Proactive weekly health checks Try It Yourself Connect your observability tool via MCP (Dynatrace, Datadog, New Relic, Prometheus—any tool with an MCP gateway) Build a log analysis subagent that knows how to query your observability data Build a remediation subagent that can correlate logs with deployments and execute fixes Wire them together with handoffs so the subagents can delegate log analysis Create a scheduled task to trigger the workflow automatically Learn More Azure SRE Agent documentation Model Context Protocol (MCP) integration guide Building subagents for specialized workflows Scheduled tasks and automation SRE Agent Community Azure SRE Agent pricing SRE Agent Blogs
Vineela-Suri
Feb 25, 2026 Place Apps on Azure Blog
532Views
0likes
0Comments
Get started with Elasticsearch MCP server in Azure SRE Agent
Overview The Elasticsearch MCP server enables Azure SRE Agent to interact with your Elasticsearch clusters using natural language. Query your logs, analyze metrics, check cluster health, and troubleshoot issues conversationally. This integration uses Elastic's Agent Builder MCP endpoint, the recommended approach for Elastic 9.2.0+ and Elasticsearch Serverless projects. Key capabilities Capability Description Search Execute search queries using Elasticsearch Query DSL ES|QL Run ES|QL queries for data exploration Mappings Get field mappings for indices Cluster health Check shard information and cluster status Index management List available indices Prerequisites Azure SRE Agent resource at sre.azure.com Elasticsearch cluster (Elastic Cloud or self-hosted, version 9.2.0 or higher) Kibana with Agent Builder enabled (Elastic 9.2.0+ or Serverless) API key with appropriate permissions Step 1: Get your Elasticsearch credentials Log in to Elastic Cloud or your self-hosted Kibana at https://{your-kibana-url} Navigate to Management > API Keys Click Create API key Provide a name (e.g., azure-sre-agent-mcp ) Set appropriate permissions (at minimum, read access to indices you want to query) Click Create API key and copy the encoded API key Note your Kibana URL (e.g., https://my-deployment.kb.us-east-1.aws.elastic.cloud ) Step 2: Add the MCP connector Navigate to your Azure SRE Agent at sre.azure.com Select your agent from the list In the left navigation, expand Builder > Connectors Select Add connector In "Choose a connector", select MCP server (User provided connector) Click Next and configure: Field Value Name elasticsearch-mcp Connection type Streamable-HTTP URL https://{KIBANA_URL}/api/agent_builder/mcp Authentication method Custom headers Header name Authorization Header value ApiKey {your-api-key} 7. Click Next to review, then Add to save Equivalent mcp.json configuration For reference, the equivalent mcp.json configuration: { "mcpServers": { "elasticsearch-mcp": { "url": "https://{KIBANA_URL}/api/agent_builder/mcp", "transport": "streamable-http", "headers": { "Authorization": "ApiKey {your-api-key}" } } } } Step 3: Create a subagent and add tools In the left navigation, select Builder > Subagent builder Click + Create Switch to the YAML tab and paste this configuration: api_version: api_version: azuresre.ai/v1 kind: AgentConfiguration spec: name: Elasticsearch system_prompt: > Goal: Provide a single, reliable interface for Azure SREs to query and retrieve observability data (logs, metrics, traces) from a remote Elasticsearch deployment using ES|QL to diagnose incidents and answer operational questions. Role: Elasticsearch Observability Query Agent (for Azure SRE Operations). Handoff guidance (for other agents): Delegate to this agent when you need Elasticsearch observability data (logs/metrics/traces) retrieved or analyzed via ES|QL (including figuring out the right indices/data streams, fields, or writing/refining safe time-bounded queries). Do not delegate for remediation/changes outside querying/analysis. Capabilities: - Discoverability: list supported operations (connectivity/test, list indices/data streams, mappings/field discovery, sample documents). - Data access: identify relevant indices/data streams from incident context; request permission to enumerate when needed. - Query authoring: write ES|QL for time-bounded log/metric/trace retrieval, filtering, aggregation, grouping, sorting, limits. - Query iteration: refine queries based on results/errors; explain changes. Operating guidelines: - Ask only for the minimum required context when missing: time range (UTC), environment/cluster, service/app name, and any identifiers (host, pod, trace.id, correlation id). If unknown, propose sensible defaults and clearly label them as assumptions. - Prefer safe, bounded queries: always include explicit time filters and LIMIT; avoid unbounded scans. - If index/data stream is unknown, first propose likely patterns and/or request permission to enumerate indices/data streams. - If ES|QL is unsupported in the target, propose an equivalent query approach supported by the deployment and state the assumption. - Do not fabricate index names, field names, mappings, or results. If you must assume, label it and ask the user to confirm. Constraints: - No destructive actions: never modify, delete, or reindex data. - Treat endpoints/credentials as sensitive: request only if necessary; never echo secrets. Output format (default): 1) Intent (1–2 lines) 2) ES|QL query (fenced code block) 3) What to look for in results 4) Optional next-step query Interaction rule: - When you ask the user a question to proceed, stop and end your turn immediately after the question. - Do not repeat the same question in later turns; instead, acknowledge what was answered and ask only what remains. Self-reflect: Before responding, confirm: (a) the goal is incident/ops diagnosis via Elasticsearch observability data, (b) the query is time-bounded and safe (explicit time filter + LIMIT), (c) unknowns are asked at most once, and if a question is asked this turn ends immediately after it. tools: - Elasticsearch_platform_core_execute_esql - Elasticsearch_platform_core_generate_esql - Elasticsearch_platform_core_get_document_by_id - Elasticsearch_platform_core_get_index_mapping - Elasticsearch_platform_core_index_explorer - Elasticsearch_platform_core_list_indices - Elasticsearch_platform_core_search handoff_description: >- Delegate to this agent when you need Elasticsearch observability data (logs/metrics/traces) retrieved or analyzed via ES|QL (including figuring out the right indices/data streams, fields, or writing/refining safe time-bounded queries). Do not delegate for remediation/changes outside querying/analysis. agent_type: Autonomous enable_skills: false 4. Click Create to save the subagent CRITICAL: You must add tools to the subagent! Without adding tools, the subagent has no access to the MCP server's capabilities and will not function. After creating the subagent, select it from the list to open the editor Navigate to the Tools tab Click + Add tools In the tool picker, find the elasticsearch-mcp connector tools: list_indices get_mappings search esql get_shards Select all the tools you want the subagent to use Click Add to attach the tools to the subagent Click Save to finalize the subagent configuration Step 4: Create an Elasticsearch skill (optional) Skills provide a reusable way to package Elasticsearch expertise that any agent can invoke on demand. Unlike subagents (which run as autonomous agents), skills are callable capabilities with structured instructions. Navigate to your SRE Agent at sre.azure.com Select your agent, then go to Builder > Skills Click Create (use the dropdown arrow) > Skill Configure the skill: Field Value Name elasticsearch_observability Description Query and analyze Elasticsearch data including logs, indices, mappings, and cluster health using ES|QL 5. In the SKILL.md editor, paste the following content: --- name: elasticsearch_observability description: Query and analyze Elasticsearch data including logs, indices, mappings, and cluster health using ES|QL --- # Elasticsearch Observability Skill You have access to the Elasticsearch MCP server tools for querying and analyzing data from Elasticsearch clusters. ## Available Tools | Tool | Purpose | |------|---------| | **list_indices** | List all available Elasticsearch indices | | **get_mappings** | Get field mappings for a specific index | | **search** | Perform searches using Elasticsearch Query DSL | | **esql** | Execute ES|QL queries for data exploration | | **get_shards** | Get shard information and cluster health status | ## Workflow 1. **Understand the request** — Determine what Elasticsearch data the user needs 2. **Discover indices** — Use `list_indices` to find relevant data sources 3. **Check mappings** — Use `get_mappings` to understand available fields 4. **Query data** — Use `search` (Query DSL) or `esql` (ES|QL) to retrieve results 5. **Check health** — Use `get_shards` for cluster and index health information 6. **Analyze results** — Summarize findings with actionable recommendations ## Best Practices - Always start by listing indices to discover available data sources - Check field mappings before writing queries to ensure correct field names - Use ES|QL for complex aggregations and data exploration - Use Query DSL search for precise filtering and full-text search - Start with smaller timeframes and add LIMIT to optimize query performance - Prefer safe, bounded queries — include explicit time filters - Do not fabricate index names or field names; discover them first ## Example Prompts - "List all indices in my Elasticsearch cluster" - "What fields are available in the logs-* index?" - "Search for errors in the last hour across all log indices" - "Run an ES|QL query to find the top 10 error types" - "Show me shard information and cluster health" 6. Click Create to save the skill 💡 TIP: Skills are ideal when you want any agent (including the meta agent) to be able to invoke Elasticsearch capabilities without routing to a dedicated subagent. Use the subagent approach when you want a specialized, autonomous Elasticsearch query expert. Step 5: Test the integration Open a new chat session with your Azure SRE Agent Try these example prompts: Prompt What it tests "List all indices in my Elasticsearch cluster" list_indices tool "What are the mappings for the logs-* index?" get_mappings tool "Search for errors in the last hour across all logs indices" search tool "Run an ES|QL query to find the top 10 error types" esql tool "Show me shard information for my cluster" get_shards tool Available tools Tool Description list_indices List all available Elasticsearch indices get_mappings Get field mappings for a specific Elasticsearch index search Perform an Elasticsearch search with the provided Query DSL esql Perform an ES|QL query get_shards Get shard information for all or specific indices Troubleshooting Issue Solution Subagent doesn't have Elasticsearch tools You MUST add tools to the subagent after creating it! Go to subagent > Tools tab > Add tools > select elasticsearch-mcp tools Connection refused Verify Kibana URL is correct and accessible from Azure 401 Unauthorized Check API key is valid and has proper permissions 403 Forbidden Ensure Agent Builder is enabled in your Elastic deployment Tools not appearing Wait a few seconds after adding connector, then refresh SSL/TLS errors Ensure your Kibana URL uses HTTPS Related content Elasticsearch MCP Server (GitHub) Elastic Agent Builder Documentation Elastic Agent Builder MCP Endpoint MCP integration overview
dbandaru
Feb 25, 2026 Place Apps on Azure Blog
183Views
0likes
0Comments
Get started with Dynatrace MCP server in Azure SRE Agent
Overview Dynatrace provides a hosted MCP server that enables Azure SRE Agent to interact with the Dynatrace observability platform via SSE (Server-Sent Events) transport. Query logs, investigate problems, analyze security vulnerabilities, and execute DQL (Dynatrace Query Language), and generate timeseries forecasts directly from your SRE Agent conversations. Key capabilities Capability Description Create DQL query Generate DQL queries from natural language prompts Explain DQL query Get plain English explanations of DQL queries Ask Dynatrace Ask general Dynatrace questions (workflows, alerts, etc.) Run DQL query Execute DQL and get results (up to 1000 records) Query problems List active or closed Davis Problems Get problem by ID Get detailed problem information Get vulnerabilities List open security vulnerabilities by risk level Timeseries forecast Predict future values using statistical models NOTE: This is the official Dynatrace-hosted MCP server. For the community OSS version, see the alternate configuration section. Prerequisites Azure SRE Agent resource Dynatrace Platform account (SaaS) Platform Token or OAuth Client with required scopes Your Dynatrace environment URL (e.g., https://abc12345.apps.dynatrace.com ) Step 1: Get your Dynatrace credentials Create a Platform Token Go to My Platform Tokens Click + Create token Add the token name, expiration date, and required scopes (see below). For more details on creating platform tokens click here Copy the generated token Required scopes Tool Required Scopes MCP Gateway access mcp-gateway:servers:invoke , mcp-gateway:servers:read Create DQL query davis-copilot:nl2dql:execute Explain DQL query davis-copilot:dql2nl:execute Ask Dynatrace davis-copilot:conversations:execute Run DQL query storage:buckets:read Query problems storage:buckets:read , storage:events:read Get vulnerabilities storage:security.events:read Timeseries forecast davis:analyzers:read , davis:analyzers:execute IMPORTANT: The mcp-gateway:servers:invoke and mcp-gateway:servers:read scopes are required for SSE transport to the Dynatrace-hosted MCP server. Step 2: Add the MCP connector Navigate to your Azure SRE Agent at sre.azure.com Select your agent from the list In the left navigation, expand Builder > Connectors Select Add connector In "Choose a connector", select MCP server (User provided connector) Click Next and configure: For Streamable-HTTP connection (Dynatrace-hosted MCP): Field Value Name dynatrace-mcp Connection type Streamable-HTTP URL https://abc12345.apps.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp Authentication method Bearer token Token Your Dynatrace Platform Token TIP: Replace abc12345 with your Dynatrace tenant name from your environment URL. Click Next to review, then Add to save Equivalent mcp.json configuration For reference, here's the equivalent configuration in mcp.json format (used by VS Code and other MCP clients): { "mcpServers": { "dynatrace-mcp": { "url": "https://abc12345.apps.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp", "transport": "streamable-http", "headers": { "Authorization": "Bearer YOUR_PLATFORM_TOKEN" } } } } Step 3: Create a Dynatrace subagent Create a specialized agent focused on Dynatrace observability: In the left navigation, select Builder > Subagent builder Click + Create Switch to the YAML tab and paste this configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration spec: name: DynatraceExpert system_prompt: | You are a Dynatrace observability expert with access to real-time monitoring data via the Dynatrace MCP server. ## Capabilities You can help users with: - Creating DQL queries from natural language descriptions - Explaining existing DQL queries in plain English - Running DQL queries to fetch logs, events, and metrics - Investigating active and closed Davis Problems - Analyzing security vulnerabilities by risk level - Getting Kubernetes cluster events - Generating timeseries forecasts ## Best Practices When working with DQL: - Use the Create DQL query tool to generate queries from natural language - Use the Explain DQL query tool to help users understand complex queries - Start with smaller timeframes (last 1h or 12h) to optimize query performance - Always explain what data your queries will return When investigating problems: - First query problems to understand the active issue landscape - Get detailed problem info including root cause analysis - Correlate with related logs and events agent_type: Autonomous enable_skills: true Click Create to save the subagent Click on the newly created subagent named "DynatraceExpert" and scroll down to the Tools section Under Tools select "Choose Tools" Filter for the "MCP Tool" and select your Dynatrace MCP Server Tools Click Save Step 4: Create a Dynatrace skill (optional) Skills provide a reusable way to package Dynatrace expertise that any agent can invoke on demand. Unlike subagents (which run as autonomous agents), skills are callable capabilities with structured instructions. Navigate to your SRE Agent at sre.azure.com Select your agent, then go to Builder > Skills Click Create (use the dropdown arrow) > Skill Configure the skill: Field Value Name dynatrace_observability Description Query Dynatrace observability data including logs, metrics, problems, and vulnerabilities using DQL 5. In the SKILL.md editor, paste the following content: --- name: dynatrace_observability description: Query Dynatrace observability data including logs, metrics, problems, and vulnerabilities using DQL --- # Dynatrace Observability Skill You have access to the Dynatrace MCP server tools for querying observability data from Dynatrace environments. ## Available Tools | Tool | Purpose | |------|---------| | **Create DQL query** | Generate DQL queries from natural language descriptions | | **Explain DQL query** | Get plain English explanations of existing DQL queries | | **Ask Dynatrace** | Ask general questions about Dynatrace workflows, alerts, and configuration | | **Run DQL query** | Execute DQL queries and return results (up to 1000 records) | | **Query problems** | List active or closed Davis Problems with filtering | | **Get problem by ID** | Get detailed problem information including root cause | | **Get vulnerabilities** | List open security vulnerabilities by risk level | | **Get K8s cluster events** | Get Kubernetes events for cluster troubleshooting | | **Timeseries forecast** | Predict future metric values using statistical models | ## Workflow 1. **Understand the request** — Determine what observability data the user needs 2. **Use the right tool** — Select the appropriate Dynatrace tool for the task 3. **Query with DQL** — Use "Create DQL query" for natural language to DQL conversion, then "Run DQL query" to execute 4. **Investigate problems** — Use "Query problems" to list issues, then "Get problem by ID" for details 5. **Analyze results** — Summarize findings with actionable recommendations ## Best Practices - Start with smaller timeframes (last 1h or 12h) to optimize query performance - Use "Create DQL query" to generate queries from natural language before running them - When investigating problems, correlate with related logs and events using DQL - For security reviews, check vulnerabilities by risk level (CRITICAL, HIGH, MEDIUM, LOW) - Always explain the data returned and suggest next steps 6. Click Create to save the skill 💡 TIP: Skills are ideal when you want any agent (including the meta agent) to be able to invoke Dynatrace capabilities without routing to a dedicated subagent. Use the subagent approach when you want a specialized, autonomous Dynatrace expert. Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: DQL queries Create a DQL query to find all error logs from the last hour Explain this DQL query: fetch logs | filter loglevel == "ERROR" | limit 10 Run a DQL query to show me the top 10 slowest requests in the last 24 hours Problem investigation List active problems in my Dynatrace environment Get details for problem P-12345 Query problems from the last 7 days that are now closed Available tools Tool Description Required Scopes Create DQL query Generate DQL from natural language davis-copilot:nl2dql:execute Explain DQL query Get plain English explanation of DQL davis-copilot:dql2nl:execute Ask Dynatrace General Dynatrace questions and guidance davis-copilot:conversations:execute Run DQL query Execute DQL and get results (max 1000 records) storage:buckets:read Query problems List active or closed Davis Problems storage:buckets:read , storage:events:read Get problem by ID Get detailed problem information storage:buckets:read , storage:events:read Get vulnerabilities List open security vulnerabilities by risk storage:security.events:read Get K8s cluster events Get events for Kubernetes clusters storage:buckets:read Timeseries forecast Predict future values using statistical models davis:analyzers:read , davis:analyzers:execute Troubleshooting Connection issues Error Cause Solution 401 Unauthorized Invalid or expired token Generate a new Platform Token 403 Forbidden Missing scopes Add mcp-gateway:servers:invoke and mcp-gateway:servers:read scopes Could not connect Wrong tenant URL Verify your tenant name in the URL Timeout Network issues Check network access to *.apps.dynatrace.com Verify your token Test your token with this curl command: curl -X GET https://abc12345.apps.dynatrace.com/platform/management/v1/environment -H Authorization: Bearer YOUR_PLATFORM_TOKEN -H accept: application/json Alternate: OSS MCP server Dynatrace also provides an open-source MCP server with additional capabilities like entity management, workflows, and document creation. This version uses Stdio transport and requires Node.js. NOTE: The OSS MCP server is community-supported. For help, use GitHub Issues. Setup in Azure SRE Agent Navigate to Builder > Connectors > Add connector Select MCP server and click Next Configure the Stdio connection: Field Value Name dynatrace-mcp-oss Connection type Stdio Command npx Arguments -y , @dynatrace-oss/dynatrace-mcp-server@latest Environment variables See table below Environment variables: Key Value DT_ENVIRONMENT Your Dynatrace environment URL DT_PLATFORM_TOKEN Your Platform Token with required scopes DT_GRAIL_QUERY_BUDGET_GB 100 (optional, limits query costs) Click Next to review, then Add to save Equivalent mcp.json configuration { "mcpServers": { "dynatrace-mcp-oss": { "command": "npx", "args": ["-y", "@dynatrace-oss/dynatrace-mcp-server@latest"], "env": { "DT_ENVIRONMENT": "https://abc12345.apps.dynatrace.com", "DT_PLATFORM_TOKEN": "YOUR_PLATFORM_TOKEN", "DT_GRAIL_QUERY_BUDGET_GB": "100" } } } } OSS server scopes Feature Required Scopes Read logs storage:logs:read Read metrics storage:metrics:read Read entities storage:entities:read Workflows automation:workflows:read , automation:workflows:write Documents document:documents:read , document:documents:write Slack app-settings:objects:read Related content Dynatrace MCP documentation Dynatrace MCP Server (OSS) Dynatrace Platform Token documentation Dynatrace Query Language (DQL)
dbandaru
Feb 25, 2026 Place Apps on Azure Blog
458Views
0likes
0Comments
Connect Azure SRE Agent to ServiceNow: End-to-End Incident Response
🎯 What You'll Achieve In this tutorial, you'll: Connect Azure SRE Agent to ServiceNow as your incident management platform Create a test incident in ServiceNow Watch the AI agent automatically pick up, investigate, and resolve the incident See the agent write triage findings and resolution notes back to ServiceNow Time to complete: ~10 minutes 🎬 The End Result Before we dive in, here's what the end result looks like: ServiceNow Incident - Resolved by Azure SRE Agent The Azure SRE Agent: ✅ Detected the incident from ServiceNow ✅ Acknowledged and began triage automatically ✅ Investigated AKS cluster memory utilization ✅ Documented findings in work notes ✅ Resolved the incident with detailed root cause analysis 📋 Prerequisites A ServiceNow instance (Developer, PDI, or Enterprise) Administrator access to ServiceNow An Azure SRE Agent deployed in your Azure subscription 💡 Don't have a ServiceNow instance? Get a free Personal Developer Instance (PDI) at developer.servicenow.com 🔧 Step 1: Gather Your ServiceNow Credentials To connect Azure SRE Agent to ServiceNow, you need three pieces of information: Component Where to Find It ServiceNow Endpoint Browser address bar when logged into ServiceNow (format: https://your-instance.service-now.com ) Username Click your profile avatar → Profile → User ID Password Your ServiceNow login password Finding Your Instance URL Your ServiceNow instance URL is visible in your browser's address bar when logged in: https://{your-instance-name}.service-now.com Finding Your Username Click your profile avatar in the top-right corner of ServiceNow Click Profile Your User ID is your username ⚙️ Step 2: Connect SRE Agent to ServiceNow Navigate to Your SRE Agent Open the Azure Portal Search for "Azure SRE Agent" in the search bar Click on Azure SRE Agent (Preview) in the results Select your agent from the list Configure the Incident Platform In the left navigation, expand Settings Click Incident platform Click the Incident platform dropdown Select ServiceNow Here's what the ServiceNow configuration form looks like: Enter Your ServiceNow Credentials Field Value ServiceNow endpoint Your ServiceNow instance URL (from Step 1) Username Your ServiceNow username (from Step 1) Password Your ServiceNow password Quickstart response plan ✓ Enable this for automatic investigation Save and Verify Click the Save button Wait for validation to complete Look for: "ServiceNow is connected." with a green checkmark 🚨 Step 3: Create a Test Incident in ServiceNow Now let's test the integration by creating an incident in ServiceNow. Navigate to Create Incident In ServiceNow, click All in the left navigation Search for "Incident" Click Incident → Create New Fill in the Incident Details Field Value Caller Select any user (e.g., System Administrator) Short description [SRE Agent Test] AKS Cluster memory pressure detected in production environment Impact 2 - Medium Submit the Incident Click Submit to create the incident. Note the incident number that's assigned. 🤖 Step 4: Watch SRE Agent Investigate Check the SRE Agent Portal Return to the Azure Portal Open your SRE Agent Click Activities → Incidents Within seconds, you should see your ServiceNow incident appear! Observe the Autonomous Investigation Click on the incident to see the SRE Agent's investigation in action: The agent automatically: 🔔 Acknowledged the incident 📋 Created a triage plan with clear steps 🔍 Identified AKS clusters in your subscription 📊 Validated memory utilization metrics ✅ Resolved the incident with findings 📝 Step 5: Review Resolution in ServiceNow Check the ServiceNow Incident Return to ServiceNow and open your incident. You'll see: State: Changed to Resolved Activity Stream: Multiple work notes from the agent Resolution notes: Detailed findings Resolution Notes The agent writes comprehensive resolution notes including: Timestamp of resolution Root cause analysis Validation steps performed Fix applied (if any) 🚀 Next Steps Create custom response plans to customize how the agent responds to different incident types Configure alert routing to route specific Azure Monitor alerts to the agent Explore the Azure SRE Agent documentation for more features Share your experiences, learnings, and questions with other early adopters. Start a discussion in our Community Hub
dbandaru
Feb 19, 2026 Place Apps on Azure Blog
1.8KViews
0likes
2Comments
Build a Custom SSL Certificate Monitor with Azure SRE Agent: From Python Tool to Production Skill
TL;DR Expired SSL certificates cause outages that are 100% preventable. In this post, you’ll learn how to create a custom Python tool in Azure SRE Agent that checks SSL certificate expiry across your domains, then wrap it in a skill that gives your agent a complete certificate health audit workflow. The result: your SRE Agent proactively flags certificates expiring in the next 30 days and recommends renewal actions , before they become 3 AM pages. The Problem Every ITOps Team Knows Too Well It’s a Tuesday morning. Your monitoring dashboard lights up with alerts: your customer-facing API is returning connection errors. Users are calling. Slack is on fire. After 20 minutes of frantic debugging, someone discovers the root cause: an SSL certificate expired overnight. This scenario plays out across enterprises every week. According to industry data, certificate-related outages cost an average of $300,000 per incident in downtime and remediation. The frustrating part? Every single one is preventable. ITOps teams say: “We have spreadsheets for tracking certs, but someone always forgets to update them after a renewal.” On-call engineers say: “I spent 20 minutes debugging before realizing it was just an expired certificate.” Most teams rely on a patchwork of solutions , and they all have gaps: Current Approach The Gap Spreadsheets Go stale , someone forgets to update after renewal Calendar reminders Fire too late , 7 days isn’t enough for compliance review Standalone SaaS tools Don’t integrate with existing incident workflows Manual checks Don’t scale with multi-domain sprawl What if your SRE Agent could check certificate health as part of its regular investigation workflow, and proactively warn you during routine health checks? What We’ll Build In this tutorial, you’ll create two things: A Python Tool ( CheckSSLCertificateExpiry ) , A custom tool that connects to any domain, retrieves its SSL certificate details, and returns structured data about the certificate’s validity, issuer, and days until expiry. A Skill ( ssl_certificate_audit ) , A reusable knowledge package that teaches your SRE Agent how to perform a complete certificate health audit across multiple domains, classify risk levels, and recommend actions. By the end, your agent will respond to prompts like: “Check the SSL certificates for all our production domains” “Are any of our certificates expiring in the next 30 days?” “Run a certificate health audit for api.contoso.com, portal.contoso.com, and store.contoso.com” The CheckSSLCertificateExpiry tool in the Azure SRE Agent portal , showing the Python code, parameters, and description. Prerequisites An Azure SRE Agent instance deployed in your subscription Access to the Azure SRE Agent portal Basic familiarity with Python and YAML Part 1: Creating the Python Tool Step 1: Create the Tool in the Portal Navigate to the Azure SRE Agent portal, go to Settings > Subagent Builder, and click Create New Tool. Select Python Tool as the type, enter the name CheckSSLCertificateExpiry , and provide the description: Checks SSL/TLS certificate expiry for a given domain and returns certificate details including days until expiration, issuer, and validity dates. Add two parameters: domain (string, required): The fully qualified domain name to check (e.g., api.contoso.com) port (string, optional): The port to connect on (default 443) Step 2: Write the Python Code In the Function Code field, paste the following Python implementation: import ssl import socket import json from datetime import datetime, timezone def main(domain, port="443"): """Check SSL certificate expiry for a domain.""" port = int(port) context = ssl.create_default_context() try: with socket.create_connection((domain, port), timeout=10) as sock: with context.wrap_socket(sock, server_hostname=domain) as ssock: cert = ssock.getpeercert() not_before = datetime.strptime(cert["notBefore"], "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc) not_after = datetime.strptime(cert["notAfter"], "%b %d %H:%M:%S %Y %Z").replace(tzinfo=timezone.utc) now = datetime.now(timezone.utc) days_remaining = (not_after - now).days issuer = dict(x[0] for x in cert.get("issuer", [])) subject = dict(x[0] for x in cert.get("subject", [])) if days_remaining < 0: risk_level = "EXPIRED" elif days_remaining <= 7: risk_level = "CRITICAL" elif days_remaining <= 30: risk_level = "WARNING" elif days_remaining <= 60: risk_level = "ATTENTION" else: risk_level = "HEALTHY" san_list = [] for entry_type, value in cert.get("subjectAltName", []): if entry_type == "DNS": san_list.append(value) return { "domain": domain, "port": port, "status": "valid" if days_remaining >= 0 else "expired", "risk_level": risk_level, "days_remaining": days_remaining, "not_before": not_before.isoformat(), "not_after": not_after.isoformat(), "issuer": issuer.get("organizationName", "Unknown"), "issuer_cn": issuer.get("commonName", "Unknown"), "subject_cn": subject.get("commonName", domain), "serial_number": cert.get("serialNumber", "Unknown"), "version": cert.get("version", "Unknown"), "san_count": len(san_list), "san_domains": san_list[:10], "checked_at": now.isoformat() } except ssl.SSLCertVerificationError as e: return { "domain": domain, "port": port, "status": "verification_failed", "risk_level": "CRITICAL", "error": str(e), "checked_at": datetime.now(timezone.utc).isoformat() } except (socket.timeout, socket.gaierror, ConnectionRefusedError, OSError) as e: return { "domain": domain, "port": port, "status": "connection_failed", "risk_level": "UNKNOWN", "error": str(e), "checked_at": datetime.now(timezone.utc).isoformat() } Key Design Decisions: Structured output , The tool returns a JSON object with clearly labeled fields so the LLM can compare, sort, and aggregate results across multiple domains. Risk classification , Five risk levels (EXPIRED, CRITICAL, WARNING, ATTENTION, HEALTHY) give the agent clear thresholds to reason about. Error handling , Specific exception types return structured error objects rather than crashing, so the agent gets useful information even when a domain is unreachable. Zero dependencies , Uses only Python standard library ( ssl , socket , datetime ) for fast cold starts and no supply chain risk. Step 3: Deploy the Tool Click Save in the tool editor to deploy the tool to your SRE Agent instance. The portal validates the YAML and Python code before saving. The Subagent Builder in the Azure SRE Agent portal , showing all deployed subagents, Python tools, and skills at a glance. Step 4: Test the Tool Open a new chat thread in the portal, select the SSLCertificateMonitor agent, and type: "Check the SSL certificate for microsoft.com" The agent checks microsoft.com and returns real certificate data: valid, healthy, 164 days remaining, issued by Microsoft Azure RSA TLS Issuing CA 04. Part 2: Creating the Skill A tool gives the agent a capability. A skill gives the agent a methodology. Tool: “I can check one certificate.” Skill: “Here’s how to audit all your certificates, classify the risks, and tell you exactly what to do about each one.” What is a Skill? A skill is a markdown document with YAML frontmatter that contains: Metadata: name, description, and which tools the skill uses Instructions: step-by-step guidance the agent follows when the skill is loaded Think of it as a runbook injected into the agent’s context when relevant. Step 1: Create the Skill in the Portal In the Azure SRE Agent portal, go to Settings > Subagent Builder and click Create New Skill. You will need to provide the full SKILL.md content, which includes both the YAML frontmatter and the markdown instructions. Step 2: Write the Skill Document Paste the following as the complete skill content: --- name: ssl_certificate_audit description: | Load this skill when the user asks about SSL/TLS certificate health, certificate expiry, certificate monitoring, or requests a certificate audit across one or more domains. Trigger phrases: "check our certificates", "are any certs expiring", "SSL audit", "certificate health check", "TLS certificate status", "cert renewal needed". Do NOT load for general security assessments, network connectivity issues unrelated to TLS, or application-level HTTPS errors (use standard troubleshooting for those). tools: - CheckSSLCertificateExpiry --- # SSL/TLS Certificate Health Audit Skill ## Purpose Perform a structured certificate health audit across one or more domains: check each certificate, classify risk, aggregate findings, and deliver a prioritized action plan with specific renewal deadlines. ## Scope Focus ONLY on SSL/TLS certificate validity, expiry, and health. Exclude: - Application-level HTTPS configuration issues - Cipher suite or TLS version analysis (unless certificate is the root cause) - Certificate Authority trust chain debugging (unless verification fails) ## Workflow ### Phase 1: Domain Collection 1. If the user provides specific domains, use those directly. 2. If the user says "all our domains" or "production domains," ask them to list the domains or provide a resource group to discover App Services, Front Doors, or API Management instances with custom domains. 3. Confirm the domain list before proceeding. ### Phase 2: Certificate Checks 1. Run CheckSSLCertificateExpiry for each domain. Execute checks in parallel when possible. 2. Collect all results before analysis. 3. If any domain returns a connection error, note it separately; do not abort the audit. ### Phase 3: Risk Classification and Reporting Classify each certificate into one of these categories: | Risk Level | Criteria | Required Action | |------------|----------|-----------------| | EXPIRED | days_remaining < 0 | Immediate renewal, this is causing outages | | CRITICAL | days_remaining <= 7 | Emergency renewal within 24 hours | | WARNING | days_remaining <= 30 | Schedule renewal this sprint | | ATTENTION | days_remaining <= 60 | Add to next renewal cycle | | HEALTHY | days_remaining > 60 | No action needed | ### Phase 4: Summary Report Present findings in this order: 1. **Executive Summary** (1-2 sentences): Total domains checked, how many need action. 2. **Certificates Needing Action** (table): Domain, expiry date, days remaining, risk level, recommended action. Sort by days_remaining ascending (most urgent first). 3. **Healthy Certificates** (compact list): Domain and expiry date only. 4. **Unreachable Domains** (if any): Domain and error reason. 5. **Recommendations**: Specific next steps based on findings. ### Phase 5: Actionable Recommendations Based on findings, recommend: - **For EXPIRED or CRITICAL**: "Renew the certificate for {domain} immediately. If using Azure-managed certificates, check the App Service custom domain binding. If using a third-party CA, initiate the renewal process with {issuer}." - **For WARNING**: "Schedule renewal for {domain} (expires {date}). Recommended to renew by {date - 7 days} to allow for propagation and testing." - **For ATTENTION**: "Add {domain} to the renewal queue. Certificate expires {date}." - **For mixed results**: "Consider implementing automated certificate management (e.g., Azure Key Vault with auto-renewal) to prevent future expiry risks." ## Output Format Use markdown tables for certificate status. Include the checked_at timestamp to establish when the audit was performed. Bold the risk level for EXPIRED and CRITICAL entries. ## Example Output (Condensed) Certificate Health Audit: 5 domains checked at 2026-02-18T14:30:00Z. 2 certificates need immediate attention; 3 are healthy. | Domain | Expires | Days Left | Risk | Action | |--------|---------|-----------|------|--------| | api.contoso.com | 2026-02-20 | **2** | **CRITICAL** | Renew within 24 hours | | store.contoso.com | 2026-03-10 | 20 | WARNING | Schedule renewal this sprint | | portal.contoso.com | 2026-06-15 | 117 | HEALTHY | None | | auth.contoso.com | 2026-08-22 | 185 | HEALTHY | None | | cdn.contoso.com | 2026-09-01 | 195 | HEALTHY | None | Recommendation: Renew api.contoso.com immediately to prevent service disruption. Schedule store.contoso.com renewal by March 3rd. ## Quality Principles - Check all domains before reporting (don't report one-by-one). - Never guess certificate details; only report what the tool returns. - Sort urgent items first in all outputs. - Include specific dates, not vague timeframes. - Align with system prompt: answer first, then evidence. Step 3: Deploy the Skill and Configure the Agent Back in the Subagent Builder, create a new subagent called SSLCertificateMonitor . In the agent configuration: Add the CheckSSLCertificateExpiry tool to the agent's tool list Enable Allow Parallel Tool Calls in the agent settings Click Save to deploy the agent Skills are automatically enabled on every agent, so no additional configuration is needed. The portal will validate and deploy the skill, tool, and agent together. The SSLCertificateMonitor subagent in the portal , showing the CheckSSLCertificateExpiry tool, agent instructions, and skills enabled. Part 3: See It in Action Here’s what happens when you ask the agent to audit four real domains , microsoft.com, azure.com, github.com, and learn.microsoft.com: Open a new chat thread in the portal, select the SSLCertificateMonitor agent, and type: "Run a certificate health audit for microsoft.com, azure.com, github.com, and learn.microsoft.com" The agent checks all 4 domains in parallel, classifies github.com as ATTENTION (45 days remaining), and recommends scheduling renewal by March 29, 2026. The agent: ✅ Loaded the ssl_certificate_audit skill (matched by “certificate health audit”) ✅ Ran CheckSSLCertificateExpiry for all 4 domains in parallel ✅ Classified github.com as ATTENTION (45 days) and the rest as HEALTHY ✅ Produced a prioritized report , action items first, healthy domains second ✅ Recommended a specific renewal date and suggested Azure Key Vault auto-renewal Real result: This audit ran against live production domains and completed in under 25 seconds. The agent correctly identified that github.com’s certificate expires soonest and needs to be added to the renewal cycle. Scenario 1: Morning Certificate Health Check User: “Run a certificate health check across our production domains: api.contoso.com, portal.contoso.com, store.contoso.com, auth.contoso.com, and payments.contoso.com” The agent: ✅ Loads the ssl_certificate_audit skill (matched by “certificate health check”) ✅ Runs CheckSSLCertificateExpiry for each domain in parallel ✅ Classifies each result by risk level ✅ Delivers a prioritized report with specific action items Scenario 2: Discovering Cert Issues During Incident Investigation During a connectivity incident, the agent may use CheckSSLCertificateExpiry to check if the certificate has expired , discovering the root cause without the engineer needing to manually check. Scenario 3: Cross-Agent Integration Because the skill references tools by name, any agent with access to CheckSSLCertificateExpiry can use it , add it to your triage agent, weekly health-check workflow, or other skills that deal with frontend health. How Tools and Skills Work Together ┌──────────────────────────────────────┐ │ Skill │ │ "ssl_certificate_audit" │ │ │ │ Methodology: │ │ 1. Collect domains │ │ 2. Check each certificate ─┐ │ │ 3. Classify risk levels │ │ │ 4. Generate report │ │ │ 5. Recommend actions │ │ └──────────────────────────────┼───────┘ │ ▼ ┌──────────────────────────────────────┐ │ Tool │ │ "CheckSSLCertificateExpiry" │ │ │ │ Capability: │ │ - Connect to domain:port │ │ - Read SSL certificate │ │ - Return structured cert data │ └──────────────────────────────────────┘ Concept Role Analogy Tool Atomic capability , does one thing, returns data A stethoscope Skill Methodology , combines tools, interprets results, makes decisions A diagnostic protocol Key Takeaways Custom Python tools are first-class citizens You don’t need to build a microservice or deploy an MCP server. Write a Python function, deploy it through the Azure SRE Agent portal, and it’s immediately available. Skills turn tools into expertise A tool tells the agent what it can do. A skill tells the agent what it should do and how. The audit skill transforms a simple certificate check into a comprehensive capability. Start small, iterate fast Tool creation, skill creation, deployment, and testing , under 30 minutes. Start with one domain check and expand incrementally. ITOps value is immediate Every team has certificates. Every team has been burned by an expired one. Deploy this on day one and prevent the next certificate outage. Want to learn more about Azure SRE Agent extensibility? Check out the YAML Schema Reference and the Python Tool documentation.
dbandaru
Feb 19, 2026 Place Apps on Azure Blog
475Views
1like
0Comments
How Azure SRE Agent Can Investigate Resources in a Private Network
⚠️ Important Note on Network Communication: In this pattern, Azure SRE Agent communicates over the public network to reach the Azure Function proxy. The proxy endpoint is secured with Easy Auth (Microsoft Entra ID) and only authenticated callers can invoke it. We are also actively working on enabling SRE Agent to be injected directly into private networks, which will eliminate the need for a public proxy altogether. Stay tuned for updates on private network injection support. TL;DR When you configure Azure Monitor Private Link Scope (AMPLS) with publicNetworkAccessForQuery: Disabled , all public queries to your Log Analytics Workspace are blocked. To enable Azure SRE Agent to query these protected workspaces, deploy Azure Functions inside your VNet as a secure query proxy. What We Built: This sample deploys to a single subscription with two resource groups ( rg-originations-* and rg-workload-* ). The same pattern works identically across subscriptions. Simply deploy each resource group to a different subscription. Why Public Queries Get Blocked Many organizations secure their Log Analytics Workspaces using Azure Monitor Private Link Scope (AMPLS) with Private Endpoints. This is a best practice for compliance and data security, but it means all public queries are blocked. Resource Type Can Live in VNet? How to Access Privately Virtual Machine Yes Direct (it has a NIC) Container App Yes VNet integration Azure SQL No Private Endpoint Storage Account No Private Endpoint Log Analytics Workspace No AMPLS + Private Endpoint When you configure publicNetworkAccessForQuery: Disabled on the workspace and queryAccessMode: PrivateOnly on the AMPLS, any query that does not come through a Private Endpoint is rejected. This includes queries from Azure SRE Agent, which runs as a cloud service outside your VNet. The Architecture Two resource groups: rg-originations-ampls-demo (LAW + AMPLS, query access disabled) and rg-workload-ampls-demo (VNet + Private Endpoint + Function). This pattern works across subscriptions with cross-subscription RBAC for the Function's Managed Identity. The Problem: Blocked Queries When you configure AMPLS with private-only query access, any attempt to query from outside the VNet fails: InsufficientAccessError: The query was blocked due to private link 2. configuration. Access is denied because this request was not made 3. through a private endpoint. This is the expected behavior. But it means Azure SRE Agent (which runs as a cloud service, not inside your VNet) cannot directly query these workspaces. The Solution: Azure Functions as Query Proxy Deploy Azure Functions inside the workload VNet. This serverless proxy: Capability Description Runs inside VNet VNet-integrated with vnetRouteAllEnabled: true Uses Managed Identity Authenticates to LAW via Azure RBAC Exposes HTTPS endpoints SRE Agent calls as custom HTTP tools Proxies queries Transforms API calls into KQL queries Serverless scaling Pay only when queries are executed Why This Pattern Works Data ingestion and query access use different network paths: Operation Direction Network Status Log Ingestion AMA → Private Endpoint → LAW Private Works External Query Public Internet → LAW Public Blocked VNet Query VNet → Private Endpoint → LAW Private Works SRE Agent Query HTTPS → Function → PE → LAW Hybrid Works The Azure Function acts as a bridge between two networks: (1) Public side with HTTPS endpoint for SRE Agent, (2) Private side with VNet integration routing all outbound traffic through the Private Endpoint. This is why the pattern works. The Function "translates" public API calls into private network queries. Setting Up the Architecture Step 1: Configure the Originations LAW az monitor log-analytics workspace create --resource-group originations-rg --workspace-name originations-law --location eastus 4. az monitor log-analytics workspace update --resource-group originations-rg --workspace-name originations-law --set properties.publicNetworkAccessForQuery=Disabled Step 2: Create the Azure Monitor Private Link Scope az monitor private-link-scope create --name originations-ampls --resource-group originations-rg 5. # ... (see full sample on GitHub) Step 3: Create the Private Endpoint in Workload Resource Group az network private-endpoint create --name pe-ampls --resource-group rg-workload-ampls-demo --vnet-name vnet-workload-ampls-demo --subnet endpoints --private-connection-resource-id "/subscriptions/.../resourceGroups/rg-originations-ampls-demo/providers/Microsoft.Insights/privateLinkScopes/ampls-originations-ampls-demo" --group-id azuremonitor --connection-name ampls-connection Step 4: Deploy the Azure Function az functionapp plan create --name plan-law-query --resource-group workload-rg --location eastus --sku EP1 --is-linux true 6. az functionapp create --name func-law-query --resource-group workload-rg --plan plan-law-query --runtime python --runtime-version 3.11 --functions-version 4 --assign-identity '[system]' 7. # ... (see full sample on GitHub) Step 5: Configure Easy Auth (Microsoft Entra ID) on the Function App Instead of function keys, we secure the Azure Function with Easy Auth (Microsoft Entra ID authentication). This eliminates the need to manage secrets. The SRE Agent authenticates using its Managed Identity. 5a. Set Function Auth Level to Anonymous Since Easy Auth handles authentication at the platform level, set authLevel to anonymous in each function.json : { 8. "scriptFile": "__init__.py", 9. "bindings": [ 10. {"authLevel": "anonymous", "type": "httpTrigger", "direction": "in", "name": "req", "methods": ["get", "post"]}, 11. {"type": "http", "direction": "out", "name": "$return"} 12. ] 13. } 5b. Enable Easy Auth via Azure Portal Navigate to your Function App in the Azure Portal Go to Settings → Authentication Click Add identity provider and select Microsoft Configure: Create new app registration, Current tenant, Federated identity credential, Require authentication, HTTP 401 for unauthenticated Add the SRE Agent's Managed Identity Client ID under Allowed client applications Note the Application (client) ID for PythonTool configuration Finding the SRE Agent Managed Identity Client ID Option 1: Azure Portal: Navigate to your SRE Agent → Settings → Identity → copy the Client ID under System assigned or User assigned. Option 2: Azure CLI: az containerapp show --name <YOUR-SRE-AGENT-NAME> --resource-group <YOUR-SRE-AGENT-RG> --query "identity.userAssignedIdentities" -o json 5c. Deploy SRE Agent Tools (PythonTools with Easy Auth) Critical: PythonTools must use def main(**kwargs) . Each tool acquires a Bearer Token from the SRE Agent's Managed Identity via IDENTITY_ENDPOINT and calls the Azure Function endpoints. See sample repository for full subagent definition and tool implementations. How Easy Auth Token Acquisition Works PythonTool reads IDENTITY_ENDPOINT and IDENTITY_HEADER environment variables (set automatically by the SRE Agent runtime) PythonTool calls the identity endpoint with resource=api://<app-id> to get a Bearer Token PythonTool includes the token in the Authorization: Bearer <token> header Easy Auth validates the token against the App Registration Function App executes the query using its Managed Identity No secrets required: Unlike function keys, Easy Auth uses Managed Identity tokens that are automatically rotated and never stored in code or configuration. The Investigation Flow Step Actor Action 1 You "There are errors on my workload VMs. Investigate." 2 SRE Agent Calls Azure Function's query_logs endpoint 3 Azure Function Queries LAW via Private Endpoint 4 Log Analytics Returns results (allowed, since request came from PE) 5 Azure Function Returns JSON response to SRE Agent 6 SRE Agent Analyzes logs, identifies root cause, responds Security Considerations Concern How It's Secured Log Analytics Public query access disabled, Private Link only Private Endpoint In isolated subnet with NSG rules Azure Function Managed Identity for LAW access (no secrets) API Authentication Easy Auth (Microsoft Entra ID) with Bearer Token, no secrets to manage VNet Routing vnetRouteAllEnabled: true for all traffic Audit Trail All invocations logged in Application Insights Try It Yourself git clone https://github.com/BandaruDheeraj/private-law-query-sample cd private-law-query-sample azd up # See Step 5 for Easy Auth configuration ./inject-failure.ps1 This creates: rg-originations-{env} (LAW + AMPLS) and rg-workload-{env} (VNet + PE + Functions + VMs) Key Takeaways AMPLS blocks public queries by design: When you configure Private Link with private-only query access, all external queries are rejected. This is the expected security behavior. Azure Functions provide a serverless query proxy: VNet-integrated Functions with Managed Identity can query private Log Analytics on behalf of SRE Agent. Resource groups simulate cross-subscription: This sample uses two resource groups; the same pattern works across subscriptions. Easy Auth eliminates secret management: Using Microsoft Entra ID authentication instead of function keys means no secrets to rotate or store. Security is maintained end-to-end: The workspace remains fully private; only the trusted Function can query it. The SRE Agent authenticates with its Managed Identity. Resources Resource Link Sample Repository github.com/BandaruDheeraj/private-law-query-sample Azure Monitor Private Link docs.microsoft.com/azure/azure-monitor/logs/private-link-security Azure Functions VNet Integration docs.microsoft.com/azure/azure-functions/functions-networking-options AMPLS Design Guidance docs.microsoft.com/azure/azure-monitor/logs/private-link-design Managed Identity for Azure Functions docs.microsoft.com/azure/app-service/overview-managed-identity Azure Developer CLI (azd) learn.microsoft.com/azure/developer/azure-developer-cli
dbandaru
Feb 17, 2026 Place Apps on Azure Blog
442Views
0likes
0Comments
MCP-Driven Azure SRE for Databricks
Azure SRE Agent is an AI-powered operations assistant built for incident response and governance. MCP (Model Context Protocol) is the standard interface it uses to connect to external systems and tools. Azure SRE Agent integrates with Azure Databricks through the Model Context Protocol (MCP) to provide: Proactive Compliance - Automated best practice validation Reactive Troubleshooting - Root cause analysis and remediation for job failures This post demonstrates both capabilities with real examples. Architecture The Azure SRE Agent orchestrates Ops Skills and Knowledge Base prompts, then calls the Databricks MCP server over HTTPS. The MCP server translates those requests into Databricks REST API calls, returns structured results, and the agent composes findings, evidence, and remediation. End-to-end, this yields a single loop: intent -> MCP tool calls -> Databricks state -> grounded response. Deployment The MCP server runs as a containerized FastMCP application on Azure Container Apps, fronted by HTTPS and configured with Databricks workspace connection settings. It exposes a tool catalog that the agent invokes through MCP, while the container handles authentication and REST API calls to Databricks. 👉 For deployment instructions, see the GitHub repository. Getting Started Deploy the MCP Server: Follow the quickstart guide to deploy to Azure Container Apps (~30 min) Configure Azure SRE Agent: Create MCP connector with streamable-http transport Upload Knowledge Base from Builder > Knowledge Base using the Best Practices doc: AZURE_DATABRICKS_BEST_PRACTICES.md Benefit: Gives the agent authoritative compliance criteria and remediation commands. Create Ops Skill from Builder > Subagent Builder > Create skill and drop the Ops Skill doc: DATABRICKS_OPS_RUNBOOK_SKILL.md Benefit: Adds incident timelines, runbooks, and escalation triggers to responses. Deploy the subagent YAML: Databricks_MCP_Agent.yaml Benefit: Wires the MCP connector, Knowledge Base, and Ops Skill into one agent for proactive and reactive workflows. Integrate with Alerting: Connect PagerDuty/ServiceNow webhooks Enable auto-remediation for common issues Part 1: Proactive Compliance Use Case: Best Practice Validation Prompt: @Databricks_MCP_Agent: Validate the Databricks workspace for best practices compliance and provide a summary, detailed findings, and concrete remediation steps. What the Agent Does: Calls MCP tools to gather current state: list_clusters() - Audit compute configurations list_catalogs() - Check Unity Catalog setup list_jobs() - Review job configurations execute_sql() - Query governance policies Cross-references findings with Knowledge Base (best practices document) Generates prioritized compliance report Expected Output: Benefits: Time Savings: 5 minutes vs. 2-3 hours manual review Consistency: Same validation criteria across all workspaces Actionable: Specific remediation steps with code examples Part 2: Reactive Incident Response Example 1: Job Failure - Non-Zero Exit Code Scenario: Job job_exceptioning_out fails repeatedly due to notebook code errors. Prompt: Agent Investigation - Calls MCP Tools: get_job() - Retrieves job definition list_job_runs() - Gets recent run history (4 failed runs) get_run_output() - Analyzes error logs Root Cause Analysis: Expected Outcome: Root Cause Identified: sys.exit(1) in notebook code Evidence Provided: Job ID, run history, code excerpt, settings Confidence: HIGH (explicit failing code present) Remediation: Fix code + add retry policy Resolution Time: 3-5 minutes (vs. 30-45 minutes manual investigation) Example 2: Job Failure - Task Notebook Exception Scenario: Job hourly-data-sync fails repeatedly due to exception in task notebook. Prompt: Agent Investigation - Calls MCP Tools: get_job() - Job definition and task configuration list_job_runs() - Recent runs show "TERMINATED with TIMEOUT" execute_sql() - Queries notebook metadata Root Cause Analysis: Expected Outcome: Root Cause Identified: Exception at line 7 - null partition detected Evidence: Notebook path, code excerpt (lines 5-7), run history (7 consecutive failures) Confidence: HIGH (explicit failing code + TIMEOUT/queue issues) Remediation: Fix exception handling + add retry policy Resolution Time: 5-8 minutes (vs. 45+ minutes manual log analysis) Key Benefits Proactive Governance ✅ Continuous compliance monitoring ✅ Automated best practice validation ✅ 95% reduction in manual review time Reactive Incident Response 🚨 Automated root cause analysis ⚡ 80-95% reduction in MTTR 🧠 Context-aware remediation recommendations 📊 Evidence-based troubleshooting Operational Impact Metric Before After Improvement Compliance review time 2-3 hours 5 minutes 95% Job failure investigation 30-45 min 3-8 min 85% On-call alerts requiring intervention 4-6 per shift 1-2 per shift 70% Conclusion Azure SRE Agent transforms Databricks operations by combining proactive governance with reactive troubleshooting. The MCP integration provides: Comprehensive visibility into workspace health Automated compliance monitoring and validation Intelligent incident response with root cause analysis Self-healing capabilities for common failures Result: Teams spend less time firefighting and more time building. Resources 📘 Deployment Guide 🤖 Subagent Configuration 📋 Best Practices Document 🧰 Ops Skill Runbook 🔧 Validation Script 📖 Azure SRE Agent Documentation 📰 Azure SRE Agent Blogs 📜 MCP Specification Questions? Open an issue on GitHub or reach out to the Azure SRE team.
varghesejoji
Feb 13, 2026 Place Apps on Azure Blog
488Views
0likes
0Comments