azure sre agent
37 TopicsGet started with Datadog MCP server in Azure SRE Agent
Overview The Datadog MCP server is a cloud-hosted bridge between your Datadog organization and Azure SRE Agent. Once configured, it enables real-time interaction with logs, metrics, APM traces, monitors, incidents, dashboards, and other Datadog data through natural language. All actions respect your existing Datadog RBAC permissions. The server uses Streamable HTTP transport with two custom headers ( DD_API_KEY and DD_APPLICATION_KEY ) for authentication. Azure SRE Agent connects directly to the Datadog-hosted endpoint—no npm packages, local proxies, or container deployments are required. The SRE Agent portal includes a dedicated Datadog MCP server connector type that pre-populates the required header keys for streamlined setup. Key capabilities Area Capabilities Logs Search and analyze logs with SQL-based queries, filter by facets and time ranges Metrics Query metric values, explore available metrics, get metric metadata and tags APM Search spans, fetch complete traces, analyze trace performance, compare traces Monitors Search monitors, validate configurations, inspect monitor groups and templates Incidents Search and get incident details, view timeline and responders Dashboards Search and list dashboards by name or tag Hosts Search hosts by name, tags, or status Services List services and map service dependencies Events Search events including monitor alerts, deployments, and custom events Notebooks Search and retrieve notebooks for investigation documentation RUM Search Real User Monitoring events for frontend observability This is the official Datadog-hosted MCP server (Preview). The server exposes 16+ core tools with additional toolsets available for alerting, APM, Database Monitoring, Error Tracking, feature flags, LLM Observability, networking, security, software delivery, and Synthetic tests. Tool availability depends on your Datadog plan and RBAC permissions. Prerequisites Azure SRE Agent resource deployed in Azure Datadog organization with an active plan Datadog user account with appropriate RBAC permissions API key: Created from Organization Settings > API Keys Application key: Created from Organization Settings > Application Keys with MCP Read and/or MCP Write permissions Your organization must be allowlisted for the Datadog MCP server Preview Step 1: Create API and Application keys The Datadog MCP server requires two credentials: an API key (identifies your organization) and an Application key (authenticates the user and defines permission scope). Both are created in the Datadog portal. Create an API key Log in to your Datadog organization (use your region-specific URL if applicable—e.g., app.datadoghq.eu for EU1) Select your account avatar in the bottom-left corner of the navigation bar Select Organization Settings In the left sidebar, select API Keys (under the Access section) Direct URL: https://app.datadoghq.com/organization-settings/api-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp ) Select Create Key Copy the key value immediately—it is shown only once. If lost, you must create a new key. [!TIP] API keys are organization-level credentials. Any Datadog Admin or user with the API Keys Write permission can create them. The API key alone does not grant data access—it must be paired with an Application key. Create an Application key From the same Organization Settings page, select Application Keys in the left sidebar Direct URL: https://app.datadoghq.com/organization-settings/application-keys Select + New Key in the top-right corner Enter a descriptive name (e.g., sre-agent-mcp-app ) Select Create Key Copy the key value immediately—it is shown only once Add MCP permissions to the Application key After creating the Application key, you must grant it the MCP-specific scopes: In the Application Keys list, locate the key you just created Select the key name to open its detail panel In the detail panel, find the Scopes section and select Edit Search for MCP in the scopes search box Check MCP Read to enable read access to Datadog data via MCP tools Optionally check MCP Write if your agent needs to create or modify resources (e.g., feature flags, Synthetic tests) Select Save If you don't see the MCP Read or MCP Write scopes, your organization may not be enrolled in the Datadog MCP server preview. Contact your Datadog account representative to request access. Required permissions summary Permission Description Required? MCP Read Read access to Datadog data via MCP tools (logs, metrics, traces, monitors, etc.) Yes MCP Write Write access for mutating operations (creating feature flags, editing Synthetic tests, etc.) Optional For production use, create keys from a service account rather than a personal account. Navigate to Organization Settings > Service Accounts to create one. This ensures the integration continues to work if team members leave the organization. Apply the principle of least privilege—grant only MCP Read unless write operations are needed. Use scoped Application keys to restrict access to only the permissions your agent needs. This limits blast radius if a key is compromised. Step 2: Add the MCP connector Connect the Datadog MCP server to your SRE Agent using the portal. The portal includes a dedicated Datadog connector type that pre-populates the required configuration. Determine your regional endpoint Select the endpoint URL that matches your Datadog organization's region: Region Endpoint URL US1 (default) https://mcp.datadoghq.com/api/unstable/mcp-server/mcp US3 https://mcp.us3.datadoghq.com/api/unstable/mcp-server/mcp US5 https://mcp.us5.datadoghq.com/api/unstable/mcp-server/mcp EU1 https://mcp.datadoghq.eu/api/unstable/mcp-server/mcp AP1 https://mcp.ap1.datadoghq.com/api/unstable/mcp-server/mcp AP2 https://mcp.ap2.datadoghq.com/api/unstable/mcp-server/mcp Using the Azure portal In Azure portal, navigate to your SRE Agent resource Select Builder > Connectors Select Add connector Select Datadog MCP server and select Next Configure the connector: Field Value Name datadog-mcp Connection type Streamable-HTTP (pre-selected) URL https://mcp.datadoghq.com/api/unstable/mcp-server/mcp (change for non-US1 regions) Authentication Custom headers (pre-selected, disabled) DD_API_KEY Your Datadog API key DD_APPLICATION_KEY Your Datadog Application key Select Next to review Select Add connector The Datadog connector type pre-populates both header keys ( DD_API_KEY and DD_APPLICATION_KEY ) and sets the authentication method to "Custom headers" automatically. The default URL is the US1 endpoint—update it if your organization is in a different region. Once the connector shows Connected status, the Datadog MCP tools are automatically available to your agent. You can verify by checking the tools list in the connector details. Step 3: Create a Datadog subagent (optional) Create a specialized subagent to give the AI focused Datadog observability expertise and better prompt responses. Navigate to Builder > Subagents Select Add subagent Paste the following YAML configuration: api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: DatadogObservabilityExpert display_name: Datadog Observability Expert system_prompt: | You are a Datadog observability expert with access to logs, metrics, APM traces, monitors, incidents, dashboards, hosts, services, and more via the Datadog MCP server. ## Capabilities ### Logs - Search logs using facets, tags, and time ranges with `search_datadog_logs` - Perform SQL-based log analysis with `analyze_datadog_logs` for aggregations, grouping, and statistical queries - Correlate log entries with traces and metrics ### Metrics - Query metric time series with `get_datadog_metric` - Get metric metadata, tags, and context with `get_datadog_metric_context` - Discover available metrics with `search_datadog_metrics` ### APM (Application Performance Monitoring) - Fetch complete traces with `get_datadog_trace` - Search distributed traces and spans with `search_datadog_spans` - Analyze service-level performance and latency patterns - Map service dependencies with `search_datadog_service_dependencies` ### Monitors & Alerting - Search monitors by name, tag, or status with `search_datadog_monitors` - Investigate triggered monitors and alert history - Correlate monitor alerts with underlying metrics and logs ### Incidents - Search incidents with `search_datadog_incidents` - Get incident details, timeline, and responders with `get_datadog_incident` - Correlate incidents with monitors, logs, and traces ### Infrastructure - Search hosts by name, tag, or status with `search_datadog_hosts` - List and discover services with `search_datadog_services` - Search dashboards with `search_datadog_dashboards` - Search events (monitor alerts, deployments) with `search_datadog_events` ### Notebooks - Search notebooks with `search_datadog_notebooks` - Retrieve notebook content with `get_datadog_notebook` ### Real User Monitoring - Search RUM events for frontend performance data with `search_datadog_rum_events` ## Best Practices When investigating incidents: - Start with `search_datadog_incidents` or `get_datadog_incident` for context - Check related monitors with `search_datadog_monitors` - Correlate with `search_datadog_logs` and `get_datadog_metric` for root cause - Use `get_datadog_trace` to inspect request flows for latency issues - Check `search_datadog_hosts` for infrastructure-level problems When analyzing logs: - Use `analyze_datadog_logs` for SQL-based aggregation queries - Use `search_datadog_logs` for individual log retrieval and filtering - Include time ranges to narrow results and reduce response size - Filter by service, host, or status to focus on relevant data When working with metrics: - Use `search_datadog_metrics` to discover available metric names - Use `get_datadog_metric_context` to understand metric tags and metadata - Use `get_datadog_metric` to query actual metric values with time ranges When handling errors: - If access is denied, explain which RBAC permission is needed - Suggest the user verify their Application key has `MCP Read` or `MCP Write` - For large traces that appear truncated, note this is a known limitation mcp_connectors: - datadog-mcp handoffs: [] Select Save The mcp_connectors field references the connector name you created in Step 2. This gives the subagent access to all tools provided by the Datadog MCP server. Step 4: Add a Datadog skill (optional) Skills provide contextual knowledge and best practices that help agents use tools more effectively. Create a Datadog skill to give your agent expertise in log queries, metric analysis, and incident investigation workflows. Navigate to Builder > Skills Select Add skill Paste the following skill configuration: api_version: azuresre.ai/v1 kind: SkillConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: datadog_observability display_name: Datadog Observability description: | Expertise in Datadog's observability platform including logs, metrics, APM, monitors, incidents, dashboards, hosts, and services. Use for searching logs, querying metrics, investigating incidents, analyzing traces, inspecting monitors, and navigating Datadog data via the Datadog MCP server. instructions: | ## Overview Datadog is a cloud-scale observability platform for logs, metrics, APM traces, monitors, incidents, infrastructure, and more. The Datadog MCP server enables natural language interaction with your organization's Datadog data. **Authentication:** Two custom headers—`DD_API_KEY` (API key) and `DD_APPLICATION_KEY` (Application key with MCP permissions). All actions respect existing RBAC permissions. **Regional endpoints:** The MCP server URL varies by Datadog region (US1, US3, US5, EU1, AP1, AP2). Ensure the connector URL matches your organization's region. ## Searching Logs Use `search_datadog_logs` for individual log retrieval and `analyze_datadog_logs` for SQL-based aggregation queries. **Common log search patterns:** ``` # Errors from a specific service service:payment-api status:error # Logs from a host in the last hour host:web-prod-01 # Logs containing a specific trace ID trace_id:abc123def456 # Errors with a specific HTTP status @http.status_code:500 service:api-gateway # Logs from a Kubernetes pod kube_namespace:production kube_deployment:checkout-service ``` **SQL-based log analysis with `analyze_datadog_logs`:** ```sql -- Count errors by service in the last hour SELECT service, count(*) as error_count FROM logs WHERE status = 'error' GROUP BY service ORDER BY error_count DESC -- Average response time by endpoint SELECT @http.url_details.path, avg(@duration) as avg_duration FROM logs WHERE service = 'api-gateway' GROUP BY @http.url_details.path ``` ## Querying Metrics Use `search_datadog_metrics` to discover metrics, `get_datadog_metric_context` for metadata, and `get_datadog_metric` for time series data. **Common metric patterns:** ``` # System metrics system.cpu.user, system.mem.used, system.disk.used # Container metrics docker.cpu.usage, kubernetes.cpu.requests # Application metrics trace.servlet.request.hits, trace.servlet.request.duration # Custom metrics app.payment.processed, app.queue.depth ``` Always specify a time range when querying metrics to avoid retrieving excessive data. ## Investigating Traces Use `get_datadog_trace` for complete trace details and `search_datadog_spans` for span-level queries. **Trace investigation workflow:** 1. Search for slow or errored spans with `search_datadog_spans` 2. Get the full trace with `get_datadog_trace` using the trace ID 3. Identify the bottleneck service or operation 4. Correlate with `search_datadog_logs` using the trace ID 5. Check related metrics with `get_datadog_metric` ## Working with Monitors Use `search_datadog_monitors` to find monitors by name, tag, or status. **Common monitor queries:** ``` # Find all triggered monitors Search for monitors with status "Alert" # Find monitors for a specific service Search for monitors tagged with service:payment-api # Find monitors by name Search for monitors matching "CPU" or "memory" ``` ## Incident Investigation Workflow For structured incident investigation: 1. `search_datadog_incidents` — find recent or active incidents 2. `get_datadog_incident` — get full incident details and timeline 3. `search_datadog_monitors` — check which monitors triggered 4. `search_datadog_logs` — search for errors around the incident time 5. `get_datadog_metric` — check key metrics for anomalies 6. `get_datadog_trace` — inspect request traces for latency or errors 7. `search_datadog_hosts` — verify infrastructure health 8. `search_datadog_service_dependencies` — map affected services ## Working with Dashboards and Notebooks - Use `search_datadog_dashboards` to find dashboards by title or tag - Use `search_datadog_notebooks` and `get_datadog_notebook` for investigation notebooks that document past analyses ## Toolsets The Datadog MCP server supports toolsets via the `?toolsets=` query parameter on the endpoint URL. Available toolsets: | Toolset | Description | |---------|-------------| | `core` | Logs, metrics, traces, dashboards, monitors, incidents, hosts, services, events, notebooks (default) | | `alerting` | Monitor validation, groups, and templates | | `apm` | Trace analysis, span search, Watchdog insights, performance investigation | | `dbm` | Database Monitoring query plans and samples | | `error-tracking` | Error Tracking issues across RUM, Logs, and Traces | | `feature-flags` | Creating, listing, and updating feature flags | | `llmobs` | LLM Observability spans | | `networks` | Cloud Network Monitoring, Network Device Monitoring | | `onboarding` | Guided Datadog setup and configuration | | `security` | Code security scanning, security signals, findings | | `software-delivery` | CI Visibility, Test Optimization | | `synthetics` | Synthetic test management | To enable additional toolsets, append `?toolsets=core,apm,alerting` to the connector URL. ## Troubleshooting | Issue | Solution | |-------|----------| | 401/403 errors | Verify API key and Application key are correct and active | | No data returned | Check that Application key has `MCP Read` permission | | Wrong region | Ensure the connector URL matches your Datadog organization's region | | Truncated traces | Large traces may be truncated; this is a known limitation | | Tool not found | The tool may require a non-default toolset; update the connector URL | | Write operations fail | Verify Application key has `MCP Write` permission | mcp_connectors: - datadog-mcp Select Save Reference the skill in your subagent Update your subagent configuration to include the skill: spec: name: DatadogObservabilityExpert skills: - datadog_observability mcp_connectors: - datadog-mcp Step 5: Test the integration Open a new chat session with your SRE Agent Try these example prompts: Log analysis Search for error logs from the payment-api service in the last hour Analyze logs to count errors by service over the last 24 hours Find all logs with HTTP 500 status from the api-gateway in the last 30 minutes Show me the most recent logs from host web-prod-01 Metrics investigation What is the current CPU usage across all production hosts? Show me the request rate and error rate for the checkout-service over the last 4 hours What metrics are available for the payment-api service? Get the p99 latency for the api-gateway service in the last hour APM and trace analysis Find the slowest traces for the checkout-service in the last hour Get the full trace details for trace ID abc123def456 What services depend on the payment-api? Search for errored spans in the api-gateway service from the last 30 minutes Monitor and alerting workflows Show me all monitors currently in Alert status Find monitors related to the database-primary host What monitors are tagged with team:platform? Search for monitors matching "disk space" or "memory" Incident investigation Show me all active incidents from the last 24 hours Get details for incident INC-12345 including the timeline What monitors triggered during the last production incident? Correlate the most recent incident with related logs and metrics Infrastructure and dashboards Search for hosts tagged with env:production and team:platform List all dashboards related to "Kubernetes" or "EKS" What services are running in the production environment? Show me recent deployment events for the checkout-service Available tools Core toolset (default) The core toolset is included by default and provides essential observability tools. Tool Description search_datadog_logs Search logs by facets, tags, and time ranges analyze_datadog_logs SQL-based log analysis for aggregations and statistical queries get_datadog_metric Query metric time series with rollup and aggregation get_datadog_metric_context Get metric metadata, tags, and related context search_datadog_metrics List and discover available metrics get_datadog_trace Fetch a complete distributed trace by trace ID search_datadog_spans Search APM spans by service, operation, or tags search_datadog_monitors Search monitors by name, tag, or status get_datadog_incident Get incident details including timeline and responders search_datadog_incidents List and search incidents search_datadog_dashboards Search dashboards by title or tag search_datadog_hosts Search hosts by name, tag, or status search_datadog_services List and search services search_datadog_service_dependencies Map service dependency relationships search_datadog_events Search events (monitor alerts, deployments, custom events) get_datadog_notebook Retrieve notebook content by ID search_datadog_notebooks Search notebooks by title or tag search_datadog_rum_events Search Real User Monitoring events Alerting toolset Enable with ?toolsets=core,alerting on the connector URL. Tool Description validate_datadog_monitor Validate monitor configuration before creation get_datadog_monitor_templates Get monitor configuration templates search_datadog_monitor_groups Search monitor groups and their statuses APM toolset Enable with ?toolsets=core,apm on the connector URL. Tool Description apm_search_spans Advanced span search with APM-specific filters apm_explore_trace Interactive trace exploration and analysis apm_trace_summary Get a summary analysis of a trace apm_trace_comparison Compare two traces side by side apm_analyze_trace_metrics Analyze aggregated trace metrics and trends Database Monitoring toolset Enable with ?toolsets=core,dbm on the connector URL. Tool Description search_datadog_dbm_plans Search database query execution plans search_datadog_dbm_samples Search database query samples and statistics Error Tracking toolset Enable with ?toolsets=core,error-tracking on the connector URL. Tool Description search_datadog_error_tracking_issues Search error tracking issues across RUM, Logs, and Traces get_datadog_error_tracking_issue Get details of a specific error tracking issue Feature Flags toolset Enable with ?toolsets=core,feature-flags on the connector URL. Tool Description list_datadog_feature_flags List feature flags create_datadog_feature_flag Create a new feature flag update_datadog_feature_flag_environment Update feature flag settings for an environment LLM Observability toolset Enable with ?toolsets=core,llmobs on the connector URL. Tool Description LLM Observability spans Query and analyze LLM Observability span data Networks toolset Enable with ?toolsets=core,networks on the connector URL. Tool Description Cloud Network Monitoring tools Analyze cloud network traffic and dependencies Network Device Monitoring tools Monitor and troubleshoot network devices Security toolset Enable with ?toolsets=core,security on the connector URL. Tool Description datadog_code_security_scan Run code security scanning datadog_sast_scan Run Static Application Security Testing datadog_secrets_scan Scan for secrets and credentials in code Software Delivery toolset Enable with ?toolsets=core,software-delivery on the connector URL. Tool Description search_datadog_ci_pipeline_events Search CI pipeline execution events get_datadog_flaky_tests Identify flaky tests in CI pipelines Synthetics toolset Enable with ?toolsets=core,synthetics on the connector URL. Tool Description get_synthetics_tests List and get Synthetic test configurations edit_synthetics_tests Edit Synthetic test settings synthetics_test_wizard Guided wizard for creating Synthetic tests Toolsets The Datadog MCP server organizes tools into toolsets. By default, only the core toolset is enabled. To enable additional toolsets, append the ?toolsets= query parameter to the connector URL. Syntax https://mcp.datadoghq.com/api/unstable/mcp-server/mcp?toolsets=core,apm,alerting Examples Use case URL suffix Default (core only) No suffix needed Core + APM analysis ?toolsets=core,apm Core + Alerting + APM ?toolsets=core,alerting,apm Core + Database Monitoring ?toolsets=core,dbm Core + Security scanning ?toolsets=core,security Core + CI/CD visibility ?toolsets=core,software-delivery All toolsets ?toolsets=core,alerting,apm,dbm,error-tracking,feature-flags,llmobs,networks,onboarding,security,software-delivery,synthetics [!TIP] Only enable the toolsets you need. Each additional toolset increases the number of tools exposed to the agent, which can increase token usage and may impact response quality. Start with core and add toolsets as needed. Updating the connector URL To add toolsets after initial setup: Navigate to Builder > Connectors Select the datadog-mcp connector Update the URL field to include the ?toolsets= parameter Select Save Troubleshooting Authentication issues Error Cause Solution 401 Unauthorized Invalid API key or Application key Verify both keys are correct and active in Organization Settings 403 Forbidden Missing RBAC permissions Ensure the Application key has MCP Read and/or MCP Write permissions Connection refused Wrong regional endpoint Verify the connector URL matches your Datadog organization's region "Organization not allowlisted" Preview access not granted Contact Datadog support to request MCP server Preview access Data and permission issues Error Cause Solution No data returned Insufficient permissions or wrong time range Verify Application key permissions; try a broader time range Tool not found Tool belongs to a non-default toolset Add the required toolset to the ?toolsets= parameter in the connector URL Truncated trace data Trace exceeds size limit Large traces are truncated for context window efficiency; query specific spans instead Write operation failed Missing MCP Write permission Add MCP Write permission to the Application key Metric not found Wrong metric name or no data in time range Use search_datadog_metrics to discover available metric names Verify the connection Test the server endpoint directly: curl -I "https://mcp.datadoghq.com/api/unstable/mcp-server/mcp" \ -H "DD_API_KEY: <your_api_key>" \ -H "DD_APPLICATION_KEY: <your_application_key>" Expected response: 200 OK confirms authentication is working. Re-authorize the integration If you encounter persistent issues: Navigate to Organization Settings > Application Keys in Datadog Revoke the existing Application key Create a new Application key with the required MCP Read / MCP Write permissions Update the connector in the SRE Agent portal with the new key Limitations Limitation Details Preview only The Datadog MCP server is in Preview and not recommended for production use Allowlisted organizations Only organizations that have been allowlisted by Datadog can access the MCP server Large trace truncation Responses are optimized for LLM context windows; large traces may be truncated Unstable API path The endpoint URL contains /unstable/ indicating the API may change without notice Toolset availability Some toolsets may not be available depending on your Datadog plan and features enabled Regional endpoints You must use the endpoint matching your organization's region; cross-region queries are not supported Security considerations How permissions work RBAC-scoped: All actions respect the RBAC permissions associated with the API and Application keys Key-based: Access is controlled through API key (organization-level) and Application key (user or service account-level) Permission granularity: MCP Read enables read operations; MCP Write enables mutating operations Admin controls Datadog administrators can: - Create and revoke API and Application keys in Organization Settings - Assign granular RBAC permissions ( MCP Read , MCP Write ) to Application keys - Use service accounts to decouple access from individual user accounts - Monitor MCP tool usage through the Datadog Audit Trail - Scope Application keys to limit the blast radius of compromised credentials The Datadog MCP server can read sensitive operational data including logs, metrics, and traces. Use service accounts with scoped Application keys, grant only the permissions your agent needs, and monitor the Audit Trail for unusual activity. Related content Datadog MCP Server documentation Datadog API and Application keys Datadog RBAC permissions Datadog Audit Trail Datadog regional sites MCP integration overview Build a custom subagent2.5KViews0likes1CommentShared Agent Context: How We Are Tackling Partner Agent Collaboration
Your Azure SRE agent detects a spike in error rates. It triages with cloud-native telemetry, but the root cause trail leads into a third-party observability platform your team also runs. The agent can't see that data. A second agent can, one that speaks Datadog or Dynatrace or whatever your team chose. The two agents talk to each other using protocols like MCP or directly via an API endpoint and come up with a remediation. The harder question is what happens to the conversation afterward. TL;DR Two AI agents collaborate on incidents using two communication paths: a direct real-time channel (MCP) for fast investigation, and a shared memory layer that writes to systems your team already uses, like PagerDuty, GitHub Issues, or ServiceNow. No new tools to adopt. No ephemeral conversations that vanish when the incident closes. The problem Most operational AI agents work in isolation. Your cloud monitoring agent doesn't have access to your third-party observability stack. Your Datadog specialist doesn't know what your Azure resource topology looks like. When an incident spans both, a human has to bridge the gap manually. At 2 AM. With half the context missing. And even when two agents do exchange information directly, the conversation is ephemeral. The investigation ends, the findings disappear. The next on-call engineer sees a resolved alert with no record of what was tried, what was found, or why the remediation worked. The next agent that hits the same pattern starts over from scratch. What we needed was somewhere for both agents to persist their findings, somewhere humans could see it too. And we really didn't want to force teams onto a new system just to get there. Two communication paths Direct agent-to-agent (real-time) During an active investigation, the primary agent calls the partner agent directly. The partner runs whatever domain-specific analysis it's good at (log searches, span analysis, custom metric queries) and returns findings in real time. This is the fast path. The direct channel uses MCP, so any partner agent can plug in without custom integration work. The primary agent doesn't need to understand the internals of Datadog or Dynatrace. It asks questions, gets answers. Shared memory (durable) After the direct exchange, both agents write their actions and findings to external systems that humans already use. This is the durable path, the one that creates audit trails and makes handoffs work. The shared memory backends are systems your team already has open during an incident: Backend What gets written Good fit for Incident platform (e.g., PagerDuty) Timeline notes, on-call handoff context Teams with alerting-centric workflows Issue tracker (e.g., GitHub Issues) Code-level findings, root cause analysis, action comments Teams with dev workflow integration ITSM system (e.g., ServiceNow) Work notes, ITSM-compliant audit trail Enterprise IT, regulated industries The important thing: this doesn't require a new system. Agents write to whatever your team already uses. How it works Step Actor What happens Path 1 Alert source Monitoring fires an alert — 2 Primary agent Receives alert, triages, starts investigating with native tools Internal 3 Primary agent Calls partner agent for domain-specific analysis (third-party logs, spans) Direct via MCP or API 4 Partner agent Runs analysis, returns findings in real time Direct via MCP or API 5 Primary agent Correlates partner findings with native data, runs remediation Internal 6 Both agents Write findings, actions, and resolution to external systems Shared memory via existing sources 7 Agent or human Verifies resolution, closes incident Shared memory via existing sources Steps 3 through 5 happen in real time over the direct channel. Nothing gets written to shared memory until the investigation has actual results. The investigation is fast; the record-keeping is thorough. Who does what In this system the primary agent owns the full incident lifecycle: detection, triage, investigation, remediation, closure. The partner agent gets called when the primary agent needs to see into a part of the stack it can't access natively. It does the specialized deep-dive, returns what it found, and the primary agent takes it from there. Both agents write to shared memory and the primary agent acts on the proposed next steps. Primary agent Partner agent Communication Calls partner directly; writes to shared memory after Responds to calls; writes enrichment to shared memory Scope Full lifecycle Domain-specific deep-dive Tools Cloud-native monitoring, CLI, runbooks, issue trackers Third-party observability APIs Typical share ~80% of investigation + all remediation ~20%, specialized enrichment Why shared context should live where humans already work If your agent writes its findings to a system nobody checks, you've built a very expensive diary. Write them to a GitHub Issue, a ServiceNow ticket, a Jira epic, or whatever your team actually monitors, and the dynamics change: humans can participate without changing their workflow. Your team already watches these systems. When an agent posts its reasoning and pending decisions to a place engineers already check, anyone can review or correct it using the tools they know. Comments, reactions, status updates. No custom approval UI. The collaboration features built into your workflow tool become the oversight mechanism for free. That persistence pays off in a second way. Every entry the agent writes is a record that future runs can search. Instead of context that disappears when a conversation ends, you accumulate operational history. How was this incident type handled last time? What did the agent try? What did the human override? That history is retrievable by both people and agents through the same interface, without spinning up a separate vector database. You could build a dedicated agent database for all this. But nobody will look at it. Teams already have notifications, permissions, and audit trails configured in their existing tools. A purpose-built system means a new UI to learn, new permissions to manage, and one more thing competing for attention. Store context where people already look and you skip all of that. The best agent memory is the one your team is already reading. Design principles A few opinions that came out of watching real incidents: Investigate first, persist second. The primary agent calls the partner directly for real-time analysis. Both agents write to shared memory only after findings are collected. Investigation speed should never be bottlenecked by writes to external systems. Humans see everything through shared context. The direct path is agent-to-agent only, but the shared context layer is where humans can see the full picture and step in. Agents don't bypass human visibility. Append-only. Both agents' writes are additive. No overwrites, no deletions. You can always reconstruct the full history of an investigation. Backend-agnostic. Swapping PagerDuty for ServiceNow, or adding GitHub Issues alongside either one, is a connector config change. What this actually gets you The practical upside is pretty simple: investigations aren't waiting on writes to external systems, nothing is lost when the conversation ends, and the next on-call engineer picks up where the last one left off instead of starting over. Every action from both agents shows up in the systems humans already look at. Adding a new partner agent or a new shared memory backend is a connector change. The architecture doesn't care which specific tools your team chose. The fast path is for investigation. The durable path is for everything else.229Views0likes0CommentsHTTP Triggers in Azure SRE Agent: From Jira Ticket to Automated Investigation
Introduction Many teams run their observability, incident management, ticketing, and deployment on platforms outside of Azure—Jira, Opsgenie, Grafana, Zendesk, GitLab, Jenkins, Harness, or homegrown internal tools. These are the systems where alerts fire, tickets get filed, deployments happen, and operational decisions are made every day. HTTP Triggers make it easy to connect any of them to Azure SRE Agent—turning events from any platform into automated agent actions with a simple HTTP POST. No manual copy-paste, no context-switching, no delay between detection and response. In this blog, we'll demonstrate by connecting Jira to SRE Agent—so that every new incident ticket automatically triggers an investigation, and the agent posts its findings back to the Jira ticket when it's done. The Scenario: Jira Incident → Automated Investigation Your team manages production applications backed by Azure PostgreSQL Flexible Server. You use Jira for incident tracking. Today, when a P1 or P2 incident is filed, your on-call engineer has to manually triage—reading through the ticket, checking dashboards, querying logs, correlating recent deployments—before they can even begin working on a fix. Some teams have Jira automations that route or label tickets, but the actual investigation still starts with a human. HTTP Triggers let you bring SRE Agent directly into that existing workflow. Instead of adding another tool for engineers to check, the agent meets them where they already work. Jira ticket created → SRE Agent automatically investigates → Agent writes findings back to Jira The on-call engineer opens the Jira ticket and the investigation is already there—root cause analysis, evidence from logs and metrics, and recommended next steps—posted as a comment by the agent. Here's how to set this up. Architecture Overview Here's the end-to-end flow we'll build: Jira — A new issue is created in your project Logic App — The Jira connector detects the new issue, and the Logic App calls the SRE Agent HTTP Trigger, using Managed Identity for authentication HTTP Trigger — The agent prompt is rendered with the Jira ticket details (key, summary, priority, etc.) via payload placeholders Agent Investigation — The agent uses Jira MCP tools to read the ticket and search related issues, queries Azure logs, metrics, and recent deployments, then posts its findings back to the Jira ticket as a comment How HTTP Triggers Work Every HTTP Trigger you create in Azure SRE Agent exposes a unique webhook URL: https://<your-agent>.<instance>.azuresre.ai/api/v1/httptriggers/trigger/<trigger-id> When an external system sends a POST request to this URL with a JSON payload, the SRE Agent: Validates the trigger exists and is enabled Renders your agent prompt by injecting payload values into {payload.X} placeholders Creates a new investigation thread (or reuses an existing one) Executes the agent with the rendered prompt—autonomously or in review mode Records the execution in the trigger's history for auditing Payload Placeholders The real power of HTTP Triggers is in payload placeholders. When you configure a trigger, you write an agent prompt with {payload.X} tokens that get replaced at runtime with values from the incoming JSON. For example, a prompt like: Investigate Jira incident {payload.key}: {payload.summary} (Priority: {payload.priority}) Gets rendered with actual incident data before the agent sees it, giving it immediate context to begin investigating. If your prompt doesn't use any placeholders, the raw JSON payload is automatically appended to the prompt, so the agent always has access to the full context regardless. Thread Modes HTTP Triggers support two thread modes: New Thread (recommended for incidents): Every trigger invocation creates a fresh investigation thread, giving each incident its own isolated workspace Same Thread: All invocations share a single thread, building up a continuous conversation—useful for accumulating alerts from a single source Authenticating External Platforms The HTTP Trigger endpoint is secured with Azure AD authentication, ensuring only authorized callers can create agent investigation threads. Every request requires a valid bearer token scoped to the SRE Agent's data plane. External platforms like Jira send standard HTTP webhooks and don't natively acquire Azure AD tokens. To bridge this, you can use any Azure service that supports Managed Identity as an intermediary—this approach means zero secrets to store or rotate in the external platform. Common options include: Approach Best For Azure Logic Apps Native connectors for many platforms, no code required, visual workflow designer Azure Functions Simple relay with ~15 lines of code, clean URL for any webhook source API Management (APIM) Enterprise environments needing rate limiting, IP filtering, or API key management All three support Managed Identity and can transparently acquire the Azure AD token before forwarding requests to the SRE Agent HTTP Trigger. In this walkthrough, we'll use Azure Logic Apps with the built-in Jira connector. Step-by-Step: Connecting Jira to SRE Agent Prerequisites An Azure SRE Agent resource deployed in your subscription A Jira Cloud project with API token access An Azure subscription for the Logic App Step 1: Set Up the Jira MCP Connector First, let's give the SRE Agent the ability to interact with Jira directly. In your agent's MCP Tool settings, add the Jira connector: Setting Value Package mcp-atlassian (npm, version 2.0.0) Transport STDIO Configure these environment variables: Variable Value ATLASSIAN_BASE_URL https://your-site.atlassian.net ATLASSIAN_EMAIL Your Jira account email ATLASSIAN_API_TOKEN Your Jira API token Once the connector is added, select the specific MCP tools you want the agent to use. The connector provides 18 Jira tools out of 80 available. For our incident investigation workflow, the key tools include: jira-mcp_read_jira_issue — Read details from a Jira issue by issue key jira-mcp_search_jira_issues — Search for Jira issues using JQL (Jira Query Language) jira-mcp_add_jira_comment — Add a comment to a Jira issue (post investigation findings back) jira-mcp_list_jira_projects — List available Jira projects jira-mcp_create_jira_issue — Create a new Jira issue This gives the SRE Agent bidirectional access to Jira—it can read ticket details, fetch comments, query related issues, and post investigation findings back as comments on the original ticket. This closes the loop so your on-call engineers see the agent's analysis directly in Jira without switching tools. Step 2: Create the HTTP Trigger Navigate to Builder → HTTP Triggers in the SRE Agent UI and click Create. Setting Value Name jira-incident-handler Agent Mode Autonomous Thread Mode New Thread (one investigation per incident) Sub-Agent (optional) Select a specialized incident response agent Agent Prompt: A new Jira incident has been filed that requires investigation: Jira Ticket: {payload.key} Summary: {payload.summary} Priority: {payload.priority} Reporter: {payload.reporter} Description: {payload.description} Jira URL: {payload.ticketUrl} Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies Checking for recent deployments or configuration changes Providing a structured analysis with Root Cause, Evidence, and Recommended Actions Once your investigation is complete, use the Jira MCP tools to post a summary of your findings as a comment on the original ticket ({payload.key}). After saving, enable the trigger and open the trigger detail view. Copy the Trigger URL—you'll need it for the Logic App. Step 3: Create the Azure Logic App In the Azure Portal, create a new Logic App: Setting Value Type Consumption (Multi-tenant, Stateful) Name jira-sre-agent-bridge Region Same region as your SRE Agent (e.g., East US 2) Resource Group Same resource group as your SRE Agent (recommended for simplicity) Step 4: Enable Managed Identity In the Logic App → Identity → System assigned: Set Status to On Click Save Step 5: Assign the SRE Agent Admin Role Navigate to your SRE Agent resource → Access control (IAM) → Add role assignment: Setting Value Role SRE Agent Admin Assign to Managed Identity → select your Logic App This grants the Logic App's Managed Identity the data-plane permissions needed to invoke HTTP Triggers. Important: The Contributor role alone is not sufficient. Contributor covers the Azure control plane, but SRE Agent uses a separate data plane with its own RBAC. The SRE Agent Admin role provides the required data-plane permissions. Step 6: Create the Jira Connection Open the Logic App designer. When adding the Jira trigger, it will prompt you to create a connection: Setting Value Connection name jira-connection Jira instance https://your-site.atlassian.net Email Your Jira email API Token Your Jira API token Step 7: Configure the Logic App Workflow Switch to the Logic App Code view and paste this workflow definition: { "definition": { "$schema": "https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#", "contentVersion": "1.0.0.0", "triggers": { "When_a_new_issue_is_created_(V2)": { "recurrence": { "interval": 3, "frequency": "Minute" }, "splitOn": "@triggerBody()", "type": "ApiConnection", "inputs": { "host": { "connection": { "name": "@parameters('$connections')['jira']['connectionId']" } }, "method": "get", "path": "/v2/new_issue_trigger/search", "queries": { "X-Request-Jirainstance": "https://YOUR-SITE.atlassian.net", "projectKey": "YOUR_PROJECT_ID" } } } }, "actions": { "Call_SRE_Agent_HTTP_Trigger": { "runAfter": {}, "type": "Http", "inputs": { "uri": "https://YOUR-AGENT.azuresre.ai/api/v1/httptriggers/trigger/YOUR-TRIGGER-ID", "method": "POST", "headers": { "Content-Type": "application/json" }, "body": { "key": "@{triggerBody()?['key']}", "summary": "@{triggerBody()?['fields']?['summary']}", "priority": "@{triggerBody()?['fields']?['priority']?['name']}", "reporter": "@{triggerBody()?['fields']?['reporter']?['displayName']}", "description": "@{triggerBody()?['fields']?['description']}", "ticketUrl": "@{concat('https://YOUR-SITE.atlassian.net/browse/', triggerBody()?['key'])}" }, "authentication": { "type": "ManagedServiceIdentity", "audience": "https://azuresre.dev" } } } }, "outputs": {}, "parameters": { "$connections": { "type": "Object", "defaultValue": {} } } }, "parameters": { "$connections": { "type": "Object", "value": { "jira": { "id": "/subscriptions/YOUR-SUB/providers/Microsoft.Web/locations/YOUR-REGION/managedApis/jira", "connectionId": "/subscriptions/YOUR-SUB/resourceGroups/YOUR-RG/providers/Microsoft.Web/connections/jira", "connectionName": "jira" } } } } } Replace the YOUR-* placeholders with your actual values. To find your Jira project ID, navigate to https://your-site.atlassian.net/rest/api/3/project/YOUR-PROJECT-KEY in your browser and find the "id" field in the JSON response. The critical piece is the authentication block: "authentication": { "type": "ManagedServiceIdentity", "audience": "https://azuresre.dev" } This tells the Logic App to automatically acquire an Azure AD token for the SRE Agent data plane and attach it as a Bearer token. No secrets, no expiration management, no manual token refresh. After pasting the JSON and clicking Save, switch back to the Designer view. The Logic App automatically generates the visual workflow from the code — you'll see the Jira trigger ("When a new issue is created (V2)") connected to the HTTP action ("Call SRE Agent HTTP Trigger") as a two-step flow, with all the field mappings and authentication settings already configured What Happens Inside the Agent When the HTTP Trigger fires, the SRE Agent receives a fully contextualized prompt with all the Jira incident data injected: A new Jira incident has been filed that requires investigation: Jira Ticket: KAN-16 Summary: Elevated API Response Times — PostgreSQL Table Lock Causing Request Blocking on Listings Service Priority: High Reporter: Vineela Suri Description: Severity: P2 — High. Affected Service: Production API (octopets-prod-postgres). Impact: End users experience slow or unresponsive listing pages. Jira URL: https://your-site.atlassian.net/browse/KAN-16 Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies ... The agent then uses its configured tools to investigate—Azure CLI to query metrics, Kusto to analyze logs, and the Jira MCP connector to read the ticket for additional context. Once the investigation is complete, the agent posts its findings as a comment directly on the Jira ticket, closing the loop without any manual copy-paste. Each execution is recorded in the trigger's history with timestamp, thread ID, success status, duration, and an AI-generated summary—giving you full observability into your automated investigation pipeline. Extending to Other Platforms The pattern we built here works for any external platform that isn't natively supported by SRE Agent. The core architecture stays the same: External Platform → Auth Bridge (Managed Identity) → SRE Agent HTTP Trigger You only need to swap the inbound side of the bridge. For example: External Platform Auth Bridge Configuration Jira Logic App with Jira V2 connector (polling) OpsGenie Logic App with OpsGenie connector, or Azure Function relay receiving OpsGenie webhooks Datadog Azure Function relay or APIM policy receiving Datadog webhook notifications Grafana Azure Function relay or APIM policy receiving Grafana alert webhooks Splunk APIM with webhook endpoint and Managed Identity forwarding Custom / Internal tools Logic App HTTP trigger, Azure Function relay, or APIM — any service that supports Managed Identity The SRE Agent HTTP Trigger and the Managed Identity authentication remain the same regardless of the source platform. You configure the trigger once, set up the auth bridge, and connect as many external sources as needed. Each trigger can have its own tailored prompt, sub-agent, and thread mode optimized for the type of incoming event. Key Takeaways HTTP Triggers extend Azure SRE Agent's reach to any external platform: Connect What You Use: If your incident platform isn't natively supported, HTTP Triggers provide the integration point—no code changes to SRE Agent required Secure by Design: Azure AD authentication with Managed Identity keeps the data plane protected while making integration straightforward through standard Azure services Bidirectional with MCP: Combine HTTP Triggers (inbound) with MCP connectors (outbound) for full round-trip integration—receive incidents automatically and post findings back to the source platform Full Observability: Every trigger execution is recorded with timestamps, thread IDs, duration, and AI-generated summaries Flexible Context Injection: Payload placeholders let you craft precise investigation prompts from incident data, while raw payload passthrough ensures the agent always has full context Getting Started HTTP Triggers are available now in the Azure SRE Agent platform: Create a Trigger: Navigate to Builder → HTTP Triggers → Create. Define your agent prompt with {payload.X} placeholders Set Up an Auth Bridge: Use Logic Apps, Azure Functions, or APIM with Managed Identity to handle Azure AD authentication Connect Your Platform: Point your external platform at the bridge and create a test event Within minutes, you'll have an automated pipeline that turns every incident ticket into an AI-driven investigation. Learn More HTTP Triggers Documentation Agent Hooks Blog Post — Governance controls for automated investigations YAML Schema Reference SRE Agent Getting Started Guide Ready to extend your SRE Agent to platforms it doesn't support natively? Set up your first HTTP Trigger today at sre.azure.com.293Views0likes0CommentsGet started with NeuBird Hawkeye MCP server in Azure SRE Agent
Integrate NeuBird Hawkeye MCP with Azure SRE Agent TL;DR If your infrastructure spans multiple clouds say Azure and GCP, or Azure alongside any other cloud provider investigating incidents means jumping between completely separate consoles, log systems, and monitoring stacks. Azure SRE Agent now integrates with NeuBird Hawkeye via Model Context Protocol (MCP), so you can investigate incidents across all of your clouds and monitoring tools from a single conversation. Key benefits: 90-second investigations vs 3-4 hours of manual dashboard-hopping Multi-cloud support - Azure, GCP, and other cloud providers investigated from a single conversation 42 MCP tools across 7 categories for investigation, analysis, and remediation Real-time streaming progress - watch investigations unfold step-by-step (v2.0+) MTTR tracking and continuous improvement metrics The problem: incidents don't stay in one cloud When an alert fires at 3 AM, your on-call engineer doesn't just need to find the problem — they need to figure out which cloud it's in. A single incident can involve an Azure Function calling a GCP Cloud Run service, with logs split across Azure Monitor and GCP Cloud Logging. Here's what that looks like: Challenge Time Cost Correlate signals across multiple monitoring tools 30-45 minutes Query logs and metrics from multiple clouds 45-60 minutes Piece together the chain of events 30-45 minutes Identify root cause and develop fixes 60-90 minutes Total 3-4 hours Sound familiar? "Is it the database? The cache? The load balancer? Let me check the GCP console... now Azure Monitor... now the other logging stack... wait, what time zone is this in?" What NeuBird Hawkeye does NeuBird Hawkeye is an autonomous incident investigation platform that connects to your cloud providers and uses AI to: Core capabilities: Investigate alerts from your monitoring tools automatically Query multiple data sources across cloud providers and observability platforms Generate detailed RCAs with incident timelines Provide corrective actions with ready-to-execute scripts Learn from your architecture through customizable instructions Supported Integrations: Category Platforms Cloud Providers Azure, Google Cloud Platform, AWS Monitoring Tools Datadog, Grafana, Dynatrace, New Relic Incident Management PagerDuty, ServiceNow, FireHydrant, Incident.io Log Aggregation CloudWatch, Azure Monitor, Google Cloud Logging How the integration works With the new Hawkeye MCP server integration, Azure SRE Agent leverages Hawkeye's autonomous investigation capabilities through natural language conversation. What is Model Context Protocol (MCP)? Model Context Protocol is an open standard that enables AI agents to securely connect to external tools and data sources. It's like a universal adapter for connecting LLMs to the real world. How it connects: Your SRE Agent can now talk to Hawkeye, which talks to Azure, GCP, and other cloud platforms—all through a single conversational interface. 42 MCP tools across 7 categories Category Tool Count Examples Projects 5 List, create, update, delete projects Connections 10 Azure, GCP, and other cloud provider connections Investigations 11 List alerts, investigate, monitor progress, RCA, follow-ups Instructions 7 Create, test, validate, apply investigation rules Analytics 4 MTTR, time saved, quality scores, session reports Discovery 2 Explore resources across all project connections Help 1 Interactive guidance system Remote MCP server Azure SRE Agent connects to the hosted Hawkeye MCP server via Streamable-HTTP transport. No local installation or Node.js required. Mode Best For How It Works Streamable-HTTP All users, teams, enterprise Connect to hosted MCP endpoint at https://<your-deployment-name>.app.neubird.ai/mcp Supports secure authentication via email/password headers or bearer tokens (JWT). Example investigation walkthrough Here's an example investigation with Azure SRE Agent + Hawkeye: Step 1: List Uninvestigated Alerts User: Show me uninvestigated alerts from the last 24 hours Hawkeye response: Hawkeye Projects: - Production (UUID: b4b902d4-a540-466b-96cf-4ec2f734b010) Uninvestigated Alerts (Production): 1. [P1] High CPU usage on api-service pods - 3 instances affected 2. [P2] Database connection pool exhausted - payment-service 3. [P3] Increased error rate in user-authentication endpoint Step 2: Investigate an Alert User: Investigate the high CPU alert The agent calls hawkeye_investigate_alert and streams real-time progress (v2.0+): Preparing telemetry sources [2s] Loading investigation context [3s] Classifying datasources (Step 1) [18s] Fine-tuning datasources [26s] Consulting telemetry (Query 7) [1m 6s] Analyzing telemetry data (Analysis 12) [1m 13s] Summarizing results [1m 29s] Investigation completed! Here's what I found... Step 3: Get Root Cause Analysis User: Show me the root cause analysis Hawkeye provides a comprehensive RCA: Incident summary Field Value Severity P1 Duration 45 minutes Status Resolved Affected Services api-service (3 pods) Root cause A memory leak in the api-service v2.3.1 deployment caused garbage collection pressure, leading to CPU exhaustion. The leak originated from unclosed database connections in the /users/search endpoint. Timeline Time Event 14:00 UTC Deployment of api-service v2.3.1 completed 14:15 UTC Memory usage begins climbing (500MB 1.2GB) 14:30 UTC GC pressure increases, CPU spikes to 95%+ 14:35 UTC Alert fired: High CPU usage 14:45 UTC Auto-scaling triggered (3 6 pods) 14:50 UTC Investigation identified unclosed connections 14:55 UTC Rollback initiated to v2.3.0 Corrective actions Immediate Fix: # Rollback to previous stable version kubectl set image deployment/api-service api-service=api-service:v2.3.0 # Force restart to clear leaked connections kubectl rollout restart deployment/api-service Preventive measures: Add connection pool monitoring to your cloud monitoring dashboard Implement database connection lifecycle tests in CI pipeline Add memory leak detection to staging environment tests Time saved Metric Value Manual investigation estimate 3-4 hours Hawkeye investigation time 90 seconds Time saved ~3.5 hours Step 4: Follow-Up Questions The conversation doesn't stop at the RCA. You can ask intelligent follow-ups: User: Has this happened before? Agent: Checking investigation history... Yes, a similar pattern was detected 2 weeks ago in the same service. The previous incident was also caused by unclosed database connections, but in a different endpoint (/orders/list). Recommendation: Implement connection lifecycle testing across all endpoints. User: What tests should we add to catch this? Agent: Based on the RCA, I recommend: 1. Connection pool exhaustion test: - Simulate high request volume to database-heavy endpoints - Monitor for unclosed connections over 1000 requests 2. Memory leak detection: - Run load tests with memory profiling enabled - Alert if memory growth exceeds 20% over 10-minute baseline 3. Integration test for connection lifecycle: - Verify all database connections are properly closed - Check connection pool statistics after each test run Key capabilities 1. Multi-Cloud Investigation Hawkeye unifies investigation across all your connected infrastructure. Example scenario: An alert fires about an Azure Function returning errors. Hawkeye can: Query Azure Monitor for function execution metrics Check GCP Cloud Logging for upstream API errors on the GCP side Review GCP Cloud Monitoring metrics for the dependent Cloud Run service Correlate with recent deployments in GitHub Actions or Azure DevOps "Finally, one place to investigate instead of 7 browser tabs!" 2. Instruction Management Customize how Hawkeye investigates incidents by creating instructions: Instruction Type Purpose Example SYSTEM Provide architecture context "We use microservices on Kubernetes with PostgreSQL and Redis" FILTER Reduce investigation noise "Only investigate P1 and P2 incidents" RCA Guide investigation steps "For database issues, check slow queries and connection pools first" GROUPING Group related alerts "Group alerts from the same service within 5 minutes" Instruction testing workflow Before deploying instructions to production, test them on past investigations: Step Action Tool 1 Validate content hawkeye_validate_instruction 2 Apply to test session hawkeye_apply_session_instruction 3 Rerun investigation hawkeye_rerun_session 4 Compare RCAs Manual review 5 Measure improvement Check quality score 6 Deploy if better hawkeye_create_project_instruction Note: Test instruction changes on historical data before applying them to live investigations. No more "oops, that filter was too aggressive!" 3. Analytics and Continuous Improvement Track the effectiveness of your incident response process: Metric What It Measures MTTR Mean Time to Resolution Time Saved Efficiency gains vs manual investigation Quality Score Accuracy and completeness of RCAs Noise Reduction Percentage of duplicate/grouped alerts Use cases for analytics: Justify investment in SRE tooling to leadership Demonstrate continuous improvement over time Identify patterns in recurring incidents Measure impact of instruction changes 4. Proactive Investigation You don't need an alert to investigate. Create manual investigations for proactive analysis: User: Investigate potential memory leak in user-api pods. Memory usage increased from 500MB to 1.2GB between 8am-10am UTC today. Hawkeye will: Query metrics for the specified time range Correlate with deployment events Check for similar patterns in the past Provide root cause analysis and recommendations When to use proactive investigation: Use Case Example Pre-production testing "Investigate performance regression in staging" Performance analysis "Why did latency increase after the last deploy?" Capacity planning "Analyze memory growth trends over the past month" Post-incident deep dive "What else happened during that outage?" Setup guide Prerequisites Azure SRE Agent resource Active Hawkeye account (contact NeuBird to get started) At least one connected cloud provider in Hawkeye (Azure, GCP, etc.) Step 1: Add the Remote MCP Connector Navigate to your SRE Agent at sre.azure.com (e.g., https://sre.azure.com/agents/subscriptions/3eaf90b4-f4fa-416e-a0aa-ac2321d9decb/resourceGroups/sre-agent/providers/Microsoft.App/agents/dbandaru-pagerduty ) Go to Builder > Connectors Click Add connector > MCP server (User provided connector) Field Value Name hawkeye-mcp Connection type Streamable-HTTP URL https://<your-deployment-name>.app.neubird.ai/mcp Authentication Custom headers Authentication headers: Header Value X-Hawkeye-Email Your Hawkeye email X-Hawkeye-Password Your Hawkeye password Or use bearer token (JWT) for CI/CD: Header Value Authorization Bearer <your-jwt-token> To obtain a bearer token: curl -s -X POST "https://<your-deployment-name>.app.neubird.ai/api/v1/user/login" \ -H "Content-Type: application/json" \ -d '{"email": "your@email.com", "password": "your-password"}' \ | jq -r '.access_token' Step 2: Create a Hawkeye skill After adding the connector, create a skill that knows how to use the Hawkeye tools. The skill has a system prompt tuned for incident investigation and a reference to your MCP connector. In the left navigation, select Builder > Skills Click Add skill Paste the following YAML configuration (see below) Click Save api_version: azuresre.ai/v1 kind: AgentConfiguration metadata: owner: your-team@contoso.com version: "1.0.0" spec: name: HawkeyeInvestigator display_name: Hawkeye Incident Investigator system_prompt: | You are an incident investigation specialist with access to NeuBird Hawkeye's autonomous investigation platform. ## Capabilities ### Finding alerts - List uninvestigated alerts from the last N hours/days - Filter alerts by severity (P1, P2, P3, P4) - Search alerts by keyword or service name ### Running investigations - Investigate existing alerts by alert ID - Create manual investigations for proactive analysis - Monitor investigation progress in real-time ### Root cause analysis - Retrieve detailed RCA reports with incident timelines - View chain of thought and reasoning - Get data sources and queries consulted - Ask follow-up questions about incidents ### Remediation - Execute corrective action scripts - Implement preventive measures - Generate post-mortem documentation ### Project management - List and switch between Hawkeye projects - View connected data sources and sync status - Create and manage investigation instructions - Get organization-wide incident analytics (MTTR, time saved) ## Best practices - Start with uninvestigated alerts from the last 24 hours - Investigations typically complete in 30-90 seconds - First investigation may take 5-10 minutes while connections sync - Review corrective actions before executing ## Permissions All investigations use the connected data sources in your Hawkeye project. Ensure connections are properly synced before investigating. mcp_connectors: - hawkeye-mcp handoffs: [] The mcp_connectors field references the connector name from Step 1. This gives the skill access to all 42 Hawkeye tools. Customizing the skill: Edit the system prompt to match your team's workflow. For example, add instructions like "Always check P1 alerts first" or "Include deployment history in every investigation." The YAML above is a starting point. Step 3: Test the Integration Open a chat session with your SRE Agent Type /agent and select HawkeyeInvestigator Try these prompts: Show me uninvestigated alerts from the last 24 hours List all Hawkeye projects and their connections Investigate the first P1 alert Show me the root cause analysis What corrective actions are recommended? Has this happened before? Security Authentication methods Method Headers Best For Email/Password X-Hawkeye-Email + X-Hawkeye-Password Simple setup, most use cases Bearer Token (JWT) Authorization: Bearer <token> CI/CD pipelines, OAuth, enterprise Data security Encrypted traffic - HTTPS with TLS 1.2+ Read-only access to cloud providers and monitoring tools SOC 2 compliant - Secure data processing environment RBAC support - Role-based access at project level Access controls Each user authenticates with their own Hawkeye credentials Investigations scoped to connected data sources in your project Respects existing IAM and RBAC policies Security note: Store credentials in environment variables, never in config files. Hawkeye only needs read access to investigate. Available MCP tools (42) Project tools (5) Tool Description hawkeye_list_projects List all Hawkeye projects hawkeye_create_project Create a new project hawkeye_get_project_details Get project configuration hawkeye_update_project Update project name or description hawkeye_delete_project Delete a project (requires confirmation) Connection tools (10) Tool Description hawkeye_list_connections List all available connections hawkeye_create_aws_connection Create AWS connection with IAM role hawkeye_create_datadog_connection Create Datadog connection with API keys hawkeye_wait_for_connection_sync Wait for connection to reach SYNCED state hawkeye_add_connection_to_project Link connections to a project hawkeye_list_project_connections List connections for a specific project + 4 additional tools Azure, GCP, and other connections Investigation tools (11) Tool Description hawkeye_list_sessions List investigation sessions with filtering hawkeye_investigate_alert Investigate an alert (supports real-time streaming) hawkeye_create_manual_investigation Create investigation from custom prompt (supports streaming) hawkeye_get_investigation_status Get real-time progress with step-by-step breakdown hawkeye_get_rca Retrieve root cause analysis hawkeye_continue_investigation Ask follow-up questions on completed investigations hawkeye_get_chain_of_thought View investigation reasoning steps hawkeye_get_investigation_sources List data sources consulted hawkeye_get_investigation_queries List queries executed during investigation hawkeye_get_follow_up_suggestions Get suggested follow-up questions hawkeye_get_rca_score Get investigation quality score Instruction tools (7) Tool Description hawkeye_list_project_instructions List project instructions with type/status filtering hawkeye_create_project_instruction Create SYSTEM/FILTER/RCA/GROUPING instruction hawkeye_validate_instruction Validate instruction content before applying hawkeye_apply_session_instruction Apply instruction to session for testing hawkeye_rerun_session Rerun investigation with updated instructions + 2 additional tools Update and delete instructions Analytics tools (4) Tool Description hawkeye_get_incident_report Get organization-wide analytics (MTTR, time saved) hawkeye_inspect_session Get session metadata hawkeye_get_session_report Get summary reports for multiple sessions hawkeye_get_session_summary Get detailed analysis and scoring for a session Discovery tools (2) Tool Description hawkeye_discover_project_resources Explore available resources across all project connections hawkeye_list_connection_resource_types Get resource types for connection type and telemetry type Help tools (1) Tool Description hawkeye_get_guidance Interactive help system with embedded knowledge base Use cases 1. Faster Incident Response Phase Before Hawkeye After Hawkeye Alert detection Alert notification Alert notification Investigation Log into multiple cloud consoles Ask: "Investigate this alert" Correlation Manual log/metric analysis Automated multi-source query Root cause 2-4 hours 2-3 minutes Remediation Write runbook, execute Copy/paste bash script, execute Result: roughly 95% reduction in MTTR for common incident types 2. Knowledge Retention The problem: Senior engineer leaves Tribal knowledge lost Junior engineers struggle with same issues The Hawkeye solution: Capture investigation patterns through instructions Preserve institutional knowledge in reusable rules Train new engineers with past investigation history 3. Reduced Toil Common repetitive investigations: Issue Type Manual Time Hawkeye Time Frequency Database connection issues 2 hours 90 seconds 3x/week Pod restart loops 1.5 hours 60 seconds 5x/week Deployment failures 3 hours 2 minutes 2x/week Result: engineers spend more time on prevention and architecture, less on firefighting 4. Cross-Team Collaboration Platform team provides: SYSTEM instructions describing architecture FILTER instructions for noise reduction RCA instructions for common patterns Application team benefits: Investigations leverage platform context No need for deep infrastructure knowledge Consistent incident response across teams 5. Continuous Learning Track and improve over time: Month MTTR Time Saved Quality Score Noise Reduction Month 1 45 min 15 hours 7.2/10 20% Month 3 12 min 45 hours 8.5/10 55% Month 6 3 min 90 hours 9.1/10 78% Result: data-driven improvement of incident response processes Next steps The Hawkeye MCP integration is available now for all Azure SRE Agent customers. Get started Contact NeuBird to set up a Hawkeye account Connect your cloud providers (Azure, GCP, etc.) Add the Hawkeye MCP connector to your SRE Agent Create a Hawkeye skill in Builder > Skills Start investigating! Learn more Hawkeye MCP documentation Tool reference (all 42 tools) Advanced workflows hawkeye-mcp-server on npm NeuBird help documentation Azure SRE Agent MCP integration guide NeuBird AI Need OAuth support? Contact NeuBird support: support@neubird.ai Try it out Ready to get started? Quick start checklist: Sign up for Hawkeye at https://neubird.ai/contact-us/ Connect your cloud infrastructure (Azure, GCP, etc.) Install the MCP connector in Azure SRE Agent Create a Hawkeye skill in Builder > Skills Test with "Show me uninvestigated alerts" Investigate your first incident in under 2 minutes! Questions? Drop a comment below or reach out to the Azure SRE Agent team. Want to see Hawkeye in action? Request a demo from NeuBird: https://neubird.ai/contact-us/ Azure SRE Agent helps SRE teams build automated incident response workflows. Learn more at aka.ms/sreagent. Tags: #Azure #SREAgent #NeuBird #Hawkeye #MCP #IncidentResponse #DevOps #SRE #AI #Automation #CloudOps #MTTR #RootCauseAnalysis143Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.12KViews6likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.11KViews1like1CommentWhat It Takes to Give SRE Agent a Useful Starting Point
In our latest posts, The Agent that investigates itself and Azure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context, we wrote about a moment that changed how we think about agent systems. Azure SRE Agent investigated a regression in its own prompt cache, traced the drop to a specific PR, and proposed fixes. What mattered was not just the model. What mattered was the starting point. The agent had code, logs, deployment history, and a workspace it could use to discover the next piece of context. That lesson forced an uncomfortable question about onboarding. If a customer finishes setup and the agent still knows nothing about their app, we have not really onboarded them. We have only created a resource. So for the March 10 GA release, we rebuilt onboarding around a more practical bar: can a new agent become useful on day one? To test that, we used the new flow the way we expect customers to use it. We connected a real sample app, wired up live Azure Monitor alerts, attached code and logs, uploaded a knowledge file, and then pushed the agent through actual work. We asked it to inspect the app, explain a 401 path from the source, debug its own log access, and triage GitHub issues in the repo. This post walks through that experience. We connected everything we could because we wanted to see what the agent does when it has a real starting point, not a partial one. If your setup is shorter, the SRE Agent still works. It just knows less. The cold start we were trying to fix The worst version of an agent experience is familiar by now. You ask a concrete question about your system and get back a smart-sounding answer that is only loosely attached to reality. The model knows what a Kubernetes probe is. It knows what a 500 looks like. It may even know common Kusto table names. But it does not know your deployment, your repo, your auth flow, or the naming mistakes your team made six months ago and still lives with. We saw the same pattern again and again inside our own work. When the agent had real context, it could do deep investigations. When it started cold, it filled the gaps with general knowledge and good guesses. The new onboarding is our attempt to close that gap up front. Instead of treating code, logs, incidents, and knowledge as optional extras, the flow is built around connecting the things the agent needs to reason well. Walking through the new onboarding Starting March 10, you can create and configure an SRE Agent at sre.azure.com. Here is what that looked like for us. Step 1: Create the agent You choose a subscription, resource group, name, and region. Azure provisions the runtime, managed identity, Application Insights, and Log Analytics workspace. In our run, the whole thing took about two minutes. That first step matters more than it may look. We are not just spinning up a chatbot. We are creating the execution environment where the agent can actually work: run commands, inspect files, query services, and keep track of what it learns. Step 2: Start adding context Once provisioning finishes, you land on the setup page. The page is organized around the sources that make the agent useful: code, logs, incidents, Azure resources, and knowledge files. Data source Why it matters Code Lets the agent read the system it is supposed to investigate. Logs Gives it real tables, schemas, and data instead of guesses. Incidents Connects the agent to the place where operational pain actually shows up. Azure resources Gives it the right scope so it starts in the right subscription and resource group. Knowledge files Adds the team-specific context that never shows up cleanly in telemetry. The page is blunt in a way we like. If you have not connected anything yet, it tells you the agent does not know enough about your app to answer useful questions. That is the right framing. The job of onboarding is to fix that. Step 3: Connect logs We started with Azure Data Explorer. The wizard supports Azure Kusto, Datadog, Elasticsearch, Dynatrace, New Relic, Splunk, and Hawkeye. After choosing Kusto, it generated the MCP connector settings for us. We supplied the cluster details, tested the connection, and let it discover the tools. This step removes a whole class of bad agent behavior. The model no longer has to invent table names or hope the cluster it wants is the cluster that exists. It knows what it can query because the connection is explicit. Step 4: Connect the incident platform For incidents, we chose Azure Monitor. This part is simple by design. If incidents are where the agent proves its value, connecting them should feel like the most natural part of setup, not a side quest. PagerDuty and ServiceNow work too, but for this walkthrough we kept it on Azure Monitor so we could wire real alerts to a real app. Step 5: Connect code Then we connected the code repo. We used microsoft-foundry/foundry-agent-webapp, a React and ASP.NET Core sample app running on Azure Container Apps. This is still the highest-leverage source we give the agent. Once the repo is connected, the agent can stop treating the app as an abstract web service. It can read the auth flow. It can inspect how health probes are configured. It can compare logs against the exact code paths that produced them. It can even look at the commit that was live when an incident happened. That changes the quality of the investigation immediately. Step 6: Scope the Azure resources Next we told the agent which resources it was responsible for. We scoped it to the resource group that contained the sample Container App. The wizard then set the roles the agent needed to observe and investigate the environment. That sounds like a small step, but it fixes another common failure mode. Agents do better when they start from the right part of the world. Subscription and resource-group scope give them that boundary. Step 7: Upload knowledge Last, we uploaded a Markdown knowledge file we wrote for the sample app. The file covered the app architecture, API endpoints, auth flow, likely failure modes, and the files we would expect an engineer to open first during debugging. We like Markdown here because it stays honest. It is easy for a human to read, easy for the agent to navigate, and easy to update as the system changes. All sources configured Once everything was connected, the setup panel turned green. At that point the agent had a repo, logs, incidents, Azure resources, and a knowledge file. That is the moment where onboarding stops being a checklist and starts being operational setup. The chat experience makes the setup visible When you open a new thread, the configuration panel stays at the top of the chat. If you expand it, you can see exactly what is connected and what is not. We built this because people should not have to guess what the agent knows. If code is connected and logs are not, that should be obvious. If incidents are wired up but knowledge files are missing, that should be obvious too. The panel makes the agent's working context visible in the same place where you ask it to think. It also makes partial setup less punishing. You do not have to finish every step before the agent becomes useful. But you can see, very clearly, what extra context would make the next answer better. What changed once the agent had context The easiest way to evaluate the onboarding is to look at the first questions we asked after setup. We started with a simple one: What do you know about the Container App in the rg-big-refactor resource group? The agent used Azure CLI to inspect the app, its revisions, and the system logs, then came back with a concise summary: image version, resource sizing, ingress, scale-to-zero behavior, and probe failures during cold start. It also correctly called out that the readiness probe noise was expected and not the root of a real outage. That answer was useful because it was grounded in the actual resource, not in generic advice about Container Apps. Then we asked a harder question: Based on the connected repo, what authentication flow does this app use? If a user reports 401s, what should we check first? The agent opened authConfig.ts, Program.cs, useAuth.ts, postprovision.ps1, and entra-app.bicep, then traced the auth path end to end. The checklist it produced was exactly the kind of thing we hoped onboarding would unlock: client ID alignment, identifier URI issues, redirect URI mismatches, audience validation, missing scopes, token expiry handling, and the single-tenant assumption in the backend. It even pointed to the place in Program.cs where extra logging could be enabled. Without the repo, this would have been a boilerplate answer about JWTs. With the repo, it read like advice from someone who had already been paged for this app before. We did not stop at setup. We wired real monitoring. A polished demo can make any agent look capable, so we pushed farther. We set up live Azure Monitor alerts for the sample web app instead of leaving the incident side as dummy data. We created three alerts: HTTP 5xx errors (Sev 1), for more than 3 server errors in 5 minutes Container restarts (Sev 2), to catch crash loops and OOMs High response latency (Sev 2), when average response time goes above 10 seconds The high-latency alert fired almost immediately. The app was scaling from zero, and the cold start was slow enough to trip the threshold. That was perfect. It gave us a real incident to put through the system instead of a fictional one. Incident response plans From the Builder menu, we created a response plan targeted at incidents with foundry-webapp in the title and severity 1 or 2. The incident that had just fired showed up in the learning flow. We used the actual codebase and deployment details to write the default plan: which files to inspect for failures, how to reason about health probes, and how to tell the difference between a cold start and a real crash. That felt like an important moment in the product. The response plan was not generic incident theater. It was anchored in the system we had just onboarded. One of the most useful demos was the agent debugging itself The sharpest proof point came when we tried to query the Log Analytics workspace from the agent. We expected it to query tables and summarize what it found. Instead, it hit insufficient_scope. That could have been a dead end. Instead, the agent turned the failure into the investigation. It identified the missing permissions, noticed there were two managed identities in play, told us which RBAC roles were required, and gave us the exact commands to apply them. After we fixed the access, it retried and ran a series of KQL queries against the workspace. That is where it found the next problem: Container Apps platform logs were present, but AppRequests, AppExceptions, and the rest of the App Insights-style tables were still empty. That was not a connector bug. It was a real observability gap in the sample app. The backend had OpenTelemetry packages, but the exporter configuration was not actually sending the telemetry we expected. The agent did not just tell us that data was missing. It explained which data was present, which data was absent, and why that difference mattered. That is the sort of thing we wanted this onboarding to set up: not just answering the first question, but exposing the next real thing that needs fixing. We also asked it to triage the repo backlog Once the repo was connected, it was natural to see how well the agent could read open issues against the code. We pointed it at the three open GitHub issues in the sample repo and asked it to triage them. It opened the relevant files, compared the code to the issue descriptions, and came back with a clear breakdown: Issue #21, @fluentui-copilot is not opensource? Partially valid, low severity. The package is public and MIT licensed. The real concern is package maturity, not licensing. Issue #20, SDK fails to deserialize agent tool definitions Confirmed, medium severity. The agent traced the problem to metadata handling in AgentFrameworkService.cs and suggested a safe fallback path. Issue #19, Create Preview experience from AI Foundry is incomplete Confirmed, medium severity. The agent found the gap between the environment variables people are told to paste and the variables the app actually expects. What stood out to us was not just that the output was correct. It was that the agent was careful. It did not overclaim. It separated a documentation concern from two real product bugs. Then it asked whether we wanted it to start implementing the fixes. That is the posture we want from an engineering agent: useful, specific, and a little humble. What the onboarding is really doing After working through the whole flow, we do not think of onboarding as a wizard anymore. We think of it as the process of giving the agent a fair shot. Each connection removes one reason for the model to bluff: Code keeps it from guessing how the system works. Logs keep it from guessing what data exists. Incidents keep it close to operational reality. Azure resource scope keeps it from wandering. Knowledge files keep team-specific context from getting lost. This is the same lesson we learned building the product itself. The agent does better when it can discover context progressively inside a world that is real and well-scoped. Good onboarding is how you create that world. Closing The main thing we learned from this work is simple: onboarding is not done when the resource exists. It is done when the agent can help with a real problem. In one setup we were able to connect a real app, fire a real alert, create a real response plan, debug a real RBAC problem, inspect real logs, and triage real GitHub issues. That is a much better standard than "the wizard completed successfully." If you try SRE Agent after GA, start there. Connect the things that make your system legible, then ask a question that would actually matter during a bad day. The answer will tell you very quickly whether the agent has a real starting point. Create your SRE Agent -> Azure SRE Agent is generally available starting March 10, 2026.698Views2likes0CommentsWhat's new in Azure SRE Agent in the GA release
Azure SRE Agent is now generally available (read the GA announcement). . After months in preview with teams across Microsoft and early customers, here's what's shipping at GA. We use SRE Agent in our team We built SRE Agent to solve our own operational problems first. It investigates our regressions, triages errors daily, and turns investigations into reusable knowledge. Every capability in this release was shaped from those learnings. → The Agent That Investigates Itself What's new at GA Redesigned onboarding — useful on day one Can a new agent become useful the same day you set it up? That's the bar we designed around. Connect code, logs, incidents, Azure resources, and knowledge files in a single guided flow. → What It Takes to Give an SRE Agent a Useful Starting Point Deep Context — your agent builds expertise on your environment Continuous access to your logs, code, and knowledge. Persistent memory across investigations. Background intelligence that runs when nobody is asking questions. Your agent already knows your routes, error handlers, and deployment configs because it's been exploring your environment continuously. It remembers what worked last time and surfaces operational insights nobody asked for. → Meet the Best Engineer That Learns Continuously Why SRE Agent - Capabilities that move the needle Automated investigation — proactive and reactive Set up scheduled tasks to run investigations on a cadence — catch issues before they become incidents. When an incident does fire, your agent picks it up automatically through integrations with platforms like ICM, PagerDuty, and ServiceNow. Faster root cause analysis → lower MTTR Your agent is code and context aware and learns continuously. It connects runtime errors to the code that caused them and gets faster with every investigation. Automate workflows across any ecosystem → reduce toil Connect to any system via MCP connectors. Eliminate the context-switching of working across multiple platforms, orchestrate workflows across Azure, monitoring, ticketing, and more from a single place. Integrate with any HTTP API → bring your own tools Write custom Python tools that call any endpoint. Extend your agent to interact with internal APIs, third-party services, or any system your team relies on. Customize your agent → skills and plugins Add your own skills to teach domain-specific knowledge, or browse the Plugin Marketplace to install pre-built capabilities with a single click. Get started Create your agent Documentation Get started guide Pricing Feedback & issues Samples Videos This is just the start — more capabilities are coming soon. Try it out and let us know what you think.2.5KViews0likes0CommentsAgent Hooks: Production-Grade Governance for Azure SRE Agent
Introduction Azure SRE Agent helps engineering teams automate incident response, diagnostics, and remediation tasks. But when you're giving an agent access to production systems—your databases, your Kubernetes clusters, your cloud resources—you need more than just automation. You need governance. Today, we're diving deep into Agent Hooks, the built-in governance framework in Azure SRE Agent that lets you enforce quality standards, prevent dangerous operations, and maintain audit trails without writing custom middleware or proxies. Agent Hooks work by intercepting your SRE Agent at critical execution points—before it responds to users (Stop hooks) or after it executes tools (PostToolUse hooks). You define the rules once in your custom agent configuration, and the SRE Agent runtime enforces them automatically across every conversation thread. In this post, we'll show you how to configure Agent Hooks for a real production scenario: diagnosing and remediating PostgreSQL connection pool exhaustion while maintaining enterprise controls. The Challenge: Autonomous Remediation with Guardrails You're managing a production application backed by Azure PostgreSQL Flexible Server. Your on-call team frequently deals with connection pool exhaustion issues that cause latency spikes. You want your SRE Agent to diagnose and resolve these incidents autonomously, but you need to ensure: Quality Control: The agent provides thorough, evidence-based analysis instead of superficial guesses Safety: The agent can't accidentally execute dangerous commands, but can still perform necessary remediation Compliance: Every agent action is logged for security audits and post-mortems Without Agent Hooks, you'd need to build custom middleware, write validation logic around the SRE Agent API, or settle for manual approval workflows. With Agent Hooks, you configure these controls once in your custom agent definition and the SRE Agent platform enforces them automatically. The Scenario: PostgreSQL Connection Pool Exhaustion For our demo, we'll use a real production application (octopets-prod-web) experiencing connection pool exhaustion. When this happens: P95 latency spikes from ~120ms to 800ms+ Active connections reach the pool limit New requests get queued or fail The correct remediation is to restart the PostgreSQL Flexible Server to flush stale connections—but we want our agent to do this safely and with proper oversight. Demo Setup: Three Hooks, Three Purposes We'll configure three hooks that work together to create a robust governance framework: Hook #1: Quality Gate (Stop Hook) Ensures the agent provides structured, evidence-based responses before presenting them to users. Hook #2: Safety Guardrails (PostToolUse Hook) Blocks dangerous commands while allowing safe operations through an explicit allowlist. Hook #3: Audit Trail (Global Hook) Logs every tool execution across all agents for compliance and debugging. Step-by-Step Implementation Creating the Custom Agent First, we create a specialized subagent in the Azure SRE Agent platform called sre_analyst_agent designed for PostgreSQL diagnostics. In the Agent Canvas, we configure the agent instructions: You are an SRE agent responsible for diagnosing and remediating production issues for an application backed by an Azure PostgreSQL Flexible Server. When investigating a problem: - Use available tools to query Azure Monitor metrics, PostgreSQL logs, and connection statistics - Look for patterns: latency spikes, connection counts, error rates, CPU/memory pressure - Quantify findings with actual numbers where possible (e.g., P95 latency in ms, active connection count, error rate %) When presenting your diagnosis, structure your response with these exact sections: ## Root Cause A precise explanation of what is causing the issue. ## Evidence Specific metrics and observations that support your root cause. Include actual numbers: latency values in ms, connection counts, error rates, timestamps. ## Recommended Actions Numbered list of remediation steps ordered by priority. Be specific — include actual resource names and exact commands. When executing a fix: - Always verify the current state before acting - Confirm the fix worked by re-checking the same metrics after the action - Report before and after numbers to show impact This explicit guidance ensures the agent knows the correct remediation path. Configuring Hook #1: Quality Gate In the Agent Canvas' Hooks tab, we add our first agent-level hook—a Stop hook that fires before the SRE Agent presents its response. This hook uses the SRE Agent's own LLM to evaluate response quality: Event Type: Stop Hook Type: Prompt Activation: Always Hook Prompt: You are a quality gate for an SRE agent that investigates database and app performance issues. Review the agent's response below: $ARGUMENTS Evaluate whether the response meets ALL of the following criteria: 1. Has a "## Root Cause" section with a specific, clear explanation (not vague — must say specifically what failed, e.g., "connection pool exhaustion due to long-running queries holding connections" not just "database issue") 2. Has a "## Evidence" section that includes at least one concrete metric or data point with an actual number (e.g., "P95 latency spiked to 847ms", "active connections: 497/500", "error rate: 23% over last 15 minutes") 3. Has a "## Recommended Actions" section with numbered, specific steps (must include actual resource names or commands, not just "restart the database") If ALL three criteria are met with substantive content, respond: {"ok": true} If ANY criterion is missing, vague, or uses placeholder text, respond: {"ok": false, "reason": "Your response needs more depth before it reaches the user. Specifically: ## Root Cause must name the exact failure mechanism, ## Evidence must include real metric values with numbers (latency in ms, connection counts, error rates), ## Recommended Actions must reference actual resource names and specific commands. Go back and verify your findings."} This hook acts as an automated quality gate built directly into the SRE Agent runtime, catching superficial responses before they reach your on-call engineers. Configuring Hook #2: Safety Guardrails Our second agent-level hook is a PostToolUse hook that fires after the SRE Agent executes Bash or Python tools. This implements an allowlist pattern to control what commands can actually run in production: Event Type: PostToolUse Hook Type: Command (Python) Matcher: Bash|ExecuteShellCommand|ExecutePythonCode Activation: Always Hook Script: #!/usr/bin/env python3 import sys, json, re context = json.load(sys.stdin) tool_input = context.get('tool_input', {}) command = '' if isinstance(tool_input, dict): command = tool_input.get('command', '') or tool_input.get('code', '') # Safe allowlist — check these FIRST before any blocking logic # These are explicitly approved remediation actions for PostgreSQL issues safe_allowlist = [ r'az\s+postgres\s+flexible-server\s+restart', ] for safe_pattern in safe_allowlist: if re.search(safe_pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': '[SAFETY] ✅ PostgreSQL server restart approved — recognized as a safe remediation action for connection pool exhaustion.' } })) sys.exit(0) # Destructive commands to block dangerous = [ (r'\baz\s+postgres\s+flexible-server\s+delete\b', 'az postgres flexible-server delete (permanent server deletion)'), (r'\baz\s+\S+\s+delete\b', 'az delete (Azure resource deletion)'), (r'\brm\s+-rf\b', 'rm -rf (recursive force delete)'), (r'\bsudo\b', 'sudo (privilege escalation)'), (r'\bdrop\s+(table|database)\b', 'DROP TABLE/DATABASE (irreversible data loss)'), (r'\btruncate\s+table\b', 'TRUNCATE TABLE (irreversible data wipe)'), (r'\bdelete\s+from\b(?!.*\bwhere\b)', 'DELETE FROM without WHERE clause (wipes entire table)'), ] for pattern, label in dangerous: if re.search(pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'block', 'reason': f'🛑 BLOCKED: {label} is not permitted. Use safe, non-destructive alternatives. For PostgreSQL connection issues, prefer server restart or connection pool configuration changes.' })) sys.exit(0) print(json.dumps({'decision': 'allow'})) This ensures only pre-approved PostgreSQL operations can execute, preventing accidental data deletion or configuration changes. Now that we've configured both agent-level hooks, here's what our custom agent looks like in the canvas: - Overview ofsre_analyst_agent with hooks. Agent Canvas showing the sre_analyst_agent configuration with two agent-level hooks attached Configuring Hook #3: Audit Trail Finally, we create a Global hook using the Hooks management page in the Azure SRE Agent Portal. Global hooks apply across all custom agents in your organization, providing centralized governance: obal Hooks Management Page - Creating the sre_audit_trail global hook. The Global Hooks management page showing the sre_audit_trail hook configuration with event type, activation mode, matcher pattern, and Python script editor Event Type: PostToolUse Hook Type: Command (Python) Matcher: * (all tools) Activation: On-demand Hook Script: #!/usr/bin/env python3 import sys, json context = json.load(sys.stdin) tool_name = context.get('tool_name', 'unknown') agent_name = context.get('agent_name', 'unknown') succeeded = context.get('tool_succeeded', False) turn = context.get('current_turn', '?') audit = f'[AUDIT] Turn {turn} | Agent: {agent_name} | Tool: {tool_name} | Success: {succeeded}' print(audit, file=sys.stderr) print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': audit } })) By setting this as "on-demand," your SRE engineers can toggle this hook on/off per conversation thread from the chat interface—enabling detailed audit logging during incident investigations without overwhelming logs during routine queries. Seeing Agent Hooks in Action Now let's see how these hooks work together when our SRE Agent investigates a real production incident. Activating Audit Trail Before starting our investigation, we toggle on the audit trail hook from the chat interface: - Managing hooks for this thread with sre_audit_trail activated the "Manage hooks for this thread" menu showing the sre_audit_trail global hook toggled on for this conversation This gives us visibility into every tool the agent executes during the investigation. Starting the Investigation We prompt our SRE Agent: "Can you check the octopets-prod-web application and diagnose any performance issues?" The SRE Agent begins gathering metrics from Azure Monitor, and we immediately see our audit trail hook logging each tool execution: This real-time visibility is invaluable for understanding what your SRE Agent is doing and debugging issues when things don't go as planned. Quality Gate Rejection The SRE Agent completes its initial analysis and attempts to respond. But our Stop hook intercepts it—the response doesn't meet our quality standards: - Stop hook forcing agent to provide more detailed analysisStop hook rejection message: "Your response needs more depth and specificity..." forcing the agent to re-analyze with more evidence The hook rejects the response and forces the SRE Agent to retry—gathering more evidence, querying additional metrics, and providing specific numbers. This self-correction happens automatically within the SRE Agent runtime, with no manual intervention required. Structured Final Response After re-verification, the SRE Agent presents a properly structured analysis that passes our quality gate: with Root Cause, Evidence, and Recommended Actions. Agent response showing the required structure: Root Cause section with connection pool exhaustion diagnosis, Evidence section with specific metric numbers, and Recommended Actions with the exact restart command Root Cause: Connection pool exhaustion Evidence: Specific metrics (83 active connections, P95 latency 847ms) Recommended Actions: Restart command with actual resource names This is the level of rigor we expect from production-ready agents. Safety Allowlist in Action The SRE Agent determines it needs to restart the PostgreSQL server to remediate the connection pool exhaustion. Our PostToolUse hook intercepts the command execution and validates it against our allowlist: - PostgreSQL metrics query and restart command output. Code execution output showing the PostgreSQL metrics query results and the az postgres flexible-server restart command being executed successfully Because the az postgres flexible-server restart command matches our safety allowlist pattern, the hook allows it to proceed. If the SRE Agent had attempted any unapproved operation (like DROP DATABASE or firewall rule changes), the safety hook would have blocked it immediately. The Results After the SRE Agent restarts the PostgreSQL server: P95 latency drops from 847ms back to ~120ms Active connections reset to healthy levels Application performance returns to normal But more importantly, we achieved autonomous remediation with enterprise governance: ✅ Quality assurance: Every response met our evidence standards (enforced by Stop hooks) ✅ Safety controls: Only pre-approved operations executed (enforced by PostToolUse hooks) ✅ Complete audit trail: Every tool call logged for compliance (enforced by Global hooks) ✅ Zero manual interventions: The SRE Agent self-corrected when quality standards weren't met This is the power of Agent Hooks—governance that doesn't get in the way of automation. Key Takeaways Agent Hooks bring production-grade governance to Azure SRE Agent: Layered Governance: Combine agent-level hooks for custom agent-specific controls with global hooks for organization-wide policies Fail-Safe by Default: Use allowlist patterns in PostToolUse hooks rather than denylists—explicitly permit safe operations instead of trying to block every dangerous one Self-Correcting SRE Agents: Stop hooks with quality gates create feedback loops that improve response quality without human intervention Audit Without Overhead: On-demand global hooks let your engineers toggle detailed logging only during incident investigations No Custom Middleware: All governance logic lives in your custom agent configuration—no need to build validation proxies or wrapper services Getting Started Agent Hooks are available now in the Azure SRE Agent platform. You can configure them entirely through the UI—no API calls or tokens needed: Agent-Level Hooks: Navigate to the Agent Canvas → Hooks tab and add hooks directly to your custom agent Global Hooks: Use the Hooks management page to create organization-wide policies Thread-Level Control: Toggle on-demand hooks from the chat interface using the "Manage hooks" menu Learn More Agent Hooks Documentation YAML Schema Reference Subagent Builder Guide Ready to build safer, smarter agents? Start experimenting with Agent Hooks today at sre.azure.com.425Views0likes0CommentsAzure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context
What if SRE Agent already knew your system before the next incident? Your most experienced SRE didn't become an expert overnight. Day one: reading runbooks, studying architecture diagrams, asking a lot of questions. Month three: knowing which services are fragile, which config changes cascade, which log patterns mean real trouble. Year two: diagnosing a production issue at 2 AM from a single alert because they'd built deep, living context about your systems. That learning process, absorbing documentation, reading code, handling incidents, building intuition from every interaction is what makes an expert. Azure SRE Agent could do the same thing From pulling context to living in it Azure SRE Agent already connects to Azure Monitor, PagerDuty, and ServiceNow. It queries Kusto logs, checks resource health, reads your code, and delivers root cause analysis often resolving incidents without waking anyone up. Thousands of incidents handled. Thousands of engineering hours saved. Deep Context takes this to the next level. Instead of accessing context on demand, your agent now lives in it — continuously reading your code, knowledge building persistent memory from every interaction, and evolving its understanding of your systems in the background. Three things makes Deep Context work: Continuous access. Source code, terminal, Python runtime, and Azure environment are available whenever the agent needs them. Connected repos are cloned into the agent's workspace automatically. The agent knows your code structure from the first message. Persistent memory. Insights from previous investigations, architecture understanding, team context — it all persists across sessions. The next time the agent picks up an alert, it already knows what happened last time. Background intelligence. Even when you're not chatting, background services continuously learn. After every conversation, the agent extracts what worked, what failed, what the root cause was. It aggregates these across all past investigations to build evolving operational insights. The agent recognizes patterns you haven't noticed yet. One example: connected to Kusto, background scanning auto-discovers every table, documents schemas, and builds reusable query templates. But this learning applies broadly — every conversation, every incident, every data source makes the agent sharper. Expertise that compound with every incident New on-call engineer SRE Agent with Deep Context Alert fires Opens runbook, looks up which service this maps to Already knows the service, its dependencies, and failure patterns from prior incidents Investigation Reads logs, searches code, asks teammates Goes straight to the relevant code path, correlates with logs and persistent insights from similar incidents After 100 incidents Becomes the team expert — irreplaceable institutional knowledge Same institutional knowledge — always available, never forgets, scales across your entire organization A human expert takes months to build this depth. An agent with Deep Context builds it in days and the knowledge compounds with every interaction. You shape what your agent learns. Deep Context learns automatically but the best results come when your team actively guides what the agent retains. Type #remember in chat to save important facts your agent should always know environment details, escalation paths, team preferences. For example: "#remember our Redis cache uses Premium tier with 6GB" or "#remember database failover takes approximately 15 minutes." These are recalled automatically during future investigations. Turn investigations into knowledge. After a good investigation, ask your agent to turn the resolution into a runbook: "Create a troubleshooting guide from the steps we just followed and save it to Knowledge settings." The agent generates a structured document, uploads it, and indexes it — so the next time a similar issue occurs, the agent finds and follows that guide automatically. The agent captures insights from every conversation on its own. Your guidance tells it which ones matter most. This is exactly how Microsoft’s own SRE team gets the best results: “Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn’t fail that class of problem again.” Read the full story in The Agent That Investigates Itself. See it in action: an Azure Monitor alert, end to end An HTTP 5xx spike fires on your container app. Your agent is in autonomous mode. It acknowledges the alert, checks resource health, reads logs, and delivers a diagnosis — that's what it already does well. Deep Context makes this dramatically better. Two things change everything: The agent already knows your environment.It'salready read your code, runbooks, and built context from previous investigations. Your route handlers, database layer, deployment configs, operational procedures, it knows all of it. So, when these alert fires, it doesn't start from scratch. It goes straight to the relevant code path, correlates a recent connection pooling commit with the deployment timeline, and confirms the root cause. The agent remembers.It's seen this pattern before a similar incident last week that was investigated but never permanently fixed. It recognizes the recurrence from persistent memory, skips rediscovery, confirms the issue is still in the code, and this time fixes it. Because it's in autonomous mode, the agent edits the source code, restarts the container, pushes the fix to a new branch, creates a PR, opens a GitHub Issue, and verifies service health, all before you wake up. The agent delivers a complete remediation summary including alert, root cause with code references, fix applied, PR created, without a single message from you. Code access turns diagnosis into action. Persistent memory turns recurring problems into solved problems. Give your agent your code — here's why it matters If you're on an IT operations, SRE, or DevOps team, you might think: "Code access? That's for developers." We'd encourage you to rethink that. Your infrastructure-as-code, deployment configs, Helm charts, Terraform files, pipeline definitions — that's all code. And it's exactly the context your agent needs to go from good to extraordinary. When your agent can read your actual configuration and infrastructure code, investigations transform. Instead of generic troubleshooting, you get root cause analysis that points to the exact file, the exact line, the exact config change. It correlates a deployment failure with a specific commit. It reads your Helm values and spots the misconfiguration that caused the pod crash loop. "Will the agent modify our production code?" No. The agent works in a secure sandbox — a copy of your repository, not your production environment. When it identifies a fix, it creates a pull request on a new branch. Your code review process, your CI/CD pipeline, your approval gates — all untouched. The agent proposes. Your team decides. Whether you're a developer, an SRE, or an IT operator managing infrastructure you didn't write — connecting your code is the single highest-impact thing you can do to make your agent smarter. The compound effects Deep Context amplifies every other SRE Agent capability: Deep Context + Incident management → Alerts fire, the agent correlates logs with actual code. Root cause references specific files and line numbers. Deep Context + Scheduled tasks → Automated code analysis, compliance checks, and drift detection — inspecting your actual infrastructure code, not just metrics. Deep Context + MCP connectors → Datadog, Splunk, PagerDuty data combined with source code context. The full picture in one conversation. Deep Context + Knowledge files → Upload runbooks, architecture docs, postmortems — in any format. The agent cross-references your team's knowledge with live code, logs, and infrastructure state. Logs tell the agent what happened. Code tells it why. Your knowledge files tell it what to do about it. Get started Deep Context is available today as part of Azure SRE Agent GA. New agents have it enabled by default. For a step-by-step walkthrough connecting your code, logs, incidents, and knowledge files, see What It Takes to Give an SRE Agent a Useful Starting Point Resources SRE Agent GA Announcement blog - https://aka.ms/sreagent/ga SRE Agent GA What’s new post - https://aka.ms/sreagent/blog/whatsnewGA SRE Agent Documentation – https://aka.ms/sreagent/newdocs SRE Agent Overview - https://aka.ms/sreagent/newdocsoverview830Views0likes0Comments