Apps on Azure Blog

4 MIN READ

Unifying Scattered Observability Data from Dynatrace + Azure for Self-Healing with SRE Agent

Vineela-Suri

Microsoft

Jan 26, 2026

What if your deployments could fix themselves?

The Deployment Remediation Challenge

Modern operations teams face a recurring nightmare:

A deployment ships at 9 AM
Errors spike at 9:15 AM
By the time you correlate logs, identify the bad revision, and execute a rollback—it's 10:30 AM
Your users felt 75 minutes of degraded experience

The data to detect and fix this existed the entire time—but it was scattered across clouds and platforms:

Error logs and traces → Dynatrace (third-party observability cloud)
Deployment history and revisions → Azure Container Apps API
Resource health and metrics → Azure Monitor
Rollback commands → Azure CLI

Your observability data lives in one cloud. Your deployment data lives in another. Stitching together log analysis from Dynatrace with deployment correlation from Azure—and then executing remediation—required a human to manually bridge these silos.

What if an AI agent could unify data from third-party observability platforms with Azure deployment history and act on it automatically—every week, before users even notice?

Enter SRE Agent + Model Context Protocol (MCP) + Subagents

Azure SRE Agent doesn't just work with Azure. Using the Model Context Protocol (MCP), you can connect external observability platforms like Dynatrace directly to your agent. Combined with subagents for specialized expertise and scheduled tasks for automation, you can build an automated deployment remediation system.

Here's what I built/configured for my Azure Container Apps environment inside SRE Agent:

Component	Purpose
Dynatrace MCP Connector	Connect to Dynatrace's MCP gateway for log queries via DQL
'Dynatrace' Subagent	Log analysis specialist that executes DQL queries and identifies root causes
'Remediation' Subagent	Deployment remediation specialist that correlates errors with deployments and executes rollbacks
Scheduled Task	Weekly Monday 9 AM health check for the 'octopets-prod-api' Container App

Subagent workflow:

The subagent workflow in SRE Agent Builder: 'OctopetsScheduledTask' triggers 'RemediationSubagent' (12 tools), which hands off to 'DynatraceSubagent' (3 MCP tools) for log analysis.

How I Set It Up: Step by Step

Step 1: Connect Dynatrace via MCP

SRE Agent supports the Model Context Protocol (MCP) for connecting external data sources. Dynatrace exposes an MCP gateway that provides access to its APIs as first-class tools.

Connection configuration:

{ "name": "dynatrace-mcp-connector", 
"dataConnectorType": "Mcp", 
"dataSource": "Endpoint=https://<your-tenant>.live.dynatrace.com/platform-reserved/mcp-gateway/v0.1/servers/dynatrace-mcp/mcp;AuthType=BearerToken;BearerToken=<your-api-token>" }

Once connected, SRE Agent automatically discovers Dynatrace tools.

💡 Tip: When creating your Dynatrace API token, grant the `entities.read`, `events.read`, and `metrics.read` scopes for comprehensive access.

Step 2: Build Specialized Subagents

Generic agents are good. Specialized agents are better.

I created two subagents that work together in a coordinated workflow—one for Dynatrace log analysis, the other for deployment remediation.

DynatraceSubagent

This subagent is the log analysis specialist. It uses the Dynatrace MCP tools to execute DQL queries and identify root causes.

Key capabilities:

Executes DQL queries via MCP tools (`create-dql`, `execute-dql`, `explain-dql`)
Fetches 5xx error counts, request volumes, and spike detection
Returns consolidated analysis with root cause, affected services, and error patterns

👉 View full DynatraceSubagent configuration here

RemediationSubagent

This is the deployment remediation specialist. It correlates Dynatrace log analysis with Azure Container Apps deployment history, generates correlation charts, and executes rollbacks when confidence is high.

Key capabilities:

Retrieves Container Apps revision history (`GetDeploymentTimes`, `ListRevisions`)
Generates correlation charts (`PlotTimeSeriesData`, `PlotBarChart`, `PlotAreaChartWithCorrelation`)
Computes confidence score (0-100%) for deployment causation
Executes rollback and traffic shift when confidence > 70%

👉 View full RemediationSubagent configuration here

The power of specialization: Each agent focuses on its domain—DynatraceSubagent handles log analysis, RemediationSubagent handles deployment correlation and rollback. When the workflow runs, RemediationSubagent hands off to DynatraceSubagent (bi-directional handoff) for analysis, gets the findings back, and continues with remediation. Simple delegation, not a single monolithic agent trying to do everything.

Step 3: Create the Weekly Scheduled Task

Now the automation. I configured a scheduled task that runs every Monday at 9:30 AM to check whether deployments in the last 4 hours caused any issues—and automatically remediate if needed.

Scheduled task configuration:

Setting	Value
Task Name	OctopetsScheduledTask
Frequency	Weekly
Day of Week	Monday
Time	9:30 AM
Response Subagent	RemediationSubagent

Scheduled Task Configuration

Configuring the OctopetsScheduledTask in the SRE Agent portal

The key insight: the scheduled task is just a coordinator. It immediately hands off to the RemediationSubagent, which orchestrates the entire workflow including handoffs to DynatraceSubagent.

Step 4: See It In Action

Here's what happens when the scheduled task runs:

The scheduled task triggering and initiating Dynatrace analysis for octopets-prod-api

The DynatraceSubagent analyzes the logs and identifies the root cause:

DynatraceSubagent executing DQL queries and returning consolidated log analysis

The RemediationSubagent then generates correlation charts:

5xx errors spiked after deploying revision 0000039Error volume isolated to revision 0000039; prior revision 0000038 shows zero errors5xx error rate correlates strongly with deployment of revision 0000039

Finally, with a 95% confidence score, SRE agent executes the rollback autonomously:

RemediationSubagent executing rollback and traffic shift autonomously.

The agent detected the bad deployment, generated visual evidence, and automatically shifted 100% traffic to the last known working revision—all without human intervention.

Why This Matters

Before	After
Manually check Dynatrace after incidents	Automated DQL queries via MCP
Stitch together logs + deployments manually	Subagents correlate data automatically
Rollback requires human decision + execution	Confidence-based auto-remediation
75+ minutes from deployment to rollback	Under 5 Minutes with autonomous workflow
Reactive incident response	Proactive weekly health checks

Try It Yourself

Connect your observability tool via MCP (Dynatrace, Datadog, New Relic, Prometheus—any tool with an MCP gateway)
Build a log analysis subagent that knows how to query your observability data
Build a remediation subagent that can correlate logs with deployments and execute fixes
Wire them together with handoffs so the subagents can delegate log analysis
Create a scheduled task to trigger the workflow automatically