Using Azure SRE Agent for Autonomous Performance Monitoring and Remediation
What if your infrastructure could detect performance issues and fix them automatically—before your users even notice?
This blog brings that vision to life using Azure SRE Agent, an AI-powered autonomous agent that monitors, detects, and remediates production issues in real-time.
💡 The magic: Zero human intervention required. The agent handles detection, diagnosis, remediation, and reporting—all autonomously.
📺 Watch the Demo
This content was presented at .NET Day 2025. Watch the full session to see Azure SRE Agent in action:
🎬 Fix it before they feel it - .NET Day 2025
🎯 What You'll See in This Demo
Watch as we intentionally deploy "bad" code to production and observe how the SRE Agent:
- Detects the degradation — Compares live response times against learned baselines
- Takes autonomous action — Executes a slot swap to roll back to healthy code
- Communicates the incident — Posts to Teams and creates a GitHub issue
- Generates reports — Summarizes deployment metrics for stakeholders
🚀 Key Capabilities
| Capability | What It Shows |
|---|---|
| Proactive Baseline Learning | Agent learns normal response times and stores them in a knowledge base |
| Real-time Anomaly Detection | Instant comparison of current vs. baseline metrics |
| Autonomous Remediation | Agent executes Azure CLI commands to swap slots without human approval |
| Cross-platform Communication | Automatic Teams posts and GitHub issue creation |
| Incident Reporting | End-of-day email summaries with deployment health metrics |
Architecture Overview
The solution uses Azure SRE Agent with three specialized sub-agents working together:
Components
Application Layer:
- .NET 9 Web API running on Azure App Service
- Application Insights for telemetry collection
- Azure Monitor Alerts for incident triggers
Azure SRE Agent:
- AvgResponseTime Sub-Agent: Captures baseline metrics every 15 minutes, stores in Knowledge Store
- DeploymentHealthCheck Sub-Agent: Triggered by deployment alerts, compares metrics to baseline, auto-remediates
- DeploymentReporter Sub-Agent: Generates daily summary emails from Teams activity
External Integrations:
- GitHub (issue creation, semantic code search, Copilot assignment)
- Microsoft Teams (deployment summaries)
- Outlook (summary reports)
What the SRE Agent Does
When a deployment occurs, the SRE Agent autonomously performs the following actions:
1. Health Check (DeploymentHealthCheck Sub-Agent)
When a slot swap alert fires, the agent:
- Queries App Insights for current response times
- Retrieves the baseline from the Knowledge Store
- Compares current performance against baseline
- If degradation > 20%: Executes rollback and creates GitHub issue
- If healthy: Posts confirmation to Teams
Healthy Deployment - No Action Needed (Teams Post):
The agent confirms the deployment is healthy — response time (22ms) is 80% faster than baseline (116ms).
Degraded Deployment - Automatic Rollback (Teams Post):
The agent detects +332% latency regression (212ms vs 116ms baseline), executes a slot swap to rollback, and creates GitHub Issue.
2. Daily Summary (DeploymentReporter Sub-Agent)
Every 24 hours, the reporter agent:
- Reads all Teams deployment posts from the last 24 hours
- Aggregates deployment metrics
- Sends an executive summary email
Daily Summary (Outlook Email):
The daily report shows 9 deployments, 6 healthy, 3 rollbacks, and 3 GitHub issues created — complete with response time details and issue links.
Demo Flow
View Step-by-Step Instructions →
| Step | Action |
|---|---|
| Step 1 | Deploy Infrastructure + Applications |
| Step 2 | Create Sub-Agents, Triggers & Schedules |
| Step 3 | Swap bad code, watch agent remediate |
Setting Up the Demo
Prerequisites
- Azure subscription with Contributor access
- Azure CLI installed and logged in (az login)
- .NET 9.0 SDK
- PowerShell 7.0+
Step 1: Deploy Infrastructure
cd scripts .\1-setup-demo.ps1 -ResourceGroupName "sre-demo-rg" -AppServiceName "sre-demo-app-12345"
This script will:
- Prompt for Azure subscription selection
- Deploy Azure infrastructure (App Service, App Insights, Alerts)
- Build and deploy healthy code to production
- Build and deploy problematic code to staging
Step 2: Configure Azure SRE Agent
Navigate to Azure SRE Agents Portal Sub Agent builder tab and create three sub-agents:
| Sub-Agent | Purpose | Tools Used |
|---|---|---|
| AvgResponseTime | Captures baseline response time metrics | QueryAppInsightsByAppId, UploadKnowledgeDocument |
| DeploymentHealthCheck | Detects degradation and executes remediation | SearchMemory, QueryAppInsights, PostTeamsMessage, CreateGithubIssue, Az CLI commands |
| DeploymentReporter | Generates deployment summary reports | GetTeamsMessages, SendOutlookEmail |
Creating Each Sub-Agent
In the Github links below you can find gif images that capture the creation flow
AvgResponseTime + Baseline Task:
DeploymentHealthCheck + Swap Alert:
DeploymentReporter + Reporter Task:
Step 3: Run the Demo
.\2-run-demo.ps1
This triggers the following flow:
Slot Swap Occurs (demo script)
▼ Activity Log Alert Fires
▼ Incident Trigger Activated
▼ DeploymentHealthCheck Agent Runs
─ Queries current response time from App Insights
─ Retrieves baseline from knowledge store
─ Compares (if >20% degradation)
─ Executes: az webapp deployment slot swap
─ Creates GitHub issue (if degraded)
─ Posts to Teams channel
Demo Timeline
| Time | Event |
|---|---|
| 0:00 | Run 2-run-demo.ps1 |
| 0:30 | Swap staging → production (bad code deployed) |
| 1:00 | Production now slow (~1500ms vs ~50ms baseline) |
| ~5:00 | Slot Swap Alert fires |
| ~5:04 | Agent executes slot swap (rollback) |
| ~5:30 | Production restored to healthy state |
| ~6:00 | Agent posts to Teams, creates GitHub issue |
How the Performance Toggle Works
The app has a compile-time toggle in ProductsController.cs:
private const bool EnableSlowEndpoints = false; // false = fast, true = slow
The setup script creates two versions:
- Production: EnableSlowEndpoints = false → ~50ms responses
- Staging: EnableSlowEndpoints = true → ~1500ms responses (artificial delay)
Get Started
🔗 Full source code and instructions: github.com/microsoft/sre-agent/samples/proactive-reliability
🔗 Azure SRE Agent documentation: https://learn.microsoft.com/en-us/azure/sre-agent/
Technology Stack
- Framework: ASP.NET Core 9.0
- Infrastructure: Azure Bicep
- Monitoring: Application Insights + Log Analytics
- Automation: Azure SRE Agent
- Scripts: PowerShell 7.0+
Tags: Azure, SRE Agent, DevOps, Reliability, .NET, App Service, Application Insights, Autonomous Remediation