Blog Post

Apps on Azure Blog
4 MIN READ

Fix It Before They Feel It: Higher Reliability with Proactive Mitigation

saziz_msft's avatar
saziz_msft
Icon for Microsoft rankMicrosoft
Dec 22, 2025

Using Azure SRE Agent for Autonomous Performance Monitoring and Remediation

What if your infrastructure could detect performance issues and fix them automatically—before your users even notice?

This blog brings that vision to life using Azure SRE Agent, an AI-powered autonomous agent that monitors, detects, and remediates production issues in real-time.

💡 The magic: Zero human intervention required. The agent handles detection, diagnosis, remediation, and reporting—all autonomously.

📺 Watch the Demo

This content was presented at .NET Day 2025. Watch the full session to see Azure SRE Agent in action:

🎬 Fix it before they feel it - .NET Day 2025

🎯 What You'll See in This Demo

Watch as we intentionally deploy "bad" code to production and observe how the SRE Agent:

  1. Detects the degradation — Compares live response times against learned baselines
  2. Takes autonomous action — Executes a slot swap to roll back to healthy code
  3. Communicates the incident — Posts to Teams and creates a GitHub issue
  4. Generates reports — Summarizes deployment metrics for stakeholders

🚀 Key Capabilities

CapabilityWhat It Shows
Proactive Baseline LearningAgent learns normal response times and stores them in a knowledge base
Real-time Anomaly DetectionInstant comparison of current vs. baseline metrics
Autonomous RemediationAgent executes Azure CLI commands to swap slots without human approval
Cross-platform CommunicationAutomatic Teams posts and GitHub issue creation
Incident ReportingEnd-of-day email summaries with deployment health metrics

Architecture Overview

The solution uses Azure SRE Agent with three specialized sub-agents working together:

Components

Application Layer:

  • .NET 9 Web API running on Azure App Service
  • Application Insights for telemetry collection
  • Azure Monitor Alerts for incident triggers

Azure SRE Agent:

  • AvgResponseTime Sub-Agent: Captures baseline metrics every 15 minutes, stores in Knowledge Store
  • DeploymentHealthCheck Sub-Agent: Triggered by deployment alerts, compares metrics to baseline, auto-remediates
  • DeploymentReporter Sub-Agent: Generates daily summary emails from Teams activity

External Integrations:

  • GitHub (issue creation, semantic code search, Copilot assignment)
  • Microsoft Teams (deployment summaries)
  • Outlook (summary reports)

What the SRE Agent Does

When a deployment occurs, the SRE Agent autonomously performs the following actions:

1. Health Check (DeploymentHealthCheck Sub-Agent)

When a slot swap alert fires, the agent:

  • Queries App Insights for current response times
  • Retrieves the baseline from the Knowledge Store
  • Compares current performance against baseline
  • If degradation > 20%: Executes rollback and creates GitHub issue
  • If healthy: Posts confirmation to Teams

Healthy Deployment - No Action Needed (Teams Post):

The agent confirms the deployment is healthy — response time (22ms) is 80% faster than baseline (116ms).

 

Degraded Deployment - Automatic Rollback (Teams Post):

The agent detects +332% latency regression (212ms vs 116ms baseline), executes a slot swap to rollback, and creates GitHub Issue.

2. Daily Summary (DeploymentReporter Sub-Agent)

Every 24 hours, the reporter agent:

  • Reads all Teams deployment posts from the last 24 hours
  • Aggregates deployment metrics
  • Sends an executive summary email

Daily Summary (Outlook Email):

The daily report shows 9 deployments, 6 healthy, 3 rollbacks, and 3 GitHub issues created — complete with response time details and issue links.

Demo Flow

View Step-by-Step Instructions →

Setting Up the Demo

Prerequisites

  • Azure subscription with Contributor access
  • Azure CLI installed and logged in (az login)
  • .NET 9.0 SDK
  • PowerShell 7.0+

Step 1: Deploy Infrastructure

cd scripts .\1-setup-demo.ps1 -ResourceGroupName "sre-demo-rg" -AppServiceName "sre-demo-app-12345"

This script will:

  1. Prompt for Azure subscription selection
  2. Deploy Azure infrastructure (App Service, App Insights, Alerts)
  3. Build and deploy healthy code to production
  4. Build and deploy problematic code to staging

Full Setup Instructions →

Step 2: Configure Azure SRE Agent

Navigate to Azure SRE Agents Portal Sub Agent builder tab and create three sub-agents:

Sub-AgentPurposeTools Used
AvgResponseTimeCaptures baseline response time metricsQueryAppInsightsByAppId, UploadKnowledgeDocument
DeploymentHealthCheckDetects degradation and executes remediationSearchMemory, QueryAppInsights, PostTeamsMessage, CreateGithubIssue, Az CLI commands
DeploymentReporterGenerates deployment summary reportsGetTeamsMessages, SendOutlookEmail

Creating Each Sub-Agent

In the Github links below you can find gif images that capture the creation flow

AvgResponseTime + Baseline Task: 

Detailed Instructions →

DeploymentHealthCheck + Swap Alert: 

Detailed Instructions →

DeploymentReporter + Reporter Task: 

Detailed Instructions →

Step 3: Run the Demo

.\2-run-demo.ps1

This triggers the following flow:

Slot Swap Occurs (demo script)

▼ Activity Log Alert Fires 

▼ Incident Trigger Activated 

▼ DeploymentHealthCheck Agent Runs 

  ─ Queries current response time from App Insights 

  ─ Retrieves baseline from knowledge store 

  ─ Compares (if >20% degradation) 

  ─ Executes: az webapp deployment slot swap

  ─ Creates GitHub issue (if degraded)

 ─ Posts to Teams channel

Full Demo Instructions →

Demo Timeline

TimeEvent
0:00Run 2-run-demo.ps1
0:30Swap staging → production (bad code deployed)
1:00Production now slow (~1500ms vs ~50ms baseline)
~5:00Slot Swap Alert fires
~5:04Agent executes slot swap (rollback)
~5:30Production restored to healthy state
~6:00Agent posts to Teams, creates GitHub issue

How the Performance Toggle Works

The app has a compile-time toggle in ProductsController.cs:

private const bool EnableSlowEndpoints = false; // false = fast, true = slow

The setup script creates two versions:

  • Production: EnableSlowEndpoints = false → ~50ms responses
  • Staging: EnableSlowEndpoints = true → ~1500ms responses (artificial delay)

Get Started

🔗 Full source code and instructionsgithub.com/microsoft/sre-agent/samples/proactive-reliability

🔗 Azure SRE Agent documentationhttps://learn.microsoft.com/en-us/azure/sre-agent/

Technology Stack

  • Framework: ASP.NET Core 9.0
  • Infrastructure: Azure Bicep
  • Monitoring: Application Insights + Log Analytics
  • Automation: Azure SRE Agent
  • Scripts: PowerShell 7.0+

Tags: Azure, SRE Agent, DevOps, Reliability, .NET, App Service, Application Insights, Autonomous Remediation

Updated Dec 23, 2025
Version 5.0

1 Comment