Azure Architecture Blog

14 MIN READ

Reactive Incident Response with Azure SRE Agent: From Alert to Resolution in Minutes

Sabyasachi-Samaddar

Microsoft

Feb 18, 2026

When things break at 2 AM, your AI teammate is already investigating.

SRE Agent portal overview with incident list

The Reactive Incident Challenge

Your monitoring is solid. Alerts fire when they should. But then what?

Alert lands in Teams/PagerDuty
On-call engineer wakes up, logs in
Starts investigating: "What's broken? Why? How do I fix it?"
20 minutes later, they're still gathering context

The alert was fast. The human response? Not so much.

The Traditional Incident Response Flow

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐

│ Alert │───▶│ Human │───▶│ Manual │───▶│ Resolution │

│ Fires │ │ Acknowledges│ │Investigation│ │ (Maybe) │

│ │ │ (5-15 min) │ │ (15-30 min)│ │ │

└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘

t=0 t=5-15min t=20-45min t=30-60min

The SRE Agent Flow

┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐

│ Alert │───▶│ SRE Agent │───▶│ AI │───▶│ Human │

│ Fires │ │ Acknowledges│ │Investigation│ │ Approves │

│ │ │ (Instant) │ │ (2-10 min) │ │ │

└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘

t=0 t=0 t=2-10min t=10-15min

What if the investigation started the moment the alert fired?

That's exactly what Azure SRE Agent does. It doesn't wait for humans to acknowledge—it starts investigating immediately, gathering context, identifying root causes, and preparing remediation options.

I tested this with two real-world scenarios: a database connectivity outage and a VM CPU spike. Here's what happened.

Two Real-World Incidents

Scenario	Trigger	Root Cause	Resolution
Web App Health Failure	Sev1 Alert - Health check failing	SQL Server public access disabled	Enabled public access + firewall rule
VM High CPU	Sev2 Alert - CPU > 85% for 5 mins	Runaway PowerShell processes	Identified and killed processes

Both incidents were detected, diagnosed, and remediated by SRE Agent with minimal human intervention—just approval clicks.

Incident 1: Azure SQL Database Connectivity Outage

The Alert

🔴 Sev1 Alert Fired

Alert Rule: sre-demo-webapp-health-alert

Description: Alert when Web App health check fails - indicates backend/database connectivity issues

Time: 02/04/2026 07:59:35 UTC

Alert Configuration Details

The alert was configured using Azure Monitor metric alerts:

resource webAppHealthAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {

properties: {

severity: 1

evaluationFrequency: 'PT1M'

windowSize: 'PT5M'

criteria: {

'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'

allOf: [

{

metricName: 'HealthCheckStatus'

operator: 'LessThan'

threshold: 100

timeAggregation: 'Average'

}

]

}

targetResourceType: 'Microsoft.Web/sites'

targetResourceRegion: 'centralindia'

}

What SRE Agent Did (Autonomously)

[SRE Agent chat showing the investigation steps and thinking process]

The moment the alert fired, SRE Agent acknowledged and began investigating:

Symptom Assessment

Pulled web app ARM configuration (AlwaysOn, Basic plan, system-assigned identity)
Analyzed HTTP 5xx and request metrics over 2 hours
Observed intermittent traffic spikes indicating service impact

// KQL query SRE Agent ran against Application Insights

requests

| where timestamp > ago(2h)

| summarize

TotalRequests = count(),

FailedRequests = countif(resultCode >= 500),

FailureRate = round(100.0 * countif(resultCode >= 500) / count(), 2)

| project TotalRequests, FailedRequests, FailureRate

Dependency Mapping

[Application Insights showing SQL dependency failures at 100%]

Queried Application Insights to identify failing backends
Found: sre-demo-sql-6o26gsgynw436.database.windows.net failing 100% (80/80 calls) in last 30 minutes
Result code: 503 on "SQL Health Check" and "GetProducts" operations

// Dependency failure analysis

dependencies

| where timestamp > ago(30m)

| where target contains "database.windows.net"

| summarize

TotalCalls = count(),

FailedCalls = countif(success == false),

FailureRate = round(100.0 * countif(success == false) / count(), 2)

| project TotalCalls, FailedCalls, FailureRate

Network Validation

Tested DNS resolution from web app to SQL endpoint ✅ Success
Tested TCP reachability on port 1433 ✅ Success
Conclusion: Network path is healthy; issue is at access/auth layer

# SRE Agent validated network connectivity using:

# DNS Resolution Test

nslookup sre-demo-sql-6o26gsgynw436.database.windows.net

# TCP Port Test (from App Service)

tcpping sre-demo-sql-6o26gsgynw436.database.windows.net:1433

Configuration Analysis

# SRE Agent queried SQL server configuration

az sql server show -g rg-sre-demo-india -n sre-demo-sql-6o26gsgynw436 \

--query "{publicNetworkAccess:publicNetworkAccess, fullyQualifiedDomainName:fullyQualifiedDomainName}"

# Output:

{

"publicNetworkAccess": "Disabled", #

"fullyQualifiedDomainName": "sre-demo-sql-6o26gsgynw436.database.windows.net"

}

Discovered: Azure SQL public network access = Disabled
Web app has no VNet integration or Private Endpoint
Root cause identified: Access model mismatch

The Root Cause Analysis

Root cause: Azure SQL public network access is Disabled while the web app has no VNet integration/private endpoint, so the app cannot reach SQL at the access model layer.

SRE Agent presented two remediation options:

Option	Approach	Speed	Security	Use Case
A	Enable public access + Allow Azure Services (0.0.0.0)	⚡ Fast	🟡 Moderate	Quick restore, non-prod
B	Add web app's specific outbound IPs to firewall	🐢 Slower	🟢 Stricter	Production environments
C	Configure Private Endpoint + VNet Integration	🐢🐢 Slowest	🟢🟢 Best	Long-term solution

Remediation (With Approval)

[SRE Agent asking for approval before executing remediation]

I approved Option A for rapid restoration. SRE Agent executed:

# Step 1: Enable public network access

az sql server update -g rg-sre-demo-india -n sre-demo-sql-6o26gsgynw436 \

--subscription <subid> \

--set publicNetworkAccess=Enabled

# Step 2: Add Azure Services firewall rule

az sql server firewall-rule create \

-g rg-sre-demo-india \

-s sre-demo-sql-6o26gsgynw436 \

-n AllowAzureServices \

--subscription <subid> \

--start-ip-address 0.0.0.0 \

--end-ip-address 0.0.0.0

⚠️ Security Note: The 0.0.0.0 rule allows traffic from any Azure service, not just your web app. For production, use Option B (specific IPs) or Option C (Private Endpoint).

Recovery Verified

SRE Agent automatically verified recovery by re-querying Application Insights:

// Post-remediation verification

dependencies

| where timestamp > ago(10m)

| where target contains "database.windows.net"

| summarize

TotalCalls = count(),

SuccessfulCalls = countif(success == true),

SuccessRate = round(100.0 * countif(success == true) / count(), 2)

Results:

SQL dependencies: 65/65 successful (100% success rate)
HTTP 5xx errors: Dropped to 0
Service restored ✅

Timeline

Time (UTC)	Event	Duration
07:59:35	Alert fired	-
07:59:36	SRE Agent acknowledged	+1s
08:00:00	Started symptom assessment	+25s
08:05:00	Dependency mapping complete	+5m
08:08:00	Network validation complete	+3m
08:10:00	Root cause identified	+2m
08:16:00	Remediation approved	+6m (human)
08:17:00	Remediation executed	+1m
08:20:00	Recovery verified	+3m

Total time from alert to resolution: ~20 minutes (6 minutes waiting for human approval)

Incident 2: VM High CPU Spike

The Alert

[Azure VM showing Average CPU metric is increasing]

🟡 Sev2 Alert Fired

Alert Rule: sre-demo-vm-cpu-alert

Description: Alert when VM CPU exceeds 85% - indicates runaway process or resource exhaustion

Resource: sre-demo-vm

Time: 02/04/2026 16:16:18 UTC

Alert Configuration Details

The VM CPU alert was configured as a metric alert:

resource vmCpuAlert 'Microsoft.Insights/metricAlerts@2018-03-01' = {

properties: {

severity: 2

evaluationFrequency: 'PT1M'

windowSize: 'PT5M'

criteria: {

'odata.type': 'Microsoft.Azure.Monitor.SingleResourceMultipleMetricCriteria'

allOf: [

{

metricName: 'Percentage CPU'

operator: 'GreaterThan'

threshold: 85

timeAggregation: 'Average'

}

]

}

targetResourceType: 'Microsoft.Compute/virtualMachines'

}

What SRE Agent Did

[SRE Agent chat showing VM investigation and Run Command execution]

Process Capture via VM Run Command

SRE Agent requested approval to run a safe, read-only command to capture top CPU processes:

# Read-only diagnostic command

Get-Process | Sort-Object CPU -Descending | Select-Object -First 10 Name,CPU,Id | ConvertTo-Json

The agent used Azure VM Run Command (az vm run-command invoke) to execute PowerShell remotely:

az vm run-command invoke \

-g rg-sre-demo-india \

-n sre-demo-vm \

--subscription <subid> \

--command-id RunPowerShellScript \

--scripts "Get-Process | Sort-Object CPU -Descending | Select-Object -First 10 Name,CPU,Id | ConvertTo-Json"

Runaway Process Identification

Results revealed two PowerShell processes consuming excessive CPU:

[

{ "Name": "powershell", "CPU": 683.45, "Id": 3164 },

{ "Name": "powershell", "CPU": 652.12, "Id": 2776 },

{ "Name": "MsMpEng", "CPU": 54.23, "Id": 1892 },

{ "Name": "svchost", "CPU": 12.34, "Id": 1024 }

]

Process	PID	CPU Time (seconds)	Assessment	Reasoning
powershell	3164	683.45s (~11 min)	🔴 Runaway	CPU time > 60s threshold from IRP
powershell	2776	652.12s (~10 min)	🔴 Runaway	CPU time > 60s threshold from IRP
MsMpEng	1892	54.23s	✅ Normal	Windows Defender - expected
svchost	1024	12.34s	✅ Normal	System process - expected

SRE Agent correctly identified these as stress/runaway processes based on the custom instructions I provided in the Incident Response Plan:

"If process is 'powershell' with CPU > 80 seconds → LIKELY stress script"

Targeted Remediation

With my approval, SRE Agent executed targeted process termination:

az vm run-command invoke \

-g rg-sre-demo-india \

-n sre-demo-vm \

--subscription <subid> \

--command-id RunPowerShellScript \

--scripts "Stop-Process -Id 3164 -Force -ErrorAction SilentlyContinue; Stop-Process -Id 2776 -Force -ErrorAction SilentlyContinue; Write-Output 'Stopped'"

💡 Why specific PIDs? SRE Agent targeted only the identified runaway processes (PIDs 3164, 2776) rather than killing all PowerShell processes. This minimizes blast radius and avoids disrupting legitimate automation.

Recovery Verification

Post-remediation check showed:

// After remediation - Top processes

[

{ "Name": "MsMpEng", "CPU": 54.23, "Id": 1892 }, // Now the top consumer

{ "Name": "svchost", "CPU": 12.34, "Id": 1024 },

{ "Name": "WmiPrvSE", "CPU": 8.12, "Id": 2048 }

]

✅ PowerShell processes no longer in top CPU list
✅ Highest CPU consumer: MsMpEng (Windows Defender) at ~54s - normal baseline
✅ VM CPU normalized

Technical Deep Dive: Understanding CPU Metrics

An important learning from this incident:

Metric	What It Measures	When to Use
Get-Process.CPU	Cumulative CPU time in seconds since process start	Identifying long-running resource hogs
Get-Counter '\Processor(_Total)\% Processor Time'	Instantaneous CPU percentage	Validating current system state
Get-CimInstance Win32_Processor	CPU load percentage	Quick health check

SRE Agent initially tried to verify recovery using performance counters but encountered parsing issues. The Session Insights captured this learning for future incidents.

Timeline

Time (UTC)	Event	Duration
16:16:18	Alert fired (CPU > 85% for 5 min)	-
16:16:20	SRE Agent acknowledged	+2s
16:48:00	Process capture approved	+32m (human delay)
16:48:30	Top processes captured	+30s
16:51:00	Runaway processes identified	+2.5m
16:52:00	Remediation approved	+1m
16:52:30	Processes terminated	+30s
16:55:00	Recovery verified	+2.5m

Total time from alert to resolution: ~39 minutes (32 minutes waiting for initial human approval)

Why Custom Instructions Matter

Out of the box, SRE Agent knows Azure. But it doesn't know your environment.

For the VM CPU scenario, I created an Incident Response Plan with custom instructions that taught the agent:

What "HighCpuProcess" means (it's our test stress process)
When it's safe to kill PowerShell processes (CPU > 60 seconds)
How to validate recovery (check CPU percentage)
When to escalate vs. auto-remediate

Full Custom Instructions for VM CPU Scenario

You are investigating a high CPU alert on a Windows Virtual Machine.

INVESTIGATION METHODOLOGY:

Connect to the VM and query current CPU usage
Identify which process is consuming the most CPU
Determine if the process is legitimate or a runaway/malicious process
Take appropriate action based on findings

DIAGNOSTIC STEPS:

Use Azure VM Run Command to execute diagnostic scripts on the VM
Query the top CPU-consuming processes using:

- Get-Process | Sort-Object CPU -Descending | Select-Object -First 10 Name, CPU, Id

Check for known runaway process indicators:

- Process name contains "HighCpuProcess" → This is a test stress process, safe to kill

- PowerShell process with unusually high CPU → Likely a stress script, investigate further

- Unknown process consuming >50% CPU → Potential runaway, gather more info before killing

IDENTIFICATION CRITERIA:

- If process name is "HighCpuProcess" → CONFIRMED runaway test process

- If process is "powershell" with CPU > 80 seconds → LIKELY stress script

- If multiple PowerShell background jobs named "HighCpuProcess-*" exist → CONFIRMED stress test

REMEDIATION ACTIONS:

For PowerShell stress jobs:

Get-Job -Name "HighCpuProcess*" | Stop-Job

For high-CPU PowerShell processes:

Get-Process -Name "powershell*" | Where-Object { $_.CPU -gt 60 } | Stop-Process -Force

General process termination (use process ID from investigation):

Stop-Process -Id <ProcessId> -Force

VALIDATION:

After remediation, verify CPU has returned to normal:

$cpu = (Get-Counter '\Processor(_Total)\% Processor Time' -SampleInterval 2 -MaxSamples 3 |

Select-Object -ExpandProperty CounterSamples |

Measure-Object -Property CookedValue -Average).Average

Write-Host "Current CPU: $([math]::Round($cpu, 1))%"

ESCALATION:

- If CPU remains high after killing identified processes, escalate to human operator

- If process is a critical system process, do NOT kill - escalate instead

- If unable to connect to VM, check VM health and network connectivity first

How Custom Instructions Change Agent Behavior

Without Custom Instructions	With Custom Instructions
"I see high CPU on this VM"	"PowerShell PID 3164 has 683s CPU time, exceeding 60s threshold - confirmed runaway"
"Should I investigate?"	"Based on IRP criteria, this matches stress script pattern - recommending termination"
Generic troubleshooting	Targeted, context-aware remediation
May escalate unnecessarily	Knows when to act vs. escalate

This context transformed SRE Agent from a generic troubleshooter into a teammate who understands our specific runbooks.

What SRE Agent Learned (Session Insights)

After each incident, SRE Agent generates Session Insights—a structured summary of what happened, what went well, and what to improve. These become organizational knowledge.

Session Insights Structure

TIMELINE

├── Event 1: Initial acknowledgment

├── Event 2: Symptom assessment

├── Event 3: Root cause identified

├── Event 4: Remediation executed

└── Event 5: Recovery verified

EVALUATION

├── What Went Well

│ └── Specific actions that succeeded

└── What Didn't Go Well

└── Issues encountered + better approaches

DERIVED LEARNING

├── System Design Knowledge

│ └── Azure-specific learnings

└── Investigation Pattern

└── Reusable troubleshooting approaches

From Incident 1 (SQL Connectivity):

What Went Well:

Rapid isolation of failing backend: Used Application Insights to pinpoint the SQL dependency target with 80/80 failures
Layered validation before change: Validated DNS and TCP connectivity to confirm network path
Targeted remediation with verification: Enabled SQL public access and confirmed recovery through dependency metrics

What Didn't Go Well:

Metric query failed for HealthCheckStatus: "cannot support requested time grain: 00:01:00"
Better approach: Use supported grains (00:05:00, 01:00:00) or query Requests/Http5xx instead

System Design Knowledge:

Azure SQL: Disabling publicNetworkAccess blocks App Service access unless a Private Endpoint + VNet integration is in place; enabling PNA plus an appropriate firewall rule restores reachability quickly.

Investigation Pattern:

Triage pattern: platform metrics (Requests/Http5xx) → App Insights dependencies to find the failing backend → connectivity probes (DNS/TCP) → configuration check (PNA/firewall) → minimal remediation → telemetry verification.

From Incident 2 (VM CPU):

What Went Well:

Efficient diagnostics via Run Command: Used az vm run-command invoke with a simple Get-Process pipeline
Targeted remediation: Stopped specific PIDs with minimal script lines
Clear verification step: Rechecked top processes to confirm normalization

What Didn't Go Well:

Safety validation blocked Remove-Job: "Delete operations are not allowed for safety reasons"
Better approach: Use Stop-Job only and avoid Remove-Job
CPU percent checks failed due to quoting/escaping in Run Command
Better approach: Use typeperf or Get-CimInstance Win32_Processor

System Design Knowledge:

Windows process metrics: Get-Process CPU is cumulative seconds, not percentage; use Get-Counter or typeperf for instantaneous CPU percent to verify recovery thresholds.

Investigation Pattern:

Diagnose-remediate-verify loop: capture top processes via Run Command, terminate only confirmed runaway PIDs, then re-run the same read to confirm normalization.

Component Details

Component	Purpose	Integration
Azure Monitor	Detect anomalies via metric/log alerts	Native alert routing to SRE Agent
Application Insights	Dependency tracking, failure analysis	KQL queries for root cause
Log Analytics	Centralized logging, performance data	KQL queries for investigation
VM Run Command	Remote script execution on VMs	az vm run-command invoke
ARM API	Resource configuration queries	Read/write resource properties

Setting Up Your Own Demo

Prerequisites

Azure subscription with SRE Agent Preview access
Permissions: RBAC Admin or User Access Admin (for role assignments)
Region: East US 2 (required for preview)
Tools: Azure CLI, PowerShell 7+, Node.js 18+ (optional for web app)

Infrastructure Overview

Resource	Purpose	SKU/Tier
Azure SQL Server	Backend database	Serverless
Azure SQL Database	Product data	Basic
App Service Plan	Web app hosting	B1 (Basic)
Web App	Frontend + API	Node.js 18
Windows VM	CPU spike demo	Standard_B2s
Application Insights	Telemetry & dependencies	-
Log Analytics Workspace	Centralized logging	-

Step 1: Deploy Infrastructure

# Clone the demo repo

git clone https://github.com/Saby007/SREAgentDemo.git

cd SREAgent

# Deploy SQL scenario (Web App + SQL Database)

.\scripts\deploy.ps1 -ResourceGroupName "rg-sre-demo" -Location "eastus2"

# Wait for deployment (~5-10 minutes)

# This creates: SQL Server, Database, App Service, Application Insights, Alerts

# Deploy VM scenario

cd scenario-vm-cpu

.\deploy-vm.ps1 -AdminPassword (ConvertTo-SecureString "YourP@ss123!" -AsPlainText -Force)

# Wait for VM + Azure Monitor Agent (~10 minutes)

Step 2: Create SRE Agent

Go to Azure SRE Agent Portal
Click Create → Select subscription → Name: sre-agent-demo
Region: East US 2 (required for preview)
Add resource group: rg-sre-demo
Click Create

⚠️ Important: SRE Agent needs appropriate RBAC permissions on the resource group. The agent will request Contributor access during setup.

Step 3: Configure Incident Response Plans

Create two Incident Response Plans:

Plan 1: Web App Health (SQL Connectivity)

Setting	Value
Incident Type	Default
Impacted Service	App Services
Priority	Sev 1
Title Contains	health
Autonomy	Review (approval required)

Plan 2: VM High CPU

Setting	Value
Incident Type	Default
Impacted Service	Virtual Machines
Priority	Sev 2
Title Contains	CPU
Autonomy	Review (approval required)

Add custom instructions from scenario-vm-cpu/README.md.

Step 4: Trigger Incidents

# Scenario 1: Cause SQL connectivity failure

# This disables public network access on SQL Server

.\scripts\trigger-incident.ps1 -Action "pause"

# Wait 5-10 minutes for alert to fire

# Scenario 2: Cause CPU spike on VM

.\scenario-vm-cpu\trigger-cpu-spike.ps1 -Action start

# This runs background PowerShell jobs that consume ~90% CPU

# Wait 5-10 minutes for alert to fire (CPU > 85% for 5 min window)

Step 5: Watch SRE Agent Work

Open the SRE Agent portal and watch it:

✅ Acknowledge the alert (instant)
🔍 Investigate autonomously (metrics, logs, config)
🎯 Identify root cause
💡 Propose remediation options
✋ Wait for your approval
🔧 Execute remediation
✅ Verify recovery
📝 Generate Session Insights

Step 6: Cleanup

# Remove all demo resources

.\scripts\cleanup.ps1 -ResourceGroupName "rg-sre-demo"

# Or manually via Azure CLI

az group delete --name rg-sre-demo --yes --no-wait

Key Takeaways

Quantitative Results

Metric	Incident 1 (SQL)	Incident 2 (VM)
Time to Acknowledge	1 second	2 seconds
Time to Root Cause	~10 minutes	~3 minutes
Human Time Required	~6 minutes (approval)	~33 minutes (approvals)
Total Resolution Time	~20 minutes	~39 minutes
Automated Steps	12	8

Before vs. After Comparison

Before SRE Agent	After SRE Agent
Alert fires → Wait for human to wake up	Alert fires → Investigation starts immediately
Engineer manually queries metrics, logs	Agent queries metrics, logs, ARM configs in seconds
Root cause found after 20-30 mins of digging	Root cause identified in <10 mins automatically
Remediation requires tribal knowledge	Custom instructions encode runbooks in IRP
Post-incident docs written (maybe, days later)	Session Insights auto-generated immediately
Knowledge stays in engineer's head	Learnings captured and reusable

Key Benefits

Faster MTTR - Investigation starts instantly, not when humans are available
Consistent Triage - Same investigation pattern every time
Knowledge Capture - Session Insights preserve learnings
Reduced Toil - Automated data gathering and correlation
Guardrails - Approval workflow for remediation actions

Lessons Learned & Best Practices

Do's ✅

Practice	Why
Write specific IRP instructions	Generic instructions = generic responses
Include identification criteria	Help agent distinguish safe vs. risky remediations
Define escalation triggers	Know when NOT to auto-remediate
Test in Review mode first	Validate agent behavior before enabling Autonomous
Use supported metric time grains	Avoid query failures (5m, 1h, not 1m for some metrics)

Don'ts ❌

Anti-Pattern	Issue
Overly broad permissions	Security risk; use least-privilege RBAC
Complex PowerShell in Run Command	Parsing/escaping issues; keep scripts simple
Skipping recovery verification	Agent should always validate the fix worked
Using Remove-Job in remediations	May trigger safety blocks; use Stop-Job
Enabling Autonomous mode without testing	Unintended remediations on production resources

What's Next?

Immediate Next Steps

Autonomous Mode: For trusted, well-tested scenarios, skip approval and let SRE Agent remediate automatically
More Scenarios: Add database pause/resume, storage throttling, AKS pod failures
Teams Integration: Get incident updates and approve remediations directly in Teams

Future Enhancements

Scheduled Checks: Combine reactive response with proactive optimization (see Proactive Cloud Ops blog)
GitHub Issues: Auto-create issues for infrastructure problems linked to repos
Knowledge Base: Upload runbooks, architecture docs to improve agent context
MCP Servers: Connect external tools (Datadog, PagerDuty, Splunk) for broader observability

Conclusion

Azure SRE Agent transforms incident response from a reactive, human-dependent process into an AI-assisted workflow that starts investigating the moment an alert fires.

In these two real-world scenarios:

SQL Connectivity Outage: Agent identified misconfigured public network access and restored connectivity in ~20 minutes
VM CPU Spike: Agent captured process data, identified runaway PowerShell, and terminated the culprits in ~39 minutes

The key differentiator? Custom Instructions. By encoding our team's runbooks and identification criteria into Incident Response Plans, SRE Agent became a context-aware teammate—not just a generic troubleshooter.

Is it perfect? No. We encountered metric query failures, CLI escaping issues, and safety blocks. But the Session Insights captured these learnings, making the agent better for next time.

Is it valuable? Absolutely. Even with human approval delays, we resolved both incidents faster than traditional triage—and with comprehensive documentation auto-generated.

Learn More

Azure SRE Agent is currently in preview. Get Started →

Updated Feb 18, 2026

Version 1.0

Sabyasachi-Samaddar

Microsoft

Joined August 11, 2021

View Profile

Azure Architecture Blog

Follow this blog board to get notified when there's new activity