Blog Post

Apps on Azure Blog
4 MIN READ

Stop Running Runbooks at 3 am: Let Azure SRE Agent Do Your On-Call Grunt Work

dchelupati's avatar
dchelupati
Icon for Microsoft rankMicrosoft
Dec 20, 2025

Your pager goes off. It's 2:47am. Production is throwing 500 errors. You know the drill - SSH into this, query that, check these metrics, correlate those logs. Twenty minutes later, you're still piecing together what went wrong. Sound familiar?

The On-Call Reality Nobody Talks About

Every SRE, DevOps engineer, and developer who's carried a pager knows this pain. When incidents hit, you're not solving problems - you're executing runbooks. Copy-paste this query. Check that dashboard. Run these az commands. Connect the dots between five different tools.

It's tedious. It's error-prone at 3am. And honestly? It's work that doesn't require human creativity but requires human time.

What if an AI agent could do this for you?

Enter Azure SRE Agent + Runbook Automation

Here's what I built: I gave SRE Agent a simple markdown runbook containing the same diagnostic steps I'd run manually during an incident. The agent executes those steps, collects evidence, and sends me an email with everything I need to take action.

No more bouncing between terminals. No more forgetting a step because it's 3am and your brain is foggy.

What My Runbook Contains

Just the basics any on-call would run:

  • az monitor metrics – CPU, memory, request rates
  • Log Analytics queries – Error patterns, exception details, dependency failures
  • App Insights data – Failed requests, stack traces, correlation IDs
  • az containerapp logs – Revision logs, app configuration

That's it. Plain markdown with KQL queries and CLI commands. Nothing fancy.

What the Agent Does

  1. Reads the runbook from its knowledge base
  2. Executes each diagnostic step
  3. Collects results and evidence
  4. Sends me an email with analysis and findings

I wake up to an email that says: "CPU spiked to 92% at 2:45am, triggering connection pool exhaustion. Top exception: SqlException (1,832 occurrences). Errors correlate with traffic spike. Recommend scaling to 5 replicas."

All the evidence. All the queries used. All the timestamps. Ready for me to act.

How to Set This Up (6 Steps)

Here's how you can build this yourself:

Step 1: Create SRE Agent

Create a new SRE Agent in the Azure portal. No Azure resource groups to configure. If your apps run on Azure, the agent pulls context from the incident itself. If your apps run elsewhere, you don't need Azure resource configuration at all.

Step 2: Grant Reader Permission (Optional)

If your runbooks execute against Azure resources, assign Reader role to the SRE Agent's managed identity on your subscription. This allows the agent to run az commands and query metrics. Skip this if your runbooks target non-Azure apps.

Step 3: Add Your Runbook to SRE Agent's Knowledge base

You already have runbooks, they're in your wiki, Confluence, or team docs. Just add them as .md files to the agent's knowledge base. To learn about other ways to link your runbooks to the agent, read this

Step 4: Connect Outlook

Connect the agent to your Outlook so it can send you the analysis email with findings.

Step 5: Create a Subagent

Create a subagent with simple instructions like:

"You are an expert in triaging and diagnosing incidents. When triggered, search the knowledge base for the relevant runbook, execute the diagnostic steps, collect evidence, and send an email summary with your findings."

Assign the tools the agent needs:

  • RunAzCliReadCommands – for az monitor, az containerapp commands
  • QueryLogAnalyticsByWorkspaceId – for KQL queries against Log Analytics
  • QueryAppInsightsByResourceId – for App Insights data
  • SearchMemory – to find the right runbook
  • SendOutlookEmail – to deliver the analysis

Step 6: Set Up Incident Trigger

Connect your incident management tool - PagerDuty, ServiceNow, or Azure Monitor alerts and setup the incident trigger to the subagent. When an incident fires, the agent kicks off automatically.

That's it. Your agentic workflow now looks like this:

This Works for Any App, Not Just Azure

Here's the thing: SRE Agent is platform agnostic. It's executing your runbooks, whatever they contain.

  • On-prem databases? Add your diagnostic SQL.
  • Custom monitoring stack? Add those API calls.

The agent doesn't care where your app runs. It cares about following your runbook and getting you answers.

Why This Matters

Lower MTTR. By the time you're awake and coherent, the analysis is done.

Consistent execution. No missed steps. No "I forgot to check the dependencies" at 4am.

Evidence for postmortems. Every query, every result, timestamped and documented.

Focus on what matters. Your brain should be deciding what to do not gathering data.

The Bottom Line

On-call runbook execution is the most common, most tedious, and most automatable part of incident response. It's grunt work that pulls engineers away from the creative problem-solving they were hired for.

SRE Agent offloads that work from your plate. You write the runbook once, and the agent executes it every time, faster and more consistently than any human at 3am.

Stop running runbooks. Start reviewing results.

Try it yourself: Create a markdown runbook with your diagnostic queries and commands, add it to your SRE Agent's knowledge base, and let the agent handle your next incident. Your 3am self will thank you.

 

Published Dec 20, 2025
Version 1.0
No CommentsBe the first to comment