Blog Post

Apps on Azure Blog
5 MIN READ

Find the Alerts You Didn't Know You Were Missing with Azure SRE Agent

dchelupati's avatar
dchelupati
Icon for Microsoft rankMicrosoft
Jan 06, 2026

I had 6 alert rules. CPU. Memory. Pod restarts. Container errors. OOMKilled. Job failures. I thought I was covered.

Then my app went down. I kept refreshing the Azure portal, waiting for an alert. Nothing.

That's when it hit me: my alerts were working perfectly. They just weren't designed for this failure mode.

Sound familiar?

The Problem Every Developer Knows

If you're a developer or DevOps engineer, you've been here: a customer reports an issue, you scramble to check your monitoring, and then you realize you don't have the right alerts set up. By the time you find out, it's already too late.

You set up what seems like reasonable alerting and assume you're covered. But real-world failures are sneaky. They slip through the cracks of your carefully planned thresholds.

My Setup: AKS with Redis

I love to vibe code apps using GitHub Copilot Agent mode with Claude Opus 4.5. It's fast, it understands context, and it lets me focus on building rather than boilerplate.

For this project, I built a simple journal entry app:

  • AKS cluster hosting the web API
  • Azure Cache for Redis storing journal data
  • Azure Monitor alerts for CPU, memory, pod restarts, container errors, OOMKilled, and job failures

Seemed solid. What could go wrong?

The Scenario: Redis Password Rotation

Here's something that happens constantly in enterprise environments: the security team rotates passwords. It's best practice. It's in the compliance checklist. And it breaks things when apps don't pick up the new credentials.

I simulated exactly this.

The pods came back up. But they couldn't connect to Redis (as expected). The readiness probes started failing. The LoadBalancer had no healthy backends. The endpoint timed out.

And not a single alert fired.

Using SRE Agent to Find the Alert Gaps

Instead of manually auditing every alert rule and trying to figure out what I missed, I turned to Azure SRE Agent. I asked it a simple question: "My endpoint is timing out. What alerts do I have, and why didn't any of them fire?"

Within minutes, it had diagnosed the problem. Here's what it found:

My Existing AlertsWhy They Didn't Fire
High CPU/MemoryNo resource pressure,just auth failures
Pod RestartsPods weren't restarting, just unhealthy
Container ErrorsApp logs weren't being written
OOMKilledNo memory issues
Job FailuresNo K8s jobs involved

The gaps SRE Agent identified:

  1. ❌ No synthetic URL availability test
  2. ❌ No readiness/liveness probe failure alerts
  3. ❌ No "pods not ready" alerts scoped to my namespace
  4. ❌ No Redis connection error detection
  5. ❌ No ingress 5xx/timeout spike alerts
  6. ❌ No per-pod resource alerts (only node-level)

SRE Agent didn't just tell me what was wrong, it created a GitHub issue with :

  • KQL queries to detect each failure type
  • Bicep code snippets for new alert rules
  • Remediation suggestions for the app code
  • Exact file paths in my repo to update

Check it out: GitHub Issue

How I Built It: Step by Step

Let me walk you through exactly how I set this up inside SRE Agent.

Step 1: Create an SRE Agent

I created a new SRE Agent in the Azure portal. Since this workflow analyzes alerts across my subscription (not just one resource group), I didn't configure any specific resource groups. Instead, I gave the agent's managed identity Reader permissions on my entire subscription. This lets it discover resources, list alert rules, and query Log Analytics across all my resource groups.

Step 2: Connect GitHub to SRE Agent via MCP

I added a GitHub MCP server to give the agent access to my source code repository.MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. I use GitHub for both source code and tracking dev tickets, but you can connect to wherever your code lives (GitLab, Azure DevOps) or your ticketing system (Jira, ServiceNow, PagerDuty).

Step 3: Create a Subagent inside SRE Agent for managing Azure Monitor Alerts

I created a focused subagent with a specific job and only the tools it needs:

Azure Monitor Alerts Expert

Prompt:

" You are expert in managing operations related to azure monitor alerts on azure resources including discovering alert rules configured on azure resources, creating new alert rules (with user approval and authorization only), processing the alerts fired on azure resources and identifying gaps in the alert rules. You can get the resource details from azure monitor alert if triggered via alert. If not, you need to ask user for the specific resource to perform analysis on.  You can use az cli tool to diagnose logs, check the app health metrics. You must use the app code and infra code (bicep files) files you have access to in the github repo <insert your repo> to further understand the possible diagnoses and suggest remediations. Once analysis is done, you must create a github issue with details of analysis and suggested remediation to the source code files in the same repo."

Tools enabled:

  • az cli – List resources, alert rules, action groups
  • Log Analytics workspace querying – Run KQL queries for diagnostics
  • GitHub MCP – Search repositories, read file contents, create issues

Step 4: Ask the Subagent About Alert Gaps

I gave the agent context and asked a simple question: "@AzureAlertExpert: My API endpoint http://132.196.167.102/api/journals/john is timing out. What alerts do I have configured in rg-aks-journal, and why didn't any of them fire?

The agent did the analysis autonomously and summarized findings with suggestions to add new alert rules in a GitHub issue.

Here's the agentic workflow to perform azure monitor alert operations 

 

 

 

 

 

 

 

 

 

 

 

 

 

 

Why This Matters

Faster response times. Issues get diagnosed in minutes, not hours of manual investigation.

Consistent analysis. No more "I thought we had an alert for that" moments. The agent systematically checks what's covered and what's not.

Proactive coverage. You don't have to wait for an incident to find gaps. Ask the agent to review your alerts before something breaks.

The Bottom Line

Your alerts have gaps. You just don't know it until something slips through.

I had 6 alert rules and still missed a basic failure. My pods weren't restarting, they were just unhealthy. My CPU wasn't spiking, the app was just returning errors. None of my alerts were designed for this.

You don't need to audit every alert rule manually. Give SRE Agent your environment, describe the failure, and let it tell you what's missing.

Stop discovering alert gaps from customer complaints. Start finding them before they matter.

A Few Tips

  • Give the agent Reader access at subscription level so it can discover all resources
  • Use a focused subagent prompt, don't try to do everything in one agent
  • Test your MCP connections before running workflows

What Alert Gaps Have Burned You?

What's the alert you wish you had set up before an incident? Credential rotation? Certificate expiry? DNS failures? Let us know in the comments.

Published Jan 06, 2026
Version 1.0
No CommentsBe the first to comment