I had 6 alert rules. CPU. Memory. Pod restarts. Container errors. OOMKilled. Job failures. I thought I was covered.
Then my app went down. I kept refreshing the Azure portal, waiting for an alert. Nothing.
That's when it hit me: my alerts were working perfectly. They just weren't designed for this failure mode.
Sound familiar?
The Problem Every Developer Knows
If you're a developer or DevOps engineer, you've been here: a customer reports an issue, you scramble to check your monitoring, and then you realize you don't have the right alerts set up. By the time you find out, it's already too late.
You set up what seems like reasonable alerting and assume you're covered. But real-world failures are sneaky. They slip through the cracks of your carefully planned thresholds.
My Setup: AKS with Redis
I love to vibe code apps using GitHub Copilot Agent mode with Claude Opus 4.5. It's fast, it understands context, and it lets me focus on building rather than boilerplate.
For this project, I built a simple journal entry app:
- AKS cluster hosting the web API
- Azure Cache for Redis storing journal data
- Azure Monitor alerts for CPU, memory, pod restarts, container errors, OOMKilled, and job failures
Seemed solid. What could go wrong?
The Scenario: Redis Password Rotation
Here's something that happens constantly in enterprise environments: the security team rotates passwords. It's best practice. It's in the compliance checklist. And it breaks things when apps don't pick up the new credentials.
I simulated exactly this.
The pods came back up. But they couldn't connect to Redis (as expected). The readiness probes started failing. The LoadBalancer had no healthy backends. The endpoint timed out.
And not a single alert fired.
Using SRE Agent to Find the Alert Gaps
Instead of manually auditing every alert rule and trying to figure out what I missed, I turned to Azure SRE Agent. I asked it a simple question: "My endpoint is timing out. What alerts do I have, and why didn't any of them fire?"
Within minutes, it had diagnosed the problem. Here's what it found:
| My Existing Alerts | Why They Didn't Fire |
|---|---|
| High CPU/Memory | No resource pressure,just auth failures |
| Pod Restarts | Pods weren't restarting, just unhealthy |
| Container Errors | App logs weren't being written |
| OOMKilled | No memory issues |
| Job Failures | No K8s jobs involved |
The gaps SRE Agent identified:
- ❌ No synthetic URL availability test
- ❌ No readiness/liveness probe failure alerts
- ❌ No "pods not ready" alerts scoped to my namespace
- ❌ No Redis connection error detection
- ❌ No ingress 5xx/timeout spike alerts
- ❌ No per-pod resource alerts (only node-level)
SRE Agent didn't just tell me what was wrong, it created a GitHub issue with :
- KQL queries to detect each failure type
- Bicep code snippets for new alert rules
- Remediation suggestions for the app code
- Exact file paths in my repo to update
Check it out: GitHub Issue
How I Built It: Step by Step
Let me walk you through exactly how I set this up inside SRE Agent.
Step 1: Create an SRE Agent
I created a new SRE Agent in the Azure portal. Since this workflow analyzes alerts across my subscription (not just one resource group), I didn't configure any specific resource groups. Instead, I gave the agent's managed identity Reader permissions on my entire subscription. This lets it discover resources, list alert rules, and query Log Analytics across all my resource groups.
Step 2: Connect GitHub to SRE Agent via MCP
I added a GitHub MCP server to give the agent access to my source code repository.MCP (Model Context Protocol) lets you bring any API into the agent. If your tool has an API, you can connect it. I use GitHub for both source code and tracking dev tickets, but you can connect to wherever your code lives (GitLab, Azure DevOps) or your ticketing system (Jira, ServiceNow, PagerDuty).
Step 3: Create a Subagent inside SRE Agent for managing Azure Monitor Alerts
I created a focused subagent with a specific job and only the tools it needs:
Azure Monitor Alerts Expert
Prompt:
" You are expert in managing operations related to azure monitor alerts on azure resources including discovering alert rules configured on azure resources, creating new alert rules (with user approval and authorization only), processing the alerts fired on azure resources and identifying gaps in the alert rules. You can get the resource details from azure monitor alert if triggered via alert. If not, you need to ask user for the specific resource to perform analysis on. You can use az cli tool to diagnose logs, check the app health metrics. You must use the app code and infra code (bicep files) files you have access to in the github repo <insert your repo> to further understand the possible diagnoses and suggest remediations. Once analysis is done, you must create a github issue with details of analysis and suggested remediation to the source code files in the same repo."
Tools enabled:
- az cli – List resources, alert rules, action groups
- Log Analytics workspace querying – Run KQL queries for diagnostics
- GitHub MCP – Search repositories, read file contents, create issues
Step 4: Ask the Subagent About Alert Gaps
I gave the agent context and asked a simple question: "@AzureAlertExpert: My API endpoint http://132.196.167.102/api/journals/john is timing out. What alerts do I have configured in rg-aks-journal, and why didn't any of them fire?
The agent did the analysis autonomously and summarized findings with suggestions to add new alert rules in a GitHub issue.
Here's the agentic workflow to perform azure monitor alert operations
Why This Matters
Faster response times. Issues get diagnosed in minutes, not hours of manual investigation.
Consistent analysis. No more "I thought we had an alert for that" moments. The agent systematically checks what's covered and what's not.
Proactive coverage. You don't have to wait for an incident to find gaps. Ask the agent to review your alerts before something breaks.
The Bottom Line
Your alerts have gaps. You just don't know it until something slips through.
I had 6 alert rules and still missed a basic failure. My pods weren't restarting, they were just unhealthy. My CPU wasn't spiking, the app was just returning errors. None of my alerts were designed for this.
You don't need to audit every alert rule manually. Give SRE Agent your environment, describe the failure, and let it tell you what's missing.
Stop discovering alert gaps from customer complaints. Start finding them before they matter.
A Few Tips
- Give the agent Reader access at subscription level so it can discover all resources
- Use a focused subagent prompt, don't try to do everything in one agent
- Test your MCP connections before running workflows
What Alert Gaps Have Burned You?
What's the alert you wish you had set up before an incident? Credential rotation? Certificate expiry? DNS failures? Let us know in the comments.