Blog Post

Apps on Azure Blog
9 MIN READ

Azure Monitor in Azure SRE Agent: Autonomous Alert Investigation and Intelligent Merging

Vineela-Suri's avatar
Vineela-Suri
Icon for Microsoft rankMicrosoft
Apr 07, 2026

Azure Monitor is great at telling you something is wrong. But once the alert fires, the real work begins — someone has to open the portal, triage it, dig into logs, and figure out what happened. That takes time. And while they're investigating, the same alert keeps firing every few minutes, stacking up duplicates of a problem that's already being looked at.

This is exactly what Azure SRE Agent's Azure Monitor integration addresses. The agent picks up alerts as they fire, investigates autonomously, and remediates when it can — all without waiting for a human to get involved. And when that same alert fires again while the investigation is still underway, the agent merges it into the existing thread rather than creating a new one.

In this blog, we'll walk through the full Azure Monitor experience in SRE Agent with a live AKS + Redis scenario — how alerts get picked up, what the agent does with them, how merging handles the noise, and why one often-overlooked setting (auto-resolve) makes a bigger difference than you'd expect.

Key Takeaways

  1. Set up Incident Response Plans to scope which alerts the agent handles — filter by severity, title patterns, and resource type. Start with review mode, then promote to autonomous once you trust the agent's behavior for that failure pattern.
  2. Recurring alerts merge into one thread automatically — when the same alert rule fires repeatedly, the agent merges subsequent firings into the existing investigation instead of creating duplicates.
  3. Turn auto-resolve OFF for persistent failures (bad credentials, misconfigurations, resource exhaustion) so all firings merge into one thread. Turn it ON for transient issues (traffic spikes, brief timeouts) so each gets a fresh investigation.
  4. Design alert rules around failure categories, not components — one alert rule = one investigation thread. Structure rules by symptom (Redis errors, HTTP errors, pod health) to give the agent focused, non-overlapping threads.
  5. Attach Custom Response Plans for specialized handling — route specific alert patterns to custom-agents with custom instructions, tools, and runbooks.

It Starts with Any Azure Monitor Alert

Before we get to the demo, a quick note on what SRE Agent actually watches. The agent queries the Azure Alerts Management REST API, which returns every fired alert regardless of signal type. Log search alerts, metric alerts, activity log alerts, smart detection, service health, Prometheus — all of them come through the same API, and the agent processes them all the same way. You don't need to configure connectors or webhooks per alert type. If it fires in Azure Monitor, the agent can see it.

What you do need to configure is which alerts the agent should care about. That's where Incident Response Plans come in.

Setting Up: Incident Response Plans and Alert Rules

We start by heading to Settings > Incident Platform > Azure Monitor and creating an Incident Response Plan. Response Plans et you scope the agent's attention by severity, alert name patterns, target resource types, and — importantly — whether the agent should act autonomously or wait for human approval.

Action: Match the agent mode to your confidence in the remediation, not just the severity. Use autonomous mode for well-understood failure patterns where the fix is predictable and safe (e.g., rolling back a bad config, restarting a pod). Use review mode for anything where you want a human to validate before the agent acts — especially Sev0/Sev1 alerts that touch critical systems. You can always start in review mode and promote to autonomous once you've validated the agent's behavior.

Incident Response Plan configuration for Azure Monitor, showing severity, agent mode, and title pattern fields

For our demo, we created a Sev1 response plan in autonomous mode — meaning the agent would pick up any Sev1 alert and immediately start investigating and remediating, no approval needed.

On the Azure Monitor side, we set up three log-based alert rules against our AKS cluster's Log Analytics workspace. The star of the show was a Redis connection error alert — a custom log search query looking for WRONGPASS, ECONNREFUSED, and other Redis failure signatures in ContainerLog:

Each rule evaluates every 5 minutes with a 15-minute aggregation window. If the query returns any results, the alert fires. Simple enough.

Breaking Redis (On Purpose)

Our test app is a Node.js journal app on AKS, backed by Azure Cache for Redis. To create a realistic failure scenario, we updated the Redis password in the Kubernetes secret to a wrong value. The app pods picked up the bad credential, Redis connections started failing, and error logs started flowing.

Within minutes, the Redis connection error alert fired.

What Happened Next

Here's where it gets interesting. We didn't touch anything — we just watched.

The agent's scanner polls the Azure Monitor Alerts API every 60 seconds. It spotted the new alert (state: "New", condition: "Fired"), matched it against our Sev1 Incident Response Plan, and immediately acknowledged it in Azure Monitor — flipping the state to "Acknowledged" so other systems and humans know someone's on it.

Then it created a new investigation thread. The thread included everything the agent needed to get started: the alert ID, rule name, severity, description, affected resource, subscription, resource group, and a deep-link back to the Azure Portal alert.

From there, the agent went to work autonomously. It queried container logs, identified the Redis WRONGPASS errors, traced them to the bad secret, retrieved the correct access key from Azure Cache for Redis, updated the Kubernetes secret, and triggered a pod rollout. By the time we checked the thread, it was already marked "Completed."

No pages. No human investigation. No context-switching.

But the Alert Kept Firing...

Here's the thing — our alert rule evaluates every 5 minutes. Between the first firing and the agent completing the fix, the alert fired again. And again. Seven times total over 35 minutes.

Without intelligent handling, that would mean seven separate investigation threads. Seven notifications. Seven disruptions.

SRE Agent handles this with alert merging. When a subsequent firing comes in for the same alert rule, the agent checks: is there already an active thread for this rule, created within the last 7 days, that hasn't been resolved or closed? If yes, the new firing gets silently merged into the existing thread — the total alert count goes up, the "Last fired" timestamp updates, and that's it. No new thread, no new notification, no interruption to the ongoing investigation.

How merging decides: new thread or merge?

ConditionResult
Same alert rule, existing thread still activeMerged — alert count increments, no new thread
Same alert rule, existing thread resolved/closedNew thread — fresh investigation starts
Different alert ruleNew thread — always separate

Five minutes after the first alert, the second firing came in and that continued. The agent finished the fix and closed the thread, and the final tally was one thread, seven merged alerts — spanning 35 minutes of continuous firings.

On the Azure Portal side, you can see all seven individual alert instances. Each one was acknowledged by the agent.

7 Redis Connection Error Alert entries, all Sev1, Fired condition, Closed by user, spanning 8:50 PM to 9:21 PM

Seven firings. One investigation. One fix. That's the merge in action.

The Auto-Resolve Twist

Now here's the part we didn't expect to matter as much as it did.

Azure Monitor has a setting called "Automatically resolve alerts". When enabled, Azure Monitor automatically transitions an alert to "Resolved" once the underlying condition clears — for example, when the Redis errors stop because the pod restarted.

For our first scenario above, we had auto-resolve turned off. That's why the alert stayed in "Fired" state across all seven evaluation cycles, and all seven firings merged cleanly into one thread.

But what happens if auto-resolve is on? We turned it on and ran the same scenario again:

Here's what happened:

  1. Redis broke. Alert fired. Agent picked it up and created a thread.
  2. The agent investigated, found the bad Redis password, fixed it.
  3. With Redis working again, error logs stopped. We noticed that the condition cleared and closed all the 7 alerts manually.
  4. We broke Redis a second time (simulating a recurrence). The alert fired again — but the previous alert was already closed/resolved. The merge check found no active thread. A brand-new thread was created, reinvestigated and mitigated. 

Two threads for the same alert rule, right there on the Incidents page:

And on the Azure Monitor side, the newest alert shows "Resolved" condition — that's the auto-resolve doing its thing:

For a persistent failure like a Redis misconfiguration, this is clearly worse. You get a new investigation thread every break-fix cycle instead of one continuous investigation.

So, Should You Just Turn Auto-Resolve Off?

No. It depends on what kind of failure the alert is watching for.

Quick Reference: Auto-Resolve Decision Guide
 Auto-Resolve OFFAuto-Resolve ON
Use whenProblem persists until fixedProblem is transient and self-correcting
ExamplesBad credentials, misconfigurations, CrashLoopBackOff, connection pool exhaustion, IOPS limitsOOM kills during traffic spikes, brief latency from neighboring deployments, one-off job timeouts
Merge behaviorAll repeat firings merge into one threadEach break-fix cycle creates a new thread
Best forAgent is actively managing the alert lifecycleEach occurrence may have a different root cause
TradeoffAlerts stay in "Fired/Acknowledged" state in Azure Monitor until the agent closes themMore threads, but each gets a clean investigation

Turn auto-resolve OFF when you want repeated firings from the same alert rule to stay in a single investigation thread until the alert is explicitly resolved or closed in Azure Monitor. This works best for persistent issues such as a Kubernetes deployment stuck in CrashLoopBackOff because of a bad image tag, a database connection pool exhausted due to a leaked connection, or a storage account hitting its IOPS limit under sustained load.

Turn auto-resolve ON when you want a new investigation thread after the previous occurrence has been resolved or closed in Azure Monitor. This works best for episodic or self-clearing issues such as a pod getting OOM-killed during a temporary traffic spike, a brief latency increases during a neighboring service’s deployment, or a scheduled job that times out once due to short-lived resource contention.

The key question is: when this alert fires again, is it the same ongoing problem or a new one? If it's the same problem, turn auto-resolve off and let the merges do their job. If it's a new problem, leave auto-resolve on and let the agent investigate fresh.

Note: These behaviors describe how SRE Agent groups alert investigations and may differ from how Azure Monitor documents native alert state behavior.

A Few Things We Learned Along the Way

Design alert rules around symptoms, not components. Each alert rule maps to one investigation thread. We structured ours around failure categories — root cause signal (Redis errors, Sev1), blast radius signal (HTTP errors, Sev2), infrastructure signal (unhealthy pods, Sev2). This gave the agent focused threads without overlap.

Incident Response Plans let you tier your response. Not every alert needs the agent to go fix things immediately. We used a Sev1 filter in autonomous mode for the Redis alert, but you could set up a Sev2 filter in review mode — the agent investigates and provides analysis but waits for human approval before taking action.

Response Plans specialize the agent. For specific alert patterns, you can give the agent custom instructions, specialized tools, and a tailored system prompt. A Redis alert can route to a custom-agent loaded with Redis-specific runbooks; a Kubernetes alert can route to one with deep kubectl expertise.

Best Practices Checklist

Here's what we learned distilled into concrete actions:

Alert Rule Design
DoDon't
Design rules around failure categories (root cause, blast radius, infra health)Create one alert per component — you'll get overlapping threads
Set evaluation frequency and aggregation window to match the failure patternUse the same frequency for everything — transient vs. persistent issues need different cadences

Example rule structure from our test:

  • Root cause signal — Redis WRONGPASS/ECONNREFUSED errors → Sev1
  • Blast radius signal — HTTP 5xx response codes → Sev2
  • Infrastructure signal — KubeEvents Reason="Unhealthy" → Sev2
Incident Response Plan Setup
DoDon't
Create separate response plans per severity tierUse one catch-all filter for everything
Start with review mode — especially for Sev0/Sev1 where wrong fixes are costlyJump straight to autonomous mode on critical alerts without validating agent behavior first
Promote to autonomous mode once you've validated the agent handles a specific failure pattern correctlyAssume severity alone determines the right mode — it's about confidence in the remediation
Response Plans
DoDon't
Attach custom response plans to specific alert patterns for specialized handlingLeave every alert to the agent's general knowledge
Include custom instructions, tools, and runbooks relevant to the failure typeWrite generic instructions — the more specific, the better the investigation
Route Redis alerts to a Redis-specialized custom-agent; K8s alerts to one with kubectl expertiseAssume one agent configuration fits all failure types

Getting Started

  1. Head to sre.azure.com and open your agent
  2. Make sure the agent's managed identity has Monitoring Reader on your target subscriptions
  3. Go to Settings > Incident Platform > Azure Monitor and create your Incident Response Plans
  4. Review the auto-resolve setting on your alert rules — turn it off for persistent issues, leave it on for transient ones (see the decision guide above)
  5. Start with a test response plan using Title Contains to target a specific alert rule — validate agent behavior before broadening
  6. Watch the Incidents page and review the agent's investigation threads before expanding to more alert rules

Learn More

Updated Apr 07, 2026
Version 1.0
No CommentsBe the first to comment