Blog Post

Apps on Azure Blog
7 MIN READ

Azure SRE Agent for Azure Monitor Alerts: Reduce Alert Fatigue, Investigate What Matters

Vineela-Suri's avatar
Vineela-Suri
Icon for Microsoft rankMicrosoft
Apr 29, 2026

Azure Monitor alerts are one of the most widely adopted observability signals in the cloud. But for many teams, those alerts have become background noise. Here's how Azure SRE Agent helps your team manage Azure Monitor alerts more effectively.

The Alert Problem

Organizations running Azure Monitor tend to land in one of two situations:

  1. Alert fatigue has set in. Alert rules tend to grow over time — a CPU threshold from two years ago, a health probe check from a migration, a disk alert from an outage that never got cleaned up. These rules fire regularly, most auto-resolve, and nobody investigates them. But buried in that noise are real incidents that go unnoticed until they escalate.
  2. Teams respond, but the effort is repetitive. Engineers triage the same alerts repeatedly — running the same diagnostic queries, confirming the same "transient spike, no action needed" conclusion. They know the rule is noisy, but fixing it in Azure Monitor requires data they don't have readily available: What should the threshold be? What's the auto-resolution rate? Is it safe to change? So the noisy rule stays, and the manual toil continues.

Both situations share the same gap: there's no intelligent layer between Azure Monitor and the team. Azure SRE Agent fills that gap — it receives alert fires in real time, investigates them automatically, consolidates noisy ones, and surfaces the data your team needs to improve the rules at the source.

 

Here's how to set it up.

1. Intelligent Alert Handling: Cooldown and Response Plan Configuration

1.a. Alert Reinvestigation Cooldown

The most impactful configuration for Azure Monitor alerts is the new reinvestigation cooldown. This is a per-response-plan setting that controls how the agent handles repeated fires of the same alert rule.

When an alert rule fires and the agent already has an active thread for that rule, it merges the new fire into the existing thread — no new investigation, no duplicate work. What makes this especially useful: if the previous thread was resolved or closed within the cooldown window, the agent reopens it and appends the new fire rather than starting a fresh investigation. This catches the common "it fired, we resolved it, it fired again 30 minutes later" pattern that generates the most duplicate effort.

To configure it: Navigate to your AzMonitor response plan and look for the "Alert reinvestigation cooldown" section in the Save step. It's enabled by default with a 3-hour window — a default chosen because most noisy alert rules re-fire within a 1–3 hour cycle, making this window broad enough to catch recurring patterns while short enough that a genuinely new issue several hours later still gets a fresh investigation.

Response plan Save step with merge enabled and cooldown set

To disable the cooldown entirely — for critical alerts where every fire demands a fresh investigation — uncheck the merge toggle:

Response plan Save step with merge disabled and warning banner

You can adjust the window between 1 and 24 hours depending on the alert pattern:

Alert PatternRecommended Window
Frequent polling-based alerts (health probes, heartbeats)1–2 hours
Recurring issues tied to daily batch jobs or deploy cycles6–12 hours
Intermittent failures with unpredictable recurrence12–24 hours
Critical alerts where every fire demands a fresh lookDisable the cooldown entirely

1.b. Segmenting Alerts with Response Plans

The cooldown works best when paired with tiered response plans that route alerts by severity and title keyword. Rather than one catch-all plan for all alert types, create separate plans that match the right investigation depth to the right alerts.

  • Critical alerts (Sev0–1, titles containing "failover", "security", "data loss") — disable cooldown. Every fire gets a fresh investigation because a repeat fire here likely means the first remediation didn't hold.
  • Operational alerts (Sev2, titles containing "high CPU", "memory pressure", "latency") — set a 6-hour cooldown. These are real issues, but recurring fires within a few hours are almost always the same root cause. The agent consolidates them into one thread while still giving a genuinely new occurrence later in the day a fresh look.
  • Low-priority alerts (Sev3–4, titles containing "health probe", "availability test") — set a short 1-hour cooldown. These rarely require deep investigation. The agent captures context without spending effort on redundant analysis.
  • Informational alerts — don't create a response plan at all. These are telemetry, not incidents.

This tiering works regardless of which agent mode (Autonomous or Review) your team uses. The value comes from the cooldown and severity segmentation — agent mode is a separate decision based on your team's comfort level with autonomous remediation.

To see the difference this makes in practice: we deployed a web app with Azure Monitor alert rules and induced real failures. Azure Monitor fired 9 alerts across three rule types over a few hours. The agent consolidated them based on each response plan's cooldown:

Azure Monitor Alerts page showing 9 fired alerts
Alert RuleResponse PlanMerge SettingAzMon FiresAgent ThreadsTotal Alerts (in thread)What Happened
High Response Time (Sev3)low-priority-alertsMerge ON, 4h cooldown314All 4 fires merged into a single thread — the agent investigated once and appended recurring fires
HTTP 5xx Errors (Sev2)critical-alerts-no-mergeMerge OFF331 eachEach fire created its own investigation — appropriate for critical alerts where every occurrence matters
High CPU (Sev2)operational-alertsMerge ON, 1h cooldown221 eachFires were >1 hour apart (resolved at 12:05, re-fired at 3:37) — outside the cooldown window, so the agent correctly treated them as separate incidents

 

The key insight: the same 9 Azure Monitor alerts produced different agent behavior depending on the response plan configuration. The High Response Time rule demonstrates the merge path saving 3 redundant investigations. The HTTP 5xx rule shows merge disabled for critical alerts. And the High CPU rule shows what happens when the cooldown window is too short for the alert's recurrence pattern — a signal to increase the window.

2. Proactive Noise Monitoring: Let the Agent Analyze Its Own Patterns

Handling alerts intelligently is the first step. The next is having the agent proactively surface insights about your alert landscape so your team can improve the rules at the source — which is the data that Category 2 teams in our intro are missing.

2.a. Weekly Alert Hygiene Report

Create a weekly scheduled task with instructions like:

Analyze all Azure Monitor alert threads from the past 7 days. For each alert rule that fired more than 3 times, produce a ranked report covering:

  • High Auto-Resolution Rules: Rules with high auto-resolution rates. Recommend threshold changes or suppression windows.
  • Rules with Recurring Root Causes: Rules where the same root cause recurs. Recommend permanent remediation actions. 
  • Miscategorized Severity: Rules where investigation concludes low impact but the alert is Sev1/Sev2. Recommend severity adjustment.
  • Cost Summary: Estimated effort consumed per alert rule this week.

 

This creates a compounding feedback loop. Week over week, your team has a concrete, data-backed list of which alert rules to adjust in Azure Monitor — complete with specific recommendations. The data that was too time-consuming to gather manually is now generated automatically.

 

Scheduled task producing a weekly alert hygiene report

2.b. Monthly Threshold Audit

For a deeper analysis, schedule a monthly task:

Audit Azure Monitor alert rules for this agent's subscriptions. For each rule:

  1. Query the rule's metric history over 30 days
  2. Compare current threshold vs. actual P50, P90, and P99 values
  3. Flag rules with threshold below P50 (always firing) or above P99 (never firing)
  4. For high-frequency rules with high auto-resolution, recommend a threshold at P95 to reduce fires while still catching genuine anomalies

Produce: a threshold optimization table, dormant rules (no fires in 30+ days), and specific Azure CLI commands to update each rule.


This is the highest-leverage outcome because it fixes noise at the source. A single threshold adjustment on one noisy rule can eliminate hundreds of alert fires per month — permanently. And the agent provides the data and specific commands to make it happen.

What This Means for Agent Costs

Each alert investigation consumes LLM tokens — for reasoning, querying, and building analysis. Without thoughtful configuration, a high-volume alert pipeline can lead to higher agent costs than expected.

The setup described in this post naturally keeps token usage in check: the cooldown prevents redundant investigations, tiered response plans match effort to alert importance, and low-priority alerts get minimal attention. For additional control, you can optionally add a PostToolUse hook that nudges the agent to include time-range filters in Log Analytics queries — preventing large, unbounded result sets from inflating the conversation context. Since this hook uses a simple regex check on the query text rather than an LLM call, it adds zero token cost of its own.

Getting Started

  1. Connect Azure Monitor as an incident source in your SRE Agent
  2. Enable the reinvestigation cooldown on your response plans (the 3-hour default is a sensible starting point)
  3. Create tiered response plans — at minimum, separate critical alerts (cooldown disabled) from operational alerts (cooldown 6h) and low-priority alerts (cooldown 1h)
  4. Set up a weekly alert hygiene report as a scheduled task to start building visibility into your alert patterns
  5. Add the monthly threshold audit once your weekly reports have a few weeks of data

Start with the first three — they take a few minutes each and begin working immediately.

Learn More

 

Updated Apr 29, 2026
Version 1.0
No CommentsBe the first to comment