Blog Post

Apps on Azure Blog
9 MIN READ

Agent Hooks: Production-Grade Governance for Azure SRE Agent

Vineela-Suri's avatar
Vineela-Suri
Icon for Microsoft rankMicrosoft
Mar 10, 2026

Introduction

Azure SRE Agent helps engineering teams automate incident response, diagnostics, and remediation tasks. But when you're giving an agent access to production systems—your databases, your Kubernetes clusters, your cloud resources—you need more than just automation. You need governance.

Today, we're diving deep into Agent Hooks, the built-in governance framework in Azure SRE Agent that lets you enforce quality standards, prevent dangerous operations, and maintain audit trails without writing custom middleware or proxies.

Agent Hooks work by intercepting your SRE Agent at critical execution points—before it responds to users (Stop hooks) or after it executes tools (PostToolUse hooks). You define the rules once in your custom agent configuration, and the SRE Agent runtime enforces them automatically across every conversation thread.

In this post, we'll show you how to configure Agent Hooks for a real production scenario: diagnosing and remediating PostgreSQL connection pool exhaustion while maintaining enterprise controls.

The Challenge: Autonomous Remediation with Guardrails

You're managing a production application backed by Azure PostgreSQL Flexible Server. Your on-call team frequently deals with connection pool exhaustion issues that cause latency spikes. You want your SRE Agent to diagnose and resolve these incidents autonomously, but you need to ensure:

  1. Quality Control: The agent provides thorough, evidence-based analysis instead of superficial guesses
  2. Safety: The agent can't accidentally execute dangerous commands, but can still perform necessary remediation
  3. Compliance: Every agent action is logged for security audits and post-mortems

Without Agent Hooks, you'd need to build custom middleware, write validation logic around the SRE Agent API, or settle for manual approval workflows. With Agent Hooks, you configure these controls once in your custom agent definition and the SRE Agent platform enforces them automatically.

The Scenario: PostgreSQL Connection Pool Exhaustion

For our demo, we'll use a real production application (octopets-prod-web) experiencing connection pool exhaustion. When this happens:

  • P95 latency spikes from ~120ms to 800ms+
  • Active connections reach the pool limit
  • New requests get queued or fail

The correct remediation is to restart the PostgreSQL Flexible Server to flush stale connections—but we want our agent to do this safely and with proper oversight.

Demo Setup: Three Hooks, Three Purposes

We'll configure three hooks that work together to create a robust governance framework:

Hook #1: Quality Gate (Stop Hook)

Ensures the agent provides structured, evidence-based responses before presenting them to users.

Hook #2: Safety Guardrails (PostToolUse Hook)

Blocks dangerous commands while allowing safe operations through an explicit allowlist.

Hook #3: Audit Trail (Global Hook)

Logs every tool execution across all agents for compliance and debugging.

Step-by-Step Implementation

Creating the Custom Agent

First, we create a specialized subagent in the Azure SRE Agent platform called sre_analyst_agent designed for PostgreSQL diagnostics. In the Agent Canvas, we configure the agent instructions:

You are an SRE agent responsible for diagnosing and remediating production issues for an application backed by an Azure PostgreSQL Flexible Server. When investigating a problem: - Use available tools to query Azure Monitor metrics, PostgreSQL logs, and connection statistics - Look for patterns: latency spikes, connection counts, error rates, CPU/memory pressure - Quantify findings with actual numbers where possible (e.g., P95 latency in ms, active connection count, error rate %) When presenting your diagnosis, structure your response with these exact sections: ## Root Cause A precise explanation of what is causing the issue. ## Evidence Specific metrics and observations that support your root cause. Include actual numbers: latency values in ms, connection counts, error rates, timestamps. ## Recommended Actions Numbered list of remediation steps ordered by priority. Be specific — include actual resource names and exact commands. When executing a fix: - Always verify the current state before acting - Confirm the fix worked by re-checking the same metrics after the action - Report before and after numbers to show impact

This explicit guidance ensures the agent knows the correct remediation path.

Configuring Hook #1: Quality Gate

In the Agent Canvas' Hooks tab, we add our first agent-level hook—a Stop hook that fires before the SRE Agent presents its response. This hook uses the SRE Agent's own LLM to evaluate response quality:

Event Type: Stop
Hook Type: Prompt
Activation: Always

Hook Prompt:

You are a quality gate for an SRE agent that investigates database and app performance issues. Review the agent's response below: $ARGUMENTS Evaluate whether the response meets ALL of the following criteria: 1. Has a "## Root Cause" section with a specific, clear explanation (not vague — must say specifically what failed, e.g., "connection pool exhaustion due to long-running queries holding connections" not just "database issue") 2. Has a "## Evidence" section that includes at least one concrete metric or data point with an actual number (e.g., "P95 latency spiked to 847ms", "active connections: 497/500", "error rate: 23% over last 15 minutes") 3. Has a "## Recommended Actions" section with numbered, specific steps (must include actual resource names or commands, not just "restart the database") If ALL three criteria are met with substantive content, respond: {"ok": true} If ANY criterion is missing, vague, or uses placeholder text, respond: {"ok": false, "reason": "Your response needs more depth before it reaches the user. Specifically: ## Root Cause must name the exact failure mechanism, ## Evidence must include real metric values with numbers (latency in ms, connection counts, error rates), ## Recommended Actions must reference actual resource names and specific commands. Go back and verify your findings."}

This hook acts as an automated quality gate built directly into the SRE Agent runtime, catching superficial responses before they reach your on-call engineers.

Configuring Hook #2: Safety Guardrails

Our second agent-level hook is a PostToolUse hook that fires after the SRE Agent executes Bash or Python tools. This implements an allowlist pattern to control what commands can actually run in production:

Event Type: PostToolUse
Hook Type: Command (Python)
Matcher: Bash|ExecuteShellCommand|ExecutePythonCode
Activation: Always

Hook Script:

#!/usr/bin/env python3 import sys, json, re context = json.load(sys.stdin) tool_input = context.get('tool_input', {}) command = '' if isinstance(tool_input, dict): command = tool_input.get('command', '') or tool_input.get('code', '') # Safe allowlist — check these FIRST before any blocking logic # These are explicitly approved remediation actions for PostgreSQL issues safe_allowlist = [ r'az\s+postgres\s+flexible-server\s+restart', ] for safe_pattern in safe_allowlist: if re.search(safe_pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': '[SAFETY] ✅ PostgreSQL server restart approved — recognized as a safe remediation action for connection pool exhaustion.' } })) sys.exit(0) # Destructive commands to block dangerous = [ (r'\baz\s+postgres\s+flexible-server\s+delete\b', 'az postgres flexible-server delete (permanent server deletion)'), (r'\baz\s+\S+\s+delete\b', 'az delete (Azure resource deletion)'), (r'\brm\s+-rf\b', 'rm -rf (recursive force delete)'), (r'\bsudo\b', 'sudo (privilege escalation)'), (r'\bdrop\s+(table|database)\b', 'DROP TABLE/DATABASE (irreversible data loss)'), (r'\btruncate\s+table\b', 'TRUNCATE TABLE (irreversible data wipe)'), (r'\bdelete\s+from\b(?!.*\bwhere\b)', 'DELETE FROM without WHERE clause (wipes entire table)'), ] for pattern, label in dangerous: if re.search(pattern, command, re.IGNORECASE): print(json.dumps({ 'decision': 'block', 'reason': f'🛑 BLOCKED: {label} is not permitted. Use safe, non-destructive alternatives. For PostgreSQL connection issues, prefer server restart or connection pool configuration changes.' })) sys.exit(0) print(json.dumps({'decision': 'allow'}))

This ensures only pre-approved PostgreSQL operations can execute, preventing accidental data deletion or configuration changes.

Now that we've configured both agent-level hooks, here's what our custom agent looks like in the canvas:

Custom Agent Configuration - Overview ofsre_analyst_agent with hooks. Agent Canvas showing the sre_analyst_agent configuration with two agent-level hooks attached

Configuring Hook #3: Audit Trail

Finally, we create a Global hook using the Hooks management page in the Azure SRE Agent Portal. Global hooks apply across all custom agents in your organization, providing centralized governance:

Global Hooks Management Page - Creating the sre_audit_trail global hook. The Global Hooks management page showing the sre_audit_trail hook configuration with event type, activation mode, matcher pattern, and Python script editor

Event Type: PostToolUse
Hook Type: Command (Python)
Matcher: * (all tools)
Activation: On-demand

Hook Script:

#!/usr/bin/env python3 import sys, json context = json.load(sys.stdin) tool_name = context.get('tool_name', 'unknown') agent_name = context.get('agent_name', 'unknown') succeeded = context.get('tool_succeeded', False) turn = context.get('current_turn', '?') audit = f'[AUDIT] Turn {turn} | Agent: {agent_name} | Tool: {tool_name} | Success: {succeeded}' print(audit, file=sys.stderr) print(json.dumps({ 'decision': 'allow', 'hookSpecificOutput': { 'additionalContext': audit } }))

By setting this as "on-demand," your SRE engineers can toggle this hook on/off per conversation thread from the chat interface—enabling detailed audit logging during incident investigations without overwhelming logs during routine queries.

Seeing Agent Hooks in Action

Now let's see how these hooks work together when our SRE Agent investigates a real production incident.

Activating Audit Trail

Before starting our investigation, we toggle on the audit trail hook from the chat interface:

Hooks Toggle UI - Managing hooks for this thread with sre_audit_trail activated the "Manage hooks for this thread" menu showing the sre_audit_trail global hook toggled on for this conversation

This gives us visibility into every tool the agent executes during the investigation.

Starting the Investigation

We prompt our SRE Agent: "Can you check the octopets-prod-web application and diagnose any performance issues?"

The SRE Agent begins gathering metrics from Azure Monitor, and we immediately see our audit trail hook logging each tool execution:

This real-time visibility is invaluable for understanding what your SRE Agent is doing and debugging issues when things don't go as planned.

Quality Gate Rejection

The SRE Agent completes its initial analysis and attempts to respond. But our Stop hook intercepts it—the response doesn't meet our quality standards:

Quality Gate Rejection - Stop hook forcing agent to provide more detailed analysisStop hook rejection message: "Your response needs more depth and specificity..." forcing the agent to re-analyze with more evidence

The hook rejects the response and forces the SRE Agent to retry—gathering more evidence, querying additional metrics, and providing specific numbers. This self-correction happens automatically within the SRE Agent runtime, with no manual intervention required.

Structured Final Response

After re-verification, the SRE Agent presents a properly structured analysis that passes our quality gate:

 

 

Structured Agent Response with Root Cause, Evidence, and Recommended Actions. Agent response showing the required structure: Root Cause section with connection pool exhaustion diagnosis, Evidence section with specific metric numbers, and Recommended Actions with the exact restart command

Root Cause: Connection pool exhaustion Evidence: Specific metrics (83 active connections, P95 latency 847ms) Recommended Actions: Restart command with actual resource names

This is the level of rigor we expect from production-ready agents.

Safety Allowlist in Action

The SRE Agent determines it needs to restart the PostgreSQL server to remediate the connection pool exhaustion. Our PostToolUse hook intercepts the command execution and validates it against our allowlist:

Command Execution - PostgreSQL metrics query and restart command output. Code execution output showing the PostgreSQL metrics query results and the az postgres flexible-server restart command being executed successfully

Because the az postgres flexible-server restart command matches our safety allowlist pattern, the hook allows it to proceed. If the SRE Agent had attempted any unapproved operation (like DROP DATABASE or firewall rule changes), the safety hook would have blocked it immediately.

The Results

After the SRE Agent restarts the PostgreSQL server:

  • P95 latency drops from 847ms back to ~120ms
  • Active connections reset to healthy levels
  • Application performance returns to normal

But more importantly, we achieved autonomous remediation with enterprise governance:

  • ✅ Quality assurance: Every response met our evidence standards (enforced by Stop hooks)
  • ✅ Safety controls: Only pre-approved operations executed (enforced by PostToolUse hooks)
  • ✅ Complete audit trail: Every tool call logged for compliance (enforced by Global hooks)
  • ✅ Zero manual interventions: The SRE Agent self-corrected when quality standards weren't met

This is the power of Agent Hooks—governance that doesn't get in the way of automation.

Key Takeaways

Agent Hooks bring production-grade governance to Azure SRE Agent:

  1. Layered Governance: Combine agent-level hooks for custom agent-specific controls with global hooks for organization-wide policies
  2. Fail-Safe by Default: Use allowlist patterns in PostToolUse hooks rather than denylists—explicitly permit safe operations instead of trying to block every dangerous one
  3. Self-Correcting SRE Agents: Stop hooks with quality gates create feedback loops that improve response quality without human intervention
  4. Audit Without Overhead: On-demand global hooks let your engineers toggle detailed logging only during incident investigations
  5. No Custom Middleware: All governance logic lives in your custom agent configuration—no need to build validation proxies or wrapper services

Getting Started

Agent Hooks are available now in the Azure SRE Agent platform. You can configure them entirely through the UI—no API calls or tokens needed:

  1. Agent-Level Hooks: Navigate to the Agent Canvas → Hooks tab and add hooks directly to your custom agent
  2. Global Hooks: Use the Hooks management page to create organization-wide policies
  3. Thread-Level Control: Toggle on-demand hooks from the chat interface using the "Manage hooks" menu

Learn More

Ready to build safer, smarter agents? Start experimenting with Agent Hooks today at sre.azure.com.

Updated Mar 10, 2026
Version 1.0
No CommentsBe the first to comment