An infrastructure engineer’s walkthrough of Azure SRE Agent—covering real‑world scenarios, portal access, prerequisites, and how AI‑driven investigation can simplify day‑to‑day Azure operations.
As Azure environments scale, infrastructure teams face a familiar challenge: operating reliably at scale.
Outages are no longer caused by a single VM or misconfigured service—they emerge from complex dependencies across compute, networking, storage, and platform services.
Azure SRE Agent is designed to help infrastructure and SRE teams meet this challenge by bringing AI‑assisted diagnostics and remediation directly into Azure operations.
This post focuses on:
- Infrastructure‑centric scenarios where Azure SRE Agent is most useful
- How infra teams can access it from the Azure portal
- Prerequisites required before onboarding
Azure SRE Agent is currently available in preview. Features, capabilities, and regional availability may change before general availability. This post reflects the product behavior at the time of writing.
What Is Azure SRE Agent (From an Infrastructure Lens)
Azure SRE Agent is an AI‑powered reliability assistant integrated into Azure.
It continuously observes telemetry from Azure Monitor, Log Analytics, and service APIs to help engineers diagnose, investigate, and remediate production issues.
From an infrastructure standpoint, the agent understands:
- Azure resource topology and dependencies
- Common failure patterns across Azure services
- Safe operational actions using Azure CLI and REST APIs
It can either recommend actions or execute remediation steps with appropriate guardrails and approvals.
The agent operates through multiple automation mechanisms, including:
- Built-in Azure knowledge: Preconfigured understanding of Azure services with optimized operational patterns
- Custom runbooks: Execute Azure CLI commands, and REST API calls for any Azure service
- Subagent extensibility: Build specialized agents for specific services like VMs, databases, or networking components
- External integrations: Connect to monitoring, incident management, and source control systems
Infrastructure Scenarios Where Azure SRE Agent Helps the Most
1. Incident Investigation & Production Outages
During an incident, infra engineers typically pivot between alerts, metrics, logs, and dashboards. Azure SRE Agent simplifies this by correlating telemetry automatically and summarizing the issue in natural language.
Typical infrastructure issues:
- Virtual machines becoming unresponsive
- App Service or Container App failures
- Network connectivity or NSG misconfigurations
- Storage throttling or capacity exhaustion
Instead of manually querying logs, engineers can ask the agent why something failed and get a reasoned response.
2. Log and Metric Driven Root Cause Analysis
Azure SRE Agent consumes Azure Monitor and Log Analytics data directly.
This allows it to perform context‑aware RCA without engineers needing to manually write KQL for common scenarios.
Example question:
“Why did my App Service start returning HTTP 500 errors after the last deployment?”
The agent correlates deployment activity, configuration changes, and telemetry to identify the most likely root cause.
3. Safe, Controlled Remediation for Infrastructure Issues
A key value for infra teams is controlled automation.
Azure SRE Agent supports:
- Review mode – actions are proposed and require explicit approval
- Autonomous mode – pre‑approved sub‑agents execute actions automatically
This is useful for repeatable infra tasks such as:
- Restarting unhealthy services
- Scaling compute resources
- Rolling back failed deployments
- Correcting known configuration drift
Automation is applied with guardrails, not blindly.
4. Infrastructure Guardrails & Operational Hygiene
Beyond incidents, Azure SRE Agent can continuously evaluate infrastructure posture and highlight operational risks.
Examples include:
- Detecting insecure network exposure
- Flagging unsupported SKUs or configurations
- Identifying operational anti‑patterns
This helps infra teams move from reactive firefighting to proactive reliability management.
5. Extending Infrastructure Automation with Subagents
Infrastructure teams can extend Azure SRE Agent using subagents tailored to specific domains such as networking, databases, or storage.
Using the Subagent Builder, teams can:
- Attach custom runbooks
- Integrate external observability tools
- Trigger actions on alerts or schedules
This enables gradual adoption—from advisory assistance to advanced automation.
How to Get Started with Azure SRE Agent
The following sections outline the prerequisites, connectivity considerations, and supported integrations required to onboard Azure SRE Agent in an enterprise Azure environment.
Prerequisites and Ownership Model
Azure SRE Agent introduces platform‑level prerequisites that span infrastructure, platform, security, and network teams. While infrastructure teams are the primary users, successful onboarding requires cross‑team alignment.
The sections below explicitly tag ownership for clarity.
1. Subscription & Region
Owner: Platform / Subscription Admin
- Dedicated Azure subscription or resource group recommended for evaluation or PoC
- During preview, the agent control plane must be created in an available location (Sweden Central, Australia East, US East 2), while monitored workloads can reside in any Azure region
- Subscription may need to be allow‑listed for preview access
Infra teams typically request this setup; platform teams implement it.
2. Identity & Access (Critical)
Owner: Platform + Security
Consumer: Infra / SRE
- Ability to create managed identities (system‑assigned or user‑assigned depending on scenario)
- Elevated RBAC permissions required during onboarding:
-
Microsoft.Authorization/roleAssignments/write at subscription scope
- Roles such as Owner, User Access Administrator, or RBAC Administrator
-
- Post‑onboarding, the SRE Agent identity should be granted least‑privilege RBAC:
- Read‑only for investigation scenarios
- Scoped write access only where remediation is approved
Infra teams use the identity; security and platform teams govern access.
3. Resource Provider Registration
Owner: Platform
- Required Azure resource providers must be registered in the subscription
- Includes providers used by the agent runtime and dependent services
Typically, a one‑time platform task.
4. Monitoring & Telemetry Baseline (Hard Dependency)
Owner: Infra / Platform (Shared)
- Azure Monitor enabled for target workloads
- Diagnostic settings configured to send logs and metrics to:
- Log Analytics
- Application Insights (where applicable)
- During agent creation, supporting resources may be deployed:
- Log Analytics workspace
- Application Insights
- Smart detector alert rules
Infra teams depend on this telemetry; platform teams often provide the shared foundation.
5. Network & Connectivity
Owner: Network / Security
- Outbound HTTPS connectivity required to:
- Azure management endpoints (ARM, Monitor, etc.)
- External systems when integrations are enabled (for example, ServiceNow or MCP servers)
- Custom MCP endpoints must be remote and HTTPS‑reachable (local endpoints not supported)
- IP allow‑listing scenarios must be explicitly validated; static egress IPs are not guaranteed
Required only if the organization enforces strict network controls.
6. Connector‑Specific Prerequisites (Conditional)
Owner: Security + Platform
Consumer: Infra / SRE
- Microsoft Teams / Outlook connectors
- OAuth consent for Microsoft 365 APIs
- User‑assigned managed identity required for connector authentication
- Custom MCP connectors
- MCP base URL
- Authentication material (API key, token, or OAuth)
- RBAC permissions to configure and manage connectors
Applies only when integrations are enabled.
7. Automation Readiness
Owner: Infra / SRE + Security
- Clear decision on recommendation‑only vs automated remediation
- Defined approval model:
- Human‑in‑the‑loop
- Scoped autonomy for well‑understood actions
- Agent identity granted write permissions only where automation is explicitly approved
Infra teams define operational intent; security teams validate boundaries.
8. Governance & Data Handling
Owner: Security / Governance
- Acceptance that prompts, responses, and analysis data are stored in the agent’s region
- Alignment with organizational policies for:
- Logging and retention
- Auditability
- Responsible AI usage and approvals
This is a governance prerequisite, not an infra task.
Azure SRE Agent operates as a managed control plane layered on Azure Resource Manager, Azure Monitor, Log Analytics, and managed identities. As a result, identity, telemetry, and governance foundations must be in place before infra teams can fully benefit from the agent.
Integration It Supports
Azure SRE Agent integrates with your operational ecosystem in the following ways:
- Monitoring and observability:
- Azure Monitor (metrics, logs, alerts, workbooks)
- Application Insights
- Log Analytics
- Grafana
- Incident management:
- Azure Monitor Alerts
- PagerDuty
- ServiceNow
- Source control and CI/CD:
- GitHub (repositories, issues)
- Azure DevOps (repos, work items)
- Data sources:
- Azure Data Explorer (Kusto) clusters
- Model Context Protocol (MCP) servers
Connectivity Matrix
1. Azure Control Plane Connectivity
|
Source |
Destination |
Direction |
Protocol / Port |
Authentication |
Purpose |
|
SRE Agent service |
Azure Resource Manager (ARM) |
Outbound |
HTTPS / 443 |
Managed Identity (OAuth 2.0) |
Read and (with approval) write operations on Azure resources. |
|
SRE Agent service |
Azure Monitor |
Outbound |
HTTPS / 443 |
Managed Identity |
Alert ingestion and metric queries. |
|
SRE Agent service |
Log Analytics Workspace |
Outbound |
HTTPS / 443 |
Managed Identity |
Log queries (KQL) for RCA. |
|
SRE Agent service |
Application Insights |
Outbound |
HTTPS / 443 |
Managed Identity |
Application telemetry analysis. |
- All Azure control‑plane communication is outbound only from the agent.
- No inbound connectivity to customer VNets is required.
2. Incident Management Integrations
|
Platform |
Connectivity Type |
Direction |
Protocol / Port |
Auth Mechanism |
Data Exchanged |
|
Azure Monitor Alerts |
Native |
Inbound → Agent |
HTTPS / 443 |
Azure‑native |
Alert payloads, severity, metadata |
|
ServiceNow |
Connector (Webhook/API) |
Outbound & Inbound |
HTTPS / 443 |
OAuth / API token |
Incident creation, updates, status sync |
|
PagerDuty |
Connector (Webhook/API) |
Outbound & Inbound |
HTTPS / 443 |
OAuth / API token |
Incident events, acknowledgements |
- Third‑party platforms require explicit connector configuration.
- Payloads are limited to incident metadata and diagnostics context.
3. Collaboration & Notification Channels
|
Tool |
Direction |
Protocol / Port |
Authentication |
Typical Usage |
|
Microsoft Teams |
Outbound |
HTTPS / 443 |
OAuth (User‑assigned Managed Identity) |
Post incident summaries, updates |
|
Outlook (Email) |
Outbound |
HTTPS / 443 |
OAuth (User‑assigned Managed Identity) |
Incident notifications, reports |
- Only user‑assigned managed identities are supported for Office 365 connectors.
- No mailbox‑level permissions are stored in the agent.
4. External & Custom Integrations
|
Integration Type |
Direction |
Protocol / Port |
Auth |
Example Use Cases |
|
Custom MCP Servers |
Outbound |
HTTPS / 443 |
OAuth / API key |
GitHub issues, Dynatrace, custom observability |
|
Python Execution Tool |
Outbound |
HTTPS / 443 |
As defined by script |
REST checks, custom diagnostics |
- Endpoints must be explicitly configured and approved.
- Secrets do not persist; credentials are injected securely at runtime.
5. Firewall & Network Considerations
- Add *.azuresre.ai to your firewall allow list. Some networking profiles might block access to this domain by default.
- Allow outbound HTTPS (443) to:
- Azure control‑plane endpoints
- *.azuresre.ai (SRE Agent service)
- Configured third‑party endpoints (ServiceNow, PagerDuty, MCP servers)
- No inbound firewall rules or private endpoint exposure is required.
- Compatible with private VNets and restricted outbound models when allow‑listed.
How to Access Azure SRE Agent
Azure SRE Agent can be accessed directly from the Azure portal or via its own SRE portal.
Adding few screenshots below:
Relatable links:
- Check how to use SRE agent, here.
- Understand how to automate workflow, here.
- Perform Azure environment diagnosis using SRE agent, here.
- Check about connectors used to extend reach to external system, here.
- To setup your first investigation, here.
- Learn about its pricing and billing, here.
- Check how to manage permissions for SRE agent, here.
- To setup the MCP connectors in SRE agent, here.
- Anthropic as a sub-processor in Azure SRE Agent
Why Azure SRE Agent Matters for Infrastructure Teams
For infrastructure and SRE teams managing large Azure estates, Azure SRE Agent provides a single, agentic reliability layer that unifies observability, incident management, and delivery workflows into a governed, intent‑driven operating model.
- Reduced Mean Time to Resolution (MTTR): By integrating natively with Azure Monitor and Log Analytics, the agent performs deep, multi‑signal investigation and root‑cause analysis without manual query building or correlation.
- Faster investigation without dashboard hopping: Azure SRE Agent reasons across monitoring, incident, and delivery systems from one interface, eliminating context‑switching across tools.
- Deep investigation & root-cause analysis: Performs multi‑signal correlation across logs, metrics, configuration state, and recent changes to identify true root causes rather than surface symptoms, with clear, explainable findings.
- Lower operational toil: Repetitive diagnostics and triage tasks are handled by the agent, allowing engineers to focus on higher‑value reliability and platform improvements.
- Consistent and auditable incident response: Through Azure Monitor, ServiceNow and PagerDuty integration, investigations are embedded directly into ITSM and on‑call workflows, ensuring traceability, consistency, and governance.
-
Scheduled tasks and proactive checks: Using scheduled workflows, teams can run daily or periodic health validations, drift checks, and post‑deployment verifications—shifting operations from reactive firefighting to proactive reliability management.
-
Extensible sub‑agents and skills: Infrastructure teams can create skills, subagents, and workflows to encode domain expertise into the agent, making reliability knowledge reusable and consistent.
- Delivery and code awareness: Integration with GitHub and Azure DevOps allows the agent to correlate infrastructure issues with source code, IaC definitions, pipelines, and work items—enabling actionable follow‑ups such as bug creation, PR recommendations, or release fixes.
- Safer, governed automation: Human‑in‑the‑loop controls ensure all recommendations and actions are auditable, reviewable, and aligned with enterprise governance, enabling progressive automation without compromising safety.
- Better reliability at infrastructure scale: By shifting teams from manual diagnostics to intent‑driven, agent‑assisted operations, Azure SRE Agent helps organizations move from reactive firefighting to systematic, scalable reliability engineering.
It shifts teams from manual diagnostics to intent‑driven operations.
Closing Thoughts
Azure SRE Agent is not just another troubleshooting tool—it represents a shift toward agent‑assisted infrastructure operations. By embedding AI‑driven reasoning directly into Azure, infrastructure teams can focus less on repetitive investigation and more on building resilient platforms.