serverless
296 TopicsIntroducing Azure Container Apps Express!
Three years ago, a 15-second cold start was industry-leading. Today, developers and AI agents expect sub-second. The speed bar has moved, and the tooling needs to move with it. After running Azure Container Apps for years, we've learned something important: for most developers, the ACA environment is an unnecessary construct. It adds provisioning time, configuration surface, and cognitive overhead — when all you really want is to run your app with scaling, networking, and operations handled for you. At the same time, a new class of workloads has emerged. Agent-first platforms — systems where AI agents deploy endpoints on demand, spin up tool-use APIs, and tear them down when work is done — demand an even more radical focus on speed and simplicity. Every second of provisioning delay is wasted agent productivity. Today, we're launching Azure Container Apps Express in Public Preview — the fastest, simplest way to go from a container image to an internet-reachable app on Azure, ready for many production-style workloads. What Is ACA Express? ACA Express removes the infrastructure decisions. There's no environment to provision, no networking to configure, no scaling rules to write. You bring a container image, Express handles everything else. Behind the scenes, Express runs your container on pre-provisioned capacity with sensible defaults baked in — so you skip environment setup without giving up ACA's serverless model. There's more coming in this space soon — keep watching. Here's what that means in practice: Instant provisioning — your app is running in seconds, not minutes Sub-second cold starts — fast enough for interactive UIs and on-demand agent endpoints Scale to and from zero — automatic, no configuration required (full scaling controls coming soon) Per-second billing — pay only for what you use Production-ready defaults — ingress, secrets, environment variables, and observability are built in Express is purpose-built for two audiences: developers who want to ship fast (SaaS apps, APIs, web dashboards, prototypes) and agents that deploy on demand (MCP servers, tool-use endpoints, multi-step workflow APIs, human-in-the-loop UIs). If you've ever waited for an ACA environment to provision, only to realize you didn't need half of the configuration options it asked you for — Express is your answer. What You Can Do Today Note: West Central US is currently the only available region. We will expand to new regions through the coming days. Express is in Public Preview starting today. It's a deliberate early ship — there's a meaningful feature gap compared to the existing Azure Container Apps offering, and we're filling it fast. New capabilities are landing on a rapid cadence throughout the preview, and by Microsoft Build in June, Express should be close to feature-complete. For the current list of supported features, known gaps, and what's on the way, see the Express documentation. We'd rather put valuable technology in your hands early and iterate with you than wait behind closed doors for perfection. Who Is Express For? Scenario Why Express SaaS apps and APIs Deploy and scale without infrastructure planning AI app frontends Chat UIs and copilot frontends that scale with usage spikes MCP servers Expose API endpoints for AI agents in seconds Agent workflows Spin up endpoints on demand, tear down when done Prototypes and startups Go from idea to production in minutes Web dashboards Internal tools with instant availability Get Started Express is available now in Public Preview. Try it: Azure Container Apps Express overview — concepts, capabilities, and the current feature support matrix Deploy your first app with the Azure CLI — step-by-step quickstart New Azure Container Apps Portal — create and manage Express apps alongside your existing Container Apps resources Have questions? Check the Azure Container Apps Express FAQ for answers to common questions about pricing, limits, regions, and the road to GA. We're building Express in the open and we want to hear from you. Tell us what features matter most, what works, and what doesn't — reach out on the Azure Container Apps GitHub or in the comments below.9.2KViews5likes5CommentsRunning Foundry Agent Service on Azure Container Apps
Microsoft’s Customer Zero blog series gives an insider view of how Microsoft builds and operates Microsoft using our trusted, enterprise-grade agentic platform. Learn best practices from our engineering teams with real-world lessons, architectural patterns, and operational strategies for pressure-tested solutions in building, operating, and scaling AI apps and agent fleets across the organization. Challenge: Scaling agents to production changes the requirements As teams move from experimenting with AI agents to running them in production, the questions they ask begin to change. Early prototypes often focus on whether an agent can reason to generate useful output. But once agents are placed into real systems where they continuously need to serve users and respond to events, new concerns quickly take center stage: reliability, scale, observability, security, and long‑running operations. A common misconception at this stage is to think of an agent as a simple chatbot wrapped around an API. In practice, an AI agent is something very different. It is a service that listens, thinks, and acts, ingesting unstructured inputs, reasoning over context, and producing outputs that may span multiple phases. Treating agents as services means teams often need more than they initially expect: dependable compute, strong security, and real-time visibility to run agents safely and effectively at scale. When we kick off an agent loop, we provide input that informs the context it recalls for the task, the data it connects to, the tools it calls, and the reasoning steps it outlines for itself to generate an output. Agent needs are different from traditional services in hosting, scaling, identity, security, and observability; it’s a product with a probabilistic nature that requires secure, auditable access to many resources at the same lightspeed performance that users expect from any software. This isn’t the first time that the software industry needed to evolve its thinking around infrastructure. When modern application architectures began shifting from monolithic apps toward microservices, existing infrastructure wasn’t built with that model in mind. As systems were reconstructed into independent services, teams quickly discovered they needed new runtime architecture that properly accommodated microservice needs. The modern app era brought new levels of performance, reliability, and scalability of apps, but it also warranted that we rebuild app infrastructure with container orchestration and new operational patterns in mind. AI agents represent a similar inflection. Infrastructure designed for request‑response applications or stateless workloads wasn’t built with long‑running, tool‑calling, AI‑driven workflows in mind. As the builders of Foundry Agent Service, we were very aware that traditional architectures wouldn’t hold up to the bursty agentic workflows that needed to aggregate data across sources, connect to several simultaneous tools, and reason through execution plans for the output that we needed. Rather than building new infrastructure from scratch, the choice for building on Azure Container Apps was clear. With over a million Apps hosted on Azure Container Apps, it was the tried-and-true solution we needed to keep our team focused on building agent intelligence and behavior instead of the plumbing underneath. Solution: Building Foundry Agent Service on a resilient agent runtime foundation Foundry Agent Service is Microsoft’s fully managed platform for building, deploying, and scaling AI agents as production services. Builders start by choosing their preferred framework or immediately building an agent inside Foundry, while Foundry Agent Service handles the operational complexity required to run agents at scale. Let’s use the example of a sales agent in Foundry Agent Service. You might have a salesperson who prompts a sales agent with “Help me prepare for my upcoming meeting with customer Contoso.” The agent is going to kick off several processes across data and tools to generate the best answer: Work IQ to understand Teams conversations with Contoso, Fabric IQ for current product usage and forecast trends, Foundry IQ to do an AI search over internal sales materials, and even GitHub Copilot SDK to generate and execute code that can draft PowerPoint and Word artifacts for the meeting. And this is just one agent; more than 20,000 customers rely on Foundry Agent Service. At the core of Foundry Agent Service is a dedicated agent runtime through Azure Container Apps that explicitly meets our demands for production agents. Agent runtime through flexible cloud infrastructure allows builders to focus on making powerful agent experiences without worrying about under-the-hood compute and configurations. This runtime is built around five foundational pillars: Fast startup and resume. Agents are event‑driven and often bursty. Responsiveness depends on the ability to start or resume execution quickly when events arrive. Built‑in agent tool execution. Agents must securely execute tool calls like APIs, workflows, and services as part of their reasoning process, without fragile glue code or ad‑hoc orchestration. State persistence and restore. Many agent workflows are long‑running and multi‑phase. The runtime must allow agents to reason, pause, and resume with safely preserved state. Strong isolation per agent task. As agents execute code and tools dynamically, isolation is critical to prevent data leakage and contain blast radius. Secure by default. Identity, access, and execution controls are enforced at the runtime layer rather than bolted on after the fact. Together, these pillars define what it means to run AI agents as first‑class production services. Impact: How Azure Container Apps powers agent runtime Building and operating agent infrastructure from scratch introduces unnecessary complexity and risk. Azure Container Apps has been pressure‑tested at Microsoft scale, proving to be a powerful, serverless foundation for running AI workloads and aligns naturally with the needs of agent runtime. It provides serverless, event‑driven scaling with fast startup and scale‑to‑zero, which is critical for agents with unpredictable execution patterns. Execution is secure by default, with built‑in identity, isolation, and security boundaries enforced at the platform layer. Azure Container Apps natively supports running MCP servers and executing full agent workflows, while Container Apps jobs enable on‑demand tool execution for discrete units of work without custom orchestration. For scenarios involving AI‑generated or untrusted code, dynamic sessions allow execution in isolated sandboxes, keeping blast radius contained. Azure Container Apps also supports running model inference directly within the container boundary, helping preserve data residency and reduce unnecessary data movement. Learnings for your agent runtime foundation Make infrastructure flexible with serverless architecture. AI systems move too fast to create infrastructure from scratch. With bursty, unpredictable agent workloads, sub‑second startup times and serverless scaling are critical. Simplify heavy lifting. Developers should focus on agent behavior, tool invocation, and workflow design instead of infrastructure plumbing. Using trusted cloud infrastructure, pain points like making sure agents run in isolated sandboxes, properly applying security policy to agent IDs, and ensuring secure connections to virtual networks are already solved. When you simplify the operational overhead, you make it easier for developers to focus on meaningful innovation. Invest in visibility and monitoring. Strong observability enables faster iteration, safer evolution, and continuous self‑correction for both humans and agents as systems adapt over time. Want to learn more? Learn about building and hosting agents with Foundry Agent Service Discover agent runtime through Azure Container Apps Read about best practices for managing agents166Views0likes0CommentsBuilding a Scalable Contract Data Extraction Pipeline with Microsoft Foundry and Python
Architecture Overview Alt text: Architecture diagram showing Blob Storage triggering Azure Function, calling Document Intelligence, transforming data, and storing in Cosmos DB Flow: Upload contract files (PDF or ZIP) to Azure Blob Storage Azure Function triggers automatically on file upload Azure AI Document Intelligence extracts layout and tables A transformation layer converts output into a canonical JSON format Data is stored in Azure Cosmos DB Step 1: Trigger Processing with Azure Functions An Azure Function with a Blob trigger enables automatic processing when a file is uploaded. import logging import azure.functions as func import zipfile import io def main(myblob: func.InputStream): logging.info(f"Processing blob: {myblob.name}") if myblob.name.endswith(".zip"): with zipfile.ZipFile(io.BytesIO(myblob.read())) as z: for file_name in z.namelist(): logging.info(f"Extracting {file_name}") file_data = z.read(file_name) # Pass file_data to extraction step Best Practices Keep functions stateless and idempotent Handle retries for transient failures Store configuration in environment variables Step 2: Extract Layout Using Document Intelligence The prebuilt layout model helps extract tables, text, and structure from documents. from azure.ai.documentintelligence import DocumentIntelligenceClient from azure.core.credentials import AzureKeyCredential client = DocumentIntelligenceClient( endpoint="<your-endpoint>", credential=AzureKeyCredential("<your-key>") ) poller = client.begin_analyze_document( "prebuilt-layout", document=file_data ) result = poller.result() Output Includes Structured tables Paragraphs and text blocks Bounding regions for layout context Step 3: Handle Multi-Page Table Continuity Contract documents often contain tables split across multiple pages. These need to be merged to preserve data integrity. def merge_tables(tables): merged = [] current = None for table in tables: headers = [cell.content for cell in table.cells if cell.row_index == 0] if current and headers == current["headers"]: current["rows"].extend(extract_rows(table)) else: if current: merged.append(current) current = { "headers": headers, "rows": extract_rows(table) } if current: merged.append(current) return merged Key Considerations Match headers to detect continuation Preserve row order Avoid duplicate headers Step 4: Transform to a Canonical JSON Schema A consistent schema ensures compatibility across downstream systems. { "id": "contract_123", "documentType": "contract", "vendorName": "ABC Corp", "invoiceDate": "2023-05-05", "tables": [ { "name": "Line Items", "headers": ["Item", "Qty", "Price"], "rows": [ ["Service A", "2", "100"] ] } ], "metadata": { "sourceFile": "contract.pdf", "processedAt": "2026-04-22T10:00:00Z" } } Design Tips Keep schema flexible and extensible Include metadata for traceability Avoid excessive nesting Step 5: Persist Data in Cosmos DB Store the transformed data in a scalable NoSQL database. from azure.cosmos import CosmosClient client = CosmosClient("<cosmos-uri>", "<key>") database = client.get_database_client("contracts-db") container = database.get_container_client("documents") container.upsert_item(canonical_json) Best Practices Choose an appropriate partition key (for example, documentType or vendorName) Optimize indexing policies Monitor request units (RU) usage Observability and Monitoring To ensure reliability: Enable logging with Application Insights Track processing time and failures Monitor document extraction accuracy Security Considerations Store secrets securely using Azure Key Vault Use Managed Identity for service authentication Apply role-based access control (RBAC) to storage resources Conclusion This approach provides a scalable and maintainable solution for contract data extraction: Event-driven processing with Azure Functions Accurate extraction using Document Intelligence Clean transformation into a reusable schema Efficient storage with Cosmos DB This foundation can be extended with validation layers, review workflows, or analytics dashboards depending on your business requirements. Resources Contract data extraction – Document Intelligence: Foundry Tools | Microsoft Learn microsoft/content-processing-solution-accelerator: Programmatically extract data and apply schemas to unstructured documents across text-based and multi-modal content using Azure AI Foundry, Azure OpenAI, Azure AI Content Understanding, and Cosmos DB.Performance Tuning and Scaling Optimization for Large-Scale Azure Workloads
Summary As cloud-native systems scale, performance challenges rarely stem from a single bottleneck. Instead, they emerge from the interaction between compute, orchestration, and data layers under load. This article captures a practical optimization journey of a high-volume Azure-based workload and highlights how controlled scaling, improved orchestration design, and proactive database maintenance can significantly outperform brute-force scaling. Introduction Distributed systems are often designed with the assumption that scaling out will solve performance issues. However, for orchestration-heavy and database-intensive workloads, this approach can introduce more problems than it solves. In this scenario, the system processed millions of transactional records through Azure Functions, Durable Functions, messaging pipelines, APIs, and SQL databases. As the workload grew, the platform began experiencing: CPU and memory spikes Slower SQL queries Service Bus throttling Increased retries and execution delays What stood out was that these issues were not due to insufficient resources, but due to inefficient execution patterns at scale. The optimization effort therefore focused on controlling how the system scaled and executed, rather than simply increasing capacity. Understanding Workload Behavior A critical early step was identifying the nature of the workload—specifically, whether it was CPU-heavy or data-heavy. Rethinking Scaling: More Is Not Always Better One of the most important lessons was that scaling out aggressively can degrade performance. As more function instances processed messages in parallel: Database calls increased sharply API traffic surged Lock contention intensified Retry rates increased This created a cascading effect where retries amplified load, further slowing down the system. To address this, scaling was intentionally controlled using: Concurrency limits on function execution Batch-based processing instead of full parallel fan-out Small delays to smooth traffic spikes Chunking of large datasets into manageable units This shift from maximum parallelism to controlled throughput significantly improved system stability. Compute Optimization: CPU and Memory After stabilizing scaling behavior, the next step was optimizing compute usage. CPU Optimization CPU spikes were largely caused by excessive parallel execution and orchestration overhead. Improvements included: Breaking large workloads into smaller units Reducing unnecessary fan-outs of processes Limiting concurrent executions This resulted in more predictable CPU usage and improved execution consistency. Memory Optimization Memory pressure was primarily driven by large payloads and batch processing. Optimizations focused on: Processing data in smaller chunks Avoiding large in-memory payloads and memory leaks Reducing orchestration state size These changes improved system reliability and reduced execution failures under load. Scaling Approaches: Practical Trade-Offs Both vertical and horizontal scaling were used, but with careful consideration. Scale Up (Vertical Scaling) Quick to implement No architectural changes required Useful for immediate stabilization However, it had cost and scalability limits. Scale Out (Horizontal Scaling) Better suited for long-term scalability Enables workload distribution But without control, it can: Increase database contention Amplify retries Introduce instability Key Insight The most effective approach was not choosing one over the other but combining both with strict control over concurrency and execution patterns. Durable Functions: Orchestration Optimization Durable Functions were central to the system, making orchestration design a key factor in performance. Challenges Observed The initial design relied heavily on nested sub-orchestrators, which introduced: High orchestration overhead Increased replay and persistence operations Slower execution at scale Key Improvements Refactoring unnecessary sub-orchestrators into Activity Functions simplified execution and improved throughput. The benefits included: Reduced orchestration latency Faster execution cycles Lower infrastructure cost Note: However, sub-orchestrators remain the right choice when the design requires composing multiple dependent steps, managing scoped retry/error logic, or isolating orchestration history. The decision should be driven by the complexity and reuse requirements of each workflow segment and not applied as a blanket rule. Improved Retry Strategy Retry behavior was also optimized by redefining execution boundaries. Previously: One activity processed multiple records A single failure triggered a retry of the entire batch After optimization: One activity handled one logical unit of work This enabled: Granular retries Better failure isolation Reduced duplicate processing Database Hygiene: A Critical Foundation The database emerged as a major bottleneck due to fragmentation and stale statistics caused by continuous high-volume operations. Issues Identified Fragmented indexes Inefficient query plans Increased query execution time Optimization Approach A proactive maintenance strategy was implemented using scheduled jobs to: Update statistics regularly Rebuild indexes Maintain query performance consistency Controlled Database Load For heavy long-running workloads in multi-tenant architecture, execution of DB intensive process was intentionally run in singleton fashion at a tenant level to reduce contention. This approach: Prevented concurrent heavy operations Improved overall system stability Delivered more predictable throughput Observability: Finding the Real Problem A major challenge during optimization was distinguishing between symptoms and root causes. For example: Slow APIs were often caused by database contention High retries were triggered by upstream throttling Orchestration delays originated from downstream dependencies To address this, end-to-end observability was established using: Application-level tracing Load testing correlations Cross-service telemetry analysis This enabled accurate root cause identification and prevented misdirected optimization efforts. Key Takeaways Some key principles emerged from this optimization journey: Scaling more does not always mean performing better Controlled parallelism is more effective than unrestricted concurrency Orchestration design directly impacts system performance Database maintenance must be proactive Retry strategies should align with logical units of work Observability is essential for correct diagnosis Conclusion Performance tuning in distributed systems is less about adding resources and more about using them efficiently. By focusing on controlled scaling, simplifying orchestration, maintaining database health, and improving observability, the system achieved higher throughput, lower cost, and significantly improved stability. These lessons are broadly applicable to any Azure-based system handling large-scale, orchestration-heavy workloads and can help teams design more predictable and resilient architectures.389Views5likes0CommentsAzure SRE Agent for Azure Monitor Alerts: Reduce Alert Fatigue, Investigate What Matters
The Alert Problem Organizations running Azure Monitor tend to land in one of two situations: Alert fatigue has set in. Alert rules tend to grow over time — a CPU threshold from two years ago, a health probe check from a migration, a disk alert from an outage that never got cleaned up. These rules fire regularly, most auto-resolve, and nobody investigates them. But buried in that noise are real incidents that go unnoticed until they escalate. Teams respond, but the effort is repetitive. Engineers triage the same alerts repeatedly — running the same diagnostic queries, confirming the same "transient spike, no action needed" conclusion. They know the rule is noisy, but fixing it in Azure Monitor requires data they don't have readily available: What should the threshold be? What's the auto-resolution rate? Is it safe to change? So the noisy rule stays, and the manual toil continues. Both situations share the same gap: there's no intelligent layer between Azure Monitor and the team. Azure SRE Agent fills that gap — it receives alert fires in real time, investigates them automatically, consolidates noisy ones, and surfaces the data your team needs to improve the rules at the source. Here's how to set it up. 1. Intelligent Alert Handling: Cooldown and Response Plan Configuration 1.a. Alert Reinvestigation Cooldown The most impactful configuration for Azure Monitor alerts is the new reinvestigation cooldown. This is a per-response-plan setting that controls how the agent handles repeated fires of the same alert rule. When an alert rule fires and the agent already has an active thread for that rule, it merges the new fire into the existing thread — no new investigation, no duplicate work. What makes this especially useful: if the previous thread was resolved or closed within the cooldown window, the agent reopens it and appends the new fire rather than starting a fresh investigation. This catches the common "it fired, we resolved it, it fired again 30 minutes later" pattern that generates the most duplicate effort. To configure it: Navigate to your AzMonitor response plan and look for the "Alert reinvestigation cooldown" section in the Save step. It's enabled by default with a 3-hour window — a default chosen because most noisy alert rules re-fire within a 1–3 hour cycle, making this window broad enough to catch recurring patterns while short enough that a genuinely new issue several hours later still gets a fresh investigation. To disable the cooldown entirely — for critical alerts where every fire demands a fresh investigation — uncheck the merge toggle: You can adjust the window between 1 and 24 hours depending on the alert pattern: Alert Pattern Recommended Window Frequent polling-based alerts (health probes, heartbeats) 1–2 hours Recurring issues tied to daily batch jobs or deploy cycles 6–12 hours Intermittent failures with unpredictable recurrence 12–24 hours Critical alerts where every fire demands a fresh look Disable the cooldown entirely 1.b. Segmenting Alerts with Response Plans The cooldown works best when paired with tiered response plans that route alerts by severity and title keyword. Rather than one catch-all plan for all alert types, create separate plans that match the right investigation depth to the right alerts. Critical alerts (Sev0–1, titles containing "failover", "security", "data loss") — disable cooldown. Every fire gets a fresh investigation because a repeat fire here likely means the first remediation didn't hold. Operational alerts (Sev2, titles containing "high CPU", "memory pressure", "latency") — set a 6-hour cooldown. These are real issues, but recurring fires within a few hours are almost always the same root cause. The agent consolidates them into one thread while still giving a genuinely new occurrence later in the day a fresh look. Low-priority alerts (Sev3–4, titles containing "health probe", "availability test") — set a short 1-hour cooldown. These rarely require deep investigation. The agent captures context without spending effort on redundant analysis. Informational alerts — don't create a response plan at all. These are telemetry, not incidents. This tiering works regardless of which agent mode (Autonomous or Review) your team uses. The value comes from the cooldown and severity segmentation — agent mode is a separate decision based on your team's comfort level with autonomous remediation. To see the difference this makes in practice: we deployed a web app with Azure Monitor alert rules and induced real failures. Azure Monitor fired 9 alerts across three rule types over a few hours. The agent consolidated them based on each response plan's cooldown: Alert Rule Response Plan Merge Setting AzMon Fires Agent Threads Total Alerts (in thread) What Happened High Response Time (Sev3) low-priority-alerts Merge ON, 4h cooldown 3 1 4 All 4 fires merged into a single thread — the agent investigated once and appended recurring fires HTTP 5xx Errors (Sev2) critical-alerts-no-merge Merge OFF 3 3 1 each Each fire created its own investigation — appropriate for critical alerts where every occurrence matters High CPU (Sev2) operational-alerts Merge ON, 1h cooldown 2 2 1 each Fires were >1 hour apart (resolved at 12:05, re-fired at 3:37) — outside the cooldown window, so the agent correctly treated them as separate incidents The key insight: the same 9 Azure Monitor alerts produced different agent behavior depending on the response plan configuration. The High Response Time rule demonstrates the merge path saving 3 redundant investigations. The HTTP 5xx rule shows merge disabled for critical alerts. And the High CPU rule shows what happens when the cooldown window is too short for the alert's recurrence pattern — a signal to increase the window. 2. Proactive Noise Monitoring: Let the Agent Analyze Its Own Patterns Handling alerts intelligently is the first step. The next is having the agent proactively surface insights about your alert landscape so your team can improve the rules at the source — which is the data that Category 2 teams in our intro are missing. 2.a. Weekly Alert Hygiene Report Create a weekly scheduled task with instructions like: Analyze all Azure Monitor alert threads from the past 7 days. For each alert rule that fired more than 3 times, produce a ranked report covering: High Auto-Resolution Rules: Rules with high auto-resolution rates. Recommend threshold changes or suppression windows. Rules with Recurring Root Causes: Rules where the same root cause recurs. Recommend permanent remediation actions. Miscategorized Severity: Rules where investigation concludes low impact but the alert is Sev1/Sev2. Recommend severity adjustment. Cost Summary: Estimated effort consumed per alert rule this week. This creates a compounding feedback loop. Week over week, your team has a concrete, data-backed list of which alert rules to adjust in Azure Monitor — complete with specific recommendations. The data that was too time-consuming to gather manually is now generated automatically. 2.b. Monthly Threshold Audit For a deeper analysis, schedule a monthly task: Audit Azure Monitor alert rules for this agent's subscriptions. For each rule: Query the rule's metric history over 30 days Compare current threshold vs. actual P50, P90, and P99 values Flag rules with threshold below P50 (always firing) or above P99 (never firing) For high-frequency rules with high auto-resolution, recommend a threshold at P95 to reduce fires while still catching genuine anomalies Produce: a threshold optimization table, dormant rules (no fires in 30+ days), and specific Azure CLI commands to update each rule. This is the highest-leverage outcome because it fixes noise at the source. A single threshold adjustment on one noisy rule can eliminate hundreds of alert fires per month — permanently. And the agent provides the data and specific commands to make it happen. What This Means for Agent Costs Each alert investigation consumes LLM tokens — for reasoning, querying, and building analysis. Without thoughtful configuration, a high-volume alert pipeline can lead to higher agent costs than expected. The setup described in this post naturally keeps token usage in check: the cooldown prevents redundant investigations, tiered response plans match effort to alert importance, and low-priority alerts get minimal attention. For additional control, you can optionally add a PostToolUse hook that nudges the agent to include time-range filters in Log Analytics queries — preventing large, unbounded result sets from inflating the conversation context. Since this hook uses a simple regex check on the query text rather than an LLM call, it adds zero token cost of its own. Getting Started Connect Azure Monitor as an incident source in your SRE Agent Enable the reinvestigation cooldown on your response plans (the 3-hour default is a sensible starting point) Create tiered response plans — at minimum, separate critical alerts (cooldown disabled) from operational alerts (cooldown 6h) and low-priority alerts (cooldown 1h) Set up a weekly alert hygiene report as a scheduled task to start building visibility into your alert patterns Add the monthly threshold audit once your weekly reports have a few weeks of data Start with the first three — they take a few minutes each and begin working immediately. Learn More Incident Response Overview — How SRE Agent handles incidents across platforms including Azure Monitor Incident Response Plans — Configuring response plans, filters, severity routing, and cooldown settings Setting Up a Response Plan — Step-by-step tutorial for creating your first response plan Scheduled Tasks — Creating weekly and monthly automated reports Agent Hooks — PostToolUse hooks, command hooks, and governance controls Monitor Agent Usage — Tracking token usage and agent activity Getting Started with Incident Response — Connecting Azure Monitor and configuring your first alert pipeline493Views0likes0CommentsPlugin Marketplace for Azure SRE Agent: Build Once, Install Anywhere
What's a Plugin? A plugin bundles two things: Skills — Operational knowledge (triage runbooks, policy rules, known issues) the agent reads at runtime to guide its reasoning MCP Connectors — Live integrations to your internal APIs (deployment tracker, cost dashboard, CMDB) the agent can query during an investigation This is the key distinction: a plugin doesn't just tell the agent what your policies are — it gives the agent tools to query your internal systems and apply those policies with real data. A plugin bundles skills and MCP connectors as a single installable unit. The Marketplace Model: Create Once, Install Everywhere The marketplace is a GitHub repository with a marketplace.json manifest. Any team pushes their plugin to the repo. Every SRE Agent in the org can discover it and install it with one click — no need for each team to manually recreate skills and configure connectors. How it works: A specialist team creates a plugin (skills + MCP connector config) and pushes it to the shared GitHub repo Any SRE Agent user browses the marketplace, sees what's available, and clicks Install The plugin's skills and connectors are deployed to that agent instance instantly Contoso runs multiple SRE Agent instances — payments team, platform team, data team. The same marketplace serves all of them. Each team installs exactly the plugins they need. One marketplace, many agents. Teams publish plugins once — every agent in the org can install them. The Scenario: AKS Incident Investigation with Plugins Contoso runs a payment processing service on AKS. Three teams have contributed plugins to the company's internal marketplace: Plugin Team Skills MCP Connector AKS Runbooks K8s Platform Team aks-incident-triage, aks-deployment-analysis Deployment Tracker API Cost & Capacity Cloud FinOps Team cost-analysis, capacity-planning Cost Dashboard API Service Catalog SRE Leadership service-ownership-lookup, dependency-impact-analysis CMDB API All three are installed on the payments team's SRE Agent. Let's see what happens when an incident hits. Building and Publishing the Plugins Each team creates their plugin independently and pushes it to the shared marketplace repo. 1. AKS Runbooks (Kubernetes Platform Team) The K8s Platform Team packages their triage procedures, node pool naming conventions, PDB policies, known issues registry, and deployment gates. Skills: aks-incident-triage — Per-symptom triage procedures (OOMKill, NodePressure, CrashLoop), PDB-first policy checks, Tier-0 escalation rules, and a known issues registry aks-deployment-analysis — Correlates incidents with recent deployments, surfaces resource spec diffs and gate violations, provides a rollback decision tree MCP Connector: contoso-deploy-tracker — Exposes get_deployments: recent deployments by namespace with deployer, image versions, resource diffs, and gate status 2. Cost & Capacity (Cloud FinOps Team) The FinOps team packages their SKU approval matrix, team budget allocations, chargeback model, and scaling governance. Skills: cost-analysis — Team budget tiers, cost dashboard API usage, incident cost impact calculations capacity-planning — "Scale-out before scale-up" rule (CCP-001), SKU approval matrix (B-series = team lead, D-series = director, E/N-series = VP/CTO), auto-scale thresholds MCP Connector: contoso-cost-dashboard — Exposes get_team_spend (budget, burn rate) and get_resources_cost (resource-level cost with utilization) 3. Service Catalog (SRE Leadership) SRE Leadership packages service ownership, SLA tiers, escalation paths, and the dependency graph. Skills: service-ownership-lookup — Maps namespaces to owning teams, on-call contacts, SLA tiers (Tier-0 through Tier-3), escalation policies dependency-impact-analysis — Dependency classification (hard/soft/async), blast radius assessment, security implications MCP Connector: contoso-cmdb — Exposes get_service_info (ownership, SLA), get_service_dependencies, and get_blast_radius The Marketplace Manifest All three plugins are described in a single marketplace.json that the SRE Agent discovers: { "name": "Contoso SRE Plugins", "description": "Internal plugin marketplace for Contoso SRE teams", "version": "1.0.0", "plugins": [ { "id": "aks-runbooks", "name": "AKS Runbooks", "description": "Kubernetes Platform Team's operational runbooks and deployment correlation", "author": "K8s Platform Team", "source": "./aks-runbooks", "category": "Operations" }, { "id": "cost-capacity", "name": "Cost & Capacity", "description": "FinOps team's cost governance, SKU approval matrix, and capacity planning", "author": "Cloud FinOps Team", "source": "./cost-capacity", "category": "Cost Management" }, { "id": "service-catalog", "name": "Service Catalog", "description": "Service ownership, SLA tiers, dependency graphs, and escalation paths", "author": "SRE Leadership", "source": "./service-catalog", "category": "Governance" } ] } The internal plugin marketplace on GitHub. Each directory is a plugin contributed by a different team. The marketplace.json manifest tells the SRE Agent what's available. Registering the Marketplace and Installing Plugins Step 1: Add the Marketplace In the SRE Agent, navigate to Builder → Plugins → Browse and click "Add Marketplace". Enter the GitHub repository path (contoso/sre-agent-plugins) and click Add. The agent fetches marketplace.json and displays the marketplace card with all three plugins discovered. Adding the internal marketplace — just point to the GitHub repo. Step 2: Browse the Catalog The Browse tab now shows the Contoso SRE Plugins marketplace. Clicking into it reveals three plugin cards — one from each contributing team — with descriptions, skill counts, and connector details. & Capacity, Service Catalog — with author teams, skill counts, and install buttonsThree plugins from three teams. Each one brings skills (organizational knowledge) and an MCP connector (internal API access). Step 3: Install All Three Plugins Click into each plugin to review what it installs — skills and MCP connectors — then click "Install Plugin" for each one. After installing all three: 6 skills loaded (2 per plugin) — organizational knowledge documents the agent reads at runtime 3 MCP connectors registered — internal API integrations the agent can call as tools Each plugin clearly shows what it installs — skills and connectors — before you commit. ith green "Installed" badges, green borders, and skill/connector countsAll three plugins installed — each card shows its "Installed" status, the authoring team, and exactly what it brings (2 skills, 1 connector). The green border and badge make installed plugins immediately recognizable. The Agent in Action Now let's ask the question: "Pods are crashing in payments-prod on aks-payments-prod-eastus2 in sre-marketplace-demo-rg. Investigate and give me a full incident report." The agent investigates — combining its native Kubernetes capabilities with the organizational context from all three plugins. The same strong Kubernetes diagnosis, now enriched with organizational context. Deployment correlation, policy violations, cost governance, blast radius, and escalation paths — all layered on top of the agent's native investigation. Why This Matters: Different Teams, One Agent The K8s Platform Team writes triage procedures and known issues. The FinOps Team writes budget governance and SKU rules. SRE Leadership defines service ownership and escalation paths. Each team packages their domain expertise independently. The SRE Agent combines all of it at runtime — producing a response no single team could have written alone, drawn from three internal systems and three bodies of institutional knowledge. This is how organizational knowledge scales: composable plugins that the agent reasons with in real-time, not longer wikis that nobody reads. Learn More Plugin Marketplace overview — How the marketplace works, manifest formats, and MCP config support Tutorial: Install a marketplace plugin — Step-by-step walkthrough of adding a marketplace, browsing plugins, and importing skills Skills in Azure SRE Agent — How skills work, how the agent loads them at runtime, and how they relate to custom agents and knowledge files MCP connectors and tools — Connecting your agent to external systems via the Model Context Protocol Tutorial: Set up an MCP connector — Configuring remote and local MCP servers as agent connectors263Views0likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.14KViews1like2CommentsStop Experimenting, Start Building: AI Apps & Agents Dev Days Has You Covered
The AI landscape has shifted. The question is no longer “Can we build AI applications?” it’s “Can we build AI applications that actually work in production?” Demos are easy. Reliable, scalable, resilient AI systems that handle real-world complexity? That’s where most teams struggle. If you’re an AI developer, software engineer, or solution architect who’s ready to move beyond prototypes and into production-grade AI, there’s a series built specifically for you. What Is AI Apps & Agents Dev Days? AI Apps & Agents Dev Days is a monthly technical series from Microsoft Reactor, delivered in partnership with Microsoft and NVIDIA. You can explore the full series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ This isn’t a slide deck marathon. The series tagline says it best: “It’s not about slides, it’s about building.” Each session tackles real-world challenges, shares patterns that actually work, and digs into what’s next in AI-driven app and agent design. You bring your curiosity, your code, and your questions. You leave with something you can ship. The sessions are led by experienced engineers and advocates from both Microsoft and NVIDIA, people like Pamela Fox, Bruno Capuano, Anthony Shaw, Gwyneth Peña-Siguenza, and solutions architects from NVIDIA’s Cloud AI team. These aren’t theorists; they’re practitioners who build and ship the tools you use every day. What You’ll Learn The series covers the full spectrum of building AI applications and agent-based systems. Here are the key themes: Building AI Applications with Azure, GitHub, and Modern Tooling Sessions walk through how to wire up AI capabilities using Azure services, GitHub workflows, and the latest SDKs. The focus is always on code-first learning, you’ll see real implementations, not abstract architecture diagrams. Designing and Orchestrating AI Agents Agent development is one of the series’ strongest threads. Sessions cover how to build agents that orchestrate long-running workflows, persist state automatically, recover from failures, and pause for human-in-the-loop input, without losing progress. For example, the session “AI Agents That Don’t Break Under Pressure” demonstrates building durable, production-ready AI agents using the Microsoft Agent Framework, running on Azure Container Apps with NVIDIA serverless GPUs. Scaling LLM Inference and Deploying to Production Moving from a working prototype to a production deployment means grappling with inference performance, GPU infrastructure, and cost management. The series covers how to leverage NVIDIA GPU infrastructure alongside Azure services to scale inference effectively, including patterns for serverless GPU compute. Real-World Architecture Patterns Expect sessions on container-based deployments, distributed agent systems, and enterprise-grade architectures. You’ll learn how to use services like Azure Container Apps to host resilient AI workloads, how Foundry IQ fits into agent architectures as a trusted knowledge source, and how to make architectural decisions that balance performance, cost, and scalability. Why This Matters for Your Day Job There’s a critical gap between what most AI tutorials teach and what production systems actually require. This series bridges that gap: Production-ready patterns, not demos. Every session focuses on code and architecture you can take directly into your projects. You’ll learn patterns for state persistence, failure recovery, and durable execution — the things that break at 2 AM. Enterprise applicability. The scenarios covered — travel planning agents, multi-step workflows, GPU-accelerated inference — map directly to enterprise use cases. Whether you’re building internal tooling or customer-facing AI features, the patterns transfer. Honest trade-off discussions. The speakers don’t shy away from the hard questions: When do you need serverless GPUs versus dedicated compute? How do you handle agent failures gracefully? What does it actually cost to run these systems at scale? Watch On-Demand, Build at Your Own Pace Every session is available on-demand. You can watch, pause, and build along at your own pace, no need to rearrange your schedule. The full playlist is available at This is particularly valuable for technical content. Pause a session while you replicate the architecture in your own environment. Rewind when you need to catch a configuration detail. Build alongside the presenters rather than just watching passively. What You’ll Walk Away Wit After working through the series, you’ll have: Practical agent development skills — how to design, orchestrate, and deploy AI agents that handle real-world complexity, including state management, failure recovery, and human-in-the-loop patterns Production architecture patterns — battle-tested approaches for deploying AI workloads on Azure Container Apps, leveraging NVIDIA GPU infrastructure, and building resilient distributed systems Infrastructure decision-making confidence — a clearer understanding of when to use serverless GPUs, how to optimise inference costs, and how to choose the right compute strategy for your workload Working code and reference implementations — the sessions are built around live coding and sample applications (like the Travel Planner agent demo), giving you starting points you can adapt immediately A framework for continuous learning — with new sessions each month, you’ll stay current as the AI platform evolves and new capabilities emerge Start Building The AI applications that will matter most aren’t the ones with the flashiest demos — they’re the ones that work reliably, scale gracefully, and solve real problems. That’s exactly what this series helps you build. Whether you’re designing your first AI agent system or hardening an existing one for production, the AI Apps & Agents Dev Days sessions give you the patterns, tools, and practical knowledge to move forward with confidence. Explore the series at https://developer.microsoft.com/en-us/reactor/series/s-1590/ and start watching the on-demand sessions at the link above. The best time to level up your AI engineering skills was yesterday. The second-best time is right now and these sessions make it easy to start.Event-Driven IaC Operations with Azure SRE Agent: Terraform Drift Detection via HTTP Triggers
What Happens After terraform plan Finds Drift? If your team is like most, the answer looks something like this: A nightly terraform plan runs and finds 3 drifted resources A notification lands in Slack or Teams Someone files a ticket During the next sprint, an engineer opens 4 browser tabs — Terraform state, Azure Portal, Activity Log, Application Insights — and spends 30 minutes piecing together what happened They discover the drift was caused by an on-call engineer who scaled up the App Service during a latency incident at 2 AM They revert the drift with terraform apply The app goes down because they just scaled it back down while the bug that caused the incident is still deployed Step 7 is the one nobody talks about. Drift detection tooling has gotten remarkably good — scheduled plans, speculative runs, drift alerts — but the output is always the same: a list of differences. What changed. Not why. Not whether it's safe to fix. The gap isn't detection. It's everything that happens after detection. HTTP Triggers in Azure SRE Agent close that gap. They turn the structured output that drift detection already produces — webhook payloads, plan summaries, run notifications — into the starting point of an autonomous investigation. Detection feeds the agent. The agent does the rest: correlates with incidents, reads source code, classifies severity, recommends context-aware remediation, notifies the team, and even ships a fix. Here's what that looks like end to end. What you'll see in this blog: An agent that classifies drift as Benign, Risky, or Critical — not just "changed" Incident correlation that links a SKU change to a latency spike in Application Insights A remediation recommendation that says "Do NOT revert" — and why reverting would cause an outage A Teams notification with the full investigation summary An agent that reviews its own performance, finds gaps, and improves its own skill file A pull request the agent created on its own to fix the root cause The Pipeline: Detection to Resolution in One Webhook The architecture is straightforward. Terraform Cloud (or any drift detection tool) sends a webhook when it finds drift. An Azure Logic App adds authentication. The SRE Agent's HTTP Trigger receives it and starts an autonomous investigation. The end-to-end pipeline: Terraform Cloud detects drift and sends a webhook. The Logic App adds Azure AD authentication via Managed Identity. The SRE Agent's HTTP Trigger fires and the agent autonomously investigates across 7 dimensions. Setting Up the Pipeline Step 1: Deploy the Infrastructure with Terraform We start with a simple Azure App Service running a Node.js application, deployed via Terraform. The Terraform configuration defines the desired state: App Service Plan: B1 (Basic) — single vCPU, ~$13/mo App Service: Node 20-lts with TLS 1.2 Tags: environment: demo, managed_by: terraform, project: sre-agent-iac-blog resource "azurerm_service_plan" "demo" { name = "iacdemo-plan" resource_group_name = azurerm_resource_group.demo.name location = azurerm_resource_group.demo.location os_type = "Linux" sku_name = "B1" } A Logic App is also deployed to act as the authentication bridge between Terraform Cloud webhooks and the SRE Agent's HTTP Trigger endpoint, using Managed Identity to acquire Azure AD tokens. Learn more about HTTP Triggers here. Step 2: Create the Drift Analysis Skill Skills are domain knowledge files that teach the agent how to approach a problem. We create a terraform-drift-analysis skill with an 8-step workflow: Identify Scope — Which resource group and resources to check Detect Drift — Compare Terraform config against Azure reality Correlate with Incidents — Check Activity Log and App Insights Classify Severity — Benign, Risky, or Critical Investigate Root Cause — Read source code from the connected repository Generate Drift Report — Structured summary with severity-coded table Recommend Smart Remediation — Context-aware: don't blindly revert Notify Team — Post findings to Microsoft Teams The key insight in the skill: "NEVER revert critical drift that is actively mitigating an incident." This teaches the agent to think like an experienced SRE, not just a diff tool. Step 3: Create the HTTP Trigger In the SRE Agent UI, we create an HTTP Trigger named tfc-drift-handler with a 7-step agent prompt: A Terraform Cloud run has completed and detected infrastructure drift. Workspace: {payload.workspace_name} Organization: {payload.organization_name} Run ID: {payload.run_id} Run Message: {payload.run_message} STEP 1 — DETECT DRIFT: Compare Terraform configuration against actual Azure state... STEP 2 — CORRELATE WITH INCIDENTS: Check Azure Activity Log and App Insights... STEP 3 — CLASSIFY SEVERITY: Rate each drift item as Benign, Risky, or Critical... STEP 4 — INVESTIGATE ROOT CAUSE: Read the application source code... STEP 5 — GENERATE DRIFT REPORT: Produce a structured summary... STEP 6 — RECOMMEND SMART REMEDIATION: Context-aware recommendations... STEP 7 — NOTIFY TEAM: Post a summary to Microsoft Teams... Step 4: Connect GitHub and Teams We connect two integrations in the SRE Agent Connectors settings: Code Repository: GitHub — so the agent can read application source code during investigations Notification: Microsoft Teams — so the agent can post drift reports to the team channel The Incident Story Act 1: The Latency Bug Our demo app has a subtle but devastating bug. The /api/data endpoint calls processLargeDatasetSync() — a function that sorts an array on every iteration, creating an O(n² log n) blocking operation. On a B1 App Service Plan (single vCPU), this blocks the Node.js event loop entirely. Under load, response times spike from milliseconds to 25-58 seconds, with 502 Bad Gateway errors from the Azure load balancer. Act 2: The On-Call Response An on-call engineer sees the latency alerts and responds — not through Terraform, but directly through the Azure Portal and CLI. They: Add diagnostic tags — manual_update=True, changed_by=portal_user (benign) Downgrade TLS from 1.2 to 1.0 while troubleshooting (risky — security regression) Scale the App Service Plan from B1 to S1 to throw more compute at the problem (critical — cost increase from ~$13/mo to ~$73/mo) The incident is partially mitigated — S1 has more compute, so latency drops from catastrophic to merely bad. Everyone goes back to sleep. Nobody updates Terraform. Act 3: The Drift Check Fires The next morning, a nightly speculative Terraform plan runs and detects 3 drifted attributes. The notification webhook fires, flowing through the Logic App auth bridge to the SRE Agent HTTP Trigger. The agent wakes up and begins its investigation. What the Agent Found Layer 1: Drift Detection The agent compares Terraform configuration against Azure reality and produces a severity-classified drift report: Three drift items detected: Critical: App Service Plan SKU changed from B1 (~$13/mo) to S1 (~$73/mo) — a +462% cost increase Risky: Minimum TLS version downgraded from 1.2 to 1.0 — a security regression vulnerable to BEAST and POODLE attacks Benign: Additional tags (changed_by: portal_user, manual_update: True) — cosmetic, no functional impact Layer 2: Incident Correlation Here's where the agent goes beyond simple drift detection. It queries Application Insights and discovers a performance incident correlated with the SKU change: Key findings from the incident correlation: 97.6% of requests (40 of 41) were impacted by high latency The /api/data endpoint does not exist in the repository source code — the deployed application has diverged from the codebase The endpoint likely contains a blocking synchronous pattern — Node.js runs on a single event loop, and any synchronous blocking call would explain 26-58s response times The SKU scale-up from B1→S1 was an attempt to mitigate latency by adding more compute, but scaling cannot fix application-level blocking code on a single-threaded Node.js server Layer 3: Smart Remediation This is the insight that separates an autonomous agent from a reporting tool. Instead of blindly recommending "revert all drift," the agent produces context-aware remediation recommendations: The agent's remediation logic: Tags (Benign) → Safe to revert anytime via terraform apply -target TLS 1.0 (Risky) → Revert immediately — the TLS downgrade is a security risk unrelated to the incident SKU S1 (Critical) → DO NOT revert until the /api/data performance root cause is fixed This is the logic an experienced SRE would apply. Blindly running terraform apply to revert all drift would scale the app back down to B1 while the blocking code is still deployed — turning a mitigated incident into an active outage. Layer 4: Investigation Summary The agent produces a complete summary tying everything together: Key findings in the summary: Actor: surivineela@microsoft.com made all changes via Azure Portal at ~23:19 UTC Performance incident: /api/data averaging 25-57s latency, affecting 97.6% of requests Code-infrastructure mismatch: /api/data exists in production but not in the repository source code Root cause: SKU scale-up was emergency incident response, not unauthorized drift Layer 5: Teams Notification The agent posts a structured drift report to the team's Microsoft Teams channel: The on-call engineer opens Teams in the morning and sees everything they need: what drifted, why it drifted, and exactly what to do about it — without logging into any dashboard. The Payoff: A Self-Improving Agent Here's where the demo surprised us. After completing the investigation, the agent did two things we didn't explicitly ask for. The Agent Improved Its Own Skill The agent performed an Execution Review — analyzing what worked and what didn't during its investigation — and found 5 gaps in its own terraform-drift-analysis.md skill file: What worked well: Drift detection via az CLI comparison against Terraform HCL was straightforward Activity Log correlation identified the actor and timing Application Insights telemetry revealed the performance incident driving the SKU change Gaps it found and fixed: No incident correlation guidance — the skill didn't instruct checking App Insights No code-infrastructure mismatch detection — no guidance to verify deployed code matches the repository No smart remediation logic — didn't warn against reverting critical drift during active incidents Report template missing incident correlation column No Activity Log integration guidance — didn't instruct checking who made changes and when The agent then edited its own skill file to incorporate these learnings. Next time it runs a drift analysis, it will include incident correlation, code-infra mismatch checks, and smart remediation logic by default. This is a learning loop — every investigation makes the agent better at future investigations. The Agent Created a PR Without being asked, the agent identified the root cause code issue and proactively created a pull request to fix it: The PR includes: App safety fixes: Adding MAX_DELAY_MS and SERVER_TIMEOUT_MS constants to prevent unbounded latency Skill improvements: Incorporating incident correlation, code-infra mismatch detection, and smart remediation logic From a single webhook: drift detected → incident correlated → root cause found → team notified → skill improved → fix shipped. Key Takeaways Drift detection is not enough. Knowing that B1 changed to S1 is table stakes. Knowing it changed because of a latency incident, and that reverting it would cause an outage — that's the insight that matters. Context-aware remediation prevents outages. Blindly running terraform apply after drift would have scaled the app back to B1 while blocking code was still deployed. The agent's "DO NOT revert SKU" recommendation is the difference between fixing drift and causing a P1. Skills create a learning loop. The agent's self-review and skill improvement means every investigation makes the next one better — without human intervention. HTTP Triggers connect any platform. The auth bridge pattern (Logic App + Managed Identity) works for Terraform Cloud, but the same architecture applies to any webhook source: GitHub Actions, Jenkins, Datadog, PagerDuty, custom internal tools. The agent acts, not just reports. From a single webhook: drift detected, incident correlated, root cause identified, team notified via Teams, skill improved, and PR created. End-to-end in one autonomous session. Getting Started HTTP Triggers are available now in Azure SRE Agent: Create a Skill — Teach the agent your operational runbook (in this case, drift analysis with severity classification and smart remediation) Create an HTTP Trigger — Define your agent prompt with {payload.X} placeholders and connect it to a skill Set Up an Auth Bridge — Deploy a Logic App with Managed Identity to handle Azure AD token acquisition Connect Your Source — Point Terraform Cloud (or any webhook-capable platform) at the Logic App URL Connect GitHub + Teams — Give the agent access to source code and team notifications Within minutes, you'll have an autonomous pipeline that turns infrastructure drift events into fully contextualized investigations — with incident correlation, root cause analysis, and smart remediation recommendations. The full implementation guide, Terraform files, skill definitions, and demo scripts are available in this repository.757Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.13KViews6likes0Comments