serverless
201 TopicsThe Durable Task Scheduler Consumption SKU is now Generally Available
Today, we're excited to announce that the Durable Task Scheduler Consumption SKU has reached General Availability. Developers can now run durable workflows and agents on Azure with pay-per-use pricing, no storage to manage, no capacity to plan, and no idle costs. Just create a scheduler, connect your app, and start orchestrating. Whether you're coordinating AI agent workflows, processing event-driven pipelines, or running background jobs, the Consumption SKU is ready to go. GET STARTED WITH THE DURABLE TASK SCHEDULER CONSUMPTION SKU Since launching the Consumption SKU in public preview last November, we've seen incredible adoption and have incorporated feedback from developers around the world to ensure the GA release is truly production ready. “The Durable Task Scheduler has become a foundational piece of what we call ‘workflows’. It gives us the reliability guarantees we need for processing financial documents and sensitive workflows, while keeping the programming model straightforward. The combination of durable execution, external event correlation, deterministic idempotency, and the local emulator experience has made it a natural fit for our event-driven architecture. We have been delighted with the consumption SKUs cost model for our lower environments.”– Emily Lewis, CarMax What is the Durable Task Scheduler? If you're new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background: Announcing Limited Early Access of the Durable Task Scheduler Announcing Workflow in Azure Container Apps with the Durable Task Scheduler Announcing Dedicated SKU GA & Consumption SKU Public Preview In brief, the Durable Task Scheduler is a fully managed orchestration backend for durable execution on Azure, meaning your workflows and agent sessions can reliably resume and run to completion, even through process failures, restarts, and scaling events. Whether you’re running workflows or orchestrating durable agents, it handles task scheduling, state persistence, fault tolerance, and built-in monitoring, freeing developers from the operational overhead of managing their own execution engines and storage backends. The Durable Task Scheduler works across Azure compute environments: Azure Functions: Using the Durable Functions extension across all Function App SKUs, including Flex Consumption. Azure Container Apps: Using the Durable Functions or Durable Task SDKs with built-in workflow support and auto-scaling. Any compute: Azure Kubernetes Service, Azure App Service, or any environment where you can run the Durable Task SDKs (.NET, Python, Java, JavaScript). Why choose the Consumption SKU? With the Consumption SKU you’re charged only for actions dispatched, with no minimum commitments or idle costs. There’s no capacity to size or throughput to reserve. Create a scheduler, connect your app, and you’re running. The Consumption SKU is a natural fit for workloads with unpredictable or bursty usage patterns: AI agent orchestration: Multi-step agent workflows that call LLMs, retrieve data, and take actions. Users trigger these on demand, so volume is spiky and pay-per-use avoids idle costs between bursts. Event-driven pipelines: Processing events from queues, webhooks, or streams with reliable orchestration and automatic checkpointing, where volumes spike and dip unpredictably. API-triggered workflows: User signups, form submissions, payment flows, and other request-driven processing where volume varies throughout the day. Distributed transactions: Retries and compensation logic across microservices with durable sagas that survive failures and restarts. What's included in the Consumption SKU at GA The Consumption SKU has been hardened based on feedback and real-world usage during the public preview. Here's what's included at GA: Performance Up to 500 actions per second: Sufficient throughput for a wide range of workloads, with the option to move to the Dedicated SKU for higher-scale scenarios. Up to 30 days of data retention: View and manage orchestration history, debug failures, and audit execution data for up to 30 days. Built-in monitoring dashboard Filter orchestrations by status, drill into execution history, view visual Gantt and sequence charts, and manage orchestrations (pause, resume, terminate, or raise events), all from the dashboard, secured with Role-Based Access Control (RBAC). Identity-based security The Consumption SKU uses Entra ID for authentication and RBAC for authorization. No SAS tokens or access keys to manage, just assign the appropriate role and connect. Get started with the Durable Task Scheduler today The Consumption SKU is available now Generally Available. Provision a scheduler in the Azure portal, connect your app, and start orchestrating. You only pay for what you use. Documentation Getting started Samples Pricing Consumption SKU docs We'd love to hear your feedback. Reach out to us by filing an issue on our GitHub repository134Views0likes0CommentsBuilding the agentic future together at JDConf 2026
JDConf 2026 is just weeks away, and I’m excited to welcome Java developers, architects, and engineering leaders from around the world for two days of learning and connection. Now in its sixth year, JDConf has become a place where the Java community compares notes on their real-world production experience: patterns, tooling, and hard-earned lessons you can take back to your team, while we keep moving the Java systems that run businesses and services forward in the AI era. This year’s program lines up with a shift many of us are seeing first-hand: delivery is getting more intelligent, more automated, and more tightly coupled to the systems and data we already own. Agentic approaches are moving from demos to backlog items, and that raises practical questions: what’s the right architecture, where do you draw trust boundaries, how do you keep secrets safe, and how do you ship without trading reliability for novelty? JDConf is for and by the people who build and manage the mission-critical apps powering organizations worldwide. Across three regional livestreams, you’ll hear from open source and enterprise practitioners who are making the same tradeoffs you are—velocity vs. safety, modernization vs. continuity, experimentation vs. operational excellence. Expect sessions that go beyond “what” and get into “how”: design choices, integration patterns, migration steps, and the guardrails that make AI features safe to run in production. You’ll find several practical themes for shipping Java in the AI era: connecting agents to enterprise systems with clear governance; frameworks and runtimes adapting to AI-native workloads; and how testing and delivery pipelines evolve as automation gets more capable. To make this more concrete, a sampling of sessions would include topics like Secrets of Agentic Memory Management (patterns for short- and long-term memory and safe retrieval), Modernizing a Java App with GitHub Copilot (end-to-end upgrade and migration with AI-powered technologies), and Docker Sandboxes for AI Agents (guardrails for running agent workflows without risking your filesystem or secrets). The goal is to help you adopt what’s new while hardening your long lived codebases. JDConf is built for community learning—free to attend, accessible worldwide, and designed for an interactive live experience in three time zones. You’ll not only get 23 practitioner-led sessions with production-ready guidance but also free on-demand access after the event to re-watch with your whole team. Pro tip: join live and get more value by discussing practical implications and ideas with your peers in the chat. This is where the “how” details and tradeoffs become clearer. JDConf 2026 Keynote Building the Agentic Future Together Rod Johnson, Embabel | Bruno Borges, Microsoft | Ayan Gupta, Microsoft The JDConf 2026 keynote features Rod Johnson, creator of the Spring Framework and founder of Embabel, joined by Bruno Borges and Ayan Gupta to explore where the Java ecosystem is headed in the agentic era. Expect a practitioner-level discussion on how frameworks like Spring continue to evolve, how MCP is changing the way agents interact with enterprise systems, and what Java developers should be paying attention to right now. Register. Attend. Earn. Register for JDConf 2026 to earn Microsoft Rewards points, which you can use for gift cards, sweepstakes entries, and more. Earn 1,000 points simply by signing up. When you register for any regional JDConf 2026 event with your Microsoft account, you'll automatically receive these points. Get 5,000 additional points for attending live (limited to the first 300 attendees per stream). On the day of your regional event, check in through the Reactor page or your email confirmation link to qualify. Disclaimer: Points are added to your Microsoft account within 60 days after the event. Must register with a Microsoft account email. Up to 10,000 developers eligible. Points will be applied upon registration and attendance and will not be counted multiple times for registering or attending at different events. Terms | Privacy JDConf 2026 Regional Live Streams Americas – April 8, 8:30 AM – 12:30 PM PDT (UTC -7) Bruno Borges hosts the Americas stream, discussing practical agentic Java topics like memory management, multi-agent system design, LLM integration, modernization with AI, and dependency security. Experts from Redis, IBM, Hammerspace, HeroDevs, AI Collective, Tekskills, and Microsoft share their insights. Register for Americas → Asia-Pacific – April 9, 10:00 AM – 2:00 PM SGT (UTC +8) Brian Benz and Ayan Gupta co-host the APAC stream, highlighting Java frameworks and practices for agentic delivery. Topics include Spring AI, multi-agent orchestration, spec-driven development, scalable DevOps, and legacy modernization, with speakers from Broadcom, Alibaba, CERN, MHP (A Porsche Company), and Microsoft. Register for Asia-Pacific → Europe, Middle East and Africa – April 9, 9:00 AM – 12:30 PM GMT (UTC +0) The EMEA stream, hosted by Sandra Ahlgrimm, will address the implementation of agentic Java in production environments. Topics include self-improving systems utilizing Spring AI, Docker sandboxes for agent workflow management, Retrieval-Augmented Generation (RAG) pipelines, modernization initiatives from a national tax authority, and AI-driven CI/CD enhancements. Presentations will feature experts from Broadcom, Docker, Elastic, Azul Systems, IBM, Team Rockstars IT, and Microsoft. Register for EMEA → Make It Interactive: Join Live Come prepared with an actual challenge you’re facing, whether you’re modernizing a legacy application, connecting agents to internal APIs, or refining CI/CD processes. Test your strategies by participating in live chats and Q&As with presenters and fellow professionals. If you’re attending with your team, schedule a debrief after the live stream to discuss how to quickly use key takeaways and insights in your pilots and projects. Learning Resources Java and AI for Beginners Video Series: Practical, episode-based walkthroughs on MCP, GenAI integration, and building AI-powered apps from scratch. Modernize Java Apps Guide: Step-by-step guide using GitHub Copilot agent mode for legacy Java project upgrades, automated fixes, and cloud-ready migrations. AI Agents for Java Webinar: Embedding AI Agent capabilities into Java applications using Microsoft Foundry, from project setup to production deployment. Java Practitioner’s Guide: Learning plan for deploying, managing, and optimizing Java applications on Azure using modern cloud-native approaches. Register Now JDConf 2026 is a free global event for Java teams. Join live to ask questions, connect, and gain practical patterns. All 23 sessions will be available on-demand. Register now to earn Microsoft Rewards points for attending. Register at JDConf.com149Views0likes0CommentsTake Control of Every Message: Partial Failure Handling for Service Bus Triggers in Azure Functions
The Problem: All-or-Nothing Batch Processing in Azure Service Bus Azure Service Bus is one of the most widely used messaging services for building event-driven applications on Azure. When you use Azure Functions with a Service Bus trigger in batch mode, your function receives multiple messages at once for efficient, high-throughput processing. But what happens when one message in the batch fails? Your function receives a batch of 50 Service Bus messages. 49 process perfectly. 1 fails. What happens? In the default model, the entire batch fails. All 50 messages go back on the queue and get reprocessed, including the 49 that already succeeded. This leads to: Duplicate processing — messages that were already handled successfully get processed again Wasted compute — you pay for re-executing work that already completed Infinite retry loops — if that one "poison" message keeps failing, it blocks the entire batch indefinitely Idempotency burden — your downstream systems must handle duplicates gracefully, adding complexity to every consumer This is the classic all-or-nothing batch failure problem. Azure Functions solves it with per-message settlement. The Solution: Per-Message Settlement for Azure Service Bus Azure Functions gives you direct control over how each individual message is settled in real time, as you process it. Instead of treating the batch as all-or-nothing, you settle each message independently based on its processing outcome. With Service Bus message settlement actions in Azure Functions, you can: Action What It Does Complete Remove the message from the queue (successfully processed) Abandon Release the lock so the message returns to the queue for retry, optionally modifying application properties Dead-letter Move the message to the dead-letter queue (poison message handling) Defer Keep the message in the queue but make it only retrievable by sequence number This means in a batch of 50 messages, you can: Complete 47 that processed successfully Abandon 2 that hit a transient error (with updated retry metadata) Dead-letter 1 that is malformed and will never succeed All in a single function invocation. No reprocessing of successful messages. No building failure response objects. No all-or-nothing. Why This Matters 1. Eliminates Duplicate Processing When you complete messages individually, successfully processed messages are immediately removed from the queue. There's no chance of them being redelivered, even if other messages in the same batch fail. 2. Enables Granular Error Handling Different failures deserve different treatments. A malformed message should be dead-lettered immediately. A message that failed due to a transient database timeout should be abandoned for retry. A message that requires manual intervention should be deferred. Per-message settlement gives you this granularity. 3. Implements Exponential Backoff Without External Infrastructure By combining abandon with modified application properties, you can track retry counts per message and implement exponential backoff patterns directly in your function code, no additional queues or Durable Functions required. 4. Reduces Cost You stop paying for redundant re-execution of already-successful work. In high-throughput systems processing millions of messages, this can be a material cost reduction. 5. Simplifies Idempotency Requirements When successful messages are never redelivered, your downstream systems don't need to guard against duplicates as aggressively. This reduces architectural complexity and potential for bugs. Before: One Message = One Function Invocation Before batch support, there was no cardinality option, Azure Functions processed each Service Bus message as a separate function invocation. If your queue had 50 messages, the runtime spun up 50 individual executions. Single-Message Processing (The Old Way) import { app, InvocationContext } from '@azure/functions'; async function processOrder( message: unknown, // ← One message at a time, no batch context: InvocationContext ): Promise<void> { try { const order = message as Order; await processOrder(order); } catch (error) { context.error('Failed to process message:', error); // Message auto-complete by default. throw error; } } app.serviceBusQueue('processOrder', { connection: 'ServiceBusConnection', queueName: 'orders-queue', handler: processOrder, }); What this cost you: 50 messages on the queue Old (single-message) New (batch + settlement) Function invocations 50 separate invocations 1 invocation Connection overhead 50 separate DB/API connections 1 connection, reused across batch Compute cost 50× invocation overhead 1× invocation overhead Settlement control Binary: throw or don't 4 actions per message Every message paid the full price of a function invocation, startup, connection setup, teardown. At scale (millions of messages/day), this was a significant cost and latency penalty. And when a message failed, your only option was to throw (retry the whole message) or swallow the error (lose it silently). Code Examples Let's see how this looks across all three major Azure Functions language stacks. Node.js (TypeScript with @ azure/functions-extensions-servicebus) import '@azure/functions-extensions-servicebus'; import { app, InvocationContext } from '@azure/functions'; import { ServiceBusMessageContext, messageBodyAsJson } from '@azure/functions-extensions-servicebus'; interface Order { id: string; product: string; amount: number; } export async function processOrderBatch( sbContext: ServiceBusMessageContext, context: InvocationContext ): Promise<void> { const { messages, actions } = sbContext; for (const message of messages) { try { const order = messageBodyAsJson<Order>(message); await processOrder(order); await actions.complete(message); // ✅ Done } catch (error) { context.error(`Failed ${message.messageId}:`, error); await actions.deadletter(message); // ☠️ Poison } } } app.serviceBusQueue('processOrderBatch', { connection: 'ServiceBusConnection', queueName: 'orders-queue', sdkBinding: true, autoCompleteMessages: false, cardinality: 'many', handler: processOrderBatch, }); Key points: Enable sdkBinding: true and autoCompleteMessages: false to gain manual settlement control ServiceBusMessageContext provides both the messages array and actions object Settlement actions: complete(), abandon(), deadletter(), defer() Application properties can be passed to abandon() for retry tracking Built-in helpers like messageBodyAsJson<T>() handle Buffer-to-object parsing Full sample: serviceBusSampleWithComplete Python (V2 Programming Model) import json import logging from typing import List import azure.functions as func import azurefunctions.extensions.bindings.servicebus as servicebus app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION) @app.service_bus_queue_trigger(arg_name="messages", queue_name="orders-queue", connection="SERVICEBUS_CONNECTION", auto_complete_messages=False, cardinality="many") def process_order_batch(messages: List[servicebus.ServiceBusReceivedMessage], message_actions: servicebus.ServiceBusMessageActions): for message in messages: try: order = json.loads(message.body) process_order(order) message_actions.complete(message) # ✅ Done except Exception as e: logging.error(f"Failed {message.message_id}: {e}") message_actions.dead_letter(message) # ☠️ Poison def process_order(order): logging.info(f"Processing order: {order['id']}") Key points: Uses azurefunctions.extensions.bindings.servicebus for SDK-type bindings with ServiceBusReceivedMessage Supports both queue and topic triggers with cardinality="many" for batch processing Each message exposes SDK properties like body, enqueued_time_utc, lock_token, message_id, and sequence_number Full sample: servicebus_samples_settlement .NET (C# Isolated Worker) using Azure.Messaging.ServiceBus; using Microsoft.Azure.Functions.Worker; public class ServiceBusBatchProcessor(ILogger<ServiceBusBatchProcessor> logger) { [Function(nameof(ProcessOrderBatch))] public async Task ProcessOrderBatch( [ServiceBusTrigger("orders-queue", Connection = "ServiceBusConnection")] ServiceBusReceivedMessage[] messages, ServiceBusMessageActions messageActions) { foreach (var message in messages) { try { var order = message.Body.ToObjectFromJson<Order>(); await ProcessOrder(order); await messageActions.CompleteMessageAsync(message); // ✅ Done } catch (Exception ex) { logger.LogError(ex, "Failed {MessageId}", message.MessageId); await messageActions.DeadLetterMessageAsync(message); // ☠️ Poison } } } private Task ProcessOrder(Order order) => Task.CompletedTask; } public record Order(string Id, string Product, decimal Amount); Key points: Inject ServiceBusMessageActions directly alongside the message array Each message is individually settled with CompleteMessageAsync, DeadLetterMessageAsync, or AbandonMessageAsync Application properties can be modified on abandon to track retry metadata Full sample: ServiceBusReceivedMessageFunctions.cs291Views1like0CommentsHTTP Triggers in Azure SRE Agent: From Jira Ticket to Automated Investigation
Introduction Many teams run their observability, incident management, ticketing, and deployment on platforms outside of Azure—Jira, Opsgenie, Grafana, Zendesk, GitLab, Jenkins, Harness, or homegrown internal tools. These are the systems where alerts fire, tickets get filed, deployments happen, and operational decisions are made every day. HTTP Triggers make it easy to connect any of them to Azure SRE Agent—turning events from any platform into automated agent actions with a simple HTTP POST. No manual copy-paste, no context-switching, no delay between detection and response. In this blog, we'll demonstrate by connecting Jira to SRE Agent—so that every new incident ticket automatically triggers an investigation, and the agent posts its findings back to the Jira ticket when it's done. The Scenario: Jira Incident → Automated Investigation Your team manages production applications backed by Azure PostgreSQL Flexible Server. You use Jira for incident tracking. Today, when a P1 or P2 incident is filed, your on-call engineer has to manually triage—reading through the ticket, checking dashboards, querying logs, correlating recent deployments—before they can even begin working on a fix. Some teams have Jira automations that route or label tickets, but the actual investigation still starts with a human. HTTP Triggers let you bring SRE Agent directly into that existing workflow. Instead of adding another tool for engineers to check, the agent meets them where they already work. Jira ticket created → SRE Agent automatically investigates → Agent writes findings back to Jira The on-call engineer opens the Jira ticket and the investigation is already there—root cause analysis, evidence from logs and metrics, and recommended next steps—posted as a comment by the agent. Here's how to set this up. Architecture Overview Here's the end-to-end flow we'll build: Jira — A new issue is created in your project Logic App — The Jira connector detects the new issue, and the Logic App calls the SRE Agent HTTP Trigger, using Managed Identity for authentication HTTP Trigger — The agent prompt is rendered with the Jira ticket details (key, summary, priority, etc.) via payload placeholders Agent Investigation — The agent uses Jira MCP tools to read the ticket and search related issues, queries Azure logs, metrics, and recent deployments, then posts its findings back to the Jira ticket as a comment How HTTP Triggers Work Every HTTP Trigger you create in Azure SRE Agent exposes a unique webhook URL: https://<your-agent>.<instance>.azuresre.ai/api/v1/httptriggers/trigger/<trigger-id> When an external system sends a POST request to this URL with a JSON payload, the SRE Agent: Validates the trigger exists and is enabled Renders your agent prompt by injecting payload values into {payload.X} placeholders Creates a new investigation thread (or reuses an existing one) Executes the agent with the rendered prompt—autonomously or in review mode Records the execution in the trigger's history for auditing Payload Placeholders The real power of HTTP Triggers is in payload placeholders. When you configure a trigger, you write an agent prompt with {payload.X} tokens that get replaced at runtime with values from the incoming JSON. For example, a prompt like: Investigate Jira incident {payload.key}: {payload.summary} (Priority: {payload.priority}) Gets rendered with actual incident data before the agent sees it, giving it immediate context to begin investigating. If your prompt doesn't use any placeholders, the raw JSON payload is automatically appended to the prompt, so the agent always has access to the full context regardless. Thread Modes HTTP Triggers support two thread modes: New Thread (recommended for incidents): Every trigger invocation creates a fresh investigation thread, giving each incident its own isolated workspace Same Thread: All invocations share a single thread, building up a continuous conversation—useful for accumulating alerts from a single source Authenticating External Platforms The HTTP Trigger endpoint is secured with Azure AD authentication, ensuring only authorized callers can create agent investigation threads. Every request requires a valid bearer token scoped to the SRE Agent's data plane. External platforms like Jira send standard HTTP webhooks and don't natively acquire Azure AD tokens. To bridge this, you can use any Azure service that supports Managed Identity as an intermediary—this approach means zero secrets to store or rotate in the external platform. Common options include: Approach Best For Azure Logic Apps Native connectors for many platforms, no code required, visual workflow designer Azure Functions Simple relay with ~15 lines of code, clean URL for any webhook source API Management (APIM) Enterprise environments needing rate limiting, IP filtering, or API key management All three support Managed Identity and can transparently acquire the Azure AD token before forwarding requests to the SRE Agent HTTP Trigger. In this walkthrough, we'll use Azure Logic Apps with the built-in Jira connector. Step-by-Step: Connecting Jira to SRE Agent Prerequisites An Azure SRE Agent resource deployed in your subscription A Jira Cloud project with API token access An Azure subscription for the Logic App Step 1: Set Up the Jira MCP Connector First, let's give the SRE Agent the ability to interact with Jira directly. In your agent's MCP Tool settings, add the Jira connector: Setting Value Package mcp-atlassian (npm, version 2.0.0) Transport STDIO Configure these environment variables: Variable Value ATLASSIAN_BASE_URL https://your-site.atlassian.net ATLASSIAN_EMAIL Your Jira account email ATLASSIAN_API_TOKEN Your Jira API token Once the connector is added, select the specific MCP tools you want the agent to use. The connector provides 18 Jira tools out of 80 available. For our incident investigation workflow, the key tools include: jira-mcp_read_jira_issue — Read details from a Jira issue by issue key jira-mcp_search_jira_issues — Search for Jira issues using JQL (Jira Query Language) jira-mcp_add_jira_comment — Add a comment to a Jira issue (post investigation findings back) jira-mcp_list_jira_projects — List available Jira projects jira-mcp_create_jira_issue — Create a new Jira issue This gives the SRE Agent bidirectional access to Jira—it can read ticket details, fetch comments, query related issues, and post investigation findings back as comments on the original ticket. This closes the loop so your on-call engineers see the agent's analysis directly in Jira without switching tools. Step 2: Create the HTTP Trigger Navigate to Builder → HTTP Triggers in the SRE Agent UI and click Create. Setting Value Name jira-incident-handler Agent Mode Autonomous Thread Mode New Thread (one investigation per incident) Sub-Agent (optional) Select a specialized incident response agent Agent Prompt: A new Jira incident has been filed that requires investigation: Jira Ticket: {payload.key} Summary: {payload.summary} Priority: {payload.priority} Reporter: {payload.reporter} Description: {payload.description} Jira URL: {payload.ticketUrl} Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies Checking for recent deployments or configuration changes Providing a structured analysis with Root Cause, Evidence, and Recommended Actions Once your investigation is complete, use the Jira MCP tools to post a summary of your findings as a comment on the original ticket ({payload.key}). After saving, enable the trigger and open the trigger detail view. Copy the Trigger URL—you'll need it for the Logic App. Step 3: Create the Azure Logic App In the Azure Portal, create a new Logic App: Setting Value Type Consumption (Multi-tenant, Stateful) Name jira-sre-agent-bridge Region Same region as your SRE Agent (e.g., East US 2) Resource Group Same resource group as your SRE Agent (recommended for simplicity) Step 4: Enable Managed Identity In the Logic App → Identity → System assigned: Set Status to On Click Save Step 5: Assign the SRE Agent Admin Role Navigate to your SRE Agent resource → Access control (IAM) → Add role assignment: Setting Value Role SRE Agent Admin Assign to Managed Identity → select your Logic App This grants the Logic App's Managed Identity the data-plane permissions needed to invoke HTTP Triggers. Important: The Contributor role alone is not sufficient. Contributor covers the Azure control plane, but SRE Agent uses a separate data plane with its own RBAC. The SRE Agent Admin role provides the required data-plane permissions. Step 6: Create the Jira Connection Open the Logic App designer. When adding the Jira trigger, it will prompt you to create a connection: Setting Value Connection name jira-connection Jira instance https://your-site.atlassian.net Email Your Jira email API Token Your Jira API token Step 7: Configure the Logic App Workflow Switch to the Logic App Code view and paste this workflow definition: { "definition": { "$schema": "https://schema.management.azure.com/providers/Microsoft.Logic/schemas/2016-06-01/workflowdefinition.json#", "contentVersion": "1.0.0.0", "triggers": { "When_a_new_issue_is_created_(V2)": { "recurrence": { "interval": 3, "frequency": "Minute" }, "splitOn": "@triggerBody()", "type": "ApiConnection", "inputs": { "host": { "connection": { "name": "@parameters('$connections')['jira']['connectionId']" } }, "method": "get", "path": "/v2/new_issue_trigger/search", "queries": { "X-Request-Jirainstance": "https://YOUR-SITE.atlassian.net", "projectKey": "YOUR_PROJECT_ID" } } } }, "actions": { "Call_SRE_Agent_HTTP_Trigger": { "runAfter": {}, "type": "Http", "inputs": { "uri": "https://YOUR-AGENT.azuresre.ai/api/v1/httptriggers/trigger/YOUR-TRIGGER-ID", "method": "POST", "headers": { "Content-Type": "application/json" }, "body": { "key": "@{triggerBody()?['key']}", "summary": "@{triggerBody()?['fields']?['summary']}", "priority": "@{triggerBody()?['fields']?['priority']?['name']}", "reporter": "@{triggerBody()?['fields']?['reporter']?['displayName']}", "description": "@{triggerBody()?['fields']?['description']}", "ticketUrl": "@{concat('https://YOUR-SITE.atlassian.net/browse/', triggerBody()?['key'])}" }, "authentication": { "type": "ManagedServiceIdentity", "audience": "https://azuresre.dev" } } } }, "outputs": {}, "parameters": { "$connections": { "type": "Object", "defaultValue": {} } } }, "parameters": { "$connections": { "type": "Object", "value": { "jira": { "id": "/subscriptions/YOUR-SUB/providers/Microsoft.Web/locations/YOUR-REGION/managedApis/jira", "connectionId": "/subscriptions/YOUR-SUB/resourceGroups/YOUR-RG/providers/Microsoft.Web/connections/jira", "connectionName": "jira" } } } } } Replace the YOUR-* placeholders with your actual values. To find your Jira project ID, navigate to https://your-site.atlassian.net/rest/api/3/project/YOUR-PROJECT-KEY in your browser and find the "id" field in the JSON response. The critical piece is the authentication block: "authentication": { "type": "ManagedServiceIdentity", "audience": "https://azuresre.dev" } This tells the Logic App to automatically acquire an Azure AD token for the SRE Agent data plane and attach it as a Bearer token. No secrets, no expiration management, no manual token refresh. After pasting the JSON and clicking Save, switch back to the Designer view. The Logic App automatically generates the visual workflow from the code — you'll see the Jira trigger ("When a new issue is created (V2)") connected to the HTTP action ("Call SRE Agent HTTP Trigger") as a two-step flow, with all the field mappings and authentication settings already configured What Happens Inside the Agent When the HTTP Trigger fires, the SRE Agent receives a fully contextualized prompt with all the Jira incident data injected: A new Jira incident has been filed that requires investigation: Jira Ticket: KAN-16 Summary: Elevated API Response Times — PostgreSQL Table Lock Causing Request Blocking on Listings Service Priority: High Reporter: Vineela Suri Description: Severity: P2 — High. Affected Service: Production API (octopets-prod-postgres). Impact: End users experience slow or unresponsive listing pages. Jira URL: https://your-site.atlassian.net/browse/KAN-16 Investigate this incident by: Identifying the affected Azure resources mentioned in the description Querying recent metrics and logs for anomalies ... The agent then uses its configured tools to investigate—Azure CLI to query metrics, Kusto to analyze logs, and the Jira MCP connector to read the ticket for additional context. Once the investigation is complete, the agent posts its findings as a comment directly on the Jira ticket, closing the loop without any manual copy-paste. Each execution is recorded in the trigger's history with timestamp, thread ID, success status, duration, and an AI-generated summary—giving you full observability into your automated investigation pipeline. Extending to Other Platforms The pattern we built here works for any external platform that isn't natively supported by SRE Agent. The core architecture stays the same: External Platform → Auth Bridge (Managed Identity) → SRE Agent HTTP Trigger You only need to swap the inbound side of the bridge. For example: External Platform Auth Bridge Configuration Jira Logic App with Jira V2 connector (polling) OpsGenie Logic App with OpsGenie connector, or Azure Function relay receiving OpsGenie webhooks Datadog Azure Function relay or APIM policy receiving Datadog webhook notifications Grafana Azure Function relay or APIM policy receiving Grafana alert webhooks Splunk APIM with webhook endpoint and Managed Identity forwarding Custom / Internal tools Logic App HTTP trigger, Azure Function relay, or APIM — any service that supports Managed Identity The SRE Agent HTTP Trigger and the Managed Identity authentication remain the same regardless of the source platform. You configure the trigger once, set up the auth bridge, and connect as many external sources as needed. Each trigger can have its own tailored prompt, sub-agent, and thread mode optimized for the type of incoming event. Key Takeaways HTTP Triggers extend Azure SRE Agent's reach to any external platform: Connect What You Use: If your incident platform isn't natively supported, HTTP Triggers provide the integration point—no code changes to SRE Agent required Secure by Design: Azure AD authentication with Managed Identity keeps the data plane protected while making integration straightforward through standard Azure services Bidirectional with MCP: Combine HTTP Triggers (inbound) with MCP connectors (outbound) for full round-trip integration—receive incidents automatically and post findings back to the source platform Full Observability: Every trigger execution is recorded with timestamps, thread IDs, duration, and AI-generated summaries Flexible Context Injection: Payload placeholders let you craft precise investigation prompts from incident data, while raw payload passthrough ensures the agent always has full context Getting Started HTTP Triggers are available now in the Azure SRE Agent platform: Create a Trigger: Navigate to Builder → HTTP Triggers → Create. Define your agent prompt with {payload.X} placeholders Set Up an Auth Bridge: Use Logic Apps, Azure Functions, or APIM with Managed Identity to handle Azure AD authentication Connect Your Platform: Point your external platform at the bridge and create a test event Within minutes, you'll have an automated pipeline that turns every incident ticket into an AI-driven investigation. Learn More HTTP Triggers Documentation Agent Hooks Blog Post — Governance controls for automated investigations YAML Schema Reference SRE Agent Getting Started Guide Ready to extend your SRE Agent to platforms it doesn't support natively? Set up your first HTTP Trigger today at sre.azure.com.317Views0likes0CommentsMigrating to the next generation of Virtual Nodes on Azure Container Instances (ACI)
What is ACI/Virtual Nodes? Azure Container Instances (ACI) is a fully-managed serverless container platform which gives you the ability to run containers on-demand without provisioning infrastructure. Virtual Nodes on ACI allows you to run Kubernetes pods managed by an AKS cluster in a serverless way on ACI instead of traditional VM‑backed node pools. From a developer’s perspective, Virtual Nodes look just like regular Kubernetes nodes, but under the hood the pods are executed on ACI’s serverless infrastructure, enabling fast scale‑out without waiting for new VMs to be provisioned. This makes Virtual Nodes ideal for bursty, unpredictable, or short‑lived workloads where speed and cost efficiency matter more than long‑running capacity planning. Introducing the next generation of Virtual Nodes on ACI The newer Virtual Nodes v2 implementation modernises this capability by removing many of the limitations of the original AKS managed add‑on and delivering a more Kubernetes‑native, flexible, and scalable experience when bursting workloads from AKS to ACI. In this article I will demonstrate how you can migrate an existing AKS cluster using the Virtual Nodes managed add-on (legacy), to the new generation of Virtual Nodes on ACI, which is deployed and managed via Helm. More information about Virtual Nodes on Azure Container Instances can be found here, and the GitHub repo is available here. Advanced documentation for Virtual Nodes on ACI is also available here, and includes topics such as node customisation, release notes and a troubleshooting guide. Please note that all code samples within this guide are examples only, and are provided without warranty/support. Background Virtual Nodes on ACI is rebuilt from the ground-up, and includes several fixes and enhancements, for instance: Added support/features VNet peering, outbound traffic to the internet with network security groups Init containers Host aliases Arguments for exec in ACI Persistent Volumes and Persistent Volume Claims Container hooks Confidential containers (see supported regions list here) ACI standby pools Support for image pulling via Private Link and Managed Identity (MSI) Planned future enhancements Kubernetes network policies Support for IPv6 Windows containers Port Forwarding Note: The new generation of the add-on is managed via Helm rather than as an AKS managed add-on. Requirements & limitations Each Virtual Nodes on ACI deployment requires 3 vCPUs and 12 GiB memory on one of the AKS cluster’s VMs Each Virtual Nodes node supports up to 200 pods DaemonSets are not supported Virtual Nodes on ACI requires AKS clusters with Azure CNI networking (Kubenet is not supported, nor is overlay networking) Migrating to the next generation of Virtual Nodes on Azure Container Instances via Helm chart For this walkthrough, I'm using Bash via Windows Subsystem for Linux (WSL), along with the Azure CLI. Direct migration is not supported, and therefore the steps below show an example of removing Virtual Nodes managed add-on and its resources and then installing the Virtual Nodes on ACI Helm chart. In this walkthrough I will explain how to delete and re-create the Virtual Nodes subnet, however if you need to preserve the VNet and/or use a custom subnet name, refer to the Helm customisation steps here. Be sure to use a new subnet CIDR within the VNet address space, which doesn't overlap with other subnets nor the AKS CIDRS for nodes/pods and ClusterIP services. To minimise disruption, we'll first install the Virtual Nodes on ACI Helm chart, before then removing the legacy managed add-on and its resources. Prerequisites A recent version of the Azure CLI An Azure subscription with sufficient ACI quota for your selected region Helm Deployment steps Initialise environment variables location=northeurope rg=rg-virtualnode-demo vnetName=vnet-virtualnode-demo clusterName=aks-virtualnode-demo aksSubnetName=subnet-aks vnSubnetName=subnet-vn Create the new Virtual Nodes on ACI subnet with the specific name value of cg (a custom subnet can be used by following the steps here): vnSubnetId=$(az network vnet subnet create \ --resource-group $rg \ --vnet-name $vnetName \ --name cg \ --address-prefixes <your subnet CIDR> \ --delegations Microsoft.ContainerInstance/containerGroups --query id -o tsv) Assign the cluster's -kubelet identity Contributor access to the infrastructure resource group, and Network Contributor access to the ACI subnet: nodeRg=$(az aks show --resource-group $rg --name $clusterName --query nodeResourceGroup -o tsv) nodeRgId=$(az group show -n $nodeRg --query id -o tsv) agentPoolIdentityId=$(az aks show --resource-group $rg --name $clusterName --query "identityProfile.kubeletidentity.resourceId" -o tsv) agentPoolIdentityObjectId=$(az identity show --ids $agentPoolIdentityId --query principalId -o tsv) az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Contributor" \ --scope "$nodeRgId" az role assignment create \ --assignee-object-id "$agentPoolIdentityObjectId" \ --assignee-principal-type ServicePrincipal \ --role "Network Contributor" \ --scope "$vnSubnetId" Download the cluster's kubeconfig file: az aks get-credentials -n $clusterName -g $rg Clone the virtualnodesOnAzureContainerInstances GitHub repo: git clone https://github.com/microsoft/virtualnodesOnAzureContainerInstances.git Install the Virtual Nodes on ACI Helm chart: helm install <yourReleaseName> <GitRepoRoot>/Helm/virtualnode Confirm the Virtual Nodes node shows within the cluster and is in a Ready state (virtualnode-n): $ kubectl get node NAME STATUS ROLES AGE VERSION aks-nodepool1-35702456-vmss000000 Ready <none> 4h13m v1.33.6 aks-nodepool1-35702456-vmss000001 Ready <none> 4h13m v1.33.6 virtualnode-0 Ready <none> 162m v1.33.7 Scale-down any running Virtual Nodes workloads (example below): kubectl scale deploy <deploymentName> -n <namespace> --replicas=0 Drain and cordon the legacy Virtual Nodes node: kubectl drain virtual-node-aci-linux Disable the Virtual Nodes managed add-on (legacy): az aks disable-addons --resource-group $rg --name $clusterName --addons virtual-node Export a backup of the original subnet configuration: az network vnet subnet show --resource-group $rg --vnet-name $vnetName --name $vnSubnetName > subnetConfigOriginal.json Delete the original subnet (subnets cannot be renamed and therefore must be re-created): az network vnet subnet delete -g $rg -n $vnSubnetName --vnet-name $vnetName Delete the previous (legacy) Virtual Nodes node from the cluster: kubectl delete node virtual-node-aci-linux Test and confirm pod scheduling on Virtual Node: apiVersion: v1 kind: Pod metadata: annotations: name: demo-pod spec: containers: - command: - /bin/bash - -c - 'counter=1; while true; do echo "Hello, World! Counter: $counter"; counter=$((counter+1)); sleep 1; done' image: mcr.microsoft.com/azure-cli name: hello-world-counter resources: limits: cpu: 2250m memory: 2256Mi requests: cpu: 100m memory: 128Mi nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists If the pod successfully starts on the Virtual Node, you should see similar to the below: $ kubectl get pod -o wide demo-pod NAME READY STATUS RESTARTS AGE IP NODE NOMINATED NODE READINESS GATES demo-pod 1/1 Running 0 95s 10.241.0.4 vnode2-virtualnode-0 <none> <none> Modify the nodeSelector and tolerations properties of your Virtual Nodes workloads to match the requirements of Virtual Nodes on ACI (see details below) Modify your deployments to run on Virtual Nodes on ACI For Virtual Nodes managed add-on (legacy), the following nodeSelector and tolerations are used to run pods on Virtual Nodes: nodeSelector: kubernetes.io/role: agent kubernetes.io/os: linux type: virtual-kubelet tolerations: - key: virtual-kubelet.io/provider operator: Exists - key: azure.com/aci effect: NoSchedule For Virtual Nodes on ACI, the nodeSelector/tolerations are slightly different: nodeSelector: virtualization: virtualnode2 tolerations: - effect: NoSchedule key: virtual-kubelet.io/provider operator: Exists Troubleshooting Check the virtual-node-admission-controller and virtualnode-n pods are running within the vn2 namespace: $ kubectl get pod -n vn2 NAME READY STATUS RESTARTS AGE virtual-node-admission-controller-54cb7568f5-b7hnr 1/1 Running 1 (5h21m ago) 5h21m virtualnode-0 6/6 Running 6 (4h48m ago) 4h51m If these pods are in a Pending state, your node pool(s) may not have enough resources available to schedule them (use kubectl describe pod to validate). If the virtualnode-n pod is crashing, check the logs of the proxycri container to see whether there are any Managed Identity permissions issues (the cluster's -agentpool MSI needs to have Contributor access on the infrastructure resource group): kubectl logs -n vn2 virtualnode-0 -c proxycri Further troubleshooting guidance is available within the official documentation. Support If you have issues deploying or using Virtual Nodes on ACI, add a GitHub issue here551Views3likes0CommentsThe Swarm Diaries: What Happens When You Let AI Agents Loose on a Codebase
The Idea Single-agent coding assistants are impressive, but they have a fundamental bottleneck: they think serially. Ask one to build a full CLI app with a database layer, a command parser, pretty output, and tests, and it’ll grind through each piece one by one. Industry benchmarks bear this out: AIMultiple’s 2026 agentic coding benchmark measured Claude Code CLI completing full-stack tasks in ~12 minutes on average, with other CLI agents ranging from 3 to 14 minutes depending on the tool. A three-week real-world test by Render.com found single-agent coding workflows taking 10–30 minutes for multi-file feature work. But these subtasks don’t depend on each other. A storage agent doesn’t need to wait for the CLI agent. A test writer doesn’t need to watch the renderer work. What if they all ran at the same time? The hypothesis was straightforward: a swarm of specialized agents should beat a single generalist on at least two of three pillars — speed, quality, or cost. The architecture looked clean on a whiteboard: The reality was messier. But first, let me explain the machinery that makes this possible. How It’s Wired: Brains and Hands The system runs on a brains-and-hands split. The brain is an Azure Durable Task Scheduler (DTS) orchestration — a deterministic workflow that decomposes the goal into a task DAG, fans agents out in parallel, merges their branches, and runs quality gates. If the worker crashes mid-run, DTS replays from the last checkpoint. No work lost. Simple LLM calls — the planner that decomposes the goal, the judge that scores the output — run as lightweight DTS activities. One call, no tools, cheap. The hands are Microsoft Agent Framework (MAF) agents, each running in its own Docker container. One sandbox per agent, each with its own git clone, filesystem, and toolset. When an agent’s LLM decides to edit a file or run a build, the call routes through middleware to that agent’s isolated container. No two agents ever touch the same workspace. These complex agents — coders, researchers, the integrator — run as DTS durable entities with full agentic loops and turn-level checkpointing. The split matters because LLM reasoning and code execution have completely different reliability profiles. The brain checkpoints and replays deterministically. The hands are ephemeral — if a container dies, spin up a new one and replay the agent’s last turn. This separation is what lets you run five agents in parallel without them stepping on each other’s git branches, build artifacts, or file handles. It’s also what made every bug I was about to encounter debuggable. When something broke, I always knew which side broke — the orchestration logic, or the agent behavior. That distinction saved me more hours than any other design decision. The First Run Produced Nothing After hours of vibe-coding the foundation — Pydantic models, skill prompts, a prompt builder, a context store, sixteen architectural decisions documented in ADRs — I wired up the seven-phase orchestration and hit go. All five agents returned empty responses. Every single one. The logs showed agents “running” but producing zero output. I stared at the code for an embarrassingly long time before I found it. The planner returned task IDs as integers — 1, 2, 3 . The sandbox provisioner stored them as string keys — "1", "2", "3" . When the orchestrator did sandbox_map.get(1) , it got None . No sandbox meant no middleware. The agents were literally talking to thin air — making LLM calls with no tools attached, like a carpenter showing up to a job site with no hammer. The fix was one line. The lesson was bigger: LLMs don’t respect type contracts. They’ll return an integer when you expect a string, a list when you expect a dict, and a confident hallucination when they have nothing to say. Every boundary between AI-generated data and deterministic systems needs defensive normalization. This would not be the last time I learned that lesson. The Seven-Minute Merge Once agents actually ran and produced code, a new problem emerged. I watched the logs on a run that took twenty-one minutes total. Four agents finished their work in about twelve minutes. The remaining seven minutes were the LLM integrator merging four branches — eight to thirty tool calls per merge, using the premium model, to do what git merge --no-edit does in five seconds. I was paying for a premium LLM to run git diff , read both sides of every file, and write a merged version. For branches that merged cleanly. With zero conflicts. The fix was obvious in retrospect: try git merge first. If it succeeds — great, five seconds, done. Only call the LLM integrator when there are actual conflicts to resolve. Merge time dropped from seven minutes to under thirty seconds. I felt a little silly for not doing this from the start. When Agents Build Different Apps The merge speedup felt like a win until I looked at what was actually being merged. The storage agent had built a JSON-file backend. The CLI agent had written its commands against SQLite. Both modules were well-written. They compiled individually. Together, nothing worked — the CLI tried to import a Storage class that didn’t exist in the JSON backend. This was the moment I realized the agents weren’t really a team. They were strangers who happened to be assigned to the same project, each interpreting the goal in their own way. The fix was the single most impactful change in the entire project: contract-first planning. Instead of just decomposing the goal into tasks, the planner now generates API contracts — function signatures, class shapes, data model definitions — and injects them into every agent’s prompt. “Here’s what the Storage class looks like. Here’s what Task looks like. Build against these interfaces.” Before contracts, three of six branches conflicted and the quality score was 28. After contracts, zero of four branches conflicted and the score hit 68. It turns out the plan isn’t just a plan. In a multi-agent system, the plan is the product. A brilliant plan with mediocre agents produces working code. A vague plan with brilliant agents produces beautiful components that don’t fit together. The Agent Who Lied PR #4 came back with what looked like a solid result. The test writer reported three test files with detailed coverage summaries. The JSON output was meticulous — file names, function names, which modules each test covered. Then I checked tool_call_count: 0 . The test writer hadn’t written a single file. It hadn’t even opened a file. It received zero tools — because the skill loader normalized test_writer to underscores while the tool registry used test-writer with hyphens. The lookup failed silently. The agent got no tools, couldn’t do any work, and did what LLMs do when they can’t fulfill a request but feel pressure to answer: it made something up. Confidently. This happened in three of our first four evaluation runs. I called them “phantom agents” — they showed up to work, clocked in, filed a report, and went home without lifting a finger. The fix had two parts. First, obviously, fix the hyphen/underscore normalization. Second, and more importantly: add a zero-tool-call guard. If an agent that should be writing files reports success with zero tool calls, don’t believe it. Nudge it and retry. The deeper lesson stuck with me: agents will never tell you they failed. They’ll report success with elaborate detail. You have to verify what they actually did, not what they said they did. The Integrator Who Took Shortcuts Even with contracts preventing mismatched architectures, merge conflicts still happened when multiple agents touched the same files. The LLM integrator’s job was to resolve these conflicts intelligently, preserving logic from both sides. Instead, facing a gnarly conflict in models.py , it ran: git restore --source=HEAD -- models.py One command. Silently destroyed one agent’s entire implementation — the Task class, the constants, the schema version — gone. The integrator committed the lobotomized file and reported “merge resolved successfully.” The downstream damage was immediate. storage.py imported symbols that no longer existed. The judge scored 43 out of 100. The fixer agent had to spend five minutes reconstructing the data model from scratch. But that wasn’t even the worst shortcut. On other runs, the integrator replaced conflicting code with: def add_task(desc, priority=0): pass # TODO: implement storage layer When an LLM is asked to resolve a hard conflict, it’ll sometimes pick the easiest valid output — delete everything and write a placeholder. Technically valid Python. Functionally a disaster. Fixing this required explicit prompt guardrails: Never run git restore --source=HEAD Never replace implementations with pass # TODO placeholders When two implementations conflict, keep the more complete one After resolving each file, read it back and verify the expected symbols still exist The lesson: LLMs optimize for the path of least resistance. Under pressure, “valid” and “useful” diverge sharply. Demolishing the House for a Leaky Faucet When the judge scored a run below 70, the original retry strategy was: start over. Re-plan. Re-provision five sandboxes. Re-run all agents. Re-merge. Re-judge. Seven minutes and a non-trivial cloud bill, all because one agent missed an import statement. This was absurd. Most failures weren’t catastrophic — they were close. A missing model field. A broken import. An unhandled error case. The code was 90% right. Starting from scratch was like tearing down a house because the bathroom faucet leaks. So I built the fixer agent: a premium-tier model that receives the judge’s specific complaints and makes surgical edits directly on the integrator’s branch. No new sandboxes, no new branches, no merge step. The first time it ran, the score jumped from 43 to 89.5. Three minutes instead of seven. And it solved the problem that actually existed, rather than hoping a second roll of the dice would land better. Of course, the fixer’s first implementation had its own bug — it ran in a new sandbox, created a new branch, and occasionally conflicted with the code it was trying to fix. The fix to the fixer: just edit in place on the integrator’s existing sandbox. No branch, no merge, no drama. How Others Parallelize (and Why We Went Distributed) Most multi-agent coding frameworks today parallelize by spawning agents as local processes on a single developer machine. Depending on the framework, there’s typically a lead agent or orchestrator that breaks the task down into subtasks, spins up new agents to handle each piece, and combines their work when they finish — often through parallel TMux sessions or subprocess pools sharing a local filesystem. It’s simple, it’s fast to set up, and for many tasks it works. But local parallelization hits a ceiling. All agents share one machine’s CPU, memory, and disk I/O. Five agents each running npm install or cargo build compete for the same 32 GB of RAM. There’s no true filesystem isolation — two agents can clobber the same file if the orchestrator doesn’t carefully sequence writes. Recovery from a crash means restarting the entire local process tree. And scaling from 3 agents to 10 means buying a bigger machine. Our swarm takes a different approach: fully distributed execution. Each agent runs in its own Docker container with its own filesystem, git clone, and compute allocation — provisioned on AKS, ACA, or any container host. Four agents get four independent resource pools. If one container dies, DTS replays that agent from its last checkpoint in a fresh container without affecting the others. Git branch-per-agent isolation means zero filesystem conflicts by design. The trade-off is overhead: container provisioning, network latency, and the merge step add wall-clock time that a local TMux setup avoids. On a small two-agent task, local parallelization on a fast laptop probably wins. But for tasks with 4+ agents doing real work — cloning repos, installing dependencies, running builds and tests — independent resource pools and crash isolation matter. Our benchmarks on a 4-agent helpdesk system showed the swarm completing in ~8 minutes with zero resource contention, producing 1,029 lines across 14 files with 4 clean branch merges. The Scorecard After all of this, did the swarm actually beat a single agent? I ran head-to-head benchmarks: same prompt, same model (GPT-5-nano), solo agent vs. swarm, scored by a Sonnet 4.6 judge on a four-criterion rubric. Two tasks — a simple URL shortener (Render.com’s benchmark prompt) and a complex helpdesk ticket system. All runs are public — you can review every line of generated code: Task Solo Agent PR Swarm PR URL Shortener PR #1 PR #2 Helpdesk System PR #3 PR #4 URL Shortener (Simple) Helpdesk System (Complex) Quality (rubric, /5) Solo 1.9 → Swarm 2.5 (+32%) Solo 2.3 → Swarm 2.95 (+28%) Speed Solo 2.5 min → Swarm 5.5 min (2.2×) Solo 1.75 min → Swarm ~8 min (~4.5×) Tokens 7.7K → 30K (3.9×) 11K → 39K (3.4×) The pattern held across both tasks: +28–32% quality improvement, at the cost of 2–4× more time and ~3.5× more tokens. On the complex task, the quality gains broadened — the swarm produced better code organization (3/5 vs 2/5), actually wrote tests (code:test ratio 0 → 0.15), and generated 5× more files with cleaner decomposition. On the simple task, the gap came entirely from security practices: environment variables, parameterized queries, and proper .gitignore rules that the solo agent skipped entirely. Industry benchmarks from AIMultiple and Render.com show single CLI agents averaging 10–15 minutes on comparable full-stack tasks. Our swarm runs in 5–12 minutes depending on parallelizability — but the real win is quality, not speed. Specialized agents with a narrow, well-defined scope tend to be more thorough: the solo agent skipped tests and security practices entirely, while the swarm's dedicated agents actually addressed them. Two out of three pillars — with a caveat the size of your task. On small, tightly-coupled problems, just use one good agent. On larger, parallelizable work with three or more independent modules? The swarm earns its keep. What I Actually Learned The Rules That Stuck Contract-first planning. Define interfaces before writing implementations. The plan isn’t just a guide — it’s the product. Deterministic before LLM. Try git merge before calling the LLM integrator. Run ruff check before asking an agent to debug. Use code when you can; use AI when you must. Validate actions, not claims. An agent that reports “merge resolved successfully” may have deleted everything. Check tool call counts. Read the actual diff. Trust nothing. Cheap recovery over expensive retries. A fixer agent that patches one file beats re-running five agents from scratch. The cost of failure should be proportional to the failure. Not every problem needs a swarm. If the task fits in one agent’s context window, adding four more just adds overhead. The sweet spot is 3+ genuinely independent modules. The Bigger Picture The biggest surprise? Building a multi-agent AI system is more about software engineering than AI engineering. The hard problems weren’t prompt design or model selection — they were contracts between components, isolation of concerns, idempotent operations, observability, and recovery strategies. Principles that have been around since the 1970s. The agents themselves are almost interchangeable. Swap GPT for Claude, change the temperature, fine-tune the system prompt — it barely matters if your orchestration is broken. What matters is how you decompose work, how you share context, how you merge results, and how you recover from failure. Get the engineering right, and the AI just works. Get it wrong, and no model on earth will save you. By the Numbers The codebase is ~7,400 lines of Python across 230 tests and 141 commits. Over 10+ evaluation runs, the swarm processed a combined ~200K+ tokens, merged 20+ branches, and resolved conflicts ranging from trivial (package.json version bumps) to gnarly (overlapping data models). It’s built on Azure Durable Task Scheduler, Microsoft Agent Framework, and containerized sandboxes that run anywhere Docker does — AKS, ACA, or a plain docker run on your laptop. And somewhere in those 141 commits is a one-line fix for an integer-vs-string bug that took me an embarrassingly long time to find. References Azure Durable Task Scheduler — Deterministic workflow orchestration with replay, checkpointing, and fan-out/fan-in patterns. Microsoft Agent Framework (MAF) — Python agent framework for tool-calling, middleware, and structured output. Azure Kubernetes Service (AKS) — Managed Kubernetes for running containerized agent workloads at scale. Azure Container Apps (ACA) — Serverless container platform for simpler deployments. Azure OpenAI Service — Hosts the GPT models used by planner, coder, and judge agents. Built with Azure DTS, Microsoft Agent Framework, and containerized sandboxes (Docker, AKS, ACA — your choice). And a lot of grep through log files.457Views6likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.12KViews6likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.11KViews1like1CommentA Practical Path Forward for Heroku Customers with Azure
On February 6, 2026, Heroku announced it is moving to a sustaining engineering model focused on stability, security, reliability, and ongoing support. Many customers are now reassessing how their application platforms will support today’s workloads and future innovation. Microsoft is committed to helping customers migrate and modernize applications from platforms like Heroku to Azure.193Views0likes0CommentsWhat It Takes to Give SRE Agent a Useful Starting Point
In our latest posts, The Agent that investigates itself and Azure SRE Agent Now Builds Expertise Like Your Best Engineer Introducing Deep Context, we wrote about a moment that changed how we think about agent systems. Azure SRE Agent investigated a regression in its own prompt cache, traced the drop to a specific PR, and proposed fixes. What mattered was not just the model. What mattered was the starting point. The agent had code, logs, deployment history, and a workspace it could use to discover the next piece of context. That lesson forced an uncomfortable question about onboarding. If a customer finishes setup and the agent still knows nothing about their app, we have not really onboarded them. We have only created a resource. So for the March 10 GA release, we rebuilt onboarding around a more practical bar: can a new agent become useful on day one? To test that, we used the new flow the way we expect customers to use it. We connected a real sample app, wired up live Azure Monitor alerts, attached code and logs, uploaded a knowledge file, and then pushed the agent through actual work. We asked it to inspect the app, explain a 401 path from the source, debug its own log access, and triage GitHub issues in the repo. This post walks through that experience. We connected everything we could because we wanted to see what the agent does when it has a real starting point, not a partial one. If your setup is shorter, the SRE Agent still works. It just knows less. The cold start we were trying to fix The worst version of an agent experience is familiar by now. You ask a concrete question about your system and get back a smart-sounding answer that is only loosely attached to reality. The model knows what a Kubernetes probe is. It knows what a 500 looks like. It may even know common Kusto table names. But it does not know your deployment, your repo, your auth flow, or the naming mistakes your team made six months ago and still lives with. We saw the same pattern again and again inside our own work. When the agent had real context, it could do deep investigations. When it started cold, it filled the gaps with general knowledge and good guesses. The new onboarding is our attempt to close that gap up front. Instead of treating code, logs, incidents, and knowledge as optional extras, the flow is built around connecting the things the agent needs to reason well. Walking through the new onboarding Starting March 10, you can create and configure an SRE Agent at sre.azure.com. Here is what that looked like for us. Step 1: Create the agent You choose a subscription, resource group, name, and region. Azure provisions the runtime, managed identity, Application Insights, and Log Analytics workspace. In our run, the whole thing took about two minutes. That first step matters more than it may look. We are not just spinning up a chatbot. We are creating the execution environment where the agent can actually work: run commands, inspect files, query services, and keep track of what it learns. Step 2: Start adding context Once provisioning finishes, you land on the setup page. The page is organized around the sources that make the agent useful: code, logs, incidents, Azure resources, and knowledge files. Data source Why it matters Code Lets the agent read the system it is supposed to investigate. Logs Gives it real tables, schemas, and data instead of guesses. Incidents Connects the agent to the place where operational pain actually shows up. Azure resources Gives it the right scope so it starts in the right subscription and resource group. Knowledge files Adds the team-specific context that never shows up cleanly in telemetry. The page is blunt in a way we like. If you have not connected anything yet, it tells you the agent does not know enough about your app to answer useful questions. That is the right framing. The job of onboarding is to fix that. Step 3: Connect logs We started with Azure Data Explorer. The wizard supports Azure Kusto, Datadog, Elasticsearch, Dynatrace, New Relic, Splunk, and Hawkeye. After choosing Kusto, it generated the MCP connector settings for us. We supplied the cluster details, tested the connection, and let it discover the tools. This step removes a whole class of bad agent behavior. The model no longer has to invent table names or hope the cluster it wants is the cluster that exists. It knows what it can query because the connection is explicit. Step 4: Connect the incident platform For incidents, we chose Azure Monitor. This part is simple by design. If incidents are where the agent proves its value, connecting them should feel like the most natural part of setup, not a side quest. PagerDuty and ServiceNow work too, but for this walkthrough we kept it on Azure Monitor so we could wire real alerts to a real app. Step 5: Connect code Then we connected the code repo. We used microsoft-foundry/foundry-agent-webapp, a React and ASP.NET Core sample app running on Azure Container Apps. This is still the highest-leverage source we give the agent. Once the repo is connected, the agent can stop treating the app as an abstract web service. It can read the auth flow. It can inspect how health probes are configured. It can compare logs against the exact code paths that produced them. It can even look at the commit that was live when an incident happened. That changes the quality of the investigation immediately. Step 6: Scope the Azure resources Next we told the agent which resources it was responsible for. We scoped it to the resource group that contained the sample Container App. The wizard then set the roles the agent needed to observe and investigate the environment. That sounds like a small step, but it fixes another common failure mode. Agents do better when they start from the right part of the world. Subscription and resource-group scope give them that boundary. Step 7: Upload knowledge Last, we uploaded a Markdown knowledge file we wrote for the sample app. The file covered the app architecture, API endpoints, auth flow, likely failure modes, and the files we would expect an engineer to open first during debugging. We like Markdown here because it stays honest. It is easy for a human to read, easy for the agent to navigate, and easy to update as the system changes. All sources configured Once everything was connected, the setup panel turned green. At that point the agent had a repo, logs, incidents, Azure resources, and a knowledge file. That is the moment where onboarding stops being a checklist and starts being operational setup. The chat experience makes the setup visible When you open a new thread, the configuration panel stays at the top of the chat. If you expand it, you can see exactly what is connected and what is not. We built this because people should not have to guess what the agent knows. If code is connected and logs are not, that should be obvious. If incidents are wired up but knowledge files are missing, that should be obvious too. The panel makes the agent's working context visible in the same place where you ask it to think. It also makes partial setup less punishing. You do not have to finish every step before the agent becomes useful. But you can see, very clearly, what extra context would make the next answer better. What changed once the agent had context The easiest way to evaluate the onboarding is to look at the first questions we asked after setup. We started with a simple one: What do you know about the Container App in the rg-big-refactor resource group? The agent used Azure CLI to inspect the app, its revisions, and the system logs, then came back with a concise summary: image version, resource sizing, ingress, scale-to-zero behavior, and probe failures during cold start. It also correctly called out that the readiness probe noise was expected and not the root of a real outage. That answer was useful because it was grounded in the actual resource, not in generic advice about Container Apps. Then we asked a harder question: Based on the connected repo, what authentication flow does this app use? If a user reports 401s, what should we check first? The agent opened authConfig.ts, Program.cs, useAuth.ts, postprovision.ps1, and entra-app.bicep, then traced the auth path end to end. The checklist it produced was exactly the kind of thing we hoped onboarding would unlock: client ID alignment, identifier URI issues, redirect URI mismatches, audience validation, missing scopes, token expiry handling, and the single-tenant assumption in the backend. It even pointed to the place in Program.cs where extra logging could be enabled. Without the repo, this would have been a boilerplate answer about JWTs. With the repo, it read like advice from someone who had already been paged for this app before. We did not stop at setup. We wired real monitoring. A polished demo can make any agent look capable, so we pushed farther. We set up live Azure Monitor alerts for the sample web app instead of leaving the incident side as dummy data. We created three alerts: HTTP 5xx errors (Sev 1), for more than 3 server errors in 5 minutes Container restarts (Sev 2), to catch crash loops and OOMs High response latency (Sev 2), when average response time goes above 10 seconds The high-latency alert fired almost immediately. The app was scaling from zero, and the cold start was slow enough to trip the threshold. That was perfect. It gave us a real incident to put through the system instead of a fictional one. Incident response plans From the Builder menu, we created a response plan targeted at incidents with foundry-webapp in the title and severity 1 or 2. The incident that had just fired showed up in the learning flow. We used the actual codebase and deployment details to write the default plan: which files to inspect for failures, how to reason about health probes, and how to tell the difference between a cold start and a real crash. That felt like an important moment in the product. The response plan was not generic incident theater. It was anchored in the system we had just onboarded. One of the most useful demos was the agent debugging itself The sharpest proof point came when we tried to query the Log Analytics workspace from the agent. We expected it to query tables and summarize what it found. Instead, it hit insufficient_scope. That could have been a dead end. Instead, the agent turned the failure into the investigation. It identified the missing permissions, noticed there were two managed identities in play, told us which RBAC roles were required, and gave us the exact commands to apply them. After we fixed the access, it retried and ran a series of KQL queries against the workspace. That is where it found the next problem: Container Apps platform logs were present, but AppRequests, AppExceptions, and the rest of the App Insights-style tables were still empty. That was not a connector bug. It was a real observability gap in the sample app. The backend had OpenTelemetry packages, but the exporter configuration was not actually sending the telemetry we expected. The agent did not just tell us that data was missing. It explained which data was present, which data was absent, and why that difference mattered. That is the sort of thing we wanted this onboarding to set up: not just answering the first question, but exposing the next real thing that needs fixing. We also asked it to triage the repo backlog Once the repo was connected, it was natural to see how well the agent could read open issues against the code. We pointed it at the three open GitHub issues in the sample repo and asked it to triage them. It opened the relevant files, compared the code to the issue descriptions, and came back with a clear breakdown: Issue #21, @fluentui-copilot is not opensource? Partially valid, low severity. The package is public and MIT licensed. The real concern is package maturity, not licensing. Issue #20, SDK fails to deserialize agent tool definitions Confirmed, medium severity. The agent traced the problem to metadata handling in AgentFrameworkService.cs and suggested a safe fallback path. Issue #19, Create Preview experience from AI Foundry is incomplete Confirmed, medium severity. The agent found the gap between the environment variables people are told to paste and the variables the app actually expects. What stood out to us was not just that the output was correct. It was that the agent was careful. It did not overclaim. It separated a documentation concern from two real product bugs. Then it asked whether we wanted it to start implementing the fixes. That is the posture we want from an engineering agent: useful, specific, and a little humble. What the onboarding is really doing After working through the whole flow, we do not think of onboarding as a wizard anymore. We think of it as the process of giving the agent a fair shot. Each connection removes one reason for the model to bluff: Code keeps it from guessing how the system works. Logs keep it from guessing what data exists. Incidents keep it close to operational reality. Azure resource scope keeps it from wandering. Knowledge files keep team-specific context from getting lost. This is the same lesson we learned building the product itself. The agent does better when it can discover context progressively inside a world that is real and well-scoped. Good onboarding is how you create that world. Closing The main thing we learned from this work is simple: onboarding is not done when the resource exists. It is done when the agent can help with a real problem. In one setup we were able to connect a real app, fire a real alert, create a real response plan, debug a real RBAC problem, inspect real logs, and triage real GitHub issues. That is a much better standard than "the wizard completed successfully." If you try SRE Agent after GA, start there. Connect the things that make your system legible, then ask a question that would actually matter during a bad day. The answer will tell you very quickly whether the agent has a real starting point. Create your SRE Agent -> Azure SRE Agent is generally available starting March 10, 2026.710Views2likes0Comments