serverless
267 TopicsEvent-Driven IaC Operations with Azure SRE Agent: Terraform Drift Detection via HTTP Triggers
What Happens After terraform plan Finds Drift? If your team is like most, the answer looks something like this: A nightly terraform plan runs and finds 3 drifted resources A notification lands in Slack or Teams Someone files a ticket During the next sprint, an engineer opens 4 browser tabs — Terraform state, Azure Portal, Activity Log, Application Insights — and spends 30 minutes piecing together what happened They discover the drift was caused by an on-call engineer who scaled up the App Service during a latency incident at 2 AM They revert the drift with terraform apply The app goes down because they just scaled it back down while the bug that caused the incident is still deployed Step 7 is the one nobody talks about. Drift detection tooling has gotten remarkably good — scheduled plans, speculative runs, drift alerts — but the output is always the same: a list of differences. What changed. Not why. Not whether it's safe to fix. The gap isn't detection. It's everything that happens after detection. HTTP Triggers in Azure SRE Agent close that gap. They turn the structured output that drift detection already produces — webhook payloads, plan summaries, run notifications — into the starting point of an autonomous investigation. Detection feeds the agent. The agent does the rest: correlates with incidents, reads source code, classifies severity, recommends context-aware remediation, notifies the team, and even ships a fix. Here's what that looks like end to end. What you'll see in this blog: An agent that classifies drift as Benign, Risky, or Critical — not just "changed" Incident correlation that links a SKU change to a latency spike in Application Insights A remediation recommendation that says "Do NOT revert" — and why reverting would cause an outage A Teams notification with the full investigation summary An agent that reviews its own performance, finds gaps, and improves its own skill file A pull request the agent created on its own to fix the root cause The Pipeline: Detection to Resolution in One Webhook The architecture is straightforward. Terraform Cloud (or any drift detection tool) sends a webhook when it finds drift. An Azure Logic App adds authentication. The SRE Agent's HTTP Trigger receives it and starts an autonomous investigation. The end-to-end pipeline: Terraform Cloud detects drift and sends a webhook. The Logic App adds Azure AD authentication via Managed Identity. The SRE Agent's HTTP Trigger fires and the agent autonomously investigates across 7 dimensions. Setting Up the Pipeline Step 1: Deploy the Infrastructure with Terraform We start with a simple Azure App Service running a Node.js application, deployed via Terraform. The Terraform configuration defines the desired state: App Service Plan: B1 (Basic) — single vCPU, ~$13/mo App Service: Node 20-lts with TLS 1.2 Tags: environment: demo, managed_by: terraform, project: sre-agent-iac-blog resource "azurerm_service_plan" "demo" { name = "iacdemo-plan" resource_group_name = azurerm_resource_group.demo.name location = azurerm_resource_group.demo.location os_type = "Linux" sku_name = "B1" } A Logic App is also deployed to act as the authentication bridge between Terraform Cloud webhooks and the SRE Agent's HTTP Trigger endpoint, using Managed Identity to acquire Azure AD tokens. Learn more about HTTP Triggers here. Step 2: Create the Drift Analysis Skill Skills are domain knowledge files that teach the agent how to approach a problem. We create a terraform-drift-analysis skill with an 8-step workflow: Identify Scope — Which resource group and resources to check Detect Drift — Compare Terraform config against Azure reality Correlate with Incidents — Check Activity Log and App Insights Classify Severity — Benign, Risky, or Critical Investigate Root Cause — Read source code from the connected repository Generate Drift Report — Structured summary with severity-coded table Recommend Smart Remediation — Context-aware: don't blindly revert Notify Team — Post findings to Microsoft Teams The key insight in the skill: "NEVER revert critical drift that is actively mitigating an incident." This teaches the agent to think like an experienced SRE, not just a diff tool. Step 3: Create the HTTP Trigger In the SRE Agent UI, we create an HTTP Trigger named tfc-drift-handler with a 7-step agent prompt: A Terraform Cloud run has completed and detected infrastructure drift. Workspace: {payload.workspace_name} Organization: {payload.organization_name} Run ID: {payload.run_id} Run Message: {payload.run_message} STEP 1 — DETECT DRIFT: Compare Terraform configuration against actual Azure state... STEP 2 — CORRELATE WITH INCIDENTS: Check Azure Activity Log and App Insights... STEP 3 — CLASSIFY SEVERITY: Rate each drift item as Benign, Risky, or Critical... STEP 4 — INVESTIGATE ROOT CAUSE: Read the application source code... STEP 5 — GENERATE DRIFT REPORT: Produce a structured summary... STEP 6 — RECOMMEND SMART REMEDIATION: Context-aware recommendations... STEP 7 — NOTIFY TEAM: Post a summary to Microsoft Teams... Step 4: Connect GitHub and Teams We connect two integrations in the SRE Agent Connectors settings: Code Repository: GitHub — so the agent can read application source code during investigations Notification: Microsoft Teams — so the agent can post drift reports to the team channel The Incident Story Act 1: The Latency Bug Our demo app has a subtle but devastating bug. The /api/data endpoint calls processLargeDatasetSync() — a function that sorts an array on every iteration, creating an O(n² log n) blocking operation. On a B1 App Service Plan (single vCPU), this blocks the Node.js event loop entirely. Under load, response times spike from milliseconds to 25-58 seconds, with 502 Bad Gateway errors from the Azure load balancer. Act 2: The On-Call Response An on-call engineer sees the latency alerts and responds — not through Terraform, but directly through the Azure Portal and CLI. They: Add diagnostic tags — manual_update=True, changed_by=portal_user (benign) Downgrade TLS from 1.2 to 1.0 while troubleshooting (risky — security regression) Scale the App Service Plan from B1 to S1 to throw more compute at the problem (critical — cost increase from ~$13/mo to ~$73/mo) The incident is partially mitigated — S1 has more compute, so latency drops from catastrophic to merely bad. Everyone goes back to sleep. Nobody updates Terraform. Act 3: The Drift Check Fires The next morning, a nightly speculative Terraform plan runs and detects 3 drifted attributes. The notification webhook fires, flowing through the Logic App auth bridge to the SRE Agent HTTP Trigger. The agent wakes up and begins its investigation. What the Agent Found Layer 1: Drift Detection The agent compares Terraform configuration against Azure reality and produces a severity-classified drift report: Three drift items detected: Critical: App Service Plan SKU changed from B1 (~$13/mo) to S1 (~$73/mo) — a +462% cost increase Risky: Minimum TLS version downgraded from 1.2 to 1.0 — a security regression vulnerable to BEAST and POODLE attacks Benign: Additional tags (changed_by: portal_user, manual_update: True) — cosmetic, no functional impact Layer 2: Incident Correlation Here's where the agent goes beyond simple drift detection. It queries Application Insights and discovers a performance incident correlated with the SKU change: Key findings from the incident correlation: 97.6% of requests (40 of 41) were impacted by high latency The /api/data endpoint does not exist in the repository source code — the deployed application has diverged from the codebase The endpoint likely contains a blocking synchronous pattern — Node.js runs on a single event loop, and any synchronous blocking call would explain 26-58s response times The SKU scale-up from B1→S1 was an attempt to mitigate latency by adding more compute, but scaling cannot fix application-level blocking code on a single-threaded Node.js server Layer 3: Smart Remediation This is the insight that separates an autonomous agent from a reporting tool. Instead of blindly recommending "revert all drift," the agent produces context-aware remediation recommendations: The agent's remediation logic: Tags (Benign) → Safe to revert anytime via terraform apply -target TLS 1.0 (Risky) → Revert immediately — the TLS downgrade is a security risk unrelated to the incident SKU S1 (Critical) → DO NOT revert until the /api/data performance root cause is fixed This is the logic an experienced SRE would apply. Blindly running terraform apply to revert all drift would scale the app back down to B1 while the blocking code is still deployed — turning a mitigated incident into an active outage. Layer 4: Investigation Summary The agent produces a complete summary tying everything together: Key findings in the summary: Actor: surivineela@microsoft.com made all changes via Azure Portal at ~23:19 UTC Performance incident: /api/data averaging 25-57s latency, affecting 97.6% of requests Code-infrastructure mismatch: /api/data exists in production but not in the repository source code Root cause: SKU scale-up was emergency incident response, not unauthorized drift Layer 5: Teams Notification The agent posts a structured drift report to the team's Microsoft Teams channel: The on-call engineer opens Teams in the morning and sees everything they need: what drifted, why it drifted, and exactly what to do about it — without logging into any dashboard. The Payoff: A Self-Improving Agent Here's where the demo surprised us. After completing the investigation, the agent did two things we didn't explicitly ask for. The Agent Improved Its Own Skill The agent performed an Execution Review — analyzing what worked and what didn't during its investigation — and found 5 gaps in its own terraform-drift-analysis.md skill file: What worked well: Drift detection via az CLI comparison against Terraform HCL was straightforward Activity Log correlation identified the actor and timing Application Insights telemetry revealed the performance incident driving the SKU change Gaps it found and fixed: No incident correlation guidance — the skill didn't instruct checking App Insights No code-infrastructure mismatch detection — no guidance to verify deployed code matches the repository No smart remediation logic — didn't warn against reverting critical drift during active incidents Report template missing incident correlation column No Activity Log integration guidance — didn't instruct checking who made changes and when The agent then edited its own skill file to incorporate these learnings. Next time it runs a drift analysis, it will include incident correlation, code-infra mismatch checks, and smart remediation logic by default. This is a learning loop — every investigation makes the agent better at future investigations. The Agent Created a PR Without being asked, the agent identified the root cause code issue and proactively created a pull request to fix it: The PR includes: App safety fixes: Adding MAX_DELAY_MS and SERVER_TIMEOUT_MS constants to prevent unbounded latency Skill improvements: Incorporating incident correlation, code-infra mismatch detection, and smart remediation logic From a single webhook: drift detected → incident correlated → root cause found → team notified → skill improved → fix shipped. Key Takeaways Drift detection is not enough. Knowing that B1 changed to S1 is table stakes. Knowing it changed because of a latency incident, and that reverting it would cause an outage — that's the insight that matters. Context-aware remediation prevents outages. Blindly running terraform apply after drift would have scaled the app back to B1 while blocking code was still deployed. The agent's "DO NOT revert SKU" recommendation is the difference between fixing drift and causing a P1. Skills create a learning loop. The agent's self-review and skill improvement means every investigation makes the next one better — without human intervention. HTTP Triggers connect any platform. The auth bridge pattern (Logic App + Managed Identity) works for Terraform Cloud, but the same architecture applies to any webhook source: GitHub Actions, Jenkins, Datadog, PagerDuty, custom internal tools. The agent acts, not just reports. From a single webhook: drift detected, incident correlated, root cause identified, team notified via Teams, skill improved, and PR created. End-to-end in one autonomous session. Getting Started HTTP Triggers are available now in Azure SRE Agent: Create a Skill — Teach the agent your operational runbook (in this case, drift analysis with severity classification and smart remediation) Create an HTTP Trigger — Define your agent prompt with {payload.X} placeholders and connect it to a skill Set Up an Auth Bridge — Deploy a Logic App with Managed Identity to handle Azure AD token acquisition Connect Your Source — Point Terraform Cloud (or any webhook-capable platform) at the Logic App URL Connect GitHub + Teams — Give the agent access to source code and team notifications Within minutes, you'll have an autonomous pipeline that turns infrastructure drift events into fully contextualized investigations — with incident correlation, root cause analysis, and smart remediation recommendations. The full implementation guide, Terraform files, skill definitions, and demo scripts are available in this repository.208Views0likes0CommentsThe Agent that investigates itself
Azure SRE Agent handles tens of thousands of incident investigations each week for internal Microsoft services and external teams running it for their own systems. Last month, one of those incidents was about the agent itself. Our KV cache hit rate alert started firing. Cached token percentage was dropping across the fleet. We didn't open dashboards. We simply asked the agent. It spawned parallel subagents, searched logs, read through its own source code, and produced the analysis. First finding: Claude Haiku at 0% cache hits. The agent checked the input distribution and found that the average call was ~180 tokens, well below Anthropic’s 4,096-token minimum for Haiku prompt caching. Structurally, these requests could never be cached. They were false positives. The real regression was in Claude Opus: cache hit rate fell from ~70% to ~48% over a week. The agent correlated the drop against the deployment history and traced it to a single PR that restructured prompt ordering, breaking the common prefix that caching relies on. It submitted two fixes: one to exclude all uncacheable requests from the alert, and the other to restore prefix stability in the prompt pipeline. That investigation is how we develop now. We rarely start with dashboards or manual log queries. We start by asking the agent. Three months earlier, it could not have done any of this. The breakthrough was not building better playbooks. It was harness engineering: enabling the agent to discover context as the investigation unfolded. This post is about the architecture decisions that made it possible. Where we started In our last post, Context Engineering for Reliable AI Agents: Lessons from Building Azure SRE Agent, we described how moving to a single generalist agent unlocked more complex investigations. The resolution rates were climbing, and for many internal teams, the agent could now autonomously investigate and mitigate roughly 50% of incidents. We were moving in the right direction. But the scores weren't uniform, and when we dug into why, the pattern was uncomfortable. The high-performing scenarios shared a trait: they'd been built with heavy human scaffolding. They relied on custom response plans for specific incident types, hand-built subagents for known failure modes, and pre-written log queries exposed as opaque tools. We weren’t measuring the agent’s reasoning – we were measuring how much engineering had gone into the scenario beforehand. On anything new, the agent had nowhere to start. We found these gaps through manual review. Every week, engineers read through lower-scored investigation threads and pushed fixes: tighten a prompt, fix a tool schema, add a guardrail. Each fix was real. But we could only review fifty threads a week. The agent was handling ten thousand. We were debugging at human speed. The gap between those two numbers was where our blind spots lived. We needed an agent powerful enough to take this toil off us. An agent which could investigate itself. Dogfooding wasn't a philosophy - it was the only way to scale. The Inversion: Three bets The problem we faced was structural - and the KV cache investigation shows it clearly. The cache rate drop was visible in telemetry, but the cause was not. The agent had to correlate telemetry with deployment history, inspect the relevant code, and reason over the diff that broke prefix stability. We kept hitting the same gap in different forms: logs pointing in multiple directions, failure modes in uninstrumented paths, regressions that only made sense at the commit level. Telemetry showed symptoms, but not what actually changed. We'd been building the agent to reason over telemetry. We needed it to reason over the system itself. The instinct when agents fail is to restrict them: pre-write the queries, pre-fetch the context, pre-curate the tools. It feels like control. In practice, it creates a ceiling. The agent can only handle what engineers anticipated in advance. The answer is an agent that can discover what it needs as the investigation unfolds. In the KV cache incident, each step, from metric anomaly to deployment history to a specific diff, followed from what the previous step revealed. It was not a pre-scripted path. Navigating towards the right context with progressive discovery is key to creating deep agents which can handle novel scenarios. Three architectural decisions made this possible – and each one compounded on the last. Bet 1: The Filesystem as the Agent's World Our first bet was to give the agent a filesystem as its workspace instead of a custom API layer. Everything it reasons over – source code, runbooks, query schemas, past investigation notes – is exposed as files. It interacts with that world using read_file, grep, find, and shell. No SearchCodebase API. No RetrieveMemory endpoint. This is an old Unix idea: reduce heterogeneous resources to a single interface. Coding agents already work this way. It turns out the same pattern works for an SRE agent. Frontier models are trained on developer workflows: navigating repositories, grepping logs, patching files, running commands. The filesystem is not an abstraction layered on top of that prior. It matches it. When we materialized the agent’s world as a repo-like workspace, our human "Intent Met" score - whether the agent's investigation addressed the actual root cause as judged by the on-call engineer - rose from 45% to 75% on novel incidents. But interface design is only half the story. The other half is what you put inside it. Code Repositories: the highest-leverage context Teams had prewritten log queries because they did not trust the agent to generate correct ones. That distrust was justified. Models hallucinate table names, guess column schemas, and write queries against the wrong cluster. But the answer was not tighter restriction. It was better grounding. The repo is the schema. Everything else is derived from it. When the agent reads the code that produces the logs, query construction stops being guesswork. It knows the exact exceptions thrown, and the conditions under which each path executes. Stack traces start making sense, and logs become legible. But beyond query grounding, code access unlocked three new capabilities that telemetry alone could not provide: Ground truth over documentation. Docs drift and dashboards show symptoms. The code is what the service actually does. In practice, most investigations only made sense when logs were read alongside implementation. Point-in-time investigation. The agent checks out the exact commit at incident time, not current HEAD, so it can correlate the failure against the actual diffs. That's what cracked the KV cache investigation: a PR broke prefix stability, and the diff was the only place this was visible. Without commit history, you can't distinguish a code regression from external factors. Reasoning even where telemetry is absent. Some code paths are not well instrumented. The agent can still trace logic through source and explain behavior even when logs do not exist. This is especially valuable in novel failure modes – the ones most likely to be missed precisely because no one thought to instrument them. Memory as a filesystem, not a vector store Our first memory system used RAG over past session learnings. It had a circular dependency: a limited agent learned from limited sessions and produced limited knowledge. Garbage in, garbage out. But the deeper problem was retrieval. In SRE Context, embedding similarity is a weak proxy for relevance. “KV cache regression” and “prompt prefix instability” may be distant in embedding space yet still describe the same causal chain. We tried re-ranking, query expansion, and hybrid search. None fixed the core mismatch between semantic similarity and diagnostic relevance. We replaced RAG with structured Markdown files that the agent reads and writes through its standard tool interface. The model names each file semantically: overview.md for a service summary, team.md for ownership and escalation paths, logs.md for cluster access and query patterns, debugging.md for failure modes and prior learnings. Each carry just enough context to orient the agent, with links to deeper files when needed. The key design choice was to let the model navigate memory, not retrieve it through query matching. The agent starts from a structured entry point and follows the evidence toward what matters. RAG assumes you know the right query before you know what you need. File traversal lets relevance emerge as context accumulates. This removed chunking, overlap tuning, and re-ranking entirely. It also proved more accurate, because frontier models are better at following context than embeddings are at guessing relevance. As a side benefit, memory state can be snapshotted periodically. One problem remains unsolved: staleness. When two sessions write conflicting patterns to debugging.md, the model must reconcile them. When a service changes behavior, old entries can become misleading. We rely on timestamps and explicit deprecation notes, but we do not have a systemic solution yet. This is an active area of work, and anyone building memory at scale will run into it. The sandbox as epistemic boundary The filesystem also defines what the agent can see. If something is not in the sandbox, the agent cannot reason about it. We treat that as a feature, not a limitation. Security boundaries and epistemic boundaries are enforced by the same mechanism. Inside that boundary, the agent has full execution: arbitrary bash, python, jq, and package installs through pip or apt. That scope unlocks capabilities we never would have built as custom tools. It opens PRs with gh cli, like the prompt-ordering fix from KV cache incident. It pushes Grafana dashboards, like a cache-hit-rate dashboard we now track by model. It installs domain-specific CLI tools mid-investigation when needed. No bespoke integration required, just a shell. The recurring lesson was simple: a generally capable agent in the right execution environment outperforms a specialized agent with bespoke tooling. Custom tools accumulate maintenance costs. Shell commands compose for free. Bet 2: Context Layering Code access tells the agent what a service does. It does not tell the agent what it can access, which resources its tools are scoped to, or where an investigation should begin. This gap surfaced immediately. Users would ask "which team do you handle incidents for?" and the agent had no answer. Tools alone are not enough. An integration also needs ambient context so the model knows what exists, how it is configured, and when to use it. We fixed this with context hooks: structured context injected at prompt construction time to orient the agent before it takes action. Connectors - what can I access? A manifest of wired systems such as Log Analytics, Outlook, and Grafana, along with their configuration. Repositories - what does this system do? Serialized repo trees, plus files like AGENTS.md, Copilot.md, and CLAUDE.md with team-specific instructions. Knowledge map - what have I learned before? A two-tier memory index with a top-level file linking to deeper scenario-specific files, so the model can drill down only when needed. Azure resource topology - where do things live? A serialized map of relationships across subscriptions, resource groups, and regions, so investigations start in the right scope. Together, these context hooks turn a cold start into an informed one. That matters because a bad early choice does not just waste tokens. It sends the investigation down the wrong trajectory. A capable agent still needs to know what exists, what matters, and where to start. Bet 3: Frugal Context Management Layered context creates a new problem: budget. Serialized repo trees, resource topology, connector manifests, and a memory index fill context fast. Once the agent starts reading source files and logs, complex incidents hit context limits. We needed our context usage to be deliberately frugal. Tool result compression via the filesystem Large tool outputs are expensive because they consume context before the agent has extracted any value from them. In many cases, only a small slice or a derived summary of that output is actually useful. Our framework exposes these results as files to the agent. The agent can then use tools like grep, jq, or python to process them outside the model interface, so that only the final result enters context. The filesystem isn't just a capability abstraction - it's also a budget management primitive. Context Pruning and Auto Compact Long investigations accumulate dead weight. As hypotheses narrow, earlier context becomes noise. We handle this with two compaction strategies. Context Pruning runs mid-session. When context usage crosses a threshold, we trim or drop stale tool calls and outputs - keeping the window focused on what still matters. Auto-Compact kicks in when a session approaches its context limit. The framework summarizes findings and working hypotheses, then resumes from that summary. From the user's perspective, there's no visible limit. Long investigations just work. Parallel subagents The KV cache investigation required reasoning along two independent hypotheses: whether the alert definition was sound, and whether cache behavior had actually regressed. The agent spawned parallel subagents for each task, each operating in its own context window. Once both finished, it merged their conclusions. This pattern generalizes to any task with independent components. It speeds up the search, keeps intermediate work from consuming the main context window, and prevents one hypothesis from biasing another. The Feedback loop These architectural bets have enabled us to close the original scaling gap. Instead of debugging the agent at human speed, we could finally start using it to fix itself. As an example, we were hitting various LLM errors: timeouts, 429s (too many requests), failures in the middle of response streaming, 400s from code bugs that produced malformed payloads. These paper cuts would cause investigations to stall midway and some conversations broke entirely. So, we set up a daily monitoring task for these failures. The agent searches for the last 24 hours of errors, clusters the top hitters, traces each to its root cause in the codebase, and submits a PR. We review it manually before merging. Over two weeks, the errors were reduced by more than 80%. Over the last month, we have successfully used our agent across a wide range of scenarios: Analyzed our user churn rate and built dashboards we now review weekly. Correlated which builds needed the most hotfixes, surfacing flaky areas of the codebase. Ran security analysis and found vulnerabilities in the read path. Helped fill out parts of its own Responsible AI review, with strict human review. Handles customer-reported issues and LiveSite alerts end to end. Whenever it gets stuck, we talk to it and teach it, ask it to update its memory, and it doesn't fail that class of problem again. The title of this post is literal. The agent investigating itself is not a metaphor. It is a real workflow, driven by scheduled tasks, incident triggers, and direct conversations with users. What We Learned We spent months building scaffolding to compensate for what the agent could not do. The breakthrough was removing it. Every prewritten query was a place we told the model not to think. Every curated tool was a decision made on its behalf. Every pre-fetched context was a guess about what would matter before we understood the problem. The inversion was simple but hard to accept: stop pre-computing the answer space. Give the model a structured starting point, a filesystem it knows how to navigate, context hooks that tell it what it can access, and budget management that keeps it sharp through long investigations. The agent that investigates itself is both the proof and the product of this approach. It finds its own bugs, traces them to root causes in its own code, and submits its own fixes. Not because we designed it to. Because we designed it to reason over systems, and it happens to be one. We are still learning. Staleness is unsolved, budget tuning remains largely empirical, and we regularly discover assumptions baked into context that quietly constrain the agent. But we have crossed a new threshold: from an agent that follows your playbook to one that writes the next one. Thanks to visagarwal for co-authoring this post.13KViews6likes0CommentsAzure Functions Ignite 2025 Update
Azure Functions is redefining event-driven applications and high-scale APIs in 2025, accelerating innovation for developers building the next generation of intelligent, resilient, and scalable workloads. This year, our focus has been on empowering AI and agentic scenarios: remote MCP server hosting, bulletproofing agents with Durable Functions, and first-class support for critical technologies like OpenTelemetry, .NET 10 and Aspire. With major advances in serverless Flex Consumption, enhanced performance, security, and deployment fundamentals across Elastic Premium and Flex, Azure Functions is the platform of choice for building modern, enterprise-grade solutions. Remote MCP Model Context Protocol (MCP) has taken the world by storm, offering an agent a mechanism to discover and work deeply with the capabilities and context of tools. When you want to expose MCP/tools to your enterprise or the world securely, we recommend you think deeply about building remote MCP servers that are designed to run securely at scale. Azure Functions is uniquely optimized to run your MCP servers at scale, offering serverless and highly scalable features of Flex Consumption plan, plus two flexible programming model options discussed below. All come together using the hardened Functions service plus new authentication modes for Entra and OAuth using Built-in authentication. Remote MCP Triggers and Bindings Extension GA Back in April, we shared a new extension that allows you to author MCP servers using functions with the MCP tool trigger. That MCP extension is now generally available, with support for C#(.NET), Java, JavaScript (Node.js), Python, and Typescript (Node.js). The MCP tool trigger allows you to focus on what matters most: the logic of the tool you want to expose to agents. Functions will take care of all the protocol and server logistics, with the ability to scale out to support as many sessions as you want to throw at it. [Function(nameof(GetSnippet))] public object GetSnippet( [McpToolTrigger(GetSnippetToolName, GetSnippetToolDescription)] ToolInvocationContext context, [BlobInput(BlobPath)] string snippetContent ) { return snippetContent; } New: Self-hosted MCP Server (Preview) If you’ve built servers with official MCP SDKs and want to run them as remote cloud‑scale servers without re‑writing any code, this public preview is for you. You can now self‑host your MCP server on Azure Functions—keep your existing Python, TypeScript, .NET, or Java code and get rapid 0 to N scaling, built-in server authentication and authorization, consumption-based billing, and more from the underlying Azure Functions service. This feature complements the Azure Functions MCP extension for building MCP servers using the Functions programming model (triggers & bindings). Pick the path that fits your scenario—build with the extension or standard MCP SDKs. Either way you benefit from the same scalable, secure, and serverless platform. Use the official MCP SDKs: # MCP.tool() async def get_alerts(state: str) -> str: """Get weather alerts for a US state. Args: state: Two-letter US state code (e.g. CA, NY) """ url = f"{NWS_API_BASE}/alerts/active/area/{state}" data = await make_nws_request(url) if not data or "features" not in data: return "Unable to fetch alerts or no alerts found." if not data["features"]: return "No active alerts for this state." alerts = [format_alert(feature) for feature in data["features"]] return "\n---\n".join(alerts) Use Azure Functions Flex Consumption Plan's serverless compute using Custom Handlers in host.json: { "version": "2.0", "configurationProfile": "mcp-custom-handler", "customHandler": { "description": { "defaultExecutablePath": "python", "arguments": ["weather.py"] }, "http": { "DefaultAuthorizationLevel": "anonymous" }, "port": "8000" } } Learn more about MCPTrigger and self-hosted MCP servers at https://aka.ms/remote-mcp Built-in MCP server authorization (Preview) The built-in authentication and authorization feature can now be used for MCP server authorization, using a new preview option. You can quickly define identity-based access control for your MCP servers with Microsoft Entra ID or other OpenID Connect providers. Learn more at https://aka.ms/functions-mcp-server-authorization. Better together with Foundry agents Microsoft Foundry is the starting point for building intelligent agents, and Azure Functions is the natural next step for extending those agents with remote MCP tools. Running your tools on Functions gives you clean separation of concerns, reuse across multiple agents, and strong security isolation. And with built-in authorization, Functions enables enterprise-ready authentication patterns, from calling downstream services with the agent’s identity to operating on behalf of end users with their delegated permissions. Build your first remote MCP server and connect it to your Foundry agent at https://aka.ms/foundry-functions-mcp-tutorial. Agents Microsoft Agent Framework 2.0 (Public Preview Refresh) We’re excited about the preview refresh 2.0 release of Microsoft Agent Framework that builds on battle hardened work from Semantic Kernel and AutoGen. Agent Framework is an outstanding solution for building multi-agent orchestrations that are both simple and powerful. Azure Functions is a strong fit to host Agent Framework with the service’s extreme scale, serverless billing, and enterprise grade features like VNET networking and built-in auth. Durable Task Extension for Microsoft Agent Framework (Preview) The durable task extension for Microsoft Agent Framework transforms how you build production-ready, resilient and scalable AI agents by bringing the proven durable execution (survives crashes and restarts) and distributed execution (runs across multiple instances) capabilities of Azure Durable Functions directly into the Microsoft Agent Framework. Combined with Azure Functions for hosting and event-driven execution, you can now deploy stateful, resilient AI agents that automatically handle session management, failure recovery, and scaling, freeing you to focus entirely on your agent logic. Key features of the durable task extension include: Serverless Hosting: Deploy agents on Azure Functions with auto-scaling from thousands of instances to zero, while retaining full control in a serverless architecture. Automatic Session Management: Agents maintain persistent sessions with full conversation context that survives process crashes, restarts, and distributed execution across instances Deterministic Multi-Agent Orchestrations: Coordinate specialized durable agents with predictable, repeatable, code-driven execution patterns Human-in-the-Loop with Serverless Cost Savings: Pause for human input without consuming compute resources or incurring costs Built-in Observability with Durable Task Scheduler: Deep visibility into agent operations and orchestrations through the Durable Task Scheduler UI dashboard Create a durable agent: endpoint = os.getenv("AZURE_OPENAI_ENDPOINT") deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini") # Create an AI agent following the standard Microsoft Agent Framework pattern agent = AzureOpenAIChatClient( endpoint=endpoint, deployment_name=deployment_name, credential=AzureCliCredential() ).create_agent( instructions="""You are a professional content writer who creates engaging, well-structured documents for any given topic. When given a topic, you will: 1. Research the topic using the web search tool 2. Generate an outline for the document 3. Write a compelling document with proper formatting 4. Include relevant examples and citations""", name="DocumentPublisher", tools=[ AIFunctionFactory.Create(search_web), AIFunctionFactory.Create(generate_outline) ] ) # Configure the function app to host the agent with durable session management app = AgentFunctionApp(agents=[agent]) app.run() Durable Task Scheduler dashboard for agent and agent workflow observability and debugging For more information on the durable task extension for Agent Framework, see the announcement: https://aka.ms/durable-extension-for-af-blog. Flex Consumption Updates As you know, Flex Consumption means serverless without compromise. It combines elastic scale and pay‑for‑what‑you‑use pricing with the controls you expect: per‑instance concurrency, longer executions, VNet/private networking, and Always Ready instances to minimize cold starts. Since launching GA at Ignite 2024 last year, Flex Consumption has had tremendous growth with over 1.5 billion function executions per day and nearly 40 thousand apps. Here’s what’s new for Ignite 2025: 512 MB instance size (GA). Right‑size lighter workloads, scale farther within default quota. Availability Zones (GA). Distribute instances across zones. Rolling updates (Public Preview). Unlock zero-downtime deployments of code or config by setting a single configuration. See below for more information. Even more improvements including: new diagnostic settingsto route logs/metrics, use Key Vault App Config references, new regions, and Custom Handler support. To get started, review Flex Consumption samples, or dive into the documentation to see how Flex can support your workloads. Migrating to Azure Functions Flex Consumption Migrating to Flex Consumption is simple with our step-by-step guides and agentic tools. Move your Azure Functions apps or AWS Lambda workloads, update your code and configuration, and take advantage of new automation tools. With Linux Consumption retiring, now is the time to switch. For more information, see: Migrate Consumption plan apps to the Flex Consumption plan Migrate AWS Lambda workloads to Azure Functions Durable Functions Durable Functions introduces powerful new features to help you build resilient, production-ready workflows: Distributed Tracing: lets you track requests across components and systems, giving you deep visibility into orchestration and activities with support for App Insights and OpenTelemetry. Extended Sessions support in .NET isolated: improves performance by caching orchestrations in memory, ideal for fast sequential activities and large fan-out/fan-in patterns. Orchestration versioning (public preview): enables zero-downtime deployments and backward compatibility, so you can safely roll out changes without disrupting in-flight workflows Durable Task Scheduler Updates Durable Task Scheduler Dedicated SKU (GA): Now generally available, the Dedicated SKU offers advanced orchestration for complex workflows and intelligent apps. It provides predictable pricing for steady workloads, automatic checkpointing, state protection, and advanced monitoring for resilient, reliable execution. Durable Task Scheduler Consumption SKU (Public Preview): The new Consumption SKU brings serverless, pay-as-you-go orchestration to dynamic and variable workloads. It delivers the same orchestration capabilities with flexible billing, making it easy to scale intelligent applications as needed. For more information see: https://aka.ms/dts-ga-blog OpenTelemetry support in GA Azure Functions OpenTelemetry is now generally available, bringing unified, production-ready observability to serverless applications. Developers can now export logs, traces, and metrics using open standards—enabling consistent monitoring and troubleshooting across every workload. Key capabilities include: Unified observability: Standardize logs, traces, and metrics across all your serverless workloads for consistent monitoring and troubleshooting. Vendor-neutral telemetry: Integrate seamlessly with Azure Monitor or any OpenTelemetry-compliant backend, ensuring flexibility and choice. Broad language support: Works with .NET (isolated), Java, JavaScript, Python, PowerShell, and TypeScript. Start using OpenTelemetry in Azure Functions today to unlock standards-based observability for your apps. For step-by-step guidance on enabling OpenTelemetry and configuring exporters for your preferred backend, see the documentation. Deployment with Rolling Updates (Preview) Achieving zero-downtime deployments has never been easier. The Flex Consumption plan now offers rolling updates as a site update strategy. Set a single property, and all future code deployments and configuration changes will be released with zero-downtime. Instead of restarting all instances at once, the platform now drains existing instances in batches while scaling out the latest version to match real-time demand. This ensures uninterrupted in-flight executions and resilient throughput across your HTTP, non-HTTP, and Durable workloads – even during intensive scale-out scenarios. Rolling updates are now in public preview. Learn more at https://aka.ms/functions/rolling-updates. Secure Identity and Networking Everywhere By Design Security and trust are paramount. Azure Functions incorporates proven best practices by design, with full support for managed identity—eliminating secrets and simplifying secure authentication and authorization. Flex Consumption and other plans offer enterprise-grade networking features like VNETs, private endpoints, and NAT gateways for deep protection. The Azure Portal streamlines secure function creation, and updated scenarios and samples showcase these identity and networking capabilities in action. Built-in authentication (discussed above) enables inbound client traffic to use identity as well. Check out our updated Functions Scenarios page with quickstarts or our secure samples gallery to see these identity and networking best practices in action. .NET 10 Azure Functions now supports .NET 10, bringing in a great suite of new features and performance benefits for your code. .NET 10 is supported on the isolated worker model, and it’s available for all plan types except Linux Consumption. As a reminder, support ends for the legacy in-process model on November 10, 2026, and the in-process model is not being updated with .NET 10. To stay supported and take advantage of the latest features, migrate to the isolated worker model. Aspire Aspire is an opinionated stack that simplifies development of distributed applications in the cloud. The Azure Functions integration for Aspire enables you to develop, debug, and orchestrate an Azure Functions .NET project as part of an Aspire solution. Aspire publish directly deploys to your functions to Azure Functions on Azure Container Apps. Aspire 13 includes an updated preview version of the Functions integration that acts as a release candidate with go-live support. The package will be moved to GA quality with Aspire 13.1. Java 25, Node.js 24 Azure Functions now supports Java 25 and Node.js 24 in preview. You can now develop functions using these versions locally and deploy them to Azure Functions plans. Learn how to upgrade your apps to these versions here In Summary Ready to build what’s next? Update your Azure Functions Core Tools today and explore the latest samples and quickstarts to unlock new capabilities for your scenarios. The guided quickstarts run and deploy in under 5 minutes, and incorporate best practices—from architecture to security to deployment. We’ve made it easier than ever to scaffold, deploy, and scale real-world solutions with confidence. The future of intelligent, scalable, and secure applications starts now—jump in and see what you can create!3.5KViews0likes2CommentsMigrate Data Ingestion from Data Collector to Log Ingestion
HTTP Data Collector API in Log Analytics workspaces is being deprecated, and will be totally out of support in September 2026. Data Collector actions in logic app using already created API connections (which uses workspace Id & Key) would still work against old custom log tables, however, newly created table will not be able to ingest data, although the connector would still succeed in logic app, but no data will be populated in newly created custom logs. In case new API connection is created for Data Collector action (using workspace Id & Key); these will fail with 403 - Forbidden action. Users should start using the Log Ingestion API to send data to custom tables, and this document will guide users on how to use Log Ingestion API in logic apps. Note: Azure portal currently is update so it doesn't show the Workspace keys in Log Analytics workspace page, however, Az CLI will still get the keys, but as stated, actions will fail with 403 when using them in Data Collector Action. Creating DCE & DCRs: To utilize the Log Ingestion API, Data Collection Endpoint & Data Collection Rule should be created first. DCE Creation is simple, from azure portal search for DCE, and then create a new one: For DCR creation, it can be either created from the DCR page in Azure Portal, or upon creating the custom log in Log Analytics workspace. DCR Popup You need to upload sample data file, so the custom log table has a schema, it needs to be JSON array. In case the sample log doesn't have a TimeGenerated field, you can easily add it using the mapping function as below: Add the below code in the Transformation box, then click run. Once you complete the DCR creation, we need to get the DCE full endpoint. Getting DCE Log Ingestion Full URL To get the full endpoint URL, please follow the below: 1. Get the DCE Log Ingestion URL from the DCE overview page: 2. On the DCR Page, get the immutable id for the DCR., then click on the JSON view of the DCR resource: 3. From the JSON view, get the stream name from the streamDeclarations field Now the full Log Ingestion URL is: DCE_URL/dataCollectionRules/{immutable_id}/streams/{streamName}?api-version=2023-01-01 It would be similar to: https://mshbou****.westeurope-1.ingest.monitor.azure.com/dataCollectionRules/dcr-7*****4e988bef2995cd52ae/streams/Custom-mshboulLogAPI_CL?api-version=2023-01-01 Granting Logic App MI needed IAM Roles: To call the ingestion endpoint using Logic Apps MI, we need to grant logic apps MI the role "Monitoring Metrics Publisher" over the DCR resource. To do this, open the DCR, from the blade choose Access Control (IAM), and then grant the logic app MI the role "Monitoring Metrics Publisher" Calling Log Ingestion Endpoint from logic apps: To call the ingestion endpoint from logic apps, we need to use the HTTP action, as below, the URI is the full DCE Endpoint we created before. Add the content-type headers, and the json body that contains the log data you want to send. For the authentication, it will be as below: Once executed, it should succeed, with status code 204. For more details on the Log Ingestion API, and the migration, please see our documentation: Migrate from the HTTP Data Collector API to the Log Ingestion API - Azure Monitor | Microsoft Learn Logs Ingestion API in Azure Monitor - Azure Monitor | Microsoft Learn Thanks.250Views0likes0CommentsAnnouncing general availability for the Azure SRE Agent
Today, we’re excited to announce the General Availability (GA) of Azure SRE Agent— your AI‑powered operations teammate that helps organizations improve uptime, reduce incident impact, and cut operational toil by accelerating diagnosis and automating response workflows.12KViews1like1CommentAzure Monitor in Azure SRE Agent: Autonomous Alert Investigation and Intelligent Merging
Azure Monitor is great at telling you something is wrong. But once the alert fires, the real work begins — someone has to open the portal, triage it, dig into logs, and figure out what happened. That takes time. And while they're investigating, the same alert keeps firing every few minutes, stacking up duplicates of a problem that's already being looked at. This is exactly what Azure SRE Agent's Azure Monitor integration addresses. The agent picks up alerts as they fire, investigates autonomously, and remediates when it can — all without waiting for a human to get involved. And when that same alert fires again while the investigation is still underway, the agent merges it into the existing thread rather than creating a new one. In this blog, we'll walk through the full Azure Monitor experience in SRE Agent with a live AKS + Redis scenario — how alerts get picked up, what the agent does with them, how merging handles the noise, and why one often-overlooked setting (auto-resolve) makes a bigger difference than you'd expect. Key Takeaways Set up Incident Response Plans to scope which alerts the agent handles — filter by severity, title patterns, and resource type. Start with review mode, then promote to autonomous once you trust the agent's behavior for that failure pattern. Recurring alerts merge into one thread automatically — when the same alert rule fires repeatedly, the agent merges subsequent firings into the existing investigation instead of creating duplicates. Turn auto-resolve OFF for persistent failures (bad credentials, misconfigurations, resource exhaustion) so all firings merge into one thread. Turn it ON for transient issues (traffic spikes, brief timeouts) so each gets a fresh investigation. Design alert rules around failure categories, not components — one alert rule = one investigation thread. Structure rules by symptom (Redis errors, HTTP errors, pod health) to give the agent focused, non-overlapping threads. Attach Custom Response Plans for specialized handling — route specific alert patterns to custom-agents with custom instructions, tools, and runbooks. It Starts with Any Azure Monitor Alert Before we get to the demo, a quick note on what SRE Agent actually watches. The agent queries the Azure Alerts Management REST API, which returns every fired alert regardless of signal type. Log search alerts, metric alerts, activity log alerts, smart detection, service health, Prometheus — all of them come through the same API, and the agent processes them all the same way. You don't need to configure connectors or webhooks per alert type. If it fires in Azure Monitor, the agent can see it. What you do need to configure is which alerts the agent should care about. That's where Incident Response Plans come in. Setting Up: Incident Response Plans and Alert Rules We start by heading to Settings > Incident Platform > Azure Monitor and creating an Incident Response Plan. Response Plans et you scope the agent's attention by severity, alert name patterns, target resource types, and — importantly — whether the agent should act autonomously or wait for human approval. Action: Match the agent mode to your confidence in the remediation, not just the severity. Use autonomous mode for well-understood failure patterns where the fix is predictable and safe (e.g., rolling back a bad config, restarting a pod). Use review mode for anything where you want a human to validate before the agent acts — especially Sev0/Sev1 alerts that touch critical systems. You can always start in review mode and promote to autonomous once you've validated the agent's behavior. For our demo, we created a Sev1 response plan in autonomous mode — meaning the agent would pick up any Sev1 alert and immediately start investigating and remediating, no approval needed. On the Azure Monitor side, we set up three log-based alert rules against our AKS cluster's Log Analytics workspace. The star of the show was a Redis connection error alert — a custom log search query looking for WRONGPASS, ECONNREFUSED, and other Redis failure signatures in ContainerLog: Each rule evaluates every 5 minutes with a 15-minute aggregation window. If the query returns any results, the alert fires. Simple enough. Breaking Redis (On Purpose) Our test app is a Node.js journal app on AKS, backed by Azure Cache for Redis. To create a realistic failure scenario, we updated the Redis password in the Kubernetes secret to a wrong value. The app pods picked up the bad credential, Redis connections started failing, and error logs started flowing. Within minutes, the Redis connection error alert fired. What Happened Next Here's where it gets interesting. We didn't touch anything — we just watched. The agent's scanner polls the Azure Monitor Alerts API every 60 seconds. It spotted the new alert (state: "New", condition: "Fired"), matched it against our Sev1 Incident Response Plan, and immediately acknowledged it in Azure Monitor — flipping the state to "Acknowledged" so other systems and humans know someone's on it. Then it created a new investigation thread. The thread included everything the agent needed to get started: the alert ID, rule name, severity, description, affected resource, subscription, resource group, and a deep-link back to the Azure Portal alert. From there, the agent went to work autonomously. It queried container logs, identified the Redis WRONGPASS errors, traced them to the bad secret, retrieved the correct access key from Azure Cache for Redis, updated the Kubernetes secret, and triggered a pod rollout. By the time we checked the thread, it was already marked "Completed." No pages. No human investigation. No context-switching. But the Alert Kept Firing... Here's the thing — our alert rule evaluates every 5 minutes. Between the first firing and the agent completing the fix, the alert fired again. And again. Seven times total over 35 minutes. Without intelligent handling, that would mean seven separate investigation threads. Seven notifications. Seven disruptions. SRE Agent handles this with alert merging. When a subsequent firing comes in for the same alert rule, the agent checks: is there already an active thread for this rule, created within the last 7 days, that hasn't been resolved or closed? If yes, the new firing gets silently merged into the existing thread — the total alert count goes up, the "Last fired" timestamp updates, and that's it. No new thread, no new notification, no interruption to the ongoing investigation. How merging decides: new thread or merge? Condition Result Same alert rule, existing thread still active Merged — alert count increments, no new thread Same alert rule, existing thread resolved/closed New thread — fresh investigation starts Different alert rule New thread — always separate Five minutes after the first alert, the second firing came in and that continued. The agent finished the fix and closed the thread, and the final tally was one thread, seven merged alerts — spanning 35 minutes of continuous firings. On the Azure Portal side, you can see all seven individual alert instances. Each one was acknowledged by the agent. 7 Redis Connection Error Alert entries, all Sev1, Fired condition, Closed by user, spanning 8:50 PM to 9:21 PM Seven firings. One investigation. One fix. That's the merge in action. The Auto-Resolve Twist Now here's the part we didn't expect to matter as much as it did. Azure Monitor has a setting called "Automatically resolve alerts". When enabled, Azure Monitor automatically transitions an alert to "Resolved" once the underlying condition clears — for example, when the Redis errors stop because the pod restarted. For our first scenario above, we had auto-resolve turned off. That's why the alert stayed in "Fired" state across all seven evaluation cycles, and all seven firings merged cleanly into one thread. But what happens if auto-resolve is on? We turned it on and ran the same scenario again: Here's what happened: Redis broke. Alert fired. Agent picked it up and created a thread. The agent investigated, found the bad Redis password, fixed it. With Redis working again, error logs stopped. We noticed that the condition cleared and closed all the 7 alerts manually. We broke Redis a second time (simulating a recurrence). The alert fired again — but the previous alert was already closed/resolved. The merge check found no active thread. A brand-new thread was created, reinvestigated and mitigated. Two threads for the same alert rule, right there on the Incidents page: And on the Azure Monitor side, the newest alert shows "Resolved" condition — that's the auto-resolve doing its thing: For a persistent failure like a Redis misconfiguration, this is clearly worse. You get a new investigation thread every break-fix cycle instead of one continuous investigation. So, Should You Just Turn Auto-Resolve Off? No. It depends on what kind of failure the alert is watching for. Quick Reference: Auto-Resolve Decision Guide Auto-Resolve OFF Auto-Resolve ON Use when Problem persists until fixed Problem is transient and self-correcting Examples Bad credentials, misconfigurations, CrashLoopBackOff, connection pool exhaustion, IOPS limits OOM kills during traffic spikes, brief latency from neighboring deployments, one-off job timeouts Merge behavior All repeat firings merge into one thread Each break-fix cycle creates a new thread Best for Agent is actively managing the alert lifecycle Each occurrence may have a different root cause Tradeoff Alerts stay in "Fired/Acknowledged" state in Azure Monitor until the agent closes them More threads, but each gets a clean investigation Turn auto-resolve OFF when you want repeated firings from the same alert rule to stay in a single investigation thread until the alert is explicitly resolved or closed in Azure Monitor. This works best for persistent issues such as a Kubernetes deployment stuck in CrashLoopBackOff because of a bad image tag, a database connection pool exhausted due to a leaked connection, or a storage account hitting its IOPS limit under sustained load. Turn auto-resolve ON when you want a new investigation thread after the previous occurrence has been resolved or closed in Azure Monitor. This works best for episodic or self-clearing issues such as a pod getting OOM-killed during a temporary traffic spike, a brief latency increases during a neighboring service’s deployment, or a scheduled job that times out once due to short-lived resource contention. The key question is: when this alert fires again, is it the same ongoing problem or a new one? If it's the same problem, turn auto-resolve off and let the merges do their job. If it's a new problem, leave auto-resolve on and let the agent investigate fresh. Note: These behaviors describe how SRE Agent groups alert investigations and may differ from how Azure Monitor documents native alert state behavior. A Few Things We Learned Along the Way Design alert rules around symptoms, not components. Each alert rule maps to one investigation thread. We structured ours around failure categories — root cause signal (Redis errors, Sev1), blast radius signal (HTTP errors, Sev2), infrastructure signal (unhealthy pods, Sev2). This gave the agent focused threads without overlap. Incident Response Plans let you tier your response. Not every alert needs the agent to go fix things immediately. We used a Sev1 filter in autonomous mode for the Redis alert, but you could set up a Sev2 filter in review mode — the agent investigates and provides analysis but waits for human approval before taking action. Response Plans specialize the agent. For specific alert patterns, you can give the agent custom instructions, specialized tools, and a tailored system prompt. A Redis alert can route to a custom-agent loaded with Redis-specific runbooks; a Kubernetes alert can route to one with deep kubectl expertise. Best Practices Checklist Here's what we learned distilled into concrete actions: Alert Rule Design Do Don't Design rules around failure categories (root cause, blast radius, infra health) Create one alert per component — you'll get overlapping threads Set evaluation frequency and aggregation window to match the failure pattern Use the same frequency for everything — transient vs. persistent issues need different cadences Example rule structure from our test: Root cause signal — Redis WRONGPASS/ECONNREFUSED errors → Sev1 Blast radius signal — HTTP 5xx response codes → Sev2 Infrastructure signal — KubeEvents Reason="Unhealthy" → Sev2 Incident Response Plan Setup Do Don't Create separate response plans per severity tier Use one catch-all filter for everything Start with review mode — especially for Sev0/Sev1 where wrong fixes are costly Jump straight to autonomous mode on critical alerts without validating agent behavior first Promote to autonomous mode once you've validated the agent handles a specific failure pattern correctly Assume severity alone determines the right mode — it's about confidence in the remediation Response Plans Do Don't Attach custom response plans to specific alert patterns for specialized handling Leave every alert to the agent's general knowledge Include custom instructions, tools, and runbooks relevant to the failure type Write generic instructions — the more specific, the better the investigation Route Redis alerts to a Redis-specialized custom-agent; K8s alerts to one with kubectl expertise Assume one agent configuration fits all failure types Getting Started Head to sre.azure.com and open your agent Make sure the agent's managed identity has Monitoring Reader on your target subscriptions Go to Settings > Incident Platform > Azure Monitor and create your Incident Response Plans Review the auto-resolve setting on your alert rules — turn it off for persistent issues, leave it on for transient ones (see the decision guide above) Start with a test response plan using Title Contains to target a specific alert rule — validate agent behavior before broadening Watch the Incidents page and review the agent's investigation threads before expanding to more alert rules Learn More Azure SRE Agent Documentation Incident Response Guide Azure Monitor Alert Rules243Views0likes0CommentsGive your Foundry Agent Custom Tools with MCP Servers on Azure Functions
This blog post is for developers who have an MCP server deployed to Azure Functions and want to connect it to Microsoft Foundry agents. It walks through why you'd want to do this, the different authentication options available, and how to get your agent calling your MCP tools. Connect your MCP server on Azure Functions to Foundry Agent If you've been following along with this blog series, you know that Azure Functions is a great place to host remote MCP servers. You get scalable infrastructure, built-in auth, and serverless billing. All the good stuff. But hosting an MCP server is only half the picture. The real value comes when something actually uses those tools. Microsoft Foundry lets you build AI agents that can reason, plan, and take actions. By connecting your MCP server to an agent, you're giving it access to your custom tools, whether that's querying a database, calling an API, or running some business logic. The agent discovers your tools, decides when to call them, and uses the results to respond to the user. Why connect MCP servers to Foundry agents? You might already have an MCP server that works great with VS Code, VS, Cursor, or other MCP clients. Connecting that same server to a Foundry agent means you can reuse those tools in a completely different context, i.e. in an enterprise AI agent that your team or customers interact with. No need to rebuild anything. Your MCP server stays the same; you're just adding another consumer. Prerequisites Before proceeding, make sure you have the following: 1. An MCP server deployed to Azure Functions. If you don't have one yet, you can deploy one quickly by following one of the samples: Python TypeScript .NET 2. A Foundry project with a deployed model and a Foundry agent Authentication options Depending on where you are in development, you can pick what makes sense and upgrade later. Here's a summary: Method Description When to use Key-based (default) Agent authenticates by passing a shared function access key in the request header. This method is the default authentication for HTTP endpoints in Functions. Development, or when Entra auth isn't required. Microsoft Entra Agent authenticates using either its own identity (agent identity) or the shared identity of the Foundry project (project managed identity). Use agent identity for production scenarios, but limit shared identity to development. OAuth identity passthrough Agent prompts users to sign in and authorize access, using the provided token to authenticate. Production, when each user must authenticate individually. Unauthenticated Agent makes unauthenticated calls. Development only, or tools that access only public information. Connect your MCP server to your Foundry agent If your server uses key-based auth or is unauthenticated, it should be relatively straightforward to set up the connection from a Foundry agent. The Microsoft Entra and OAuth identity passthrough are options that require extra steps to set up. Check out detailed step-by-step instructions for each authentication method. At a high level, the process looks like this: Enable built-in MCP authentication : When you deploy a server to Azure Functions, key-based auth is the default. You'll need to disable that and enable built-in MCP auth instead. If you deployed one of the sample servers in the Prerequisite section, this step is already done for you. Get your MCP server endpoint URL: For MCP extension-based servers, it's https://<FUNCTION_APP_NAME>.azurewebsites.net/runtime/webhooks/mcp Get your credentials based on your chosen auth method: a managed identity configuration, OAuth credentials Add the MCP server as a tool in the Foundry portal by navigating to your agent, adding a new MCP tool, and providing the endpoint and credentials. Microsoft Entra connection required fields OAuth Identity required fields Once the server is configured as a tool, test it in the Agent Builder playground by sending a prompt that triggers one of your MCP tools. Closing thoughts What I find exciting about this is the composability. You build your MCP server once and it works everywhere: VS Code, VS, Cursor, ChatGPT, and now Foundry agents. The MCP protocol is becoming the universal interface for tool use in AI, and Azure Functions makes it easy to host these servers at scale and with security. Are you building agents with Foundry? Have you connected your MCP servers to other clients? I'd love to hear what tools you're exposing and how you're using them. Share with us your thoughts! What's next In the next blog post, we'll go deeper into other MCP topics and cover new MCP features and developments in Azure Functions. Stay tuned!432Views0likes0CommentsMCP Apps on Azure Functions: Quick Start with TypeScript
Azure Functions makes hosting MCP apps simple: build locally, create a secure endpoint, and deploy fast with Azure Developer CLI (azd). This guide shows you how using a weather app example. What Are MCP Apps? MCP Apps let MCP servers return interactive HTML interfaces such as data visualizations, forms, dashboards that render directly inside MCP-compatible hosts (Visual Studio Code Copilot, Claude, ChatGPT, etc.). Learn more about MCP Apps in the official documentation. Having an interactive UI removes many restrictions that plain texts have, such as if your scenario has: Interactive Data: Replacing lists with clickable maps or charts for deep exploration. Complex Setup: Use one-page forms instead of long, back-and-forth questioning. Rich Media: Embed native viewers to pan, zoom, or rotate 3D models and documents. Live Updates: Maintain real-time dashboards that refresh without new prompts. Workflow Management: Handle multi-step tasks like approvals with navigation buttons and persistent state. MCP App Hosting as a Feature Azure Functions provides an easy abstraction to help you build MCP servers without having to learn the nitty-gritty of the MCP protocol. When hosting your MCP App on Functions, you get: MCP tools (server logic): Handle client requests, call backend services, return structured data - Azure Functions manages the MCP protocol details for you MCP resources (UI payloads such as app widgets): Serve interactive HTML, JSON documents, or formatted content - just focus on your UI logic Secure HTTPS access: Built-in authentication using Azure Functions keys, plus built-in MCP authentication with OAuth support for enterprise-grade security Easy deployment with Bicep and azd: Infrastructure as Code for reliable deployments Local development: Test and debug locally before deploying Auto-scaling: Azure Functions handles scaling, retries, and monitoring automatically The weather app in this repo is an example of this feature, not the only use case. Architecture Overview Example: The classic Weather App The sample implementation includes: A GetWeather MCP tool that fetches weather by location (calls Open-Meteo geocoding and forecast APIs) A Weather Widget MCP resource that serves interactive HTML/JS code (runs in the client; fetches data via GetWeather tool) A TypeScript service layer that abstracts API calls and data transformation (runs on the server) Bidirectional communication: client-side UI calls server-side tools, receives data, renders locally Local and remote testing flow for MCP clients (via MCP Inspector, VS Code, or custom clients) How UI Rendering Works in MCP Apps In the Weather App example: Azure Functions serves getWeatherWidget as a resource → returns weather-app.ts compiled to HTML/JS Client renders the Weather Widget UI User interacts with the widget or requests are made internally The widget calls the getWeather tool → server processes and returns weather data The widget renders the weather data on the client side This architecture keeps the UI responsive locally while using server-side logic and data on demand. Quick Start Checkout repository: https://github.com/Azure-Samples/remote-mcp-functions-typescript Run locally: npm install npm run build func start Local endpoint: http://0.0.0.0:7071/runtime/webhooks/mcp Deploy to Azure: azd provision azd deploy Remote endpoint: https://.azurewebsites.net/runtime/webhooks/mcp TypeScript MCP Tools Snippet (Get Weather service) In Azure Functions, you define MCP tools using app.mcpTool(). The toolName and description tell clients what this tool does, toolProperties defines the input arguments (like location as a string), and handler points to your function that processes the request. app.mcpTool("getWeather", { toolName: "GetWeather", description: "Returns current weather for a location via Open-Meteo.", toolProperties: { location: arg.string().describe("City name to check weather for") }, handler: getWeather, }); Resource Trigger Snippet (Weather App Hook) MCP resources are defined using app.mcpResource(). The uri is how clients reference this resource, resourceName and description provide metadata, mimeType tells clients what type of content to expect, and handler is your function that returns the actual content (like HTML for a widget). app.mcpResource("getWeatherWidget", { uri: "ui://weather/index.html", resourceName: "Weather Widget", description: "Interactive weather display for MCP Apps", mimeType: "text/html;profile=mcp-app", handler: getWeatherWidget, }); Sample repos and references Complete sample repository with TypeScript implementation: https://github.com/Azure-Samples/remote-mcp-functions-typescript Official MCP extension documentation: https://learn.microsoft.com/azure/azure-functions/functions-bindings-mcp?pivots=programming-language-typescript Java sample: https://github.com/Azure-Samples/remote-mcp-functions-java .NET sample: https://github.com/Azure-Samples/remote-mcp-functions-dotnet Python sample: https://github.com/Azure-Samples/remote-mcp-functions-python MCP Inspector: https://github.com/modelcontextprotocol/inspector Final Takeaway MCP Apps are just MCP servers but they represent a paradigm shift by transforming the AI from a text-based chatbot into a functional interface. Instead of forcing users to navigate complex tasks through back-and-forth conversations, these apps embed interactive UIs and tools directly into the chat, significantly improving the user experience and the usefulness of MCP servers. Azure Functions allows developers to quickly build and host an MCP app by providing an easy abstraction and deployment experience. The platform also provides built-in features to secure and scale your MCP apps, plus a serverless pricing model so you can just focus on the business logic.365Views1like0CommentsThe Durable Task Scheduler Consumption SKU is now Generally Available
Today, we're excited to announce that the Durable Task Scheduler Consumption SKU has reached General Availability. Developers can now run durable workflows and agents on Azure with pay-per-use pricing, no storage to manage, no capacity to plan, and no idle costs. Just create a scheduler, connect your app, and start orchestrating. Whether you're coordinating AI agent workflows, processing event-driven pipelines, or running background jobs, the Consumption SKU is ready to go. GET STARTED WITH THE DURABLE TASK SCHEDULER CONSUMPTION SKU Since launching the Consumption SKU in public preview last November, we've seen incredible adoption and have incorporated feedback from developers around the world to ensure the GA release is truly production ready. “The Durable Task Scheduler has become a foundational piece of what we call ‘workflows’. It gives us the reliability guarantees we need for processing financial documents and sensitive workflows, while keeping the programming model straightforward. The combination of durable execution, external event correlation, deterministic idempotency, and the local emulator experience has made it a natural fit for our event-driven architecture. We have been delighted with the consumption SKUs cost model for our lower environments.”– Emily Lewis, CarMax What is the Durable Task Scheduler? If you're new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background: Announcing Limited Early Access of the Durable Task Scheduler Announcing Workflow in Azure Container Apps with the Durable Task Scheduler Announcing Dedicated SKU GA & Consumption SKU Public Preview In brief, the Durable Task Scheduler is a fully managed orchestration backend for durable execution on Azure, meaning your workflows and agent sessions can reliably resume and run to completion, even through process failures, restarts, and scaling events. Whether you’re running workflows or orchestrating durable agents, it handles task scheduling, state persistence, fault tolerance, and built-in monitoring, freeing developers from the operational overhead of managing their own execution engines and storage backends. The Durable Task Scheduler works across Azure compute environments: Azure Functions: Using the Durable Functions extension across all Function App SKUs, including Flex Consumption. Azure Container Apps: Using the Durable Functions or Durable Task SDKs with built-in workflow support and auto-scaling. Any compute: Azure Kubernetes Service, Azure App Service, or any environment where you can run the Durable Task SDKs (.NET, Python, Java, JavaScript). Why choose the Consumption SKU? With the Consumption SKU you’re charged only for actions dispatched, with no minimum commitments or idle costs. There’s no capacity to size or throughput to reserve. Create a scheduler, connect your app, and you’re running. The Consumption SKU is a natural fit for workloads with unpredictable or bursty usage patterns: AI agent orchestration: Multi-step agent workflows that call LLMs, retrieve data, and take actions. Users trigger these on demand, so volume is spiky and pay-per-use avoids idle costs between bursts. Event-driven pipelines: Processing events from queues, webhooks, or streams with reliable orchestration and automatic checkpointing, where volumes spike and dip unpredictably. API-triggered workflows: User signups, form submissions, payment flows, and other request-driven processing where volume varies throughout the day. Distributed transactions: Retries and compensation logic across microservices with durable sagas that survive failures and restarts. What's included in the Consumption SKU at GA The Consumption SKU has been hardened based on feedback and real-world usage during the public preview. Here's what's included at GA: Performance Up to 500 actions per second: Sufficient throughput for a wide range of workloads, with the option to move to the Dedicated SKU for higher-scale scenarios. Up to 30 days of data retention: View and manage orchestration history, debug failures, and audit execution data for up to 30 days. Built-in monitoring dashboard Filter orchestrations by status, drill into execution history, view visual Gantt and sequence charts, and manage orchestrations (pause, resume, terminate, or raise events), all from the dashboard, secured with Role-Based Access Control (RBAC). Identity-based security The Consumption SKU uses Entra ID for authentication and RBAC for authorization. No SAS tokens or access keys to manage, just assign the appropriate role and connect. Get started with the Durable Task Scheduler today The Consumption SKU is available now Generally Available. Provision a scheduler in the Azure portal, connect your app, and start orchestrating. You only pay for what you use. Documentation Getting started Samples Pricing Consumption SKU docs We'd love to hear your feedback. Reach out to us by filing an issue on our GitHub repository442Views0likes0CommentsIntroducing Skills in Azure API Center
The problem Modern applications depend on more than APIs. A single AI workflow might call an LLM, invoke an MCP tool, integrate a third-party service, and reference a business capability spanning dozens of endpoints. Without a central inventory, these assets become impossible to discover, easy to duplicate, and painful to govern. Azure API Center — part of the Azure API Management platform — already catalogs models, agents, and MCP servers alongside traditional APIs. Skills extend that foundation to cover reusable AI capabilities. What is a Skill? As AI agents become more capable, organizations need a way to define and govern what those agents can actually do. Skills are the answer. A Skill in Azure API Center is a reusable, registered capability that AI agents can discover and consume to extend their functionality. Each skill is backed by source code — typically hosted in a Git repository — and describes what it does, what APIs or MCP servers it can access, and who owns it. Think of skills as the building blocks of AI agent behavior, promoted into a governed inventory alongside your APIs, MCP servers, models, and agents. Example: A "Code Review Skill" performs automated code reviews using static analysis. It is registered in API Center with a Source URL pointing to its GitHub repo, allowed to access your code analysis API, and discoverable by any AI agent in your organization. How Skills work in API Center Skills can be added to your inventory in two ways: registered manually through the Azure portal, or synchronized automatically from a connected Git repository. Both approaches end up in the same governed catalog, discoverable through the API Center portal. Option 1: Register a Skill manually Use the Azure portal to register a skill directly. Navigate to Inventory > Assets in your API center, select + Register an asset > Skill, and fill in the registration form. Figure 2: Register a skill form in the Azure portal. The form captures everything needed to make a skill discoverable and governable: Field Description Title Display name for the skill (e.g. Code Review Skill). Identification Auto-generated URL slug based on the title. Editable. Summary One-line description of what the skill does. Description Full detail on capabilities, use cases, and expected behavior. Lifecycle stage Current state: Design, Preview, Production, or Deprecated. Source URL Git repository URL for the skill source code. Allowed tools The APIs or MCP servers from your inventory this skill is permitted to access. Enforces governance at the capability level. License Licensing terms: MIT, Apache 2.0, Proprietary, etc. Contact information Owner or support contact for the skill. Governance note: The Allowed tools field is key for AI governance. It explicitly defines which APIs and MCP servers a skill can invoke — preventing uncontrolled access and making security review straightforward. Option 2: Sync Skills from a Git repository For teams managing skills in source control, API Center can integrate directly with a Git repository and synchronize skill information automatically. This is the recommended approach for teams practicing GitOps or managing many skills at scale. Figure 3: Integrating a Git repository to sync skills automatically into API Center. When you configure a Git integration, API Center: Creates an Environment representing the repository as a source of skills Scans for files matching the configured pattern (default: **/skill.md) Syncs matching skills into your inventory and keeps them current as the repo changes For private repositories, a Personal Access Token (PAT) stored in Azure Key Vault is used for authentication. API Center's managed identity retrieves the PAT securely — no credentials are stored in the service itself. Tip: Use the Automatically configure managed identity and assign permissions option when setting up the integration if you haven't pre-configured a managed identity. API Center handles the Key Vault permissions automatically. Discovering Skills in your catalog Once registered — manually or via Git — skills appear in the Inventory > Assets page alongside all other asset types. Linked skills (synced from Git) are visually identified with a link icon, so teams can see at a glance which skills are source-controlled. From the API Center portal, developers and other stakeholders can browse the full skill catalog, filter by lifecycle stage or type, and view detailed information about each skill — including its source URL, allowed tools, and contact information. Figure 4: Skills catalog in API Center portal, showing registered skills and the details related to the skill. Developer experience: The API Center portal gives teams a self-service way to discover approved skills without needing to ask around or search GitHub. The catalog becomes the authoritative source of what's available and what's allowed. Why this matters for AI development teams Skills close a critical gap in AI governance. As organizations deploy AI agents, they need to know — and control — what those agents can do. Without a governed skill registry, capability discovery is ad hoc, reuse is low, and security review is difficult. By bringing skills into Azure API Center alongside APIs, MCP servers, models, and agents, teams get: A single inventory for all the assets AI agents depend on Explicit governance over which resources each skill can access via Allowed tools Automated, source-controlled skill registration via Git integration Discoverability for developers and AI systems through the API Center portal Consistent lifecycle management — Design through Production to Deprecated API Center, as part of the Azure API Management platform and the broader AI Gateway vision, is evolving into the system of record for AI-ready development. Skills are the latest step in that direction. Available now Skills are available today in Azure API Center (preview). To register your first skill: Sign in to the Azure portal and navigate to your API Center instance In the sidebar, select Inventory > Assets Select + Register an asset > Skill Fill in the registration form and select Create → Register and discover skills in Azure API Center (docs) → Set up your API Center portal → Explore the Azure API Management platform1.6KViews0likes2Comments