durable functions
30 TopicsMCP Apps on Azure Functions: Quick Start with TypeScript
Azure Functions makes hosting MCP apps simple: build locally, create a secure endpoint, and deploy fast with Azure Developer CLI (azd). This guide shows you how using a weather app example. What Are MCP Apps? MCP Apps let MCP servers return interactive HTML interfaces such as data visualizations, forms, dashboards that render directly inside MCP-compatible hosts (Visual Studio Code Copilot, Claude, ChatGPT, etc.). Learn more about MCP Apps in the official documentation. Having an interactive UI removes many restrictions that plain texts have, such as if your scenario has: Interactive Data: Replacing lists with clickable maps or charts for deep exploration. Complex Setup: Use one-page forms instead of long, back-and-forth questioning. Rich Media: Embed native viewers to pan, zoom, or rotate 3D models and documents. Live Updates: Maintain real-time dashboards that refresh without new prompts. Workflow Management: Handle multi-step tasks like approvals with navigation buttons and persistent state. MCP App Hosting as a Feature Azure Functions provides an easy abstraction to help you build MCP servers without having to learn the nitty-gritty of the MCP protocol. When hosting your MCP App on Functions, you get: MCP tools (server logic): Handle client requests, call backend services, return structured data - Azure Functions manages the MCP protocol details for you MCP resources (UI payloads such as app widgets): Serve interactive HTML, JSON documents, or formatted content - just focus on your UI logic Secure HTTPS access: Built-in authentication using Azure Functions keys, plus built-in MCP authentication with OAuth support for enterprise-grade security Easy deployment with Bicep and azd: Infrastructure as Code for reliable deployments Local development: Test and debug locally before deploying Auto-scaling: Azure Functions handles scaling, retries, and monitoring automatically The weather app in this repo is an example of this feature, not the only use case. Architecture Overview Example: The classic Weather App The sample implementation includes: A GetWeather MCP tool that fetches weather by location (calls Open-Meteo geocoding and forecast APIs) A Weather Widget MCP resource that serves interactive HTML/JS code (runs in the client; fetches data via GetWeather tool) A TypeScript service layer that abstracts API calls and data transformation (runs on the server) Bidirectional communication: client-side UI calls server-side tools, receives data, renders locally Local and remote testing flow for MCP clients (via MCP Inspector, VS Code, or custom clients) How UI Rendering Works in MCP Apps In the Weather App example: Azure Functions serves getWeatherWidget as a resource → returns weather-app.ts compiled to HTML/JS Client renders the Weather Widget UI User interacts with the widget or requests are made internally The widget calls the getWeather tool → server processes and returns weather data The widget renders the weather data on the client side This architecture keeps the UI responsive locally while using server-side logic and data on demand. Quick Start Checkout repository: https://github.com/Azure-Samples/remote-mcp-functions-typescript Run locally: npm install npm run build func start Local endpoint: http://0.0.0.0:7071/runtime/webhooks/mcp Deploy to Azure: azd provision azd deploy Remote endpoint: https://.azurewebsites.net/runtime/webhooks/mcp TypeScript MCP Tools Snippet (Get Weather service) In Azure Functions, you define MCP tools using app.mcpTool(). The toolName and description tell clients what this tool does, toolProperties defines the input arguments (like location as a string), and handler points to your function that processes the request. app.mcpTool("getWeather", { toolName: "GetWeather", description: "Returns current weather for a location via Open-Meteo.", toolProperties: { location: arg.string().describe("City name to check weather for") }, handler: getWeather, }); Resource Trigger Snippet (Weather App Hook) MCP resources are defined using app.mcpResource(). The uri is how clients reference this resource, resourceName and description provide metadata, mimeType tells clients what type of content to expect, and handler is your function that returns the actual content (like HTML for a widget). app.mcpResource("getWeatherWidget", { uri: "ui://weather/index.html", resourceName: "Weather Widget", description: "Interactive weather display for MCP Apps", mimeType: "text/html;profile=mcp-app", handler: getWeatherWidget, }); Sample repos and references Complete sample repository with TypeScript implementation: https://github.com/Azure-Samples/remote-mcp-functions-typescript Official MCP extension documentation: https://learn.microsoft.com/azure/azure-functions/functions-bindings-mcp?pivots=programming-language-typescript Java sample: https://github.com/Azure-Samples/remote-mcp-functions-java .NET sample: https://github.com/Azure-Samples/remote-mcp-functions-dotnet Python sample: https://github.com/Azure-Samples/remote-mcp-functions-python MCP Inspector: https://github.com/modelcontextprotocol/inspector Final Takeaway MCP Apps are just MCP servers but they represent a paradigm shift by transforming the AI from a text-based chatbot into a functional interface. Instead of forcing users to navigate complex tasks through back-and-forth conversations, these apps embed interactive UIs and tools directly into the chat, significantly improving the user experience and the usefulness of MCP servers. Azure Functions allows developers to quickly build and host an MCP app by providing an easy abstraction and deployment experience. The platform also provides built-in features to secure and scale your MCP apps, plus a serverless pricing model so you can just focus on the business logic.196Views1like0CommentsAzure Functions Ignite 2025 Update
Azure Functions is redefining event-driven applications and high-scale APIs in 2025, accelerating innovation for developers building the next generation of intelligent, resilient, and scalable workloads. This year, our focus has been on empowering AI and agentic scenarios: remote MCP server hosting, bulletproofing agents with Durable Functions, and first-class support for critical technologies like OpenTelemetry, .NET 10 and Aspire. With major advances in serverless Flex Consumption, enhanced performance, security, and deployment fundamentals across Elastic Premium and Flex, Azure Functions is the platform of choice for building modern, enterprise-grade solutions. Remote MCP Model Context Protocol (MCP) has taken the world by storm, offering an agent a mechanism to discover and work deeply with the capabilities and context of tools. When you want to expose MCP/tools to your enterprise or the world securely, we recommend you think deeply about building remote MCP servers that are designed to run securely at scale. Azure Functions is uniquely optimized to run your MCP servers at scale, offering serverless and highly scalable features of Flex Consumption plan, plus two flexible programming model options discussed below. All come together using the hardened Functions service plus new authentication modes for Entra and OAuth using Built-in authentication. Remote MCP Triggers and Bindings Extension GA Back in April, we shared a new extension that allows you to author MCP servers using functions with the MCP tool trigger. That MCP extension is now generally available, with support for C#(.NET), Java, JavaScript (Node.js), Python, and Typescript (Node.js). The MCP tool trigger allows you to focus on what matters most: the logic of the tool you want to expose to agents. Functions will take care of all the protocol and server logistics, with the ability to scale out to support as many sessions as you want to throw at it. [Function(nameof(GetSnippet))] public object GetSnippet( [McpToolTrigger(GetSnippetToolName, GetSnippetToolDescription)] ToolInvocationContext context, [BlobInput(BlobPath)] string snippetContent ) { return snippetContent; } New: Self-hosted MCP Server (Preview) If you’ve built servers with official MCP SDKs and want to run them as remote cloud‑scale servers without re‑writing any code, this public preview is for you. You can now self‑host your MCP server on Azure Functions—keep your existing Python, TypeScript, .NET, or Java code and get rapid 0 to N scaling, built-in server authentication and authorization, consumption-based billing, and more from the underlying Azure Functions service. This feature complements the Azure Functions MCP extension for building MCP servers using the Functions programming model (triggers & bindings). Pick the path that fits your scenario—build with the extension or standard MCP SDKs. Either way you benefit from the same scalable, secure, and serverless platform. Use the official MCP SDKs: # MCP.tool() async def get_alerts(state: str) -> str: """Get weather alerts for a US state. Args: state: Two-letter US state code (e.g. CA, NY) """ url = f"{NWS_API_BASE}/alerts/active/area/{state}" data = await make_nws_request(url) if not data or "features" not in data: return "Unable to fetch alerts or no alerts found." if not data["features"]: return "No active alerts for this state." alerts = [format_alert(feature) for feature in data["features"]] return "\n---\n".join(alerts) Use Azure Functions Flex Consumption Plan's serverless compute using Custom Handlers in host.json: { "version": "2.0", "configurationProfile": "mcp-custom-handler", "customHandler": { "description": { "defaultExecutablePath": "python", "arguments": ["weather.py"] }, "http": { "DefaultAuthorizationLevel": "anonymous" }, "port": "8000" } } Learn more about MCPTrigger and self-hosted MCP servers at https://aka.ms/remote-mcp Built-in MCP server authorization (Preview) The built-in authentication and authorization feature can now be used for MCP server authorization, using a new preview option. You can quickly define identity-based access control for your MCP servers with Microsoft Entra ID or other OpenID Connect providers. Learn more at https://aka.ms/functions-mcp-server-authorization. Better together with Foundry agents Microsoft Foundry is the starting point for building intelligent agents, and Azure Functions is the natural next step for extending those agents with remote MCP tools. Running your tools on Functions gives you clean separation of concerns, reuse across multiple agents, and strong security isolation. And with built-in authorization, Functions enables enterprise-ready authentication patterns, from calling downstream services with the agent’s identity to operating on behalf of end users with their delegated permissions. Build your first remote MCP server and connect it to your Foundry agent at https://aka.ms/foundry-functions-mcp-tutorial. Agents Microsoft Agent Framework 2.0 (Public Preview Refresh) We’re excited about the preview refresh 2.0 release of Microsoft Agent Framework that builds on battle hardened work from Semantic Kernel and AutoGen. Agent Framework is an outstanding solution for building multi-agent orchestrations that are both simple and powerful. Azure Functions is a strong fit to host Agent Framework with the service’s extreme scale, serverless billing, and enterprise grade features like VNET networking and built-in auth. Durable Task Extension for Microsoft Agent Framework (Preview) The durable task extension for Microsoft Agent Framework transforms how you build production-ready, resilient and scalable AI agents by bringing the proven durable execution (survives crashes and restarts) and distributed execution (runs across multiple instances) capabilities of Azure Durable Functions directly into the Microsoft Agent Framework. Combined with Azure Functions for hosting and event-driven execution, you can now deploy stateful, resilient AI agents that automatically handle session management, failure recovery, and scaling, freeing you to focus entirely on your agent logic. Key features of the durable task extension include: Serverless Hosting: Deploy agents on Azure Functions with auto-scaling from thousands of instances to zero, while retaining full control in a serverless architecture. Automatic Session Management: Agents maintain persistent sessions with full conversation context that survives process crashes, restarts, and distributed execution across instances Deterministic Multi-Agent Orchestrations: Coordinate specialized durable agents with predictable, repeatable, code-driven execution patterns Human-in-the-Loop with Serverless Cost Savings: Pause for human input without consuming compute resources or incurring costs Built-in Observability with Durable Task Scheduler: Deep visibility into agent operations and orchestrations through the Durable Task Scheduler UI dashboard Create a durable agent: endpoint = os.getenv("AZURE_OPENAI_ENDPOINT") deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini") # Create an AI agent following the standard Microsoft Agent Framework pattern agent = AzureOpenAIChatClient( endpoint=endpoint, deployment_name=deployment_name, credential=AzureCliCredential() ).create_agent( instructions="""You are a professional content writer who creates engaging, well-structured documents for any given topic. When given a topic, you will: 1. Research the topic using the web search tool 2. Generate an outline for the document 3. Write a compelling document with proper formatting 4. Include relevant examples and citations""", name="DocumentPublisher", tools=[ AIFunctionFactory.Create(search_web), AIFunctionFactory.Create(generate_outline) ] ) # Configure the function app to host the agent with durable session management app = AgentFunctionApp(agents=[agent]) app.run() Durable Task Scheduler dashboard for agent and agent workflow observability and debugging For more information on the durable task extension for Agent Framework, see the announcement: https://aka.ms/durable-extension-for-af-blog. Flex Consumption Updates As you know, Flex Consumption means serverless without compromise. It combines elastic scale and pay‑for‑what‑you‑use pricing with the controls you expect: per‑instance concurrency, longer executions, VNet/private networking, and Always Ready instances to minimize cold starts. Since launching GA at Ignite 2024 last year, Flex Consumption has had tremendous growth with over 1.5 billion function executions per day and nearly 40 thousand apps. Here’s what’s new for Ignite 2025: 512 MB instance size (GA). Right‑size lighter workloads, scale farther within default quota. Availability Zones (GA). Distribute instances across zones. Rolling updates (Public Preview). Unlock zero-downtime deployments of code or config by setting a single configuration. See below for more information. Even more improvements including: new diagnostic settingsto route logs/metrics, use Key Vault App Config references, new regions, and Custom Handler support. To get started, review Flex Consumption samples, or dive into the documentation to see how Flex can support your workloads. Migrating to Azure Functions Flex Consumption Migrating to Flex Consumption is simple with our step-by-step guides and agentic tools. Move your Azure Functions apps or AWS Lambda workloads, update your code and configuration, and take advantage of new automation tools. With Linux Consumption retiring, now is the time to switch. For more information, see: Migrate Consumption plan apps to the Flex Consumption plan Migrate AWS Lambda workloads to Azure Functions Durable Functions Durable Functions introduces powerful new features to help you build resilient, production-ready workflows: Distributed Tracing: lets you track requests across components and systems, giving you deep visibility into orchestration and activities with support for App Insights and OpenTelemetry. Extended Sessions support in .NET isolated: improves performance by caching orchestrations in memory, ideal for fast sequential activities and large fan-out/fan-in patterns. Orchestration versioning (public preview): enables zero-downtime deployments and backward compatibility, so you can safely roll out changes without disrupting in-flight workflows Durable Task Scheduler Updates Durable Task Scheduler Dedicated SKU (GA): Now generally available, the Dedicated SKU offers advanced orchestration for complex workflows and intelligent apps. It provides predictable pricing for steady workloads, automatic checkpointing, state protection, and advanced monitoring for resilient, reliable execution. Durable Task Scheduler Consumption SKU (Public Preview): The new Consumption SKU brings serverless, pay-as-you-go orchestration to dynamic and variable workloads. It delivers the same orchestration capabilities with flexible billing, making it easy to scale intelligent applications as needed. For more information see: https://aka.ms/dts-ga-blog OpenTelemetry support in GA Azure Functions OpenTelemetry is now generally available, bringing unified, production-ready observability to serverless applications. Developers can now export logs, traces, and metrics using open standards—enabling consistent monitoring and troubleshooting across every workload. Key capabilities include: Unified observability: Standardize logs, traces, and metrics across all your serverless workloads for consistent monitoring and troubleshooting. Vendor-neutral telemetry: Integrate seamlessly with Azure Monitor or any OpenTelemetry-compliant backend, ensuring flexibility and choice. Broad language support: Works with .NET (isolated), Java, JavaScript, Python, PowerShell, and TypeScript. Start using OpenTelemetry in Azure Functions today to unlock standards-based observability for your apps. For step-by-step guidance on enabling OpenTelemetry and configuring exporters for your preferred backend, see the documentation. Deployment with Rolling Updates (Preview) Achieving zero-downtime deployments has never been easier. The Flex Consumption plan now offers rolling updates as a site update strategy. Set a single property, and all future code deployments and configuration changes will be released with zero-downtime. Instead of restarting all instances at once, the platform now drains existing instances in batches while scaling out the latest version to match real-time demand. This ensures uninterrupted in-flight executions and resilient throughput across your HTTP, non-HTTP, and Durable workloads – even during intensive scale-out scenarios. Rolling updates are now in public preview. Learn more at https://aka.ms/functions/rolling-updates. Secure Identity and Networking Everywhere By Design Security and trust are paramount. Azure Functions incorporates proven best practices by design, with full support for managed identity—eliminating secrets and simplifying secure authentication and authorization. Flex Consumption and other plans offer enterprise-grade networking features like VNETs, private endpoints, and NAT gateways for deep protection. The Azure Portal streamlines secure function creation, and updated scenarios and samples showcase these identity and networking capabilities in action. Built-in authentication (discussed above) enables inbound client traffic to use identity as well. Check out our updated Functions Scenarios page with quickstarts or our secure samples gallery to see these identity and networking best practices in action. .NET 10 Azure Functions now supports .NET 10, bringing in a great suite of new features and performance benefits for your code. .NET 10 is supported on the isolated worker model, and it’s available for all plan types except Linux Consumption. As a reminder, support ends for the legacy in-process model on November 10, 2026, and the in-process model is not being updated with .NET 10. To stay supported and take advantage of the latest features, migrate to the isolated worker model. Aspire Aspire is an opinionated stack that simplifies development of distributed applications in the cloud. The Azure Functions integration for Aspire enables you to develop, debug, and orchestrate an Azure Functions .NET project as part of an Aspire solution. Aspire publish directly deploys to your functions to Azure Functions on Azure Container Apps. Aspire 13 includes an updated preview version of the Functions integration that acts as a release candidate with go-live support. The package will be moved to GA quality with Aspire 13.1. Java 25, Node.js 24 Azure Functions now supports Java 25 and Node.js 24 in preview. You can now develop functions using these versions locally and deploy them to Azure Functions plans. Learn how to upgrade your apps to these versions here In Summary Ready to build what’s next? Update your Azure Functions Core Tools today and explore the latest samples and quickstarts to unlock new capabilities for your scenarios. The guided quickstarts run and deploy in under 5 minutes, and incorporate best practices—from architecture to security to deployment. We’ve made it easier than ever to scaffold, deploy, and scale real-world solutions with confidence. The future of intelligent, scalable, and secure applications starts now—jump in and see what you can create!3.4KViews0likes1CommentThe Durable Task Scheduler Consumption SKU is now Generally Available
Today, we're excited to announce that the Durable Task Scheduler Consumption SKU has reached General Availability. Developers can now run durable workflows and agents on Azure with pay-per-use pricing, no storage to manage, no capacity to plan, and no idle costs. Just create a scheduler, connect your app, and start orchestrating. Whether you're coordinating AI agent workflows, processing event-driven pipelines, or running background jobs, the Consumption SKU is ready to go. GET STARTED WITH THE DURABLE TASK SCHEDULER CONSUMPTION SKU Since launching the Consumption SKU in public preview last November, we've seen incredible adoption and have incorporated feedback from developers around the world to ensure the GA release is truly production ready. “The Durable Task Scheduler has become a foundational piece of what we call ‘workflows’. It gives us the reliability guarantees we need for processing financial documents and sensitive workflows, while keeping the programming model straightforward. The combination of durable execution, external event correlation, deterministic idempotency, and the local emulator experience has made it a natural fit for our event-driven architecture. We have been delighted with the consumption SKUs cost model for our lower environments.”– Emily Lewis, CarMax What is the Durable Task Scheduler? If you're new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background: Announcing Limited Early Access of the Durable Task Scheduler Announcing Workflow in Azure Container Apps with the Durable Task Scheduler Announcing Dedicated SKU GA & Consumption SKU Public Preview In brief, the Durable Task Scheduler is a fully managed orchestration backend for durable execution on Azure, meaning your workflows and agent sessions can reliably resume and run to completion, even through process failures, restarts, and scaling events. Whether you’re running workflows or orchestrating durable agents, it handles task scheduling, state persistence, fault tolerance, and built-in monitoring, freeing developers from the operational overhead of managing their own execution engines and storage backends. The Durable Task Scheduler works across Azure compute environments: Azure Functions: Using the Durable Functions extension across all Function App SKUs, including Flex Consumption. Azure Container Apps: Using the Durable Functions or Durable Task SDKs with built-in workflow support and auto-scaling. Any compute: Azure Kubernetes Service, Azure App Service, or any environment where you can run the Durable Task SDKs (.NET, Python, Java, JavaScript). Why choose the Consumption SKU? With the Consumption SKU you’re charged only for actions dispatched, with no minimum commitments or idle costs. There’s no capacity to size or throughput to reserve. Create a scheduler, connect your app, and you’re running. The Consumption SKU is a natural fit for workloads with unpredictable or bursty usage patterns: AI agent orchestration: Multi-step agent workflows that call LLMs, retrieve data, and take actions. Users trigger these on demand, so volume is spiky and pay-per-use avoids idle costs between bursts. Event-driven pipelines: Processing events from queues, webhooks, or streams with reliable orchestration and automatic checkpointing, where volumes spike and dip unpredictably. API-triggered workflows: User signups, form submissions, payment flows, and other request-driven processing where volume varies throughout the day. Distributed transactions: Retries and compensation logic across microservices with durable sagas that survive failures and restarts. What's included in the Consumption SKU at GA The Consumption SKU has been hardened based on feedback and real-world usage during the public preview. Here's what's included at GA: Performance Up to 500 actions per second: Sufficient throughput for a wide range of workloads, with the option to move to the Dedicated SKU for higher-scale scenarios. Up to 30 days of data retention: View and manage orchestration history, debug failures, and audit execution data for up to 30 days. Built-in monitoring dashboard Filter orchestrations by status, drill into execution history, view visual Gantt and sequence charts, and manage orchestrations (pause, resume, terminate, or raise events), all from the dashboard, secured with Role-Based Access Control (RBAC). Identity-based security The Consumption SKU uses Entra ID for authentication and RBAC for authorization. No SAS tokens or access keys to manage, just assign the appropriate role and connect. Get started with the Durable Task Scheduler today The Consumption SKU is available now Generally Available. Provision a scheduler in the Azure portal, connect your app, and start orchestrating. You only pay for what you use. Documentation Getting started Samples Pricing Consumption SKU docs We'd love to hear your feedback. Reach out to us by filing an issue on our GitHub repository186Views0likes0CommentsThe Swarm Diaries: What Happens When You Let AI Agents Loose on a Codebase
The Idea Single-agent coding assistants are impressive, but they have a fundamental bottleneck: they think serially. Ask one to build a full CLI app with a database layer, a command parser, pretty output, and tests, and it’ll grind through each piece one by one. Industry benchmarks bear this out: AIMultiple’s 2026 agentic coding benchmark measured Claude Code CLI completing full-stack tasks in ~12 minutes on average, with other CLI agents ranging from 3 to 14 minutes depending on the tool. A three-week real-world test by Render.com found single-agent coding workflows taking 10–30 minutes for multi-file feature work. But these subtasks don’t depend on each other. A storage agent doesn’t need to wait for the CLI agent. A test writer doesn’t need to watch the renderer work. What if they all ran at the same time? The hypothesis was straightforward: a swarm of specialized agents should beat a single generalist on at least two of three pillars — speed, quality, or cost. The architecture looked clean on a whiteboard: The reality was messier. But first, let me explain the machinery that makes this possible. How It’s Wired: Brains and Hands The system runs on a brains-and-hands split. The brain is an Azure Durable Task Scheduler (DTS) orchestration — a deterministic workflow that decomposes the goal into a task DAG, fans agents out in parallel, merges their branches, and runs quality gates. If the worker crashes mid-run, DTS replays from the last checkpoint. No work lost. Simple LLM calls — the planner that decomposes the goal, the judge that scores the output — run as lightweight DTS activities. One call, no tools, cheap. The hands are Microsoft Agent Framework (MAF) agents, each running in its own Docker container. One sandbox per agent, each with its own git clone, filesystem, and toolset. When an agent’s LLM decides to edit a file or run a build, the call routes through middleware to that agent’s isolated container. No two agents ever touch the same workspace. These complex agents — coders, researchers, the integrator — run as DTS durable entities with full agentic loops and turn-level checkpointing. The split matters because LLM reasoning and code execution have completely different reliability profiles. The brain checkpoints and replays deterministically. The hands are ephemeral — if a container dies, spin up a new one and replay the agent’s last turn. This separation is what lets you run five agents in parallel without them stepping on each other’s git branches, build artifacts, or file handles. It’s also what made every bug I was about to encounter debuggable. When something broke, I always knew which side broke — the orchestration logic, or the agent behavior. That distinction saved me more hours than any other design decision. The First Run Produced Nothing After hours of vibe-coding the foundation — Pydantic models, skill prompts, a prompt builder, a context store, sixteen architectural decisions documented in ADRs — I wired up the seven-phase orchestration and hit go. All five agents returned empty responses. Every single one. The logs showed agents “running” but producing zero output. I stared at the code for an embarrassingly long time before I found it. The planner returned task IDs as integers — 1, 2, 3 . The sandbox provisioner stored them as string keys — "1", "2", "3" . When the orchestrator did sandbox_map.get(1) , it got None . No sandbox meant no middleware. The agents were literally talking to thin air — making LLM calls with no tools attached, like a carpenter showing up to a job site with no hammer. The fix was one line. The lesson was bigger: LLMs don’t respect type contracts. They’ll return an integer when you expect a string, a list when you expect a dict, and a confident hallucination when they have nothing to say. Every boundary between AI-generated data and deterministic systems needs defensive normalization. This would not be the last time I learned that lesson. The Seven-Minute Merge Once agents actually ran and produced code, a new problem emerged. I watched the logs on a run that took twenty-one minutes total. Four agents finished their work in about twelve minutes. The remaining seven minutes were the LLM integrator merging four branches — eight to thirty tool calls per merge, using the premium model, to do what git merge --no-edit does in five seconds. I was paying for a premium LLM to run git diff , read both sides of every file, and write a merged version. For branches that merged cleanly. With zero conflicts. The fix was obvious in retrospect: try git merge first. If it succeeds — great, five seconds, done. Only call the LLM integrator when there are actual conflicts to resolve. Merge time dropped from seven minutes to under thirty seconds. I felt a little silly for not doing this from the start. When Agents Build Different Apps The merge speedup felt like a win until I looked at what was actually being merged. The storage agent had built a JSON-file backend. The CLI agent had written its commands against SQLite. Both modules were well-written. They compiled individually. Together, nothing worked — the CLI tried to import a Storage class that didn’t exist in the JSON backend. This was the moment I realized the agents weren’t really a team. They were strangers who happened to be assigned to the same project, each interpreting the goal in their own way. The fix was the single most impactful change in the entire project: contract-first planning. Instead of just decomposing the goal into tasks, the planner now generates API contracts — function signatures, class shapes, data model definitions — and injects them into every agent’s prompt. “Here’s what the Storage class looks like. Here’s what Task looks like. Build against these interfaces.” Before contracts, three of six branches conflicted and the quality score was 28. After contracts, zero of four branches conflicted and the score hit 68. It turns out the plan isn’t just a plan. In a multi-agent system, the plan is the product. A brilliant plan with mediocre agents produces working code. A vague plan with brilliant agents produces beautiful components that don’t fit together. The Agent Who Lied PR #4 came back with what looked like a solid result. The test writer reported three test files with detailed coverage summaries. The JSON output was meticulous — file names, function names, which modules each test covered. Then I checked tool_call_count: 0 . The test writer hadn’t written a single file. It hadn’t even opened a file. It received zero tools — because the skill loader normalized test_writer to underscores while the tool registry used test-writer with hyphens. The lookup failed silently. The agent got no tools, couldn’t do any work, and did what LLMs do when they can’t fulfill a request but feel pressure to answer: it made something up. Confidently. This happened in three of our first four evaluation runs. I called them “phantom agents” — they showed up to work, clocked in, filed a report, and went home without lifting a finger. The fix had two parts. First, obviously, fix the hyphen/underscore normalization. Second, and more importantly: add a zero-tool-call guard. If an agent that should be writing files reports success with zero tool calls, don’t believe it. Nudge it and retry. The deeper lesson stuck with me: agents will never tell you they failed. They’ll report success with elaborate detail. You have to verify what they actually did, not what they said they did. The Integrator Who Took Shortcuts Even with contracts preventing mismatched architectures, merge conflicts still happened when multiple agents touched the same files. The LLM integrator’s job was to resolve these conflicts intelligently, preserving logic from both sides. Instead, facing a gnarly conflict in models.py , it ran: git restore --source=HEAD -- models.py One command. Silently destroyed one agent’s entire implementation — the Task class, the constants, the schema version — gone. The integrator committed the lobotomized file and reported “merge resolved successfully.” The downstream damage was immediate. storage.py imported symbols that no longer existed. The judge scored 43 out of 100. The fixer agent had to spend five minutes reconstructing the data model from scratch. But that wasn’t even the worst shortcut. On other runs, the integrator replaced conflicting code with: def add_task(desc, priority=0): pass # TODO: implement storage layer When an LLM is asked to resolve a hard conflict, it’ll sometimes pick the easiest valid output — delete everything and write a placeholder. Technically valid Python. Functionally a disaster. Fixing this required explicit prompt guardrails: Never run git restore --source=HEAD Never replace implementations with pass # TODO placeholders When two implementations conflict, keep the more complete one After resolving each file, read it back and verify the expected symbols still exist The lesson: LLMs optimize for the path of least resistance. Under pressure, “valid” and “useful” diverge sharply. Demolishing the House for a Leaky Faucet When the judge scored a run below 70, the original retry strategy was: start over. Re-plan. Re-provision five sandboxes. Re-run all agents. Re-merge. Re-judge. Seven minutes and a non-trivial cloud bill, all because one agent missed an import statement. This was absurd. Most failures weren’t catastrophic — they were close. A missing model field. A broken import. An unhandled error case. The code was 90% right. Starting from scratch was like tearing down a house because the bathroom faucet leaks. So I built the fixer agent: a premium-tier model that receives the judge’s specific complaints and makes surgical edits directly on the integrator’s branch. No new sandboxes, no new branches, no merge step. The first time it ran, the score jumped from 43 to 89.5. Three minutes instead of seven. And it solved the problem that actually existed, rather than hoping a second roll of the dice would land better. Of course, the fixer’s first implementation had its own bug — it ran in a new sandbox, created a new branch, and occasionally conflicted with the code it was trying to fix. The fix to the fixer: just edit in place on the integrator’s existing sandbox. No branch, no merge, no drama. How Others Parallelize (and Why We Went Distributed) Most multi-agent coding frameworks today parallelize by spawning agents as local processes on a single developer machine. Depending on the framework, there’s typically a lead agent or orchestrator that breaks the task down into subtasks, spins up new agents to handle each piece, and combines their work when they finish — often through parallel TMux sessions or subprocess pools sharing a local filesystem. It’s simple, it’s fast to set up, and for many tasks it works. But local parallelization hits a ceiling. All agents share one machine’s CPU, memory, and disk I/O. Five agents each running npm install or cargo build compete for the same 32 GB of RAM. There’s no true filesystem isolation — two agents can clobber the same file if the orchestrator doesn’t carefully sequence writes. Recovery from a crash means restarting the entire local process tree. And scaling from 3 agents to 10 means buying a bigger machine. Our swarm takes a different approach: fully distributed execution. Each agent runs in its own Docker container with its own filesystem, git clone, and compute allocation — provisioned on AKS, ACA, or any container host. Four agents get four independent resource pools. If one container dies, DTS replays that agent from its last checkpoint in a fresh container without affecting the others. Git branch-per-agent isolation means zero filesystem conflicts by design. The trade-off is overhead: container provisioning, network latency, and the merge step add wall-clock time that a local TMux setup avoids. On a small two-agent task, local parallelization on a fast laptop probably wins. But for tasks with 4+ agents doing real work — cloning repos, installing dependencies, running builds and tests — independent resource pools and crash isolation matter. Our benchmarks on a 4-agent helpdesk system showed the swarm completing in ~8 minutes with zero resource contention, producing 1,029 lines across 14 files with 4 clean branch merges. The Scorecard After all of this, did the swarm actually beat a single agent? I ran head-to-head benchmarks: same prompt, same model (GPT-5-nano), solo agent vs. swarm, scored by a Sonnet 4.6 judge on a four-criterion rubric. Two tasks — a simple URL shortener (Render.com’s benchmark prompt) and a complex helpdesk ticket system. All runs are public — you can review every line of generated code: Task Solo Agent PR Swarm PR URL Shortener PR #1 PR #2 Helpdesk System PR #3 PR #4 URL Shortener (Simple) Helpdesk System (Complex) Quality (rubric, /5) Solo 1.9 → Swarm 2.5 (+32%) Solo 2.3 → Swarm 2.95 (+28%) Speed Solo 2.5 min → Swarm 5.5 min (2.2×) Solo 1.75 min → Swarm ~8 min (~4.5×) Tokens 7.7K → 30K (3.9×) 11K → 39K (3.4×) The pattern held across both tasks: +28–32% quality improvement, at the cost of 2–4× more time and ~3.5× more tokens. On the complex task, the quality gains broadened — the swarm produced better code organization (3/5 vs 2/5), actually wrote tests (code:test ratio 0 → 0.15), and generated 5× more files with cleaner decomposition. On the simple task, the gap came entirely from security practices: environment variables, parameterized queries, and proper .gitignore rules that the solo agent skipped entirely. Industry benchmarks from AIMultiple and Render.com show single CLI agents averaging 10–15 minutes on comparable full-stack tasks. Our swarm runs in 5–12 minutes depending on parallelizability — but the real win is quality, not speed. Specialized agents with a narrow, well-defined scope tend to be more thorough: the solo agent skipped tests and security practices entirely, while the swarm's dedicated agents actually addressed them. Two out of three pillars — with a caveat the size of your task. On small, tightly-coupled problems, just use one good agent. On larger, parallelizable work with three or more independent modules? The swarm earns its keep. What I Actually Learned The Rules That Stuck Contract-first planning. Define interfaces before writing implementations. The plan isn’t just a guide — it’s the product. Deterministic before LLM. Try git merge before calling the LLM integrator. Run ruff check before asking an agent to debug. Use code when you can; use AI when you must. Validate actions, not claims. An agent that reports “merge resolved successfully” may have deleted everything. Check tool call counts. Read the actual diff. Trust nothing. Cheap recovery over expensive retries. A fixer agent that patches one file beats re-running five agents from scratch. The cost of failure should be proportional to the failure. Not every problem needs a swarm. If the task fits in one agent’s context window, adding four more just adds overhead. The sweet spot is 3+ genuinely independent modules. The Bigger Picture The biggest surprise? Building a multi-agent AI system is more about software engineering than AI engineering. The hard problems weren’t prompt design or model selection — they were contracts between components, isolation of concerns, idempotent operations, observability, and recovery strategies. Principles that have been around since the 1970s. The agents themselves are almost interchangeable. Swap GPT for Claude, change the temperature, fine-tune the system prompt — it barely matters if your orchestration is broken. What matters is how you decompose work, how you share context, how you merge results, and how you recover from failure. Get the engineering right, and the AI just works. Get it wrong, and no model on earth will save you. By the Numbers The codebase is ~7,400 lines of Python across 230 tests and 141 commits. Over 10+ evaluation runs, the swarm processed a combined ~200K+ tokens, merged 20+ branches, and resolved conflicts ranging from trivial (package.json version bumps) to gnarly (overlapping data models). It’s built on Azure Durable Task Scheduler, Microsoft Agent Framework, and containerized sandboxes that run anywhere Docker does — AKS, ACA, or a plain docker run on your laptop. And somewhere in those 141 commits is a one-line fix for an integer-vs-string bug that took me an embarrassingly long time to find. References Azure Durable Task Scheduler — Deterministic workflow orchestration with replay, checkpointing, and fan-out/fan-in patterns. Microsoft Agent Framework (MAF) — Python agent framework for tool-calling, middleware, and structured output. Azure Kubernetes Service (AKS) — Managed Kubernetes for running containerized agent workloads at scale. Azure Container Apps (ACA) — Serverless container platform for simpler deployments. Azure OpenAI Service — Hosts the GPT models used by planner, coder, and judge agents. Built with Azure DTS, Microsoft Agent Framework, and containerized sandboxes (Docker, AKS, ACA — your choice). And a lot of grep through log files.510Views6likes0CommentsRethinking Background Workloads with Azure Functions on Azure Container Apps
Objective Azure Container Apps provides a flexible platform for running background workloads, supporting multiple execution models to address different workload needs. Two commonly used models are: Azure Functions on Azure Container Apps - overview of Azure functions Azure Container Apps Jobs – overview of Container App Jobs Both are first‑class capabilities on the same platform and are designed for different types of background processing. This blog explores Use Cases where Azure Functions on Azure Container Apps are best suited Use Cases where Container App Jobs provide advantages Use Cases where Azure Functions on Azure Container Apps Are suited Azure Functions on Azure Container Apps are particularly well suited for event‑driven and workflow‑oriented background workloads, where work is initiated by external signals and coordination is a core concern. The following use cases illustrate scenarios where the Functions programming model aligns naturally with the workload, allowing teams to focus on business logic while the platform handles triggering, scaling, and coordination. Event‑Driven Data Ingestion Pipelines For ingestion pipelines where data arrives asynchronously and unpredictably. Example: A retail company processes inventory updates from hundreds of suppliers. Files land in Blob Storage overnight, varying widely in size and arrival time. In this scenario: Each file is processed independently as it arrives Execution is driven by actual data arrival, not schedules Parallelism and retries are handled by the platform .blob_trigger(arg_name="blob", path="inventory-uploads/{name}", connection="StorageConnection") async def process_inventory(blob: func.InputStream): data = blob.read() # Transform and load to database await transform_and_load(data, blob.name) Multi‑Step, Event‑Driven Processing Workflows Functions works well for workloads that involve multiple dependent steps, where each step can fail independently and must be retried or resumed safely. Example: An order processing workflow that includes validation, inventory checks, payment capture, and fulfilment notifications. Using Durable Functions: Workflow state persisted automatically Each step can be retried independently Execution resumes from the point of failure rather than restarting Durable Functions on Container Apps solves this declaratively: .orchestration_trigger(context_name="context") def order_workflow(context: df.DurableOrchestrationContext): order = context.get_input() # Each step is independently retryable with built-in checkpointing validated = yield context.call_activity("validate_order", order) inventory = yield context.call_activity("check_inventory", validated) payment = yield context.call_activity("capture_payment", inventory) yield context.call_activity("notify_fulfillment", payment) return {"status": "completed", "order_id": order["id"]} Scheduled, Recurring Background Tasks For time‑based background work that runs on a predictable cadence and is closely tied to application logic. Example: Daily financial summaries, weekly aggregations, or month‑end reconciliation reports. Timer‑triggered Functions allow: Schedules to be defined in code Logic to be versioned alongside application code Execution to run in the same Container Apps environment as other services .timer_trigger(schedule="0 0 6 * * *", arg_name="timer") async def daily_financial_summary(timer: func.TimerRequest): if timer.past_due: logging.warning("Timer is running late!") await generate_summary(date.today() - timedelta(days=1)) await send_to_stakeholders() Long‑Running, Parallelizable Workloads Scenarios which require long‑running workloads to be decomposed into smaller units of work and coordinated as a workflow. Example: A large data migration processing millions of records. With Durable Functions: Work is split into independent batches Batches execute in parallel across multiple instances Progress is checkpointed automatically Failures are isolated to individual batches .orchestration_trigger(context_name="context") def migration_orchestrator(context: df.DurableOrchestrationContext): batches = yield context.call_activity("get_migration_batches") # Process all batches in parallel across multiple instances tasks = [context.call_activity("migrate_batch", b) for b in batches] results = yield context.task_all(tasks) yield context.call_activity("generate_report", results) Use Cases where Container App Jobs are a Best Fit Azure Container Apps Jobs are well suited for workloads that require explicit execution control or full ownership of the runtime and lifecycle. Common examples include: Batch Processing Using Existing Container Images Teams often have existing containerized batch workloads such as data processors, ETL tools, or analytics jobs that are already packaged and validated. When refactoring these workloads into a Functions programming model is not desirable, Container Apps Jobs allow them to run unchanged while integrating into the Container Apps environment. Large-Scale Data Migrations and One-Time Operations Jobs are a natural fit for one‑time or infrequently run migrations, such as schema upgrades, backfills, or bulk data transformations. These workloads are typically: Explicitly triggered Closely monitored Designed to run to completion under controlled conditions The ability to manage execution, retries, and shutdown behavior directly is often important in these scenarios. Custom Runtime or Specialized Dependency Workloads Some workloads rely on: Specialized runtimes Native system libraries Third‑party tools or binaries When these requirements fall outside the supported Functions runtimes, Container Apps Jobs provide the flexibility to define the runtime environment exactly as needed. Externally Orchestrated or Manually Triggered Workloads In some architectures, execution is coordinated by an external system such as: A CI/CD pipeline An operations workflow A custom scheduler or control plane Container Apps Jobs integrate well into these models, where execution is initiated explicitly rather than driven by platform‑managed triggers. Long-Running, Single-Instance Processing For workloads that are intentionally designed to run as a single execution unit without fan‑out, trigger‑based scaling, or workflow orchestration Jobs provide a straightforward execution model. This includes tasks where parallelism, retries, and state handling are implemented directly within the application. Making the Choice Consideration Azure Functions on Azure Container Apps Azure Container Apps Jobs Trigger model Event‑driven (files, messages, timers, HTTP, events) Explicit execution (manual, scheduled, or externally triggered) Scaling behavior Automatic scaling based on trigger volume / queue depth Fixed or explicitly defined parallelism Programming model Functions programming model with triggers, bindings, Durable Functions General container execution model State management Built‑in state, retries, and checkpointing via Durable Functions Custom state management required Workflow orchestration Native support using Durable Functions Must be implemented manually Boilerplate required Minimal (no polling, retry, or coordination code) Higher (polling, retries, lifecycle handling) Runtime flexibility Limited to supported Functions runtimes Full control over runtime and dependencies Getting Started on Functions on Azure Container Apps If you’re already running on Container Apps, adding Functions is straightforward: Your Functions run alongside your existing apps, sharing the same networking, observability, and scaling infrastructure. Check out the documentation for details - Getting Started on Functions on Azure Container Apps # Create a Functions app in your existing Container Apps environment az functionapp create \ --name my-batch-processor \ --storage-account mystorageaccount \ --environment my-container-apps-env \ --workload-profile-name "Consumption" \ --runtime python \ --functions-version 4 Getting Started on Container App Jobs on Azure Container Apps If you already have an Azure Container Apps environment, you can create a job using the Azure CLI. Checkout the documentation for details - Jobs in Azure Container Apps az containerapp job create \ --name my-job \ --resource-group my-resource-group \ --environment my-container-apps-env \ --trigger-type Manual \ --image mcr.microsoft.com/k8se/quickstart-jobs:latest \ --cpu 0.25 \ --memory 0.5Gi Quick Links Azure Functions on Azure Container Apps overview Create your Azure Functions app through custom containers on Azure Container Apps Run event-driven and batch workloads with Azure Functions on Azure Container Apps1KViews0likes0CommentsAnnouncing Azure Functions Durable Task Scheduler Dedicated SKU GA & Consumption SKU Public Preview
Earlier this year, we introduced the Durable Task Scheduler, our orchestration engine designed for complex workflows and intelligent agents. It automatically checkpoints progress and protects your orchestration state, enabling resilient and reliable execution. Today, we’re excited to announce a major milestone: Durable Task Scheduler is now Generally Available with the Dedicated SKU, and the Consumption SKU is entering Public Preview. These offerings provide advanced orchestration capabilities for cloud-native and AI applications, providing predictable pricing for steady workloads with the Dedicated SKU and flexible, pay-as-you-go billing for dynamic, variable workloads with the Consumption SKU. “The Durable Task Scheduler has been a game-changer for our projects. It keeps our workflows running reliably with minimal code, even as they grow in complexity. It automatically recovers from unexpected issues, so we don’t have to step in. It scales to handle millions of orchestrations, and the real-time dashboard makes it simple to monitor and manage everything as it happens.” – Pedram Rezaei, VP of Engineering for Copilot What is the durable task scheduler? If you’re new to the Durable Task Scheduler, we recommend checking out our previous blog posts for a detailed background on what it is and how/when to leverage it: https://aka.ms/dts-public-preview https://aka.ms/workflow-in-aca In brief, the Durable Task Scheduler is a fully managed backend for durable execution on Azure. It can serve as the backend for a Durable Function App using the Durable Functions extension, or as the backend for an app leveraging the Durable Task SDKs in other compute environments, such as Azure Container Apps, Azure Kubernetes Services, or Azure App Service. It simplifies the development and operation of complex, stateful, and long-running workflows by providing automatic orchestration state persistence, fault tolerance, and built-in monitoring, all freeing developers from the operational overhead of managing orchestration storage and failure recovery. The Durable Task Scheduler is designed to deliver the best possible experience by addressing the key challenges developers face when self-managing orchestration infrastructure, such as configuring storage accounts, checkpointing orchestration progress, troubleshooting unexpected orchestration behavior, and ensuring high reliability. As of today, the Durable Task Scheduler is available across all Function App SKUs and includes autoscaling support in options like Flex Consumption. “Durable Task Scheduler has significantly accelerated execution of complex business logic which requires orchestration. We are observing up to 10 times faster speed as compared to the blob storage backend. We also love the dashboard view for our taskhubs, which gives us great visibility and helps us monitor, time and manage our workflows.” – Roney Varghese, Software Engineer at Pinnacle Tech Dedicated and Consumption SKUs Dedicated SKU (GA) The Dedicated SKU, which has been available in public preview since March of this year, has now graduated to General Availability. It delivers predictable performance and high reliability with dedicated infrastructure, high throughput, and up to 90-days orchestration data retention. It’s ideal for mission-critical workloads requiring consistent, high-scale throughput and for organizations that prefer predictable billing. Key features of the Dedicated SKU include: Dedicated Infrastructure: Runs on dedicated resources guaranteeing isolation. Custom Scaling: Configure Capacity Units (CUs) to match your workload needs. High Availability: High availability with multi-CU deployments. Data Retention: Up to 90 days. Performance: Each CU supports up to 2,000 actions per second and 50GB of orchestration data. What’s new in the Dedicated SKU? More Capacity Units As of today, the Dedicated SKU enables you to purchase additional capacity units for high performance and orchestration data storage. High Availability For applications requiring even higher availability for mission-critical scenarios, the Dedicated SKU now offers a High Availability feature. To enable high availability, you need at least 3 capacity units on your scheduler instance. Learn more about the Dedicated SKU here: https://aka.ms/dts-dedicated-sku Consumption SKU (Public Preview) We’ve heard your feedback loud and clear. We understand that the Dedicated SKU isn’t the right fit for every scenario. That’s why we’re introducing a new pricing plan: the Consumption SKU, a SKU tailored for workloads that run intermittently, or scale dynamically, and for requirements where flexibility and cost efficiency matter most. The Consumption SKU is perfect for variable workloads and development/test environments. It offers: Pay-Per-Use: Only pay for actions dispatched No Upfront Costs: No minimum commitments. Data Retention: Up to 30 days. Performance: Up to 500 actions per second. Learn more about the Consumption SKU here: https://aka.ms/dts-consumption-sku Roadmap We’re excited to reach this milestone, but we also have many plans for the future. Here’s a glimpse of the features you can expect to see in the Durable Task Scheduler in the near future: Private Endpoints Zone Redundancy in the Dedicated SKU Export API – Need your orchestration data for longer than the max retention limit? Use the Export API to move data out of DTS into a storage provider of your choice. Dynamic Scaling of Capacity Units – Set a minimum and maximum and allow DTS to dynamically scale up and down depending on orchestration throughput. Ability to handle payloads larger than 1MB Get started with the Durable Task Scheduler today Documentation: https://aka.ms/dts-documentation Samples: https://aka.ms/dts-samples Getting Started: https://aka.ms/dts-getting-started1.3KViews1like1CommentOpenAI Agent SDK Integration with Azure Durable Functions
Picture this: Your agent authored with the OpenAI Agent SDK is halfway through analyzing 10,000 customer reviews when it hits a rate limit and dies. All that progress? Gone. Your multi-agent workflow that took 30 minutes to orchestrate? Back to square one because of a rate limit throttle. If you've deployed AI agents in production, you probably know this frustration first-hand. Today, we're announcing a solution that makes your agents reliable: OpenAI Agent SDK Integration with Azure Durable Functions. This integration provides automatic state persistence, enabling your agents to survive any failure and continue exactly where they stopped. No more lost progress, no more starting over, just reliable agents that work. The Challenge with AI Agents Building AI agents that work reliably in production environments has proven to be one of the most significant challenges in modern AI development. As agent sophistication increases with complex workflows involving multiple LLM calls, tool executions, and agent hand-offs, the likelihood of encountering failures increases. This creates a fundamental problem for production AI systems where reliability is essential. Common failure scenarios include: Rate Limiting: Agents halt mid-process when hitting API rate limits during LLM calls Network Timeouts: workflows terminate due to connectivity issues System Crashes: Multi-agent systems fail when individual components encounter errors State Loss: Complex workflows restart from the beginning after any interruption Traditional approaches force developers to choose between building complex retry logic with significant code changes or accepting unreliable agent behavior. Neither option is suitable for production-grade AI systems that businesses depend on and that’s why we’re introducing this integration. Key Benefits of the OpenAI Agent SDK Integration with Azure Durable Functions Our solution leverages durable execution value propositions to address these reliability challenges while preserving the familiar OpenAI Agents Python SDK developer experience. The integration enables agent invocations hosted on Azure Functions to run within durable orchestration contexts where both agent LLM calls and tool calls are executed as durable operations. This integration delivers significant advantages for production AI systems such as: Enhanced Agent Resilience- Built-in retry mechanisms for LLM calls and tool executions enable agents to automatically recover from failures and continue from their last successful step Multi-Agent Orchestration Reliability- Individual agent failures don't crash entire multi-agent workflows, and complex orchestrations maintain state across system restarts Built-in Observability- Monitor agent progress through the Durable Task Scheduler dashboard with enhanced debugging and detailed execution tracking (only applicable when using the Durable Task Scheduler as the Durable Function backend). Seamless Developer Experience- Keep using the OpenAI Agents SDK interface you already know with minimal code changes required to add reliability Distributed Compute and Scalability – Agent workflow automatically scale across multiple compute instances. Core Integration Components: These powerful capabilities are enabled through just a few simple additions to your AI application: durable_openai_agent_orchestrator: Decorator that enables durable execution for agent invocations run_sync: Uses an existing OpenAI Agents SDK API that executes your agent with built-in durability create_activity_tool: Wraps tool calls as durable activities with automatic retry capabilities State Persistence: Maintains agentic workflow state across failures and restarts Hello World Example Let's see how this works in practice. Here's what code written using the OpenAI Agent SDK looks like: import asyncio from agents import Agent, Runner async def main(): agent = Agent( name="Assistant", instructions="You only respond in haikus.", ) result = await Runner.run(agent, "Tell me about recursion in programming.") print(result.final_output) With our added durable integration, it becomes: from agents import Agent, Runner @app.orchestration_trigger(context_name="context") @app.durable_openai_agent_orchestrator # Runs the agent invocation in the context of a durable orchestration def hello_world(context): agent = Agent( name="Assistant", instructions="You only respond in haikus.", ) result = Runner.run_sync(agent, "Tell me about recursion in programming.") # Provides synchronous execution with built-in durability return result.final_output rable Task Scheduler dashboard showcasing the agent LLM call as a durable operation Notice how little actually changed. We added app.durable_openai_agent_orchestrator decorator but your core agent logic stays the same. The run_sync* method provides execution with built-in durability, enabling your agents to automatically recover from failures with minimal code changes. When using the Durable Task Scheduler as your Durable Functions backend, you gain access to a detailed monitoring dashboard that provides visibility into your agent executions. The dashboard displays detailed inputs and outputs for both LLM calls and tool invocations, along with clear success/failure indicators, making it straightforward to diagnose and troubleshoot any unexpected behavior in your agent processes. A note about 'run_sync' In Durable Functions, orchestrators don’t usually benefit from invoking code asynchronously because their role is to define the workflow—tracking state, scheduling activities, and so on—not to perform actual work. When you call an activity, the framework records the decision and suspends the orchestrator until the result is ready. For example, when you call run_sync, the deterministic part of the call completes almost instantly, and the LLM call activity is scheduled for asynchronous execution. Adding extra asynchronous code inside the orchestrator doesn’t improve performance; it only breaks determinism and complicates replay. Reliable Tool Invocation Example For agents requiring tool interactions, there are two implementation approaches. The first option uses the @function_tool decorator from the OpenAI Agent SDK, which executes directly within the context of the durable orchestration. When using this approach, your tool functions must follow durable functions orchestration deterministic constraints. Additionally, since these functions run within the orchestration itself, they may be replayed as part of normal operations, making cost-conscious implementation necessary. from agents import Agent, Runner, function_tool class Weather(BaseModel): city: str temperature_range: str conditions: str @function_tool def get_weather(city: str) -> Weather: """Get the current weather information for a specified city.""" print("[debug] get_weather called") return Weather( city=city, temperature_range="14-20C", conditions="Sunny with wind." ) @app.orchestration_trigger(context_name="context") @app.durable_openai_agent_orchestrator def tools(context): agent = Agent( name="Hello world", instructions="You are a helpful agent.", tools=[get_weather], ) result = Runner.run_sync(agent, input="What's the weather in Tokyo?") return result.final_output The second approach uses the create_activity_tool function, which is designed for non-deterministic code or scenarios where rerunning the tool is expensive (in terms of performance or cost). This approach executes the tool within the context of a durable orchestration activity, providing enhanced monitoring through the Durable Task Scheduler dashboard and ensuring that expensive operations are not unnecessarily repeated during orchestration replays. from agents import Agent, Runner, function_tool class Weather(BaseModel): city: str temperature_range: str conditions: str @app.orchestration_trigger(context_name="context") @app.durable_openai_agent_orchestrator def weather_expert(context): agent = Agent( name="Hello world", instructions="You are a helpful agent.", tools=[ context.create_activity_tool(get_weather) ], ) result = Runner.run_sync(agent, "What is the weather in Tokio?") return result.final_output @app.activity_trigger(input_name="city") async def get_weather(city: str) -> Weather: weather = Weather( city=city, temperature_range="14-20C", conditions="Sunny with wind." ) return weather Leveraging Durable Functions Stateful App Patterns Beyond basic durability of agents, this integration provides access to the full Durable Functions orchestration context, enabling developers to implement sophisticated stateful application patterns when needed, such as: External Event Handling: Use context.wait_for_external_event() for human approvals, external system callbacks, or time-based triggers Fan-out/Fan-in: Coordinate multiple tasks (including sub orchestrations invoking agents) in parallel. Long-running Workflows: Implement workflows that span hours, days, or weeks with persistent state Conditional Logic: Build dynamic agent workflows based on runtime decisions and external inputs Human Interaction and Approval Workflows Example For scenarios requiring human oversight, you can leverage the orchestration context to implement approval workflows: .durable_openai_agent_orchestrator def agent_with_approval(context): # Run initial agent analysis agent = Agent(name="DataAnalyzer", instructions="Analyze the provided dataset") initial_result = Runner.run_sync(agent, context.get_input()) # Wait for human approval before proceeding approval_event = context.wait_for_external_event("approval_received") if approval_event.get("approved"): # Continue with next phase final_agent = Agent(name="Reporter", instructions="Generate final report") final_result = Runner.run_sync(final_agent, initial_result.final_output) return final_result.final_output else: return "Workflow cancelled by user" This flexibility allows you to build sophisticated agentic applications that combine the power of AI agents with enterprise-grade workflow orchestration patterns, all while maintaining the familiar OpenAI Agents SDK experience. Get Started Today This article only scratches the surface of what's possible with the OpenAI Agent SDK integration for Durable Functions The combination of familiar OpenAI Agents SDK patterns with added reliability opens new possibilities for building sophisticated AI systems that can handle real-world production workloads. The integration is designed for a smooth onboarding experience. Begin by selecting one of your existing agents and applying the transformation patterns demonstrated above (often requiring just a few lines of code changes). Documentation: https://aka.ms/openai-agents-with-reliability-docs Sample Applications: https://aka.ms/openai-agents-with-reliability-samples1.9KViews2likes2CommentsAnnouncing Native Azure Functions Support in Azure Container Apps
Azure Container Apps is introducing a new, streamlined method for running Azure Functions directly in Azure Container Apps (ACA). This integration allows you to leverage the full features and capabilities of Azure Container Apps while benefiting from the simplicity of auto-scaling provided by Azure Functions. With the new native hosting model, you can deploy Azure Functions directly onto Azure Container Apps using the Microsoft.App resource provider by setting “kind=functionapp” property on the container app resource. You can deploy Azure Functions using ARM templates, Bicep, Azure CLI, and the Azure portal. Get started today and explore the complete feature set of Azure Container Apps, including multi-revision management, easy authentication, metrics and alerting, health probes and many more. To learn more, visit: https://aka.ms/fnonacav28KViews2likes1CommentAnnouncing the public preview launch of Azure Functions durable task scheduler
We are excited to roll out the public preview of the Azure Functions durable task scheduler. This new Azure-managed backend is designed to provide high performance, improve reliability, reduce operational overhead, and simplify the monitoring of your stateful orchestrations. If you missed the initial announcement of the private preview, see this blog post. Durable Task Scheduler Durable functions simplifies the development of complex, stateful, and long-running apps in a serverless environment. It allows developers to orchestrate multiple function calls without having to handle fault tolerance. It's great for scenarios like orchestrating multiple agents, distributed transactions, big data processing, batch processing like ETL (extract, transform, load), asynchronous APIs, and essentially any scenario that requires chaining function calls with state persistence. The durable task scheduler is a new storage provider for durable functions, designed to address the challenges and gaps identified by our durable customers with existing bring-your-own storage options. Over the past few months, since the initial limited early access launch of the durable task scheduler, we’ve been working closely with our customers to understand their requirements and ensure they are fully supported in using the durable task scheduler successfully. We’ve also dedicated significant effort to strengthening the fundamentals – expanding regional availability, solidifying APIs, and ensuring the durable task scheduler is reliable, secure, scalable, and can be leveraged from any of the supported durable functions programming languages. Now, we’re excited to open the gates and make the durable task scheduler available to the public. Some notable capabilities and enhancements over the existing “bring your own storage” options include: Azure Managed Unlike the other existing storage providers for durable functions, the durable task scheduler offers dedicated resources that are fully managed by Azure. You no longer need to bring your own storage account for storing orchestration and entity state, as it is completely built in. Looking ahead, the roadmap includes additional operational capabilities, such as auto-purging old execution history, handling failover, and other Business Continuity and Disaster Recovery (BCDR) capabilities. Superior Performance and Scalability Enhanced throughput for processing orchestrations and entities, ideal for demanding and high-scale applications. Efficiently manages sudden bursts of events, ensuring reliable and quick processing of your orchestrations across your function app instances. The table below compares the throughput of the durable task scheduler provider and the Azure Storage provider. The function app used for this test runs onone to four Elastic Premium EP2 instances. The orchestration code was written in C# using the .NET Isolated worker model on NET 8. The same app was used for all storage providers, and the only change was the backend storage provider configuration. The test is triggered using an HTTP trigger which starts 5,000 orchestrations concurrently. The benchmark used a standard orchestrator function calling five activity functions sequentially, each returning a "Hello, {cityName}!" string. This specific benchmark showed that the durable task scheduler is roughly five times faster than the Azure Storage provider. Orchestration Debugging and Management Dashboard Simplify the monitoring and management of orchestrations with an intuitive out-of-the-box UI. It offers clear visibility into orchestration errors and lifecycle events through detailed visual diagrams, providing essential information on exceptions and processing times. It also enables interactive orchestration management, allowing you to perform ad hoc actions such as suspending, resuming, raising events, and terminating orchestrations. Monitor the inputs and outputs between orchestration and activities. Exceptions surfaced making it easy to identify where and why an orchestration may have failed. Security Best Practices Uses identity-based authentication with Role-Based Access Control (RBAC) for enterprise-grade authorization, eliminating the need for SAS tokens or access keys. Local Emulator To simplify the development experience, we are also launching a durable task scheduler emulator that can be run as a container on your development machine. The emulator supports the same durable task scheduler runtime APIs and stores data in local memory, enabling a completely offline debugging experience. The emulator also allows you to run the durable task scheduler management dashboard locally. Pricing Plan We’re excited to announce the initial launch of the durable task scheduler with a Dedicated, fixed pricing plan. One of the key pieces of feedback that we’ve consistently received from customers is the desire for more upfront billing transparency. To address this, we’ve introduced a fixed pricing model with the option to purchase a specific amount of performance and storage through an abstraction called a Capacity Unit (CU). A single CU provides : Single tenancy with dedicated resources for predictable performance Up to 2,000 work items* dispatched per second 50GB of orchestration data storage A Capacity Unit (CU) is a measure of the resources allocated to your durable task scheduler. Each CU represents a pre-allocated amount of CPU, memory, and storage resources. A single CU guarantees the dispatch of a certain number of work items and provides a defined amount of storage. If additional performance and/or storage are needed, more CUs can be purchased*. A Work Item is a message dispatched by the durable task scheduler to your application, triggering the execution of orchestrator, activity, or entity functions. The number of work items that can be dispatch per second is determined by the Capacity Units allocated to the durable task scheduler. For detailed instructions on determining the number of work items your applications needs and the number of CUs you should purchase, please refer to the guidance provided here. *At the beginning of the public preview phase, schedulers will be temporarily limited to a single CU. *Billing for the durable task scheduler will begin on May 1st, 2025. Under the Hood The durable functions team has been continuously evolving the architecture of the backends that persist the state of orchestrations and entities; the durable task scheduler is the latest installment in this series, and it includes both the most successful characteristics of its predecessors, as well as some significant improvements of its own. In the next paragraph, we shed some light on what is new. Of course, it is not necessary to understand these internal implementation details, and they are subject to change as we will keep improving and optimizing the design. Like the MSSQL provider, the durable task scheduler uses a SQL database as the storage foundation, to provide robustness and versatility. Like the Netherite provider, it uses a partitioned design to achieve scale-out, and a pipelining optimization to boost the partition persistence. Unlike the previous backends, however, the durable task scheduler runs as a service, on its own compute nodes, to which workers are connected by GRPC. This significantly improves latency and load balancing. It strongly isolates the workflow management logic from the user application, allowing them to be scaled separately. What can we expect next for the durable task scheduler? One of the most exciting developments is the significant interest we’ve received in leveraging the durable task scheduler across other Azure compute offerings beyond Azure Functions, such as Azure Container Apps (ACA) and Azure Kubernetes Service (AKS). As we continue to enhance the integration with durable functions, we have also integrated the durable task SDKs, which are the underlying technology behind the durable task framework and durable functions, to support the durable task scheduler directly. We refer to these durable task sdks as the “portable SDKs” because they are a client-only SDK that connects directly to the durable task scheduler, where the managed orchestration engine resides, eliminating any dependency on the underlying compute platform, hence the name “portable”. By utilizing the portable SDKs to author your orchestrations as code, you can deploy your orchestrations across any Azure compute offering. This allows you to leverage the durable task scheduler as the backend, benefiting from its full set of capabilities. If you would like to discuss this further with our team or are interested in trying out the portable SDK yourself, please feel free to reach out to us at DurableTaskScheduler@microsoft.com . We welcome your questions and feedback. We've also received feedback from customers’ requesting a versioning mechanism to facilitate zero downtime deployments. This feature would enable you to manage breaking workflow changes by allowing all in-flight orchestrations using the older version to complete, while switching new orchestrations to the updated version. This is already in development and will be available in the near future. Lastly, we are in the process of introducing critical enterprise features under the category of Business Continuity and Disaster Recovery (BCDR). We understand the importance of these capabilities as our customers rely on the durable task scheduler for critical production scenarios. Get started with the durable task scheduler Migrating to the durable task scheduler from an existing durable function application is a quick process. The transition is purely configuration changes, meaning your existing orchestrations and business logic remain unchanged. The durable task scheduler is provided through a new Azure resource known as a scheduler. Each scheduler can contain one or multiple task hubs, which are sub-resources. A task hub, an established concept within durable functions, is responsible for managing the state and execution of orchestrations and activities. Think of a task hub as a logical way to separate your applications that require orchestrations execution. A Durable Task scheduler resource in the Azure Portal includes a task hub named dts-github-agent One you have created a scheduler and task hub(s), simply add the library package to your project and update your host.json to point your function app to the durable task scheduler endpoint and task hub. That’s all there is to it. With the correct authentication configuration applied, your applications can fully leverage the capabilities of the durable task scheduler. For more detailed information on how to migrate or start using the durable task scheduler, visit our getting started page here. Get in touch to learn more We are always interested in engaging with both existing and potential new customers. If any of the above interests you,, if you have any questions, or if you simply want to discuss your scenarios and explore how you can leverage the durable task scheduler, feel free to reach out to us anytime. Our line is always open - DurableTaskScheduler@microsoft.com.4.7KViews2likes3Comments