Bulletproof agents with the durable task extension for Microsoft Agent Framework

greenie-msft

Microsoft

Nov 13, 2025

What if your AI agents could survive crashes, execute across thousands of instances, wait weeks for human approval, and cost you nothing while idle, all automatically?

Today, we're thrilled to announce the public preview of the durable task extension for Microsoft Agent Framework. This extension transforms how you build production-ready, resilient and scalable AI agents by bringing the proven durable execution (survives crashes and restarts) and distributed execution (runs across multiple instances) capabilities of Azure Durable Functions directly into the Microsoft Agent Framework. Now you can deploy stateful, resilient AI agents to Azure that automatically handle session management, failure recovery, and scaling, freeing you to focus entirely on your agent logic.

Whether you're building customer service agents that maintain context across multi-day conversations, content pipelines with human-in-the-loop approval workflows, or fully automated multi-agent systems coordinating specialized AI models, the durable task extension gives you production-grade reliability, scalability and coordination with serverless simplicity.

Key features of the durable task extension include:

Serverless Hosting: Deploy agents on Azure Functions with auto-scaling from thousands of instances to zero, while retaining full control in a serverless architecture.
Automatic Session Management: Agents maintain persistent sessions with full conversation context that survives process crashes, restarts, and distributed execution across instances
Deterministic Multi-Agent Orchestrations: Coordinate specialized durable agents with predictable, repeatable, code-driven execution patterns
Human-in-the-Loop with Serverless Cost Savings: Pause for human input without consuming compute resources or incurring costs
Built-in Observability with Durable Task Scheduler: Deep visibility into agent operations and orchestrations through the Durable Task Scheduler UI dashboard

Click here to create and run a durable agent

# Python

endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini")

# Create an AI agent following the standard Microsoft Agent Framework pattern
agent = AzureOpenAIChatClient(
    endpoint=endpoint,
    deployment_name=deployment_name,
    credential=AzureCliCredential()
).create_agent(
    instructions="""You are a professional content writer who creates engaging, 
    well-structured documents for any given topic. 
    
    When given a topic, you will:
    1. Research the topic using the web search tool
    2. Generate an outline for the document
    3. Write a compelling document with proper formatting
    4. Include relevant examples and citations""",
    name="DocumentPublisher",
    tools=[
        AIFunctionFactory.Create(search_web),
        AIFunctionFactory.Create(generate_outline)
    ]
)

# Configure the function app to host the agent with durable session management
app = AgentFunctionApp(agents=[agent])

app.run()

// C#

var endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT");
var deploymentName = Environment.GetEnvironmentVariable("AZURE_OPENAI_DEPLOYMENT") ?? "gpt-4o-mini";

// Create an AI agent following the standard Microsoft Agent Framework pattern
AIAgent agent = new AzureOpenAIClient(new Uri(endpoint), new DefaultAzureCredential())
    .GetChatClient(deploymentName)
    .CreateAIAgent(
        instructions: """You are a professional content writer who creates engaging, 
        well-structured documents for any given topic. 
        
        When given a topic, you will:
        1. Research the topic using the web search tool
        2. Generate an outline for the document
        3. Write a compelling document with proper formatting
        4. Include relevant examples and citations""",
        name: "DocumentPublisher",
        tools: [
            AIFunctionFactory.Create(SearchWeb),
            AIFunctionFactory.Create(GenerateOutline)
        ]);

// Configure the function app to host the agent with durable thread management
// This automatically creates HTTP endpoints and manages state persistence
using IHost app = FunctionsApplication
    .CreateBuilder(args)
    .ConfigureFunctionsWebApplication()
    .ConfigureDurableAgents(options =>
        options.AddAIAgent(agent)
    )
    .Build();
app.Run();

Why the durable task extension?

As AI agents evolve from simple chatbots to sophisticated systems handling complex, long-running tasks, new challenges emerge:

Conversations span multiple days and weeks, requiring persistent state across process restarts, crashes, and disruptions.
Tool calls might take longer than typical timeouts allow, needing automatic checkpointing and recovery.
High-volume workloads require elastic scaling across distributed instances to handle thousands of concurrent agent conversations.
Multiple specialized agents need coordination with predictable, repeatable execution for reliable business processes.
Agents sometimes must wait for human approval before proceeding, ideally without consuming resources.

The Durable Extension addresses these challenges by extending Microsoft Agent Framework with capabilities from Azure Durable Functions, enabling you to build AI agents that survive failures, scale elastically, and execute predictably through durable and distributed execution.

The extension is built on four foundational value pillars, which we refer to as the 4D’s:

Durability

Every agent state change (messages, tool calls, decisions) is durably checkpointed automatically. Agents survive and automatically resume from infrastructure updates, crashes, and can be unloaded from memory during long waiting periods without losing context. This is essential for agents that orchestrate long-running operations or wait for external events.

Distributed

Agent execution is accessible across all instances, enabling elastic scaling and automatic failover. Healthy nodes seamlessly take over work from failed instances, ensuring continuous operation. This distributed execution model allows thousands of stateful agents to scale up and run in parallel.

Deterministic

Agent orchestrations execute predictably using imperative logic written as ordinary code. Define the execution path, enabling automated testing, verifiable guardrails, and business-critical workflows that stakeholders can trust. This complements agent-directed workflows by providing explicit control flow when needed.

Debuggability

Use familiar development tools (IDEs, debuggers, breakpoints, stack traces, and unit tests) and programming languages to develop and debug. Your agent and agent orchestrations are expressed as code, making them easily testable, debuggable, and maintainable.

Features in action

Serverless hosting

Deploy agents to Azure Functions (with expansion to other Azure computes soon) with automatic scaling to thousands of instances or down to zero when not in use. Pay only for the compute resources you consume. This code-first deployment approach gives you full control over the compute environment while maintaining the benefits of a serverless architecture.

# Python

endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini")

# Create an AI agent following the standard Microsoft Agent Framework pattern
agent = AzureOpenAIChatClient(
    endpoint=endpoint,
    deployment_name=deployment_name,
    credential=AzureCliCredential()
).create_agent(
    instructions="""You are a professional content writer who creates engaging, 
    well-structured documents for any given topic. 
    
    When given a topic, you will:
    1. Research the topic using the web search tool
    2. Generate an outline for the document
    3. Write a compelling document with proper formatting
    4. Include relevant examples and citations""",
    name="DocumentPublisher",
    tools=[
        AIFunctionFactory.Create(search_web),
        AIFunctionFactory.Create(generate_outline)
    ]
)

# Configure the function app to host the agent with durable session management
app = AgentFunctionApp(agents=[agent])

app.run()

Automatic session management

Agent sessions are automatically checkpointed in durable storage that you configure in your function app, enabling durable and distributed execution across multiple instances. Any instance can resume an agent's execution after interruptions or process failures, ensuring continuous operation.

Under the hood, agents are implemented as durable entities. These are stateful objects that maintain their state across executions. This architecture enables each agent session to function as a reliable, long-lived entity with preserved conversation history and context.

Example scenario: A customer service agent handling a complex support case over multiple days and weeks. The conversation history, context, and progress are preserved even if the agent is redeployed or moves to a different instance.

# First interaction - start a new thread to create a document
curl -X POST https://your-function-app.azurewebsites.net/api/agents/DocumentPublisher/threads \
  -H "Content-Type: application/json" \
  -d '{"message": "Create a document about the benefits of Azure Functions"}'

# Response includes thread ID and initial document outline/draft
# {"threadId": "doc789", "response": "I'll create a comprehensive document about Azure Functions benefits. Let me search for the latest information... [Document Draft] # Benefits of Azure Functions\n\n## Introduction\nAzure Functions is a serverless compute service that enables you to run event-driven code without managing infrastructure...\n\n## Cost Efficiency\n- Pay only for execution time\n- No charges for idle resources\n- Automatic scaling reduces over-provisioning...\n\n## Developer Productivity\n- Multiple language support (C#, Python, JavaScript, Java)\n- Integrated development tools and CI/CD...\n\n## Scalability\n- Automatic scaling based on demand\n- Handle millions of requests seamlessly...\n\nReferences: [Azure documentation, serverless computing studies]"}

# Second interaction - continue the same thread to refine the document
curl -X POST https://your-function-app.azurewebsites.net/api/agents/DocumentPublisher/threads/doc789 \
  -H "Content-Type: application/json" \
  -d '{"message": "Can you add a section about integration with other Azure services?"}'

# Agent remembers the Azure Functions document context and adds the requested section
# {"threadId": "doc789", "response": "I've added a comprehensive integration section to your Azure Functions document:\n\n## Integration with Azure Services\n\n### Azure Storage\nSeamless triggers and bindings for Blob Storage, Queue Storage, and Table Storage enable event-driven architectures...\n\n### Azure Event Grid & Event Hubs\nProcess real-time event streams and implement publish-subscribe patterns at scale...\n\n### Azure Cosmos DB\nBuilt-in bindings for document database operations with automatic change feed processing...\n\n### Azure Service Bus\nReliable message processing with enterprise messaging features...\n\n### Azure AI Services\nEasily integrate OpenAI, Cognitive Services, and AI Search for intelligent applications...\n\nThe document now includes this section after the Scalability section. Would you like me to add use cases or deployment best practices?"}

Deterministic multi-agent orchestrations

Coordinate multiple specialized durable agents using imperative code where you define the control flow. This differs from agent-directed workflows where the agent decides the next steps. Deterministic Orchestrations provide predictable, repeatable execution patterns with automatic checkpointing and recovery.

Example scenario: An email processing system that uses a spam detection agent, then conditionally routes to different specialized agents based on the classification. The orchestration automatically recovers if any step fails and completed agent calls are not re-executed.

# Python

app.orchestration_trigger(context_name="context")
def document_publishing_orchestration(context: DurableOrchestrationContext):
    """Deterministic orchestration coordinating multiple specialized agents."""
    doc_request = context.get_input()

    # Get specialized agents from the orchestration context
    research_agent = context.get_agent("ResearchAgent") 
    writer_agent = context.get_agent("DocumentPublisherAgent")

    # Step 1: Research the topic using web search
    research_result = yield research_agent.run(
        messages=f"Research the following topic and gather key information: {doc_request.topic}",
        response_schema=ResearchResult
    )

    # Step 2: Generate outline based on research findings
    outline = yield context.call_activity("generate_outline", {
        "topic": doc_request.topic,
        "research_data": research_result.findings
    })
    
    # Step 3: Write the document with the research and outline
    document = yield writer_agent.run(
        messages=f"""Create a comprehensive document about {doc_request.topic}.
        
        Research findings: {research_result.findings}
        Outline: {outline}
        
        Write a well-structured, engaging document with proper formatting and citations.""",
        response_schema=DocumentResponse
    )
    
    # Step 4: Save and publish the generated document
    return yield context.call_activity("publish_document", {
        "title": doc_request.topic,
        "content": document.text,
        "citations": document.citations
    })

Human-in-the-loop

Orchestrations and agents can pause for human input, approval, or review without consuming compute resources. Durable execution enables orchestrations to wait for days or even weeks while waiting for human responses, even if the app crashes or restarts. When combined with serverless hosting, all compute resources are spun down during the wait period, eliminating compute costs until the human provides their input.

Example scenario: A content publishing agent that generates drafts, sends them to human reviewers, and waits days for approval without running (or paying for) compute resources during the review period. When the human response arrives, the orchestration automatically resumes with full conversation context and execution state intact.

# Python

app.orchestration_trigger(context_name="context")
def content_approval_workflow(context: DurableOrchestrationContext):
    """Human-in-the-loop workflow with zero-cost waiting."""
    topic = context.get_input()

    # Step 1: Generate content using an agent
    content_agent = context.get_agent("ContentGenerationAgent")
    draft_content = yield content_agent.run(f"Write an article about {topic}")

    # Step 2: Send for human review
    yield context.call_activity("notify_reviewer", draft_content)

    # Step 3: Wait for approval - no compute resources consumed while waiting
    approval_event = context.wait_for_external_event("ApprovalDecision")
    timeout_task = context.create_timer(context.current_utc_datetime + timedelta(hours=24))
    
    winner = yield context.task_any([approval_event, timeout_task])
    
    if winner == approval_event:
        timeout_task.cancel()
        approved = approval_event.result
        
        if approved:
            result = yield context.call_activity("publish_content", draft_content)
            return result
        else:
            return "Content rejected"
    else:
        # Timeout - escalate for review
        result = yield context.call_activity("escalate_for_review", draft_content)
        return result

Built-in agent observability

Configure your Function App with the Durable Task Scheduler as the durable backend (what persists agents and orchestration state). The Durable Task Scheduler is the recommended durable backend for your durable agents, offering the best throughput performance, fully managed infrastructure, and built-in observability through a UI dashboard.

The Durable Task Scheduler dashboard provides deep visibility into your agent operations:

Conversation history: View complete conversation threads for each agent session, including all messages, tool calls, and conversation context at any point in time
Multi-agent visualization: See the execution flow when calling multiple specialized agents with visual representation of agent handoffs, parallel executions, and conditional branching
Performance metrics: Monitor agent response times, token usage, and orchestration duration
Execution history: Access detailed execution logs with full replay capability for debugging