Blog Post

Apps on Azure Blog
7 MIN READ

Bulletproof agents with the durable task extension for Microsoft Agent Framework

greenie-msft's avatar
greenie-msft
Icon for Microsoft rankMicrosoft
Nov 13, 2025

What if your AI agents could survive crashes, execute across thousands of instances, wait weeks for human approval, and cost you nothing while idle, all automatically?

Today, we're thrilled to announce the public preview of the durable task extension for Microsoft Agent Framework. This extension transforms how you build production-ready, resilient and scalable AI agents by bringing the proven durable execution (survives crashes and restarts) and distributed execution (runs across multiple instances) capabilities of Azure Durable Functions directly into the Microsoft Agent Framework. Now you can deploy stateful, resilient AI agents to Azure that automatically handle session management, failure recovery, and scaling, freeing you to focus entirely on your agent logic.

Whether you're building customer service agents that maintain context across multi-day conversations, content pipelines with human-in-the-loop approval workflows, or fully automated multi-agent systems coordinating specialized AI models, the durable task extension gives you production-grade reliability, scalability and coordination with serverless simplicity.

Key features of the durable task extension include:

  • Serverless Hosting: Deploy agents on Azure Functions with auto-scaling from thousands of instances to zero, while retaining full control in a serverless architecture.
  • Automatic Session Management: Agents maintain persistent sessions with full conversation context that survives process crashes, restarts, and distributed execution across instances
  • Deterministic Multi-Agent Orchestrations: Coordinate specialized durable agents with predictable, repeatable, code-driven execution patterns
  • Human-in-the-Loop with Serverless Cost Savings: Pause for human input without consuming compute resources or incurring costs
  • Built-in Observability with Durable Task Scheduler: Deep visibility into agent operations and orchestrations through the Durable Task Scheduler UI dashboard

Click here to create and run a durable agent

 

# Python

endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini")

# Create an AI agent following the standard Microsoft Agent Framework pattern
agent = AzureOpenAIChatClient(
    endpoint=endpoint,
    deployment_name=deployment_name,
    credential=AzureCliCredential()
).create_agent(
    instructions="You are good at telling jokes.",
    name="Joker"
)

# Configure the function app to host the agent with durable session management
app = AgentFunctionApp(agents=[agent])

app.run()
// C#

var endpoint = Environment.GetEnvironmentVariable("AZURE_OPENAI_ENDPOINT");
var deploymentName = Environment.GetEnvironmentVariable("AZURE_OPENAI_DEPLOYMENT") ?? "gpt-4o-mini";

// Create an AI agent following the standard Microsoft Agent Framework pattern
AIAgent agent = new AzureOpenAIClient(new Uri(endpoint), new AzureCliCredential())
    .GetChatClient(deploymentName)
    .CreateAIAgent(
        instructions: "You are good at telling jokes.",
        name: "Joker");

// Configure the function app to host the agent with durable thread management
// This automatically creates HTTP endpoints and manages state persistence
using IHost app = FunctionsApplication
    .CreateBuilder(args)
    .ConfigureFunctionsWebApplication()
    .ConfigureDurableAgents(options =>
        options.AddAIAgent(agent)
    )
    .Build();
app.Run();

Why the durable task extension?

As AI agents evolve from simple chatbots to sophisticated systems handling complex, long-running tasks, new challenges emerge:

  • Conversations span multiple days and weeks, requiring persistent state across process restarts, crashes, and disruptions.
  • Tool calls might take longer than typical timeouts allow, needing automatic checkpointing and recovery.
  • High-volume workloads require elastic scaling across distributed instances to handle thousands of concurrent agent conversations.
  • Multiple specialized agents need coordination with predictable, repeatable execution for reliable business processes.
  • Agents sometimes must wait for human approval before proceeding, ideally without consuming resources.

The Durable Extension addresses these challenges by extending Microsoft Agent Framework with capabilities from Azure Durable Functions, enabling you to build AI agents that survive failures, scale elastically, and execute predictably through durable and distributed execution.

The extension is built on four foundational value pillars, which we refer to as the 4D’s:

Durability

Every agent state change (messages, tool calls, decisions) is durably checkpointed automatically. Agents survive and automatically resume from infrastructure updates, crashes, and can be unloaded from memory during long waiting periods without losing context. This is essential for agents that orchestrate long-running operations or wait for external events.

Distributed

Agent execution is accessible across all instances, enabling elastic scaling and automatic failover. Healthy nodes seamlessly take over work from failed instances, ensuring continuous operation. This distributed execution model allows thousands of stateful agents to scale up and run in parallel.

Deterministic

Agent orchestrations execute predictably using imperative logic written as ordinary code. Define the execution path, enabling automated testing, verifiable guardrails, and business-critical workflows that stakeholders can trust. This complements agent-directed workflows by providing explicit control flow when needed.

Debuggability

Use familiar development tools (IDEs, debuggers, breakpoints, stack traces, and unit tests) and programming languages to develop and debug. Your agent and agent orchestrations are expressed as code, making them easily testable, debuggable, and maintainable.

Features in action

Serverless hosting

Deploy agents to Azure Functions (with expansion to other Azure computes soon) with automatic scaling to thousands of instances or down to zero when not in use. Pay only for the compute resources you consume. This code-first deployment approach gives you full control over the compute environment while maintaining the benefits of a serverless architecture.

# Python

endpoint = os.getenv("AZURE_OPENAI_ENDPOINT")
deployment_name = os.getenv("AZURE_OPENAI_DEPLOYMENT_NAME", "gpt-4o-mini")

# Create an AI agent following the standard Microsoft Agent Framework pattern
agent = AzureOpenAIChatClient(
    endpoint=endpoint,
    deployment_name=deployment_name,
    credential=AzureCliCredential()
).create_agent(
    instructions="You are good at telling jokes.",
    name="Joker"
)

# Configure the function app to host the agent with durable session management
app = AgentFunctionApp(agents=[agent])

app.run()

Automatic session management

Agent sessions are automatically checkpointed in durable storage that you configure in your function app, enabling durable and distributed execution across multiple instances. Any instance can resume an agent's execution after interruptions or process failures, ensuring continuous operation.

Under the hood, agents are implemented as durable entities. These are stateful objects that maintain their state across executions. This architecture enables each agent session to function as a reliable, long-lived entity with preserved conversation history and context.

 

Example scenario: A customer service agent handling a complex support case over multiple days and weeks. The conversation history, context, and progress are preserved even if the agent is redeployed or moves to a different instance.

# First interaction - start a new thread
curl -X POST https://your-function-app.azurewebsites.net/api/agents/Joker/threads \
  -H "Content-Type: application/json" \
  -d '{"message": "Tell me a joke about pirates"}'

# Response includes thread ID and joke
# {"threadId": "abc123", "response": "Why do pirates make terrible singers? Because they hit the high Cs!"}

# Second interaction - continue the same thread with context
curl -X POST https://your-function-app.azurewebsites.net/api/agents/Joker/threads/abc123 \
  -H "Content-Type: application/json" \
  -d '{"message": "Tell me another one about the same topic"}'

# Agent remembers the pirate context from the first message
# {"threadId": "abc123", "response": "What's a pirate's favorite letter? You'd think it's R, but it's actually the C!"}

Deterministic multi-agent orchestrations

Coordinate multiple specialized durable agents using imperative code where you define the control flow. This differs from agent-directed workflows where the agent decides the next steps. Deterministic Orchestrations provide predictable, repeatable execution patterns with automatic checkpointing and recovery.

 

Example scenario: An email processing system that uses a spam detection agent, then conditionally routes to different specialized agents based on the classification. The orchestration automatically recovers if any step fails and completed agent calls are not re-executed.

# Python

app.orchestration_trigger(context_name="context")
def spam_detection_orchestration(context: DurableOrchestrationContext):
    """Deterministic orchestration coordinating multiple specialized agents."""
    email = context.get_input()

    # Get specialized agents from the orchestration context
    spam_agent = context.get_agent("SpamDetectionAgent") 
    email_agent = context.get_agent("EmailAssistantAgent")

    # Step 1: Check if the email is spam
    spam_result = yield spam_agent.run(
        messages=f"Analyze this email for spam: {email.content}",
        response_schema=SpamDetectionResult
    )

    # Step 2: Conditional logic based on spam detection
    if spam_result.is_spam:
        # Handle spam email
        return yield context.call_activity("handle_spam_email", spam_result.reason)
    
    # Step 3: Generate professional response for legitimate email
    email_response = yield email_agent.run(
        messages=f"Draft a professional response to: {email.content}",
        response_schema=EmailResponse
    )
    
    # Step 4: Send the generated response
    return yield context.call_activity("send_email", email_response.text)

Human-in-the-loop

Orchestrations and agents can pause for human input, approval, or review without consuming compute resources. Durable execution enables orchestrations to wait for days or even weeks while waiting for human responses, even if the app crashes or restarts. When combined with serverless hosting, all compute resources are spun down during the wait period, eliminating compute costs until the human provides their input.

Example scenario: A content publishing agent that generates drafts, sends them to human reviewers, and waits days for approval without running (or paying for) compute resources during the review period. When the human response arrives, the orchestration automatically resumes with full conversation context and execution state intact.

 

# Python

app.orchestration_trigger(context_name="context")
def content_approval_workflow(context: DurableOrchestrationContext):
    """Human-in-the-loop workflow with zero-cost waiting."""
    topic = context.get_input()

    # Step 1: Generate content using an agent
    content_agent = context.get_agent("ContentGenerationAgent")
    draft_content = yield content_agent.run(f"Write an article about {topic}")

    # Step 2: Send for human review
    yield context.call_activity("notify_reviewer", draft_content)

    # Step 3: Wait for approval - no compute resources consumed while waiting
    approval_event = context.wait_for_external_event("ApprovalDecision")
    timeout_task = context.create_timer(context.current_utc_datetime + timedelta(hours=24))
    
    winner = yield context.task_any([approval_event, timeout_task])
    
    if winner == approval_event:
        timeout_task.cancel()
        approved = approval_event.result
        
        if approved:
            result = yield context.call_activity("publish_content", draft_content)
            return result
        else:
            return "Content rejected"
    else:
        # Timeout - escalate for review
        result = yield context.call_activity("escalate_for_review", draft_content)
        return result

Built-in agent observability 

Configure your Function App with the Durable Task Scheduler as the durable backend (what persists agents and orchestration state). The Durable Task Scheduler is the recommended durable backend for your durable agents, offering the best throughput performance, fully managed infrastructure, and built-in observability through a UI dashboard.

The Durable Task Scheduler dashboard provides deep visibility into your agent operations:

  • Conversation history: View complete conversation threads for each agent session, including all messages, tool calls, and conversation context at any point in time
  • Multi-agent visualization: See the execution flow when calling multiple specialized agents with visual representation of agent handoffs, parallel executions, and conditional branching
  • Performance metrics: Monitor agent response times, token usage, and orchestration duration
  • Execution history: Access detailed execution logs with full replay capability for debugging

Demo Video

Language support

The Durable Extension supports:

  • C# (.NET 8.0+) with Azure Functions
  • Python (3.10+) with Azure Functions

Support for additional computes coming soon.

Get started today

Click here to create and run a durable agent

Learn more

Updated Nov 13, 2025
Version 8.0
No CommentsBe the first to comment