Blog Post

Microsoft Foundry Blog
18 MIN READ

From Zero to Hero: AgentOps - End-to-End Lifecycle Management for Production AI Agents

mrajguru's avatar
mrajguru
Icon for Microsoft rankMicrosoft
Jan 12, 2026

The shift from proof-of-concept AI agents to production-ready systems isn't just about better models—it's about building robust infrastructure that can develop, deploy, and maintain intelligent agents at enterprise scale. As organizations move beyond simple chatbots to agentic systems that plan, reason, and act autonomously, the need for comprehensive Agent LLMOps becomes critical.

This guide walks through the complete lifecycle for building production AI agents, from development through deployment to monitoring, with special focus on leveraging Azure AI Foundry's hosted agents infrastructure.

The Evolution: From Single-Turn Prompts to Agentic Workflows

Traditional AI applications operated on a simple request-response pattern. Modern AI agents, however, are fundamentally different. They maintain state across multiple interactions, orchestrate complex multi-step workflows, and dynamically adapt their approach based on intermediate results.

According to recent analysis, agentic workflows represent systems where language models and tools are orchestrated through a combination of predefined logic and dynamic decision-making. Unlike monolithic systems where a single model attempts everything, production agents break down complex tasks into specialized components that collaborate effectively.

The difference is profound. A simple customer service chatbot might answer questions from a knowledge base. An agentic customer service system, however, can search multiple data sources, escalate to specialized sub-agents for technical issues, draft response emails, schedule follow-up tasks, and learn from each interaction to improve future responses.

 

 

Stage 1: Development with any agentic framework

Why LangGraph for Agent Development?

LangGraph has emerged as a leading framework for building stateful, multi-agent applications. Unlike traditional chain-based approaches, LangGraph uses a graph-based architecture where each node represents a unit of work and edges define the workflow paths between them.

The key advantages include:

Explicit State Management: LangGraph maintains persistent state across nodes, making it straightforward to track conversation history, intermediate results, and decision points. This is critical for debugging complex agent behaviors.

Visual Workflow Design: The graph structure provides an intuitive way to visualize and understand agent logic. When an agent misbehaves, you can trace execution through the graph to identify where things went wrong.

Flexible Control Flows: LangGraph supports diverse orchestration patterns—single agent, multi-agent, hierarchical, sequential—all within one framework. You can start simple and evolve as requirements grow.

Built-in Memory: Agents automatically store conversation histories and maintain context over time, enabling rich personalized interactions across sessions.

Core LangGraph Components

Nodes: Individual units of logic or action. A node might call an AI model, query a database, invoke an external API, or perform data transformation. Each node is a Python function that receives the current state and returns updates.

Edges: Define the workflow paths between nodes. These can be conditional (routing based on the node's output) or unconditional (always proceeding to the next step).

State: The data structure passed between nodes and updated through reducers. Proper state design is crucial—it should contain all information needed for decision-making while remaining manageable in size.

Checkpoints: LangGraph's checkpointing mechanism saves state at each node, enabling features like human-in-the-loop approval, retry logic, and debugging.

Implementing the Agentic Workflow Pattern

A robust production agent typically follows a cyclical pattern of planning, execution, reflection, and adaptation:

  1. Planning Phase: The agent analyzes the user's request and creates a structured plan, breaking complex problems into manageable steps.
  2. Execution Phase: The agent carries out planned actions using appropriate tools—search engines, calculators, code interpreters, database queries, or API calls.
  3. Reflection Phase: After each action, the agent evaluates results against expected outcomes. This critical thinking step determines whether to proceed, retry with a different approach, or seek additional information.
  4. Decision Phase: Based on reflection, the agent decides the next course of action—continue to the next step, loop back to refine the approach, or conclude with a final response.

 

This pattern handles real-world complexity far better than simple linear workflows. When an agent encounters unexpected results, the reflection phase enables adaptive responses rather than brittle failure.

Example: Building a Research Agent with LangGraph

 

from langgraph.graph import StateGraph, END
from langchain_openai import ChatOpenAI
from typing import TypedDict, List

class AgentState(TypedDict):
    query: str
    plan: List[str]
    current_step: int
    research_results: List[dict]
    final_answer: str

def planning_node(state: AgentState):
    # Agent creates a research plan
    llm = ChatOpenAI(model="gpt-4")
    plan = llm.invoke(f"Create a research plan for: {state['query']}")
    return {"plan": plan, "current_step": 0}

def research_node(state: AgentState):
    # Execute current research step
    step = state['plan'][state['current_step']]
    # Perform web search, database query, etc.
    results = perform_research(step)
    return {"research_results": state['research_results'] + [results]}

def reflection_node(state: AgentState):
    # Evaluate if we have enough information
    if len(state['research_results']) >= len(state['plan']):
        return {"next": "synthesize"}
    return {"next": "research", "current_step": state['current_step'] + 1}

def synthesize_node(state: AgentState):
    # Generate final answer from all research
    llm = ChatOpenAI(model="gpt-4")
    answer = llm.invoke(f"Synthesize research: {state['research_results']}")
    return {"final_answer": answer}

# Build the graph
workflow = StateGraph(AgentState)
workflow.add_node("planning", planning_node)
workflow.add_node("research", research_node)
workflow.add_node("reflection", reflection_node)
workflow.add_node("synthesize", synthesize_node)

workflow.add_edge("planning", "research")
workflow.add_edge("research", "reflection")
workflow.add_conditional_edges(
    "reflection",
    lambda s: s["next"],
    {"research": "research", "synthesize": "synthesize"}
)
workflow.add_edge("synthesize", END)

agent = workflow.compile()

This pattern scales from simple workflows to complex multi-agent systems with dozens of specialized nodes. 

Stage 2: CI/CD Pipeline for AI Agents

Traditional software CI/CD focuses on code quality, security, and deployment automation. Agent CI/CD must additionally handle model versioning, evaluation against behavioral benchmarks, and non-deterministic behavior.

Build Phase: Packaging Agent Dependencies

Unlike traditional applications, AI agents have unique packaging requirements:

  • Model artifacts: Fine-tuned models, embeddings, or model configurations
  • Vector databases: Pre-computed embeddings for knowledge retrieval
  • Tool configurations: API credentials, endpoint URLs, rate limits
  • Prompt templates: Versioned prompt engineering assets
  • Evaluation datasets: Test cases for agent behavior validation

Best practice is to containerize everything. Docker provides reproducibility across environments and simplifies dependency management:

FROM python:3.11-slim

WORKDIR /app

COPY . user_agent/
WORKDIR /app/user_agent

RUN if [ -f requirements.txt ]; then \
        pip install -r requirements.txt; \
    else \
        echo "No requirements.txt found"; \
    fi

EXPOSE 8088

CMD ["python", "main.py"]

Register Phase: Version Control Beyond Git

Code versioning is necessary but insufficient for AI agents. You need comprehensive artifact versioning:

Container Registry: Azure Container Registry stores Docker images with semantic versioning. Each agent version becomes an immutable artifact that can be deployed or rolled back at any time.

Prompt Registry: Version control your prompts separately from code. Prompt changes can dramatically impact agent behavior, so treating them as first-class artifacts enables A/B testing and rapid iteration.

Configuration Management: Store agent configurations (model selection, temperature, token limits, tool permissions) in version-controlled files. This ensures reproducibility and enables easy rollback.

Evaluate Phase: Testing Non-Deterministic Behavior

The biggest challenge in agent CI/CD is evaluation. Unlike traditional software where unit tests verify exact outputs, agents produce variable responses that must be evaluated holistically.

Behavioral Testing: Define test cases that specify desired agent behaviors rather than exact outputs. For example, "When asked about product pricing, the agent should query the pricing API, handle rate limits gracefully, and present information in a structured format."

Evaluation Metrics: Track multiple dimensions:

  • Task completion rate: Did the agent accomplish the goal?
  • Tool usage accuracy: Did it call the right tools with correct parameters?
  • Response quality: Measured via LLM-as-judge or human evaluation
  • Latency: Time to first token and total response time
  • Cost: Token usage and API call expenses

Adversarial Testing: Intentionally test edge cases—ambiguous requests, tool failures, rate limiting, conflicting information. Production agents will encounter these scenarios.

Recent research on CI/CD for AI agents emphasizes comprehensive instrumentation from day one. Track every input, output, API call, token usage, and decision point. After accumulating production data, patterns emerge showing which metrics actually predict failures versus noise.

Deploy Phase: Safe Production Rollouts

Never deploy agents directly to production. Implement progressive delivery:

Staging Environment: Deploy to a staging environment that mirrors production. Run automated tests and manual QA against real data (appropriately anonymized).

Canary Deployment: Route a small percentage of traffic (5-10%) to the new version. Monitor error rates, latency, user satisfaction, and cost metrics. Automatically rollback if any metric degrades beyond thresholds.

Blue-Green Deployment: Maintain two production environments. Deploy to the inactive environment, verify it's healthy, then switch traffic. Enables instant rollback by switching back.

Feature Flags: Deploy new agent capabilities behind feature flags. Gradually enable them for specific user segments, gather feedback, and iterate before full rollout.

Now since we know how to create an agent using langgraph, the next step will be understand how can we use this langgraph agent to deploy in Azure AI Foundry.

Stage 3: Azure AI Foundry Hosted Agents

Hosted agents are containerized agentic AI applications that run on Microsoft's Foundry Agent Service. They represent a paradigm shift from traditional prompt-based agents to fully code-driven, production-ready AI systems.

When to Use Hosted Agents:

✅ Complex agentic workflows - Multi-step reasoning, branching logic, conditional execution

Custom tool integration - External APIs, databases, internal systems

Framework-specific features - LangGraph graphs, multi-agent orchestration

Production scale - Enterprise deployments requiring autoscaling

Auth- Identity and authentication, Security and compliance controls

CI/CD integration - Automated testing and deployment pipelines

Why Hosted Agents Matter

Hosted agents bridge the gap between experimental AI prototypes and production systems:

For Developers:

  • Full control over agent logic via code
  • Use familiar frameworks and tools
  • Local testing before deployment
  • Version control for agent code

For Enterprises:

  • No infrastructure management overhead
  • Built-in security and compliance
  • Scalable pay-as-you-go pricing
  • Integration with existing Azure ecosystem

For AI Systems:

  • Complex reasoning patterns beyond prompts
  • Stateful conversations with persistence
  • Custom tool integration and orchestration
  • Multi-agent collaboration

Before you get started with Foundry. Deploy Foundry project using the starter code using AZ CLI.

 

# Initialize a new agent project 
azd init -t https://github.com/Azure-Samples/azd-ai-starter-basic 
# The template automatically provisions: 
# - Foundry resource and project 
# - Azure Container Registry 
# - Application Insights for monitoring 
# - Managed identities and RBAC 
# Deploy 
azd up

 

The extension significantly reduces the operational burden. What previously required extensive Azure knowledge and infrastructure-as-code expertise now works with a few CLI commands.

The extension significantly reduces the operational burden. What previously required extensive Azure knowledge and infrastructure-as-code expertise now works with a few CLI commands.

Local Development to Production Workflow

A streamlined workflow bridges development and production:

  1. Develop Locally: Build and test your LangGraph agent on your machine. Use the Foundry SDK to ensure compatibility with production APIs.
  2. Validate Locally: Run the agent locally against the Foundry Responses API to verify it works with production authentication and conversation management.
  3. Containerize: Package your agent in a Docker container with all dependencies.
  4. Deploy to Staging: Use azd deploy to push to a staging Foundry project. Run automated tests.
  5. Deploy to Production: Once validated, deploy to production. Foundry handles versioning, so you can maintain multiple agent versions and route traffic accordingly.
  6. Monitor and Iterate: Use Application Insights to monitor agent performance, identify issues, and plan improvements.

Azure AI Toolkit for Visual Studio offer this great place to test your hosted agent.

 

You can also test this using REST.

 

FROM python:3.11-slim

WORKDIR /app

COPY . user_agent/
WORKDIR /app/user_agent

RUN if [ -f requirements.txt ]; then \
        pip install -r requirements.txt; \
    else \
        echo "No requirements.txt found"; \
    fi

EXPOSE 8088

CMD ["python", "main.py"]

 

Once you are able to run agent and test in local playground. You want to move to the next step of registering, evaluating and deploying agent in Microsoft AI Foundry.

CI/CD with GitHub Actions

This repository includes a GitHub Actions workflow (`.github/workflows/mslearnagent-AutoDeployTrigger.yml`) that automatically builds and deploys the agent to Azure when changes are pushed to the main branch.

1. Set Up Service Principal

# Create service principal
az ad sp create-for-rbac \
  --name "github-actions-agent-deploy" \
  --role contributor \
  --scopes /subscriptions/$SUBSCRIPTION_ID/resourceGroups/$RESOURCE_GROUP

# Output will include:
# - appId (AZURE_CLIENT_ID)
# - tenant (AZURE_TENANT_ID)

 

2. Configure Federated Credentials

# For GitHub Actions OIDC
az ad app federated-credential create \
  --id $APP_ID \
  --parameters '{
    "name": "github-actions-deploy",
    "issuer": "https://token.actions.githubusercontent.com",
    "subject": "repo:YOUR_ORG/YOUR_REPO:ref:refs/heads/main",
    "audiences": ["api://AzureADTokenExchange"]
  }'

 

3. Set Required Permissions

Critical: Service principal needs Azure AI User role on AI Services resource:

# Get AI Services resource ID
AI_SERVICES_ID=$(az cognitiveservices account show \
  --name $AI_SERVICES_NAME \
  --resource-group $RESOURCE_GROUP \
  --query id -o tsv)

# Assign Azure AI User role
az role assignment create \
  --assignee $SERVICE_PRINCIPAL_ID \
  --role "Azure AI User" \
  --scope $AI_SERVICES_ID

 

4. Configure GitHub Secrets

Go to GitHub repository → Settings → Secrets and variables → Actions

Add the following secrets:

AZURE_CLIENT_ID=<from-service-principal>
AZURE_TENANT_ID=<from-service-principal>
AZURE_SUBSCRIPTION_ID=<your-subscription-id>
AZURE_AI_PROJECT_ENDPOINT=<your-project-endpoint>
ACR_NAME=<your-acr-name>
IMAGE_NAME=calculator-agent
AGENT_NAME=CalculatorAgent

5. Create GitHub Actions Workflow

Create .github/workflows/deploy-agent.yml:

 

name: Deploy Agent to Azure AI Foundry

on:
  push:
    branches:
      - main
    paths:
      - 'main.py'
      - 'custom_state_converter.py'
      - 'requirements.txt'
      - 'Dockerfile'
  workflow_dispatch:
    inputs:
      version_tag:
        description: 'Version tag (leave empty for auto-increment)'
        required: false
        type: string

permissions:
  id-token: write
  contents: read

jobs:
  deploy:
    runs-on: ubuntu-latest

    steps:
      - name: Checkout code
        uses: actions/checkout@v4

      - name: Generate version tag
        id: version
        run: |
          if [ -n "${{ github.event.inputs.version_tag }}" ]; then
            echo "VERSION=${{ github.event.inputs.version_tag }}" >> $GITHUB_OUTPUT
          else
            # Auto-increment version
            VERSION="v$(date +%Y%m%d-%H%M%S)"
            echo "VERSION=$VERSION" >> $GITHUB_OUTPUT
          fi

      - name: Azure Login (OIDC)
        uses: azure/login@v1
        with:
          client-id: ${{ secrets.AZURE_CLIENT_ID }}
          tenant-id: ${{ secrets.AZURE_TENANT_ID }}
          subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }}

      - name: Set up Python
        uses: actions/setup-python@v4
        with:
          python-version: '3.11'

      - name: Install Azure AI SDK
        run: |
          pip install azure-ai-projects azure-identity

      - name: Build and push Docker image
        run: |
          az acr build \
            --registry ${{ secrets.ACR_NAME }} \
            --image ${{ secrets.IMAGE_NAME }}:${{ steps.version.outputs.VERSION }} \
            --image ${{ secrets.IMAGE_NAME }}:latest \
            --file Dockerfile \
            .

      - name: Register agent version
        env:
          AZURE_AI_PROJECT_ENDPOINT: ${{ secrets.AZURE_AI_PROJECT_ENDPOINT }}
          ACR_NAME: ${{ secrets.ACR_NAME }}
          IMAGE_NAME: ${{ secrets.IMAGE_NAME }}
          AGENT_NAME: ${{ secrets.AGENT_NAME }}
          VERSION: ${{ steps.version.outputs.VERSION }}
        run: |
          python - <<EOF
          import os
          from azure.ai.projects import AIProjectClient
          from azure.identity import DefaultAzureCredential
          from azure.ai.projects.models import ImageBasedHostedAgentDefinition

          project_endpoint = os.environ["AZURE_AI_PROJECT_ENDPOINT"]
          credential = DefaultAzureCredential()
          project_client = AIProjectClient.from_connection_string(
              credential=credential,
              conn_str=project_endpoint,
          )

          agent_name = os.environ["AGENT_NAME"]
          version = os.environ["VERSION"]
          image_uri = f"{os.environ['ACR_NAME']}.azurecr.io/{os.environ['IMAGE_NAME']}:{version}"

          agent_definition = ImageBasedHostedAgentDefinition(
              image=image_uri,
              cpu=1.0,
              memory="2Gi",
          )

          agent = project_client.agents.create_or_update(
              agent_id=agent_name,
              body=agent_definition
          )

          print(f"Agent version registered: {agent.version}")
          EOF

      - name: Start agent
        run: |
          echo "Agent deployed successfully with version ${{ steps.version.outputs.VERSION }}"

      - name: Deployment summary
        run: |
          echo "### Deployment Summary" >> $GITHUB_STEP_SUMMARY
          echo "- **Agent Name**: ${{ secrets.AGENT_NAME }}" >> $GITHUB_STEP_SUMMARY
          echo "- **Version**: ${{ steps.version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
          echo "- **Image**: ${{ secrets.ACR_NAME }}.azurecr.io/${{ secrets.IMAGE_NAME }}:${{ steps.version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
          echo "- **Status**: Deployed" >> $GITHUB_STEP_SUMMARY

 

6. Add Container Status Verification

To ensure deployments are truly successful, add a script to verify container startup before marking the pipeline as complete.

Create wait_for_container.py:

 

"""
Wait for agent container to be ready.

This script polls the agent container status until it's running successfully
or times out. Designed for use in CI/CD pipelines to verify deployment.
"""

import os
import sys
import time
import requests
from typing import Optional, Dict, Any
from azure.identity import DefaultAzureCredential


class ContainerStatusWaiter:
    """Polls agent container status until ready or timeout."""

    def __init__(
        self,
        project_endpoint: str,
        agent_name: str,
        agent_version: str,
        timeout_seconds: int = 600,
        poll_interval: int = 10,
    ):
        """
        Initialize the container status waiter.

        Args:
            project_endpoint: Azure AI Foundry project endpoint
            agent_name: Name of the agent
            agent_version: Version of the agent
            timeout_seconds: Maximum time to wait (default: 10 minutes)
            poll_interval: Seconds between status checks (default: 10s)
        """
        self.project_endpoint = project_endpoint.rstrip("/")
        self.agent_name = agent_name
        self.agent_version = agent_version
        self.timeout_seconds = timeout_seconds
        self.poll_interval = poll_interval
        self.api_version = "2025-11-15-preview"

        # Get Azure AD token
        credential = DefaultAzureCredential()
        token = credential.get_token("https://ml.azure.com/.default")
        self.headers = {
            "Authorization": f"Bearer {token.token}",
            "Content-Type": "application/json",
        }

    def _get_container_url(self) -> str:
        """Build the container status URL."""
        return (
            f"{self.project_endpoint}/agents/{self.agent_name}"
            f"/versions/{self.agent_version}/containers/default"
        )

    def get_container_status(self) -> Optional[Dict[str, Any]]:
        """Get current container status."""
        url = f"{self._get_container_url()}?api-version={self.api_version}"

        try:
            response = requests.get(url, headers=self.headers, timeout=30)

            if response.status_code == 200:
                return response.json()
            elif response.status_code == 404:
                return None
            else:
                print(f"⚠️  Warning: GET container returned {response.status_code}")
                return None

        except Exception as e:
            print(f"⚠️  Warning: Error getting container status: {e}")
            return None

    def wait_for_container_running(self) -> bool:
        """
        Wait for container to reach running state.

        Returns:
            True if container is running, False if timeout or error
        """
        print(f"\n🔍 Checking container status for {self.agent_name} v{self.agent_version}")
        print(f"⏱️  Timeout: {self.timeout_seconds}s | Poll interval: {self.poll_interval}s")
        print("-" * 70)

        start_time = time.time()
        iteration = 0

        while time.time() - start_time < self.timeout_seconds:
            iteration += 1
            elapsed = int(time.time() - start_time)

            container = self.get_container_status()

            if not container:
                print(f"[{iteration}] ({elapsed}s) ⏳ Container not found yet, waiting...")
                time.sleep(self.poll_interval)
                continue

            # Extract status information
            status = (
                container.get("status")
                or container.get("state")
                or container.get("provisioningState")
                or "Unknown"
            )

            # Check for replicas information
            replicas = container.get("replicas", {})
            ready_replicas = replicas.get("ready", 0)
            desired_replicas = replicas.get("desired", 0)

            print(f"[{iteration}] ({elapsed}s) 📊 Status: {status}")

            if replicas:
                print(f"              🔢 Replicas: {ready_replicas}/{desired_replicas} ready")

            # Check if container is running and ready
            if status.lower() in ["running", "succeeded", "ready"]:
                if desired_replicas == 0 or ready_replicas >= desired_replicas:
                    print("\n" + "=" * 70)
                    print("✅ Container is running and ready!")
                    print("=" * 70)
                    return True

            elif status.lower() in ["failed", "error", "cancelled"]:
                print("\n" + "=" * 70)
                print(f"❌ Container failed to start: {status}")
                print("=" * 70)
                return False

            time.sleep(self.poll_interval)

        # Timeout reached
        print("\n" + "=" * 70)
        print(f"⏱️  Timeout reached after {self.timeout_seconds}s")
        print("=" * 70)
        return False


def main():
    """Main entry point for CLI usage."""
    project_endpoint = os.getenv("AZURE_AI_PROJECT_ENDPOINT")
    agent_name = os.getenv("AGENT_NAME")
    agent_version = os.getenv("AGENT_VERSION")
    timeout = int(os.getenv("TIMEOUT_SECONDS", "600"))
    poll_interval = int(os.getenv("POLL_INTERVAL_SECONDS", "10"))

    if not all([project_endpoint, agent_name, agent_version]):
        print("❌ Error: Missing required environment variables")
        sys.exit(1)

    waiter = ContainerStatusWaiter(
        project_endpoint=project_endpoint,
        agent_name=agent_name,
        agent_version=agent_version,
        timeout_seconds=timeout,
        poll_interval=poll_interval,
    )

    success = waiter.wait_for_container_running()
    sys.exit(0 if success else 1)


if __name__ == "__main__":
    main()

Key Features:

  1. REST API Polling: Uses Azure AI Foundry REST API to check container status
  2. Timeout Handling: Configurable timeout (default 10 minutes)
  3. Progress Tracking: Shows iteration count and elapsed time
  4. Replica Checking: Verifies all desired replicas are ready
  5. Clear Output: Emoji-enhanced status messages for easy reading
  6. Exit Codes: Returns 0 for success, 1 for failure (CI/CD friendly)

Update workflow to include verification:

Add this step after starting the agent:

 

      - name: Start the new agent version
        id: start_agent
        env:
          FOUNDRY_ACCOUNT: ${{ steps.foundry_details.outputs.FOUNDRY_ACCOUNT }}
          PROJECT_NAME: ${{ steps.foundry_details.outputs.PROJECT_NAME }}
          AGENT_NAME: ${{ secrets.AGENT_NAME }}
        run: |
          LATEST_VERSION=$(az cognitiveservices agent show \
            --account-name "$FOUNDRY_ACCOUNT" \
            --project-name "$PROJECT_NAME" \
            --name "$AGENT_NAME" \
            --query "versions.latest.version" -o tsv)

          echo "AGENT_VERSION=$LATEST_VERSION" >> $GITHUB_OUTPUT

          az cognitiveservices agent start \
            --account-name "$FOUNDRY_ACCOUNT" \
            --project-name "$PROJECT_NAME" \
            --name "$AGENT_NAME" \
            --agent-version $LATEST_VERSION

      - name: Wait for container to be ready
        env:
          AZURE_AI_PROJECT_ENDPOINT: ${{ secrets.AZURE_AI_PROJECT_ENDPOINT }}
          AGENT_NAME: ${{ secrets.AGENT_NAME }}
          AGENT_VERSION: ${{ steps.start_agent.outputs.AGENT_VERSION }}
          TIMEOUT_SECONDS: 600
          POLL_INTERVAL_SECONDS: 15
        run: |
          echo "⏳ Waiting for container to be ready..."
          python wait_for_container.py

      - name: Deployment Summary
        if: success()
        run: |
          echo "## Deployment Complete! 🚀" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "- **Agent**: ${{ secrets.AGENT_NAME }}" >> $GITHUB_STEP_SUMMARY
          echo "- **Version**: ${{ steps.version.outputs.VERSION }}" >> $GITHUB_STEP_SUMMARY
          echo "- **Status**: ✅ Container running and ready" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "### Deployment Timeline" >> $GITHUB_STEP_SUMMARY
          echo "1. ✅ Image built and pushed to ACR" >> $GITHUB_STEP_SUMMARY
          echo "2. ✅ Agent version registered" >> $GITHUB_STEP_SUMMARY
          echo "3. ✅ Container started" >> $GITHUB_STEP_SUMMARY
          echo "4. ✅ Container verified as running" >> $GITHUB_STEP_SUMMARY

      - name: Deployment Failed Summary
        if: failure()
        run: |
          echo "## Deployment Failed ❌" >> $GITHUB_STEP_SUMMARY
          echo "" >> $GITHUB_STEP_SUMMARY
          echo "Please check the logs above for error details." >> $GITHUB_STEP_SUMMARY

 

Benefits of Container Status Verification:

  1. Deployment Confidence: Know for certain that the container started successfully
  2. Early Failure Detection: Catch startup errors before users are affected
  3. CI/CD Gate: Pipeline only succeeds when container is actually ready
  4. Debugging Aid: Clear logs show container startup progress
  5. Timeout Protection: Prevents infinite waits with configurable timeout

REST API Endpoints Used:

GET {endpoint}/agents/{agent_name}/versions/{agent_version}/containers/default?api-version=2025-11-15-preview

Response includes:

  • status or state: Container state (Running, Failed, etc.)
  • replicas.ready: Number of ready replicas
  • replicas.desired: Target number of replicas
  • error: Error details if failed

Container States:

  • Running/Ready: Container is operational
  • InProgress: Container is starting up
  • Failed/Error: Container failed to start
  • Stopped: Container was stopped

7. Trigger Deployment

# Automatic trigger - push to main
git add .
git commit -m "Update agent implementation"
git push origin main

# Manual trigger - via GitHub UI
# Go to Actions → Deploy Agent to Azure AI Foundry → Run workflow

 

Now this will trigger the Workflow as soon as you checkin the implementation code.

You can play with the Agent in Foundry UI:

 

Evaluation is now part the workflow

 

You can also visualize the Evaluation in AI Foundry

 

Best Practices for Production Agent LLMOps

1. Start with Simple Workflows, Add Complexity Gradually

Don't build a complex multi-agent system on day one. Start with a single agent that does one task well. Once that's stable in production, add additional capabilities:

  • Single agent with basic tool calling
  • Add memory/state for multi-turn conversations
  • Introduce specialized sub-agents for complex tasks
  • Implement multi-agent collaboration

This incremental approach reduces risk and enables learning from real usage before investing in advanced features.

2. Instrument Everything from Day One

The worst time to add observability is after you have a production incident. Comprehensive instrumentation should be part of your initial development:

  • Log every LLM call with inputs, outputs, token usage
  • Track all tool invocations
  • Record decision points in agent reasoning
  • Capture timing metrics for every operation
  • Log errors with full context

After accumulating production data, you'll identify which metrics matter most. But you can't retroactively add logging for incidents that already occurred.

3. Build Evaluation into the Development Process

Don't wait until deployment to evaluate agent quality. Integrate evaluation throughout development:

  • Maintain a growing set of test conversations
  • Run evaluations on every code change
  • Track metrics over time to identify regressions
  • Include diverse scenarios—happy path, edge cases, adversarial inputs

Use LLM-as-judge for scalable automated evaluation, supplemented with periodic human review of sample outputs.

4. Embrace Non-Determinism, But Set Boundaries

Agents are inherently non-deterministic, but that doesn't mean anything goes:

  • Set acceptable ranges for variability in testing
  • Use temperature and sampling controls to manage randomness
  • Implement retry logic with exponential backoff
  • Add fallback behaviors for when primary approaches fail
  • Use assertions to verify critical invariants (e.g., "agent must never perform destructive actions without confirmation")

5. Prioritize Security and Governance from Day One

Security shouldn't be an afterthought:

  • Use managed identities and RBAC for all resource access
  • Implement least-privilege principles—agents get only necessary permissions
  • Add content filtering for inputs and outputs
  • Monitor for prompt injection and jailbreak attempts
  • Maintain audit logs for compliance
  • Regularly review and update security policies

6. Design for Failure

Your agents will fail. Design systems that degrade gracefully:

  • Implement retry logic for transient failures
  • Provide clear error messages to users
  • Include fallback behaviors (e.g., escalate to human support)
  • Never leave users stuck—always provide a path forward
  • Log failures with full context for post-incident analysis

7. Balance Automation with Human Oversight

Fully autonomous agents are powerful but risky. Consider human-in-the-loop workflows for high-stakes decisions:

  • Draft responses that require approval before sending
  • Request confirmation before executing destructive actions
  • Escalate ambiguous situations to human operators
  • Provide clear audit trails of agent actions

8. Manage Costs Proactively

LLM API costs can escalate quickly at scale:

  • Monitor token usage per conversation
  • Set per-conversation token limits
  • Use caching for repeated queries
  • Choose appropriate models (not always the largest)
  • Consider local models for suitable use cases
  • Alert on cost anomalies that indicate runaway loops

9. Plan for Continuous Learning

Agents should improve over time:

  • Collect feedback on agent responses (thumbs up/down)
  • Analyze conversations that required escalation
  • Identify common failure patterns
  • Fine-tune models on production interaction data (with appropriate consent)
  • Iterate on prompts based on real usage
  • Share learnings across the team

10. Document Everything

Comprehensive documentation is critical as teams scale:

  • Agent architecture and design decisions
  • Tool configurations and API contracts
  • Deployment procedures and runbooks
  • Incident response procedures
  • Version migration guides
  • Evaluation methodologies

Conclusion

You now have a complete, production-ready AI agent deployed to Azure AI Foundry with:

  • LangGraph-based agent orchestration
  • Tool-calling capabilities
  • Multi-turn conversation support
  • Containerized deployment
  • CI/CD automation
  • Evaluation framework
  • Multiple client implementations

Key Takeaway

  1. LangGraph provides flexible agent orchestration with state management
  2. Azure AI Agent Server SDK simplifies deployment to Azure AI Foundry
  3. Custom state converter is critical for production deployments with tool calls
  4. CI/CD automation enables rapid iteration and deployment
  5. Evaluation framework ensures agent quality and performance

Resources

 

Thanks

Manoranjan Rajguru

https://www.linkedin.com/in/manoranjan-rajguru/

Updated Jan 12, 2026
Version 1.0
No CommentsBe the first to comment