agents
106 TopicsFrom Requirement to Production Code, How Engineering Squad Automates the Full Dev Lifecycle
I started wondering: what if instead of one AI assistant generating code snippets, you had an entire squad of specialized AI agents. Each owning a single stage of the delivery pipeline, they could collaborate, self-correct, and produce a complete, traceable output from a plain-text requirement? That's Engineering Squad: an open-source, multi-agent framework built with LangGraph, Azure OpenAI, and Foundry Local. Nine agents. One pipeline. Zero manual handoffs. You give it a requirement. It gives you back: - User stories with acceptance criteria - Technical design (API contracts, data models, architecture) - Full implementation code (written into real files, not markdown) - Unit tests and Playwright E2E tests - Automated code review with a self-correcting feedback loop When the Code Reviewer finds a bug, it doesn't just flag it, it routes the work back to the exact agent that needs to fix it. When the Spec Agent hits ambiguity, it stops and asks you rather than guessing. The loop runs up to 5 iterations, and every run is versioned under a unique Run ID for full traceability. It runs on Azure OpenAI for heavy reasoning, Foundry Local for lightweight tasks or entirely offline with --local-only mode. No cloud required. How It Works The squad is a directed graph of 9 specialized agents. Each agent has a single responsibility and a tuned system prompt. The orchestration is handled by LangGraph's StateGraph, which routes work through the pipeline and handles feedback loops. The Agents Agent Model Responsibility Product Owner Azure OpenAI gpt-4.1 Reads requirements, classifies impact scope Story Agent Foundry Local (qwen2.5-7b) Converts requirements → structured user stories Spec Agent Azure OpenAI o3 Resolves ambiguity — asks the user interactively Technical Design Azure OpenAI gpt-4.1 Architecture, API contracts, data models, error handling Developer Azure OpenAI gpt-4.1 Writes code directly into the codebase Unit Tester Azure OpenAI gpt-4.1 Writes unit tests and evaluates them against implementation Test Writer Foundry Local (qwen2.5-7b) Writes Playwright E2E tests using Page Object Model Tester Azure OpenAI o3 Final evaluation of code against all specs and tests Code Reviewer Azure OpenAI o3 Reviews everything, decides: approve or route back The Self-Correcting Loop This is where it gets interesting. The Code Reviewer doesn't just say "approved" or "rejected" — it makes a routing decision using structured output: class ReviewDecision(BaseModel): decision: Literal[ "approved", # Ship it "requirement_confusion", # → Spec Agent (clarify ambiguity) "clarity_missing", # → Technical Design (refine design) "code_missing", # → Developer (fix implementation) "bug_found", # → Developer (fix bugs) "test_case_missing", # → Test Writer (add coverage) ] feedback: str # Actionable feedback for the target agent LangGraph's conditional edges route the workflow back to the exact agent that needs to act. The loop runs up to 5 iterations with a hard stop to prevent infinite cycles. workflow.add_conditional_edges( "code_reviewer", route_review, { END: END, "spec_agent": "spec_agent", "technical_design": "technical_design", "developer": "developer", "test_writer": "test_writer", }, ) Key Design Decisions 1. Impact Classification — Don't Run What You Don't Need Not every change needs the full pipeline. The squad classifies scope first: Scope What Runs config Impact Analysis → Developer → Unit Tester → Reviewer bugfix Impact Analysis → Developer → Unit Tester → Tester → Reviewer enhancement Stories → Design (if needed) → Developer → All Tests → Reviewer feature Stories → Design → Developer → All Tests → Reviewer refactor Impact Analysis → Developer → Unit Tester → Reviewer A config change doesn't need user stories. A bugfix doesn't need a full architectural design. This keeps runs fast and focused. 2. Code Goes Into Real Files, Not Markdown This was a deliberate choice. The Developer Agent edits actual source files in your project — it doesn't dump code into a markdown artifact. The code_changes.md artifact is a change log that records what was modified and why, for traceability. 3. Existing Projects vs. Greenfield Set PROJECT_TYPE: existing in requirements_input.txt, point it at your repos, and the squad will: Scan your codebase for patterns, conventions, and architecture Make targeted changes only — no rewriting from scratch Preserve your existing coding style, error handling, and naming conventions 4. Two LLM Tiers — Cloud + Local The framework uses a hybrid model strategy: Azure OpenAI (gpt-4.1, o3) for complex reasoning: code generation, technical design, code review Foundry Local (qwen2.5-7b, phi-3.5-mini) for lightweight tasks: user stories, test writing This keeps costs down while maintaining quality where it matters. And with --local-only mode, you can run the entire squad on Foundry Local with zero cloud dependencies. Running It Locally with Foundry Local One of my favorite features: the entire squad can run 100% locally using Foundry Local. No Azure subscription, no API keys, no internet required. Setup # Install Foundry Local CLI (one-time) winget install Microsoft.FoundryLocal # Install Python dependencies pip install foundry-local-sdk openai langchain-openai langgraph python-dotenv # Run in local-only mode python main.py --local-only When --local-only is set, every agent that would normally call Azure OpenAI gets redirected to Foundry Local: def get_azure_llm(deployment: str, temperature: float = 0.1): # Local-only mode: redirect to Foundry Local if os.getenv("SQUAD_LOCAL_ONLY", "").lower() in ("true", "1", "yes"): from models.local_llm import get_local_llm return get_local_llm(temperature=temperature) # Otherwise: use Azure OpenAI with DefaultAzureCredential ... The foundry-local-sdk (v1.1.0+) handles everything — initializing the runtime, downloading models, and loading them: from foundry_local_sdk import FoundryLocalManager, Configuration # Initialize once (singleton) config = Configuration(app_name="my-app") manager = FoundryLocalManager(config) # Start OpenAI-compatible web service manager.start_web_service() print(manager.urls[0]) # SDK auto-discovers the endpoint # Download & load a model model = manager.catalog.get_model("qwen2.5-7b") model.download() model.load() # Chat directly — no web service needed chat = model.get_chat_client() response = chat.complete_chat([{"role": "user", "content": "Hello!"}]) Jupyter Notebook The repo includes a Jupyter notebook (foundry_local.ipynb) that walks you through: Installing Foundry Local Loading a model Sending chat completions (streaming and non-streaming) Running the full Engineering Squad in local-only mode Traceability — Every Run Is Versioned Every squad execution gets a unique Run ID and produces a structured artifact set: output/ runs/ 20260524_a3f9b1/ run_metadata.json ← run ID, timestamp, requirement hash, decision impact_classification.md user_stories.md technical_design.md code_changes.md ← change log (code is in real files) unit_test_results.md tests.md test_results.md review_feedback.md latest/ ← symlink to most recent approved run The run_metadata.json is structured for future Azure DevOps integration — auto-creating work items, tasks, and test cases from squad output. Two Ways to Run Mode Best For GitHub Copilot Agent Mode Existing codebases — Copilot has full workspace context via #codebase Python CLI (python main.py) New projects, CI pipelines, fully automated runs Running with GitHub Copilot Agent Mode This is the recommended way to run the squad on existing projects. Copilot has full access to your workspace — it can read files, write code, and run terminal commands — so it naturally understands your architecture, patterns, and conventions. Prerequisites VS Code with the GitHub Copilot and GitHub Copilot Chat extensions installed A Copilot subscription that supports Agent Mode (Copilot Pro, Business, or Enterprise) Setup Clone the repo and open it in VS Code: git clone https://github.com/prasunagga/engineeringSquad.git code engineeringSquad Switch to Agent Mode — In the Copilot Chat panel, click the mode dropdown (top of the chat input) and select "Agent". This is required — Ask and Edit modes don't have tool access. Enable tools — Click the 🔧 tools icon (or gear/settings icon) at the bottom of the chat input area. Make sure the following tools are enabled: File operations (read, create, edit files) Terminal (run commands) Code search / workspace context Without these enabled, the squad can't read your codebase or write code into files. Edit your requirement — Open requirements_input.txt and write your requirement: PROJECT_TYPE: existing FRONTEND_PATH: plant-catalog BACKEND_PATH: Build a cart page where users can add plants, adjust quantities, and see totals. Running the Squad In Copilot Chat (Agent Mode), type: /run-squad This triggers the .github/prompts/run-squad.prompt.md file — a prompt file with mode: agent in its YAML frontmatter that orchestrates the full workflow: --- mode: agent description: Run the full Engineering Squad workflow tools: - read_file - create_file - replace_in_file - insert_text - delete_file_range --- Copilot will then execute the full pipeline: read requirements → classify impact → generate stories → design → write code → write tests → run tests → code review → approve or loop back. How It Differs from Python CLI Copilot Agent Mode Python CLI Context Full workspace awareness via #codebase Reads files from paths in requirements_input.txt Human-in-loop Spec Agent asks you directly in chat Spec Agent prints questions to stdout Code editing Uses VS Code's file editing tools Writes files via Python open() Test execution Runs npm test / playwright test in VS Code terminal Runs via subprocess Model Uses whichever model is selected in Copilot Uses Azure OpenAI / Foundry Local Individual Agent Prompts The .github/prompts/ directory also contains standalone prompt files for running individual agents: Prompt Purpose run-squad.prompt.md Full orchestrated pipeline developer.prompt.md Developer agent only code-reviewer.prompt.md Code review only story-agent.prompt.md Generate user stories only technical-design.prompt.md Technical design only test-writer.prompt.md Write E2E tests only Extending the Framework The squad is designed to be modular. Here are the most common extension points: Add a New Agent Every agent follows the same pattern — a function that takes SquadState, calls an LLM, and returns updated fields: # agents/my_agent.py from langchain_core.prompts import ChatPromptTemplate from graph.state import SquadState from models.azure_llm import get_azure_llm, DEPLOYMENT_DEVELOPER PROMPT = ChatPromptTemplate.from_messages([ ("system", "You are a security review specialist."), ("human", "Review this code for vulnerabilities:\n{code}"), ]) def my_agent_node(state: SquadState) -> dict: llm = get_azure_llm(deployment=DEPLOYMENT_DEVELOPER) result = (PROMPT | llm).invoke({"code": state["code"]}) return {"security_review": result.content} Then wire it in: Add state fields in graph/state.py Register the node and edges in graph/workflow.py Add artifact output in main.py Swap the LLM for Any Agent Each agent calls get_azure_llm(deployment=...) or get_local_llm(). You can: Change the model — edit .env (e.g., AZURE_DEPLOYMENT_DEVELOPER=gpt-5.4) Go fully local — python main.py --local-only Use a different provider — replace get_azure_llm() with any LangChain-compatible LLM (Anthropic, Ollama, Groq, etc.) Customize Agent Prompts Each agent's system prompt is defined as a ChatPromptTemplate at the top of its file in agents/. Edit the prompt directly — no configuration layer to navigate. Change the Review Loop The routing logic lives in graph/workflow.py → route_review(). Add new decision strings, change the routing map, or adjust MAX_ITERATIONS (default: 5). VS Code Copilot Agent Mode The .github/prompts/ directory contains prompt files for running individual agents in VS Code Copilot Agent Mode. Edit these to customize agent behavior when running through Copilot. What I Learned Building This Structured output is essential for routing. Without Pydantic models for review decisions, the conditional edge routing would be fragile and string-matching-dependent. Impact classification saves significant time. Running 9 agents for a one-line config change is wasteful. Classifying scope first makes the system practical. The self-correcting loop works — but needs a hard stop. Left unchecked, agents can ping-pong feedback indefinitely. The 5-iteration cap is a pragmatic safety net. Hybrid local + cloud models are the right balance. Not every task needs GPT-4.1. User story generation and test writing work well on smaller local models, cutting costs without sacrificing quality. "Ask, don't guess" is the single most important principle. When the Spec Agent encounters ambiguous requirements, it stops and asks the user rather than hallucinating assumptions. This one rule prevents the most costly category of errors. Try It Yourself The framework is open source and designed to be extensible: git clone https://github.com/prasunagga/engineeringSquad.git cd engineeringSquad pip install -r requirements.txt # Edit your requirement notepad requirements_input.txt # Run (local-only, no Azure needed) python main.py --local-only Requirements: Python 3.10+ Windows, macOS, or Linux For local-only: Foundry Local (winget install Microsoft.FoundryLocal) For cloud mode: Azure OpenAI endpoint + az login What's Next Azure DevOps MCP integration — Auto-sync stories, tasks, and test cases to ADO boards CI/CD trigger — Auto-run the squad on PR creation or work item assignment Multi-repo support — Frontend, backend, and infra in separate repositories Cost estimation — Estimate effort and cloud costs from the technical design Links GitHub: github.com/prasunagga/engineeringSquad Foundry Local docs: learn.microsoft.com/en-us/azure/foundry-local/what-is-foundry-local LangGraph docs: langchain.com/langgraph Azure OpenAI docs: azure.microsoft.com/en-us/products/ai-foundry/models/openaiBuilding Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.Building an End-to-End Azure RAG Strategy Agent with MS Foundry
High-Level Architecture This architecture represents an end-to-end Retrieval-Augmented Generation (RAG) pipeline where raw documents are ingested from Azure Blob Storage, processed using Document Intelligence, transformed into embeddings via Azure OpenAI, and indexed in Azure AI Search for hybrid retrieval. A Foundry/MAF-based agent orchestrates query processing by combining user input with relevant search results and generates contextual responses, which are exposed through a FastAPI or CLI interface. This solution is composed of two main layers: 1. Data Ingestion Layer (RAG Pipeline) This layer transforms raw enterprise documents into searchable knowledge. Flow: Raw documents stored in Azure Blob Storage Supported formats: PDF, DOCX, PPTX, images, etc. Document Intelligence extraction Extracts: Text Tables Key-value pairs Structure Writes output as structured JSON back to Blob (processed/) Chunking + Embedding Documents are split into chunks Each chunk is embedded using Azure OpenAI (text-embedding-*) Indexing into Azure AI Search Creates a hybrid index: Keyword search Semantic ranking Vector search Enables flexible retrieval strategies 2. Query Layer (Strategy Agents) This layer enables intelligent query answering. Flow: User sends a query via: FastAPI endpoint CLI interface Query is handled by: Microsoft Agent Framework (MAF) agent Running on Azure AI Foundry Agent: Queries Azure AI Search Retrieves top relevant chunks Injects them into LLM prompt LLM generates grounded response This follows the standard RAG pattern: Retrieval → Augmentation → Generation End-to-End Flow Key Azure Services Used Service Purpose Azure Blob Storage Raw + processed document storage Azure AI Document Intelligence Extract structured content Azure OpenAI Embeddings + LLM generation Azure AI Search Hybrid retrieval engine Azure AI Foundry Agent orchestration Microsoft Agent Framework Agent execution layer Why this Architecture Matters This solution goes beyond basic RAG and provides: Hybrid Retrieval Combines keyword + semantic + vector search Improves recall and accuracy Structured Document Parsing Handles complex enterprise documents Extracts tables and metadata Agent-Based Orchestration Enables reasoning over retrieval results Extensible for multi-agent workflows Scalable Data Pipeline Supports continuous ingestion Works with large document collections Enterprise Considerations Use Managed Identity for secure service access Apply RBAC on Cosmos DB / Search / Storage Enable Private Endpoints for network isolation Use Guardrails + Evaluations in Foundry Summary This repository demonstrates a production-ready Azure RAG architecture: Ingest → Extract → Chunk → Embed → Index Retrieve → Reason → Generate Powered by Azure AI Foundry + Agent Framework By combining data engineering + AI orchestration, it enables enterprise AI systems that are: Accurate Grounded Extensible Repo: https://github.com/snd94/azure-rag-strategy-agent Please refer to the Microsoft Learn Documentation for further information: Azure AI Search documentation - Azure AI Search | Microsoft Learn Document Intelligence documentation - Quickstarts, Tutorials, API Reference - Foundry Tools | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn Microsoft Agent Framework Overview | Microsoft Learn What is Microsoft Foundry? - Microsoft Foundry | Microsoft LearnLearn how to host your agents on Microsoft Foundry
We just concluded Host your agents on Foundry, a three-part livestream series where we explored how to deploy and host Python AI agents on Microsoft Foundry: Deploying Python agents to Foundry Hosted agents using the Azure Developer CLI Building hosted agents with Microsoft Agent Framework, including Foundry IQ integration and multi-agent workflows Building hosted agents with LangChain + LangGraph, including built-in tools like Bing Web Search Running quality and safety evaluations: bulk, scheduled, and continuous evals, guardrails, and red-teaming All of the materials from our series are available for you to keep learning from, and linked below: Video recordings of each stream PowerPoint slides that you can use for reviewing or even teaching the material to your own community Open-source code samples you can run yourself in your own Microsoft Foundry project Spanish speaker? Check out the Spanish version of the series. 🙋🏽♂️ Have follow up questions? Join the weekly Python+AI office hours on Foundry Discord. Host your agents on Foundry: Microsoft Agent Framework 📺 Watch YouTube recording In our first session, we deploy agents built with Microsoft Agent Framework (the successor of Autogen and Semantic Kernel). Starting with a simple agent, we add Foundry tools like Code Interpreter, ground the agent in enterprise data with Foundry IQ, and finally deploy multi-agent workflows. Along the way, we use the Foundry UI to interact with the hosted agent, testing it out in the playground and observing the traces from the reasoning and tool calls. 🖼️ Slides for this session 💻 Code repository with examples: foundry-hosted-agentframework-demos 📝 Write-up for this session Host your agents on Foundry: LangChain + LangGraph 📺 Watch YouTube recording In our second session, we deploy agents built with the popular open-source libraries LangChain and LangGraph. Starting with a simple agent, we add Foundry tools like Bing Web Search, ground the agent in Foundry IQ, then deploy more complex agents using the LangGraph orchestration framework. Along the way, we use the Foundry UI to interact with the hosted agent, testing it out in the playground and observing the traces from the reasoning and tool calls. 🖼️ Slides for this session 💻 Code repository with examples: foundry-hosted-langchain-demos 📝 Write-up for this session Host your agents on Foundry: Quality & safety evaluations 📺 Watch YouTube recording In our third session, we ensure that our AI agents are producing high-quality outputs and operating safely and responsibly. First we explore what it means for agent outputs to be high quality, using built-in evaluators to check overall task adherence and then building custom evaluators for domain-specific checks. With Foundry hosted agents, we run bulk evaluations on demand, set up scheduled evaluations, and even enable continuous evaluation on a subset of live agent traces. Next we discuss safety systems that can be layered on top of agents and audit agents for potential safety risks. To improve compliance with an organization's goals, we configure custom policies and guardrails that can be shared across agents. Finally, we ensure that adversarial inputs can't produce unsafe outputs by running automated red-teaming scans on agents, and even schedule those to run regularly as well. 🖼️ Slides for this session 💻 Code repository with examples: foundry-hosted-agentframework-demos 📝 Write-up for this sessionBuilding AI Agents with Microsoft Foundry: A Progressive Lab from Hello World to Self-Hosted
AI agent development has a steep on-ramp. The combination of new SDKs, tool-calling patterns, model selection decisions, retrieval-augmented generation, and deployment concerns means most developers spend more time wiring things together than actually building anything useful. The Microsoft Foundry Agent Lab is a structured, open-source demo series designed to change that — nine self-contained demos, each adding exactly one new concept, all built on the same Microsoft Foundry SDK and a single model deployment. This post walks through what the lab contains, how each demo works under the hood, and the architectural decisions that make it a useful reference for AI engineers building production agents. Why a Progressive Lab? Agent frameworks can be overwhelming. A developer who opens a rich example with RAG, tool-calling, streaming, and a custom UI all at once has no clear line of sight to which parts are essential and which are embellishments. The Foundry Agent Lab takes the opposite approach: start with the absolute minimum and introduce one new primitive per demo. By the time you reach Demo 8, you have seen every major capability — not in one monolithic sample, but in a layered sequence where each addition is visible and understandable. # Demo New Concept Tool Used UX 0 hello-demo Agent creation, Responses API, conversations None Terminal 1 tools-demo Function calling, tool-calling loop, live API FunctionTool Terminal 2 desktop-demo UI decoupling — same agent, different surface None Desktop (Tkinter) 3 websearch-demo Server-side built-in tools, no client loop WebSearchTool Terminal 4 code-demo Code execution in sandbox, Gradio web UI CodeInterpreterTool Web (Gradio) 5 rag-demo Document upload, vector stores, RAG grounding FileSearchTool Terminal 6 mcp-demo MCP servers, human-in-the-loop approval MCPTool Terminal 7 toolbox-demo Centralized tool governance, Toolbox versioning Toolbox Terminal 8 hosted-demo Self-hosted agent with Responses protocol Custom server Terminal + Agent Inspector The Model Router: One Deployment to Rule Them All Before diving into the demos, it is worth understanding the one architectural decision that ties the entire lab together: every agent uses model-router as its model deployment. MODEL_DEPLOYMENT=model-router Model Router is a Microsoft Foundry capability that inspects each request at inference time and routes it to the optimal available model — weighing task complexity, cost, and latency. A simple factual question goes to a fast, cheap model. A complex tool-calling chain with code generation gets routed to a frontier model. You write zero routing logic. The lab's MODEL-ROUTER.md file contains empirical observations from running all nine demos. A sample of what the router selected: Demo Query Task Type Model Selected hello "What's the capital of WA state?" Factual recall grok-4-1-fast-reasoning hello "Summarize our conversation" Summarization gpt-5.2-chat-2025-12-11 tools "What's the weather in Seattle?" Tool-using gpt-5.4-mini-2026-03-17 code Data analysis with code generation Code generation + execution gpt-5.4-2026-03-05 rag HR policy document question Retrieval + synthesis gpt-5.3-chat-2026-03-03 This is the strongest signal in the lab: you do not need to reason about model selection. You declare what your agent needs to do; the router handles the rest, and it chooses correctly. Demo 0: The Minimum Viable Agent The hello-demo establishes the baseline pattern used by every subsequent demo. Two files: one to register the agent, one to chat with it. Registering the agent from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient from azure.ai.projects.models import PromptAgentDefinition credential = DefaultAzureCredential() project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, instructions="You are a helpful, friendly assistant.", ), ) Authentication uses DefaultAzureCredential , which works with az login locally and with managed identity in production — no API keys anywhere in the code. Chatting with the agent # Create a server-side conversation (persists history across turns) conversation = openai.conversations.create() # Each turn sends the user message; the agent sees full history response = openai.responses.create( input=user_input, conversation=conversation.id, extra_body={"agent_reference": {"name": AGENT_NAME, "type": "agent_reference"}}, ) print(response.output_text) The conversation object is server-side. You pass its ID on every turn; the history lives in Foundry, not in a local list. This is the Responses API pattern — distinct from the older Completions or Chat Completions APIs. Demo 1: Function Tools and the Tool-Calling Loop Demo 1 adds function calling against a real weather API. The key insight here is that the model does not execute the function — it requests the execution, and your code executes it locally, then feeds the result back. Declaring a function tool from azure.ai.projects.models import FunctionTool, PromptAgentDefinition func_tool = FunctionTool( name="get_weather", description="Get the current weather for a given city.", parameters={ "type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"], }, strict=True, ) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[func_tool], instructions="You are a weather assistant...", ), ) The tool-calling loop response = openai.responses.create(input=user_input, conversation=conversation.id, ...) # Loop while the model is requesting tool calls while any(item.type == "function_call" for item in response.output): input_list = [] for item in response.output: if item.type == "function_call": args = json.loads(item.arguments) result = get_weather(args["city"]) # execute locally input_list.append(FunctionCallOutput(call_id=item.call_id, output=result)) # Send results back to the agent response = openai.responses.create(input=input_list, conversation=conversation.id, ...) print(response.output_text) The strict=True parameter on FunctionTool enforces structured outputs — the model must return arguments that match the declared JSON schema exactly. This eliminates argument parsing errors in production. Demo 2: UI Is Not Your Agent Demo 2 runs the exact same agent as Demo 1 but surfaces it in a Tkinter desktop window. The point is pedagogical: your agent definition, conversation management, and tool-calling logic are entirely independent of your UI layer. Swapping from terminal to desktop requires changing only the presentation code — nothing in the agent or conversation path changes. This is a principle worth internalising early: agent logic and UI logic should never be entangled. The lab enforces this separation structurally. Demo 3: Server-Side Built-In Tools The web search demo introduces a sharp contrast with Demo 1. With WebSearchTool , the tool-calling loop disappears entirely from client code: from azure.ai.projects.models import WebSearchTool agent = project.agents.create_version( agent_name="Search-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[WebSearchTool()], instructions="You are a research assistant...", ), ) The agent decides when to search, executes the search server-side, and returns a grounded response with citations. Your client code looks identical to Demo 0 — a simple responses.create() call with no tool loop. The distinction matters architecturally: Function tools (Demo 1) — tool execution happens on your client; you control the code, the API call, the error handling. Built-in tools (Demo 3+) — tool execution happens inside Foundry; you get results without managing execution. Demo 4: Code Interpreter and the Gradio Web UI Demo 4 attaches CodeInterpreterTool , which gives the agent a sandboxed Python execution environment inside Foundry. The agent can write code, run it, observe output, and iterate — all server-side. Combined with a Gradio web interface, this demo shows an agent that can perform data analysis, generate charts, and explain results through a browser UI. Model Router is particularly interesting here: the empirical data shows it selects a more capable frontier model ( gpt-5.4-2026-03-05 ) for code-generation tasks, while simpler conversational turns stay on lighter models. Demo 5: Retrieval-Augmented Generation with FileSearchTool Demo 5 introduces RAG. The setup phase uploads a document, creates a vector store, and attaches it to the agent: # Upload document and create a vector store vector_store = openai.vector_stores.create(name="employee-handbook-store") with open("data/employee-handbook.md", "rb") as f: openai.vector_stores.files.upload_and_poll( vector_store_id=vector_store.id, file=f ) # Attach the vector store to the agent agent = project.agents.create_version( agent_name="RAG-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[FileSearchTool(vector_store_ids=[vector_store.id])], instructions="Answer questions using only the provided documents...", ), ) At query time, the agent embeds the question, searches the vector store semantically, retrieves matching chunks, and generates an answer grounded in the retrieved content — entirely server-side. The client code remains a plain responses.create() call. An important detail: the .vector_store_id file is written to disk during setup and read back during the chat session, so the demo survives process restarts without re-uploading the document. The .gitignore excludes this file from source control. Demo 6: Model Context Protocol Demo 6 connects the agent to a GitHub MCP server, giving it access to repository and issue data via the open Model Context Protocol standard. MCP servers expose tools over a standardised wire protocol; the agent discovers and calls them without any client-side function declarations. The demo also demonstrates human-in-the-loop approval: before executing any MCP tool call, the agent surfaces the proposed action and waits for the user to confirm. This is an important safety pattern for agents that can trigger side effects on external systems. Demo 7: Toolbox — Centralised Tool Governance Where Demo 6 connects to a single MCP server directly, Demo 7 uses a Toolbox — a managed Microsoft Foundry resource that bundles multiple tools into a single, versioned, MCP-compatible endpoint. The Toolbox in this demo exposes both GitHub Issues and GitHub Repos tools, curated into an immutable versioned snapshot. This pattern is significant for production multi-agent systems: Centralised governance — one team owns the tool definitions; all agents consume them via a single endpoint. Versioned snapshots — promoting a new Toolbox version is explicit; agents pin to a version and upgrade intentionally. MCP compatibility — any MCP-capable agent or framework can connect, not just Foundry SDK agents. from azure.ai.projects.models import McpTool toolbox_tool = McpTool( server_label="toolbox", server_url=TOOLBOX_ENDPOINT, allowed_tools=[], # empty = all tools in the Toolbox version headers={"Authorization": f"Bearer {token}"}, ) Demo 8: Self-Hosted Agent with the Responses Protocol The final demo departs from the prompt-agent pattern. Instead of registering a declarative agent in Foundry, Demo 8 implements a custom agent server using the Responses protocol. The server exposes a streaming HTTP endpoint; Foundry's Agent Inspector can connect to it and route user turns to it just as it would to a hosted prompt agent. This demo includes a Dockerfile and an agent.yaml , enabling deployment to Foundry's container hosting service. It uses gpt-4.1-mini directly rather than the model router, because the custom server owns the entire inference path. When to consider this pattern: Your agent requires custom pre- or post-processing logic that cannot be expressed in a system prompt. You need to integrate with infrastructure that is not reachable through MCP or built-in tools. You want to own the inference call for cost control, A/B testing, or compliance reasons. You are building a multi-agent orchestrator that needs to expose itself as an agent to other orchestrators. Getting Started The lab requires Python 3.10 or higher, an Azure subscription with a Microsoft Foundry project, and the Azure CLI. 1. Clone and set up the virtual environment git clone https://github.com/microsoft-foundry/Foundry-Agent-Lab.git cd Foundry-Agent-Lab # Create and activate the virtual environment python -m venv .venv # Windows Command Prompt .venv\Scripts\activate.bat # Windows PowerShell .venv\Scripts\Activate.ps1 # macOS / Linux source .venv/bin/activate pip install -r requirements.txt 2. Configure a demo copy hello-demo\.env.sample hello-demo\.env # Edit hello-demo\.env and set PROJECT_ENDPOINT Your PROJECT_ENDPOINT is on the Overview page of your Foundry project in the Azure portal. It takes the form https://your-resource.ai.azure.com/api/projects/your-project . 3. Run the demo az login 0-hello-demo Each numbered batch file at the root activates the virtual environment, runs create_agent.py , and launches chat.py . Append log to capture the full session transcript: 0-hello-demo log Reset between runs hello-demo\reset.bat Every demo includes a reset.bat that deletes the registered agent and any associated resources (vector stores, uploaded files). Demos are fully repeatable. Architecture Principles Demonstrated Across the nine demos, the lab illustrates a set of design principles that apply directly to production agent systems: Keyless authentication throughout Every demo uses DefaultAzureCredential . No API keys appear anywhere in the code. Locally, az login provides credentials. In production, managed identity takes over automatically — same code, no secrets to rotate. Server-side conversation state The Responses API stores conversation history server-side. Your application passes a conversation ID; Foundry maintains the thread. This eliminates the common bug of truncating history due to local list management and makes multi-process or multi-instance deployments straightforward. Client-side vs server-side tool execution The lab makes the distinction explicit. Function tools execute in your process — you control the code, the external call, and the error handling. Built-in tools (WebSearch, CodeInterpreter, FileSearch) execute inside Foundry — you get results without managing execution infrastructure. MCP tools (Demo 6, 7) fall between these: they execute in a separately deployed server, with the protocol mediating the call. Progressive tool introduction Each demo's create_agent.py registers the agent once. The chat.py file handles the conversation loop. These two responsibilities are always separate, making it easy to update agent definitions without modifying conversation logic, and vice versa. Security Considerations When building agents for production, keep the following in mind: Never commit .env files. The .gitignore excludes them, but verify this before pushing. Use Azure Key Vault or environment variable injection in CI/CD pipelines. Use managed identity in production. DefaultAzureCredential automatically picks up managed identity when deployed to Azure, eliminating the need for any stored credentials. Apply human-in-the-loop for side-effecting tools. Demo 6 demonstrates this pattern for MCP tool calls. Any agent that can modify external state (create issues, send emails, write files) should surface proposed actions for confirmation. Validate tool outputs before use. Treat data returned by external tools (weather APIs, search results, document retrieval) as untrusted input. Prompt injection through tool results is a real attack surface; grounding instructions in your system prompt reduce but do not eliminate this risk. Scope Toolbox permissions narrowly. When using a Toolbox (Demo 7), use allowed_tools to restrict which tools the agent can call, rather than granting access to all tools in a Toolbox version. Key Takeaways Start with the minimum. A prompt agent with no tools requires fewer than 30 lines of code using the Foundry SDK. Add tools only when the use case demands them. Use model-router unless you have a specific reason not to. The empirical data in the lab shows the router selects appropriate models across all task types — factual, creative, tool-calling, RAG, and code generation. Understand the client/server tool boundary. Function tools give you control; built-in tools give you simplicity. MCP and Toolbox give you governance and interoperability. Choose based on where you need control and where you need scale. Conversation state belongs on the server. Do not maintain conversation history in application memory if you can avoid it. The Responses API conversation object is designed for this. The hosted-demo pattern is for when you need to own the inference path. For most use cases, a declarative prompt agent is sufficient and far simpler to operate. Next Steps Explore the repo: github.com/microsoft-foundry/Foundry-Agent-Lab Microsoft Foundry SDK documentation: learn.microsoft.com/azure/ai-studio/ Responses API quickstart: Prompt agent quickstart Model Router conceptual documentation: Model Router for Microsoft Foundry Model Context Protocol: modelcontextprotocol.io Azure Identity SDK (DefaultAzureCredential): azure-identity Python SDK The Foundry Agent Lab is open source under the MIT licence. Contributions, bug reports, and feature requests are welcome through GitHub Issues. See CONTRIBUTING.md for guidelines.Agents League: The Esports-Inspired Hackathon Where AI Agents Battle for Glory
Ready to put your AI skills to the ultimate test? Agents League is here, a dynamic, esports-inspired developer challenge that brings the thrill of live competition to the world of agentic AI. Whether you're a seasoned AI developer or just getting started, this is your chance to build, compete, and win. What is Agents League? Agents League is a week-long hackathon running as part of AI Skills Fest (June 4–14, 2026). Unlike traditional hackathons, Agents League combines live AI coding battles, asynchronous project submissions, and a thriving Discord community all competing for a total prize pool of $55,000 USD. This isn't just about building it's about showcasing what's possible with agentic AI in a format that's fast, competitive, and globally accessible. Three Challenge Tracks Pick One or Compete in All 1. Creative Apps Build innovative applications using GitHub Copilot for AI-assisted development. Show off your creativity and demonstrate how AI can accelerate app creation from concept to code. 2. Reasoning Agents Create intelligent agents using Microsoft Foundry that solve complex problems through multi-step reasoning. This track is all about building agents that can think, plan, and execute. 3. Enterprise Agents Build business-ready knowledge agents integrated with Microsoft 365 Copilot, authored in Copilot Studio. Perfect for developers focused on real-world enterprise solutions. Live Microsoft Reactor Events—Don't Miss the Battles! The heart of Agents League beats through live Microsoft Reactor events. Watch experts go head-to-head in live coding battles, learn cutting-edge techniques, and get inspired for your own submissions: Event What You'll Learn Creative Apps Battle See GitHub Copilot in action as experts build innovative apps live Reasoning Agents Battle Watch multi-step reasoning agents come to life with Microsoft Foundry Enterprise Agents Battle Learn to build M365-integrated agents with Copilot Studio 👉 View the full event series Key Dates Registration Deadline: June 12, 2026, 12:00 PM PT Hacking Period: June 4–14, 2026 Submission Deadline: June 14, 2026, 11:59 PM PT What You Get Live coding battles with expert demonstrations Curated technical experiences and on-demand content Learning resources on Microsoft Learn and AI Skills Navigator Community support through Discord GitHub-based submissions for transparent, collaborative judging Why Participate? Agents League isn't just another hackathon. It's designed as a streamlined, competitive format that: ✅ Fits into your schedule with focused, time-boxed challenges ✅ Provides real-world product innovation experience ✅ Offers global accessibility—participate from anywhere ✅ Demonstrates the latest capabilities of agentic AI, including new IQ tools ✅ Connects you with a passionate developer community Ready to Enter the Arena? Register Now for Agents League Before you register: Review the Hackathon Rules and Regulations for prize categories and judging criteria Join the Microsoft Reactor event series for live battles and learning Check out the Microsoft Event Code of Conduct Join the Conversation Have questions? Want to connect with fellow competitors? Join the Agents League community on Discord and start strategizing with developers from around the world. Whether you're building creative apps, reasoning agents, or enterprise solutions—the arena awaits. May the best agent win! 🏆 Agents League hackathon is open to the public and offered at no cost. Government employees should check with their employers to ensure participation is permitted in accordance with applicable policies. Related Links: Agents League Hackathon Registration Microsoft Reactor Series AI Skills FestHow to Visualize Your Azure AI Workloads Usage for Observability
This article assumes you already have an Azure Foundry project and resource deployed in Microsoft Foundry. The options referenced here are documented in detail in the linked articles; this post serves as a consolidated step by step guide bringing them all together and explaining where each option is most useful. A Summary: Need Best Option Quick day-over-day visual, minimal setup Grafana Dashboard (Option 3) Custom growth % calculations App Insights + KQL in Log Analytics (Option 4) Shareable, interactive report Azure Workbooks (Option 5) Per-user/per-agent granularity APIM + App Insights (Option 6) Quick one-off chart, export to Excel Microsoft Foundry Monitor tab or App Insights Metrics Explorer (Option 1 and 2) Option 1. Within the Microsoft Foundry Portal (Quickest, No Setup) If you have models deployed in Microsoft Foundry and would like to monitor its usage, go to the New Foundry Portal → Build → Models → Monitor tab. View metrics such as: Estimated cost Total token usage Input vs. output tokens Number of requests This is the simplest way to monitor both model and agent usage. For PAYG plans: You can also view your total allocated quota (and figure out which Tier you are on) using the Quota Management Screen (New Foundry Portal → Operate → Quota tab). This screen shows how much your total allocated quota is, per model in a given subscription + region + Deployment Type (Global, Data Zones or Regional). For eg., in the image below, for gpt-4o, I am allocated 7M total TPM in my subscription. I am only using 150K TPM of the allocated 7M TPM amount. Which means, my requests will get throttled if I exceed the 150K TPM limit. To avoid throttling, I would need to increase my shared allocation limit. NOTE: you are charged for usage, so if you allow more capacity, you use more, so you pay more. Option 2: Azure Monitor Metrics Explorer This is already built into the Azure Portal and gives you time-series charts out of the box. Go to Azure Portal → your Azure OpenAI / Foundry resource → Monitoring → Metrics Select a metric like AzureOpenAIRequests or TokenTransaction Set Aggregation to Sum (total) or Max and Time granularity to 1 day Split by ModelDeploymentName to see per-model trends Adjust the time range (e.g., last 30 days) — you'll see day-over-day bars/lines Tip: You can pin these charts to an Azure Dashboard for a persistent view, or click Share → Download to Excel to get the raw data for your own analysis. Option 3: Azure Managed Grafana (Best Pre-Built Dashboard) This is the best option for a polished, real-time, day-over-day dashboard with no custom code. There's a pre-built AI Foundry dashboard ready to import. [grafana.com], [Create a M...ed Grafana] How to set it up: Create an Azure Managed Grafana workspace (if you don't have one) In Grafana, go to Dashboards → New → Import → enter dashboard ID 24039 (for Foundry) Select your Azure Monitor data source and point it to your Foundry resource Tip: You can also import this directly from the Azure Portal: Monitor → Dashboards with Grafana → AI Foundry. That's it — the dashboard gives you (per model deployment): Token trends over time (inference, prompt, completion — day over day) Request trends over time (AzureOpenAIRequests as a time series) Latency trends (bonus) NOTE: Default time range is 7 days — adjust to 30/60/90 days for growth trends Option 4: Application Insights + KQL Queries (Most Flexible, Custom Reports) If you want fully custom day-over-day growth calculations (e.g., % change day-to-day), this is the way. [azurefeeds.com] Setup: Ensure your Foundry project is connected to an Application Insights resource (Foundry → Settings → Connected Resources). Open up App Insights resource → Logs → New Query or choose a sample query. In the images below, we simply ran 'requests' and set the time range to 24 hours. There is also a Kusto Query Language (KQL) mode or Simple mode on the right-hand side: Simple mode will let you run out of the box samples. KQL mode will open up a query window for you to enter custom queries. Below are the results in grid view. Same view but showing a chart: Export options: Another way to get the above graphs are via Log Analytics. Simply enable Diagnostic Settings on your Azure OpenAI resource → send to a Log Analytics workspace. Open Log Analytics → Logs and try our your sample queries. Sample KQL for day-over-day token usage (adjust to your needs): AzureMetrics | where MetricName in ("TokenTransaction", "ProcessedPromptTokens", "GeneratedTokens") | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by bin(TimeGenerated, 1d), MetricName | order by TimeGenerated asc | render timechart Result: Sample KQL for day-over-day growth % (adjust to your needs): AzureMetrics | where MetricName == "TokenTransaction" | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by Day = bin(TimeGenerated, 1d) | sort by Day asc | extend PrevDay = prev(DailyTokens) | extend GrowthPct = round((DailyTokens - PrevDay) / PrevDay * 100, 2) | project Day, DailyTokens, GrowthPct Option 5: Azure Monitor Workbooks (Custom Dashboards, Shareable) Workbooks let you build interactive, parameterized dashboards that combine metrics and KQL logs. What's more, you can select resources from multiple subscriptions and visualize them all in one place using Workbooks! Go to Azure Portal → Monitor → Workbooks → New Add a Metrics query panel → select your Log Analytics or App Insights or Foundry resource -> Enter the same query you used in Option 4. Do a test run and view the graphs (this can be viewed as charts or a list (grid view)): 4. Save and share with your team. Option 6: APIM + Application Insights (Granular Per-Caller/Per-Agent Tracking) 1. If your app routes requests through Azure API Management, you can use the azure-openai-emit-token-metric policy to send per-request token metrics to Application Insights with custom dimensions (User ID, Subscription ID, Agent, etc.). [Azure API...osoft Docs] This is ideal for scenarios like: "Which agent consumed the most tokens last week?" "What's the token usage per API consumer/team?" NOTE: Microsoft Foundry resources do not track usage by users. So, fronting your Foundry resource with an APIM could be a way to track users provided you pass the username/id in the request context. How you implement this is upto your app design. Ref: AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHub Bonus: Check out all other APIM + AI related policies here: AI-Gateway/labs/semantic-caching at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-rate-limiting at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHubMicrosoft Foundry Toolkit for VS Code is Now Generally Available
We are thrilled to announce that the Microsoft Foundry Toolkit for VS Code, formerly AI Toolkit, is now Generally Available (GA)! From first model prompt to production‑grade AI agents, Foundry Toolkit lets you build, debug, and ship AI end to end without ever leaving VS Code. Same Product. New Name. You may know this extension as AI Toolkit — and we thank you for using it in the past year and for the continuous feedback that has shaped the product. With this GA release, we’re rebranding AI Toolkit to Microsoft Foundry Toolkit. The new name reflects where we’re headed: a single, unified developer experience for building AI apps and agents on the Microsoft AI platform. Rest assured, this is a name change only — there are no plans to remove or deprecate any existing features. Empower AI Development from Idea to Production with Foundry Toolkit The GA release brings together the most requested features into a high-performance workflow: 🧪 Curated Model Playground: Don’t waste time with setup. Browse and chat with over 100+ state-of-the-art models from Microsoft Foundry, GitHub, OpenAI, Anthropic, Ollama, and more. Compare performance side-by-side and export production-ready code in seconds. 🤖 Agent Builder (No-Code/low code): Experiment with agent ideas or build sophisticated agents without writing boilerplate code. Define instructions, link tools from the Foundry catalog, or connect local MCP (Model Context Protocol) servers to have a functional agent running in minutes. ✨GitHub Copilot powered agent development: With Foundry tools and skills built into the Toolkit, GitHub Copilot is equipped with deep context to jumpstart agent creation using the Microsoft Agent Framework - often from a single prompt. 🛠️ Deep-Cycle Debugging: Move beyond black-box AI. The Agent Inspector provides real-time workflow visualization, breakpoints, and full local tracing across tool calls and agent chains. ⚡ Edge-Optimized Performance: Specialized support for the Phi model family. Fine-tune Phi Silica on your data, quantize for NPU/GPU targets, and profile on-device performance to ensure your models run lean and fast. 🚀 Seamless Scale: Transition from local to cloud with one click. Deploy directly to the Microsoft Foundry Agent Service and run continuous evaluations using familiar pytest syntax within the VS Code Test Explorer. Get Started Today Install: Microsoft Foundry Toolkit on VS Code Marketplace. Quick Start: Follow our 3-Minute Getting Started Tutorial to build your first AI agent. Deep Dive: Check out documentations, Samples, and workshop. Join the Community Join us on Model Monday event on 4/20 where we will talk through Building Foundry Agents using VS Code and GitHub Copilot. We can’t wait to see what you build. Share your projects, file issues, or suggest features on our GitHub repository. Welcome to the next chapter of AI development!Six Coding Agents, One Production System: A Field Guide to AgenticOps with AKS-Lab-GitHubCopilot
The shift: from "AI helps me code" to "AI authors my repo" For two years we've been talking about GitHub Copilot as an inline pair programmer — a clever autocomplete that lives in your editor. That framing is officially out of date. The new reality is agentic delivery: a team of named, scoped AI agents owns slices of your repository, each with its own tools, skills, and refusal rules. They produce pull requests. They run tests. They roll deployments. And when one finishes its turn, it hands off to the next. The microsoft/AKS-Lab-GitHubCopilot's five labs you ship ZavaShop — a multi-agent retail supply-chain control plane running on AKS + Azure Container Apps — and along the way you internalize an operating model you can carry to any project. Everything in the repo (specs, agents, MCP servers, tests, Bicep, Helm, GitHub Actions) is authored by six GitHub Copilot Custom Coding Agents working from your IDE, plus the remote GitHub Copilot Coding Agent that closes the PR loop on GitHub. This is what AgenticOps looks like in practice. Two layers of agents — don't confuse them The first cognitive hurdle in this lab is keeping two very different agent populations straight: Layer What it is When it lives Examples Application agents The product you ship — the runtime ZavaShop fleet that solves a business problem Production (AKS + ACA) InventoryAgent, SupplierAgent, LogisticsAgent, PricingAgent, OrchestratorAgent Coding agents The dev-time team that writes the application agents Your IDE + GitHub requirements-analyst, mcp-builder, agent-builder, orchestrator-architect, test-author, deploy-engineer Both are built with the Microsoft Agent Framework (MAF). Both use the GitHub Copilot SDK as their model provider. But they exist at different layers of the development lifecycle, and the entire lab is structured around that distinction. If you only remember one thing from this post: the coding agents are how you build the application agents. That is the whole AgenticOps loop, compressed into one sentence. GitHub Copilot Coding Agent vs. Custom Coding Agents There are two flavors of "coding agent" in the GitHub Copilot ecosystem, and this lab uses both. 1. The remote GitHub Copilot Coding Agent This is the GitHub-side, asynchronous, PR-driven agent. You assign it an issue, it spins up a sandboxed environment, writes the code, runs the tests, and opens a PR for human review. You don't watch it work — you review what it produces. In ZavaShop, Lab 04 (Testing) explicitly uses this agent: you take a failing eval scenario, file it as an issue, assign it to Copilot, and the agent comes back with a PR. Your job is the human bar, not the keystrokes. Important governance choice from AGENTS.md: the remote Coding Agent is allowed to open PRs against src/ and tests/ only — never against infra/ without human review. That single rule is a textbook example of agent-aware policy. 2. The local Custom Coding Agents These are scoped, in-IDE specialist agents you select <agent name> in Copilot Chat. They live as *.agent.md files inside .github/agents/ and are discovered by VS Code on reload. Each one owns exactly one slice of the repository. Six of them ship in this lab: Phase Agent Owns Refusal rule Requirements requirements-analyst specs/*.md Refuses to write code MCP tools mcp-builder src/mcp_servers/* One server per turn Specialist agents agent-builder src/agents/<specialist>/* One specialist per turn Orchestration orchestrator-architect src/agents/orchestrator/*, src/shared/*, docker-compose.yml Owns wiring, not business logic Tests test-author tests/** Never edits src/ Deploy deploy-engineer infra/**, .github/workflows/** Won't touch application code The pattern that matters here isn't just "we made some custom agents." It's that every agent declares what it owns and what it refuses to do. That refusal envelope is what makes the system safe to delegate to. Without it, you'd just have a noisier autocomplete. Three workflow prompts in .github/prompts/ chain the agents together so you don't have to remember the sequence: /feature-from-issue — issue → spec → code → tests → PR → deploy /spec-to-code — drive an existing spec through code + tests /ship-it — quality gate → build → push → ACR/ACA/AKS rollout → smoke + evals This is the closest thing I've seen to a programmable software development lifecycle. Where AgenticOps fits in DevOps gave us repeatable infrastructure. MLOps gave us repeatable model lifecycles. AgenticOps is what you need when the thing you're operating is itself a fleet of autonomous agents — both at build time and at runtime. The lab makes the four pillars of AgenticOps concrete: Specs as the contract. /requirements-analyst produces specs/<slug>.md files with goals, contracts, and eval scenarios. Nothing else in the repo is built until that spec exists. Specs are the source of truth that human reviewers actually read. Skills as living documentation. .github/skills/<skill>/SKILL.md files hold shared, agent-agnostic knowledge — Python conventions, Kubernetes patterns, MAF idioms. Every coding agent declares which skills it must consult before writing code. This is how you stop drift: knowledge lives in one place and is pulled in on demand. Evals as the quality gate. The repo runs a four-layer test pyramid plus five golden eval scenarios (S1–S5). uv run poe check runs locally and in GitHub Actions. Copilot-authored PRs must pass the same bar a human does — no exceptions. Observability tied to agent identity. Every agent emits agent.name, agent.run_id, and agent.span_id through structlog. When something misbehaves in production, you can trace the line from "this evaluation failed" all the way back to "this version of this agent, on this run, called this tool with these arguments." These four pillars aren't ZavaShop-specific. They're the contract for any AgenticOps system: scoped ownership, contracts as code, evals as gates, identity in every span. Walking through the workshop: which agent does what, when The five labs are five chapters of one story — ZavaShop going from an empty Azure subscription to a live retail control plane. Each lab activates a different subset of coding agents. Lab 01 — Environment Setup (no coding agents yet) You provision the platform: AKS cluster, ACA environment, Azure Container Registry, Key Vault, and the Workload Identity that every agent will wear. Then you install the six Custom Coding Agents into your IDE. Think of this as hiring the development team and giving them their badges. Lab 02 — Agent Creation (four agents in play) This is where it clicks. You start by requirements-analyst in Copilot Chat to produce the spec for each ZavaShop application agent. Then mcp-builder is invoked four times to scaffold the four MCP servers — one per domain (inventory DB, supplier API, shipping API, pricing API). Then agent-builder runs four more times to build the typed ChatAgent specialists. Finally orchestrator-architect wires them together with a MAF Workflow. What's stunning about this lab is the handoff discipline. Every coding agent ends its turn with a line naming the next agent to invoke. You're not orchestrating the work — the agents are. Lab 03 — Multi-Agent Orchestration & Config (two agents) The orchestrator stops being a one-shot LLM call and becomes a deterministic Workflow. Secrets move from .env to Key Vault. The whole fleet boots locally with Docker Compose. This is orchestrator-architect's star turn — wiring A2A endpoints, MCP tool registration, Key Vault hydration, OpenTelemetry. Specs come from requirements-analyst; the rest is orchestration. Lab 04 — Testing (both coding agent flavors) /test-author writes the four-layer pyramid (unit, MCP contract, integration, eval). Then you switch gears: take a failing eval scenario, file it as a GitHub issue, and assign it to the remote GitHub Copilot Coding Agent. The agent works asynchronously, opens a PR, and uv run poe check decides whether it passes. This is the lab where the local-vs-remote distinction stops being abstract and starts being operational. Lab 05 — Deployment & Run (deployment specialist) /deploy-engineer writes the Helm chart for the AKS orchestrator and the Bicep modules for the ACA specialists. The /ship-it workflow prompt then runs the full pipeline: quality gate → ACR build → ACA deploy → AKS rollout → smoke tests → evals. GitHub Actions OIDC re-runs the same pipeline on every main push. Notice the pattern across all five labs: at no point does a human write production code from scratch. Humans set goals, review specs, approve PRs, and run quality gates. The keystrokes belong to agents. How Coding Agents transform the DevOps pipeline Take a step back from the lab and ask: what actually changes in your DevOps flow when you adopt this model? The atomic unit of work shifts. In classic DevOps the unit is the commit. In AgenticOps the unit is the spec. A spec drives one or more agents; agents produce commits; commits trigger CI; CI gates promotion. The commit becomes a derived artifact, not the starting point. Code review changes shape. You're no longer reviewing "did this human understand the codebase?" — you're reviewing "did this agent follow its refusal rules, consult its skills, and produce something that passes the evals?" Reviewers spend less time on style and more time on intent. The diff is often less interesting than the spec it came from. Governance becomes structural, not procedural. Instead of writing a wiki page that says "don't touch infra without review," you encode that rule in AGENTS.md and refuse to let the agent's tool set include infra paths. Policy becomes part of the agent definition, not a checklist humans hopefully remember. The CI pipeline expands. Beyond build/test/deploy, you now have an eval stage that asks "does the system still behave correctly on the golden scenarios?" — and a Copilot-authored PR has to pass the same eval stage as a human-authored one. The pipeline is the great equalizer. Onboarding compresses. A new engineer doesn't need to read 50 wiki pages to be productive. They read AGENTS.md, select the relevant agent walks them through. Institutional knowledge lives in .agent.md and SKILL.md files instead of senior engineers' heads. The net effect is a pipeline that's faster, more uniform, and easier to audit. Faster because agents parallelize what humans serialize. More uniform because every change goes through the same six-agent template. Easier to audit because every artifact has a named author and a refusal rule it had to respect. What to take away The AKS-Lab-GitHubCopilot workshop teaches three things at once. The surface lesson is "how to build a multi-agent retail system on AKS." The middle lesson is "how to use GitHub Copilot Custom Agents and the remote Coding Agent." The deepest lesson — and the one I'd argue matters most — is how to design a development process where AI agents are first-class citizens with bounded responsibilities, not free-form copilots. If you take the model and walk away from the lab, three patterns are worth keeping: Scope before capability. Don't give an agent every tool; give it the smallest surface that makes it useful. Specs are the API between humans and agents. Invest in requirements-analyst-style flows even if the rest of your stack isn't there yet. Evals are non-negotiable. The moment an agent can open a PR, you need a quality gate that doesn't care who the author is. Clone the repo microsoft/AKS-Lab-GitHubCopilot , hit Developer: Reload Window, select agents in Copilot Chat, and watch six teammates show up. That's the future of the DevOps pipeline — and it's already shipping. Resources microsoft/AKS-Lab-GitHubCopilot — The repository this post is built on. Best practices for using Copilot to work on tasks — Governance patterns for delegating issues to Copilot. GitHub Copilot SDK (Python) — The provider used by every agent in this lab.558Views0likes0CommentsHow to Test AI Agents with LangSmith: A Complete Guide
Testing AI agents is crucial for ensuring reliability and accuracy in production. Evaluation is a technique to evaluate your agents. Different type of evaluation are # Evaluation Type 1 Task Success (Pass / Fail) 2 Instruction Adherence 3 Correctness / Accuracy 4 Relevance 5 Groundedness (Hallucination) 6 Coherence / Fluency 7 Tool‑Use Accuracy 8 Safety / Harmfulness LangSmith provides powerful tools for creating datasets, running evaluations, and using LLM-as-judge techniques. This guide walks through the complete workflow using a practical example. Prerequistes : 1) create your account under langsmith. 2) generate langsmith key and store in .env file and load whenever a reference made for datacreation or doing evaluation or from command prompt use set LANGCHAIN_API_KEY = <your_api_key_here> Part 1: Creating Your Test Dataset The foundation of any good evaluation is a quality dataset. LangSmith allows you to create datasets programmatically with input-output pairs that serve as ground truth. from langsmith import Client def create_evaluation_dataset(): client = Client() # Create a new dataset dataset = client.create_dataset( dataset_name="Sample dataset", description="A sample dataset in LangSmith." ) # Define your test examples examples = [ { "inputs": {"question": "Which country is Mount Kilimanjaro located in?"}, "outputs": {"answer": "Mount Kilimanjaro is located in Tanzania."}, }, { "inputs": {"question": "What is Earth's lowest point?"}, "outputs": {"answer": "Earth's lowest point is The Dead Sea."}, }, ] # Add examples to the dataset client.create_examples(dataset_id=dataset.id, examples=examples) print(f"Created dataset: {dataset.name}") return dataset Best Practices for Dataset Creation Diverse Examples: Include edge cases and various question types Clear Ground Truth: Ensure reference answers are accurate and complete Sufficient Volume: Create enough examples to get statistically meaningful results Consistent Format: Maintain consistent input/output structure Part 2: Setting Up LLM-as-Judge Evaluation LLM-as-judge is a powerful technique where you use a language model to evaluate the quality of another model's responses. This approach scales well and can assess subjective qualities like correctness and hallucinations. import os from dotenv import load_dotenv from langsmith import Client, wrappers from openai import AzureOpenAI from openevals.llm import create_llm_as_judge from openevals.prompts import CORRECTNESS_PROMPT load_dotenv() # Wrap your AI client for LangSmith tracing openai_client = wrappers.wrap_openai(AzureOpenAI( azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_key=os.environ["AZURE_OPENAI_API_KEY"], api_version="2025-04-01-preview", )) DEPLOYMENT_NAME = os.environ.get("AZURE_OPENAI_DEPLOYMENT", "gpt-5-mini") Defining Your Target Function The target function represents the AI agent you want to test: def target(inputs: dict) -> dict: """The AI agent being evaluated""" response = openai_client.chat.completions.create( model=DEPLOYMENT_NAME, messages=[ {"role": "system", "content": "Answer the following question accurately"}, {"role": "user", "content": inputs["question"]}, ], ) return {"answer": response.choices[0].message.content.strip()} Creating Custom Evaluators 1. Correctness Evaluator def correctness_evaluator(inputs: dict, outputs: dict, reference_outputs: dict): """Evaluates how correct the answer is compared to the reference""" evaluator = create_llm_as_judge( prompt=CORRECTNESS_PROMPT, # Pre-built prompt for correctness model="azure_openai:" + DEPLOYMENT_NAME, feedback_key="correctness", ) return evaluator( inputs=inputs, outputs=outputs, reference_outputs=reference_outputs ) 2. Hallucination Evaluator def hallucination_evaluator(inputs: dict, outputs: dict, reference_outputs: dict): """Detects if the answer contains unsupported claims""" evaluator = create_llm_as_judge( prompt="""You are an expert judge evaluating AI responses for hallucinations. <question> {inputs} </question> <answer> {outputs} </answer> <reference_answer> {reference_outputs} </reference_answer> Does the answer contain any claims or information that are not supported by the question or reference answer? Respond with true if the answer is free of hallucinations, false if it contains hallucinated information. You must also provide a brief explanation of your reasoning.""", model="azure_openai:" + DEPLOYMENT_NAME, feedback_key="hallucination", ) return evaluator( inputs=inputs, outputs=outputs, reference_outputs=reference_outputs ) Part 3: Running the Evaluation Execute the Complete Evaluation Pipeline def run_evaluation(): client = Client() # Run the evaluation experiment_results = client.evaluate( target, # Function to test data="Sample dataset", # Dataset name evaluators=[ # List of evaluators correctness_evaluator, hallucination_evaluator, ], experiment_prefix="first-eval-in-langsmith", max_concurrency=2, # Control API rate limits ) print("Evaluation Results:") print(experiment_results) return experiment_results if __name__ == "__main__": run_evaluation() Understanding Your Results When the evaluation completes, you'll get detailed metrics including: Individual Scores: Per-example results for each evaluator Aggregate Metrics: Overall performance across the dataset Trace Links: Deep links to view exact model interactions Comparison Views: Side-by-side comparisons of outputs vs. references Key Benefits of This Approach Automated Testing: Run comprehensive evaluations without manual review Scalable Assessment: Evaluate subjective qualities at scale Continuous Monitoring: Track performance changes over time Rich Analytics: Get detailed insights into failure modes