foundry local
7 TopicsBuilding a Local Research Desk: Multi-Agent Orchestration
Introduction Multi-agent systems represent the next evolution of AI applications. Instead of a single model handling everything, specialised agents collaborate—each with defined responsibilities, passing context to one another, and producing results that no single agent could achieve alone. But building these systems typically requires cloud infrastructure, API keys, usage tracking, and the constant concern about what data leaves your machine. What if you could build sophisticated multi-agent workflows entirely on your local machine, with no cloud dependencies? The Local Research & Synthesis Desk demonstrates exactly this. Using Microsoft Agent Framework (MAF) for orchestration and Foundry Local for on-device inference, this demo shows how to create a four-agent research pipeline that runs entirely on your hardware—no API keys, no data leaving your network, and complete control over every step. This article walks through the architecture, implementation patterns, and practical code that makes multi-agent local AI possible. You'll learn how to bootstrap Foundry Local from Python, create specialised agents with distinct roles, wire them into sequential, concurrent, and feedback loop orchestration patterns, and implement tool calling for extended functionality. Whether you're building research tools, internal analysis systems, or simply exploring what's possible with local AI, this architecture provides a production-ready foundation. Why Multi-Agent Architecture Matters Single-agent AI systems hit limitations quickly. Ask one model to research a topic, analyse findings, identify gaps, and write a comprehensive report—and you'll get mediocre results. The model tries to do everything at once, with no opportunity for specialisation, review, or iterative refinement. Multi-agent systems solve this by decomposing complex tasks into specialised roles. Each agent focuses on what it does best: Planners break ambiguous questions into concrete sub-tasks Retrievers focus exclusively on finding and extracting relevant information Critics review work for gaps, contradictions, and quality issues Writers synthesise everything into coherent, well-structured output This separation of concerns mirrors how human teams work effectively. A research team doesn't have one person doing everything—they have researchers, fact-checkers, editors, and writers. Multi-agent AI systems apply the same principle to AI workflows, with each agent receiving the output of previous agents as context for their own specialised task. The Local Research & Synthesis Desk implements this pattern with four primary agents, plus an optional ToolAgent for utility functions. Here's how user questions flow through the system: This architecture demonstrates three essential orchestration patterns: sequential pipelines where each agent builds on the previous output, concurrent fan-out where independent tasks run in parallel to save time, and feedback loops where the Critic can send work back to the Retriever for iterative refinement. The Technology Stack: MAF + Foundry Local Before diving into implementation, let's understand the two core technologies that make this architecture possible and why they work so well together. Microsoft Agent Framework (MAF) The Microsoft Agent Framework provides building blocks for creating AI agents in Python and .NET. Unlike frameworks that require specific cloud providers, MAF works with any OpenAI-compatible API—which is exactly what Foundry Local provides. The key abstraction in MAF is the ChatAgent . Each agent has: Instructions: A system prompt that defines the agent's role and behaviour Chat client: An OpenAI-compatible client for making inference calls Tools: Optional functions the agent can invoke during execution Name: An identifier for logging and observability MAF handles message threading, tool execution, and response parsing automatically. You focus on designing agent behaviour rather than managing low-level API interactions. Foundry Local Foundry Local brings Azure AI Foundry's model catalog to your local machine. It automatically selects the best hardware acceleration available (GPU, NPU, or CPU) and exposes models through an OpenAI-compatible API. Models run entirely on-device with no data leaving your machine. The foundry-local-sdk Python package provides programmatic control over the Foundry Local service. You can start the service, download models, and retrieve connection information—all from your Python code. This is the "control plane" that manages the local AI infrastructure. The combination is powerful: MAF handles agent logic and orchestration, while Foundry Local provides the underlying inference. No cloud dependencies, no API keys, complete data privacy: Bootstrapping Foundry Local from Python The first practical challenge is starting Foundry Local programmatically. The FoundryLocalBootstrapper class handles this, encapsulating all the setup logic so the rest of the application can focus on agent behaviour. The bootstrap process follows three steps: start the Foundry Local service if it's not running, download the requested model if it's not cached, and return connection information that MAF agents can use. Here's the core implementation: from dataclasses import dataclass from foundry_local import FoundryLocalManager @dataclass class FoundryConnection: """Holds endpoint, API key, and model ID after bootstrap.""" endpoint: str api_key: str model_id: str model_alias: str This dataclass carries everything needed to connect MAF agents to Foundry Local. The endpoint is typically http://localhost:<port>/v1 (the port is assigned dynamically), and the API key is managed internally by Foundry Local. class FoundryLocalBootstrapper: def __init__(self, alias: str | None = None) -> None: self.alias = alias or os.getenv("MODEL_ALIAS", "qwen2.5-0.5b") def bootstrap(self) -> FoundryConnection: """Start service, download & load model, return connection info.""" from foundry_local import FoundryLocalManager manager = FoundryLocalManager() model_info = manager.download_and_load_model(self.alias) return FoundryConnection( endpoint=manager.endpoint, api_key=manager.api_key, model_id=model_info.id, model_alias=self.alias, ) Key design decisions in this implementation: Lazy import: The foundry_local import happens inside bootstrap() so the application can provide helpful error messages if the SDK isn't installed Environment configuration: Model alias comes from MODEL_ALIAS environment variable or defaults to qwen2.5-0.5b Automatic hardware selection: Foundry Local picks GPU, NPU, or CPU automatically—no configuration needed The qwen2.5 model family is recommended because it supports function/tool calling, which the ToolAgent requires. For higher quality outputs, larger variants like qwen2.5-7b or qwen2.5-14b are available via the --model flag. Creating Specialised Agents With Foundry Local bootstrapped, the next step is creating agents with distinct roles. Each agent is a ChatAgent instance with carefully crafted instructions that focus it on a specific task. The Planner Agent The Planner receives a user question and available documents, then breaks the research task into concrete sub-tasks. Its instructions emphasise structured output—a numbered list of specific tasks rather than prose: from agent_framework import ChatAgent from agent_framework.openai import OpenAIChatClient def _make_client(conn: FoundryConnection) -> OpenAIChatClient: """Create an MAF OpenAIChatClient pointing at Foundry Local.""" return OpenAIChatClient( api_key=conn.api_key, base_url=conn.endpoint, model_id=conn.model_id, ) def create_planner(conn: FoundryConnection) -> ChatAgent: return ChatAgent( chat_client=_make_client(conn), name="Planner", instructions=( "You are a planning agent. Given a user's research question and a list " "of document snippets (if any), break the question into 2-4 concrete " "sub-tasks. Output ONLY a numbered list of tasks. Each task should state:\n" " • What information is needed\n" " • Which source documents might help (if known)\n" "Keep it concise — no more than 6 lines total." ), ) Notice how the instructions are explicit about output format. Multi-agent systems work best when each agent produces structured, predictable output that downstream agents can parse reliably. The Retriever Agent The Retriever receives the Planner's task list plus raw document content, then extracts and cites relevant passages. Its instructions emphasise citation format—a specific pattern that the Writer can reference later: def create_retriever(conn: FoundryConnection) -> ChatAgent: return ChatAgent( chat_client=_make_client(conn), name="Retriever", instructions=( "You are a retrieval agent. You receive a research plan AND raw document " "text from local files. Your job:\n" " 1. Identify the most relevant passages for each task in the plan.\n" " 2. Output extracted snippets with citations in the format:\n" " [filename.ext, lines X-Y]: \"quoted text…\"\n" " 3. If no relevant content exists, say so explicitly.\n" "Be precise — quote only what is relevant, keep each snippet under 100 words." ), ) The citation format [filename.ext, lines X-Y] creates a consistent contract. The Writer knows exactly how to reference source material, and human reviewers can verify claims against original documents. The Critic Agent The Critic reviews the Retriever's work, identifying gaps and contradictions. This agent serves as a quality gate before the final report and can trigger feedback loops for iterative improvement: def create_critic(conn: FoundryConnection) -> ChatAgent: return ChatAgent( chat_client=_make_client(conn), name="Critic", instructions=( "You are a critical review agent. You receive a plan and extracted snippets. " "Your job:\n" " 1. Check for gaps — are any plan tasks unanswered?\n" " 2. Check for contradictions between snippets.\n" " 3. Suggest 1-2 specific improvements or missing details.\n" "Start your response with 'GAPS FOUND' if issues exist, or 'NO GAPS' if satisfied.\n" "Then output a short numbered list of issues (or say 'No issues found')." ), ) The Critic is instructed to output GAPS FOUND or NO GAPS at the start of its response. This structured output enables the orchestrator to detect when gaps exist and trigger the feedback loop—sending the gaps back to the Retriever for additional retrieval before re-running the Critic. This iterates up to 2 times before the Writer takes over, ensuring higher quality reports. Critics are essential for production systems. Without this review step, the Writer might produce confident-sounding reports with missing information or internal contradictions. The Writer Agent The Writer receives everything—original question, plan, extracted snippets, and critic review—then produces the final report: def create_writer(conn: FoundryConnection) -> ChatAgent: return ChatAgent( chat_client=_make_client(conn), name="Writer", instructions=( "You are the final report writer. You receive:\n" " • The original question\n" " • A plan, extracted snippets with citations, and a critic review\n\n" "Produce a clear, well-structured answer (3-5 paragraphs). " "Requirements:\n" " • Cite sources using [filename.ext, lines X-Y] notation\n" " • Address any gaps the critic raised (note if unresolvable)\n" " • End with a one-sentence summary\n" "Do NOT fabricate citations — only use citations provided by the Retriever." ), ) The final instruction—"Do NOT fabricate citations"—is crucial for responsible AI. The Writer has access only to citations the Retriever provided, preventing hallucinated references that plague single-agent research systems. Implementing Sequential Orchestration With agents defined, the orchestrator connects them into a workflow. Sequential orchestration is the simpler pattern: each agent runs after the previous one completes, passing its output as input to the next agent. The implementation uses Python's async/await for clean asynchronous execution: import asyncio import time from dataclasses import dataclass, field @dataclass class StepResult: """Captures one agent step for observability.""" agent_name: str input_text: str output_text: str elapsed_sec: float @dataclass class WorkflowResult: """Final result of the entire orchestration run.""" question: str steps: list[StepResult] = field(default_factory=list) final_report: str = "" async def _run_agent(agent: ChatAgent, prompt: str) -> tuple[str, float]: """Execute a single agent and measure elapsed time.""" start = time.perf_counter() response = await agent.run(prompt) elapsed = time.perf_counter() - start return response.content, elapsed The StepResult dataclass captures everything needed for observability: what went in, what came out, and how long it took. This information is invaluable for debugging and optimisation. The sequential pipeline chains agents together, building context progressively: async def run_sequential_workflow( question: str, docs: LoadedDocuments, conn: FoundryConnection, ) -> WorkflowResult: wf = WorkflowResult(question=question) doc_block = docs.combined_text if docs.chunks else "(no documents provided)" # Step 1 — Plan planner = create_planner(conn) planner_prompt = f"User question: {question}\n\nAvailable documents:\n{doc_block}" plan_text, elapsed = await _run_agent(planner, planner_prompt) wf.steps.append(StepResult("Planner", planner_prompt, plan_text, elapsed)) # Step 2 — Retrieve retriever = create_retriever(conn) retriever_prompt = f"Plan:\n{plan_text}\n\nDocuments:\n{doc_block}" snippets_text, elapsed = await _run_agent(retriever, retriever_prompt) wf.steps.append(StepResult("Retriever", retriever_prompt, snippets_text, elapsed)) # Step 3 — Critique critic = create_critic(conn) critic_prompt = f"Plan:\n{plan_text}\n\nExtracted snippets:\n{snippets_text}" critique_text, elapsed = await _run_agent(critic, critic_prompt) wf.steps.append(StepResult("Critic", critic_prompt, critique_text, elapsed)) # Step 4 — Write writer = create_writer(conn) writer_prompt = ( f"Original question: {question}\n\n" f"Plan:\n{plan_text}\n\n" f"Extracted snippets:\n{snippets_text}\n\n" f"Critic review:\n{critique_text}" ) report_text, elapsed = await _run_agent(writer, writer_prompt) wf.steps.append(StepResult("Writer", writer_prompt, report_text, elapsed)) wf.final_report = report_text return wf Each step receives all relevant context from previous steps. The Writer gets the most comprehensive prompt—original question, plan, snippets, and critique—enabling it to produce a well-informed final report. Adding Concurrent Fan-Out and Feedback Loops Sequential orchestration works well but can be slow. When tasks are independent—neither needs the other's output—running them in parallel saves time. The demo implements this with asyncio.gather . Consider the Retriever and ToolAgent: both need the Planner's output, but neither depends on the other. Running them concurrently cuts the wait time roughly in half: async def run_concurrent_retrieval( plan_text: str, docs: LoadedDocuments, conn: FoundryConnection, ) -> tuple[str, str]: """Run Retriever and ToolAgent in parallel.""" retriever = create_retriever(conn) tool_agent = create_tool_agent(conn) doc_block = docs.combined_text if docs.chunks else "(no documents)" retriever_prompt = f"Plan:\n{plan_text}\n\nDocuments:\n{doc_block}" tool_prompt = f"Analyse the following documents for word count and keywords:\n{doc_block}" # Execute both agents concurrently (snippets_text, r_elapsed), (tool_text, t_elapsed) = await asyncio.gather( _run_agent(retriever, retriever_prompt), _run_agent(tool_agent, tool_prompt), ) return snippets_text, tool_text The asyncio.gather function runs both coroutines concurrently and returns when both complete. If the Retriever takes 3 seconds and the ToolAgent takes 1.5 seconds, the total wait is approximately 3 seconds rather than 4.5 seconds. Implementing the Feedback Loop The most sophisticated orchestration pattern is the Critic–Retriever feedback loop. When the Critic identifies gaps in the retrieved information, the orchestrator sends them back to the Retriever for additional retrieval, then re-evaluates: async def run_critic_with_feedback( plan_text: str, snippets_text: str, docs: LoadedDocuments, conn: FoundryConnection, max_iterations: int = 2, ) -> tuple[str, str]: """ Run Critic with feedback loop to Retriever. Returns (final_snippets, final_critique). """ critic = create_critic(conn) retriever = create_retriever(conn) current_snippets = snippets_text for iteration in range(max_iterations): # Run Critic critic_prompt = f"Plan:\n{plan_text}\n\nExtracted snippets:\n{current_snippets}" critique_text, _ = await _run_agent(critic, critic_prompt) # Check if gaps were found if not critique_text.upper().startswith("GAPS FOUND"): return current_snippets, critique_text # Gaps found — send back to Retriever for more extraction gap_fill_prompt = ( f"Previous snippets:\n{current_snippets}\n\n" f"Gaps identified:\n{critique_text}\n\n" f"Documents:\n{docs.combined_text}\n\n" "Extract additional relevant passages to fill these gaps." ) additional_snippets, _ = await _run_agent(retriever, gap_fill_prompt) current_snippets = f"{current_snippets}\n\n--- Gap-fill iteration {iteration + 1} ---\n{additional_snippets}" # Max iterations reached — run final critique final_critique, _ = await _run_agent(critic, f"Plan:\n{plan_text}\n\nExtracted snippets:\n{current_snippets}") return current_snippets, final_critique This feedback loop pattern significantly improves output quality. The Critic acts as a quality gate, and when standards aren't met, the system iteratively improves rather than producing incomplete results. The full workflow combines all three patterns—sequential where dependencies require it, concurrent where independence allows it, and feedback loops for quality assurance: async def run_full_workflow( question: str, docs: LoadedDocuments, conn: FoundryConnection, ) -> WorkflowResult: """ End-to-end workflow showcasing THREE orchestration patterns: 1. Planner runs first (sequential — must happen before anything else). 2. Retriever + ToolAgent run concurrently (fan-out on independent tasks). 3. Critic reviews with feedback loop (iterates with Retriever if gaps found). 4. Writer produces final report (sequential — needs everything above). """ wf = WorkflowResult(question=question) # Step 1: Planner (sequential) plan_text, elapsed = await _run_agent(create_planner(conn), planner_prompt) wf.steps.append(StepResult("Planner", planner_prompt, plan_text, elapsed)) # Step 2: Concurrent fan-out (Retriever + ToolAgent) snippets_text, tool_text = await run_concurrent_retrieval(plan_text, docs, conn) # Step 3: Critic with feedback loop final_snippets, critique_text = await run_critic_with_feedback( plan_text, snippets_text, docs, conn ) # Step 4: Writer (sequential — needs everything) writer_prompt = ( f"Original question: {question}\n\n" f"Plan:\n{plan_text}\n\n" f"Snippets:\n{final_snippets}\n\n" f"Stats:\n{tool_text}\n\n" f"Critique:\n{critique_text}" ) report_text, elapsed = await _run_agent(create_writer(conn), writer_prompt) wf.final_report = report_text return wf This hybrid approach maximises both correctness and performance. Dependencies are respected, independent work happens in parallel, and quality is ensured through iterative feedback. Implementing Tool Calling Some agents benefit from deterministic tools rather than relying entirely on LLM generation. The ToolAgent demonstrates this pattern with two utility functions: word counting and keyword extraction. MAF supports tool calling through function declarations with Pydantic type annotations: from typing import Annotated from pydantic import Field def word_count( text: Annotated[str, Field(description="The text to count words in")] ) -> int: """Count words in a text string.""" return len(text.split()) def extract_keywords( text: Annotated[str, Field(description="The text to extract keywords from")], top_n: Annotated[int, Field(description="Number of keywords to return", default=5)] ) -> list[str]: """Extract most frequent words (simple implementation).""" words = text.lower().split() # Filter common words, count frequencies, return top N word_counts = {} for word in words: if len(word) > 3: # Skip short words word_counts[word] = word_counts.get(word, 0) + 1 sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True) return [word for word, count in sorted_words[:top_n]] The Annotated type with Field descriptions provides metadata that MAF uses to generate function schemas for the LLM. When the model needs to count words, it invokes the word_count tool rather than attempting to count in its response (which LLMs notoriously struggle with). The ToolAgent receives these functions in its constructor: def create_tool_agent(conn: FoundryConnection) -> ChatAgent: return ChatAgent( chat_client=_make_client(conn), name="ToolHelper", instructions=( "You are a utility agent. Use the provided tools to compute " "word counts or extract keywords when asked. Return the tool " "output directly — do not embellish." ), tools=[word_count, extract_keywords], ) This pattern—combining LLM reasoning with deterministic tools—produces more reliable results. The LLM decides when to use tools and how to interpret results, but the actual computation happens in Python where precision is guaranteed. Running the Demo With the architecture explained, here's how to run the demo yourself. Setup takes about five minutes. Prerequisites You'll need Python 3.10 or higher and Foundry Local installed on your machine. Install Foundry Local by following the instructions at github.com/microsoft/Foundry-Local, then verify it works: foundry --help Installation Clone the repository and set up a virtual environment: git clone https://github.com/leestott/agentframework--foundrylocal.git cd agentframework--foundrylocal python -m venv .venv # Windows .venv\Scripts\activate # macOS / Linux source .venv/bin/activate pip install -r requirements.txt copy .env.example .env CLI Usage Run the research workflow from the command line: python -m src.app "What are the key features of Foundry Local and how does it compare to cloud inference?" --docs ./data You'll see agent-by-agent progress with timing information: Web Interface For a visual experience, launch the Flask-based web UI: python -m src.app.web Open http://localhost:5000 in your browser. The web UI provides real-time streaming of agent progress, a visual pipeline showing both orchestration patterns, and an interactive demos tab showcasing tool calling capabilities. CLI Options The CLI supports several options for customisation: --docs: Folder of local documents to search (default: ./data) --model: Foundry Local model alias (default: qwen2.5-0.5b) --mode: full for sequential + concurrent, or sequential for simpler pipeline --log-level: DEBUG, INFO, WARNING, or ERROR For higher quality output, try larger models: python -m src.app "Explain multi-agent benefits" --docs ./data --model qwen2.5-7b Validate Tool/Function Calling Run the dedicated tool calling demo to verify function calling works: python -m src.app.tool_demo This tests direct tool function calls ( word_count , extract_keywords ), LLM-driven tool calling via the ToolAgent, and multi-tool requests in a single prompt. Run Tests Run the smoke tests to verify your setup: pip install pytest pytest-asyncio pytest tests/ -v The smoke tests check document loading, tool functions, and configuration—they do not require a running Foundry Local service. Interactive Demos: Exploring MAF Capabilities Beyond the research workflow, the web UI includes five interactive demos showcasing different MAF capabilities. Each demonstrates a specific pattern with suggested prompts and real-time results. Weather Tools demonstrates multi-tool calling with an agent that provides weather information, forecasts, city comparisons, and activity recommendations. The agent uses four different tools to construct comprehensive responses. Math Calculator shows precise calculation through tool calling. The agent uses arithmetic, percentage, unit conversion, compound interest, and statistics tools instead of attempting mental math—eliminating the calculation errors that plague LLM-only approaches. Sentiment Analyser performs structured text analysis, detecting sentiment, emotions, key phrases, and word frequency through lexicon-based tools. The results are deterministic and verifiable. Code Reviewer analyses code for style issues, complexity problems, potential bugs, and improvement opportunities. This demonstrates how tool calling can extend AI capabilities into domain-specific analysis. Multi-Agent Debate showcases sequential orchestration with interdependent outputs. Three agents—one arguing for a position, one against, and a moderator—debate a topic. Each agent receives the previous agent's output, demonstrating how multi-agent systems can explore topics from multiple perspectives. Troubleshooting Common issues and their solutions: foundry: command not found : Install Foundry Local from github.com/microsoft/Foundry-Local foundry-local-sdk is not installed : Run pip install foundry-local-sdk Model download is slow: First download can be large. It's cached for future runs. No documents found warning: Add .txt or .md files to the --docs folder Agent output is low quality: Try a larger model alias, e.g. --model phi-3.5-mini Web UI won't start: Ensure Flask is installed: pip install flask Port 5000 in use: The web UI uses port 5000. Stop other services or set PORT=8080 environment variable Key Takeaways Multi-agent systems decompose complex tasks: Specialised agents (Planner, Retriever, Critic, Writer) produce better results than single-agent approaches by focusing each agent on what it does best Local AI eliminates cloud dependencies: Foundry Local provides on-device inference with automatic hardware acceleration, keeping all data on your machine MAF simplifies agent development: The ChatAgent abstraction handles message threading, tool execution, and response parsing, letting you focus on agent behaviour Three orchestration patterns serve different needs: Sequential pipelines maintain dependencies; concurrent fan-out parallelises independent work; feedback loops enable iterative quality improvement Feedback loops improve quality: The Critic–Retriever feedback loop catches gaps and contradictions, iterating until quality standards are met rather than producing incomplete results Tool calling adds precision: Deterministic functions for counting, calculation, and analysis complement LLM reasoning for more reliable results The same patterns scale to production: This demo architecture—bootstrapping, agent creation, orchestration—applies directly to real-world research and analysis systems Conclusion and Next Steps The Local Research & Synthesis Desk demonstrates that sophisticated multi-agent AI systems don't require cloud infrastructure. With Microsoft Agent Framework for orchestration and Foundry Local for inference, you can build production-quality workflows that run entirely on your hardware. The architecture patterns shown here—specialised agents with clear roles, sequential pipelines for dependent tasks, concurrent fan-out for independent work, feedback loops for quality assurance, and tool calling for precision—form a foundation for building more sophisticated systems. Consider extending this demo with: Additional agents for fact-checking, summarisation, or domain-specific analysis Richer tool integrations connecting to databases, APIs, or local services Human-in-the-loop approval gates before producing final reports Different model sizes for different agents based on task complexity Start with the demo, understand the patterns, then apply them to your own research and analysis challenges. The future of AI isn't just cloud models—it's intelligent systems that run wherever your data lives. Resources Local Research & Synthesis Desk Repository – Full source code with documentation and examples Foundry Local – Official site for on-device AI inference Foundry Local GitHub Repository – Installation instructions and CLI reference Foundry Local SDK Documentation – Python SDK reference on Microsoft Learn Microsoft Agent Framework Documentation – Official MAF tutorials and user guides MAF Orchestrations Overview – Deep dive into workflow patterns agent-framework-core on PyPI – Python package for MAF Agent Framework Samples – Additional MAF examples and patterns278Views2likes2CommentsDeploying Custom Models with Microsoft Olive and Foundry Local
Over the past few weeks, we've been on quite a journey together. We started by exploring what makes Phi-4 and small language models so compelling, then got our hands dirty running models locally with Foundry Local. We leveled up with function calling, and most recently built a complete multi-agent quiz application with an orchestrator coordinating specialist agents. Our quiz app works great locally, but it relies on Foundry Local's catalog models — pre-optimized and ready to go. What happens when you want to deploy a model that isn't in the catalog? Maybe you've fine-tuned a model on domain-specific quiz data, or a new model just dropped on Hugging Face that you want to use. Today we'll take a model from Hugging Face, optimize it with Microsoft Olive, register it with Foundry Local, and run our quiz app against it. The same workflow applies to any model you might fine-tune for your specific use case. Understanding Deployment Options Before we dive in, let's understand the landscape of deployment options for SLM applications. There are several routes to deploying SLM applications depending on your target environment. The Three Main Paths vLLM is the industry standard for cloud deployments — containerized, scalable, handles many concurrent users. Great for Azure VMs or Kubernetes. Ollama offers a middle ground — simpler than vLLM but still provides Docker support for easy sharing and deployment. Foundry Local + Olive is Microsoft's edge-first approach. Optimize your model with Olive, serve with Foundry Local or a custom server. Perfect for on-premise, offline, or privacy-focused deployments. In keeping with the edge-first theme that's run through this series, we'll focus on the Foundry Local path. We'll use Qwen 2.5-0.5B-Instruct — small enough to optimize quickly and demonstrate the full workflow. Think of it as a stand-in for a model you've fine-tuned on your own quiz data. Prerequisites You'll need: Foundry Local version 0.8.117 or later Python 3.10+ for the quiz app (the foundry-local-sdk requires it) A separate Python 3.9 environment for Olive (Olive 0.9.x has this requirement) The quiz app from the previous article Having two Python versions might seem odd, but it mirrors a common real-world setup: you optimize models in one environment and serve them in another. The optimization is a one-time step. Installing Olive Dependencies In your Python 3.9 environment: pip install olive-ai onnxruntime onnxruntime-genai pip install transformers>=4.45.0,<5.0.0 Important: Olive is not compatible with Transformers 5.x. You must use version 4.x. Model Optimization with Olive Microsoft Olive is the bridge between a Hugging Face model and something Foundry Local can serve. It handles ONNX conversion, graph optimization, and quantization in a single command. Understanding Quantization Quantization reduces model size by converting weights from high-precision floating point to lower-precision integers: Precision Size Reduction Quality Best For FP32 Baseline Best Development, debugging FP16 50% smaller Excellent GPU inference with plenty of VRAM INT8 75% smaller Very Good Balanced production INT4 87.5% smaller Good Edge devices, resource-constrained We'll use INT4 to demonstrate the maximum compression. For production with better quality, consider INT8 — simply change --precision int4 to --precision int8 in the commands below. Running the Optimization The optimization script at scripts/optimize_model.py handles two things: downloading the model locally (to avoid authentication issues), then running Olive. The download step is important. The ONNX Runtime GenAI model builder internally requests HuggingFace authentication even for public models. Rather than configuring tokens, we download the model first with token=False, then point Olive at the local path: from huggingface_hub import snapshot_download local_path = snapshot_download("Qwen/Qwen2.5-0.5B-Instruct", token=False) Then the Olive command runs against the local copy: cmd = [ sys.executable, "-m", "olive", "auto-opt", "--model_name_or_path", local_path, "--trust_remote_code", "--output_path", "models/qwen2.5-0.5b-int4", "--device", "cpu", "--provider", "CPUExecutionProvider", "--precision", "int4", "--use_model_builder", "--use_ort_genai", "--log_level", "1", ] Key flags: --precision int4 quantizes weights to 4-bit integers, --use_model_builder reads each transformer layer and exports it to ONNX, and --use_ort_genai outputs in the format Foundry Local consumes. Run it: python scripts/optimize_model.py This process takes about a minute. When complete, you'll see the output directory structure. models/qwen2.5-0.5b-int4/model/ ├── model.onnx # ONNX graph (162 KB) ├── model.onnx.data # Quantized INT4 weights (823 MB) ├── genai_config.json # ONNX Runtime GenAI config ├── tokenizer.json # Tokenizer vocabulary (11 MB) ├── vocab.json # Token-to-ID map (2.7 MB) ├── merges.txt # BPE merges (1.6 MB) ├── tokenizer_config.json ├── config.json ├── generation_config.json ├── special_tokens_map.json └── added_tokens.json Total size: approximately 838MB — a significant reduction from the original, while maintaining usable quality for structured tasks like quiz generation. Registering with Foundry Local With the model optimized, we need to register it with Foundry Local. Unlike cloud model registries, there's no CLI command — you place files in the right directory and Foundry discovers them automatically. Foundry's Model Registry foundry cache cd # Windows: C:\Users\<username>\.foundry\cache\ # macOS/Linux: ~/.foundry/cache/ Foundry organizes models by publisher: .foundry/cache/models/ ├── foundry.modelinfo.json ← catalog of official models ├── Microsoft/ ← pre-optimized Microsoft models │ ├── qwen2.5-7b-instruct-cuda-gpu-4/ │ ├── Phi-4-cuda-gpu-1/ │ └── ... └── Custom/ ← your models go here The Registration Script The script at scripts/register_model.sh does two things: copies all model files into the Foundry cache, and creates the inference_model.json configuration file. The critical file is inference_model.json — without it, Foundry won't recognize your model: { "Name": "qwen-quiz-int4", "PromptTemplate": { "system": "<|im_start|>system\n{Content}<|im_end|>", "user": "<|im_start|>user\n{Content}<|im_end|>", "assistant": "<|im_start|>assistant\n{Content}<|im_end|>", "prompt": "<|im_start|>user\n{Content}<|im_end|>\n<|im_start|>assistant" } } The PromptTemplate defines the ChatML format that Qwen 2.5 expects. The {Content} placeholder is where Foundry injects the actual message content at runtime. If you were deploying a Llama or Phi model, you'd use their respective prompt templates. Run the registration: scripts/register_model.sh Verify Registration foundry cache ls Test the Model foundry model run qwen-quiz-int4 The model loads via ONNX Runtime on CPU. Try a simple prompt to verify it responds. Integrating with the Quiz App Here's where things get interesting. The application-level change is one line in utils/foundry_client.py: # Before: DEFAULT_MODEL_ALIAS = "qwen2.5-7b-instruct-cuda-gpu" # After: DEFAULT_MODEL_ALIAS = "qwen-quiz-int4" But that one line raised some issues worth understanding. Issue 1: The SDK Can't See Custom Models The Foundry Local Python SDK resolves models by looking them up in the official catalog — a JSON file of Microsoft-published models. Custom models in the Custom/ directory aren't in that catalog. So FoundryLocalManager("qwen-quiz-int4") throws a "model not found" error, despite foundry cache ls and foundry model run both working perfectly. The fix in foundry_client.py is a dual code path. It tries the SDK first (works for catalog models), and when that fails with a "not found in catalog" error, it falls back to discovering the running service endpoint directly: def _discover_endpoint(): """Discover running Foundry service endpoint via CLI.""" result = subprocess.run( ["foundry", "service", "status"], capture_output=True, text=True, timeout=10 ) match = re.search(r"(http://\S+?)(?:/openai)?/status", result.stdout) if not match: raise ConnectionError( "Foundry service is not running.\n" f"Start it with: foundry model run {DEFAULT_MODEL_ALIAS}" ) return match.group(1) The workflow becomes two terminals: Terminal 1: foundry model run qwen-quiz-int4 Terminal 2: python main.py The client auto-discovers the endpoint and connects. For catalog models, the existing FoundryLocalManager path works unchanged. Issue 2: Tool Calling Format For catalog models, Foundry's server-side middleware intercepts <tool_call> tags in the model's output and converts them into structured tool_calls objects in the API response. This is configured via metadata in foundry.modelinfo.json. For custom models, those metadata fields aren't recognized — Foundry ignores them in inference_model.json. The <tool_call> tags pass through as raw text in response.choices[0].message.content. Since our custom model outputs the exact same <tool_call> format, we added a small fallback parser in agents/base_agent.py — the same pattern we explored in our function calling article. After each model response, if tool_calls is None, we scan the content for tags: def _parse_text_tool_calls(content: str) -> list: """Parse <tool_call>...</tool_call> tags from model output.""" blocks = re.findall(r"<tool_call>\s*(\{.*?\})\s*</tool_call>", content, re.DOTALL) calls = [] for block in blocks: try: data = json.loads(block) calls.append(_TextToolCall(data["name"], json.dumps(data.get("arguments", {})))) except (json.JSONDecodeError, KeyError): continue return calls The model's behavior is identical; only the parsing location changes — from server-side (Foundry middleware) to client-side (our code). Part 7: Testing the Deployment With the model running in one terminal, start the quiz app in another: Terminal 1: foundry model run qwen-quiz-int4 Terminal 2: cd multi_agents_slm && python main.py Now test the full flow. Generate a quiz: Test the Full Flow Generate a quiz: Example output: The orchestrator successfully calls the generate_new_quiz tool, and the QuizGeneratorAgent produces well-structured quiz JSON. Model Limitations The 0.5B INT4 model occasionally struggles with complex reasoning or basic arithmetic. This is expected from such a small, heavily quantized model. For production use cases requiring higher accuracy, use Qwen 2.5-1.5B or Qwen 2.5-7B for better quality, or use INT8 quantization instead of INT4. The deployment workflow remains identical — just change the model name and precision in the optimization script. What You've Accomplished Take a moment to appreciate the complete journey across this series: Article What You Learned 1. Phi-4 Introduction Why SLMs matter, performance vs size tradeoffs 2. Running Locally Foundry Local setup, basic inference 3. Function Calling Tool use, external API integration 4. Multi-Agent Systems Orchestration, specialist agents 5. Deployment Olive optimization, Foundry Local registration, custom model deployment You now have end-to-end skills for building production SLM applications: understanding the landscape, local development with Foundry Local, agentic applications with function calling, multi-agent architectures, model optimization with Olive, and deploying custom models to the edge. Where to Go From Here The logical next step is fine-tuning for your domain. Medical quiz tutors trained on USMLE questions, legal assistants trained on case law, company onboarding bots trained on internal documentation — use the same Olive workflow to optimize and deploy your fine-tuned model. The same ONNX model we registered with Foundry Local could also run on mobile devices via ONNX Runtime Mobile, or be containerized for server-side edge deployment. The full source code, including the optimization and registration scripts, is available in the GitHub repository. Resources: Microsoft Olive — Model optimization toolkit Foundry Local Documentation — Setup and CLI reference Compiling Hugging Face models for Foundry Local — Official guide ONNX Runtime GenAI — Powers Foundry Local's inference Edge AI for Beginners — Microsoft's 8-module Edge AI curriculum Quiz App Source Code — Full repository with deployment scripts This series has been a joy to write. I'd love to see what you build — share your projects in the comments, and don't hesitate to open issues on the GitHub repo if you encounter challenges. Until next time — keep building, keep optimizing, and keep pushing what's possible with local AI.195Views0likes0CommentsAdvanced Function Calling and Multi-Agent Systems with Small Language Models in Foundry Local
Advanced Function Calling and Multi-Agent Systems with Small Language Models in Foundry Local In our previous exploration of function calling with Small Language Models, we demonstrated how to enable local SLMs to interact with external tools using a text-parsing approach with regex patterns. While that method worked, it required manual extraction of function calls from the model's output; functional but fragile. Today, I'm excited to show you something far more powerful: Foundry Local now supports native OpenAI-compatible function calling with select models. This update transforms how we build agentic AI systems locally, making it remarkably straightforward to create sophisticated multi-agent architectures that rival cloud-based solutions. What once required careful prompt engineering and brittle parsing now works seamlessly through standardized API calls. We'll build a complete multi-agent quiz application that demonstrates both the elegance of modern function calling and the power of coordinated agent systems. The full source code is available in this GitHub repository, but rather than walking through every line of code, we'll focus on how the pieces work together and what you'll see when you run it. What's New: Native Function Calling in Foundry Local As we explored in our guide to running Phi-4 locally with Foundry Local, we ran powerful language models on our local machine. The latest version now support native function calling for models specifically trained with this capability. The key difference is architectural. In our weather assistant example, we manually parsed JSON strings from the model's text output using regex patterns and frankly speaking, meticulously testing and tweaking the system prompt for the umpteenth time 🙄. Now, when you provide tool definitions to supported models, they return structured tool_calls objects that you can directly execute. Currently, this native function calling capability is available for the Qwen 2.5 family of models in Foundry Local. For this tutorial, we're using the 7B variant, which strikes a great balance between capability and resource requirements. Quick Setup Getting started requires just a few steps. First, ensure you have Foundry Local installed. On Windows, use winget install Microsoft.FoundryLocal , and on macOS, use bash brew install microsoft/foundrylocal/foundrylocal You'll need version 0.8.117 or later. Install the Python dependencies in the requirements file, then start your model. The first run will download approximately 4GB: foundry model run qwen2.5-7b-instruct-cuda-gpu If you don't have a compatible GPU, use the CPU version instead, or you can specify any other Qwen 2.5 variant that suits your hardware. I have set a DEFAULT_MODEL_ALIAS variable you can modify to use different models in utils/foundry_client.py file. Keep this terminal window open. The model needs to stay running while you develop and test your application. Understanding the Architecture Before we dive into running the application, let's understand what we're building. Our quiz system follows a multi-agent architecture where specialized agents handle distinct responsibilities, coordinated by a central orchestrator. The flow works like this: when you ask the system to generate a quiz about photosynthesis, the orchestrator agent receives your message, understands your intent, and decides which tool to invoke. It doesn't try to generate the quiz itself, instead, it calls a tool that creates a specialist QuizGeneratorAgent focused solely on producing well-structured quiz questions. Then there's another agent, reviewAgent, that reviews the quiz with you. The project structure reflects this architecture: quiz_app/ ├── agents/ # Base agent + specialist agents ├── tools/ # Tool functions the orchestrator can call ├── utils/ # Foundry client connection ├── data/ ├── quizzes/ # Generated quiz JSON files │── responses/ # User response JSON files └── main.py # Application entry point The orchestrator coordinates three main tools: generate_new_quiz, launch_quiz_interface, and review_quiz_interface. Each tool either creates a specialist agent or launches an interactive interface (Gradio), handling the complexity so the orchestrator can focus on routing and coordination. How Native Function Calling Works When you initialize the orchestrator agent in main.py, you provide two things: tool schemas that describe your functions to the model, and a mapping of function names to actual Python functions. The schemas follow the OpenAI function calling specification, describing each tool's purpose, parameters, and when it should be used. Here's what happens when you send a message to the orchestrator: The agent calls the model with your message and the tool schemas. If the model determines a tool is needed, it returns a structured tool_calls attribute containing the function name and arguments as a proper object—not as text to be parsed. Your code executes the tool, creates a message with "role": "tool" containing the result, and sends everything back to the model. The model can then either call another tool or provide its final response. The critical insight is that the model itself controls this flow through a while loop in the base agent. Each iteration represents the model examining the current state, deciding whether it needs more information, and either proceeding with another tool call or providing its final answer. You're not manually orchestrating when tools get called; the model makes those decisions based on the conversation context. Seeing It In Action Let's walk through a complete session to see how these pieces work together. When you run python main.py, you'll see the application connect to Foundry Local and display a welcome banner: Now type a request like "Generate a 5 question quiz about photosynthesis." Watch what happens in your console: The orchestrator recognized your intent, selected the generate_new_quiz tool, and extracted the topic and number of questions from your natural language request. Behind the scenes, this tool instantiated a QuizGeneratorAgent with a focused system prompt designed specifically for creating quiz JSON. The agent used a low temperature setting to ensure consistent formatting and generated questions that were saved to the data/quizzes folder. This demonstrates the first layer of the multi-agent architecture: the orchestrator doesn't generate quizzes itself. It recognizes that this task requires specialized knowledge about quiz structure and delegates to an agent built specifically for that purpose. Now request to take the quiz by typing "Take the quiz." The orchestrator calls a different tool and Gradio server is launched. Click the link to open in a browser window displaying your quiz questions. This tool demonstrates how function calling can trigger complex interactions—it reads the quiz JSON, dynamically builds a user interface with radio buttons for each question, and handles the submission flow. After you answer the questions and click submit, the interface saves your responses to the data/responses folder and closes the Gradio server. The orchestrator reports completion: The system now has two JSON files: one containing the quiz questions with correct answers, and another containing your responses. This separation of concerns is important—the quiz generation phase doesn't need to know about response collection, and the response collection doesn't need to know how quizzes are created. Each component has a single, well-defined responsibility. Now request a review. The orchestrator calls the third tool: A new chat interface opens, and here's where the multi-agent architecture really shines. The ReviewAgent is instantiated with full context about both the quiz questions and your answers. Its system prompt includes a formatted view of each question, the correct answer, your answer, and whether you got it right. This means when the interface opens, you immediately see personalized feedback: The Multi-Agent Pattern Multi-agent architectures solve complex problems by coordinating specialized agents rather than building monolithic systems. This pattern is particularly powerful for local SLMs. A coordinator agent routes tasks to specialists, each optimized for narrow domains with focused system prompts and specific temperature settings. You can use a 1.7B model for structured data generation, a 7B model for conversations, and a 4B model for reasoning, all orchestrated by a lightweight coordinator. This is more efficient than requiring one massive model for everything. Foundry Local's native function calling makes this straightforward. The coordinator reliably invokes tools that instantiate specialists, with structured responses flowing back through proper tool messages. The model manages the coordination loop—deciding when it needs another specialist, when it has enough information, and when to provide a final answer. In our quiz application, the orchestrator routes user requests but never tries to be an expert in quiz generation, interface design, or tutoring. The QuizGeneratorAgent focuses solely on creating well-structured quiz JSON using constrained prompts and low temperature. The ReviewAgent handles open-ended educational dialogue with embedded quiz context and higher temperature for natural conversation. The tools abstract away file management, interface launching, and agent instantiation, the orchestrator just knows "this tool launches quizzes" without needing implementation details. This pattern scales effortlessly. If you wanted to add a new capability like study guides or flashcards, you could just easily create a new tool or specialists. The orchestrator gains these capabilities automatically by having the tool schemas you have defined without modifying core logic. This same pattern powers production systems with dozens of specialists handling retrieval, reasoning, execution, and monitoring, each excelling in its domain while the coordinator ensures seamless collaboration. Why This Matters The transition from text-parsing to native function calling enables a fundamentally different approach to building AI applications. With text parsing, you're constantly fighting against the unpredictability of natural language output. A model might decide to explain why it's calling a function before outputting the JSON, or it might format the JSON slightly differently than your regex expects, or it might wrap it in markdown code fences. Native function calling eliminates this entire class of problems. The model is trained to output tool calls as structured data, separate from its conversational responses. The multi-agent aspect builds on this foundation. Because function calling is reliable, you can confidently delegate to specialist agents knowing they'll integrate smoothly with the orchestrator. You can chain tool calls—the orchestrator might generate a quiz, then immediately launch the interface to take it, based on a single user request like "Create and give me a quiz about machine learning." The model handles this orchestration intelligently because the tool results flow back as structured data it can reason about. Running everything locally through Foundry Local adds another dimension of value and I am genuinely excited about this (hopefully, the phi models get this functionality soon). You can experiment freely, iterate quickly, and deploy solutions that run entirely on your infrastructure. For educational applications like our quiz system, this means students can interact with the AI tutor as much as they need without cost concerns. Getting Started With Your Own Multi-Agent System The complete code for this quiz application is available in the GitHub repository, and I encourage you to clone it and experiment. Try modifying the tool schemas to see how the orchestrator's behavior changes. Add a new specialist agent for a different task. Adjust the system prompts to see how agent personalities and capabilities shift. Think about the problems you're trying to solve. Could they benefit from having different specialists handling different aspects? A customer service system might have agents for order lookup, refund processing, and product recommendations. A research assistant might have agents for web search, document summarization, and citation formatting. A coding assistant might have agents for code generation, testing, and documentation. Start small, perhaps with two or three specialist agents for a specific domain. Watch how the orchestrator learns to route between them based on the tool descriptions you provide. You'll quickly see opportunities to add more specialists, refine the existing ones, and build increasingly sophisticated systems that leverage the unique strengths of each agent while presenting a unified, intelligent interface to your users. In the next entry, we will be deploying our quizz app which will mark the end of our journey in Foundry and SLMs these past few weeks. I hope you are as excited as I am! Thanks for reading.291Views0likes0CommentsFunction Calling with Small Language Models
In our previous article on running Phi-4 locally, we built a web-enhanced assistant that could search the internet and provide informed answers. Here's what that implementation looked like: def web_enhanced_query(question): # 1. ALWAYS search (hardcoded decision) search_results = search_web(question) # 2. Inject results into prompt prompt = f"""Here are recent search results: {search_results} Question: {question} Using only the information above, give a clear answer.""" # 3. Model just summarizes what it reads return ask_phi4(endpoint, model_id, prompt) Today, we're upgrading to true function calling. With this, we have ability to transform small language models from passive text generators into intelligent agents that can: Decide when to use external tools Reason which tool bests fit each task Execute real-world actions thrugh apis Function calling represents a significant evolution in AI capabilities. Let's understand where this positions our small language models: Agent Classification Framework Simple Reflex Agents (Basic) React to immediate input with predefined rules Example: Thermostat, basic chatbot Without function calling, models operate here Model-Based Agents (Intermediate) Maintain internal state and context Example: Robot vacuum with room mapping Function calling enables this level Goal-Based Agents (Advanced) Plan multi-step sequences to achieve objectives Example: Route planner, task scheduler Function calling + reasoning enables this Learning Agents (Expert) Adapt and improve over time Example: Recommendation systems Future: Function calling + fine-tuning As usual with these articles, let's get ready to get our hands dirty! Project Setup Let's set up our environment for building function-calling assistants. Prerequisites First, ensure you have Foundry Local installed and a model running. We'll use Qwen 2.5-7B for this tutorial as it has excellent function calling support. Important: Not all small language models support function calling equally. Qwen 2.5 was specifically trained for this capability and provides a reliable experience through Foundry Local. # 1. Check Foundry Local is installed foundry --version # 2. Start the Foundry Local service foundry service start # 3. Download and run Qwen 2.5-7B foundry model run qwen2.5-7b Python Environment Setup # 1. Create Python virtual environment python -m venv venv source venv/bin/activate # Windows: venv\Scripts\activate # 2. Install dependencies pip install openai requests python-dotenv # 3. Get a free OpenWeatherMap API key # Sign up at: https://openweathermap.org/api ``` Create `.env` file: ``` OPENWEATHER_API_KEY=your_api_key_here ``` Building a Weather-Aware Assistant So in this scenario, a user wants to plan outdoor activities but needs weather context. Without function calling, You will get something like this: User: "Should I schedule my team lunch outside at 2pm in Birmingham?" Model: "That depends on weather conditions. Please check the forecast for rain and temperature." However, with fucntion-calling you get an answer that is able to look up the weather and reply with the needed context. We will do that now. Understanding Foundry Local's Function Calling Implementation Before we start coding, there's an important implementation detail to understand. Foundry Local uses a non-standard function calling format. Instead of returning function calls in the standard OpenAI tool_calls field, Qwen models return the function call as JSON text in the response content. For example, when you ask about weather, instead of: # Standard OpenAI format message.tool_calls = [ {"name": "get_weather", "arguments": {"location": "Birmingham"}} ] You get: # Foundry Local format message.content = '{"name": "get_weather", "arguments": {"location": "Birmingham"}}' This means we need to parse the JSON from the content ourselves. Don't worry—this is straightforward, and I'll show you exactly how to handle it! Step 1: Define the Weather Tool Create weather_assistant.py: import os from openai import OpenAI import requests import json import re from dotenv import load_dotenv load_dotenv() # Initialize Foundry Local client client = OpenAI( base_url="http://127.0.0.1:59752/v1/", api_key="not-needed" ) # Define weather tool tools = [ { "type": "function", "function": { "name": "get_weather", "description": "Get current weather information for a location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city or location name" }, "units": { "type": "string", "description": "Temperature units", "enum": ["celsius", "fahrenheit"], "default": "celsius" } }, "required": ["location"] } } } ] A tool is necessary because it provides the model with a structured specification of what external functions are available and how to use them. The tool definition contains the function name, description, parameters schema, and information returned. Step 2: Implement the Weather Function def get_weather(location: str, units: str = "celsius") -> dict: """Fetch weather data from OpenWeatherMap API""" api_key = os.getenv("OPENWEATHER_API_KEY") url = "http://api.openweathermap.org/data/2.5/weather" params = { "q": location, "appid": api_key, "units": "metric" if units == "celsius" else "imperial" } response = requests.get(url, params=params, timeout=5) response.raise_for_status() data = response.json() temp_unit = "°C" if units == "celsius" else "°F" return { "location": data["name"], "temperature": f"{round(data['main']['temp'])}{temp_unit}", "feels_like": f"{round(data['main']['feels_like'])}{temp_unit}", "conditions": data["weather"][0]["description"], "humidity": f"{data['main']['humidity']}%", "wind_speed": f"{round(data['wind']['speed'] * 3.6)} km/h" } The model calls this function to get the weather data. it contacts OpenWeatherMap API, gets real weather data and returns it as a python dictionary Step 3: Parse Function Calls from Content This is the crucial step where we handle Foundry Local's non-standard format: def parse_function_call(content: str): """Extract function call JSON from model response""" if not content: return None json_pattern = r'\{"name":\s*"get_weather",\s*"arguments":\s*\{[^}]+\}\}' match = re.search(json_pattern, content) if match: try: return json.loads(match.group()) except json.JSONDecodeError: pass try: parsed = json.loads(content.strip()) if isinstance(parsed, dict) and "name" in parsed: return parsed except json.JSONDecodeError: pass return None Step 4: Main Chat Function with Function Calling and lastly, calling the model. Notice the tools and tool_choice parameter. Tools tells the model it is allowed to output a tool_call requesting that the function be executed. While tool_choice instructs the model how to decide whether to call a tool. def chat(user_message: str) -> str: """Process user message with function calling support""" messages = [ {"role": "user", "content": user_message} ] response = client.chat.completions.create( model="qwen2.5-7b-instruct-generic-cpu:4", messages=messages, tools=tools, tool_choice="auto", temperature=0.3, max_tokens=500 ) message = response.choices[0].message if message.content: function_call = parse_function_call(message.content) if function_call and function_call.get("name") == "get_weather": print(f"\n[Function Call] {function_call.get('name')}({function_call.get('arguments')})") args = function_call.get("arguments", {}) weather_data = get_weather(**args) print(f"[Result] {weather_data}\n") final_prompt = f"""User asked: "{user_message}" Weather data: {json.dumps(weather_data, indent=2)} Provide a natural response based on this weather information.""" final_response = client.chat.completions.create( model="qwen2.5-7b-instruct-generic-cpu:4", messages=[{"role": "user", "content": final_prompt}], max_tokens=200, temperature=0.7 ) return final_response.choices[0].message.content return message.content Step 5: Run the script Now put all the above together and run the script def main(): """Interactive weather assistant""" print("\nWeather Assistant") print("=" * 50) print("Ask about weather or general questions.") print("Type 'exit' to quit\n") while True: user_input = input("You: ").strip() if user_input.lower() in ['exit', 'quit']: print("\nGoodbye!") break if user_input: response = chat(user_input) print(f"Assistant: {response}\n") if __name__ == "__main__": if not os.getenv("OPENWEATHER_API_KEY"): print("Error: OPENWEATHER_API_KEY not set") print("Set it with: export OPENWEATHER_API_KEY='your_key_here'") exit(1) main() Note: Make sure Qwen 2.5 is running in Foundry Local in a new terminal Now let's talk about Model Context Protocol! Our weather assistant works beautifully with a single function, but what happens when you need dozens of tools? Database queries, file operations, calendar integration, email—each would require similar setup code. This is where Model Context Protocol (MCP) comes in. MCP is an open standard that provides pre-built, standardized servers for common tools. Instead of writing custom integration code for every capability, you can connect to MCP servers that handle the complexity for you. With MCP, You only need one command to enable weather, database, and file access npx @modelcontextprotocol/server-weather npx @modelcontextprotocol/server-sqlite npx @modelcontextprotocol/server-filesystem Your model automatically discovers and uses these tools without custom integration code. Learn more: Model Context Protocol Documentation EdgeAI Course - Module 03: MCP Integration Key Takeaways Function calling transforms models into agents - From passive text generators to active problem-solvers Qwen 2.5 has excellent function calling support - Specifically trained for reliable tool use Foundry Local uses non-standard format - Parse JSON from content instead of tool_calls field Start simple, then scale with MCP - Build one tool to understand the pattern, then leverage standards Documentation Running Phi-4 Locally with Foundry Local Phi-4: Small Language Models That Pack a Punch Microsoft Foundry Local GitHub EdgeAI for Beginners Course OpenWeatherMap API Documentation Model Context Protocol Qwen 2.5 Documentation Thank you for reading! I hope this article helps you build more capable AI agents with small language models. Function calling opens up incredible possibilities—from simple weather assistants to complex multi-tool workflows. Start with one tool, understand the pattern, and scale from there.585Views1like0CommentsPrivyDoc: Building a Zero-Data-Leak AI with Foundry Local & Microsoft's Agent Framework
Tired of choosing between powerful AI insights and sacrificing your data's privacy? PrivyDoc offers a groundbreaking solution. In this article, Microsoft MVP in AI, Shivam Goyal, introduces his innovative project that brings robust AI document analysis directly to your local machine, ensuring zero data ever leaves your device. Discover how PrivyDoc leverages two cutting-edge Microsoft technologies: Foundry Local: The secret sauce for 100% on-device AI processing, allowing advanced models to run securely without cloud dependency. Microsoft Agent Framework: The intelligent orchestrator that builds a sophisticated multi-agent pipeline, handling everything from text extraction and entity recognition to summarization and sentiment analysis. Learn about PrivyDoc's intuitive web UI, its multi-format support, and crucial features that make it perfect for sensitive industries like legal, healthcare, and finance. Say goodbye to privacy concerns and hello to AI-powered document intelligence without compromise.404Views3likes0CommentsEdge AI for Beginners : Getting Started with Foundry Local
In Module 08 of the EdgeAI for Beginners course, Microsoft introduces Foundry Local a toolkit that helps you deploy and test Small Language Models (SLMs) completely offline. In this blog, I’ll share how I installed Foundry Local, ran the Phi-3.5-mini model on my windows laptop, and what I learned through the process. What Is Foundry Local? Foundry Local allows developers to run AI models locally on their own hardware. It supports text generation, summarization, and code completion — all without sending data to the cloud. Unlike cloud-based systems, everything happens on your computer, so your data never leaves your device. Prerequisites Before starting, make sure you have: Windows 10 or 11 Python 3.10 or newer Git Internet connection (for the first-time model download) Foundry Local installed Step 1 — Verify Installation After installing Foundry Local, open Command Prompt and type: foundry --version If you see a version number, Foundry Local is installed correctly. Step 2 — Start the Service Start the Foundry Local service using: foundry service start You should see a confirmation message that the service is running. Step 3 — List Available Models To view the models supported by your system, run: foundry model list You’ll get a list of locally available SLMs. Here’s what I saw on my machine: Note: Model availability depends on your device’s hardware. For most laptops, phi-3.5-mini works smoothly on CPU. Step 4 — Run the Phi-3.5 Model Now let’s start chatting with the model: foundry model run phi-3.5-mini-instruct-generic-cpu:1 Once it loads, you’ll enter an interactive chat mode. Try a simple prompt: Hello! What can you do? The model replies instantly — right from your laptop, no cloud needed. To exit, type: /exit How It Works Foundry Local loads the model weights from your device and performs inference locally.This means text generation happens using your CPU (or GPU, if available). The result: complete privacy, no internet dependency, and instant responses. Benefits for Students For students beginning their journey in AI, Foundry Local offers several key advantages: No need for high-end GPUs or expensive cloud subscriptions. Easy setup for experimenting with multiple models. Perfect for class assignments, AI workshops, and offline learning sessions. Promotes a deeper understanding of model behavior by allowing step-by-step local interaction. These factors make Foundry Local a practical choice for learning environments, especially in universities and research institutions where accessibility and affordability are important. Why Use Foundry Local Running models locally offers several practical benefits compared to using AI Foundry in the cloud. With Foundry Local, you do not need an internet connection, and all computations happen on your personal machine. This makes it faster for small models and more private since your data never leaves your device. In contrast, AI Foundry runs entirely on the cloud, requiring internet access and charging based on usage. For students and developers, Foundry Local is ideal for quick experiments, offline testing, and understanding how models behave in real-time. On the other hand, AI Foundry is better suited for large-scale or production-level scenarios where models need to be deployed at scale. In summary, Foundry Local provides a flexible and affordable environment for hands-on learning, especially when working with smaller models such as Phi-3, Qwen2.5, or TinyLlama. It allows you to experiment freely, learn efficiently, and better understand the fundamentals of Edge AI development. Optional: Restart Later Next time you open your laptop, you don’t have to reinstall anything. Just run these two commands again: foundry service start foundry model run phi-3.5-mini-instruct-generic-cpu:1 What I Learned Following the EdgeAI for Beginners Study Guide helped me understand: How edge AI applications work How small models like Phi 3.5 can run on a local machine How to test prompts and build chat apps with zero cloud usage Conclusion Running the Phi-3.5-mini model locally with Foundry Localgave me hands-on insight into edge AI. It’s an easy, private, and cost-free way to explore generative AI development. If you’re new to Edge AI, start with the EdgeAI for Beginners course and follow its Study Guide to get comfortable with local inference and small language models. Resources: EdgeAI for Beginners GitHub Repo Foundry Local Official Site Phi Model Link630Views1like0CommentsEdge AI for Student Developers: Learn to Run AI Locally
AI isn’t just for the cloud anymore. With the rise of Small Language Models (SLMs) and powerful local inference tools, developers can now run intelligent applications directly on laptops, phones, and edge devices—no internet required. If you're a student developer curious about building AI that works offline, privately, and fast, Microsoft’s Edge AI for Beginners course is your perfect starting point. What Is Edge AI? Edge AI refers to running AI models directly on local hardware—like your laptop, mobile device, or embedded system—without relying on cloud servers. This approach offers: ⚡ Real-time performance 🔒 Enhanced privacy (no data leaves your device) 🌐 Offline functionality 💸 Reduced cloud costs Whether you're building a chatbot that works without Wi-Fi or optimizing AI for low-power devices, Edge AI is the future of intelligent, responsive apps. About the Course Edge AI for Beginners is a free, open-source curriculum designed to help you: Understand the fundamentals of Edge AI and local inference Explore Small Language Models like Phi-2, Mistral-7B, and Gemma Deploy models using tools like Llama.cpp, Olive, MLX, and OpenVINO Build cross-platform apps that run AI locally on Windows, macOS, Linux, and mobile The course is hosted on GitHub and includes hands-on labs, quizzes, and real-world examples. You can fork it, remix it, and contribute to the community. What You’ll Learn Module Focus 01. Introduction What is Edge AI and why it matters 02. SLMs Overview of small language models 03. Deployment Running models locally with various tools 04. Optimization Speeding up inference and reducing memory 05. Applications Building real-world Edge AI apps Each module is beginner-friendly and includes practical exercises to help you build and deploy your own local AI solutions. Who Should Join? Student developers curious about AI beyond the cloud Hackathon participants looking to build offline-capable apps Makers and builders interested in privacy-first AI Anyone who wants to explore the future of on-device intelligence No prior AI experience required just a willingness to learn and experiment. Why It Matters Edge AI is a game-changer for developers. It enables smarter, faster, and more private applications that work anywhere. By learning how to deploy AI locally, you’ll gain skills that are increasingly in demand across industries—from healthcare to robotics to consumer tech. Plus, the course is: 💯 Free and open-source 🧠 Backed by Microsoft’s best practices 🧪 Hands-on and project-based 🌐 Continuously updated Ready to Start? Head to aka.ms/edgeai-for-beginners and dive into the modules. Whether you're coding in your dorm room or presenting at your next hackathon, this course will help you build smarter AI apps that run right where you need them on the edge.537Views1like0Comments