foundry
24 Topics8 Architectural Pillars to Boost GenAI LLM Accuracy and Performance in Low Cost
Smarter AI architecture, not bigger LLM models - how engineering teams push LLM accuracy and high performance in low cost. Enterprises using LLM (Large Language Models) hits the same ceiling and paying big price! A raw API call to a frontier model- GPT-4, Claude, Gemini delivers only 35-40% accuracy on structured output tasks like code generation, NL to DAX query generation, domain-specific reasoning. Prompt engineering pushes that to ~60%. But the final 35+ percentage points? Those come from system architecture, not model upgrades. This guide presents 8 architectural pillars, distilled from production Gen AI systems, that compound to close the accuracy gap. These patterns are model-agnostic and domain-agnostic, they apply equally to chatbots, coding assistants, content/query generators, automation agents, and any application where an LLM produces structured or semi-structured output. It’s based on my recent Gen AI projects. The key takeaway: use the LLM as one component in a larger system, not as the system itself. Surround it with deterministic guardrails, verified knowledge, and feedback loops. Pillar 1: Enhance Prompts with Verified Knowledge Context Impact: +35–40% accuracy (based on production use cases; may vary by domain) Top source of LLM errors in production is hallucinated identifiers knowledgebase, the model invents names, references, or structures that don't exist in the target system. This happens because LLMs are trained on general knowledge but deployed against specific, private enterprise systems they've never seen local database and knowledgebase. The fix is straightforward: inject verified, system-specific context (type definitions, API specs, ontologies, configuration schemas, entity catalogues) directly into the prompt so the model composes from known-good elements rather than recalling from training data. Use Knowledge Graph for better sematic knowledge. How to Implement Provide explicit context, never implicit- Whatever the LLM needs to reference identifiers, valid values, semantic knowledge, structures must appear verbatim in the prompt or retrieved context window. Filter aggressively. A full knowledge base with thousands of entities overwhelms the context window and confuses the model. Use intelligent filtering to surface only needed 5-10 most relevant elements per request. Store structured semantic knowledge in a graph or searchable index. This enables relationship-aware retrieval: "given entity X, what related entities, attributes, and constraints are also needed?" Include rich Semantic metadata. Names alone are insufficient. Include types, constraints, valid value ranges, relationships, and usage notes to minimize ambiguity. Keep context fresh. Stale context causes a different class of hallucination the model generates valid-looking output that references outdated structures. Sync your knowledge store with your source of truth. Why This Works LLMs excel at composition and reasoning combining elements, applying logic, following patterns. They are unreliable at recall of specific identifiers exact names, valid values, structural constraints. By offloading recall to a deterministic retrieval system and giving the LLM only composition tasks, you play to each system's strengths. Pillar 2: Tiered LLM Approach: Route Deterministically First, Use LLMs Last Impact: 80% cost reduction, 85% latency reduction, eliminates non-deterministic errors for most traffic. The most impactful architectural insight: most production requests don't need an LLM at all. A well-designed system handles 60-70% of traffic with deterministic logic templates, composition rules, cached results and reserves expensive, non-deterministic LLM calls only for genuinely novel inputs. The Three-Tier Model These metrics are from a real use case to convert NLP to Power BI DAX query. Tier Strategy Uses LLM ? Latency Accuracy Tier 0 Template slot-filling - handles requests that match known patterns exactly the system fills slots in a pre-built template with extracted parameters. No LLM, no non-determinism, near-perfect accuracy, sub-100ms response. No ~50ms 95-98% Tier 1 Compose from pre-validated fragments- handles requests that combine known patterns in new ways. The system retrieves pre-validated building blocks via search, composes them using deterministic rules, and validates the result. Still no LLM call. No ~200ms 90-95% Tier 2 Full LLM generation with enriched context- is reserved for genuinely novel requests that can't be served deterministically. Even here, the LLM receives maximum support: filtered context, relevant examples, explicit rules, and structured planning. Yes (1 call) 2-5s 88-93% Complexity-Based Routing A lightweight scoring function (evaluated in <1ms) routes each incoming request: Factors: reasoning depth, number of components, cross-references, constraints, nesting depth, novelty (distance from known patterns) Score 0-39: Tier 0 (deterministic template) Score 40-59: Tier 1 if confidence ≥ 85%, else Tier 2 Score 60+: Tier 2 (LLM generation) This routing achieves 96%+ accuracy in tier assignment and ensures the expensive path is only taken when necessary. Why This Matters Cost: 70-80% of requests cost zero LLM tokens Latency: Majority of responses in <200ms instead of 2-5s Reliability: Deterministic tiers produce identical output for identical input. Scalability: Deterministic tiers scale horizontally with trivial compute Pillar 3: Encode Prompt Anti-Patterns as Explicit Rules Impact: +8-10% accuracy, ~80% reduction in common structural errors LLM mistakes are patterned, not random. In any domain, 80% of errors cluster around a small set of 6-13 recurring structural mistakes. Instead of hoping the model avoids them through general instruction-following, compile these mistakes into explicit WRONG => CORRECT rules embedded directly in the system prompt. How to Implement Collect error data. Run 100+ requests through your system and categorize the failures. You'll find the same 6-13 patterns appearing repeatedly. Write concrete rules. For each pattern, show the exact wrong output and the exact correct alternative, with a one-line explanation of why. Embed in system prompt. Place rules prominently after the task description, before examples. Use formatting that's hard to ignore (headers, bold, explicit "NEVER" language). Keep the list short. 6-13 rules maximum. Beyond that, attention dilutes and the model starts ignoring rules. Prioritize by frequency. Refresh continuously. As the system improves (via other pillars), some errors disappear. New error types emerge. Update the rule set quarterly. Why This Works LLMs respond strongly to explicit negative examples. A generic instruction like "be careful with X" has minimal impact. But showing the exact wrong output the model tends to produce, paired with the correction, creates a strong avoidance signal. It's analogous to unit tests. Pillar 4: Retrieve Few-Shot Examples Dynamically Impact: +5-15% accuracy depending on domain complexity Static examples hardcoded in a prompt become stale, irrelevant of context tokens. Dynamic few-shot retrieval selects the 3-5 most relevant examples for each specific request, maximizing the signal-to-noise ratio in the prompt. Hybrid Retrieval Architecture The most effective approach combines two search strategies for intent search to understand natural language (NL) context: Keyword search (BM25) Finds examples with exact matching terms, identifiers, and domain vocabulary Vector search (semantic similarity) Finds examples with similar intent and structure, even if wording differs Rank fusion Merges results from both strategies, re-ranking by combined relevance This hybrid approach outperforms either strategy alone because keyword search catches exact identifier matches that vector search dilutes, while vector search captures semantic similarity that keyword search misses entirely. Best Gen AI Architectural Practices Match complexity to complexity. Simple requests should see simple examples. Complex requests should see complex examples. Mismatched examples confuse the model. Include negative examples. For the detected request type, include 1-2 "wrong => correct" pairs alongside positive examples. This reinforces Pillar 3's anti-pattern rules with concrete, contextually relevant demonstrations. Pre-compute embeddings. Generate vector embeddings at indexing time, not at query time. Cache retrieval results for repeated patterns. Curate quality over quantity. 3 excellent, diverse examples beat 10 mediocre ones. Each example should demonstrate a distinct pattern or edge case. Keep examples current. As your system evolves, old examples may demonstrate outdated patterns. Review and refresh the example store periodically. Pillar 5: Feedback Loop- Validate and Auto-Fix Every Output Deterministically Impact: +3-5% accuracy as a safety net, plus continuous improvement via feedback No matter how well-prompted, LLMs will occasionally produce outputs with minor structural errors - wrong casing, missing delimiters, references to slightly-incorrect identifiers, or subtle format violations. A deterministic post-processing pipeline catches and fixes these without any additional LLM calls. The Validation Pipeline LLM Output => Parse (grammar/AST) => Rule-Based Fixes => Compliance Check/validation => Final Output Each stage is fully deterministic: Parsing: Use a formal grammar or AST parser (ANTLR, tree-sitter, language-native parsers) to structurally analyse the output. Never regex-parse structured output - it's fragile and misses edge cases. Rule-based fixes: 10-20 deterministic transformation rules that correct known error patterns - name normalization, casing fixes, missing delimiters, structural repairs. Compliance check: Verify every identifier referenced in the output actually exists in the provided context. Flag unknown references. Design Principles Zero LLM calls in the fix pipeline. Every fix is a regex, an AST transformation, or a lookup table operation. Instant, free, deterministic, 100% reliable. Fail safe. If a fix is ambiguous (multiple valid corrections possible), pass through rather than corrupt. A minor error is better than a confident wrong "fix." Log everything. Track every fix applied, categorized by type. This data drives the feedback loop. The Critical Feedback Loop- The validation pipeline's most important function isn't fixing outputs, it's generating improvement signals: This creates a feedback loop: the auto-fix catches errors → the errors get promoted to upstream prevention → fewer errors reach the auto-fix → the system continuously tightens. Pillar 6: Multi-Agent Orchestration with Fewer Agents and Clear Contracts Impact: Reduced latency, clearer debugging, fewer failure modes The multi-agent pattern is powerful but commonly over-applied. The counter-intuitive lesson from production systems: fewer agents with well-defined responsibilities outperform many fine-grained agents. Why Fewer Is Better Each agent handoff introduces: Latency - serialization, network calls, context assembly Context loss - information dropped between boundaries Failure modes - each handoff is a potential error point Debugging complexity - tracing issues across many agents is exponentially harder Multi-Agent Orchestration Principles Merge agents that always run sequentially. If Agent A always feeds into Agent B with no branching or conditional logic, they should be one agent with two internal steps. Parallelize independent operations. Context retrieval and example lookup are independent, run them concurrently to halve retrieval latency. Route sub-tasks to cheaper models. Decomposed sub-problems are simpler by design. Use a smaller, faster, cheaper model (3x cost savings, 2x speed improvement). Define strict contracts. Each agent boundary should have an explicit schema defining inputs and outputs. No implicit assumptions about what crosses the boundary. Only 2 of 4 agents should call an LLM. The rest are purely deterministic. This minimizes non-deterministic behavior and cost. Pillar 7: Multi-Agent Cache at Multiple Hierarchical Levels Impact: 40-50% faster responses, 85%+ combined hit rate, significant cost reduction A single cache layer captures only one type of repetition. Production systems need hierarchical caching where multiple levels catch different repetition patterns , from exact duplicates to semantic near-misses. with -> A single cache layer captures only one type of repetition. Production systems need multi-level caching to handle exact matches, similar requests, and reusable fragments. or -> with Production systems need hierarchical caching where multiple levels handle exact matches, similar requests, and reusable fragments. Pillar 8: Measure Everything, Learn Continuously Impact: Enables data-driven iteration and prevents accuracy regressions. Architecture without observability is guesswork. The final pillar ensures every other pillar stays effective over time through comprehensive metrics and automated feedback loops. This isn't a one-time setup; it's a perpetual feedback loop. Every week, the top error patterns shift slightly. The auto-fix metrics tell you exactly where to focus next. Over months, this flywheel compounds into dramatic accuracy gains that no single prompt rewrite could achieve. Auto-Learning for New Domains When extending your system to new domains or knowledge areas: Auto-classify elements using naming conventions, type analysis, and structural patterns Auto-generate templates from universal patterns (transformations, comparisons, compositions, sequences) Bootstrap few-shot examples from successful template outputs Monitor for the first 100 requests, then curate only the edge cases manually This reduces domain onboarding from days of manual work to minutes of automated bootstrapping plus focused human review of outliers. Key Takeaways Architecture beats model size. A well-architected system with a smaller model outperforms a raw frontier model call on structured tasks at a fraction of the cost. Deterministic systems should do the heavy lifting. Reserve LLMs for genuinely novel, creative tasks. 70-80% of production requests should never touch an LLM. Verified knowledge is your top accuracy lever. Ground every prompt in context the model can trust. Errors are patterned, not random- Track them, compile them, and explicitly forbid them. Build feedback loops, not static systems- Every auto-fix, every cache miss, every routing decision is a signal for improvement. Fewer agents, done well- Fewer agents with strict contracts outperform 9 agents with fuzzy boundaries in accuracy, latency, and debuggability. Measure what matters and iterates- The system that wins isn't the one with the best day-one prompt, it's the one that improves fastest over time. Production-grade GenAI isn't about finding the perfect prompt or waiting for the next LLM model release. It's about building architectural guardrails that make failure nearly impossible and when failure does occur, the system learns from it automatically. These 8 pillars, applied together, transform any LLM from an unreliable black box into a precise, efficient, and continuously improving production system. -> Production Gen AI success is not about perfect prompts or waiting for the next LLM release. It comes from designing strong system guardrails that reduce failures and ensure consistent output. Even when failures happen, the system learns and improves automatically. When applied together, these 8 pillars turn an LLM into a reliable, efficient, and continuously improving production system.Student Devs: Build AI Agents, Compete for $55K in Prizes
Student Devs: Build AI Agents, Compete for $55K in Prizes 🎮 AI Skills Fest • June 4–14, 2026 • Free to Enter $55K Prize Pool 3 Challenge Tracks 10 Days of Hacking Free To Enter Whether you're a first-year CS student or a final-year senior with a portfolio full of projects, Agents League is the best way to gain hands-on experience with agentic AI this summer and walk away with real skills employers are hiring for right now. What You'll Actually Learn Forget passive tutorials. Agents League is project-based learning at full speed. By the end of the hackathon, you'll have built a working AI agent and gained practical experience with the tools shaping the future of software development. 🤖 AI-Assisted Development Use GitHub Copilot to accelerate your coding workflow — from scaffolding to debugging — the way professional developers do today. 🧩 Multi-Step Reasoning Build agents with Microsoft Foundry that can plan, reason, and execute complex tasks — the core of agentic AI. 🏢 Enterprise AI Patterns Learn to build production-ready agents that integrate with Microsoft 365 and Copilot Studio — skills that translate directly to industry jobs. 🔧 Prompt Engineering Design effective prompts and orchestration flows that make AI agents reliable and useful in the real world. 📦 GitHub Workflows Submit your project through GitHub — practising version control, README writing, and open-source collaboration. 🎯 Competitive Problem-Solving Work under real constraints with deadlines, judging criteria, and peer competition — just like industry hackathons and sprints. Pick Your Track (or Try All Three) Agents League has three challenge tracks, each using different Microsoft AI tools. Choose based on your interests or stretch yourself by competing in multiple tracks. Track 01. Creative Apps Build an innovative application with AI-assisted development. This track rewards creativity, dream big and let GitHub Copilot help you bring ideas to life faster than ever. Tool: GitHub Copilot Track 02. Reasoning Agents Create intelligent agents that solve complex problems through multi-step reasoning. Think: agents that can research, plan, and act. This is the cutting edge of AI. Tool: Microsoft Foundry Track 03. Enterprise Agents Build knowledge agents that integrate with Microsoft 365 Copilot. Learn how businesses are deploying AI today and add enterprise AI to your skillset. Tool: Copilot Studio • M365 Opportunities You Won't Want to Miss Agents League isn't just a competition, it's a launchpad. Here's what's in it for you beyond the code: 💰 Win from a $55,000 USD Prize Pool Prizes are awarded across all three tracks smaller teams and solo hackers have a real shot. 📺 Watch Live Coding Battles at Microsoft Reactor See industry experts go head-to-head building AI agents live. Learn advanced techniques you can apply immediately to your own project. 🎓 Free Learning Resources on Microsoft Learn Access curated learning paths and the AI Skills Navigator, structured content designed to get you from zero to submission-ready. 🌍 Join a Global Developer Community Connect with thousands of developers on the Agents League Discord. Find teammates, ask questions, and build your professional network. 📂 Build Your Portfolio with a Real Project Every submission lives on GitHub. Walk away with a polished, public project that demonstrates your AI skills to future employers and grad schools. 🏆 Gain Recognition from Microsoft and the Community Top projects get visibility across the Microsoft developer ecosystem. Stand out from the crowd in internship and job applications. Key Dates to Remember Event Date Hacking Period Opens June 4, 2026 Registration Deadline June 12, 2026 — 12:00 PM PT Submission Deadline June 14, 2026 — 11:59 PM PT How to Get Started (Right Now) You don't have to wait until June 4th to start preparing. Here's your pre-hackathon game plan: Register for the hackathon it's free and open to everyone. Pick a track that matches your interests or curiosity. Explore the learning resources on Microsoft Learn and the AI Skills Navigator. Join the Discord community to find teammates and get early tips. Watch the Reactor event series for live coding battles and expert walkthroughs. Set up your GitHub repo and start experimenting before the hacking window opens. Helpful Links Register for Agents League Free entry, sign up now Microsoft Reactor Events Live coding battles & workshops AI Skills Fest The broader event Microsoft Learn Free learning paths The Arena Awaits 🏆 Ten days. Three tracks. $55K in prizes. Whether you go solo or squad up, this is your chance to build something real with AI and have a blast doing it. Register Now It's Free | Watch Reactor Events Agents League is part of AI Skills Fest and is open to the public at no cost. Review the Hackathon Rules and Regulations and the Microsoft Event Code of Conduct before participating.94Views0likes0CommentsSigning in to Microsoft Foundry from OpenClaw using Azure AD: a smoother way to bring your models in
This post is a quick update to walk through the new flow. If you read the previous one, think of this as the easier path I wish I had the first time round. If you have not seen the original, you can find it here: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Pre-requisite: You will need the Azure CLI (azure-cli) installed on your machine. The official install guide for Linux is here: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest I am on Linux so I went the Homebrew route, which keeps things simple. The formula is here: https://formulae.brew.sh/formula/azure-cli Microsoft also has official docs covering the Homebrew/Linuxbrew install: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Once Homebrew is ready, run this in your terminal: brew install azure-cli Why this matters: Before this update, every Foundry model you wanted to use in OpenClaw needed its own API key and endpoint pasted into the config. It worked, but it was tedious, and keys are easy to leak if you are copying them around. The Azure AD path solves both problems. You authenticate as yourself (or a service principal), OpenClaw asks Azure for the list of Foundry resources you have access to, and it brings the models in automatically. Signing in to Microsoft Foundry from OpenClaw via Azure AD A device-code OAuth handshake replaces the old static-API-key flow. OpenClaw delegates auth to the local Azure CLI; the CLI handles the browser-side sign-in, holds the resulting tokens, and refreshes them silently. OpenClaw then walks the Azure resource graph, subscriptions → Foundry resources → model deployments and registers each model into its own config. No API keys move through OpenClaw at any point. Sequence diagram of the OAuth 2.0 device-authorization flow as orchestrated by OpenClaw. Phases 1–3 establish identity (the developer authenticates once, in a real browser, against Azure AD). Phases 4–5 perform service discovery (OpenClaw walks the ARM resource hierarchy, subscriptions → Foundry accounts → model deployments and persists the result to a local provider config). After registration, every model call OpenClaw makes against Foundry reuses the same Azure-CLI-managed token cache: tokens refresh transparently, and access is gated by the Foundry resource's RBAC assignments rather than a static API key. Dashed lines denote return values; the teal line in step 7 marks the single token-issuance event the rest of the system pivots on. Walking through the new flow: Start with the command to onboard openclaw as if you were setting up OpenClaw for the first time: openclaw onboard Kick things off with the OpenClaw onboard command, the same one you would use when setting up OpenClaw for the first time. When it prompts you, choose update values. Next, you will be asked to configure your models. Scroll down a little and you will see Microsoft Foundry listed as a supported provider. Pick it. From here, you have two options. You can sign in with an API key, which is what I covered in the previous blog post, or you can sign in through Azure AD. The Azure AD path is easier and more secure, so that is the one we will use. OpenClaw will give you a URL and a device code. Copy the URL into your browser and use the code to complete the sign in. (This is where the az CLI from the pre-requisite section earns its keep.) If everything worked, you should see a success prompt similar to this: Once you are signed in, OpenClaw will ask you to pick the Azure subscription that your Microsoft Foundry resource lives in. Pick the subscription, then pick the Foundry resource where your models are deployed. And that is pretty much it. All the models you have deployed to that Foundry resource get pulled into OpenClaw automatically. Compared to the old way of pasting API keys and endpoints one by one, this is a huge time saver, and you do not have to babysit any keys. From here you can start using your Foundry-deployed models inside OpenClaw straight away: Wrapping up The Azure AD sign-in option in OpenClaw is one of those small updates that quietly removes a real pain point. If you have ever juggled multiple Foundry endpoints and rotated keys across them, you already know why. With this flow, you sign in once, your models show up, and you can get back to actually building. If you have not tried OpenClaw with Microsoft Foundry yet, this is a good time to give it a go. And if you were holding off because of the key management overhead, that excuse is gone now. References Previous post on integrating Microsoft Foundry with OpenClaw using API keys: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Install the Azure CLI on Linux: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest Install the Azure CLI on macOS: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Homebrew formula for azure-cli: https://formulae.brew.sh/formula/azure-cli107Views0likes0CommentsFrom Demo to Production: Building Microsoft Foundry Hosted Agents with .NET
The Gap Between a Demo and a Production Agent Let's be honest. Getting an AI agent to work in a demo takes an afternoon. Getting it to work reliably in production, tested, containerised, deployed, observable, and maintainable by a team. is a different problem entirely. Most tutorials stop at the point where the agent prints a response in a terminal. They don't show you how to structure your code, cover your tools with tests, wire up CI, or deploy to a managed runtime with a proper lifecycle. That gap between prototype and production is where developer teams lose weeks. Microsoft Foundry Hosted Agents close that gap with a managed container runtime for your own custom agent code. And the Hosted Agents Workshop for .NET gives you a complete, copy-paste-friendly path through the entire journey. from local run to deployed agent to chat UI, in six structured labs using .NET 10. This post walks you through what the workshop delivers, what you will build, and why the patterns it teaches matter far beyond the workshop itself. What Is a Microsoft Foundry Hosted Agent? Microsoft Foundry supports two distinct agent types, and understanding the difference is the first decision you will make as an agent developer. Prompt agents are lightweight agents backed by a model deployment and a system prompt. No custom code required. Ideal for simple Q&A, summarisation, or chat scenarios where the model's built-in reasoning is sufficient. Hosted agents are container-based agents that run your own code .NET, Python, or any framework you choose inside Foundry's managed runtime. You control the logic, the tools, the data access, and the orchestration. When your scenario requires custom tool integrations, deterministic business logic, multi-step workflow orchestration, or private API access, a hosted agent is the right choice. The Foundry runtime handles the managed infrastructure; you own the code. For the official deployment reference, see Deploy a hosted agent to Foundry Agent Service on Microsoft Learn. What the Workshop Delivers The Hosted Agents Workshop for .NET is a beginner-friendly, hands-on workshop that takes you through the full development and deployment path for a real hosted agent. It is structured around a concrete scenario: a Hosted Agent Readiness Coach that helps delivery teams answer questions like: Should this use case start as a prompt agent or a hosted agent? What should a pilot launch checklist include? How should a team troubleshoot common early setup problems? The scenario is purposefully practical. It is not a toy chatbot. It is the kind of tool a real team would build and hand to other engineers, which means it needs to be testable, deployable, and extensible. The workshop covers: Local development and validation with .NET 10 Copilot-assisted coding with repo-specific instructions Deterministic tool implementation with xUnit test coverage CI pipeline validation with GitHub Actions Secure deployment to Azure Container Registry and Microsoft Foundry Chat UI integration using Blazor What You Will Build By the end of the workshop, you will have a code-based hosted agent that exposes an OpenAI Responses-compatible /responses endpoint on port 8088 . The agent is backed by three deterministic local tools, implemented in WorkshopLab.Core : RecommendImplementationShape — analyses a scenario and recommends hosted or prompt agent based on its requirements BuildLaunchChecklist — generates a pilot launch checklist for a given use case TroubleshootHostedAgent — returns structured troubleshooting guidance for common setup problems These tools are deterministic by design, no LLM call required to return a result. That choice makes them fast, predictable, and fully testable, which is the right architecture for business logic in a production agent. The end-to-end architecture looks like this: The Hands-On Journey: Lab by Lab The workshop follows a deliberate build → validate → ship progression. Each lab has a clear outcome. You do not move forward until the previous checkpoint passes. Lab 0 — Setup and Local Run Open the repo in VS Code or a GitHub Codespace, configure your Microsoft Foundry project endpoint and model deployment name, then run the agent locally. By the end of Lab 0, your agent is listening on http://localhost:8088/responses and responding to test requests. dotnet restore dotnet build dotnet run --project src/WorkshopLab.AgentHost Test it with a single PowerShell call: Invoke-RestMethod -Method Post ` -Uri "http://localhost:8088/responses" ` -ContentType "application/json" ` -Body '{"input":"Should we start with a hosted agent or a prompt agent?"}' Lab 0 instructions → Lab 1 — Copilot Customisation Configure repo-specific GitHub Copilot instructions so that Copilot understands the hosted-agent patterns used in this project. You will also add a Copilot review skill tailored to hosted agent code reviews. This step means every code suggestion you receive from Copilot is contextualised to the workshop scenario rather than giving generic .NET advice. Lab 1 instructions → Lab 2 — Tool Implementation Extend one of the deterministic tools in WorkshopLab.Core with a real feature change. The suggested change adds a stronger recommendation path to RecommendImplementationShape for scenarios that require all three hosted-agent strengths simultaneously. // In RecommendImplementationShape — add before the final return: if (requiresCode && requiresTools && requiresWorkflow) { return string.Join(Environment.NewLine, [ $"Recommended implementation: Hosted agent (full-stack)", $"Scenario goal: {goal}", "Why: the scenario requires custom code, external tool access, and " + "multi-step orchestration — all three hosted-agent strengths.", "Suggested next step: start with a code-based hosted agent, register " + "local tools for each integration, and add a workflow layer." ]); } You then write an xUnit test to cover it, run dotnet test , and validate the change against a live /responses call. This is the workshop's most important teaching moment: every tool change is covered by a test before it ships. Lab 2 instructions → Lab 3 — CI Validation Wire up a GitHub Actions workflow that builds the solution, runs the test suite, and validates that the agent container builds cleanly. No manual steps — if a change breaks the build or a test, CI catches it before any deployment happens. Lab 3 instructions → Lab 4 — Deployment to Microsoft Foundry Use the Azure Developer CLI ( azd ) to provision an Azure Container Registry, publish the agent image, and deploy the hosted agent to Microsoft Foundry. The workshop separates provisioning from deployment deliberately: azd owns the Azure resources; the Foundry control plane deployment is an explicit, intentional final step that depends on your real project endpoint and agent.yaml manifest values. Lab 4 instructions → Lab 5 — Chat UI Integration Connect a Blazor chat UI to the deployed hosted agent and validate end-to-end responses. By the end of Lab 5, you have a fully functioning agent accessible through a real UI, calling your deterministic tools via the Foundry control plane. Lab 5 instructions → Key Concepts to Take Away The workshop teaches concrete patterns that apply well beyond this specific scenario. Code-first agent design Prompt-only agents are fast to build but hard to test and reason about at scale. A hosted agent with code-backed tools gives you something you can unit test, refactor, and version-control like any other software. Deterministic tools and testability The workshop explicitly avoids LLM calls inside tool implementations. Deterministic tools return predictable outputs for a given input, which means you can write fast, reliable unit tests for them. This is the right pattern for business logic. Reserve LLM calls for the reasoning layer, not the execution layer. CI/CD for agent systems AI agents are software. They deserve the same build-test-deploy discipline as any other service. Lab 3 makes this concrete: you cannot ship without passing CI, and CI validates the container as well as the unit tests. Deployment separation The workshop's split between azd provisioning and Foundry control-plane deployment is not arbitrary. It reflects the real operational boundary: your Azure resources are long-lived infrastructure; your agent deployment is a lifecycle event tied to your project's specific endpoint and manifest. Keeping them separate reduces accidents and makes rollbacks easier. Observability and the validation mindset Every lab ends with an explicit checkpoint. The culture the workshop builds is: prove it works before moving on. That mindset is more valuable than any specific tool or command in the labs. Why Hosted Agents Are Worth the Investment The managed runtime in Microsoft Foundry removes the infrastructure overhead that makes custom agent deployment painful. You do not manage Kubernetes clusters, configure ingress rules, or handle TLS termination. Foundry handles the hosting; you handle the code. This matters most for teams making the transition from demo to production. A prompt agent is an afternoon's work. A hosted agent with proper CI, tested tools, and a deployment pipeline is a week's work done properly once, instead of several weeks of firefighting done poorly repeatedly. The Foundry agent lifecycle —> create, update, version, deploy —>also gives you the controls you need to manage agents in a real environment: staged rollouts, rollback capability, and clear separation between agent versions. For the full deployment guide, see Deploy a hosted agent to Foundry Agent Service. From Workshop to Real Project This workshop is not just a learning exercise. The repository structure, the tooling choices, and the CI/CD patterns are a reference implementation. The patterns you can lift directly into a production project include: The WorkshopLab.Core / WorkshopLab.AgentHost separation between business logic and agent hosting The agent.yaml manifest pattern for declarative Foundry deployment The GitHub Actions workflow structure for build, test, and container validation The azd + ACR pattern for image publishing without requiring Docker Desktop locally The Blazor chat UI as a starting point for internal tooling or developer-facing applications The scenario, a readiness coach for hosted agents. This is also something teams evaluating Microsoft Foundry will find genuinely useful. It answers exactly the questions that come up when onboarding a new team to the platform. Common Mistakes When Building Hosted Agents Having run workshops and spoken with developer teams building on Foundry, a few patterns come up repeatedly: Skipping local validation before containerising. Always validate the /responses endpoint locally first. Debugging inside a container is slower and harder than debugging locally. Putting business logic inside the LLM call. If the answer to a user query can be determined by code, use code. Reserve the model for reasoning, synthesis, and natural language output. Treating CI as optional. Agent code changes break things just like any other code change. If you do not have CI catching regressions, you will ship them. Conflating provisioning and deployment. Recreating Azure resources on every deploy is slow and error-prone. Provision once with azd ; deploy agent versions as needed through the Foundry control plane. Not having a rollback plan. The Foundry agent lifecycle supports versioning. Use it. Know how to roll back to a previous version before you deploy to production. Get Started The workshop is open source, beginner-friendly, and designed to be completed in a single day. You need a .NET 10 SDK, an Azure subscription, access to a Microsoft Foundry project, and a GitHub account. Clone the repository, follow the labs in order, and by the end you will have a production-ready reference implementation that your team can extend and adapt for real scenarios. Clone the workshop repository → Here is the quick start to prove the solution works locally before you begin the full lab sequence: git clone https://github.com/microsoft/Hosted_Agents_Workshop_dotNET.git cd Hosted_Agents_Workshop_dotNET # Set your Foundry project endpoint and model deployment $env:AZURE_AI_PROJECT_ENDPOINT = "https://<resource>.services.ai.azure.com/api/projects/<project>" $env:MODEL_DEPLOYMENT_NAME = "gpt-4.1-mini" # Build and run dotnet restore dotnet build dotnet run --project src/WorkshopLab.AgentHost Then send your first request: Invoke-RestMethod -Method Post ` -Uri "http://localhost:8088/responses" ` -ContentType "application/json" ` -Body '{"input":"Should we start with a hosted agent or a prompt agent?"}' When the agent answers as a Hosted Agent Readiness Coach, you are ready to begin the labs. Key Takeaways Hosted agents in Microsoft Foundry let you run custom .NET code in a managed container runtime — you own the logic, Foundry owns the infrastructure. Deterministic tools are the right pattern for business logic in production agents: fast, testable, and predictable. CI/CD is not optional for agent systems. Build it in from the start, not as an afterthought. Separate your provisioning ( azd ) from your deployment (Foundry control plane) — it reduces accidents and simplifies rollbacks. The workshop is a reference implementation, not just a tutorial. The patterns are production-grade and ready to adapt. References Hosted Agents Workshop for .NET — GitHub Repository Workshop Lab Guide Deploy a Hosted Agent to Foundry Agent Service — Microsoft Learn Microsoft Foundry Portal Azure Developer CLI (azd) — Microsoft Learn .NET 10 SDK Download376Views0likes0CommentsMicrosoft Foundry Model Router: A Developer's Guide to Smarter AI Routing
Introduction When building AI-powered applications on Azure, one of the most impactful decisions you make isn't about which model to use, it's about how your application selects models at runtime. Microsoft Foundry Model Router, available through Microsoft Foundry, automatically routes your inference requests to the best available model based on prompt complexity, latency targets, and cost efficiency. But how do you know it's actually routing correctly? And how do you compare its behavior across different API paths? That's exactly the problem RouteLens solves. It's an open-source Node.js CLI and web-based testing tool that sends configurable prompts through two distinct Azure AI runtime paths and produces a detailed comparison of routing decisions, latency profiles, and reliability metrics. In this post, we'll walk through what Model Router does, why it matters, how to use the validator tool, and best practices for designing applications that get the most out of intelligent model routing. What Is Microsoft Founry Model Router? Microsoft Foundry Model Router is a deployment option in Microsoft Foundry that sits between your application and a pool of AI models. Instead of hard-coding a specific model like gpt-4o or gpt-4o-mini , you deploy a Model Router endpoint and let Azure decide which underlying model serves each request. How It Works Your application sends an inference request to the Model Router deployment. Model Router analyzes the request (prompt complexity, token count, required capabilities). It selects the most appropriate model from the available pool. The response is returned transparently — your application code doesn't change. Why This Matters Cost optimization — Simple prompts get routed to smaller, cheaper models. Complex prompts go to more capable (and expensive) ones. Latency reduction — Lightweight prompts complete faster when they don't need a heavyweight model. Resilience — If one model is experiencing high load or throttling, traffic can shift to alternatives. Simplified application code — No need to build your own model-selection logic. The Two Runtime Paths Microsoft Foundry offers two distinct endpoint configurations for hitting Model Router. Even though both use the Chat Completions API, they may have different routing behaviour: Path SDK Endpoint AOAI + Chat Completions OpenAI JS SDK https://.cognitiveservices.azure.com/openai/deployments/ Foundry Project + Chat Completions OpenAI JS SDK (separate client) https://.cognitiveservices.azure.com/openai/deployments/ Understanding whether these two paths produce the same routing decisions is critical for production applications. If the same prompt routes to different models depending on which endpoint you use, that's a signal you need to investigate. Introducing RouteLens RouteLens is a Node.js tool that automates this comparison. It: Sends a configurable set of prompts across categories (echo, summarize, code, reasoning) through both paths. Logs every response to structured JSONL files for post-hoc analysis. Computes statistics including p50/p95 latency, error rates, and model-choice distribution. Highlights routing differences — where the same prompt was served by different models across paths. Provides a web dashboard for interactive testing and real-time result visualization. The Web Dashboard The built-in web UI makes it easy to run tests and explore results without parsing log files: The dashboard includes: KPI Dashboard — Key metrics at a glance: Success Rate, Avg TPS, Gen TPS, Peak TPS, Fastest Response, p50/p95 Latency, Most Reliable Path, Total Tokens Summary view — Per-path/per-category stats with success rate, TPS, and latency Model Comparison — Side-by-side view of which models were selected by each path Latency Charts — Visual bar charts comparing p50 and p95 latencies Error Analysis — Error distribution and detailed error messages Live Feed — Real-time streaming of results as they come in Log Viewer — Browse and inspect historical JSONL log files Model Comparison — See which models were selected by each routing path for every prompt: Live Feed — Real-time streaming of results as they come in: Log Viewer — Browse and inspect historical JSONL log files with parsed table views: Mobile Responsive — The UI adapts to smaller screens: Getting Started Prerequisites Node.js 18+ (LTS recommended) An Azure subscription with a Foundry project Model Router deployed in your Foundry project An API key from your Azure OpenAI / Foundry resource The API version (e.g. 2024-05-01-preview ) Setup # Clone and install git clone https://github.com/leestott/modelrouter-routelens/ cd modelrouter-routelens npm install # Configure your endpoints cp .env.example .env # Edit .env with your Azure endpoints (see below) Configuration The .env file needs these key settings: # Your Foundry / Cognitive Services deployment endpoint # Format: https://<resource>.cognitiveservices.azure.com/openai/deployments/<deployment> # Do NOT include /chat/completions or ?api-version FOUNDRY_PROJECT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router AOAI_BASE_URL=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router # API key from your Azure OpenAI / Foundry resource AOAI_API_KEY=your-api-key-here # Azure OpenAI API version AOAI_API_VERSION=2024-05-01-preview</resource></resource></deployment></resource> Running Tests # Full test matrix — sends all prompts through both paths npm run run:matrix # 408 timeout diagnostic — focuses on the Responses path timeout issue npm run run:repro408 # Web UI — interactive dashboard npm run ui # Then open http://localhost:3002 (or the port set in UI_PORT) Understanding the Results Latency Comparison The latency charts show p50 (median) and p95 (tail) latency for each path and prompt category: Key things to look for: Large p50 differences between paths suggest one path has consistently higher overhead. High p95 values indicate tail latency problems — possibly timeouts or retries. Category-specific patterns — If code prompts are slow on one path but fast on another, that's a routing difference worth investigating. Model Comparison The model comparison view shows which models were selected for each prompt: When both paths select the same model, you see a green "Match" indicator. When they differ, it's flagged in red — these are the cases you want to investigate. Error Analysis The errors view helps diagnose reliability issues: Common error patterns: 408 Timeout — The Responses path may take longer for certain prompt categories 401 Unauthorized — Authentication configuration issues 429 Rate Limited — You're hitting throughput limits 500 Internal Server Error — Backend model issues Best Practices for Designing Applications with Model Router 1. Design Prompts with Routing in Mind Model Router makes decisions based on prompt characteristics. To get the best routing: Keep prompts focused — A clear, single-purpose prompt is easier for the router to classify than a multi-part prompt that spans multiple complexity levels. Use system messages effectively — A well-structured system message helps the router understand the task complexity. Separate complex chains — If you have a multi-step workflow, make each step a separate API call rather than one massive prompt. This lets the router use a cheaper model for simple steps. 2. Set Appropriate Timeouts Different models have different latency profiles. Your timeout settings should account for the slowest model the router might select: // Too aggressive — may timeout when routed to a larger model const TIMEOUT = 5000; // 5s // Better — allows headroom for model variation const TIMEOUT = 30000; // 30s // Best — use different timeouts based on expected complexity function getTimeout(category) { switch (category) { case 'echo': return 10000; case 'summarize': return 20000; case 'code': return 45000; case 'reasoning': return 60000; default: return 30000; } } 3. Implement Robust Retry Logic Because the router may select different models on retry, transient failures can resolve themselves: async function callWithRetry(prompt, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); } catch (err) { if (attempt === maxRetries - 1) throw err; // Exponential backoff await new Promise(r => setTimeout(r, 1000 * Math.pow(2, attempt))); } } } 4. Monitor Model Selection in Production Log which model was selected for each request so you can track routing patterns over time: const response = await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); // The model field in the response tells you which model was actually used console.log(`Routed to: ${response.model}`); console.log(`Tokens: ${response.usage.total_tokens}`); 5. Use the Right API Path for Your Use Case Based on our testing with RouteLens, consider: Chat Completions path — The standard path for chat-style interactions. Uses the openai SDK directly. Foundry Project path — Uses the same Chat Completions API but through the Foundry project endpoint. Useful for comparing routing behaviour across different endpoint configurations. Note: The Responses API ( /responses ) is not currently available on cognitiveservices.azure.com Model Router deployments. Both paths in RouteLens use Chat Completions. 6. Test Before You Ship Run RouteLens as part of your pre-production validation: # In your CI/CD pipeline or pre-deployment check npm run run:matrix -- --runs 10 --concurrency 4 This helps you: Catch routing regressions when Azure updates model pools Verify that your prompt changes don't cause unexpected model selection shifts Establish latency baselines for alerting Architecture Overview RouteLens sends configurable prompts through two distinct Azure AI runtime paths and compares routing decisions, latency, and reliability. The Matrix Runner dispatches prompts to both the Chat Completions Client (OpenAI JS SDK → AOAI endpoint) and the Project Responses Client ( azure/ai-projects → Foundry endpoint). Both paths converge at Azure Model Router, which intelligently selects the optimal backend model. Results are logged to JSONL files and rendered in the web dashboard. Key Benefits of Model Router Benefit Description Cost savings Automatically routes simple prompts to cheaper models, reducing spend by 30-50% in typical workloads Lower latency Simple prompts complete faster on lightweight models Zero code changes Same API contract as a standard model deployment — just change the deployment name Future-proof As Azure adds new models to the pool, your application benefits automatically Built-in resilience Routing adapts to model availability and load conditions Conclusion Azure Model Router represents a shift from "pick a model" to "describe your task and let the platform decide." This is a natural evolution for AI applications — just as cloud platforms abstract away server selection, Model Router abstracts away model selection. RouteLens gives you the visibility to trust that abstraction. By systematically comparing routing behavior across API paths and prompt categories, you can deploy Model Router with confidence and catch issues before your users do. The tool is open source under the MIT license. Try it out, file issues, and contribute improvements: GitHub Repository Model Router Documentation Microsoft Foundry791Views1like0CommentsMicrosoft Olive & Olive Recipes: A Practical Guide to Model Optimization for Real-World Deployment
Why your model runs great on your laptop but fails in the real world You have trained a model. It scores well on your test set. It runs fine on your development machine with a beefy GPU. Then someone asks you to deploy it to a customer's edge device, a cloud endpoint with a latency budget, or a laptop with no discrete GPU at all. Suddenly the model is too large, too slow, or simply incompatible with the target runtime. You start searching for quantisation scripts, conversion tools, and hardware-specific compiler flags. Each target needs a different recipe, and the optimisation steps interact in ways that are hard to predict. This is the deployment gap. It is not a knowledge gap; it is a tooling gap. And it is exactly the problem that Microsoft Olive is designed to close. What is Olive? Olive is an easy-to-use, hardware-aware model optimisation toolchain that composes techniques across model compression, optimisation, and compilation. Rather than asking you to string together separate conversion scripts, quantisation utilities, and compiler passes by hand, Olive lets you describe what you have and what you need, then handles the pipeline. In practical terms, Olive takes a model source, such as a PyTorch model or an ONNX model (and other supported formats), plus a configuration that describes your production requirements and target hardware accelerator. It then runs the appropriate optimisation passes and produces a deployment-ready artefact. You can think of it as a build system for model optimisation: you declare the intent, and Olive figures out the steps. Official repo: github.com/microsoft/olive Documentation: microsoft.github.io/Olive Key advantages: why Olive matters for your workflow A. Optimise once, deploy across many targets One of the hardest parts of deploying models in production is that "production" is not one thing. Your model might need to run on a cloud GPU, an edge CPU, or a Windows device with an NPU. Each target has different memory constraints, instruction sets, and runtime expectations. Olive supports targeting CPU, GPU, and NPU through its optimisation workflow. This means a single toolchain can produce optimised artefacts for multiple deployment targets, expanding the number of platforms you can serve without maintaining separate optimisation scripts for each one. The conceptual workflow is straightforward: Olive can download, convert, quantise, and optimise a model using an auto-optimisation style approach where you specify the target device (cpu, gpu, or npu). This keeps the developer experience consistent even as the underlying optimisation strategy changes per target. B. ONNX as the portability layer If you have heard of ONNX but have not used it in anger, here is why it matters: ONNX gives your model a common representation that multiple runtimes understand. Instead of being locked to one framework's inference path, an ONNX model can run through ONNX Runtime and take advantage of whatever hardware is available. Olive supports ONNX conversion and optimisation, and can generate a deployment-ready model package along with sample inference code in languages like C#, C++, or Python. That package is not just the model weights; it includes the configuration and code needed to load and run the model on the target platform. For students and early-career engineers, this is a meaningful capability: you can train in PyTorch (the ecosystem you already know) and deploy through ONNX Runtime (the ecosystem your production environment needs). C. Hardware-specific acceleration and execution providers When Olive targets a specific device, it does not just convert the model format. It optimises for the execution provider (EP) that will actually run the model on that hardware. Execution providers are the bridge between the ONNX Runtime and the underlying accelerator. Olive can optimise for a range of execution providers, including: Vitis AI EP (AMD) – for AMD accelerator hardware OpenVINO EP (Intel) – for Intel CPUs, integrated GPUs, and VPUs QNN EP (Qualcomm) – for Qualcomm NPUs and SoCs DirectML EP (Windows) – for broad GPU support on Windows devices Why does EP targeting matter? Because the difference between a generic model and one optimised for a specific execution provider can be significant in terms of latency, throughput, and power efficiency. On battery-powered devices especially, the right EP optimisation can be the difference between a model that is practical and one that drains the battery in minutes. D. Quantisation and precision options Quantisation is one of the most powerful levers you have for making models smaller and faster. The core idea is reducing the numerical precision of model weights and activations: FP32 (32-bit floating point) – full precision, largest model size, highest fidelity FP16 (16-bit floating point) – roughly half the memory, usually minimal quality loss for most tasks INT8 (8-bit integer) – significant size and speed gains, moderate risk of quality degradation depending on the model INT4 (4-bit integer) – aggressive compression for the most constrained deployment scenarios Think of these as a spectrum. As you move from FP32 towards INT4, models get smaller and faster, but you trade away some numerical fidelity. The practical question is always: how much quality can I afford to lose for this use case? Practical heuristics for choosing precision: FP16 is often a safe default for GPU deployment. In practice, you might start here and only go lower if you need to. INT8 is a strong choice for CPU-based inference where memory and compute are constrained but accuracy requirements are still high (e.g., classification, embeddings, many NLP tasks). INT4 is worth exploring when you are deploying large language models to edge or consumer devices and need aggressive size reduction. Expect to validate quality carefully, as some tasks and model architectures tolerate INT4 better than others. Olive handles the mechanics of applying these quantisation passes as part of the optimisation pipeline, so you do not need to write custom quantisation scripts from scratch. Showcase: model conversion stories To make this concrete, here are three plausible optimisation scenarios that illustrate how Olive fits into real workflows. Story 1: PyTorch classification model → ONNX → quantised for cloud CPU inference Starting point: A PyTorch image classification model fine-tuned on a domain-specific dataset. Target hardware: Cloud CPU instances (no GPU budget for inference). Optimisation intent: Reduce latency and cost by quantising to INT8 whilst keeping accuracy within acceptable bounds. Output: An ONNX model optimised for CPU execution, packaged with configuration and sample inference code ready for deployment behind an API endpoint. Story 2: Hugging Face language model → optimised for edge NPU Starting point: A Hugging Face transformer model used for text summarisation. Target hardware: A laptop with an integrated NPU (e.g., a Qualcomm-based device). Optimisation intent: Shrink the model to INT4 to fit within NPU memory limits, and optimise for the QNN execution provider to leverage the neural processing unit. Output: A quantised ONNX model configured for QNN EP, with packaging that includes the model, runtime configuration, and sample code for local inference. Story 3: Same model, two targets – GPU vs. NPU Starting point: A single PyTorch generative model used for content drafting. Target hardware: (A) Cloud GPU for batch processing, (B) On-device NPU for interactive use. Optimisation intent: For GPU, optimise at FP16 for throughput. For NPU, quantise to INT4 for size and power efficiency. Output: Two separate optimised packages from the same source model, one targeting DirectML EP for GPU, one targeting QNN EP for NPU, each with appropriate precision, runtime configuration, and sample inference code. In each case, Olive handles the multi-step pipeline: conversion, optimisation passes, quantisation, and packaging. The developer's job is to define the target and validate the output quality. Introducing Olive Recipes If you are new to model optimisation, staring at a blank configuration file can be intimidating. That is where Olive Recipes comes in. The Olive Recipes repository complements Olive by providing recipes that demonstrate features and use cases. You can use them as a reference for optimising publicly available models or adapt them for your own proprietary models. The repository also includes a selection of ONNX-optimised models that you can study or use as starting points. Think of recipes as worked examples: each one shows a complete optimisation pipeline for a specific scenario, including the configuration, the target hardware, and the expected output. Instead of reinventing the pipeline from scratch, you can find a recipe close to your use case and modify it. For students especially, recipes are a fast way to learn what good optimisation configurations look like in practice. Taking it further: adding custom models to Foundry Local Once you have optimised a model with Olive, you may want to serve it locally for development, testing, or fully offline use. Foundry Local is a lightweight runtime that downloads, manages, and serves language models entirely on-device via an OpenAI-compatible API, with no cloud dependency and no API keys required. Important: Foundry Local only supports specific model templates. At present, these are the chat template (for conversational and text-generation models) and the whisper template (for speech-to-text models based on the Whisper architecture). If your model does not fit one of these two templates, it cannot currently be loaded into Foundry Local. Compiling a Hugging Face model for Foundry Local If your optimised model uses a supported architecture, you can compile it from Hugging Face for use with Foundry Local. The high-level process is: Choose a compatible Hugging Face model. The model must match one of Foundry Local's supported templates (chat or whisper). For chat models, this typically means decoder-only transformer architectures that support the standard chat format. Use Olive to convert and optimise. Olive handles the conversion from the Hugging Face source format into an ONNX-based, quantised artefact that Foundry Local can serve. This is where your Olive skills directly apply. Register the model with Foundry Local. Once compiled, you register the model so that Foundry Local's catalogue recognises it and can serve it through the local API. For the full step-by-step guide, including exact commands and configuration details, refer to the official documentation: How to compile Hugging Face models for Foundry Local. For a hands-on lab that walks through the complete workflow, see Foundry Local Lab, specifically Lab 10 which covers bringing custom models into Foundry Local. Why does this matter? The combination of Olive and Foundry Local gives you a complete local workflow: optimise your model with Olive, then serve it with Foundry Local for rapid iteration, privacy-sensitive workloads, or environments without internet connectivity. Because Foundry Local exposes an OpenAI-compatible API, your application code can switch between local and cloud inference with minimal changes. Keep in mind the template constraint. If you are planning to bring a custom model into Foundry Local, verify early that it fits the chat or whisper template. Attempting to load an unsupported architecture will not work, regardless of how well the model has been optimised. Contributing: how to get involved The Olive ecosystem is open source, and contributions are welcome. There are two main ways to contribute: A. Contributing recipes If you have built an optimisation pipeline that works well for a specific model, hardware target, or use case, consider contributing it as a recipe. Recipes are repeatable pipeline configurations that others can learn from and adapt. B. Sharing optimised model outputs and configurations If you have produced an optimised model that might be useful to others, sharing the optimisation configuration and methodology (and, where licensing permits, the model itself) helps the community build on proven approaches rather than starting from zero. Contribution checklist Reproducibility: Can someone else run your recipe or configuration and get comparable results? Licensing: Are the base model weights, datasets, and any dependencies properly licensed for sharing? Hardware target documented: Have you specified which device and execution provider the optimisation targets? Runtime documented: Have you noted the ONNX Runtime version and any EP-specific requirements? Quality validation: Have you included at least a basic accuracy or quality check for the optimised output? If you are a student or early-career developer, contributing a recipe is a great way to build portfolio evidence that you understand real deployment concerns, not just training. Try it yourself: a minimal workflow Here is a conceptual walkthrough of the optimisation workflow using Olive. The idea is to make the mental model concrete. For exact CLI flags and options, refer to the official Olive documentation. Choose a model source. Start with a PyTorch or Hugging Face model you want to optimise. This is your input. Choose a target device. Decide where the model will run: cpu , gpu , or npu . Choose an execution provider. Pick the EP that matches your hardware, for example DirectML for Windows GPU, QNN for Qualcomm NPU, or OpenVINO for Intel. Choose a precision. Select the quantisation level: fp16 , int8 , or int4 , based on your size, speed, and quality requirements. Run the optimisation. Olive will convert, quantise, optimise, and package the model for your target. The output is a deployment-ready artefact with model files, configuration, and sample inference code. A conceptual command might look like this: # Conceptual example – refer to official docs for exact syntax olive auto-opt --model-id my-model --device cpu --provider onnxruntime --precision int8 After optimisation, validate the output. Run your evaluation benchmark on the optimised model and compare quality, latency, and model size against the original. If INT8 drops quality below your threshold, try FP16. If the model is still too large for your device, explore INT4. Iteration is expected. Key takeaways Olive bridges training and deployment by providing a single, hardware-aware optimisation toolchain that handles conversion, quantisation, optimisation, and packaging. One source model, many targets: Olive lets you optimise the same model for CPU, GPU, and NPU, expanding your deployment reach without maintaining separate pipelines. ONNX is the portability layer that decouples your training framework from your inference runtime, and Olive leverages it to generate deployment-ready packages. Precision is a design choice: FP16, INT8, and INT4 each serve different deployment constraints. Start conservative, measure quality, and compress further only when needed. Olive Recipes are your starting point: Do not build optimisation pipelines from scratch when worked examples exist. Learn from recipes, adapt them, and contribute your own. Foundry Local extends the workflow: Once your model is optimised, Foundry Local can serve it on-device via a standard API, but only if it fits a supported template (chat or whisper). Resources Microsoft Olive – GitHub repository Olive documentation Olive Recipes – GitHub repository How to compile Hugging Face models for Foundry Local Foundry Local Lab – hands-on labs (see Lab 10 for custom models) Foundry Local documentation355Views0likes0CommentsFoundry IQ: Give Your AI Agents a Knowledge Upgrade
If you’re learning to build AI agents, you’ve probably hit a familiar wall: your agent can generate text, but it doesn’t actually know anything about your data. It can’t look up your documents, search across your files, or pull facts from multiple sources to answer a real question. That’s the gap Foundry IQ fills. It gives your AI agents structured access to knowledge, so they can retrieve, reason over, and synthesize information from real data sources instead of relying on what’s baked into the model. Why Should You Care? As a student or early-career developer, understanding how AI systems work with external knowledge is one of the most valuable skills you can build right now. Retrieval-Augmented Generation (RAG), knowledge bases, and multi-source querying are at the core of every production AI application, from customer support bots to research assistants to enterprise copilots. Foundry IQ gives you a hands-on way to learn these patterns without having to build all the plumbing yourself. You define knowledge bases, connect data sources, and let your agents query them. The concepts you learn here transfer directly to real-world AI engineering roles. What is Foundry IQ? Foundry IQ is a service within Azure AI Foundry that lets you create knowledge bases, collections of connected data sources that your AI agents can query through a single endpoint. Instead of writing custom retrieval logic for every app you build, you: Define knowledge sources — connect documents, data stores, or web content (SharePoint, Azure Blob Storage, Azure AI Search, Fabric OneLake, and more). Organize them into a knowledge base — group multiple sources behind one queryable endpoint. Query from your agent — your AI agent calls the knowledge base to get the context it needs before generating a response. This approach means the knowledge layer is reusable. Build it once, and any agent or app in your project can tap into it. The IQ Series: A Three-Part Learning Path The IQ Series is a set of three weekly episodes that walk you through Foundry IQ from concept to code. Each episode includes a tech talk, visual doodle summaries, and a companion cookbook with sample code you can run yourself. 👉 Get started: https://aka.ms/iq-series Episode 1: Unlocking Knowledge for Your Agents (March 18, 2026) Start here. This episode introduces the core architecture of Foundry IQ and explains how AI agents interact with knowledge. You’ll learn what knowledge bases are, why they matter, and how the key components fit together. What you’ll learn: The difference between model knowledge and retrieved knowledge How Foundry IQ structures the retrieval layer The building blocks: knowledge sources, knowledge bases, and agent queries Episode 2: Building the Data Pipeline with Knowledge Sources (March 25, 2026) This episode goes deeper into knowledge sources, the connectors that bring data into Foundry IQ. You’ll see how different content types flow into the system and how to wire up sources from services you may already be using. What you’ll learn: How to connect sources like Azure Blob Storage, Azure AI Search, SharePoint, Fabric OneLake, and the web How content is ingested and indexed for retrieval Patterns for combining multiple source types Episode 3: Querying Multi-Source Knowledge Bases (April 1, 2026) The final episode shows you how to bring it all together. You’ll learn how agents query across multiple knowledge sources through a single knowledge base endpoint and how to synthesize answers from diverse data. What you’ll learn: How to query a knowledge base from your agent code How retrieval works across multiple connected sources Techniques for synthesizing information to answer complex questions Get Hands-On with the Cookbooks Every episode comes with a companion cookbook in the GitHub repo, complete with sample code you can clone, run, and modify. This is the fastest way to go from watching to building. 👉 Explore the repo: https://aka.ms/iq-series Inside you’ll find: Episode links — watch the tech talks and doodle recaps Cookbooks — step-by-step code samples for each episode Documentation links — official Foundry IQ docs and additional learning resources What to Build Next Once you’ve worked through the series, try applying what you’ve learned: Study assistant — connect your course materials as knowledge sources and build an agent that can answer questions across all your notes and readings. Project documentation bot — index your team’s project docs and READMEs into a knowledge base so everyone can query them naturally. Research synthesizer — connect multiple data sources (papers, web content, datasets) and build an agent that can cross-reference and summarize findings. Start Learning The IQ Series is designed to take you from zero to building knowledge-driven AI agents. Watch the episodes, run the cookbooks, and start experimenting with your own knowledge bases. 👉 https://aka.ms/iq-series446Views0likes0CommentsNasdaq builds thoughtfully designed AI for board governance with PostgreSQL on Azure
Authored by: Charles Federssen, Partner Director of Product Management for PostgreSQL at Microsoft and Mohsin Shafqat, Senior Manager, Software Engineering at Nasdaq When people think of Nasdaq, they usually think of markets, trading floors, and financial data moving at extraordinary speed. But behind the scenes, Nasdaq also plays an equally critical role in how boards of directors govern, deliberate, and make decisions. Nasdaq Boardvantage® is the company’s governance platform, used by more than 4,400 organizations worldwide—including nearly half of the Fortune 100. It’s where directors review board books, collaborate in an environment designed with robust security, and prepare for meetings that often involve some of the most sensitive information a company has. In recent years, Nasdaq set out to modernize Nasdaq Boardvantage with AI, without compromising security and reliability. That journey was featured in a Microsoft Ignite session, “Nasdaq Boardvantage: AI-Driven Governance on PostgreSQL and Foundry.” It offers a practical look at how Azure Database for PostgreSQL can support AI-driven applications where precision, isolation, and data control are non-negotiable. Introducing AI where trust is everything Board governance isn’t a typical productivity workload. Board packets can run 400 to 600 pages, meeting minutes are legal records, and any AI-generated insight must be confined to a customer’s own data. “Our customers trust us with some of their most strategic, sensitive data,” said Mohsin Shafqat, Senior Manager of Software Development at Nasdaq. That trust meant tackling several core challenges upfront, including: How do you minimize AI hallucinations in a governance context? How do you guarantee tenant isolation at scale? How do you keep data regional across a global customer base? A cloud foundation built for governance Before adding intelligence, Nasdaq decided to re-architect Nasdaq Boardvantage on Microsoft Azure, using Azure Kubernetes Service (AKS) to run containerized, multi-tenant workloads with strong isolation boundaries. Microsoft Foundry provides the managed foundation for deploying, governing, and operating AI models across this architecture, adding consistency, security, and control as intelligence is introduced. At the data layer, Azure Database for PostgreSQL and Azure Database for MySQL became the backbone for governance data. PostgreSQL, in particular, plays a central role in managing structured governance information alongside vector embeddings that support AI-driven features. Together, these services give Nasdaq the performance, security, and operational control required for a highly regulated, multi-tenant environment, while still moving quickly. Key architectural choices included: Tenant isolation by design, with separate databases and storage Regional deployments to align with data residency requirements High availability and managed operations, so teams could focus on product innovation instead of infrastructure maintenance PostgreSQL and pgvector: Powering context-aware AI With that foundation in place, Nasdaq was ready to carefully introduce AI. One of the first AI capabilities was intelligent document summarization. Board materials that once took hours to review could now be condensed into concise, contextually accurate summaries. Under the hood, this required more than just calling an LLM. Nasdaq uses pgvector, natively supported in Azure Database for PostgreSQL, to store and query embeddings generated from board documents. This allows the platform to perform hybrid searches that combine traditional SQL queries with vector similarity to retrieve the most relevant context before sending anything to a language model. Instead of treating AI as a black box, the team built a pipeline where: Documents are processed with Azure Document Intelligence to preserve structure and meaning Content is chunked and embedded Embeddings are stored in PostgreSQL with pgvector Vector similarity searches retrieve precise context for each AI task Because this runs inside PostgreSQL, the same database benefits from Azure’s built-in high availability, security controls, and operational tooling – delivering tangible results, including a 25% reduction in overall board preparation time and internal testing shows 91–97% accuracy for AI-generated summaries and meeting minutes. From summaries to an AI Board Assistant With summarization working in production, Nasdaq expanded further. The team is now building an AI-powered Board Assistant that will help directors prepare for upcoming meetings by surfacing trends, risks, and insights from prior discussions. This introduces a new level of scale. Years of board data across thousands of customers translate into millions of embeddings. PostgreSQL continues to anchor this architecture, storing vectors for semantic retrieval while MySQL supports complementary non-vector workloads. Across Nasdaq Boardvantage, users are advised to always review AI outputs, and no customer data is shared or used to train external models. “We designed AI for governance, not the other way around,” Shafqat said. More importantly, customers trust the system because security, isolation, and data control were engineered in from day one. Looking ahead Nasdaq’s work shows how Azure Database for PostgreSQL can support AI workloads that demand both intelligence and integrity. With PostgreSQL at the core, Nasdaq has built a governance platform that scales globally, respects regulatory boundaries, and introduces AI in a way that feels dependable and not experimental. What started as a modernization of Nasdaq Boardvantage is now influencing how Nasdaq approaches AI across the enterprise. To dive deeper into the architecture and hear directly from the engineers behind it, watch the Ignite session and check out these resources: Watch the Ignite breakout session for a technical walkthrough of how Nasdaq Boardvantage is built, including PostgreSQL on Azure, pgvector, and Microsoft Foundry in production. Read the case study to see how Nasdaq introduced AI into board governance and what changed for directors, administrators, and decision-making. Watch the Ignite broadcast for a candid discussion on Azure Database for PostgreSQL, Azure HorizonDB, and what it takes to scale AI-driven governance.Agentic Code Fixing with GitHub Copilot SDK and Foundry Local
Introduction AI-powered coding assistants have transformed how developers write and review code. But most of these tools require sending your source code to cloud services, a non-starter for teams working with proprietary codebases, air-gapped environments, or strict compliance requirements. What if you could have an intelligent coding agent that finds bugs, fixes them, runs your tests, and produces PR-ready summaries, all without a single byte leaving your machine? The Local Repo Patch Agent demonstrates exactly this. By combining the GitHub Copilot SDK for agent orchestration with Foundry Local for on-device inference, this project creates a fully autonomous coding workflow that operates entirely on your hardware. The agent scans your repository, identifies bugs and code smells, applies fixes, verifies them through your test suite, and generates a comprehensive summary of all changes, completely offline and secure. This article explores the architecture behind this integration, walks through the key implementation patterns, and shows you how to run the agent yourself. Whether you're building internal developer tools, exploring agentic workflows, or simply curious about what's possible when you combine GitHub's SDK with local AI, this project provides a production-ready foundation to build upon. Why Local AI Matters for Code Analysis Cloud-based AI coding tools have proven their value—GitHub Copilot has fundamentally changed how millions of developers work. But certain scenarios demand local-first approaches where code never leaves the organisation's network. Consider these real-world constraints that teams face daily: Regulatory compliance: Financial services, healthcare, and government projects often prohibit sending source code to external services, even for analysis Intellectual property protection: Proprietary algorithms and trade secrets can't risk exposure through cloud API calls Air-gapped environments: Secure facilities and classified projects have no internet connectivity whatsoever Latency requirements: Real-time code analysis in IDEs benefits from zero network roundtrip Cost control: High-volume code analysis without per-token API charges The Local Repo Patch Agent addresses all these scenarios. By running the AI model on-device through Foundry Local and using the GitHub Copilot SDK for orchestration, you get the intelligence of agentic coding workflows with complete data sovereignty. The architecture proves that "local-first" doesn't mean "capability-limited." The Technology Stack Two core technologies make this architecture possible, working together through a clever integration called BYOK (Bring Your Own Key). Understanding how they complement each other reveals the elegance of the design. GitHub Copilot SDK The GitHub Copilot SDK provides the agent runtime, the scaffolding that handles planning, tool invocation, streaming responses, and the orchestration loop that makes agentic behaviour possible. Rather than managing raw LLM API calls, developers define tools (functions the agent can call) and system prompts, and the SDK handles everything else. Key capabilities the SDK brings to this project: Session management: Maintains conversation context across multiple agent interactions Tool orchestration: Automatically invokes defined tools when the model requests them Streaming support: Real-time response streaming for responsive user interfaces Provider abstraction: Works with any OpenAI-compatible API through the BYOK configuration Foundry Local Foundry Local brings Azure AI Foundry's model catalog to your local machine. It automatically selects the best available hardware acceleration—GPU, NPU, or CP, and exposes models through an OpenAI-compatible API on localhost. Models run entirely on-device with no telemetry or data transmission. For this project, Foundry Local provides: On-device inference: All AI processing happens locally, ensuring complete data privacy Dynamic port allocation: The SDK auto-detects the Foundry Local endpoint, eliminating configuration hassle Model flexibility: Swap between models like qwen2.5-coder-1.5b , phi-3-mini , or larger variants based on your hardware OpenAI API compatibility: Standard API format means the GitHub Copilot SDK works without modification The BYOK Integration The entire connection between the GitHub Copilot SDK and Foundry Local happens through a single configuration object. This BYOK (Bring Your Own Key) pattern tells the SDK to route all inference requests to your local model instead of cloud services: const session = await client.createSession({ model: modelId, provider: { type: "openai", // Foundry Local speaks OpenAI's API format baseUrl: proxyBaseUrl, // Streaming proxy → Foundry Local apiKey: manager.apiKey, wireApi: "completions", // Chat Completions API }, streaming: true, tools: [ /* your defined tools */ ], }); This configuration is the key insight: with one config object, you've redirected an entire agent framework to run on local hardware. No code changes to the SDK, no special adapters—just standard OpenAI-compatible API communication. Architecture Overview The Local Repo Patch Agent implements a layered architecture where each component has a clear responsibility. Understanding this flow helps when extending or debugging the system. ┌─────────────────────────────────────────────────────────┐ │ Your Terminal / Web UI │ │ npm run demo / npm run ui │ └──────────────┬──────────────────────────────────────────┘ │ ┌──────────────▼──────────────────────────────────────────┐ │ src/agent.ts (this project) │ │ │ │ ┌────────────────────────────┐ ┌──────────────────┐ │ │ │ GitHub Copilot SDK │ │ Agent Tools │ │ │ │ (CopilotClient) │ │ list_files │ │ │ │ BYOK → Foundry │ │ read_file │ │ │ └────────┬───────────────────┘ │ write_file │ │ │ │ │ run_command │ │ └────────────┼───────────────────────┴──────────────────┘ │ │ │ │ JSON-RPC │ ┌────────────▼─────────────────────────────────────────────┐ │ GitHub Copilot CLI (server mode) │ │ Agent orchestration layer │ └────────────┬─────────────────────────────────────────────┘ │ POST /v1/chat/completions (BYOK) ┌────────────▼─────────────────────────────────────────────┐ │ Foundry Local (on-device inference) │ │ Model: qwen2.5-coder-1.5b via ONNX Runtime │ │ Endpoint: auto-detected (dynamic port) │ └───────────────────────────────────────────────────────────┘ The data flow works as follows: your terminal or web browser sends a request to the agent application. The agent uses the GitHub Copilot SDK to manage the conversation, which communicates with the Copilot CLI running in server mode. The CLI, configured with BYOK, sends inference requests to Foundry Local running on localhost. Responses flow back up the same path, with tool invocations happening in the agent.ts layer. The Four-Phase Workflow The agent operates through a structured four-phase loop, each phase building on the previous one's output. This decomposition transforms what would be an overwhelming single prompt into manageable, verifiable steps. Phase 1: PLAN The planning phase scans the repository and produces a numbered fix plan. The agent reads every source and test file, identifies potential issues, and outputs specific tasks to address: // Phase 1 system prompt excerpt const planPrompt = ` You are a code analysis agent. Scan the repository and identify: 1. Bugs that cause test failures 2. Code smells and duplication 3. Style inconsistencies Output a numbered list of fixes, ordered by priority. Each item should specify: file path, line numbers, issue type, and proposed fix. `; The tools available during this phase are list_files and read_file —the agent explores the codebase without modifying anything. This read-only constraint prevents accidental changes before the plan is established. Phase 2: EDIT With a plan in hand, the edit phase applies each fix by rewriting affected files. The agent receives the plan from Phase 1 and systematically addresses each item: // Phase 2 adds the write_file tool const editTools = [ { name: "write_file", description: "Write content to a file, creating or overwriting it", parameters: { type: "object", properties: { path: { type: "string", description: "File path relative to repo root" }, content: { type: "string", description: "Complete file contents" } }, required: ["path", "content"] } } ]; The write_file tool is sandboxed to the demo-repo directory, path traversal attempts are blocked, preventing the agent from modifying files outside the designated workspace. Phase 3: VERIFY After making changes, the verification phase runs the project's test suite to confirm fixes work correctly. If tests fail, the agent attempts to diagnose and repair the issue: // Phase 3 adds run_command with an allowlist const allowedCommands = ["npm test", "npm run lint", "npm run build"]; const runCommandTool = { name: "run_command", description: "Execute a shell command (npm test, npm run lint, npm run build only)", execute: async (command: string) => { if (!allowedCommands.includes(command)) { throw new Error(`Command not allowed: ${command}`); } // Execute and return stdout/stderr } }; The command allowlist is a critical security measure. The agent can only run explicitly permitted commands—no arbitrary shell execution, no data exfiltration, no system modification. Phase 4: SUMMARY The final phase produces a PR-style Markdown report documenting all changes. This summary includes what was changed, why each change was necessary, test results, and recommended follow-up actions: ## Summary of Changes ### Bug Fix: calculateInterest() in account.js - **Issue**: Division instead of multiplication caused incorrect interest calculations - **Fix**: Changed `principal / annualRate` to `principal * (annualRate / 100)` - **Tests**: 3 previously failing tests now pass ### Refactor: Duplicate formatCurrency() removed - **Issue**: Identical function existed in account.js and transaction.js - **Fix**: Both files now import from utils.js - **Impact**: Reduced code duplication, single source of truth ### Test Results - **Before**: 6/9 passing - **After**: 9/9 passing This structured output makes code review straightforward, reviewers can quickly understand what changed and why without digging through diffs. The Demo Repository: Intentional Bugs The project includes a demo-repo directory containing a small banking utility library with intentional problems for the agent to find and fix. This provides a controlled environment to demonstrate the agent's capabilities. Bug 1: Calculation Error in calculateInterest() The account.js file contains a calculation bug that causes test failures: // BUG: should be principal * (annualRate / 100) function calculateInterest(principal, annualRate) { return principal / annualRate; // Division instead of multiplication! } This bug causes 3 of 9 tests to fail. The agent identifies it during the PLAN phase by correlating test failures with the implementation, then fixes it during EDIT. Bug 2: Code Duplication The formatCurrency() function is copy-pasted in both account.js and transaction.js, even though a canonical version exists in utils.js. This duplication creates maintenance burden and potential inconsistency: // In account.js (duplicated) function formatCurrency(amount) { return '$' + amount.toFixed(2); } // In transaction.js (also duplicated) function formatCurrency(amount) { return '$' + amount.toFixed(2); } // In utils.js (canonical, but unused) export function formatCurrency(amount) { return '$' + amount.toFixed(2); } The agent identifies this duplication during planning and refactors both files to import from utils.js, eliminating redundancy. Handling Foundry Local Streaming Quirks One technical challenge the project solves is Foundry Local's behaviour with streaming requests. As of version 0.5, Foundry Local can hang on stream: true requests. The project includes a streaming proxy that works around this limitation transparently. The Streaming Proxy The streaming-proxy.ts file implements a lightweight HTTP proxy that converts streaming requests to non-streaming, then re-encodes the single response as SSE (Server-Sent Events) chunks—the format the OpenAI SDK expects: // streaming-proxy.ts simplified logic async function handleRequest(req: Request): Promise { const body = await req.json(); // If it's a streaming chat completion, convert to non-streaming if (body.stream === true && req.url.includes('/chat/completions')) { body.stream = false; const response = await fetch(foundryEndpoint, { method: 'POST', body: JSON.stringify(body), headers: { 'Content-Type': 'application/json' } }); const data = await response.json(); // Re-encode as SSE stream for the SDK return createSSEResponse(data); } // Non-streaming and non-chat requests pass through unchanged return fetch(foundryEndpoint, req); } This proxy runs on port 8765 by default and sits between the GitHub Copilot SDK and Foundry Local. The SDK thinks it's talking to a streaming-capable endpoint, while the actual inference happens non-streaming. The conversion is transparent, no changes needed to SDK configuration. Text-Based Tool Call Detection Small on-device models like qwen2.5-coder-1.5b sometimes output tool calls as JSON text rather than using OpenAI-style function calling. The SDK won't fire tool.execution_start events for these text-based calls, so the agent includes a regex-based detector: // Pattern to detect tool calls in model output const toolCallPattern = /\{[\s\S]*"name":\s*"(list_files|read_file|write_file|run_command)"[\s\S]*\}/; function detectToolCall(text: string): ToolCall | null { const match = text.match(toolCallPattern); if (match) { try { return JSON.parse(match[0]); } catch { return null; } } return null; } This fallback ensures tool calls are captured regardless of whether the model uses native function calling or text output, keeping the dashboard's tool call counter and CLI log accurate. Security Considerations Running an AI agent that can read and write files and execute commands requires careful security design. The Local Repo Patch Agent implements multiple layers of protection: 100% local execution: No code, prompts, or responses leave your machine—complete data sovereignty Command allowlist: The agent can only run npm test , npm run lint , and npm run build —no arbitrary shell commands Path sandboxing: File tools are locked to the demo-repo/ directory; path traversal attempts like ../../../etc/passwd are rejected File size limits: The read_file tool rejects files over 256 KB, preventing memory exhaustion attacks Recursion limits: Directory listing caps at 20 levels deep, preventing infinite traversal These constraints demonstrate responsible AI agent design. The agent has enough capability to do useful work but not enough to cause harm. When extending this project for your own use cases, maintain similar principles, grant minimum necessary permissions, validate all inputs, and fail closed on unexpected conditions. Running the Agent Getting the Local Repo Patch Agent running on your machine takes about five minutes. The project includes setup scripts that handle prerequisites automatically. Prerequisites Before running the setup, ensure you have: Node.js 18 or higher: Download from nodejs.org (LTS version recommended) Foundry Local: Install via winget install Microsoft.FoundryLocal (Windows) or brew install foundrylocal (macOS) GitHub Copilot CLI: Follow the GitHub Copilot CLI install guide Verify your installations: node --version # Should print v18.x.x or higher foundry --version copilot --version One-Command Setup The easiest path uses the provided setup scripts that install dependencies, start Foundry Local, and download the AI model: # Clone the repository git clone https://github.com/leestott/copilotsdk_foundrylocal.git cd copilotsdk_foundrylocal # Windows (PowerShell) .\setup.ps1 # macOS / Linux chmod +x setup.sh ./setup.sh When setup completes, you'll see: ━━━ Setup complete! ━━━ You're ready to go. Run one of these commands: npm run demo CLI agent (terminal output) npm run ui Web dashboard (http://localhost:3000) Manual Setup If you prefer step-by-step control: # Install npm packages npm install cd demo-repo && npm install --ignore-scripts && cd .. # Start Foundry Local and download the model foundry service start foundry model run qwen2.5-coder-1.5b # Copy environment configuration cp .env.example .env # Run the agent npm run demo The first model download takes a few minutes depending on your connection. After that, the model runs from cache with no internet required. Using the Web Dashboard For a visual experience with real-time streaming, launch the web UI: npm run ui Open http://localhost:3000 in your browser. The dashboard provides: Phase progress sidebar: Visual indication of which phase is running, completed, or errored Live streaming output: Model responses appear in real-time via WebSocket Tool call log: Every tool invocation logged with phase context Phase timing table: Performance metrics showing how long each phase took Environment info: Current model, endpoint, and repository path at a glance Configuration Options The agent supports several environment variables for customisation. Edit the .env file or set them directly: Variable Default Description FOUNDRY_LOCAL_ENDPOINT auto-detected Override the Foundry Local API endpoint FOUNDRY_LOCAL_API_KEY auto-detected Override the API key FOUNDRY_MODEL qwen2.5-coder-1.5b Which model to use from the Foundry Local catalog FOUNDRY_TIMEOUT_MS 180000 (3 min) How long each agent phase can run before timing out FOUNDRY_NO_PROXY — Set to 1 to disable the streaming proxy PORT 3000 Port for the web dashboard Using Different Models To try a different model from the Foundry Local catalog: # Use phi-3-mini instead FOUNDRY_MODEL=phi-3-mini npm run demo # Use a larger model for higher quality (requires more RAM/VRAM) FOUNDRY_MODEL=qwen2.5-7b npm run demo Adjusting for Slower Hardware If you're running on CPU-only or limited hardware, increase the timeout to give the model more time per phase: # 5 minutes per phase instead of 3 FOUNDRY_TIMEOUT_MS=300000 npm run demo Troubleshooting Common Issues When things don't work as expected, these solutions address the most common problems: Problem Solution foundry: command not found Install Foundry Local—see Prerequisites section copilot: command not found Install GitHub Copilot CLI—see Prerequisites section Agent times out on every phase Increase FOUNDRY_TIMEOUT_MS (e.g., 300000 for 5 min). CPU-only machines are slower. Port 3000 already in use Set PORT=3001 npm run ui Model download is slow First download can take 5-10 min. Subsequent runs use the cache. Cannot find module errors Run npm install again, then cd demo-repo && npm install --ignore-scripts Tests still fail after agent runs The agent edits files in demo-repo/. Reset with git checkout demo-repo/ and run again. PowerShell blocks setup.ps1 Run Set-ExecutionPolicy -Scope Process Bypass first, then .\setup.ps1 Diagnostic Test Scripts The src/tests/ folder contains standalone scripts for debugging SDK and Foundry Local integration issues. These are invaluable when things go wrong: # Debug-level SDK event logging npx tsx src/tests/test-debug.ts # Test non-streaming inference (bypasses streaming proxy) npx tsx src/tests/test-nostream.ts # Raw fetch to Foundry Local (bypasses SDK entirely) npx tsx src/tests/test-stream-direct.ts # Start the traffic-inspection proxy npx tsx src/tests/test-proxy.ts These scripts isolate different layers of the stack, helping identify whether issues lie in Foundry Local, the streaming proxy, the SDK, or your application code. Key Takeaways BYOK enables local-first AI: A single configuration object redirects the entire GitHub Copilot SDK to use on-device inference through Foundry Local Phased workflows improve reliability: Breaking complex tasks into PLAN → EDIT → VERIFY → SUMMARY phases makes agent behaviour predictable and debuggable Security requires intentional design: Allowlists, sandboxing, and size limits constrain agent capabilities to safe operations Local models have quirks: The streaming proxy and text-based tool detection demonstrate how to work around on-device model limitations Real-time feedback matters: The web dashboard with WebSocket streaming makes agent progress visible and builds trust in the system The architecture is extensible: Add new tools, change models, or modify phases to adapt the agent to your specific needs Conclusion and Next Steps The Local Repo Patch Agent proves that sophisticated agentic coding workflows don't require cloud infrastructure. By combining the GitHub Copilot SDK's orchestration capabilities with Foundry Local's on-device inference, you get intelligent code analysis that respects data sovereignty completely. The patterns demonstrated here, BYOK integration, phased execution, security sandboxing, and streaming workarounds, transfer directly to production systems. Consider extending this foundation with: Custom tool sets: Add database queries, API calls to internal services, or integration with your CI/CD pipeline Multiple repository support: Scan and fix issues across an entire codebase or monorepo Different model sizes: Use smaller models for quick scans, larger ones for complex refactoring Human-in-the-loop approval: Add review steps before applying fixes to production code Integration with Git workflows: Automatically create branches and PRs from agent-generated fixes Clone the repository, run through the demo, and start building your own local-first AI coding tools. The future of developer AI isn't just cloud—it's intelligent systems that run wherever your code lives. Resources Local Repo Patch Agent Repository – Full source code with setup scripts and documentation Foundry Local – Official site for on-device AI inference Foundry Local GitHub Repository – Installation instructions and CLI reference Foundry Local Get Started Guide – Official Microsoft Learn documentation Foundry Local SDK Reference – Python and JavaScript SDK documentation GitHub Copilot SDK – Official SDK repository GitHub Copilot SDK BYOK Documentation – Bring Your Own Key integration guide GitHub Copilot SDK Getting Started – SDK setup and first agent tutorial Microsoft Sample: Copilot SDK + Foundry Local – Official integration sample from Microsoft1.9KViews0likes0CommentsMissing equivalent for Python MemorySearchTool and AgentMemorySettings in C# SDK
Hi Team, I am currently working with the Azure AI Foundry Agent Service (preview). I’ve been reviewing the documentation for managed long-term memory, specifically the "Automatic User Memory" features demonstrated in the Python SDK here: https://learn.microsoft.com/en-us/azure/ai-foundry/agents/how-to/memory-usage?view=foundry&tabs=python. In Python, it is very straightforward to attach a MemorySearchTool to an agent and use AgentMemorySettings(scope="user_123") during a run. This allows the service to automatically extract, consolidate, and retrieve memories without manual intervention. However, in the https://github.com/Azure/azure-sdk-for-net/tree/main/sdk/ai/Azure.AI.Projects#memory-store-operations, I only see the low-level MemoryStoreClient which appears to require manual CRUD operations on memory items. My Questions: Is there an equivalent high-level AgentMemorySearchTool or similar abstraction in the current C# NuGet package (Azure.AI.Projects) that handles automatic extraction and retrieval? If not currently available, is this feature on the immediate roadmap for the .NET SDK? Are there any samples showing how to achieve "automatic" memory (where the Agent extracts facts itself) using the C# SDK without having to build a custom orchestration layer or call REST APIs directly? Any guidance on the timeline for feature parity between the Python and .NET SDKs regarding Agent Memory would be greatly appreciated. SDK Version: Azure.AI.Projects 1.2.0-beta.574Views0likes1Comment