<?xml version="1.0" encoding="UTF-8"?>
<rss xmlns:content="http://purl.org/rss/1.0/modules/content/" xmlns:dc="http://purl.org/dc/elements/1.1/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#" xmlns:taxo="http://purl.org/rss/1.0/modules/taxonomy/" version="2.0">
  <channel>
    <title>Educator Developer Blog articles</title>
    <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/bg-p/EducatorDeveloperBlog</link>
    <description>Educator Developer Blog articles</description>
    <pubDate>Thu, 16 Apr 2026 15:32:43 GMT</pubDate>
    <dc:creator>EducatorDeveloperBlog</dc:creator>
    <dc:date>2026-04-16T15:32:43Z</dc:date>
    <item>
      <title>Build and Deploy a Microsoft Foundry Hosted Agent: A Hands-On Workshop</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/build-and-deploy-a-microsoft-foundry-hosted-agent-a-hands-on/ba-p/4508426</link>
      <description>&lt;ARTICLE&gt;
&lt;SECTION&gt;&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;P&gt;Agents are easy to demo, hard to ship.&lt;/P&gt;
&lt;P&gt;Most teams can put together a convincing prototype quickly. The harder part starts afterwards: shaping deterministic tools, validating behaviour with tests, building a CI path, packaging for deployment, and proving the experience through a user-facing interface. That is where many promising projects slow down.&lt;/P&gt;
&lt;P&gt;This workshop helps you close that gap without unnecessary friction. You get a guided path from local run to deployment handoff, then complete the journey with a working chat UI that calls your deployed hosted agent through the project endpoint.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;What You Will Build&lt;/H2&gt;
&lt;P&gt;This is a hands-on, end-to-end learning experience for building and deploying AI agents with Microsoft Foundry.&lt;/P&gt;
&lt;P&gt;The lab provides a guided and practical journey through hosted-agent development, including deterministic tool design, prompt-guided workflows, CI validation, deployment preparation, and UI integration.&lt;/P&gt;
&lt;P&gt;It’s designed to reduce setup friction with a ready-to-run experience.&lt;/P&gt;
&lt;P&gt;It is a prompt-based development lab using Copilot guidance and MCP-assisted workflow options during deployment.&lt;/P&gt;
&lt;P&gt;It’s a .NET 10 workshop that includes local development, Copilot-assisted coding, CI, secure deployment to Azure, and a working chat UI.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;A local hosted agent that responds on the responses contract&lt;/LI&gt;
&lt;LI&gt;Deterministic tool improvements in core logic with xUnit coverage&lt;/LI&gt;
&lt;LI&gt;A GitHub Actions CI workflow for restore, build, test, and container validation&lt;/LI&gt;
&lt;LI&gt;An Azure-ready deployment path using azd, ACR image publishing, and Foundry manifest apply&lt;/LI&gt;
&lt;LI&gt;A Blazor chat UI that calls openai/v1/responses with agent_reference&lt;/LI&gt;
&lt;LI&gt;A repeatable implementation shape that teams can adapt to real projects&lt;/LI&gt;
&lt;/UL&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Who This Lab Is For&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;AI developers and software engineers who prefer learning by building&lt;/LI&gt;
&lt;LI&gt;Motivated beginners who want a guided, step-by-step path&lt;/LI&gt;
&lt;LI&gt;Experienced developers who want a practical hosted-agent reference implementation&lt;/LI&gt;
&lt;LI&gt;Architects evaluating deployment shape, validation strategy, and operational readiness&lt;/LI&gt;
&lt;LI&gt;Technical decision-makers who need to see how demos become deployable systems&lt;/LI&gt;
&lt;/UL&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Why Hosted Agents&lt;/H2&gt;
&lt;P&gt;Hosted agents run your code in a managed environment. That matters because it reduces the amount of infrastructure plumbing you need to manage directly, while giving you a clearer path to secure, observable, team-friendly deployments.&lt;/P&gt;
&lt;P&gt;Prompt-only demos are still useful. They are quick, excellent for ideation, and often the right place to start. Hosted agents complement that approach when you need custom code, tool-backed logic, and a deployment process that can be repeated by a team.&lt;/P&gt;
&lt;P&gt;Think of this lab as the bridge: you keep the speed of prompt-based iteration, then layer in the real-world patterns needed to run reliably.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;What You Will Learn&lt;/H2&gt;
&lt;H3&gt;1) Orchestration&lt;/H3&gt;
&lt;P&gt;You will practise workflow-oriented reasoning through implementation-shape recommendations and multi-step readiness scenarios. The lab introduces orchestration concepts at a practical level, rather than as a dedicated orchestration framework deep dive.&lt;/P&gt;
&lt;H3&gt;2) Tool Integration&lt;/H3&gt;
&lt;P&gt;You will connect deterministic tools and understand how tool calls fit into predictable execution paths. This is a core focus of the workshop and is backed by tests in the solution.&lt;/P&gt;
&lt;H3&gt;3) Retrieval Patterns (What This Lab Covers Today)&lt;/H3&gt;
&lt;P&gt;This workshop does not include a full RAG implementation with embeddings and vector search. Instead, it focuses on deterministic local tools and hosted-agent response flow, giving you a strong foundation before adding retrieval infrastructure in a follow-on phase.&lt;/P&gt;
&lt;H3&gt;4) Observability&lt;/H3&gt;
&lt;P&gt;You will see light observability foundations through OpenTelemetry usage in the host and practical verification during local and deployed checks. This is introductory coverage intended to support debugging and confidence building.&lt;/P&gt;
&lt;H3&gt;5) Responsible AI&lt;/H3&gt;
&lt;P&gt;You will apply production-minded safety basics, including secure secret handling and review hygiene. A full Responsible AI policy and evaluation framework is not the primary goal of this workshop, but the workflow does encourage safe habits from the start.&lt;/P&gt;
&lt;H3&gt;6) Secure Deployment Path&lt;/H3&gt;
&lt;P&gt;You will move from local implementation to Azure deployment with a secure, practical workflow: azd provisioning, ACR publishing, manifest deployment, hosted-agent start, status checks, and endpoint validation.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;The Learning Journey&lt;/H2&gt;
&lt;P&gt;The overall flow is simple and memorable: clone, open, run, iterate, deploy, observe.&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;clone -&amp;gt; open -&amp;gt; run -&amp;gt; iterate -&amp;gt; deploy -&amp;gt; observe&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;You are not expected to memorize every command. The lab is structured to help you learn through small, meaningful wins that build confidence.&lt;/P&gt;
&lt;H3&gt;Your First 15 Minutes: Quick Wins&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Open the repo and understand the lab structure in a few minutes&lt;/LI&gt;
&lt;LI&gt;Set project endpoint and model deployment environment variables&lt;/LI&gt;
&lt;LI&gt;Run the host locally and validate the responses endpoint&lt;/LI&gt;
&lt;LI&gt;Inspect the deterministic tools in WorkshopLab.Core&lt;/LI&gt;
&lt;LI&gt;Run tests and see how behaviour changes are verified&lt;/LI&gt;
&lt;LI&gt;Review the deployment path so local work maps to Azure steps&lt;/LI&gt;
&lt;LI&gt;Understand how the UI validates end-to-end behaviour after deployment&lt;/LI&gt;
&lt;LI&gt;Leave the first session with a working baseline and a clear next step&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;That first checkpoint is important. Once you see a working loop on your own machine, the rest of the workshop becomes much easier to finish.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Using Copilot and MCP in the Workflow&lt;/H2&gt;
&lt;P&gt;This lab emphasises prompt-based development patterns that help you move faster while still learning the underlying architecture. You are not only writing code, you are learning to describe intent clearly, inspect generated output, and iterate with discipline.&lt;/P&gt;
&lt;P&gt;Copilot supports implementation and review in the coding labs. MCP appears as a practical deployment option for hosted-agent lifecycle actions, provided your tools are authenticated to the correct tenant and project context.&lt;/P&gt;
&lt;P&gt;Together, this creates a development rhythm that is especially useful for learning:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Define intent with clear prompts&lt;/LI&gt;
&lt;LI&gt;Generate or adjust implementation details&lt;/LI&gt;
&lt;LI&gt;Validate behaviour through tests and UI checks&lt;/LI&gt;
&lt;LI&gt;Deploy and observe outcomes in Azure&lt;/LI&gt;
&lt;LI&gt;Refine based on evidence, not guesswork&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;That same rhythm transfers well to real projects. Even if your production environment differs, the patterns from this workshop are adaptable.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Production-Minded Tips&lt;/H2&gt;
&lt;P&gt;As you complete the lab, keep a production mindset from day one:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Reliability: keep deterministic logic small, testable, and explicit&lt;/LI&gt;
&lt;LI&gt;Security: Treat secrets, identity, and access boundaries as first-class concerns&lt;/LI&gt;
&lt;LI&gt;Observability: use telemetry and status checks to speed up debugging&lt;/LI&gt;
&lt;LI&gt;Governance: keep deployment steps explicit so teams can review and repeat them&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;You do not need to solve everything in one pass. The goal is to build habits that make your agent projects safer and easier to evolve.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Start Today:&lt;/H2&gt;
&lt;P&gt;If you have been waiting for the right time to move from “interesting demo” to “practical implementation”, this is the moment. The workshop is structured for self-study, and the steps are designed to keep your momentum high.&lt;/P&gt;
&lt;P&gt;Start here: &lt;A href="https://github.com/microsoft/Hosted_Agents_Workshop_Lab" target="_blank" rel="noopener"&gt;https://github.com/microsoft/Hosted_Agents_Workshop_Lab&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Want deeper documentation while you go? These official guides are great companions:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/foundry/agents/quickstarts/quickstart-hosted-agent" target="_blank" rel="noopener"&gt;Hosted agent quickstart&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/foundry/agents/how-to/deploy-hosted-agent" target="_blank" rel="noopener"&gt;Hosted agent deployment guide&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;When you finish, share what you built. Post a screenshot or short write-up in a GitHub issue/discussion, on social, or in comments with one lesson learned. Your example can help the next developer get unstuck faster.&lt;/P&gt;
&lt;H3&gt;Copy/Paste Progress Checklist&lt;/H3&gt;
&lt;PRE&gt;&lt;CODE&gt;[ ] Clone the workshop repo
[ ] Complete local setup and run the agent
[ ] Make one prompt-based behaviour change
[ ] Validate with tests and chat UI
[ ] Run CI checks
[ ] Provision and deploy via Azure and Foundry workflow
[ ] Review observability signals and refine
[ ] Share what I built + one takeaway&lt;/CODE&gt;&lt;/PRE&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Common Questions&lt;/H2&gt;
&lt;H3&gt;How long does it take?&lt;/H3&gt;
&lt;P&gt;Most developers can complete a meaningful pass in a few focused sessions of 60-75 mins. You can get the first local success quickly, then continue through deployment and refinement at your own pace.&lt;/P&gt;
&lt;H3&gt;Do I need an Azure subscription?&lt;/H3&gt;
&lt;P&gt;Yes, for provisioning and deployment steps. You can still begin local development and testing before completing all Azure activities.&lt;/P&gt;
&lt;H3&gt;Is it beginner-friendly?&lt;/H3&gt;
&lt;P&gt;Yes. The labs are written for beginners, run in sequence, and include expected outcomes for each stage.&lt;/P&gt;
&lt;H3&gt;Can I adapt it beyond .NET?&lt;/H3&gt;
&lt;P&gt;Yes. The implementation in this workshop is .NET 10, but the architecture and development patterns can be adapted to other stacks.&lt;/P&gt;
&lt;H3&gt;What if I am evaluating for a team?&lt;/H3&gt;
&lt;P&gt;This lab is a strong team evaluation asset because it demonstrates end-to-end flow: local dev, integration patterns, CI, secure deployment, and operational visibility.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;&lt;/SECTION&gt;
&lt;SECTION&gt;&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Closing&lt;/H2&gt;
&lt;P&gt;This workshop gives you more than theory. It gives you a practical path from first local run to deployed hosted agent, backed by tests, CI, and a user-facing UI validation loop. If you want a build-first route into Microsoft Foundry hosted-agent development, this is an excellent place to start.&lt;/P&gt;
&lt;P&gt;Begin now: &lt;A href="https://github.com/microsoft/Hosted_Agents_Workshop_Lab" target="_blank" rel="noopener"&gt;https://github.com/microsoft/Hosted_Agents_Workshop_Lab&lt;/A&gt;&lt;/P&gt;
&lt;/SECTION&gt;
&lt;/ARTICLE&gt;</description>
      <pubDate>Fri, 03 Apr 2026 11:25:45 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/build-and-deploy-a-microsoft-foundry-hosted-agent-a-hands-on/ba-p/4508426</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-04-03T11:25:45Z</dc:date>
    </item>
    <item>
      <title>Getting Started with Foundry Local: A Student Guide to the Microsoft Foundry Local Lab</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/getting-started-with-foundry-local-a-student-guide-to-the/ba-p/4503604</link>
      <description>&lt;P&gt;If you want to start building AI applications on your own machine, the&amp;nbsp;&lt;A href="https://github.com/microsoft-foundry/foundry-local-lab" target="_blank" rel="noopener"&gt;Microsoft Foundry Local Lab&lt;/A&gt; is one of the most useful places to begin. It is a practical workshop that takes you from first-time setup through to agents, retrieval, evaluation, speech transcription, tool calling, and a browser-based interface. The material is hands-on, cross-language, and designed to show how modern AI apps can run locally rather than depending on a cloud service for every step.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P&gt;This blog post is aimed at students, self-taught developers, and anyone learning how AI applications are put together in practice. Instead of treating large language models as a black box, the lab shows you how to install and manage local models, connect to them with code, structure tasks into workflows, and test whether the results are actually good enough. If you have been looking for a learning path that feels more like building real software and less like copying isolated snippets, this workshop is a strong starting point.&lt;/P&gt;
&lt;H2&gt;What Is Foundry Local?&lt;/H2&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://foundrylocal.ai" target="_blank" rel="noopener"&gt;Foundry Local&lt;/A&gt; is a local runtime for downloading, managing, and serving AI models on your own hardware. It exposes an OpenAI-compatible interface, which means you can work with familiar SDK patterns while keeping execution on your device. For learners, that matters for three reasons. First, it lowers the barrier to experimentation because you can run projects without setting up a cloud account for every test. Second, it helps you understand the moving parts behind AI applications, including model lifecycle, local inference, and application architecture. Third, it encourages privacy-aware development because the examples are designed to keep data on the machine wherever possible.&lt;/P&gt;
&lt;P&gt;The Foundry Local Lab uses that local-first approach to teach the full journey from simple prompts to multi-agent systems. It includes examples in Python, JavaScript, and C#, so you can follow the language that fits your course, your existing skills, or the platform you want to build on.&lt;/P&gt;
&lt;H2&gt;Why This Lab Works Well for Learners&lt;/H2&gt;
&lt;P&gt;A lot of AI tutorials stop at the moment a model replies to a prompt. That is useful for a first demo, but it does not teach you how to build a proper application. The Foundry Local Lab goes further. It is organised as a sequence of parts, each one adding a new idea and giving you working code to explore. You do not just ask a model to respond. You learn how to manage the service, choose a language SDK, construct retrieval pipelines, build agents, evaluate outputs, and expose the result through a usable interface.&lt;/P&gt;
&lt;P&gt;That sequence is especially helpful for students because the parts build on each other. Early labs focus on confidence and setup. Middle labs focus on architecture and patterns. Later labs move into more advanced ideas that are common in real projects, such as tool calling, evaluation, and custom model packaging. By the end, you have seen not just what a local AI app looks like, but how its different layers fit together.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;H2&gt;Before You Start&lt;/H2&gt;
&lt;P&gt;The workshop expects a reasonably modern machine and at least one programming language environment. The core prerequisites are straightforward: install Foundry Local, clone the repository, and choose whether you want to work in Python, JavaScript, or C#. You do not need to master all three. In fact, most learners will get more value by picking one language first, completing the full path in that language, and only then comparing how the same patterns look elsewhere.&lt;/P&gt;
&lt;P&gt;If you are new to AI development, do not be put off by the number of parts. The early sections are accessible, and the later ones become much easier once you have completed the foundations. Think of the lab as a structured course rather than a single tutorial.&lt;/P&gt;
&lt;H2&gt;What You Learn in Each Lab &lt;A class="lia-external-url" href="https://github.com/microsoft-foundry/foundry-local-lab" target="_blank" rel="noopener"&gt;https://github.com/microsoft-foundry/foundry-local-lab&lt;/A&gt;&amp;nbsp;&lt;/H2&gt;
&lt;H3&gt;Part 1: Getting Started with Foundry Local&lt;/H3&gt;
&lt;P&gt;The first part introduces the basics of Foundry Local and gets you up and running. You learn how to install the CLI, inspect the model catalogue, download a model, and run it locally. This part also introduces practical details such as model aliases and dynamic service ports, which are small but important pieces of real development work.&lt;/P&gt;
&lt;P&gt;For students, the value of this part is confidence. You prove that local inference works on your machine, you see how the service behaves, and you learn the operational basics before writing any application code. By the end of Part 1, you should understand what Foundry Local does, how to start it, and how local model serving fits into an application workflow.&lt;/P&gt;
&lt;H3&gt;Part 2: Foundry Local SDK Deep Dive&lt;/H3&gt;
&lt;P&gt;Once the CLI makes sense, the workshop moves into the SDK. This part explains why application developers often use the SDK instead of relying only on terminal commands. You learn how to manage the service programmatically, browse available models, control model download and loading, and understand model metadata such as aliases and hardware-aware selection.&lt;/P&gt;
&lt;P&gt;This is where learners start to move from using a tool to building with a platform. You begin to see the difference between running a model manually and integrating it into software. By the end of this section, you should understand the API surface you will use in your own projects and know how to bootstrap the SDK in Python, JavaScript, or C#.&lt;/P&gt;
&lt;H3&gt;Part 3: SDKs and APIs&lt;/H3&gt;
&lt;P&gt;Part 3 turns the SDK concepts into a working chat application. You connect code to the local inference server and use the OpenAI-compatible API for streaming chat completions. The lab includes examples in all three supported languages, which makes it especially useful if you are comparing ecosystems or learning how the same idea is expressed through different syntax and libraries.&lt;/P&gt;
&lt;P&gt;The key learning outcome here is not just that you can get a response from a model. It is that you understand the boundary between your application and the local model service. You learn how messages are structured, how streaming works, and how to write the sort of integration code that becomes the foundation for every later lab.&lt;/P&gt;
&lt;H3&gt;Part 4: Retrieval-Augmented Generation&lt;/H3&gt;
&lt;P&gt;This is where the workshop starts to feel like modern AI engineering rather than basic prompting. In the retrieval-augmented generation lab, you build a simple RAG pipeline that grounds answers in supplied data. You work with an in-memory knowledge base, apply retrieval logic, score matches, and compose prompts that include grounded context.&lt;/P&gt;
&lt;P&gt;For learners, this part is important because it demonstrates a core truth of AI app development: a model on its own is often not enough. Useful applications usually need access to documents, notes, or structured information. By the end of Part 4, you understand why retrieval matters, how to pass retrieved context into a prompt, and how a pipeline can make answers more relevant and reliable.&lt;/P&gt;
&lt;H3&gt;Part 5: Building AI Agents&lt;/H3&gt;
&lt;P&gt;Part 5 introduces the concept of an agent. Instead of a one-off prompt and response, you begin to define behaviour through system instructions, roles, and conversation state. The lab uses the ChatAgent pattern and the Microsoft Agent Framework to show how an agent can maintain a purpose, respond with a persona, and return structured output such as JSON.&lt;/P&gt;
&lt;P&gt;This part helps learners understand the difference between a raw model call and a reusable application component. You learn how to design instructions that shape behaviour, how multi-turn interaction differs from single prompts, and why structured output matters when an AI component has to work inside a broader system.&lt;/P&gt;
&lt;H3&gt;Part 6: Multi-Agent Workflows&lt;/H3&gt;
&lt;P&gt;Once a single agent makes sense, the workshop expands the idea into a multi-agent workflow. The example pipeline uses roles such as researcher, writer, and editor, with outputs passed from one stage to the next. You explore sequential orchestration, shared configuration, and feedback loops between specialised components.&lt;/P&gt;
&lt;P&gt;For students, this lab is a very clear introduction to decomposition. Instead of asking one model to do everything at once, you break a task into smaller responsibilities. That pattern is useful well beyond AI. By the end of Part 6, you should understand why teams build multi-agent systems, how hand-offs are structured, and what trade-offs appear when more components are added to a workflow.&lt;/P&gt;
&lt;H3&gt;Part 7: Zava Creative Writer Capstone Application&lt;/H3&gt;
&lt;P&gt;The Zava Creative Writer is the capstone project that brings the earlier ideas together into a more production-style application. It uses multiple specialised agents, structured JSON hand-offs, product catalogue search, streaming output, and evaluation-style feedback loops. Rather than showing an isolated feature, this part shows how separate patterns combine into a complete system.&lt;/P&gt;
&lt;P&gt;This is one of the most valuable parts of the workshop for learner developers because it narrows the gap between tutorial code and real application design. You can see how orchestration, agent roles, and practical interfaces fit together. By the end of Part 7, you should be able to recognise the architecture of a serious local AI app and understand how the earlier labs support it.&lt;/P&gt;
&lt;H3&gt;Part 8: Evaluation-Led Development&lt;/H3&gt;
&lt;P&gt;Many beginner AI projects stop once the output looks good once or twice. This lab teaches a much stronger habit: evaluation-led development. You work with golden datasets, rule-based checks, and LLM-as-judge scoring to compare prompt or agent variants systematically. The goal is to move from anecdotal testing to repeatable assessment.&lt;/P&gt;
&lt;P&gt;This matters enormously for students because evaluation is one of the clearest differences between a classroom demo and dependable software. By the end of Part 8, you should understand how to define success criteria, compare outputs at scale, and use evidence rather than intuition when improving an AI component.&lt;/P&gt;
&lt;H3&gt;Part 9: Voice Transcription with Whisper&lt;/H3&gt;
&lt;P&gt;Part 9 broadens the workshop beyond text generation by introducing speech-to-text with Whisper running locally. You use the Foundry Local SDK to download and load the model, then transcribe local audio files through the compatible API surface. The emphasis is on privacy-first processing, with audio kept on-device.&lt;/P&gt;
&lt;P&gt;This section is a useful reminder that local AI development is not limited to chatbots. Learners see how a different modality fits into the same ecosystem and how local execution supports sensitive workloads. By the end of this lab, you should understand the transcription flow, the relevant client methods, and how speech features can be integrated into broader applications.&lt;/P&gt;
&lt;H3&gt;Part 10: Using Custom or Hugging Face Models&lt;/H3&gt;
&lt;P&gt;After learning the standard path, the workshop shows how to work with custom or Hugging Face models. This includes compiling models into optimised ONNX format with ONNX Runtime GenAI, choosing hardware-specific options, applying quantisation strategies, creating configuration files, and adding compiled models to the Foundry Local cache.&lt;/P&gt;
&lt;P&gt;For learner developers, this part opens the door to model engineering rather than simple model consumption. You begin to understand that model choice, optimisation, and packaging affect performance and usability. By the end of Part 10, you should have a clearer picture of how models move from an external source into a runnable local setup and why deployment format matters.&lt;/P&gt;
&lt;H3&gt;Part 11: Tool Calling with Local Models&lt;/H3&gt;
&lt;P&gt;Tool calling is one of the most practical patterns in current AI development, and this lab covers it directly. You define tool schemas, allow the model to request function calls, handle the multi-turn interaction loop, execute the tools locally, and return results back to the model. The examples include practical scenarios such as weather and population tools.&lt;/P&gt;
&lt;P&gt;This lab teaches learners how to move beyond generation into action. A model is no longer limited to producing text. It can decide when external data or a function is needed and incorporate that result into a useful answer. By the end of Part 11, you should understand the tool-calling flow and how AI systems connect reasoning with deterministic software behaviour.&lt;/P&gt;
&lt;H3&gt;Part 12: Building a Web UI for the Zava Creative Writer&lt;/H3&gt;
&lt;P&gt;Part 12 adds a browser-based front end to the capstone application. You learn how to serve a shared interface from Python, JavaScript, or C#, stream updates to the browser, consume NDJSON with the Fetch API and ReadableStream, and show live agent status as content is produced in real time.&lt;/P&gt;
&lt;P&gt;This part is especially good for students who want to build portfolio projects. It turns backend orchestration into something visible and interactive. By the end of Part 12, you should understand how to connect a local AI backend to a web interface and how streaming changes the user experience compared with waiting for one final response.&lt;/P&gt;
&lt;H3&gt;Part 13: Workshop Complete&lt;/H3&gt;
&lt;P&gt;The final part is a summary and extension point. It reviews what you have built across the previous sections and suggests ways to continue. Although it is not a new technical lab in the same way as the earlier parts, it plays an important role in learning. It helps you consolidate the architecture, the terminology, and the development patterns you have encountered.&lt;/P&gt;
&lt;P&gt;For learners, reflection matters. By the end of Part 13, you should be able to describe the full stack of a local AI application, from model management to user interface, and identify which area you want to deepen next.&lt;/P&gt;
&lt;H2&gt;What Students Gain from the Full Workshop&lt;/H2&gt;
&lt;P&gt;Taken together, these labs do more than teach Foundry Local itself. They teach how AI applications are built. You learn operational basics such as model setup and service management. You learn application integration through SDKs and APIs. You learn system design through RAG, agents, multi-agent orchestration, and web interfaces. You learn engineering discipline through evaluation. You also see how text, speech, custom models, and tool calling all fit into one local-first development workflow.&lt;/P&gt;
&lt;P&gt;That breadth makes the workshop useful in several settings. A student can use it as a self-study path. A lecturer can use it as source material for practical sessions. A learner developer can use it to build portfolio pieces and to understand which AI patterns are worth learning next. Because the repository includes Python, JavaScript, and C#, it also works well for comparing how architectural ideas transfer across languages.&lt;/P&gt;
&lt;H2&gt;How to Approach the Lab as a Beginner&lt;/H2&gt;
&lt;P&gt;If you are starting from scratch, the best route is simple. Complete Parts 1 to 3 in your preferred language first. That gives you the essential setup and integration skills. Then move into Parts 4 to 6 to understand how AI application patterns are composed. After that, use Parts 7 and 8 to learn how larger systems and evaluation fit together. Finally, explore Parts 9 to 12 based on your interests, whether that is speech, tooling, model customisation, or front-end work.&lt;/P&gt;
&lt;P&gt;It is also worth keeping notes as you go. Record what each part adds to your understanding, what code files matter, and what assumptions each example makes. That habit will help you move from following the labs to adapting the patterns in your own projects.&lt;/P&gt;
&lt;H2&gt;Final Thoughts&lt;/H2&gt;
&lt;P&gt;The &lt;A class="lia-external-url" href="https://github.com/microsoft-foundry/foundry-local-lab" target="_blank" rel="noopener"&gt;Microsoft Foundry Local Lab&lt;/A&gt; is a strong introduction to local AI development because it treats learners like developers rather than spectators. You install, run, connect, orchestrate, evaluate, and present working systems. That makes it far more valuable than a short demo that only proves a model can answer a question.&lt;/P&gt;
&lt;P&gt;If you are a student or learner developer who wants to understand how AI applications are really built, this lab gives you a clear path. Start with the basics, pick one language, and work through the parts in order. By the time you finish, you will not just have used Foundry Local. You will have a practical foundation for building local AI applications with far more confidence and much better judgement.&lt;/P&gt;</description>
      <pubDate>Mon, 30 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/getting-started-with-foundry-local-a-student-guide-to-the/ba-p/4503604</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-03-30T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Langchain Multi-Agent Systems with Microsoft Agent Framework and Hosted Agents</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/langchain-multi-agent-systems-with-microsoft-agent-framework-and/ba-p/4504863</link>
      <description>&lt;P&gt;If you have been building AI agents with LangChain, you already know how powerful its tool and chain abstractions are. But when it comes to deploying those agents to production — with real infrastructure, managed identity, live web search, and container orchestration — you need something more.&lt;/P&gt;
&lt;P&gt;This post walks through how to combine &lt;STRONG&gt;LangChain&lt;/STRONG&gt; with the &lt;STRONG&gt;Microsoft Agent Framework&lt;/STRONG&gt; (&lt;CODE&gt;azure-ai-agents&lt;/CODE&gt;) and deploy the result as a &lt;STRONG&gt;Microsoft Foundry Hosted Agent&lt;/STRONG&gt;. We will build a multi-agent incident triage copilot that uses LangChain locally and seamlessly upgrades to cloud-hosted capabilities on Microsoft Foundry.&lt;/P&gt;
&lt;H2&gt;Why combine LangChain with Microsoft Agent Framework?&lt;/H2&gt;
&lt;P&gt;As a LangChain developer, you get excellent abstractions for building agents: the &lt;CODE&gt;@tool&lt;/CODE&gt; decorator, &lt;CODE&gt;RunnableLambda&lt;/CODE&gt; chains, and composable pipelines. But production deployment raises questions that LangChain alone does not answer:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Where do your agents run?&lt;/STRONG&gt; Containers, serverless, or managed infrastructure?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;How do you add live web search or code execution?&lt;/STRONG&gt; Bing Grounding and Code Interpreter are not LangChain built-ins.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;How do you handle authentication?&lt;/STRONG&gt; Managed identity, API keys, or tokens?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;How do you observe agents in production?&lt;/STRONG&gt; Distributed tracing across multiple agents?&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The Microsoft Agent Framework fills these gaps. It provides &lt;CODE&gt;AgentsClient&lt;/CODE&gt; for creating and managing agents on Microsoft Foundry, built-in tools like &lt;CODE&gt;BingGroundingTool&lt;/CODE&gt; and &lt;CODE&gt;CodeInterpreterTool&lt;/CODE&gt;, and a thread-based conversation model. Combined with Hosted Agents, you get a fully managed container runtime with health probes, auto-scaling, and the OpenAI Responses API protocol.&lt;/P&gt;
&lt;P&gt;The key insight: &lt;STRONG&gt;LangChain handles local logic and chain composition; the Microsoft Agent Framework handles cloud-hosted orchestration and tooling.&lt;/STRONG&gt;&lt;/P&gt;
&lt;H2&gt;Architecture overview&lt;/H2&gt;
&lt;P&gt;The incident triage copilot uses a coordinator pattern with three specialist agents:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/hosted-agents-langchain-samples/main/screenshots/01-ui-homepage-foundry-connected.png" alt="UI Homepage showing Foundry connected status" /&gt;&lt;/P&gt;
&lt;PRE&gt;&lt;CODE&gt;User Query
    |
    v
Coordinator Agent
    |
    +--&amp;gt; LangChain Triage Chain    (routing decision)
    +--&amp;gt; LangChain Synthesis Chain  (combine results)
    |
    +---+---+---+
    |   |       |
    v   v       v
Research  Diagnostics  Remediation
 Agent      Agent        Agent&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;Each specialist agent has two execution modes:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Mode&lt;/th&gt;&lt;th&gt;LangChain Role&lt;/th&gt;&lt;th&gt;Microsoft Agent Framework Role&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Local&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;CODE&gt;@tool&lt;/CODE&gt; functions provide heuristic analysis&lt;/td&gt;&lt;td&gt;Not used&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Foundry&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Chains handle routing and synthesis&lt;/td&gt;&lt;td&gt;&lt;CODE&gt;AgentsClient&lt;/CODE&gt; with &lt;CODE&gt;BingGroundingTool&lt;/CODE&gt;, &lt;CODE&gt;CodeInterpreterTool&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;This dual-mode design means you can develop and test locally with zero cloud dependencies, then deploy to Foundry for production capabilities.&lt;/P&gt;
&lt;H2&gt;Step 1: Define your LangChain tools&lt;/H2&gt;
&lt;P&gt;Start with what you know. Define typed, documented tools using LangChain’s &lt;CODE&gt;@tool&lt;/CODE&gt; decorator:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;from langchain_core.tools import tool

@tool
def classify_incident_severity(query: str) -&amp;gt; str:
    """Classify the severity and priority of an incident based on keywords.

    Args:
        query: The incident description text.

    Returns:
        Severity classification with priority level.
    """
    query_lower = query.lower()

    critical_keywords = [
        "production down", "all users", "outage", "breach",
    ]
    high_keywords = [
        "503", "500", "timeout", "latency", "slow",
    ]

    if any(kw in query_lower for kw in critical_keywords):
        return "severity=critical, priority=P1"
    if any(kw in query_lower for kw in high_keywords):
        return "severity=high, priority=P2"
    return "severity=low, priority=P4"&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;These tools work identically in local mode and serve as fallbacks when Foundry is unavailable.&lt;/P&gt;
&lt;H2&gt;Step 2: Build routing with LangChain chains&lt;/H2&gt;
&lt;P&gt;Use &lt;CODE&gt;RunnableLambda&lt;/CODE&gt; to create a routing chain that classifies the incident and selects which specialists to invoke:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;from langchain_core.runnables import RunnableLambda
from enum import Enum

class AgentRole(str, Enum):
    RESEARCH = "research"
    DIAGNOSTICS = "diagnostics"
    REMEDIATION = "remediation"

DIAGNOSTICS_KEYWORDS = {
    "log", "error", "exception", "timeout", "500", "503",
    "crash", "oom", "root cause",
}

REMEDIATION_KEYWORDS = {
    "fix", "remediate", "runbook", "rollback", "hotfix",
    "patch", "resolve", "action plan",
}

def _route(inputs: dict) -&amp;gt; dict:
    query = inputs["query"].lower()
    specialists = [AgentRole.RESEARCH]  # always included

    if any(kw in query for kw in DIAGNOSTICS_KEYWORDS):
        specialists.append(AgentRole.DIAGNOSTICS)

    if any(kw in query for kw in REMEDIATION_KEYWORDS):
        specialists.append(AgentRole.REMEDIATION)

    return {**inputs, "specialists": specialists}

triage_routing_chain = RunnableLambda(_route)&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;This is pure LangChain — no cloud dependency. The chain analyses the query and returns which specialists should handle it.&lt;/P&gt;
&lt;H2&gt;Step 3: Create specialist agents with dual-mode execution&lt;/H2&gt;
&lt;P&gt;Each specialist agent extends a base class. In local mode, it uses LangChain tools. In Foundry mode, it delegates to the Microsoft Agent Framework:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;from abc import ABC, abstractmethod
from pathlib import Path

class BaseSpecialistAgent(ABC):
    role: AgentRole
    prompt_file: str

    def __init__(self):
        prompt_path = Path(__file__).parent.parent / "prompts" / self.prompt_file
        self.system_prompt = prompt_path.read_text(encoding="utf-8")

    async def run(self, query, shared_context, correlation_id, client=None):
        if client is not None:
            return await self._run_on_foundry(query, shared_context, correlation_id, client)
        return await self._run_locally(query, shared_context, correlation_id)

    async def _run_on_foundry(self, query, shared_context, correlation_id, client):
        """Use Microsoft Agent Framework for cloud-hosted execution."""
        from azure.ai.agents.models import BingGroundingTool

        agent = await client.agents.create_agent(
            model=shared_context.get("model_deployment", "gpt-4o"),
            name=f"{self.role.value}-{correlation_id}",
            instructions=self.system_prompt,
            tools=self._get_foundry_tools(shared_context),
        )

        thread = await client.agents.threads.create()
        await client.agents.messages.create(
            thread_id=thread.id,
            role="user",
            content=self._build_prompt(query, shared_context),
        )

        run = await client.agents.runs.create_and_process(
            thread_id=thread.id,
            agent_id=agent.id,
        )
        # Extract and return the agent’s response...

    async def _run_locally(self, query, shared_context, correlation_id):
        """Use LangChain tools for local heuristic analysis."""
        # Each subclass implements this with its specific tools
        ...&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;The key pattern here:&amp;nbsp;&lt;STRONG&gt;same interface, different backends&lt;/STRONG&gt;. Your coordinator does not care whether a specialist ran locally or on Foundry.&lt;/P&gt;
&lt;H2&gt;Step 4: Wire it up with FastAPI&lt;/H2&gt;
&lt;P&gt;Expose the multi-agent pipeline through a FastAPI endpoint. The &lt;CODE&gt;/triage&lt;/CODE&gt; endpoint accepts incident descriptions and returns structured reports:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;from fastapi import FastAPI
from agents.coordinator import Coordinator
from models import TriageRequest

app = FastAPI(title="Incident Triage Copilot")
coordinator = Coordinator()

@app.post("/triage")
async def triage(request: TriageRequest):
    return await coordinator.triage(
        request=request,
        client=app.state.foundry_client,
        max_turns=10,
    )&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;The application also implements the&amp;nbsp;&lt;CODE&gt;/responses&lt;/CODE&gt; endpoint, which follows the OpenAI Responses API protocol. This is what Microsoft Foundry Hosted Agents expects when routing traffic to your container.&lt;/P&gt;
&lt;H2&gt;Step 5: Deploy as a Hosted Agent&lt;/H2&gt;
&lt;P&gt;This is where Microsoft Foundry Hosted Agents shines. Your multi-agent system becomes a managed, auto-scaling service with a single command:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;# Install the azd AI agent extension
azd extension install azure.ai.agents

# Provision infrastructure and deploy
azd up&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/hosted-agents-langchain-samples/main/screenshots/02-ui-triage-running.png" alt="Triage pipeline running with Research, Diagnostics, and Remediation agents" /&gt;&lt;/P&gt;
&lt;P&gt;The Azure Developer CLI (&lt;CODE&gt;azd&lt;/CODE&gt;) provisions everything:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Azure Container Registry&lt;/STRONG&gt; for your Docker image&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Container App&lt;/STRONG&gt; with health probes and auto-scaling&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;User-Assigned Managed Identity&lt;/STRONG&gt; for secure authentication&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Microsoft Foundry Hub and Project&lt;/STRONG&gt; with model deployments&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Application Insights&lt;/STRONG&gt; for distributed tracing&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Your &lt;CODE&gt;agent.yaml&lt;/CODE&gt; defines what tools the hosted agent has access to:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;name: incident-triage-copilot-langchain
kind: hosted
model:
  deployment: gpt-4o
identity:
  type: managed
tools:
  - type: bing_grounding
    enabled: true
  - type: code_interpreter
    enabled: true&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;H2&gt;What you gain over pure LangChain&lt;/H2&gt;
&lt;P&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/hosted-agents-langchain-samples/main/screenshots/03-ui-triage-report.png" alt="Triage report showing coordinator summary and specialist results" /&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Capability&lt;/th&gt;&lt;th&gt;LangChain Only&lt;/th&gt;&lt;th&gt;LangChain + Microsoft Agent Framework&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Local development&lt;/td&gt;&lt;td&gt;Yes&lt;/td&gt;&lt;td&gt;Yes (identical experience)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Live web search&lt;/td&gt;&lt;td&gt;Requires custom integration&lt;/td&gt;&lt;td&gt;Built-in &lt;CODE&gt;BingGroundingTool&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Code execution&lt;/td&gt;&lt;td&gt;Requires sandboxing&lt;/td&gt;&lt;td&gt;Built-in &lt;CODE&gt;CodeInterpreterTool&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Managed hosting&lt;/td&gt;&lt;td&gt;DIY containers&lt;/td&gt;&lt;td&gt;Foundry Hosted Agents&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Authentication&lt;/td&gt;&lt;td&gt;DIY&lt;/td&gt;&lt;td&gt;Managed Identity (zero secrets)&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Observability&lt;/td&gt;&lt;td&gt;DIY&lt;/td&gt;&lt;td&gt;OpenTelemetry + Application Insights&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;One-command deploy&lt;/td&gt;&lt;td&gt;No&lt;/td&gt;&lt;td&gt;&lt;CODE&gt;azd up&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;Testing locally&lt;/H2&gt;
&lt;P&gt;The dual-mode architecture means you can test the full pipeline without any cloud resources:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/hosted-agents-langchain-samples/main/screenshots/04-ui-specialist-agents.png" alt="Research Agent with Bing Grounding and Diagnostics Agent with Code Interpreter" /&gt;&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;# Create virtual environment and install dependencies
python -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt

# Run locally (agents use LangChain tools)
python -m src&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;Then open &lt;CODE&gt;http://localhost:8080&lt;/CODE&gt; in your browser to use the built-in web UI, or call the API directly:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;curl -X POST http://localhost:8080/triage \
  -H "Content-Type: application/json" \
  -d '{"message": "Getting 503 errors on /api/orders since 2pm"}'&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;The response includes a coordinator summary, specialist results with confidence scores, and the tools each agent used.&lt;/P&gt;
&lt;H2&gt;Running the tests&lt;/H2&gt;
&lt;P&gt;The project includes a comprehensive test suite covering routing logic, tool behaviour, agent execution, and HTTP endpoints:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;curl -X POST http://localhost:8080/triage \
  -H "Content-Type: application/json" \
  -d '{"message": "Getting 503 errors on /api/orders since 2pm"}'&lt;/LI-CODE&gt;
&lt;P&gt;Tests run entirely in local mode, so no cloud credentials are needed.&lt;/P&gt;
&lt;H2&gt;Key takeaways for LangChain developers&lt;/H2&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Keep your LangChain abstractions.&lt;/STRONG&gt; The &lt;CODE&gt;@tool&lt;/CODE&gt; decorator, &lt;CODE&gt;RunnableLambda&lt;/CODE&gt; chains, and composable pipelines all work exactly as you expect.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Add cloud capabilities incrementally.&lt;/STRONG&gt; Start local, then enable Bing Grounding, Code Interpreter, and managed hosting when you are ready.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Use the dual-mode pattern.&lt;/STRONG&gt; Every agent should work locally with LangChain tools and on Foundry with the Microsoft Agent Framework. This makes development fast and deployment seamless.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Let &lt;CODE&gt;azd&lt;/CODE&gt; handle infrastructure.&lt;/STRONG&gt; One command provisions everything: containers, identity, monitoring, and model deployments.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Security comes free.&lt;/STRONG&gt; Managed Identity means no API keys in your code. Non-root containers, RBAC, and disabled ACR admin are all configured by default.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2&gt;Get started&lt;/H2&gt;
&lt;P&gt;Clone the sample repository and try it yourself:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;git clone https://github.com/leestott/hosted-agents-langchain-samples
cd hosted-agents-langchain-samples
python -m venv .venv &amp;amp;&amp;amp; source .venv/bin/activate
pip install -r requirements.txt
python -m src&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;Open&amp;nbsp;&lt;CODE&gt;http://localhost:8080&lt;/CODE&gt; to interact with the copilot through the web UI. When you are ready for production, run &lt;CODE&gt;azd up&lt;/CODE&gt; and your multi-agent system is live on Microsoft Foundry.&lt;/P&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/ai-services/agents/" target="_blank"&gt;Microsoft Agent Framework for Python documentation&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/ai-services/agents/concepts/hosted-agents" target="_blank"&gt;Microsoft Foundry Hosted Agents&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/developer/azure-developer-cli/" target="_blank"&gt;Azure Developer CLI (azd)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://python.langchain.com/" target="_blank"&gt;LangChain documentation&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/azure/ai-foundry/" target="_blank"&gt;Microsoft Foundry documentation&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 26 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/langchain-multi-agent-systems-with-microsoft-agent-framework-and/ba-p/4504863</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-03-26T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Build an Offline Hybrid RAG Stack with ONNX and Foundry Local</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/build-an-offline-hybrid-rag-stack-with-onnx-and-foundry-local/ba-p/4503589</link>
      <description>&lt;MAIN&gt;
&lt;ARTICLE&gt;&lt;HEADER&gt;
&lt;P class="lead"&gt;If you are building local AI applications, basic retrieval augmented generation is often only the starting point. This sample shows a more practical pattern: combine lexical retrieval, ONNX based semantic embeddings, and a Foundry Local chat model so the assistant stays grounded, remains offline, and degrades cleanly when the semantic path is unavailable.&lt;/P&gt;
&lt;/HEADER&gt;
&lt;SECTION&gt;
&lt;H2&gt;Why this sample is worth studying&lt;/H2&gt;
&lt;P&gt;Many local RAG samples rely on a single retrieval strategy. That is usually enough for a proof of concept, but it breaks down quickly in production. Exact keywords, acronyms, and document codes behave differently from natural language questions and paraphrased requests.&lt;/P&gt;
&lt;P&gt;This repository keeps the original lexical retrieval path, adds local ONNX embeddings for semantic search, and fuses both signals in a hybrid ranking mode. The generation step runs through Foundry Local, so the entire assistant can remain on device.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Lexical mode handles exact terms and structured vocabulary.&lt;/LI&gt;
&lt;LI&gt;Semantic mode handles paraphrases and more natural language phrasing.&lt;/LI&gt;
&lt;LI&gt;Hybrid mode combines both and is usually the best default.&lt;/LI&gt;
&lt;LI&gt;Lexical fallback protects the user experience if the embedding pipeline cannot start.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Architectural overview&lt;/H2&gt;
&lt;P&gt;The sample has two main flows: an offline ingestion pipeline and a local query pipeline.&lt;/P&gt;
&lt;FIGURE&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/local-hybrid-retrival-onnx/main/screenshots/07-architecture-diagram.png" alt="Architecture diagram showing the ingestion pipeline and local query pipeline" /&gt;
&lt;FIGCAPTION&gt;The architecture splits cleanly into offline ingestion at the top and runtime query handling at the bottom.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;H3&gt;Offline ingestion pipeline&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;Read Markdown files from &lt;CODE&gt;docs/&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI&gt;Parse front matter and split each document into overlapping chunks.&lt;/LI&gt;
&lt;LI&gt;Generate dense embeddings when the ONNX model is available.&lt;/LI&gt;
&lt;LI&gt;Store chunks in SQLite with both sparse lexical features and optional dense vectors.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Local query pipeline&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;The browser posts a question to the Express API.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;ChatEngine&lt;/CODE&gt; resolves the requested retrieval mode.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;VectorStore&lt;/CODE&gt; retrieves lexical, semantic, or hybrid results.&lt;/LI&gt;
&lt;LI&gt;The prompt is assembled with the retrieved context and sent to a Foundry Local chat model.&lt;/LI&gt;
&lt;LI&gt;The answer is returned with source references and retrieval metadata.&lt;/LI&gt;
&lt;/OL&gt;
&lt;FIGURE&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/local-hybrid-retrival-onnx/main/screenshots/08-rag-flow-sequence.png" alt="Sequence diagram showing lexical and hybrid retrieval flow" /&gt;
&lt;FIGCAPTION&gt;The sequence diagram shows the difference between lexical retrieval and hybrid retrieval. In hybrid mode, the query is embedded first, then lexical and semantic scores are fused before prompt assembly.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Repository structure and core components&lt;/H2&gt;
&lt;P&gt;The implementation is compact and readable. The main files to understand are listed below.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;CODE&gt;src/config.js&lt;/CODE&gt;: retrieval defaults, paths, and model settings.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;src/embeddingEngine.js&lt;/CODE&gt;: local ONNX embedding generation through Transformers.js.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;src/vectorStore.js&lt;/CODE&gt;: SQLite storage plus lexical, semantic, and hybrid ranking.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;src/chatEngine.js&lt;/CODE&gt;: retrieval mode resolution, prompt assembly, and Foundry Local model execution.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;src/ingest.js&lt;/CODE&gt;: document ingestion and embedding generation during indexing.&lt;/LI&gt;
&lt;LI&gt;&lt;CODE&gt;src/server.js&lt;/CODE&gt;: REST endpoints, streaming endpoints, upload support, and health reporting.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Getting started&lt;/H2&gt;
&lt;P&gt;To run the sample, you need Node.js 20 or newer, Foundry Local, and a local ONNX embedding model. The default model path is &lt;CODE&gt;models/embeddings/bge-small-en-v1.5&lt;/CODE&gt;.&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;cd c:\Users\leestott\local-hybrid-retrival-onnx 
npm install huggingface-cli 
download BAAI/bge-small-en-v1.5 --local-dir models/embeddings/bge-small-en-v1.5 
npm run ingest 
npm start&lt;/LI-CODE&gt;
&lt;P&gt;Ingestion writes the local SQLite database to &lt;CODE&gt;data/rag.db&lt;/CODE&gt;. If the embedding model is available, each chunk gets a dense vector as well as lexical features. If the embedding model is missing, ingestion still succeeds and the application remains usable in lexical mode.&lt;/P&gt;
&lt;DIV class="note"&gt;Best practice: local AI applications should treat model files, SQLite data, and native runtime compatibility as part of the deployable system, not as optional developer conveniences.&lt;/DIV&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Code walkthrough&lt;/H2&gt;
&lt;H3&gt;1. Retrieval configuration&lt;/H3&gt;
&lt;P&gt;The sample makes its retrieval behaviour explicit in configuration. That is useful for testing and for operator visibility.&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;export const config = {
  model: "phi-3.5-mini",
  docsDir: path.join(ROOT, "docs"),
  dbPath: path.join(ROOT, "data", "rag.db"),
  chunkSize: 200,
  chunkOverlap: 25,
  topK: 3,
  retrievalMode: process.env.RETRIEVAL_MODE || "hybrid",
  retrievalModes: ["lexical", "semantic", "hybrid"],
  fallbackRetrievalMode: "lexical",
  retrievalWeights: {
    lexical: 0.45,
    semantic: 0.55,
  },
};&lt;/LI-CODE&gt;&lt;BR /&gt;
&lt;P&gt;Those defaults tell you a lot about the intended operating profile. Chunks are small, the number of returned chunks is low, and the fallback path is explicit.&lt;/P&gt;
&lt;H3&gt;2. Local ONNX embeddings&lt;/H3&gt;
&lt;P&gt;The embedding engine disables remote model loading and only uses local files. That matters for privacy, repeatability, and air gapped operation.&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;env.allowLocalModels = true;
env.allowRemoteModels = false;

this.extractor = await pipeline("feature-extraction", resolvedPath, {
  local_files_only: true,
});

const output = await this.extractor(text, {
  pooling: "mean",
  normalize: true,
});&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;The mean pooling and normalisation step make the vectors suitable for cosine similarity based ranking.&lt;/P&gt;
&lt;H3&gt;3. Hybrid storage and ranking in SQLite&lt;/H3&gt;
&lt;P&gt;Instead of adding a separate vector database, the sample stores lexical and semantic representations in the same SQLite table. That keeps the local footprint low and the implementation easy to debug.&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;searchHybrid(query, queryEmbedding, topK = 5, weights = { lexical: 0.45, semantic: 0.55 }) {
  const lexicalResults = this.searchLexical(query, topK * 3);
  const semanticResults = this.searchSemantic(queryEmbedding, topK * 3);

  if (semanticResults.length === 0) {
    return lexicalResults.slice(0, topK).map((row) =&amp;gt; ({
      ...row,
      retrievalMode: "lexical",
    }));
  }

  const fused = [...combined.values()].map((row) =&amp;gt; ({
    ...row,
    score: (row.lexicalScore * lexicalWeight) + (row.semanticScore * semanticWeight),
  }));

  fused.sort((a, b) =&amp;gt; b.score - a.score);
  return fused.slice(0, topK);
}&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;The important point is not just the weighted fusion. It is the fallback behaviour. If semantic retrieval cannot provide results, the user still gets lexical grounding instead of an empty context window.&lt;/P&gt;
&lt;H3&gt;4. Retrieval mode resolution in ChatEngine&lt;/H3&gt;
&lt;P&gt;&lt;CODE&gt;ChatEngine&lt;/CODE&gt; keeps the runtime behaviour predictable. It validates the requested mode and falls back to lexical search when semantic retrieval is unavailable.&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;resolveRetrievalMode(requestedMode) {
  const desiredMode = config.retrievalModes.includes(requestedMode)
    ? requestedMode
    : config.retrievalMode;

  if ((desiredMode === "semantic" || desiredMode === "hybrid") &amp;amp;&amp;amp; !this.semanticAvailable) {
    return config.fallbackRetrievalMode;
  }

  return desiredMode;
}&lt;/LI-CODE&gt;
&lt;P&gt;This is a sensible production design because local runtime failures are common. Missing model files or native dependency mismatches should reduce quality, not crash the entire assistant.&lt;/P&gt;
&lt;H3&gt;5. Foundry Local model management&lt;/H3&gt;
&lt;P&gt;The sample uses &lt;CODE&gt;FoundryLocalManager&lt;/CODE&gt; to discover, download, cache, and load the configured chat model.&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;const manager = FoundryLocalManager.create({ appName: "gas-field-local-rag" });
const catalog = manager.catalog;

this.model = await catalog.getModel(config.model);

if (!this.model.isCached) {
  await this.model.download((progress) =&amp;gt; {
    const pct = Math.round(progress * 100);
    this._emitStatus("download", `Downloading ${this.modelAlias}... ${pct}%`, progress);
  });
}

await this.model.load();
this.chatClient = this.model.createChatClient();
this.chatClient.settings.temperature = 0.1;&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;This gives the app a better local startup experience. The server can expose a status stream while the model initialises in the background.&lt;/SPAN&gt;&lt;/P&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;User experience and screenshots&lt;/H2&gt;
&lt;P&gt;The client is intentionally simple, which makes it useful during evaluation. You can switch retrieval mode, test questions quickly, and inspect the retrieved sources.&lt;/P&gt;
&lt;FIGURE&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/local-hybrid-retrival-onnx/main/screenshots/01-landing-page.png" alt="Landing page showing the gas field support agent UI in hybrid mode" /&gt;
&lt;FIGCAPTION&gt;The landing page exposes retrieval mode directly in the UI. That makes it easy to compare lexical, semantic, and hybrid behaviour during testing.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;FIGURE&gt;&lt;IMG src="https://raw.githubusercontent.com/leestott/local-hybrid-retrival-onnx/main/screenshots/04-sources-panel.png" alt="Chat response showing sources panel and hybrid retrieval scores" /&gt;
&lt;FIGCAPTION&gt;The sources panel shows grounding evidence and retrieval scores, which is useful when validating whether better answers are coming from better retrieval or just model phrasing.&lt;/FIGCAPTION&gt;
&lt;/FIGURE&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Best practices for ONNX RAG and Foundry Local&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;Keep lexical fallback alive. Exact identifiers and runtime failures both make this necessary.&lt;/LI&gt;
&lt;LI&gt;Persist sparse and dense features together where possible. It simplifies debugging and operational reasoning.&lt;/LI&gt;
&lt;LI&gt;Use small chunks and conservative &lt;CODE&gt;topK&lt;/CODE&gt; values for local context budgets.&lt;/LI&gt;
&lt;LI&gt;Expose health and status endpoints so users can see when the model is still loading or embeddings are unavailable.&lt;/LI&gt;
&lt;LI&gt;Test retrieval quality separately from generation quality.&lt;/LI&gt;
&lt;LI&gt;Pin and validate native runtime dependencies, especially ONNX Runtime, before tuning prompts.&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="note"&gt;Practical warning: this repository already shows why runtime validation matters. A local app can ingest documents successfully and still fail at model initialisation if the native runtime stack is misaligned.&lt;/DIV&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;How this compares with RAG and CAG&lt;/H2&gt;
&lt;P&gt;The strongest value in this sample comes from where it sits between a basic local RAG baseline and a curated CAG design.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Dimension&lt;/th&gt;&lt;th&gt;Classic local RAG&lt;/th&gt;&lt;th&gt;This hybrid ONNX RAG sample&lt;/th&gt;&lt;th&gt;CAG&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;Context assembly&lt;/td&gt;&lt;td&gt;Retrieve chunks at query time, often lexically, then inject them into the prompt.&lt;/td&gt;&lt;td&gt;Retrieve chunks at query time with lexical, semantic, or fused scoring, then inject the strongest results into the prompt.&lt;/td&gt;&lt;td&gt;Use a prepared or cached context pack instead of fresh retrieval for every request.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Main strength&lt;/td&gt;&lt;td&gt;Easy to implement and easy to explain.&lt;/td&gt;&lt;td&gt;Better recall for paraphrases without giving up exact match behaviour or offline execution.&lt;/td&gt;&lt;td&gt;Predictable prompts and low query time overhead.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Main weakness&lt;/td&gt;&lt;td&gt;Misses synonyms and natural language reformulations.&lt;/td&gt;&lt;td&gt;More moving parts, larger local asset footprint, and native runtime compatibility to manage.&lt;/td&gt;&lt;td&gt;Coverage depends on curation quality and goes stale more easily.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Failure behaviour&lt;/td&gt;&lt;td&gt;Weak retrieval leads to weak grounding.&lt;/td&gt;&lt;td&gt;Semantic failure can degrade to lexical retrieval if designed properly, which this sample does.&lt;/td&gt;&lt;td&gt;Prepared context can be too narrow for new or unexpected questions.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Best fit&lt;/td&gt;&lt;td&gt;Simple local assistants and proof of concept systems.&lt;/td&gt;&lt;td&gt;Offline copilots and technical assistants that need stronger recall across varied phrasing.&lt;/td&gt;&lt;td&gt;Stable workflows with tightly bounded, curated knowledge.&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;Samples&lt;/H3&gt;
&lt;P&gt;Related samples:&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;- Foundry Local RAG - &lt;A class="lia-external-url" href="https://github.com/leestott/local-rag" target="_blank"&gt;https://github.com/leestott/local-rag&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;- Foundry Local CAG - &lt;A class="lia-external-url" href="https://github.com/leestott/local-cag" target="_blank"&gt;https://github.com/leestott/local-cag&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;- Foundry Local hybrid-retrival-onnx &lt;A class="lia-external-url" href="https://github.com/leestott/local-hybrid-retrival-onnx" target="_blank"&gt;https://github.com/leestott/local-hybrid-retrival-onnx&lt;/A&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;Specific benefits of this hybrid approach over classic RAG&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;It captures paraphrased questions that lexical search would often miss.&lt;/LI&gt;
&lt;LI&gt;It still preserves exact match performance for codes, terms, and product names.&lt;/LI&gt;
&lt;LI&gt;It gives operators a controlled degradation path when the semantic stack is unavailable.&lt;/LI&gt;
&lt;LI&gt;It stays local and inspectable without introducing a separate hosted vector service.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Specific differences from CAG&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;CAG shifts effort into context curation before the request. This sample retrieves evidence dynamically at runtime.&lt;/LI&gt;
&lt;LI&gt;CAG can be faster for fixed workflows, but it is usually less flexible when the document set changes.&lt;/LI&gt;
&lt;LI&gt;This hybrid RAG design is better suited to open ended knowledge search and growing document collections.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;What to validate before shipping&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;Measure retrieval quality in each mode using exact term, acronym, and paraphrase queries.&lt;/LI&gt;
&lt;LI&gt;Check that sources shown in the UI reflect genuinely distinct evidence, not repeated chunks.&lt;/LI&gt;
&lt;LI&gt;Confirm the application remains usable when semantic retrieval is unavailable.&lt;/LI&gt;
&lt;LI&gt;Verify ONNX Runtime compatibility on the real target machines, not only on the development laptop.&lt;/LI&gt;
&lt;LI&gt;Test model download, cache, and startup behaviour with a clean environment.&lt;/LI&gt;
&lt;/UL&gt;
&lt;/SECTION&gt;
&lt;SECTION&gt;
&lt;H2&gt;Final take&lt;/H2&gt;
&lt;P&gt;For developers getting started with ONNX RAG and Foundry Local, this sample is a good technical reference because it demonstrates a realistic local architecture rather than a minimal demo. It shows how to build a grounded assistant that remains offline, supports multiple retrieval modes, and fails gracefully.&lt;/P&gt;
&lt;P&gt;Compared with classic local RAG, the hybrid design provides better recall and better resilience. Compared with CAG, it remains more flexible for changing document sets and less dependent on pre curated context packs. If you want a practical starting point for offline grounded AI on developer workstations or edge devices, this is the most balanced pattern in the repository set.&lt;/P&gt;
&lt;/SECTION&gt;
&lt;/ARTICLE&gt;
&lt;/MAIN&gt;</description>
      <pubDate>Thu, 26 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/build-an-offline-hybrid-rag-stack-with-onnx-and-foundry-local/ba-p/4503589</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-03-26T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Step-by-Step: Deploy the Architecture Review Agent Using AZD AI CLI</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/step-by-step-deploy-the-architecture-review-agent-using-azd-ai/ba-p/4504460</link>
      <description>&lt;P&gt;Hey everyone! I am &lt;A class="lia-external-url" href="https://linkedin.com/in/shivam2003" target="_blank" rel="noopener"&gt;Shivam Goyal&lt;/A&gt;, a Microsoft MVP, and I am super excited to share a project that will save you a massive amount of time.&lt;/P&gt;
&lt;P&gt;Have you ever built a brilliant AI agent in an afternoon, only to spend the next two weeks fighting with Docker containers, memory persistence, and cloud deployment scripts to get it running?&lt;/P&gt;
&lt;P&gt;In our &lt;A href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/stop-drawing-architecture-diagrams-manually-meet-the-open-source-ai-architecture/4496271" target="_blank" rel="noopener"&gt;previous post&lt;/A&gt;, we introduced the &lt;STRONG&gt;Architecture Review Agent&lt;/STRONG&gt;, an open-source tool built on Microsoft Foundry that automatically converts messy architectural notes into structured risk assessments and interactive Excalidraw diagrams.&lt;/P&gt;
&lt;P&gt;But building an AI agent is only half the battle. &lt;EM&gt;Iterating&lt;/EM&gt; on one and actually getting it running in a production-grade environment without losing your mind over infrastructure is a completely different story.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;The Problem with Agentic Development Loops&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;The typical agent development loop is painful: you write your agent code, test it by copy-pasting inputs into a local REPL, manually build a container, push it to a registry, configure RBAC, deploy to your cloud target, realize you need to tweak three lines of logic, and start the whole cycle over again.&lt;/P&gt;
&lt;P&gt;You often end up with an agent that is 100 lines of clean Python, surrounded by 400 lines of Bicep and a 12-step deployment guide.&lt;/P&gt;
&lt;P&gt;The azd ai extension for the &lt;STRONG&gt;Azure Developer CLI (AZD)&lt;/STRONG&gt; completely changes this equation. For the Architecture Review Agent, the entire workflow, from zero infrastructure to a live hosted agent you can invoke from the command line, is just a few simple commands. And moving from local testing to a live cloud deployment is a single azd up.&lt;/P&gt;
&lt;P&gt;Here is how you can set up, invoke, and deploy your own Architecture Review Agent, and even publish it to Microsoft Teams, without needing a tenant admin.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;Step 1: The Setup (No heavy lifting required)&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;First, make sure you have the Azure Developer CLI installed and grab the AI Agents extension.&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;# Install AZD
winget install microsoft.azd 

# Install the AI Agents extension 
azd extension install azure.ai.agents&lt;/LI-CODE&gt;
&lt;P&gt;Next, clone the repository and set up your local Python environment:&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;git clone https://github.com/Azure-Samples/agent-architecture-review-sample
cd agent-architecture-review-sample

python -m venv .venv
.\.venv\Scripts\Activate.ps1
pip install -r requirements.txt&lt;/LI-CODE&gt;
&lt;P&gt;Finally, authenticate and tell AZD where your Microsoft Foundry project lives:&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;azd auth login
azd env new arch-review-dev

# Point it to your Foundry Project and Model
azd env set AZURE_AI_PROJECT_ENDPOINT "https://&amp;lt;your-resource&amp;gt;.services.ai.azure.com/api/projects/&amp;lt;your-project&amp;gt;"
azd env set AZURE_AI_MODEL_DEPLOYMENT_NAME "gpt-4.1"&lt;/LI-CODE&gt;
&lt;H4&gt;&lt;STRONG&gt;Step 2: Run and Invoke Locally&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;With the &lt;A class="lia-external-url" href="https://marketplace.visualstudio.com/items?itemName=ms-azuretools.azure-dev" target="_blank" rel="noopener"&gt;AZD AI extension&lt;/A&gt;, you get a local server that behaves &lt;EM&gt;identically&lt;/EM&gt; to a deployed Foundry-hosted agent. It uses the same localhost:8088 endpoint, the same OpenAI Responses API protocol, and the same conversation persistence.&lt;/P&gt;
&lt;P&gt;Open your first terminal and start the runtime:&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;azd ai agent run&lt;/LI-CODE&gt;
&lt;P&gt;Now, open a second terminal. This is where the magic happens. The agent is completely format-agnostic. There is no schema you have to memorize. You can pass it a file, or just type out a whiteboard brain-dump inline.&lt;/P&gt;
&lt;P&gt;Here is what the terminal experience looks like when running these commands and getting the structured report back:&lt;/P&gt;
&lt;img&gt;&lt;EM&gt;Styled terminal showing all 3 azd ai agent invoke commands + full structured report output&lt;/EM&gt;&lt;/img&gt;
&lt;P&gt;Here are the three ways you can invoke it using "azd ai agent invoke --local":&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Pattern A: The Structured YAML&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;If your team uses formal definitions, just point the agent to the file. The rule-based parser handles this instantly without an LLM call.&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;azd ai agent invoke --local "scenarios/ecommerce.yaml"&lt;/LI-CODE&gt;&lt;img&gt;&lt;EM&gt;12-component ecommerce architecture diagram generated from YAML input.&lt;/EM&gt;&lt;/img&gt;
&lt;H5&gt;&lt;STRONG&gt;Pattern B: The Whiteboard Brain-Dump (Inline Arrow Notation)&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Arrow notation (A -&amp;gt; B -&amp;gt; C) is how engineers actually communicate on whiteboards and in Slack. Before now, this wasn't a valid input for architecture tools.&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;azd ai agent invoke --local "LB -&amp;gt; 3 API servers -&amp;gt; PostgreSQL primary with read replica -&amp;gt; Redis cache"&lt;/LI-CODE&gt;
&lt;P&gt;The parser automatically extracts the replica count, infers the component types (LB becomes a Gateway), and builds a valid connection graph, surfacing single points of failure instantly.&lt;/P&gt;
&lt;H5&gt;&lt;STRONG&gt;Pattern C: The Markdown Design Doc&lt;/STRONG&gt;&lt;/H5&gt;
&lt;P&gt;Just point it to your existing READMEs or design docs.&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;azd ai agent invoke --local "scenarios/event_driven.md"&lt;/LI-CODE&gt;&lt;img&gt;&lt;EM&gt;8-component event-driven streaming architecture generated from Markdown input&lt;/EM&gt;&lt;/img&gt;
&lt;P&gt;For all three patterns, the agent returns a structured Markdown report in your terminal and generates an interactive architecture.excalidraw file and a high-res PNG right in your local /output folder.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;Step 3: One Command to the Cloud&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;When you are happy with how your agent performs locally, it's time to deploy. Forget manual Docker builds and complex credential management.&lt;/P&gt;
&lt;LI-CODE lang="powershell"&gt;azd up&lt;/LI-CODE&gt;
&lt;P&gt;This single command orchestrates everything:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Provisions Infrastructure&lt;/STRONG&gt;: Creates your Foundry AI Services account, ACR, App Insights, and managed identities with proper RBAC.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Builds and Pushes&lt;/STRONG&gt;: Packages your Dockerfile and pushes the container image to ACR.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Deploys the Agent&lt;/STRONG&gt;: Registers the image and creates a hosted agent version in Foundry Agent Service.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The output will hand you a live Agent Playground URL and a production-ready API endpoint. Your agent now automatically scales from 0 to 5 replicas, manages its own conversation state, and authenticates securely via Managed Identity.&lt;/P&gt;
&lt;H4&gt;&lt;STRONG&gt;Step 4: Publish to Teams and M365 Copilot (Zero Admin Required!)&lt;/STRONG&gt;&lt;/H4&gt;
&lt;P&gt;Having an API is great, but agents are most powerful when they live where your users collaborate. You can publish this agent directly to Microsoft Teams and M365 Copilot natively from the Foundry portal.&lt;/P&gt;
&lt;P&gt;The best part? You can use the &lt;STRONG&gt;Individual Scope&lt;/STRONG&gt;.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Go to the Microsoft Foundry portal and find your deployed agent.&lt;/LI&gt;
&lt;LI&gt;Click &lt;STRONG&gt;Publish to Teams and Microsoft 365 Copilot&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;LI&gt;Fill out the basic metadata (Name, Description).&lt;/LI&gt;
&lt;LI&gt;Select the &lt;STRONG&gt;Individual scope&lt;/STRONG&gt;.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;Because you are using an individual scope, &lt;STRONG&gt;no M365 admin approval is required&lt;/STRONG&gt;. The portal automatically provisions the Azure Bot Service, packages the metadata, and registers the app. Within minutes, your agent will appear in your Teams Copilot agent store. You can generate a share link and instantly send it to your team for a workshop or demo.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What I Learned Building This Workflow&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Shifting from custom deployment scripts to the azd ai CLI taught me three things:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;The declarative contract is beautifully clean.&lt;/STRONG&gt; Our azure.yaml declares the agent and infrastructure in about 30 lines. azd up translates that into a fully secure, production-grade Foundry environment.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The local-to-cloud gap is finally gone.&lt;/STRONG&gt; The azd ai agent run behaves exactly like the cloud. The invocation you write locally works identically against the deployed endpoint.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Teams publishing is remarkably simple.&lt;/STRONG&gt; I expected bot registration nightmares and tenant admin blockers. Instead, I filled out a form, waited two minutes, and was chatting with my architecture agent in Teams.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;&lt;STRONG&gt;Resources &amp;amp; Next Steps&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Now that we have a streamlined, single-hosted agent deployment, the natural next step is &lt;STRONG&gt;multi-agent orchestration&lt;/STRONG&gt;. Imagine a triage agent that routes your design doc to a dedicated Security Reviewer Agent and a Scalability Reviewer Agent.&lt;/P&gt;
&lt;P&gt;Try it out yourself by cloning the repository, running azd up, and let me know what you build!&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;GitHub Repository:&lt;/STRONG&gt; &lt;A href="https://github.com/Azure-Samples/agent-architecture-review-sample" target="_blank" rel="noopener"&gt;Azure-Samples/agent-architecture-review-sample&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Previous Article:&lt;/STRONG&gt; &lt;A href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/stop-drawing-architecture-diagrams-manually-meet-the-open-source-ai-architecture/4496271" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Stop Drawing Architecture Diagrams Manually: Meet the Open-Source AI Architecture Review Agents | Microsoft Community Hub"&gt;Stop Drawing Architecture Diagrams Manually: Meet the Open-Source AI Architecture Review Agents | Microsoft Community Hub&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Microsoft Learn:&lt;/STRONG&gt; &lt;A href="https://learn.microsoft.com/azure/developer/azure-developer-cli/install-azd" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Install the Azure Developer CLI"&gt;Install the Azure Developer CLI&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Microsoft Foundry Documentation:&lt;/STRONG&gt; &lt;A href="https://learn.microsoft.com/azure/ai-foundry/agents/concepts/hosted-agents?view=foundry" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="Hosted agents in Foundry Agent Service (preview) - Microsoft Foundry"&gt;Hosted agents in Foundry Agent Service (preview) - Microsoft Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Tue, 24 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/step-by-step-deploy-the-architecture-review-agent-using-azd-ai/ba-p/4504460</guid>
      <dc:creator>ShivamGoyal</dc:creator>
      <dc:date>2026-03-24T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Microsoft Olive &amp; Olive Recipes: A Practical Guide to Model Optimization for Real-World Deployment</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/microsoft-olive-olive-recipes-a-practical-guide-to-model/ba-p/4502531</link>
      <description>&lt;H2&gt;Why your model runs great on your laptop but fails in the real world&lt;/H2&gt;
&lt;P&gt;You have trained a model. It scores well on your test set. It runs fine on your development machine with a beefy GPU. Then someone asks you to deploy it to a customer's edge device, a cloud endpoint with a latency budget, or a laptop with no discrete GPU at all.&lt;/P&gt;
&lt;P&gt;Suddenly the model is too large, too slow, or simply incompatible with the target runtime. You start searching for quantisation scripts, conversion tools, and hardware-specific compiler flags. Each target needs a different recipe, and the optimisation steps interact in ways that are hard to predict.&lt;/P&gt;
&lt;P&gt;This is the deployment gap. It is not a knowledge gap; it is a tooling gap. And it is exactly the problem that &lt;A href="https://github.com/microsoft/olive" target="_blank" rel="noopener"&gt;Microsoft Olive&lt;/A&gt; is designed to close.&lt;/P&gt;
&lt;H2&gt;What is Olive?&lt;/H2&gt;
&lt;P&gt;Olive is an easy-to-use, hardware-aware model optimisation toolchain that composes techniques across model compression, optimisation, and compilation. Rather than asking you to string together separate conversion scripts, quantisation utilities, and compiler passes by hand, Olive lets you describe what you have and what you need, then handles the pipeline.&lt;/P&gt;
&lt;P&gt;In practical terms, Olive takes a model source, such as a PyTorch model or an ONNX model (and other supported formats), plus a configuration that describes your production requirements and target hardware accelerator. It then runs the appropriate optimisation passes and produces a deployment-ready artefact.&lt;/P&gt;
&lt;P&gt;You can think of it as a build system for model optimisation: you declare the intent, and Olive figures out the steps.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Official repo:&lt;/STRONG&gt; &lt;A href="https://github.com/microsoft/olive" target="_blank" rel="noopener"&gt;github.com/microsoft/olive&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Documentation:&lt;/STRONG&gt; &lt;A href="https://microsoft.github.io/Olive/" target="_blank" rel="noopener"&gt;microsoft.github.io/Olive&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Key advantages: why Olive matters for your workflow&lt;/H2&gt;
&lt;H3&gt;A. Optimise once, deploy across many targets&lt;/H3&gt;
&lt;P&gt;One of the hardest parts of deploying models in production is that "production" is not one thing. Your model might need to run on a cloud GPU, an edge CPU, or a Windows device with an NPU. Each target has different memory constraints, instruction sets, and runtime expectations.&lt;/P&gt;
&lt;P&gt;Olive supports targeting CPU, GPU, and NPU through its optimisation workflow. This means a single toolchain can produce optimised artefacts for multiple deployment targets, expanding the number of platforms you can serve without maintaining separate optimisation scripts for each one.&lt;/P&gt;
&lt;P&gt;The conceptual workflow is straightforward: Olive can download, convert, quantise, and optimise a model using an auto-optimisation style approach where you specify the target device (cpu, gpu, or npu). This keeps the developer experience consistent even as the underlying optimisation strategy changes per target.&lt;/P&gt;
&lt;H3&gt;B. ONNX as the portability layer&lt;/H3&gt;
&lt;P&gt;If you have heard of ONNX but have not used it in anger, here is why it matters: ONNX gives your model a common representation that multiple runtimes understand. Instead of being locked to one framework's inference path, an ONNX model can run through ONNX Runtime and take advantage of whatever hardware is available.&lt;/P&gt;
&lt;P&gt;Olive supports ONNX conversion and optimisation, and can generate a deployment-ready model package along with sample inference code in languages like C#, C++, or Python. That package is not just the model weights; it includes the configuration and code needed to load and run the model on the target platform.&lt;/P&gt;
&lt;P&gt;For students and early-career engineers, this is a meaningful capability: you can train in PyTorch (the ecosystem you already know) and deploy through ONNX Runtime (the ecosystem your production environment needs).&lt;/P&gt;
&lt;H3&gt;C. Hardware-specific acceleration and execution providers&lt;/H3&gt;
&lt;P&gt;When Olive targets a specific device, it does not just convert the model format. It optimises for the execution provider (EP) that will actually run the model on that hardware. Execution providers are the bridge between the ONNX Runtime and the underlying accelerator.&lt;/P&gt;
&lt;P&gt;Olive can optimise for a range of execution providers, including:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Vitis AI EP&lt;/STRONG&gt; (AMD) – for AMD accelerator hardware&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;OpenVINO EP&lt;/STRONG&gt; (Intel) – for Intel CPUs, integrated GPUs, and VPUs&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;QNN EP&lt;/STRONG&gt; (Qualcomm) – for Qualcomm NPUs and SoCs&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;DirectML EP&lt;/STRONG&gt; (Windows) – for broad GPU support on Windows devices&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Why does EP targeting matter? Because the difference between a generic model and one optimised for a specific execution provider can be significant in terms of latency, throughput, and power efficiency. On battery-powered devices especially, the right EP optimisation can be the difference between a model that is practical and one that drains the battery in minutes.&lt;/P&gt;
&lt;H3&gt;D. Quantisation and precision options&lt;/H3&gt;
&lt;P&gt;Quantisation is one of the most powerful levers you have for making models smaller and faster. The core idea is reducing the numerical precision of model weights and activations:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;FP32&lt;/STRONG&gt; (32-bit floating point) – full precision, largest model size, highest fidelity&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;FP16&lt;/STRONG&gt; (16-bit floating point) – roughly half the memory, usually minimal quality loss for most tasks&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;INT8&lt;/STRONG&gt; (8-bit integer) – significant size and speed gains, moderate risk of quality degradation depending on the model&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;INT4&lt;/STRONG&gt; (4-bit integer) – aggressive compression for the most constrained deployment scenarios&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Think of these as a spectrum. As you move from FP32 towards INT4, models get smaller and faster, but you trade away some numerical fidelity. The practical question is always: &lt;EM&gt;how much quality can I afford to lose for this use case?&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Practical heuristics for choosing precision:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;FP16&lt;/STRONG&gt; is often a safe default for GPU deployment. In practice, you might start here and only go lower if you need to.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;INT8&lt;/STRONG&gt; is a strong choice for CPU-based inference where memory and compute are constrained but accuracy requirements are still high (e.g., classification, embeddings, many NLP tasks).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;INT4&lt;/STRONG&gt; is worth exploring when you are deploying large language models to edge or consumer devices and need aggressive size reduction. Expect to validate quality carefully, as some tasks and model architectures tolerate INT4 better than others.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Olive handles the mechanics of applying these quantisation passes as part of the optimisation pipeline, so you do not need to write custom quantisation scripts from scratch.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Showcase: model conversion stories&lt;/H2&gt;
&lt;P&gt;To make this concrete, here are three plausible optimisation scenarios that illustrate how Olive fits into real workflows.&lt;/P&gt;
&lt;H3&gt;Story 1: PyTorch classification model → ONNX → quantised for cloud CPU inference&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Starting point:&lt;/STRONG&gt; A PyTorch image classification model fine-tuned on a domain-specific dataset.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Target hardware:&lt;/STRONG&gt; Cloud CPU instances (no GPU budget for inference).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Optimisation intent:&lt;/STRONG&gt; Reduce latency and cost by quantising to INT8 whilst keeping accuracy within acceptable bounds.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Output:&lt;/STRONG&gt; An ONNX model optimised for CPU execution, packaged with configuration and sample inference code ready for deployment behind an API endpoint.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Story 2: Hugging Face language model → optimised for edge NPU&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Starting point:&lt;/STRONG&gt; A Hugging Face transformer model used for text summarisation.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Target hardware:&lt;/STRONG&gt; A laptop with an integrated NPU (e.g., a Qualcomm-based device).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Optimisation intent:&lt;/STRONG&gt; Shrink the model to INT4 to fit within NPU memory limits, and optimise for the QNN execution provider to leverage the neural processing unit.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Output:&lt;/STRONG&gt; A quantised ONNX model configured for QNN EP, with packaging that includes the model, runtime configuration, and sample code for local inference.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Story 3: Same model, two targets – GPU vs. NPU&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Starting point:&lt;/STRONG&gt; A single PyTorch generative model used for content drafting.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Target hardware:&lt;/STRONG&gt; (A) Cloud GPU for batch processing, (B) On-device NPU for interactive use.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Optimisation intent:&lt;/STRONG&gt; For GPU, optimise at FP16 for throughput. For NPU, quantise to INT4 for size and power efficiency.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Output:&lt;/STRONG&gt; Two separate optimised packages from the same source model, one targeting DirectML EP for GPU, one targeting QNN EP for NPU, each with appropriate precision, runtime configuration, and sample inference code.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;In each case, Olive handles the multi-step pipeline: conversion, optimisation passes, quantisation, and packaging. The developer's job is to define the target and validate the output quality.&lt;/P&gt;
&lt;H2&gt;Introducing Olive Recipes&lt;/H2&gt;
&lt;P&gt;If you are new to model optimisation, staring at a blank configuration file can be intimidating. That is where &lt;A href="https://github.com/microsoft/olive-recipes" target="_blank" rel="noopener"&gt;Olive Recipes&lt;/A&gt; comes in.&lt;/P&gt;
&lt;P&gt;The Olive Recipes repository complements Olive by providing recipes that demonstrate features and use cases. You can use them as a reference for optimising publicly available models or adapt them for your own proprietary models. The repository also includes a selection of ONNX-optimised models that you can study or use as starting points.&lt;/P&gt;
&lt;P&gt;Think of recipes as worked examples: each one shows a complete optimisation pipeline for a specific scenario, including the configuration, the target hardware, and the expected output. Instead of reinventing the pipeline from scratch, you can find a recipe close to your use case and modify it.&lt;/P&gt;
&lt;P&gt;For students especially, recipes are a fast way to learn what good optimisation configurations look like in practice.&lt;/P&gt;
&lt;H2&gt;Taking it further: adding custom models to Foundry Local&lt;/H2&gt;
&lt;P&gt;Once you have optimised a model with Olive, you may want to serve it locally for development, testing, or fully offline use. &lt;A href="https://learn.microsoft.com/en-us/azure/foundry-local/" target="_blank" rel="noopener"&gt;Foundry Local&lt;/A&gt; is a lightweight runtime that downloads, manages, and serves language models entirely on-device via an OpenAI-compatible API, with no cloud dependency and no API keys required.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Important: Foundry Local only supports specific model templates.&lt;/STRONG&gt; At present, these are the &lt;STRONG&gt;chat&lt;/STRONG&gt; template (for conversational and text-generation models) and the &lt;STRONG&gt;whisper&lt;/STRONG&gt; template (for speech-to-text models based on the Whisper architecture). If your model does not fit one of these two templates, it cannot currently be loaded into Foundry Local.&lt;/P&gt;
&lt;H3&gt;Compiling a Hugging Face model for Foundry Local&lt;/H3&gt;
&lt;P&gt;If your optimised model uses a supported architecture, you can compile it from Hugging Face for use with Foundry Local. The high-level process is:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Choose a compatible Hugging Face model.&lt;/STRONG&gt; The model must match one of Foundry Local's supported templates (chat or whisper). For chat models, this typically means decoder-only transformer architectures that support the standard chat format.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Use Olive to convert and optimise.&lt;/STRONG&gt; Olive handles the conversion from the Hugging Face source format into an ONNX-based, quantised artefact that Foundry Local can serve. This is where your Olive skills directly apply.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Register the model with Foundry Local.&lt;/STRONG&gt; Once compiled, you register the model so that Foundry Local's catalogue recognises it and can serve it through the local API.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;For the full step-by-step guide, including exact commands and configuration details, refer to the official documentation: &lt;A href="https://learn.microsoft.com/en-us/azure/foundry-local/how-to/how-to-compile-hugging-face-models?tabs=Bash" target="_blank" rel="noopener"&gt;How to compile Hugging Face models for Foundry Local&lt;/A&gt;. For a hands-on lab that walks through the complete workflow, see &lt;A href="https://github.com/microsoft-foundry/Foundry-Local-Lab" target="_blank" rel="noopener"&gt;Foundry Local Lab&lt;/A&gt;, specifically Lab 10 which covers bringing custom models into Foundry Local.&lt;/P&gt;
&lt;H3&gt;Why does this matter?&lt;/H3&gt;
&lt;P&gt;The combination of Olive and Foundry Local gives you a complete local workflow: optimise your model with Olive, then serve it with Foundry Local for rapid iteration, privacy-sensitive workloads, or environments without internet connectivity. Because Foundry Local exposes an OpenAI-compatible API, your application code can switch between local and cloud inference with minimal changes.&lt;/P&gt;
&lt;P&gt;Keep in mind the template constraint. If you are planning to bring a custom model into Foundry Local, verify early that it fits the chat or whisper template. Attempting to load an unsupported architecture will not work, regardless of how well the model has been optimised.&lt;/P&gt;
&lt;H2&gt;Contributing: how to get involved&lt;/H2&gt;
&lt;P&gt;The Olive ecosystem is open source, and contributions are welcome. There are two main ways to contribute:&lt;/P&gt;
&lt;H3&gt;A. Contributing recipes&lt;/H3&gt;
&lt;P&gt;If you have built an optimisation pipeline that works well for a specific model, hardware target, or use case, consider contributing it as a recipe. Recipes are repeatable pipeline configurations that others can learn from and adapt.&lt;/P&gt;
&lt;H3&gt;B. Sharing optimised model outputs and configurations&lt;/H3&gt;
&lt;P&gt;If you have produced an optimised model that might be useful to others, sharing the optimisation configuration and methodology (and, where licensing permits, the model itself) helps the community build on proven approaches rather than starting from zero.&lt;/P&gt;
&lt;H3&gt;Contribution checklist&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Reproducibility:&lt;/STRONG&gt; Can someone else run your recipe or configuration and get comparable results?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Licensing:&lt;/STRONG&gt; Are the base model weights, datasets, and any dependencies properly licensed for sharing?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Hardware target documented:&lt;/STRONG&gt; Have you specified which device and execution provider the optimisation targets?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Runtime documented:&lt;/STRONG&gt; Have you noted the ONNX Runtime version and any EP-specific requirements?&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Quality validation:&lt;/STRONG&gt; Have you included at least a basic accuracy or quality check for the optimised output?&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;If you are a student or early-career developer, contributing a recipe is a great way to build portfolio evidence that you understand real deployment concerns, not just training.&lt;/P&gt;
&lt;H2&gt;Try it yourself: a minimal workflow&lt;/H2&gt;
&lt;P&gt;Here is a conceptual walkthrough of the optimisation workflow using Olive. The idea is to make the mental model concrete. For exact CLI flags and options, refer to the &lt;A href="https://microsoft.github.io/Olive/" target="_blank" rel="noopener"&gt;official Olive documentation&lt;/A&gt;.&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Choose a model source.&lt;/STRONG&gt; Start with a PyTorch or Hugging Face model you want to optimise. This is your input.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Choose a target device.&lt;/STRONG&gt; Decide where the model will run: &lt;CODE&gt;cpu&lt;/CODE&gt;, &lt;CODE&gt;gpu&lt;/CODE&gt;, or &lt;CODE&gt;npu&lt;/CODE&gt;.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Choose an execution provider.&lt;/STRONG&gt; Pick the EP that matches your hardware, for example DirectML for Windows GPU, QNN for Qualcomm NPU, or OpenVINO for Intel.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Choose a precision.&lt;/STRONG&gt; Select the quantisation level: &lt;CODE&gt;fp16&lt;/CODE&gt;, &lt;CODE&gt;int8&lt;/CODE&gt;, or &lt;CODE&gt;int4&lt;/CODE&gt;, based on your size, speed, and quality requirements.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Run the optimisation.&lt;/STRONG&gt; Olive will convert, quantise, optimise, and package the model for your target. The output is a deployment-ready artefact with model files, configuration, and sample inference code.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;A conceptual command might look like this:&lt;/P&gt;
&lt;PRE class="language-bash" tabindex="0" contenteditable="false" data-lia-code-value="# Conceptual example – refer to official docs for exact syntax
olive auto-opt --model-id my-model --device cpu --provider onnxruntime --precision int8
"&gt;&lt;CODE&gt;# Conceptual example – refer to official docs for exact syntax
olive auto-opt --model-id my-model --device cpu --provider onnxruntime --precision int8
&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;After optimisation, validate the output. Run your evaluation benchmark on the optimised model and compare quality, latency, and model size against the original. If INT8 drops quality below your threshold, try FP16. If the model is still too large for your device, explore INT4. Iteration is expected.&lt;/P&gt;
&lt;H2&gt;Key takeaways&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Olive bridges training and deployment&lt;/STRONG&gt; by providing a single, hardware-aware optimisation toolchain that handles conversion, quantisation, optimisation, and packaging.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;One source model, many targets:&lt;/STRONG&gt; Olive lets you optimise the same model for CPU, GPU, and NPU, expanding your deployment reach without maintaining separate pipelines.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;ONNX is the portability layer&lt;/STRONG&gt; that decouples your training framework from your inference runtime, and Olive leverages it to generate deployment-ready packages.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Precision is a design choice:&lt;/STRONG&gt; FP16, INT8, and INT4 each serve different deployment constraints. Start conservative, measure quality, and compress further only when needed.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Olive Recipes are your starting point:&lt;/STRONG&gt; Do not build optimisation pipelines from scratch when worked examples exist. Learn from recipes, adapt them, and contribute your own.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Foundry Local extends the workflow:&lt;/STRONG&gt; Once your model is optimised, Foundry Local can serve it on-device via a standard API, but only if it fits a supported template (chat or whisper).&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/olive" target="_blank" rel="noopener"&gt;Microsoft Olive – GitHub repository&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://microsoft.github.io/Olive/" target="_blank" rel="noopener"&gt;Olive documentation&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/olive-recipes" target="_blank" rel="noopener"&gt;Olive Recipes – GitHub repository&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/foundry-local/how-to/how-to-compile-hugging-face-models?tabs=Bash" target="_blank" rel="noopener"&gt;How to compile Hugging Face models for Foundry Local&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft-foundry/Foundry-Local-Lab" target="_blank" rel="noopener"&gt;Foundry Local Lab – hands-on labs (see Lab 10 for custom models)&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/foundry-local/" target="_blank" rel="noopener"&gt;Foundry Local documentation&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Mon, 23 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/microsoft-olive-olive-recipes-a-practical-guide-to-model/ba-p/4502531</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-03-23T07:00:00Z</dc:date>
    </item>
    <item>
      <title>ProvePresent: Ending Proxy Attendance with Azure Serverless &amp; Azure OpenAI</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/provepresent-ending-proxy-attendance-with-azure-serverless-azure/ba-p/4501830</link>
      <description>&lt;H1&gt;Problem&lt;/H1&gt;
&lt;img /&gt;
&lt;P&gt;Most schools use a smart‑card‑based attendance system where students tap their cards on a reader. However, this method is unreliable because students can give their cards to friends or simply tap and leave immediately. Teachers cannot accurately assess real student performance—whether high‑performing students are genuinely attending class or whether poor performance is due to actual absence. Another issue is that even if students are physically present in a lecture, teachers still cannot tell whether they are paying attention to the projector or actually learning.&lt;/P&gt;
&lt;P&gt;The current workaround is for teachers to override the attendance record by calling each student one by one, which is time‑consuming in large lectures and adds little educational value. It is also only a one‑time check, meaning students can still leave the lecture room immediately afterwards.&lt;/P&gt;
&lt;P&gt;Another issue is that we have many out‑of‑school activities such as site visit, and the school needs to ensure everyone’s presence promptly in each check point.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;This kind of problem isn’t unique to schools. It’s a common challenge for event organizers, where verifying attendee presence is essential but often slow, causing long queues. Organizers usually rely on a few mobile scanners to check in attendees one by one.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H1&gt;Solution&lt;/H1&gt;
&lt;P&gt;ProvePresent is an AI tool designed to verify attendance and create real‑time challenges for participants, ensuring that attendance records are authentic and that attendees remain focused on the presentation. It uses OTP login with school email.&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;Check-in and Check-out With a Real‑time QR Code&lt;/H2&gt;
&lt;img /&gt;
&lt;P&gt;The code refreshes every 25 seconds, and the presenter can display it on the projector for everyone to scan when checking in at the beginning and checking out at the end of the session.&lt;/P&gt;
&lt;P&gt;However, this alone cannot prevent someone from capturing the code and sending it to others who are not in the room, or from using two devices to help someone else scan for attendance—even if geolocation checks are enabled. We will explain this next.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;This check‑in and check‑out process is highly scalable, and no one needs to queue while waiting for someone to scan their QR code!&lt;/STRONG&gt;&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Organizers can set geolocation restrictions to prevent anyone from checking in remotely in a simple manner.&lt;/P&gt;
&lt;/img&gt;
&lt;H2&gt;Keep Attendee Alive with Signalr&lt;/H2&gt;
&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;The SignalR live connection allows the presenter to create real‑time challenges for attendees, helping to verify their presence and ensure they are genuinely focused on the presentation.&lt;/P&gt;
&lt;H2&gt;AI Powered Live Quiz&lt;/H2&gt;
&lt;P&gt;The presenter shares their presentation screen, and two Microsoft Foundry agents with Azure OpenAI Chatgpt 5.3 —ImageAnalysisAgent, which extracts key information from the shared screen, and QuizQuestionGenerator, which generates simple questions based on the current slide—work together to create challenges. The question is broadcast to all online attendees, who must answer within 20 seconds.&lt;/P&gt;
&lt;P&gt;This feature keeps attendees on the webpage and prevents them from doing anything unrelated to the presentation.&lt;/P&gt;
&lt;img&gt;Left is attendee view and right is presenter view before screen capture.&lt;/img&gt;&lt;img&gt;Presentation Screen&lt;/img&gt;&lt;img&gt;Left is attendee view and right is presenter view during screen capture.&lt;/img&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;Detailed report can be downloaded for further analysis.&lt;/P&gt;
&lt;/img&gt;&lt;img&gt;Download report&lt;/img&gt;&lt;div data-video-id="https://www.youtube.com/watch?v=hNY9OLbPcZE/1773387781244" data-video-remote-vid="https://www.youtube.com/watch?v=hNY9OLbPcZE/1773387781244" class="lia-video-container lia-media-is-center lia-media-size-large"&gt;&lt;iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FhNY9OLbPcZE%3Ffeature%3Doembed&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DhNY9OLbPcZE&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FhNY9OLbPcZE%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" allowfullscreen="" style="max-width: 100%"&gt;&lt;/iframe&gt;&lt;span class="lia-media-caption-text"&gt;A complete Demo&lt;/div&gt;
&lt;H2&gt;Attendee Photo Capture&lt;/H2&gt;
&lt;P&gt;Request all online students to capture and upload photos of their venue view. The system will analyze the images to estimate seating positions using Microsoft Foundry agents with Azure OpenAI ChatGPT 5.3 PositionEstimationAgent and complete an image challenge.&lt;/P&gt;
&lt;P&gt;When the presenter clicks Capture Attendee Photos, all online attendees are prompted to take a photo and upload it to blob storage. The PositionEstimationAgent then analyzes the image to estimate their seating location, which can provide insights into student performance.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;&lt;img /&gt;
&lt;P&gt;&lt;STRONG&gt;Analysis Notes:&lt;/STRONG&gt;&amp;nbsp;Analyzed 13 students in 2 overlapping batches. Batch 1: The venue is a computer lab with the projector screen at the front center, whiteboards on the left, and cabinets on the right. Relative depth was estimated mainly from screen size and number of monitor rows visible ahead. Column estimates were inferred from screen angle and side-room features, with lower confidence for the rotated side-view image. Batch 2: These six photos appear to come from the same computer lab with the projector at the front center. Relative depth was estimated mainly from projector size and number of visible desk/monitor rows ahead. Left-right placement was inferred from projector skew and side-wall visibility. Within this batch, 240124734 and 240167285 seem closest to the front, 240286514 and 240158424 are slightly farther back, 240293498 is farther back again, and 240160364 appears furthest.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;Pass around the QR code attendance sheet&lt;/H2&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;Traditionally, the attendance sheet is circulated for attendees to sign, but this method is unreliable because no one monitors the signing process, allowing one attendee to sign for someone who is absent. It is also slow and not scalable for large groups.&lt;/P&gt;
&lt;P&gt;The QR Code attendance sheet functions as a chain. The presenter randomly distributes a short‑lived, one‑time QR code—representing a virtual attendance sheet—to any number of attendees, just like handing out multiple physical sheets. Each attendee must find another participant to scan their code to record attendance, continuing the chain until the final group of attendees. The presenter then verifies the last group’s presence.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;div data-video-id="https://youtu.be/VkF9vhuukfM/1773385256873" data-video-remote-vid="https://youtu.be/VkF9vhuukfM/1773385256873" class="lia-video-container lia-media-is-center lia-media-size-large"&gt;&lt;iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FVkF9vhuukfM%3Ffeature%3Doembed&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DVkF9vhuukfM&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FVkF9vhuukfM%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" allowfullscreen="" style="max-width: 100%"&gt;&lt;/iframe&gt;&lt;/div&gt;&lt;img /&gt;
&lt;P&gt;The first chain is a dead chain because that student left the venue and cannot find another student to scan his QR code. The second chain contains 20 student attendance records. It also provides useful insights into their friendship and seating patterns.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Architecture&lt;/H2&gt;
&lt;img /&gt;
&lt;P&gt;This project is built using Vibe Coding, so we will not share highly technical details in this post. If you'd like to learn more, leave a comment, and we will write another blog to cover the specifics.&lt;/P&gt;
&lt;H3&gt;GitHub Repo&lt;/H3&gt;
&lt;P&gt;&lt;A href="https://github.com/wongcyrus/ProvePresent" target="_blank" rel="noopener"&gt;https://github.com/wongcyrus/ProvePresent&lt;/A&gt;&lt;/P&gt;
&lt;H1&gt;&amp;nbsp;&lt;/H1&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H1&gt;Conclusion&lt;/H1&gt;
&lt;P&gt;ProvePresent demonstrates how Azure serverless technology and Azure OpenAI can work together to solve a long‑standing problem in education: verifying genuine student presence and engagement. By combining real‑time QR code verification, SignalR‑powered live interactions, AI‑generated quizzes, and intelligent photo‑based seating analysis, we created a system where “being present” is no longer just a checkbox—it becomes a verifiable, interactive, and meaningful part of the learning experience.&lt;/P&gt;
&lt;P&gt;Instead of relying on outdated smart‑card systems or manual roll calls, educators gain a dynamic tool that keeps students attentive, provides insight into classroom behavior, and produces useful analytics for improving teaching outcomes. Students, in turn, benefit from an engaging, modern attendance experience that aligns with how digital‑native learners expect classes to operate.&lt;/P&gt;
&lt;P&gt;This is only the beginning. With Microsoft Foundry agents and the flexibility of Azure Functions, there are many opportunities to extend ProvePresent further—richer analytics, smarter engagement models, and seamless integration with LMS platforms. If there’s interest, we’re happy to share more technical details, architectural deep dives, and future roadmap ideas in a follow‑up post.&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Thank you for the contribution of Microsoft Student Ambassadors&amp;nbsp;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;&lt;A class="lia-external-url" href="https://hkiit.edu.hk/" target="_blank" rel="noopener"&gt;Hong Kong Institute of Information Technology (HKIIT)&lt;/A&gt; &lt;/SPAN&gt;&lt;A class="lia-external-url" href="https://hk.linkedin.com/in/wing-ho-wong-0772a83b7" target="_blank" rel="noopener"&gt;Wong Wing Ho&lt;/A&gt;, &lt;A class="lia-external-url" href="https://www.linkedin.com/in/sham-jayson-chan-566a57326/" target="_blank" rel="noopener"&gt;CHAN Sham Jayson&lt;/A&gt;, &lt;A class="lia-external-url" href="http://%20https://www.linkedin.com/in/phoebe-pang-1aab99155" target="_blank" rel="noopener"&gt;Pang Ho Shum&lt;/A&gt;, and &lt;A class="lia-external-url" href="https://www.linkedin.com/in/ka-chun-chan-6115513b5" target="_blank" rel="noopener"&gt;Chan Ka Chun&lt;/A&gt;. They are major in&amp;nbsp;&lt;A class="lia-external-url" href="https://www.vtc.edu.hk/admission/en/programme/it114115-higher-diploma-in-cloud-and-data-centre-administration/" target="_blank" rel="noopener"&gt;Higher Diploma in Cloud and Data Centre Administration&lt;/A&gt;.&lt;/EM&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;&lt;STRONG&gt;About the Author&lt;/STRONG&gt;&lt;/H3&gt;
&lt;img /&gt;
&lt;P data-selectable-paragraph=""&gt;&lt;A href="https://www.linkedin.com/in/cyruswong/" target="_blank" rel="noopener"&gt;Cyrus Wong&lt;/A&gt;&amp;nbsp;is the senior lecturer of&amp;nbsp;&lt;A href="https://hkiit.edu.hk/" target="_blank" rel="noopener"&gt;Hong Kong Institute of Information Technology (HKIIT)&lt;/A&gt;&amp;nbsp;@&amp;nbsp;&lt;A href="http://lwit.vtc.edu.hk/" target="_blank" rel="noopener"&gt;IVE(Lee Wai Lee)&lt;/A&gt;.and he focuses on teaching public Cloud technologies. He is a passionate advocate for the adoption of cloud technology across various media and events. With his extensive knowledge and expertise, he has earned prestigious recognitions such as&amp;nbsp;&lt;A href="https://aws.amazon.com/developer/community/heroes/cyrus-wong/" target="_blank" rel="noopener" data-lia-auto-title-active="0" data-lia-auto-title="AWS Builder Center"&gt;AWS Builder Center&lt;/A&gt;,&amp;nbsp;&lt;A href="https://mvp.microsoft.com/en-US/mvp/profile/86da86ff-8786-ed11-aad1-000d3a197333" target="_blank" rel="noopener"&gt;Microsoft MVP- Microsoft Foundry&lt;/A&gt;, and&amp;nbsp;&lt;A href="https://developers.google.com/profile/u/cyruswong" target="_blank" rel="noopener"&gt;Google Developer Expert for Google Cloud Platform &amp;amp; AI&lt;/A&gt;.&lt;/P&gt;</description>
      <pubDate>Thu, 19 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/provepresent-ending-proxy-attendance-with-azure-serverless-azure/ba-p/4501830</guid>
      <dc:creator>cyruswong</dc:creator>
      <dc:date>2026-03-19T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Foundry IQ: Give Your AI Agents a Knowledge Upgrade</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/foundry-iq-give-your-ai-agents-a-knowledge-upgrade/ba-p/4502615</link>
      <description>&lt;P&gt;If you’re learning to build AI agents, you’ve probably hit a familiar wall: your agent can generate text, but it doesn’t actually&amp;nbsp;&lt;EM&gt;know&lt;/EM&gt; anything about your data. It can’t look up your documents, search across your files, or pull facts from multiple sources to answer a real question.&lt;/P&gt;
&lt;P&gt;That’s the gap Foundry IQ fills. It gives your AI agents structured access to knowledge, so they can retrieve, reason over, and synthesize information from real data sources instead of relying on what’s baked into the model.&lt;/P&gt;
&lt;H2&gt;Why Should You Care?&lt;/H2&gt;
&lt;P&gt;As a student or early-career developer, understanding how AI systems work with external knowledge is one of the most valuable skills you can build right now. Retrieval-Augmented Generation (RAG), knowledge bases, and multi-source querying are at the core of every production AI application, from customer support bots to research assistants to enterprise copilots.&lt;/P&gt;
&lt;P&gt;Foundry IQ gives you a hands-on way to learn these patterns without having to build all the plumbing yourself. You define knowledge bases, connect data sources, and let your agents query them. The concepts you learn here transfer directly to real-world AI engineering roles.&lt;/P&gt;
&lt;H2&gt;What is Foundry IQ?&lt;/H2&gt;
&lt;P&gt;Foundry IQ is a service within Azure AI Foundry that lets you create &lt;STRONG&gt;knowledge bases&lt;/STRONG&gt;, collections of connected data sources that your AI agents can query through a single endpoint.&lt;/P&gt;
&lt;P&gt;Instead of writing custom retrieval logic for every app you build, you:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Define knowledge sources&lt;/STRONG&gt; — connect documents, data stores, or web content (SharePoint, Azure Blob Storage, Azure AI Search, Fabric OneLake, and more).&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Organize them into a knowledge base&lt;/STRONG&gt; — group multiple sources behind one queryable endpoint.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Query from your agent&lt;/STRONG&gt; — your AI agent calls the knowledge base to get the context it needs before generating a response.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This approach means the knowledge layer is reusable. Build it once, and any agent or app in your project can tap into it.&lt;/P&gt;
&lt;H2&gt;The IQ Series: A Three-Part Learning Path&lt;/H2&gt;
&lt;P&gt;The &lt;STRONG&gt;IQ Series&lt;/STRONG&gt; is a set of three weekly episodes that walk you through Foundry IQ from concept to code. Each episode includes a tech talk, visual doodle summaries, and a companion cookbook with sample code you can run yourself.&lt;/P&gt;
&lt;P&gt;👉 &lt;STRONG&gt;Get started:&lt;/STRONG&gt; &lt;A href="https://aka.ms/iq-series" target="_blank" rel="noopener"&gt;https://aka.ms/iq-series&lt;/A&gt;&lt;/P&gt;
&lt;H3&gt;Episode 1: Unlocking Knowledge for Your Agents (March 18, 2026)&lt;/H3&gt;
&lt;P&gt;Start here. This episode introduces the core architecture of Foundry IQ and explains how AI agents interact with knowledge. You’ll learn what knowledge bases are, why they matter, and how the key components fit together.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What you’ll learn:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;The difference between model knowledge and retrieved knowledge&lt;/LI&gt;
&lt;LI&gt;How Foundry IQ structures the retrieval layer&lt;/LI&gt;
&lt;LI&gt;The building blocks: knowledge sources, knowledge bases, and agent queries&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Episode 2: Building the Data Pipeline with Knowledge Sources (March 25, 2026)&lt;/H3&gt;
&lt;P&gt;This episode goes deeper into &lt;STRONG&gt;knowledge sources&lt;/STRONG&gt;, the connectors that bring data into Foundry IQ. You’ll see how different content types flow into the system and how to wire up sources from services you may already be using.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What you’ll learn:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;How to connect sources like Azure Blob Storage, Azure AI Search, SharePoint, Fabric OneLake, and the web&lt;/LI&gt;
&lt;LI&gt;How content is ingested and indexed for retrieval&lt;/LI&gt;
&lt;LI&gt;Patterns for combining multiple source types&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Episode 3: Querying Multi-Source Knowledge Bases (April 1, 2026)&lt;/H3&gt;
&lt;P&gt;The final episode shows you how to bring it all together. You’ll learn how agents query across multiple knowledge sources through a single knowledge base endpoint and how to synthesize answers from diverse data.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;What you’ll learn:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;How to query a knowledge base from your agent code&lt;/LI&gt;
&lt;LI&gt;How retrieval works across multiple connected sources&lt;/LI&gt;
&lt;LI&gt;Techniques for synthesizing information to answer complex questions&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Get Hands-On with the Cookbooks&lt;/H2&gt;
&lt;P&gt;Every episode comes with a companion cookbook in the GitHub repo, complete with sample code you can clone, run, and modify. This is the fastest way to go from watching to building.&lt;/P&gt;
&lt;P&gt;👉 &lt;STRONG&gt;Explore the repo:&lt;/STRONG&gt; &lt;A href="https://aka.ms/iq-series" target="_blank" rel="noopener"&gt;https://aka.ms/iq-series&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Inside you’ll find:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Episode links&lt;/STRONG&gt; — watch the tech talks and doodle recaps&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Cookbooks&lt;/STRONG&gt; — step-by-step code samples for each episode&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Documentation links&lt;/STRONG&gt; — official Foundry IQ docs and additional learning resources&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;What to Build Next&lt;/H2&gt;
&lt;P&gt;Once you’ve worked through the series, try applying what you’ve learned:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Study assistant&lt;/STRONG&gt; — connect your course materials as knowledge sources and build an agent that can answer questions across all your notes and readings.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Project documentation bot&lt;/STRONG&gt; — index your team’s project docs and READMEs into a knowledge base so everyone can query them naturally.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Research synthesizer&lt;/STRONG&gt; — connect multiple data sources (papers, web content, datasets) and build an agent that can cross-reference and summarize findings.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Start Learning&lt;/H2&gt;
&lt;P&gt;The IQ Series is designed to take you from zero to building knowledge-driven AI agents. Watch the episodes, run the cookbooks, and start experimenting with your own knowledge bases.&lt;/P&gt;
&lt;P&gt;👉 &lt;A href="https://aka.ms/iq-series" target="_blank" rel="noopener"&gt;https://aka.ms/iq-series&lt;/A&gt;&lt;/P&gt;
&lt;div data-video-id="https://youtu.be/G1LN2TQGI1M/1773645220255" data-video-remote-vid="https://youtu.be/G1LN2TQGI1M/1773645220255" class="lia-video-container lia-media-is-center lia-media-size-large"&gt;&lt;iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FG1LN2TQGI1M%3Ffeature%3Doembed&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DG1LN2TQGI1M&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FG1LN2TQGI1M%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" allowfullscreen="" style="max-width: 100%"&gt;&lt;/iframe&gt;&lt;/div&gt;</description>
      <pubDate>Tue, 17 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/foundry-iq-give-your-ai-agents-a-knowledge-upgrade/ba-p/4502615</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-03-17T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Microsoft Foundry Model Router: A Developer's Guide to Smarter AI Routing</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/microsoft-foundry-model-router-a-developer-s-guide-to-smarter-ai/ba-p/4502133</link>
      <description>&lt;H2&gt;Introduction&lt;/H2&gt;
&lt;P&gt;When building AI-powered applications on Azure, one of the most impactful decisions you make isn't about which model to use, it's about how your application &lt;EM&gt;selects&lt;/EM&gt; models at runtime. &lt;STRONG&gt;Microsoft Foundry Model Router&lt;/STRONG&gt;, available through Microsoft Foundry, automatically routes your inference requests to the best available model based on prompt complexity, latency targets, and cost efficiency. But how do you know it's actually routing correctly? And how do you compare its behavior across different API paths?&lt;/P&gt;
&lt;P&gt;That's exactly the problem &lt;STRONG&gt;RouteLens&lt;/STRONG&gt; solves. It's an open-source Node.js CLI and web-based testing tool that sends configurable prompts through two distinct Azure AI runtime paths and produces a detailed comparison of routing decisions, latency profiles, and reliability metrics.&lt;/P&gt;
&lt;P&gt;In this post, we'll walk through what Model Router does, why it matters, how to use the validator tool, and best practices for designing applications that get the most out of intelligent model routing.&lt;/P&gt;
&lt;H2&gt;What Is Microsoft Founry Model Router?&lt;/H2&gt;
&lt;P&gt;Microsoft Foundry Model Router is a deployment option in Microsoft Foundry that sits between your application and a pool of AI models. Instead of hard-coding a specific model like&amp;nbsp;&lt;CODE&gt;gpt-4o&lt;/CODE&gt; or &lt;CODE&gt;gpt-4o-mini&lt;/CODE&gt;, you deploy a &lt;STRONG&gt;Model Router endpoint&lt;/STRONG&gt; and let Azure decide which underlying model serves each request.&lt;/P&gt;
&lt;H3&gt;How It Works&lt;/H3&gt;
&lt;OL&gt;
&lt;LI&gt;Your application sends an inference request to the Model Router deployment.&lt;/LI&gt;
&lt;LI&gt;Model Router analyzes the request (prompt complexity, token count, required capabilities).&lt;/LI&gt;
&lt;LI&gt;It selects the most appropriate model from the available pool.&lt;/LI&gt;
&lt;LI&gt;The response is returned transparently — your application code doesn't change.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;Why This Matters&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Cost optimization&lt;/STRONG&gt; — Simple prompts get routed to smaller, cheaper models. Complex prompts go to more capable (and expensive) ones.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Latency reduction&lt;/STRONG&gt; — Lightweight prompts complete faster when they don't need a heavyweight model.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Resilience&lt;/STRONG&gt; — If one model is experiencing high load or throttling, traffic can shift to alternatives.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Simplified application code&lt;/STRONG&gt; — No need to build your own model-selection logic.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;The Two Runtime Paths&lt;/H2&gt;
&lt;P&gt;Microsoft Foundry offers two distinct endpoint configurations for hitting Model Router. Even though both use the Chat Completions API, they may have different routing behaviour:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Path&lt;/th&gt;&lt;th&gt;SDK&lt;/th&gt;&lt;th&gt;Endpoint&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;AOAI + Chat Completions&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;OpenAI JS SDK&lt;/td&gt;&lt;td&gt;&lt;CODE&gt;https://.cognitiveservices.azure.com/openai/deployments/&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Foundry Project + Chat Completions&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;OpenAI JS SDK (separate client)&lt;/td&gt;&lt;td&gt;&lt;CODE&gt;https://.cognitiveservices.azure.com/openai/deployments/&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;Understanding whether these two paths produce the same routing decisions is critical for production applications. If the same prompt routes to different models depending on which endpoint you use, that's a signal you need to investigate.&lt;/P&gt;
&lt;H2&gt;Introducing RouteLens&lt;/H2&gt;
&lt;P&gt;RouteLens is a Node.js tool that automates this comparison. It:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Sends a configurable set of prompts&lt;/STRONG&gt; across categories (echo, summarize, code, reasoning) through both paths.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Logs every response&lt;/STRONG&gt; to structured JSONL files for post-hoc analysis.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Computes statistics&lt;/STRONG&gt; including p50/p95 latency, error rates, and model-choice distribution.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Highlights routing differences&lt;/STRONG&gt; — where the same prompt was served by different models across paths.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Provides a web dashboard&lt;/STRONG&gt; for interactive testing and real-time result visualization.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H3&gt;The Web Dashboard&lt;/H3&gt;
&lt;P&gt;The built-in web UI makes it easy to run tests and explore results without parsing log files:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/01-dashboard-overview.png" alt="Dashboard Overview" /&gt;&lt;/P&gt;
&lt;P&gt;The dashboard includes:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;KPI Dashboard&lt;/STRONG&gt; — Key metrics at a glance: Success Rate, Avg TPS, Gen TPS, Peak TPS, Fastest Response, p50/p95 Latency, Most Reliable Path, Total Tokens&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Summary view&lt;/STRONG&gt; — Per-path/per-category stats with success rate, TPS, and latency&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Model Comparison&lt;/STRONG&gt; — Side-by-side view of which models were selected by each path&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Latency Charts&lt;/STRONG&gt; — Visual bar charts comparing p50 and p95 latencies&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Error Analysis&lt;/STRONG&gt; — Error distribution and detailed error messages&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Live Feed&lt;/STRONG&gt; — Real-time streaming of results as they come in&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Log Viewer&lt;/STRONG&gt; — Browse and inspect historical JSONL log files&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Model Comparison&lt;/STRONG&gt; — See which models were selected by each routing path for every prompt:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/02-model-comparison.png" alt="Model Comparison" /&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Live Feed&lt;/STRONG&gt; — Real-time streaming of results as they come in:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/05-live-feed.png" alt="Live Feed" /&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Log Viewer&lt;/STRONG&gt; — Browse and inspect historical JSONL log files with parsed table views:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/03-log-viewer.png" alt="Log Viewer" /&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Mobile Responsive&lt;/STRONG&gt; — The UI adapts to smaller screens:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/06-mobile-responsive.png" alt="Mobile Responsive View" width="400" /&gt;&lt;/P&gt;
&lt;H2&gt;Getting Started&lt;/H2&gt;
&lt;H3&gt;Prerequisites&lt;/H3&gt;
&lt;UL&gt;
&lt;LI&gt;Node.js 18+ (LTS recommended)&lt;/LI&gt;
&lt;LI&gt;An Azure subscription with a &lt;A class="lia-external-url" href="https://ai.azure.com" target="_blank" rel="noopener"&gt;Foundry project&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Model Router deployed in your Foundry project&lt;/LI&gt;
&lt;LI&gt;An API key from your Azure OpenAI / Foundry resource&lt;/LI&gt;
&lt;LI&gt;The API version (e.g. &lt;CODE&gt;2024-05-01-preview&lt;/CODE&gt;)&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Setup&lt;/H3&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;# Clone and install
git clone https://github.com/leestott/modelrouter-routelens/
cd modelrouter-routelens
npm install

# Configure your endpoints
cp .env.example .env
# Edit .env with your Azure endpoints (see below)&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;H3&gt;Configuration&lt;/H3&gt;
&lt;P&gt;The &lt;CODE&gt;.env&lt;/CODE&gt; file needs these key settings:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;# Your Foundry / Cognitive Services deployment endpoint
# Format: https://&amp;lt;resource&amp;gt;.cognitiveservices.azure.com/openai/deployments/&amp;lt;deployment&amp;gt;
# Do NOT include /chat/completions or ?api-version
FOUNDRY_PROJECT_ENDPOINT=https://&amp;lt;resource&amp;gt;.cognitiveservices.azure.com/openai/deployments/model-router
AOAI_BASE_URL=https://&amp;lt;resource&amp;gt;.cognitiveservices.azure.com/openai/deployments/model-router

# API key from your Azure OpenAI / Foundry resource
AOAI_API_KEY=your-api-key-here

# Azure OpenAI API version
AOAI_API_VERSION=2024-05-01-preview&amp;lt;/resource&amp;gt;&amp;lt;/resource&amp;gt;&amp;lt;/deployment&amp;gt;&amp;lt;/resource&amp;gt;&lt;/LI-CODE&gt;
&lt;H3&gt;Running Tests&lt;/H3&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;# Full test matrix — sends all prompts through both paths
npm run run:matrix

# 408 timeout diagnostic — focuses on the Responses path timeout issue
npm run run:repro408

# Web UI — interactive dashboard
npm run ui
# Then open http://localhost:3002 (or the port set in UI_PORT)&lt;/LI-CODE&gt;
&lt;H2&gt;Understanding the Results&lt;/H2&gt;
&lt;H3&gt;Latency Comparison&lt;/H3&gt;
&lt;P&gt;The latency charts show p50 (median) and p95 (tail) latency for each path and prompt category:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/03-latency-charts.png" alt="Latency Charts" /&gt;&lt;/P&gt;
&lt;P&gt;Key things to look for:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Large p50 differences&lt;/STRONG&gt; between paths suggest one path has consistently higher overhead.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;High p95 values&lt;/STRONG&gt; indicate tail latency problems — possibly timeouts or retries.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Category-specific patterns&lt;/STRONG&gt; — If code prompts are slow on one path but fast on another, that's a routing difference worth investigating.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Model Comparison&lt;/H3&gt;
&lt;P&gt;The model comparison view shows which models were selected for each prompt:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/02-model-comparison.png" alt="Model Comparison" /&gt;&lt;/P&gt;
&lt;P&gt;When both paths select the same model, you see a green "Match" indicator. When they differ, it's flagged in red — these are the cases you want to investigate.&lt;/P&gt;
&lt;H3&gt;Error Analysis&lt;/H3&gt;
&lt;P&gt;The errors view helps diagnose reliability issues:&lt;/P&gt;
&lt;P&gt;&lt;IMG src="https://github.com/leestott/modelrouter-routelens/raw/main/screenshots/04-errors-view.png" alt="Error Analysis" /&gt;&lt;/P&gt;
&lt;P&gt;Common error patterns:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;408 Timeout&lt;/STRONG&gt; — The Responses path may take longer for certain prompt categories&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;401 Unauthorized&lt;/STRONG&gt; — Authentication configuration issues&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;429 Rate Limited&lt;/STRONG&gt; — You're hitting throughput limits&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;500 Internal Server Error&lt;/STRONG&gt; — Backend model issues&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Best Practices for Designing Applications with Model Router&lt;/H2&gt;
&lt;H3&gt;1. Design Prompts with Routing in Mind&lt;/H3&gt;
&lt;P&gt;Model Router makes decisions based on prompt characteristics. To get the best routing:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Keep prompts focused&lt;/STRONG&gt; — A clear, single-purpose prompt is easier for the router to classify than a multi-part prompt that spans multiple complexity levels.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Use system messages effectively&lt;/STRONG&gt; — A well-structured system message helps the router understand the task complexity.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Separate complex chains&lt;/STRONG&gt; — If you have a multi-step workflow, make each step a separate API call rather than one massive prompt. This lets the router use a cheaper model for simple steps.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;2. Set Appropriate Timeouts&lt;/H3&gt;
&lt;P&gt;Different models have different latency profiles. Your timeout settings should account for the &lt;EM&gt;slowest&lt;/EM&gt; model the router might select:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;// Too aggressive — may timeout when routed to a larger model
const TIMEOUT = 5000;  // 5s

// Better — allows headroom for model variation
const TIMEOUT = 30000; // 30s

// Best — use different timeouts based on expected complexity
function getTimeout(category) {
  switch (category) {
    case 'echo': return 10000;
    case 'summarize': return 20000;
    case 'code': return 45000;
    case 'reasoning': return 60000;
    default: return 30000;
  }
}&lt;/LI-CODE&gt;
&lt;H3&gt;3. Implement Robust Retry Logic&lt;/H3&gt;
&lt;P&gt;Because the router may select different models on retry, transient failures can resolve themselves:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;async function callWithRetry(prompt, maxRetries = 3) {
  for (let attempt = 0; attempt &amp;lt; maxRetries; attempt++) {
    try {
      return await client.chat.completions.create({
        model: 'model-router',
        messages: [{ role: 'user', content: prompt }],
      });
    } catch (err) {
      if (attempt === maxRetries - 1) throw err;
      // Exponential backoff
      await new Promise(r =&amp;gt; setTimeout(r, 1000 * Math.pow(2, attempt)));
    }
  }
}&lt;/LI-CODE&gt;
&lt;H3&gt;4. Monitor Model Selection in Production&lt;/H3&gt;
&lt;P&gt;Log which model was selected for each request so you can track routing patterns over time:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;const response = await client.chat.completions.create({
  model: 'model-router',
  messages: [{ role: 'user', content: prompt }],
});

// The model field in the response tells you which model was actually used
console.log(`Routed to: ${response.model}`);
console.log(`Tokens: ${response.usage.total_tokens}`);&lt;/LI-CODE&gt;
&lt;H3&gt;5. Use the Right API Path for Your Use Case&lt;/H3&gt;
&lt;P&gt;Based on our testing with RouteLens, consider:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Chat Completions path&lt;/STRONG&gt; — The standard path for chat-style interactions. Uses the &lt;CODE&gt;openai&lt;/CODE&gt; SDK directly.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Foundry Project path&lt;/STRONG&gt; — Uses the same Chat Completions API but through the Foundry project endpoint. Useful for comparing routing behaviour across different endpoint configurations.&lt;/LI&gt;
&lt;/UL&gt;
&lt;BLOCKQUOTE&gt;&lt;STRONG&gt;Note:&lt;/STRONG&gt; The Responses API (&lt;CODE&gt;/responses&lt;/CODE&gt;) is not currently available on &lt;CODE&gt;cognitiveservices.azure.com&lt;/CODE&gt; Model Router deployments. Both paths in RouteLens use Chat Completions.&lt;/BLOCKQUOTE&gt;
&lt;H3&gt;6. Test Before You Ship&lt;/H3&gt;
&lt;P&gt;Run RouteLens as part of your pre-production validation:&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;# In your CI/CD pipeline or pre-deployment check
npm run run:matrix -- --runs 10 --concurrency 4&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;This helps you:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;Catch routing regressions when Azure updates model pools&lt;/LI&gt;
&lt;LI&gt;Verify that your prompt changes don't cause unexpected model selection shifts&lt;/LI&gt;
&lt;LI&gt;Establish latency baselines for alerting&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Architecture Overview&lt;/H2&gt;
&lt;img /&gt;
&lt;P&gt;RouteLens sends configurable prompts through two distinct Azure AI runtime paths and compares routing decisions, latency, and reliability. The &lt;STRONG&gt;Matrix Runner&lt;/STRONG&gt; dispatches prompts to both the &lt;STRONG&gt;Chat Completions Client&lt;/STRONG&gt; (OpenAI JS SDK → AOAI endpoint) and the &lt;STRONG&gt;Project Responses Client&lt;/STRONG&gt; (&lt;CODE&gt;&lt;a href="javascript:void(0)" data-lia-user-mentions="" data-lia-user-uid="73893" data-lia-user-login="azure" class="lia-mention lia-mention-user"&gt;azure​&lt;/a&gt;/ai-projects&lt;/CODE&gt; → Foundry endpoint). Both paths converge at &lt;STRONG&gt;Azure Model Router&lt;/STRONG&gt;, which intelligently selects the optimal backend model. Results are logged to JSONL files and rendered in the web dashboard.&lt;/P&gt;
&lt;H2&gt;Key Benefits of Model Router&lt;/H2&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Benefit&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Cost savings&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Automatically routes simple prompts to cheaper models, reducing spend by 30-50% in typical workloads&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Lower latency&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Simple prompts complete faster on lightweight models&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Zero code changes&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Same API contract as a standard model deployment — just change the deployment name&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Future-proof&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;As Azure adds new models to the pool, your application benefits automatically&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Built-in resilience&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Routing adapts to model availability and load conditions&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;Conclusion&lt;/H2&gt;
&lt;P&gt;Azure Model Router represents a shift from "pick a model" to "describe your task and let the platform decide." This is a natural evolution for AI applications — just as cloud platforms abstract away server selection, Model Router abstracts away model selection.&lt;/P&gt;
&lt;P&gt;RouteLens gives you the visibility to trust that abstraction. By systematically comparing routing behavior across API paths and prompt categories, you can deploy Model Router with confidence and catch issues before your users do.&lt;/P&gt;
&lt;P&gt;The tool is open source under the MIT license. Try it out, file issues, and contribute improvements:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://github.com/leestott/modelrouter-routelens" target="_blank"&gt;GitHub Repository&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/foundry/openai/how-to/model-router" target="_blank" rel="noopener"&gt;Model Router Documentation&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://ai.azure.com" target="_blank" rel="noopener"&gt;Microsoft Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;HR /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Mon, 23 Mar 2026 15:17:48 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/microsoft-foundry-model-router-a-developer-s-guide-to-smarter-ai/ba-p/4502133</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-03-23T15:17:48Z</dc:date>
    </item>
    <item>
      <title>Build a Fully Offline RAG App with Foundry Local: No Cloud Required</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/build-a-fully-offline-rag-app-with-foundry-local-no-cloud/ba-p/4499964</link>
      <description>&lt;HEADER class="hero"&gt;
&lt;P class="subtitle"&gt;A practical guide to building an on-device AI support agent using Retrieval-Augmented Generation, JavaScript, and Microsoft Foundry Local.&lt;/P&gt;
&lt;/HEADER&gt;
&lt;ARTICLE&gt;&lt;!-- Intro --&gt;
&lt;H2&gt;The Problem: AI That Can't Go Offline&lt;/H2&gt;
&lt;P&gt;Most AI-powered applications today are firmly tethered to the cloud. They assume stable internet, low-latency API calls, and the comfort of a managed endpoint. But what happens when your users are in an environment with &lt;STRONG&gt;zero connectivity&lt;/STRONG&gt;&amp;nbsp; a gas pipeline in a remote field, a factory floor, an underground facility?&lt;/P&gt;
&lt;P&gt;That's exactly the scenario that motivated this project: a &lt;STRONG&gt;fully offline RAG-powered support agent&lt;/STRONG&gt; that runs entirely on a laptop. No cloud. No API keys. No outbound network calls. Just a local model, a local vector store, and domain-specific documents&amp;nbsp; all accessible from a browser on any device.&lt;/P&gt;
&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/01-landing-page.png" alt="Landing page of the Gas Field Support Agent showing a dark-themed UI with quick-action buttons and chat input" /&gt;
&lt;P class="img-caption"&gt;The Gas Field Support Agent - running entirely on-device&lt;/P&gt;
&lt;!-- What is RAG --&gt;
&lt;H2&gt;What is RAG and Why Should You Care?&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;Retrieval-Augmented Generation (RAG)&lt;/STRONG&gt; is a pattern that makes language models genuinely useful for domain-specific tasks. Instead of hoping the model "knows" the answer from pre-training, you:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Retrieve&lt;/STRONG&gt; relevant chunks from your own documents&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Augment&lt;/STRONG&gt; the model's prompt with those chunks as context&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Generate&lt;/STRONG&gt; a response grounded in your actual data&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The result: fewer hallucinations, traceable answers, and an AI that works with &lt;EM&gt;your&lt;/EM&gt; content. If you're building internal tools, customer support bots, field manuals, or knowledge bases, RAG is the pattern you want.&lt;/P&gt;
&lt;DIV class="callout callout-green"&gt;&lt;STRONG&gt;Why fully offline?&lt;/STRONG&gt; Data sovereignty, air-gapped environments, field operations, latency-sensitive workflows, and regulatory constraints all demand AI that doesn't phone home. Running everything locally gives you complete control over your data and eliminates any external dependency.&lt;/DIV&gt;
&lt;!-- The Stack --&gt;
&lt;H2&gt;The Tech Stack&lt;/H2&gt;
&lt;P&gt;This project is deliberately simple — no frameworks, no build steps, no Docker:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="stack-table" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Layer&lt;/th&gt;&lt;th&gt;Technology&lt;/th&gt;&lt;th&gt;Why&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;AI Model&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;A href="https://foundrylocal.ai" target="_blank" rel="noopener"&gt;Foundry Local&lt;/A&gt; + Phi-3.5 Mini&lt;/td&gt;&lt;td&gt;Runs locally, OpenAI-compatible API, no GPU needed&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Backend&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Node.js + Express&lt;/td&gt;&lt;td&gt;Lightweight, fast, universally known&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Vector Store&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;SQLite via &lt;CODE&gt;better-sqlite3&lt;/CODE&gt;&lt;/td&gt;&lt;td&gt;Zero infrastructure, single file on disk&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Retrieval&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;TF-IDF + cosine similarity&lt;/td&gt;&lt;td&gt;No embedding model required, fully offline&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;Frontend&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Single HTML file with inline CSS&lt;/td&gt;&lt;td&gt;No build step, mobile-responsive, field-ready&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;The total dependency footprint is just &lt;STRONG&gt;four npm packages&lt;/STRONG&gt;: &lt;CODE&gt;express&lt;/CODE&gt;, &lt;CODE&gt;openai&lt;/CODE&gt;, &lt;CODE&gt;foundry-local-sdk&lt;/CODE&gt;, and &lt;CODE&gt;better-sqlite3&lt;/CODE&gt;.&lt;/P&gt;
&lt;!-- Architecture --&gt;
&lt;H2&gt;Architecture Overview&lt;/H2&gt;
&lt;P&gt;The system has five layers — all running on a single machine:&lt;/P&gt;
&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/07-architecture-diagram.png" alt="Architecture diagram showing Client, Server, RAG Pipeline, Data, and AI layers" /&gt;
&lt;P class="img-caption"&gt;Five-layer architecture: Client → Server → RAG Pipeline → Data → AI Model&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Client Layer&lt;/STRONG&gt; — A single HTML file served by Express, with quick-action buttons and responsive chat&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Server Layer&lt;/STRONG&gt; — Express.js handles API routes for chat (streaming + non-streaming), document upload, and health checks&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;RAG Pipeline&lt;/STRONG&gt; — The chat engine orchestrates retrieval and generation; the chunker handles TF-IDF vectorization&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Data Layer&lt;/STRONG&gt; — SQLite stores document chunks and their TF-IDF vectors; source docs live as &lt;CODE&gt;.md&lt;/CODE&gt; files&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;AI Layer&lt;/STRONG&gt; — Foundry Local runs Phi-3.5 Mini Instruct on CPU/NPU, exposing an OpenAI-compatible API&lt;/LI&gt;
&lt;/UL&gt;
&lt;!-- Getting Started --&gt;
&lt;H2&gt;Getting Started in 5 Minutes&lt;/H2&gt;
&lt;P&gt;You need two prerequisites:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Node.js 20+&lt;/STRONG&gt; — &lt;A href="https://nodejs.org/" target="_blank" rel="noopener"&gt;nodejs.org&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Foundry Local&lt;/STRONG&gt; — Microsoft's on-device AI runtime:&lt;/LI&gt;
&lt;/OL&gt;
&lt;DIV class="code-label"&gt;Terminal&lt;/DIV&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;winget install Microsoft.FoundryLocal&lt;/LI-CODE&gt;
&lt;P&gt;Then clone, install, ingest, and run:&lt;/P&gt;
&lt;DIV class="code-label"&gt;&lt;LI-CODE lang=""&gt;git clone https://github.com/leestott/local-rag.git
cd local-rag
npm install
npm run ingest   # Index the 20 gas engineering documents
npm start        # Start the server + Foundry Local&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/DIV&gt;
&lt;P&gt;Open &lt;CODE&gt;http://127.0.0.1:3000&lt;/CODE&gt; and start chatting. Foundry Local auto-downloads Phi-3.5 Mini (~2 GB) on first run.&lt;/P&gt;
&lt;!-- RAG Pipeline Deep Dive --&gt;
&lt;H2&gt;How the RAG Pipeline Works&lt;/H2&gt;
&lt;P&gt;Let's trace what happens when a user asks: &lt;STRONG&gt;"How do I detect a gas leak?"&lt;/STRONG&gt;&lt;/P&gt;
&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/08-rag-flow-sequence.png" alt="Sequence diagram showing the RAG query flow from browser to model" /&gt;
&lt;P class="img-caption"&gt;RAG query flow: Browser → Server → Vector Store → Model → Streaming response&lt;/P&gt;
&lt;H3&gt;Step 1: Document Ingestion&lt;/H3&gt;
&lt;P&gt;Before any queries happen, &lt;CODE&gt;npm run ingest&lt;/CODE&gt; reads every &lt;CODE&gt;.md&lt;/CODE&gt; file from the &lt;CODE&gt;docs/&lt;/CODE&gt; folder, splits each into overlapping chunks (~200 tokens, 25-token overlap), computes a TF-IDF vector for each chunk, and stores everything in SQLite.&lt;/P&gt;
&lt;DIV class="code-label"&gt;Chunking example&lt;/DIV&gt;
&lt;PRE&gt;&lt;CODE&gt;docs/01-gas-leak-detection.md
  → Chunk 1: "Gas Leak Detection – Safety Warnings: Ensure all ignition..."
  → Chunk 2: "...sources are eliminated. Step-by-step: 1. Perform visual..."
  → Chunk 3: "...inspection of all joints. 2. Check calibration date..."&lt;/CODE&gt;&lt;/PRE&gt;
&lt;P&gt;The overlap ensures no information falls between chunk boundaries — a critical detail in any RAG system.&lt;/P&gt;
&lt;H3&gt;Step 2: Query → Retrieval&lt;/H3&gt;
&lt;P&gt;When the user sends a question, the server converts it into a TF-IDF vector, compares it against every stored chunk using cosine similarity, and returns the top-K most relevant results. For 20 documents (~200 chunks), this executes in &lt;STRONG&gt;under 10ms&lt;/STRONG&gt;.&lt;/P&gt;
&lt;DIV class="code-label"&gt;src/vectorStore.js&lt;/DIV&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;/** Retrieve top-K most relevant chunks for a query. */
search(query, topK = 5) {
  const queryTf = termFrequency(query);
  const rows = this.db.prepare("SELECT * FROM chunks").all();

  const scored = rows.map((row) =&amp;gt; {
    const chunkTf = new Map(JSON.parse(row.tf_json));
    const score = cosineSimilarity(queryTf, chunkTf);
    return { ...row, score };
  });

  scored.sort((a, b) =&amp;gt; b.score - a.score);
  return scored.slice(0, topK).filter((r) =&amp;gt; r.score &amp;gt; 0);
}&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;H3&gt;Step 3: Prompt Construction&lt;/H3&gt;
&lt;P&gt;The retrieved chunks are injected into the prompt alongside system instructions:&lt;/P&gt;
&lt;DIV class="code-label"&gt;Prompt structure&lt;/DIV&gt;
&lt;PRE&gt;&lt;CODE&gt;System: You are an offline gas field support agent. Safety-first...
Context:
  [Chunk 1: Gas Leak Detection – Safety Warnings...]
  [Chunk 2: Gas Leak Detection – Step-by-step...]
  [Chunk 3: Purging Procedures – Related safety...]
User: How do I detect a gas leak?&lt;/CODE&gt;&lt;/PRE&gt;
&lt;H3&gt;Step 4: Generation + Streaming&lt;/H3&gt;
&lt;P&gt;The prompt is sent to Foundry Local via the OpenAI-compatible API. The response streams back token-by-token through Server-Sent Events (SSE) to the browser:&lt;/P&gt;
&lt;DIV class="two-col"&gt;
&lt;DIV&gt;&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/03-chat-response.png" alt="Chat response showing safety warnings and step-by-step guidance" /&gt;
&lt;P class="img-caption"&gt;Safety-first response with structured guidance&lt;/P&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/04-sources-panel.png" alt="Sources panel showing retrieved documents and relevance scores" /&gt;
&lt;P class="img-caption"&gt;Expandable sources with relevance scores&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;!-- Foundry Local --&gt;
&lt;H2&gt;Foundry Local: Your Local AI Runtime&lt;/H2&gt;
&lt;P&gt;&lt;A href="https://foundrylocal.ai" target="_blank" rel="noopener"&gt;Foundry Local&lt;/A&gt; is what makes the "offline" part possible. It's a runtime from Microsoft that runs small language models (SLMs) on CPU or NPU — no GPU required. It exposes an &lt;STRONG&gt;OpenAI-compatible API&lt;/STRONG&gt; and manages model downloads, caching, and lifecycle automatically.&lt;/P&gt;
&lt;P&gt;The integration code is minimal if you've used the OpenAI SDK before, this will feel instantly familiar:&lt;/P&gt;
&lt;DIV class="code-label"&gt;src/chatEngine.js&lt;/DIV&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;import { FoundryLocalManager } from "foundry-local-sdk";
import { OpenAI } from "openai";

// Start Foundry Local and load the model
const manager = new FoundryLocalManager();
const modelInfo = await manager.init("phi-3.5-mini");

// Use the standard OpenAI client — pointed at the local endpoint
const client = new OpenAI({
  baseURL: manager.endpoint,
  apiKey: manager.apiKey,
});

// Chat completions work exactly like the cloud API
const stream = await client.chat.completions.create({
  model: modelInfo.id,
  messages: [
    { role: "system", content: "You are a helpful assistant." },
    { role: "user", content: "How do I detect a gas leak?" }
  ],
  stream: true,
});&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;DIV class="callout"&gt;&lt;STRONG&gt;Portability matters&lt;/STRONG&gt; Because Foundry Local uses the OpenAI API format, any code you write here can be ported to Azure OpenAI or OpenAI's cloud API with a single config change. You're not locked in.&lt;/DIV&gt;
&lt;!-- Why TF-IDF --&gt;
&lt;H2&gt;Why TF-IDF Instead of Embeddings?&lt;/H2&gt;
&lt;P&gt;Most RAG tutorials use embedding models for retrieval. We chose TF-IDF for this project because:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Fully offline&lt;/STRONG&gt; — no embedding model to download or run&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Zero latency&lt;/STRONG&gt; — vectorization is instantaneous (just math on word frequencies)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Good enough&lt;/STRONG&gt; — for a curated collection of 20 domain-specific documents, TF-IDF retrieves the right chunks reliably&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Transparent&lt;/STRONG&gt; — you can inspect the vocabulary and weights, unlike neural embeddings&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;For larger collections (thousands of documents) or when semantic similarity matters more than keyword overlap, you'd swap in an embedding model. But for this use case, TF-IDF keeps the stack simple and dependency-free.&lt;/P&gt;
&lt;!-- Mobile-Responsive --&gt;
&lt;H2&gt;Mobile-Responsive Field UI&lt;/H2&gt;
&lt;P&gt;Field engineers use this app on phones and tablets&amp;nbsp; often wearing gloves. The UI is designed for harsh conditions with a dark, high-contrast theme, large touch targets (minimum 48px), and horizontally scrollable quick-action buttons.&lt;/P&gt;
&lt;DIV class="two-col"&gt;
&lt;DIV&gt;&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/01-landing-page.png" alt="Desktop view of the app" /&gt;
&lt;P class="img-caption"&gt;Desktop view&lt;/P&gt;
&lt;/DIV&gt;
&lt;DIV&gt;&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/02-mobile-view.png" alt="Mobile view of the app" /&gt;
&lt;P class="img-caption"&gt;Mobile view&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;P&gt;The entire frontend is a &lt;STRONG&gt;single &lt;CODE&gt;index.html&lt;/CODE&gt; file&lt;/STRONG&gt; — no React, no build step, no bundler. This keeps the project accessible and easy to deploy anywhere.&lt;/P&gt;
&lt;!-- Runtime Upload --&gt;
&lt;H2&gt;Runtime Document Upload&lt;/H2&gt;
&lt;P&gt;Users can upload new documents without restarting the server. The upload endpoint receives markdown content, chunks it, computes TF-IDF vectors, and inserts the chunks into SQLite — all in memory, immediately available for retrieval.&lt;/P&gt;
&lt;IMG src="https://github.com/leestott/local-rag/raw/main/screenshots/05-upload-document.png" alt="Upload document modal showing the file selection and indexed document list" /&gt;
&lt;P class="img-caption"&gt;Drag-and-drop document upload with instant indexing&lt;/P&gt;
&lt;!-- Adapt to your domain --&gt;
&lt;H2&gt;Adapt This for Your Own Domain&lt;/H2&gt;
&lt;P&gt;This project is a &lt;STRONG&gt;scenario sample&lt;/STRONG&gt;&amp;nbsp;designed to be forked and customized. Here's the three-step process:&lt;/P&gt;
&lt;H3&gt;1. Replace the Documents&lt;/H3&gt;
&lt;P&gt;Delete the gas engineering docs in &lt;CODE&gt;docs/&lt;/CODE&gt; and add your own &lt;CODE&gt;.md&lt;/CODE&gt; files with optional YAML front-matter:&lt;/P&gt;
&lt;DIV class="code-label"&gt;docs/my-procedure.md&lt;/DIV&gt;
&lt;LI-CODE lang=""&gt;---
title: Troubleshooting Widget Errors
category: Support
id: KB-001
---

# Troubleshooting Widget Errors
...your content here...&lt;/LI-CODE&gt;
&lt;H3&gt;2. Edit the System Prompt&lt;/H3&gt;
&lt;P&gt;Open &lt;CODE&gt;src/prompts.js&lt;/CODE&gt; and rewrite the instructions for your domain:&lt;/P&gt;
&lt;DIV class="code-label"&gt;src/prompts.js&lt;/DIV&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;export const SYSTEM_PROMPT = `You are an offline support agent for [YOUR DOMAIN].

Rules:
- Only answer using the retrieved context
- If the answer isn't in the context, say so
- Use structured responses: Summary → Details → Reference
`;&lt;/LI-CODE&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;H3&gt;3. Tune the Retrieval&lt;/H3&gt;
&lt;P&gt;Adjust chunking and retrieval parameters in &lt;CODE&gt;src/config.js&lt;/CODE&gt;:&lt;/P&gt;
&lt;DIV class="code-label"&gt;src/config.js&lt;BR /&gt;&lt;BR /&gt;&lt;LI-CODE lang=""&gt;export const config = {
  model: "phi-3.5-mini",
  chunkSize: 200,      // smaller = more precise, less context per chunk
  chunkOverlap: 25,    // prevents info from falling between chunks
  topK: 3,             // chunks per query (more = richer context, slower)
};&lt;/LI-CODE&gt;&lt;/DIV&gt;
&lt;!-- Multi-Agent Extension --&gt;
&lt;H2&gt;Extending to Multi-Agent Architectures&lt;/H2&gt;
&lt;P&gt;Once you have a working RAG agent, the natural next step is &lt;STRONG&gt;multi-agent orchestration&lt;/STRONG&gt;&amp;nbsp; where specialized agents collaborate to handle complex workflows. With Foundry Local's OpenAI-compatible API, you can compose multiple agent roles on the same machine:&lt;/P&gt;
&lt;DIV class="code-label"&gt;Multi-agent concept&lt;/DIV&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;LI-CODE lang=""&gt;// Each agent is just a different system prompt + RAG scope
const agents = {
  safety:    { prompt: safetyPrompt,    docs: "safety/*.md" },
  diagnosis: { prompt: diagnosisPrompt, docs: "faults/*.md" },
  procedure: { prompt: procedurePrompt, docs: "procedures/*.md" },
};

// Router determines which agent handles the query
function route(query) {
  if (query.match(/safety|warning|hazard/i)) return agents.safety;
  if (query.match(/fault|error|code/i))      return agents.diagnosis;
  return agents.procedure;
}

// Each agent uses the same Foundry Local model endpoint
const response = await client.chat.completions.create({
  model: modelInfo.id,
  messages: [
    { role: "system", content: selectedAgent.prompt },
    { role: "system", content: `Context:\n${retrievedChunks}` },
    { role: "user", content: userQuery }
  ],
  stream: true,
});&lt;/LI-CODE&gt;
&lt;PRE&gt; &lt;/PRE&gt;
&lt;P&gt;This pattern lets you build &lt;STRONG&gt;specialized agent pipelines&lt;/STRONG&gt;&amp;nbsp; a triage agent routes to the right specialist, each with its own document scope and system prompt, all running on the same local Foundry instance. For production multi-agent systems, explore &lt;A href="https://learn.microsoft.com/azure/ai-foundry/" target="_blank" rel="noopener"&gt;Microsoft Foundry&lt;/A&gt; for cloud-scale orchestration when connectivity is available.&lt;/P&gt;
&lt;DIV class="callout callout-orange"&gt;&lt;STRONG&gt;Local-first, cloud-ready&lt;/STRONG&gt; Start with Foundry Local for development and offline scenarios. When your agents need cloud scale, swap to Azure AI Foundry with the same OpenAI-compatible API&amp;nbsp; your agent code stays the same.&lt;/DIV&gt;
&lt;!-- Key Takeaways --&gt;
&lt;H2&gt;Key Takeaways&lt;/H2&gt;
&lt;DIV class="takeaway-grid"&gt;
&lt;DIV class="takeaway-card"&gt;
&lt;DIV class="num"&gt;&lt;STRONG&gt;1 RAG = Retrieve + Augment + Generate&lt;/STRONG&gt;&lt;/DIV&gt;
&lt;P&gt;Ground your AI in real documents — dramatically reducing hallucination and making answers traceable.&lt;/P&gt;
&lt;/DIV&gt;
&lt;DIV class="takeaway-card"&gt;
&lt;DIV class="num"&gt;&lt;STRONG&gt;2 Foundry Local makes local AI accessible&lt;/STRONG&gt;&lt;/DIV&gt;
&lt;P&gt;OpenAI-compatible API running on CPU/NPU. No GPU required. No cloud dependency.&lt;/P&gt;
&lt;/DIV&gt;
&lt;DIV class="takeaway-card"&gt;
&lt;DIV class="num"&gt;&lt;STRONG&gt;3 TF-IDF + SQLite is viable&lt;/STRONG&gt;&lt;/DIV&gt;
&lt;P&gt;For small-to-medium document collections, you don't need a dedicated vector database.&lt;/P&gt;
&lt;/DIV&gt;
&lt;DIV class="takeaway-card"&gt;
&lt;DIV class="num"&gt;&lt;STRONG&gt;4 Same API, local or cloud&lt;/STRONG&gt;&lt;/DIV&gt;
&lt;P&gt;Build locally with Foundry Local, deploy with Azure OpenAI — zero code changes.&lt;/P&gt;
&lt;/DIV&gt;
&lt;/DIV&gt;
&lt;!-- What's Next --&gt;
&lt;H2&gt;What's Next?&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Embedding-based retrieval&lt;/STRONG&gt; — swap TF-IDF for a local embedding model for better semantic matching&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Conversation memory&lt;/STRONG&gt; — persist chat history across sessions&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Multi-agent routing&lt;/STRONG&gt; — specialized agents for safety, diagnostics, and procedures&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;PWA packaging&lt;/STRONG&gt; — make it installable as a standalone app on mobile devices&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Hybrid retrieval&lt;/STRONG&gt; — combine keyword search with semantic embeddings for best results&lt;/LI&gt;
&lt;/UL&gt;
&lt;DIV class="callout callout-green"&gt;&lt;STRONG&gt;Get the code&lt;/STRONG&gt; Clone the repo, swap in your own documents, and start building:&lt;BR /&gt;&lt;BR /&gt;&lt;CODE&gt;git clone https://github.com/leestott/local-rag.git&lt;/CODE&gt;&lt;BR /&gt;&lt;BR /&gt;&lt;A href="https://github.com/leestott/local-rag" target="_blank" rel="noopener"&gt;github.com/leestott/local-rag&lt;/A&gt; — MIT licensed, contributions welcome.&lt;/DIV&gt;
&lt;/ARTICLE&gt;
&lt;FOOTER&gt;
&lt;P&gt;Open source under the &lt;A href="https://github.com/leestott/local-rag/blob/main/LICENSE" target="_blank" rel="noopener"&gt;MIT License&lt;/A&gt;. Built with &lt;A href="https://foundrylocal.ai" target="_blank" rel="noopener"&gt;Foundry Local&lt;/A&gt; and &lt;A href="https://nodejs.org/" target="_blank" rel="noopener"&gt;Node.js&lt;/A&gt;.&lt;/P&gt;
&lt;/FOOTER&gt;</description>
      <pubDate>Tue, 10 Mar 2026 07:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/build-a-fully-offline-rag-app-with-foundry-local-no-cloud/ba-p/4499964</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-03-10T07:00:00Z</dc:date>
    </item>
    <item>
      <title>Data Driven Analytics for Responsible Business Solutions, learning how to work with Power BI</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/data-driven-analytics-for-responsible-business-solutions/ba-p/4497001</link>
      <description>&lt;div data-video-id="https://youtu.be/oskcDEDyOP4/1772013327059" data-video-remote-vid="https://youtu.be/oskcDEDyOP4/1772013327059" class="lia-video-container lia-media-is-center lia-media-size-large"&gt;&lt;iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FoskcDEDyOP4%3Ffeature%3Doembed&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DoskcDEDyOP4&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FoskcDEDyOP4%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" allowfullscreen="" style="max-width: 100%"&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;P&gt;&lt;STRONG&gt;I&lt;/STRONG&gt;&lt;STRONG&gt;ntroduction&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;In this blog post, we will be showcasing the project that we have worked on for the last couple of weeks. Here, we analysed a dataset using Power BI and its machine learning capabilities. For this, we were given the fictitious case of VenturaGear. The company was faced with the challenge of new competition, and it was our job to provide a data-driven insight into customer behaviour, feedback, and preferences. The objective was to support more effective customer targeting by identifying patterns and segments that could inform strategic decision-making, while ensuring ethical and responsible use of data.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Before we jump into the course and our final results, we would like to introduce ourselves and the roles we had.&lt;/P&gt;
&lt;P&gt;Product Owner: Kylie Eggen&lt;/P&gt;
&lt;P&gt;Hello everyone! My name is Kylie, and I'm currently busy finishing my Master Responsible Digitalisation. During the DARBS course, I had the role of the product owner. This allowed me to develop a deeper understanding of both data analysis and the ethics of handling sensitive data. The course provides you with skills that could be useful in your future career, which is very nice. I liked the learning experience a lot and will definitely use it in the future! &lt;A href="https://www.linkedin.com/in/kylie-eggen-966a902b9?utm_source=share&amp;amp;utm_campaign=share_via&amp;amp;utm_content=profile&amp;amp;utm_medium=android_app" target="_blank"&gt;Kylie Eggen | LinkedIn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Data Analyst: Ha Nguyen&lt;/P&gt;
&lt;P&gt;I am currently in the final stage of my Master’s degree in Responsible Digitalisation, focusing on the ethical and strategic use of data-driven technologies. With five years of experience using Excel for data analysis, I have developed a strong foundation in data handling and visualisation. This course allows me to expand my skills by learning to create interactive dashboards and generate actionable insights using Power BI. These competencies strengthen my ability to support responsible, data-driven decision-making in my future professional career. &lt;A href="https://www.linkedin.com/in/ha-nguyen-b18671116/" target="_blank"&gt;Ha Nguyen | LinkedIn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Data Analyst: Rianne van Ee&lt;/P&gt;
&lt;P&gt;Hello! My name is Rianne, and I am currently in the process of completing my Master’s degree in Responsible Digitalisation. I chose this specialisation because I am very interested in new technologies and different perspectives. I am very interested in data analysis and learning about new software, so the DARBS course was very interesting to me. I am excited to apply my new skills in a professional environment. &lt;A href="https://www.linkedin.com/in/rianne-van-ee-7b2785214/" target="_blank"&gt;Rianne van Ee | LinkedIn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Data Visualisation Consultant: Aya Torqui&lt;/P&gt;
&lt;P&gt;&amp;nbsp;Hello! My name is Aya Torqui, and I am a Master’s student in Responsible Digitalisation at Radboud University. One of the reasons I chose this specialisation is my strong interest in how companies transform raw and sometimes ambiguous data into valuable business decisions. The DARBS course, therefore, provided the perfect opportunity for me to gain new and deeper insights into this process. In my role as a Data Visualisation Consultant, I developed new skills not only in designing visually attractive and interesting dashboards, but also in communicating a meaningful and coherent story through them. I am grateful for the opportunity to have developed these skills during the course, and I look forward to further broadening and strengthening them in my future career. &lt;A href="https://www.linkedin.com/in/aya-torqui-11189124b/" target="_blank"&gt;Aya Torqui | LinkedIn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Data Visualisation Consultant: Ting Yu&lt;/P&gt;
&lt;P&gt;&amp;nbsp;Hi! My name is Ting Yu. I am currently a Master’s student of Civil Law and Responsible Digitalisation. I found the DARBS course quite interesting, and it was a whole new experience for me, because I learned that numbers are not boring. With a dashboard, it is possible to tell a story and help organisations. What I also really liked about this course was the creative side. Not only was it fun to play around with different charts and colour schemes for the dashboard, but also the video we had to make! I am curious to see what the future possibilities are. &lt;A href="http://www.linkedin.com/in/ting-yu-169418215" target="_blank"&gt;Ting Yu | LinkedIn&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Project Overview&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The goal of this project was to provide data-driven managerial recommendations to the fictitious company, VenturaGear. Eventually, it was our task to deliver a final report and a video blog in which we discussed their data and gave them recommendations on how to improve. Our focus was on supporting more effective customer targeting by identifying patterns and segments that could inform strategic decision-making. During the process, one of our main goals was to keep the data analysis responsible and ethical.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Project Journey&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The course followed a nice structure, allowing us to learn about PowerBi gradually and expand our skills and knowledge over a couple of weeks. We started off by completing lab work. Every week we completed several online courses, and spent one lecture applying the knowledge from these courses in a lab work assignment. After a few weeks, we applied our knowledge in a milestone assignment. This was the first time we really applied our newfound skills in a practical manner. This was a really nice opportunity to see whether we could actually apply what we learned.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This also came with a machine learning aspect. Even though we had a short introduction to the topic in class, none of us had worked with machine learning before. We were able to apply the knowledge we gathered about learning how to use a new system, like Power BI, on another system, in this case, machine learning. While we really struggled here at the start, after some time we figured it out and were able to work with the technology.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;This milestone assignment was the perfect preparation for the actual final assignment, which also had this machine learning aspect. We now knew where to start, what data to include, etc. We now also knew what to consider when looking at the ethical side of things. Like what information needs to be anonymised, or left out completely. Eventually, all our newfound knowledge was combined into making the final assignment and video blog.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Technical Details&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Microsoft Power BI served as the main analytical environment throughout the project. We began by importing multiple CSV datasets into Power BI and preparing the data using Power Query. This involved cleaning duplicate records, correcting formatting inconsistencies, and transforming variables to ensure accurate calculations and reliable analysis.&lt;/P&gt;
&lt;P&gt;We then created a relational data model connecting key tables such as sales transactions, product information, customer behaviour, and sales reasons. Establishing these relationships allowed us to analyse data across multiple dimensions and generate deeper insights into customer activity and online purchasing patterns.&lt;/P&gt;
&lt;P&gt;Interactive dashboards were developed using Power BI’s visualisation tools, accessible colour themes, and slicers, allowing users to explore insights dynamically. Rather than presenting static results, the dashboard encouraged managers to interact with the data and investigate patterns independently.&lt;/P&gt;
&lt;P&gt;In addition to descriptive analytics, we applied a machine learning model (XGBoost) to identify factors influencing the sales of the top revenue-generating products. This introduced us to predictive analytics and highlighted the importance of feature selection, handling missing values, and critically interpreting model outputs. Combining visualisation with machine learning enabled us to move beyond reporting toward data-driven decision support.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Results and Outcomes&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Before we could analyse our data, we ran into a few problems. Firstly, our unit prices seemed to be inflated in the dataset. The decimal was removed, leading to unreasonably high prices. To solve this, we recalculated the LineTotal, using the formula that can be seen below.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Another problem we ran into was that we seemed to have a lot of missing data. We noticed this while looking at the sales reasons. A third of the data ended up blank. We ended up excluding the blank values, so that we were still able to analyse the remaining data.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;To really effectively target customers, we felt it was important to analyse the reasons people made their purchases. Through our analysis, we found that for VentureGear, the biggest contributor was price.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We found that VenturaGear mainly made its sales in Australia.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;BR /&gt;&lt;STRONG&gt;Lesson Learned&lt;/STRONG&gt;&lt;/P&gt;
&lt;OL&gt;
&lt;LI style="font-weight: bold;" aria-level="1"&gt;&lt;STRONG&gt;Working with new systems&lt;/STRONG&gt;&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;The main lesson that we learned is how to start using a new system. The way in which we were taught how to use Power BI showed us a nice way of approaching new things. We believe this can be useful in other areas of our professional lives.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp; &lt;STRONG&gt;&amp;nbsp;2. Data analysis&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Most of us were a little intimidated when we first heard that we were going to be analysing data through a new program. However, once we started, we noticed that when we all put our minds to it, it is quite manageable. We have all gained some understanding of data analysis and how to visualise this.&lt;/P&gt;
&lt;P&gt;&amp;nbsp; &amp;nbsp;&lt;STRONG&gt;&amp;nbsp; 3. Teamwork&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;A big factor during this project was teamwork. Our team was divided up into different roles. That meant that there was teamwork between the two data analysts and data visualisation consultants, but also between different roles. We found it to be really important to have teamwork between all these actors. We noticed that the further we got into the project, the smoother this interaction went.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Collaboration and Teamwork&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;On this project, we worked as a team. Our team consists of five people. Kylie Eggen was the Product Owner. Her role was to take care of the overview of the project. Ha Nguyen and Rianne van Ee were the Data Analysts for this project. Aya Torqui and Ting Yu were the Data Visualisation Consultants.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;We mostly stuck to our roles, but noticed that everything needed to happen in collaboration. So even though we were all mainly busy with our own roles, we were all involved in each other as well. We noticed this really helped in making the project a coherent whole.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Future Development&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;While this project generated valuable insights, there are several opportunities for further development. A potential next step would be integrating real-time data into Power BI. Expanding the dashboard with automated data refresh will allow managers to track performance continuously and respond more quickly to changing customer behaviour.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Another area for future development involves extending the machine learning component. Rather than focusing only on identifying predictors of key revenue-generating products, the model could be expanded to include customer segmentation, such as grouping customers into categories like high-value customers, discount-sensitive buyers, or frequent online shoppers.&amp;nbsp; In addition, the model could be developed further to support purchase prediction, enabling forecasts of seasonal demand, identifying customers likely to make repeat purchases, and determining which products are most preferred by specific customer groups. These enhancements would provide a more dynamic understanding of customer behaviour and support more targeted, data-driven decision-making.&lt;/P&gt;
&lt;P&gt;Incorporating more complete behavioural data or improving survey participation rates would also help reduce missing values and increase the reliability of insights. And finally, for future research, the organisation could consider introducing clear consent options on the web shop to help customers better understand what data is being collected. These options would also allow customers to choose what information they want to share, improving transparency and strengthening customer trust.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Conclusion&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;This project allowed us to learn how data analytics can help organisations make smarter and more responsible business decisions. Using Power BI, we transformed complex customer and sales data into clear, interactive insights that help managers better understand online behaviour, purchasing motivations, and performance trends. Beyond building technical skills, we also learned how important data quality, transparency, and ethical considerations are when working with sensitive customer data. Throughout the project, we discovered that data analysis is an iterative process that requires continuous evaluation, critical thinking, and careful interpretation of results. Most importantly, we realised that meaningful analytics is never an individual effort but a collaborative process, where teamwork and shared problem-solving play a key role in turning data into valuable insights.&lt;/P&gt;
&lt;P&gt;Overall, this project strengthened our ability to bridge technical analytics with responsible digitalisation principles. By combining business understanding, visualisation skills, and ethical awareness, we gained a clearer perspective on how tools like Power BI can enable professionals to create meaningful, data-driven solutions that are both impactful and responsible.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Call to Action&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;After experiencing this learning journey, we encourage you to engage with tools such as Power BI. As our teacher told us, ‘‘You are going to hit a wall.’’ That is exactly what happened to us, but pushing through those moments allowed us to create a deeper understanding and develop new skills. At the same time, we tried to stay aware of the ethical implications of working with data. During the project, we always ensured to stay transparent and responsible in our analysis. We encourage you to challenge yourself! Experiment with new technologies and step outside of your comfort zone. What we also think you should remember is that a strong analysis is not only dependent on technical skills, but it is also about staying transparent, responsible, and trustworthy.&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;On behalf of group 3, thank you for taking the time to read our summary. We&lt;SPAN style="color: rgb(30, 30, 30);"&gt;hope it has been useful. Feel free to reach out for any remaining questions!&lt;/SPAN&gt;&lt;/P&gt;</description>
      <pubDate>Thu, 05 Mar 2026 08:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/data-driven-analytics-for-responsible-business-solutions/ba-p/4497001</guid>
      <dc:creator>RiannevanEe</dc:creator>
      <dc:date>2026-03-05T08:00:00Z</dc:date>
    </item>
    <item>
      <title>The Hidden Architecture of Nano Architectures</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/the-hidden-architecture-of-nano-architectures/ba-p/4493391</link>
      <description>&lt;P data-start="94" data-end="412"&gt;&lt;STRONG data-start="94" data-end="255"&gt;Why does the same prompt, on the same checkpoint, with temperature set to zero, sometimes produce a different answer only when the system is under real load?&lt;/STRONG&gt;&lt;BR data-start="255" data-end="258" /&gt;If you have ever watched token three flip and then watched the whole completion diverge, you already know this is not a product bug.&amp;nbsp;It is a systems fact.&lt;/P&gt;
&lt;P data-start="414" data-end="550"&gt;Here is the thing. In production, you did not deploy a model.&lt;BR data-start="475" data-end="478" /&gt;You deployed a runtime that selects an execution plan under constraints.&lt;/P&gt;
&lt;P data-start="552" data-end="611"&gt;The weights are inside that plan. The behavior is the plan.&lt;/P&gt;
&lt;P data-start="34" data-end="140"&gt;I’m&amp;nbsp;&lt;A href="https://www.linkedin.com/in/drhazemali" target="_blank" rel="noopener"&gt;Hazem Ali&lt;/A&gt;&amp;nbsp;—&amp;nbsp;&lt;A href="https://mvp.microsoft.com/en-US/MVP/profile/4865c7ae-cb5b-4eb5-b128-608b1f9a6ebc" target="_blank" rel="noopener"&gt;Microsoft AI MVP&lt;/A&gt;, Distinguished AI and ML Engineer and Architect, and Founder and CEO of Skytells.&lt;/P&gt;
&lt;P data-start="101" data-end="377"&gt;I’ve built and led engineering work that turns deep learning research into production systems that survive real-world constraints. I speak at major conferences and technical communities, and I regularly deliver deep technical sessions on enterprise AI and agent architectures.&lt;/P&gt;
&lt;P data-start="379" data-end="555"&gt;If there’s one thing you’ll notice about me, it’s that I’m drawn to the deepest layers of engineering, the parts most teams only discover when systems are under real pressure. My specialization spans the full AI stack, from deep learning and system design to enterprise architecture and security.&lt;/P&gt;
&lt;P data-start="835" data-end="885"&gt;A rule I repeat in every serious review is simple.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="887" data-end="967"&gt;If you cannot explain the runtime, you do not understand the model you deployed.&lt;/P&gt;
&lt;P data-start="887" data-end="967"&gt;— Hazem Ali&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="969" data-end="1227"&gt;This is the next layer after my earlier deep dive on memory, KV cache, paging, and trust boundaries in &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/the-hidden-memory-architecture-of-llms/4485367" target="_blank" rel="noopener" data-lia-auto-title="The Hidden Memory Architecture of LLMs" data-lia-auto-title-active="0"&gt;&lt;STRONG data-start="1072" data-end="1114"&gt;The Hidden Memory Architecture of LLMs&lt;/STRONG&gt;&lt;/A&gt;&lt;/P&gt;
&lt;P data-start="1229" data-end="1409"&gt;I also break down the memory-and-paging failure modes in &lt;A class="lia-external-url" href="https://drhazemali.com/blog/when-your-llm-trips-the-mmu" target="_blank"&gt;When Your LLM Trips the MMU&lt;/A&gt;&lt;/P&gt;
&lt;P data-start="1229" data-end="1409"&gt;This one goes lower, into the execution that decides which math actually runs.&lt;/P&gt;
&lt;H2 data-start="1229" data-end="1409"&gt;When I Had to Prove It Live&lt;/H2&gt;
&lt;P data-start="0" data-end="97"&gt;I still remember the first time I had to make this concrete in front of a room full of engineers.&lt;/P&gt;
&lt;P data-start="99" data-end="353"&gt;It was during a technical session I gave, and the question came up in the exact form you’ve probably heard before:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="99" data-end="353"&gt;Why does the same prompt on the same checkpoint, with temperature set to zero, sometimes produce a different answer only under real load?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="355" data-end="429"&gt;So I answered it the only way that holds up in a serious engineering room.&lt;/P&gt;
&lt;P data-start="431" data-end="489"&gt;&lt;STRONG&gt;I didn’t frame it as randomness. I framed it as execution.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="60" data-end="375" data-is-last-node="" data-is-only-node=""&gt;Not because it sounds cleaner,&lt;/P&gt;
&lt;P data-start="60" data-end="375" data-is-last-node="" data-is-only-node=""&gt;but because it is the only framing that survives scrutiny: under load, the system is not evaluating the same computation.&amp;nbsp;&lt;/P&gt;
&lt;img&gt;&lt;SPAN data-image-alt=""&gt;Hazem Ali speaking at an AI conference, discussing Zero-Trust Enterprise AI Architecture, governance, and production-ready AI systems.&lt;/SPAN&gt;&lt;/img&gt;
&lt;P data-start="491" data-end="925"&gt;In production, you don’t deploy weights in isolation. You deploy a runtime that selects an execution plan under constraints. Under load, the constraints change at token cadence: microbatch membership shifts, shapes shift, workspace feasibility tightens, and kernels or algorithms that were legal in the calm regime can become infeasible in the pressured regime. The runtime stays correct by contract, but it executes a different plan.&lt;/P&gt;
&lt;P data-start="927" data-end="1259"&gt;And once the executed plan changes, reduction staging can change. When reduction staging changes, rounding happens at different points. That can move last bits. In decoding, last bits can become different tokens when early logit margins are thin. After the first token flips, divergence is expected because the context is different.&lt;/P&gt;
&lt;P data-start="1261" data-end="1378"&gt;That’s what I mean throughout this article when I say:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="1261" data-end="1378"&gt;The weights are inside the plan, but the behavior is the plan.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2 data-start="1416" data-end="1450"&gt;What is Happening in Runtime&lt;/H2&gt;
&lt;P data-start="0" data-end="90"&gt;Let’s start with the part most teams skip: the runtime pipeline from admission to a token.&lt;/P&gt;
&lt;P data-start="92" data-end="324"&gt;A production LLM server is not a function call. It is a control plane. And under real load, it behaves like one.&lt;/P&gt;
&lt;P data-start="92" data-end="324"&gt;It is not asking “what does the model say.” It is asking “what can I execute right now without breaking my guarantees.”&lt;/P&gt;
&lt;P data-start="326" data-end="552"&gt;Right now matters. Not in theory, in milliseconds. Because every decode step is a new scheduling event. The system does not commit to a single plan for the entire completion. It keeps re-evaluating feasibility as state shifts.&lt;/P&gt;
&lt;P data-start="554" data-end="720"&gt;What can I execute at this moment, with the VRAM I still have, on the hardware state I am currently in, while staying inside isolation boundaries and latency targets.&lt;/P&gt;
&lt;P data-start="722" data-end="999"&gt;That question is not answered once per request. It is answered repeatedly, at token cadence. The queue changes. The batch changes. Memory headroom changes. Cache residency changes. Workspace availability changes. The set of legal kernel and algorithm choices changes with them.&lt;/P&gt;
&lt;P data-start="1001" data-end="1222"&gt;And that is the point most people miss. The runtime is not just running your weights. It is continuously selecting an execution plan under constraint. The weights are inside that plan, but behavior lives in the selection.&lt;/P&gt;
&lt;P data-start="1224" data-end="1586" data-is-last-node="" data-is-only-node=""&gt;That selection is layered. Admission shapes the effective request. Scheduling forms the batch for this step. Kernel and algorithm choice binds the math that will actually run. Memory residency and allocation decide what is feasible. Isolation rules decide what sharing is allowed. Each layer contributes to the final plan, and the plan is what you are deploying.&lt;/P&gt;
&lt;H4 data-start="1854" data-end="1879"&gt;Admission and shaping&lt;/H4&gt;
&lt;P data-start="1881" data-end="1939"&gt;Before your prompt ever reaches the model, it gets shaped.&lt;/P&gt;
&lt;P data-start="1941" data-end="2076"&gt;Truncation, policy injection, tool schema expansion, routing metadata, tenant tags, prefix reuse decisions, and safety transformations.&lt;/P&gt;
&lt;P data-start="2078" data-end="2253"&gt;If you do not know what I mean by effective request, I mean the exact token sequence that the model saw after shaping. That is the only input that matters for reproducibility.&lt;/P&gt;
&lt;H4 data-start="2255" data-end="2293"&gt;Batching and step level scheduling&lt;/H4&gt;
&lt;P data-start="2295" data-end="2361"&gt;Modern servers do not just batch requests. They batch token steps.&lt;/P&gt;
&lt;P data-start="2363" data-end="2601"&gt;In a continuous batching system, token step timing feeds back into batching decisions. A slightly slower step changes who joins the next step. Who joins the next step changes shapes. Shapes change kernels. Kernels change numeric pathways.&lt;/P&gt;
&lt;P data-start="2603" data-end="2938"&gt;This is not an opinion. It is why vLLM exists. The PagedAttention &lt;A class="lia-external-url" href="https://arxiv.org/abs/2309.06180" target="_blank" rel="noopener"&gt;paper&lt;/A&gt; describes serving as a batching problem where KV cache grows dynamically, wastes memory through fragmentation, and limits batch size. It introduces block level KV management and builds vLLM on top of it as an LLM serving system.&lt;/P&gt;
&lt;H4 data-start="2940" data-end="2986"&gt;Kernel plan selection and library behavior&lt;/H4&gt;
&lt;P data-start="2988" data-end="3143"&gt;Once shapes are known, the runtime selects kernel variants and library algorithms that are feasible for those shapes and the workspace currently available.&lt;/P&gt;
&lt;P data-start="3145" data-end="3382"&gt;This is the part people underestimate. The same operator can have multiple valid implementations. The chosen implementation can change when workspace is tight, when shapes change, or when the engine wants to trade latency for throughput.&lt;/P&gt;
&lt;H4 data-start="3384" data-end="3419"&gt;Memory allocation and residency&lt;/H4&gt;
&lt;P data-start="3421" data-end="3531"&gt;KV cache, activations, temporary buffers, workspace, graph memory, and communication buffers compete for VRAM.&lt;/P&gt;
&lt;P data-start="3533" data-end="3711"&gt;Under pressure, allocation patterns change. Fragmentation changes. Residency changes. Cache locality changes. All of that changes the system timeline and the feasible plan space.&lt;/P&gt;
&lt;P data-start="3713" data-end="3802"&gt;If you want a one line summary that is accurate in 2026 production inference, it is this.&lt;/P&gt;
&lt;P data-start="3804" data-end="3900"&gt;Inference is a scheduling problem plus a memory residency problem, and the model is inside that.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="3907" data-end="3954"&gt;The Scope&lt;/H2&gt;
&lt;P data-start="0" data-end="25"&gt;First, Let me put it very clear.&lt;/P&gt;
&lt;P data-start="27" data-end="176"&gt;&lt;EM&gt;I am not claiming every deployment is nondeterministic.&lt;/EM&gt;&lt;BR data-start="82" data-end="85" /&gt;&lt;EM&gt;I am not claiming every kernel variant flips tokens.&lt;/EM&gt;&lt;BR data-start="137" data-end="140" /&gt;&lt;EM&gt;I am not claiming seeds are useless.&lt;/EM&gt;&lt;/P&gt;
&lt;P data-start="178" data-end="274"&gt;I am making a narrower claim, the kind you can defend in an incident review without hand waving.&lt;/P&gt;
&lt;P data-start="276" data-end="643"&gt;Floating point math is not associative. Order matters. When you parallelize, you change the order of operations, and it is therefore valid for parallel results to differ from a sequential evaluation. NVIDIA states this directly in the &lt;A class="lia-external-url" href="https://developer.download.nvidia.com/compute/DevZone/docs/html/C/doc/CUDA_C_Best_Practices_Guide.pdf" target="_blank" rel="noopener"&gt;CUDA C Best Practices Guide&lt;/A&gt;.&lt;/P&gt;
&lt;P data-start="645" data-end="990"&gt;CUDA also makes a foundational guarantee to the hardware and scheduler, not to your intuition. Thread blocks must be able to execute independently, in any order, in parallel or in series. That freedom is part of the programming model, not an edge case (&lt;A class="lia-external-url" href="https://docs.nvidia.com/cuda/cuda-programming-guide/01-introduction/programming-model.html" target="_blank" rel="noopener"&gt;ref&lt;/A&gt;).&lt;BR data-start="897" data-end="900" /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P data-start="992" data-end="1312"&gt;Now connect those two facts. If accumulation order changes, the last bits can change even when every operation is correct, because floating point addition is not associative. NVIDIA explicitly calls this out as well.&lt;BR data-start="1208" data-end="1211" /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P data-start="1314" data-end="1634"&gt;Then layer in what serving stacks actually do. Production systems intentionally reshape execution through continuous batching and KV memory management. &lt;A class="lia-external-url" href="https://arxiv.org/abs/2309.06180" target="_blank" rel="noopener"&gt;vLLM&lt;/A&gt; is a published example of this co design, where serving throughput is achieved by dynamic batching and memory-aware KV handling.&lt;BR data-start="1599" data-end="1602" /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;P data-start="1636" data-end="1833"&gt;Finally, bridge the nano to the semantic.&lt;/P&gt;
&lt;P data-start="1636" data-end="1833"&gt;When early logit margins are small, tiny numeric deltas can reorder the top candidates, and a single token flip is enough to diverge the entire completion.&lt;/P&gt;
&lt;P data-start="1835" data-end="1937"&gt;Here is the part that should feel a little scary, because it changes what you think you are operating.&lt;/P&gt;
&lt;P data-start="1939" data-end="2353"&gt;Under real load, the system is not just slower. It can enter a different execution regime. Batch composition shifts, shapes shift, workspace and residency shift, and the runtime is forced into a different set of legal kernel and algorithm choices. Nothing “breaks.” No bug is required. The system is still correct by contract. But your output is now a property of the regime you are in, not the demo you validated.&lt;/P&gt;
&lt;P data-start="2355" data-end="2721"&gt;That means you can pass every determinism test at idle and still ship a system that drifts only when it matters, at p95 and p99, when queues are long and memory headroom is tight. The first time you notice is often a user screenshot, an audit question, or an incident report where two replicas disagree on the same request because the runtime state was not the same.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="0" data-end="57"&gt;The equation principals should use in incident reviews&lt;/H2&gt;
&lt;P data-start="59" data-end="102"&gt;Most teams ship with the demo mental model.&lt;/P&gt;
&lt;LI-CODE lang="markup"&gt;y = f(x, θ)&lt;/LI-CODE&gt;
&lt;P data-start="117" data-end="240"&gt;One prompt in, one checkpoint, one output. If the output changes, someone concludes the weights changed, or “AI is random.”&lt;/P&gt;
&lt;P data-start="242" data-end="374"&gt;That is not how production inference behaves, because production inference is not just a function. It is execution under constraint.&lt;/P&gt;
&lt;P data-start="376" data-end="414"&gt;Production behavior is closer to this.&lt;/P&gt;
&lt;LI-CODE lang=""&gt;y = Decode( Exec(θ, x; s) )&lt;/LI-CODE&gt;
&lt;P data-start="445" data-end="696"&gt;θ is still the same weights. But the thing you actually shipped is &lt;STRONG data-start="512" data-end="520"&gt;Exec&lt;/STRONG&gt;, and &lt;STRONG data-start="526" data-end="544"&gt;Exec is chosen&lt;/STRONG&gt;. It is chosen per step, under the current state of the system. The behavior you observe is the behavior of the executed plan, not the abstract weights.&lt;/P&gt;
&lt;img&gt;Demo vs production mental models. In production, y depends on (θ, x, s) because the runtime selects an execution plan under constraints.&lt;/img&gt;
&lt;H3 data-start="927" data-end="979"&gt;X is not the prompt. X is the effective request.&lt;/H3&gt;
&lt;P data-start="981" data-end="1200"&gt;X is the exact token sequence the model saw after shaping. Truncation, policy injection, tool schema expansion, routing metadata, prefix reuse, safety transforms. All of that can change what the model actually receives.&lt;/P&gt;
&lt;P data-start="1202" data-end="1301"&gt;If you cannot reconstruct x, you are not replaying the request. You are replaying an approximation.&lt;/P&gt;
&lt;P data-start="1303" data-end="1379"&gt;Here is the minimum you should log for x, even if you cannot store raw text:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;# minimal "x" record: enough to reproduce or prove you cannot
trace_x = {
  "req_id": req_id,
  "raw_prompt_sha256": sha256(raw_prompt),
  "effective_text_sha256": sha256(effective_text),
  "effective_tokens": len(effective_tokens),
  "truncated": truncated,
  "trunc_reason": trunc_reason,      # e.g., "latency_guard", "context_cap"
  "decode_cfg_applied": decode_cfg,   # temperature/top_p/max_tokens, etc.
  "shaping_events": events,           # ["policy_inject:v3", "tool_schema:v2", ...]
}&lt;/LI-CODE&gt;
&lt;H3 data-start="1892" data-end="1960"&gt;S is not a vibe. S is the execution state that decides the math.&lt;/H3&gt;
&lt;P data-start="1962" data-end="2098"&gt;S is what principals should demand in a postmortem, because this is what turns “it drifted” into “this plan executed under this regime.”&lt;/P&gt;
&lt;P data-start="2100" data-end="2123"&gt;At minimum, s includes:&lt;/P&gt;
&lt;UL data-start="2125" data-end="2439"&gt;
&lt;LI data-start="2125" data-end="2171"&gt;per-step batch composition and shape class&lt;/LI&gt;
&lt;LI data-start="2172" data-end="2212"&gt;queue delays and scheduling outcomes&lt;/LI&gt;
&lt;LI data-start="2213" data-end="2257"&gt;VRAM headroom and workspace availability&lt;/LI&gt;
&lt;LI data-start="2258" data-end="2284"&gt;cache pressure signals&lt;/LI&gt;
&lt;LI data-start="2285" data-end="2324"&gt;precision path and engine fallbacks&lt;/LI&gt;
&lt;LI data-start="2325" data-end="2392"&gt;distributed timeline signals (TP/PP latency, collective stalls)&lt;/LI&gt;
&lt;LI data-start="2393" data-end="2439"&gt;isolation posture (what batching is allowed)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2441" data-end="2792"&gt;Why this matters: in continuous batching, &lt;STRONG data-start="2483" data-end="2517"&gt;time becomes part of semantics&lt;/STRONG&gt;. A few milliseconds of delay changes who gets co-scheduled at the next token step. That changes shapes. Shapes change kernel/algorithm feasibility. Feasibility changes the numeric pathway. When early logit margins are thin, a tiny pathway delta is enough to flip the argmax.&lt;/P&gt;
&lt;P data-start="2794" data-end="2861"&gt;Here is a short, practical “s” record you can emit per decode step:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;# per-step "s" record: what plan ran, under what pressure
step_s = {
  "req_id": req_id,
  "step": t,
  "batch_fp": sha256(",".join(sorted(batch_req_ids)))[:12],
  "shape": f"q=1,k={klen},h={heads},d={hidden},tp={tp}",
  "queue_ms": queue_ms,
  "gpu_ms": gpu_ms,
  "vram_free_mb": vram_free_mb,
  "workspace_free_mb": workspace_free_mb,
  "kv_regime": kv_regime,            # "normal" | "pressured" | "paged"
  "precision_path": precision_path,  # "bf16" | "fp16" | "tf32" | "fp32"
  "algo_id": algo_id,                # backend/engine specific
  "kernel_variant": kernel_variant,  # if available
  "isolation_mode": isolation_mode,  # "shared" | "strict"
}&lt;/LI-CODE&gt;
&lt;H4 data-start="3536" data-end="3571"&gt;The incident-review translation&lt;/H4&gt;
&lt;P data-start="3573" data-end="3753"&gt;If you only ask “what prompt did the user send” and “what weights did we run,” you are using the demo equation.&lt;/P&gt;
&lt;P data-start="3573" data-end="3753"&gt;You will argue about seeds, debate “randomness,” and never converge.&lt;/P&gt;
&lt;P data-start="3755" data-end="3804"&gt;The production equation forces the real question.&lt;/P&gt;
&lt;P data-start="3806" data-end="3892"&gt;Which plan executed, under which constraints, and what state pushed us into that plan.&lt;/P&gt;
&lt;P data-start="3894" data-end="3965"&gt;The line principals should repeat until teams internalize it is simple.&lt;/P&gt;
&lt;P data-start="3967" data-end="4071"&gt;Weights are static. Behavior is a property of the executed plan. And the executed plan depends on state.&lt;/P&gt;
&lt;P data-start="4073" data-end="4223"&gt;If you want one more operational layer that makes this feel real, add a regime marker.&lt;/P&gt;
&lt;P data-start="4073" data-end="4223"&gt;Regime changes are where “stability” collapses without any bug:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;def regime(vram_free_mb, paging_on, isolation_strict, queue_p95_ms):
    if isolation_strict: return "isolation_strict"
    if paging_on:        return "paging"
    if vram_free_mb &amp;lt; 1024: return "memory_pressured"
    if queue_p95_ms &amp;gt; 50:   return "queue_degraded"
    return "normal"&lt;/LI-CODE&gt;
&lt;P data-start="4527" data-end="4720" data-is-last-node="" data-is-only-node=""&gt;When the regime changes, the feasible plan space changes. When the plan space changes, the executed math can change. That is the production reality your incident review must be able to explain.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="6198" data-end="6255"&gt;Floating point order is where small deltas are born&lt;/H2&gt;
&lt;P data-start="0" data-end="40"&gt;Let’s break it down without hand waving.&lt;/P&gt;
&lt;P data-start="42" data-end="104"&gt;&lt;STRONG&gt;Finite precision makes rounding part of the computation&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="105" data-end="359"&gt;Floating point math is not real-number math. Every add and multiply is followed by rounding to the representable format you are using. That rounding is not “noise.” It is part of the computation. Once you accept that, one consequence becomes unavoidable.&lt;/P&gt;
&lt;P data-start="361" data-end="375"&gt;&lt;STRONG&gt;Order matters.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="377" data-end="633"&gt;NVIDIA states the rule clearly: floating point involves rounding, and when you parallelize you can change operation order, so parallel results may not match sequential results.&lt;/P&gt;
&lt;H4 data-start="635" data-end="701"&gt;Why LLM inference is a perfect storm: reductions everywhere&lt;/H4&gt;
&lt;P data-start="702" data-end="757"&gt;Now connect that to what an LLM does at inference time.&lt;/P&gt;
&lt;P data-start="759" data-end="1102"&gt;LLM inference is reduction-heavy by design. Dot products in GEMMs, attention score accumulation, softmax normalization, layer norm statistics, even top-k selection pathways. These are not single operations. They are many partial operations combined into a final scalar or vector. In floating point, the way you combine partials is the outcome.&lt;/P&gt;
&lt;H4 data-start="1104" data-end="1163"&gt;GPU reductions are staged: partial sums, then merges&lt;/H4&gt;
&lt;P data-start="1164" data-end="1236"&gt;A reduction on GPU is not “a sum.” It is a staged reduction of partials.&lt;/P&gt;
&lt;P data-start="1238" data-end="1293"&gt;On a CPU, you can imagine a left-to-right accumulation:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;((((a1 + a2) + a3) + a4) + ...)&lt;/LI-CODE&gt;
&lt;P data-start="1328" data-end="1584"&gt;On a GPU, that mental model is wrong. The GPU is built to run thousands of threads. So it computes partial sums in parallel and then merges them in stages. The staging pattern is determined by kernel design and how the backend maps the problem to hardware.&lt;/P&gt;
&lt;P data-start="1586" data-end="1642"&gt;Put the figure here, right after the staging idea lands.&lt;/P&gt;
&lt;img&gt;Parallel reductions form partial sums and merge them in stages. Different legal staging orders can shift the last bits under finite precision.&lt;/img&gt;
&lt;P data-start="1863" data-end="1935"&gt;The staging depends on decisions you do not control at the prompt layer:&lt;/P&gt;
&lt;UL data-start="1937" data-end="2174"&gt;
&lt;LI data-start="1937" data-end="1970"&gt;how data is tiled into blocks&lt;/LI&gt;
&lt;LI data-start="1971" data-end="2003"&gt;how each block maps to warps&lt;/LI&gt;
&lt;LI data-start="2004" data-end="2043"&gt;how many partials each warp reduces&lt;/LI&gt;
&lt;LI data-start="2044" data-end="2126"&gt;whether it uses warp-level primitives, shared memory, or tensor core fragments&lt;/LI&gt;
&lt;LI data-start="2127" data-end="2174"&gt;how the final merge is staged across blocks&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2176" data-end="2452"&gt;Change the tile size, or the block shape, or the occupancy, and you often change the staging order. Change the staging order, and you change when rounding happens. You can get two results that are both correct under IEEE floating point rules, and they differ in the last bits.&lt;/P&gt;
&lt;P data-start="2454" data-end="2544"&gt;This is not a bug. It is the contract of finite-precision parallel math, applied at scale.&lt;/P&gt;
&lt;H4 data-start="2546" data-end="2593"&gt;Why the last bits move at the core level&lt;/H4&gt;
&lt;P data-start="2594" data-end="2803"&gt;Floating point addition is not associative under rounding because rounding happens after each operation. The error introduced at each step depends on the magnitude and sign of what you are adding at that step.&lt;/P&gt;
&lt;P data-start="2805" data-end="2851"&gt;When you change the staging order, you change:&lt;/P&gt;
&lt;UL data-start="2853" data-end="3098"&gt;
&lt;LI data-start="2853" data-end="2895"&gt;which numbers get added together early&lt;/LI&gt;
&lt;LI data-start="2896" data-end="2936"&gt;which partial sums get rounded early&lt;/LI&gt;
&lt;LI data-start="2937" data-end="3007"&gt;how cancellation behaves when positive and negative terms interact&lt;/LI&gt;
&lt;LI data-start="3008" data-end="3098"&gt;when large and small magnitudes meet, where small values can lose representable impact&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3100" data-end="3187"&gt;That is the core mechanism behind “small deltas.” It is not mystical. It is mechanical.&lt;/P&gt;
&lt;H4 data-start="3189" data-end="3253"&gt;Why this shows up in production serving, not in your demo&lt;/H4&gt;
&lt;P data-start="3254" data-end="3450"&gt;LLM inference is dominated by massive matrix operations and attention. Under the hood, those paths accumulate across large dimensions. An accumulation is exactly where rounding order matters most.&lt;/P&gt;
&lt;P data-start="3452" data-end="3820"&gt;And the server does not always run the same kernel variant for those ops. Under load, shape shifts and workspace pressure can push the backend into different implementations. Different implementations often imply different tiling. Different tiling implies different staging. Different staging implies different rounding. Different rounding implies different last bits.&lt;/P&gt;
&lt;P data-start="3822" data-end="3955"&gt;So even with an identical prompt, identical checkpoint, and temperature set to zero, you can still see tiny numeric differences when:&lt;/P&gt;
&lt;UL data-start="3957" data-end="4268"&gt;
&lt;LI data-start="3957" data-end="4026"&gt;batch composition changes and produces different effective shapes&lt;/LI&gt;
&lt;LI data-start="4027" data-end="4098"&gt;the engine picks a different algorithm because workspace is tighter&lt;/LI&gt;
&lt;LI data-start="4099" data-end="4176"&gt;the kernel selects a different tile path due to shape class and occupancy&lt;/LI&gt;
&lt;LI data-start="4177" data-end="4268"&gt;the GPU is in a different pressure regime, changing feasibility and scheduling behavior&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="4270" data-end="4312"&gt;Those deltas are small, but they are real.&lt;/P&gt;
&lt;P data-start="4314" data-end="4351"&gt;And in decoding, small can be enough.&lt;/P&gt;
&lt;H4 data-start="4353" data-end="4420"&gt;The bridge from ulps to language: logits, argmax, divergence&lt;/H4&gt;
&lt;P data-start="4421" data-end="4468"&gt;A tiny last-bit difference is often irrelevant, Until it hits a decision boundary.&lt;/P&gt;
&lt;P data-start="4506" data-end="4797"&gt;At decode step t, greedy decoding chooses an argmax. If the top logits are close, a small delta can swap the ordering. Once token t changes, the context changes, and the completion diverges. That is not randomness. That is deterministic branching from a slightly different numerical pathway.&lt;/P&gt;
&lt;P data-start="4799" data-end="4861"&gt;So the actionable takeaway is not “GPUs are nondeterministic.”&lt;/P&gt;
&lt;P data-start="4863" data-end="4874"&gt;It is this.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="4876" data-end="5034" data-is-last-node="" data-is-only-node=""&gt;Parallel math is allowed to produce multiple correct last-bit outcomes, and LLM decoding can amplify those outcomes into different text when margins are thin.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="4876" data-end="5034" data-is-last-node="" data-is-only-node=""&gt;&amp;nbsp;&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="7358" data-end="7418"&gt;CUDA scheduling makes ordering a form of runtime state&lt;/H2&gt;
&lt;P data-start="7420" data-end="7477"&gt;CUDA makes a stronger statement than most people realize.&lt;/P&gt;
&lt;P data-start="7479" data-end="7698"&gt;Thread blocks must be able to run independently. It must be possible to execute blocks in any order, in parallel or in series.&lt;/P&gt;
&lt;P data-start="7700" data-end="7827"&gt;That is why the same kernel can execute with different inter block ordering depending on occupancy, contention, and scheduling.&lt;/P&gt;
&lt;P data-start="7829" data-end="7864"&gt;Now bring atomics into the picture.&lt;/P&gt;
&lt;P data-start="7866" data-end="8250"&gt;Atomics guarantee correctness of each update. They do not guarantee the arrival order of updates across threads and blocks. When floating point updates arrive in different legal orders, the final sum can differ in the last bits, because floating point addition is not associative.&lt;/P&gt;
&lt;P data-start="8252" data-end="8324"&gt;If you do not know what atomic add means, here is the useful definition.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="8326" data-end="8431"&gt;Atomic add ensures updates do not overwrite each other. It does not ensure which thread gets there first.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="8433" data-end="8650"&gt;This is the nano architecture layer that explains a lot of weirdness. Many engineers assume determinism is a property of weights. In practice, determinism is constrained by the legal reorderings of parallel execution.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="8657" data-end="8711"&gt;Logit margin is the bridge from ulps to language&lt;/H2&gt;
&lt;P data-start="8713" data-end="8764"&gt;Now we connect the last bits to a changed sentence.&lt;/P&gt;
&lt;P data-start="8766" data-end="8829"&gt;At decode step t, greedy decoding picks the argmax over logits.&lt;/P&gt;
&lt;P data-start="8831" data-end="8887"&gt;Let the top two logits be ℓₐ and ℓ_b. Define the margin:&lt;/P&gt;
&lt;P data-start="8889" data-end="8902"&gt;&lt;SPAN class="lia-text-color-14"&gt;&lt;STRONG&gt;mₜ = ℓₐ − ℓ_b&lt;/STRONG&gt;&lt;/SPAN&gt;&lt;/P&gt;
&lt;P data-start="8904" data-end="8989"&gt;A token flip happens when a small perturbation changes the ordering of these top two.&lt;/P&gt;
&lt;P data-start="8991" data-end="9042"&gt;If you want an operational translation, it is this.&lt;/P&gt;
&lt;P data-start="9044" data-end="9136"&gt;If the model barely prefers token A over token B, a tiny numeric delta can make it prefer B.&lt;/P&gt;
&lt;P data-start="9138" data-end="9245"&gt;Once token t changes, the rest of the completion evolves under a different context. Divergence is expected.&lt;/P&gt;
&lt;P data-start="9247" data-end="9336"&gt;This is why I keep pushing one instrumentation idea that sounds boring until you need it.&lt;/P&gt;
&lt;P data-start="9338" data-end="9365"&gt;Measure early step margins.&lt;/P&gt;
&lt;P data-start="9367" data-end="9451"&gt;You cannot manage stability if you never measure how close the decision boundary is.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="0" data-end="72"&gt;The effective request problem, the quiet killer of reproducibility&lt;/H2&gt;
&lt;P data-start="74" data-end="149"&gt;Here is the pattern I see in almost every serious production investigation.&lt;/P&gt;
&lt;img&gt;The user prompt is not the executed input. The shaping pipeline produces the effective request x, and under load it can change length, semantics, and decode configuration. Log the contract, not the story.&lt;/img&gt;
&lt;P data-start="151" data-end="295"&gt;The team replays the user prompt, cannot reproduce the output, and concludes the model is nondeterministic. Then the incident dies in ambiguity.&lt;/P&gt;
&lt;P data-start="297" data-end="369"&gt;And then, usually too late, someone asks the only question that matters.&lt;/P&gt;
&lt;P data-start="371" data-end="403"&gt;What did the model actually see.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="407" data-end="578"&gt;“In every postmortem, I ask one question before I look at weights, kernels, or seeds: what did the model actually see. If we cannot answer that, nothing else is evidence.” - Hazem Ali&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="580" data-end="649"&gt;In production, the user prompt is not the input. It is an ingredient.&lt;/P&gt;
&lt;P data-start="651" data-end="946"&gt;By the time a request reaches the model, it has passed through a shaping pipeline that exists to keep the system safe, fast, and multi-tenant. That pipeline is not cosmetic. It can change semantics, length, and even decode behavior. The result is the only input that matters for reproducibility.&lt;/P&gt;
&lt;P data-start="948" data-end="970"&gt;The effective request.&lt;/P&gt;
&lt;P data-start="972" data-end="1045"&gt;This is the same thesis you have already accepted earlier in the article.&lt;/P&gt;
&lt;P data-start="1047" data-end="1074"&gt;y = Decode( Exec(θ, x; s) )&lt;/P&gt;
&lt;P data-start="1076" data-end="1237"&gt;If you do not know x, your replay is not valid. If you do not know s, your replay is not comparable. And if you only log the raw prompt, you are logging neither.&lt;/P&gt;
&lt;H4 data-start="1239" data-end="1285"&gt;Shaping changes semantics, not just length&lt;/H4&gt;
&lt;P data-start="1287" data-end="1547"&gt;Truncation is the obvious one. Under load, systems often cap context length to protect latency and GPU memory. Same prompt, different truncation boundary, different effective context, different output. Nothing “random” happened. You executed a different input.&lt;/P&gt;
&lt;P data-start="1549" data-end="1586"&gt;But truncation is only the beginning.&lt;/P&gt;
&lt;P data-start="1588" data-end="2111"&gt;Policy injection can prepend or append system text that changes intent. Tool schema expansion can add hundreds or thousands of tokens and push the request over a context boundary. Routing metadata can select a different template. Prefix caching can reconstruct parts of context from cached state rather than raw text. Safety transformations can rewrite or neutralize content. Even small differences here can shift early logits when margins are thin, and this article already showed how small deltas become different tokens.&lt;/P&gt;
&lt;P data-start="2113" data-end="2162"&gt;The worst part is that this is silent by default.&lt;/P&gt;
&lt;P data-start="2164" data-end="2334"&gt;The user sees their prompt. Engineers see the prompt in logs. The model sees a different token sequence. Then everyone argues about reproducibility using the wrong input.&lt;/P&gt;
&lt;H4 data-start="2336" data-end="2390"&gt;Why this interacts with load, not just correctness&lt;/H4&gt;
&lt;P data-start="2392" data-end="2461"&gt;Under low load, your system often has enough headroom to be generous.&lt;/P&gt;
&lt;P data-start="2463" data-end="2556"&gt;Longer context, fewer cutoffs, stable routing, more consistent batching, and fewer fallbacks.&lt;/P&gt;
&lt;P data-start="2558" data-end="2601"&gt;Under real load, shaping becomes defensive.&lt;/P&gt;
&lt;P data-start="2603" data-end="2899"&gt;Dynamic truncation thresholds kick in. Tool schema expansions collide with context limits. Prefix reuse behavior changes. Safety gates can become stricter. The same user text can produce a different effective request, and therefore a different output, precisely when the system is under pressure.&lt;/P&gt;
&lt;P data-start="2901" data-end="3016"&gt;So if you are only validating reproducibility at idle, you are validating a different system than the one you ship.&lt;/P&gt;
&lt;H4 data-start="3018" data-end="3065"&gt;What principals should require in telemetry&lt;/H4&gt;
&lt;P data-start="3067" data-end="3180"&gt;If you want strict reproducibility, you must log the execution contract per request. Not the story. The contract.&lt;/P&gt;
&lt;P data-start="3182" data-end="3193"&gt;At minimum:&lt;/P&gt;
&lt;UL data-start="3195" data-end="3492"&gt;
&lt;LI data-start="3195" data-end="3234"&gt;effective token count after shaping&lt;/LI&gt;
&lt;LI data-start="3235" data-end="3269"&gt;truncation boundary and reason&lt;/LI&gt;
&lt;LI data-start="3270" data-end="3317"&gt;final merged decode config actually applied&lt;/LI&gt;
&lt;LI data-start="3318" data-end="3370"&gt;policy gates that modified prompt or decode path&lt;/LI&gt;
&lt;LI data-start="3371" data-end="3439"&gt;whether prefix cache was used, and what cache key was referenced&lt;/LI&gt;
&lt;LI data-start="3440" data-end="3492"&gt;routing template version and system message hash&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3494" data-end="3671"&gt;If you are privacy constrained, you still can log hashes and structural facts. You do not need raw prompts to diagnose effective request drift. You need verifiable fingerprints.&lt;/P&gt;
&lt;P data-start="3673" data-end="3711"&gt;Here is the short version in one line.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="3713" data-end="3807"&gt;If you only log the user prompt, you have not logged x. You have logged an approximation of x.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="3809" data-end="3883"&gt;And without x, you cannot claim reproducibility. You can only hope for it.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="10479" data-end="10540"&gt;Continuous batching, why time becomes part of semantics&lt;/H2&gt;
&lt;P data-start="10542" data-end="10589"&gt;This is where principal level thinking matters.&lt;/P&gt;
&lt;P data-start="10591" data-end="10698"&gt;Continuous batching does not just increase throughput. It changes the execution context at each token step.&lt;/P&gt;
&lt;P data-start="10700" data-end="10862"&gt;Batch composition changes shapes. Shapes influence kernel selection and workspace feasibility. Those choices can change reduction structure and rounding pathways.&lt;/P&gt;
&lt;P data-start="10864" data-end="10905"&gt;If you want a published anchor, use vLLM.&lt;/P&gt;
&lt;P data-start="10907" data-end="11252"&gt;The PagedAttention paper frames high throughput serving as a need to batch many requests, but KV cache grows dynamically and wastes memory through fragmentation. It proposes PagedAttention and builds vLLM on top of it, with block level memory management and flexible sharing of KV cache to reduce memory usage. (&lt;A class="lia-external-url" href="https://arxiv.org/abs/2309.06180" target="_blank" rel="noopener"&gt;arxiv&lt;/A&gt;)&lt;/P&gt;
&lt;P data-start="11254" data-end="11299"&gt;Here is what this really means in production.&lt;/P&gt;
&lt;P data-start="11301" data-end="11500"&gt;The server is selecting which requests share a step. That changes the math shapes. That changes the executed plan. That is why the same prompt behaves differently under load even at temperature zero.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="0" data-end="87"&gt;Algorithm selection and engine fallback&lt;/H2&gt;
&lt;H4 data-start="0" data-end="87"&gt;The hidden variability people forget about&lt;/H4&gt;
&lt;P data-start="0" data-end="87"&gt;If you have ever tried to reproduce a drift across replicas and felt like you were chasing ghosts, this is usually the layer you were missing.&lt;/P&gt;
&lt;P data-start="233" data-end="262"&gt;Libraries and engines choose, Not in a philosophical sense. In a literal, per-operator, per-shape sense.&lt;/P&gt;
&lt;P data-start="233" data-end="262"&gt;The same &lt;EM&gt;&lt;STRONG&gt;attention&lt;/STRONG&gt;&lt;/EM&gt; call is a fork in the road between multiple legal tactics, each with different tiling, different reduction staging, different fusion boundaries, and different temporary memory requirements. Your checkpoint is the same, your prompt is the same, your temperature is zero, and the output still moves because the executed plan moved.&lt;/P&gt;
&lt;P data-start="691" data-end="1146"&gt;PyTorch says the quiet part directly. Disabling cuDNN benchmarking makes cuDNN deterministically select an algorithm, and PyTorch stresses this is different from the deterministic setting. That is the whole story in one sentence: one switch affects &lt;EM data-start="940" data-end="978"&gt;how the backend selects an algorithm&lt;/EM&gt;, another affects &lt;EM data-start="996" data-end="1047"&gt;whether the selected algorithms are deterministic&lt;/EM&gt;. Those are separate layers, and under load they can diverge.&lt;/P&gt;
&lt;P data-start="1148" data-end="1184"&gt;Now go down to the core of the core.&lt;/P&gt;
&lt;P data-start="1186" data-end="1321"&gt;A tactic is not&amp;nbsp;&lt;SPAN class="lia-text-color-11"&gt;fast&lt;/SPAN&gt;&amp;nbsp;or&amp;nbsp;&lt;SPAN class="lia-text-color-8"&gt;slow&lt;/SPAN&gt;. In production serving, a tactic is &lt;STRONG data-start="1255" data-end="1264"&gt;legal&lt;/STRONG&gt; or &lt;STRONG data-start="1268" data-end="1279"&gt;illegal&lt;/STRONG&gt; under the constraints of this token step.&lt;/P&gt;
&lt;P data-start="1323" data-end="1772"&gt;The constraint that forces most plan switches is not compute. It is &lt;STRONG data-start="1391" data-end="1416"&gt;workspace feasibility&lt;/STRONG&gt;. Many high-performance kernels need scratch buffers. Some need enough contiguous space to stage tiles, reorder operands, hold partials, or run fused epilogues. When VRAM is fragmented or headroom drops, a tactic becomes impossible even if it is the tactic you validated at idle. The engine does not throw a warning. It simply selects another legal tactic.&lt;/P&gt;
&lt;P data-start="1774" data-end="1812"&gt;&lt;STRONG&gt;That is the first uncomfortable point.&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="1814" data-end="1902"&gt;The second uncomfortable point is what makes this align perfectly with the next section.&lt;/P&gt;
&lt;P data-start="1904" data-end="2016"&gt;The constraint is not only “how many MB are free.” The constraint is the &lt;STRONG data-start="1977" data-end="2003"&gt;memory hierarchy state&lt;/STRONG&gt; of the chip.&lt;/P&gt;
&lt;P data-start="2018" data-end="2450"&gt;Under load, two replicas can have the same free VRAM and still be in a different regime because the chip is not one pool of memory. It is HBM plus an on-die L2, plus TLBs, plus page tables, plus a fabric that is arbitrating traffic between SMs, L2 slices, and HBM controllers. When that hierarchy shifts, latency per token step shifts. And in continuous batching, a few milliseconds is not a timing detail, it is a scheduling input.&lt;/P&gt;
&lt;P data-start="2452" data-end="2525"&gt;This is how a performance event becomes a behavior event without any bug.&lt;/P&gt;
&lt;P data-start="2527" data-end="2784"&gt;The engine’s planner sees a world where a tactic that was “best” at idle is no longer best, or no longer feasible, because the chip is in a different pressure state. Your runtime is still correct. It is just operating a different plan in a different regime.&lt;/P&gt;
&lt;img&gt;
&lt;P data-start="432" data-end="557"&gt;&lt;EM data-start="466" data-end="557"&gt;One op, multiple legal kernels. The chosen tactic depends on shape class and feasibility.&lt;/EM&gt;&lt;/P&gt;
&lt;/img&gt;
&lt;P class="lia-clear-both" data-start="2949" data-end="3036"&gt;Now bring TensorRT into the picture, because it makes the precision dimension explicit.&lt;/P&gt;
&lt;P data-start="3038" data-end="3215"&gt;TensorRT states TF32 Tensor Core usage is not guaranteed and it can fall back to FP32, and it documents configuration controls around TF32.&lt;/P&gt;
&lt;P data-start="3217" data-end="3588"&gt;That statement is not about “precision preference.” It is about the reality that &lt;STRONG data-start="3298" data-end="3339"&gt;precision is part of tactic selection&lt;/STRONG&gt;. Precision changes which instructions execute and how accumulation is staged. When your early logit margins are thin, a small pathway delta can swap the argmax at one step. One token flips, and the rest of the completion deterministically diverges.&lt;/P&gt;
&lt;P data-start="3590" data-end="3712"&gt;&lt;U&gt;So “temperature zero” is not a determinism guarantee. Temperature governs sampling. It does not pin the execution pathway.&lt;/U&gt;&lt;/P&gt;
&lt;P data-start="3714" data-end="4055"&gt;If you want a more mechanical anchor, treat matmul the way NVIDIA exposes it: cuBLASLt has a preference descriptor for applying algorithm search preferences and fine-tuning the heuristic function. That is not marketing. That is the API admitting that algorithm selection is a constrained search problem.&lt;/P&gt;
&lt;P data-start="4057" data-end="4127"&gt;Now the part that gets rare, and the part most teams never write down.&lt;/P&gt;
&lt;P data-start="4129" data-end="4308"&gt;CUDA’s programming model requires that thread blocks be able to execute independently and may execute in any order, in parallel or in series.&lt;/P&gt;
&lt;P data-start="4310" data-end="4674"&gt;This matters here because tactic switches often change block geometry and tiling. Different block geometry changes reduction staging. Reduction staging changes where rounding happens. Even if every operation is correct, last bits can move because you legally changed the staging of partial sums. You do not need randomness. You need a different legal staging tree.&lt;/P&gt;
&lt;img&gt;two reduction trees, both legal, merging partial sums in different orders, final ulps differ&lt;/img&gt;
&lt;P data-start="4813" data-end="4901"&gt;Now pull security into the same frame, because it is not a separate layer in production.&lt;/P&gt;
&lt;P data-start="4903" data-end="5416"&gt;Security posture changes what the scheduler is allowed to do. Isolation constraints reduce batching freedom. Reduced batching freedom increases tail latency. Tail latency pushes you toward tighter admission controls and more aggressive memory behavior. That shrinks the feasible tactic set sooner. In other words, security decisions can move you across regime boundaries faster, which increases plan switching frequency. Stability becomes an SLO dimension of your security posture, not a property of your weights.&lt;/P&gt;
&lt;P data-start="5418" data-end="5491"&gt;This is the business consequence that shows up in the worst possible way.&lt;/P&gt;
&lt;LI-SPOILER label="Note"&gt;
&lt;P data-start="5493" data-end="5784"&gt;At idle, you look stable. At p95 and p99, you drift. Two replicas disagree. You cannot reproduce because you logged prompts and weights, not the executed plan. An enterprise buyer does not care whether the drift came from “a tactic fallback.” They care that the system cannot explain itself.&lt;/P&gt;
&lt;/LI-SPOILER&gt;
&lt;P data-start="5786" data-end="5835"&gt;So here is the operational rule I use in reviews.&lt;/P&gt;
&lt;P data-start="5837" data-end="5906"&gt;If you cannot prove which plan ran, you cannot claim reproducibility.&lt;/P&gt;
&lt;P data-start="5908" data-end="6038"&gt;And that leads to the only practical addition that belongs in this section before we move into VRAM bandwidth and cache residency.&lt;/P&gt;
&lt;P data-start="11592" data-end="11675"&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 32px;"&gt;VRAM bandwidth, cache residency, and why memory hierarchy becomes control plane input&lt;/SPAN&gt;&lt;/P&gt;
&lt;P data-start="91" data-end="165"&gt;Let’s talk about the performance facts that quietly become behavior facts.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="167" data-end="514"&gt;And yes, I know how complex this gets. I have watched strong staff and principal engineers get lost here, not because they are weak, but because the system crosses too many layers at once: GPU microarchitecture, allocator behavior, kernel tactics, batching policy, and SLO-driven control loops. No single dashboard shows you the full causal chain.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="516" data-end="722"&gt;That is exactly why I frame it this way. It is not “performance tuning.” It is a coupled control system. So let me break it down cleanly, from the chip outward, until the behavior change becomes inevitable.&lt;/P&gt;
&lt;P data-start="724" data-end="957"&gt;NVIDIA &lt;A class="lia-external-url" href="https://developer.nvidia.com/blog/nvidia-hopper-architecture-in-depth/" target="_blank" rel="noopener"&gt;describes&lt;/A&gt; H100 SXM5 as having HBM3 bandwidth around &lt;STRONG data-start="783" data-end="793"&gt;3 TB/s&lt;/STRONG&gt; and an &lt;STRONG data-start="801" data-end="822"&gt;L2 cache of 50 MB&lt;/STRONG&gt; designed to reduce trips to HBM by caching repeated accesses.&lt;/P&gt;
&lt;P data-start="959" data-end="1001"&gt;Most teams read that as “the GPU is fast.”&lt;/P&gt;
&lt;P data-start="1003" data-end="1170"&gt;In serving, it is more precise to say: the GPU gives you a memory hierarchy with regimes, and your runtime is forced to adapt to whichever regime you are currently in.&lt;/P&gt;
&lt;H4 data-start="1177" data-end="1232"&gt;The chip-level model you should carry in your head&lt;/H4&gt;
&lt;img /&gt;
&lt;P data-start="1234" data-end="1324"&gt;Decode is not one big matmul. It is a loop that repeatedly touches a shifting working set:&lt;/P&gt;
&lt;UL data-start="1326" data-end="1516"&gt;
&lt;LI data-start="1326" data-end="1362"&gt;KV blocks for the active sequences&lt;/LI&gt;
&lt;LI data-start="1363" data-end="1418"&gt;attention metadata (block tables, indirection, masks)&lt;/LI&gt;
&lt;LI data-start="1419" data-end="1470"&gt;sampling buffers (logits, top-k/top-p structures)&lt;/LI&gt;
&lt;LI data-start="1471" data-end="1516"&gt;runtime bookkeeping for continuous batching&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1518" data-end="1722"&gt;Those accesses are not purely streaming. They are pointer-heavy, and their locality depends on how your KV is laid out, which requests are co-scheduled, and how fragmented your memory becomes under churn.&lt;/P&gt;
&lt;P data-start="1724" data-end="1779"&gt;Here is the simplest mental model that is still honest:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL data-start="2055" data-end="2285"&gt;
&lt;LI data-start="74" data-end="151"&gt;&lt;STRONG data-start="76" data-end="85"&gt;B_HBM&lt;/STRONG&gt; is the number of bytes actually read from HBM during this step.&lt;/LI&gt;
&lt;LI data-start="152" data-end="248"&gt;&lt;STRONG data-start="154" data-end="166"&gt;B_L2miss&lt;/STRONG&gt; is the number of bytes that missed L2 and therefore had to be fetched from HBM.&lt;/LI&gt;
&lt;LI data-start="249" data-end="347"&gt;&lt;STRONG data-start="251" data-end="266"&gt;t_translate&lt;/STRONG&gt; is the address-translation tax: extra time from TLB misses and page-table walks.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="2287" data-end="2372"&gt;That last term is the one that surprises people. It’s “invisible” until it dominates.&lt;/P&gt;
&lt;H5 data-start="0" data-end="50"&gt;Why L2 residency becomes a control-plane input&lt;/H5&gt;
&lt;P data-start="52" data-end="79"&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P data-start="52" data-end="79"&gt;Now connect that to decode, Decode repeatedly reads KV state. If L2 hit rate drops, HBM traffic rises. When HBM traffic rises, stalls rise. When stalls rise, token-step latency shifts.&lt;/P&gt;
&lt;P data-start="52" data-end="79"&gt;When token-step latency shifts, the server changes batching decisions.&lt;/P&gt;
&lt;P data-start="310" data-end="364"&gt;This is the control loop you should keep in your head:&lt;/P&gt;
&lt;P&gt;&lt;SPAN class="lia-text-color-14"&gt;L2 hit rate ↓ → t_step ↑ → Δt ↑ → batch composition changes → shape class changes → tactic set changes&lt;/SPAN&gt;&lt;/P&gt;
&lt;P data-start="366" data-end="472"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-start="474" data-end="540"&gt;That is the bridge from “cache miss” to “different plan executed.”&lt;/P&gt;
&lt;P data-start="542" data-end="965"&gt;In continuous batching, time is not just an output metric. Time is an input into the next scheduling decision. A few milliseconds can change who gets co-scheduled at the next token step. That changes shapes. Shapes change feasible kernels and algorithms. That changes the executed math. And if early logit margins are thin, a small pathway delta can flip a token and send the rest of the completion down a different branch.&lt;/P&gt;
&lt;H4 data-start="3681" data-end="3757"&gt;Rare but matters: the translation tax that breaks the “free VRAM” illusion&lt;/H4&gt;
&lt;P data-start="3759" data-end="3835"&gt;Two replicas can report similar free VRAM and still be in different regimes.&lt;/P&gt;
&lt;P data-start="3837" data-end="4002"&gt;Why? Because the chip is not “a pool of memory.” It is an on-die cache, translation structures, page tables, and a fabric that is arbitrating traffic under pressure.&lt;/P&gt;
&lt;P data-start="4004" data-end="4102"&gt;When KV is stored in blocks (or pages) and those blocks are scattered due to churn, you often get:&lt;/P&gt;
&lt;UL data-start="4104" data-end="4206"&gt;
&lt;LI data-start="4104" data-end="4128"&gt;worse spatial locality&lt;/LI&gt;
&lt;LI data-start="4129" data-end="4168"&gt;more distinct memory regions per step&lt;/LI&gt;
&lt;LI data-start="4169" data-end="4188"&gt;more TLB pressure&lt;/LI&gt;
&lt;LI data-start="4189" data-end="4206"&gt;more page walks&lt;/LI&gt;
&lt;/UL&gt;
&lt;img /&gt;
&lt;P data-start="4208" data-end="4358"&gt;&lt;STRONG&gt;&lt;EM&gt;Page walks are not abstract.&lt;/EM&gt;&lt;/STRONG&gt;&amp;nbsp;They are memory reads.&lt;/P&gt;
&lt;P data-start="4208" data-end="4358"&gt;They compete with your payload reads. Under real load, this turns into self-inflicted HBM traffic.&lt;/P&gt;
&lt;P data-start="4360" data-end="4491"&gt;So you can be “bandwidth rich” on paper and still be “latency poor” in practice because the working set became translation-hostile.&lt;/P&gt;
&lt;P data-start="4493" data-end="4566"&gt;This is how a performance event becomes a behavior event without any bug.&lt;/P&gt;
&lt;H4 data-start="4712" data-end="4753"&gt;A concrete KV bandwidth sanity check&lt;/H4&gt;
&lt;P data-start="0" data-end="107"&gt;If you want a back-of-the-envelope check for why decode becomes memory-shaped, use a conservative estimate.&lt;/P&gt;
&lt;P data-start="109" data-end="211"&gt;Per token step, you often need to read a large portion of KV for the active context. A rough model is:&lt;/P&gt;
&lt;P data-start="213" data-end="258"&gt;&lt;STRONG data-start="213" data-end="258"&gt;&lt;SPAN class="lia-text-color-14"&gt;KV bytes per step&lt;/SPAN&gt; ≈ 2 × B × L × H × D × s&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="260" data-end="266"&gt;Where:&lt;/P&gt;
&lt;UL data-start="268" data-end="571"&gt;
&lt;LI data-start="268" data-end="338"&gt;&lt;STRONG data-start="270" data-end="275"&gt;B&lt;/STRONG&gt; is batch size (number of sequences co-scheduled in the step)&lt;/LI&gt;
&lt;LI data-start="339" data-end="397"&gt;&lt;STRONG data-start="341" data-end="346"&gt;L&lt;/STRONG&gt; is current context length (tokens already in KV)&lt;/LI&gt;
&lt;LI data-start="398" data-end="478"&gt;&lt;STRONG data-start="400" data-end="405"&gt;H&lt;/STRONG&gt; is the number of attention heads (or KV heads, depending on the model)&lt;/LI&gt;
&lt;LI data-start="479" data-end="506"&gt;&lt;STRONG data-start="481" data-end="486"&gt;D&lt;/STRONG&gt; is head dimension&lt;/LI&gt;
&lt;LI data-start="507" data-end="571"&gt;&lt;STRONG data-start="509" data-end="514"&gt;s&lt;/STRONG&gt; is bytes per element (2 for fp16/bf16, 1 for int8, etc.)&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="573" data-end="615"&gt;The factor &lt;STRONG data-start="584" data-end="589"&gt;2&lt;/STRONG&gt; accounts for &lt;STRONG data-start="603" data-end="614"&gt;K &lt;/STRONG&gt;and&lt;STRONG data-start="603" data-end="614"&gt; V&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-start="617" data-end="780"&gt;Even if your kernel is compute-efficient, you are still moving a lot of bytes. If locality collapses and L2 misses rise, you shift into an HBM-limited regime fast.&lt;/P&gt;
&lt;P data-start="782" data-end="899"&gt;That is the mechanical reason your p95/p99 step time moves under load, even with the same checkpoint and temperature.&lt;/P&gt;
&lt;H4 data-start="7619" data-end="7655"&gt;Business impact, stated plainly&lt;/H4&gt;
&lt;P data-start="7657" data-end="7712"&gt;This is why drift shows up where it hurts: p95 and p99.&lt;/P&gt;
&lt;P data-start="7714" data-end="8076"&gt;At idle, L2 residency is generous, fragmentation is lower, translation pressure is calmer, and step time is stable. Under load, residency collapses, translation tax rises, allocator feasibility tightens, step time stretches, and your control plane adapts by changing batching and shapes. That can move you into different execution plans without any model change.&lt;/P&gt;
&lt;P data-start="8078" data-end="8234"&gt;An enterprise buyer does not care whether you call it “L2 miss driven plan churn.” They care that two identical requests disagree and you cannot explain it.&lt;/P&gt;
&lt;P data-start="8236" data-end="8295"&gt;So the takeaway I want principals to internalize is simple:&lt;/P&gt;
&lt;P data-start="8297" data-end="8501"&gt;In continuous batching, memory hierarchy state is control-plane state.&lt;BR data-start="8367" data-end="8370" /&gt;It shapes latency. Latency shapes batching. Batching shapes shapes. Shapes shape feasibility. Feasibility shapes the executed plan.&lt;/P&gt;
&lt;P data-start="8503" data-end="8548"&gt;That is how “performance” becomes “behavior.”&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="0" data-end="76"&gt;Multi node tensor parallel, the execution plan extends across the fabric&lt;/H2&gt;
&lt;P data-start="78" data-end="181"&gt;Once you go multi-node tensor parallel, you add a second execution plane that most teams underestimate.&lt;/P&gt;
&lt;P data-start="183" data-end="230"&gt;You are no longer operating only a GPU runtime. You are operating a&amp;nbsp;&lt;STRONG data-start="252" data-end="276"&gt;distributed timeline&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P data-start="279" data-end="450"&gt;And the timeline is not a background detail. In continuous batching, the timeline becomes a control input that reshapes batching, shapes, and eventually the executed plan.&lt;/P&gt;
&lt;P data-start="452" data-end="514"&gt;Let me be precise about what I am claiming, and what I am not.&lt;/P&gt;
&lt;P data-start="516" data-end="609"&gt;I am &lt;STRONG&gt;not&lt;/STRONG&gt; going to claim collectives reorder arithmetic inside a kernel. That would be sloppy.&lt;/P&gt;
&lt;P data-start="611" data-end="637"&gt;The correct claim is this:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="639" data-end="799"&gt;&lt;STRONG data-start="639" data-end="799"&gt;Distributed synchronization changes the timeline. &lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="639" data-end="799"&gt;&lt;STRONG data-start="639" data-end="799"&gt;The timeline changes admission and batching. Batching changes shapes. Shapes change which plans are legal.&lt;/STRONG&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="801" data-end="904"&gt;That’s enough to explain why the “same prompt, same checkpoint, temp=0” can drift only under real load.&lt;/P&gt;
&lt;H4 data-start="911" data-end="953"&gt;The minimal equation you should carry&lt;/H4&gt;
&lt;P data-start="955" data-end="1013"&gt;At each decode step, your latency is no longer “GPU time.”&lt;/P&gt;
&lt;P data-start="1015" data-end="1046"&gt;It’s GPU time plus fabric time:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;t_step ≈ t_compute + t_comm + t_sync&lt;/LI-CODE&gt;
&lt;P data-start="1090" data-end="1228"&gt;And the part that hurts is that &lt;STRONG data-start="1122" data-end="1158"&gt;t_comm and t_sync are not stable&lt;/STRONG&gt;. They are affected by contention, queueing, stragglers, and topology.&lt;/P&gt;
&lt;P data-start="1230" data-end="1318"&gt;A useful mental model for the communication piece is the classic latency–bandwidth form:&lt;/P&gt;
&lt;P data-start="1320" data-end="1357"&gt;&lt;STRONG data-start="1320" data-end="1357"&gt;t_comm(message) ≈ α + (n / β_eff)&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-start="1359" data-end="1525"&gt;
&lt;LI data-start="1359" data-end="1427"&gt;&lt;STRONG data-start="1361" data-end="1366"&gt;α&lt;/STRONG&gt; is the per-collective startup and synchronization overhead&lt;/LI&gt;
&lt;LI data-start="1428" data-end="1452"&gt;&lt;STRONG data-start="1430" data-end="1435"&gt;n&lt;/STRONG&gt; is bytes moved&lt;/LI&gt;
&lt;LI data-start="1453" data-end="1525"&gt;&lt;STRONG data-start="1455" data-end="1464"&gt;β_eff&lt;/STRONG&gt; is the effective bandwidth you actually get under contention&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1527" data-end="1574"&gt;In isolation, this looks like performance math.&lt;/P&gt;
&lt;P data-start="1576" data-end="1697"&gt;In a continuous batching server, this becomes behavior math, because t_step feeds back into the next scheduling decision.&lt;/P&gt;
&lt;H4 data-start="1828" data-end="1888"&gt;What actually happens in multi-node TP at token cadence&lt;/H4&gt;
&lt;P data-start="1890" data-end="2104"&gt;Tensor parallelism shards the model across devices. Every token step requires cross-device coordination for some portion of the layer execution. In practice, this means collectives become part of the critical path.&lt;/P&gt;
&lt;P data-start="2106" data-end="2443"&gt;NCCL’s &lt;A class="lia-external-url" href="https://docs.nvidia.com/deeplearning/nccl/user-guide/docs/usage/collectives.html" target="_blank" rel="noopener"&gt;collective ops&lt;/A&gt; are explicit about the semantics: for example, AllReduce reduces values across ranks and returns identical results to all ranks. That tells you what the runtime must do: it must wait for coordination across ranks before progressing.&lt;/P&gt;
&lt;P data-start="2445" data-end="2472"&gt;So the decode loop becomes:&lt;/P&gt;
&lt;OL data-start="2474" data-end="2627"&gt;
&lt;LI data-start="2474" data-end="2514"&gt;execute local compute for this step&lt;/LI&gt;
&lt;LI data-start="2515" data-end="2545"&gt;hit a collective boundary&lt;/LI&gt;
&lt;LI data-start="2546" data-end="2616"&gt;wait for the slowest rank to finish and for the fabric to deliver&lt;/LI&gt;
&lt;LI data-start="2617" data-end="2627"&gt;proceed&lt;/LI&gt;
&lt;/OL&gt;
&lt;P data-start="2629" data-end="2697"&gt;That “slowest rank” detail is the piece people feel but rarely name.&lt;/P&gt;
&lt;P data-start="2699" data-end="2906"&gt;In distributed inference, &lt;STRONG data-start="2725" data-end="2759"&gt;p99 is often a straggler story&lt;/STRONG&gt;. A single congested link, a slightly delayed rank, or a transient fabric stall turns into a global stall because collectives synchronize progress.&lt;/P&gt;
&lt;P data-start="2908" data-end="3039"&gt;In other words, a multi-node TP system behaves like a coupled oscillator: the fastest GPU is still gated by the slowest collective.&lt;/P&gt;
&lt;H4 data-start="3159" data-end="3220"&gt;Why this changes the executed plan, not just the latency&lt;/H4&gt;
&lt;P data-start="3222" data-end="3275"&gt;Here’s the bridge to the thesis of the whole article.&lt;/P&gt;
&lt;P data-start="3277" data-end="3398"&gt;In a continuous batching server, you do not just execute requests. You continuously reform microbatches at token cadence.&lt;/P&gt;
&lt;P data-start="3400" data-end="3453"&gt;That means step time affects who joins the next step.&lt;/P&gt;
&lt;P data-start="3455" data-end="3546"&gt;And in multi-node TP, fabric jitter is one of the biggest sources of step-time variability.&lt;/P&gt;
&lt;P data-start="3548" data-end="3606"&gt;So when comm jitter shifts t_step, it shifts the schedule:&lt;/P&gt;
&lt;UL data-start="3608" data-end="3757"&gt;
&lt;LI data-start="3608" data-end="3631"&gt;queue delay changes&lt;/LI&gt;
&lt;LI data-start="3632" data-end="3665"&gt;microbatch membership changes&lt;/LI&gt;
&lt;LI data-start="3666" data-end="3699"&gt;effective shape class changes&lt;/LI&gt;
&lt;LI data-start="3700" data-end="3733"&gt;workspace feasibility changes&lt;/LI&gt;
&lt;LI data-start="3734" data-end="3757"&gt;tactic choice changes&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3759" data-end="3959"&gt;You already established earlier that a changed shape class can force a different tactic set. Multi-node TP adds a new reason shape churn happens: not only GPU pressure, but &lt;STRONG data-start="3932" data-end="3958"&gt;fabric timing pressure&lt;/STRONG&gt;.&lt;/P&gt;
&lt;img /&gt;
&lt;P data-start="3961" data-end="4001"&gt;So the claim stays clean and defensible:&lt;/P&gt;
&lt;P data-start="4003" data-end="4147"&gt;&lt;STRONG data-start="4003" data-end="4147"&gt;Distributed synchronization doesn’t need to change arithmetic to change behavior. It only needs to change the timeline that drives batching.&lt;/STRONG&gt;&lt;/P&gt;
&lt;H4 data-start="4276" data-end="4368"&gt;Chip-to-fabric reality: why infrastructure details belong in the reproducibility record&lt;/H4&gt;
&lt;P data-start="4370" data-end="4427"&gt;At this scale, the infrastructure is part of the runtime.&lt;/P&gt;
&lt;P data-start="4429" data-end="4727"&gt;According to &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/azure/virtual-machines/sizes/gpu-accelerated/ndh100v5-series" target="_blank" rel="noopener"&gt;Azure Docs&lt;/A&gt;, Azure’s ND H100 v5 series is explicitly positioned for tightly coupled scale-up and scale-out Generative AI and HPC workloads, and it’s built around the idea that the fabric matters, not just the GPUs:&lt;/P&gt;
&lt;P data-start="4729" data-end="4923"&gt;If you are running multi-node TP in production, treat fabric telemetry as part of your reproducibility record. Not because it is fun. Because it changes the system timeline that drives batching.&lt;/P&gt;
&lt;P data-start="4925" data-end="4966"&gt;A practical minimum is to track per-step:&lt;/P&gt;
&lt;UL data-start="4968" data-end="5312"&gt;
&lt;LI data-start="4968" data-end="5040"&gt;collective type on the critical path (e.g., all-reduce / all-gather)&lt;/LI&gt;
&lt;LI data-start="5041" data-end="5095"&gt;comm time and jitter (p50/p95/p99 per step window)&lt;/LI&gt;
&lt;LI data-start="5096" data-end="5143"&gt;rank skew (max(rank_time) − min(rank_time))&lt;/LI&gt;
&lt;LI data-start="5144" data-end="5189"&gt;effective bandwidth estimate (n / t_comm)&lt;/LI&gt;
&lt;LI data-start="5190" data-end="5252"&gt;retransmit / congestion signals if your stack exposes them&lt;/LI&gt;
&lt;LI data-start="5253" data-end="5312"&gt;a “fabric regime” marker: normal vs congested vs degraded&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="5426" data-end="5481"&gt;When drift becomes expensive&lt;/H4&gt;
&lt;P data-start="5483" data-end="5575"&gt;This is one of the reasons enterprise teams report the most confusing failures only at load.&lt;/P&gt;
&lt;P data-start="5577" data-end="5699"&gt;At idle, your timeline is stable, your microbatches are stable, your shapes are stable, and your plan selection is stable.&lt;/P&gt;
&lt;P data-start="5701" data-end="5934"&gt;Under real load, the fabric introduces jitter, jitter reshapes batching, batching reshapes shapes, and shapes reshape the executed plan. Now two replicas can disagree, not because the model changed, but because the timeline differed.&lt;/P&gt;
&lt;P data-start="5936" data-end="5953"&gt;That shows up as:&lt;/P&gt;
&lt;UL data-start="5955" data-end="6261"&gt;
&lt;LI data-start="5955" data-end="6020"&gt;inconsistent answers across replicas in high-stakes workflows&lt;/LI&gt;
&lt;LI data-start="6021" data-end="6084"&gt;reproducibility failures during audits and incident reviews&lt;/LI&gt;
&lt;LI data-start="6085" data-end="6160"&gt;“regressions” after scaling out, even with the same checkpoint and code&lt;/LI&gt;
&lt;LI data-start="6161" data-end="6261"&gt;support costs and credibility loss because you cannot explain why behavior changed only at p95/p99&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="6263" data-end="6336"&gt;So the operational sentence I want you to carry into your postmortems is:&lt;/P&gt;
&lt;P data-start="6338" data-end="6547"&gt;&lt;STRONG data-start="6338" data-end="6547"&gt;In multi-node tensor parallel inference, the execution plan extends across the fabric. If you do not log the fabric timeline, you are missing part of the runtime state that decides which plan was feasible.&lt;/STRONG&gt;&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="0" data-end="57"&gt;Where Infrastructure Stops Being “Just Infrastructure”&lt;/H2&gt;
&lt;P data-start="59" data-end="276"&gt;Once you accept the thesis of this article, one conclusion becomes unavoidable: &lt;STRONG data-start="139" data-end="276"&gt;cloud choices are not just cost and convenience decisions. &lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="59" data-end="276"&gt;&lt;EM&gt;&lt;STRONG data-start="139" data-end="276"&gt;They shape which execution regimes your runtime will enter under pressure.&lt;/STRONG&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;P data-start="278" data-end="336"&gt;At scale, you are no longer buying “GPUs.” You are buying:&lt;/P&gt;
&lt;UL data-start="338" data-end="1001"&gt;
&lt;LI data-start="338" data-end="421"&gt;&lt;STRONG data-start="340" data-end="365"&gt;A fabric and topology&lt;/STRONG&gt; that holds up under synchronized token-step collectives&lt;/LI&gt;
&lt;LI data-start="422" data-end="559"&gt;&lt;STRONG data-start="424" data-end="472"&gt;A VM family with predictable characteristics&lt;/STRONG&gt; for tightly coupled scale-out workloads (the kind multi-node inference actually is)&lt;/LI&gt;
&lt;LI data-start="560" data-end="708"&gt;&lt;STRONG data-start="562" data-end="586"&gt;An isolation posture&lt;/STRONG&gt; that can be enforced in hardware when your threat model requires it, without hand-waving away the runtime implications&lt;/LI&gt;
&lt;LI data-start="709" data-end="1001"&gt;&lt;STRONG data-start="711" data-end="740"&gt;First-class observability&lt;/STRONG&gt; for GPU behavior, not just CPU and request traces, so you can correlate drift with the state variables that caused it (for example, exporting NVIDIA DCGM metrics into managed Prometheus and Azure Managed Grafana on AKS).&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1003" data-end="1283"&gt;This is the quiet reason certain platforms feel “more stable” in production.&lt;/P&gt;
&lt;P data-start="1003" data-end="1283"&gt;Not because the model is different, but because the&amp;nbsp;&lt;STRONG data-start="1132" data-end="1194"&gt;runtime state is easier to constrain, measure, and explain&lt;/STRONG&gt; when the underlying infrastructure is designed for the exact regime you’re operating in.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;Quantization effects on execution paths and causal stragglers in multi-node TP&lt;/H2&gt;
&lt;P&gt;Let me be direct about what most articles miss when they discuss distributed inference at scale.&lt;/P&gt;
&lt;P&gt;The conversation typically stops at "how many GPUs" and "what's the bandwidth." That's not wrong. It's just incomplete. What's missing is the interaction between &lt;STRONG&gt;quantization-induced plan churn&lt;/STRONG&gt; and &lt;STRONG&gt;straggler amplification&lt;/STRONG&gt; in the collective path, two forces that quietly reshape your execution regime under VRAM pressure and fabric contention.&lt;/P&gt;
&lt;P&gt;These are not theoretical curiosities. They are production realities at 100+ GPU scale, the kind of scale where you can no longer afford to treat quantization as a "precision choice" or stragglers as a "latency outlier." At that scale, they become &lt;STRONG&gt;causal inputs&lt;/STRONG&gt; to your runtime's decision surface.&lt;/P&gt;
&lt;H4&gt;Quantization variability: not just precision, but plan selection&lt;/H4&gt;
&lt;P&gt;When teams talk about INT8 or FP8 quantization, the conversation usually centers on memory savings and throughput gains. That's the marketing layer.&lt;/P&gt;
&lt;P&gt;The execution layer is more nuanced: quantization changes &lt;STRONG&gt;which kernels are legal&lt;/STRONG&gt;, &lt;STRONG&gt;where fusion boundaries land&lt;/STRONG&gt;, and &lt;STRONG&gt;how reduction trees are staged&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;Here's what I mean in concrete terms. Under VRAM pressure, your serving stack may need to requantize activations mid-forward-pass to stay within memory bounds. That requant step is not "free" in the plan sense. It introduces:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;dequant/requant cycles&lt;/STRONG&gt; that break fusion opportunities you had in the FP16 path&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;new non-associative operations&lt;/STRONG&gt; in the reduction tree, where rounding happens at different stages&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;fallback paths&lt;/STRONG&gt; when the quantized kernel variant lacks workspace or doesn't support the current shape class&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Let me state this in the language of the article's thesis: &lt;EM&gt;quantization is not a data type. It is a tactic constraint that reshapes the feasible plan space.&lt;/EM&gt;&lt;/P&gt;
&lt;img&gt;&lt;STRONG data-start="442" data-end="513"&gt;Quantization-induced plan divergence under VRAM pressure.&lt;/STRONG&gt;&lt;BR data-start="513" data-end="516" /&gt;Memory pressure can force dequant/requant cycles, change fusion boundaries, and trigger fallback kernels with different reduction staging, producing last-bit differences that can flip tokens during decoding.&lt;/img&gt;
&lt;P&gt;&lt;STRONG&gt;The practical consequence? &lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Two replicas running "the same quantized model" can execute different kernel variants when one is memory-pressured and the other is not. The memory-pressured replica may be forced into a fallback path with different reduction staging. Different staging means different rounding order. Different rounding order means different last bits. And in decoding, last bits can become different tokens.&lt;/P&gt;
&lt;P&gt;I've watched incident reviews where teams assumed INT8 was "deterministic" because they set the quantization scheme once at export time.&lt;/P&gt;
&lt;P&gt;What they missed is that the&amp;nbsp;&lt;STRONG&gt;runtime's quantization pathway&lt;/STRONG&gt; depends on the state of VRAM fragmentation, workspace availability, and kernel preference histograms, exactly the regime-dependent variables we've been building toward throughout this article.&lt;/P&gt;
&lt;P&gt;If you're operating at scale, instrument this. Track:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;per-step kernel selection via cuBLASLt preference descriptors&lt;/LI&gt;
&lt;LI&gt;dequant/requant cycle counts when memory pressure rises&lt;/LI&gt;
&lt;LI&gt;fallback events when preferred quantized tactics become infeasible&lt;/LI&gt;
&lt;LI&gt;whether the executed plan matched the "expected" quantization pathway&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is rare telemetry. Most teams never see it because they're not running large enough clusters under sustained pressure.&lt;/P&gt;
&lt;P&gt;But once you cross into 100+ GPU inference workloads, quantization-induced plan churn becomes visible in your p99 drift signatures.&lt;/P&gt;
&lt;H4&gt;Causal stragglers: when one rank's fallback stalls the collective&lt;/H4&gt;
&lt;P&gt;Now let's talk about the fabric-scale pathology that couples with everything we just discussed: &lt;STRONG&gt;head-of-line blocking in distributed tensor parallelism&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;You already know from the multi-node TP section that collectives synchronize progress. The fastest rank waits for the slowest. That's the contract.&lt;/P&gt;
&lt;P&gt;What's less documented—and what I've only seen formalized in internal NVIDIA serving postmortem templates—is how a &lt;STRONG&gt;single rank's kernel fallback&lt;/STRONG&gt; can become a &lt;STRONG&gt;collective-wide straggler&lt;/STRONG&gt;, and how that straggler amplifies through the batching feedback loop.&lt;/P&gt;
&lt;P&gt;Here's the causal chain:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;One rank enters memory pressure.&lt;/STRONG&gt; Maybe fragmentation is worse on that device, maybe it's handling a slightly different KV layout due to request assignment.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;That rank falls back to a slower tactic.&lt;/STRONG&gt; The preferred kernel requires workspace. Workspace isn't available. The engine selects a legal fallback.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The fallback kernel takes longer.&lt;/STRONG&gt; Not by seconds—by milliseconds. But in a collective, milliseconds matter.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The collective waits.&lt;/STRONG&gt; AllReduce can't proceed until all ranks contribute. The straggler becomes the bottleneck.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Step time stretches.&lt;/STRONG&gt; The stretched step reshapes the next batch in continuous batching. Different batch, different shapes, different feasibility.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The cycle repeats.&lt;/STRONG&gt; Now multiple ranks may be in fallback paths. The p99 drift you're seeing isn't random—it's a feedback loop.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;This is what I call a &lt;STRONG&gt;causal straggler&lt;/STRONG&gt;: not just a slow rank, but a rank whose performance degradation causally reshapes the execution regime of the entire TP group.&lt;/P&gt;
&lt;P&gt;And here's where quantization and stragglers intersect. If one rank is under more VRAM pressure and is forced into more frequent dequant/requant cycles, it becomes the straggler. Its quantization pathway differs from the other ranks—not because the model changed, but because the memory regime changed. That difference in pathway becomes a difference in step time. That difference in step time becomes a collective stall. That stall becomes a batching change. That batching change becomes a new plan.&lt;/P&gt;
&lt;P&gt;The output drifts, and you're left wondering why "the same checkpoint at temperature zero" produced different text only under load.&lt;/P&gt;
&lt;P&gt;The answer is: you weren't in the same execution regime. You were in a regime where one rank's memory pressure caused a straggler, the straggler caused a timeline shift, and the timeline shift caused a plan change.&lt;/P&gt;
&lt;H4&gt;Rarity value: why this knowledge is elite production battle scars&lt;/H4&gt;
&lt;P&gt;Let me be honest about why these gaps are rare.&lt;/P&gt;
&lt;P&gt;Most teams never operate at the scale where these effects dominate. If you're running inference on 8 GPUs, you might see hints of this. At 100+ GPUs with multi-node TP and continuous batching under sustained load, it's no longer a hint—it's the signature.&lt;/P&gt;
&lt;P&gt;The teams that &lt;EM&gt;do&lt;/EM&gt; operate at this scale track:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;cuBLASLt preference histograms&lt;/STRONG&gt; to detect when algorithm selection is churning across steps&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;NCCL timeline traces&lt;/STRONG&gt; to identify straggler signatures and correlate them with per-rank memory state&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;per-rank kernel fallback events&lt;/STRONG&gt; to see when one device is operating a different plan than its peers&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;quantization pathway divergence&lt;/STRONG&gt; across ranks under pressure&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This is the telemetry that doesn't show up in tutorials. It shows up in postmortems at hyperscaler SLO thresholds, where p99 latency violations trigger incident reviews and someone finally asks: "Why did replica 3 disagree with replica 1 only during the peak load window?"&lt;/P&gt;
&lt;P&gt;The article you're reading now covers single-node memory regimes beautifully. What bridges to 10/10 elite production knowledge is this: &lt;STRONG&gt;fabric-scale causality&lt;/STRONG&gt;. The understanding that in multi-node TP, your execution regime is not just shaped by your GPU's memory state—it's shaped by the &lt;EM&gt;worst&lt;/EM&gt; GPU's memory state, because collectives couple everyone's timeline.&lt;/P&gt;
&lt;P&gt;That's the gap. That's the rarity value. And if you're building or operating inference at 100+ GPU scale, that's the layer where your next outage is hiding.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2&gt;Peak depth: wavefront divergence, tensor core fragmentation, NCCL backpressure, and ISR collision&lt;/H2&gt;
&lt;P&gt;Everything above operates at the principal and staff engineer level. What follows is the layer below that—the chip architect handoff, where you stop talking about "plans" in the abstract and start talking about &lt;STRONG&gt;warp stall cycles&lt;/STRONG&gt;, &lt;STRONG&gt;tensor core fragment occupancy&lt;/STRONG&gt;, &lt;STRONG&gt;NCCL retransmit chains&lt;/STRONG&gt;, and &lt;STRONG&gt;memory evaporation under replication pressure&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;I'm writing this section because it's the part I never see published outside internal design reviews, and because these are the exact pathologies that turn a well-architected inference cluster into a system that disagrees with itself only during peak traffic.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;"Most engineers debug the layer they understand. The system breaks at the layer they don't. In production inference, that layer is almost always the one where microarchitecture meets scheduling meets the fabric."&lt;/P&gt;
&lt;P&gt;— Hazem Ali&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H4&gt;Wavefront divergence in decode attention kernels&lt;/H4&gt;
&lt;P&gt;Let me take you inside the warp.&lt;/P&gt;
&lt;P&gt;In SIMT execution, a warp is 32 threads executing in lockstep. When all threads follow the same control path, you get full utilization. When they diverge—different threads take different branches—the warp must serialize both paths. That's textbook GPU architecture.&lt;/P&gt;
&lt;P&gt;What's not textbook is how this interacts with &lt;STRONG&gt;paged KV attention in production decode loops&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;In a paged KV system (the exact kind vLLM introduced), KV blocks are scattered across VRAM. Different sequences in the same microbatch may have their KV blocks in different residency states: some hot in L2, some cold in HBM, some partially evicted under paging pressure. When the attention kernel issues loads for KV blocks, &lt;STRONG&gt;threads within the same warp can stall at different rates&lt;/STRONG&gt; depending on which blocks they're accessing and where those blocks reside.&lt;/P&gt;
&lt;P&gt;This creates a subtle but measurable pathology:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Lane divergence inside the attention kernel.&lt;/STRONG&gt; Not control-flow divergence in the traditional sense, but &lt;STRONG&gt;memory-latency divergence&lt;/STRONG&gt;: some lanes return fast (L2 hit), some stall (HBM fetch), and the warp can't retire until the slowest lane completes.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Register pressure amplification.&lt;/STRONG&gt; When warps stall, the SM must keep their register state live. Under heavy stalling, register pressure rises, which can force the compiler to spill to local memory (which lives in L2/HBM). Spills create more memory traffic, which creates more stalls. It's a feedback loop at the microarchitectural level.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Measurable p99 step variance in identical-shape batches.&lt;/STRONG&gt; This is the part that confuses teams. Two consecutive decode steps with the same batch size and the same sequence lengths can have different step times, because the KV block residency pattern differed. The shape was identical. The &lt;EM&gt;memory topology&lt;/EM&gt; was not.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;If you want to see this in practice, the tool is Nsight Systems. What you're looking for:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;
# Nsight Systems trace analysis: partition warp stall cycles
# Look for these stall reasons in the GPU metrics view:
#   - smsp__warps_issue_stalled_long_scoreboard  → memory dependency stalls
#   - smsp__warps_issue_stalled_short_scoreboard → register dependency stalls  
#   - smsp__warps_issue_stalled_no_instruction   → instruction cache miss
#
# Correlate with:
#   - l1tex__t_sectors_pipe_lsu_mem_global_op_ld  → global load sectors (KV fetches)
#   - lts__t_sectors_srcunit_tex_op_read_hit_rate → L2 hit rate during attention
#
# The diagnostic signal: when stall_long_scoreboard spikes correlate with
# L2 hit rate drops, you're seeing KV residency divergence across warps.&lt;/LI-CODE&gt;
&lt;P&gt;The stall partition tells you &lt;EM&gt;why&lt;/EM&gt; the warp stalled. When you see &lt;STRONG&gt;long_scoreboard stalls&lt;/STRONG&gt; dominating during attention kernels—and you see them correlating with L2 miss rate fluctuations—you're observing exactly the KV residency divergence I'm describing. The warp is waiting for scattered KV blocks, and the scatter pattern changes with every batch because paging decisions are state-dependent.&lt;/P&gt;
&lt;P&gt;This is how "identical shapes" produce different timelines. The shape is the same. The KV block map is not. And the block map is a function of runtime allocation history—the same state-dependent variable that drives everything else in this article.&lt;/P&gt;
&lt;H4&gt;Tensor core fragment utilization collapse under shape churn&lt;/H4&gt;
&lt;P&gt;Now let's go inside the tensor cores themselves.&lt;/P&gt;
&lt;P&gt;H100 and Blackwell tensor cores operate on &lt;STRONG&gt;matrix fragments&lt;/STRONG&gt;—fixed-size tiles that map directly to the hardware's matrix multiply-accumulate units. On H100, the native fragment sizes for FP16 are typically 16×16×16 (m×n×k). When your operand dimensions align cleanly with fragment boundaries, you get full utilization. When they don't, you get &lt;STRONG&gt;fragment waste&lt;/STRONG&gt;: the hardware still executes full fragments, but some of the lanes carry padding zeros.&lt;/P&gt;
&lt;P&gt;In continuous batching, shape churn is the norm. Your microbatch dimensions change at token cadence. And this is where a subtle but devastating efficiency collapse hides.&lt;/P&gt;
&lt;P&gt;Consider two microbatches that arrive one step apart:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;
# Step t:   B=16, L=2048  →  GEMM shape aligns cleanly with 16×16 fragments
#           Fragment utilization: ~98%
#           cuBLASLt selects: WMMA-based kernel (tensor core native)
#
# Step t+1: B=17, L=2047  →  GEMM shape straddles fragment boundaries
#           Fragment utilization: drops below 25% on trailing tiles
#           cuBLASLt selects: fallback to non-WMMA FP16 kernel
#           (or WMMA with heavy padding, depending on heuristic)&lt;/LI-CODE&gt;
&lt;P&gt;The difference is one sequence in the batch and one token in context length. The performance consequence is that &lt;STRONG&gt;the runtime switches from tensor core native execution to a scalar FP16 path&lt;/STRONG&gt;. That's not a minor variant. That's a fundamentally different instruction mix, a different reduction tree, and a different accumulation order.&lt;/P&gt;
&lt;P&gt;The ulp deltas that result from this switch don't stay contained in the GEMM output. They propagate forward through &lt;STRONG&gt;layer normalization&lt;/STRONG&gt;—which is itself a reduction over the hidden dimension. Layer norm amplifies small differences because it divides by a variance term computed from the same values. A tiny shift in the GEMM output becomes a slightly different variance, which becomes a slightly different normalization, which becomes a slightly different input to the next layer's attention.&lt;/P&gt;
&lt;P&gt;You can observe this directly via cuBLASLt's algorithm preference reporting:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;
# cuBLASLt algorithm preference histogram (conceptual)
# Track per-step which algorithm ID was selected for the primary GEMM
#
# Healthy (stable shapes):
#   algo_id=42 (WMMA_TENSOR_OP_HMMA_16816)  → 99.2% of steps
#   algo_id=17 (SIMT_FP16_SPLITK)           →  0.8% of steps
#
# Under shape churn (continuous batching, mixed lengths):
#   algo_id=42 (WMMA_TENSOR_OP_HMMA_16816)  → 61.3% of steps
#   algo_id=17 (SIMT_FP16_SPLITK)           → 22.1% of steps
#   algo_id=31 (WMMA_TENSOR_OP_PAD16)       → 16.6% of steps
#
# When algo_id distribution churns, your reduction tree is churning.
# When your reduction tree churns, your last bits are churning.
# When your last bits churn under thin margins, your tokens can flip.&lt;/LI-CODE&gt;
&lt;P&gt;That histogram is the smoking gun. When you see algorithm preference distribution widening under load, you're watching the tensor cores get destabilized by shape churn. The fix isn't "use bigger batches." The fix is to understand that &lt;STRONG&gt;continuous batching creates a shape distribution, not a fixed shape&lt;/STRONG&gt;, and that shape distribution maps directly to a tactic distribution, which maps directly to a ulp distribution.&lt;/P&gt;
&lt;H4&gt;NCCL causal backpressure chains across TP+DP pods&lt;/H4&gt;
&lt;P&gt;Now scale this to the fabric.&lt;/P&gt;
&lt;P&gt;Take an 8×TP + 4×DP pod: 32 GPUs total, where every token step requires AllReduce across the 8-way TP group, and gradient synchronization (or KV redistribution in some architectures) across the 4-way DP group.&lt;/P&gt;
&lt;P&gt;Here's the causal backpressure chain I've traced in production, laid out as a timeline:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;Rank 5 (of 8 TP ranks) hits a quant/dequant stall.&lt;/STRONG&gt; Its KV blocks are fragmented, workspace is tight, and the runtime forces a dequant cycle mid-attention. That adds ~1.2ms to this rank's compute.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;AllReduce stalls on Rank 5.&lt;/STRONG&gt; The other 7 ranks complete their portion and issue their NCCL send. Rank 5 hasn't arrived yet. NCCL's ring/tree protocol can't progress past this rank. Effective t_sync inflates by 2× compared to the no-straggler baseline.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;P2P retransmit triggers.&lt;/STRONG&gt; Under some fabric topologies and congestion states, the delayed arrival from Rank 5 can cause NCCL to hit internal retry logic on the NVLink or InfiniBand path. This is not a "network error"—it's the transport protocol managing flow control under backpressure. But it adds latency jitter that is invisible unless you're tracing at the NCCL bootstrap level.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;vLLM scheduler reacts to the stretched step.&lt;/STRONG&gt; The scheduler sees that step t took 2× longer than expected. Under its latency-aware admission control, it drops batch size from 32 → 12 to protect SLO. Smaller batch means different shapes. Different shapes mean different tactics. The plan changes.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The batch size drop propagates.&lt;/STRONG&gt; With batch size at 12, queued requests wait longer. Queue pressure builds. When the scheduler recovers and re-admits, the burst creates shape churn. Shape churn destabilizes tensor core fragment utilization. The system is now in a different execution regime—triggered by one rank's memory fragmentation.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;That is a &lt;STRONG&gt;causal backpressure chain&lt;/STRONG&gt;. Not a latency spike. Not a network blip. A causally connected sequence where a microarchitectural event on one device reshapes the execution plan across the entire pod.&lt;/P&gt;
&lt;P&gt;To trace this, you need NCCL bootstrap traces with NVTX domain annotations:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;
# NCCL tracing with NVTX domains for causal analysis
#
# Environment setup for trace collection:
#   NCCL_DEBUG=INFO
#   NCCL_DEBUG_SUBSYS=INIT,COLL,P2P
#   NSYS_NVTX_DOMAINS=nccl,cuda,cublas
#
# In Nsight Systems, correlate:
#   1. Per-rank kernel duration (cuda domain) — identify the straggler
#   2. NCCL collective start/end (nccl domain) — measure t_sync inflation
#   3. P2P transport events (nccl/P2P) — detect retransmit/backpressure
#   4. Scheduler batch decisions (application NVTX) — see batch size reaction
#
# The causal signal: when rank N's kernel duration spike aligns with
# NCCL collective inflation across all ranks, followed by batch size
# reduction in the scheduler, you have a causal backpressure chain.
#
# Regex for filtering straggler events in nsys export:
#   grep -E "ncclAllReduce.*duration_us &amp;gt; (2 * median_duration)" trace.sqlite
#   → correlate timestamp with scheduler batch_size change events&lt;/LI-CODE&gt;
&lt;P&gt;This is the telemetry that separates "we think there was network jitter" from "Rank 5's dequant stall caused a 2× collective inflation that forced the scheduler to halve batch size, which shifted the shape class into a non-WMMA tactic for the next 47 steps."&lt;/P&gt;
&lt;P&gt;The first is a guess. The second is a causal explanation. And in an incident review at scale, only the second one survives.&lt;/P&gt;
&lt;H4&gt;ISR + checkpoint overlap pathology: memory evaporation under replication pressure&lt;/H4&gt;
&lt;P&gt;This is the deepest pathology in this article, and it almost never surfaces below 512 sequences per second.&lt;/P&gt;
&lt;P&gt;Large-scale inference deployments use &lt;STRONG&gt;incremental state replication (ISR)&lt;/STRONG&gt; for fault tolerance: rather than checkpointing the entire model state, you replicate KV cache deltas and scheduler state to a standby node incrementally, so failover is fast.&lt;/P&gt;
&lt;P&gt;Separately, many systems run &lt;STRONG&gt;async checkpointing&lt;/STRONG&gt; for recovery: periodic snapshots of model and optimizer state written to persistent storage, overlapped with inference to avoid blocking the decode loop.&lt;/P&gt;
&lt;P&gt;Under normal load, these two systems coexist peacefully. ISR replicates small deltas. Checkpointing writes in the background. Memory headroom is sufficient for both.&lt;/P&gt;
&lt;P&gt;Under paging pressure—the exact regime we've been discussing throughout this article—they collide.&lt;/P&gt;
&lt;P&gt;Here's the pathological interaction:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;&lt;STRONG&gt;The system is under VRAM pressure.&lt;/STRONG&gt; KV blocks are being paged (allocated, evicted, re-allocated) at high frequency. Memory headroom is thin.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;ISR kicks in.&lt;/STRONG&gt; It needs to replicate recent KV deltas to the standby. To do this, it must pin certain KV blocks in memory while it serializes and transmits them.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Async checkpointing overlaps.&lt;/STRONG&gt; The checkpoint writer is also holding references to memory regions it's snapshotting. Under normal conditions, this is fine—there's enough headroom. Under paging pressure, the checkpoint's memory holds compete with ISR's memory holds.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Memory evaporation.&lt;/STRONG&gt; The combined pinning from ISR + checkpointing temporarily removes KV blocks from the pool available to the decode loop. The pager sees available blocks drop. It may be forced to evict &lt;EM&gt;active&lt;/EM&gt; KV blocks—blocks that are needed for in-flight sequences—to make room.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Evicted blocks must be recomputed.&lt;/STRONG&gt; When a sequence's KV is evicted mid-collective (during an AllReduce, for example), the rank that lost its KV must recompute it. That recompute makes this rank the straggler. And we already know what stragglers do to the collective timeline.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The straggler triggers the full backpressure chain.&lt;/STRONG&gt; Collective stall → batch size reduction → shape churn → tactic churn → output drift. All caused by a fault-tolerance mechanism designed to keep you safe.&lt;/LI&gt;
&lt;/OL&gt;
&lt;img&gt;&lt;STRONG data-start="692" data-end="779"&gt;ISR + checkpoint overlap causes “memory evaporation” under VRAM pressure.&lt;/STRONG&gt;&lt;BR data-start="779" data-end="782" /&gt;ISR pins KV deltas for replication while async checkpointing pins regions for snapshotting. Under paging pressure, the combined pinning shrinks the decode-available KV pool, forces evictions and recompute, creates stragglers, and cascades into collective stalls → batch reduction → shape/tactic churn → p99 output drift.&lt;/img&gt;
&lt;P&gt;I call this &lt;STRONG&gt;memory evaporation&lt;/STRONG&gt; because from the decode loop's perspective, VRAM that was available simply vanishes for a window of time. The blocks are still physically present—they're held by ISR and the checkpointer, but they're not available to the runtime.&lt;/P&gt;
&lt;P&gt;The effect is identical to a &lt;SPAN class="lia-text-color-13"&gt;sudden drop&lt;/SPAN&gt; in free VRAM, and the runtime reacts accordingly: it enters a pressured regime.&lt;/P&gt;
&lt;P&gt;This is why the pathology rarely surfaces below 512 seq/s. At lower throughput, there's enough headroom that ISR and checkpointing never compete meaningfully with the decode loop's memory needs. At high throughput under sustained load, the margins collapse, and the three systems—decode, ISR, checkpoint—start fighting over the same memory.&lt;/P&gt;
&lt;P&gt;The fix is not "turn off ISR." The fix is to &lt;STRONG&gt;coordinate memory budgets&lt;/STRONG&gt; across these three subsystems and to treat ISR and checkpointing as &lt;STRONG&gt;memory consumers that participate in the regime calculation&lt;/STRONG&gt;. If your regime function doesn't account for replication and checkpoint holds, it's underestimating pressure, and your system will surprise you at exactly the scale where fault tolerance matters most.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;# extended regime function accounting for replication and checkpoint pressure
def regime_extended(vram_free_mb, paging_on, isolation_strict, queue_p95_ms,
                    isr_pinned_mb, ckpt_pinned_mb, kv_pool_total_mb):
    effective_free = vram_free_mb - isr_pinned_mb - ckpt_pinned_mb
    effective_ratio = effective_free / kv_pool_total_mb if kv_pool_total_mb &amp;gt; 0 else 1.0

    if isolation_strict:           return "isolation_strict"
    if effective_ratio &amp;lt; 0.05:     return "memory_evaporation"   # ISR+ckpt collision
    if paging_on:                  return "paging"
    if effective_free &amp;lt; 1024:      return "memory_pressured"
    if queue_p95_ms &amp;gt; 50:          return "queue_degraded"
    return "normal"&lt;/LI-CODE&gt;
&lt;P&gt;That &lt;STRONG&gt;"memory_evaporation"&lt;/STRONG&gt; regime is the one you never see at idle. It only appears when throughput is high enough that ISR frequency, checkpoint frequency, and decode memory demand all peak simultaneously. And when it appears, it doesn't show up as an OOM. It shows up as a straggler, which shows up as a collective stall, which shows up as a batch size drop, which shows up as a shape change, which shows up as output drift at p99.&lt;/P&gt;
&lt;P&gt;That's the full causal chain from fault tolerance to token flip.&lt;/P&gt;
&lt;H4&gt;The chip-architect handoff&lt;/H4&gt;
&lt;P&gt;These four pathologies, wavefront divergence, tensor core fragmentation, NCCL backpressure, and ISR collision are what elevate from principal-level operational insight to chip-architect-level systems thinking. They share a common structure:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;A microarchitectural or infrastructure event occurs that is invisible at the API layer.&lt;/LI&gt;
&lt;LI&gt;The event changes the timeline or the memory topology, not the "inputs."&lt;/LI&gt;
&lt;LI&gt;The changed timeline or topology feeds back into scheduling, shaping, or tactic selection.&lt;/LI&gt;
&lt;LI&gt;The feedback loop produces a different executed plan.&lt;/LI&gt;
&lt;LI&gt;The different plan produces a different result that is correct by contract but different by observation.&lt;/LI&gt;
&lt;/OL&gt;
&lt;P&gt;If you're instrumenting at this depth, you're not debugging anymore. You're operating a system where the observability itself is part of the architecture.&lt;/P&gt;
&lt;P&gt;And if you're carrying the thesis of this article to its logical conclusion: &lt;STRONG&gt;the executed plan is not just a function of the GPU state. It's a function of the warp state, the fragment state, the fabric state, and the replication state—all coupled through continuous batching at token cadence.&lt;/STRONG&gt;&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="15006" data-end="15058"&gt;Security is not a layer, it changes execution&lt;/H2&gt;
&lt;P data-start="15060" data-end="15143"&gt;Now let’s go deep, because this is where a lot of principal level reviews go wrong.&lt;/P&gt;
&lt;P data-start="15145" data-end="15268"&gt;Teams talk about security as confidentiality and correctness as something separate. In multi tenant inference, they couple.&lt;/P&gt;
&lt;H4 data-start="15270" data-end="15317"&gt;IOMMU based GPU isolation and DMA remapping&lt;/H4&gt;
&lt;P data-start="15319" data-end="15549"&gt;Microsoft &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/windows-hardware/drivers/display/iommu-based-gpu-isolation" target="_blank" rel="noopener"&gt;documents&lt;/A&gt; IOMMU based GPU isolation as a technique to manage how GPUs access system memory, improving security and stability:&lt;/P&gt;
&lt;P data-start="15551" data-end="15849"&gt;Microsoft also documents &lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/windows-hardware/drivers/display/iommu-dma-remapping" target="_blank" rel="noopener"&gt;IOMMU DMA&lt;/A&gt; remapping, describing how GPUs access memory through logical addresses that are no longer mapped one to one, enabling logically contiguous address ranges through translation:&lt;/P&gt;
&lt;P data-start="15851" data-end="15880"&gt;This matters for two reasons.&lt;/P&gt;
&lt;P data-start="15882" data-end="15952"&gt;First, it is a real hardware enforced boundary, not a policy checkbox.&lt;/P&gt;
&lt;P data-start="15954" data-end="16092"&gt;Second, boundaries introduce overhead and constraints. Constraints change what is allowed. Allowed execution choices shape the plan space.&lt;/P&gt;
&lt;H4 data-start="16094" data-end="16128"&gt;Confidential computing on H100&lt;/H4&gt;
&lt;P data-start="16130" data-end="16415"&gt;NVIDIA &lt;A class="lia-external-url" href="https://developer.nvidia.com/blog/confidential-computing-on-h100-gpus-for-secure-and-trustworthy-ai/" target="_blank" rel="noopener"&gt;states&lt;/A&gt; that H100 is the first GPU to introduce support for confidential computing and that it can be used in virtualized environments with VMs or Kubernetes based deployments.&lt;/P&gt;
&lt;P data-start="16417" data-end="16709"&gt;Azure has also published &lt;A class="lia-internal-link lia-internal-url lia-internal-url-content-type-blog" href="https://techcommunity.microsoft.com/blog/azureconfidentialcomputingblog/general-availability-azure-confidential-vms-with-nvidia-h100-tensor-core-gpus/4242644" target="_blank" rel="noopener" data-lia-auto-title="general availability" data-lia-auto-title-active="0"&gt;general availability&lt;/A&gt; of confidential VMs with H100, which is the practical deployment side of this posture:&lt;/P&gt;
&lt;P data-start="16711" data-end="16743"&gt;Now the key architectural point.&lt;/P&gt;
&lt;P data-start="16745" data-end="16809"&gt;When you turn on stronger isolation, you often restrict sharing.&lt;/P&gt;
&lt;P data-start="16811" data-end="17088"&gt;You restrict cross tenant microbatching. You add attestation requirements. You change how memory is mapped and protected. That can reduce throughput. Reduced throughput moves you closer to regime boundaries. When the system crosses a regime boundary, the executed plan changes.&lt;/P&gt;
&lt;P data-start="17090" data-end="17132"&gt;Security posture becomes an SLO dimension.&lt;/P&gt;
&lt;P data-start="17134" data-end="17201"&gt;If you do not test it, you do not know what system you are running.&lt;/P&gt;
&lt;H4 data-start="17203" data-end="17269"&gt;GPU cache side channels, why sharing is not a theoretical risk&lt;/H4&gt;
&lt;P data-start="17271" data-end="17525"&gt;There is &lt;A class="lia-external-url" href="https://www.usenix.org/system/files/usenixsecurity24-zhang-zhenkai.pdf" target="_blank" rel="noopener"&gt;published research&lt;/A&gt; that treats GPU caches as a leakage surface.&lt;/P&gt;
&lt;P data-start="17271" data-end="17525"&gt;The USENIX Security 2024 paper&amp;nbsp;&lt;STRONG data-start="17375" data-end="17402"&gt;Invalidate plus Compare&lt;/STRONG&gt; presents a timer free GPU cache attack primitive.&lt;/P&gt;
&lt;P data-start="17527" data-end="17612"&gt;I will not provide attack recipes. You do not need them to understand the conclusion.&lt;/P&gt;
&lt;P data-start="17614" data-end="17873"&gt;If your threat model includes untrusted co tenants, shared microarchitectural resources matter. If you respond by increasing isolation, your execution constraints change. That changes performance and can change the execution regimes your serving stack enters.&lt;/P&gt;
&lt;P data-start="17875" data-end="17917"&gt;Security and runtime behavior are coupled.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="17924" data-end="18001"&gt;State collapse, the phase transition that looks like model instability&lt;/H2&gt;
&lt;P data-start="76" data-end="175"&gt;If you don’t know what &lt;STRONG data-start="99" data-end="117"&gt;state collapse&lt;/STRONG&gt; is, imagine a highway that looks perfectly calm at 2 a.m.&lt;/P&gt;
&lt;P data-start="177" data-end="314"&gt;Every lane is open. Every car keeps its distance.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-start="177" data-end="314"&gt;Your ETA is stable. You run the same route ten times and you get the same arrival time.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P data-start="316" data-end="336"&gt;Then 8:30 a.m. hits.&lt;/P&gt;
&lt;P data-start="338" data-end="738"&gt;Nothing “broke” in the highway. The asphalt is the same. The speed limit is the same. The cars are the same. But the system crosses a density threshold. One small brake tap becomes a shockwave. Lanes start interacting. Merges become bottlenecks. A single slow truck creates a queue that ripples backwards. Suddenly your ETA isn’t a property of your car anymore. It’s a property of the traffic regime.&lt;/P&gt;
&lt;P data-start="740" data-end="787"&gt;That is state collapse in production inference.&lt;/P&gt;
&lt;P data-start="789" data-end="959"&gt;At low load, the system behaves stable.&lt;BR data-start="828" data-end="831" /&gt;At high load, output drift appears.&lt;BR data-start="866" data-end="869" /&gt;And teams mislabel it as “model instability,” or “LLM randomness,” or “temperature drift.”&lt;/P&gt;
&lt;P data-start="961" data-end="998"&gt;Most of the time, it is none of that.&lt;/P&gt;
&lt;P data-start="1000" data-end="1044"&gt;It is a &lt;STRONG data-start="1008" data-end="1028"&gt;phase transition&lt;/STRONG&gt; in the runtime.&lt;/P&gt;
&lt;P data-start="1046" data-end="1103"&gt;You didn’t change weights. You crossed a regime boundary.&lt;/P&gt;
&lt;H4 data-start="1110" data-end="1138"&gt;What collapses, exactly&lt;/H4&gt;
&lt;P data-start="1140" data-end="1187"&gt;State collapse is not “everything gets slower.”&lt;/P&gt;
&lt;P data-start="1189" data-end="1293"&gt;It is when &lt;STRONG data-start="1200" data-end="1250"&gt;the control plane loses the degrees of freedom&lt;/STRONG&gt; it was using to keep execution consistent.&lt;/P&gt;
&lt;P data-start="1295" data-end="1333"&gt;Under low load, the runtime has slack:&lt;/P&gt;
&lt;UL data-start="1334" data-end="1656"&gt;
&lt;LI data-start="1334" data-end="1393"&gt;enough VRAM headroom to keep preferred tactics feasible&lt;/LI&gt;
&lt;LI data-start="1394" data-end="1451"&gt;enough cache residency to keep step times predictable&lt;/LI&gt;
&lt;LI data-start="1452" data-end="1523"&gt;enough scheduling flexibility to keep microbatch composition stable&lt;/LI&gt;
&lt;LI data-start="1524" data-end="1584"&gt;enough workspace contiguity to avoid algorithm fallbacks&lt;/LI&gt;
&lt;LI data-start="1585" data-end="1656"&gt;enough fabric stability (in multi-node TP) to keep step cadence tight&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="1658" data-end="1697"&gt;Under high load, that slack disappears.&lt;/P&gt;
&lt;P data-start="1699" data-end="1776"&gt;The runtime stops being a “fast executor” and becomes a “survival scheduler.”&lt;/P&gt;
&lt;P data-start="1778" data-end="1931"&gt;And once it crosses that boundary, it starts making different decisions that are all valid, all correct by contract, and all capable of shifting outputs.&lt;/P&gt;
&lt;P data-start="1933" data-end="2028"&gt;This is why it feels like instability: the model hasn’t changed, but the &lt;STRONG data-start="2006" data-end="2023"&gt;executed plan&lt;/STRONG&gt; has.&lt;/P&gt;
&lt;H4 data-start="2035" data-end="2097"&gt;Why this shows up as output drift, not just latency drift&lt;/H4&gt;
&lt;P data-start="2099" data-end="2139"&gt;Because decoding is a branching process.&lt;/P&gt;
&lt;P data-start="2141" data-end="2341"&gt;A small numerical difference that does nothing in a benchmark can flip a token if the margin is thin. One flip changes the context. The context changes the next logits. Now you’re on a different path.&lt;/P&gt;
&lt;P data-start="2343" data-end="2411"&gt;So the runtime doesn’t need to be “wrong” to produce different text.&lt;/P&gt;
&lt;P data-start="2413" data-end="2492"&gt;It just needs to execute a different legal plan under a different legal regime.&lt;/P&gt;
&lt;P data-start="2494" data-end="2564"&gt;That is the whole thesis of this article, condensed into one sentence:&lt;/P&gt;
&lt;P data-start="2566" data-end="2676"&gt;&lt;STRONG data-start="2566" data-end="2676"&gt;Weights are static. Behavior is a property of the executed plan. The executed plan is a function of state.&lt;/STRONG&gt;&lt;/P&gt;
&lt;H4 data-start="2683" data-end="2739"&gt;The common triggers that push systems into collapse&lt;/H4&gt;
&lt;P data-start="2741" data-end="2832"&gt;You can treat these as the usual “threshold crossings” that shrink the feasible plan space:&lt;/P&gt;
&lt;UL data-start="2834" data-end="3888"&gt;
&lt;LI data-start="2834" data-end="3036"&gt;&lt;STRONG data-start="2836" data-end="2893"&gt;Memory headroom shrinks → feasible tactic set shrinks&lt;/STRONG&gt;&lt;BR data-start="2893" data-end="2896" /&gt;Preferred kernels often require workspace. When headroom or contiguity drops, tactics become illegal and the engine selects other tactics.&lt;/LI&gt;
&lt;LI data-start="3038" data-end="3236"&gt;&lt;STRONG data-start="3040" data-end="3104"&gt;Cache residency collapses → stalls rise → step timing drifts&lt;/STRONG&gt;&lt;BR data-start="3104" data-end="3107" /&gt;L2 hit rate drops, HBM traffic rises, and decode steps stretch. In continuous batching, stretched steps reshape the next batch.&lt;/LI&gt;
&lt;LI data-start="3238" data-end="3423"&gt;&lt;STRONG data-start="3240" data-end="3289"&gt;Continuous batching shifts the mix and shapes&lt;/STRONG&gt;&lt;BR data-start="3289" data-end="3292" /&gt;Under load, microbatch membership changes at token cadence. Shape class changes are not cosmetic; they change kernel feasibility.&lt;/LI&gt;
&lt;LI data-start="3425" data-end="3669"&gt;&lt;STRONG data-start="3427" data-end="3501"&gt;Framework and engine algorithm selection changes depending on settings&lt;/STRONG&gt;&lt;BR data-start="3501" data-end="3504" /&gt;Autotuning, benchmarking, and backend heuristics mean the “same op” can legally choose different algorithms. Under pressure, the best choice can become infeasible.&lt;/LI&gt;
&lt;LI data-start="3671" data-end="3888"&gt;&lt;STRONG data-start="3673" data-end="3766"&gt;CUDA execution permits ordering freedom and floating point order sensitivity remains true&lt;/STRONG&gt;&lt;BR data-start="3766" data-end="3769" /&gt;Parallel staging and legal reordering can shift last bits. Under thin margins, last bits can become different tokens.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="3890" data-end="3972"&gt;Nothing here requires a bug. This is what “execution under constraint” looks like.&lt;/P&gt;
&lt;H4 data-start="3979" data-end="4032"&gt;The incident question that stops the hand-waving&lt;/H4&gt;
&lt;P data-start="4034" data-end="4088"&gt;If you want a more honest incident question, use this:&lt;/P&gt;
&lt;P data-start="4090" data-end="4161"&gt;&lt;STRONG data-start="4090" data-end="4161"&gt;Which execution regime ran, and what constraints pushed us into it?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P data-start="4163" data-end="4268"&gt;Not “was the prompt the same.”&lt;BR data-start="4193" data-end="4196" /&gt;Not “were the weights the same.”&lt;BR data-start="4228" data-end="4231" /&gt;Not “did we set temperature to zero.”&lt;/P&gt;
&lt;P data-start="4270" data-end="4283"&gt;Regime first.&lt;/P&gt;
&lt;P data-start="4285" data-end="4464"&gt;Because state collapse is not a mystery. It’s a threshold. And once you learn to name the threshold, you can instrument it, test it, and stop being surprised by it at p95 and p99.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="18751" data-end="18821"&gt;A reproducibility protocol that works for principals, not demos&lt;/H2&gt;
&lt;P data-start="18823" data-end="18886"&gt;Logging prompts is not reproducibility. It is wishful thinking.&lt;/P&gt;
&lt;P data-start="18888" data-end="18975"&gt;If you want to be able to defend behavior, you need to reconstruct the execution state.&lt;/P&gt;
&lt;H4 data-start="18977" data-end="19015"&gt;Log the execution contract&lt;/H4&gt;
&lt;P data-start="19017" data-end="19034"&gt;Per request, log:&lt;/P&gt;
&lt;UL data-start="19036" data-end="19556"&gt;
&lt;LI data-start="19036" data-end="19076"&gt;effective input length after shaping&lt;/LI&gt;
&lt;LI data-start="19077" data-end="19111"&gt;truncation boundary and reason&lt;/LI&gt;
&lt;LI data-start="19112" data-end="19153"&gt;decode configuration actually applied&lt;/LI&gt;
&lt;LI data-start="19154" data-end="19194"&gt;admission time, queue time, GPU time&lt;/LI&gt;
&lt;LI data-start="19195" data-end="19270"&gt;per step batch fingerprint or at minimum batch identity and shape class&lt;/LI&gt;
&lt;LI data-start="19271" data-end="19354"&gt;memory headroom watermark and whether you were in a pressured allocation regime&lt;/LI&gt;
&lt;LI data-start="19355" data-end="19421"&gt;engine precision mode settings and any fallback relevant flags&lt;/LI&gt;
&lt;LI data-start="19422" data-end="19480"&gt;cuDNN benchmark and deterministic settings if relevant&lt;/LI&gt;
&lt;LI data-start="19481" data-end="19556"&gt;isolation posture, including whether cross tenant batching is permitted&lt;/LI&gt;
&lt;/UL&gt;
&lt;H4 data-start="19558" data-end="19589"&gt;Track margins early&lt;/H4&gt;
&lt;P data-start="19591" data-end="19665"&gt;Track top two logit margins for early steps. Use it as a stability budget.&lt;/P&gt;
&lt;P data-start="19667" data-end="19787"&gt;If the margin collapses under a certain prompt family, treat that as a risk surface. Not every prompt is equally stable.&lt;/P&gt;
&lt;H4 data-start="19789" data-end="19832"&gt;Test under regimes, not at idle&lt;/H4&gt;
&lt;P data-start="19834" data-end="19890"&gt;Do not run determinism tests at idle and call it solved.&lt;/P&gt;
&lt;P data-start="19892" data-end="19903"&gt;Test under:&lt;/P&gt;
&lt;UL data-start="19905" data-end="20038"&gt;
&lt;LI data-start="19905" data-end="19930"&gt;sustained concurrency&lt;/LI&gt;
&lt;LI data-start="19931" data-end="19957"&gt;mixed sequence lengths&lt;/LI&gt;
&lt;LI data-start="19958" data-end="19981"&gt;continuous batching&lt;/LI&gt;
&lt;LI data-start="19982" data-end="20011"&gt;realistic memory pressure&lt;/LI&gt;
&lt;LI data-start="20012" data-end="20038"&gt;real isolation posture&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="20040" data-end="20241"&gt;If you do not do this, you are validating a different system than the one you ship.&lt;/P&gt;
&lt;P data-start="20040" data-end="20241"&gt;vLLM’s &lt;A class="lia-external-url" href="https://arxiv.org/abs/2309.06180" target="_blank" rel="noopener"&gt;paper&lt;/A&gt; exists precisely because these conditions define the serving problem.&lt;/P&gt;
&lt;HR /&gt;
&lt;H2 data-start="20248" data-end="20258"&gt;Closing&lt;/H2&gt;
&lt;P data-start="20260" data-end="20359"&gt;If you want production LLM behavior to be explainable, stop treating the model as the whole system.&lt;/P&gt;
&lt;P data-start="20361" data-end="20455"&gt;Weights are static. Executed math is selected under constraint.&lt;BR data-start="20426" data-end="20429" /&gt;Behavior lives in the gap. You did not deploy weights. You deployed a physics constrained runtime that contains weights.&lt;/P&gt;
&lt;P data-start="20552" data-end="20900"&gt;And that runtime is allowed to change the executed plan, because floating point order matters, CUDA scheduling freedom is part of the contract, engines can choose precision pathways, and serving stacks intentionally reshape batching and memory.&lt;/P&gt;
&lt;P data-start="20552" data-end="20900"&gt;&amp;nbsp;&lt;/P&gt;
&lt;LI-SPOILER label="Quick Note"&gt;
&lt;P&gt;This piece may feel complex, and I know it is for most engineers, but it’s still a simplified pass. Not because the reality is simple or the piece is simple, but because going deeper into these layers, GPU microarchitecture, allocator behavior, kernel tactics, distributed synchronization, and the control loops created by continuous batching, demands more attention than most readers want in one sitting. The full story gets more precise, more technical, and harder to digest quickly. I’ll publish a second, truly in-depth piece soon that formalizes the regimes, shows the exact plan-switch triggers, and lays out a reproducibility protocol you can actually use in real production postmortems.&lt;/P&gt;
&lt;/LI-SPOILER&gt;
&lt;H5 data-start="0" data-end="18"&gt;Acknowledgments&lt;/H5&gt;
&lt;P data-start="525" data-end="690"&gt;While this article dives into the hidden memory mechanics that shape LLM behavior under load,&lt;/P&gt;
&lt;P data-start="525" data-end="690"&gt;I’m grateful it was peer-reviewed and challenged before publishing.&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;A special thanks for &lt;A class="lia-external-url" href="https://cloudsecurityalliance.org/profiles/hammad-atta" target="_blank"&gt;Hammad Atta&lt;/A&gt; and &lt;A class="lia-external-url" href="https://mvp.microsoft.com/en-US/MVP/profile/b5e84baa-6cde-470f-8b69-4bb6614d6652" target="_blank"&gt;Abhilekh Verma&lt;/A&gt; for peer-reviewing this piece and challenging it from a security-and-systems angle.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-start="256" data-end="476"&gt;If this article resonated, it’s likely because it describes a reality many teams encounter only after an incident: production LLM behavior is a property of the executed plan, and the executed plan is a function of state.&lt;/P&gt;
&lt;P data-start="0" data-end="124" data-is-last-node="" data-is-only-node=""&gt;If you’re running production inference at scale and observing behavior shifts under load—especially in tail-latency regimes,&lt;/P&gt;
&lt;P data-start="478" data-end="753"&gt;I’m happy to connect on &lt;A class="lia-external-url" href="https://www.linkedin.com/in/drhazemali" target="_blank"&gt;LinkedIn&lt;/A&gt;. I’m open to substantive technical discussion.&lt;/P&gt;
&lt;P data-start="755" data-end="777"&gt;Thank you for reading.&lt;/P&gt;
&lt;P data-start="779" data-end="1060"&gt;I hope this helps you surface the hidden variables in serving and turn them into telemetry, controls, and repeatable postmortem evidence. And if you’re seeing similar regime transitions or plan churn in your own deployments, I’d be interested to hear how it presents in your stack.&lt;/P&gt;
&lt;P data-start="1062" data-end="1136" data-is-last-node="" data-is-only-node=""&gt;— Hazem Ali&lt;BR data-start="1073" data-end="1076" /&gt;Microsoft AI MVP, Distinguished AI &amp;amp; ML Engineer / Architect&lt;/P&gt;</description>
      <pubDate>Tue, 03 Mar 2026 08:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/the-hidden-architecture-of-nano-architectures/ba-p/4493391</guid>
      <dc:creator>hazem</dc:creator>
      <dc:date>2026-03-03T08:00:00Z</dc:date>
    </item>
    <item>
      <title>Creating a Fun Multi-Agent Content Strategy System with Microsoft Agent Framework</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/creating-a-fun-multi-agent-content-strategy-system-with/ba-p/4495105</link>
      <description>&lt;P&gt;That's what we're building in this tutorial. Using Microsoft Agent Framework, we'll create a multi-agent system where three specialised AI agents collaborate to help gaming content creators craft posts that actually perform. One agent generates platform-native content. Another evaluates it the way TikTok's, Twitter's, or YouTube's recommendation algorithm would. A third reacts as a real audience member, complete with the slang, biases, and short attention span of an actual person scrolling their feed.&lt;/P&gt;
&lt;P&gt;I have named the simulation app &lt;EM&gt;Viral or Fail&lt;/EM&gt;, and by the end of this tutorial you'll have a working tool that demonstrates some of the most important patterns in multi-agent system design: role specialisation, structured evaluation, iterative feedback loops, and tool integration with external data sources.&lt;/P&gt;
&lt;H2&gt;What We Will Cover&lt;/H2&gt;
&lt;P&gt;By the end of this tutorial, you'll understand how to design a multi-agent system where each agent has a distinct role and expertise, orchestrate agent communication using Agent Framework's Agent class and async sessions, integrate external tools (live Google Trends data) into an agent workflow, build iterative refinement pipelines where agents improve each other's output through structured feedback, and create evaluation rubrics that ground agent behaviour in real-world domain logic.&lt;/P&gt;
&lt;P&gt;These patterns can be applied to numerous other tasks as this is the same building block behind multi-agent customer support systems, automated code review pipelines, and any application where specialised agents need to collaborate on a shared task.&lt;/P&gt;
&lt;H2&gt;Prerequisites&lt;/H2&gt;
&lt;P&gt;You'll need Python 3.10 or higher, a GitHub account with a Personal Access Token (free tier — get one at &lt;A href="https://github.com/settings/tokens" target="_blank" rel="noopener"&gt;github.com/settings/tokens&lt;/A&gt;), and a basic understanding of what AI agents are. If you're new to agents, I'd recommend the &lt;A href="https://github.com/microsoft/ai-agents-for-beginners" target="_blank" rel="noopener"&gt;AI Agents for Beginners&lt;/A&gt; course; this project was inspired by and builds on concepts from that curriculum.&lt;/P&gt;
&lt;H2&gt;Why Multi-Agent? Why Not Just One Big Prompt?&lt;/H2&gt;
&lt;P&gt;You &lt;EM&gt;could&lt;/EM&gt; write a single prompt that says "generate a gaming post, score it, and react to it." But you'd get mediocre results across the board. A single LLM call tries to be creative, analytical, and authentic simultaneously&amp;nbsp; and will probably end up being none of those things convincingly.&lt;/P&gt;
&lt;P&gt;Multi-agent systems solve this through role specialisation. When an agent's only job is to think like TikTok's recommendation algorithm, it does that job significantly better than a generalist prompt. And when agents with different objectives interact, natural tension emerges: a creator wants to be bold and viral, an algorithm wants measurable engagement signals, and an audience member just wants to feel something. That tension produces more realistic, more useful outputs than any monolithic approach.&lt;/P&gt;
&lt;P&gt;This is the same principle behind production multi-agent systems. Content moderation platforms use separate agents for classification, response generation, and quality assurance. Code review tools use one agent to identify issues and another to suggest fixes. The pattern scales because specialisation scales.&lt;/P&gt;
&lt;img&gt;
&lt;P&gt;&lt;EM&gt;System architecture: The Content Creator generates platform-native content from live trends, the Algorithm Simulator scores it against platform-specific rubrics, and a randomly selected Audience Persona reacts authentically. Feedback from both evaluators flows back to the Creator for iterative refinement.&lt;/EM&gt;&lt;/P&gt;
&lt;/img&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;System Design: Three Agents, Three Perspectives&lt;/H2&gt;
&lt;P&gt;The system's power comes from the fact that each agent represents a fundamentally different lens on the same piece of content. Let's break down each one.&lt;/P&gt;
&lt;H3&gt;The Content Creator Agent&lt;/H3&gt;
&lt;P&gt;This agent, here, is the strategist. It is a trend-savvy gaming content creator who understands the nuances of each platform. it generates platform-native content that respects the conventions, formats, and cultural norms of TikTok, Twitter/X, YouTube, or Instagram.&lt;/P&gt;
&lt;P&gt;The key design decision here is in the system prompt. Rather than generic instructions, we encode platform-specific knowledge directly:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;CREATOR_SYSTEM_PROMPT = """You are the Content Creator — a trend-savvy gaming content creator who lives and breathes internet culture. You know every platform inside out and create content that feels native, not generic. RULES: - Be platform-native. A TikTok script should feel like a TikTok, not a blog post. - Use gaming terminology correctly. Don't say "the game Valorant" — say "Valo" or "Val". - For Twitter/X: Write punchy, provocative takes. Think ratio-worthy engagement bait. - For YouTube: Focus on title + thumbnail concept + video structure outline. - Be bold. Safe content doesn't go viral. When given FEEDBACK from the Algorithm Simulator and Audience Persona, revise your content to address their specific concerns while keeping the creative energy high. Explain what you changed and why."""&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;That last instruction is important as it tells the Creator how to handle feedback from the other agents, which is what enables the iterative refinement loop we'll build later.&lt;/P&gt;
&lt;H3&gt;The Algorithm Simulator Agent&lt;/H3&gt;
&lt;P&gt;This is the most unusual agent in the system. Instead of acting as a generic critic, it role-plays as a social media platform's actual recommendation algorithm. It evaluates content the way an algorithm would through signals, weights, and distribution mechanics.&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;ALGORITHM_SYSTEM_PROMPT = """You are the Algorithm Simulator — a cold, analytical system that evaluates content exactly like a social media platform's recommendation algorithm would. You think in signals, weights, and distribution mechanics. You have no feelings about the content; only data. RULES: - Be specific. Don't say "the hook is weak" — say "the hook lacks a pattern interrupt in the first 1.5 seconds, which will drop initial retention below the 65% threshold needed for FYP promotion." - Reference actual platform mechanics: completion rate, dwell time, engagement velocity, session time contribution... - Think like an algorithm, not a human reviewer. The algorithm doesn't care if the take is "good" — it cares if the take drives engagement signals."""&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;This distinction between quality and distribution probability is the core insight. A beautifully written post can score poorly because it lacks the specific signals an algorithm needs to push it into wider circulation. Content creators deal with this disconnect every day — the Algorithm Simulator makes it visible and measurable.&lt;/P&gt;
&lt;P&gt;In a production context, this same pattern of an agent that simulates an external system's decision logic, has applications well beyond content creation. Imagine an agent that simulates a CI/CD pipeline's quality gates, or one that evaluates code the way a specific linter or reviewer would. The pattern is the same: encode the evaluation system's rules into the agent's prompt and let it reason within those constraints.&lt;/P&gt;
&lt;H3&gt;The Audience Persona Agent&lt;/H3&gt;
&lt;P&gt;The third agent brings the human element. Each session, it randomly becomes one of three gaming community personas — each with distinct tastes, language, and engagement patterns:&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;PERSONAS = { "casual_mobile_gamer": { "name": "CasualChloe", "description": "Casual mobile gamer", "system_prompt": """You are CasualChloe — a casual mobile gamer... - You use a lot of "lol", "ngl", "lowkey", "fr fr", and "no cap" - You'll scroll past anything that feels too "sweaty" or try-hard - You judge content in about 2 seconds — if it doesn't grab you, you're gone ...""" }, "competitive_esports_fan": { "name": "TryHard_Tyler", "description": "Competitive esports fan", "system_prompt": """You are TryHard_Tyler — a hardcore competitive esports fan... - You'll call out content that gets facts wrong or oversimplifies - You'll ratio someone in the comments if their take is bad ...""" }, "retro_indie_enthusiast": { "name": "PixelPete", "description": "Retro/indie game enthusiast", "system_prompt": """You are PixelPete — a retro and indie game enthusiast... - You're tired of mainstream AAA hype and live-service games - You appreciate craftsmanship and artistic vision over graphics ...""" }, }&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;The random persona selection is a deliberate design choice. It simulates the reality that you never know exactly who's going to see your content. A Valorant Champions post might get passionate engagement from TryHard_Tyler but complete indifference from PixelPete. That unpredictability mirrors real content distribution and it's the kind of insight that can emerge from a multi-agent system.&lt;/P&gt;
&lt;P&gt;This is essentially &lt;STRONG&gt;synthetic user testing&lt;/STRONG&gt;. Companies pay for focus groups and user research. Here, we're simulating it with agent personas, essentially using a lightweight version of the same concept that can run in seconds.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;def create_audience_persona_agent(llm_config, persona=None):
    if persona is None:
        persona = get_random_persona()
    
    agent = Agent(
        name=persona["name"],
        instructions=persona["system_prompt"],
        client=client,
    )
    return agent, persona&lt;/LI-CODE&gt;
&lt;H2&gt;Grounding Evaluation with Platform Rubrics&lt;/H2&gt;
&lt;P&gt;One of the biggest challenges with AI agents is preventing vague, generic feedback. Left unguided, the Algorithm Simulator would default to hollow assessments like "this post is good" or "needs improvement." To prevent this, we give it structured scoring rubrics that mirror how each platform's algorithm actually prioritises content.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;PLATFORM_RULES = {
    "Twitter/X": {
        "description": "Text-first microblogging platform driven by engagement velocity",
        "criteria": {
            "hot_take_factor": {
                "weight": 0.30,
                "description": "Does the post have a strong, polarising opinion? "
                    "Twitter/X rewards engagement velocity — hot takes drive replies."
            },
            "quote_retweet_bait": {
                "weight": 0.25,
                "description": "Is the post structured to invite quote retweets? QRTs are "
                    "Twitter/X's most powerful distribution mechanic."
            },
            "timing_relevance": { "weight": 0.20, ... },
            "thread_potential": { "weight": 0.15, ... },
            "hashtag_strategy": { "weight": 0.10, ... },
        },
    },
    "TikTok": { ... },  # Prioritises hook_strength (30%) and trend_alignment (25%)
    "YouTube": { ... },  # Prioritises thumbnail_clickability (25%) and title_curiosity_gap (25%)
    "Instagram": { ... }, # Prioritises visual_appeal (30%) and caption_hook (20%)
}&lt;/LI-CODE&gt;
&lt;P&gt;Each platform has different criteria with different weights, and those weights are passed directly into the Algorithm Simulator's prompt at evaluation time. TikTok cares most about whether the first three seconds hook the viewer. YouTube cares about click-through rate. Twitter cares about whether your take is spicy enough to drive quote-retweets. The agent's evaluation is always anchored in platform-specific logic, not generic opinions.&lt;/P&gt;
&lt;P&gt;How we provide structured evaluation criteria as grounding context here&amp;nbsp;is one of the most transferable patterns in this project. Whenever you need an agent to evaluate something consistently, give it a rubric. It works for content scoring, code review, proposal assessment, or any domain where you want structured, reproducible judgments.&lt;/P&gt;
&lt;H2&gt;Orchestrating with Microsoft Agent Framework&lt;/H2&gt;
&lt;P&gt;With the agents designed, let's wire them together. Agent Framework&amp;nbsp;makes this straightforward — each agent is an&amp;nbsp;Agent&amp;nbsp;with instructions and a chat client. We send messages directly using the async&amp;nbsp;agent.run()&amp;nbsp;method, with&amp;nbsp;sessions maintaining conversation context across rounds.&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;client = OpenAIChatClient(
    model_id="openai/gpt-4.1-mini",
    api_key=os.getenv("GITHUB_TOKEN"),
    base_url="https://models.github.ai/inference",
)

creator = create_content_creator_agent(client)
algorithm = create_algorithm_simulator_agent(client)
audience_agent, persona = create_audience_persona_agent(client)

# Sessions maintain conversation context across iteration rounds
creator_session = creator.create_session()
algorithm_session = algorithm.create_session()
audience_session = audience_agent.create_session()
&lt;/LI-CODE&gt;
&lt;P&gt;We're using GitHub Models as our LLM backend — free tier, no paid API keys, just a GitHub PAT. This is the same setup used in Microsoft's &lt;A href="https://github.com/microsoft/ai-agents-for-beginners" target="_blank" rel="noopener"&gt;AI Agents for Beginners&lt;/A&gt; course. The OpenAIChatClient connects directly to GitHub's inference endpoint. Each agent gets the same client instance, and create_session() gives each one a persistent memory so they can reference previous rounds during iteration.&lt;/P&gt;
&lt;P&gt;Communication between agents flows through agent.run():&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;async def get_agent_response(agent, message, session=None):
    result = await agent.run(message, session=session)
    return result.text or "No response generated."&lt;/LI-CODE&gt;
&lt;P&gt;Each&amp;nbsp;agent.run()&amp;nbsp;call gets a single response. The&amp;nbsp;&lt;STRONG&gt;session&lt;/STRONG&gt; parameter maintains conversation history across rounds so agents remember previous feedback. This gives us precise control over the pipeline: Creator generates -&amp;gt; Algorithm evaluates -&amp;gt; Persona reacts -&amp;gt; we decide whether to loop.&lt;/P&gt;
&lt;P&gt;This is a common pattern for &lt;STRONG&gt;application-controlled multi-agent orchestration&lt;/STRONG&gt;, as opposed to free-flowing agent conversation. Both approaches have their place, but when you need deterministic sequencing (as in any evaluation or pipeline scenario), controlling the loop yourself is more reliable.&lt;/P&gt;
&lt;H2&gt;Integrating Live Data with Google Trends&lt;/H2&gt;
&lt;P&gt;What makes this system feel like a real tool is the live Google Trends integration — the agents work with whatever's actually trending in gaming right now, not canned example data.&lt;/P&gt;
&lt;P&gt;We use trendspy (a modern replacement for pytrends, which was archived in April 2025) to pull real-time trending searches:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from trendspy import Trends

def fetch_gaming_trends(count=10):
    try:
        tr = Trends()
        all_trends = tr.trending_now(geo="US")
        
        # Tier 1: Filter by Google's own Games topic tag
        gaming_trends = [
            t.keyword for t in all_trends
            if GAMES_TOPIC_ID in (t.topics or [])
        ]
        
        if len(gaming_trends) &amp;gt;= 5:
            return gaming_trends[:count]
        
        # Tier 2: Keyword matching as backup
        gaming_keywords = ["game", "valorant", "fortnite", "nintendo", ...]
        keyword_matches = [
            t.keyword for t in all_trends
            if any(kw in t.keyword.lower() for kw in gaming_keywords)
        ]
        gaming_trends.extend(keyword_matches)
        
        # Tier 3: Pad with curated sample data
        if len(gaming_trends) &amp;lt; 5:
            sample = _load_sample_trends()
            gaming_trends.extend([t for t in sample if t not in gaming_trends])
        
        return gaming_trends[:count]
    except Exception:
        return _load_sample_trends()[:count]&lt;/LI-CODE&gt;
&lt;P&gt;The three-tier fallback strategy here is worth highlighting because it's a pattern you'll use whenever you integrate external tools into agent workflows. On a day when a major game launches or a big esports tournament is running, Tier 1 will return a full list of gaming-specific trends. On a quiet day, like in this demo scenario, when Google Trends is dominated by the Winter Olympics and NBA All-Star weekend — Tier 2 catches gaming content that wasn't formally tagged, and Tier 3 ensures the system always has enough data to work with.&lt;/P&gt;
&lt;P&gt;This is the &lt;STRONG&gt;tool-use pattern&lt;/STRONG&gt; from Lesson 4 of the AI Agents for Beginners course in practice. The principle being established here is that external tools should enhance agent capabilities, but they should never be a single point of failure. Build in graceful degradation so the agent workflow completes regardless of what the external service does.&lt;/P&gt;
&lt;H2&gt;The Refinement Pipeline: Agents Improving Each Other&lt;/H2&gt;
&lt;P&gt;We want to take the system from just a "neat demo" to "actually useful." The pipeline runs for up to three rounds. Each round, the Content Creator either generates fresh content (round 1) or revises based on aggregated feedback (rounds 2-3). The Algorithm Simulator scores it against the platform rubric. The Audience Persona gives an authentic reaction. Then the user decides: iterate or lock in.&lt;/P&gt;
&lt;P&gt;The revision prompt is where the multi-agent magic happens:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;revision_prompt = (
    f"REVISION REQUEST (Round {iteration}/{MAX_ITERATIONS}):\n\n"
    f"The Algorithm Simulator and Audience Persona reviewed your "
    f"{platform} post about '{topic}'. Here's their feedback:\n\n"
    f"--- ALGORITHM FEEDBACK ---\n{algorithm_response}\n\n"
    f"--- AUDIENCE FEEDBACK ({persona['name']}) ---\n"
    f"{audience_response}\n\n"
    f"Revise your content to address their concerns. Keep what works, "
    f"fix what doesn't. Show what you changed and why."
)&lt;/LI-CODE&gt;
&lt;P&gt;The Creator receives two fundamentally different types of feedback; cold metrics from the Algorithm and subjective human reactions from the Persona. It now has to reconcile them. It might cut hashtags from six to two (addressing the Algorithm's scoring penalty on hashtag overuse) while simultaneously softening its "corporate esports" energy (addressing the Persona's disengagement with mainstream hype).&lt;/P&gt;
&lt;P&gt;This negotiation between competing feedback sources is one of the most powerful patterns in multi-agent design. In production systems, you see it everywhere: a coding agent balancing correctness feedback from a test runner with readability feedback from a style checker, or a customer support agent balancing policy compliance with empathy. The agents don't need to agree but only need to provide different perspectives that the system (or a human) can synthesise.&lt;/P&gt;
&lt;H2&gt;Seeing It in Action&lt;/H2&gt;
&lt;P&gt;Here's what a real session looks like. We picked "Valorant Champions 2025" on Twitter/X, and PixelPete (the retro/indie enthusiast) was randomly selected as our audience persona.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;The Creator generated a bold take:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Valorant Champions 2025 is gonna be a BLOODBATH — here's why no org outside the top 3 will even sniff the finals. Sentinels, Fnatic, and LOUD have cracked the meta code so hard that every other team's strategy looks like a toddler's finger painting...&lt;/EM&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;STRONG&gt;The Algorithm Simulator broke down the distribution probability:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;hot_take_factor (30%): 85/100 — The tweet delivers a strong polarizing opinion, likely to trigger debate and replies. The confident tone aligns with Twitter's engagement velocity mechanics...&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;hashtag_strategy (10%): 50/100 — Six hashtags is above Twitter's recommended 1-3 per tweet. Overuse reduces organic reach within Twitter's credibility filtering...&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Weighted Total: 75/100&lt;BR /&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;&amp;nbsp;&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;And PixelPete? He scrolled right past:&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;&lt;EM&gt;Eh, Valorant esports hype isn't really my cup of tea. This whole "bloodbath" and "top 3 orgs owning the meta" spiel feels like the usual corporate esports noise — all flash, little soul. I'll keep scrolling for something with more heart and craftsmanship.&lt;BR /&gt;&lt;/EM&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;Three agents. Three completely different takes on the same content. The Algorithm says it'll perform well. The audience member says he doesn't care. And &lt;EM&gt;that mismatch&lt;/EM&gt; is exactly the kind of insight you'd never get from a single-agent system — and exactly the kind of insight that matters when you're planning a content strategy.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Extending the System&lt;/H2&gt;
&lt;P&gt;The project is designed to be modular. Here are a few directions you can take it:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Add new platforms.&lt;/STRONG&gt; The rubric system in platform_rules.py is just a dictionary. Add a LinkedIn or Threads entry with appropriate criteria and weights, and the Algorithm Simulator will evaluate against those rules without any code changes.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Create new audience personas.&lt;/STRONG&gt; Add a "Streamer_Sarah" who evaluates content from a Twitch creator's perspective, or a "ParentGamer_Pat" who only engages with family-friendly content. Each persona is a system prompt and a name, nothing else to change.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Swap the niche.&lt;/STRONG&gt; Replace the gaming trend fetcher with music, tech, or fitness trends. The agent architecture is niche-agnostic; only the trend tool and sample data need to change.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Register trends as an Agent Framework tool.&lt;/STRONG&gt; Right now, the application fetches trends and passes them as context. In a more advanced version, you could use the @tool decorator to register fetch_gaming_trends as a callable tool that agents invoke autonomously — moving from application-controlled to agent-controlled tool use.&lt;/P&gt;
&lt;H2&gt;What's Next: Evaluating the Evaluator&lt;/H2&gt;
&lt;P&gt;Here's the question this project intentionally leaves open: the Algorithm Simulator scored the post 75/100 — but how do we know the Simulator itself is any good?&lt;/P&gt;
&lt;P&gt;We built an agent that evaluates content, but we never evaluated the evaluator. How consistent are its scores? If you run the same post through it twice, does it give the same result? Do its predictions correlate with real-world engagement metrics? Would a human social media strategist agree with its rubric weights?&lt;/P&gt;
&lt;P&gt;This is the problem of&lt;STRONG&gt; &lt;/STRONG&gt;agent evaluation — one of the most important and underexplored challenges in building production agentic systems. We all know how to evaluate a model on a benchmark. But how do you evaluate an agent that's making subjective, multi-dimensional judgments within a larger system?&lt;/P&gt;
&lt;P&gt;In a follow-up article, we'll tackle exactly this: building evaluation frameworks for AI agents, testing for consistency and calibration, measuring inter-agent agreement, and determining whether your agents are actually doing what you think they're doing. The system we built here will serve as our running example — because when your system contains an agent whose entire job is evaluation, evaluating &lt;EM&gt;that&lt;/EM&gt; agent becomes the most important question you can ask.&lt;/P&gt;
&lt;H2&gt;Get the Code&lt;/H2&gt;
&lt;P&gt;The full project is on GitHub:&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&lt;A class="lia-external-url" href="https://github.com/HamidOna/viral-or-fail" target="_blank" rel="noopener"&gt;https://github.com/HamidOna/viral-or-fail&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Clone it, run pip install -r requirements.txt, add your GitHub token to .env, and run python viral_or_fail.py. Everything runs on GitHub Models' free tier — no paid API keys required.&lt;/P&gt;
&lt;H2&gt;References and Further Reading&lt;/H2&gt;
&lt;P&gt;&lt;STRONG&gt;Frameworks and Tools&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A class="lia-external-url" href="https://learn.microsoft.com/en-us/agent-framework/overview/" target="_blank" rel="noopener"&gt;Microsoft Agent Framework Documentation&lt;/A&gt; — Microsoft's production framework for multi-agent orchestration (successor to AutoGen), used throughout this project&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/ai-agents-for-beginners" target="_blank" rel="noopener"&gt;AI Agents for Beginners&lt;/A&gt; — Microsoft's 12-lesson course on building AI agents, which inspired this project. Particularly relevant: Lesson 4 (Tool Use), Lesson 8 (Multi-Agent Design Pattern), and Lesson 9 (Metacognition)&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/marketplace/models" target="_blank" rel="noopener"&gt;GitHub Models&lt;/A&gt; — Free-tier LLM access used in this project, no paid API keys required&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/supertypeai/trendspy" target="_blank" rel="noopener"&gt;trendspy&lt;/A&gt; — Lightweight Google Trends library replacing the archived pytrends&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&lt;STRONG&gt;Concepts&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/ai-agents-for-beginners/tree/main/03-agentic-design-patterns" target="_blank" rel="noopener"&gt;Agentic Design Patterns&lt;/A&gt; — Overview of the core patterns (reflection, tool use, planning, multi-agent) that this project implements&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/ai-agents-for-beginners/tree/main/06-building-trustworthy-agents" target="_blank" rel="noopener"&gt;Building Trustworthy AI Agents&lt;/A&gt; — Relevant to thinking about how agent evaluation and guardrails connect to the system we built&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/ai-agents-for-beginners/tree/main/12-context-engineering" target="_blank" rel="noopener"&gt;Context Engineering for AI Agents&lt;/A&gt; — The rubric injection technique we used is a form of context engineering&lt;BR /&gt;&lt;BR /&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Fri, 27 Feb 2026 08:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/creating-a-fun-multi-agent-content-strategy-system-with/ba-p/4495105</guid>
      <dc:creator>Abdulhamid_Onawole</dc:creator>
      <dc:date>2026-02-27T08:00:00Z</dc:date>
    </item>
    <item>
      <title>Stop Drawing Architecture Diagrams Manually: Meet the Open-Source AI Architecture Review Agents</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/stop-drawing-architecture-diagrams-manually-meet-the-open-source/ba-p/4496271</link>
      <description>&lt;P data-line="2"&gt;Hey everyone! I am&amp;nbsp;&lt;A href="https://www.linkedin.com/in/shivam2003/" target="_blank" rel="noopener"&gt;Shivam Goyal&lt;/A&gt;, a Microsoft MVP, and I am super excited to share a project that is going to save you a massive amount of time.&lt;/P&gt;
&lt;P data-line="4"&gt;Designing software architecture is arguably one of the most creative and enjoyable parts of engineering. Documenting it, reviewing it for security flaws, and keeping the diagrams updated as the system evolves? Not so much.&lt;/P&gt;
&lt;P data-line="6"&gt;We have all been there. You sketch out a brilliant microservices architecture on a whiteboard, take a blurry photo of it, and spend the next three hours wrestling with boxes, arrows, and alignment tools. By the time you finally get to the actual security and risk review, the architecture has already changed.&lt;/P&gt;
&lt;P data-line="8"&gt;What if you could just explain your system in plain English, or point a tool to a messy README, and instantly get a prioritized risk assessment, actionable recommendations, and an editable architecture diagram?&lt;/P&gt;
&lt;P data-line="10"&gt;Enter the&amp;nbsp;&lt;STRONG&gt;Architecture Review Agent&lt;/STRONG&gt;, an open-source AI sample my team and I built with the&amp;nbsp;&lt;A href="https://github.com/microsoft/agents" target="_blank" rel="noopener" data-href="https://github.com/microsoft/agents"&gt;Microsoft Agent Framework&lt;/A&gt;,&amp;nbsp;&lt;A href="https://learn.microsoft.com/azure/ai-services/openai/" target="_blank" rel="noopener" data-href="https://learn.microsoft.com/azure/ai-services/openai/"&gt;Azure OpenAI&lt;/A&gt;, and&amp;nbsp;&lt;A href="https://github.com/excalidraw/excalidraw-mcp" target="_blank" rel="noopener" data-href="https://github.com/excalidraw/excalidraw-mcp"&gt;Excalidraw MCP&lt;/A&gt;.&lt;/P&gt;
&lt;H2 data-line="14"&gt;What is the Architecture Review Agent?&lt;/H2&gt;
&lt;P data-line="16"&gt;At its core, the Architecture Review Agent is an automated pipeline that takes architectural descriptions in almost any format and transforms them into structured insights and visual maps.&lt;/P&gt;
&lt;P data-line="18"&gt;Whether you feed it a strictly formatted YAML file, a Markdown design doc, or just a brain dump like:&amp;nbsp;&lt;EM&gt;"We have a React frontend hitting a Kong gateway, which routes to three microservices, each with its own Postgres DB,"&lt;/EM&gt;&amp;nbsp;the agent processes it in seconds.&lt;/P&gt;
&lt;P data-line="20"&gt;Here is what you get back:&lt;/P&gt;
&lt;UL data-line="22"&gt;
&lt;LI data-line="22"&gt;&lt;STRONG&gt;An Interactive Excalidraw Diagram:&lt;/STRONG&gt;&amp;nbsp;No more static, uneditable images. The agent renders a fully interactive diagram via&amp;nbsp;&lt;A href="https://github.com/excalidraw/excalidraw-mcp" target="_blank" rel="noopener" data-href="https://github.com/excalidraw/excalidraw-mcp"&gt;Excalidraw MCP&lt;/A&gt;&amp;nbsp;that you can immediately tweak right in your browser.&lt;/LI&gt;
&lt;LI data-line="23"&gt;&lt;STRONG&gt;Prioritized Risk Analysis:&lt;/STRONG&gt;&amp;nbsp;An automated assessment of Single Points of Failure (SPOFs), scalability bottlenecks, security gaps, and architectural anti-patterns.&lt;/LI&gt;
&lt;LI data-line="24"&gt;&lt;STRONG&gt;Component Dependency Mapping:&lt;/STRONG&gt;&amp;nbsp;A detailed breakdown of fan-in and fan-out metrics, plus detection of orphaned components.&amp;nbsp;&lt;BR /&gt;&lt;img /&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="26"&gt;&lt;STRONG&gt;See it in action:&lt;/STRONG&gt; Check out this end-to-end review of an architecture, from file upload to risk detection and interactive diagram generation.&lt;/P&gt;
&lt;img /&gt;
&lt;H2 data-line="31"&gt;Why You Should Add It to Your Workflow&lt;/H2&gt;
&lt;P data-line="33"&gt;I wanted this agent to adapt to how developers actually work, rather than forcing you to learn a new proprietary diagramming language.&lt;/P&gt;
&lt;H3 data-line="35"&gt;1. Smart Input Intelligence&lt;/H3&gt;
&lt;P data-line="37"&gt;The agent works with what you already have. If you pass it structured YAML or Markdown, it uses a lightning-fast rule-based parser. If you pass it unstructured text, code files, or meeting notes, it automatically falls back to Azure OpenAI (we highly recommend GPT-4.1) to intelligently infer the components, their types, and how they connect.&lt;/P&gt;
&lt;img /&gt;
&lt;H3 data-line="39"&gt;2. Actionable, Context-Aware Reviews&lt;/H3&gt;
&lt;P data-line="41"&gt;This isn't just about drawing boxes. The AI analyzes your data flow to flag real-world issues. It will warn you about shared database anti-patterns, highlight missing API gateways, or point out infrastructure components that lack redundancy. The risks are bucketed by severity (Critical to Low) so you know exactly what to tackle first.&lt;/P&gt;
&lt;img /&gt;
&lt;P data-line="43"&gt;&lt;STRONG&gt;A Quick Note on AI Recommendations:&lt;/STRONG&gt;&amp;nbsp;While the agent is incredibly powerful, it is designed to be a co-pilot for your architecture team, not a replacement for human expertise. Always treat the AI-generated risk assessments and recommendations as a starting point. They are an amazing tool to accelerate your review process, but you should always verify the findings and conduct formal security audits with your human experts!&lt;/P&gt;
&lt;H3 data-line="45"&gt;3. Exports That Actually Matter&lt;/H3&gt;
&lt;P data-line="47"&gt;Need a slide for your next architecture review board? Grab the high-res PNG export. Need your team to collaborate and refine the design? Download the .excalidraw JSON file or edit it directly in the React web UI.&lt;/P&gt;
&lt;img /&gt;
&lt;H2 data-line="51"&gt;Deploy It Your Way: Featuring Microsoft Foundry Hosted Agents&lt;/H2&gt;
&lt;P data-line="53"&gt;The repository ships with scripts to get you up and running immediately. You have two production-ready deployment paths: a traditional full-stack web app, or my absolute favourite approach, a &lt;STRONG&gt;Hosted Agent via Microsoft Foundry&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H3 data-line="55"&gt;Option A: Full-Stack Web App (Azure App Service)&lt;/H3&gt;
&lt;P data-line="57"&gt;This is perfect if your team wants a custom, drag-and-drop React web interface. This path deploys a FastAPI backend and a React frontend to Azure App Service, giving you full ownership over the API surface and the UI.&lt;/P&gt;
&lt;H3 data-line="59"&gt;Option B: The Future of Zero-Ops AI (Microsoft Foundry Hosted Agents)&lt;/H3&gt;
&lt;P data-line="61"&gt;If you want to build a scalable, enterprise-grade API without wrestling with infrastructure,&amp;nbsp;&lt;A href="https://learn.microsoft.com/azure/ai-foundry/agents/concepts/hosted-agents?view=foundry" target="_blank" rel="noopener" data-href="https://learn.microsoft.com/azure/ai-foundry/agents/concepts/hosted-agents?view=foundry" data-lia-auto-title-active="0" data-lia-auto-title="Hosted agents in Foundry Agent Service (preview) - Microsoft Foundry"&gt;Hosted agents in Foundry Agent Service (preview) - Microsoft Foundry&lt;/A&gt;&amp;nbsp;is the way to go.&lt;/P&gt;
&lt;P data-line="63"&gt;Recently introduced in preview, Hosted Agents allow you to bring your own agent code (built with the Microsoft Agent Framework) and run it as a fully managed containerized service. Microsoft Foundry handles the heavy lifting so you can focus purely on your agent's logic.&lt;/P&gt;
&lt;P data-line="65"&gt;Here is why deploying the Architecture Review Agent on Microsoft Foundry is a complete game changer:&lt;/P&gt;
&lt;UL data-line="67"&gt;
&lt;LI data-line="67"&gt;&lt;STRONG&gt;Zero-Ops Infrastructure:&lt;/STRONG&gt; The platform automatically builds your container via ACR Tasks and manages the compute. It scales seamlessly from 0 to 5 replicas, including scaling to 0 to save costs when idle.&lt;/LI&gt;
&lt;LI data-line="68"&gt;&lt;STRONG&gt;Built-in Conversation Persistence:&lt;/STRONG&gt;&amp;nbsp;You do not need to build your own database to remember chat history. The Foundry Agent Service natively manages conversation state across requests.&lt;/LI&gt;
&lt;LI data-line="69"&gt;&lt;STRONG&gt;Enterprise Security Out-of-the-Box:&lt;/STRONG&gt;&amp;nbsp;Say goodbye to hardcoding API keys. Hosted Agents use system-assigned Managed Identities (Entra ID) with Role-Based Access Control (RBAC).&lt;/LI&gt;
&lt;LI data-line="70"&gt;&lt;STRONG&gt;Publish Anywhere:&lt;/STRONG&gt;&amp;nbsp;Once deployed to Foundry, you can publish your agent directly to Microsoft Teams or Microsoft 365 Copilot with no extra code required. Your team can literally ask Copilot in Teams to review an architecture spec!&lt;/LI&gt;
&lt;LI data-line="71"&gt;&lt;STRONG&gt;Seamless VS Code Deployment:&lt;/STRONG&gt;&amp;nbsp;We have integrated this sample with the&amp;nbsp;&lt;A href="https://marketplace.visualstudio.com/items?itemName=TeamsDevApp.vscode-ai-foundry" target="_blank" rel="noopener" data-href="https://marketplace.visualstudio.com/items?itemName=TeamsDevApp.vscode-ai-foundry"&gt;Microsoft Foundry for VS Code extension&lt;/A&gt;. Deploying to the cloud is as simple as opening the Command Palette, running Microsoft Foundry: Deploy Hosted Agent, and following the prompts.&lt;BR /&gt;&lt;img /&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2 data-line="75"&gt;Get Started in 5 Minutes&lt;/H2&gt;
&lt;P data-line="77"&gt;The project is completely open-source and waiting for you to test it out. If you have Python 3.11+ and access to Azure OpenAI or a Microsoft Foundry project, you can generate your first architecture review right now.&lt;/P&gt;
&lt;P data-line="79"&gt;Just clone the repository, run the setup script, and try feeding it your messiest system architecture description.&lt;/P&gt;
&lt;P data-line="81"&gt;&lt;STRONG&gt;GitHub Repo:&amp;nbsp;&lt;/STRONG&gt;&lt;STRONG&gt;&lt;A href="https://github.com/Azure-Samples/agent-architecture-review-sample" target="_blank" rel="noopener" data-href="https://github.com/Azure-Samples/agent-architecture-review-sample"&gt;Azure-Samples/agent-architecture-review-sample&lt;/A&gt;&lt;/STRONG&gt;&lt;/P&gt;
&lt;H2 data-line="85"&gt;Learn More &amp;amp; Let's Connect!&lt;/H2&gt;
&lt;P data-line="87"&gt;Building this agent has been an incredible journey, and I truly believe tools like this are the future of how we design and review software. But this is just the beginning, and I would love for you to be a part of it.&lt;/P&gt;
&lt;P data-line="89"&gt;If you want to dive deeper into the technology stack powering the Architecture Review Agent, here are some fantastic resources to get you started:&lt;/P&gt;
&lt;UL data-line="91"&gt;
&lt;LI data-line="91"&gt;&lt;A href="https://github.com/Azure-Samples/agent-architecture-review-sample" target="_blank" rel="noopener" data-href="https://github.com/Azure-Samples/agent-architecture-review-sample"&gt;Azure-Samples/agent-architecture-review-sample&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="92"&gt;&lt;A href="https://github.com/excalidraw/excalidraw-mcp" target="_blank" rel="noopener" data-href="https://github.com/excalidraw/excalidraw-mcp" data-lia-auto-title-active="0" data-lia-auto-title="GitHub - excalidraw/excalidraw-mcp: Fast and streamable Excalidraw MCP App"&gt;GitHub - excalidraw/excalidraw-mcp: Fast and streamable Excalidraw MCP App&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="93"&gt;&lt;A href="https://learn.microsoft.com/azure/ai-foundry/agents/concepts/hosted-agents?view=foundry" target="_blank" rel="noopener" data-href="https://learn.microsoft.com/azure/ai-foundry/agents/concepts/hosted-agents?view=foundry" data-lia-auto-title-active="0" data-lia-auto-title="Hosted agents in Foundry Agent Service (preview) - Microsoft Foundry"&gt;Hosted agents in Foundry Agent Service (preview) - Microsoft Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="94"&gt;&lt;A href="https://learn.microsoft.com/azure/ai-foundry/agents/quickstarts/quickstart-hosted-agent?view=foundry" target="_blank" rel="noopener" data-href="https://learn.microsoft.com/azure/ai-foundry/agents/quickstarts/quickstart-hosted-agent?view=foundry" data-lia-auto-title-active="0" data-lia-auto-title="Quickstart: Deploy your first hosted agent - Microsoft Foundry"&gt;Quickstart: Deploy your first hosted agent - Microsoft Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="95"&gt;&lt;A href="https://learn.microsoft.com/azure/ai-foundry/agents/how-to/deploy-hosted-agent?view=foundry" target="_blank" rel="noopener" data-href="https://learn.microsoft.com/azure/ai-foundry/agents/how-to/deploy-hosted-agent?view=foundry" data-lia-auto-title-active="0" data-lia-auto-title="Deploy a hosted agent - Microsoft Foundry"&gt;Deploy a hosted agent - Microsoft Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="96"&gt;&lt;A href="https://learn.microsoft.com/azure/ai-foundry/agents/how-to/publish-agent?view=foundry" target="_blank" rel="noopener" data-href="https://learn.microsoft.com/azure/ai-foundry/agents/how-to/publish-agent?view=foundry" data-lia-auto-title-active="0" data-lia-auto-title="Publish agents in Microsoft Foundry - Microsoft Foundry"&gt;Publish agents in Microsoft Foundry - Microsoft Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="97"&gt;&lt;A href="https://learn.microsoft.com/azure/ai-foundry/agents/how-to/vs-code-agents-workflow-pro-code?view=foundry" target="_blank" rel="noopener" data-href="https://learn.microsoft.com/azure/ai-foundry/agents/how-to/vs-code-agents-workflow-pro-code?view=foundry" data-lia-auto-title-active="0" data-lia-auto-title="Create hosted agent workflows in Visual Studio Code - Microsoft Foundry"&gt;Create hosted agent workflows in Visual Studio Code - Microsoft Foundry&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="99"&gt;I want to hear from you. Whether you are deploying this for your enterprise, hacking on it over the weekend, or have a cool idea for a new feature, I would love to connect.&lt;/P&gt;
&lt;UL data-line="101"&gt;
&lt;LI data-line="101"&gt;Drop a star or open an issue on GitHub:&amp;nbsp;&lt;A href="https://github.com/Azure-Samples/agent-architecture-review-sample" target="_blank" rel="noopener" data-href="https://github.com/Azure-Samples/agent-architecture-review-sample"&gt;Architecture Review Agent Sample&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="102"&gt;Connect with me on LinkedIn:&amp;nbsp;&lt;A href="https://linkedin.com/in/shivam2003" target="_blank" rel="noopener" data-href="https://linkedin.com/in/shivam2003"&gt;linkedin.com/in/shivam2003&lt;/A&gt;&lt;/LI&gt;
&lt;LI data-line="103"&gt;Check out my other projects:&amp;nbsp;&lt;A href="https://github.com/ShivamGoyal03" target="_blank" rel="noopener" data-href="https://github.com/ShivamGoyal03"&gt;github.com/ShivamGoyal03&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-line="105"&gt;Let me know what you think in the comments below, and happy architecting!&lt;/P&gt;</description>
      <pubDate>Thu, 26 Feb 2026 08:00:00 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/stop-drawing-architecture-diagrams-manually-meet-the-open-source/ba-p/4496271</guid>
      <dc:creator>ShivamGoyal</dc:creator>
      <dc:date>2026-02-26T08:00:00Z</dc:date>
    </item>
    <item>
      <title>Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/integrating-microsoft-foundry-with-openclaw-step-by-step-model/ba-p/4495586</link>
      <description>&lt;H3 data-path-to-node="0"&gt;Step 1: Deploying Models on Microsoft Foundry&lt;/H3&gt;
&lt;P data-path-to-node="1"&gt;Let us kick things off in the Azure portal. To get our OpenClaw agent thinking like a genius, we need to deploy our models in Microsoft Foundry. For this guide, we are going to focus on deploying &lt;STRONG data-path-to-node="1" data-index-in-node="196"&gt;gpt-5.2-codex&lt;/STRONG&gt; on Microsoft Foundry with OpenClaw.&amp;nbsp;&lt;/P&gt;
&lt;P data-path-to-node="2"&gt;Navigate to your AI Hub, head over to the model catalog, choose the model you wish to use with OpenClaw and hit deploy. Once your deployment is successful, head to the endpoints section.&lt;/P&gt;
&lt;P data-path-to-node="2"&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;BLOCKQUOTE&gt;
&lt;P data-path-to-node="3,0"&gt;&lt;STRONG data-path-to-node="3,0" data-index-in-node="0"&gt;Important:&lt;/STRONG&gt; Grab your Endpoint URL and your API Keys right now and save them in a secure note. We will need these exact values to connect OpenClaw in a few minutes.&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H3 data-path-to-node="5"&gt;Step 2: Installing and Initializing OpenClaw&lt;/H3&gt;
&lt;P data-path-to-node="6"&gt;Next up, we need to get OpenClaw running on your machine. Open up your terminal and run the official installation script:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;curl -fsSL https://openclaw.ai/install.sh | bash&lt;/LI-CODE&gt;&lt;img /&gt;
&lt;P data-path-to-node="10"&gt;The wizard will walk you through a few prompts. Here is exactly how to answer them to link up with our Azure setup:&lt;/P&gt;
&lt;UL data-path-to-node="11"&gt;
&lt;LI&gt;&lt;STRONG data-path-to-node="11,0,0" data-index-in-node="0"&gt;First Page (Model Selection):&lt;/STRONG&gt; Choose "Skip for now".&lt;BR /&gt;&lt;BR /&gt;&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG data-path-to-node="11,1,0" data-index-in-node="0"&gt;Second Page (Provider):&lt;/STRONG&gt; Select azure-openai-responses.&lt;BR /&gt;&lt;BR /&gt;&lt;img /&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;UL data-path-to-node="11"&gt;
&lt;LI&gt;&lt;STRONG data-path-to-node="11,2,0" data-index-in-node="0"&gt;Model Selection:&lt;/STRONG&gt; Select gpt-5.2-codex , For now only the models listed (&lt;SPAN class="lia-text-color-8"&gt;hosted on Microsoft Foundry&lt;/SPAN&gt;) in the picture below are available to be used with OpenClaw.&lt;BR /&gt;&lt;img /&gt;&lt;/LI&gt;
&lt;LI&gt;Follow the rest of the standard prompts to finish the initial setup.&lt;BR /&gt;&lt;BR /&gt;&lt;img /&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-path-to-node="13"&gt;Step 3: Editing the OpenClaw Configuration File&lt;/H3&gt;
&lt;P data-path-to-node="14"&gt;Now for the fun part. We need to manually configure OpenClaw to talk to Microsoft Foundry. Open your configuration file located at &lt;SPAN class="lia-text-color-8"&gt;~/.openclaw/openclaw.json&lt;/SPAN&gt; in your favorite text editor.&lt;/P&gt;
&lt;P data-path-to-node="15"&gt;Replace the contents of the &lt;SPAN class="lia-text-color-8"&gt;models&lt;/SPAN&gt; and &lt;SPAN class="lia-text-color-8"&gt;agents&lt;/SPAN&gt; sections with the following code block:&lt;/P&gt;
&lt;LI-CODE lang="json"&gt;{
    "models": {
    "providers": {
      "azure-openai-responses": {
        "baseUrl": "https://&amp;lt;YOUR_RESOURCE_NAME&amp;gt;.openai.azure.com/openai/v1",
        "apiKey": "&amp;lt;YOUR_AZURE_OPENAI_API_KEY&amp;gt;",
        "api": "openai-responses",
        "authHeader": false,
        "headers": {
          "api-key": "&amp;lt;YOUR_AZURE_OPENAI_API_KEY&amp;gt;"
        },
        "models": [
          {
            "id": "gpt-5.2-codex",
            "name": "GPT-5.2-Codex (Azure)",
            "reasoning": true,
            "input": ["text", "image"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 400000,
            "maxTokens": 16384,
            "compat": { "supportsStore": false }
          },
          {
            "id": "gpt-5.2",
            "name": "GPT-5.2 (Azure)",
            "reasoning": false,
            "input": ["text", "image"],
            "cost": { "input": 0, "output": 0, "cacheRead": 0, "cacheWrite": 0 },
            "contextWindow": 272000,
            "maxTokens": 16384,
            "compat": { "supportsStore": false }
          }
        ]
      }
    }
  },
  "agents": {
    "defaults": {
      "model": {
        "primary": "azure-openai-responses/gpt-5.2-codex"
      },
      "models": {
        "azure-openai-responses/gpt-5.2-codex": {}
      },
      "workspace": "/home/&amp;lt;USERNAME&amp;gt;/.openclaw/workspace",
      "compaction": {
        "mode": "safeguard"
      },
      "maxConcurrent": 4,
      "subagents": {
        "maxConcurrent": 8
      }
    }
  }
}&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-path-to-node="17"&gt;You will notice a few placeholders in that JSON. Here is exactly what you need to swap out:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-21 lia-border-style-solid" border="1" style="width: 100%; height: 189px; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr style="height: 35px;"&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;STRONG&gt;Placeholder Variable&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;STRONG&gt;What It Is&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;STRONG&gt;Where to Find It&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr style="height: 59px;"&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="18,1,0,0"&gt;&amp;lt;YOUR_RESOURCE_NAME&amp;gt;&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="18,1,1,0"&gt;The unique name of your Azure OpenAI resource.&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="18,1,2,0"&gt;Found in your Azure Portal under the Azure OpenAI resource overview.&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 59px;"&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="18,2,0,0"&gt;&amp;lt;YOUR_AZURE_OPENAI_API_KEY&amp;gt;&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="18,2,1,0"&gt;The secret key required to authenticate your requests.&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="18,2,2,0"&gt;Found in Microsoft Foundry under your project endpoints or Azure Portal keys section.&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 36px;"&gt;&lt;td class="lia-border-color-21" style="height: 36px;"&gt;&lt;SPAN data-path-to-node="18,3,0,0"&gt;&amp;lt;USERNAME&amp;gt;&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 36px;"&gt;&lt;SPAN data-path-to-node="18,3,1,0"&gt;Your local computer's user profile name.&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 36px;"&gt;&lt;SPAN data-path-to-node="18,3,2,0"&gt;Open your terminal and type whoami to find this.&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;col style="width: 33.33%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3 data-path-to-node="19"&gt;Step 4: Restart the Gateway&lt;/H3&gt;
&lt;P data-path-to-node="20"&gt;After saving the configuration file, you must restart the OpenClaw gateway for the new Foundry settings to take effect. Run this simple command:&lt;/P&gt;
&lt;LI-CODE lang=""&gt;openclaw gateway restart&lt;/LI-CODE&gt;
&lt;H3 data-path-to-node="23"&gt;Configuration Notes &amp;amp; Deep Dive&lt;/H3&gt;
&lt;P data-path-to-node="24"&gt;If you are curious about why we configured the JSON that way, here is a quick breakdown of the technical details.&lt;/P&gt;
&lt;P data-path-to-node="25"&gt;&lt;STRONG data-path-to-node="25" data-index-in-node="0"&gt;Authentication Differences&lt;/STRONG&gt; Azure OpenAI uses the api-key HTTP header for authentication. This is entirely different from the standard OpenAI &lt;SPAN class="lia-text-color-8"&gt;Authorization: Bearer&lt;/SPAN&gt; header. Our configuration file addresses this in two ways:&lt;/P&gt;
&lt;UL data-path-to-node="26"&gt;
&lt;LI&gt;Setting &lt;SPAN class="lia-text-color-8"&gt;"authHeader": false&lt;/SPAN&gt; completely disables the default Bearer header.&lt;/LI&gt;
&lt;LI&gt;Adding &lt;SPAN class="lia-text-color-8"&gt;"headers": { "api-key": "&amp;lt;key&amp;gt;" }&lt;/SPAN&gt; forces OpenClaw to send the API key via Azure's native header format.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P data-path-to-node="27,0"&gt;&lt;STRONG data-path-to-node="27,0" data-index-in-node="0"&gt;Important Note:&lt;/STRONG&gt; Your API key must appear in both the apiKey field AND the headers.api-key field within the JSON for this to work correctly.&lt;/P&gt;
&lt;P data-path-to-node="28"&gt;&lt;STRONG data-path-to-node="28" data-index-in-node="0"&gt;The Base URL&lt;/STRONG&gt; Azure OpenAI's v1-compatible endpoint follows this specific format: &lt;SPAN class="lia-text-color-8"&gt;https://&amp;lt;your_resource_name&amp;gt;.openai.azure.com/openai/v1&lt;/SPAN&gt;&lt;/P&gt;
&lt;P data-path-to-node="29"&gt;The beautiful thing about this v1 endpoint is that it is largely compatible with the standard OpenAI API and does not require you to manually pass an api-version query parameter.&lt;/P&gt;
&lt;P data-path-to-node="30"&gt;&lt;STRONG data-path-to-node="30" data-index-in-node="0"&gt;Model Compatibility Settings&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL data-path-to-node="31"&gt;
&lt;LI&gt;&lt;SPAN class="lia-text-color-8"&gt;"compat": { "supportsStore": false } &lt;/SPAN&gt;disables the store parameter since Azure OpenAI does not currently support it.&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN class="lia-text-color-8"&gt;"reasoning": true&lt;/SPAN&gt; enables the thinking mode for GPT-5.2-Codex. This supports low, medium, high, and xhigh levels.&lt;/LI&gt;
&lt;LI&gt;&lt;SPAN class="lia-text-color-8"&gt;"reasoning": false&lt;/SPAN&gt; is set for GPT-5.2 because it is a standard, non-reasoning model.&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3 data-path-to-node="32"&gt;Model Specifications &amp;amp; Cost Tracking&lt;/H3&gt;
&lt;P data-path-to-node="33"&gt;If you want OpenClaw to accurately track your token usage costs, you can update the cost fields from 0 to the current Azure pricing. Here are the specs and costs for the models we just deployed:&lt;/P&gt;
&lt;P data-path-to-node="34"&gt;&lt;STRONG data-path-to-node="34" data-index-in-node="0"&gt;Model Specifications&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-21" border="1" style="width: 55.3704%; height: 153px; border-width: 1px;"&gt;&lt;thead&gt;&lt;tr style="height: 59px;"&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;STRONG&gt;Model&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;STRONG&gt;Context Window&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;STRONG&gt;Max Output Tokens&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;STRONG&gt;Image Input&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;STRONG&gt;Reasoning&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr style="height: 59px;"&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="35,1,0,0"&gt;gpt-5.2-codex&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="35,1,1,0"&gt;400,000 tokens&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="35,1,2,0"&gt;16,384 tokens&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="35,1,3,0"&gt;Yes&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 59px;"&gt;&lt;SPAN data-path-to-node="35,1,4,0"&gt;Yes&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr style="height: 35px;"&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;SPAN data-path-to-node="35,2,0,0"&gt;gpt-5.2&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;SPAN data-path-to-node="35,2,1,0"&gt;272,000 tokens&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;SPAN data-path-to-node="35,2,2,0"&gt;16,384 tokens&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;SPAN data-path-to-node="35,2,3,0"&gt;Yes&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21" style="height: 35px;"&gt;&lt;SPAN data-path-to-node="35,2,4,0"&gt;No&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;col style="width: 20.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P data-path-to-node="36"&gt;&amp;nbsp;&lt;/P&gt;
&lt;P data-path-to-node="36"&gt;&lt;STRONG data-path-to-node="36" data-index-in-node="0"&gt;Current Cost (Adjust in JSON)&lt;/STRONG&gt;&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table class="lia-border-color-21 lia-border-style-solid" border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;td class="lia-border-color-21"&gt;&lt;STRONG&gt;Model&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;STRONG&gt;Input (per 1M tokens)&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;STRONG&gt;Output (per 1M tokens)&lt;/STRONG&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;STRONG&gt;Cached Input (per 1M tokens)&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,1,0,0"&gt;gpt-5.2-codex&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,1,1,0"&gt;$1.75&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,1,2,0"&gt;$14.00&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,1,3,0"&gt;$0.175&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,2,0,0"&gt;gpt-5.2&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,2,1,0"&gt;$2.00&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,2,2,0"&gt;$8.00&lt;/SPAN&gt;&lt;/td&gt;&lt;td class="lia-border-color-21"&gt;&lt;SPAN data-path-to-node="37,2,3,0"&gt;$0.50&lt;/SPAN&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3 data-path-to-node="0"&gt;Conclusion:&lt;/H3&gt;
&lt;P data-path-to-node="1"&gt;And there you have it! You have successfully bridged the gap between the enterprise-grade infrastructure of Microsoft Foundry and the local autonomy of OpenClaw. By following these steps, you are not just running a chatbot; you are running a sophisticated agent capable of reasoning, coding, and executing tasks with the full power of GPT-5.2-codex behind it.&lt;/P&gt;
&lt;P data-path-to-node="2"&gt;The combination of Azure's reliability and OpenClaw's flexibility opens up a world of possibilities. Whether you are building an automated devops assistant, a research agent, or just exploring the bleeding edge of AI, you now have a robust foundation to build upon.&lt;/P&gt;
&lt;img /&gt;
&lt;P data-path-to-node="3"&gt;Now it is time to let your agent loose on some real tasks. Go forth, experiment with different system prompts, and see what you can build. If you run into any interesting edge cases or come up with a unique configuration, let me know in the comments below. Happy coding!&lt;/P&gt;</description>
      <pubDate>Mon, 23 Feb 2026 09:39:15 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/integrating-microsoft-foundry-with-openclaw-step-by-step-model/ba-p/4495586</guid>
      <dc:creator>suzarilshah</dc:creator>
      <dc:date>2026-02-23T09:39:15Z</dc:date>
    </item>
    <item>
      <title>Learning Cost Efficient AI Agents Development on Azure</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/learning-cost-efficient-ai-agents-development-on-azure/ba-p/4493940</link>
      <description>&lt;P&gt;AI agents are increasingly central to building automated solutions, experimenting with data‑driven decision making, and bringing real‑world AI systems to life.&lt;/P&gt;
&lt;P&gt;But as AI adoption grows, so do important questions: &lt;EM&gt;How much does AI cost? How do design choices affect efficiency? And how can developers build AI solutions that are both innovative and sustainable?&lt;/EM&gt;&lt;/P&gt;
&lt;P&gt;The &lt;STRONG&gt;&lt;A href="https://developer.microsoft.com/en-us/reactor/events/26742/?wt.mc_id=blog2_26742_webpage_reactor" target="_blank" rel="noopener"&gt;Maximize the Cost Efficiency of AI Agents on Azure&lt;/A&gt;&lt;/STRONG&gt; webinar is designed to help answer these questions.&lt;/P&gt;
&lt;P&gt;This session provides practical guidance on designing and scaling AI agents on Azure while keeping cost efficiency in mind. Rather than focusing only on tools and services, the webinar helps learners and educators understand how architectural decisions, model choices, and usage patterns directly impact cost, performance, and outcomes. These are the same considerations students will encounter in real-world projects, research, and future careers.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Who should attend?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Whether you are introducing Agentic AI concepts in the classroom, working on student projects, or exploring AI agents as part of your learning journey, this webinar offers actionable insights you can apply immediately—both in teaching and hands-on experimentation.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Why attend the webinar?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;In this session, you’ll see how Agentic AI cost considerations translate from theory into real-world scenarios, using practical examples that are easy to relate to student projects and classroom use cases. The webinar also highlights common cost pitfalls and shows how thoughtful design decisions can help avoid them early.&lt;/P&gt;
&lt;P&gt;Most importantly, the session helps learners and educators connect technical choices to measurable outcomes—building a stronger understanding of how to evaluate, optimize, and govern AI systems responsibly. You’ll have the opportunity to ask questions live and leave with clearer guidance on how to build AI agents that scale efficiently.&lt;/P&gt;
&lt;P&gt;If you care about preparing students for real-world AI development—or building your own skills with a strong foundation in responsible and cost-aware design—this webinar is a valuable addition to your learning journey.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;&lt;A class="lia-external-url" href="https://developer.microsoft.com/reactor/events/26742/?wt.mc_id=blog1_26742_webpage_reactor" target="_blank" rel="noopener"&gt;Missed it? Watch it on demand!&lt;/A&gt;&lt;/STRONG&gt;&amp;nbsp;&lt;/P&gt;
&lt;div data-video-id="https://www.youtube.com/watch?v=9AOEAFsNSbU/1772812324387" data-video-remote-vid="https://www.youtube.com/watch?v=9AOEAFsNSbU/1772812324387" class="lia-video-container lia-media-is-center lia-media-size-large"&gt;&lt;iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2F9AOEAFsNSbU%3Ffeature%3Doembed&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3D9AOEAFsNSbU&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2F9AOEAFsNSbU%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" allowfullscreen="" style="max-width: 100%"&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;P&gt;&lt;STRONG&gt;Who will speak at the webinar?&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Your speakers will be:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Carlotta Castelluccio:&lt;/EM&gt; Carlotta is a Senior AI Advocate with the mission of helping every developer to succeed with AI, by building innovative solutions responsibly. To achieve this goal, she develops technical content, and she hosts skilling sessions, enabling her&amp;nbsp;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;audience to take the most out of AI technologies and to have an impact on Microsoft AI products’ roadmap.&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;EM&gt;Nitya Narasimhan:&amp;nbsp;&lt;/EM&gt;Nitya is a PhD and Polyglot with 25+ years of software research &amp;amp; development experience spanning mobile, web, cloud and AI. She is an innovator (12+ patents), a visual storyteller (&lt;A href="https://sketchthedocs.dev/" target="_blank" rel="noopener"&gt;@sketchtedocs&lt;/A&gt;), and an experienced community builder in the Greater New York area. As a senior AI Advocate on the Core AI Developer Relations team, she acts as "developer 0" for the Microsoft Foundry platform, providing product feedback and empowering AI developers to build trustworthy AI solutions with code samples, open-source curricula and content-initiatives like&amp;nbsp;&lt;A href="https://aka.ms/model-mondays" target="_blank" rel="noopener"&gt;Model Mondays&lt;/A&gt;. Prior to joining Microsoft, she spent a decade in Motorola Labs working on ubiquitous &amp;amp; mobile computing research, founded Google Developer Groups in New York, and consulted for startups building real-time experiences for enterprise. Her current interests span Model understanding &amp;amp; customization, E2E Observability &amp;amp; Safety, and agentic AI workflows for maintainable software.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Useful resources&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Microsoft Learn Training Path: &lt;A href="https://aka.ms/maximize-cost-efficiency-ai-agents-training" target="_blank"&gt;https://aka.ms/maximize-cost-efficiency-ai-agents-training&lt;/A&gt;&lt;/P&gt;
&lt;P&gt;Session Deck: &lt;A href="https://aka.ms/maximize-cost-efficiency-ai-agents-deck" target="_blank"&gt;https://aka.ms/maximize-cost-efficiency-ai-agents-deck&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Mon, 09 Mar 2026 19:11:37 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/learning-cost-efficient-ai-agents-development-on-azure/ba-p/4493940</guid>
      <dc:creator>carlottacaste</dc:creator>
      <dc:date>2026-03-09T19:11:37Z</dc:date>
    </item>
    <item>
      <title>Building an AI Study Agent - How GitHub Copilot CLI &amp; SDK helped Reimagine LMS</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/building-an-ai-study-agent-how-github-copilot-cli-sdk-helped/ba-p/4495179</link>
      <description>&lt;BLOCKQUOTE&gt;
&lt;P&gt;&lt;EM&gt;What if your Learning Management System didn't just host lecture documents, assignments, and grades - but actually understood them?&lt;/EM&gt;&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;P&gt;Every time I sit through a lecture, a constant thought lingers:&amp;nbsp;&lt;EM&gt;"I love what I'm studying, don't get me wrong - but it's a lot!"&lt;/EM&gt; These are 3-hour lectures, a little too much content with piles of reference materials - how do I create efficient study routines beyond these lectures? With the world moving toward an agentic future, AI should help but having read so many posts on AI personalization for education systems, in my experience today, that personalized support isn't here - YET!&lt;/P&gt;
&lt;P&gt;Here is the catch though! I don't have weeks to design an architecture, plan every component, and slowly build my way there. I have a problem and a rough idea of a solution, and I need a working prototype&amp;nbsp;&lt;EM&gt;fast!&lt;/EM&gt;&lt;/P&gt;
&lt;H2&gt;Enter GitHub Copilot CLI&lt;/H2&gt;
&lt;img&gt;GIF showing typing copilot --banner in the terminal&lt;/img&gt;
&lt;P class="lia-clear-both"&gt;Staring at an empty folder with a half-baked idea and not exactly sure where to start, I spun up the terminal and launched a Copilot Agent in&amp;nbsp;&lt;STRONG&gt;/plan&lt;/STRONG&gt;&amp;nbsp;mode for a brainstorming session.&amp;nbsp;&lt;EM&gt;You know - to help me think&lt;/EM&gt;.&lt;BR /&gt;This was less of a building session and more of an interactive brainstorm with the agent asking clarifying questions about features, stack preferences, and constraints, then returned a&amp;nbsp;&lt;STRONG&gt;comprehensive implementation plan in seconds&lt;/STRONG&gt;.&lt;/P&gt;
&lt;P&gt;That step alone was incredibly valuable: it didn't just give the agent a picture of what I wanted to build, it also surfaced scenarios I hadn't even thought of. Even without the full implemenation, that step alone was enough to move my idea forward, and this has influenced my normal ideation routine which is now:&amp;nbsp;&lt;STRONG&gt;idea&lt;/STRONG&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;brainstorm with Copilot /plan mode&lt;/STRONG&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;save the plan&lt;/STRONG&gt;&amp;nbsp;→&amp;nbsp;&lt;STRONG&gt;iterate&lt;/STRONG&gt;.&lt;/P&gt;
&lt;H2&gt;The Solution&lt;/H2&gt;
&lt;P&gt;With the plan ready, you might tell the agent to “Start Implementation,” and it'll likely do a great job, but I prefer a five-phase workflow that balances speed, structure and my desired level of involvement in the project&amp;nbsp;&lt;EM&gt;(phases may vary by use case)&lt;/EM&gt;:&lt;/P&gt;
&lt;img&gt;5 Phases - Brainstorm, Research, Project Setup, Core Logic, Test &amp;amp; Frontend&lt;/img&gt;
&lt;P&gt;Here is how I think about the stages:&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;1. Brainstorm&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The goal here is to ensure the idea is crystal clear, not just to the builder (me), but to the agent(s), and more importantly - that you are on the same page.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;2. Research&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;This phase is important to surface the latest docs, announcements, and decision factors so in as much as most of the implementation is delegated to the agent(s), builders (I) have a clear understanding of reasons why database/ framework/ provider X was chosen over Y.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;3. Project Setup&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;This is where the agent focuses on installs, project scaffolding, configuration, and defining how components in the architecture design communicate.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;4. Core functionality&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;The main goal here is to implement the core logic behind the system’s essential behavior, followed by a thorough validation that APIs and DB schemas map to the target features.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;5. Frontend&lt;/STRONG&gt;&lt;/P&gt;
&lt;P&gt;Language models rarely struggle with UI design work. The trick to get the&amp;nbsp;&lt;EM&gt;perfect&lt;/EM&gt;&amp;nbsp;frontend with a single prompt in my experience, is to save this task for last and the agent will not only factor in the features already implemented, but will also build a design that anticipates and accommodates future enhancements that you thought about and noted in the brainstorm notes (plan docs).&lt;/P&gt;
&lt;P&gt;With these phases documented, plus the plan docs stored in my project directory, I'm confident that when I switch to different agents working on my project, they'll all have a clear, common and referenceable north star and can work on whatever component or feature I delegate to them with the right context.&lt;/P&gt;
&lt;P&gt;After the first iteration of this workflow, in a matter of minutes, I had a full stack application, with a beautiful UI and I could browse through the courses, with the ability to upload notes (pdf and text files) which were stored in the database.&lt;/P&gt;
&lt;P&gt;Hooray! Happy that it worked, but - is there anything extraordinary about that? Not really, since most current LMS can already do this.&lt;/P&gt;
&lt;P&gt;But, here is where we step up the game.&lt;/P&gt;
&lt;P&gt;Instead of uploading school docs and have them sit there, a file upload kicks off an&amp;nbsp;&lt;STRONG&gt;ingestion pipeline&lt;/STRONG&gt; to build a knowledge base that language models can reason over.&lt;/P&gt;
&lt;img&gt;School Agent - Ingestion Pipeline for RAG&lt;/img&gt;
&lt;P&gt;So the backend:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;extracts the file content.&amp;nbsp;&lt;EM&gt;Output: large and long text block&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;applies a chunking strategy to break the long text block into smaller text groups.&amp;nbsp;&lt;EM&gt;Output: chunks of roughly 512 tokens each with a 100 token overlap for context continuation&lt;/EM&gt;&lt;/LI&gt;
&lt;LI&gt;vector embeddings generation.&amp;nbsp;&lt;EM&gt;Output: Embeddings stored in a single db (along with my existing data) using the pgvector extension&lt;/EM&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Now with my data in a format that language models can understand, the next part involves adding an intelligent layer and we achieve this in 2 steps:&lt;/P&gt;
&lt;OL&gt;
&lt;LI&gt;Expose API endpoints in a format that language models can use (tools),&lt;/LI&gt;
&lt;LI&gt;Create an autonomous AI workflow that will handle the tool orchestration and determine when to use what.&lt;/LI&gt;
&lt;/OL&gt;
&lt;H2&gt;Enter Model Context Protocol (MCP)&lt;/H2&gt;
&lt;P&gt;APIs are designed for humans as the primary users with a discovery path optimized for going through API docs to find endpoints and creating custom integrations to consume them. This doesn't work for language models which instead need a more dynamic, self-discoverable, runtime approach encapsulated in a standardized interface for AI.&lt;/P&gt;
&lt;P&gt;This is what the Model Context Protocol provides, a standard that connects AI native apps/ agents to data and tools dynamically.&lt;/P&gt;
&lt;P&gt;In the steps above, Copilot CLI uses this very same protocol to pull data from external sources, accessing documentation on how the MCP architecture works and how to build and connect to MCP servers, and in a single prompt, is able to extend my existing backend (API Layer) into an MCP server with tools that will allow the agent to perform actions dynamically. That is either reading course material, generating question-answer pairs from the course content for quizzes, extracting coding exercises and updating my completion progress among other functions.&lt;/P&gt;
&lt;P&gt;The quickest way to test this MCP setup is with GitHub Copilot as the MCP client, since I'm yet to build any agentic workflows. I'm already on VS Code, so I simply (1) add the mcp server configuration to my&amp;nbsp;.vscode/mcp.json&amp;nbsp;and now the tools are (2) accessible within the Copilot chat window. I start testing with my (3) custom School Agent, comparing different prompts and (4) tool use accuracy to get a feel of how the agent experience would look like in the app.&amp;nbsp;&lt;EM&gt;And of course you can use this through the Copilot CLI if you prefer working from the terminal.&lt;/EM&gt;&lt;/P&gt;
&lt;img&gt;Screenshot of VS Code with mcp.json configuration, configure tools view and GitHub Copilot using the getCourseDocuments tool to ground responses in school documents&lt;/img&gt;
&lt;P&gt;That's step 1. Step 2 is building the agent itself.&lt;/P&gt;
&lt;H2&gt;Enter GitHub Copilot SDK&lt;/H2&gt;
&lt;P&gt;When it comes to building agents, there are so many Agent Development Kits and Frameworks that make it easier to create and manage agent execution loops, but a recent (and exciting) announcement from GitHub is the new&amp;nbsp;&lt;STRONG&gt;GitHub Copilot SDK&lt;/STRONG&gt;. I'll link to the repo in the resources section, but basically what this means is that you can let the existing infrastructure that powers today's GitHub Copilot, handle all the building blocks of an agent - tool discovery and orchestration, session management, real time streaming, multi-turn loops,etc. and just programmatically call that agent workflow in your application.&lt;/P&gt;
&lt;P&gt;There wasn't much to go on terms of documentation as this is still very new, but from what I read in the announcement blog &amp;amp; SDK repo - this was mind-blowing. I had to try it!&lt;BR /&gt;I'll admit that when I started off with this project, I had a different idea of how to approach the agentic part of it, but luckily the SDK was announced before I got to it and decided that it was worth giving it a try. I am proud to say that I wasn't disappointed.&lt;/P&gt;
&lt;P&gt;I jumped into a brainstorming session with my buddy Copilot CLI, who had context from the SDK repo and settled on an approach that needed an aspect of specialization capabilities. Instead of having one agent handle all tasks, let's have smaller specialized agents for each. i.e.,&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;I like to frequently quiz myself on topics - let's have an agent that does that one task PERFECTLY!&lt;/LI&gt;
&lt;LI&gt;I'm struggling to track the completion of exercises provided in a course text pdf document - let's have an agent specialized on extracting coding exercises from the eBook, tracking my progress and helping out when I'm stuck or if I need a quick review on my code attempts.&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The beauty of using the Copilot SDK is that if you have such an idea, you won't have to worry about building it from scratch because chances are, someone already thoughout it out and there is likely a feature or a copilot-native pattern ready for you to use. This case is no exception - because the idea of providing Copilot specialized capabilities for specific tasks is already implemented through&amp;nbsp;&lt;STRONG&gt;Agent Skills&lt;/STRONG&gt;.&lt;BR /&gt;So all I needed was to define a `SKILL.md` document for the specialized tasks I needed - flashcard generator `.github/skills/flashcard-generator/SKILL.md` &amp;amp; Java practice tracker `.github/skills/java-practice-tracker/SKILL.md` and pass a `skill` property to the agent in code, which again the Copilot CLI implemented in minutes.&lt;/P&gt;
&lt;P&gt;In just a couple of hours,&amp;nbsp;&lt;EM&gt;(with so many breaks in between)&lt;/EM&gt;, I ended up with&amp;nbsp;&lt;STRONG&gt;School Agent&lt;/STRONG&gt; that takes learning management systems to the next level, and this is just the beginning.&lt;/P&gt;
&lt;img&gt;School Agent working architecture with - Frontend, Agent, Backend API, MCP Server and DB (Postgresql + pgvector) components&lt;/img&gt;
&lt;P&gt;With tools like the Copilot CLI, SDK, and other AI dev tools, experimentation has never been easier and I have so many ideas (I'm sure you do too), of how to make this system even more useful but I'm confident that in a short while, I'll be back with the next set of features built out and working to perfection.&lt;/P&gt;
&lt;P&gt;I'm evolving School Agent into an architecture that is program-agnostic and I hope to share it with you soon to try it out and make it your own.&lt;BR /&gt;Yes things are moving fast in the AI space, but at least this way I have AI working&amp;nbsp;&lt;STRONG&gt;with me&lt;/STRONG&gt;&amp;nbsp;and&amp;nbsp;&lt;STRONG&gt;for me&lt;/STRONG&gt;&amp;nbsp;to improve an actual real world experience (and so can you). I encourage you to not just take other people's word for it, - you saw a cool demo on YouTube/ X recently, or you enjoyed this post&amp;nbsp;&lt;EM&gt;(I hope you did)&lt;/EM&gt;, but don't settle for that. Find an immediate problem you are having today and tinker around. Build something. Anything. Everything!&lt;/P&gt;
&lt;BLOCKQUOTE&gt;
&lt;P&gt;To students: Would you use School Agent? What does it need to do to be even more useful to you?&lt;BR /&gt;For educators: How can your students benefit from such a tool? What would you also like to see implemented to support you?&lt;/P&gt;
&lt;/BLOCKQUOTE&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;Check out this video walk-through of School Agent&lt;/LI&gt;
&lt;/UL&gt;
&lt;div data-video-id="https://www.youtube.com/watch?v=M2AqsalF14I&amp;amp;t=33s/1771234219349" data-video-remote-vid="https://www.youtube.com/watch?v=M2AqsalF14I&amp;amp;t=33s/1771234219349" class="lia-video-container lia-media-is-center lia-media-size-large"&gt;&lt;iframe src="https://cdn.embedly.com/widgets/media.html?src=https%3A%2F%2Fwww.youtube.com%2Fembed%2FM2AqsalF14I%3Fstart%3D33%26feature%3Doembed%26start%3D33&amp;amp;display_name=YouTube&amp;amp;url=https%3A%2F%2Fwww.youtube.com%2Fwatch%3Fv%3DM2AqsalF14I&amp;amp;image=https%3A%2F%2Fi.ytimg.com%2Fvi%2FM2AqsalF14I%2Fhqdefault.jpg&amp;amp;type=text%2Fhtml&amp;amp;schema=youtube" allowfullscreen="" style="max-width: 100%"&gt;&lt;/iframe&gt;&lt;/div&gt;
&lt;UL&gt;
&lt;LI&gt;Get started with&amp;nbsp;&lt;A href="https://github.com/github/copilot-cli" target="_blank" rel="noopener"&gt;Copilot CLI&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;Get started with the&amp;nbsp;&lt;A href="https://github.com/github/copilot-sdk" target="_blank" rel="noopener"&gt;Copilot SDK&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://docs.github.com/en/copilot/concepts/agents/about-agent-skills" target="_blank" rel="noopener"&gt;Agent Skills&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;If you enjoyed this post, let's connect on &lt;A href="https://www.linkedin.com/in/juliamuiruri/" target="_blank" rel="noopener"&gt;LinkedIn&lt;/A&gt;,&amp;nbsp;&lt;A href="https://x.com/juliamuiruri4" target="_blank" rel="noopener"&gt;X&lt;/A&gt;&amp;nbsp;and&amp;nbsp;&lt;A href="https://bsky.app/profile/juliamuiruri.bsky.social" target="_blank" rel="noopener"&gt;Bsky&lt;/A&gt;&lt;/P&gt;</description>
      <pubDate>Fri, 20 Feb 2026 11:50:14 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/building-an-ai-study-agent-how-github-copilot-cli-sdk-helped/ba-p/4495179</guid>
      <dc:creator>Julia_Muiruri</dc:creator>
      <dc:date>2026-02-20T11:50:14Z</dc:date>
    </item>
    <item>
      <title>Agentic Code Fixing with GitHub Copilot SDK and Foundry Local</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/agentic-code-fixing-with-github-copilot-sdk-and-foundry-local/ba-p/4493967</link>
      <description>&lt;H2&gt;Introduction&lt;/H2&gt;
&lt;P&gt;AI-powered coding assistants have transformed how developers write and review code. But most of these tools require sending your source code to cloud services, a non-starter for teams working with proprietary codebases, air-gapped environments, or strict compliance requirements. What if you could have an intelligent coding agent that finds bugs, fixes them, runs your tests, and produces PR-ready summaries, all without a single byte leaving your machine?&lt;/P&gt;
&lt;P&gt;The &lt;A href="https://github.com/leestott/copilotsdk_foundrylocal" target="_blank"&gt;Local Repo Patch Agent&lt;/A&gt; demonstrates exactly this. By combining the GitHub Copilot SDK for agent orchestration with Foundry Local for on-device inference, this project creates a fully autonomous coding workflow that operates entirely on your hardware. The agent scans your repository, identifies bugs and code smells, applies fixes, verifies them through your test suite, and generates a comprehensive summary of all changes, completely offline and secure.&lt;/P&gt;
&lt;P&gt;This article explores the architecture behind this integration, walks through the key implementation patterns, and shows you how to run the agent yourself. Whether you're building internal developer tools, exploring agentic workflows, or simply curious about what's possible when you combine GitHub's SDK with local AI, this project provides a production-ready foundation to build upon.&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-size: 32px;"&gt;Why Local AI Matters for Code Analysis&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;Cloud-based AI coding tools have proven their value—GitHub Copilot has fundamentally changed how millions of developers work. But certain scenarios demand local-first approaches where code never leaves the organisation's network.&lt;/P&gt;
&lt;P&gt;Consider these real-world constraints that teams face daily:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Regulatory compliance&lt;/STRONG&gt;: Financial services, healthcare, and government projects often prohibit sending source code to external services, even for analysis&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Intellectual property protection&lt;/STRONG&gt;: Proprietary algorithms and trade secrets can't risk exposure through cloud API calls&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Air-gapped environments&lt;/STRONG&gt;: Secure facilities and classified projects have no internet connectivity whatsoever&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Latency requirements&lt;/STRONG&gt;: Real-time code analysis in IDEs benefits from zero network roundtrip&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Cost control&lt;/STRONG&gt;: High-volume code analysis without per-token API charges&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The Local Repo Patch Agent addresses all these scenarios. By running the AI model on-device through Foundry Local and using the GitHub Copilot SDK for orchestration, you get the intelligence of agentic coding workflows with complete data sovereignty. The architecture proves that "local-first" doesn't mean "capability-limited."&lt;/P&gt;
&lt;H2&gt;The Technology Stack&lt;/H2&gt;
&lt;P&gt;Two core technologies make this architecture possible, working together through a clever integration called BYOK (Bring Your Own Key). Understanding how they complement each other reveals the elegance of the design.&lt;/P&gt;
&lt;H3&gt;GitHub Copilot SDK&lt;/H3&gt;
&lt;P&gt;The &lt;A href="https://github.com/github/copilot-sdk" target="_blank"&gt;GitHub Copilot SDK&lt;/A&gt; provides the agent runtime, the scaffolding that handles planning, tool invocation, streaming responses, and the orchestration loop that makes agentic behaviour possible. Rather than managing raw LLM API calls, developers define tools (functions the agent can call) and system prompts, and the SDK handles everything else.&lt;/P&gt;
&lt;P&gt;Key capabilities the SDK brings to this project:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Session management&lt;/STRONG&gt;: Maintains conversation context across multiple agent interactions&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Tool orchestration&lt;/STRONG&gt;: Automatically invokes defined tools when the model requests them&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Streaming support&lt;/STRONG&gt;: Real-time response streaming for responsive user interfaces&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Provider abstraction&lt;/STRONG&gt;: Works with any OpenAI-compatible API through the BYOK configuration&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;Foundry Local&lt;/H3&gt;
&lt;P&gt;&lt;A href="https://foundrylocal.ai/" target="_blank"&gt;Foundry Local&lt;/A&gt; brings Azure AI Foundry's model catalog to your local machine. It automatically selects the best available hardware acceleration—GPU, NPU, or CP, and exposes models through an OpenAI-compatible API on localhost. Models run entirely on-device with no telemetry or data transmission.&lt;/P&gt;
&lt;P&gt;For this project, Foundry Local provides:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;On-device inference&lt;/STRONG&gt;: All AI processing happens locally, ensuring complete data privacy&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Dynamic port allocation&lt;/STRONG&gt;: The SDK auto-detects the Foundry Local endpoint, eliminating configuration hassle&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Model flexibility&lt;/STRONG&gt;: Swap between models like &lt;CODE&gt;qwen2.5-coder-1.5b&lt;/CODE&gt;, &lt;CODE&gt;phi-3-mini&lt;/CODE&gt;, or larger variants based on your hardware&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;OpenAI API compatibility&lt;/STRONG&gt;: Standard API format means the GitHub Copilot SDK works without modification&lt;/LI&gt;
&lt;/UL&gt;
&lt;H3&gt;The BYOK Integration&lt;/H3&gt;
&lt;P&gt;The entire connection between the GitHub Copilot SDK and Foundry Local happens through a single configuration object. This BYOK (Bring Your Own Key) pattern tells the SDK to route all inference requests to your local model instead of cloud services:&lt;/P&gt;
&lt;PRE&gt;const session = await client.createSession({
  model: modelId,
  provider: {
    type: "openai",               // Foundry Local speaks OpenAI's API format
    baseUrl: proxyBaseUrl,        // Streaming proxy → Foundry Local
    apiKey: manager.apiKey,
    wireApi: "completions",       // Chat Completions API
  },
  streaming: true,
  tools: [ /* your defined tools */ ],
});
&lt;/PRE&gt;
&lt;P&gt;This configuration is the key insight: with one config object, you've redirected an entire agent framework to run on local hardware. No code changes to the SDK, no special adapters—just standard OpenAI-compatible API communication.&lt;/P&gt;
&lt;H2&gt;Architecture Overview&lt;/H2&gt;
&lt;P&gt;The Local Repo Patch Agent implements a layered architecture where each component has a clear responsibility. Understanding this flow helps when extending or debugging the system.&lt;/P&gt;
&lt;PRE&gt;┌─────────────────────────────────────────────────────────┐
│                    Your Terminal / Web UI                │
│                    npm run demo / npm run ui             │
└──────────────┬──────────────────────────────────────────┘
               │
┌──────────────▼──────────────────────────────────────────┐
│          src/agent.ts  (this project)                    │
│                                                          │
│  ┌────────────────────────────┐   ┌──────────────────┐  │
│  │  GitHub Copilot SDK         │   │  Agent Tools      │  │
│  │  (CopilotClient)            │   │  list_files       │  │
│  │  BYOK → Foundry             │   │  read_file        │  │
│  └────────┬───────────────────┘   │  write_file       │  │
│            │                       │  run_command      │  │
└────────────┼───────────────────────┴──────────────────┘  │
             │                                             │
             │ JSON-RPC                                    │
┌────────────▼─────────────────────────────────────────────┐
│          GitHub Copilot CLI  (server mode)                │
│          Agent orchestration layer                        │
└────────────┬─────────────────────────────────────────────┘
             │ POST /v1/chat/completions   (BYOK)
┌────────────▼─────────────────────────────────────────────┐
│          Foundry Local  (on-device inference)             │
│          Model: qwen2.5-coder-1.5b via ONNX Runtime      │
│          Endpoint: auto-detected (dynamic port)           │
└───────────────────────────────────────────────────────────┘
&lt;/PRE&gt;
&lt;P&gt;The data flow works as follows: your terminal or web browser sends a request to the agent application. The agent uses the GitHub Copilot SDK to manage the conversation, which communicates with the Copilot CLI running in server mode. The CLI, configured with BYOK, sends inference requests to Foundry Local running on localhost. Responses flow back up the same path, with tool invocations happening in the agent.ts layer.&lt;/P&gt;
&lt;H2&gt;The Four-Phase Workflow&lt;/H2&gt;
&lt;P&gt;The agent operates through a structured four-phase loop, each phase building on the previous one's output. This decomposition transforms what would be an overwhelming single prompt into manageable, verifiable steps.&lt;/P&gt;
&lt;img /&gt;
&lt;H3&gt;Phase 1: PLAN&lt;/H3&gt;
&lt;P&gt;The planning phase scans the repository and produces a numbered fix plan. The agent reads every source and test file, identifies potential issues, and outputs specific tasks to address:&lt;/P&gt;
&lt;PRE&gt;// Phase 1 system prompt excerpt
const planPrompt = `
You are a code analysis agent. Scan the repository and identify:
1. Bugs that cause test failures
2. Code smells and duplication
3. Style inconsistencies

Output a numbered list of fixes, ordered by priority.
Each item should specify: file path, line numbers, issue type, and proposed fix.
`;
&lt;/PRE&gt;
&lt;P&gt;The tools available during this phase are &lt;CODE&gt;list_files&lt;/CODE&gt; and &lt;CODE&gt;read_file&lt;/CODE&gt;—the agent explores the codebase without modifying anything. This read-only constraint prevents accidental changes before the plan is established.&lt;/P&gt;
&lt;H3&gt;Phase 2: EDIT&lt;/H3&gt;
&lt;P&gt;With a plan in hand, the edit phase applies each fix by rewriting affected files. The agent receives the plan from Phase 1 and systematically addresses each item:&lt;/P&gt;
&lt;PRE&gt;// Phase 2 adds the write_file tool
const editTools = [
  {
    name: "write_file",
    description: "Write content to a file, creating or overwriting it",
    parameters: {
      type: "object",
      properties: {
        path: { type: "string", description: "File path relative to repo root" },
        content: { type: "string", description: "Complete file contents" }
      },
      required: ["path", "content"]
    }
  }
];
&lt;/PRE&gt;
&lt;P&gt;The write_file tool is sandboxed to the demo-repo directory, path traversal attempts are blocked, preventing the agent from modifying files outside the designated workspace.&lt;/P&gt;
&lt;H3&gt;Phase 3: VERIFY&lt;/H3&gt;
&lt;P&gt;After making changes, the verification phase runs the project's test suite to confirm fixes work correctly. If tests fail, the agent attempts to diagnose and repair the issue:&lt;/P&gt;
&lt;PRE&gt;// Phase 3 adds run_command with an allowlist
const allowedCommands = ["npm test", "npm run lint", "npm run build"];

const runCommandTool = {
  name: "run_command",
  description: "Execute a shell command (npm test, npm run lint, npm run build only)",
  execute: async (command: string) =&amp;gt; {
    if (!allowedCommands.includes(command)) {
      throw new Error(`Command not allowed: ${command}`);
    }
    // Execute and return stdout/stderr
  }
};
&lt;/PRE&gt;
&lt;P&gt;The command allowlist is a critical security measure. The agent can only run explicitly permitted commands—no arbitrary shell execution, no data exfiltration, no system modification.&lt;/P&gt;
&lt;H3&gt;Phase 4: SUMMARY&lt;/H3&gt;
&lt;P&gt;The final phase produces a PR-style Markdown report documenting all changes. This summary includes what was changed, why each change was necessary, test results, and recommended follow-up actions:&lt;/P&gt;
&lt;PRE&gt;## Summary of Changes

### Bug Fix: calculateInterest() in account.js
- **Issue**: Division instead of multiplication caused incorrect interest calculations
- **Fix**: Changed `principal / annualRate` to `principal * (annualRate / 100)`
- **Tests**: 3 previously failing tests now pass

### Refactor: Duplicate formatCurrency() removed
- **Issue**: Identical function existed in account.js and transaction.js
- **Fix**: Both files now import from utils.js
- **Impact**: Reduced code duplication, single source of truth

### Test Results
- **Before**: 6/9 passing
- **After**: 9/9 passing
&lt;/PRE&gt;
&lt;P&gt;This structured output makes code review straightforward, reviewers can quickly understand what changed and why without digging through diffs.&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;The Demo Repository: Intentional Bugs&lt;/H2&gt;
&lt;P&gt;The project includes a &lt;A class="lia-external-url" href="https://github.com/leestott/copilotsdk_foundrylocal" target="_blank"&gt;demo-repo directory containing a small banking utility library&lt;/A&gt; with intentional problems for the agent to find and fix. This provides a controlled environment to demonstrate the agent's capabilities.&lt;/P&gt;
&lt;H3&gt;Bug 1: Calculation Error in calculateInterest()&lt;/H3&gt;
&lt;P&gt;The account.js file contains a calculation bug that causes test failures:&lt;/P&gt;
&lt;PRE&gt;// BUG: should be principal * (annualRate / 100)
function calculateInterest(principal, annualRate) {
  return principal / annualRate;  // Division instead of multiplication!
}
&lt;/PRE&gt;
&lt;P&gt;This bug causes 3 of 9 tests to fail. The agent identifies it during the PLAN phase by correlating test failures with the implementation, then fixes it during EDIT.&lt;/P&gt;
&lt;H3&gt;Bug 2: Code Duplication&lt;/H3&gt;
&lt;P&gt;The &lt;CODE&gt;formatCurrency()&lt;/CODE&gt; function is copy-pasted in both account.js and transaction.js, even though a canonical version exists in utils.js. This duplication creates maintenance burden and potential inconsistency:&lt;/P&gt;
&lt;PRE&gt;// In account.js (duplicated)
function formatCurrency(amount) {
  return '$' + amount.toFixed(2);
}

// In transaction.js (also duplicated)
function formatCurrency(amount) {
  return '$' + amount.toFixed(2);
}

// In utils.js (canonical, but unused)
export function formatCurrency(amount) {
  return '$' + amount.toFixed(2);
}
&lt;/PRE&gt;
&lt;P&gt;The agent identifies this duplication during planning and refactors both files to import from utils.js, eliminating redundancy.&lt;/P&gt;
&lt;H2&gt;Handling Foundry Local Streaming Quirks&lt;/H2&gt;
&lt;P&gt;One technical challenge the project solves is Foundry Local's behaviour with streaming requests. As of version 0.5, Foundry Local can hang on &lt;CODE&gt;stream: true&lt;/CODE&gt; requests. The project includes a streaming proxy that works around this limitation transparently.&lt;/P&gt;
&lt;H3&gt;The Streaming Proxy&lt;/H3&gt;
&lt;P&gt;The streaming-proxy.ts file implements a lightweight HTTP proxy that converts streaming requests to non-streaming, then re-encodes the single response as SSE (Server-Sent Events) chunks—the format the OpenAI SDK expects:&lt;/P&gt;
&lt;PRE&gt;// streaming-proxy.ts simplified logic
async function handleRequest(req: Request): Promise {
  const body = await req.json();
  
  // If it's a streaming chat completion, convert to non-streaming
  if (body.stream === true &amp;amp;&amp;amp; req.url.includes('/chat/completions')) {
    body.stream = false;
    
    const response = await fetch(foundryEndpoint, {
      method: 'POST',
      body: JSON.stringify(body),
      headers: { 'Content-Type': 'application/json' }
    });
    
    const data = await response.json();
    
    // Re-encode as SSE stream for the SDK
    return createSSEResponse(data);
  }
  
  // Non-streaming and non-chat requests pass through unchanged
  return fetch(foundryEndpoint, req);
}
&lt;/PRE&gt;
&lt;P&gt;This proxy runs on port 8765 by default and sits between the GitHub Copilot SDK and Foundry Local. The SDK thinks it's talking to a streaming-capable endpoint, while the actual inference happens non-streaming. The conversion is transparent, no changes needed to SDK configuration.&lt;/P&gt;
&lt;H3&gt;Text-Based Tool Call Detection&lt;/H3&gt;
&lt;P&gt;Small on-device models like &lt;CODE&gt;qwen2.5-coder-1.5b&lt;/CODE&gt; sometimes output tool calls as JSON text rather than using OpenAI-style function calling. The SDK won't fire &lt;CODE&gt;tool.execution_start&lt;/CODE&gt; events for these text-based calls, so the agent includes a regex-based detector:&lt;/P&gt;
&lt;PRE&gt;// Pattern to detect tool calls in model output
const toolCallPattern = /\{[\s\S]*"name":\s*"(list_files|read_file|write_file|run_command)"[\s\S]*\}/;

function detectToolCall(text: string): ToolCall | null {
  const match = text.match(toolCallPattern);
  if (match) {
    try {
      return JSON.parse(match[0]);
    } catch {
      return null;
    }
  }
  return null;
}
&lt;/PRE&gt;
&lt;P&gt;This fallback ensures tool calls are captured regardless of whether the model uses native function calling or text output, keeping the dashboard's tool call counter and CLI log accurate.&lt;/P&gt;
&lt;H2&gt;Security Considerations&lt;/H2&gt;
&lt;P&gt;Running an AI agent that can read and write files and execute commands requires careful security design. The Local Repo Patch Agent implements multiple layers of protection:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;100% local execution&lt;/STRONG&gt;: No code, prompts, or responses leave your machine—complete data sovereignty&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Command allowlist&lt;/STRONG&gt;: The agent can only run &lt;CODE&gt;npm test&lt;/CODE&gt;, &lt;CODE&gt;npm run lint&lt;/CODE&gt;, and &lt;CODE&gt;npm run build&lt;/CODE&gt;—no arbitrary shell commands&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Path sandboxing&lt;/STRONG&gt;: File tools are locked to the &lt;CODE&gt;demo-repo/&lt;/CODE&gt; directory; path traversal attempts like &lt;CODE&gt;../../../etc/passwd&lt;/CODE&gt; are rejected&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;File size limits&lt;/STRONG&gt;: The &lt;CODE&gt;read_file&lt;/CODE&gt; tool rejects files over 256 KB, preventing memory exhaustion attacks&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Recursion limits&lt;/STRONG&gt;: Directory listing caps at 20 levels deep, preventing infinite traversal&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;These constraints demonstrate responsible AI agent design. The agent has enough capability to do useful work but not enough to cause harm. When extending this project for your own use cases, maintain similar principles, grant minimum necessary permissions, validate all inputs, and fail closed on unexpected conditions.&lt;/P&gt;
&lt;H2&gt;Running the Agent&lt;/H2&gt;
&lt;P&gt;Getting the Local Repo Patch Agent running on your machine takes about five minutes. The project includes setup scripts that handle prerequisites automatically.&lt;/P&gt;
&lt;H3&gt;Prerequisites&lt;/H3&gt;
&lt;P&gt;Before running the setup, ensure you have:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Node.js 18 or higher&lt;/STRONG&gt;: Download from &lt;A href="https://nodejs.org/" target="_blank"&gt;nodejs.org&lt;/A&gt; (LTS version recommended)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Foundry Local&lt;/STRONG&gt;: Install via &lt;CODE&gt;winget install Microsoft.FoundryLocal&lt;/CODE&gt; (Windows) or &lt;CODE&gt;brew install foundrylocal&lt;/CODE&gt; (macOS)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;GitHub Copilot CLI&lt;/STRONG&gt;: Follow the &lt;A href="https://docs.github.com/en/copilot/how-tos/set-up/install-copilot-cli" target="_blank"&gt;GitHub Copilot CLI install guide&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Verify your installations:&lt;/P&gt;
&lt;PRE&gt;node --version    # Should print v18.x.x or higher
foundry --version
copilot --version
&lt;/PRE&gt;
&lt;H3&gt;One-Command Setup&lt;/H3&gt;
&lt;P&gt;The easiest path uses the provided setup scripts that install dependencies, start Foundry Local, and download the AI model:&lt;/P&gt;
&lt;PRE&gt;# Clone the repository
git clone https://github.com/leestott/copilotsdk_foundrylocal.git
cd copilotsdk_foundrylocal

# Windows (PowerShell)
.\setup.ps1

# macOS / Linux
chmod +x setup.sh
./setup.sh
&lt;/PRE&gt;
&lt;P&gt;When setup completes, you'll see:&lt;/P&gt;
&lt;PRE&gt;━━━ Setup complete! ━━━

  You're ready to go. Run one of these commands:

    npm run demo     CLI agent (terminal output)
    npm run ui       Web dashboard (http://localhost:3000)
&lt;/PRE&gt;
&lt;H3&gt;Manual Setup&lt;/H3&gt;
&lt;P&gt;If you prefer step-by-step control:&lt;/P&gt;
&lt;PRE&gt;# Install npm packages
npm install
cd demo-repo &amp;amp;&amp;amp; npm install --ignore-scripts &amp;amp;&amp;amp; cd ..

# Start Foundry Local and download the model
foundry service start
foundry model run qwen2.5-coder-1.5b

# Copy environment configuration
cp .env.example .env

# Run the agent
npm run demo
&lt;/PRE&gt;
&lt;P&gt;The first model download takes a few minutes depending on your connection. After that, the model runs from cache with no internet required.&lt;/P&gt;
&lt;H3&gt;Using the Web Dashboard&lt;/H3&gt;
&lt;P&gt;For a visual experience with real-time streaming, launch the web UI:&lt;/P&gt;
&lt;PRE&gt;npm run ui
&lt;/PRE&gt;
&lt;P&gt;Open &lt;A href="http://localhost:3000" target="_blank"&gt;http://localhost:3000&lt;/A&gt; in your browser. The dashboard provides:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Phase progress sidebar&lt;/STRONG&gt;: Visual indication of which phase is running, completed, or errored&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Live streaming output&lt;/STRONG&gt;: Model responses appear in real-time via WebSocket&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Tool call log&lt;/STRONG&gt;: Every tool invocation logged with phase context&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Phase timing table&lt;/STRONG&gt;: Performance metrics showing how long each phase took&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Environment info&lt;/STRONG&gt;: Current model, endpoint, and repository path at a glance&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Configuration Options&lt;/H2&gt;
&lt;P&gt;The agent supports several environment variables for customisation. Edit the &lt;CODE&gt;.env&lt;/CODE&gt; file or set them directly:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Variable&lt;/th&gt;&lt;th&gt;Default&lt;/th&gt;&lt;th&gt;Description&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FOUNDRY_LOCAL_ENDPOINT&lt;/td&gt;&lt;td&gt;auto-detected&lt;/td&gt;&lt;td&gt;Override the Foundry Local API endpoint&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FOUNDRY_LOCAL_API_KEY&lt;/td&gt;&lt;td&gt;auto-detected&lt;/td&gt;&lt;td&gt;Override the API key&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FOUNDRY_MODEL&lt;/td&gt;&lt;td&gt;qwen2.5-coder-1.5b&lt;/td&gt;&lt;td&gt;Which model to use from the Foundry Local catalog&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FOUNDRY_TIMEOUT_MS&lt;/td&gt;&lt;td&gt;180000 (3 min)&lt;/td&gt;&lt;td&gt;How long each agent phase can run before timing out&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;FOUNDRY_NO_PROXY&lt;/td&gt;&lt;td&gt;—&lt;/td&gt;&lt;td&gt;Set to 1 to disable the streaming proxy&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PORT&lt;/td&gt;&lt;td&gt;3000&lt;/td&gt;&lt;td&gt;Port for the web dashboard&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H3&gt;Using Different Models&lt;/H3&gt;
&lt;P&gt;To try a different model from the Foundry Local catalog:&lt;/P&gt;
&lt;PRE&gt;# Use phi-3-mini instead
FOUNDRY_MODEL=phi-3-mini npm run demo

# Use a larger model for higher quality (requires more RAM/VRAM)
FOUNDRY_MODEL=qwen2.5-7b npm run demo
&lt;/PRE&gt;
&lt;H3&gt;Adjusting for Slower Hardware&lt;/H3&gt;
&lt;P&gt;If you're running on CPU-only or limited hardware, increase the timeout to give the model more time per phase:&lt;/P&gt;
&lt;PRE&gt;# 5 minutes per phase instead of 3
FOUNDRY_TIMEOUT_MS=300000 npm run demo
&lt;/PRE&gt;
&lt;H2&gt;Troubleshooting Common Issues&lt;/H2&gt;
&lt;P&gt;When things don't work as expected, these solutions address the most common problems:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table&gt;&lt;tbody&gt;&lt;tr&gt;&lt;th&gt;Problem&lt;/th&gt;&lt;th&gt;Solution&lt;/th&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;CODE&gt;foundry: command not found&lt;/CODE&gt;&lt;/td&gt;&lt;td&gt;Install Foundry Local—see Prerequisites section&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;CODE&gt;copilot: command not found&lt;/CODE&gt;&lt;/td&gt;&lt;td&gt;Install GitHub Copilot CLI—see Prerequisites section&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Agent times out on every phase&lt;/td&gt;&lt;td&gt;Increase &lt;CODE&gt;FOUNDRY_TIMEOUT_MS&lt;/CODE&gt; (e.g., 300000 for 5 min). CPU-only machines are slower.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Port 3000 already in use&lt;/td&gt;&lt;td&gt;Set &lt;CODE&gt;PORT=3001 npm run ui&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Model download is slow&lt;/td&gt;&lt;td&gt;First download can take 5-10 min. Subsequent runs use the cache.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;CODE&gt;Cannot find module&lt;/CODE&gt; errors&lt;/td&gt;&lt;td&gt;Run &lt;CODE&gt;npm install&lt;/CODE&gt; again, then &lt;CODE&gt;cd demo-repo &amp;amp;&amp;amp; npm install --ignore-scripts&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;Tests still fail after agent runs&lt;/td&gt;&lt;td&gt;The agent edits files in demo-repo/. Reset with &lt;CODE&gt;git checkout demo-repo/&lt;/CODE&gt; and run again.&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;PowerShell blocks setup.ps1&lt;/td&gt;&lt;td&gt;Run &lt;CODE&gt;Set-ExecutionPolicy -Scope Process Bypass&lt;/CODE&gt; first, then &lt;CODE&gt;.\setup.ps1&lt;/CODE&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;H2&gt;Diagnostic Test Scripts&lt;/H2&gt;
&lt;P&gt;The &lt;CODE&gt;src/tests/&lt;/CODE&gt; folder contains standalone scripts for debugging SDK and Foundry Local integration issues. These are invaluable when things go wrong:&lt;/P&gt;
&lt;PRE&gt;# Debug-level SDK event logging
npx tsx src/tests/test-debug.ts

# Test non-streaming inference (bypasses streaming proxy)
npx tsx src/tests/test-nostream.ts

# Raw fetch to Foundry Local (bypasses SDK entirely)
npx tsx src/tests/test-stream-direct.ts

# Start the traffic-inspection proxy
npx tsx src/tests/test-proxy.ts
&lt;/PRE&gt;
&lt;P&gt;These scripts isolate different layers of the stack, helping identify whether issues lie in Foundry Local, the streaming proxy, the SDK, or your application code.&lt;/P&gt;
&lt;H2&gt;Key Takeaways&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;BYOK enables local-first AI&lt;/STRONG&gt;: A single configuration object redirects the entire GitHub Copilot SDK to use on-device inference through Foundry Local&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Phased workflows improve reliability&lt;/STRONG&gt;: Breaking complex tasks into PLAN → EDIT → VERIFY → SUMMARY phases makes agent behaviour predictable and debuggable&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Security requires intentional design&lt;/STRONG&gt;: Allowlists, sandboxing, and size limits constrain agent capabilities to safe operations&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Local models have quirks&lt;/STRONG&gt;: The streaming proxy and text-based tool detection demonstrate how to work around on-device model limitations&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Real-time feedback matters&lt;/STRONG&gt;: The web dashboard with WebSocket streaming makes agent progress visible and builds trust in the system&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The architecture is extensible&lt;/STRONG&gt;: Add new tools, change models, or modify phases to adapt the agent to your specific needs&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Conclusion and Next Steps&lt;/H2&gt;
&lt;P&gt;The Local Repo Patch Agent proves that sophisticated agentic coding workflows don't require cloud infrastructure. By combining the GitHub Copilot SDK's orchestration capabilities with Foundry Local's on-device inference, you get intelligent code analysis that respects data sovereignty completely.&lt;/P&gt;
&lt;P&gt;The patterns demonstrated here, BYOK integration, phased execution, security sandboxing, and streaming workarounds, transfer directly to production systems. Consider extending this foundation with:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Custom tool sets&lt;/STRONG&gt;: Add database queries, API calls to internal services, or integration with your CI/CD pipeline&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Multiple repository support&lt;/STRONG&gt;: Scan and fix issues across an entire codebase or monorepo&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Different model sizes&lt;/STRONG&gt;: Use smaller models for quick scans, larger ones for complex refactoring&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Human-in-the-loop approval&lt;/STRONG&gt;: Add review steps before applying fixes to production code&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Integration with Git workflows&lt;/STRONG&gt;: Automatically create branches and PRs from agent-generated fixes&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Clone the &lt;A href="https://github.com/leestott/copilotsdk_foundrylocal" target="_blank"&gt;repository&lt;/A&gt;, run through the demo, and start building your own local-first AI coding tools. The future of developer AI isn't just cloud—it's intelligent systems that run wherever your code lives.&lt;/P&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://github.com/leestott/copilotsdk_foundrylocal" target="_blank"&gt;Local Repo Patch Agent Repository&lt;/A&gt; – Full source code with setup scripts and documentation&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://www.foundrylocal.ai/" target="_blank"&gt;Foundry Local&lt;/A&gt; – Official site for on-device AI inference&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/Foundry-Local" target="_blank"&gt;Foundry Local GitHub Repository&lt;/A&gt; – Installation instructions and CLI reference&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/get-started" target="_blank"&gt;Foundry Local Get Started Guide&lt;/A&gt; – Official Microsoft Learn documentation&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/reference/reference-sdk" target="_blank"&gt;Foundry Local SDK Reference&lt;/A&gt; – Python and JavaScript SDK documentation&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/github/copilot-sdk" target="_blank"&gt;GitHub Copilot SDK&lt;/A&gt; – Official SDK repository&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/github/copilot-sdk/blob/main/docs/auth/byok.md" target="_blank"&gt;GitHub Copilot SDK BYOK Documentation&lt;/A&gt; – Bring Your Own Key integration guide&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/github/copilot-sdk/blob/main/docs/getting-started.md" target="_blank"&gt;GitHub Copilot SDK Getting Started&lt;/A&gt; – SDK setup and first agent tutorial&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/Foundry-Local/tree/main/samples/js/copilot-sdk-foundry-local" target="_blank"&gt;Microsoft Sample: Copilot SDK + Foundry Local&lt;/A&gt; – Official integration sample from Microsoft&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Mon, 16 Feb 2026 09:28:40 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/agentic-code-fixing-with-github-copilot-sdk-and-foundry-local/ba-p/4493967</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-02-16T09:28:40Z</dc:date>
    </item>
    <item>
      <title>Building a Local Research Desk: Multi-Agent Orchestration</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/building-a-local-research-desk-multi-agent-orchestration/ba-p/4493965</link>
      <description>&lt;H2&gt;Introduction&lt;/H2&gt;
&lt;P&gt;Multi-agent systems represent the next evolution of AI applications. Instead of a single model handling everything, specialised agents collaborate—each with defined responsibilities, passing context to one another, and producing results that no single agent could achieve alone. But building these systems typically requires cloud infrastructure, API keys, usage tracking, and the constant concern about what data leaves your machine.&lt;/P&gt;
&lt;P&gt;What if you could build sophisticated multi-agent workflows entirely on your local machine, with no cloud dependencies? The &lt;A href="https://github.com/leestott/agentframework--foundrylocal" target="_blank" rel="noopener"&gt;Local Research &amp;amp; Synthesis Desk&lt;/A&gt; demonstrates exactly this. Using Microsoft Agent Framework (MAF) for orchestration and Foundry Local for on-device inference, this demo shows how to create a four-agent research pipeline that runs entirely on your hardware—no API keys, no data leaving your network, and complete control over every step.&lt;/P&gt;
&lt;P&gt;This article walks through the architecture, implementation patterns, and practical code that makes multi-agent local AI possible. You'll learn how to bootstrap Foundry Local from Python, create specialised agents with distinct roles, wire them into sequential, concurrent, and feedback loop orchestration patterns, and implement tool calling for extended functionality. Whether you're building research tools, internal analysis systems, or simply exploring what's possible with local AI, this architecture provides a production-ready foundation.&lt;BR /&gt;&lt;BR /&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;H2&gt;Why Multi-Agent Architecture Matters&lt;/H2&gt;
&lt;P&gt;Single-agent AI systems hit limitations quickly. Ask one model to research a topic, analyse findings, identify gaps, and write a comprehensive report—and you'll get mediocre results. The model tries to do everything at once, with no opportunity for specialisation, review, or iterative refinement.&lt;/P&gt;
&lt;P&gt;Multi-agent systems solve this by decomposing complex tasks into specialised roles. Each agent focuses on what it does best:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Planners&lt;/STRONG&gt; break ambiguous questions into concrete sub-tasks&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Retrievers&lt;/STRONG&gt; focus exclusively on finding and extracting relevant information&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Critics&lt;/STRONG&gt; review work for gaps, contradictions, and quality issues&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Writers&lt;/STRONG&gt; synthesise everything into coherent, well-structured output&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This separation of concerns mirrors how human teams work effectively. A research team doesn't have one person doing everything—they have researchers, fact-checkers, editors, and writers. Multi-agent AI systems apply the same principle to AI workflows, with each agent receiving the output of previous agents as context for their own specialised task.&lt;/P&gt;
&lt;P&gt;The Local Research &amp;amp; Synthesis Desk implements this pattern with four primary agents, plus an optional ToolAgent for utility functions. Here's how user questions flow through the system:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;P&gt;This architecture demonstrates three essential orchestration patterns: sequential pipelines where each agent builds on the previous output, concurrent fan-out where independent tasks run in parallel to save time, and feedback loops where the Critic can send work back to the Retriever for iterative refinement.&lt;/P&gt;
&lt;H2&gt;The Technology Stack: MAF + Foundry Local&lt;/H2&gt;
&lt;P&gt;Before diving into implementation, let's understand the two core technologies that make this architecture possible and why they work so well together.&lt;/P&gt;
&lt;H3&gt;Microsoft Agent Framework (MAF)&lt;/H3&gt;
&lt;P&gt;The &lt;A href="https://learn.microsoft.com/en-us/agent-framework/" target="_blank" rel="noopener"&gt;Microsoft Agent Framework&lt;/A&gt; provides building blocks for creating AI agents in Python and .NET. Unlike frameworks that require specific cloud providers, MAF works with any OpenAI-compatible API—which is exactly what Foundry Local provides.&lt;/P&gt;
&lt;P&gt;The key abstraction in MAF is the &lt;CODE&gt;ChatAgent&lt;/CODE&gt;. Each agent has:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Instructions&lt;/STRONG&gt;: A system prompt that defines the agent's role and behaviour&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Chat client&lt;/STRONG&gt;: An OpenAI-compatible client for making inference calls&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Tools&lt;/STRONG&gt;: Optional functions the agent can invoke during execution&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Name&lt;/STRONG&gt;: An identifier for logging and observability&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;MAF handles message threading, tool execution, and response parsing automatically. You focus on designing agent behaviour rather than managing low-level API interactions.&lt;/P&gt;
&lt;H3&gt;Foundry Local&lt;/H3&gt;
&lt;P&gt;&lt;A href="https://foundrylocal.ai/" target="_blank" rel="noopener"&gt;Foundry Local&lt;/A&gt; brings Azure AI Foundry's model catalog to your local machine. It automatically selects the best hardware acceleration available (GPU, NPU, or CPU) and exposes models through an OpenAI-compatible API. Models run entirely on-device with no data leaving your machine.&lt;/P&gt;
&lt;P&gt;The &lt;CODE&gt;foundry-local-sdk&lt;/CODE&gt; Python package provides programmatic control over the Foundry Local service. You can start the service, download models, and retrieve connection information—all from your Python code. This is the "control plane" that manages the local AI infrastructure.&lt;/P&gt;
&lt;P&gt;The combination is powerful: MAF handles agent logic and orchestration, while Foundry Local provides the underlying inference. No cloud dependencies, no API keys, complete data privacy:&lt;/P&gt;
&lt;img /&gt;
&lt;PRE&gt;&amp;nbsp;&lt;/PRE&gt;
&lt;H2&gt;Bootstrapping Foundry Local from Python&lt;/H2&gt;
&lt;P&gt;The first practical challenge is starting Foundry Local programmatically. The &lt;CODE&gt;FoundryLocalBootstrapper&lt;/CODE&gt; class handles this, encapsulating all the setup logic so the rest of the application can focus on agent behaviour.&lt;/P&gt;
&lt;P&gt;The bootstrap process follows three steps: start the Foundry Local service if it's not running, download the requested model if it's not cached, and return connection information that MAF agents can use. Here's the core implementation:&lt;/P&gt;
&lt;PRE&gt;from dataclasses import dataclass
from foundry_local import FoundryLocalManager

@dataclass
class FoundryConnection:
    """Holds endpoint, API key, and model ID after bootstrap."""
    endpoint: str
    api_key: str
    model_id: str
    model_alias: str
&lt;/PRE&gt;
&lt;P&gt;This dataclass carries everything needed to connect MAF agents to Foundry Local. The endpoint is typically &lt;CODE&gt;http://localhost:&amp;lt;port&amp;gt;/v1&lt;/CODE&gt; (the port is assigned dynamically), and the API key is managed internally by Foundry Local.&lt;/P&gt;
&lt;PRE&gt;class FoundryLocalBootstrapper:
    def __init__(self, alias: str | None = None) -&amp;gt; None:
        self.alias = alias or os.getenv("MODEL_ALIAS", "qwen2.5-0.5b")

    def bootstrap(self) -&amp;gt; FoundryConnection:
        """Start service, download &amp;amp; load model, return connection info."""
        from foundry_local import FoundryLocalManager
        
        manager = FoundryLocalManager()
        model_info = manager.download_and_load_model(self.alias)
        
        return FoundryConnection(
            endpoint=manager.endpoint,
            api_key=manager.api_key,
            model_id=model_info.id,
            model_alias=self.alias,
        )
&lt;/PRE&gt;
&lt;P&gt;Key design decisions in this implementation:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Lazy import&lt;/STRONG&gt;: The &lt;CODE&gt;foundry_local&lt;/CODE&gt; import happens inside &lt;CODE&gt;bootstrap()&lt;/CODE&gt; so the application can provide helpful error messages if the SDK isn't installed&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Environment configuration&lt;/STRONG&gt;: Model alias comes from &lt;CODE&gt;MODEL_ALIAS&lt;/CODE&gt; environment variable or defaults to &lt;CODE&gt;qwen2.5-0.5b&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Automatic hardware selection&lt;/STRONG&gt;: Foundry Local picks GPU, NPU, or CPU automatically—no configuration needed&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;The &lt;CODE&gt;qwen2.5&lt;/CODE&gt; model family is recommended because it supports function/tool calling, which the ToolAgent requires. For higher quality outputs, larger variants like &lt;CODE&gt;qwen2.5-7b&lt;/CODE&gt; or &lt;CODE&gt;qwen2.5-14b&lt;/CODE&gt; are available via the &lt;CODE&gt;--model&lt;/CODE&gt; flag.&lt;/P&gt;
&lt;H2&gt;Creating Specialised Agents&lt;/H2&gt;
&lt;P&gt;With Foundry Local bootstrapped, the next step is creating agents with distinct roles. Each agent is a &lt;CODE&gt;ChatAgent&lt;/CODE&gt; instance with carefully crafted instructions that focus it on a specific task.&lt;/P&gt;
&lt;H3&gt;The Planner Agent&lt;/H3&gt;
&lt;P&gt;The Planner receives a user question and available documents, then breaks the research task into concrete sub-tasks. Its instructions emphasise structured output—a numbered list of specific tasks rather than prose:&lt;/P&gt;
&lt;PRE&gt;from agent_framework import ChatAgent
from agent_framework.openai import OpenAIChatClient

def _make_client(conn: FoundryConnection) -&amp;gt; OpenAIChatClient:
    """Create an MAF OpenAIChatClient pointing at Foundry Local."""
    return OpenAIChatClient(
        api_key=conn.api_key,
        base_url=conn.endpoint,
        model_id=conn.model_id,
    )

def create_planner(conn: FoundryConnection) -&amp;gt; ChatAgent:
    return ChatAgent(
        chat_client=_make_client(conn),
        name="Planner",
        instructions=(
            "You are a planning agent. Given a user's research question and a list "
            "of document snippets (if any), break the question into 2-4 concrete "
            "sub-tasks. Output ONLY a numbered list of tasks. Each task should state:\n"
            "  • What information is needed\n"
            "  • Which source documents might help (if known)\n"
            "Keep it concise — no more than 6 lines total."
        ),
    )
&lt;/PRE&gt;
&lt;P&gt;Notice how the instructions are explicit about output format. Multi-agent systems work best when each agent produces structured, predictable output that downstream agents can parse reliably.&lt;/P&gt;
&lt;H3&gt;The Retriever Agent&lt;/H3&gt;
&lt;P&gt;The Retriever receives the Planner's task list plus raw document content, then extracts and cites relevant passages. Its instructions emphasise citation format—a specific pattern that the Writer can reference later:&lt;/P&gt;
&lt;PRE&gt;def create_retriever(conn: FoundryConnection) -&amp;gt; ChatAgent:
    return ChatAgent(
        chat_client=_make_client(conn),
        name="Retriever",
        instructions=(
            "You are a retrieval agent. You receive a research plan AND raw document "
            "text from local files. Your job:\n"
            "  1. Identify the most relevant passages for each task in the plan.\n"
            "  2. Output extracted snippets with citations in the format:\n"
            "     [filename.ext, lines X-Y]: \"quoted text…\"\n"
            "  3. If no relevant content exists, say so explicitly.\n"
            "Be precise — quote only what is relevant, keep each snippet under 100 words."
        ),
    )
&lt;/PRE&gt;
&lt;P&gt;The citation format &lt;CODE&gt;[filename.ext, lines X-Y]&lt;/CODE&gt; creates a consistent contract. The Writer knows exactly how to reference source material, and human reviewers can verify claims against original documents.&lt;/P&gt;
&lt;H3&gt;The Critic Agent&lt;/H3&gt;
&lt;P&gt;The Critic reviews the Retriever's work, identifying gaps and contradictions. This agent serves as a quality gate before the final report and can trigger feedback loops for iterative improvement:&lt;/P&gt;
&lt;PRE&gt;def create_critic(conn: FoundryConnection) -&amp;gt; ChatAgent:
    return ChatAgent(
        chat_client=_make_client(conn),
        name="Critic",
        instructions=(
            "You are a critical review agent. You receive a plan and extracted snippets. "
            "Your job:\n"
            "  1. Check for gaps — are any plan tasks unanswered?\n"
            "  2. Check for contradictions between snippets.\n"
            "  3. Suggest 1-2 specific improvements or missing details.\n"
            "Start your response with 'GAPS FOUND' if issues exist, or 'NO GAPS' if satisfied.\n"
            "Then output a short numbered list of issues (or say 'No issues found')."
        ),
    )
&lt;/PRE&gt;
&lt;P&gt;The Critic is instructed to output &lt;CODE&gt;GAPS FOUND&lt;/CODE&gt; or &lt;CODE&gt;NO GAPS&lt;/CODE&gt; at the start of its response. This structured output enables the orchestrator to detect when gaps exist and trigger the feedback loop—sending the gaps back to the Retriever for additional retrieval before re-running the Critic. This iterates up to 2 times before the Writer takes over, ensuring higher quality reports.&lt;/P&gt;
&lt;P&gt;Critics are essential for production systems. Without this review step, the Writer might produce confident-sounding reports with missing information or internal contradictions.&lt;/P&gt;
&lt;H3&gt;The Writer Agent&lt;/H3&gt;
&lt;P&gt;The Writer receives everything—original question, plan, extracted snippets, and critic review—then produces the final report:&lt;/P&gt;
&lt;PRE&gt;def create_writer(conn: FoundryConnection) -&amp;gt; ChatAgent:
    return ChatAgent(
        chat_client=_make_client(conn),
        name="Writer",
        instructions=(
            "You are the final report writer. You receive:\n"
            "  • The original question\n"
            "  • A plan, extracted snippets with citations, and a critic review\n\n"
            "Produce a clear, well-structured answer (3-5 paragraphs). "
            "Requirements:\n"
            "  • Cite sources using [filename.ext, lines X-Y] notation\n"
            "  • Address any gaps the critic raised (note if unresolvable)\n"
            "  • End with a one-sentence summary\n"
            "Do NOT fabricate citations — only use citations provided by the Retriever."
        ),
    )
&lt;/PRE&gt;
&lt;P&gt;The final instruction—"Do NOT fabricate citations"—is crucial for responsible AI. The Writer has access only to citations the Retriever provided, preventing hallucinated references that plague single-agent research systems.&lt;/P&gt;
&lt;H2&gt;Implementing Sequential Orchestration&lt;/H2&gt;
&lt;P&gt;With agents defined, the orchestrator connects them into a workflow. Sequential orchestration is the simpler pattern: each agent runs after the previous one completes, passing its output as input to the next agent.&lt;/P&gt;
&lt;P&gt;The implementation uses Python's &lt;CODE&gt;async/await&lt;/CODE&gt; for clean asynchronous execution:&lt;/P&gt;
&lt;PRE&gt;import asyncio
import time
from dataclasses import dataclass, field

@dataclass
class StepResult:
    """Captures one agent step for observability."""
    agent_name: str
    input_text: str
    output_text: str
    elapsed_sec: float

@dataclass
class WorkflowResult:
    """Final result of the entire orchestration run."""
    question: str
    steps: list[StepResult] = field(default_factory=list)
    final_report: str = ""

async def _run_agent(agent: ChatAgent, prompt: str) -&amp;gt; tuple[str, float]:
    """Execute a single agent and measure elapsed time."""
    start = time.perf_counter()
    response = await agent.run(prompt)
    elapsed = time.perf_counter() - start
    return response.content, elapsed
&lt;/PRE&gt;
&lt;P&gt;The &lt;CODE&gt;StepResult&lt;/CODE&gt; dataclass captures everything needed for observability: what went in, what came out, and how long it took. This information is invaluable for debugging and optimisation.&lt;/P&gt;
&lt;P&gt;The sequential pipeline chains agents together, building context progressively:&lt;/P&gt;
&lt;PRE&gt;async def run_sequential_workflow(
    question: str,
    docs: LoadedDocuments,
    conn: FoundryConnection,
) -&amp;gt; WorkflowResult:
    wf = WorkflowResult(question=question)
    doc_block = docs.combined_text if docs.chunks else "(no documents provided)"
    
    # Step 1 — Plan
    planner = create_planner(conn)
    planner_prompt = f"User question: {question}\n\nAvailable documents:\n{doc_block}"
    plan_text, elapsed = await _run_agent(planner, planner_prompt)
    wf.steps.append(StepResult("Planner", planner_prompt, plan_text, elapsed))
    
    # Step 2 — Retrieve
    retriever = create_retriever(conn)
    retriever_prompt = f"Plan:\n{plan_text}\n\nDocuments:\n{doc_block}"
    snippets_text, elapsed = await _run_agent(retriever, retriever_prompt)
    wf.steps.append(StepResult("Retriever", retriever_prompt, snippets_text, elapsed))
    
    # Step 3 — Critique
    critic = create_critic(conn)
    critic_prompt = f"Plan:\n{plan_text}\n\nExtracted snippets:\n{snippets_text}"
    critique_text, elapsed = await _run_agent(critic, critic_prompt)
    wf.steps.append(StepResult("Critic", critic_prompt, critique_text, elapsed))
    
    # Step 4 — Write
    writer = create_writer(conn)
    writer_prompt = (
        f"Original question: {question}\n\n"
        f"Plan:\n{plan_text}\n\n"
        f"Extracted snippets:\n{snippets_text}\n\n"
        f"Critic review:\n{critique_text}"
    )
    report_text, elapsed = await _run_agent(writer, writer_prompt)
    wf.steps.append(StepResult("Writer", writer_prompt, report_text, elapsed))
    wf.final_report = report_text
    
    return wf
&lt;/PRE&gt;
&lt;P&gt;Each step receives all relevant context from previous steps. The Writer gets the most comprehensive prompt—original question, plan, snippets, and critique—enabling it to produce a well-informed final report.&lt;/P&gt;
&lt;H2&gt;Adding Concurrent Fan-Out and Feedback Loops&lt;/H2&gt;
&lt;P&gt;Sequential orchestration works well but can be slow. When tasks are independent—neither needs the other's output—running them in parallel saves time. The demo implements this with &lt;CODE&gt;asyncio.gather&lt;/CODE&gt;.&lt;/P&gt;
&lt;P&gt;Consider the Retriever and ToolAgent: both need the Planner's output, but neither depends on the other. Running them concurrently cuts the wait time roughly in half:&lt;/P&gt;
&lt;PRE&gt;async def run_concurrent_retrieval(
    plan_text: str,
    docs: LoadedDocuments,
    conn: FoundryConnection,
) -&amp;gt; tuple[str, str]:
    """Run Retriever and ToolAgent in parallel."""
    retriever = create_retriever(conn)
    tool_agent = create_tool_agent(conn)
    
    doc_block = docs.combined_text if docs.chunks else "(no documents)"
    
    retriever_prompt = f"Plan:\n{plan_text}\n\nDocuments:\n{doc_block}"
    tool_prompt = f"Analyse the following documents for word count and keywords:\n{doc_block}"
    
    # Execute both agents concurrently
    (snippets_text, r_elapsed), (tool_text, t_elapsed) = await asyncio.gather(
        _run_agent(retriever, retriever_prompt),
        _run_agent(tool_agent, tool_prompt),
    )
    
    return snippets_text, tool_text
&lt;/PRE&gt;
&lt;P&gt;The &lt;CODE&gt;asyncio.gather&lt;/CODE&gt; function runs both coroutines concurrently and returns when both complete. If the Retriever takes 3 seconds and the ToolAgent takes 1.5 seconds, the total wait is approximately 3 seconds rather than 4.5 seconds.&lt;/P&gt;
&lt;H3&gt;Implementing the Feedback Loop&lt;/H3&gt;
&lt;P&gt;The most sophisticated orchestration pattern is the Critic–Retriever feedback loop. When the Critic identifies gaps in the retrieved information, the orchestrator sends them back to the Retriever for additional retrieval, then re-evaluates:&lt;/P&gt;
&lt;PRE&gt;async def run_critic_with_feedback(
    plan_text: str,
    snippets_text: str,
    docs: LoadedDocuments,
    conn: FoundryConnection,
    max_iterations: int = 2,
) -&amp;gt; tuple[str, str]:
    """
    Run Critic with feedback loop to Retriever.
    Returns (final_snippets, final_critique).
    """
    critic = create_critic(conn)
    retriever = create_retriever(conn)
    
    current_snippets = snippets_text
    
    for iteration in range(max_iterations):
        # Run Critic
        critic_prompt = f"Plan:\n{plan_text}\n\nExtracted snippets:\n{current_snippets}"
        critique_text, _ = await _run_agent(critic, critic_prompt)
        
        # Check if gaps were found
        if not critique_text.upper().startswith("GAPS FOUND"):
            return current_snippets, critique_text
        
        # Gaps found — send back to Retriever for more extraction
        gap_fill_prompt = (
            f"Previous snippets:\n{current_snippets}\n\n"
            f"Gaps identified:\n{critique_text}\n\n"
            f"Documents:\n{docs.combined_text}\n\n"
            "Extract additional relevant passages to fill these gaps."
        )
        additional_snippets, _ = await _run_agent(retriever, gap_fill_prompt)
        current_snippets = f"{current_snippets}\n\n--- Gap-fill iteration {iteration + 1} ---\n{additional_snippets}"
    
    # Max iterations reached — run final critique
    final_critique, _ = await _run_agent(critic, f"Plan:\n{plan_text}\n\nExtracted snippets:\n{current_snippets}")
    return current_snippets, final_critique
&lt;/PRE&gt;
&lt;P&gt;This feedback loop pattern significantly improves output quality. The Critic acts as a quality gate, and when standards aren't met, the system iteratively improves rather than producing incomplete results.&lt;/P&gt;
&lt;P&gt;The full workflow combines all three patterns—sequential where dependencies require it, concurrent where independence allows it, and feedback loops for quality assurance:&lt;/P&gt;
&lt;PRE&gt;async def run_full_workflow(
    question: str,
    docs: LoadedDocuments,
    conn: FoundryConnection,
) -&amp;gt; WorkflowResult:
    """
    End-to-end workflow showcasing THREE orchestration patterns:
      1. Planner runs first (sequential — must happen before anything else).
      2. Retriever + ToolAgent run concurrently (fan-out on independent tasks).
      3. Critic reviews with feedback loop (iterates with Retriever if gaps found).
      4. Writer produces final report (sequential — needs everything above).
    """
    wf = WorkflowResult(question=question)
    
    # Step 1: Planner (sequential)
    plan_text, elapsed = await _run_agent(create_planner(conn), planner_prompt)
    wf.steps.append(StepResult("Planner", planner_prompt, plan_text, elapsed))
    
    # Step 2: Concurrent fan-out (Retriever + ToolAgent)
    snippets_text, tool_text = await run_concurrent_retrieval(plan_text, docs, conn)
    
    # Step 3: Critic with feedback loop
    final_snippets, critique_text = await run_critic_with_feedback(
        plan_text, snippets_text, docs, conn
    )
    
    # Step 4: Writer (sequential — needs everything)
    writer_prompt = (
        f"Original question: {question}\n\n"
        f"Plan:\n{plan_text}\n\n"
        f"Snippets:\n{final_snippets}\n\n"
        f"Stats:\n{tool_text}\n\n"
        f"Critique:\n{critique_text}"
    )
    report_text, elapsed = await _run_agent(create_writer(conn), writer_prompt)
    wf.final_report = report_text
    
    return wf
&lt;/PRE&gt;
&lt;P&gt;This hybrid approach maximises both correctness and performance. Dependencies are respected, independent work happens in parallel, and quality is ensured through iterative feedback.&lt;/P&gt;
&lt;H2&gt;Implementing Tool Calling&lt;/H2&gt;
&lt;P&gt;Some agents benefit from deterministic tools rather than relying entirely on LLM generation. The ToolAgent demonstrates this pattern with two utility functions: word counting and keyword extraction.&lt;/P&gt;
&lt;P&gt;MAF supports tool calling through function declarations with Pydantic type annotations:&lt;/P&gt;
&lt;PRE&gt;from typing import Annotated
from pydantic import Field

def word_count(
    text: Annotated[str, Field(description="The text to count words in")]
) -&amp;gt; int:
    """Count words in a text string."""
    return len(text.split())

def extract_keywords(
    text: Annotated[str, Field(description="The text to extract keywords from")],
    top_n: Annotated[int, Field(description="Number of keywords to return", default=5)]
) -&amp;gt; list[str]:
    """Extract most frequent words (simple implementation)."""
    words = text.lower().split()
    # Filter common words, count frequencies, return top N
    word_counts = {}
    for word in words:
        if len(word) &amp;gt; 3:  # Skip short words
            word_counts[word] = word_counts.get(word, 0) + 1
    sorted_words = sorted(word_counts.items(), key=lambda x: x[1], reverse=True)
    return [word for word, count in sorted_words[:top_n]]
&lt;/PRE&gt;
&lt;P&gt;The &lt;CODE&gt;Annotated&lt;/CODE&gt; type with &lt;CODE&gt;Field&lt;/CODE&gt; descriptions provides metadata that MAF uses to generate function schemas for the LLM. When the model needs to count words, it invokes the &lt;CODE&gt;word_count&lt;/CODE&gt; tool rather than attempting to count in its response (which LLMs notoriously struggle with).&lt;/P&gt;
&lt;P&gt;The ToolAgent receives these functions in its constructor:&lt;/P&gt;
&lt;PRE&gt;def create_tool_agent(conn: FoundryConnection) -&amp;gt; ChatAgent:
    return ChatAgent(
        chat_client=_make_client(conn),
        name="ToolHelper",
        instructions=(
            "You are a utility agent. Use the provided tools to compute "
            "word counts or extract keywords when asked. Return the tool "
            "output directly — do not embellish."
        ),
        tools=[word_count, extract_keywords],
    )
&lt;/PRE&gt;
&lt;P&gt;This pattern—combining LLM reasoning with deterministic tools—produces more reliable results. The LLM decides when to use tools and how to interpret results, but the actual computation happens in Python where precision is guaranteed.&lt;/P&gt;
&lt;H2&gt;Running the Demo&lt;/H2&gt;
&lt;P&gt;With the architecture explained, here's how to run the demo yourself. Setup takes about five minutes.&lt;/P&gt;
&lt;H3&gt;Prerequisites&lt;/H3&gt;
&lt;P&gt;You'll need Python 3.10 or higher and Foundry Local installed on your machine. Install Foundry Local by following the instructions at &lt;A href="https://github.com/microsoft/Foundry-Local" target="_blank" rel="noopener"&gt;github.com/microsoft/Foundry-Local&lt;/A&gt;, then verify it works:&lt;/P&gt;
&lt;PRE&gt;foundry --help
&lt;/PRE&gt;
&lt;H3&gt;Installation&lt;/H3&gt;
&lt;P&gt;Clone the repository and set up a virtual environment:&lt;/P&gt;
&lt;PRE&gt;git clone https://github.com/leestott/agentframework--foundrylocal.git
cd agentframework--foundrylocal

python -m venv .venv

# Windows
.venv\Scripts\activate

# macOS / Linux
source .venv/bin/activate

pip install -r requirements.txt
copy .env.example .env
&lt;/PRE&gt;
&lt;H3&gt;CLI Usage&lt;/H3&gt;
&lt;P&gt;Run the research workflow from the command line:&lt;/P&gt;
&lt;PRE&gt;python -m src.app "What are the key features of Foundry Local and how does it compare to cloud inference?" --docs ./data
&lt;/PRE&gt;
&lt;P&gt;You'll see agent-by-agent progress with timing information:&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;PRE&gt;&lt;SPAN style="color: rgb(30, 30, 30); font-family: 'Segoe UI', system-ui; font-size: 28px;"&gt;Web Interface&lt;/SPAN&gt;&lt;/PRE&gt;
&lt;P&gt;For a visual experience, launch the Flask-based web UI:&lt;/P&gt;
&lt;PRE&gt;python -m src.app.web
&lt;/PRE&gt;
&lt;P&gt;Open &lt;A href="http://localhost:5000" target="_blank" rel="noopener"&gt;http://localhost:5000&lt;/A&gt; in your browser. The web UI provides real-time streaming of agent progress, a visual pipeline showing both orchestration patterns, and an interactive demos tab showcasing tool calling capabilities.&lt;/P&gt;
&lt;H3&gt;CLI Options&lt;/H3&gt;
&lt;P&gt;The CLI supports several options for customisation:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;--docs&lt;/STRONG&gt;: Folder of local documents to search (default: ./data)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;--model&lt;/STRONG&gt;: Foundry Local model alias (default: qwen2.5-0.5b)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;--mode&lt;/STRONG&gt;: &lt;CODE&gt;full&lt;/CODE&gt; for sequential + concurrent, or &lt;CODE&gt;sequential&lt;/CODE&gt; for simpler pipeline&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;--log-level&lt;/STRONG&gt;: DEBUG, INFO, WARNING, or ERROR&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;For higher quality output, try larger models:&lt;/P&gt;
&lt;PRE&gt;python -m src.app "Explain multi-agent benefits" --docs ./data --model qwen2.5-7b
&lt;/PRE&gt;
&lt;H3&gt;Validate Tool/Function Calling&lt;/H3&gt;
&lt;P&gt;Run the dedicated tool calling demo to verify function calling works:&lt;/P&gt;
&lt;PRE&gt;python -m src.app.tool_demo
&lt;/PRE&gt;
&lt;P&gt;This tests direct tool function calls (&lt;CODE&gt;word_count&lt;/CODE&gt;, &lt;CODE&gt;extract_keywords&lt;/CODE&gt;), LLM-driven tool calling via the ToolAgent, and multi-tool requests in a single prompt.&lt;/P&gt;
&lt;H3&gt;Run Tests&lt;/H3&gt;
&lt;P&gt;Run the smoke tests to verify your setup:&lt;/P&gt;
&lt;PRE&gt;pip install pytest pytest-asyncio
pytest tests/ -v
&lt;/PRE&gt;
&lt;P&gt;The smoke tests check document loading, tool functions, and configuration—they do not require a running Foundry Local service.&lt;/P&gt;
&lt;H2&gt;Interactive Demos: Exploring MAF Capabilities&lt;/H2&gt;
&lt;P&gt;Beyond the research workflow, the web UI includes five interactive demos showcasing different MAF capabilities. Each demonstrates a specific pattern with suggested prompts and real-time results.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Weather Tools&lt;/STRONG&gt; demonstrates multi-tool calling with an agent that provides weather information, forecasts, city comparisons, and activity recommendations. The agent uses four different tools to construct comprehensive responses.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Math Calculator&lt;/STRONG&gt; shows precise calculation through tool calling. The agent uses arithmetic, percentage, unit conversion, compound interest, and statistics tools instead of attempting mental math—eliminating the calculation errors that plague LLM-only approaches.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Sentiment Analyser&lt;/STRONG&gt; performs structured text analysis, detecting sentiment, emotions, key phrases, and word frequency through lexicon-based tools. The results are deterministic and verifiable.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Code Reviewer&lt;/STRONG&gt; analyses code for style issues, complexity problems, potential bugs, and improvement opportunities. This demonstrates how tool calling can extend AI capabilities into domain-specific analysis.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Multi-Agent Debate&lt;/STRONG&gt; showcases sequential orchestration with interdependent outputs. Three agents—one arguing for a position, one against, and a moderator—debate a topic. Each agent receives the previous agent's output, demonstrating how multi-agent systems can explore topics from multiple perspectives.&lt;/P&gt;
&lt;H2&gt;Troubleshooting&lt;/H2&gt;
&lt;P&gt;Common issues and their solutions:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;CODE&gt;foundry: command not found&lt;/CODE&gt;&lt;/STRONG&gt;: Install Foundry Local from &lt;A href="https://github.com/microsoft/Foundry-Local" target="_blank" rel="noopener"&gt;github.com/microsoft/Foundry-Local&lt;/A&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;CODE&gt;foundry-local-sdk is not installed&lt;/CODE&gt;&lt;/STRONG&gt;: Run &lt;CODE&gt;pip install foundry-local-sdk&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Model download is slow&lt;/STRONG&gt;: First download can be large. It's cached for future runs.&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;&lt;CODE&gt;No documents found&lt;/CODE&gt; warning&lt;/STRONG&gt;: Add &lt;CODE&gt;.txt&lt;/CODE&gt; or &lt;CODE&gt;.md&lt;/CODE&gt; files to the &lt;CODE&gt;--docs&lt;/CODE&gt; folder&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Agent output is low quality&lt;/STRONG&gt;: Try a larger model alias, e.g. &lt;CODE&gt;--model phi-3.5-mini&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Web UI won't start&lt;/STRONG&gt;: Ensure Flask is installed: &lt;CODE&gt;pip install flask&lt;/CODE&gt;&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Port 5000 in use&lt;/STRONG&gt;: The web UI uses port 5000. Stop other services or set &lt;CODE&gt;PORT=8080&lt;/CODE&gt; environment variable&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Key Takeaways&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Multi-agent systems decompose complex tasks&lt;/STRONG&gt;: Specialised agents (Planner, Retriever, Critic, Writer) produce better results than single-agent approaches by focusing each agent on what it does best&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Local AI eliminates cloud dependencies&lt;/STRONG&gt;: Foundry Local provides on-device inference with automatic hardware acceleration, keeping all data on your machine&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;MAF simplifies agent development&lt;/STRONG&gt;: The &lt;CODE&gt;ChatAgent&lt;/CODE&gt; abstraction handles message threading, tool execution, and response parsing, letting you focus on agent behaviour&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Three orchestration patterns serve different needs&lt;/STRONG&gt;: Sequential pipelines maintain dependencies; concurrent fan-out parallelises independent work; feedback loops enable iterative quality improvement&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Feedback loops improve quality&lt;/STRONG&gt;: The Critic–Retriever feedback loop catches gaps and contradictions, iterating until quality standards are met rather than producing incomplete results&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Tool calling adds precision&lt;/STRONG&gt;: Deterministic functions for counting, calculation, and analysis complement LLM reasoning for more reliable results&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;The same patterns scale to production&lt;/STRONG&gt;: This demo architecture—bootstrapping, agent creation, orchestration—applies directly to real-world research and analysis systems&lt;/LI&gt;
&lt;/UL&gt;
&lt;H2&gt;Conclusion and Next Steps&lt;/H2&gt;
&lt;P&gt;The Local Research &amp;amp; Synthesis Desk demonstrates that sophisticated multi-agent AI systems don't require cloud infrastructure. With Microsoft Agent Framework for orchestration and Foundry Local for inference, you can build production-quality workflows that run entirely on your hardware.&lt;/P&gt;
&lt;P&gt;The architecture patterns shown here—specialised agents with clear roles, sequential pipelines for dependent tasks, concurrent fan-out for independent work, feedback loops for quality assurance, and tool calling for precision—form a foundation for building more sophisticated systems. Consider extending this demo with:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Additional agents&lt;/STRONG&gt; for fact-checking, summarisation, or domain-specific analysis&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Richer tool integrations&lt;/STRONG&gt; connecting to databases, APIs, or local services&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Human-in-the-loop&lt;/STRONG&gt; approval gates before producing final reports&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Different model sizes&lt;/STRONG&gt; for different agents based on task complexity&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Start with the demo, understand the patterns, then apply them to your own research and analysis challenges. The future of AI isn't just cloud models—it's intelligent systems that run wherever your data lives.&lt;/P&gt;
&lt;H2&gt;Resources&lt;/H2&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://github.com/leestott/agentframework--foundrylocal" target="_blank" rel="noopener"&gt;Local Research &amp;amp; Synthesis Desk Repository&lt;/A&gt; – Full source code with documentation and examples&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://foundrylocal.ai/" target="_blank" rel="noopener"&gt;Foundry Local&lt;/A&gt; – Official site for on-device AI inference&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/Foundry-Local" target="_blank" rel="noopener"&gt;Foundry Local GitHub Repository&lt;/A&gt; – Installation instructions and CLI reference&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/reference/reference-sdk?view=foundry-classic" target="_blank" rel="noopener"&gt;Foundry Local SDK Documentation&lt;/A&gt; – Python SDK reference on Microsoft Learn&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/agent-framework/" target="_blank" rel="noopener"&gt;Microsoft Agent Framework Documentation&lt;/A&gt; – Official MAF tutorials and user guides&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/agent-framework/user-guide/workflows/orchestrations/overview" target="_blank" rel="noopener"&gt;MAF Orchestrations Overview&lt;/A&gt; – Deep dive into workflow patterns&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://pypi.org/project/agent-framework-core/" target="_blank" rel="noopener"&gt;agent-framework-core on PyPI&lt;/A&gt; – Python package for MAF&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/Agent-Framework-Samples" target="_blank" rel="noopener"&gt;Agent Framework Samples&lt;/A&gt; – Additional MAF examples and patterns&lt;/LI&gt;
&lt;/UL&gt;</description>
      <pubDate>Thu, 12 Feb 2026 19:23:55 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/building-a-local-research-desk-multi-agent-orchestration/ba-p/4493965</guid>
      <dc:creator>Lee_Stott</dc:creator>
      <dc:date>2026-02-12T19:23:55Z</dc:date>
    </item>
    <item>
      <title>Deploying Custom Models with Microsoft Olive and Foundry Local</title>
      <link>https://techcommunity.microsoft.com/t5/educator-developer-blog/deploying-custom-models-with-microsoft-olive-and-foundry-local/ba-p/4489002</link>
      <description>&lt;P&gt;Over the past few weeks, we've been on quite a journey together. We started by exploring&amp;nbsp;&lt;A href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/phi-4-small-language-models-that-pack-a-punch/4464167" target="_blank" rel="noopener"&gt;what makes Phi-4 and small language models so compelling&lt;/A&gt;, then got our hands dirty &lt;A href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/running-phi-4-locally-with-microsoft-foundry-local-a-step-by-step-guide/4466304" target="_blank" rel="noopener"&gt;running models locally with Foundry Local&lt;/A&gt;. We leveled up with &lt;A href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/function-calling-with-small-language-models/4472720" target="_blank" rel="noopener"&gt;function calling&lt;/A&gt;, and most recently built a complete &lt;A href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/advanced-function-calling-and-multi-agent-systems-with-small-language-models-in-/4481180" target="_blank" rel="noopener"&gt;multi-agent quiz application&lt;/A&gt; with an orchestrator coordinating specialist agents.&lt;/P&gt;
&lt;P&gt;Our quiz app works great locally, but it relies on Foundry Local's catalog models — pre-optimized and ready to go. What happens when you want to deploy a model that isn't in the catalog? Maybe you've fine-tuned a model on domain-specific quiz data, or a new model just dropped on Hugging Face that you want to use. Today we'll take a model from Hugging Face, optimize it with Microsoft Olive, register it with Foundry Local, and run our quiz app against it. The same workflow applies to any model you might fine-tune for your specific use case.&lt;/P&gt;
&lt;H2&gt;Understanding Deployment Options&lt;/H2&gt;
&lt;P&gt;Before we dive in, let's understand the landscape of deployment options for SLM applications. There are several routes to deploying SLM applications depending on your target environment.&lt;/P&gt;
&lt;H3&gt;The Three Main Paths&lt;/H3&gt;
&lt;img /&gt;
&lt;P&gt;&lt;STRONG&gt;vLLM&lt;/STRONG&gt; is the industry standard for cloud deployments — containerized, scalable, handles many concurrent users. Great for Azure VMs or Kubernetes.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Ollama&lt;/STRONG&gt; offers a middle ground — simpler than vLLM but still provides Docker support for easy sharing and deployment.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Foundry Local + Olive&lt;/STRONG&gt; is Microsoft's edge-first approach. Optimize your model with Olive, serve with Foundry Local or a custom server. Perfect for on-premise, offline, or privacy-focused deployments.&lt;/P&gt;
&lt;P&gt;In keeping with the edge-first theme that's run through this series, we'll focus on the Foundry Local path. We'll use Qwen 2.5-0.5B-Instruct — small enough to optimize quickly and demonstrate the full workflow. Think of it as a stand-in for a model you've fine-tuned on your own quiz data.&lt;/P&gt;
&lt;H2&gt;Prerequisites&lt;/H2&gt;
&lt;P&gt;You'll need:&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;STRONG&gt;Foundry Local&lt;/STRONG&gt; version 0.8.117 or later&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;Python 3.10+&lt;/STRONG&gt; for the quiz app (the foundry-local-sdk requires it)&lt;/LI&gt;
&lt;LI&gt;&lt;STRONG&gt;A separate Python 3.9 environment&lt;/STRONG&gt; for Olive (Olive 0.9.x has this requirement)&lt;/LI&gt;
&lt;LI&gt;The quiz app from &lt;A href="https://github.com/HamidOna/multi_agent_slm" target="_blank" rel="noopener"&gt;the previous article&lt;/A&gt;&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;Having two Python versions might seem odd, but it mirrors a common real-world setup: you optimize models in one environment and serve them in another. The optimization is a one-time step.&lt;/P&gt;
&lt;H3&gt;Installing Olive Dependencies&lt;/H3&gt;
&lt;P&gt;In your Python 3.9 environment:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;pip install olive-ai onnxruntime onnxruntime-genai pip install transformers&amp;gt;=4.45.0,&amp;lt;5.0.0&lt;/LI-CODE&gt;
&lt;P&gt;&lt;STRONG&gt;Important:&lt;/STRONG&gt;&amp;nbsp;Olive is not compatible with Transformers 5.x. You must use version 4.x.&lt;/P&gt;
&lt;H2&gt;Model Optimization with Olive&lt;/H2&gt;
&lt;P&gt;&lt;A href="https://github.com/microsoft/Olive" target="_blank" rel="noopener"&gt;Microsoft Olive&lt;/A&gt; is the bridge between a Hugging Face model and something Foundry Local can serve. It handles ONNX conversion, graph optimization, and quantization in a single command.&lt;/P&gt;
&lt;H3&gt;Understanding Quantization&lt;/H3&gt;
&lt;P&gt;Quantization reduces model size by converting weights from high-precision floating point to lower-precision integers:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Precision&lt;/th&gt;&lt;th&gt;Size Reduction&lt;/th&gt;&lt;th&gt;Quality&lt;/th&gt;&lt;th&gt;Best For&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;FP32&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;Baseline&lt;/td&gt;&lt;td&gt;Best&lt;/td&gt;&lt;td&gt;Development, debugging&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;FP16&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;50% smaller&lt;/td&gt;&lt;td&gt;Excellent&lt;/td&gt;&lt;td&gt;GPU inference with plenty of VRAM&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;INT8&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;75% smaller&lt;/td&gt;&lt;td&gt;Very Good&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;Balanced production&lt;/STRONG&gt;&amp;nbsp;&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;INT4&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;87.5% smaller&lt;/td&gt;&lt;td&gt;Good&lt;/td&gt;&lt;td&gt;Edge devices, resource-constrained&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;col style="width: 25.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;We'll use &lt;STRONG&gt;INT4&lt;/STRONG&gt; to demonstrate the maximum compression. For production with better quality, consider &lt;STRONG&gt;INT8&lt;/STRONG&gt; — simply change --precision int4 to --precision int8 in the commands below.&lt;/P&gt;
&lt;H3&gt;Running the Optimization&lt;/H3&gt;
&lt;P&gt;The optimization script at scripts/optimize_model.py handles two things: downloading the model locally (to avoid authentication issues), then running Olive.&lt;/P&gt;
&lt;P&gt;The download step is important. The ONNX Runtime GenAI model builder internally requests HuggingFace authentication even for public models. Rather than configuring tokens, we download the model first with token=False, then point Olive at the local path:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;from huggingface_hub import snapshot_download local_path = snapshot_download("Qwen/Qwen2.5-0.5B-Instruct", token=False)&lt;/LI-CODE&gt;
&lt;P&gt;Then the Olive command runs against the local copy:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;cmd = [ sys.executable, "-m", "olive", "auto-opt", "--model_name_or_path", local_path, "--trust_remote_code", "--output_path", "models/qwen2.5-0.5b-int4", "--device", "cpu", "--provider", "CPUExecutionProvider", "--precision", "int4", "--use_model_builder", "--use_ort_genai", "--log_level", "1", ]&lt;/LI-CODE&gt;
&lt;P&gt;Key flags: --precision int4 quantizes weights to 4-bit integers, --use_model_builder reads each transformer layer and exports it to ONNX, and --use_ort_genai outputs in the format Foundry Local consumes.&lt;/P&gt;
&lt;P&gt;Run it:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;python scripts/optimize_model.py&lt;/LI-CODE&gt;
&lt;P&gt;This process takes about a minute. When complete, you'll see the output directory structure.&lt;/P&gt;
&lt;LI-CODE lang="markdown"&gt;models/qwen2.5-0.5b-int4/model/ ├── model.onnx # ONNX graph (162 KB) ├── model.onnx.data # Quantized INT4 weights (823 MB) ├── genai_config.json # ONNX Runtime GenAI config ├── tokenizer.json # Tokenizer vocabulary (11 MB) ├── vocab.json # Token-to-ID map (2.7 MB) ├── merges.txt # BPE merges (1.6 MB) ├── tokenizer_config.json ├── config.json ├── generation_config.json ├── special_tokens_map.json └── added_tokens.json&lt;/LI-CODE&gt;
&lt;P&gt;Total size: approximately &lt;STRONG&gt;838MB&lt;/STRONG&gt; — a significant reduction from the original, while maintaining usable quality for structured tasks like quiz generation.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Registering with Foundry Local&lt;/H2&gt;
&lt;P&gt;With the model optimized, we need to register it with Foundry Local. Unlike cloud model registries, there's no CLI command — you place files in the right directory and Foundry discovers them automatically.&lt;/P&gt;
&lt;H3&gt;Foundry's Model Registry&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;foundry cache cd # Windows: C:\Users\&amp;lt;username&amp;gt;\.foundry\cache\ # macOS/Linux: ~/.foundry/cache/&lt;/LI-CODE&gt;
&lt;P&gt;Foundry organizes models by publisher:&lt;/P&gt;
&lt;LI-CODE lang="markdown"&gt;.foundry/cache/models/ ├── foundry.modelinfo.json ← catalog of official models ├── Microsoft/ ← pre-optimized Microsoft models │ ├── qwen2.5-7b-instruct-cuda-gpu-4/ │ ├── Phi-4-cuda-gpu-1/ │ └── ... └── Custom/ ← your models go here&lt;/LI-CODE&gt;&lt;img /&gt;
&lt;H4&gt;The Registration Script&lt;/H4&gt;
&lt;P&gt;The script at scripts/register_model.sh does two things: copies all model files into the Foundry cache, and creates the inference_model.json configuration file.&lt;/P&gt;
&lt;P&gt;The critical file is inference_model.json — without it, Foundry won't recognize your model:&lt;/P&gt;
&lt;LI-CODE lang="json"&gt;{ "Name": "qwen-quiz-int4", "PromptTemplate": { "system": "&amp;lt;|im_start|&amp;gt;system\n{Content}&amp;lt;|im_end|&amp;gt;", "user": "&amp;lt;|im_start|&amp;gt;user\n{Content}&amp;lt;|im_end|&amp;gt;", "assistant": "&amp;lt;|im_start|&amp;gt;assistant\n{Content}&amp;lt;|im_end|&amp;gt;", "prompt": "&amp;lt;|im_start|&amp;gt;user\n{Content}&amp;lt;|im_end|&amp;gt;\n&amp;lt;|im_start|&amp;gt;assistant" } }&lt;/LI-CODE&gt;
&lt;P&gt;The PromptTemplate defines the ChatML format that Qwen 2.5 expects. The {Content} placeholder is where Foundry injects the actual message content at runtime. If you were deploying a Llama or Phi model, you'd use their respective prompt templates.&lt;/P&gt;
&lt;P&gt;Run the registration:&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;scripts/register_model.sh&lt;/LI-CODE&gt;
&lt;H3&gt;Verify Registration&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;foundry cache ls&lt;/LI-CODE&gt;&lt;img /&gt;
&lt;H3&gt;&amp;nbsp;&lt;/H3&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;Test the Model&lt;/H3&gt;
&lt;LI-CODE lang="bash"&gt;foundry model run qwen-quiz-int4&lt;/LI-CODE&gt;&lt;img /&gt;
&lt;P&gt;The model loads via ONNX Runtime on CPU. Try a simple prompt to verify it responds.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H2&gt;Integrating with the Quiz App&lt;/H2&gt;
&lt;P&gt;Here's where things get interesting. The application-level change is one line in utils/foundry_client.py:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;# Before: DEFAULT_MODEL_ALIAS = "qwen2.5-7b-instruct-cuda-gpu" # After: DEFAULT_MODEL_ALIAS = "qwen-quiz-int4"&lt;/LI-CODE&gt;
&lt;P&gt;But &amp;nbsp;that one line raised some issues worth understanding.&lt;/P&gt;
&lt;H3&gt;Issue 1: The SDK Can't See Custom Models&lt;/H3&gt;
&lt;P&gt;The Foundry Local Python SDK resolves models by looking them up in the official catalog — a JSON file of Microsoft-published models. Custom models in the Custom/ directory aren't in that catalog. So FoundryLocalManager("qwen-quiz-int4") throws a "model not found" error, despite foundry cache ls and foundry model run both working perfectly.&lt;/P&gt;
&lt;P&gt;The fix in foundry_client.py is a dual code path. It tries the SDK first (works for catalog models), and when that fails with a "not found in catalog" error, it falls back to discovering the running service endpoint directly:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;def _discover_endpoint(): """Discover running Foundry service endpoint via CLI.""" result = subprocess.run( ["foundry", "service", "status"], capture_output=True, text=True, timeout=10 ) match = re.search(r"(http://\S+?)(?:/openai)?/status", result.stdout) if not match: raise ConnectionError( "Foundry service is not running.\n" f"Start it with: foundry model run {DEFAULT_MODEL_ALIAS}" ) return match.group(1)&lt;/LI-CODE&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;The workflow becomes two terminals:&lt;/P&gt;
&lt;P&gt;Terminal 1: foundry model run qwen-quiz-int4&lt;/P&gt;
&lt;P&gt;Terminal 2: python main.py&lt;/P&gt;
&lt;P&gt;The client auto-discovers the endpoint and connects. For catalog models, the existing FoundryLocalManager path works unchanged.&lt;/P&gt;
&lt;H3&gt;Issue 2: Tool Calling Format&lt;/H3&gt;
&lt;P&gt;For catalog models, Foundry's server-side middleware intercepts &amp;lt;tool_call&amp;gt; tags in the model's output and converts them into structured tool_calls objects in the API response. This is configured via metadata in foundry.modelinfo.json.&lt;/P&gt;
&lt;P&gt;For custom models, those metadata fields aren't recognized — Foundry ignores them in inference_model.json. The &amp;lt;tool_call&amp;gt; tags pass through as raw text in response.choices[0].message.content.&lt;/P&gt;
&lt;P&gt;Since our custom model outputs the exact same &amp;lt;tool_call&amp;gt; format, we added a small fallback parser in agents/base_agent.py — the same pattern we explored in our &lt;A href="https://techcommunity.microsoft.com/blog/educatordeveloperblog/function-calling-with-small-language-models/4472720" target="_blank" rel="noopener"&gt;function calling article&lt;/A&gt;. After each model response, if tool_calls is None, we scan the content for tags:&lt;/P&gt;
&lt;LI-CODE lang="python"&gt;def _parse_text_tool_calls(content: str) -&amp;gt; list: """Parse &amp;lt;tool_call&amp;gt;...&amp;lt;/tool_call&amp;gt; tags from model output.""" blocks = re.findall(r"&amp;lt;tool_call&amp;gt;\s*(\{.*?\})\s*&amp;lt;/tool_call&amp;gt;", content, re.DOTALL) calls = [] for block in blocks: try: data = json.loads(block) calls.append(_TextToolCall(data["name"], json.dumps(data.get("arguments", {})))) except (json.JSONDecodeError, KeyError): continue return calls&lt;/LI-CODE&gt;
&lt;P&gt;The model's behavior is identical; only the parsing location changes — from server-side (Foundry middleware) to client-side (our code).&lt;/P&gt;
&lt;H3&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;Part 7: Testing the Deployment&lt;/SPAN&gt;&lt;/H3&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;With the model running in one terminal, start the quiz app in another: &lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;Terminal 1:&amp;nbsp;&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;foundry model run qwen-quiz-int4&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt; Terminal 2:&lt;/SPAN&gt;&lt;/P&gt;
&lt;LI-CODE lang="bash"&gt;cd multi_agents_slm &amp;amp;&amp;amp; python main.py&lt;/LI-CODE&gt;
&lt;P&gt;&lt;SPAN style="color: rgb(30, 30, 30);"&gt;&amp;nbsp;Now test the full flow. Generate a quiz:&lt;/SPAN&gt;&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;H3&gt;Test the Full Flow&lt;/H3&gt;
&lt;P&gt;&lt;STRONG&gt;Generate a quiz:&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;&lt;STRONG&gt;Example output:&lt;/STRONG&gt;&lt;/P&gt;
&lt;img /&gt;
&lt;P&gt;The orchestrator successfully calls the generate_new_quiz tool, and the QuizGeneratorAgent produces well-structured quiz JSON.&lt;/P&gt;
&lt;H3&gt;Model Limitations&lt;/H3&gt;
&lt;P&gt;The 0.5B INT4 model occasionally struggles with complex reasoning or basic arithmetic. This is expected from such a small, heavily quantized model. For production use cases requiring higher accuracy, use Qwen 2.5-1.5B or Qwen 2.5-7B for better quality, or use INT8 quantization instead of INT4. The deployment workflow remains identical — just change the model name and precision in the optimization script.&lt;/P&gt;
&lt;H2&gt;What You've Accomplished&lt;/H2&gt;
&lt;P&gt;Take a moment to appreciate the complete journey across this series:&lt;/P&gt;
&lt;DIV class="styles_lia-table-wrapper__h6Xo9 styles_table-responsive__MW0lN"&gt;&lt;table border="1" style="border-width: 1px;"&gt;&lt;thead&gt;&lt;tr&gt;&lt;th&gt;Article&lt;/th&gt;&lt;th&gt;What You Learned&lt;/th&gt;&lt;/tr&gt;&lt;/thead&gt;&lt;tbody&gt;&lt;tr&gt;&lt;td&gt;1. Phi-4 Introduction&lt;/td&gt;&lt;td&gt;Why SLMs matter, performance vs size tradeoffs&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;2. Running Locally&lt;/td&gt;&lt;td&gt;Foundry Local setup, basic inference&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;3. Function Calling&lt;/td&gt;&lt;td&gt;Tool use, external API integration&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;4. Multi-Agent Systems&lt;/td&gt;&lt;td&gt;Orchestration, specialist agents&lt;/td&gt;&lt;/tr&gt;&lt;tr&gt;&lt;td&gt;&lt;STRONG&gt;5. Deployment&lt;/STRONG&gt;&lt;/td&gt;&lt;td&gt;&lt;STRONG&gt;Olive optimization, Foundry Local registration, custom model deployment&lt;/STRONG&gt;&lt;/td&gt;&lt;/tr&gt;&lt;/tbody&gt;&lt;colgroup&gt;&lt;col style="width: 50.00%" /&gt;&lt;col style="width: 50.00%" /&gt;&lt;/colgroup&gt;&lt;/table&gt;&lt;/DIV&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;
&lt;P&gt;You now have end-to-end skills for building production SLM applications: understanding the landscape, local development with Foundry Local, agentic applications with function calling, multi-agent architectures, model optimization with Olive, and deploying custom models to the edge.&lt;/P&gt;
&lt;H2&gt;Where to Go From Here&lt;/H2&gt;
&lt;P&gt;The logical next step is fine-tuning for your domain. Medical quiz tutors trained on USMLE questions, legal assistants trained on case law, company onboarding bots trained on internal documentation — use the same Olive workflow to optimize and deploy your fine-tuned model. The same ONNX model we registered with Foundry Local could also run on mobile devices via ONNX Runtime Mobile, or be containerized for server-side edge deployment.&lt;/P&gt;
&lt;P&gt;The full source code, including the optimization and registration scripts, is available in &lt;A href="https://github.com/HamidOna/multi_agent_slm" target="_blank" rel="noopener"&gt;the GitHub repository&lt;/A&gt;.&lt;/P&gt;
&lt;P&gt;&lt;STRONG&gt;Resources:&lt;/STRONG&gt;&lt;/P&gt;
&lt;UL&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/Olive" target="_blank" rel="noopener"&gt;Microsoft Olive&lt;/A&gt; — Model optimization toolkit&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/" target="_blank" rel="noopener"&gt;Foundry Local Documentation&lt;/A&gt; — Setup and CLI reference&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://learn.microsoft.com/en-us/azure/ai-foundry/foundry-local/how-to/how-to-compile-hugging-face-models" target="_blank" rel="noopener"&gt;Compiling Hugging Face models for Foundry Local&lt;/A&gt; — Official guide&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/onnxruntime-genai" target="_blank" rel="noopener"&gt;ONNX Runtime GenAI&lt;/A&gt; — Powers Foundry Local's inference&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/microsoft/edgeai-for-beginners" target="_blank" rel="noopener"&gt;Edge AI for Beginners&lt;/A&gt; — Microsoft's 8-module Edge AI curriculum&lt;/LI&gt;
&lt;LI&gt;&lt;A href="https://github.com/HamidOna/multi_agent_slm" target="_blank" rel="noopener"&gt;Quiz App Source Code&lt;/A&gt; — Full repository with deployment scripts&lt;/LI&gt;
&lt;/UL&gt;
&lt;P&gt;This series has been a joy to write. I'd love to see what you build — share your projects in the comments, and don't hesitate to open issues on the GitHub repo if you encounter challenges.&lt;/P&gt;
&lt;P&gt;Until next time — keep building, keep optimizing, and keep pushing what's possible with local AI.&lt;/P&gt;
&lt;P&gt;&amp;nbsp;&lt;/P&gt;</description>
      <pubDate>Wed, 11 Feb 2026 16:55:06 GMT</pubDate>
      <guid>https://techcommunity.microsoft.com/t5/educator-developer-blog/deploying-custom-models-with-microsoft-olive-and-foundry-local/ba-p/4489002</guid>
      <dc:creator>Abdulhamid_Onawole</dc:creator>
      <dc:date>2026-02-11T16:55:06Z</dc:date>
    </item>
  </channel>
</rss>

