foundry
25 TopicsBuilding a hands-free voice concierge with Microsoft Foundry Voice Live and a Hosted Agent
This post walks through a small, working sample that wires the browser microphone to Azure AI Speech Voice Live, binds the realtime session to a Foundry hosted agent, and lets the agent answer travel questions using tool calls. The full source, infrastructure, and labs live in the repository linked at the end. Why this combination matters Voice user interfaces have historically been hard to build well. Streaming audio, partial transcripts, barge-in, voice activity detection, tool dispatch, and audio playback have traditionally meant stitching together five or six services. The combination of Voice Live and a Foundry hosted agent collapses that into one realtime WebSocket session with a single binding field. Voice Live owns the audio loop: speech to text, neural text to speech, semantic turn detection, noise suppression, and echo cancellation. The Foundry hosted agent owns the brain: instructions, memory, model selection, evaluators, and tool calling. The link between them is one query parameter on the WebSocket URL. What this means in practice: the browser never sees a model API key, never instantiates a tool, and never owns the agent prompt. The browser does microphone capture and audio playback. Everything else lives server-side. The scenario The sample is called Contoso Travel Concierge. The user is mid-journey, hands and eyes busy, and wants to ask things like: What is the weather in Tokyo this weekend? Is BA005 from Heathrow on time? What time is check-in at the Marriott Marquis? Each question triggers a tool call on the hosted agent. The reply is short, speakable, and synthesised back to the user in under a second on a warm connection. Architecture There are four moving parts. Three of them are managed Azure services. Only the broker is your code. Browser client – captures PCM16 audio at 24 kHz and streams it over a WebSocket to the broker. Plays back audio chunks the broker forwards from Voice Live. Session broker (FastAPI) – authenticates to Azure with DefaultAzureCredential , builds the Voice Live WebSocket URL with a short-lived bearer token, and relays frames in both directions. Voice Live – the Azure AI Speech realtime endpoint. Transcribes the user, hands the text to the bound agent, and synthesises the agent’s reply. Foundry hosted agent – a prompt-kind agent in Azure AI Foundry with instructions, tool definitions, and the microsoft.voice-live.enabled metadata flag set to true . Two design choices are worth calling out. The broker is small on purpose. It does authentication, URL construction, and WebSocket relay. It does not transcode audio, run business logic, or hold conversation state. Voice Live and the agent already do those things well. The agent binding is a URL query parameter, not an SDK call. There is no per-turn HTTP request to the agent runtime. Voice Live opens a session against the agent once and streams turns through it for the lifetime of the WebSocket. That is what keeps latency low. The Voice Live URL contract This is the single most important thing to get right. The public Microsoft sample that ships under liupeirong/ai-foundry-voice-agent targets a different URL shape ( services.ai.azure.com host, agent-id + agent-access-token parameters, an Authorization header). That shape is rejected by Foundry resources that expose voice-live-enabled agents. The shape below is the one the portal itself uses, and the one this sample dials. Three details cause most failures: The host must be <resource>.cognitiveservices.azure.com , not services.ai.azure.com . The broker rewrites this automatically from VOICE_LIVE_ENDPOINT . The bearer token travels in the authorization query parameter, URL-encoded, with a literal Bearer prefix and a + (or %20 ) before the token. No Authorization header is sent. agent-name and model are both the agent’s display name. agent-version is empty when you want the latest published version. Walkthrough: from clone to spoken reply Prerequisites Python 3.11 or later (the sample is developed on 3.13). The Azure CLI, signed in with az login --tenant <your-tenant-id> . An Azure AI Foundry project in a Voice Live region ( eastus2 , swedencentral , or westus2 ). A deployed prompt-kind agent in that project with Enable Voice Live turned on. The Cognitive Services User role on the Foundry resource for the identity the broker will use. Configure the broker Copy .env.sample to .env and fill in four values: AZURE_AI_PROJECT_ENDPOINT=https://<your-resource>.services.ai.azure.com AZURE_AI_PROJECT_NAME=<your-foundry-project-name> VOICE_LIVE_ENDPOINT=wss://<your-resource>.services.ai.azure.com/voice-live/realtime VOICE_LIVE_API_VERSION=2025-10-01 FOUNDRY_AGENT_ID=<your-agent-name> The agent name is what the Foundry portal shows on the agent card. The broker uses it for both the agent-name and model query parameters. Install and run python -m venv .venv .\.venv\Scripts\Activate.ps1 pip install -r requirements.txt .\scripts\start-local.ps1 The broker exposes three endpoints: GET /healthz – liveness probe. GET /config – returns the session.update the browser sends as its first frame. WS /ws – the bi-directional relay to Voice Live. Smoke test .\scripts\test-session.ps1 A successful run prints: [OK] /ws upgraded -> sent session.update <- {"type":"session.created",…} <- {"type":"session.updated",…} [OK] session.updated received -- E2E works This confirms the entire chain: local broker, DefaultAzureCredential token, Foundry Portal URL shape, Voice Live handshake, and the bound agent acknowledging the session. Open the browser UI Browse to http://localhost:8000/ , click Start talking, and ask one of the sample questions. Transcripts appear in real time and the spoken reply plays back through the audio context. Inside the broker The relay logic is tiny – the heavy lifting is the URL construction. The function below is the canonical reference; copy it if you are porting the pattern to another language. def build_voice_live_ws_url(agent_access_token: str) -> str: """ Build the Foundry Portal style Voice Live WebSocket URL. Auth lives in the query string only. No Authorization header is sent. """ host = _ws_host_from_endpoint(VOICE_LIVE_ENDPOINT) qs = urlencode( { "trafficType": "FoundryPortal", "agent-name": FOUNDRY_AGENT_ID, "agent-version": "", "agent-project-name": AZURE_AI_PROJECT_NAME, "api-version": VOICE_LIVE_API_VERSION, "model": FOUNDRY_AGENT_ID, "client-request-id": str(uuid.uuid4()), "authorization": f"Bearer {agent_access_token}", }, quote_via=quote, ) return f"wss://{host}/voice-live/realtime?{qs}" The relay itself is a pair of asyncio tasks: one forwarding browser frames upstream, one forwarding Voice Live frames back. Audio bytes are passed straight through – the broker never decodes them. Deploying the hosted agent The most reliable way to create a voice-live-enabled agent is the Foundry portal. Agents created via the Assistants v2 SDK do not carry the required metadata by default and will be rejected by the Voice Live URL shape above. The portal steps are: Open the Foundry project, go to Agents, and click New agent. Choose Prompt agent as the kind, name it (for example travel-concierge ), and pick a model deployment. Paste the contents of agent/src/prompts/system.txt into the instructions box. On the Voice tab, switch Enable Voice Live on. This is what sets the microsoft.voice-live.enabled = true metadata. Add the three tools ( get_weather , get_flight_status , get_hotel_info ) from agent/agent.yaml on the Tools tab. Publish the version and write the agent name back to .env as FOUNDRY_AGENT_ID . The full deployment guide, including how to host the broker on Azure Container Apps with a managed identity, is in docs/deployment.md in the repository. Three lessons from getting this to production 1. Voice output must be written for speech, not for screens Foundry agents tend to format answers in markdown with citations like ([data.jma.go.jp](https://…)) . When Voice Live synthesises that text, the user hears the URL read aloud, character by character. The fix is to write the agent instructions so the spoken text never contains URLs, markdown, or symbols. A short block at the end of the agent instructions does the job: Voice output rules - This output is read aloud by TTS. Never include URLs, domain names, or citation markers like "(source.com)" in your reply. Cite by speakable source name only. - Never use markdown for formatting. No asterisks, brackets, backticks, bullets, or hashes. Write in plain spoken sentences. - Keep numbers speakable: say "thirty degrees Celsius", not "30C / 86F". - Keep replies under about 40 words unless the user asks for detail. The browser transcript can still render markdown for the eyes. The sample does so with a small, escaping markdown renderer that whitelists bold, italic, code, and http(s) links only, so the same agent reply looks polished on screen even though the spoken version contains none of it. 2. Identity is simpler than it looks The broker uses DefaultAzureCredential and requests the https://ai.azure.com/.default scope. Locally that resolves to your az login credentials. In Azure Container Apps it resolves to the user-assigned managed identity. In both cases the only role assignment you need on the Foundry account is Cognitive Services User. There is no API key path on the working URL shape – it is bearer tokens all the way down. 3. The wrong sample wastes a day If you start from the public liupeirong/ai-foundry-voice-agent repository against a portal-provisioned voice-live agent, the WebSocket either returns HTTP 400 or closes silently with code 1006. The cause is the URL shape, not your code. The reference probe in scripts/probe_portal_shape.py is the single source of truth for the working contract – keep it as a regression test. Responsible AI and security notes Credentials never reach the browser. Tokens are minted server-side and travel only on the upstream Voice Live URL. No secrets in source. The .env file is gitignored. The .env.sample contains only placeholders. Markdown rendering is escape-first. The browser HTML-escapes the agent reply before applying its small markdown whitelist, and links are restricted to http(s) URLs so the rule cannot emit javascript: hrefs. Tool calls are auditable. Every turn shows up as a run in the Foundry portal under the agent, with the prompt, model output, and tool inputs and outputs visible for review. Voice biometric considerations. If you plan to handle account verification by voice, plug in dedicated speaker recognition rather than relying on the conversational model. Key takeaways Voice Live plus a Foundry hosted agent is a session-level integration, not an API integration. One URL, one binding field, one WebSocket. The browser is a thin client. Authentication, URL construction, and relay all live in a small FastAPI broker. Get the URL shape right ( cognitiveservices.azure.com , token in the query string, agent-name equals model equals the agent display name) and the rest is plumbing. Use the Foundry portal to create the agent so the voice-live metadata is set correctly. Write agent instructions for the ear, not the eye, then layer screen formatting on top in the browser. Get the code and try it Repository: github.com/microsoft/foundry-agent-voice-mode-sample Deployment guide: docs/deployment.md in the repository. Labs: three progressive workshops under labs/ – basic voice, adding tools, and binding a hosted agent. Reference docs: Voice Live in Azure AI Speech and Agents in Microsoft Foundry. If you build something on top of this pattern, open an issue or pull request on the repository. The sample is intentionally small so it stays easy to fork.207Views0likes0CommentsModel Mondays S2E11: Exploring Speech AI in Azure AI Foundry
1. Weekly Highlights This week’s top news in the Azure AI ecosystem included: Lakuna — Copilot Studio Agent for Product Teams: A hackathon project built with Copilot Studio and Azure AI Foundry, Lakuna analyzes your requirements and docs to surface hidden assumptions, helping teams reflect, test, and reduce bias in product planning. Azure ND H200 v5 VMs for AI: Azure Machine Learning introduced ND H200 v5 VMs, featuring NVIDIA H200 GPUs (over 1TB GPU memory per VM!) for massive models, bigger context windows, and ultra-fast throughput. Agent Factory Blog Series: The next wave of agentic AI is about extensibility: plug your agents into hundreds of APIs and services using Model Connector Protocol (MCP) for portable, reusable tool integrations. GPT-5 Tool Calling on Azure AI Foundry: GPT-5 models now support free-form tool calling—no more rigid JSON! Output SQL, Python, configs, and more in your preferred format for natural, flexible workflows. Microsoft a Leader in 2025 Gartner Magic Quadrant: Azure was again named a leader for Cloud Native Application Platforms—validating its end-to-end runway for AI, microservices, DevOps, and more. 2. Spotlight On: Azure AI Foundry Speech Playground The main segment featured a live demo of the new Azure AI Speech Playground (now part of Foundry), showing how developers can experiment with and deploy cutting-edge voice, transcription, and avatar capabilities. Key Features & Demos: Speech Recognition (Speech-to-Text): Try real-time transcription directly in the playground—recognizing natural speech, pauses, accents, and domain terms. Batch and Fast transcription options for large files and blob storage. Custom Speech: Fine-tune models for your industry, vocabulary, and noise conditions. Text to Speech (TTS): Instantly convert text into natural, expressive audio in 150+ languages with 600+ neural voices. Demo: Listen to pre-built voices, explore whispering, cheerful, angry, and more styles. Custom Neural Voice: Clone and train your own professional or personal voice (with strict Responsible AI controls). Avatars & Video Translation: Bring your apps to life with prebuilt avatars and video translation, which syncs voice-overs to speakers in multilingual videos. Voice Live API: Voice Live API (Preview) integrates all premium speech capabilities with large language models, enabling real-time, proactive voice agents and chatbots. Demo: Language learning agent with voice, avatars, and proactive engagement. One-click code export for deployment in your IDE. 3. Customer Story: Hilo Health This week’s customer spotlight featured Helo Health—a healthcare technology company using Azure AI to boost efficiency for doctors, staff, and patients. How Hilo Uses Azure AI: Document Management: Automates fax/document filing, splits multi-page faxes by patient, reduces staff effort and errors using Azure Computer Vision and Document Intelligence. Ambient Listening: Ambient clinical note transcription captures doctor-patient conversations and summarizes them for easy EHR documentation. Genie AI Contact Center: Agentic voice assistants handle patient calls, book appointments, answer billing/refill questions, escalate to humans, and assist human agents—using Azure Communication Services, Azure Functions, FastAPI (community), and Azure OpenAI. Conversational Campaigns: Outbound reminders, procedure preps, and follow-ups all handled by voice AI—freeing up human staff. Impact: Hilo reaches 16,000+ physician practices and 180,000 providers, automates millions of communications, and processes $2B+ in payments annually—demonstrating how multimodal AI transforms patient journeys from first call to post-visit care. 4. Key Takeaways Here’s what you need to know from S2E11: Speech AI is Accessible: The Azure AI Foundry Speech Playground makes experimenting with voice recognition, TTS, and avatars easy for everyone. From Playground to Production: Fine-tune, export code, and deploy speech models in your own apps with Azure Speech Service. Responsible AI Built-In: Custom Neural Voice and avatars require application and approval, ensuring ethical, secure use. Agentic AI Everywhere: Voice Live API brings real-time, multimodal voice agents to any workflow. Healthcare Example: Hilo’s use of Azure AI shows the real-world impact of speech and agentic AI, from patient intake to after-visit care. Join the Community: Keep learning and building—join the Discord and Forum. Sharda's Tips: How I Wrote This Blog I organize key moments from each episode, highlight product demos and customer stories, and use GitHub Copilot for structure. For this recap, I tested the Speech Playground myself, explored the docs, and summarized answers to common developer questions on security, dialects, and deployment. Here’s my favorite Copilot prompt this week: "Generate a technical blog post for Model Mondays S2E11 based on the transcript and episode details. Focus on Azure Speech Playground, TTS, avatars, Voice Live API, and healthcare use cases. Add practical links for developers and students!" Coming Up Next Week Next week: Observability! Learn how to monitor, evaluate, and debug your AI models and workflows using Azure and OpenAI tools. Register For The Livestream – Sep 1, 2025 Register For The AMA – Sep 5, 2025 Ask Questions & View Recaps – Discussion Forum About Model Mondays Model Mondays is your weekly Azure AI learning series: 5-Minute Highlights: Latest AI news and product updates 15-Minute Spotlight: Demos and deep dives with product teams 30-Minute AMA Fridays: Ask anything in Discord or the forum Start building: Register For Livestreams Watch Past Replays Register For AMA Recap Past AMAs Join The Community Don’t build alone! The Azure AI Developer Community is here for real-time chats, events, and support: Join the Discord Explore the Forum About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador focused on cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I share takeaways from each week’s Model Mondays livestream.350Views0likes0CommentsModel Mondays S2E12: Models & Observability
1. Weekly Highlights This week’s top news in the Azure AI ecosystem included: GPT Real Time (GA): Azure AI Foundry now offers GPT Real Time (GA)—lifelike voices, improved instruction following, audio fidelity, and function calling, with support for image context and lower pricing. Read the announcement and check out the model card for more details. Azure AI Translator API (Public Preview): Choose between fast Neural Machine Translation (NMT) or nuanced LLM-powered translations, with real-time flexibility for multilingual workflows. Read the announcement then check out the Azure AI Translator documentation for more details. Azure AI Foundry Agents Learning Plan: Build agents with autonomous goal pursuit, memory, collaboration, and deep fine-tuning (SFT, RFT, DPO) - on Azure AI Foundry. Read the announcement what Agentic AI involves - then follow this comprehensive learning plan with step-by-step guidance. CalcLM Agent Grid (Azure AI Foundry Labs): Project CalcLM: Agent Grid is a prototype and open-source experiment that illustrates how agents might live in a grid-like surface (like Excel). It's formula-first and lightweight - defining agentic workflows like calculations. Try the prototype and visit Foundry Labs to learn more. Agent Factory Blog: Observability in Agentic AI: Agentic AI tools and workflows are gaining rapid adoption in the enterprise. But delivering safe, reliable and performant agents requires foundation support for Observability. Read the 6-part Agent Factory series and check out the Top 5 agent observability best practices for reliable AI blog post for more details. 2. Spotlight On: Observability in Azure AI Foundry This week’s spotlight featured a deep dive and demo by Han Che (Senior PM, Core AI/ Microsoft ), showing observability end-to-end for agent workflows. Why Observability? Ensures AI quality, performance, and safety throughout the development lifecycle. Enables monitoring, root cause analysis, optimization, and governance for agents and models. Key Features & Demos: Development Lifecycle: Leaderboard: Pick the best model for your agent with real-time evaluation. Playground: Chat and prototype agents, view instant quality and safety metrics. Evaluators: Assess quality, risk, safety, intent resolution, tool accuracy, code vulnerability, and custom metrics. Governance: Integrate with partners like Cradle AI and SideDot for policy mapping and evidence archiving. Red Teaming Agent: Automatically test for vulnerabilities and unsafe behavior. CI/CD Integration: Automate evaluation in GitHub Actions and Azure DevOps pipelines. Azure DevOps GitHub Actions Monitoring Dashboard: Resource usage, application analytics, input/output tokens, request latency, cost breakdown, and evaluation scores. Azure Cost Management SDKs & Local Evaluation: Run evaluations locally or in the cloud with the Azure AI Evaluation SDK. Demo Highlights: Chat with a travel planning agent, view run metrics and tool usage. Drill into run details, debugging, and real-time safety/quality scores. Configure and run large-scale agent evaluations in CI/CD pipelines. Compare agents, review statistical analysis, and monitor in production dashboards 3. Customer Story: Saifr Saifr is a RegTech company that uses artificial intelligence to streamline compliance for marketing, communications, and creative teams in regulated industries. Incubated at Fidelity Labs (Fidelity Investments’ innovation arm), Saifr helps enterprises create, review, and approve content that meets regulatory standards—faster and with less manual effort. What Saifr Offers AI-Powered Compliance: Saifr’s platform leverages proprietary AI models trained on decades of regulatory expertise to automatically detect potential compliance risks in text, images, audio, and video. Automated Guardrails: The solution flags risky or non-compliant language, suggests compliant alternatives, and provides explanations—all in real time. Workflow Integration: Saifr seamlessly integrates with enterprise content creation and approval workflows, including cloud platforms and agentic AI systems like Azure AI Foundry. Multimodal Support: Goes beyond text to check images, videos, and audio for compliance risks, supporting modern marketing and communications teams. 4. Key Takeaways Observability is Essential: Azure AI Foundry offers complete monitoring, evaluation, tracing, and governance for agentic AI—making production safe, reliable, and compliant. Built-In Evaluation and Red Teaming: Use leaderboards, evaluators, and red teaming agents to assess and continuously improve model safety and quality. CI/CD and Dashboard Integration: Automate evaluations in GitHub Actions or Azure DevOps, then monitor and optimize agents in production with detailed dashboards. Compliance Made Easy: Safer’s agents and models help financial services and regulated industries proactively meet compliance standards for content and communications. Sharda's Tips: How I Wrote This Blog I focus on organizing highlights, summarizing customer stories, and linking to official Microsoft docs and real working resources. For this recap, I explored the Azure AI Foundry Observability docs, tested CI/CD pipeline integration, and watched the customer demo to share best practices for regulated industries. Here’s my Copilot prompt for this episode: "Generate a technical blog post for Model Mondays S2E12 based on the transcript and episode details. Focus on observability, agent dashboards, CI/CD, compliance, and customer stories. Add correct, working Microsoft links!" Coming Up Next Week Next week: Open Source Models! Join us for the final episode with Hugging Face VP of Product, live demos, and open model workflows. Register For The Livestream – Sep 15, 2025 About Model Mondays Model Mondays is your weekly Azure AI learning series: 5-Minute Highlights: Latest AI news and product updates 15-Minute Spotlight: Demos and deep dives with product teams 30-Minute AMA Fridays: Ask anything in Discord or the forum Start building: Watch Past Replays Register For AMA Recap Past AMAs Join The Community Don’t build alone! The Azure AI Developer Community is here for real-time chats, events, and support: Join the Discord Explore the Forum About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador focused on cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I share takeaways from each week’s Model Mondays livestream.286Views0likes0CommentsModel Mondays S2E13: Open Source Models (Hugging Face)
1. Weekly Highlights 1. Weekly Highlights Here are the key updates we covered in the Season 2 finale: O1 Mini Reinforcement Fine-Tuning (GA): Fine-tune models with as few as ~100 samples using built-in Python code graders. Azure Live Interpreter API (Preview): Real-time speech-to-speech translation supporting 76 input languages and 143 locales with near human-level latency. Agent Factory – Part 5: Connecting agents using open standards like MCP (Model Context Protocol) and A2A (Agent-to-Agent protocol). Ask Ralph by Ralph Lauren: A retail example of agentic AI for conversational styling assistance, built on Azure OpenAI and Foundry’s agentic toolset. VS Code August Release: Brings auto-model selection, stronger safety guards for sensitive edits, and improved agent workflows through new agents.md support. 2. Spotlight – Open Source Models in Azure AI Foundry Guest: Jeff Boudier, VP of Product at Hugging Face Jeff showcased the deep integration between the Hugging Face community and Azure AI Foundry, where developers can access over 10 000 open-source models across multiple modalities—LLMs, speech recognition, computer vision, and even specialized domains like protein modeling and robotics. Demo Highlights Discover models through Azure AI Foundry’s task-based catalog filters. Deploy directly from Hugging Face Hub to Azure with one-click deployment. Explore Use Cases such as multilingual speech recognition and vision-language-action models for robotics. Jeff also highlighted notable models, including: SmoLM3 – a 3 B-parameter model with hybrid reasoning capabilities Qwen 3 Coder – a mixture-of-experts model optimized for coding tasks Parakeet ASR – multilingual speech recognition Microsoft Research protein-modeling collection MAGMA – a vision-language-action model for robotics Integration extends beyond deployment to programmatic access through the Azure CLI and Python SDKs, plus local development via new VS Code extensions. 3. Customer Story – DraftWise (BUILD 2025 Segment) The finale featured a customer spotlight on DraftWise, where CEO James Ding shared how the company accelerates contract drafting with Azure AI Foundry. Problem Legal contract drafting is time-consuming and error-prone. Solution DraftWise uses Azure AI Foundry to fine-tune Hugging Face language models on legal data, generating contract drafts and redline suggestions. Impact Faster drafting cycles and higher consistency Easy model management and deployment with Foundry’s secure workflows Transparent evaluation for legal compliance 4. Community Story – Hugging Face & Microsoft The episode also celebrated the ongoing collaboration between Hugging Face and Microsoft and the impact of open-source AI on the global developer ecosystem. Community Benefits Access to State-of-the-Art Models without licensing barriers Transparent Performance through public leaderboards and benchmarks Rapid Innovation as improvements and bug fixes spread quickly Education & Empowerment via tutorials, docs, and active forums Responsible AI Practices encouraged through community oversight 5. Key Takeaways Open Source AI Is Here to Stay Azure AI Foundry and Hugging Face make deploying, fine-tuning, and benchmarking open models easier than ever. Community Drives Innovation: Collaboration accelerates progress, improves transparency, and makes AI accessible to everyone. Responsible AI and Transparency: Open-source models come with clear documentation, licensing, and community-driven best practices. Easy Deployment & Customization: Azure AI Foundry lets you deploy, automate, and customize open models from a single, unified platform. Learn, Build, Share: The open-model ecosystem is a great place for students, developers, and researchers to learn, build, and share their work. Sharda's Tips: How I Wrote This Blog For this final recap, I focused on capturing the energy of the open source AI movement and the practical impact of Hugging Face and Azure AI Foundry collaboration. I watched the livestream, took notes on the demos and interviews, and linked directly to official resources for models, docs, and community sites. Here’s my Copilot prompt for this episode: "Generate a technical blog post for Model Mondays S2E13 based on the transcript and episode details. Focus on open source models, Hugging Face, Azure AI Foundry, and community workflows. Include practical links and actionable insights for developers and students! Learn & Connect Explore Open Models in Azure AI Foundry Hugging Face Leaderboard Responsible AI in Azure Machine Learning Llama-3 by Meta Hugging Face Community Azure AI Documentation About Model Mondays Model Mondays is your weekly Azure AI learning series: 5-Minute Highlights: Latest AI news and product updates 15-Minute Spotlight: Demos and deep dives with product teams 30-Minute AMA Fridays: Ask anything in Discord or the forum Start building: Watch Past Replays Register For AMA Recap Past AMAs Join The Community Don’t build alone! The Azure AI Developer Community is here for real-time chats, events, and support: Join the Discord Explore the Forum About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador focused on cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I share takeaways from each week’s Model Mondays livestream.362Views0likes0Comments8 Architectural Pillars to Boost GenAI LLM Accuracy and Performance in Low Cost
Smarter AI architecture, not bigger LLM models - how engineering teams push LLM accuracy and high performance in low cost. Enterprises using LLM (Large Language Models) hits the same ceiling and paying big price! A raw API call to a frontier model- GPT-4, Claude, Gemini delivers only 35-40% accuracy on structured output tasks like code generation, NL to DAX query generation, domain-specific reasoning. Prompt engineering pushes that to ~60%. But the final 35+ percentage points? Those come from system architecture, not model upgrades. This guide presents 8 architectural pillars, distilled from production Gen AI systems, that compound to close the accuracy gap. These patterns are model-agnostic and domain-agnostic, they apply equally to chatbots, coding assistants, content/query generators, automation agents, and any application where an LLM produces structured or semi-structured output. It’s based on my recent Gen AI projects. The key takeaway: use the LLM as one component in a larger system, not as the system itself. Surround it with deterministic guardrails, verified knowledge, and feedback loops. Pillar 1: Enhance Prompts with Verified Knowledge Context Impact: +35–40% accuracy (based on production use cases; may vary by domain) Top source of LLM errors in production is hallucinated identifiers knowledgebase, the model invents names, references, or structures that don't exist in the target system. This happens because LLMs are trained on general knowledge but deployed against specific, private enterprise systems they've never seen local database and knowledgebase. The fix is straightforward: inject verified, system-specific context (type definitions, API specs, ontologies, configuration schemas, entity catalogues) directly into the prompt so the model composes from known-good elements rather than recalling from training data. Use Knowledge Graph for better sematic knowledge. How to Implement Provide explicit context, never implicit- Whatever the LLM needs to reference identifiers, valid values, semantic knowledge, structures must appear verbatim in the prompt or retrieved context window. Filter aggressively. A full knowledge base with thousands of entities overwhelms the context window and confuses the model. Use intelligent filtering to surface only needed 5-10 most relevant elements per request. Store structured semantic knowledge in a graph or searchable index. This enables relationship-aware retrieval: "given entity X, what related entities, attributes, and constraints are also needed?" Include rich Semantic metadata. Names alone are insufficient. Include types, constraints, valid value ranges, relationships, and usage notes to minimize ambiguity. Keep context fresh. Stale context causes a different class of hallucination the model generates valid-looking output that references outdated structures. Sync your knowledge store with your source of truth. Why This Works LLMs excel at composition and reasoning combining elements, applying logic, following patterns. They are unreliable at recall of specific identifiers exact names, valid values, structural constraints. By offloading recall to a deterministic retrieval system and giving the LLM only composition tasks, you play to each system's strengths. Pillar 2: Tiered LLM Approach: Route Deterministically First, Use LLMs Last Impact: 80% cost reduction, 85% latency reduction, eliminates non-deterministic errors for most traffic. The most impactful architectural insight: most production requests don't need an LLM at all. A well-designed system handles 60-70% of traffic with deterministic logic templates, composition rules, cached results and reserves expensive, non-deterministic LLM calls only for genuinely novel inputs. The Three-Tier Model These metrics are from a real use case to convert NLP to Power BI DAX query. Tier Strategy Uses LLM ? Latency Accuracy Tier 0 Template slot-filling - handles requests that match known patterns exactly the system fills slots in a pre-built template with extracted parameters. No LLM, no non-determinism, near-perfect accuracy, sub-100ms response. No ~50ms 95-98% Tier 1 Compose from pre-validated fragments- handles requests that combine known patterns in new ways. The system retrieves pre-validated building blocks via search, composes them using deterministic rules, and validates the result. Still no LLM call. No ~200ms 90-95% Tier 2 Full LLM generation with enriched context- is reserved for genuinely novel requests that can't be served deterministically. Even here, the LLM receives maximum support: filtered context, relevant examples, explicit rules, and structured planning. Yes (1 call) 2-5s 88-93% Complexity-Based Routing A lightweight scoring function (evaluated in <1ms) routes each incoming request: Factors: reasoning depth, number of components, cross-references, constraints, nesting depth, novelty (distance from known patterns) Score 0-39: Tier 0 (deterministic template) Score 40-59: Tier 1 if confidence ≥ 85%, else Tier 2 Score 60+: Tier 2 (LLM generation) This routing achieves 96%+ accuracy in tier assignment and ensures the expensive path is only taken when necessary. Why This Matters Cost: 70-80% of requests cost zero LLM tokens Latency: Majority of responses in <200ms instead of 2-5s Reliability: Deterministic tiers produce identical output for identical input. Scalability: Deterministic tiers scale horizontally with trivial compute Pillar 3: Encode Prompt Anti-Patterns as Explicit Rules Impact: +8-10% accuracy, ~80% reduction in common structural errors LLM mistakes are patterned, not random. In any domain, 80% of errors cluster around a small set of 6-13 recurring structural mistakes. Instead of hoping the model avoids them through general instruction-following, compile these mistakes into explicit WRONG => CORRECT rules embedded directly in the system prompt. How to Implement Collect error data. Run 100+ requests through your system and categorize the failures. You'll find the same 6-13 patterns appearing repeatedly. Write concrete rules. For each pattern, show the exact wrong output and the exact correct alternative, with a one-line explanation of why. Embed in system prompt. Place rules prominently after the task description, before examples. Use formatting that's hard to ignore (headers, bold, explicit "NEVER" language). Keep the list short. 6-13 rules maximum. Beyond that, attention dilutes and the model starts ignoring rules. Prioritize by frequency. Refresh continuously. As the system improves (via other pillars), some errors disappear. New error types emerge. Update the rule set quarterly. Why This Works LLMs respond strongly to explicit negative examples. A generic instruction like "be careful with X" has minimal impact. But showing the exact wrong output the model tends to produce, paired with the correction, creates a strong avoidance signal. It's analogous to unit tests. Pillar 4: Retrieve Few-Shot Examples Dynamically Impact: +5-15% accuracy depending on domain complexity Static examples hardcoded in a prompt become stale, irrelevant of context tokens. Dynamic few-shot retrieval selects the 3-5 most relevant examples for each specific request, maximizing the signal-to-noise ratio in the prompt. Hybrid Retrieval Architecture The most effective approach combines two search strategies for intent search to understand natural language (NL) context: Keyword search (BM25) Finds examples with exact matching terms, identifiers, and domain vocabulary Vector search (semantic similarity) Finds examples with similar intent and structure, even if wording differs Rank fusion Merges results from both strategies, re-ranking by combined relevance This hybrid approach outperforms either strategy alone because keyword search catches exact identifier matches that vector search dilutes, while vector search captures semantic similarity that keyword search misses entirely. Best Gen AI Architectural Practices Match complexity to complexity. Simple requests should see simple examples. Complex requests should see complex examples. Mismatched examples confuse the model. Include negative examples. For the detected request type, include 1-2 "wrong => correct" pairs alongside positive examples. This reinforces Pillar 3's anti-pattern rules with concrete, contextually relevant demonstrations. Pre-compute embeddings. Generate vector embeddings at indexing time, not at query time. Cache retrieval results for repeated patterns. Curate quality over quantity. 3 excellent, diverse examples beat 10 mediocre ones. Each example should demonstrate a distinct pattern or edge case. Keep examples current. As your system evolves, old examples may demonstrate outdated patterns. Review and refresh the example store periodically. Pillar 5: Feedback Loop- Validate and Auto-Fix Every Output Deterministically Impact: +3-5% accuracy as a safety net, plus continuous improvement via feedback No matter how well-prompted, LLMs will occasionally produce outputs with minor structural errors - wrong casing, missing delimiters, references to slightly-incorrect identifiers, or subtle format violations. A deterministic post-processing pipeline catches and fixes these without any additional LLM calls. The Validation Pipeline LLM Output => Parse (grammar/AST) => Rule-Based Fixes => Compliance Check/validation => Final Output Each stage is fully deterministic: Parsing: Use a formal grammar or AST parser (ANTLR, tree-sitter, language-native parsers) to structurally analyse the output. Never regex-parse structured output - it's fragile and misses edge cases. Rule-based fixes: 10-20 deterministic transformation rules that correct known error patterns - name normalization, casing fixes, missing delimiters, structural repairs. Compliance check: Verify every identifier referenced in the output actually exists in the provided context. Flag unknown references. Design Principles Zero LLM calls in the fix pipeline. Every fix is a regex, an AST transformation, or a lookup table operation. Instant, free, deterministic, 100% reliable. Fail safe. If a fix is ambiguous (multiple valid corrections possible), pass through rather than corrupt. A minor error is better than a confident wrong "fix." Log everything. Track every fix applied, categorized by type. This data drives the feedback loop. The Critical Feedback Loop- The validation pipeline's most important function isn't fixing outputs, it's generating improvement signals: This creates a feedback loop: the auto-fix catches errors → the errors get promoted to upstream prevention → fewer errors reach the auto-fix → the system continuously tightens. Pillar 6: Multi-Agent Orchestration with Fewer Agents and Clear Contracts Impact: Reduced latency, clearer debugging, fewer failure modes The multi-agent pattern is powerful but commonly over-applied. The counter-intuitive lesson from production systems: fewer agents with well-defined responsibilities outperform many fine-grained agents. Why Fewer Is Better Each agent handoff introduces: Latency - serialization, network calls, context assembly Context loss - information dropped between boundaries Failure modes - each handoff is a potential error point Debugging complexity - tracing issues across many agents is exponentially harder Multi-Agent Orchestration Principles Merge agents that always run sequentially. If Agent A always feeds into Agent B with no branching or conditional logic, they should be one agent with two internal steps. Parallelize independent operations. Context retrieval and example lookup are independent, run them concurrently to halve retrieval latency. Route sub-tasks to cheaper models. Decomposed sub-problems are simpler by design. Use a smaller, faster, cheaper model (3x cost savings, 2x speed improvement). Define strict contracts. Each agent boundary should have an explicit schema defining inputs and outputs. No implicit assumptions about what crosses the boundary. Only 2 of 4 agents should call an LLM. The rest are purely deterministic. This minimizes non-deterministic behavior and cost. Pillar 7: Multi-Agent Cache at Multiple Hierarchical Levels Impact: 40-50% faster responses, 85%+ combined hit rate, significant cost reduction A single cache layer captures only one type of repetition. Production systems need hierarchical caching where multiple levels catch different repetition patterns , from exact duplicates to semantic near-misses. with -> A single cache layer captures only one type of repetition. Production systems need multi-level caching to handle exact matches, similar requests, and reusable fragments. or -> with Production systems need hierarchical caching where multiple levels handle exact matches, similar requests, and reusable fragments. Pillar 8: Measure Everything, Learn Continuously Impact: Enables data-driven iteration and prevents accuracy regressions. Architecture without observability is guesswork. The final pillar ensures every other pillar stays effective over time through comprehensive metrics and automated feedback loops. This isn't a one-time setup; it's a perpetual feedback loop. Every week, the top error patterns shift slightly. The auto-fix metrics tell you exactly where to focus next. Over months, this flywheel compounds into dramatic accuracy gains that no single prompt rewrite could achieve. Auto-Learning for New Domains When extending your system to new domains or knowledge areas: Auto-classify elements using naming conventions, type analysis, and structural patterns Auto-generate templates from universal patterns (transformations, comparisons, compositions, sequences) Bootstrap few-shot examples from successful template outputs Monitor for the first 100 requests, then curate only the edge cases manually This reduces domain onboarding from days of manual work to minutes of automated bootstrapping plus focused human review of outliers. Key Takeaways Architecture beats model size. A well-architected system with a smaller model outperforms a raw frontier model call on structured tasks at a fraction of the cost. Deterministic systems should do the heavy lifting. Reserve LLMs for genuinely novel, creative tasks. 70-80% of production requests should never touch an LLM. Verified knowledge is your top accuracy lever. Ground every prompt in context the model can trust. Errors are patterned, not random- Track them, compile them, and explicitly forbid them. Build feedback loops, not static systems- Every auto-fix, every cache miss, every routing decision is a signal for improvement. Fewer agents, done well- Fewer agents with strict contracts outperform 9 agents with fuzzy boundaries in accuracy, latency, and debuggability. Measure what matters and iterates- The system that wins isn't the one with the best day-one prompt, it's the one that improves fastest over time. Production-grade GenAI isn't about finding the perfect prompt or waiting for the next LLM model release. It's about building architectural guardrails that make failure nearly impossible and when failure does occur, the system learns from it automatically. These 8 pillars, applied together, transform any LLM from an unreliable black box into a precise, efficient, and continuously improving production system. -> Production Gen AI success is not about perfect prompts or waiting for the next LLM release. It comes from designing strong system guardrails that reduce failures and ensure consistent output. Even when failures happen, the system learns and improves automatically. When applied together, these 8 pillars turn an LLM into a reliable, efficient, and continuously improving production system.Student Devs: Build AI Agents, Compete for $55K in Prizes
Student Devs: Build AI Agents, Compete for $55K in Prizes 🎮 AI Skills Fest • June 4–14, 2026 • Free to Enter $55K Prize Pool 3 Challenge Tracks 10 Days of Hacking Free To Enter Whether you're a first-year CS student or a final-year senior with a portfolio full of projects, Agents League is the best way to gain hands-on experience with agentic AI this summer and walk away with real skills employers are hiring for right now. What You'll Actually Learn Forget passive tutorials. Agents League is project-based learning at full speed. By the end of the hackathon, you'll have built a working AI agent and gained practical experience with the tools shaping the future of software development. 🤖 AI-Assisted Development Use GitHub Copilot to accelerate your coding workflow — from scaffolding to debugging — the way professional developers do today. 🧩 Multi-Step Reasoning Build agents with Microsoft Foundry that can plan, reason, and execute complex tasks — the core of agentic AI. 🏢 Enterprise AI Patterns Learn to build production-ready agents that integrate with Microsoft 365 and Copilot Studio — skills that translate directly to industry jobs. 🔧 Prompt Engineering Design effective prompts and orchestration flows that make AI agents reliable and useful in the real world. 📦 GitHub Workflows Submit your project through GitHub — practising version control, README writing, and open-source collaboration. 🎯 Competitive Problem-Solving Work under real constraints with deadlines, judging criteria, and peer competition — just like industry hackathons and sprints. Pick Your Track (or Try All Three) Agents League has three challenge tracks, each using different Microsoft AI tools. Choose based on your interests or stretch yourself by competing in multiple tracks. Track 01. Creative Apps Build an innovative application with AI-assisted development. This track rewards creativity, dream big and let GitHub Copilot help you bring ideas to life faster than ever. Tool: GitHub Copilot Track 02. Reasoning Agents Create intelligent agents that solve complex problems through multi-step reasoning. Think: agents that can research, plan, and act. This is the cutting edge of AI. Tool: Microsoft Foundry Track 03. Enterprise Agents Build knowledge agents that integrate with Microsoft 365 Copilot. Learn how businesses are deploying AI today and add enterprise AI to your skillset. Tool: Copilot Studio • M365 Opportunities You Won't Want to Miss Agents League isn't just a competition, it's a launchpad. Here's what's in it for you beyond the code: 💰 Win from a $55,000 USD Prize Pool Prizes are awarded across all three tracks smaller teams and solo hackers have a real shot. 📺 Watch Live Coding Battles at Microsoft Reactor See industry experts go head-to-head building AI agents live. Learn advanced techniques you can apply immediately to your own project. 🎓 Free Learning Resources on Microsoft Learn Access curated learning paths and the AI Skills Navigator, structured content designed to get you from zero to submission-ready. 🌍 Join a Global Developer Community Connect with thousands of developers on the Agents League Discord. Find teammates, ask questions, and build your professional network. 📂 Build Your Portfolio with a Real Project Every submission lives on GitHub. Walk away with a polished, public project that demonstrates your AI skills to future employers and grad schools. 🏆 Gain Recognition from Microsoft and the Community Top projects get visibility across the Microsoft developer ecosystem. Stand out from the crowd in internship and job applications. Key Dates to Remember Event Date Hacking Period Opens June 4, 2026 Registration Deadline June 12, 2026 — 12:00 PM PT Submission Deadline June 14, 2026 — 11:59 PM PT How to Get Started (Right Now) You don't have to wait until June 4th to start preparing. Here's your pre-hackathon game plan: Register for the hackathon it's free and open to everyone. Pick a track that matches your interests or curiosity. Explore the learning resources on Microsoft Learn and the AI Skills Navigator. Join the Discord community to find teammates and get early tips. Watch the Reactor event series for live coding battles and expert walkthroughs. Set up your GitHub repo and start experimenting before the hacking window opens. Helpful Links Register for Agents League Free entry, sign up now Microsoft Reactor Events Live coding battles & workshops AI Skills Fest The broader event Microsoft Learn Free learning paths The Arena Awaits 🏆 Ten days. Three tracks. $55K in prizes. Whether you go solo or squad up, this is your chance to build something real with AI and have a blast doing it. Register Now It's Free | Watch Reactor Events Agents League is part of AI Skills Fest and is open to the public at no cost. Review the Hackathon Rules and Regulations and the Microsoft Event Code of Conduct before participating.748Views0likes0CommentsSigning in to Microsoft Foundry from OpenClaw using Azure AD: a smoother way to bring your models in
This post is a quick update to walk through the new flow. If you read the previous one, think of this as the easier path I wish I had the first time round. If you have not seen the original, you can find it here: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Pre-requisite: You will need the Azure CLI (azure-cli) installed on your machine. The official install guide for Linux is here: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest I am on Linux so I went the Homebrew route, which keeps things simple. The formula is here: https://formulae.brew.sh/formula/azure-cli Microsoft also has official docs covering the Homebrew/Linuxbrew install: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Once Homebrew is ready, run this in your terminal: brew install azure-cli Why this matters: Before this update, every Foundry model you wanted to use in OpenClaw needed its own API key and endpoint pasted into the config. It worked, but it was tedious, and keys are easy to leak if you are copying them around. The Azure AD path solves both problems. You authenticate as yourself (or a service principal), OpenClaw asks Azure for the list of Foundry resources you have access to, and it brings the models in automatically. Signing in to Microsoft Foundry from OpenClaw via Azure AD A device-code OAuth handshake replaces the old static-API-key flow. OpenClaw delegates auth to the local Azure CLI; the CLI handles the browser-side sign-in, holds the resulting tokens, and refreshes them silently. OpenClaw then walks the Azure resource graph, subscriptions → Foundry resources → model deployments and registers each model into its own config. No API keys move through OpenClaw at any point. Sequence diagram of the OAuth 2.0 device-authorization flow as orchestrated by OpenClaw. Phases 1–3 establish identity (the developer authenticates once, in a real browser, against Azure AD). Phases 4–5 perform service discovery (OpenClaw walks the ARM resource hierarchy, subscriptions → Foundry accounts → model deployments and persists the result to a local provider config). After registration, every model call OpenClaw makes against Foundry reuses the same Azure-CLI-managed token cache: tokens refresh transparently, and access is gated by the Foundry resource's RBAC assignments rather than a static API key. Dashed lines denote return values; the teal line in step 7 marks the single token-issuance event the rest of the system pivots on. Walking through the new flow: Start with the command to onboard openclaw as if you were setting up OpenClaw for the first time: openclaw onboard Kick things off with the OpenClaw onboard command, the same one you would use when setting up OpenClaw for the first time. When it prompts you, choose update values. Next, you will be asked to configure your models. Scroll down a little and you will see Microsoft Foundry listed as a supported provider. Pick it. From here, you have two options. You can sign in with an API key, which is what I covered in the previous blog post, or you can sign in through Azure AD. The Azure AD path is easier and more secure, so that is the one we will use. OpenClaw will give you a URL and a device code. Copy the URL into your browser and use the code to complete the sign in. (This is where the az CLI from the pre-requisite section earns its keep.) If everything worked, you should see a success prompt similar to this: Once you are signed in, OpenClaw will ask you to pick the Azure subscription that your Microsoft Foundry resource lives in. Pick the subscription, then pick the Foundry resource where your models are deployed. And that is pretty much it. All the models you have deployed to that Foundry resource get pulled into OpenClaw automatically. Compared to the old way of pasting API keys and endpoints one by one, this is a huge time saver, and you do not have to babysit any keys. From here you can start using your Foundry-deployed models inside OpenClaw straight away: Wrapping up The Azure AD sign-in option in OpenClaw is one of those small updates that quietly removes a real pain point. If you have ever juggled multiple Foundry endpoints and rotated keys across them, you already know why. With this flow, you sign in once, your models show up, and you can get back to actually building. If you have not tried OpenClaw with Microsoft Foundry yet, this is a good time to give it a go. And if you were holding off because of the key management overhead, that excuse is gone now. References Previous post on integrating Microsoft Foundry with OpenClaw using API keys: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Install the Azure CLI on Linux: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest Install the Azure CLI on macOS: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Homebrew formula for azure-cli: https://formulae.brew.sh/formula/azure-cli226Views0likes0CommentsFrom Demo to Production: Building Microsoft Foundry Hosted Agents with .NET
The Gap Between a Demo and a Production Agent Let's be honest. Getting an AI agent to work in a demo takes an afternoon. Getting it to work reliably in production, tested, containerised, deployed, observable, and maintainable by a team. is a different problem entirely. Most tutorials stop at the point where the agent prints a response in a terminal. They don't show you how to structure your code, cover your tools with tests, wire up CI, or deploy to a managed runtime with a proper lifecycle. That gap between prototype and production is where developer teams lose weeks. Microsoft Foundry Hosted Agents close that gap with a managed container runtime for your own custom agent code. And the Hosted Agents Workshop for .NET gives you a complete, copy-paste-friendly path through the entire journey. from local run to deployed agent to chat UI, in six structured labs using .NET 10. This post walks you through what the workshop delivers, what you will build, and why the patterns it teaches matter far beyond the workshop itself. What Is a Microsoft Foundry Hosted Agent? Microsoft Foundry supports two distinct agent types, and understanding the difference is the first decision you will make as an agent developer. Prompt agents are lightweight agents backed by a model deployment and a system prompt. No custom code required. Ideal for simple Q&A, summarisation, or chat scenarios where the model's built-in reasoning is sufficient. Hosted agents are container-based agents that run your own code .NET, Python, or any framework you choose inside Foundry's managed runtime. You control the logic, the tools, the data access, and the orchestration. When your scenario requires custom tool integrations, deterministic business logic, multi-step workflow orchestration, or private API access, a hosted agent is the right choice. The Foundry runtime handles the managed infrastructure; you own the code. For the official deployment reference, see Deploy a hosted agent to Foundry Agent Service on Microsoft Learn. What the Workshop Delivers The Hosted Agents Workshop for .NET is a beginner-friendly, hands-on workshop that takes you through the full development and deployment path for a real hosted agent. It is structured around a concrete scenario: a Hosted Agent Readiness Coach that helps delivery teams answer questions like: Should this use case start as a prompt agent or a hosted agent? What should a pilot launch checklist include? How should a team troubleshoot common early setup problems? The scenario is purposefully practical. It is not a toy chatbot. It is the kind of tool a real team would build and hand to other engineers, which means it needs to be testable, deployable, and extensible. The workshop covers: Local development and validation with .NET 10 Copilot-assisted coding with repo-specific instructions Deterministic tool implementation with xUnit test coverage CI pipeline validation with GitHub Actions Secure deployment to Azure Container Registry and Microsoft Foundry Chat UI integration using Blazor What You Will Build By the end of the workshop, you will have a code-based hosted agent that exposes an OpenAI Responses-compatible /responses endpoint on port 8088 . The agent is backed by three deterministic local tools, implemented in WorkshopLab.Core : RecommendImplementationShape — analyses a scenario and recommends hosted or prompt agent based on its requirements BuildLaunchChecklist — generates a pilot launch checklist for a given use case TroubleshootHostedAgent — returns structured troubleshooting guidance for common setup problems These tools are deterministic by design, no LLM call required to return a result. That choice makes them fast, predictable, and fully testable, which is the right architecture for business logic in a production agent. The end-to-end architecture looks like this: The Hands-On Journey: Lab by Lab The workshop follows a deliberate build → validate → ship progression. Each lab has a clear outcome. You do not move forward until the previous checkpoint passes. Lab 0 — Setup and Local Run Open the repo in VS Code or a GitHub Codespace, configure your Microsoft Foundry project endpoint and model deployment name, then run the agent locally. By the end of Lab 0, your agent is listening on http://localhost:8088/responses and responding to test requests. dotnet restore dotnet build dotnet run --project src/WorkshopLab.AgentHost Test it with a single PowerShell call: Invoke-RestMethod -Method Post ` -Uri "http://localhost:8088/responses" ` -ContentType "application/json" ` -Body '{"input":"Should we start with a hosted agent or a prompt agent?"}' Lab 0 instructions → Lab 1 — Copilot Customisation Configure repo-specific GitHub Copilot instructions so that Copilot understands the hosted-agent patterns used in this project. You will also add a Copilot review skill tailored to hosted agent code reviews. This step means every code suggestion you receive from Copilot is contextualised to the workshop scenario rather than giving generic .NET advice. Lab 1 instructions → Lab 2 — Tool Implementation Extend one of the deterministic tools in WorkshopLab.Core with a real feature change. The suggested change adds a stronger recommendation path to RecommendImplementationShape for scenarios that require all three hosted-agent strengths simultaneously. // In RecommendImplementationShape — add before the final return: if (requiresCode && requiresTools && requiresWorkflow) { return string.Join(Environment.NewLine, [ $"Recommended implementation: Hosted agent (full-stack)", $"Scenario goal: {goal}", "Why: the scenario requires custom code, external tool access, and " + "multi-step orchestration — all three hosted-agent strengths.", "Suggested next step: start with a code-based hosted agent, register " + "local tools for each integration, and add a workflow layer." ]); } You then write an xUnit test to cover it, run dotnet test , and validate the change against a live /responses call. This is the workshop's most important teaching moment: every tool change is covered by a test before it ships. Lab 2 instructions → Lab 3 — CI Validation Wire up a GitHub Actions workflow that builds the solution, runs the test suite, and validates that the agent container builds cleanly. No manual steps — if a change breaks the build or a test, CI catches it before any deployment happens. Lab 3 instructions → Lab 4 — Deployment to Microsoft Foundry Use the Azure Developer CLI ( azd ) to provision an Azure Container Registry, publish the agent image, and deploy the hosted agent to Microsoft Foundry. The workshop separates provisioning from deployment deliberately: azd owns the Azure resources; the Foundry control plane deployment is an explicit, intentional final step that depends on your real project endpoint and agent.yaml manifest values. Lab 4 instructions → Lab 5 — Chat UI Integration Connect a Blazor chat UI to the deployed hosted agent and validate end-to-end responses. By the end of Lab 5, you have a fully functioning agent accessible through a real UI, calling your deterministic tools via the Foundry control plane. Lab 5 instructions → Key Concepts to Take Away The workshop teaches concrete patterns that apply well beyond this specific scenario. Code-first agent design Prompt-only agents are fast to build but hard to test and reason about at scale. A hosted agent with code-backed tools gives you something you can unit test, refactor, and version-control like any other software. Deterministic tools and testability The workshop explicitly avoids LLM calls inside tool implementations. Deterministic tools return predictable outputs for a given input, which means you can write fast, reliable unit tests for them. This is the right pattern for business logic. Reserve LLM calls for the reasoning layer, not the execution layer. CI/CD for agent systems AI agents are software. They deserve the same build-test-deploy discipline as any other service. Lab 3 makes this concrete: you cannot ship without passing CI, and CI validates the container as well as the unit tests. Deployment separation The workshop's split between azd provisioning and Foundry control-plane deployment is not arbitrary. It reflects the real operational boundary: your Azure resources are long-lived infrastructure; your agent deployment is a lifecycle event tied to your project's specific endpoint and manifest. Keeping them separate reduces accidents and makes rollbacks easier. Observability and the validation mindset Every lab ends with an explicit checkpoint. The culture the workshop builds is: prove it works before moving on. That mindset is more valuable than any specific tool or command in the labs. Why Hosted Agents Are Worth the Investment The managed runtime in Microsoft Foundry removes the infrastructure overhead that makes custom agent deployment painful. You do not manage Kubernetes clusters, configure ingress rules, or handle TLS termination. Foundry handles the hosting; you handle the code. This matters most for teams making the transition from demo to production. A prompt agent is an afternoon's work. A hosted agent with proper CI, tested tools, and a deployment pipeline is a week's work done properly once, instead of several weeks of firefighting done poorly repeatedly. The Foundry agent lifecycle —> create, update, version, deploy —>also gives you the controls you need to manage agents in a real environment: staged rollouts, rollback capability, and clear separation between agent versions. For the full deployment guide, see Deploy a hosted agent to Foundry Agent Service. From Workshop to Real Project This workshop is not just a learning exercise. The repository structure, the tooling choices, and the CI/CD patterns are a reference implementation. The patterns you can lift directly into a production project include: The WorkshopLab.Core / WorkshopLab.AgentHost separation between business logic and agent hosting The agent.yaml manifest pattern for declarative Foundry deployment The GitHub Actions workflow structure for build, test, and container validation The azd + ACR pattern for image publishing without requiring Docker Desktop locally The Blazor chat UI as a starting point for internal tooling or developer-facing applications The scenario, a readiness coach for hosted agents. This is also something teams evaluating Microsoft Foundry will find genuinely useful. It answers exactly the questions that come up when onboarding a new team to the platform. Common Mistakes When Building Hosted Agents Having run workshops and spoken with developer teams building on Foundry, a few patterns come up repeatedly: Skipping local validation before containerising. Always validate the /responses endpoint locally first. Debugging inside a container is slower and harder than debugging locally. Putting business logic inside the LLM call. If the answer to a user query can be determined by code, use code. Reserve the model for reasoning, synthesis, and natural language output. Treating CI as optional. Agent code changes break things just like any other code change. If you do not have CI catching regressions, you will ship them. Conflating provisioning and deployment. Recreating Azure resources on every deploy is slow and error-prone. Provision once with azd ; deploy agent versions as needed through the Foundry control plane. Not having a rollback plan. The Foundry agent lifecycle supports versioning. Use it. Know how to roll back to a previous version before you deploy to production. Get Started The workshop is open source, beginner-friendly, and designed to be completed in a single day. You need a .NET 10 SDK, an Azure subscription, access to a Microsoft Foundry project, and a GitHub account. Clone the repository, follow the labs in order, and by the end you will have a production-ready reference implementation that your team can extend and adapt for real scenarios. Clone the workshop repository → Here is the quick start to prove the solution works locally before you begin the full lab sequence: git clone https://github.com/microsoft/Hosted_Agents_Workshop_dotNET.git cd Hosted_Agents_Workshop_dotNET # Set your Foundry project endpoint and model deployment $env:AZURE_AI_PROJECT_ENDPOINT = "https://<resource>.services.ai.azure.com/api/projects/<project>" $env:MODEL_DEPLOYMENT_NAME = "gpt-4.1-mini" # Build and run dotnet restore dotnet build dotnet run --project src/WorkshopLab.AgentHost Then send your first request: Invoke-RestMethod -Method Post ` -Uri "http://localhost:8088/responses" ` -ContentType "application/json" ` -Body '{"input":"Should we start with a hosted agent or a prompt agent?"}' When the agent answers as a Hosted Agent Readiness Coach, you are ready to begin the labs. Key Takeaways Hosted agents in Microsoft Foundry let you run custom .NET code in a managed container runtime — you own the logic, Foundry owns the infrastructure. Deterministic tools are the right pattern for business logic in production agents: fast, testable, and predictable. CI/CD is not optional for agent systems. Build it in from the start, not as an afterthought. Separate your provisioning ( azd ) from your deployment (Foundry control plane) — it reduces accidents and simplifies rollbacks. The workshop is a reference implementation, not just a tutorial. The patterns are production-grade and ready to adapt. References Hosted Agents Workshop for .NET — GitHub Repository Workshop Lab Guide Deploy a Hosted Agent to Foundry Agent Service — Microsoft Learn Microsoft Foundry Portal Azure Developer CLI (azd) — Microsoft Learn .NET 10 SDK Download449Views0likes0CommentsMicrosoft Foundry Model Router: A Developer's Guide to Smarter AI Routing
Introduction When building AI-powered applications on Azure, one of the most impactful decisions you make isn't about which model to use, it's about how your application selects models at runtime. Microsoft Foundry Model Router, available through Microsoft Foundry, automatically routes your inference requests to the best available model based on prompt complexity, latency targets, and cost efficiency. But how do you know it's actually routing correctly? And how do you compare its behavior across different API paths? That's exactly the problem RouteLens solves. It's an open-source Node.js CLI and web-based testing tool that sends configurable prompts through two distinct Azure AI runtime paths and produces a detailed comparison of routing decisions, latency profiles, and reliability metrics. In this post, we'll walk through what Model Router does, why it matters, how to use the validator tool, and best practices for designing applications that get the most out of intelligent model routing. What Is Microsoft Founry Model Router? Microsoft Foundry Model Router is a deployment option in Microsoft Foundry that sits between your application and a pool of AI models. Instead of hard-coding a specific model like gpt-4o or gpt-4o-mini , you deploy a Model Router endpoint and let Azure decide which underlying model serves each request. How It Works Your application sends an inference request to the Model Router deployment. Model Router analyzes the request (prompt complexity, token count, required capabilities). It selects the most appropriate model from the available pool. The response is returned transparently — your application code doesn't change. Why This Matters Cost optimization — Simple prompts get routed to smaller, cheaper models. Complex prompts go to more capable (and expensive) ones. Latency reduction — Lightweight prompts complete faster when they don't need a heavyweight model. Resilience — If one model is experiencing high load or throttling, traffic can shift to alternatives. Simplified application code — No need to build your own model-selection logic. The Two Runtime Paths Microsoft Foundry offers two distinct endpoint configurations for hitting Model Router. Even though both use the Chat Completions API, they may have different routing behaviour: Path SDK Endpoint AOAI + Chat Completions OpenAI JS SDK https://.cognitiveservices.azure.com/openai/deployments/ Foundry Project + Chat Completions OpenAI JS SDK (separate client) https://.cognitiveservices.azure.com/openai/deployments/ Understanding whether these two paths produce the same routing decisions is critical for production applications. If the same prompt routes to different models depending on which endpoint you use, that's a signal you need to investigate. Introducing RouteLens RouteLens is a Node.js tool that automates this comparison. It: Sends a configurable set of prompts across categories (echo, summarize, code, reasoning) through both paths. Logs every response to structured JSONL files for post-hoc analysis. Computes statistics including p50/p95 latency, error rates, and model-choice distribution. Highlights routing differences — where the same prompt was served by different models across paths. Provides a web dashboard for interactive testing and real-time result visualization. The Web Dashboard The built-in web UI makes it easy to run tests and explore results without parsing log files: The dashboard includes: KPI Dashboard — Key metrics at a glance: Success Rate, Avg TPS, Gen TPS, Peak TPS, Fastest Response, p50/p95 Latency, Most Reliable Path, Total Tokens Summary view — Per-path/per-category stats with success rate, TPS, and latency Model Comparison — Side-by-side view of which models were selected by each path Latency Charts — Visual bar charts comparing p50 and p95 latencies Error Analysis — Error distribution and detailed error messages Live Feed — Real-time streaming of results as they come in Log Viewer — Browse and inspect historical JSONL log files Model Comparison — See which models were selected by each routing path for every prompt: Live Feed — Real-time streaming of results as they come in: Log Viewer — Browse and inspect historical JSONL log files with parsed table views: Mobile Responsive — The UI adapts to smaller screens: Getting Started Prerequisites Node.js 18+ (LTS recommended) An Azure subscription with a Foundry project Model Router deployed in your Foundry project An API key from your Azure OpenAI / Foundry resource The API version (e.g. 2024-05-01-preview ) Setup # Clone and install git clone https://github.com/leestott/modelrouter-routelens/ cd modelrouter-routelens npm install # Configure your endpoints cp .env.example .env # Edit .env with your Azure endpoints (see below) Configuration The .env file needs these key settings: # Your Foundry / Cognitive Services deployment endpoint # Format: https://<resource>.cognitiveservices.azure.com/openai/deployments/<deployment> # Do NOT include /chat/completions or ?api-version FOUNDRY_PROJECT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router AOAI_BASE_URL=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router # API key from your Azure OpenAI / Foundry resource AOAI_API_KEY=your-api-key-here # Azure OpenAI API version AOAI_API_VERSION=2024-05-01-preview</resource></resource></deployment></resource> Running Tests # Full test matrix — sends all prompts through both paths npm run run:matrix # 408 timeout diagnostic — focuses on the Responses path timeout issue npm run run:repro408 # Web UI — interactive dashboard npm run ui # Then open http://localhost:3002 (or the port set in UI_PORT) Understanding the Results Latency Comparison The latency charts show p50 (median) and p95 (tail) latency for each path and prompt category: Key things to look for: Large p50 differences between paths suggest one path has consistently higher overhead. High p95 values indicate tail latency problems — possibly timeouts or retries. Category-specific patterns — If code prompts are slow on one path but fast on another, that's a routing difference worth investigating. Model Comparison The model comparison view shows which models were selected for each prompt: When both paths select the same model, you see a green "Match" indicator. When they differ, it's flagged in red — these are the cases you want to investigate. Error Analysis The errors view helps diagnose reliability issues: Common error patterns: 408 Timeout — The Responses path may take longer for certain prompt categories 401 Unauthorized — Authentication configuration issues 429 Rate Limited — You're hitting throughput limits 500 Internal Server Error — Backend model issues Best Practices for Designing Applications with Model Router 1. Design Prompts with Routing in Mind Model Router makes decisions based on prompt characteristics. To get the best routing: Keep prompts focused — A clear, single-purpose prompt is easier for the router to classify than a multi-part prompt that spans multiple complexity levels. Use system messages effectively — A well-structured system message helps the router understand the task complexity. Separate complex chains — If you have a multi-step workflow, make each step a separate API call rather than one massive prompt. This lets the router use a cheaper model for simple steps. 2. Set Appropriate Timeouts Different models have different latency profiles. Your timeout settings should account for the slowest model the router might select: // Too aggressive — may timeout when routed to a larger model const TIMEOUT = 5000; // 5s // Better — allows headroom for model variation const TIMEOUT = 30000; // 30s // Best — use different timeouts based on expected complexity function getTimeout(category) { switch (category) { case 'echo': return 10000; case 'summarize': return 20000; case 'code': return 45000; case 'reasoning': return 60000; default: return 30000; } } 3. Implement Robust Retry Logic Because the router may select different models on retry, transient failures can resolve themselves: async function callWithRetry(prompt, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); } catch (err) { if (attempt === maxRetries - 1) throw err; // Exponential backoff await new Promise(r => setTimeout(r, 1000 * Math.pow(2, attempt))); } } } 4. Monitor Model Selection in Production Log which model was selected for each request so you can track routing patterns over time: const response = await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); // The model field in the response tells you which model was actually used console.log(`Routed to: ${response.model}`); console.log(`Tokens: ${response.usage.total_tokens}`); 5. Use the Right API Path for Your Use Case Based on our testing with RouteLens, consider: Chat Completions path — The standard path for chat-style interactions. Uses the openai SDK directly. Foundry Project path — Uses the same Chat Completions API but through the Foundry project endpoint. Useful for comparing routing behaviour across different endpoint configurations. Note: The Responses API ( /responses ) is not currently available on cognitiveservices.azure.com Model Router deployments. Both paths in RouteLens use Chat Completions. 6. Test Before You Ship Run RouteLens as part of your pre-production validation: # In your CI/CD pipeline or pre-deployment check npm run run:matrix -- --runs 10 --concurrency 4 This helps you: Catch routing regressions when Azure updates model pools Verify that your prompt changes don't cause unexpected model selection shifts Establish latency baselines for alerting Architecture Overview RouteLens sends configurable prompts through two distinct Azure AI runtime paths and compares routing decisions, latency, and reliability. The Matrix Runner dispatches prompts to both the Chat Completions Client (OpenAI JS SDK → AOAI endpoint) and the Project Responses Client ( azure/ai-projects → Foundry endpoint). Both paths converge at Azure Model Router, which intelligently selects the optimal backend model. Results are logged to JSONL files and rendered in the web dashboard. Key Benefits of Model Router Benefit Description Cost savings Automatically routes simple prompts to cheaper models, reducing spend by 30-50% in typical workloads Lower latency Simple prompts complete faster on lightweight models Zero code changes Same API contract as a standard model deployment — just change the deployment name Future-proof As Azure adds new models to the pool, your application benefits automatically Built-in resilience Routing adapts to model availability and load conditions Conclusion Azure Model Router represents a shift from "pick a model" to "describe your task and let the platform decide." This is a natural evolution for AI applications — just as cloud platforms abstract away server selection, Model Router abstracts away model selection. RouteLens gives you the visibility to trust that abstraction. By systematically comparing routing behavior across API paths and prompt categories, you can deploy Model Router with confidence and catch issues before your users do. The tool is open source under the MIT license. Try it out, file issues, and contribute improvements: GitHub Repository Model Router Documentation Microsoft Foundry891Views1like0CommentsMicrosoft Olive & Olive Recipes: A Practical Guide to Model Optimization for Real-World Deployment
Why your model runs great on your laptop but fails in the real world You have trained a model. It scores well on your test set. It runs fine on your development machine with a beefy GPU. Then someone asks you to deploy it to a customer's edge device, a cloud endpoint with a latency budget, or a laptop with no discrete GPU at all. Suddenly the model is too large, too slow, or simply incompatible with the target runtime. You start searching for quantisation scripts, conversion tools, and hardware-specific compiler flags. Each target needs a different recipe, and the optimisation steps interact in ways that are hard to predict. This is the deployment gap. It is not a knowledge gap; it is a tooling gap. And it is exactly the problem that Microsoft Olive is designed to close. What is Olive? Olive is an easy-to-use, hardware-aware model optimisation toolchain that composes techniques across model compression, optimisation, and compilation. Rather than asking you to string together separate conversion scripts, quantisation utilities, and compiler passes by hand, Olive lets you describe what you have and what you need, then handles the pipeline. In practical terms, Olive takes a model source, such as a PyTorch model or an ONNX model (and other supported formats), plus a configuration that describes your production requirements and target hardware accelerator. It then runs the appropriate optimisation passes and produces a deployment-ready artefact. You can think of it as a build system for model optimisation: you declare the intent, and Olive figures out the steps. Official repo: github.com/microsoft/olive Documentation: microsoft.github.io/Olive Key advantages: why Olive matters for your workflow A. Optimise once, deploy across many targets One of the hardest parts of deploying models in production is that "production" is not one thing. Your model might need to run on a cloud GPU, an edge CPU, or a Windows device with an NPU. Each target has different memory constraints, instruction sets, and runtime expectations. Olive supports targeting CPU, GPU, and NPU through its optimisation workflow. This means a single toolchain can produce optimised artefacts for multiple deployment targets, expanding the number of platforms you can serve without maintaining separate optimisation scripts for each one. The conceptual workflow is straightforward: Olive can download, convert, quantise, and optimise a model using an auto-optimisation style approach where you specify the target device (cpu, gpu, or npu). This keeps the developer experience consistent even as the underlying optimisation strategy changes per target. B. ONNX as the portability layer If you have heard of ONNX but have not used it in anger, here is why it matters: ONNX gives your model a common representation that multiple runtimes understand. Instead of being locked to one framework's inference path, an ONNX model can run through ONNX Runtime and take advantage of whatever hardware is available. Olive supports ONNX conversion and optimisation, and can generate a deployment-ready model package along with sample inference code in languages like C#, C++, or Python. That package is not just the model weights; it includes the configuration and code needed to load and run the model on the target platform. For students and early-career engineers, this is a meaningful capability: you can train in PyTorch (the ecosystem you already know) and deploy through ONNX Runtime (the ecosystem your production environment needs). C. Hardware-specific acceleration and execution providers When Olive targets a specific device, it does not just convert the model format. It optimises for the execution provider (EP) that will actually run the model on that hardware. Execution providers are the bridge between the ONNX Runtime and the underlying accelerator. Olive can optimise for a range of execution providers, including: Vitis AI EP (AMD) – for AMD accelerator hardware OpenVINO EP (Intel) – for Intel CPUs, integrated GPUs, and VPUs QNN EP (Qualcomm) – for Qualcomm NPUs and SoCs DirectML EP (Windows) – for broad GPU support on Windows devices Why does EP targeting matter? Because the difference between a generic model and one optimised for a specific execution provider can be significant in terms of latency, throughput, and power efficiency. On battery-powered devices especially, the right EP optimisation can be the difference between a model that is practical and one that drains the battery in minutes. D. Quantisation and precision options Quantisation is one of the most powerful levers you have for making models smaller and faster. The core idea is reducing the numerical precision of model weights and activations: FP32 (32-bit floating point) – full precision, largest model size, highest fidelity FP16 (16-bit floating point) – roughly half the memory, usually minimal quality loss for most tasks INT8 (8-bit integer) – significant size and speed gains, moderate risk of quality degradation depending on the model INT4 (4-bit integer) – aggressive compression for the most constrained deployment scenarios Think of these as a spectrum. As you move from FP32 towards INT4, models get smaller and faster, but you trade away some numerical fidelity. The practical question is always: how much quality can I afford to lose for this use case? Practical heuristics for choosing precision: FP16 is often a safe default for GPU deployment. In practice, you might start here and only go lower if you need to. INT8 is a strong choice for CPU-based inference where memory and compute are constrained but accuracy requirements are still high (e.g., classification, embeddings, many NLP tasks). INT4 is worth exploring when you are deploying large language models to edge or consumer devices and need aggressive size reduction. Expect to validate quality carefully, as some tasks and model architectures tolerate INT4 better than others. Olive handles the mechanics of applying these quantisation passes as part of the optimisation pipeline, so you do not need to write custom quantisation scripts from scratch. Showcase: model conversion stories To make this concrete, here are three plausible optimisation scenarios that illustrate how Olive fits into real workflows. Story 1: PyTorch classification model → ONNX → quantised for cloud CPU inference Starting point: A PyTorch image classification model fine-tuned on a domain-specific dataset. Target hardware: Cloud CPU instances (no GPU budget for inference). Optimisation intent: Reduce latency and cost by quantising to INT8 whilst keeping accuracy within acceptable bounds. Output: An ONNX model optimised for CPU execution, packaged with configuration and sample inference code ready for deployment behind an API endpoint. Story 2: Hugging Face language model → optimised for edge NPU Starting point: A Hugging Face transformer model used for text summarisation. Target hardware: A laptop with an integrated NPU (e.g., a Qualcomm-based device). Optimisation intent: Shrink the model to INT4 to fit within NPU memory limits, and optimise for the QNN execution provider to leverage the neural processing unit. Output: A quantised ONNX model configured for QNN EP, with packaging that includes the model, runtime configuration, and sample code for local inference. Story 3: Same model, two targets – GPU vs. NPU Starting point: A single PyTorch generative model used for content drafting. Target hardware: (A) Cloud GPU for batch processing, (B) On-device NPU for interactive use. Optimisation intent: For GPU, optimise at FP16 for throughput. For NPU, quantise to INT4 for size and power efficiency. Output: Two separate optimised packages from the same source model, one targeting DirectML EP for GPU, one targeting QNN EP for NPU, each with appropriate precision, runtime configuration, and sample inference code. In each case, Olive handles the multi-step pipeline: conversion, optimisation passes, quantisation, and packaging. The developer's job is to define the target and validate the output quality. Introducing Olive Recipes If you are new to model optimisation, staring at a blank configuration file can be intimidating. That is where Olive Recipes comes in. The Olive Recipes repository complements Olive by providing recipes that demonstrate features and use cases. You can use them as a reference for optimising publicly available models or adapt them for your own proprietary models. The repository also includes a selection of ONNX-optimised models that you can study or use as starting points. Think of recipes as worked examples: each one shows a complete optimisation pipeline for a specific scenario, including the configuration, the target hardware, and the expected output. Instead of reinventing the pipeline from scratch, you can find a recipe close to your use case and modify it. For students especially, recipes are a fast way to learn what good optimisation configurations look like in practice. Taking it further: adding custom models to Foundry Local Once you have optimised a model with Olive, you may want to serve it locally for development, testing, or fully offline use. Foundry Local is a lightweight runtime that downloads, manages, and serves language models entirely on-device via an OpenAI-compatible API, with no cloud dependency and no API keys required. Important: Foundry Local only supports specific model templates. At present, these are the chat template (for conversational and text-generation models) and the whisper template (for speech-to-text models based on the Whisper architecture). If your model does not fit one of these two templates, it cannot currently be loaded into Foundry Local. Compiling a Hugging Face model for Foundry Local If your optimised model uses a supported architecture, you can compile it from Hugging Face for use with Foundry Local. The high-level process is: Choose a compatible Hugging Face model. The model must match one of Foundry Local's supported templates (chat or whisper). For chat models, this typically means decoder-only transformer architectures that support the standard chat format. Use Olive to convert and optimise. Olive handles the conversion from the Hugging Face source format into an ONNX-based, quantised artefact that Foundry Local can serve. This is where your Olive skills directly apply. Register the model with Foundry Local. Once compiled, you register the model so that Foundry Local's catalogue recognises it and can serve it through the local API. For the full step-by-step guide, including exact commands and configuration details, refer to the official documentation: How to compile Hugging Face models for Foundry Local. For a hands-on lab that walks through the complete workflow, see Foundry Local Lab, specifically Lab 10 which covers bringing custom models into Foundry Local. Why does this matter? The combination of Olive and Foundry Local gives you a complete local workflow: optimise your model with Olive, then serve it with Foundry Local for rapid iteration, privacy-sensitive workloads, or environments without internet connectivity. Because Foundry Local exposes an OpenAI-compatible API, your application code can switch between local and cloud inference with minimal changes. Keep in mind the template constraint. If you are planning to bring a custom model into Foundry Local, verify early that it fits the chat or whisper template. Attempting to load an unsupported architecture will not work, regardless of how well the model has been optimised. Contributing: how to get involved The Olive ecosystem is open source, and contributions are welcome. There are two main ways to contribute: A. Contributing recipes If you have built an optimisation pipeline that works well for a specific model, hardware target, or use case, consider contributing it as a recipe. Recipes are repeatable pipeline configurations that others can learn from and adapt. B. Sharing optimised model outputs and configurations If you have produced an optimised model that might be useful to others, sharing the optimisation configuration and methodology (and, where licensing permits, the model itself) helps the community build on proven approaches rather than starting from zero. Contribution checklist Reproducibility: Can someone else run your recipe or configuration and get comparable results? Licensing: Are the base model weights, datasets, and any dependencies properly licensed for sharing? Hardware target documented: Have you specified which device and execution provider the optimisation targets? Runtime documented: Have you noted the ONNX Runtime version and any EP-specific requirements? Quality validation: Have you included at least a basic accuracy or quality check for the optimised output? If you are a student or early-career developer, contributing a recipe is a great way to build portfolio evidence that you understand real deployment concerns, not just training. Try it yourself: a minimal workflow Here is a conceptual walkthrough of the optimisation workflow using Olive. The idea is to make the mental model concrete. For exact CLI flags and options, refer to the official Olive documentation. Choose a model source. Start with a PyTorch or Hugging Face model you want to optimise. This is your input. Choose a target device. Decide where the model will run: cpu , gpu , or npu . Choose an execution provider. Pick the EP that matches your hardware, for example DirectML for Windows GPU, QNN for Qualcomm NPU, or OpenVINO for Intel. Choose a precision. Select the quantisation level: fp16 , int8 , or int4 , based on your size, speed, and quality requirements. Run the optimisation. Olive will convert, quantise, optimise, and package the model for your target. The output is a deployment-ready artefact with model files, configuration, and sample inference code. A conceptual command might look like this: # Conceptual example – refer to official docs for exact syntax olive auto-opt --model-id my-model --device cpu --provider onnxruntime --precision int8 After optimisation, validate the output. Run your evaluation benchmark on the optimised model and compare quality, latency, and model size against the original. If INT8 drops quality below your threshold, try FP16. If the model is still too large for your device, explore INT4. Iteration is expected. Key takeaways Olive bridges training and deployment by providing a single, hardware-aware optimisation toolchain that handles conversion, quantisation, optimisation, and packaging. One source model, many targets: Olive lets you optimise the same model for CPU, GPU, and NPU, expanding your deployment reach without maintaining separate pipelines. ONNX is the portability layer that decouples your training framework from your inference runtime, and Olive leverages it to generate deployment-ready packages. Precision is a design choice: FP16, INT8, and INT4 each serve different deployment constraints. Start conservative, measure quality, and compress further only when needed. Olive Recipes are your starting point: Do not build optimisation pipelines from scratch when worked examples exist. Learn from recipes, adapt them, and contribute your own. Foundry Local extends the workflow: Once your model is optimised, Foundry Local can serve it on-device via a standard API, but only if it fits a supported template (chat or whisper). Resources Microsoft Olive – GitHub repository Olive documentation Olive Recipes – GitHub repository How to compile Hugging Face models for Foundry Local Foundry Local Lab – hands-on labs (see Lab 10 for custom models) Foundry Local documentation399Views0likes0Comments