microsoft foundry
60 TopicsMicrosoft Foundry: Unlock Adaptive, Personalized Agents with User-Scoped Persistent Memory
From Knowledgeable to Personalized: Why Memory Matters Most AI agents today are knowledgeable — they ground responses in enterprise data sources and rely on short‑term, session‑based memory to maintain conversational coherence. This works well within a single interaction. But once the session ends, the context disappears. The agent starts fresh, unable to recall prior interactions, user preferences, or previously established context. In reality, enterprise users don’t interact with agents exclusively in one‑off sessions. Conversations can span days, weeks, evolving across multiple interactions rather than isolated sessions. Without a way to persist and safely reuse relevant context across interactions, AI agents remain efficient in the short term be being stateful within a session, but lose continuity over time due to their statelessness across sessions. Bridging this gap between short-term efficiency and long‑term adaptation exposes a deeper challenge. Persisting memory across sessions is not just a technical decision; in enterprise environments, it introduces legitimate concerns around privacy, data isolation, governance, and compliance — especially when multiple users interact with the same agent. What seems like an obvious next step quickly becomes a complex architectural problem, requiring organizations to balance the ability for agents to learn and adapt over time with the need to preserve trust, enforce isolation boundaries, and meet enterprise compliance requirements. In this post, I’ll walk through a practical design pattern for user‑scoped persistent memory, including a reference architecture and a deployable sample implementation that demonstrates how to apply this pattern in a real enterprise setting while preserving isolation, governance, and compliance. The Challenge of Persistent Memory in Enterprise AI Agents Extending memory beyond a single session seems like a natural way to make AI agents more adaptive. Retaining relevant context over time — such as preferences, prior decisions, or recurring patterns — would allow an agent to progressively tailor its behavior to each user, moving from simple responsiveness toward genuine adaptation. In enterprise environments, however, persistence introduces a different class of risk. Storing and reusing user context across interactions raises questions of privacy, data isolation, governance, and compliance — particularly when multiple users interact with shared systems. Without clear ownership and isolation boundaries, naïvely persisted memory can lead to cross‑user data leakage, policy violations, or unclear retention guarantees. As a result, many systems default to ephemeral, session‑only memory. This approach prioritizes safety and simplicity — but does so at the cost of long‑term personalization and continuity. The challenge, then, is not whether agents should remember, but how memory can be introduced without violating enterprise trust boundaries. Persistent Memory: Trade‑offs Between Abstraction and Control As AI agents evolve toward more adaptive behavior, several approaches to agent memory are emerging across the ecosystem. Each reflects a different set of trade-offs between abstraction, flexibility, and control — making it useful to briefly acknowledge these patterns before introducing the design presented here. Microsoft Foundry Agent Service includes a built‑in memory capability (currently in Preview) that enables agents to retain context beyond a single interaction. This approach integrates tightly with the Foundry runtime and abstracts much of the underlying memory management, making it well suited for scenarios that align closely with the managed agent lifecycle. Another notable approach combines Mem0 with Azure AI Search, where memory entries are stored and retrieved through vector search. In this model, memory is treated as an embedding‑centric store that emphasizes semantic recall and relevance. Mem0 is intentionally opinionated, defining how memory is structured, summarized, and retrieved to optimize for ease of use and rapid iteration. Both approaches represent meaningful progress. At the same time, some enterprises require an approach where user memory is explicitly owned, scoped, and governed within their existing data architecture — rather than implicitly managed by an agent framework or memory library. These requirements often stem from stricter expectations around data isolation, compliance, and long‑term control. User-Scoped Persistent Memory with Azure Cosmos DB The solution presented in this post provides a practical reference implementation for organizations that require explicit control over how user memory is stored, scoped, and governed. Rather than embedding long‑term memory implicitly within the agent runtime, this design models memory as a first‑class system component built on Azure Cosmos DB. At a high level, the architecture introduces user‑scoped persistent memory: a durable memory layer in which each user’s context is isolated and managed independently. Persistent memory is stored in Azure Cosmos DB containers partitioned by user identity and consists of curated, long‑lived signals — such as preferences, recurring intent, or summarized outcomes from prior interactions — rather than raw conversational transcripts. This keeps memory intentional, auditable, and easy to evolve over time. Short‑term, in‑session conversation state remains managed by Microsoft Foundry on the server side through its built‑in conversation and thread model. By separating ephemeral session context from durable user memory, the system preserves conversational coherence while avoiding uncontrolled accumulation of long‑term state within the agent runtime. This design enables continuity and personalization across sessions while deliberately avoiding the risks associated with shared or global memory models, including cross‑user data leakage, unclear ownership, and unintended reuse of context. Azure Cosmos DB provides enterprises with direct control over memory isolation, data residency, retention policies, and operational characteristics such as consistency, availability, and scale. In this architecture, knowledge grounding and memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data sources. User‑scoped persistent memory ensures relevance by tailoring interactions to the individual user over time. Together, they enable trustworthy, adaptive AI agents that improve with use — without compromising enterprise boundaries. Architecture Components and Responsibilities Identity and User Scoping Microsoft Entra ID (App Registrations) — provides the frontend a client ID and tenant ID so the Microsoft Authentication Library (MSAL) can authenticate users via browser redirect. The oid (Object ID) claim from the ID token is used as the user identifier throughout the system. Agent Runtime and Orchestration Microsoft Foundry — serves as the unified AI platform for hosting models, managing agents, and maintaining conversation state. Foundry manages in‑session and thread‑level memory on the server side, preserving conversational continuity while keeping ephemeral context separate from long‑term user memory. Backend Agent Service — implements the AI agent using Microsoft Foundry’s agent and conversation APIs. The agent is responsible for reasoning, tool‑calling decisions, and response generation, delegating memory and search operations to external MCP servers. Memory and Knowledge Services MCP‑Memory — MCP server that hosts tools for extracting structured memory signals from conversations, generating embeddings, and persisting user‑scoped memories. Memories are written to and retrieved from Azure Cosmos DB, enforcing strict per‑user isolation. MCP‑Search — MCP server exposing tools for querying enterprise knowledge sources via Azure AI Search. This separation ensures that knowledge grounding and memory retrieval remain distinct concerns. Azure Cosmos DB for NoSQL — provides the durable, serverless document store for user‑scoped persistent memory. Memory containers are partitioned by user ID, enabling isolation, auditable access, configurable retention policies, and predictable scalability. Vector search is used to support semantic recall over stored memory entries. Azure AI Search — supplies hybrid retrieval (keyword and vector) with semantic reranking over the enterprise knowledge index. An integrated vectorizer backed by an embedding model is used for query‑time vectorization. Models text‑embedding‑3‑large — used for generating vector embeddings for both user‑scoped memories and enterprise knowledge search. gpt‑5‑mini — used for lightweight analysis tasks, such as extracting structured memory facts from conversational context. gpt‑5.1 — powers the AI agent, handling multi‑turn conversations, tool invocation, and response synthesis. Application and Hosting Infrastructure Frontend Web Application — a React‑based web UI that handles user authentication and presents a conversational chat interface. Azure Container Apps Environment — provides a shared execution environment for all services, including networking, scaling, and observability. Azure Container Apps — hosts the frontend, backend agent service, and MCP servers as independently scalable containers. Azure Container Registry — stores container images for all application components. Try It Yourself Demonstration of user‑scoped persistent memory across sessions. To make these concepts concrete, I’ve published a working reference implementation that demonstrates the architecture and patterns described above. The complete solution is available in the Agent-Memory GitHub repository. The repository README includes prerequisites, environment setup notes, and configuration details. Start by cloning the repository and moving into the project directory: git clone https://github.com/mardianto-msft/azure-agent-memory.git cd azure-agent-memory Next, sign in to Azure using the Azure CLI: az login Then authenticate the Azure Developer CLI: azd auth login Once authenticated, deploy the solution: azd up After deployment is complete, sign in using the provided demo users and interact with the agent across multiple sessions. Each user’s preferences and prior context are retained independently, the interaction continues seamlessly after signing out and returning later, and user context remains fully isolated with no cross‑identity leakage. The solution also includes a knowledge index initialized with selected Microsoft Outlook Help documentation, which the agent uses for knowledge grounding. This index can be easily replaced or extended with your own publicly accessible URLs to adapt the solution to different domains. Looking Ahead: Personalized Memory as a Foundation for Adaptive Agents As enterprise AI agents evolve, many teams are looking beyond larger models and improved retrieval toward human‑centered personalization at scale — building agents that adapt to individual users while operating within clearly defined trust boundaries. User‑scoped persistent memory enables this shift. By treating memory as a first‑class, user‑owned component, agents can maintain continuity across sessions while preserving isolation, governance, and compliance. Personalization becomes an intentional design choice, aligning with Microsoft’s human‑centered approach to AI, where users retain control over how systems adapt to them. This solution demonstrates how knowledge grounding and personalized memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data. Personalized memory ensures relevance by tailoring interactions to the individual user. Together, they enable context‑aware, adaptive, and personalized agents — without compromising enterprise trust. Finally, this solution is intentionally presented as a reference design pattern, not a prescriptive architecture. It offers a practical starting point for enterprises designing adaptive, personalized agents, illustrating how user‑scoped memory can be modeled, governed, and integrated as a foundational capability for scalable enterprise AI.313Views1like1CommentIntroducing OpenAI’s GPT-image-1.5 in Microsoft Foundry
Developers building with visual AI can often run into the same frustrations: images that drift from the prompt, inconsistent object placement, text that renders unpredictably, and editing workflows that break when iterating on a single asset. That’s why we are excited to announce OpenAI's GPT Image 1.5 is now generally available in Microsoft Foundry. This model can bring sharper image fidelity, stronger prompt alignment, and faster image generation that supports iterative workflows. Starting today, customers can request access to the model and start building in the Foundry platform. Meet GPT Image 1.5 AI driven image generation began with early models like OpenAI's DALL-E, which introduced the ability to transform text prompts into visuals. Since then, image generation models have been evolving to enhance multimodal AI across industries. GPT Image 1.5 represents continuous improvement in enterprise-grade image generation. Building on the success of GPT Image 1 and GPT Image 1 mini, these enhanced models introduce advanced capabilities that cater to both creative and operational needs. The new image model offer: Text-to-image: Stronger instruction following and highly precise editing. Image-to-image: Transform existing images to iteratively refine specific regions Improved visual fidelity: More detailed scenes and realistic rendering. Accelerated creation times: Up to 4x faster generation speed. Enterprise integration: Deploy and scale securely in Microsoft Foundry. GPT Image 1.5 delivers stronger image preservation and editing capabilities, maintaining critical details like facial likeness, lighting, composition, and color tone across iterative changes. You’ll see more consistent preservation of branded logos and key visuals, making it especially powerful for marketing, brand design, and ecommerce workflows—from graphics and logo creation to generating full product catalogs (variants, environments, and angles) from a single source image. Benchmarks Based on an internal Microsoft dataset, GPT Image 1.5 performs higher than other image generation models in prompt alignment and infographics tasks. It focuses on making clear, strong edits – performing best on single-turn modification, delivering the higher visual quality in both single and multi-turn settings. The following results were found across image generation and editing: Text to image Prompt alignment Diagram / Flowchart GPT Image 1.5 91.2% 96.9% GPT Image 1 87.3% 90.0% Qwen Image 83.9% 33.9% Nano Banana Pro 87.9% 95.3% Image editing Evaluation Aspect Modification Preservation Visual Quality Face Preservation Metrics BinaryEval SC (semantic) DINO (Visual) BinaryEval AuraFace Single-turn GPT image 1 99.2% 51.0% 0.14 79.5% 0.30 Qwen image 81.9% 63.9% 0.44 76.0% 0.85 GPT Image 1.5 100% 56.77% 0.14 89.96% 0.39 Multi-turn GPT Image 1 93.5% 54.7% 0.10 82.8% 0.24 Qwen image 77.3% 68.2% 0.43 77.6% 0.63 GPT image 1.5 92.49% 60.55% 0.15 89.46% 0.28 Using GPT Image 1.5 across industries Whether you’re creating immersive visuals for campaigns, accelerating UI and product design, or producing assets for interactive learning GPT Image 1.5 gives modern enterprises the flexibility and scalability they need. Image models can allow teams to drive deeper engagement through compelling visuals, speed up design cycles for apps, websites, and marketing initiatives, and support inclusivity by generating accessible, high‑quality content for diverse audiences. Watch how Foundry enables developers to iterate with multimodal AI across Black Forest Labs, OpenAI, and more: Microsoft Foundry empowers organizations to deploy these capabilities at scale, integrating image generation seamlessly into enterprise workflows. Explore the use of AI image generation here across industries like: Retail: Generate product imagery for catalogs, e-commerce listings, and personalized shopping experiences. Marketing: Create campaign visuals and social media graphics. Education: Develop interactive learning materials or visual aids. Entertainment: Edit storyboards, character designs, and dynamic scenes for films and games. UI/UX: Accelerate design workflows for apps and websites. Microsoft Foundry provides security and compliance with built-in content safety filters, role-based access, network isolation, and Azure Monitor logging. Integrated governance via Azure Policy, Purview, and Sentinel gives teams real-time visibility and control, so privacy and safety are embedded in every deployment. Learn more about responsible AI at Microsoft. Pricing Model Pricing (per 1M tokens) - Global GPT-image-1.5 Input Tokens: $8 Cached Input Tokens: $2 Output Tokens: $32 Cost efficiency improves as well: image inputs and outputs are now cheaper compared to GPT Image 1, enabling organizations to generate and iterate on more creative assets within the same budget. For detailed pricing, refer here. Getting started Learn more about image generation, explore code samples, and read about responsible AI protections here. Try GPT Image 1.5 in Microsoft Foundry and start building multimodal experiences today. Whether you’re designing educational materials, crafting visual narratives, or accelerating UI workflows, these models deliver the flexibility and performance your organization needs.8.5KViews2likes1CommentBuild and Deploy a Microsoft Foundry Hosted Agent: A Hands-On Workshop
Agents are easy to demo, hard to ship. Most teams can put together a convincing prototype quickly. The harder part starts afterwards: shaping deterministic tools, validating behaviour with tests, building a CI path, packaging for deployment, and proving the experience through a user-facing interface. That is where many promising projects slow down. This workshop helps you close that gap without unnecessary friction. You get a guided path from local run to deployment handoff, then complete the journey with a working chat UI that calls your deployed hosted agent through the project endpoint. What You Will Build This is a hands-on, end-to-end learning experience for building and deploying AI agents with Microsoft Foundry. The lab provides a guided and practical journey through hosted-agent development, including deterministic tool design, prompt-guided workflows, CI validation, deployment preparation, and UI integration. It’s designed to reduce setup friction with a ready-to-run experience. It is a prompt-based development lab using Copilot guidance and MCP-assisted workflow options during deployment. It’s a .NET 10 workshop that includes local development, Copilot-assisted coding, CI, secure deployment to Azure, and a working chat UI. A local hosted agent that responds on the responses contract Deterministic tool improvements in core logic with xUnit coverage A GitHub Actions CI workflow for restore, build, test, and container validation An Azure-ready deployment path using azd, ACR image publishing, and Foundry manifest apply A Blazor chat UI that calls openai/v1/responses with agent_reference A repeatable implementation shape that teams can adapt to real projects Who This Lab Is For AI developers and software engineers who prefer learning by building Motivated beginners who want a guided, step-by-step path Experienced developers who want a practical hosted-agent reference implementation Architects evaluating deployment shape, validation strategy, and operational readiness Technical decision-makers who need to see how demos become deployable systems Why Hosted Agents Hosted agents run your code in a managed environment. That matters because it reduces the amount of infrastructure plumbing you need to manage directly, while giving you a clearer path to secure, observable, team-friendly deployments. Prompt-only demos are still useful. They are quick, excellent for ideation, and often the right place to start. Hosted agents complement that approach when you need custom code, tool-backed logic, and a deployment process that can be repeated by a team. Think of this lab as the bridge: you keep the speed of prompt-based iteration, then layer in the real-world patterns needed to run reliably. What You Will Learn 1) Orchestration You will practise workflow-oriented reasoning through implementation-shape recommendations and multi-step readiness scenarios. The lab introduces orchestration concepts at a practical level, rather than as a dedicated orchestration framework deep dive. 2) Tool Integration You will connect deterministic tools and understand how tool calls fit into predictable execution paths. This is a core focus of the workshop and is backed by tests in the solution. 3) Retrieval Patterns (What This Lab Covers Today) This workshop does not include a full RAG implementation with embeddings and vector search. Instead, it focuses on deterministic local tools and hosted-agent response flow, giving you a strong foundation before adding retrieval infrastructure in a follow-on phase. 4) Observability You will see light observability foundations through OpenTelemetry usage in the host and practical verification during local and deployed checks. This is introductory coverage intended to support debugging and confidence building. 5) Responsible AI You will apply production-minded safety basics, including secure secret handling and review hygiene. A full Responsible AI policy and evaluation framework is not the primary goal of this workshop, but the workflow does encourage safe habits from the start. 6) Secure Deployment Path You will move from local implementation to Azure deployment with a secure, practical workflow: azd provisioning, ACR publishing, manifest deployment, hosted-agent start, status checks, and endpoint validation. The Learning Journey The overall flow is simple and memorable: clone, open, run, iterate, deploy, observe. clone -> open -> run -> iterate -> deploy -> observe You are not expected to memorize every command. The lab is structured to help you learn through small, meaningful wins that build confidence. Your First 15 Minutes: Quick Wins Open the repo and understand the lab structure in a few minutes Set project endpoint and model deployment environment variables Run the host locally and validate the responses endpoint Inspect the deterministic tools in WorkshopLab.Core Run tests and see how behaviour changes are verified Review the deployment path so local work maps to Azure steps Understand how the UI validates end-to-end behaviour after deployment Leave the first session with a working baseline and a clear next step That first checkpoint is important. Once you see a working loop on your own machine, the rest of the workshop becomes much easier to finish. Using Copilot and MCP in the Workflow This lab emphasises prompt-based development patterns that help you move faster while still learning the underlying architecture. You are not only writing code, you are learning to describe intent clearly, inspect generated output, and iterate with discipline. Copilot supports implementation and review in the coding labs. MCP appears as a practical deployment option for hosted-agent lifecycle actions, provided your tools are authenticated to the correct tenant and project context. Together, this creates a development rhythm that is especially useful for learning: Define intent with clear prompts Generate or adjust implementation details Validate behaviour through tests and UI checks Deploy and observe outcomes in Azure Refine based on evidence, not guesswork That same rhythm transfers well to real projects. Even if your production environment differs, the patterns from this workshop are adaptable. Production-Minded Tips As you complete the lab, keep a production mindset from day one: Reliability: keep deterministic logic small, testable, and explicit Security: Treat secrets, identity, and access boundaries as first-class concerns Observability: use telemetry and status checks to speed up debugging Governance: keep deployment steps explicit so teams can review and repeat them You do not need to solve everything in one pass. The goal is to build habits that make your agent projects safer and easier to evolve. Start Today: If you have been waiting for the right time to move from “interesting demo” to “practical implementation”, this is the moment. The workshop is structured for self-study, and the steps are designed to keep your momentum high. Start here: https://github.com/microsoft/Hosted_Agents_Workshop_Lab Want deeper documentation while you go? These official guides are great companions: Hosted agent quickstart Hosted agent deployment guide When you finish, share what you built. Post a screenshot or short write-up in a GitHub issue/discussion, on social, or in comments with one lesson learned. Your example can help the next developer get unstuck faster. Copy/Paste Progress Checklist [ ] Clone the workshop repo [ ] Complete local setup and run the agent [ ] Make one prompt-based behaviour change [ ] Validate with tests and chat UI [ ] Run CI checks [ ] Provision and deploy via Azure and Foundry workflow [ ] Review observability signals and refine [ ] Share what I built + one takeaway Common Questions How long does it take? Most developers can complete a meaningful pass in a few focused sessions of 60-75 mins. You can get the first local success quickly, then continue through deployment and refinement at your own pace. Do I need an Azure subscription? Yes, for provisioning and deployment steps. You can still begin local development and testing before completing all Azure activities. Is it beginner-friendly? Yes. The labs are written for beginners, run in sequence, and include expected outcomes for each stage. Can I adapt it beyond .NET? Yes. The implementation in this workshop is .NET 10, but the architecture and development patterns can be adapted to other stacks. What if I am evaluating for a team? This lab is a strong team evaluation asset because it demonstrates end-to-end flow: local dev, integration patterns, CI, secure deployment, and operational visibility. Closing This workshop gives you more than theory. It gives you a practical path from first local run to deployed hosted agent, backed by tests, CI, and a user-facing UI validation loop. If you want a build-first route into Microsoft Foundry hosted-agent development, this is an excellent place to start. Begin now: https://github.com/microsoft/Hosted_Agents_Workshop_Lab104Views0likes0CommentsHow Do We Know AI Isn’t Lying? The Art of Evaluating LLMs in RAG Systems
🔍 1. Why Evaluating LLM Responses is Hard In classical programming, correctness is binary. Input Expected Result 2 + 2 4 ✔ Correct 2 + 2 5 ✘ Wrong Software is deterministic — same input → same output. LLMs are probabilistic. They generate one of many valid word combinations, like forming sentences from multiple possible synonyms and sentence structures. Example: Prompt: "Explain gravity like I'm 10" Possible responses: Response A Response B Gravity is a force that pulls everything to Earth. Gravity bends space-time causing objects to attract. Both are correct. Which is better? Depends on audience. So evaluation needs to look beyond text similarity. We must check: ✔ Is the answer meaningful? ✔ Is it correct? ✔ Is it easy to understand? ✔ Does it follow prompt intent? Testing LLMs is like grading essays — not checking numeric outputs. 🧠 2. Why RAG Evaluation is Even Harder RAG introduces an additional layer — retrieval. The model no longer answers from memory; it must first read context, then summarise it. Evaluation now has multi-dimensions: Evaluation Layer What we must verify Retrieval Did we fetch the right documents? Understanding Did the model interpret context correctly? Grounding Is the answer based on retrieved data? Generation Quality Is final response complete & clear? A simple story makes this intuitive: Teacher asks student to explain Photosynthesis. Student goes to library → selects a book → reads → writes explanation. We must evaluate: Did they pick the right book? → Retrieval Did they understand the topic? → Reasoning Did they copy facts correctly without inventing? → Faithfulness Is written explanation clear enough for another child to learn from? → Answer Quality One failure → total failure. 🧩 3. Two Types of Evaluation 🔹 Intrinsic Evaluation — Quality of the Response Itself Here we judge the answer, ignoring real-world impact. We check: ✔ Grammar & coherence ✔ Completeness of explanation ✔ No hallucination ✔ Logic flow & clarity ✔ Semantic correctness This is similar to checking how well the essay is written. Even if the result did not solve the real problem, the answer could still look good — that’s why intrinsic alone is not enough. 🔹 Extrinsic Evaluation — Did It Achieve the Goal? This measures task success. If a customer support bot writes a beautifully worded paragraph, but the user still doesn’t get their refund — it failed extrinsically. Examples: System Type Extrinsic Goal Banking RAG Bot Did user get correct KYC procedure? Medical RAG Was advice safe & factual? Legal search assistant Did it return the right section of the law? Technical summariser Did summary capture key meaning? Intrinsic = writing quality. Extrinsic = impact quality. A production-grade RAG system must satisfy both. 📏 4. Core RAG Evaluation Metrics (Explained with Very Simple Analogies) Metric Meaning Analogy Relevance Does answer match question? Ask who invented C++? → model talks about Java ❌ Faithfulness No invented facts Book says started 2004, response says 1990 ❌ Groundedness Answer traceable to sources Claims facts that don’t exist in context ❌ Completeness Covers all parts of question User asks Windows vs Linux → only explains Windows Context Recall / Precision Correct docs retrieved & used Student opens wrong chapter Hallucination Rate Degree of made-up info “Taj Mahal is in London” 😱 Semantic Similarity Meaning-level match “Engine died” = “Car stopped running” 💡 Good evaluation doesn’t check exact wording. It checks meaning + truth + usefulness. 🛠 5. Tools for RAG Evaluation 🔹 1. RAGAS — Foundation for RAG Scoring RAGAS evaluates responses based on: ✔ Faithfulness ✔ Relevance ✔ Context recall ✔ Answer similarity Think of RAGAS as a teacher grading with a rubric. It reads both answer + source documents, then scores based on truthfulness & alignment. 🔹 2. LangChain Evaluators LangChain offers multiple evaluation types: Type What it checks String or regex Basic keyword presence Embedding based Meaning similarity, not text match LLM-as-a-Judge AI evaluates AI (deep reasoning) LangChain = testing toolbox RAGAS = grading framework Together they form a complete QA ecosystem. 🔹 3. PyTest + CI for Automated LLM Testing Instead of manually validating outputs, we automate: Feed preset questions to RAG Capture answers Run RAGAS/LangChain scoring Fail test if hallucination > threshold This brings AI closer to software-engineering discipline. RAG systems stop being experiments — they become testable, trackable, production-grade products. 🚀 6. The Future: LLM-as-a-Judge The future of evaluation is simple: LLMs will evaluate other LLMs. One model writes an answer. Another model checks: ✔ Was it truthful? ✔ Was it relevant? ✔ Did it follow context? This enables: Benefit Why it matters Scalable evaluation No humans needed for every query Continuous improvement Model learns from mistakes Real-time scoring Detect errors before user sees them This is like autopilot for AI systems — not only navigating, but self-correcting mid-flight. And that is where enterprise AI is headed. 🎯 Final Summary Evaluating LLM responses is not checking if strings match. It is checking if the machine: ✔ Understood the question ✔ Retrieved relevant knowledge ✔ Avoided hallucination ✔ Provided complete, meaningful reasoning ✔ Grounded answer in real source text RAG evaluation demands multi-layer validation — retrieval, reasoning, grounding, semantics, safety. Frameworks like RAGAS + LangChain evaluators + PyTest pipelines are shaping the discipline of measurable, reliable AI — pushing LLM-powered RAG from cool demo → trustworthy enterprise intelligence. Useful Resources What is Retrieval-Augmented Generation (RAG) : https://azure.microsoft.com/en-in/resources/cloud-computing-dictionary/what-is-retrieval-augmented-generation-rag/ Retrieval-Augmented Generation concepts (Azure AI) : https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/retrieval-augmented-generation RAG with Azure AI Search – Overview : https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview Evaluate Generative AI Applications (Microsoft Learn – Learning Path) : https://learn.microsoft.com/en-us/training/paths/evaluate-generative-ai-apps/ Evaluate Generative AI Models in Microsoft Foundry Portal : https://learn.microsoft.com/en-us/training/modules/evaluate-models-azure-ai-studio/ RAG Evaluation Metrics (Relevance, Groundedness, Faithfulness) : https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/rag-evaluators RAGAS – Evaluation Framework for RAG Systems : https://docs.ragas.io/171Views0likes0CommentsIntroducing MAI-Transcribe-1, MAI-Voice-1, and MAI-Image-2 in Microsoft Foundry
Another Step Towards a Complete AI Platform Since inception, our goal with Microsoft Foundry has been to deliver the most complete AI and app agent factory; giving developers access to the latest frontier models, tools, infrastructure, security, and reliability to confidently build and scale their AI solutions. Today, we're taking another step towards that vision by announcing the public preview of three new models from Microsoft AI in Microsoft Foundry: MAI-Transcribe-1: Our first-generation speech recognition model, delivering enterprise-grade accuracy across 25 languages at approximately 50% lower GPU cost than leading alternatives. MAI-Voice-1: A high-fidelity speech generation model capable of producing 60 seconds of expressive audio in under one second on a single GPU. MAI-Image-2: Our highest-capability text-to-image model, which debuted on #3 on the Arena.ai leaderboard for image model families. These are the same models already powering our own products such as Copilot, Bing, PowerPoint, and Azure Speech, and now they're available exclusively on Foundry for developers to use. We can't wait to see what you create with these new multimedia AI models in public preview. Read on for a deeper look at each model's capabilities and how to start building with them in Foundry! MAI-Transcribe-1 & Voice-1: End-To-End Voice Experiences Voice and speech are rapidly becoming the primary interface for the next generation of AI agents, and building great voice experiences requires models that can both speak and listen with precision. With MAI-Voice-1 and MAI-Transcribe-1, Microsoft is delivering exactly that: a comprehensive, first-party audio AI stack purpose-built for developers. MAI-Voice-1 is a lightning-fast speech generation model capable of producing a full minute of audio in under a second on a single GPU; making it one of the most efficient speech systems available today. On the listening side, MAI-Transcribe-1 supports up to 25 languages and is engineered for enterprise-grade reliability across accents, languages, and real-world audio conditions. But what truly sets it apart is its efficiency: when benchmarked against leading transcription models, MAI-Transcribe-1 delivers competitive accuracy at nearly half the GPU cost; an advantage that translates directly into more predictable, scalable pricing for enterprises 1 . Use cases for MAI-Transcribe-1 and MAI-Voice-1 MAI-Voice-1 and MAI-Transcribe-1 are designed for production use across a broad set of real-world scenarios: Conversational AI & Agent Assist: Enable real‑time transcription for IVR systems, virtual assistants, and call‑center workflows to power voice‑driven interfaces, live agent assist, and post‑call summarization. Live Captioning & Accessibility: Deliver real‑time captions for large events, enterprise meetings, and digital communications to improve accessibility and inclusivity across spoken experiences. Media, Subtitling & Archiving: Automate video subtitling, dialogue indexing, and transcription to support scalable content production, searchability, and long‑term media archiving. Education & Training Platforms: Transcribe lectures, learning modules, and certification programs to enhance discoverability, reviewability, and knowledge retention in e‑learning environments. Customer & Market Insights: Convert spoken interactions across research interviews, focus groups, and support channels into structured data for downstream analytics and business intelligence. We're also applying these model capabilities inside Microsoft's own products. MAI-Voice-1 powers the expressive voice experiences in Copilot's Audio Expressions and podcast features. MAI-Transcribe-1 drives Copilot's Voice Mode transcriptions and the new dictation feature, connecting natural voice input with the generative power of Copilot's language models. Both models are available through Azure Speech, where developers can tap into first-party MAI model quality alongside the enterprise-grade reliability, scalability, and 700+ voice gallery of the Azure Speech ecosystem. Try MAI-Transcribe-1 & Voice-1 Today MAI-Transcribe-1 and Voice-1 are available now through Azure Speech. Here's how to get started: Experiment in MAI Playground: Speak, record, or upload audio to see the models in action at the MAI playground. Build in Foundry: deploy MAI-Transcribe-1 and MAI-Voice-1 in Azure Speech. MAI-Transcribe-1 starts at $0.36 USD per hour, while MAI-Voice-1 pricing starts at $22 USD per 1M characters. Developers looking to create custom voices using MAI-Voice-1 can do so through the Personal Voice feature in Azure Speech — including the ability to clone a voice from a short 10-second audio sample. Note that custom voice creation requires an approval process consistent with Microsoft's responsible AI policies. MAI-Image-2: Limitless Creativity For Every Builder Images are at the center of how developers build compelling AI-powered creative experiences; from marketing tools to content platforms to multimodal agents. MAI-Image-2 is Microsoft's answer to that demand. This model has been developed in close collaboration with photographers, designers, and visual storytellers and debuted in the top-3 text-to-image model families on the Arena.ai leaderboard. It raises the bar across the capabilities that matter most in real creative workflows; more natural, photorealistic image generation, stronger in-image text rendering for infographics and diagrams, and greater precision on complex layouts, detailed scenes, and cinematic visuals. Use cases for MAI-Image-2 Developers can integrate MAI-Image-2 across a range of high-impact workflows: Media & Creative Ideation: Designers, illustrators, and creative teams use text‑to‑image generation to explore visual directions, styles, and compositions early in the creative process—moving from concept to exploration faster. Enterprise Communications & Internal Branding: Organizations create custom visuals for internal campaigns, training materials, and executive communications directly from text, ensuring clarity, polish, and brand alignment without relying on stock imagery. UX & Product Concept Visualization: Product teams visualize interfaces, workflows, environments, and conceptual product scenarios from text descriptions, helping teams communicate ideas and align early—before engineering or design resources are engaged. WPP, one of the world's largest marketing and communications groups, is among the first enterprise partners building with MAI-Image-2 at scale, using it to power creative production workflows that previously required significant manual effort. "MAI-Image-2 is a genuine game-changer. It's a platform that not only responds to the intricate nuance of creative direction, but deeply respects the sheer craft involved in generating real-world, campaign-ready images. WPP has some of the best creative talent in the world and MAI-Image-2 is making them even better." -Rob Reilly, Global Chief Creative Officer, WPP We’re also implementing MAI-Image-2 to power image generation within Microsoft’s own products, including Copilot, Bing Image Creator, and PowerPoint, and now you have access to this powerful, cost effective model for your own apps. Try MAI-Image-2 Today Experiment in the MAI Playground: Preview MAI-Image-2 at MAI playground and share feedback directly with the team. Build in Foundry: deploy MAI-Image-2 via the API and start building your apps and agents! MAI-Image-2 starts at $5 USD per 1M tokens for text input and $33 USD per 1M tokens for image output. We look forward to your feedback on these models in Foundry. References: 1 1 st on overall WER on the FLEURS benchmark. Out of the top 25 global languages, MAI-Transcribe-1 ranks 1st by FLEURS in 11 core languages. It wins against Whisper-large-v3 on the remaining 14 and Gemini 3.1 Flash on 11 of those 14.6.4KViews1like0CommentsHow Veris AI and Lume Security built a self-improving AI agent with Microsoft Foundry
Introduction AI agents are slowly moving from demos into production, where real users, messy systems, and long-tail edge cases lead to new unseen failure modes. Production monitoring can surface issues, but converting those into reliable improvements is slow and manual, bottlenecked by the low volume of repeatable failures, engineering time, and risk of regression on previously working cases. We show how a high-fidelity simulation environment built by Veris AI on Microsoft Azure can expand production failures into families of realistic scenarios, generating enough targeted data to optimize agent behavior through automated context engineering and reinforcement learning, all while not regressing on any previous issue. We demonstrate this on a security agent built by Lume Security on Microsoft Foundry. Lume creates an institutionally grounded security intelligence graph that captures how organizations actually work—this intelligence graph then powers deterministic, policy-aligned agents that reason alongside the user and take trusted action across security, compliance, and IT workflows. Lume helps modern security teams scale expertise without scaling headcount. Its Security Intelligence Platform builds a continuously learning security intelligence graph that reflects how work actually gets done. It reasons over policies, prior tickets, security findings, tool configurations, documentation, and system activity to retain institutional memory and drive consistent response. On this foundation, Lume delivers context-aware security assistants that fetch the most relevant context at the right time, produce policy-aligned recommendations, and execute deterministic actions with full explainability and audit trails. These assistants are embedded in tools teams are already using like ServiceNow, Jira, Slack, and Teams so the experience is seamless and provides value from day one. Microsoft Foundry and Veris AI together provide Lume with a secure, repeatable control plane that makes model iteration, safety, and simulation-driven evaluation practical. Vendor flexibility. Swap or test different models from OpenAI, Anthropic, Meta, and many others, with no infra changes. Fast model rollout. Provide access to the latest models as soon as they are released, making experimentations and updates easy. Consistent safety. Built-in policy and guardrail tooling enforces the same checks across experiments, cutting bespoke guardrail work. Enterprise privacy. Models run in private Azure instances and are not trained on client data, which simplifies and shortens AI security reviews. Made evaluation practical. Centralized models, logging, and policies let Veris run repeatable simulations and feed targeted failure cases back into Lume’s improvement loop. Simulation-driven evaluation. Run repeatable, high-fidelity simulations to stress-test and automatically surface failure modes before production. Agent optimization. Turn the graded failures into upgrades through automatic prompt fixes and targeted fine-tuning/RFT. The Lume Solution: Contextual Intelligence for Security Team Members Security teams are burdened by the time-consuming process of gathering necessary context from various siloed systems, tools, and subject-matter experts. This fragmented approach creates significant latency in decision-making, leads to frequent escalations, and drives up operational costs. Furthermore, incomplete or inaccurate context often results in inconsistent responses and actions that fail to fully mitigate risks. Lume unifies an organization’s fragmented data sources into a security intelligence graph. It also provides purpose-built assistants, embedded in tooling the team already uses, that can fetch relevant content from across the org, produce explainable recommendations, execute actions with full audit trails, and update data sources when source context is missing or misleading. The result is faster, more consistent decisions and fewer avoidable escalations. With just a couple integrations into ticketing and collaboration tools, Lume begins to prioritize where teams can gain the most efficiencies and risk reduction from context intelligence and automation - and automatically builds out assistants that can help. Search. Aggregate prior requests, owners, org, access, policy, live intel, and SIEM. Plan. Produce an institutionally grounded action plan. Review. Present a single decision-ready view with full context to approvers, modifying the plan based on their input. Act. Execute pre-approved changes with full explainability and audit trail. Close the loop. Notify stakeholders, update the graph and docs, log outcomes. Validation with early customers has revealed Lume’s security intelligence and assistants potential to truly change the way enterprises deal with security analysis and tasks. 35–55% less time on routine requests. Measurements with early customers shows the assistant and access to institutional intelligence reduces the time security analysts spend on recurring intake and tactical tasks, freeing staff for higher-value work. Faster and more confident decision making. Qualitative feedback from security team members reveals they feel more confident and can act faster when using the assistant, while also feeling less burdened knowing Lume will help ensure the task is resolved and documented. Improved institutional memory. Every decision, rationale, and action is captured and surfaced in the security intelligence graph, increasing repeatability and reducing future rework—this captured information also updates and cleanses existing documentation to ensure institutional knowledge aligns with current practice. Proactively identify & prioritize opportunities. Lume first identifies and prioritizes the tasks and processes where intelligence and assistants can have the greatest impact—measuring the actual ROI for security teams. Meaningful deflection of routine requests. The assistant can respond, plan, and execute actions (with human review + approval), deflecting common escalations and reducing load on subject matter experts. Architected on Azure By implementing this system on Azure stack, Lume and Veris benefited from the flexibility and breadth of services enabling this large scale implementation. This architecture allows the agent to evolve independently across reasoning, retrieval, and action layers without destabilizing production behavior. Microsoft Foundry orchestrates model usage, prompt execution, safety policies, and evaluation hooks. Azure AI Search provides hybrid retrieval across structured documents, policies, and unstructured artifacts. Vector storage enables semantic retrieval of prior tickets, decisions, and organizational knowledge. Graph databases capture relationships between systems, controls, owners, and historical decisions, allowing the agent to reason over organizational structure rather than isolated facts. Azure Kubernetes Service (AKS) hosts the agent runtime and evaluation workloads, enabling horizontal scaling and isolated experiments. Azure Key Vault manages secrets, credentials, and API access securely. Azure Web App Services power customer-facing interfaces and internal dashboards for visibility and control. Evaluating and ultimately improving agents is still one of the hardest parts of the stack. Static datasets don’t solve this, because agents are not static predictors. They are dynamic, multi-turn systems that change the world as they act: they call tools, accumulate state, recover (or fail to recover) from errors, and face nondeterministic user behavior. A golden dataset might grade a single response as “correct,” while completely missing whether the agent actually achieved the right end state across systems. In other words: static evals grade answers; agent evals must grade outcomes. At the same time, getting enough real-world data to drive improvement is perpetually difficult. The most valuable failures are rare, long-tail, and hard to reproduce on demand. Iterating directly in production is not possible because of safety, compliance, and customer-experience risk. That’s why environments are essential: a place to test agents safely, generate repeatable experience, and create the signals that allows developers to improve behavior without gambling on live users. Veris AI is built around that premise. It is a high-fidelity simulation platform that models the world around an agent with mock tools, realistic state transitions, and simulated users with distinct behaviors. From a single observed production failure, Veris can reconstruct what happened, expand it into a family of realistic scenario variants, and then stress-test the agent across that scenario set. Those runs are evaluated with a mix of LLM-based judges and code-based verifiers that score full trajectories not just the final text. Crucially, Veris does not stop at measurement. Its optimization module uses those grader signals to improve the agent with automatically refining prompts and supporting reinforcement-learning style updates (e.g., RFT) in a closed loop. The result is a workflow where one production incident can become a repeatable training and regression suite, producing targeted improvements on the failure mode while protecting performance on everything that already worked. Veris AI environment is available on Azure Kubernetes Service (AKS) and can be easily deployed in a user Virtual Network. Under the Hood: Building a self-improving agent When a failure appears in production, the improvement loop starts from the trace. Veris takes the failed session logs from the observability pipeline, then an LLM-based evaluator pinpoints what actually went wrong and automatically writes a new targeted evaluation rubric for that specific failure mode. From there, Veris simulator expands the single incident into a scenario family, dozens of realistic variants with branching, so the agent can be trained and re-tested against the whole “class” of the failure, not just one example. This is important as failure modes are often sparse in the real-world. Those scenarios are executed in the simulation engine with mocked tools and simulated users, producing full interaction traces that are then scored by the evaluation engine. The scores become the training signal for optimization. Veris optimization engine uses the evaluation results, the original system prompt, and the new rubric to generate an updated prompt designed to fix the failure without breaking previously-good behavior. Then it validates the new prompt on both (a) the specialized scenario set created from the incident and (b) a broader regression held-out set suite to ensure the improvement generalizes and does not cause regressions. In this case study, we focused on a key failure mode of incorrect workflow classification at the triage step for approval-driven access requests. In these cases, tickets were routed into an escalation or in-progress workflow instead of the approval-validation path. The ticket often contained conflicting approval or rejection signals in human comments - manager approved but they required some additional information, a genuine occurrence in org related jira workflows. The triage agent failed to recognize them and misclassified the workflow state. Since triage determines what runs next, a single misclassification at the start was enough to bypass the Approval Agent and send the request down an incorrect downstream path. Results & Impact By running the optimization loop, we were able to improve agent accuracy on this issue by over 40%, while not regressing on any other correct behavior. We ran experiments on a dataset only containing scenarios with the issue and a more general dataset encompassing a variety of scenarios. The experiment results on both datasets and updated prompt is shown below. Continuing with this over any issues that arise in production or simulation will continuously improve the agent performance. Takeaways This collaboration shows a practical pattern for taking an enterprise agent from “works in a demo” to “improves safely in production”, pairing an orchestration layer that standardizes model usage, safety, logging, and evaluation with a simulation environment where failures can be reproduced and fixed without risking real users, then making it repeatable in practice. In this stack, Veris AI provides the simulations, trajectory grading, and optimization loop, while Microsoft Foundry operationalizes the workflow with vendor-flexible model iteration, consistent policy enforcement, enterprise privacy, and centralized evaluation hooks that turn testing into a first-class system instead of bespoke glue code. The result is an improved and reliable Lume assitant that can help enterprises spend 35–55% less time on routine requests, and meaningfully deflect repetitive escalations, requiring fewer clarification cycles, enabling faster response times, and a stronger institutional memory where decisions and rationales compound over time instead of getting lost across tools and tickets. The self-improving loop continuously improves the agent, starting from a production trace, by generating a targeted rubric, expanding the incident into a scenario family, running it end-to-end with mocked tools and simulated users, scoring full trajectories, then using those scores to produce a safer prompt/model update and validating it against both the failure set and a broader regression suite. This turns rare long-tail failures into repeatable training and regression assets. If you’re building an AI agent, the recommendation is straightforward. Invest early in an orchestration and safety layer, as well as an environment-driven evaluation that can create an improvement loop to ship fixes without regressions. This way, your production failures act as the highest-signal input to continuously harden the system. To learn more or start building, teams can explore the Veris Console for free or browse the Veris documentation . In this stack, Microsoft Foundry provides the orchestration and safety control plane, and Veris provides the simulation, evaluation, and optimization loop needed to make agents improve safely over time. Learn more about the Lume Security assistant here.161Views0likes0Commentso3-mini not returning reasoning tokens
Hi, I work on a service that leverages o3-mini via Microsoft Foundry. In the past few days, I've observed that when calling o3-mini via Microsoft Foundry, that completion_token_details always has the reasoning_tokens value set to 0, regardless of the reasoning setting being used. In my testing, it seems that the reasoning is still occurring, as increasing reasoning value causes the completion_tokens field to increase by a good amount, but none of the reasoning levels cause the reasoning_tokens value to be anything other than 0. Has anyone else encountered this issue? Thanks! Tom14Views0likes0CommentsIntroducing OpenAI’s GPT-5.4 mini and GPT-5.4 nano for low-latency AI
Imagine you’re a developer building a research assistant agent on top of GPT‑5.4. The agent retrieves documents, summarizes findings, and answers follow‑up questions across multiple turns. In early testing, the reasoning quality is strong, but as the agent chains together retrieval, tool calls, and generation, latency starts to add up. For interactive experiences, those delays matter—so many teams adopt a multi‑model approach, using a larger model to plan and smaller models to execute subtasks quickly at scale. This is where GPT‑5.4 mini and GPT‑5.4 nano come in. These smaller variants of GPT-5.4 are optimized for developer workloads where latency, cost savings, and agentic design are top of mind. GPT-5.4 mini and GPT-5.4 nano will be rolling out today in Microsoft Foundry, so you can evaluate them in the model catalog and deploy the right option for each workload. GPT-5.4 mini: efficient reasoning for production workflows GPT-5.4 mini distills GPT-5.4’s strengths into a smaller, more efficient model for developer workloads where responsiveness matters. It significantly improves over GPT-5 mini across coding, reasoning, multimodal understanding, and tool use while running about 2X faster. Text and image inputs: build multimodal experiences that combine prompts with screenshots or other images. Tool use and function calling: reliably invoke tools and APIs for agentic workflows. Web search and file search: ground responses in external or enterprise content as part of multi-step tasks. Computer use: support software-interaction loops where the model interprets UI state and takes well-scoped actions. Where GPT-5.4 mini thrives Developer copilots and coding assistants: latency-sensitive coding help, code review suggestions, and fast iteration loops where turnaround time matters. Multimodal developer workflows: applications that interpret screenshots, understand UI state, or process images as part of coding and debugging loops. Computer-use sub-agents: fast executors that take well-scoped actions in software (for example, navigating UIs or completing repetitive steps) within a larger agent loop coordinated by a planner model. GPT-5.4 nano: ultra-low latency automation at scale GPT-5.4 nano is the smallest and fastest model in the lineup, designed for low-latency and low-cost API usage at high throughput. It’s optimized for short-turn tasks like classification, extraction, and ranking, plus lightweight sub-agent work where speed and cost are the priority and extended multi-step reasoning isn’t required. Strong instruction following: consistent adherence to developer intent across short, well-defined interactions. Function and tool calling: dependable invocation of tools and APIs for lightweight agent and automation scenarios. Coding support: optimized performance for common coding tasks where fast turnaround is required. Image understanding: multimodal image input support for basic image interpretation alongside text. Low-latency, low-cost execution: designed to deliver responses quickly and efficiently at scale. Where GPT-5.4 nano thrives GPT-5.4 nano is a strong fit when you need predictable behavior at very high throughput and the task can be expressed as short, well-scoped instructions. Classification and intent detection: fast labeling and routing decisions for high-volume requests. Extraction and normalization: pull structured fields from text, validate formats, and standardize outputs. Ranking and triage: reorder candidates, prioritize tickets/leads, and select best-next actions under tight latency budgets. Guardrails and policy checks: lightweight safety and policy classification, prompt gating, and enforcement decisions before dispatching to tools or larger models. High-volume text processing pipelines: batch transformation, cleanup, deduping, and normalization steps where unit cost and throughput dominate. Routing and prioritization at the edge: select the right downstream workflow (template, queue, or model) for each request under tight latency budgets. Choosing the right GPT-5.4 model Microsoft Foundry makes it possible to deploy multiple GPT-5.4 variants side by side, so teams can route requests to the model that best fits each task. Here’s a practical way to think about the lineup: Model Best suited for Typical workloads GPT-5.4 Sustained, multi-step reasoning with reliable follow-through Agentic workflows, research assistants, document analysis, complex internal tools GPT-5.4 Pro Deeper, higher-reliability reasoning for complex production scenarios High-stakes agentic workflows, long-form analysis and synthesis, complex planning, advanced internal copilots GPT-5.4 mini Balanced reasoning with lower latency for interactive systems Real-time agents, developer tools, retrieval-augmented applications GPT-5.4 nano Ultra-low latency and high throughput High-volume request routing, real-time chat, lightweight automation Responsible AI in Microsoft Foundry At Microsoft, our mission to empower people and organizations remains constant. In the age of AI, trust is foundational to adoption, and earning that trust requires a commitment to transparency, safety, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy GPT-5.4 models responsibly in production environments, aligned with Microsoft's Responsible AI principles. Pricing Model Deployment Input (USD $/M tokens) Cached input (USD $/M tokens) Output (USD $/M tokens) GPT-5.4 mini Standard Global $0.75 $0.075 $4.5 GPT-5.4 nano Standard Global $0.20 $0.02 $1.25 The models are also available in Data Zone US. It is rolling out to Data Zone EU. Getting started Explore the models in Microsoft Foundry. Sign in to the Foundry portal and browse the model catalog to evaluate GPT-5.4 mini and GPT-5.4 nano alongside other options, then deploy the right model for each workload.9.2KViews0likes1CommentStep-by-Step: Deploy the Architecture Review Agent Using AZD AI CLI
Building an AI agent is easy; operating it is an infrastructure trap. Discover how to use the azd ai CLI extension to streamline your workflow. From local testing to deploying a live Microsoft Foundry hosted agent and publishing it to Microsoft Teams—learn how to do it all without writing complex deployment scripts or needing admin permissions.287Views0likes0CommentsMicrosoft Foundry Model Router: A Developer's Guide to Smarter AI Routing
Introduction When building AI-powered applications on Azure, one of the most impactful decisions you make isn't about which model to use, it's about how your application selects models at runtime. Microsoft Foundry Model Router, available through Microsoft Foundry, automatically routes your inference requests to the best available model based on prompt complexity, latency targets, and cost efficiency. But how do you know it's actually routing correctly? And how do you compare its behavior across different API paths? That's exactly the problem RouteLens solves. It's an open-source Node.js CLI and web-based testing tool that sends configurable prompts through two distinct Azure AI runtime paths and produces a detailed comparison of routing decisions, latency profiles, and reliability metrics. In this post, we'll walk through what Model Router does, why it matters, how to use the validator tool, and best practices for designing applications that get the most out of intelligent model routing. What Is Microsoft Founry Model Router? Microsoft Foundry Model Router is a deployment option in Microsoft Foundry that sits between your application and a pool of AI models. Instead of hard-coding a specific model like gpt-4o or gpt-4o-mini , you deploy a Model Router endpoint and let Azure decide which underlying model serves each request. How It Works Your application sends an inference request to the Model Router deployment. Model Router analyzes the request (prompt complexity, token count, required capabilities). It selects the most appropriate model from the available pool. The response is returned transparently — your application code doesn't change. Why This Matters Cost optimization — Simple prompts get routed to smaller, cheaper models. Complex prompts go to more capable (and expensive) ones. Latency reduction — Lightweight prompts complete faster when they don't need a heavyweight model. Resilience — If one model is experiencing high load or throttling, traffic can shift to alternatives. Simplified application code — No need to build your own model-selection logic. The Two Runtime Paths Microsoft Foundry offers two distinct endpoint configurations for hitting Model Router. Even though both use the Chat Completions API, they may have different routing behaviour: Path SDK Endpoint AOAI + Chat Completions OpenAI JS SDK https://.cognitiveservices.azure.com/openai/deployments/ Foundry Project + Chat Completions OpenAI JS SDK (separate client) https://.cognitiveservices.azure.com/openai/deployments/ Understanding whether these two paths produce the same routing decisions is critical for production applications. If the same prompt routes to different models depending on which endpoint you use, that's a signal you need to investigate. Introducing RouteLens RouteLens is a Node.js tool that automates this comparison. It: Sends a configurable set of prompts across categories (echo, summarize, code, reasoning) through both paths. Logs every response to structured JSONL files for post-hoc analysis. Computes statistics including p50/p95 latency, error rates, and model-choice distribution. Highlights routing differences — where the same prompt was served by different models across paths. Provides a web dashboard for interactive testing and real-time result visualization. The Web Dashboard The built-in web UI makes it easy to run tests and explore results without parsing log files: The dashboard includes: KPI Dashboard — Key metrics at a glance: Success Rate, Avg TPS, Gen TPS, Peak TPS, Fastest Response, p50/p95 Latency, Most Reliable Path, Total Tokens Summary view — Per-path/per-category stats with success rate, TPS, and latency Model Comparison — Side-by-side view of which models were selected by each path Latency Charts — Visual bar charts comparing p50 and p95 latencies Error Analysis — Error distribution and detailed error messages Live Feed — Real-time streaming of results as they come in Log Viewer — Browse and inspect historical JSONL log files Model Comparison — See which models were selected by each routing path for every prompt: Live Feed — Real-time streaming of results as they come in: Log Viewer — Browse and inspect historical JSONL log files with parsed table views: Mobile Responsive — The UI adapts to smaller screens: Getting Started Prerequisites Node.js 18+ (LTS recommended) An Azure subscription with a Foundry project Model Router deployed in your Foundry project An API key from your Azure OpenAI / Foundry resource The API version (e.g. 2024-05-01-preview ) Setup # Clone and install git clone https://github.com/leestott/modelrouter-routelens/ cd modelrouter-routelens npm install # Configure your endpoints cp .env.example .env # Edit .env with your Azure endpoints (see below) Configuration The .env file needs these key settings: # Your Foundry / Cognitive Services deployment endpoint # Format: https://<resource>.cognitiveservices.azure.com/openai/deployments/<deployment> # Do NOT include /chat/completions or ?api-version FOUNDRY_PROJECT_ENDPOINT=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router AOAI_BASE_URL=https://<resource>.cognitiveservices.azure.com/openai/deployments/model-router # API key from your Azure OpenAI / Foundry resource AOAI_API_KEY=your-api-key-here # Azure OpenAI API version AOAI_API_VERSION=2024-05-01-preview</resource></resource></deployment></resource> Running Tests # Full test matrix — sends all prompts through both paths npm run run:matrix # 408 timeout diagnostic — focuses on the Responses path timeout issue npm run run:repro408 # Web UI — interactive dashboard npm run ui # Then open http://localhost:3002 (or the port set in UI_PORT) Understanding the Results Latency Comparison The latency charts show p50 (median) and p95 (tail) latency for each path and prompt category: Key things to look for: Large p50 differences between paths suggest one path has consistently higher overhead. High p95 values indicate tail latency problems — possibly timeouts or retries. Category-specific patterns — If code prompts are slow on one path but fast on another, that's a routing difference worth investigating. Model Comparison The model comparison view shows which models were selected for each prompt: When both paths select the same model, you see a green "Match" indicator. When they differ, it's flagged in red — these are the cases you want to investigate. Error Analysis The errors view helps diagnose reliability issues: Common error patterns: 408 Timeout — The Responses path may take longer for certain prompt categories 401 Unauthorized — Authentication configuration issues 429 Rate Limited — You're hitting throughput limits 500 Internal Server Error — Backend model issues Best Practices for Designing Applications with Model Router 1. Design Prompts with Routing in Mind Model Router makes decisions based on prompt characteristics. To get the best routing: Keep prompts focused — A clear, single-purpose prompt is easier for the router to classify than a multi-part prompt that spans multiple complexity levels. Use system messages effectively — A well-structured system message helps the router understand the task complexity. Separate complex chains — If you have a multi-step workflow, make each step a separate API call rather than one massive prompt. This lets the router use a cheaper model for simple steps. 2. Set Appropriate Timeouts Different models have different latency profiles. Your timeout settings should account for the slowest model the router might select: // Too aggressive — may timeout when routed to a larger model const TIMEOUT = 5000; // 5s // Better — allows headroom for model variation const TIMEOUT = 30000; // 30s // Best — use different timeouts based on expected complexity function getTimeout(category) { switch (category) { case 'echo': return 10000; case 'summarize': return 20000; case 'code': return 45000; case 'reasoning': return 60000; default: return 30000; } } 3. Implement Robust Retry Logic Because the router may select different models on retry, transient failures can resolve themselves: async function callWithRetry(prompt, maxRetries = 3) { for (let attempt = 0; attempt < maxRetries; attempt++) { try { return await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); } catch (err) { if (attempt === maxRetries - 1) throw err; // Exponential backoff await new Promise(r => setTimeout(r, 1000 * Math.pow(2, attempt))); } } } 4. Monitor Model Selection in Production Log which model was selected for each request so you can track routing patterns over time: const response = await client.chat.completions.create({ model: 'model-router', messages: [{ role: 'user', content: prompt }], }); // The model field in the response tells you which model was actually used console.log(`Routed to: ${response.model}`); console.log(`Tokens: ${response.usage.total_tokens}`); 5. Use the Right API Path for Your Use Case Based on our testing with RouteLens, consider: Chat Completions path — The standard path for chat-style interactions. Uses the openai SDK directly. Foundry Project path — Uses the same Chat Completions API but through the Foundry project endpoint. Useful for comparing routing behaviour across different endpoint configurations. Note: The Responses API ( /responses ) is not currently available on cognitiveservices.azure.com Model Router deployments. Both paths in RouteLens use Chat Completions. 6. Test Before You Ship Run RouteLens as part of your pre-production validation: # In your CI/CD pipeline or pre-deployment check npm run run:matrix -- --runs 10 --concurrency 4 This helps you: Catch routing regressions when Azure updates model pools Verify that your prompt changes don't cause unexpected model selection shifts Establish latency baselines for alerting Architecture Overview RouteLens sends configurable prompts through two distinct Azure AI runtime paths and compares routing decisions, latency, and reliability. The Matrix Runner dispatches prompts to both the Chat Completions Client (OpenAI JS SDK → AOAI endpoint) and the Project Responses Client ( azure/ai-projects → Foundry endpoint). Both paths converge at Azure Model Router, which intelligently selects the optimal backend model. Results are logged to JSONL files and rendered in the web dashboard. Key Benefits of Model Router Benefit Description Cost savings Automatically routes simple prompts to cheaper models, reducing spend by 30-50% in typical workloads Lower latency Simple prompts complete faster on lightweight models Zero code changes Same API contract as a standard model deployment — just change the deployment name Future-proof As Azure adds new models to the pool, your application benefits automatically Built-in resilience Routing adapts to model availability and load conditions Conclusion Azure Model Router represents a shift from "pick a model" to "describe your task and let the platform decide." This is a natural evolution for AI applications — just as cloud platforms abstract away server selection, Model Router abstracts away model selection. RouteLens gives you the visibility to trust that abstraction. By systematically comparing routing behavior across API paths and prompt categories, you can deploy Model Router with confidence and catch issues before your users do. The tool is open source under the MIT license. Try it out, file issues, and contribute improvements: GitHub Repository Model Router Documentation Microsoft Foundry539Views1like0Comments