ai
31 Topics8 Architectural Pillars to Boost GenAI LLM Accuracy and Performance in Low Cost
Smarter AI architecture, not bigger LLM models - how engineering teams push LLM accuracy and high performance in low cost. Enterprises using LLM (Large Language Models) hits the same ceiling and paying big price! A raw API call to a frontier model- GPT-4, Claude, Gemini delivers only 35-40% accuracy on structured output tasks like code generation, NL to DAX query generation, domain-specific reasoning. Prompt engineering pushes that to ~60%. But the final 35+ percentage points? Those come from system architecture, not model upgrades. This guide presents 8 architectural pillars, distilled from production GenAI systems, that compound to close the accuracy gap. These patterns are model-agnostic and domain-agnostic, they apply equally to chatbots, coding assistants, content/query generators, automation agents, and any application where an LLM produces structured or semi-structured output. It’s based on my recent GenAI projects. The key takeaway: use the LLM as one component in a larger system, not as the system itself. Surround it with deterministic guardrails, verified knowledge, and feedback loops. Pillar 1: Enhance Prompts with Verified Knowledge Context Impact: +35-40% accuracy the single biggest improvement Top source of LLM errors in production is hallucinated identifiers knowledgebase, the model invents names, references, or structures that don't exist in the target system. This happens because LLMs are trained on general knowledge but deployed against specific, private enterprise systems they've never seen local database and knowledgebase. The fix is straightforward: inject verified, system-specific context (type definitions, API specs, ontologies, configuration schemas, entity catalogues) directly into the prompt so the model composes from known-good elements rather than recalling from training data. Use Knowledge Graph for better sematic knowledge. How to Implement Provide explicit context, never implicit- Whatever the LLM needs to reference identifiers, valid values, semantic knowledge, structures must appear verbatim in the prompt or retrieved context window. Filter aggressively. A full knowledge base with thousands of entities overwhelms the context window and confuses the model. Use intelligent filtering to surface only needed 5-10 most relevant elements per request. Store structured semantic knowledge in a graph or searchable index. This enables relationship-aware retrieval: "given entity X, what related entities, attributes, and constraints are also needed?" Include rich Semantic metadata. Names alone are insufficient. Include types, constraints, valid value ranges, relationships, and usage notes to minimize ambiguity. Keep context fresh. Stale context causes a different class of hallucination the model generates valid-looking output that references outdated structures. Sync your knowledge store with your source of truth. Why This Works LLMs excel at composition and reasoning — combining elements, applying logic, following patterns. They are unreliable at recall of specific identifiers — exact names, valid values, structural constraints. By offloading recall to a deterministic retrieval system and giving the LLM only composition tasks, you play to each system's strengths. Pillar 2: Tiered LLM Approach: Route Deterministically First, Use LLMs Last Impact: 80% cost reduction, 85% latency reduction, eliminates non-deterministic errors for most traffic. The most impactful architectural insight: most production requests don't need an LLM at all. A well-designed system handles 60-70% of traffic with deterministic logic templates, composition rules, cached results and reserves expensive, non-deterministic LLM calls only for genuinely novel inputs. The Three-Tier Model These metrics are from a real use case to convert NLP to Power BI DAX query. Tier Strategy Uses LLM? Latency Accuracy Tier 0 Template slot-filling - handles requests that match known patterns exactly — the system fills slots in a pre-built template with extracted parameters. No LLM, no non-determinism, near-perfect accuracy, sub-100ms response. No ~50ms 95-98% Tier 1 Compose from pre-validated fragments- handles requests that combine known patterns in new ways. The system retrieves pre-validated building blocks via search, composes them using deterministic rules, and validates the result. Still no LLM call. No ~200ms 90-95% Tier 2 Full LLM generation with enriched context- is reserved for genuinely novel requests that can't be served deterministically. Even here, the LLM receives maximum support: filtered context, relevant examples, explicit rules, and structured planning. Yes (1 call) 2-5s 88-93% Complexity-Based Routing A lightweight scoring function (evaluated in <1ms) routes each incoming request: Factors: reasoning depth, number of components, cross-references, constraints, nesting depth, novelty (distance from known patterns) Score 0-39: Tier 0 (deterministic template) Score 40-59: Tier 1 if confidence ≥ 85%, else Tier 2 Score 60+: Tier 2 (LLM generation) This routing achieves 96%+ accuracy in tier assignment and ensures the expensive path is only taken when necessary. Why This Matters Cost: 70-80% of requests cost zero LLM tokens Latency: Majority of responses in <200ms instead of 2-5s Reliability: Deterministic tiers produce identical output for identical input. Scalability: Deterministic tiers scale horizontally with trivial compute Pillar 3: Encode Prompt Anti-Patterns as Explicit Rules Impact: +8-10% accuracy, ~80% reduction in common structural errors LLM mistakes are patterned, not random. In any domain, 80% of errors cluster around a small set of 6-13 recurring structural mistakes. Instead of hoping the model avoids them through general instruction-following, compile these mistakes into explicit WRONG => CORRECT rules embedded directly in the system prompt. How to Implement Collect error data. Run 100+ requests through your system and categorize the failures. You'll find the same 6-13 patterns appearing repeatedly. Write concrete rules. For each pattern, show the exact wrong output and the exact correct alternative, with a one-line explanation of why. Embed in system prompt. Place rules prominently — after the task description, before examples. Use formatting that's hard to ignore (headers, bold, explicit "NEVER" language). Keep the list short. 6-13 rules maximum. Beyond that, attention dilutes and the model starts ignoring rules. Prioritize by frequency. Refresh continuously. As the system improves (via other pillars), some errors disappear. New error types emerge. Update the rule set quarterly. Why This Works LLMs respond strongly to explicit negative examples. A generic instruction like "be careful with X" has minimal impact. But showing the exact wrong output the model tends to produce, paired with the correction, creates a strong avoidance signal. It's analogous to unit tests. Pillar 4: Retrieve Few-Shot Examples Dynamically Impact: +5-15% accuracy depending on domain complexity Static examples hardcoded in a prompt become stale, irrelevant of context tokens. Dynamic few-shot retrieval selects the 3-5 most relevant examples for each specific request, maximizing the signal-to-noise ratio in the prompt. Hybrid Retrieval Architecture The most effective approach combines two search strategies for intent searc to understand natural language (NL) context: Keyword search (BM25) — Finds examples with exact matching terms, identifiers, and domain vocabulary Vector search (semantic similarity) — Finds examples with similar intent and structure, even if wording differs Rank fusion — Merges results from both strategies, re-ranking by combined relevance This hybrid approach outperforms either strategy alone because keyword search catches exact identifier matches that vector search dilutes, while vector search captures semantic similarity that keyword search misses entirely. Best GenAI Architectural Practices Match complexity to complexity. Simple requests should see simple examples. Complex requests should see complex examples. Mismatched examples confuse the model. Include negative examples. For the detected request type, include 1-2 "wrong => correct" pairs alongside positive examples. This reinforces Pillar 3's anti-pattern rules with concrete, contextually relevant demonstrations. Pre-compute embeddings. Generate vector embeddings at indexing time, not at query time. Cache retrieval results for repeated patterns. Curate quality over quantity. 3 excellent, diverse examples beat 10 mediocre ones. Each example should demonstrate a distinct pattern or edge case. Keep examples current. As your system evolves, old examples may demonstrate outdated patterns. Review and refresh the example store periodically. Pillar 5: Feedback Loop- Validate and Auto-Fix Every Output Deterministically Impact: +3-5% accuracy as a safety net, plus continuous improvement via feedback No matter how well-prompted, LLMs will occasionally produce outputs with minor structural errors - wrong casing, missing delimiters, references to slightly-incorrect identifiers, or subtle format violations. A deterministic post-processing pipeline catches and fixes these without any additional LLM calls. The Validation Pipeline LLM Output => Parse (grammar/AST) => Rule-Based Fixes => Compliance Check/validation => Final Output Each stage is fully deterministic: Parsing: Use a formal grammar or AST parser (ANTLR, tree-sitter, language-native parsers) to structurally analyse the output. Never regex-parse structured output - it's fragile and misses edge cases. Rule-based fixes: 10-20 deterministic transformation rules that correct known error patterns - name normalization, casing fixes, missing delimiters, structural repairs. Compliance check: Verify every identifier referenced in the output actually exists in the provided context. Flag unknown references. Design Principles Zero LLM calls in the fix pipeline. Every fix is a regex, an AST transformation, or a lookup table operation. Instant, free, deterministic, 100% reliable. Fail safe. If a fix is ambiguous (multiple valid corrections possible), pass through rather than corrupt. A minor error is better than a confident wrong "fix." Log everything. Track every fix applied, categorized by type. This data drives the feedback loop. The Critical Feedback Loop- The validation pipeline's most important function isn't fixing outputs — it's generating improvement signals: This creates a feedback loop: the auto-fix catches errors → the errors get promoted to upstream prevention → fewer errors reach the auto-fix → the system continuously tightens. Pillar 6: Multi-Agent Orchestration with Fewer Agents and Clear Contracts Impact: Reduced latency, clearer debugging, fewer failure modes The multi-agent pattern is powerful but commonly over-applied. The counter-intuitive lesson from production systems: fewer agents with well-defined responsibilities outperform many fine-grained agents. Why Fewer Is Better Each agent handoff introduces: Latency - serialization, network calls, context assembly Context loss - information dropped between boundaries Failure modes - each handoff is a potential error point Debugging complexity - tracing issues across many agents is exponentially harder Multi-Agent Orchestration Principles Merge agents that always run sequentially. If Agent A always feeds into Agent B with no branching or conditional logic, they should be one agent with two internal steps. Parallelize independent operations. Context retrieval and example lookup are independent — run them concurrently to halve retrieval latency. Route sub-tasks to cheaper models. Decomposed sub-problems are simpler by design. Use a smaller, faster, cheaper model (3x cost savings, 2x speed improvement). Define strict contracts. Each agent boundary should have an explicit schema defining inputs and outputs. No implicit assumptions about what crosses the boundary. Only 2 of 4 agents should call an LLM. The rest are purely deterministic. This minimizes non-deterministic behavior and cost. Pillar 7: Multi-Agent Cache at Multiple Hierarchical Levels Impact: 40-50% faster responses, 85%+ combined hit rate, significant cost reduction A single cache layer captures only one type of repetition. Production systems need hierarchical caching where multiple levels catch different repetition patterns — from exact duplicates to semantic near-misses. Pillar 8: Measure Everything, Learn Continuously Impact: Enables data-driven iteration and prevents accuracy regressions. Architecture without observability is guesswork. The final pillar ensures every other pillar stays effective over time through comprehensive metrics and automated feedback loops. This isn't a one-time setup; it's a perpetual feedback loop. Every week, the top error patterns shift slightly. The auto-fix metrics tell you exactly where to focus next. Over months, this flywheel compounds into dramatic accuracy gains that no single prompt rewrite could achieve. Auto-Learning for New Domains When extending your system to new domains or knowledge areas: Auto-classify elements using naming conventions, type analysis, and structural patterns Auto-generate templates from universal patterns (transformations, comparisons, compositions, sequences) Bootstrap few-shot examples from successful template outputs Monitor for the first 100 requests, then curate only the edge cases manually This reduces domain onboarding from days of manual work to minutes of automated bootstrapping plus focused human review of outliers. Key Takeaways Architecture beats model size. A well-architected system with a smaller model outperforms a raw frontier model call on structured tasks at a fraction of the cost. Deterministic systems should do the heavy lifting. Reserve LLMs for genuinely novel, creative tasks. 70-80% of production requests should never touch an LLM. Verified knowledge is your top accuracy lever. Ground every prompt in context the model can trust. Errors are patterned, not random- Track them, compile them, and explicitly forbid them. Build feedback loops, not static systems- Every auto-fix, every cache miss, every routing decision is a signal for improvement. Fewer agents, done well- Fewer agents with strict contracts outperform 9 agents with fuzzy boundaries — in accuracy, latency, and debuggability. Measure what matters and iterates- The system that wins isn't the one with the best day-one prompt, it's the one that improves fastest over time. Production-grade GenAI isn't about finding the perfect prompt or waiting for the next LLM model release. It's about building architectural guardrails that make failure nearly impossible — and when failure does occur, the system learns from it automatically. These 8 pillars, applied together, transform any LLM from an unreliable black box into a precise, efficient, and continuously improving production system.A New Chapter for Realtime AI: Reasoning, Translation, and Real-Time Transcription
Voice can be one of the most direct and productive interfaces for AI — enabling customer support agents that may resolve issues without a single keystroke, live multilingual communication that can take on language barriers as conversations happen, and voice assistants capable of reasoning through complex requests in real time. Developers building these experiences need models that can keep pace with increasingly demanding latency, accuracy, and language coverage requirements. Today, OpenAI’s GPT-realtime-translate, GPT‑realtime‑2 and, GPT-realtime-whisper are rolling out into Microsoft Foundry starting today — together representing a significant step forward for the realtime model lineup available to developers on the platform. GPT-realtime-translate and GPT-realtime-whisper GPT-realtime-translate and GPT-realtime-whisper together extend the realtime stack for live multilingual audio workflows. GPT-realtime-translate is built for continuous, real-time translation, producing translated output as speech unfolds without relying on segmented pipeline processing, while GPT-realtime-whisper provides low-latency streaming transcription of the original audio in parallel. Used together, they help developers support scenarios such as live events, cross-language customer experiences, captions, monitoring, and archival workflows that require both translated output and visibility into the source speech. Continuous stream processing: This new model translates live audio without segmenting or buffering allowing for more natural interactions. New translation and transcription capabilities: Translate between languages in real time and observe faster text to speech. Available via the Realtime API GPT-realtime-2 GPT‑realtime‑2 is a generational upgrade to OpenAI's speech-to-speech model, bringing internal reasoning and an expanded context window to real-time voice applications. Where previous speech to speech models responded immediately, GPT‑realtime‑2 can work through a problem before speaking — making it well suited for voice applications that need to handle complex, multi-step queries entirely in the audio layer without routing to a separate text pipeline. Native reasoning capability: The newest realtime model introduces stronger reasoning capabilities. Now the model thinks internally before responding. Adjustable reasoning effort via {reasoning.effort}: Explicitly request the level of reasoning the model uses -- minimal, low, medium, high – to save on cost and latency. Audio in, audio out: No need for an intermediary text step, conversation stays fluid and natural. Available via the Realtime API This models is coming soon to Microsoft Foundry. Since, May 6, the models have been rolling out into the model catalog. We are excited for you to explore and build with our evolving collection of frontier models. Use cases These models work independently, but they're designed to complement each other in real-world pipelines: Live multilingual events. GPT-realtime-translate enables real-time translation of live audio, producing translated speech along with a transcript in the target language. GPT‑realtime‑whisper can be used in parallel to capture a transcription of the original speech for captions, monitoring, or archival purposes. Together, they enable multilingual live streaming with both translated experiences and visibility into the source language. Global customer support. Route inbound calls through GPT-realtime-translate to translate conversations in real time and provide a translated transcript for agents. Use GPT‑realtime‑whisper alongside it to capture the original conversation as text for compliance, quality review, or analytics. Then pass the interaction to an agent built with GPT‑realtime‑2 using {reasoning.effort}: high for complex issue resolution, all within a continuous audio pipeline. International voice assistants. Build once and deploy across languages. GPT-realtime-translate enables multilingual interaction and provides translated output with a target-language transcript, while GPT‑realtime‑whisper can optionally capture the original user input as text. GPT‑realtime‑2 manages reasoning and conversational context, supporting more complex voice interactions. Pricing Model Deployment Modality Pricing per 1M tokens Input Cached Input Output GPT-realtime-2 Global Standard Audio $32.00 $0.40 $64.00 Text $4.00 $0.40 $24.00 Image $5.00 $0.50 -- GPT-realtime-translate Global Standard Audio -- -- $2.04/hour GPT-realtime-whisper Global Standard Audio -- -- $1.02/hour *Pricing for GPT-realtime-translate and GPT-realtime-whisper will be done by the hour Getting Started Looking for ways to dive in? GPT-realtime-translate, GPT-realtime-whisper, and GPT‑realtime‑2 are rolling out into Microsoft Foundry today. Explore the model catalog and start building: https://ai.azure.com4.3KViews1like5CommentsNow in Foundry: IBM Granite 4.1, NVIDIA Nemotron Nano Omni, and Qwen3.6-35B-A3B
This week Microsoft Foundry adds two major model families alongside a reasoning powerhouse that spans the full spectrum from specialized speech and vision to general-purpose coding and long-context analysis. IBM's Granite 4.1 is a famiily of 10: six LLMs across 3B, 8B, and 30B sizes in both full-precision and FP8 variants, plus a safety model, a vision-language model for document extraction, and a multilingual speech recognition model. NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning brings multimodal capability—video, audio, image, and text—to a 31B Mamba2-Transformer Hybrid Mixture-of-Experts (MoE) architecture that activates only 3B parameters per forward pass; three variants are available in Foundry (BF16, FP8, and NVFP4), with the FP8 variant featured here. Qwen3.6-35B-A3B is designed for agentic coding among open models, with thinking preservation across conversation turns and a context window extensible to 1 million tokens. Models of the week IBM: granite-4.1-30b Model Specs Parameters / size: 30B (flagship of the Granite 4.1 family) Context length: 131,072 tokens Primary task: Text generation (multilingual instruction following, RAG, tool calling, code, summarization) Why it's interesting The Granite 4.1 release brings 10 models to Microsoft Foundry. The LLM lineup covers granite-4.1-3b-instruct, granite-4.1-8b-instruct, and granite-4.1-30b-instruct with FP8 variants for each, plus granite-guardian-4.1-8b for safety, granite-vision-4.1-4b for document and chart understanding, and granite-speech-4.1-2b for multilingual speech recognition. This is a deployment-ready stack where teams can mix and match model sizes and modalities from a single provider. Strong instruction following and reasoning at the 30B scale: granite-4.1-30b-instruct scores 80.16 on MMLU, 64.09 on MMLU-Pro, 83.74 on Big-Bench Hard (BBH), 77.80 on AGI Eval, 45.76 on GPQA (Graduate-Level Google-Proof Q&A, a graduate-level science reasoning benchmark), and 89.65 average on IFEval (instruction following). These results reflect SFT and reinforcement learning post-training focused specifically on instruction compliance, tool calling accuracy, and long-context retrieval. (View benchmarks here) Enhanced tool calling and 12-language support: Granite 4.1 models are trained for structured function calling and support 12 languages—Arabic, Chinese, Czech, Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish—with dialog, extraction, and summarization capabilities. Safety and multimodal coverage within the same family: The inclusion of granite-guardian-4.1-8b (a safety classifier for detecting harmful content and prompt injections), granite-vision-4.1-4b (a Vision Language Model optimized for document extraction from PDFs, charts, and tables), and granite-speech-4.1-2b (a 2B multilingual Automatic Speech Recognition model) means teams can address safety, document parsing, and audio ingestion within the same model family—reducing integration complexity across a full pipeline. Try it Use Case Prompt Pattern Multilingual RAG Submit retrieved document passages in any of 12 supported languages; ask model to synthesize and cite sources Agentic tool calling Provide function definitions + user goal; model plans and executes tool calls in structured format Document extraction (granite-vision-4.1-4b) Submit PDF page image; extract tables, key figures, or form fields as structured JSON Safety classification (granite-guardian-4.1-8b) Pass user input or model output; receive structured risk assessment before serving response Sample prompt for an enterprise document processing deployment: You are building a multilingual document intelligence pipeline for a global financial institution. Using the granite-4.1-30b-instruct endpoint deployed in Microsoft Foundry, submit each incoming policy or regulatory document with the following system instruction: "You are a compliance analysis assistant. Review the document and extract: (1) all regulatory requirements described, (2) the entities to which each requirement applies, (3) any compliance deadlines mentioned, and (4) any penalties or consequences for non-compliance. Return the output as a structured JSON array with one entry per requirement." For documents that include scanned pages, first route them through granite-vision-4.1-4b to extract text and table content before passing to the 30B model for compliance analysis. Pass all user-facing outputs through granite-guardian-4.1-8b to screen for sensitive information before returning results. NVIDIA: Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 Model Specs Parameters / size: 31B total, ~3B activated per forward pass (Mamba2-Transformer Hybrid Mixture-of-Experts) Context length: 256,000 tokens Primary task: Video-audio-image-text-to-text (Multimodal understanding, reasoning, tool calling) Why it's interesting Multimodal input from a single efficient endpoint: Nemotron-3-Nano-Omni-30B-A3B-Reasoning supports video (up to 2 minutes), audio (up to 1 hour), images (RGB), and text—all from a single model deployed in Microsoft Foundry. Three variants are available in Foundry: full-precision BF16, FP8, and NVFP4. Paper: Nemotron Nano Omni technical report. Strong results across vision, document, video, and audio benchmarks: With reasoning mode enabled, the model scores 82.8 on MathVista-MINI (visual math reasoning), 67.04 on OCRBenchV2-EN (document OCR), 63.6 on Charxiv Reasoning (chart understanding), 72.2 on Video MME (video Q&A), 74.52 on Daily Omni (video+audio omnimodal understanding), and 89.39 on VoiceBench (speech instruction following). On OSWorld (GUI agent benchmark measuring autonomous computer use), it scores 47.4—a notable result for a model at the 3B active parameter scale. (Please see above model cards for further benchmark data) Mamba2-Transformer Hybrid MoE for efficient long-context inference: The model's layers alternate between Mamba2 state-space blocks (which process sequences with linear rather than quadratic cost) and standard Transformer attention blocks, combined with Mixture-of-Experts feedforward layers. Only ~3B parameters are activated per token despite the 31B total, making the 256K context window practically usable at lower compute cost than a comparably sized dense model. Word-level timestamps, JSON output, and tool calling for structured media workflows: The model produces word-level timestamps from audio, enabling precise transcript-to-timecode alignment for review and indexing workflows. Combined with JSON-structured output, chain-of-thought reasoning, and native tool calling, it can serve as an agentic step that ingests raw media (meeting recordings, M&E assets, training videos) and produces structured outputs for downstream systems without requiring separate transcription or OCR preprocessing stages. Try it Use Case Prompt Pattern Meeting intelligence Submit audio recording (up to 1 hr); extract transcript with word-level timestamps, action items, and decisions as structured JSON Video content analysis Submit video clip (up to 2 min) + query; retrieve timestamped summary of key events or spoken content Document + audio joint analysis Submit scanned document image alongside narrated walkthrough audio; extract and reconcile information from both modalities Multimodal tool calling Provide tool definitions + combined image/audio input; model reasons over content and executes structured tool calls Sample prompt for a media and compliance deployment: You are building a broadcast compliance review system for a media company. Using the Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 endpoint deployed in Microsoft Foundry, submit each recorded segment as video input with the following instruction: "Review this video segment and produce a compliance report as a JSON object with the following fields: transcript (full text with word-level timestamps), flagged_segments (array of objects with start_time, end_time, content, and reason for flagging), speaker_count (estimated number of distinct speakers), and compliance_summary (overall assessment). Flag any content that includes unverified factual claims, restricted product categories, or regulatory disclosures that may be incomplete." Use the word-level timestamps from the compliance report to route flagged segments directly to the editorial review queue with precise timecode references. Qwen: Qwen3.6-35B-A3B Model Specs Parameters / size: 35B total, 3B activated (Mixture-of-Experts) Context length: 262,144 tokens natively, extensible to 1,010,000 tokens Primary task: Image-text-to-text (agentic coding, reasoning, vision) Why it's interesting Agentic coding improvements over Qwen3.5-35B-A3B: Qwen3.6-35B-A3B scores 73.4 on SWE-bench Verified (vs. 70.0 for Qwen3.5-35B-A3B and 52.0 for Gemma 4 31B), 67.2 on SWE-bench Multilingual (vs. 60.3 and 51.7), and 49.5 on SWE-bench Pro (vs. 44.6 and 35.7). Terminal-Bench 2.0 reaches 51.5 (vs. 40.5 and 42.9). The update targets frontend workflows and repository-level reasoning specifically, areas where earlier Qwen3.5 iterations showed gaps. Blog post: Qwen3.6-35B-A3B. Hybrid architecture: Gated DeltaNet and Mixture-of-Experts: The model's 40 layers alternate between Gated DeltaNet blocks (a form of linear attention that avoids the quadratic cost of standard self-attention), Gated Attention blocks (using Grouped Query Attention with 16 query heads and 2 key-value heads), and Mixture-of-Experts (MoE) feedforward layers with 256 experts (8 routed + 1 shared active per token). Only 3B parameters are activated per forward pass, keeping inference cost comparable to a 3B dense model while retaining the capacity of a 35B model for knowledge and specialization. Thinking preservation across conversation turns: Qwen3.6 introduces an option to retain reasoning context from previous messages in multi-turn conversations. In prior models, chain-of-thought traces were stripped between turns, requiring the model to re-derive context it had already reasoned through. With thinking preservation enabled, iterative coding workflows—such as debugging across multiple exchanges—benefit from accumulated reasoning without repeating earlier analysis. Natively extensible to 1 million token context: The 262K native context is already among the largest in open models at this size, and the architecture supports extension to 1,010,000 tokens. On GPQA Diamond (science reasoning), Qwen3.6-35B-A3B scores 86.0—above both Gemma 4 31B (84.3) and Qwen3.5-27B (85.5)—while matching Gemma 4 31B on MMLU Pro (85.2) and LiveCodeBench v6 (80.4 vs. 80.0). Try it Use Case Prompt Pattern Repository-level code change Provide repository structure + task description; model plans file edits and outputs unified diff Multi-turn iterative debugging Enable thinking preservation; submit failing test + code across multiple turns; accumulate reasoning context Frontend code generation Provide design spec or screenshot + existing codebase context; generate component implementation Long-document reasoning Submit technical specification (up to 262K tokens); ask model to identify ambiguities or implementation gaps Sample prompt for a software engineering deployment: You are building an automated code review and implementation assistant for a platform engineering team. Using the Qwen3.6-35B-A3B endpoint deployed in Microsoft Foundry, enable thinking preservation for multi-turn sessions. In the first turn, submit the repository file tree and a GitHub issue describing a required API endpoint change. Prompt the model: "Review the repository structure and describe your implementation plan, including which files need to change and why." In the second turn, submit the relevant source files and prompt: "Based on your earlier plan, implement the changes and produce a unified diff." In the third turn, submit the test suite and prompt: "Write additional unit tests for the new endpoint, covering edge cases identified in your reasoning." The thinking preservation feature ensures the model carries forward its understanding of the codebase across all three turns without re-explaining context. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry389Views0likes0CommentsIntroducing OpenAI's newest chat model in Microsoft Foundry
OpenAI's GPT-5.5 Instant (or Chat-latest in the API) begins rolling out in Microsoft Foundry today as GPT-chat-latest. Built on GPT-5.4 and GPT-5.3-chat, the new model delivers measurable gains in factual accuracy, tool calling, and response efficiency. These improvements translate directly into more reliable production deployments. GPT-chat-latest is designed for the workflows builders are actually shipping: multi-turn assistants, agentic systems that orchestrate tools, and retrieval-grounded applications where precision and grounding matter as much as conversational quality. Why the name is changing In Microsoft Foundry, we are introducing GPT-chat-latest as the product name for this release, while the model continues to follow the existing Preview lifecycle and standard notice periods. We are also evaluating ways to simplify how customers access continuously updated models over time, but current behavior remains unchanged as that work continue Smarter, more factually reliable GPT-chat-latest closes the factuality gap from prior iterations with significant reductions in hallucinations, especially in domains where accuracy matters most. According to OpenAI, the new model produces 52.5% fewer hallucinations and reduces hallucinated claims by 37.3% on conversations previously flagged for factual errors when compared to GPT-5.3-chat. These gains extend beyond text. GPT-chat-latest shows improvements in visual reasoning, expert multimodal understanding, and STEM tasks, with measurable lifts across standard benchmarks: Benchmark GPT-5.3-chat GPT-chat-latest CharXiv-reasoning Scientific Chart Reasoning 75.0 81.6 MMMU-Pro Expert multimodal reasoning 69.2 76.0 GPQA PhD-level science questions 78.5 85.6 AIME 2025 Competition math 65.4 81.2 *Data shown comes from OpenAI’s testing” For builders shipping into regulated workloads such as clinical decision support, legal research, financial advisory, and technical analysis, these improvements raise the bar on the kinds of applications GPT-chat-latest can assist with. More efficient outputs GPT-chat-latest produces responses that may be more to-the-point without losing substance. The model may reduce verbosity and over formatting, ask fewer follow-up questions, and avoid cluttered output patterns that often require post-processing in production UIs. For builders, this can translate to two concrete benefits: lower output token costs at scale, and cleaner responses that drop into product surfaces with less downstream cleanup. In comparative testing from OpenAI, GPT-chat-latest produced roughly 25–30% fewer words than GPT-5.3-chat across a range of common prompts while preserving response quality, and in many cases improving it. Improving intelligence and tool calling GPT-chat-latest introduces measurable improvements in how the model interacts with tools, including better judgment about when and how to invoke them. The model produces more structured and context-aware tool invocation outputs, which is particularly relevant for workflows that rely on function calling, retrieval-augmented generation, and multi-step reasoning. Equally important, the model is better at deciding whether a tool is needed in the first place, reducing unnecessary tool calls in scenarios where it already has the information to answer directly. Improved search and context handling GPT-chat-latest includes targeted improvements to how the model retrieves, interprets, and synthesizes information when search is involved, with enhancements to query formulation, result ranking, and filtering, plus more grounded synthesis of retrieved content into final responses. These changes improve handling of ambiguous or underspecified queries and reduce noise in answers that depend on retrieved content. The model also makes better use of the context developers pass in, including system prompts, conversation history, retrieved documents, and structured data. Applications that maintain long-running state or stitch together multiple retrieval steps produce more coherent, context-aware outputs without developers having to over-engineer prompt scaffolding. Use Cases: When to choose the chat model Developers typically choose a chat-optimized model like GPT-5.5-chat when the application needs to sustain multi-turn conversations while reliably following instructions and coordinating external tools. This is a fit for assistants and agentic workflows where the model must interpret user intent over time, decide when to retrieve additional context, and produce structured outputs for downstream systems rather than just generate free-form text. Customer support and contact centers: virtual agents that maintain conversational context across a case, retrieve policy or product documentation via search, and hand off to a ticketing or CRM system through tool calls when escalation is needed. Retail and e-commerce: shopping and service assistants that clarify preferences over multiple turns, reference catalogs and policies via retrieval, and generate structured actions such as returns, exchanges, and order lookups through integrated tools. Manufacturing and field service: technician-facing assistants that combine conversational guidance with retrieval of manuals and work instructions, plus structured task creation in maintenance systems. Use GPT-chat-latest Use GPT-5.5 Reasoning Multi-turn assistants and customer-facing chat experiences Harder problems that benefit from more deliberate, step-by-step thinking Agentic workflows that coordinate tools (search, retrieval, ticketing, CRM) and benefit from structured tool outputs Complex analysis, planning, or decision support where correctness matters more than conversational flow Interactive experiences where you want quick back-and-forth clarification and task completion Tasks involving multi-constraint reasoning (policy interpretation, detailed requirements, long-horizon plans) RAG-based apps where the model must decide when to retrieve and then synthesize grounded answers Offline or low-tool scenarios where the main value is deeper reasoning over provided context Pricing Model Input ($/1M tokens) Cached input ($/1M tokens) Output ($/1M tokens) GPT-chat-latest $5 $0.50 $30 Responsible AI in Microsoft Foundry At Microsoft, our mission to empower people and organizations remains constant. In the age of AI, trust is foundational to adoption, and earning that trust requires a commitment to transparency, safety, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy models responsibly in production environments, aligned with Microsoft's Responsible AI principles. Getting started GPT-chat-latest is rolling out in Microsoft Foundry today.4.1KViews1like0CommentsIntroducing OpenAI's GPT-image-2 in Microsoft Foundry
Take a small design team running a global social campaign. They have the creative vision to produce localized imagery for every market, but not the resources to reshoot, reformat, or outsource that scale. Every asset needs to fit a different platform, a different dimension, a different cultural context, and they all need to ship at the same time. This is where flexible image generation comes in handy. OpenAI's GPT-image-2 is now generally available and rolling out today to Microsoft Foundry, introducing a step change in image generation. Developers and designers now get more control over image output, so a small team can execute with the reach and flexibility of a much larger one. What is new in GPT-image-2? GPT-image-2 brings real world intelligence, multilingual understanding, improved instruction following, increased resolution support, and an intelligent routing layer giving developers the tools to scale image generation for production workflows. Real world intelligence GPT-image-2 has a knowledge cut off of December 2025, meaning that it is able to give you more contextually relevant and accurate outputs. The model also comes with enhanced thinking capabilities that allow it to search the web, check its own outputs, and create multiple images from just one prompt. These enhancements shift image generation models away from being simple tools and runs them into creative sidekicks. Multilingual understanding GPT-image-2 includes increased language support across Japanese, Korean, Chinese, Hindi, and Bengali, as well as new thinking capabilities. This means the model can create images and render text that feels localized. Increased resolution support GPT-image-2 introduces 4K resolution support, giving developers the ability to generate rich, detailed, and photorealistic images at custom dimensions. Resolution guidelines to keep in mind: Constraint Detail Total pixel budget Maximum pixels in final image cannot exceed 8,294,400 Minimum pixels in final image cannot be less than 655,360 Requests exceeding this are automatically resized to fit. Resolutions 4K, 1024x1024, 1536x1024, and 1024x1536 Dimension alignment Each dimension must be a multiple of 16 Note: If your requested resolution exceeds the pixel budget, the service will automatically resize it down. Intelligent routing layer GPT-image-2 also includes an expanded routing layer with two distinct modes, allowing the service to intelligently select the right generation configuration for a request without requiring an explicitly set size value. Mode 1 — Legacy size selection In Mode 1, the routing layer selects one of the three legacy size tiers to use for generation: Size tier Description smimage Small image output image Standard image output xlimage Large image output This mode is useful for teams already familiar with the legacy size tiers who want to benefit from automatic selection without making any manual changes. Mode 2 — Token size bucket selection In Mode 2, the routing layer selects from six token size buckets — 16, 24, 36, 48, 64, 96 — which map roughly to the legacy size tiers: Token bucket Approximate legacy size 16, 24 smimage 36, 48 image 64, 96 xlimage This approach can allow for more flexibility in the number of tokens generated, which in turn helps to better optimize output quality and efficiency for a given prompt. See it in action GPT-image-2 shows improved image fidelity across visual styles, generating more detailed and refined images. But, don’t just take our word for it, let's see the model in action with a few prompts and edits. Here is the example we used: Prompt: Interior of an empty subway car (no people). Wide-angle view looking down the aisle. Clean, modern subway car with seats, poles, route map strip, and ad frames above the windows. Realistic lighting with a slight cool fluorescent tone, realistic materials (metal poles, vinyl seats, textured floor). As you can see, when using the same base prompt, the image quality and realism improved with each model. Now let’s take a look at adding incremental changes to the same image: Prompt: Populate the ad frames with a cohesive ad campaign for “Zava Flower Delivery” and use an array of flower types. And our subway is now full of ads for the new ZAVA flower delivery service. Let's ask for another small change: Prompt: In all Zava Flower Delivery advertisements, change the flowers shown to roses (red and pink roses). And in three simple prompts, we've created a mockup of a flower delivery ad. From marketing material to website creation to UX design, GPT-image-2 now allows developers to deliver production-grade assets for real business use cases. Image generation across industries These new capabilities open the door to richer, more production-ready image generation workflows across a range of enterprise scenarios: Retail & e-commerce: Generate product imagery at exact platform-required dimensions, from square thumbnails to wide banners, without post-processing. Marketing: Produce crisp, rich in color campaign visuals and social assets localized to different markets. Media & entertainment: Generate storyboard panels and scene at resolutions suited to production pipelines. Education & training: Create visual learning aids and course materials formatted to exact display requirements across devices. UI/UX design: Accelerate mockup and prototype workflows by generating interface assets at the precise dimensions your design system requires. Trust and safety At Microsoft, our mission to empower people and organizations remains constant. As part of this commitment, models made available through Foundry undergo internal reviews and are deployed with safeguards designed to support responsible use at scale. Learn more about responsible AI at Microsoft. For GPT-image-2, Microsoft applied an in-depth safety approach that addresses disallowed content and misuse while maintaining human oversight. The deployment combines OpenAI’s image generation safety mitigations with Azure AI Content Safety, including filters and classifiers for sensitive content. Pricing Model Offer type Pricing - Image Pricing - Text GPT-image-2 Standard Global Input Tokens: $8 Cached Input Tokens: $2 Output Tokens: $30 Input Tokens: $5 Cached Input Tokens: $1.25 Note: All prices are per 1M token. There is no billing for output tokens for the GPT-image-2 model. Getting started Whether you’re building a personalized retail experience, automating visual content pipelines or accelerating design workflows. GPT-image-2 gives your team the resolution control and intelligent routing to generate images that fit your exact needs. Try the GPT-image-2 in Microsoft Foundry today! Deploy the model in Microsoft Foundry Experiment with the model in the Image playground Read the documentation to learn more14KViews3likes3CommentsGPT Capability in Understanding Coordinates: How GPT-5.4 Transforms Spatial Precision
Why I Ran This Experiment This work started not as a benchmarking exercise, but as a practical problem: I needed to automatically extract panel regions from PDF-format electrical Single-Line Diagram (SLD) drawings using OpenAI models . All experiments were conducted using OpenAI models in Microsoft Foundry- Microsoft's unified platform for building generative AI applications. The downstream goal was a pipeline that combines GPT model with Azure Document Intelligence to generate Bills of Materials (BOMs) — a project I wrote about separately in Extracting BOMs from Electrical Drawings with AI: Azure OpenAI GPT-5 + Azure Document Intelligence Pipeline. Before building that pipeline, I needed a clear-eyed answer to a deceptively simple question: how well can GPT actually understand and return pixel-level coordinates from an image? If the model can't reliably locate a panel bounding box, the rest of the pipeline doesn't matter. When I first ran these tests against GPT-5.2, the results were mixed — good enough to be promising, but inconsistent enough to leave clear room for improvement. I tried many workarounds: feeding image dimensions explicitly, overlaying coordinate grids, enabling extended reasoning, and building iterative self-correction loops. Each helped, but required deliberate engineering effort. Then GPT-5.4 was released. Re-running the same benchmark revealed that most of those workarounds were no longer necessary. Context: All experiments use a fixed CAD-style test image (847 × 783 px) with a known ground-truth bounding box at [135, 165, 687, 619] . Accuracy is measured by Intersection over Union (IoU) — a score of 1.0 is a perfect match. Every test was run 5 times and averaged. for all coordinate experiments. The Experiment Design I designed experiments across two axes: prompt strategy (how spatial information is presented to the model) and reasoning mode (standard vs. extended reasoning). Each combination was tested across both GPT-5.2 and GPT-5.4, producing 4 conditions per test. GPT-5.2 and GPT-5.4 were each tested under two reasoning modes (None vs. High), resulting in four conditions in total. Single-Shot Strategies (Tests 1–5) These tests have no iterative validation loop — the model gets one prompt and returns its answer. Each test was run 5 times and the results averaged, so the scores reflect consistency, not a single lucky attempt. The differences between tests lie in how spatial information is framed in the prompt. Test 1 is a simple sanity check: can the model understand percentage-based coordinates at all? The model receives the clean image (no overlay) and is asked: "return the pixel coordinate at 30% width, 50% height." The expected answer is (254, 392). GPT-5.2 gets the X coordinate roughly right (~254–260), but the Y coordinate scatters wildly — predictions range from 260 to 322, consistently 100+ pixels above the correct position. GPT-5.4 returns (254, 392) on every single run, essentially pixel-perfect. Even on this simple sanity check, the gap is stark: GPT-5.4 is pixel-perfect from the start, while GPT-5.2 shows a clear Y-axis bias. But a single-point test doesn't tell us how well the models handle real spatial tasks. The next question: can they detect a full bounding box? Tests 2–5: Bounding Box Detection with Increasing Prompt Richness Tests 2–5 move to the real task: detecting a bounding box drawn on the image. Each test sends a different version of the same base image, with progressively richer spatial context in the prompt: Feedback Loop Strategies (Tests 6A–7B) These tests add an iterative validation loop: the model's predicted bounding box is overlaid on the image and sent back for self-correction — up to 5 iterations (early stop at IoU ≥ 0.99). All feedback tests share the same two-phase structure: an init step (first prediction) and a validation loop (iterative correction). All feedback tests use the same two images (init + validation overlay), but differ in prompt strategy and color assignment. Image-wise, they fall into two groups: Group A — Orange GT (Tests 6A, 6C, 7A) Group B — Color Bias / Blue GT (Tests 6B, 6D, 7B) Figure 4b — Feedback loop input images. Group B (bottom): colors swapped to test color-role priors. What differs between tests in the same group: The images are identical, but the prompt changes. 6A/6B use holistic comparison ("compare and correct"). 6C/6D additionally send the full history of past predictions as multi-image input. 7A/7B ask for per-edge directional judgments ("move left/right/up/down/none" for each edge independently). Results 1. Model version is the single biggest factor Across every test, GPT-5.4 dramatically outperforms GPT-5.2. The gap is not incremental — it's the difference between a bounding box that roughly overlaps the target and one that is essentially pixel-perfect. GPT-5.4 achieved an IoU of 0.99 or above on its very first attempt on tests where GPT-5.2 had only scored between 0.76 and 0.88. GPT-5.4 (green bars) consistently hits ≥0.99 regardless of prompt strategy or reasoning mode. GPT-5.2 (blue bars) ranges from 0.76 to 0.92. 2. GPT-5.2 is inconsistent; GPT-5.4 locks in Raw averages only tell half the story. GPT-5.2 is unpredictable: on the exact same test with the exact same prompt and image, results fluctuate wildly between runs. The standard deviation on Test 2 is ±0.084 — meaning a single run could land anywhere from 0.66 to 0.88. GPT-5.4 stays within ±0.003. The scatter plots below make this viscerally clear. Each dot is one API call — notice how GPT-5.2 dots spray across the IoU range while GPT-5.4 dots stack on top of each other: Wide scatter on simpler prompts (Test 2: 0.66–0.88); reasoning mode (orange) provides a lift that shrinks with richer prompts (Δmean shown below each panel). Production implication: With GPT-5.2, you couldn't rely on a single inference call — building a reliable pipeline would require multiple calls and majority voting, multiplying latency and cost. With GPT-5.4, a single call is sufficient. 3. Reasoning mode reduced variance for GPT-5.2; GPT-5.4 didn't need it For GPT-5.2, enabling extended reasoning ( reasoning: high ) provided a meaningful boost — especially when the prompt was sparse. On Test 2 (bare image, no spatial context), reasoning added +0.076 IoU and visibly tightened the spread of results across runs. As prompts got richer, the benefit shrank: with a grid overlay (Test 4), reasoning added only +0.007. In other words, reasoning mode acted as a compensating mechanism — filling in the gaps when the prompt alone didn't provide enough spatial scaffolding. For GPT-5.4, reasoning mode offered no additional benefit on this class of task. The base model already achieves 0.99+ IoU, so there was simply no room for improvement. In a few cases the reasoning runs showed marginal regressions (−0.005 to −0.015), likely within noise. The takeaway isn't that reasoning mode is harmful in general, but rather that a spatial-coordinate task at this complexity level doesn't require it when the underlying model already has strong coordinate understanding. Figure 7 — Effect of reasoning mode: GPT-5.2 gains +0.04–0.08 from reasoning (blue bars), largest on sparse prompts. GPT-5.4 shows no meaningful gain (green bars near zero). 4. Richer prompts close the gap (but only for GPT-5.2) For GPT-5.2, providing more spatial context in the prompt made a big difference: from 0.765 (Test 2, no info) to 0.910 (Test 4, grid overlay) — a +0.145 IoU gain just from adding visual reference rulers to the image. Telling the model the image dimensions (Test 3) was a "free win" that cost nothing. For GPT-5.4, all prompt variants produce essentially the same result (0.989–0.997). The model already understands spatial coordinates well enough that extra scaffolding adds no value. ided. GPT-5.4 is flat at ≥0.99 regardless. If you're still on GPT-5.2: Always inject image dimensions into the prompt (free). Use grid overlays for the biggest single-shot gain (+0.145 IoU). With GPT-5.4, none of this is needed. 5. Validation loops: essential for GPT-5.2, Option for GPT-5.4 The feedback loop tests (6A–7B) showed that iterative self-correction genuinely helped GPT-5.2 improve from its initial prediction. For example, in Test 7A (directional feedback), GPT-5.2 improved from an init IoU of 0.926 to a best of 0.969 over 5 iterations. For GPT-5.4, every single run hit IoU ≥ 0.99 on iteration 1 and early-stopped immediately. There was nothing left to correct. The validation loop infrastructure — overlay rendering, multi-turn prompting, iteration logic — becomes dead code you can remove from your pipeline. GPT-5.4 (green) starts at ≥0.99 and early-stops at iteration 1. 6. Prompt instruction matters: holistic vs directional feedback Comparing 6A/6B (holistic: "compare the two boxes and correct") with 7A/7B (directional: "for each edge, decide which direction to move"), the directional approach consistently reached higher best IoU for GPT-5.2. The per-edge structured output forced the model to reason about each boundary independently rather than making a holistic guess. Separately, the color bias tests (6B, 7B — GT drawn in blue instead of orange) revealed that swapping GT/prediction colors drops the initial accuracy significantly. In 6A (orange GT) the init IoU was 0.937, but in 6B (blue GT) it dropped to 0.850. This suggests GPT models have learned color-role priors — orange is "expected" as the ground truth color. However, the validation loop largely recovers this gap: after 5 iterations, 6A and 6B converge to similar best IoU (~0.96). The directional variants (7A, 7B) show the same pattern but converge faster. Left: initial accuracy drops when GT is drawn in blue. Right: after the validation loop, the gap closes. Directional feedback (7A/7B) shows the same pattern. For GPT-5.4: Color bias has no measurable effect. All variants (6A/6B/7A/7B) hit 0.994–0.998 IoU on iteration 1 regardless of color assignment. Summary: What Changed from GPT-5.2 to GPT-5.4 The story of this benchmark is really about engineering workarounds that became unnecessary. Here's what we built for GPT-5.2 and whether you still need it: Grid overlays & image dimensions in prompt — Gave +0.05–0.15 IoU for GPT-5.2. Not needed for GPT-5.4 (already ≥0.99 without it). Extended reasoning mode — Gave +0.04–0.08 IoU for GPT-5.2. No benefit for GPT-5.4 on this task (already at ceiling without it). Validation loops (iterative self-correction) — Improved GPT-5.2 by +0.02–0.10 IoU over 5 iterations. Unnecessary for GPT-5.4 (early-stops at iteration 1). Multiple runs & voting — Required for GPT-5.2 due to ±0.08 variance. Not needed for GPT-5.4 (±0.003 variance, single call sufficient). Color convention management — GPT-5.2 showed color bias (−0.09 IoU when colors swapped). No effect on GPT-5.4. GPT-5.4 doesn't just perform better — it makes entire categories of pipeline engineering unnecessary. For clean, CAD-style images like the ones tested here, GPT-5.4 dramatically reduces prompt engineering overhead: grid overlays, image dimension injection, reasoning mode, and validation loops — all of which required deliberate effort with GPT-5.2 — are no longer necessary. This translates directly to simpler pipelines, lower latency, and lower cost. That said, for more complex scenarios — multiple overlapping panels, cluttered backgrounds, or ambiguous region boundaries — iterative validation loops could still prove valuable, and we plan to explore this in future work. This benchmark started as a sanity check and turned into a clear signal: GPT-5.4 represents a genuine leap in spatial coordinate understanding, not just a marginal iteration. The gap between 0.765 and 0.997 IoU on an identical task is the difference between a prototype experiment and a production-ready component. Try It Yourself Ready to explore GPT-5.4's spatial precision capabilities? Here are ways to get started: Sample notebooks for bounding box extraction test : github Read the companion post: Extracting BOMs from Electrical Drawings with AI: Azure OpenAI GPT-5 + Azure Document Intelligence — See how this benchmark informed a production pipeline492Views4likes0CommentsAutomate Prior Authorization with AI Agents - Now Available as a Foundry Template
By Amit Mukherjee · Principal Solutions Engineer, Microsoft Health & Life Sciences Lindsey Craft-Goins · Technology Leader - Cloud & AI Platforms, Health & Life Sciences Joel Borellis · Director Solutions Engineering - Cloud & AI Platforms, Health & Life Sciences Prior authorization (PA) is one of the most expensive bottlenecks in U.S. healthcare. Physicians complete an average of 39 PA requests per week, spending roughly 13 hours of physician-and-staff time on PA-related work (AMA 2024 Prior Authorization Physician Survey). Turnaround averages 5–14 business days, and PA alone accounts for an estimated $35 billion in annual administrative spending (Sahni et al., Health Affairs Scholar, 2024). The regulatory clock is now ticking. CMS-0057-F mandates electronic PA with 72-hour urgent response starting in 2026. Forty-nine states plus DC already have PA laws on the books, and at least half of all U.S. state legislatures introduced new PA reform bills this year, including laws specifically targeting AI use in PA decisions (KFF Health News, April 2026). Today we’re making the Prior Authorization Multi-Agent Solution Accelerator available as a Microsoft Foundry template. Health plan payers can deploy a working, four-agent PA review pipeline to Azure using the Azure Developer CLI (“azd”) with a single command in supported environments, then customize it to their policies, workflows, and EHR environment. Try it now: Find the template in the Foundry template gallery, or clone directly from github.com/microsoft/Prior-Authorization-Multi-Agent-Solution-Accelerator What the template delivers The accelerator deploys four specialist Foundry hosted agents (Compliance, Clinical Reviewer, Coverage, and Synthesis), each independently containerized and managed by Foundry. In internal testing with synthetic demo cases, the pipeline reduced review workflow, from beginning to completion in under 5 minutes per case. Agent Role Key capability Compliance Documentation check 10-item checklist with blocking/non-blocking flags Clinical Reviewer Clinical evidence ICD-10 validation, PubMed + ClinicalTrials.gov search Coverage Policy matching CMS NCD/LCD lookup, per-criterion MET/NOT_MET mapping Synthesis Decision rubric 3-gate APPROVE/PEND with weighted confidence scoring Compliance and Clinical run in parallel. Coverage runs after clinical findings are ready. Synthesis evaluates all three outputs through a three-gate rubric. The result is a structured recommendation with per-criterion confidence scores and a full audit trail, not a black-box answer. Solution architecture The accelerator runs entirely on Azure. The frontend and backend deploy as Azure Container Apps. The four specialist agents are hosted by Microsoft Foundry. Real-time healthcare data flows through third-party MCP servers. Figure 1: Azure solution architecture How the pipeline works The four agents execute in a structured parallel-then-sequential pipeline. Compliance and Clinical run simultaneously in Phase 1. Coverage runs after clinical findings are ready. The Synthesis agent applies a three-gate decision rubric over all prior outputs. Figure 2: Agentic architecture, hosted agent pipeline Compliance and Clinical run in parallel via asyncio.gather, since neither depends on the other. Coverage runs sequentially after Clinical because it needs the structured clinical profile for criterion mapping. Synthesis evaluates all three outputs through a three-gate rubric (Provider, Codes, Medical Necessity) with weighted confidence scoring: 40% coverage criteria + 30% clinical extraction + 20% compliance + 10% policy match. The total pipeline time is bound by the slowest parallel agent plus the sequential agents, not the sum. In internal testing with synthetic demo cases, this architecture indicated materially reduced processing time compared to sequential manual workflows. Under the hood For the architect in the room, here are four design decisions worth knowing about: Foundry hosted agents: Each agent is independently containerized, versioned, and managed by Foundry’s runtime. The FastAPI backend is a pure HTTP dispatcher. All reasoning happens inside the agent containers, and there are no code changes between local (Docker Compose) and production (Foundry); the environment variable is the only switch. Structured output: Every agent uses MAF’s response_format enforcement to produce typed Pydantic schemas at the token level. No JSON parsing, no malformed fences, no free-form text. The orchestrator receives typed Python objects; the frontend receives a stable API contract. Keyless security: DefaultAzureCredential throughout, so no API keys are stored anywhere. Managed Identity handles production; azd tokens handle local development. Role assignments are provisioned automatically by Bicep at deploy time. Observability: All agents emit OpenTelemetry traces to Azure Application Insights. The Foundry portal shows per-agent spans correlated by case ID. End-to-end latency, per-agent contribution, and error rates are visible from day one with no additional configuration. For the full architecture documentation, agent specifications, Pydantic schemas, and extension guides, see the GitHub repository. Why this matters now Human-in-the-loop by design The system runs in LENIENT mode by default: it produces only APPROVE or PEND and is not designed to produce automated DENY outcomes in its default configuration. Every recommendation requires a clinician to Accept or Override with documented rationale before the decision is finalized. Override records flow to the audit PDF, notification letters, and downstream systems. This directly addresses the emerging wave of state legislation governing AI use in PA decisions. Domain experts own the rules Agent behavior is defined in markdown skill files, not Python code. When CMS updates a coverage determination or a plan changes its commercial policy, a clinician or compliance officer edits a text file and redeploys. No engineering PR required. Real-time healthcare data via MCP Agents connect to five MCP servers for real-time data: ICD-10 codes, NPI Registry, CMS Coverage policies, PubMed, and ClinicalTrials.gov. This incorporates real‑time clinical reference data sources to inform agent recommendations. Third-party MCP servers are included for demonstration with synthetic data only. Their inclusion does not constitute an endorsement by Microsoft. See the GitHub repository for production migration guidance. Audit-ready from day one Every case generates an 8-section audit justification PDF with per-criterion evidence, data source attribution, timestamps, and confidence breakdowns. Clinician overrides are recorded in Section 9. Notification letters (approval and pend) are generated automatically. These artifacts are designed to support CMS-0057-F documentation requirements. Deploy in under 15 minutes From the Foundry template gallery or from the command line: git clone https://github.com/microsoft/Prior-Authorization-Multi-Agent-Solution-Accelerator cd Prior-Authorization-Multi-Agent-Solution-Accelerator azd up That single command provisions Foundry, Azure Container Registry, Container Apps, builds all Docker images, registers the four agents, and runs health checks. The demo is live with a synthetic sample case as soon as deployment completes. What’s included What you customize 4 Foundry hosted agents Payer-specific coverage policies FastAPI orchestrator + Next.js frontend EHR/FHIR integration for clinical notes 5 MCP healthcare data connections Self-hosted MCP servers for production PHI Audit PDF + notification letter generation Authentication (Microsoft Entra ID) Full Bicep infrastructure-as-code Persistent storage (Cosmos DB / PostgreSQL) OpenTelemetry + App Insights observability Additional agents (Pharmacy, Financial) Built on Microsoft Foundry + Foundry hosted agents · Microsoft Agent Framework (MAF) · Azure OpenAI gpt-5.4 · Azure Container Apps · Azure Developer CLI + Bicep · OpenTelemetry + Azure Application Insights · DefaultAzureCredential (keyless, no secrets) Full architecture documentation, agent specifications, and extension guides are in the GitHub repository. Get started Foundry template gallery: Search “AI-Powered Prior Authorization for Healthcare” in the Foundry template section GitHub: github.com/microsoft/Prior-Authorization-Multi-Agent-Solution-Accelerator Disclaimers Not a medical device. This solution accelerator is not a medical device, is not FDA-cleared, and is not intended for autonomous clinical decision-making. All AI recommendations require qualified clinical review before any authorization decision is finalized. Not production-ready software. This is an open-source reference architecture (MIT License), not a supported Microsoft product. Customers are solely responsible for testing, validation, regulatory compliance, security hardening, and production deployment. Performance figures are illustrative. Metrics cited (including processing time reductions) are based on internal testing with synthetic demo data. Actual results will vary based on case complexity, infrastructure, and configuration. Third-party services included for demonstration only; not endorsed by Microsoft. Customers should evaluate providers against their compliance and data residency requirements. The demo uses synthetic data only. Customers deploying real patient data are responsible for HIPAA compliance and establishing appropriate Business Associate Agreements. This accelerator is intended to help customers align documentation workflows with CMS‑0057‑F requirements but has not been independently validated or certified for regulatory compliance.1.7KViews1like0CommentsTroubleshooting Microsoft Foundry Accessing On‑Premises APIs Over Private Networking
Audience: Azure solution architects, network engineers, and AI practitioners deploying Microsoft Foundry in enterprise, network‑isolated environments. Scenario: Foundry Agent Service must call an on‑premises (or privately hosted) API over VPN or ExpressRoute using on‑premises corporate DNS. Connectivity works from a virtual machine (VM) in the virtual network (VNet), but fails when the same call is made from a Foundry agent. This post consolidates common field patterns observed across customer engagements and maps them directly to official Microsoft guidance. It highlights a frequently missed prerequisite for private connectivity: Project and Agent Capability Hosts. The Repeating Enterprise Pattern The architecture is familiar: Microsoft Foundry account and project Customer‑managed VNet with VPN or ExpressRoute to on‑premises Corporate (on‑premises) DNS used for API name resolution A VM in the VNet can successfully resolve and call the on‑prem API Foundry agents are configured to call the same API (often via an OpenAPI tool) Despite this, the agent fails with one or more of the following: DNS resolution failures Connection timeouts HTTP 401 or 403 responses Unexpected backend or proxy URLs appearing in logs The consistent question is: Why does this work from a VM in the VNet but not from the Foundry agents? Key Principle: Foundry Agents Do Not Automatically Run in Your VNet This is the most important mental model to reset. Creating a private endpoint for a Foundry Agent Service does not place agent runtime traffic into your VNet. Private endpoints are inbound constructs. Outbound connectivity from an agent only flows through your VNet when specific requirements are met. The most critical of those requirements is a Capability Host. What Is a Capability Host? A Capability Host defines where Foundry capabilities are allowed to execute. In private networking scenarios, the capability host: Binds a Foundry project or agent to a customer‑managed subnet Enables platform‑managed container injection into that subnet Ensures outbound traffic follows VNet routing, security controls, and DNS configuration Capability Host Scope Project Capability Host Associated at the project level Applies to all agents in the project Defines the customer subnet the project is allowed to use Agent Capability Host Associated at the individual agent level Can explicitly bind or override subnet placement Useful when agents require different isolation boundaries Key field insight: If no capability host is associated, the agent runtime is not injected into the VNet—even if VPN, ExpressRoute, private endpoints, and on‑prem DNS are correctly configured. Capability Host Lifecycle (Conceptual) 1 Microsoft Foundry Account ↓ Project ↓ Capability Host ↓ Delegated Agent Subnet (VNet) ↓ Agent Runtime (Container Injection) Without the capability host step, the chain breaks and the agent executes outside the customer network boundary. Why On‑Premises DNS Appears Correct—but Still Fails DNS is where most investigations stall. Teams typically confirm: On‑premises DNS resolves the API hostname VNet DNS settings forward queries to on‑prem DNS A VM in the subnet resolves and reaches the API Yet the agent still fails. The reason is simple: The VM is unquestionably inside the VNet The agent may not be Without a capability host, the agent runtime does not inherit: VNet DNS server settings Corporate DNS forwarding rules On‑premises name resolution paths As a result, DNS fails even though the DNS design itself is correct. Secondary Symptom: HTTP 401 Errors After subnet injection and DNS are corrected, some customers encounter HTTP 401 responses. This typically means: The API is now reachable The request is successfully routed through the private path Authentication or authorization is failing At this stage, troubleshooting moves from networking to identity: Validate the credential or token configured in the Foundry connection Confirm expected headers, audience, or auth flow Account for managed proxy hops in the request path A 401 at this point is progress—it confirms private connectivity is working. Permissions and Required Services for Capability Hosts Creating capability hosts requires both the correct Azure role-based access control (RBAC) permissions and the presence of specific dependent services. These prerequisites are frequently overlooked and can silently block capability host creation or leave hosts in a failed provisioning state. Required Permissions (RBAC) Microsoft documentation explicitly calls out the following minimum permissions: Contributor role on the Microsoft Foundry account to create capability hosts. User Access Administrator or Owner role on the subscription or resource group to grant the Foundry project’s managed identity access to dependent Azure resources when using standard agent setup. Without these roles, capability host creation may fail, or the host may be created without access to required downstream services. For details, see Role-based access control (RBAC) for Microsoft Foundry. Required Azure Services and Connections For standard and network-secured agent setups, capability hosts reference customer-owned Azure resources. The following services must exist and be connected to the Foundry project before creating the capability host: Azure Storage – for file uploads and artifacts Azure AI Search – for vector stores and retrieval Azure Cosmos DB – for thread and conversation storage Azure AI Services / Azure OpenAI – for model execution These services must be deployed in supported regions and, for private networking scenarios, in the same region as the virtual network. Capability hosts reference these resources through project connections, not raw resource IDs . Networking-Specific Requirements When using private networking: The agent subnet must already exist and be delegated appropriately. The capability host must reference the correct customer subnet at creation time. Required private endpoints and DNS resolution must be in place for dependent services. If networking or connections change, capability hosts cannot be updated in-place and must be deleted and recreated with the corrected configuration . Practical Fix Patterns 1. Create and Associate a Project Capability Host Bind the project to the intended delegated agent subnet Verify the customerSubnet reference Redeploy the agent after association This aligns directly with the Standard Setup with Private Networking model documented by Microsoft. 2. Validate Agent Placement and Network Inheritance Confirm the agent is associated with the expected capability host Verify the capability host references the correct subnet Ensure network and DNS settings are applied at the subnet level The agent inherits routing and DNS behavior only after successful subnet injection. 3. Validate DNS From the Agent Subnet Confirm VNet DNS settings point to on‑prem DNS Test name resolution from a VM in the same subnet Once injected, the agent uses the same DNS behavior as other resources in that subnet. 4. Use a Supported Foundry Experience Be aware of documented constraints: End‑to‑end network isolation is not supported in the newer Foundry portal experience Network‑isolated agent scenarios require the classic Foundry experience, SDK, or CLI Hosted agents do not support full isolation Mismatch here can make a correct network design appear broken. A Simple Checklist When a Foundry agent cannot reach an on‑prem API: Is a **Project Capability **Host associated? Is the capability host bound to the correct subnet? Is on‑prem DNS reachable from that subnet? Is a supported Foundry experience in use? If reachable, is authentication configured correctly? If items 1 and 2 are not satisfied, all other troubleshooting is premature. Closing Most private networking issues with Microsoft Foundry are not caused by VPNs or DNS infrastructure. They result from an incomplete understanding of **where the agent ****runtime ****actually **executes. Capability Hosts are the control point. When they are correctly configured, Foundry agents behave exactly as described in Microsoft guidance: they inherit VNet routing, DNS, and security controls and can securely access on‑premises systems over private connectivity. No capability host = no VNet injection = no on‑prem connectivity. References The following Microsoft‑published resources were referenced and aligned with throughout this article: Network‑secured agent setup (GitHub) – Reference implementation demonstrating a network‑secured Foundry Agent Service deployment with a customer‑managed virtual network, delegated agent subnet, and private connectivity patterns. This notebook illustrates how agent runtimes inherit network behavior only after subnet injection ralacher/network-secured-agent Set up private networking for Foundry Agent Service - Microsoft Foundry | Microsoft Learns .484Views2likes0CommentsGemma 4 now available in Microsoft Foundry
Experimenting with open-source models has become a core part of how innovative AI teams stay competitive: experimenting with the latest architectures and often fine-tuning on proprietary data to achieve lower latencies and cost. Today, we’re happy to announce that the Gemma 4 family, Google DeepMind’s newest model family, is now available in Microsoft Foundry via the Hugging Face collection. Azure customers can now discover, evaluate, and deploy Gemma 4 inside their Azure environment with the same policies they rely on for every other workload. Foundry is the only hyperscaler platform where developers can access OpenAI, Anthropic, Gemma, and over 11,000+ models under a single control plane. Through our close collaboration with Hugging Face, Gemma 4 joining that collection continues Microsoft’s push to bring customers the widest selection of models from any cloud – and fits in line with our enhanced investments in open-source development. Frontier Intelligence, open-source weights Released by Google DeepMind on April 2, 2026, Gemma 4 is built from the same research foundation as Gemini 3 and packaged as open weights under an Apache 2.0 license. Key capabilities across the Gemma 4 family: Native multimodal: Text + image + video inputs across all sizes; analyze video by processing sequences of frames; audio input on edge models (E2B, E4B) Enhanced reasoning & coding capabilities: Multi-step planning, deep logic, and improvements in math and instruction-following enabling autonomous agents Trained for global deployment: Pretrained on 140+ languages with support for 35+ languages out of the box Long context: Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B) allow developers to reason across extensive codebases, lengthy documents, or multi-session histories Why choose Foundry? Foundry is built to give developers breadth -- access to models from major model providers, open and proprietary, under one roof. Stay within Azure to work leading models. When you deploy through Foundry, models run inside your Azure environment and are subject to the same network policies, identity controls, and audit processes your organization already has in place. Managed online endpoints handle serving, scaling, and monitoring without manually setting up and managing the underlying infrastructure. Serverless deployment with Azure Container Apps allows developers to deploy and run containerized applications while reducing infrastructure management and saving costs. Gated model access integrates directly with Hugging Face user tokens, so models that require license acceptance stay compliant can be accessed without manual approvals. Foundry Local lets you run optimized Hugging Face models directly on your own hardware using the same model catalog and SDK patterns as your cloud deployments. Read the documentation here: https://aka.ms/foundrylocal and https://aka.ms/HF/foundrylocal Microsoft’s approach to Responsible AI is grounded in our AI principles of fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy new models responsibly in production environments. What are teams building with Gemma 4 in Foundry Gemma 4’s combination of multimodal input, agentic function calling, and long context offers a wide range of production use cases: Document intelligence: Processing PDFs, charts, invoices, and complex tables using native vision capabilities Multilingual enterprise apps: 140+ natively trained languages — ideal for multinational customer support, content platforms as well as language learning tools for grammar correction and writing practice Long-context analytics: Reasoning across entire codebases, legal documents, or multi-session conversation histories Getting started Try Gemma 4 in Microsoft Foundry today. New models from Hugging Face continue to roll out to Foundry on a regular basis through our ongoing collaboration. If there's a model you want to see added, let us know here. Stay connected to our developer community on Discord and stay up to date on what is new in Foundry through the Model Mondays series.2.1KViews1like0CommentsNow in Foundry: Microsoft Harrier and NVIDIA EGM-8B
This week's Model Mondays edition highlights three models that share a common thread: each achieves results comparable to larger leading models, as a result of targeted training strategies rather than scale. Microsoft Research's harrier-oss-v1-0.6b from achieves state-of-the-art results on the Multilingual MTEB v2 embedding benchmark at 0.6B parameters through contrastive learning and knowledge distillation. NVIDIA's EGM-8B scores 91.4 average IoU on the RefCOCO visual grounding benchmark by training a small Vision Language Model (VLM) with reinforcement learning to match the output quality of much larger models. Together they represent a practical argument for efficiency-first model development: the gap between small and large models continues to narrow when training methodology is the focus rather than parameter count alone. Models of the week Microsoft Research: harrier-oss-v1-0.6b Model Specs Parameters / size: 0.6B Context length: 32,768 tokens Primary task: Text embeddings (retrieval, semantic similarity, classification, clustering, reranking) Why it's interesting State-of-the-art on Multilingual MTEB v2 from Microsoft Research: harrier-oss-v1-0.6b is a new embedding model released by Microsoft Research, achieving a 69.0 score on the Multilingual MTEB v2 (Massive Text Embedding Benchmark) leaderboard—placing it at the top of its size class at release. It is part of the harrier-oss family spanning harrier-oss-v1-270m (66.5 MTEB v2), harrier-oss-v1-0.6b (69.0), and harrier-oss-v1-27b (74.3), with the 0.6B variant further trained with knowledge distillation from the larger family members. Benchmarks: Multilingual MTEB v2 Leaderboard. Decoder-only architecture with task-instruction queries: Unlike most embedding models that use encoder-only transformers, harrier-oss-v1-0.6b uses a decoder-only architecture with last-token pooling and L2 normalization. Queries are prefixed with a one-sentence task instruction (e.g., "Instruct: Retrieve relevant passages that answer the query\nQuery: ...") while documents are encoded without instructions—allowing the same deployed model to be specialized for retrieval, classification, or similarity tasks through the prompt alone. Broad task coverage across six embedding scenarios: The model is trained and evaluated on retrieval, clustering, semantic similarity, classification, bitext mining, and reranking—making it suitable as a general embedding backbone for multi-task pipelines rather than a single-use retrieval model. One endpoint, consistent embeddings across the stack. 100+ language support: Trained on a large-scale mixture of multilingual data covering Arabic, Chinese, Japanese, Korean, and 100+ additional languages, with strong cross-lingual transfer for tasks that span language boundaries. Try it Use Case Prompt Pattern Multilingual semantic search Prepend task instruction to query; encode documents without instruction; rank by cosine similarity Cross-lingual document clustering Embed documents across languages; apply clustering to group semantically related content Text classification with embeddings Encode labeled examples + new text; classify by nearest-neighbor similarity in embedding space Bitext mining Encode parallel corpora in source and target languages; align segments by embedding similarity Sample prompt for a global enterprise knowledge base deployment: You are building a multilingual internal knowledge base for a global professional services firm. Using the harrier-oss-v1-0.6b endpoint deployed in Microsoft Foundry, encode all internal documents—policy guides, project case studies, and technical documentation—across English, French, German, and Japanese. At query time, prepend the task instruction to each employee query: "Instruct: Retrieve relevant internal documents that answer the employee's question\nQuery: {question}". Retrieve the top-5 most similar documents by cosine similarity and pass them to a language model with the instruction: "Using only the provided documents, answer the question and cite the source document title for each claim. If no document addresses the question, say so." NVIDIA: EGM-8B Model Specs Parameters / size: ~8.8B Context length: 262,144 tokens Primary task: Image-text-to-text (visual grounding) Why it's interesting Preforms well on visual grounding compared to larger models even at its small size: EGM-8B achieves 91.4 average Intersection over Union (IoU) on the RefCOCO benchmark—the standard measure of how accurately a model localizes a described region within an image. Compared to its base model Qwen3-VL-8B-Thinking (87.8 IoU), EGM-8B achieves a +3.6 IoU gain through targeted Reinforcement Learning (RL) fine-tuning. Benchmarks: EGM Project Page. 5.9x faster than larger models at inference: EGM-8B achieves 737ms average latency. The research demonstrates that test-time compute can be scaled horizontally across small models—generating many medium-quality responses and selecting the best—rather than relying on a single expensive forward pass through a large model. Two-stage training: EGM-8B is trained first with Supervised Fine-Tuning (SFT) on detailed chain-of-thought reasoning traces generated by a proprietary VLM, then refined with Group Relative Policy Optimization (GRPO) using a reward function combining IoU accuracy and task success. The intermediate SFT checkpoint is available as nvidia/EGM-8B-SFT for developers who want to experiment with the intermediate stage. Addresses a root cause of small model grounding errors: The EGM research identifies that 62.8% of small model errors on visual grounding stem from complex multi-relational descriptions—where a model must reason about spatial relationships, attributes, and context simultaneously. By focusing test-time compute on reasoning through these complex prompts, EGM-8B closes the gap without increasing the underlying model size. Try it Use Case Prompt Pattern Object localization Submit image + natural language description; receive bounding box coordinates Document region extraction Provide scanned document image + field description; extract specific regions Visual quality control Submit product image + defect description; localize defect region for downstream classification Retail shelf analysis Provide shelf image + product description; return location of specified SKU Sample prompt for a retail and logistics deployment: You are building a visual inspection system for a logistics warehouse. Using the EGM-8B endpoint deployed in Microsoft Foundry, submit each incoming package scan image along with a natural language grounding query describing the region of interest: "Please provide the bounding box coordinate of the region this sentence describes: {description}". For example: "the label on the upper-left side of the box", "the barcode on the bottom face", or "the damaged corner on the right side". Use the returned bounding box coordinates to route each package to the appropriate inspection station based on the identified region. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry854Views0likes0Comments