ai
37 TopicsFoundry IQ: Improve recall by up to 54% with knowledge bases
Foundry IQ: Improve recall by up to 54% with knowledge bases. Foundry IQ (Azure AI Search) has improved its agentic retrieval engine resulting in better answer quality and improved token cost savings. We compared standalone retrieval tools to knowledge bases using the challenging BrowseComp-Plus benchmark and found: Replacing single-shot RAG with a knowledge base improves evidence recall by up to 46%. Combining a smaller agent model with agentic retrieval improves evidence recall by up to 54% while controlling costs and increasing agent responsiveness. In both cases, the amount of retrieval tool calls your agent makes is reduced, resulting in 34% token cost savings.854Views3likes0CommentsEvaluate before you ship: introducing the Voice Live Evaluation Harness
You've built a voice agent on Azure Voice Live. It demos beautifully. Then a teammate asks the question that keeps every voice-agent team up at night: "How do we know it's actually good — across 200 customer calls, not the three we just listened to?" Until today, the honest answer was: put on headphones. Manual listening. Subjective scoring in a spreadsheet. No baseline, no regression signal, no way to defend a model swap with data. We're releasing the Voice Live Evaluation Harness to change that. It's an open-source, deployable evaluation pipeline that runs pre-recorded multi-turn audio through your Voice Live agent and scores every turn with the same evaluators built into Microsoft Foundry — automatically, repeatably, and in parallel. TL;DR Two flavors, one repo. Run the CLI harness locally against a Foundry project for fast iteration, or deploy the evaluation agent into your Azure subscription with the Azure Developer CLI (azd) for a fully-hosted evaluation backend. 13 built-in evaluators score every turn — intent resolution, task adherence, task completion, response completeness, tool-call accuracy, groundedness, and more — viewable per-turn and in aggregate inside the Foundry portal. Supports the three Voice Live modes you actually ship in — Semantic VAD, Push-to-Talk, and Foundry Agent mode — including multi-turn conversations with tool calls and grounding. Grows with your agent. Start with the sample datasets, then layer in audio collected from user testing and production traffic so your evaluation set matures alongside the agent. 🔗 Repo: microsoft-foundry/voicelive-evaluation · Docs: Evaluate Voice Live agents (preview) Why systematic evaluation matters for voice agents Text agents have a mature evaluation story. Voice agents don't — and the gaps actually matter more, because every voice failure happens in real time, in front of a customer, on a phone line you can't easily replay. The Voice Live Evaluation Harness closes that gap with four concrete capabilities: Establish a quality baseline. Run a representative audio dataset through your agent and get scores you can publish as your launch bar. Compare configurations side-by-side. Swap the underlying model (GPT-Realtime 1.5, Azure-Realtime, MAI-Transcribe-1.5), change the voice, tune VAD thresholds — and see exactly which knobs moved which scores. Catch regressions before users do. Wire it into CI and fail the build when intent resolution drops below your threshold. Optimize with data, not vibes. When task-completion drops, drill into the per-turn scores to see whether the agent failed to call the right tool, misunderstood intent, or generated an incomplete response. Keep iterating as production data rolls in. Start with the sample datasets, then grow your evaluation set with audio captured from internal testing, pilot users, and real production traffic. Re-run after every prompt tweak or model swap so the harness becomes a continuous quality signal — not a one-time launch checklist. How it works The pipeline is a five-stage loop: Audio Dataset. Multi-turn audio + expected behaviors in a simple JSONL schema. Four sample datasets ship in the repo (travel planning, complex data analytics, tool-calling tests, batch multi-conversation) so you can run end-to-end on day one. Voice Live API. Pick your Voice Live mode (Semantic VAD, PTT, or Foundry Agent), model, voice, and turn-detection settings via a JSON config file, then stream each turn of audio through the API — locally with the CLI harness, or, if you've deployed the evaluation agent, via the hosted Container App for long-running batches in your own subscription. Transcript + Response. Every turn produces an agent transcript, the model's response, and any tool calls it made — captured automatically for scoring. Foundry Evaluators. 13 built-in evaluators — powered by the same Foundry evaluator models (GPT-4.1-mini and o4-mini) used across Microsoft Foundry — judge every turn on intent resolution, task adherence, tool-call accuracy, groundedness, and more. Quality Scores. Per-turn and aggregate scores land in the Microsoft Foundry portal under your project's Evaluation tab — sortable, filterable, comparable across runs. Then loop. Audio captured from internal testing, pilots, and production traffic feeds back into the dataset — each pass makes the next evaluation more representative of what users actually do. What gets measured The accelerator ships 13 built-in evaluators out of the box, covering the dimensions that matter most for production voice agents: Category Evaluators Intent & task quality Intent Resolution · Task Adherence · Task Completion · Response Completeness Tool calling Tool Call Accuracy · Tool Call Parameter Validity · Tool Result Usage · Tool Call Success Content quality Groundedness · Relevance · Fluency · Coherence Conversational dynamics Turn-taking quality Every evaluator runs against the same Foundry evaluator models (GPT-4.1-mini and o4-mini) that power evaluation across the rest of Microsoft Foundry — so your voice-agent scores are directly comparable to your text-agent scores. Run the CLI locally against your existing Voice Live endpoint If you already have a Voice Live agent deployed and just want fast iteration on a laptop: git clone https://github.com/microsoft-foundry/voicelive-evaluation.git cd voicelive-evaluation/evaluation_harness python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt cp .sample_env .env # Edit .env with your AZURE_VOICELIVE_ENDPOINT python voice_agent_evaluation.py \ --config configs/sample_vad_realtime.json The full walkthrough — dataset schema, configuration reference, score interpretation, and troubleshooting — is in the documentation. Get started Repo: microsoft-foundry/voicelive-evaluation Docs: How to evaluate Voice Live agents (preview) We'd love your feedback — try it, file issues, and tell us which evaluators you wish you had.162Views0likes0CommentsImproved data processing features in Foundry IQ: Richer content extraction and data enrichment
Foundry IQ (Azure AI Search) introduces new capabilities in preview focused on improving enterprise data pipelines for RAG and agentic retrieval scenarios. The release expands SharePoint indexing support to include ASPX pages, SharePoint Lists, recursive subsite discovery, and source traceability, enabling broader access to enterprise knowledge across intranet content, operational lists, and document libraries. New integrations with Content Understanding in Foundry Tools improve document extraction, semantic chunking, structure preservation, and AI-generated image descriptions for complex documents such as PDFs. These capabilities help preserve layout, reading order, tables, and visual context during ingestion, improving grounding quality and retrieval accuracy in enterprise AI applications.362Views1like0CommentsFoundry IQ: New governance and enterprise AI security capabilities
Enterprise AI isn’t just about better retrieval—it’s about secure access to business‑critical content. Discover how Foundry IQ (Azure AI Search) enables governance, compliance, and private connectivity across agentic retrieval workflows. We are introducing the following features: - Incremental SharePoint permissions sync for indexed document content, SharePoint Lists and ASPX pages. - Purview sensitivity labels in Foundry IQ knowledge bases - Purview auditing for elevated admin queries - Private connectivity support between for Foundry IQ and Foundry resources via NSP370Views1like0CommentsWhat's New in Microsoft Foundry Labs – May 2026
Four new releases this month — a new benchmark for how agents interact, an experimental end-to-end agentic stack, a faster image model, and a first-party geospatial model. Last month we kicked off this series with a roundup of new Foundry Labs releases across speech, vision, and multimodal AI. This month, we're back with another update — read on to see learn what's new! SocialReasoning-Bench: measuring whether AI agents act in their user's best interest We are moving into a world where agents are interacting with other agents on behalf of their users, and thus, task completion is no longer a sufficient measure of usefulness. What matters is whether the agent advocates well for the person it represents. SocialReasoning-Bench, a new open-source benchmark from Microsoft Research AI Frontiers, measures exactly that. The benchmark currently supports two main scenarios — Calendar Coordination and Marketplace Negotiation — and scores them on two new metrics: Outcome Optimality (the share of available value the agent captures for its principal) and Due Diligence (the quality of the process used, scored against a deterministic reasonable-agent policy). Together they define an operational notion of duty of care. Learn more about SocialReasoning-Bench in Foundry Labs Try it on GitHub MagenticLite, Magentic Orchestrator & Fara 1.5: an end-to-end agentic stack Microsoft Research AI Frontiers also released a complete agentic stack: MagenticLite is the application layer — the next generation of Magentic-UI, with a redesigned chat-and-browser interface and a harness rebuilt for small models. It works across both your browser and your local file system in a single workflow, with browser sessions and code execution sandboxed by Quicksand, the project's open-source QEMU runtime. Transparency is baked in: you see what the agent is reasoning about, you can take direct control at any moment, and critical actions pause for explicit approval. MagenticBrain is the orchestrator of the stack — an orchestration model fine-tuned on Qwen 3 8B that plans, codes, and delegates. Critically, it was trained end-to-end inside the MagenticLite harness with the same tool schemas it sees at inference, eliminating the gap between training and execution. Fara1.5 is the next generation of Microsoft's computer-use model family — three models (4B, 9B, 27B) on Qwen 3.5, with the 9B as the recommended flagship. Fara1.5 sets a new state of the art among small computer-use models on the Online‑Mind2Web benchmark, nearly doubling the performance of the previously released Fara‑7B, and the 27B variant records 90+% on the same benchmark 1 . Together, they represent an open-source, end-to-end agentic stack that work together, so developers can build, plan, and run agents on infrastructure they control. Learn more about MagenticLite on Foundry Labs Try it on GitHub MAI-Image-2-Efficient: high-quality image generation at speed and scale MAI-Image-2-Efficient — Image‑2e for short — is Microsoft's latest text-to-image model, built on the same architecture as MAI-Image-2 (which debuted at #3 on the Arena.ai leaderboard for image model families) but engineered for the production workloads where every millisecond and every GPU hour matters. When normalized by latency and GPU usage, Image‑2e is up to 22% faster and 4x more efficient than MAI-Image-2 — and outpaces leading text-to-image models by 40% on average 1 . In short, it delivers more output for less compute, giving teams the headroom to iterate faster without blowing through their GPU budget. That efficiency unlocks new categories of work. E-commerce platforms, media companies, and marketing teams generating thousands of images per day for targeted ads, concept art, and mood boards translate it directly into larger batches at lower GPU cost. Chatbots, creative copilots, and AI-powered design tools translate it into latency low enough for real-time interaction. The model also has a distinct visual signature — sharp, defined lines that fit illustration, animation, and attention-grabbing photoreal imagery. Learn more about MAI-Image-2-Efficient in Foundry Labs Try it in Microsoft Foundry EO/OS Object Detection: production-grade earth observation Object detection on satellite and aerial imagery has historically required months of in-house computer vision engineering — bespoke models, custom labels, fragile pipelines. EO/OS Object Detection collapses that into a managed first-party endpoint in Microsoft Foundry. Built by the team behind Planetary Computer, EO/OS Object Detection is a model that identifies and localizes objects in overhead imagery and returns bounding-box detections optimized for batch processing of large image archives. It's part of a new GeoAI category in Microsoft Foundry, opening Microsoft's geospatial intelligence stack to anyone building on satellite or aerial data. Defense and intelligence teams analyzing satellite feeds, infrastructure operators monitoring assets at scale, agriculture and energy companies tracking change across vast landscapes, and disaster response teams triaging post-event imagery can all swap a custom one-off detector for a managed endpoint that fits inside their existing Foundry stack. Put simply, the work shifts from "build the detector" to "use the detector" — and the detection signal lands faster, more consistently, and inside the same Microsoft platform their broader AI work already runs on. Learn more about EO/OS Object Detection in Foundry Labs Try EO/OS Object Detection in Microsoft Foundry What's Next Foundry Labs is where Microsoft's most ambitious AI research becomes accessible to builders and where the products you'll rely on tomorrow are taking shape today. There's plenty more in the pipeline. Explore more AI innovations on Foundry Labs Join the Microsoft Foundry Discord community to shape the future of AI together References As tested on April 13, 2026. Compared to MAI-Image-2 when normalized by latency and GPU usage. Throughput per GPU vs MAI-Image-2 on NVIDIA H100 at 1024×1024; measured with optimized batch sizes and matched latency targets. Results vary with batch size, concurrency, and latency constraints.607Views2likes0Comments8 Architectural Pillars to Boost GenAI LLM Accuracy and Performance in Low Cost
Smarter AI architecture, not bigger LLM models - how engineering teams push LLM accuracy and high performance in low cost. Enterprises using LLM (Large Language Models) hits the same ceiling and paying big price! A raw API call to a frontier model- GPT-4, Claude, Gemini delivers only 35-40% accuracy on structured output tasks like code generation, NL to DAX query generation, domain-specific reasoning. Prompt engineering pushes that to ~60%. But the final 35+ percentage points? Those come from system architecture, not model upgrades. This guide presents 8 architectural pillars, distilled from production Gen AI systems, that compound to close the accuracy gap. These patterns are model-agnostic and domain-agnostic, they apply equally to chatbots, coding assistants, content/query generators, automation agents, and any application where an LLM produces structured or semi-structured output. It’s based on my recent Gen AI projects. The key takeaway: use the LLM as one component in a larger system, not as the system itself. Surround it with deterministic guardrails, verified knowledge, and feedback loops. Pillar 1: Enhance Prompts with Verified Knowledge Context Impact: +35–40% accuracy (based on production use cases; may vary by domain) Top source of LLM errors in production is hallucinated identifiers knowledgebase, the model invents names, references, or structures that don't exist in the target system. This happens because LLMs are trained on general knowledge but deployed against specific, private enterprise systems they've never seen local database and knowledgebase. The fix is straightforward: inject verified, system-specific context (type definitions, API specs, ontologies, configuration schemas, entity catalogues) directly into the prompt so the model composes from known-good elements rather than recalling from training data. Use Knowledge Graph for better sematic knowledge. How to Implement Provide explicit context, never implicit- Whatever the LLM needs to reference identifiers, valid values, semantic knowledge, structures must appear verbatim in the prompt or retrieved context window. Filter aggressively. A full knowledge base with thousands of entities overwhelms the context window and confuses the model. Use intelligent filtering to surface only needed 5-10 most relevant elements per request. Store structured semantic knowledge in a graph or searchable index. This enables relationship-aware retrieval: "given entity X, what related entities, attributes, and constraints are also needed?" Include rich Semantic metadata. Names alone are insufficient. Include types, constraints, valid value ranges, relationships, and usage notes to minimize ambiguity. Keep context fresh. Stale context causes a different class of hallucination the model generates valid-looking output that references outdated structures. Sync your knowledge store with your source of truth. Why This Works LLMs excel at composition and reasoning combining elements, applying logic, following patterns. They are unreliable at recall of specific identifiers exact names, valid values, structural constraints. By offloading recall to a deterministic retrieval system and giving the LLM only composition tasks, you play to each system's strengths. Pillar 2: Tiered LLM Approach: Route Deterministically First, Use LLMs Last Impact: 80% cost reduction, 85% latency reduction, eliminates non-deterministic errors for most traffic. The most impactful architectural insight: most production requests don't need an LLM at all. A well-designed system handles 60-70% of traffic with deterministic logic templates, composition rules, cached results and reserves expensive, non-deterministic LLM calls only for genuinely novel inputs. The Three-Tier Model These metrics are from a real use case to convert NLP to Power BI DAX query. Tier Strategy Uses LLM ? Latency Accuracy Tier 0 Template slot-filling - handles requests that match known patterns exactly the system fills slots in a pre-built template with extracted parameters. No LLM, no non-determinism, near-perfect accuracy, sub-100ms response. No ~50ms 95-98% Tier 1 Compose from pre-validated fragments- handles requests that combine known patterns in new ways. The system retrieves pre-validated building blocks via search, composes them using deterministic rules, and validates the result. Still no LLM call. No ~200ms 90-95% Tier 2 Full LLM generation with enriched context- is reserved for genuinely novel requests that can't be served deterministically. Even here, the LLM receives maximum support: filtered context, relevant examples, explicit rules, and structured planning. Yes (1 call) 2-5s 88-93% Complexity-Based Routing A lightweight scoring function (evaluated in <1ms) routes each incoming request: Factors: reasoning depth, number of components, cross-references, constraints, nesting depth, novelty (distance from known patterns) Score 0-39: Tier 0 (deterministic template) Score 40-59: Tier 1 if confidence ≥ 85%, else Tier 2 Score 60+: Tier 2 (LLM generation) This routing achieves 96%+ accuracy in tier assignment and ensures the expensive path is only taken when necessary. Why This Matters Cost: 70-80% of requests cost zero LLM tokens Latency: Majority of responses in <200ms instead of 2-5s Reliability: Deterministic tiers produce identical output for identical input. Scalability: Deterministic tiers scale horizontally with trivial compute Pillar 3: Encode Prompt Anti-Patterns as Explicit Rules Impact: +8-10% accuracy, ~80% reduction in common structural errors LLM mistakes are patterned, not random. In any domain, 80% of errors cluster around a small set of 6-13 recurring structural mistakes. Instead of hoping the model avoids them through general instruction-following, compile these mistakes into explicit WRONG => CORRECT rules embedded directly in the system prompt. How to Implement Collect error data. Run 100+ requests through your system and categorize the failures. You'll find the same 6-13 patterns appearing repeatedly. Write concrete rules. For each pattern, show the exact wrong output and the exact correct alternative, with a one-line explanation of why. Embed in system prompt. Place rules prominently after the task description, before examples. Use formatting that's hard to ignore (headers, bold, explicit "NEVER" language). Keep the list short. 6-13 rules maximum. Beyond that, attention dilutes and the model starts ignoring rules. Prioritize by frequency. Refresh continuously. As the system improves (via other pillars), some errors disappear. New error types emerge. Update the rule set quarterly. Why This Works LLMs respond strongly to explicit negative examples. A generic instruction like "be careful with X" has minimal impact. But showing the exact wrong output the model tends to produce, paired with the correction, creates a strong avoidance signal. It's analogous to unit tests. Pillar 4: Retrieve Few-Shot Examples Dynamically Impact: +5-15% accuracy depending on domain complexity Static examples hardcoded in a prompt become stale, irrelevant of context tokens. Dynamic few-shot retrieval selects the 3-5 most relevant examples for each specific request, maximizing the signal-to-noise ratio in the prompt. Hybrid Retrieval Architecture The most effective approach combines two search strategies for intent search to understand natural language (NL) context: Keyword search (BM25) Finds examples with exact matching terms, identifiers, and domain vocabulary Vector search (semantic similarity) Finds examples with similar intent and structure, even if wording differs Rank fusion Merges results from both strategies, re-ranking by combined relevance This hybrid approach outperforms either strategy alone because keyword search catches exact identifier matches that vector search dilutes, while vector search captures semantic similarity that keyword search misses entirely. Best Gen AI Architectural Practices Match complexity to complexity. Simple requests should see simple examples. Complex requests should see complex examples. Mismatched examples confuse the model. Include negative examples. For the detected request type, include 1-2 "wrong => correct" pairs alongside positive examples. This reinforces Pillar 3's anti-pattern rules with concrete, contextually relevant demonstrations. Pre-compute embeddings. Generate vector embeddings at indexing time, not at query time. Cache retrieval results for repeated patterns. Curate quality over quantity. 3 excellent, diverse examples beat 10 mediocre ones. Each example should demonstrate a distinct pattern or edge case. Keep examples current. As your system evolves, old examples may demonstrate outdated patterns. Review and refresh the example store periodically. Pillar 5: Feedback Loop- Validate and Auto-Fix Every Output Deterministically Impact: +3-5% accuracy as a safety net, plus continuous improvement via feedback No matter how well-prompted, LLMs will occasionally produce outputs with minor structural errors - wrong casing, missing delimiters, references to slightly-incorrect identifiers, or subtle format violations. A deterministic post-processing pipeline catches and fixes these without any additional LLM calls. The Validation Pipeline LLM Output => Parse (grammar/AST) => Rule-Based Fixes => Compliance Check/validation => Final Output Each stage is fully deterministic: Parsing: Use a formal grammar or AST parser (ANTLR, tree-sitter, language-native parsers) to structurally analyse the output. Never regex-parse structured output - it's fragile and misses edge cases. Rule-based fixes: 10-20 deterministic transformation rules that correct known error patterns - name normalization, casing fixes, missing delimiters, structural repairs. Compliance check: Verify every identifier referenced in the output actually exists in the provided context. Flag unknown references. Design Principles Zero LLM calls in the fix pipeline. Every fix is a regex, an AST transformation, or a lookup table operation. Instant, free, deterministic, 100% reliable. Fail safe. If a fix is ambiguous (multiple valid corrections possible), pass through rather than corrupt. A minor error is better than a confident wrong "fix." Log everything. Track every fix applied, categorized by type. This data drives the feedback loop. The Critical Feedback Loop- The validation pipeline's most important function isn't fixing outputs, it's generating improvement signals: This creates a feedback loop: the auto-fix catches errors → the errors get promoted to upstream prevention → fewer errors reach the auto-fix → the system continuously tightens. Pillar 6: Multi-Agent Orchestration with Fewer Agents and Clear Contracts Impact: Reduced latency, clearer debugging, fewer failure modes The multi-agent pattern is powerful but commonly over-applied. The counter-intuitive lesson from production systems: fewer agents with well-defined responsibilities outperform many fine-grained agents. Why Fewer Is Better Each agent handoff introduces: Latency - serialization, network calls, context assembly Context loss - information dropped between boundaries Failure modes - each handoff is a potential error point Debugging complexity - tracing issues across many agents is exponentially harder Multi-Agent Orchestration Principles Merge agents that always run sequentially. If Agent A always feeds into Agent B with no branching or conditional logic, they should be one agent with two internal steps. Parallelize independent operations. Context retrieval and example lookup are independent, run them concurrently to halve retrieval latency. Route sub-tasks to cheaper models. Decomposed sub-problems are simpler by design. Use a smaller, faster, cheaper model (3x cost savings, 2x speed improvement). Define strict contracts. Each agent boundary should have an explicit schema defining inputs and outputs. No implicit assumptions about what crosses the boundary. Only 2 of 4 agents should call an LLM. The rest are purely deterministic. This minimizes non-deterministic behavior and cost. Pillar 7: Multi-Agent Cache at Multiple Hierarchical Levels Impact: 40-50% faster responses, 85%+ combined hit rate, significant cost reduction A single cache layer captures only one type of repetition. Production systems need hierarchical caching where multiple levels catch different repetition patterns , from exact duplicates to semantic near-misses. with -> A single cache layer captures only one type of repetition. Production systems need multi-level caching to handle exact matches, similar requests, and reusable fragments. or -> with Production systems need hierarchical caching where multiple levels handle exact matches, similar requests, and reusable fragments. Pillar 8: Measure Everything, Learn Continuously Impact: Enables data-driven iteration and prevents accuracy regressions. Architecture without observability is guesswork. The final pillar ensures every other pillar stays effective over time through comprehensive metrics and automated feedback loops. This isn't a one-time setup; it's a perpetual feedback loop. Every week, the top error patterns shift slightly. The auto-fix metrics tell you exactly where to focus next. Over months, this flywheel compounds into dramatic accuracy gains that no single prompt rewrite could achieve. Auto-Learning for New Domains When extending your system to new domains or knowledge areas: Auto-classify elements using naming conventions, type analysis, and structural patterns Auto-generate templates from universal patterns (transformations, comparisons, compositions, sequences) Bootstrap few-shot examples from successful template outputs Monitor for the first 100 requests, then curate only the edge cases manually This reduces domain onboarding from days of manual work to minutes of automated bootstrapping plus focused human review of outliers. Key Takeaways Architecture beats model size. A well-architected system with a smaller model outperforms a raw frontier model call on structured tasks at a fraction of the cost. Deterministic systems should do the heavy lifting. Reserve LLMs for genuinely novel, creative tasks. 70-80% of production requests should never touch an LLM. Verified knowledge is your top accuracy lever. Ground every prompt in context the model can trust. Errors are patterned, not random- Track them, compile them, and explicitly forbid them. Build feedback loops, not static systems- Every auto-fix, every cache miss, every routing decision is a signal for improvement. Fewer agents, done well- Fewer agents with strict contracts outperform 9 agents with fuzzy boundaries in accuracy, latency, and debuggability. Measure what matters and iterates- The system that wins isn't the one with the best day-one prompt, it's the one that improves fastest over time. Production-grade GenAI isn't about finding the perfect prompt or waiting for the next LLM model release. It's about building architectural guardrails that make failure nearly impossible and when failure does occur, the system learns from it automatically. These 8 pillars, applied together, transform any LLM from an unreliable black box into a precise, efficient, and continuously improving production system. -> Production Gen AI success is not about perfect prompts or waiting for the next LLM release. It comes from designing strong system guardrails that reduce failures and ensure consistent output. Even when failures happen, the system learns and improves automatically. When applied together, these 8 pillars turn an LLM into a reliable, efficient, and continuously improving production system.Now in Foundry: Tongyi-MAI Z-Image-Turbo, with FLUX.1-schnell and SDXL base 1.0
This week's Model Mondays edition pairs three models available through the Hugging Face collection in Microsoft Foundry: Tongyi-MAI's Z-Image-Turbo, a new designed for lower latency on a single GPU and native bilingual text rendering; Black Forest Labs' FLUX.1-schnell, a 12B rectified flow transformer distilled to 1–4 step inference and one of the most adopted open-weight image models since its 2024 release; and Stability AI's stable-diffusion-xl-base-1.0 (SDXL), a latent diffusion research model that can be used to generate and modify images based on text prompts. Models of the week Tongyi-MAI: Z-Image-Turbo Model Specs Parameters / size: 6B (BF16) Resolution: Up to 1024×1024 native Primary task: Text-to-image generation (English and Chinese) Why it's interesting (Spotlight) Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture: Z-Image concatenates text tokens, visual semantic tokens, and image VAE tokens into a single unified input stream rather than running text and image through separate branches. This single-stream design can improve parameter efficiency relative to dual-stream DiT architectures at the same capacity. See the Z-Image technical report for details. 8-step inference at sub-second latency, fits in 16GB VRAM: Z-Image-Turbo is distilled with Decoupled Distribution Matching Distillation (Decoupled-DMD) and further refined with DMDR, a method that fuses DMD with reinforcement learning during post-training. The result is a model that runs 8 Number-of-Function-Evaluations (NFE) per image with no Classifier-Free Guidance (CFG)—which roughly halves the per-step compute compared to CFG-based inference. See the Decoupled-DMD and DMDR papers. Native bilingual text rendering and strong instruction adherence: Unlike most open-weight image models, which struggle with legible in-image text, Z-Image-Turbo renders complex English and Chinese text accurately which is useful for posters, signage, packaging mockups, and marketing creative. Try it Imagine you're a community programs coordinator at your city's parks department, planning a new summer event series — a "Cake Picnic in the Park" — designed to bring neighbors together over food in shared green space. The event is a few weeks out. You haven't booked bakery partners yet, so no actual cake exists, and you need marketing assets this week to start driving sign-ups: a hero image for the registration page, a flyer for community centers and libraries, social tiles for the city's channels. Use the prompt below and a photorealistic image, that can now be scaled to become additional assets like printed flyers or social images in minutes using image editing tools (or another model). Prompt: A round layered cake displayed on a white ceramic cake stand, topped with glossy fresh red cherries and smooth pastel pink buttercream frosting piped in delicate rosettes around the edge. One generous slice has been cleanly cut and removed from the front, revealing a perfect cross-section: four distinct horizontal layers alternating between soft pink sponge cake and fluffy white vanilla cream frosting. Professional bakery photography, soft natural window light from the left, shallow depth of field, marble countertop, warm and inviting atmosphere, photorealistic detail on the cake texture, cherry highlights, and frosting swirls. Black Forest Labs: FLUX.1-schnell Model Specs Parameters / size: 12B (rectified flow transformer) Resolution: Flexible up to 2 megapixels Primary task: Text-to-image generation Why it's interesting (Spotlight) Rectified flow transformer with adversarial distillation for 1–4 step inference: FLUX.1-schnell is the distilled, Apache 2.0 sibling of the FLUX.1 family. It uses a rectified flow formulation (a diffusion variant that learns straight-line probability paths between noise and data, reducing the number of solver steps needed) and is further compressed with latent adversarial diffusion distillation. The model generates high quality images in for latency-sensitive workloads. Permissive licensing for commercial use: Released under Apache 2.0, FLUX.1-schnell can be used for personal, scientific, and commercial purposes. This has driven broad adoption across product features that need an open, redistributable image backbone. Strong prompt adherence at its parameter range: At 12B parameters, FLUX.1-schnell sits between the SDXL family and frontier proprietary image models, and it remains a common reference point for evaluating open image generation prompt following—particularly for complex compositional prompts and longer captions—roughly two years after its initial release. Try it Hugging Face Spaces give developers the ability to experiment and try new models before deploying them. Test out a few prompts here: https://black-forest-labs-flux-1-schnell.hf.space then when you are ready, deploy the model in Microsoft Foundry. Stability AI: stable-diffusion-xl-base-1.0 stabilityai/stable-diffusion-xl-base-1.0 · Hugging Face Model Specs Parameters / size: 2.6B UNet (≈3.5B total with text encoders) Resolution: 1024×1024 native Primary task: Text-to-image generation Why it's interesting (Spotlight) Dual text encoder design and an ensemble-of-experts pipeline: SDXL uses two pretrained text encoders—OpenCLIP-ViT/G and CLIP-ViT/L—concatenated to capture both broad semantic alignment and finer-grained token-level cues. It can be run standalone or paired with the SDXL refiner in an ensemble-of-experts pipeline where the base model handles early denoising and the refiner specializes in the final steps. See the SDXL report for the original training and architecture details. CreativeML Open RAIL++-M licensing for managed deployments: SDXL is distributed under the CreativeML Open RAIL++-M license, which permits commercial use and downstream fine-tuning with documented use restrictions. Try it To go deeper on SDXL, take a look at Stability AI's generative-models GitHub repository, which implements the most popular diffusion frameworks for both training and inference and continues to expand with new capabilities like distillation. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry in two ways. The first by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. The second way is direct through the Hugging Face Hub, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure. Learn how to discover models and deploy them using Microsoft Foundry documentation: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry371Views0likes0CommentsA New Chapter for Realtime AI: Reasoning, Translation, and Real-Time Transcription
Voice can be one of the most direct and productive interfaces for AI — enabling customer support agents that may resolve issues without a single keystroke, live multilingual communication that can take on language barriers as conversations happen, and voice assistants capable of reasoning through complex requests in real time. Developers building these experiences need models that can keep pace with increasingly demanding latency, accuracy, and language coverage requirements. Today, OpenAI’s GPT-realtime-translate, GPT‑realtime‑2 and, GPT-realtime-whisper are rolling out into Microsoft Foundry starting today — together representing a significant step forward for the realtime model lineup available to developers on the platform. GPT-realtime-translate and GPT-realtime-whisper GPT-realtime-translate and GPT-realtime-whisper together extend the realtime stack for live multilingual audio workflows. GPT-realtime-translate is built for continuous, real-time translation, producing translated output as speech unfolds without relying on segmented pipeline processing, while GPT-realtime-whisper provides low-latency streaming transcription of the original audio in parallel. Used together, they help developers support scenarios such as live events, cross-language customer experiences, captions, monitoring, and archival workflows that require both translated output and visibility into the source speech. Continuous stream processing: This new model translates live audio without segmenting or buffering allowing for more natural interactions. New translation and transcription capabilities: Translate between languages in real time and observe faster text to speech. Available via the Realtime API GPT-realtime-2 GPT‑realtime‑2 is a generational upgrade to OpenAI's speech-to-speech model, bringing internal reasoning and an expanded context window to real-time voice applications. Where previous speech to speech models responded immediately, GPT‑realtime‑2 can work through a problem before speaking — making it well suited for voice applications that need to handle complex, multi-step queries entirely in the audio layer without routing to a separate text pipeline. Native reasoning capability: The newest realtime model introduces stronger reasoning capabilities. Now the model thinks internally before responding. Adjustable reasoning effort via {reasoning.effort}: Explicitly request the level of reasoning the model uses -- minimal, low, medium, high – to save on cost and latency. Audio in, audio out: No need for an intermediary text step, conversation stays fluid and natural. Available via the Realtime API This models is coming soon to Microsoft Foundry. Since, May 6, the models have been rolling out into the model catalog. We are excited for you to explore and build with our evolving collection of frontier models. Use cases These models work independently, but they're designed to complement each other in real-world pipelines: Live multilingual events. GPT-realtime-translate enables real-time translation of live audio, producing translated speech along with a transcript in the target language. GPT‑realtime‑whisper can be used in parallel to capture a transcription of the original speech for captions, monitoring, or archival purposes. Together, they enable multilingual live streaming with both translated experiences and visibility into the source language. Global customer support. Route inbound calls through GPT-realtime-translate to translate conversations in real time and provide a translated transcript for agents. Use GPT‑realtime‑whisper alongside it to capture the original conversation as text for compliance, quality review, or analytics. Then pass the interaction to an agent built with GPT‑realtime‑2 using {reasoning.effort}: high for complex issue resolution, all within a continuous audio pipeline. International voice assistants. Build once and deploy across languages. GPT-realtime-translate enables multilingual interaction and provides translated output with a target-language transcript, while GPT‑realtime‑whisper can optionally capture the original user input as text. GPT‑realtime‑2 manages reasoning and conversational context, supporting more complex voice interactions. Pricing Model Deployment Modality Pricing per 1M tokens Input Cached Input Output GPT-realtime-2 Global Standard Audio $32.00 $0.40 $64.00 Text $4.00 $0.40 $24.00 Image $5.00 $0.50 -- GPT-realtime-translate Global Standard Audio -- -- $2.04/hour GPT-realtime-whisper Global Standard Audio -- -- $1.02/hour *Pricing for GPT-realtime-translate and GPT-realtime-whisper will be done by the hour Getting Started Looking for ways to dive in? GPT-realtime-translate, GPT-realtime-whisper, and GPT‑realtime‑2 are rolling out into Microsoft Foundry today. Explore the model catalog and start building: https://ai.azure.com5.2KViews1like5CommentsNow in Foundry: IBM Granite 4.1, NVIDIA Nemotron Nano Omni, and Qwen3.6-35B-A3B
This week Microsoft Foundry adds two major model families alongside a reasoning powerhouse that spans the full spectrum from specialized speech and vision to general-purpose coding and long-context analysis. IBM's Granite 4.1 is a famiily of 10: six LLMs across 3B, 8B, and 30B sizes in both full-precision and FP8 variants, plus a safety model, a vision-language model for document extraction, and a multilingual speech recognition model. NVIDIA's Nemotron-3-Nano-Omni-30B-A3B-Reasoning brings multimodal capability—video, audio, image, and text—to a 31B Mamba2-Transformer Hybrid Mixture-of-Experts (MoE) architecture that activates only 3B parameters per forward pass; three variants are available in Foundry (BF16, FP8, and NVFP4), with the FP8 variant featured here. Qwen3.6-35B-A3B is designed for agentic coding among open models, with thinking preservation across conversation turns and a context window extensible to 1 million tokens. Models of the week IBM: granite-4.1-30b Model Specs Parameters / size: 30B (flagship of the Granite 4.1 family) Context length: 131,072 tokens Primary task: Text generation (multilingual instruction following, RAG, tool calling, code, summarization) Why it's interesting The Granite 4.1 release brings 10 models to Microsoft Foundry. The LLM lineup covers granite-4.1-3b-instruct, granite-4.1-8b-instruct, and granite-4.1-30b-instruct with FP8 variants for each, plus granite-guardian-4.1-8b for safety, granite-vision-4.1-4b for document and chart understanding, and granite-speech-4.1-2b for multilingual speech recognition. This is a deployment-ready stack where teams can mix and match model sizes and modalities from a single provider. Strong instruction following and reasoning at the 30B scale: granite-4.1-30b-instruct scores 80.16 on MMLU, 64.09 on MMLU-Pro, 83.74 on Big-Bench Hard (BBH), 77.80 on AGI Eval, 45.76 on GPQA (Graduate-Level Google-Proof Q&A, a graduate-level science reasoning benchmark), and 89.65 average on IFEval (instruction following). These results reflect SFT and reinforcement learning post-training focused specifically on instruction compliance, tool calling accuracy, and long-context retrieval. (View benchmarks here) Enhanced tool calling and 12-language support: Granite 4.1 models are trained for structured function calling and support 12 languages—Arabic, Chinese, Czech, Dutch, English, French, German, Italian, Japanese, Korean, Portuguese, and Spanish—with dialog, extraction, and summarization capabilities. Safety and multimodal coverage within the same family: The inclusion of granite-guardian-4.1-8b (a safety classifier for detecting harmful content and prompt injections), granite-vision-4.1-4b (a Vision Language Model optimized for document extraction from PDFs, charts, and tables), and granite-speech-4.1-2b (a 2B multilingual Automatic Speech Recognition model) means teams can address safety, document parsing, and audio ingestion within the same model family—reducing integration complexity across a full pipeline. Try it Use Case Prompt Pattern Multilingual RAG Submit retrieved document passages in any of 12 supported languages; ask model to synthesize and cite sources Agentic tool calling Provide function definitions + user goal; model plans and executes tool calls in structured format Document extraction (granite-vision-4.1-4b) Submit PDF page image; extract tables, key figures, or form fields as structured JSON Safety classification (granite-guardian-4.1-8b) Pass user input or model output; receive structured risk assessment before serving response Sample prompt for an enterprise document processing deployment: You are building a multilingual document intelligence pipeline for a global financial institution. Using the granite-4.1-30b-instruct endpoint deployed in Microsoft Foundry, submit each incoming policy or regulatory document with the following system instruction: "You are a compliance analysis assistant. Review the document and extract: (1) all regulatory requirements described, (2) the entities to which each requirement applies, (3) any compliance deadlines mentioned, and (4) any penalties or consequences for non-compliance. Return the output as a structured JSON array with one entry per requirement." For documents that include scanned pages, first route them through granite-vision-4.1-4b to extract text and table content before passing to the 30B model for compliance analysis. Pass all user-facing outputs through granite-guardian-4.1-8b to screen for sensitive information before returning results. NVIDIA: Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 Model Specs Parameters / size: 31B total, ~3B activated per forward pass (Mamba2-Transformer Hybrid Mixture-of-Experts) Context length: 256,000 tokens Primary task: Video-audio-image-text-to-text (Multimodal understanding, reasoning, tool calling) Why it's interesting Multimodal input from a single efficient endpoint: Nemotron-3-Nano-Omni-30B-A3B-Reasoning supports video (up to 2 minutes), audio (up to 1 hour), images (RGB), and text—all from a single model deployed in Microsoft Foundry. Three variants are available in Foundry: full-precision BF16, FP8, and NVFP4. Paper: Nemotron Nano Omni technical report. Strong results across vision, document, video, and audio benchmarks: With reasoning mode enabled, the model scores 82.8 on MathVista-MINI (visual math reasoning), 67.04 on OCRBenchV2-EN (document OCR), 63.6 on Charxiv Reasoning (chart understanding), 72.2 on Video MME (video Q&A), 74.52 on Daily Omni (video+audio omnimodal understanding), and 89.39 on VoiceBench (speech instruction following). On OSWorld (GUI agent benchmark measuring autonomous computer use), it scores 47.4—a notable result for a model at the 3B active parameter scale. (Please see above model cards for further benchmark data) Mamba2-Transformer Hybrid MoE for efficient long-context inference: The model's layers alternate between Mamba2 state-space blocks (which process sequences with linear rather than quadratic cost) and standard Transformer attention blocks, combined with Mixture-of-Experts feedforward layers. Only ~3B parameters are activated per token despite the 31B total, making the 256K context window practically usable at lower compute cost than a comparably sized dense model. Word-level timestamps, JSON output, and tool calling for structured media workflows: The model produces word-level timestamps from audio, enabling precise transcript-to-timecode alignment for review and indexing workflows. Combined with JSON-structured output, chain-of-thought reasoning, and native tool calling, it can serve as an agentic step that ingests raw media (meeting recordings, M&E assets, training videos) and produces structured outputs for downstream systems without requiring separate transcription or OCR preprocessing stages. Try it Use Case Prompt Pattern Meeting intelligence Submit audio recording (up to 1 hr); extract transcript with word-level timestamps, action items, and decisions as structured JSON Video content analysis Submit video clip (up to 2 min) + query; retrieve timestamped summary of key events or spoken content Document + audio joint analysis Submit scanned document image alongside narrated walkthrough audio; extract and reconcile information from both modalities Multimodal tool calling Provide tool definitions + combined image/audio input; model reasons over content and executes structured tool calls Sample prompt for a media and compliance deployment: You are building a broadcast compliance review system for a media company. Using the Nemotron-3-Nano-Omni-30B-A3B-Reasoning-FP8 endpoint deployed in Microsoft Foundry, submit each recorded segment as video input with the following instruction: "Review this video segment and produce a compliance report as a JSON object with the following fields: transcript (full text with word-level timestamps), flagged_segments (array of objects with start_time, end_time, content, and reason for flagging), speaker_count (estimated number of distinct speakers), and compliance_summary (overall assessment). Flag any content that includes unverified factual claims, restricted product categories, or regulatory disclosures that may be incomplete." Use the word-level timestamps from the compliance report to route flagged segments directly to the editorial review queue with precise timecode references. Qwen: Qwen3.6-35B-A3B Model Specs Parameters / size: 35B total, 3B activated (Mixture-of-Experts) Context length: 262,144 tokens natively, extensible to 1,010,000 tokens Primary task: Image-text-to-text (agentic coding, reasoning, vision) Why it's interesting Agentic coding improvements over Qwen3.5-35B-A3B: Qwen3.6-35B-A3B scores 73.4 on SWE-bench Verified (vs. 70.0 for Qwen3.5-35B-A3B and 52.0 for Gemma 4 31B), 67.2 on SWE-bench Multilingual (vs. 60.3 and 51.7), and 49.5 on SWE-bench Pro (vs. 44.6 and 35.7). Terminal-Bench 2.0 reaches 51.5 (vs. 40.5 and 42.9). The update targets frontend workflows and repository-level reasoning specifically, areas where earlier Qwen3.5 iterations showed gaps. Blog post: Qwen3.6-35B-A3B. Hybrid architecture: Gated DeltaNet and Mixture-of-Experts: The model's 40 layers alternate between Gated DeltaNet blocks (a form of linear attention that avoids the quadratic cost of standard self-attention), Gated Attention blocks (using Grouped Query Attention with 16 query heads and 2 key-value heads), and Mixture-of-Experts (MoE) feedforward layers with 256 experts (8 routed + 1 shared active per token). Only 3B parameters are activated per forward pass, keeping inference cost comparable to a 3B dense model while retaining the capacity of a 35B model for knowledge and specialization. Thinking preservation across conversation turns: Qwen3.6 introduces an option to retain reasoning context from previous messages in multi-turn conversations. In prior models, chain-of-thought traces were stripped between turns, requiring the model to re-derive context it had already reasoned through. With thinking preservation enabled, iterative coding workflows—such as debugging across multiple exchanges—benefit from accumulated reasoning without repeating earlier analysis. Natively extensible to 1 million token context: The 262K native context is already among the largest in open models at this size, and the architecture supports extension to 1,010,000 tokens. On GPQA Diamond (science reasoning), Qwen3.6-35B-A3B scores 86.0—above both Gemma 4 31B (84.3) and Qwen3.5-27B (85.5)—while matching Gemma 4 31B on MMLU Pro (85.2) and LiveCodeBench v6 (80.4 vs. 80.0). Try it Use Case Prompt Pattern Repository-level code change Provide repository structure + task description; model plans file edits and outputs unified diff Multi-turn iterative debugging Enable thinking preservation; submit failing test + code across multiple turns; accumulate reasoning context Frontend code generation Provide design spec or screenshot + existing codebase context; generate component implementation Long-document reasoning Submit technical specification (up to 262K tokens); ask model to identify ambiguities or implementation gaps Sample prompt for a software engineering deployment: You are building an automated code review and implementation assistant for a platform engineering team. Using the Qwen3.6-35B-A3B endpoint deployed in Microsoft Foundry, enable thinking preservation for multi-turn sessions. In the first turn, submit the repository file tree and a GitHub issue describing a required API endpoint change. Prompt the model: "Review the repository structure and describe your implementation plan, including which files need to change and why." In the second turn, submit the relevant source files and prompt: "Based on your earlier plan, implement the changes and produce a unified diff." In the third turn, submit the test suite and prompt: "Write additional unit tests for the new endpoint, covering edge cases identified in your reasoning." The thinking preservation feature ensures the model carries forward its understanding of the codebase across all three turns without re-explaining context. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation. Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry601Views0likes0CommentsIntroducing OpenAI's newest chat model in Microsoft Foundry
OpenAI's GPT-5.5 Instant (or Chat-latest in the API) begins rolling out in Microsoft Foundry today as GPT-chat-latest. Built on GPT-5.4 and GPT-5.3-chat, the new model delivers measurable gains in factual accuracy, tool calling, and response efficiency. These improvements translate directly into more reliable production deployments. GPT-chat-latest is designed for the workflows builders are actually shipping: multi-turn assistants, agentic systems that orchestrate tools, and retrieval-grounded applications where precision and grounding matter as much as conversational quality. Why the name is changing In Microsoft Foundry, we are introducing GPT-chat-latest as the product name for this release, while the model continues to follow the existing Preview lifecycle and standard notice periods. We are also evaluating ways to simplify how customers access continuously updated models over time, but current behavior remains unchanged as that work continue Smarter, more factually reliable GPT-chat-latest closes the factuality gap from prior iterations with significant reductions in hallucinations, especially in domains where accuracy matters most. According to OpenAI, the new model produces 52.5% fewer hallucinations and reduces hallucinated claims by 37.3% on conversations previously flagged for factual errors when compared to GPT-5.3-chat. These gains extend beyond text. GPT-chat-latest shows improvements in visual reasoning, expert multimodal understanding, and STEM tasks, with measurable lifts across standard benchmarks: Benchmark GPT-5.3-chat GPT-chat-latest CharXiv-reasoning Scientific Chart Reasoning 75.0 81.6 MMMU-Pro Expert multimodal reasoning 69.2 76.0 GPQA PhD-level science questions 78.5 85.6 AIME 2025 Competition math 65.4 81.2 *Data shown comes from OpenAI’s testing” For builders shipping into regulated workloads such as clinical decision support, legal research, financial advisory, and technical analysis, these improvements raise the bar on the kinds of applications GPT-chat-latest can assist with. More efficient outputs GPT-chat-latest produces responses that may be more to-the-point without losing substance. The model may reduce verbosity and over formatting, ask fewer follow-up questions, and avoid cluttered output patterns that often require post-processing in production UIs. For builders, this can translate to two concrete benefits: lower output token costs at scale, and cleaner responses that drop into product surfaces with less downstream cleanup. In comparative testing from OpenAI, GPT-chat-latest produced roughly 25–30% fewer words than GPT-5.3-chat across a range of common prompts while preserving response quality, and in many cases improving it. Improving intelligence and tool calling GPT-chat-latest introduces measurable improvements in how the model interacts with tools, including better judgment about when and how to invoke them. The model produces more structured and context-aware tool invocation outputs, which is particularly relevant for workflows that rely on function calling, retrieval-augmented generation, and multi-step reasoning. Equally important, the model is better at deciding whether a tool is needed in the first place, reducing unnecessary tool calls in scenarios where it already has the information to answer directly. Improved search and context handling GPT-chat-latest includes targeted improvements to how the model retrieves, interprets, and synthesizes information when search is involved, with enhancements to query formulation, result ranking, and filtering, plus more grounded synthesis of retrieved content into final responses. These changes improve handling of ambiguous or underspecified queries and reduce noise in answers that depend on retrieved content. The model also makes better use of the context developers pass in, including system prompts, conversation history, retrieved documents, and structured data. Applications that maintain long-running state or stitch together multiple retrieval steps produce more coherent, context-aware outputs without developers having to over-engineer prompt scaffolding. Use Cases: When to choose the chat model Developers typically choose a chat-optimized model like GPT-5.5-chat when the application needs to sustain multi-turn conversations while reliably following instructions and coordinating external tools. This is a fit for assistants and agentic workflows where the model must interpret user intent over time, decide when to retrieve additional context, and produce structured outputs for downstream systems rather than just generate free-form text. Customer support and contact centers: virtual agents that maintain conversational context across a case, retrieve policy or product documentation via search, and hand off to a ticketing or CRM system through tool calls when escalation is needed. Retail and e-commerce: shopping and service assistants that clarify preferences over multiple turns, reference catalogs and policies via retrieval, and generate structured actions such as returns, exchanges, and order lookups through integrated tools. Manufacturing and field service: technician-facing assistants that combine conversational guidance with retrieval of manuals and work instructions, plus structured task creation in maintenance systems. Use GPT-chat-latest Use GPT-5.5 Reasoning Multi-turn assistants and customer-facing chat experiences Harder problems that benefit from more deliberate, step-by-step thinking Agentic workflows that coordinate tools (search, retrieval, ticketing, CRM) and benefit from structured tool outputs Complex analysis, planning, or decision support where correctness matters more than conversational flow Interactive experiences where you want quick back-and-forth clarification and task completion Tasks involving multi-constraint reasoning (policy interpretation, detailed requirements, long-horizon plans) RAG-based apps where the model must decide when to retrieve and then synthesize grounded answers Offline or low-tool scenarios where the main value is deeper reasoning over provided context Pricing Model Input ($/1M tokens) Cached input ($/1M tokens) Output ($/1M tokens) GPT-chat-latest $5 $0.50 $30 Responsible AI in Microsoft Foundry At Microsoft, our mission to empower people and organizations remains constant. In the age of AI, trust is foundational to adoption, and earning that trust requires a commitment to transparency, safety, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy models responsibly in production environments, aligned with Microsoft's Responsible AI principles. Getting started GPT-chat-latest is rolling out in Microsoft Foundry today.6.3KViews1like0Comments