azure ai
147 TopicsNow in Foundry: Cohere Transcribe, Nanbeige 4.1-3B, and Octen Embedding
This week's Model Mondays edition spans three distinct layers of the AI application stack: Cohere's cohere-transcribe, a 2B Automatic Speech Recognition (ASR) model that ranks first on the Open ASR Leaderboard across 14 languages; Nanbeige's Nanbeige4.1-3B, a compact 3B reasoning model that outperforms models ten times its size on coding, math, and deep-search benchmarks; and Octen's Octen-Embedding-0.6B, a lightweight text embedding model that achieves strong retrieval scores across 100+ languages and industry-specific domains. Together, these three models illustrate how developers can build full AI pipelines—from audio ingestion to language reasoning to semantic retrieval—entirely with open-source models deployed through Microsoft Foundry. Each operates in a different modality and fills a distinct architectural role, making this week's selection especially well-suited for teams assembling production-grade systems across speech, text, and search. Models of the week Cohere's cohere-transcribe-03-2026 Model Specs Parameters / size: 2B Primary task: Automatic Speech Recognition (audio-to-text) Why it's interesting Top-ranked on the Open ASR Leaderboard: cohere-transcribe-03-2026 achieves a 5.42% average Word Error Rate (WER) across 8 English benchmark datasets as of March 26, 2026—placing it first among open models. It reaches 1.25% WER on LibriSpeech Clean and 8.15% on AMI (meeting transcription), demonstrating consistent accuracy across both clean speech and real-world, multi-speaker environments. Benchmarks: Open ASR Leaderboard. 14 languages with a dedicated encoder-decoder architecture: The model uses a large Conformer encoder for acoustic representation extraction paired with a lightweight Transformer decoder for token generation, trained from scratch on 14 languages covering European, East Asian (Chinese Mandarin, Japanese, Korean, Vietnamese), and Arabic. Unlike general-purpose models adapted for ASR, this dedicated architecture makes it efficient without sacrificing accuracy. Long-form audio with automatic chunking: Audio longer than 35 seconds is automatically split into overlapping chunks and reassembled into a coherent transcript—no manual preprocessing required. Batched inference, punctuation control, and per-language configuration are all supported through the standard API. Try it Click on the window above, upload an audio file, and watch how quickly the model transcribes it for you. Or click the link to experiment with the Cohere Transcribe Space and record audio directly from your device. Use Case Prompt Pattern Meeting transcription Submit recorded audio with language tag; retrieve timestamped transcript per speaker turn Call center quality review Batch-process customer call recordings, extract transcript, pass to classification model Medical documentation Transcribe clinical encounters; feed transcript into summarization or structured note pipeline Multilingual content indexing Process podcasts or video audio in any of 14 supported languages; store as searchable text Sample prompt for a legal services deployment: You are building a contract negotiation assistant. A client submits a recorded audio of a 45-minute supplier negotiation call. Using the cohere-transcribe-03-2026 endpoint deployed in Microsoft Foundry, transcribe the call with punctuation enabled for the English audio. Once the transcript is available, pass it to a downstream language model with the following instruction: "Identify all pricing commitments, delivery deadlines, and liability clauses mentioned in this negotiation transcript. For each, note the speaker's position (client or supplier) and flag any terms that appear ambiguous or require legal review." Nanbeige's Nanbeige4.1-3B Model Specs Parameters / size: 3B Context length: 131,072 tokens Primary task: Text generation (reasoning, coding, tool use, deep search) Why it's interesting Reasoning performance that exceeds its size class: Nanbeige4.1-3B scores 76.9 on LiveCodeBench-V6, these results suggest that targeted post-training using Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on a focused dataset can yield improvements that scale-based approaches cannot replicate at equivalent parameter counts. Read the technical report: https://huggingface.co/papers/2602.13367. Strong preference alignment at the 3B scale: On Arena-Hard-v2, Nanbeige4.1-3B scores 73.2, compared to 56.0 for Qwen3-32B and 60.2 for Qwen3-30B-A3B—both significantly larger models. This indicates that the model's outputs consistently match human preference for response quality and helpfulness, not just accuracy on structured tasks. Deep-search capability previously absent from small general models: On xBench-DeepSearch-2505, Nanbeige4.1-3B scores 75—matching search-specialized small agents. The model can sustain complex agentic tasks involving more than 500 sequential tool invocations, a capability gap that previously required either specialized search agents or significantly larger models. Native tool-use support: The model's chat template and generation pipeline natively support tool call formatting, making it straightforward to connect to external APIs and build multi-step agentic workflows without additional scaffolding. Try it Use Case Prompt Pattern Code review and fix Provide failing test + stack trace; ask model to diagnose root cause and write corrected implementation Competition-style math Submit problem as structured prompt; use temperature 0.6, top-p 0.95 for consistent reasoning steps Agentic task execution Provide tool definitions as JSON + goal; let model plan and execute tool calls sequentially Long-document Q&A Pass full document (up to 131K tokens) with targeted factual questions; extract structured answers Sample prompt for a software engineering deployment: You are automating pull request review for a backend engineering team. Using the Nanbeige4.1-3B endpoint deployed in Microsoft Foundry, provide the model with a unified diff of a proposed code change and the following system instruction: "You are a senior software engineer reviewing a pull request. For each modified function: (1) summarize what the change does, (2) identify any edge cases that are not handled, (3) flag any security or performance regressions relative to the original, and (4) suggest a specific improvement if one is warranted. Format your output as a structured list per function." Octen's Octen-Embedding-0.6B Model Specs Parameters / size: 0.6B Context length: 32,768 tokens Primary task: Text embeddings (semantic search, retrieval, similarity) Why it's interesting Retrieval performance above larger proprietary models at 0.6B: On the RTEB (Retrieval Text Embedding Benchmark) public leaderboard, Octen-Embedding-0.6B achieves a mean task score of 0.7241—above voyage-3.5 (0.7139), Cohere-embed-v4.0 (0.6534), and text-embedding-3-large (0.6110), despite being a fraction of their parameter count. The model is fine-tuned from Qwen3-Embedding-0.6B via Low-Rank Adaptation (LoRA), demonstrating that targeted fine-tuning on retrieval-specific data can close the gap with larger embedding models. Vertical domain coverage across legal, finance, healthcare, and code: Octen-Embedding-0.6B was trained with explicit coverage of domain-specific retrieval scenarios—legal document matching, financial report Q&A, clinical dialogue retrieval, and code search including SQL. This makes it suitable for regulated-industry applications where generic embedding models tend to underperform on specialized terminology. 32,768-token context for long-document retrieval: The extended context window supports encoding entire legal contracts, earnings reports, or clinical case notes as single embeddings—removing the need to chunk long documents and re-aggregate scores at query time, which can introduce ranking errors. 100+ language support with cross-lingual retrieval: The model handles multilingual and cross-lingual retrieval natively, with strong coverage across languages including English, Chinese, and other major languages via its Qwen3-based architecture—practical for global enterprise applications that span multiple languages. Use Case Prompt Pattern Semantic search Encode user query and document corpus; rank documents by cosine similarity to query embedding Legal precedent retrieval Embed case briefs and query with legal question; retrieve most semantically relevant precedents Cross-lingual document search Encode multilingual document set; submit query in any supported language for cross-lingual retrieval Financial Q&A pipeline Embed earnings reports or filings; retrieve relevant passages to ground downstream language model responses Sample prompt for a global enterprise knowledge base deployment: You are building a clinical decision support tool. Using the Octen-Embedding-0.6B endpoint deployed in Microsoft Foundry, embed a corpus of 10,000 clinical case notes at ingestion time and store the resulting 1024-dimensional vectors in a vector database. At query time, encode an incoming patient presentation summary and retrieve the 5 most semantically similar historical cases. Pass the retrieved cases and the current presentation to a language model with the following instruction: "Based on these five similar cases and their documented outcomes, summarize the most common treatment approaches and flag any cases where the outcome differed significantly from the initial prognosis." Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry273Views1like0CommentsTurn Enterprise Knowledge into Answers with Copilot Studio and Azure AI Search
From the Field: Why This Integration Works As an experienced AI Cloud Solution Architect working in Greater China Region (GCR), I’ve seen one emerging pattern that delivers quick wins for some of my customers: combining Microsoft Copilot Studio with an existing Azure AI Search index. Teams choose this approach because it delivers two outcomes immediately: business users get grounded, reliable answers, and enterprises avoid re-building pipelines or re-platforming knowledge stores. This guide shows exactly how to connect Copilot Studio to an Azure AI Search index that is already live, so your copilot can answer confidently using your enterprise documents. What We Assume Is Already Ready To stay focused on the integration step, we assume: You have an Azure AI Search service deployed You have an index containing vectorized content (manuals, PDFs, policies, FAQs) Your platform/data team already handled ingestion, embeddings, and indexing In short, your Azure AI Search endpoint and admin key are ready, and the index already contains chunked content with embeddings. Step 1 - Collect Your Azure AI Search Connection Details From the Azure AI Search resource: Endpoint URL Azure AI Search → Overview → Url: https://<your-search-service>.search.windows.net Admin Key Azure AI Search → Keys Use either the primary or secondary key. Governance tip: For production, rotate keys regularly and use managed identities when possible. Step 2 - Add Azure AI Search as Knowledge Inside Copilot Studio Open your Copilot Studio agent Go to the Knowledge tab Select Add knowledge, choose Azure AI Search Provide: Endpoint URL Admin key Create or select the connection Choose your existing index from the dropdown Select Add to agent Step 3 - Test a Grounded Response Open the Test copilot pane and ask a question your indexed content can answer, such as: “What are the different licensing options available for Power Platform?” Verify that: The Activity Map shows Azure AI Search being invoked The answer reflects the correct document in your index Citations or references appear where applicable Conclusion Business value: You can activate grounded, explainable answers in Copilot Studio immediately by reusing your existing Azure AI Search index - no re-platforming, no new pipelines. Team model: Data/Platform teams own ingestion, enrichment, and vectorization. Business teams build and refine the copilot experience in Copilot Studio. Scale and governance: All components stay inside Azure, with enterprise-grade security, RBAC, and operational monitoring, while enabling low‑code agility for makers. For the full end-to-end lab (storage setup, embeddings, index creation), see: 🔗 https://github.com/Azure/Copilot-Studio-and-Azure (Lab 1.4). Acknowledgements This tutorial builds on foundational work by my EMEA colleague Pablo Carceller, whose GitHub repo on Copilot Studio and Azure has helped teams worldwide accelerate real customer implementations. 👉 GitHub - Copilot Studio and Azure: https://github.com/Azure/Copilot-Studio-and-Azure I would also like to thank the broader Cloud Accelerate Factory GCR team for their contributions, insights, and active collaboration in validating this pattern across customer engagements. Special appreciation to our AI Architects Dr. Longyu Qi, Jian (Jason) Shao, Lei (Leo) Ma, and Ethan Tseng, as well as our PM partners Yunxi (Rayne) Jin and Emma Wang, whose feedback and field experiences helped shape and refine this guide. Image credits: demo visuals adapted from materials by Pablo Carceller (GitHub Lab 1.4).268Views0likes0CommentsMicrosoft Foundry: Unlock Adaptive, Personalized Agents with User-Scoped Persistent Memory
From Knowledgeable to Personalized: Why Memory Matters Most AI agents today are knowledgeable — they ground responses in enterprise data sources and rely on short‑term, session‑based memory to maintain conversational coherence. This works well within a single interaction. But once the session ends, the context disappears. The agent starts fresh, unable to recall prior interactions, user preferences, or previously established context. In reality, enterprise users don’t interact with agents exclusively in one‑off sessions. Conversations can span days, weeks, evolving across multiple interactions rather than isolated sessions. Without a way to persist and safely reuse relevant context across interactions, AI agents remain efficient in the short term be being stateful within a session, but lose continuity over time due to their statelessness across sessions. Bridging this gap between short-term efficiency and long‑term adaptation exposes a deeper challenge. Persisting memory across sessions is not just a technical decision; in enterprise environments, it introduces legitimate concerns around privacy, data isolation, governance, and compliance — especially when multiple users interact with the same agent. What seems like an obvious next step quickly becomes a complex architectural problem, requiring organizations to balance the ability for agents to learn and adapt over time with the need to preserve trust, enforce isolation boundaries, and meet enterprise compliance requirements. In this post, I’ll walk through a practical design pattern for user‑scoped persistent memory, including a reference architecture and a deployable sample implementation that demonstrates how to apply this pattern in a real enterprise setting while preserving isolation, governance, and compliance. The Challenge of Persistent Memory in Enterprise AI Agents Extending memory beyond a single session seems like a natural way to make AI agents more adaptive. Retaining relevant context over time — such as preferences, prior decisions, or recurring patterns — would allow an agent to progressively tailor its behavior to each user, moving from simple responsiveness toward genuine adaptation. In enterprise environments, however, persistence introduces a different class of risk. Storing and reusing user context across interactions raises questions of privacy, data isolation, governance, and compliance — particularly when multiple users interact with shared systems. Without clear ownership and isolation boundaries, naïvely persisted memory can lead to cross‑user data leakage, policy violations, or unclear retention guarantees. As a result, many systems default to ephemeral, session‑only memory. This approach prioritizes safety and simplicity — but does so at the cost of long‑term personalization and continuity. The challenge, then, is not whether agents should remember, but how memory can be introduced without violating enterprise trust boundaries. Persistent Memory: Trade‑offs Between Abstraction and Control As AI agents evolve toward more adaptive behavior, several approaches to agent memory are emerging across the ecosystem. Each reflects a different set of trade-offs between abstraction, flexibility, and control — making it useful to briefly acknowledge these patterns before introducing the design presented here. Microsoft Foundry Agent Service includes a built‑in memory capability (currently in Preview) that enables agents to retain context beyond a single interaction. This approach integrates tightly with the Foundry runtime and abstracts much of the underlying memory management, making it well suited for scenarios that align closely with the managed agent lifecycle. Another notable approach combines Mem0 with Azure AI Search, where memory entries are stored and retrieved through vector search. In this model, memory is treated as an embedding‑centric store that emphasizes semantic recall and relevance. Mem0 is intentionally opinionated, defining how memory is structured, summarized, and retrieved to optimize for ease of use and rapid iteration. Both approaches represent meaningful progress. At the same time, some enterprises require an approach where user memory is explicitly owned, scoped, and governed within their existing data architecture — rather than implicitly managed by an agent framework or memory library. These requirements often stem from stricter expectations around data isolation, compliance, and long‑term control. User-Scoped Persistent Memory with Azure Cosmos DB The solution presented in this post provides a practical reference implementation for organizations that require explicit control over how user memory is stored, scoped, and governed. Rather than embedding long‑term memory implicitly within the agent runtime, this design models memory as a first‑class system component built on Azure Cosmos DB. At a high level, the architecture introduces user‑scoped persistent memory: a durable memory layer in which each user’s context is isolated and managed independently. Persistent memory is stored in Azure Cosmos DB containers partitioned by user identity and consists of curated, long‑lived signals — such as preferences, recurring intent, or summarized outcomes from prior interactions — rather than raw conversational transcripts. This keeps memory intentional, auditable, and easy to evolve over time. Short‑term, in‑session conversation state remains managed by Microsoft Foundry on the server side through its built‑in conversation and thread model. By separating ephemeral session context from durable user memory, the system preserves conversational coherence while avoiding uncontrolled accumulation of long‑term state within the agent runtime. This design enables continuity and personalization across sessions while deliberately avoiding the risks associated with shared or global memory models, including cross‑user data leakage, unclear ownership, and unintended reuse of context. Azure Cosmos DB provides enterprises with direct control over memory isolation, data residency, retention policies, and operational characteristics such as consistency, availability, and scale. In this architecture, knowledge grounding and memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data sources. User‑scoped persistent memory ensures relevance by tailoring interactions to the individual user over time. Together, they enable trustworthy, adaptive AI agents that improve with use — without compromising enterprise boundaries. Architecture Components and Responsibilities Identity and User Scoping Microsoft Entra ID (App Registrations) — provides the frontend a client ID and tenant ID so the Microsoft Authentication Library (MSAL) can authenticate users via browser redirect. The oid (Object ID) claim from the ID token is used as the user identifier throughout the system. Agent Runtime and Orchestration Microsoft Foundry — serves as the unified AI platform for hosting models, managing agents, and maintaining conversation state. Foundry manages in‑session and thread‑level memory on the server side, preserving conversational continuity while keeping ephemeral context separate from long‑term user memory. Backend Agent Service — implements the AI agent using Microsoft Foundry’s agent and conversation APIs. The agent is responsible for reasoning, tool‑calling decisions, and response generation, delegating memory and search operations to external MCP servers. Memory and Knowledge Services MCP‑Memory — MCP server that hosts tools for extracting structured memory signals from conversations, generating embeddings, and persisting user‑scoped memories. Memories are written to and retrieved from Azure Cosmos DB, enforcing strict per‑user isolation. MCP‑Search — MCP server exposing tools for querying enterprise knowledge sources via Azure AI Search. This separation ensures that knowledge grounding and memory retrieval remain distinct concerns. Azure Cosmos DB for NoSQL — provides the durable, serverless document store for user‑scoped persistent memory. Memory containers are partitioned by user ID, enabling isolation, auditable access, configurable retention policies, and predictable scalability. Vector search is used to support semantic recall over stored memory entries. Azure AI Search — supplies hybrid retrieval (keyword and vector) with semantic reranking over the enterprise knowledge index. An integrated vectorizer backed by an embedding model is used for query‑time vectorization. Models text‑embedding‑3‑large — used for generating vector embeddings for both user‑scoped memories and enterprise knowledge search. gpt‑5‑mini — used for lightweight analysis tasks, such as extracting structured memory facts from conversational context. gpt‑5.1 — powers the AI agent, handling multi‑turn conversations, tool invocation, and response synthesis. Application and Hosting Infrastructure Frontend Web Application — a React‑based web UI that handles user authentication and presents a conversational chat interface. Azure Container Apps Environment — provides a shared execution environment for all services, including networking, scaling, and observability. Azure Container Apps — hosts the frontend, backend agent service, and MCP servers as independently scalable containers. Azure Container Registry — stores container images for all application components. Try It Yourself Demonstration of user‑scoped persistent memory across sessions. To make these concepts concrete, I’ve published a working reference implementation that demonstrates the architecture and patterns described above. The complete solution is available in the Agent-Memory GitHub repository. The repository README includes prerequisites, environment setup notes, and configuration details. Start by cloning the repository and moving into the project directory: git clone https://github.com/mardianto-msft/azure-agent-memory.git cd azure-agent-memory Next, sign in to Azure using the Azure CLI: az login Then authenticate the Azure Developer CLI: azd auth login Once authenticated, deploy the solution: azd up After deployment is complete, sign in using the provided demo users and interact with the agent across multiple sessions. Each user’s preferences and prior context are retained independently, the interaction continues seamlessly after signing out and returning later, and user context remains fully isolated with no cross‑identity leakage. The solution also includes a knowledge index initialized with selected Microsoft Outlook Help documentation, which the agent uses for knowledge grounding. This index can be easily replaced or extended with your own publicly accessible URLs to adapt the solution to different domains. Looking Ahead: Personalized Memory as a Foundation for Adaptive Agents As enterprise AI agents evolve, many teams are looking beyond larger models and improved retrieval toward human‑centered personalization at scale — building agents that adapt to individual users while operating within clearly defined trust boundaries. User‑scoped persistent memory enables this shift. By treating memory as a first‑class, user‑owned component, agents can maintain continuity across sessions while preserving isolation, governance, and compliance. Personalization becomes an intentional design choice, aligning with Microsoft’s human‑centered approach to AI, where users retain control over how systems adapt to them. This solution demonstrates how knowledge grounding and personalized memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data. Personalized memory ensures relevance by tailoring interactions to the individual user. Together, they enable context‑aware, adaptive, and personalized agents — without compromising enterprise trust. Finally, this solution is intentionally presented as a reference design pattern, not a prescriptive architecture. It offers a practical starting point for enterprises designing adaptive, personalized agents, illustrating how user‑scoped memory can be modeled, governed, and integrated as a foundational capability for scalable enterprise AI.422Views1like1CommentVector Drift in Azure AI Search: Three Hidden Reasons Your RAG Accuracy Degrades After Deployment
What Is Vector Drift? Vector drift occurs when embeddings stored in a vector index no longer accurately represent the semantic intent of incoming queries. Because vector similarity search depends on relative semantic positioning, even small changes in models, data distribution, or preprocessing logic can significantly affect retrieval quality over time. Unlike schema drift or data corruption, vector drift is subtle: The system continues to function Queries return results But relevance steadily declines Cause 1: Embedding Model Version Mismatch What Happens Documents are indexed using one embedding model, while query embeddings are generated using another. This typically happens due to: Model upgrades Shared Azure OpenAI resources across teams Inconsistent configuration between environments Why This Matters Embeddings generated by different models: Exist in different vector spaces Are not mathematically comparable Produce misleading similarity scores As a result, documents that were previously relevant may no longer rank correctly. Recommended Practice A single vector index should be bound to one embedding model and one dimension size for its entire lifecycle. If the embedding model changes, the index must be fully re-embedded and rebuilt. Cause 2: Incremental Content Updates Without Re-Embedding What Happens New documents are continuously added to the index, while existing embeddings remain unchanged. Over time, new content introduces: Updated terminology Policy changes New product or domain concepts Because semantic meaning is relative, the vector space shifts—but older vectors do not. Observable Impact Recently indexed documents dominate retrieval results Older but still valid content becomes harder to retrieve Recall degrades without obvious system errors Practical Guidance Treat embeddings as living assets, not static artifacts: Schedule periodic re-embedding for stable corpora Re-embed high-impact or frequently accessed documents Trigger re-embedding when domain vocabulary changes meaningfully Declining similarity scores or reduced citation coverage are often early signals of drift. Cause 3: Inconsistent Chunking Strategies What Happens Chunk size, overlap, or parsing logic is adjusted over time, but previously indexed content is not updated. The index ends up containing chunks created using different strategies. Why This Causes Drift Different chunking strategies produce: Different semantic density Different contextual boundaries Different retrieval behavior This inconsistency reduces ranking stability and makes retrieval outcomes unpredictable. Governance Recommendation Chunking strategy should be treated as part of the index contract: Use one chunking strategy per index Store chunk metadata (for example, chunk_version) Rebuild the index when chunking logic changes Design Principles Versioned embedding deployments Scheduled or event-driven re-embedding pipelines Standardized chunking strategy Retrieval quality observability Prompt and response evaluation Key Takeaways Vector drift is an architectural concern, not a service defect It emerges from model changes, evolving data, and preprocessing inconsistencies Long-lived RAG systems require embedding lifecycle management Azure AI Search provides the controls needed to mitigate drift effectively Conclusion Vector drift is an expected characteristic of production RAG systems. Teams that proactively manage embedding models, chunking strategies, and retrieval observability can maintain reliable relevance as their data and usage evolve. Recognizing and addressing vector drift is essential to building and operating robust AI solutions on Azure. Further Reading The following Microsoft resources provide additional guidance on vector search, embeddings, and production-grade RAG architectures on Azure. Azure AI Search – Vector Search Overview - https://learn.microsoft.com/azure/search/vector-search-overview Azure OpenAI – Embeddings Concepts - https://learn.microsoft.com/en-us/azure/ai-foundry/openai/how-to/embeddings?view=foundry-classic&tabs=csharp Retrieval-Augmented Generation (RAG) Pattern on Azure - https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview?tabs=videos Azure Monitor – Observability Overview - https://learn.microsoft.com/azure/azure-monitor/overview313Views0likes0CommentsIntroducing OpenAI’s GPT-image-1.5 in Microsoft Foundry
Developers building with visual AI can often run into the same frustrations: images that drift from the prompt, inconsistent object placement, text that renders unpredictably, and editing workflows that break when iterating on a single asset. That’s why we are excited to announce OpenAI's GPT Image 1.5 is now generally available in Microsoft Foundry. This model can bring sharper image fidelity, stronger prompt alignment, and faster image generation that supports iterative workflows. Starting today, customers can request access to the model and start building in the Foundry platform. Meet GPT Image 1.5 AI driven image generation began with early models like OpenAI's DALL-E, which introduced the ability to transform text prompts into visuals. Since then, image generation models have been evolving to enhance multimodal AI across industries. GPT Image 1.5 represents continuous improvement in enterprise-grade image generation. Building on the success of GPT Image 1 and GPT Image 1 mini, these enhanced models introduce advanced capabilities that cater to both creative and operational needs. The new image model offer: Text-to-image: Stronger instruction following and highly precise editing. Image-to-image: Transform existing images to iteratively refine specific regions Improved visual fidelity: More detailed scenes and realistic rendering. Accelerated creation times: Up to 4x faster generation speed. Enterprise integration: Deploy and scale securely in Microsoft Foundry. GPT Image 1.5 delivers stronger image preservation and editing capabilities, maintaining critical details like facial likeness, lighting, composition, and color tone across iterative changes. You’ll see more consistent preservation of branded logos and key visuals, making it especially powerful for marketing, brand design, and ecommerce workflows—from graphics and logo creation to generating full product catalogs (variants, environments, and angles) from a single source image. Benchmarks Based on an internal Microsoft dataset, GPT Image 1.5 performs higher than other image generation models in prompt alignment and infographics tasks. It focuses on making clear, strong edits – performing best on single-turn modification, delivering the higher visual quality in both single and multi-turn settings. The following results were found across image generation and editing: Text to image Prompt alignment Diagram / Flowchart GPT Image 1.5 91.2% 96.9% GPT Image 1 87.3% 90.0% Qwen Image 83.9% 33.9% Nano Banana Pro 87.9% 95.3% Image editing Evaluation Aspect Modification Preservation Visual Quality Face Preservation Metrics BinaryEval SC (semantic) DINO (Visual) BinaryEval AuraFace Single-turn GPT image 1 99.2% 51.0% 0.14 79.5% 0.30 Qwen image 81.9% 63.9% 0.44 76.0% 0.85 GPT Image 1.5 100% 56.77% 0.14 89.96% 0.39 Multi-turn GPT Image 1 93.5% 54.7% 0.10 82.8% 0.24 Qwen image 77.3% 68.2% 0.43 77.6% 0.63 GPT image 1.5 92.49% 60.55% 0.15 89.46% 0.28 Using GPT Image 1.5 across industries Whether you’re creating immersive visuals for campaigns, accelerating UI and product design, or producing assets for interactive learning GPT Image 1.5 gives modern enterprises the flexibility and scalability they need. Image models can allow teams to drive deeper engagement through compelling visuals, speed up design cycles for apps, websites, and marketing initiatives, and support inclusivity by generating accessible, high‑quality content for diverse audiences. Watch how Foundry enables developers to iterate with multimodal AI across Black Forest Labs, OpenAI, and more: Microsoft Foundry empowers organizations to deploy these capabilities at scale, integrating image generation seamlessly into enterprise workflows. Explore the use of AI image generation here across industries like: Retail: Generate product imagery for catalogs, e-commerce listings, and personalized shopping experiences. Marketing: Create campaign visuals and social media graphics. Education: Develop interactive learning materials or visual aids. Entertainment: Edit storyboards, character designs, and dynamic scenes for films and games. UI/UX: Accelerate design workflows for apps and websites. Microsoft Foundry provides security and compliance with built-in content safety filters, role-based access, network isolation, and Azure Monitor logging. Integrated governance via Azure Policy, Purview, and Sentinel gives teams real-time visibility and control, so privacy and safety are embedded in every deployment. Learn more about responsible AI at Microsoft. Pricing Model Pricing (per 1M tokens) - Global GPT-image-1.5 Input Tokens: $8 Cached Input Tokens: $2 Output Tokens: $32 Cost efficiency improves as well: image inputs and outputs are now cheaper compared to GPT Image 1, enabling organizations to generate and iterate on more creative assets within the same budget. For detailed pricing, refer here. Getting started Learn more about image generation, explore code samples, and read about responsible AI protections here. Try GPT Image 1.5 in Microsoft Foundry and start building multimodal experiences today. Whether you’re designing educational materials, crafting visual narratives, or accelerating UI workflows, these models deliver the flexibility and performance your organization needs.8.7KViews2likes1CommentHow Do We Know AI Isn’t Lying? The Art of Evaluating LLMs in RAG Systems
🔍 1. Why Evaluating LLM Responses is Hard In classical programming, correctness is binary. Input Expected Result 2 + 2 4 ✔ Correct 2 + 2 5 ✘ Wrong Software is deterministic — same input → same output. LLMs are probabilistic. They generate one of many valid word combinations, like forming sentences from multiple possible synonyms and sentence structures. Example: Prompt: "Explain gravity like I'm 10" Possible responses: Response A Response B Gravity is a force that pulls everything to Earth. Gravity bends space-time causing objects to attract. Both are correct. Which is better? Depends on audience. So evaluation needs to look beyond text similarity. We must check: ✔ Is the answer meaningful? ✔ Is it correct? ✔ Is it easy to understand? ✔ Does it follow prompt intent? Testing LLMs is like grading essays — not checking numeric outputs. 🧠 2. Why RAG Evaluation is Even Harder RAG introduces an additional layer — retrieval. The model no longer answers from memory; it must first read context, then summarise it. Evaluation now has multi-dimensions: Evaluation Layer What we must verify Retrieval Did we fetch the right documents? Understanding Did the model interpret context correctly? Grounding Is the answer based on retrieved data? Generation Quality Is final response complete & clear? A simple story makes this intuitive: Teacher asks student to explain Photosynthesis. Student goes to library → selects a book → reads → writes explanation. We must evaluate: Did they pick the right book? → Retrieval Did they understand the topic? → Reasoning Did they copy facts correctly without inventing? → Faithfulness Is written explanation clear enough for another child to learn from? → Answer Quality One failure → total failure. 🧩 3. Two Types of Evaluation 🔹 Intrinsic Evaluation — Quality of the Response Itself Here we judge the answer, ignoring real-world impact. We check: ✔ Grammar & coherence ✔ Completeness of explanation ✔ No hallucination ✔ Logic flow & clarity ✔ Semantic correctness This is similar to checking how well the essay is written. Even if the result did not solve the real problem, the answer could still look good — that’s why intrinsic alone is not enough. 🔹 Extrinsic Evaluation — Did It Achieve the Goal? This measures task success. If a customer support bot writes a beautifully worded paragraph, but the user still doesn’t get their refund — it failed extrinsically. Examples: System Type Extrinsic Goal Banking RAG Bot Did user get correct KYC procedure? Medical RAG Was advice safe & factual? Legal search assistant Did it return the right section of the law? Technical summariser Did summary capture key meaning? Intrinsic = writing quality. Extrinsic = impact quality. A production-grade RAG system must satisfy both. 📏 4. Core RAG Evaluation Metrics (Explained with Very Simple Analogies) Metric Meaning Analogy Relevance Does answer match question? Ask who invented C++? → model talks about Java ❌ Faithfulness No invented facts Book says started 2004, response says 1990 ❌ Groundedness Answer traceable to sources Claims facts that don’t exist in context ❌ Completeness Covers all parts of question User asks Windows vs Linux → only explains Windows Context Recall / Precision Correct docs retrieved & used Student opens wrong chapter Hallucination Rate Degree of made-up info “Taj Mahal is in London” 😱 Semantic Similarity Meaning-level match “Engine died” = “Car stopped running” 💡 Good evaluation doesn’t check exact wording. It checks meaning + truth + usefulness. 🛠 5. Tools for RAG Evaluation 🔹 1. RAGAS — Foundation for RAG Scoring RAGAS evaluates responses based on: ✔ Faithfulness ✔ Relevance ✔ Context recall ✔ Answer similarity Think of RAGAS as a teacher grading with a rubric. It reads both answer + source documents, then scores based on truthfulness & alignment. 🔹 2. LangChain Evaluators LangChain offers multiple evaluation types: Type What it checks String or regex Basic keyword presence Embedding based Meaning similarity, not text match LLM-as-a-Judge AI evaluates AI (deep reasoning) LangChain = testing toolbox RAGAS = grading framework Together they form a complete QA ecosystem. 🔹 3. PyTest + CI for Automated LLM Testing Instead of manually validating outputs, we automate: Feed preset questions to RAG Capture answers Run RAGAS/LangChain scoring Fail test if hallucination > threshold This brings AI closer to software-engineering discipline. RAG systems stop being experiments — they become testable, trackable, production-grade products. 🚀 6. The Future: LLM-as-a-Judge The future of evaluation is simple: LLMs will evaluate other LLMs. One model writes an answer. Another model checks: ✔ Was it truthful? ✔ Was it relevant? ✔ Did it follow context? This enables: Benefit Why it matters Scalable evaluation No humans needed for every query Continuous improvement Model learns from mistakes Real-time scoring Detect errors before user sees them This is like autopilot for AI systems — not only navigating, but self-correcting mid-flight. And that is where enterprise AI is headed. 🎯 Final Summary Evaluating LLM responses is not checking if strings match. It is checking if the machine: ✔ Understood the question ✔ Retrieved relevant knowledge ✔ Avoided hallucination ✔ Provided complete, meaningful reasoning ✔ Grounded answer in real source text RAG evaluation demands multi-layer validation — retrieval, reasoning, grounding, semantics, safety. Frameworks like RAGAS + LangChain evaluators + PyTest pipelines are shaping the discipline of measurable, reliable AI — pushing LLM-powered RAG from cool demo → trustworthy enterprise intelligence. Useful Resources What is Retrieval-Augmented Generation (RAG) : https://azure.microsoft.com/en-in/resources/cloud-computing-dictionary/what-is-retrieval-augmented-generation-rag/ Retrieval-Augmented Generation concepts (Azure AI) : https://learn.microsoft.com/en-us/azure/ai-services/content-understanding/concepts/retrieval-augmented-generation RAG with Azure AI Search – Overview : https://learn.microsoft.com/en-us/azure/search/retrieval-augmented-generation-overview Evaluate Generative AI Applications (Microsoft Learn – Learning Path) : https://learn.microsoft.com/en-us/training/paths/evaluate-generative-ai-apps/ Evaluate Generative AI Models in Microsoft Foundry Portal : https://learn.microsoft.com/en-us/training/modules/evaluate-models-azure-ai-studio/ RAG Evaluation Metrics (Relevance, Groundedness, Faithfulness) : https://learn.microsoft.com/en-us/azure/ai-foundry/concepts/evaluation-evaluators/rag-evaluators RAGAS – Evaluation Framework for RAG Systems : https://docs.ragas.io/277Views0likes0CommentsHow Veris AI and Lume Security built a self-improving AI agent with Microsoft Foundry
Introduction AI agents are slowly moving from demos into production, where real users, messy systems, and long-tail edge cases lead to new unseen failure modes. Production monitoring can surface issues, but converting those into reliable improvements is slow and manual, bottlenecked by the low volume of repeatable failures, engineering time, and risk of regression on previously working cases. We show how a high-fidelity simulation environment built by Veris AI on Microsoft Azure can expand production failures into families of realistic scenarios, generating enough targeted data to optimize agent behavior through automated context engineering and reinforcement learning, all while not regressing on any previous issue. We demonstrate this on a security agent built by Lume Security on Microsoft Foundry. Lume creates an institutionally grounded security intelligence graph that captures how organizations actually work—this intelligence graph then powers deterministic, policy-aligned agents that reason alongside the user and take trusted action across security, compliance, and IT workflows. Lume helps modern security teams scale expertise without scaling headcount. Its Security Intelligence Platform builds a continuously learning security intelligence graph that reflects how work actually gets done. It reasons over policies, prior tickets, security findings, tool configurations, documentation, and system activity to retain institutional memory and drive consistent response. On this foundation, Lume delivers context-aware security assistants that fetch the most relevant context at the right time, produce policy-aligned recommendations, and execute deterministic actions with full explainability and audit trails. These assistants are embedded in tools teams are already using like ServiceNow, Jira, Slack, and Teams so the experience is seamless and provides value from day one. Microsoft Foundry and Veris AI together provide Lume with a secure, repeatable control plane that makes model iteration, safety, and simulation-driven evaluation practical. Vendor flexibility. Swap or test different models from OpenAI, Anthropic, Meta, and many others, with no infra changes. Fast model rollout. Provide access to the latest models as soon as they are released, making experimentations and updates easy. Consistent safety. Built-in policy and guardrail tooling enforces the same checks across experiments, cutting bespoke guardrail work. Enterprise privacy. Models run in private Azure instances and are not trained on client data, which simplifies and shortens AI security reviews. Made evaluation practical. Centralized models, logging, and policies let Veris run repeatable simulations and feed targeted failure cases back into Lume’s improvement loop. Simulation-driven evaluation. Run repeatable, high-fidelity simulations to stress-test and automatically surface failure modes before production. Agent optimization. Turn the graded failures into upgrades through automatic prompt fixes and targeted fine-tuning/RFT. The Lume Solution: Contextual Intelligence for Security Team Members Security teams are burdened by the time-consuming process of gathering necessary context from various siloed systems, tools, and subject-matter experts. This fragmented approach creates significant latency in decision-making, leads to frequent escalations, and drives up operational costs. Furthermore, incomplete or inaccurate context often results in inconsistent responses and actions that fail to fully mitigate risks. Lume unifies an organization’s fragmented data sources into a security intelligence graph. It also provides purpose-built assistants, embedded in tooling the team already uses, that can fetch relevant content from across the org, produce explainable recommendations, execute actions with full audit trails, and update data sources when source context is missing or misleading. The result is faster, more consistent decisions and fewer avoidable escalations. With just a couple integrations into ticketing and collaboration tools, Lume begins to prioritize where teams can gain the most efficiencies and risk reduction from context intelligence and automation - and automatically builds out assistants that can help. Search. Aggregate prior requests, owners, org, access, policy, live intel, and SIEM. Plan. Produce an institutionally grounded action plan. Review. Present a single decision-ready view with full context to approvers, modifying the plan based on their input. Act. Execute pre-approved changes with full explainability and audit trail. Close the loop. Notify stakeholders, update the graph and docs, log outcomes. Validation with early customers has revealed Lume’s security intelligence and assistants potential to truly change the way enterprises deal with security analysis and tasks. 35–55% less time on routine requests. Measurements with early customers shows the assistant and access to institutional intelligence reduces the time security analysts spend on recurring intake and tactical tasks, freeing staff for higher-value work. Faster and more confident decision making. Qualitative feedback from security team members reveals they feel more confident and can act faster when using the assistant, while also feeling less burdened knowing Lume will help ensure the task is resolved and documented. Improved institutional memory. Every decision, rationale, and action is captured and surfaced in the security intelligence graph, increasing repeatability and reducing future rework—this captured information also updates and cleanses existing documentation to ensure institutional knowledge aligns with current practice. Proactively identify & prioritize opportunities. Lume first identifies and prioritizes the tasks and processes where intelligence and assistants can have the greatest impact—measuring the actual ROI for security teams. Meaningful deflection of routine requests. The assistant can respond, plan, and execute actions (with human review + approval), deflecting common escalations and reducing load on subject matter experts. Architected on Azure By implementing this system on Azure stack, Lume and Veris benefited from the flexibility and breadth of services enabling this large scale implementation. This architecture allows the agent to evolve independently across reasoning, retrieval, and action layers without destabilizing production behavior. Microsoft Foundry orchestrates model usage, prompt execution, safety policies, and evaluation hooks. Azure AI Search provides hybrid retrieval across structured documents, policies, and unstructured artifacts. Vector storage enables semantic retrieval of prior tickets, decisions, and organizational knowledge. Graph databases capture relationships between systems, controls, owners, and historical decisions, allowing the agent to reason over organizational structure rather than isolated facts. Azure Kubernetes Service (AKS) hosts the agent runtime and evaluation workloads, enabling horizontal scaling and isolated experiments. Azure Key Vault manages secrets, credentials, and API access securely. Azure Web App Services power customer-facing interfaces and internal dashboards for visibility and control. Evaluating and ultimately improving agents is still one of the hardest parts of the stack. Static datasets don’t solve this, because agents are not static predictors. They are dynamic, multi-turn systems that change the world as they act: they call tools, accumulate state, recover (or fail to recover) from errors, and face nondeterministic user behavior. A golden dataset might grade a single response as “correct,” while completely missing whether the agent actually achieved the right end state across systems. In other words: static evals grade answers; agent evals must grade outcomes. At the same time, getting enough real-world data to drive improvement is perpetually difficult. The most valuable failures are rare, long-tail, and hard to reproduce on demand. Iterating directly in production is not possible because of safety, compliance, and customer-experience risk. That’s why environments are essential: a place to test agents safely, generate repeatable experience, and create the signals that allows developers to improve behavior without gambling on live users. Veris AI is built around that premise. It is a high-fidelity simulation platform that models the world around an agent with mock tools, realistic state transitions, and simulated users with distinct behaviors. From a single observed production failure, Veris can reconstruct what happened, expand it into a family of realistic scenario variants, and then stress-test the agent across that scenario set. Those runs are evaluated with a mix of LLM-based judges and code-based verifiers that score full trajectories not just the final text. Crucially, Veris does not stop at measurement. Its optimization module uses those grader signals to improve the agent with automatically refining prompts and supporting reinforcement-learning style updates (e.g., RFT) in a closed loop. The result is a workflow where one production incident can become a repeatable training and regression suite, producing targeted improvements on the failure mode while protecting performance on everything that already worked. Veris AI environment is available on Azure Kubernetes Service (AKS) and can be easily deployed in a user Virtual Network. Under the Hood: Building a self-improving agent When a failure appears in production, the improvement loop starts from the trace. Veris takes the failed session logs from the observability pipeline, then an LLM-based evaluator pinpoints what actually went wrong and automatically writes a new targeted evaluation rubric for that specific failure mode. From there, Veris simulator expands the single incident into a scenario family, dozens of realistic variants with branching, so the agent can be trained and re-tested against the whole “class” of the failure, not just one example. This is important as failure modes are often sparse in the real-world. Those scenarios are executed in the simulation engine with mocked tools and simulated users, producing full interaction traces that are then scored by the evaluation engine. The scores become the training signal for optimization. Veris optimization engine uses the evaluation results, the original system prompt, and the new rubric to generate an updated prompt designed to fix the failure without breaking previously-good behavior. Then it validates the new prompt on both (a) the specialized scenario set created from the incident and (b) a broader regression held-out set suite to ensure the improvement generalizes and does not cause regressions. In this case study, we focused on a key failure mode of incorrect workflow classification at the triage step for approval-driven access requests. In these cases, tickets were routed into an escalation or in-progress workflow instead of the approval-validation path. The ticket often contained conflicting approval or rejection signals in human comments - manager approved but they required some additional information, a genuine occurrence in org related jira workflows. The triage agent failed to recognize them and misclassified the workflow state. Since triage determines what runs next, a single misclassification at the start was enough to bypass the Approval Agent and send the request down an incorrect downstream path. Results & Impact By running the optimization loop, we were able to improve agent accuracy on this issue by over 40%, while not regressing on any other correct behavior. We ran experiments on a dataset only containing scenarios with the issue and a more general dataset encompassing a variety of scenarios. The experiment results on both datasets and updated prompt is shown below. Continuing with this over any issues that arise in production or simulation will continuously improve the agent performance. Takeaways This collaboration shows a practical pattern for taking an enterprise agent from “works in a demo” to “improves safely in production”, pairing an orchestration layer that standardizes model usage, safety, logging, and evaluation with a simulation environment where failures can be reproduced and fixed without risking real users, then making it repeatable in practice. In this stack, Veris AI provides the simulations, trajectory grading, and optimization loop, while Microsoft Foundry operationalizes the workflow with vendor-flexible model iteration, consistent policy enforcement, enterprise privacy, and centralized evaluation hooks that turn testing into a first-class system instead of bespoke glue code. The result is an improved and reliable Lume assitant that can help enterprises spend 35–55% less time on routine requests, and meaningfully deflect repetitive escalations, requiring fewer clarification cycles, enabling faster response times, and a stronger institutional memory where decisions and rationales compound over time instead of getting lost across tools and tickets. The self-improving loop continuously improves the agent, starting from a production trace, by generating a targeted rubric, expanding the incident into a scenario family, running it end-to-end with mocked tools and simulated users, scoring full trajectories, then using those scores to produce a safer prompt/model update and validating it against both the failure set and a broader regression suite. This turns rare long-tail failures into repeatable training and regression assets. If you’re building an AI agent, the recommendation is straightforward. Invest early in an orchestration and safety layer, as well as an environment-driven evaluation that can create an improvement loop to ship fixes without regressions. This way, your production failures act as the highest-signal input to continuously harden the system. To learn more or start building, teams can explore the Veris Console for free or browse the Veris documentation . In this stack, Microsoft Foundry provides the orchestration and safety control plane, and Veris provides the simulation, evaluation, and optimization loop needed to make agents improve safely over time. Learn more about the Lume Security assistant here.209Views1like0CommentsThe Business Foundation: Why Most Companies Aren’t Ready for Agentic AI
Before agents can execute decisions, organizations must redesign how they structure responsibility, data, governance, and operational context before autonomy can scale. The enterprise AI landscape has shifted. Organizations are moving beyond chatbots and isolated predictive models toward systems that can plan, decide, and execute multi-step work across finance, engineering operations, supply chains, and customer service. Many analysts now expect agentic AI to unlock major productivity gains across knowledge work. But despite the momentum, adoption remains limited. As of 2025, only about 2% of organizations have deployed agent-based systems at real operational scale, while most remain stuck in pilots. The reason is not model capability. It is readiness. The Core Problem Most organizations still treat AI adoption as a technical rollout exercise and measure progress through deployment indicators such as copilots enabled, pilots launched, or models evaluated. These metrics reflect experimentation activity, but they do not show whether an organization is ready to operate systems that make decisions and execute actions inside business workflows. Agentic systems do more than generate insights; they participate directly in operational processes. The gap between deploying AI tools and safely delegating decision-making authority to them is where many transformation efforts begin to stall. True enterprise readiness for agentic AI is not defined by how many models an organization deploys or how many pilots it launches. It depends on whether the organization can safely delegate bounded decisions to autonomous systems. In practice, this requires: Strategy and decision scoping: identifying where autonomous execution creates value and where human oversight must remain in place Process and decision-system maturity: redesigning workflows for human-agent collaboration with clear escalation boundaries Context-ready data foundations: ensuring agents operate on consistent, policy-aware operational context rather than fragmented data silos Governance and accountability structures: defining what agents may recommend, execute, escalate, or never touch, supported by auditability and oversight Team readiness and lifecycle management: preparing teams to supervise autonomous execution and managing agents as ongoing operational participants rather than static tools Coordination architecture readiness: aligning multiple agents across domains so local optimization does not create organizational conflict This article explains why traditional enterprise environments are not yet prepared for autonomous agents, what true agentic readiness actually looks like in practice, and the sequence of organizational changes required before decision-capable systems can be deployed safely at scale. I. The Readiness Illusion and the Root Causes of Failure Most organizations are deploying agentic systems into environments designed exclusively for human execution. That mismatch produces predictable friction across five structural layers. 1. Fragmented Operational Context (The Data Problem) Enterprises have a lot of data. What they often lack is usable context. Traditional systems record what happened. Agents also need to understand why something happened, how systems are connected, and where policy limits apply. In most organizations, customer systems, telemetry platforms, identity services, and finance tools do not stay aligned in real time. As a result, agents operate across disconnected information rather than a shared operational picture. This creates real risk. With generative AI, poor data quality usually produces a weak answer. With agentic AI, poor data quality can produce the wrong action at scale. More APIs, more pipelines, and more dashboards do not fix this by themselves. Without a shared semantic context across systems, agents can still make decisions that are internally logical but operationally wrong. For example, an agent may see that a customer received a large discount and conclude that future discounts should be limited, while missing that the original discount was approved because of a service outage and a retention risk. The data is available, but the business meaning behind it is not. 2. Undocumented Decision Systems Most organizations document workflows. However, very few document decision authority clearly enough for autonomous execution. Agents need to know where they are allowed to act, when they must escalate, and which decisions remain human-only. Without these boundaries, organizations often follow the same pattern: the first unexpected situation appears, confidence drops, and the agent is switched off. This is not a model problem. It is a decision-structure problem. Before deploying agents, organizations must be able to explain which decisions can be delegated and who remains responsible for each step. Many cannot yet do this. 3. The Governance Paradox Agentic systems do not fit traditional governance models. Most organizations still assume a simple structure: user → application → resource Agent-based systems introduce a new layer: user → agent → tools → resource This change affects access control, compliance processes, and audit visibility. Organizations usually buy agents like software tools but must manage them more like team members. That gap is rarely addressed before deployment begins. This issue is already visible today. Many enterprises are using vendor copilots and embedded AI features inside business systems without clear ownership, audit coverage, or governance rules. This creates a growing “shadow AI” layer even before intentional agent programs start. 4. Identity and Accountability Ambiguity Many organizations cannot clearly answer a simple question: who is responsible when an agent makes a mistake? In practice, agents often receive permissions that are broader than necessary, execution traces are difficult to follow across multiple systems, and accountability is split between IT, compliance, and business teams. Without clear attribution, autonomy introduces hidden risk instead of efficiency. Delegation without accountability is not automation. It is unmanaged risk. 5. Organizational Misalignment Most transformation programs still assume employees will use AI as a tool. Agentic environments change the role of employees from operators to supervisors. People are expected to review outcomes, guide behavior, and manage exceptions instead of executing every step themselves. Research from BCG shows that around 70% of AI project challenges come from people and process issues rather than technology. Organizations that invest in change management are significantly more likely to see successful results. Organizational readiness is not something to address later. It is required before agents can operate safely. Common Failure Patterns at a Glance Common failure patterns like these are already visible in real deployments. The Klarna case illustrates the challenge well. After replacing several hundred customer service roles with AI, the company later reported lower resolution quality for complex cases, declining satisfaction scores, and higher escalation rates, which led to renewed hiring in support roles. The outcome did not point to a failure of the model itself. It highlighted what happens when autonomous systems are introduced without the supporting process, governance, and team structures required for sustained operation. II. Defining True Agentic Readiness Agentic readiness is not just about having the right tools in place. It is about whether the organization has the capability to use autonomous systems safely and effectively. Definition Agentic readiness is the ability to safely delegate bounded operational decisions to autonomous systems while maintaining accountability, observability, and policy alignment across the full execution chain. Research consistently shows that organizations benefit from AI only when multiple maturity layers advance together. The MIT CISR AI Maturity Model, based on data from 721 companies, demonstrates that financial performance improves as organizations progress through the stages. Companies in early stages often perform below industry averages, while those reaching later stages perform significantly better. The key insight is that maturity is cumulative. Organizations cannot skip foundational steps and still expect reliable outcomes. For agentic systems, those cumulative layers include strategy alignment, decision-ready processes, context-ready data, governance structures, organizational roles, and technical architecture. When only some of these elements are in place, organizations produce pilots. When they advance together, organizations produce transformation. From Activity Metrics to Outcome Metrics One of the clearest signs of readiness is how an organization measures progress. Organizations at an early stage usually focus on activity: Number of models deployed Pilots launched Features enabled User onboarding numbers and API call volume More mature organizations focus on outcomes: Better decision quality and fewer errors Higher throughput for clearly defined tasks Consistent operation within safe autonomy boundaries Complete audit trails and accurate escalation handling This is not a semantic distinction. Organizations measuring activity invest indefinitely in pilots because they have no signal telling them a pilot has succeeded or failed. The measurement framework is itself a prerequisite for the transformation sequence. III. The Transformation Sequence Most Organizations Skip Many organizations begin agent adoption in the wrong order. Platforms are procured before governance is defined. Models are evaluated before workflows are structured. Autonomy is introduced before decision authority is mapped. The result is not faster progress. It is earlier failure, followed by expensive cleanup later. In traditional cloud transformation, architecture precedes automation. Agentic transformation follows the same rule: decision structure must exist before delegation can scale. Step 1: Strategic Alignment and Decision Scoping Organizations should begin by identifying where autonomy creates value safely — not where it is technically possible and not where ambitions are highest. Strong early candidates usually share the same characteristics: structured decisions, bounded scope, reversible actions, and high execution frequency. Typical examples include incident triage and routing, capacity classification, environment status updates, and prioritization support. These are good starting points not because they are simple, but because failures are visible, recoverable, and useful for learning. Delegation should grow gradually from bounded decision spaces toward broader authority. Organizations that struggle often start with highly visible, high-risk use cases because the business case looks attractive. Organizations that succeed usually begin with frequent, lower-impact decisions where feedback loops are short and improvements can happen quickly. Step 2: Process Maturity and Boundary Setting Agents do not fix broken workflows. They execute them faster. If a process depends on informal judgment, tribal knowledge, or undocumented exception handling, an agent will reproduce those weaknesses at machine speed. Before introducing autonomy, organizations should establish structured runbooks with clear execution paths, explicit escalation logic an agent can evaluate, defined exception-handling rules that do not rely on intuition, and clear boundaries between decisions an agent may take and those that must remain with humans. This level of discipline requires documentation precision that many organizations have never needed before. A statement such as “the engineer uses judgment” is not a runbook step. It is an undocumented dependency that will later appear as an agent failure. This is also where leaders face a practical choice: add agents on top of fragile legacy workflows, or redesign those workflows so delegation can happen safely. In many cases, the second path is slower at first but far more durable. Step 3: Data Context and Decision Awareness Agents cannot operate reliably in fragmented environments. The solution is not simply collecting more data. What they require is decision-aware context: structured knowledge about relationships between systems, service dependencies, environment classification, policy boundaries, and operational intent. This is a different challenge from building analytics platforms. Analytics depends on broad visibility across large datasets. Agentic execution depends on precise, current, and consistent information at the moment a decision is made. A customer record that is accurate enough for reporting may not be reliable enough for an agent executing a contract action. Because of this difference, data readiness becomes a leadership concern rather than only an infrastructure task. Microsoft’s digital transformation guidance captures this clearly with the principle “no AI without data”: organizations should identify critical data sources, establish governance ownership, improve quality, and define controlled access before introducing agents into operational workflows. Step 4: Governance and Delegation Redesign Organizations must explicitly define four categories of agent authority before deployment: What agents may recommend (advisory boundary) What agents may execute autonomously (execution boundary) What requires human approval before execution (escalation boundary) What remains permanently restricted regardless of confidence (prohibition boundary) These policies cannot remain static. Agentic systems require continuous supervision, not periodic review. Research supports this shift. Studies of governance professionals working with autonomous systems show that adopting traditional Enterprise Risk Management frameworks alone does not significantly reduce governance incidents. What makes the difference is integrating human oversight into execution loops and strengthening machine identity security. In practice, this means organizations need a delegated-autonomy governance function: a cross-functional group with representation from IT, compliance, legal, and business teams that continuously defines and monitors the boundaries of agent behavior. This is different from extending existing approval committees. Governance must move from acting as a gate before deployment to operating as a supervision layer throughout the lifecycle of the agent. This creates a basic operational tension: organizations adopt agents to reduce manual work, but safe autonomy requires stronger supervision, better observability, and tighter control over identity and permissions — especially in the early stages. Step 5: Operating Model Redesign: Operationalizing Human-Agent Collaboration Agentic systems create responsibilities that do not yet exist in most organizations. This shift is not mainly about replacing people with agents. It is about redesigning how people work with them, supervise them, and remain accountable for outcomes. New operational roles typically include: Agent reliability engineers who monitor performance, detect degradation, and define retraining triggers Policy designers who translate business rules into machine-evaluable decision logic Workflow supervisors who oversee autonomous execution and handle escalations Context curators who maintain the data foundations agents depend on for accurate reasoning Organizations that succeed with agents do not treat them as static automation tools. They treat them as managed participants inside workflows. That is why they need an HR layer for agents. An HR layer for agents means applying the same lifecycle thinking used for people to autonomous systems. Before an agent is allowed to operate, it needs a clearly defined role, scope, level of authority, and access to the right systems. Once deployed, its performance must be reviewed over time, its behavior monitored, and its permissions adjusted when quality drops or risks increase. When the agent no longer fits the workflow, it should be retired or replaced instead of being left running by default. In practice, this means agent management should include: Onboarding, by defining scope, authority, and access boundaries Supervision, through observability, escalation paths, and performance review Retraining or re-scoping, when quality declines or conditions change Retirement, when the agent no longer fits the process or creates more risk than value In higher-risk workflows, this HR layer must also include graceful degradation. For example, an underperforming agent may automatically lose write access, be moved to read-only mode, and hand control back to a human supervisor until its behavior is corrected. This shift also requires leadership readiness. The Harvard 2025 Global Leadership Development Study found that 71% of senior leaders now see the ability to lead through continuous change as critical, yet only 36% say AI is fully integrated into their strategy. That gap between intention and execution is where many organizational transformation programs begin to stall. Step 6: Coordination Architecture Readiness As organizations deploy agents across multiple domains, a new challenge appears: agents begin optimizing locally instead of organizationally. An agent focused on cost efficiency in one area may conflict with another agent responsible for quality assurance elsewhere. Without coordination structures, these conflicts often remain invisible until they surface as operational failures. Coordination architecture helps align agent behavior across the organization. It ensures policy consistency between agents, maintains a shared understanding of the operational environment, prevents conflicts when actions intersect, and supports stable communication between agents working together across workflows. This capability is not required for the first agent deployment. It becomes important as soon as organizations begin operating multiple agents in parallel. Many organizations encounter coordination problems earlier than expected, which is why coordination readiness belongs in the transformation sequence even if its implementation happens later. Local optimization is rarely what enterprises intend. Coordination architecture is how you prevent it from becoming what they get. IV. The Regulatory Clock Is Already Running For organizations operating in or serving European markets, readiness is no longer only a strategic question. It is also a regulatory one. The EU AI Act’s high-risk provisions take effect in August 2026, with potential penalties reaching €35 million or 7% of global revenue. Colorado’s AI Act follows in June 2026, and a growing number of U.S. states now require documented AI governance programs for specific sectors and use cases. The governance and data foundations described earlier in this article are therefore not only best practice. For many organizations, they are becoming compliance prerequisites. Treating readiness as optional before deployment increasingly means accepting regulatory exposure before value is realized. The transformation sequence described here is not a slower path to deployment. It is the only path that avoids accumulating technical and legal risk at the same time. V. Conclusion: Shifting Toward Outcome-Based Pragmatism Agentic systems rarely fail because language models are incapable. They fail because they are introduced into environments designed for human execution, governed by frameworks built for deterministic software, and evaluated using metrics that cannot distinguish a promising pilot from a production-ready capability. The readiness gap is structural and, in many cases, self-inflicted. Organizations skip foundational steps because platform procurement is faster, more visible, and easier to justify internally than operating-model redesign. The result is earlier failure, higher remediation cost, and — in regulated industries — increasing legal exposure. What this means in practice Organizations should stop measuring readiness through activity indicators and start measuring it through decision quality, execution safety, throughput improvement, and bounded autonomy performance. Governance and data foundations must be established before platform rollout. Organizational transition planning must begin before deployment. Decision authority must be defined before the first agent workflow is introduced. Only then can enterprises safely unlock the productivity gains promised by agentic systems — not because the technology suddenly becomes capable, but because the organization becomes ready to use it. Up Next in This Series Part 2 looks at the cloud foundation needed for safe agent deployment, including identity-first architecture, observability, policy controls, and the platform constraints that often appear only after design decisions have been made. Part 3 focuses on how to design agents that work reliably in enterprise environments, including RAG maturity, loop design, multi-agent coordination, and human oversight built into the architecture from the start. References Weinberg, A. I. (2025). A Framework for the Adoption and Integration of Generative AI in Midsize Organizations and Enterprises (FAIGMOE). Patel, R. (2026). Agentic AI Frameworks: A Complete Enterprise Guide for 2026. Space-O Technologies. Microsoft Learn. Agentic AI maturity model. Keenan, K. (2026). How the right context can reshape agentic AI’s productivity output. Business Insider / Reltio. Ransbotham, S., Kiron, D., Khodabandeh, S., Iyer, S., & Das, A. (2025). The Emerging Agentic Enterprise: How Leaders Must Navigate a New Age of AI. MIT Sloan Management Review & Boston Consulting Group.104Views0likes0CommentsNow in Foundry: NVIDIA Nemotron-3-Super-120B-A12B, IBM Granite-4.0-1b-Speech, and Sarvam-105B
This week's Model Mondays edition highlights three models now available in Hugging Face collection on Microsoft Foundry: NVIDIA's Nemotron-3-Super-120B-A12B, a hybrid Latent Mixture-of-Experts (MOE) model with 12B active parameters and context handling up to 1 million tokens; IBM Granite's Granite-4.0-1b-Speech, a compact Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) model that achieves a 5.52% average Word Error Rate (WER) at 280× real-time speed with runtime keyword biasing for domain adaptation; and Sarvam's Sarvam-105B, a 105B Mixture-of-Experts (MoE) model with 10.3B active parameters optimized for complex reasoning and 22 Indian languages, with comparable agentic performance compared to other larger proprietary models on web search and task-planning benchmarks. Models of the week NVIDIA Nemotron-3-Super-120B-A12B Model Specs Parameters / size: 120B total with 12B active Context length: Up to 1M tokens Primary task: Text generation (reasoning, agentic workflows, long-context tasks, tool use, RAG) Why it's interesting (Spotlight) Hybrid Latent MoE architecture with selective attention: Nemotron-3-Super combines interleaved Mamba-2 state-space layers and sparse MoE layers with a select number of full attention layers—a design called Latent MoE. Tokens are routed into a smaller latent space for computation, which improves accuracy per parameter while keeping only 12B parameters active at inference time. Multi-Token Prediction (MTP) heads where the model simultaneously predicts multiple upcoming tokens during training enable native speculative decoding, reducing time-to-first-token on long outputs without a separate draft model. Configurable reasoning mode: The model supports toggling extended chain-of-thought reasoning on or off via the chat template flag enable_thinking. This lets developers suppress the reasoning trace for latency-sensitive tasks while keeping it available for high-stakes or multi-step agentic use cases without loading a separate model. Sustained 1M-token context reliability: On RULER, the standard long-context evaluation suite, Nemotron-3-Super achieves 91.75% at 1M tokens. This makes it practical for full-document retrieval-augmented generation (RAG), long-form code analysis, and extended agentic sessions without chunking or windowing strategies. Try it Use cases Best practices Ultra‑long document ingestion & consolidation (e.g., end‑to‑end review of massive specs, logs, or multi‑volume manuals without chunking) Use the native 1M‑token context to avoid windowing strategies; feed full corpora in one pass to reduce stitching errors. Prefer default decoding for general analysis (NVIDIA recommends temperature≈1.0, top_p≈0.95) before tuning; this aligns with the model’s training and MTP‑optimized generation path. Leverage MTP for throughput (multi‑token prediction improves output speed on long outputs), making single‑pass synthesis practical at scale. Latency‑sensitive chat & tool‑calling at scale (e.g., high‑volume enterprise assistants where response time matters) Toggle reasoning traces intentionally via the chat template (enable_thinking on/off): turn off for low‑latency interactions; on for harder prompts where accuracy benefits from explicit reasoning. Use model‑recommended sampling for tool calls (many guides tighten temperature for tool use) to improve determinism while keeping top_p near 0.95. Rely on the LatentMoE + MTP design to sustain high tokens/sec under load instead of adding a draft model for speculative decoding. IBM Granite-4.0-1b-Speech Model Specs Parameters / size: ~1B Context length: 128K tokens (LLM backbone; audio processed per utterance through the speech encoder) Primary task: Multilingual Automatic Speech Recognition (ASR) and bidirectional Automatic Speech Translation (AST) Why it's interesting (Spotlight) Compact ASR with speculative decoding at near-real-time speed: At roughly 1B parameters, Granite-4.0-1b-Speech achieves a 5.52% average WER across eight English benchmarks at 280× real-time speed (RTFx—the ratio of audio duration processed to wall-clock time) on the Open ASR Leaderboard. Runtime keyword biasing for domain adaptation without fine-tuning: Granite-4.0-1b-Speech accepts a runtime keyword list—proper nouns, brand names, technical terms, acronyms—that adjusts decoding probabilities toward those terms. This allows domain-specific vocabulary to be injected at inference time rather than requiring a fine-tuning run, practical for legal transcription, medical dictation, or financial meeting notes where terminology changes across clients. Bidirectional speech translation across 6 languages in one model: Beyond ASR, the model supports translation both to and from English for French, German, Spanish, Portuguese, and Japanese, plus English-to-Italian and English-to-Mandarin. A single deployed endpoint handles ASR and AST tasks without routing audio to separate models, reducing infrastructure surface area. Try it Test the model in the Hugging Face space before deploying in Foundry here: Sarvam’s Sarvam-105B Model Specs Parameters / size: 105B total with 10.3B active (Mixture of Experts, BF16) Context length: 128K tokens (with YaRN-based long-context extrapolation, scale factor 40) Primary task: Text generation (reasoning, coding, agentic tasks, Indian language understanding) Why it's interesting (Spotlight) Broad Indian language coverage at scale: Sarvam-105B supports English and 22 Indian languages—Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Urdu, Sanskrit, Maithili, Dogri, Manipuri, Santali, Kashmiri, Nepali, Sindhi, Konkani, and Tibetan—the broadest open-model coverage for this language set at this parameter range. Training explicitly prioritized the Indian context, resulting in reported state-of-the-art performance across these languages for models of comparable size. Strong agentic and web-search performance: Sarvam-105B scores 49.5% on BrowseComp (web research benchmark with search tool access)—substantially above GLM-4.5-Air (21.3%) and Qwen3-Next-80B-A3B-Thinking (38.0%). It also achieves 68.3% average on τ² Bench (multi-domain task-planning benchmark), above GPT-OSS-120B (65.8%) and GLM-4.5-Air (53.2%). This reflects training emphasis on multi-step agentic workflows in addition to standard reasoning. Try it Use cases Best practices Agentic web research & technical troubleshooting (multi-step reasoning, planning, troubleshooting) Use longer context when needed: the model is designed for long-context workflows (up to 128K context with YaRN-based extrapolation noted). Start from the model’s baseline decoding settings (as shown in the model’s sample usage) and adjust for your task: temperature ~0.8, top_p ~0.95, repetition_penalty ~1.0, and set an explicit max_new_tokens (sample shows 2048). Suggestion (general, not stated verbatim in the sources): For agentic tasks, keep the prompt structured (goal → constraints → tools available → required output format), and ask for a short plan + final answer to reduce wandering. Multilingual (Indic) customer support & content generation (English + 22 Indian languages; native-script / romanized / code-mixed inputs) Be explicit about the language/script you want back (e.g., Hindi in Devanagari vs romanized Hinglish), since training emphasized Indian languages and code-mixed/romanized inputs. Provide in-language examples (a short “good response” example in the target language/script) to anchor tone and terminology. (Suggestion—general best practice; not stated verbatim in sources.) Use the model’s baseline generation settings first (sample decoding params) and then tighten creativity for support use cases (e.g., lower temperature) if you see variability. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. Or start from the Hugging Face Hub and choose the "Deploy on Microsoft Foundry" option, which brings you straight into Foundry. Learn how to discover models and deploy them using Microsoft Foundry here: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry330Views0likes0CommentsFoundry IQ: Unlocking ubiquitous knowledge for agents
Introducing Foundry IQ by Azure AI Search in Microsoft Foundry. Foundry IQ is a centralized knowledge layer that connects agents to data with the next generation of retrieval-augmented generation (RAG). Foundry IQ includes the following features: Knowledge bases: Available directly in the new Foundry portal, knowledge bases are reusable, topic-centric collections that ground multiple agents and applications through a single API. Automated indexed and federated knowledge sources – Expand what data an agent can reach by connecting to both indexed and remote knowledge sources. For indexed sources, Foundry IQ delivers automatic indexing, vectorization, and enrichment for text, images, and complex documents. Agentic retrieval engine in knowledge bases – A self-reflective query engine that uses AI to plan, select sources, search, rank and synthesize answers across sources with configurable “retrieval reasoning effort.” Enterprise-grade security and governance – Support for document-level access control, alignment with existing permissions models, and options for both indexed and remote data. Foundry IQ is available in public preview through the new Foundry portal and Azure portal with Azure AI Search. Foundry IQ is part of Microsoft's intelligence layer with Fabric IQ and Work IQ.39KViews6likes4Comments