ai
25 TopicsIntroducing OpenAI's GPT-image-2 in Microsoft Foundry
Take a small design team running a global social campaign. They have the creative vision to produce localized imagery for every market, but not the resources to reshoot, reformat, or outsource that scale. Every asset needs to fit a different platform, a different dimension, a different cultural context, and they all need to ship at the same time. This is where flexible image generation comes in handy. OpenAI's GPT-image-2 is now generally available and rolling out today to Microsoft Foundry, introducing a step change in image generation. Developers and designers now get more control over image output, so a small team can execute with the reach and flexibility of a much larger one. What is new in GPT-image-2? GPT-image-2 brings real world intelligence, multilingual understanding, improved instruction following, increased resolution support, and an intelligent routing layer giving developers the tools to scale image generation for production workflows. Real world intelligence GPT-image-2 has a knowledge cut off of December 2025, meaning that it is able to give you more contextually relevant and accurate outputs. The model also comes with enhanced thinking capabilities that allow it to search the web, check its own outputs, and create multiple images from just one prompt. These enhancements shift image generation models away from being simple tools and runs them into creative sidekicks. Multilingual understanding GPT-image-2 includes increased language support across Japanese, Korean, Chinese, Hindi, and Bengali, as well as new thinking capabilities. This means the model can create images and render text that feels localized. Increased resolution support GPT-image-2 introduces 4K resolution support, giving developers the ability to generate rich, detailed, and photorealistic images at custom dimensions. Resolution guidelines to keep in mind: Constraint Detail Total pixel budget Maximum pixels in final image cannot exceed 8,294,400 Minimum pixels in final image cannot be less than 655,360 Requests exceeding this are automatically resized to fit. Resolutions 4K, 1024x1024, 1536x1024, and 1024x1536 Dimension alignment Each dimension must be a multiple of 16 Note: If your requested resolution exceeds the pixel budget, the service will automatically resize it down. Intelligent routing layer GPT-image-2 also includes an expanded routing layer with two distinct modes, allowing the service to intelligently select the right generation configuration for a request without requiring an explicitly set size value. Mode 1 — Legacy size selection In Mode 1, the routing layer selects one of the three legacy size tiers to use for generation: Size tier Description smimage Small image output image Standard image output xlimage Large image output This mode is useful for teams already familiar with the legacy size tiers who want to benefit from automatic selection without making any manual changes. Mode 2 — Token size bucket selection In Mode 2, the routing layer selects from six token size buckets — 16, 24, 36, 48, 64, 96 — which map roughly to the legacy size tiers: Token bucket Approximate legacy size 16, 24 smimage 36, 48 image 64, 96 xlimage This approach can allow for more flexibility in the number of tokens generated, which in turn helps to better optimize output quality and efficiency for a given prompt. See it in action GPT-image-2 shows improved image fidelity across visual styles, generating more detailed and refined images. But, don’t just take our word for it, let's see the model in action with a few prompts and edits. Here is the example we used: Prompt: Interior of an empty subway car (no people). Wide-angle view looking down the aisle. Clean, modern subway car with seats, poles, route map strip, and ad frames above the windows. Realistic lighting with a slight cool fluorescent tone, realistic materials (metal poles, vinyl seats, textured floor). As you can see, when using the same base prompt, the image quality and realism improved with each model. Now let’s take a look at adding incremental changes to the same image: Prompt: Populate the ad frames with a cohesive ad campaign for “Zava Flower Delivery” and use an array of flower types. And our subway is now full of ads for the new ZAVA flower delivery service. Let's ask for another small change: Prompt: In all Zava Flower Delivery advertisements, change the flowers shown to roses (red and pink roses). And in three simple prompts, we've created a mockup of a flower delivery ad. From marketing material to website creation to UX design, GPT-image-2 now allows developers to deliver production-grade assets for real business use cases. Image generation across industries These new capabilities open the door to richer, more production-ready image generation workflows across a range of enterprise scenarios: Retail & e-commerce: Generate product imagery at exact platform-required dimensions, from square thumbnails to wide banners, without post-processing. Marketing: Produce crisp, rich in color campaign visuals and social assets localized to different markets. Media & entertainment: Generate storyboard panels and scene at resolutions suited to production pipelines. Education & training: Create visual learning aids and course materials formatted to exact display requirements across devices. UI/UX design: Accelerate mockup and prototype workflows by generating interface assets at the precise dimensions your design system requires. Trust and safety At Microsoft, our mission to empower people and organizations remains constant. As part of this commitment, models made available through Foundry undergo internal reviews and are deployed with safeguards designed to support responsible use at scale. Learn more about responsible AI at Microsoft. For GPT-image-2, Microsoft applied an in-depth safety approach that addresses disallowed content and misuse while maintaining human oversight. The deployment combines OpenAI’s image generation safety mitigations with Azure AI Content Safety, including filters and classifiers for sensitive content. Pricing Model Offer type Pricing - Image Pricing - Text GPT-image-2 Standard Global Input Tokens: $8 Cached Input Tokens: $2 Output Tokens: $30 Input Tokens: $5 Cached Input Tokens: $1.25 Output Tokens: $10 Note: All prices are per 1M token. Getting started Whether you’re building a personalized retail experience, automating visual content pipelines or accelerating design workflows. GPT-image-2 gives your team the resolution control and intelligent routing to generate images that fit your exact needs. Try the GPT-image-2 in Microsoft Foundry today! Deploy the model in Microsoft Foundry Experiment with the model in the Image playground Read the documentation to learn more2.6KViews1like0CommentsTroubleshooting Microsoft Foundry Accessing On‑Premises APIs Over Private Networking
Audience: Azure solution architects, network engineers, and AI practitioners deploying Microsoft Foundry in enterprise, network‑isolated environments. Scenario: Foundry Agent Service must call an on‑premises (or privately hosted) API over VPN or ExpressRoute using on‑premises corporate DNS. Connectivity works from a virtual machine (VM) in the virtual network (VNet), but fails when the same call is made from a Foundry agent. This post consolidates common field patterns observed across customer engagements and maps them directly to official Microsoft guidance. It highlights a frequently missed prerequisite for private connectivity: Project and Agent Capability Hosts. The Repeating Enterprise Pattern The architecture is familiar: Microsoft Foundry account and project Customer‑managed VNet with VPN or ExpressRoute to on‑premises Corporate (on‑premises) DNS used for API name resolution A VM in the VNet can successfully resolve and call the on‑prem API Foundry agents are configured to call the same API (often via an OpenAPI tool) Despite this, the agent fails with one or more of the following: DNS resolution failures Connection timeouts HTTP 401 or 403 responses Unexpected backend or proxy URLs appearing in logs The consistent question is: Why does this work from a VM in the VNet but not from the Foundry agents? Key Principle: Foundry Agents Do Not Automatically Run in Your VNet This is the most important mental model to reset. Creating a private endpoint for a Foundry Agent Service does not place agent runtime traffic into your VNet. Private endpoints are inbound constructs. Outbound connectivity from an agent only flows through your VNet when specific requirements are met. The most critical of those requirements is a Capability Host. What Is a Capability Host? A Capability Host defines where Foundry capabilities are allowed to execute. In private networking scenarios, the capability host: Binds a Foundry project or agent to a customer‑managed subnet Enables platform‑managed container injection into that subnet Ensures outbound traffic follows VNet routing, security controls, and DNS configuration Capability Host Scope Project Capability Host Associated at the project level Applies to all agents in the project Defines the customer subnet the project is allowed to use Agent Capability Host Associated at the individual agent level Can explicitly bind or override subnet placement Useful when agents require different isolation boundaries Key field insight: If no capability host is associated, the agent runtime is not injected into the VNet—even if VPN, ExpressRoute, private endpoints, and on‑prem DNS are correctly configured. Capability Host Lifecycle (Conceptual) 1 Microsoft Foundry Account ↓ Project ↓ Capability Host ↓ Delegated Agent Subnet (VNet) ↓ Agent Runtime (Container Injection) Without the capability host step, the chain breaks and the agent executes outside the customer network boundary. Why On‑Premises DNS Appears Correct—but Still Fails DNS is where most investigations stall. Teams typically confirm: On‑premises DNS resolves the API hostname VNet DNS settings forward queries to on‑prem DNS A VM in the subnet resolves and reaches the API Yet the agent still fails. The reason is simple: The VM is unquestionably inside the VNet The agent may not be Without a capability host, the agent runtime does not inherit: VNet DNS server settings Corporate DNS forwarding rules On‑premises name resolution paths As a result, DNS fails even though the DNS design itself is correct. Secondary Symptom: HTTP 401 Errors After subnet injection and DNS are corrected, some customers encounter HTTP 401 responses. This typically means: The API is now reachable The request is successfully routed through the private path Authentication or authorization is failing At this stage, troubleshooting moves from networking to identity: Validate the credential or token configured in the Foundry connection Confirm expected headers, audience, or auth flow Account for managed proxy hops in the request path A 401 at this point is progress—it confirms private connectivity is working. Permissions and Required Services for Capability Hosts Creating capability hosts requires both the correct Azure role-based access control (RBAC) permissions and the presence of specific dependent services. These prerequisites are frequently overlooked and can silently block capability host creation or leave hosts in a failed provisioning state. Required Permissions (RBAC) Microsoft documentation explicitly calls out the following minimum permissions: Contributor role on the Microsoft Foundry account to create capability hosts. User Access Administrator or Owner role on the subscription or resource group to grant the Foundry project’s managed identity access to dependent Azure resources when using standard agent setup. Without these roles, capability host creation may fail, or the host may be created without access to required downstream services. For details, see Role-based access control (RBAC) for Microsoft Foundry. Required Azure Services and Connections For standard and network-secured agent setups, capability hosts reference customer-owned Azure resources. The following services must exist and be connected to the Foundry project before creating the capability host: Azure Storage – for file uploads and artifacts Azure AI Search – for vector stores and retrieval Azure Cosmos DB – for thread and conversation storage Azure AI Services / Azure OpenAI – for model execution These services must be deployed in supported regions and, for private networking scenarios, in the same region as the virtual network. Capability hosts reference these resources through project connections, not raw resource IDs . Networking-Specific Requirements When using private networking: The agent subnet must already exist and be delegated appropriately. The capability host must reference the correct customer subnet at creation time. Required private endpoints and DNS resolution must be in place for dependent services. If networking or connections change, capability hosts cannot be updated in-place and must be deleted and recreated with the corrected configuration . Practical Fix Patterns 1. Create and Associate a Project Capability Host Bind the project to the intended delegated agent subnet Verify the customerSubnet reference Redeploy the agent after association This aligns directly with the Standard Setup with Private Networking model documented by Microsoft. 2. Validate Agent Placement and Network Inheritance Confirm the agent is associated with the expected capability host Verify the capability host references the correct subnet Ensure network and DNS settings are applied at the subnet level The agent inherits routing and DNS behavior only after successful subnet injection. 3. Validate DNS From the Agent Subnet Confirm VNet DNS settings point to on‑prem DNS Test name resolution from a VM in the same subnet Once injected, the agent uses the same DNS behavior as other resources in that subnet. 4. Use a Supported Foundry Experience Be aware of documented constraints: End‑to‑end network isolation is not supported in the newer Foundry portal experience Network‑isolated agent scenarios require the classic Foundry experience, SDK, or CLI Hosted agents do not support full isolation Mismatch here can make a correct network design appear broken. A Simple Checklist When a Foundry agent cannot reach an on‑prem API: Is a **Project Capability **Host associated? Is the capability host bound to the correct subnet? Is on‑prem DNS reachable from that subnet? Is a supported Foundry experience in use? If reachable, is authentication configured correctly? If items 1 and 2 are not satisfied, all other troubleshooting is premature. Closing Most private networking issues with Microsoft Foundry are not caused by VPNs or DNS infrastructure. They result from an incomplete understanding of **where the agent ****runtime ****actually **executes. Capability Hosts are the control point. When they are correctly configured, Foundry agents behave exactly as described in Microsoft guidance: they inherit VNet routing, DNS, and security controls and can securely access on‑premises systems over private connectivity. No capability host = no VNet injection = no on‑prem connectivity. References The following Microsoft‑published resources were referenced and aligned with throughout this article: Network‑secured agent setup (GitHub) – Reference implementation demonstrating a network‑secured Foundry Agent Service deployment with a customer‑managed virtual network, delegated agent subnet, and private connectivity patterns. This notebook illustrates how agent runtimes inherit network behavior only after subnet injection ralacher/network-secured-agent Set up private networking for Foundry Agent Service - Microsoft Foundry | Microsoft Learns .170Views0likes0CommentsGemma 4 now available in Microsoft Foundry
Experimenting with open-source models has become a core part of how innovative AI teams stay competitive: experimenting with the latest architectures and often fine-tuning on proprietary data to achieve lower latencies and cost. Today, we’re happy to announce that the Gemma 4 family, Google DeepMind’s newest model family, is now available in Microsoft Foundry via the Hugging Face collection. Azure customers can now discover, evaluate, and deploy Gemma 4 inside their Azure environment with the same policies they rely on for every other workload. Foundry is the only hyperscaler platform where developers can access OpenAI, Anthropic, Gemma, and over 11,000+ models under a single control plane. Through our close collaboration with Hugging Face, Gemma 4 joining that collection continues Microsoft’s push to bring customers the widest selection of models from any cloud – and fits in line with our enhanced investments in open-source development. Frontier Intelligence, open-source weights Released by Google DeepMind on April 2, 2026, Gemma 4 is built from the same research foundation as Gemini 3 and packaged as open weights under an Apache 2.0 license. Key capabilities across the Gemma 4 family: Native multimodal: Text + image + video inputs across all sizes; analyze video by processing sequences of frames; audio input on edge models (E2B, E4B) Enhanced reasoning & coding capabilities: Multi-step planning, deep logic, and improvements in math and instruction-following enabling autonomous agents Trained for global deployment: Pretrained on 140+ languages with support for 35+ languages out of the box Long context: Context windows of up to 128K tokens (E2B/E4B) and 256K tokens (26B A4B/31B) allow developers to reason across extensive codebases, lengthy documents, or multi-session histories Why choose Foundry? Foundry is built to give developers breadth -- access to models from major model providers, open and proprietary, under one roof. Stay within Azure to work leading models. When you deploy through Foundry, models run inside your Azure environment and are subject to the same network policies, identity controls, and audit processes your organization already has in place. Managed online endpoints handle serving, scaling, and monitoring without manually setting up and managing the underlying infrastructure. Serverless deployment with Azure Container Apps allows developers to deploy and run containerized applications while reducing infrastructure management and saving costs. Gated model access integrates directly with Hugging Face user tokens, so models that require license acceptance stay compliant can be accessed without manual approvals. Foundry Local lets you run optimized Hugging Face models directly on your own hardware using the same model catalog and SDK patterns as your cloud deployments. Read the documentation here: https://aka.ms/foundrylocal and https://aka.ms/HF/foundrylocal Microsoft’s approach to Responsible AI is grounded in our AI principles of fairness, reliability and safety, privacy and security, inclusiveness, transparency, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy new models responsibly in production environments. What are teams building with Gemma 4 in Foundry Gemma 4’s combination of multimodal input, agentic function calling, and long context offers a wide range of production use cases: Document intelligence: Processing PDFs, charts, invoices, and complex tables using native vision capabilities Multilingual enterprise apps: 140+ natively trained languages — ideal for multinational customer support, content platforms as well as language learning tools for grammar correction and writing practice Long-context analytics: Reasoning across entire codebases, legal documents, or multi-session conversation histories Getting started Try Gemma 4 in Microsoft Foundry today. New models from Hugging Face continue to roll out to Foundry on a regular basis through our ongoing collaboration. If there's a model you want to see added, let us know here. Stay connected to our developer community on Discord and stay up to date on what is new in Foundry through the Model Mondays series.994Views1like0CommentsNow in Foundry: Microsoft Harrier and NVIDIA EGM-8B
This week's Model Mondays edition highlights three models that share a common thread: each achieves results comparable to larger leading models, as a result of targeted training strategies rather than scale. Microsoft Research's harrier-oss-v1-0.6b from achieves state-of-the-art results on the Multilingual MTEB v2 embedding benchmark at 0.6B parameters through contrastive learning and knowledge distillation. NVIDIA's EGM-8B scores 91.4 average IoU on the RefCOCO visual grounding benchmark by training a small Vision Language Model (VLM) with reinforcement learning to match the output quality of much larger models. Together they represent a practical argument for efficiency-first model development: the gap between small and large models continues to narrow when training methodology is the focus rather than parameter count alone. Models of the week Microsoft Research: harrier-oss-v1-0.6b Model Specs Parameters / size: 0.6B Context length: 32,768 tokens Primary task: Text embeddings (retrieval, semantic similarity, classification, clustering, reranking) Why it's interesting State-of-the-art on Multilingual MTEB v2 from Microsoft Research: harrier-oss-v1-0.6b is a new embedding model released by Microsoft Research, achieving a 69.0 score on the Multilingual MTEB v2 (Massive Text Embedding Benchmark) leaderboard—placing it at the top of its size class at release. It is part of the harrier-oss family spanning harrier-oss-v1-270m (66.5 MTEB v2), harrier-oss-v1-0.6b (69.0), and harrier-oss-v1-27b (74.3), with the 0.6B variant further trained with knowledge distillation from the larger family members. Benchmarks: Multilingual MTEB v2 Leaderboard. Decoder-only architecture with task-instruction queries: Unlike most embedding models that use encoder-only transformers, harrier-oss-v1-0.6b uses a decoder-only architecture with last-token pooling and L2 normalization. Queries are prefixed with a one-sentence task instruction (e.g., "Instruct: Retrieve relevant passages that answer the query\nQuery: ...") while documents are encoded without instructions—allowing the same deployed model to be specialized for retrieval, classification, or similarity tasks through the prompt alone. Broad task coverage across six embedding scenarios: The model is trained and evaluated on retrieval, clustering, semantic similarity, classification, bitext mining, and reranking—making it suitable as a general embedding backbone for multi-task pipelines rather than a single-use retrieval model. One endpoint, consistent embeddings across the stack. 100+ language support: Trained on a large-scale mixture of multilingual data covering Arabic, Chinese, Japanese, Korean, and 100+ additional languages, with strong cross-lingual transfer for tasks that span language boundaries. Try it Use Case Prompt Pattern Multilingual semantic search Prepend task instruction to query; encode documents without instruction; rank by cosine similarity Cross-lingual document clustering Embed documents across languages; apply clustering to group semantically related content Text classification with embeddings Encode labeled examples + new text; classify by nearest-neighbor similarity in embedding space Bitext mining Encode parallel corpora in source and target languages; align segments by embedding similarity Sample prompt for a global enterprise knowledge base deployment: You are building a multilingual internal knowledge base for a global professional services firm. Using the harrier-oss-v1-0.6b endpoint deployed in Microsoft Foundry, encode all internal documents—policy guides, project case studies, and technical documentation—across English, French, German, and Japanese. At query time, prepend the task instruction to each employee query: "Instruct: Retrieve relevant internal documents that answer the employee's question\nQuery: {question}". Retrieve the top-5 most similar documents by cosine similarity and pass them to a language model with the instruction: "Using only the provided documents, answer the question and cite the source document title for each claim. If no document addresses the question, say so." NVIDIA: EGM-8B Model Specs Parameters / size: ~8.8B Context length: 262,144 tokens Primary task: Image-text-to-text (visual grounding) Why it's interesting Preforms well on visual grounding compared to larger models even at its small size: EGM-8B achieves 91.4 average Intersection over Union (IoU) on the RefCOCO benchmark—the standard measure of how accurately a model localizes a described region within an image. Compared to its base model Qwen3-VL-8B-Thinking (87.8 IoU), EGM-8B achieves a +3.6 IoU gain through targeted Reinforcement Learning (RL) fine-tuning. Benchmarks: EGM Project Page. 5.9x faster than larger models at inference: EGM-8B achieves 737ms average latency. The research demonstrates that test-time compute can be scaled horizontally across small models—generating many medium-quality responses and selecting the best—rather than relying on a single expensive forward pass through a large model. Two-stage training: EGM-8B is trained first with Supervised Fine-Tuning (SFT) on detailed chain-of-thought reasoning traces generated by a proprietary VLM, then refined with Group Relative Policy Optimization (GRPO) using a reward function combining IoU accuracy and task success. The intermediate SFT checkpoint is available as nvidia/EGM-8B-SFT for developers who want to experiment with the intermediate stage. Addresses a root cause of small model grounding errors: The EGM research identifies that 62.8% of small model errors on visual grounding stem from complex multi-relational descriptions—where a model must reason about spatial relationships, attributes, and context simultaneously. By focusing test-time compute on reasoning through these complex prompts, EGM-8B closes the gap without increasing the underlying model size. Try it Use Case Prompt Pattern Object localization Submit image + natural language description; receive bounding box coordinates Document region extraction Provide scanned document image + field description; extract specific regions Visual quality control Submit product image + defect description; localize defect region for downstream classification Retail shelf analysis Provide shelf image + product description; return location of specified SKU Sample prompt for a retail and logistics deployment: You are building a visual inspection system for a logistics warehouse. Using the EGM-8B endpoint deployed in Microsoft Foundry, submit each incoming package scan image along with a natural language grounding query describing the region of interest: "Please provide the bounding box coordinate of the region this sentence describes: {description}". For example: "the label on the upper-left side of the box", "the barcode on the bottom face", or "the damaged corner on the right side". Use the returned bounding box coordinates to route each package to the appropriate inspection station based on the identified region. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry405Views0likes0CommentsNow in Foundry: Cohere Transcribe, Nanbeige 4.1-3B, and Octen Embedding
This week's Model Mondays edition spans three distinct layers of the AI application stack: Cohere's cohere-transcribe, a 2B Automatic Speech Recognition (ASR) model that ranks first on the Open ASR Leaderboard across 14 languages; Nanbeige's Nanbeige4.1-3B, a compact 3B reasoning model that outperforms models ten times its size on coding, math, and deep-search benchmarks; and Octen's Octen-Embedding-0.6B, a lightweight text embedding model that achieves strong retrieval scores across 100+ languages and industry-specific domains. Together, these three models illustrate how developers can build full AI pipelines—from audio ingestion to language reasoning to semantic retrieval—entirely with open-source models deployed through Microsoft Foundry. Each operates in a different modality and fills a distinct architectural role, making this week's selection especially well-suited for teams assembling production-grade systems across speech, text, and search. Models of the week Cohere's cohere-transcribe-03-2026 Model Specs Parameters / size: 2B Primary task: Automatic Speech Recognition (audio-to-text) Why it's interesting Top-ranked on the Open ASR Leaderboard: cohere-transcribe-03-2026 achieves a 5.42% average Word Error Rate (WER) across 8 English benchmark datasets as of March 26, 2026—placing it first among open models. It reaches 1.25% WER on LibriSpeech Clean and 8.15% on AMI (meeting transcription), demonstrating consistent accuracy across both clean speech and real-world, multi-speaker environments. Benchmarks: Open ASR Leaderboard. 14 languages with a dedicated encoder-decoder architecture: The model uses a large Conformer encoder for acoustic representation extraction paired with a lightweight Transformer decoder for token generation, trained from scratch on 14 languages covering European, East Asian (Chinese Mandarin, Japanese, Korean, Vietnamese), and Arabic. Unlike general-purpose models adapted for ASR, this dedicated architecture makes it efficient without sacrificing accuracy. Long-form audio with automatic chunking: Audio longer than 35 seconds is automatically split into overlapping chunks and reassembled into a coherent transcript—no manual preprocessing required. Batched inference, punctuation control, and per-language configuration are all supported through the standard API. Try it Click on the window above, upload an audio file, and watch how quickly the model transcribes it for you. Or click the link to experiment with the Cohere Transcribe Space and record audio directly from your device. Use Case Prompt Pattern Meeting transcription Submit recorded audio with language tag; retrieve timestamped transcript per speaker turn Call center quality review Batch-process customer call recordings, extract transcript, pass to classification model Medical documentation Transcribe clinical encounters; feed transcript into summarization or structured note pipeline Multilingual content indexing Process podcasts or video audio in any of 14 supported languages; store as searchable text Sample prompt for a legal services deployment: You are building a contract negotiation assistant. A client submits a recorded audio of a 45-minute supplier negotiation call. Using the cohere-transcribe-03-2026 endpoint deployed in Microsoft Foundry, transcribe the call with punctuation enabled for the English audio. Once the transcript is available, pass it to a downstream language model with the following instruction: "Identify all pricing commitments, delivery deadlines, and liability clauses mentioned in this negotiation transcript. For each, note the speaker's position (client or supplier) and flag any terms that appear ambiguous or require legal review." Nanbeige's Nanbeige4.1-3B Model Specs Parameters / size: 3B Context length: 131,072 tokens Primary task: Text generation (reasoning, coding, tool use, deep search) Why it's interesting Reasoning performance that exceeds its size class: Nanbeige4.1-3B scores 76.9 on LiveCodeBench-V6, these results suggest that targeted post-training using Supervised Fine-Tuning (SFT) and Reinforcement Learning (RL) on a focused dataset can yield improvements that scale-based approaches cannot replicate at equivalent parameter counts. Read the technical report: https://huggingface.co/papers/2602.13367. Strong preference alignment at the 3B scale: On Arena-Hard-v2, Nanbeige4.1-3B scores 73.2, compared to 56.0 for Qwen3-32B and 60.2 for Qwen3-30B-A3B—both significantly larger models. This indicates that the model's outputs consistently match human preference for response quality and helpfulness, not just accuracy on structured tasks. Deep-search capability previously absent from small general models: On xBench-DeepSearch-2505, Nanbeige4.1-3B scores 75—matching search-specialized small agents. The model can sustain complex agentic tasks involving more than 500 sequential tool invocations, a capability gap that previously required either specialized search agents or significantly larger models. Native tool-use support: The model's chat template and generation pipeline natively support tool call formatting, making it straightforward to connect to external APIs and build multi-step agentic workflows without additional scaffolding. Try it Use Case Prompt Pattern Code review and fix Provide failing test + stack trace; ask model to diagnose root cause and write corrected implementation Competition-style math Submit problem as structured prompt; use temperature 0.6, top-p 0.95 for consistent reasoning steps Agentic task execution Provide tool definitions as JSON + goal; let model plan and execute tool calls sequentially Long-document Q&A Pass full document (up to 131K tokens) with targeted factual questions; extract structured answers Sample prompt for a software engineering deployment: You are automating pull request review for a backend engineering team. Using the Nanbeige4.1-3B endpoint deployed in Microsoft Foundry, provide the model with a unified diff of a proposed code change and the following system instruction: "You are a senior software engineer reviewing a pull request. For each modified function: (1) summarize what the change does, (2) identify any edge cases that are not handled, (3) flag any security or performance regressions relative to the original, and (4) suggest a specific improvement if one is warranted. Format your output as a structured list per function." Octen's Octen-Embedding-0.6B Model Specs Parameters / size: 0.6B Context length: 32,768 tokens Primary task: Text embeddings (semantic search, retrieval, similarity) Why it's interesting Retrieval performance above larger proprietary models at 0.6B: On the RTEB (Retrieval Text Embedding Benchmark) public leaderboard, Octen-Embedding-0.6B achieves a mean task score of 0.7241—above voyage-3.5 (0.7139), Cohere-embed-v4.0 (0.6534), and text-embedding-3-large (0.6110), despite being a fraction of their parameter count. The model is fine-tuned from Qwen3-Embedding-0.6B via Low-Rank Adaptation (LoRA), demonstrating that targeted fine-tuning on retrieval-specific data can close the gap with larger embedding models. Vertical domain coverage across legal, finance, healthcare, and code: Octen-Embedding-0.6B was trained with explicit coverage of domain-specific retrieval scenarios—legal document matching, financial report Q&A, clinical dialogue retrieval, and code search including SQL. This makes it suitable for regulated-industry applications where generic embedding models tend to underperform on specialized terminology. 32,768-token context for long-document retrieval: The extended context window supports encoding entire legal contracts, earnings reports, or clinical case notes as single embeddings—removing the need to chunk long documents and re-aggregate scores at query time, which can introduce ranking errors. 100+ language support with cross-lingual retrieval: The model handles multilingual and cross-lingual retrieval natively, with strong coverage across languages including English, Chinese, and other major languages via its Qwen3-based architecture—practical for global enterprise applications that span multiple languages. Use Case Prompt Pattern Semantic search Encode user query and document corpus; rank documents by cosine similarity to query embedding Legal precedent retrieval Embed case briefs and query with legal question; retrieve most semantically relevant precedents Cross-lingual document search Encode multilingual document set; submit query in any supported language for cross-lingual retrieval Financial Q&A pipeline Embed earnings reports or filings; retrieve relevant passages to ground downstream language model responses Sample prompt for a global enterprise knowledge base deployment: You are building a clinical decision support tool. Using the Octen-Embedding-0.6B endpoint deployed in Microsoft Foundry, embed a corpus of 10,000 clinical case notes at ingestion time and store the resulting 1024-dimensional vectors in a vector database. At query time, encode an incoming patient presentation summary and retrieve the 5 most semantically similar historical cases. Pass the retrieved cases and the current presentation to a language model with the following instruction: "Based on these five similar cases and their documented outcomes, summarize the most common treatment approaches and flag any cases where the outcome differed significantly from the initial prognosis." Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. You can also start from the Hugging Face Hub. First, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure with secure, scalable inference already configured. Learn how to discover models and deploy them using Microsoft Foundry documentation: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry412Views1like0CommentsAnnouncing Fireworks AI on Microsoft Foundry
We’re excited to announce that starting today, Microsoft Foundry customers can access high performance, low latency inference performance of popular open models hosted on the Fireworks cloud from their Foundry projects, and even deploy their own customized versions, too! As part of the Public Preview launch, we’re offering the most popular open models for serverless inference in both pay-per-token (US Data Zone) and provisioned throughput (Global Provisioned Managed) deployments. This includes: Minimax M2.5 🆕 OpenAI’s gpt-oss-120b MoonshotAI’s Kimi-K2.5 DeepSeek-v3.2 For customers that have been looking for a path to production with models they’ve post-trained, you can now import your own fine-tuned versions of popular open models and deploy them at production scale with Fireworks AI on Microsoft Foundry. Serverless (pay-per-token) For customers wanting per-token pricing, we’re launching with Data Zone Standard in the United States. You can make model deployments for Foundry resources in the following regions: East US East US 2 Central US North Central US West US West US 3 Depending on your Azure subscription type, you’ll automatically receive either a 250K or 25K tokens per minute (TPM) quota limit per region and model. (Azure Student and Trial subscriptions will not receive quota at this time.) Per-token pricing rates include input, cached input, and output tokens priced per million tokens. Model Input Tokens ($/1M tokens) Cached Tokens ($/1M tokens) Output Tokens ($/1M tokens) gpt-oss-120b $0.17 $0.09 $0.66 kimi-k2.5 $0.66 $0.11 $3.30 deepseek-v3.2 $0.62 $0.31 $1.85 minimax-m2.5 $0.33 $0.03 $1.32 As we work together with Fireworks to launch the latest OSS models, the supported models will evolve as popular research labs push the frontier! Provisioned Throughput For customers looking to shift or scale production workloads on these models, we’re launching with support for Global provisioned throughput. (Data Zone support will be coming soon!) Provisioned throughput for Fireworks models works just like it does for Foundry models: PTUs are designed to deliver consistent performance in terms of time between token latency. Your existing quota for Global PTUs works as does any reservation commitments! gpt-oss-120b Kimi-K2.5 DeepSeek-v3.2 MiniMax-M2.5 Global provisioned minimum deployment 80 800 1,200 400 Global provisioned scale increment 40 400 600 200 Input TPM per PTU 13,500 530 1,500 3,000 Latency Target Value 99% > 50 Tokens Per Second^ 99% > 50 Tokens Per Second^ 99% > 50 Tokens Per Second^ 99% > 50 Tokens Per Second^ ^ Calculated as p50 request latency on a per 5 minute basis. Custom Models Have you post-trained a model like gpt-oss-120b for your particular use case? With Fireworks on Foundry you can deploy, govern, and scale your custom models all within your Foundry project. This means full fine-tuned versions of models from the following families can be imported and deployed as part of preview: Qwen3-14B OpenAI gpt-oss-120b Kimi K2 and K2.5 DeepSeek v3.1 and v3.2 The new Custom Models page in the Models experience lets you initiate the import process for copying your model weights into your Foundry project. > Models -> Custom Models. For performing a high-speed transfer of the files into Foundry, we’ve added a new feature to Azure Developer CLI (azd) for facilitating the transfer of a directory of model weights. The Foundry UI will give you cli arguments to copy and paste for quickly running azd ai models create pointed to your Foundry project. Enabling Fireworks AI on Microsoft Foundry in your Subscription While in preview, customers must opt-in to integrate their Microsoft Foundry resources with the Fireworks inference cloud to perform model deployments and send inference requests. Opt-in is self-service and available in the Preview features panel within your Azure portal. For additional details on finding and enabling the preview feature, please see the new product documentation for Fireworks on Foundry. Frequently Asked Questions How are Fireworks AI on Microsoft Foundry models different than Foundry Models? Models provided direct from Azure include some open-source models as well as proprietary models from labs like Black Forest Labs, Cohere, and xAI, and others. These models undergo rigorous model safety and risks assessments based on Microsoft’s Responsible AI standard. For customers needing the latest open-source models from emerging frontier labs, break-neck speed, or the ability to deploy their own post-trained custom models, Fireworks delivers best-in-class inference performance. Whether you’re focused on minimizing latency or just staying ahead of the trends, Fireworks AI on Microsoft Foundry gives you additional choice in the model catalog. Still need to quantify model safety and risk? Foundry provides a suite of observability tools with built-in risk and safety evaluators, letting you build AI systems confidently. How is model retirement handled? Customers using serverless per-token offers of models via Fireworks on Foundry will receive notice no less than 30 days before potential model retirement. You’ll be recommended to upgrade to either an equivalent, longer-term supported Azure Direct model or a newer model provided by Fireworks. For customers looking to use models beyond the retirement period, they may do so via Provisioned throughput deployments. How can I get more quota? For TPM quota, you may submit requests via our current Fireworks on Foundry quota form. For PTU quota, please contact your Microsoft account team. Can you support my custom model? Let’s talk! In general, if your model meets Fireworks’ current requirements, we have you covered. You can either reach out to your Microsoft account team or your contacts you may already have with Fireworks.2.1KViews1like1CommentMicrosoft Foundry: Unlock Adaptive, Personalized Agents with User-Scoped Persistent Memory
From Knowledgeable to Personalized: Why Memory Matters Most AI agents today are knowledgeable — they ground responses in enterprise data sources and rely on short‑term, session‑based memory to maintain conversational coherence. This works well within a single interaction. But once the session ends, the context disappears. The agent starts fresh, unable to recall prior interactions, user preferences, or previously established context. In reality, enterprise users don’t interact with agents exclusively in one‑off sessions. Conversations can span days, weeks, evolving across multiple interactions rather than isolated sessions. Without a way to persist and safely reuse relevant context across interactions, AI agents remain efficient in the short term be being stateful within a session, but lose continuity over time due to their statelessness across sessions. Bridging this gap between short-term efficiency and long‑term adaptation exposes a deeper challenge. Persisting memory across sessions is not just a technical decision; in enterprise environments, it introduces legitimate concerns around privacy, data isolation, governance, and compliance — especially when multiple users interact with the same agent. What seems like an obvious next step quickly becomes a complex architectural problem, requiring organizations to balance the ability for agents to learn and adapt over time with the need to preserve trust, enforce isolation boundaries, and meet enterprise compliance requirements. In this post, I’ll walk through a practical design pattern for user‑scoped persistent memory, including a reference architecture and a deployable sample implementation that demonstrates how to apply this pattern in a real enterprise setting while preserving isolation, governance, and compliance. The Challenge of Persistent Memory in Enterprise AI Agents Extending memory beyond a single session seems like a natural way to make AI agents more adaptive. Retaining relevant context over time — such as preferences, prior decisions, or recurring patterns — would allow an agent to progressively tailor its behavior to each user, moving from simple responsiveness toward genuine adaptation. In enterprise environments, however, persistence introduces a different class of risk. Storing and reusing user context across interactions raises questions of privacy, data isolation, governance, and compliance — particularly when multiple users interact with shared systems. Without clear ownership and isolation boundaries, naïvely persisted memory can lead to cross‑user data leakage, policy violations, or unclear retention guarantees. As a result, many systems default to ephemeral, session‑only memory. This approach prioritizes safety and simplicity — but does so at the cost of long‑term personalization and continuity. The challenge, then, is not whether agents should remember, but how memory can be introduced without violating enterprise trust boundaries. Persistent Memory: Trade‑offs Between Abstraction and Control As AI agents evolve toward more adaptive behavior, several approaches to agent memory are emerging across the ecosystem. Each reflects a different set of trade-offs between abstraction, flexibility, and control — making it useful to briefly acknowledge these patterns before introducing the design presented here. Microsoft Foundry Agent Service includes a built‑in memory capability (currently in Preview) that enables agents to retain context beyond a single interaction. This approach integrates tightly with the Foundry runtime and abstracts much of the underlying memory management, making it well suited for scenarios that align closely with the managed agent lifecycle. Another notable approach combines Mem0 with Azure AI Search, where memory entries are stored and retrieved through vector search. In this model, memory is treated as an embedding‑centric store that emphasizes semantic recall and relevance. Mem0 is intentionally opinionated, defining how memory is structured, summarized, and retrieved to optimize for ease of use and rapid iteration. Both approaches represent meaningful progress. At the same time, some enterprises require an approach where user memory is explicitly owned, scoped, and governed within their existing data architecture — rather than implicitly managed by an agent framework or memory library. These requirements often stem from stricter expectations around data isolation, compliance, and long‑term control. User-Scoped Persistent Memory with Azure Cosmos DB The solution presented in this post provides a practical reference implementation for organizations that require explicit control over how user memory is stored, scoped, and governed. Rather than embedding long‑term memory implicitly within the agent runtime, this design models memory as a first‑class system component built on Azure Cosmos DB. At a high level, the architecture introduces user‑scoped persistent memory: a durable memory layer in which each user’s context is isolated and managed independently. Persistent memory is stored in Azure Cosmos DB containers partitioned by user identity and consists of curated, long‑lived signals — such as preferences, recurring intent, or summarized outcomes from prior interactions — rather than raw conversational transcripts. This keeps memory intentional, auditable, and easy to evolve over time. Short‑term, in‑session conversation state remains managed by Microsoft Foundry on the server side through its built‑in conversation and thread model. By separating ephemeral session context from durable user memory, the system preserves conversational coherence while avoiding uncontrolled accumulation of long‑term state within the agent runtime. This design enables continuity and personalization across sessions while deliberately avoiding the risks associated with shared or global memory models, including cross‑user data leakage, unclear ownership, and unintended reuse of context. Azure Cosmos DB provides enterprises with direct control over memory isolation, data residency, retention policies, and operational characteristics such as consistency, availability, and scale. In this architecture, knowledge grounding and memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data sources. User‑scoped persistent memory ensures relevance by tailoring interactions to the individual user over time. Together, they enable trustworthy, adaptive AI agents that improve with use — without compromising enterprise boundaries. Architecture Components and Responsibilities Identity and User Scoping Microsoft Entra ID (App Registrations) — provides the frontend a client ID and tenant ID so the Microsoft Authentication Library (MSAL) can authenticate users via browser redirect. The oid (Object ID) claim from the ID token is used as the user identifier throughout the system. Agent Runtime and Orchestration Microsoft Foundry — serves as the unified AI platform for hosting models, managing agents, and maintaining conversation state. Foundry manages in‑session and thread‑level memory on the server side, preserving conversational continuity while keeping ephemeral context separate from long‑term user memory. Backend Agent Service — implements the AI agent using Microsoft Foundry’s agent and conversation APIs. The agent is responsible for reasoning, tool‑calling decisions, and response generation, delegating memory and search operations to external MCP servers. Memory and Knowledge Services MCP‑Memory — MCP server that hosts tools for extracting structured memory signals from conversations, generating embeddings, and persisting user‑scoped memories. Memories are written to and retrieved from Azure Cosmos DB, enforcing strict per‑user isolation. MCP‑Search — MCP server exposing tools for querying enterprise knowledge sources via Azure AI Search. This separation ensures that knowledge grounding and memory retrieval remain distinct concerns. Azure Cosmos DB for NoSQL — provides the durable, serverless document store for user‑scoped persistent memory. Memory containers are partitioned by user ID, enabling isolation, auditable access, configurable retention policies, and predictable scalability. Vector search is used to support semantic recall over stored memory entries. Azure AI Search — supplies hybrid retrieval (keyword and vector) with semantic reranking over the enterprise knowledge index. An integrated vectorizer backed by an embedding model is used for query‑time vectorization. Models text‑embedding‑3‑large — used for generating vector embeddings for both user‑scoped memories and enterprise knowledge search. gpt‑5‑mini — used for lightweight analysis tasks, such as extracting structured memory facts from conversational context. gpt‑5.1 — powers the AI agent, handling multi‑turn conversations, tool invocation, and response synthesis. Application and Hosting Infrastructure Frontend Web Application — a React‑based web UI that handles user authentication and presents a conversational chat interface. Azure Container Apps Environment — provides a shared execution environment for all services, including networking, scaling, and observability. Azure Container Apps — hosts the frontend, backend agent service, and MCP servers as independently scalable containers. Azure Container Registry — stores container images for all application components. Try It Yourself Demonstration of user‑scoped persistent memory across sessions. To make these concepts concrete, I’ve published a working reference implementation that demonstrates the architecture and patterns described above. The complete solution is available in the Agent-Memory GitHub repository. The repository README includes prerequisites, environment setup notes, and configuration details. Start by cloning the repository and moving into the project directory: git clone https://github.com/mardianto-msft/azure-agent-memory.git cd azure-agent-memory Next, sign in to Azure using the Azure CLI: az login Then authenticate the Azure Developer CLI: azd auth login Once authenticated, deploy the solution: azd up After deployment is complete, sign in using the provided demo users and interact with the agent across multiple sessions. Each user’s preferences and prior context are retained independently, the interaction continues seamlessly after signing out and returning later, and user context remains fully isolated with no cross‑identity leakage. The solution also includes a knowledge index initialized with selected Microsoft Outlook Help documentation, which the agent uses for knowledge grounding. This index can be easily replaced or extended with your own publicly accessible URLs to adapt the solution to different domains. Looking Ahead: Personalized Memory as a Foundation for Adaptive Agents As enterprise AI agents evolve, many teams are looking beyond larger models and improved retrieval toward human‑centered personalization at scale — building agents that adapt to individual users while operating within clearly defined trust boundaries. User‑scoped persistent memory enables this shift. By treating memory as a first‑class, user‑owned component, agents can maintain continuity across sessions while preserving isolation, governance, and compliance. Personalization becomes an intentional design choice, aligning with Microsoft’s human‑centered approach to AI, where users retain control over how systems adapt to them. This solution demonstrates how knowledge grounding and personalized memory serve complementary roles. Knowledge grounding ensures correctness by anchoring responses in trusted enterprise data. Personalized memory ensures relevance by tailoring interactions to the individual user. Together, they enable context‑aware, adaptive, and personalized agents — without compromising enterprise trust. Finally, this solution is intentionally presented as a reference design pattern, not a prescriptive architecture. It offers a practical starting point for enterprises designing adaptive, personalized agents, illustrating how user‑scoped memory can be modeled, governed, and integrated as a foundational capability for scalable enterprise AI.525Views1like1CommentHow Veris AI and Lume Security built a self-improving AI agent with Microsoft Foundry
Introduction AI agents are slowly moving from demos into production, where real users, messy systems, and long-tail edge cases lead to new unseen failure modes. Production monitoring can surface issues, but converting those into reliable improvements is slow and manual, bottlenecked by the low volume of repeatable failures, engineering time, and risk of regression on previously working cases. We show how a high-fidelity simulation environment built by Veris AI on Microsoft Azure can expand production failures into families of realistic scenarios, generating enough targeted data to optimize agent behavior through automated context engineering and reinforcement learning, all while not regressing on any previous issue. We demonstrate this on a security agent built by Lume Security on Microsoft Foundry. Lume creates an institutionally grounded security intelligence graph that captures how organizations actually work—this intelligence graph then powers deterministic, policy-aligned agents that reason alongside the user and take trusted action across security, compliance, and IT workflows. Lume helps modern security teams scale expertise without scaling headcount. Its Security Intelligence Platform builds a continuously learning security intelligence graph that reflects how work actually gets done. It reasons over policies, prior tickets, security findings, tool configurations, documentation, and system activity to retain institutional memory and drive consistent response. On this foundation, Lume delivers context-aware security assistants that fetch the most relevant context at the right time, produce policy-aligned recommendations, and execute deterministic actions with full explainability and audit trails. These assistants are embedded in tools teams are already using like ServiceNow, Jira, Slack, and Teams so the experience is seamless and provides value from day one. Microsoft Foundry and Veris AI together provide Lume with a secure, repeatable control plane that makes model iteration, safety, and simulation-driven evaluation practical. Vendor flexibility. Swap or test different models from OpenAI, Anthropic, Meta, and many others, with no infra changes. Fast model rollout. Provide access to the latest models as soon as they are released, making experimentations and updates easy. Consistent safety. Built-in policy and guardrail tooling enforces the same checks across experiments, cutting bespoke guardrail work. Enterprise privacy. Models run in private Azure instances and are not trained on client data, which simplifies and shortens AI security reviews. Made evaluation practical. Centralized models, logging, and policies let Veris run repeatable simulations and feed targeted failure cases back into Lume’s improvement loop. Simulation-driven evaluation. Run repeatable, high-fidelity simulations to stress-test and automatically surface failure modes before production. Agent optimization. Turn the graded failures into upgrades through automatic prompt fixes and targeted fine-tuning/RFT. The Lume Solution: Contextual Intelligence for Security Team Members Security teams are burdened by the time-consuming process of gathering necessary context from various siloed systems, tools, and subject-matter experts. This fragmented approach creates significant latency in decision-making, leads to frequent escalations, and drives up operational costs. Furthermore, incomplete or inaccurate context often results in inconsistent responses and actions that fail to fully mitigate risks. Lume unifies an organization’s fragmented data sources into a security intelligence graph. It also provides purpose-built assistants, embedded in tooling the team already uses, that can fetch relevant content from across the org, produce explainable recommendations, execute actions with full audit trails, and update data sources when source context is missing or misleading. The result is faster, more consistent decisions and fewer avoidable escalations. With just a couple integrations into ticketing and collaboration tools, Lume begins to prioritize where teams can gain the most efficiencies and risk reduction from context intelligence and automation - and automatically builds out assistants that can help. Search. Aggregate prior requests, owners, org, access, policy, live intel, and SIEM. Plan. Produce an institutionally grounded action plan. Review. Present a single decision-ready view with full context to approvers, modifying the plan based on their input. Act. Execute pre-approved changes with full explainability and audit trail. Close the loop. Notify stakeholders, update the graph and docs, log outcomes. Validation with early customers has revealed Lume’s security intelligence and assistants potential to truly change the way enterprises deal with security analysis and tasks. 35–55% less time on routine requests. Measurements with early customers shows the assistant and access to institutional intelligence reduces the time security analysts spend on recurring intake and tactical tasks, freeing staff for higher-value work. Faster and more confident decision making. Qualitative feedback from security team members reveals they feel more confident and can act faster when using the assistant, while also feeling less burdened knowing Lume will help ensure the task is resolved and documented. Improved institutional memory. Every decision, rationale, and action is captured and surfaced in the security intelligence graph, increasing repeatability and reducing future rework—this captured information also updates and cleanses existing documentation to ensure institutional knowledge aligns with current practice. Proactively identify & prioritize opportunities. Lume first identifies and prioritizes the tasks and processes where intelligence and assistants can have the greatest impact—measuring the actual ROI for security teams. Meaningful deflection of routine requests. The assistant can respond, plan, and execute actions (with human review + approval), deflecting common escalations and reducing load on subject matter experts. Architected on Azure By implementing this system on Azure stack, Lume and Veris benefited from the flexibility and breadth of services enabling this large scale implementation. This architecture allows the agent to evolve independently across reasoning, retrieval, and action layers without destabilizing production behavior. Microsoft Foundry orchestrates model usage, prompt execution, safety policies, and evaluation hooks. Azure AI Search provides hybrid retrieval across structured documents, policies, and unstructured artifacts. Vector storage enables semantic retrieval of prior tickets, decisions, and organizational knowledge. Graph databases capture relationships between systems, controls, owners, and historical decisions, allowing the agent to reason over organizational structure rather than isolated facts. Azure Kubernetes Service (AKS) hosts the agent runtime and evaluation workloads, enabling horizontal scaling and isolated experiments. Azure Key Vault manages secrets, credentials, and API access securely. Azure Web App Services power customer-facing interfaces and internal dashboards for visibility and control. Evaluating and ultimately improving agents is still one of the hardest parts of the stack. Static datasets don’t solve this, because agents are not static predictors. They are dynamic, multi-turn systems that change the world as they act: they call tools, accumulate state, recover (or fail to recover) from errors, and face nondeterministic user behavior. A golden dataset might grade a single response as “correct,” while completely missing whether the agent actually achieved the right end state across systems. In other words: static evals grade answers; agent evals must grade outcomes. At the same time, getting enough real-world data to drive improvement is perpetually difficult. The most valuable failures are rare, long-tail, and hard to reproduce on demand. Iterating directly in production is not possible because of safety, compliance, and customer-experience risk. That’s why environments are essential: a place to test agents safely, generate repeatable experience, and create the signals that allows developers to improve behavior without gambling on live users. Veris AI is built around that premise. It is a high-fidelity simulation platform that models the world around an agent with mock tools, realistic state transitions, and simulated users with distinct behaviors. From a single observed production failure, Veris can reconstruct what happened, expand it into a family of realistic scenario variants, and then stress-test the agent across that scenario set. Those runs are evaluated with a mix of LLM-based judges and code-based verifiers that score full trajectories not just the final text. Crucially, Veris does not stop at measurement. Its optimization module uses those grader signals to improve the agent with automatically refining prompts and supporting reinforcement-learning style updates (e.g., RFT) in a closed loop. The result is a workflow where one production incident can become a repeatable training and regression suite, producing targeted improvements on the failure mode while protecting performance on everything that already worked. Veris AI environment is available on Azure Kubernetes Service (AKS) and can be easily deployed in a user Virtual Network. Under the Hood: Building a self-improving agent When a failure appears in production, the improvement loop starts from the trace. Veris takes the failed session logs from the observability pipeline, then an LLM-based evaluator pinpoints what actually went wrong and automatically writes a new targeted evaluation rubric for that specific failure mode. From there, Veris simulator expands the single incident into a scenario family, dozens of realistic variants with branching, so the agent can be trained and re-tested against the whole “class” of the failure, not just one example. This is important as failure modes are often sparse in the real-world. Those scenarios are executed in the simulation engine with mocked tools and simulated users, producing full interaction traces that are then scored by the evaluation engine. The scores become the training signal for optimization. Veris optimization engine uses the evaluation results, the original system prompt, and the new rubric to generate an updated prompt designed to fix the failure without breaking previously-good behavior. Then it validates the new prompt on both (a) the specialized scenario set created from the incident and (b) a broader regression held-out set suite to ensure the improvement generalizes and does not cause regressions. In this case study, we focused on a key failure mode of incorrect workflow classification at the triage step for approval-driven access requests. In these cases, tickets were routed into an escalation or in-progress workflow instead of the approval-validation path. The ticket often contained conflicting approval or rejection signals in human comments - manager approved but they required some additional information, a genuine occurrence in org related jira workflows. The triage agent failed to recognize them and misclassified the workflow state. Since triage determines what runs next, a single misclassification at the start was enough to bypass the Approval Agent and send the request down an incorrect downstream path. Results & Impact By running the optimization loop, we were able to improve agent accuracy on this issue by over 40%, while not regressing on any other correct behavior. We ran experiments on a dataset only containing scenarios with the issue and a more general dataset encompassing a variety of scenarios. The experiment results on both datasets and updated prompt is shown below. Continuing with this over any issues that arise in production or simulation will continuously improve the agent performance. Takeaways This collaboration shows a practical pattern for taking an enterprise agent from “works in a demo” to “improves safely in production”, pairing an orchestration layer that standardizes model usage, safety, logging, and evaluation with a simulation environment where failures can be reproduced and fixed without risking real users, then making it repeatable in practice. In this stack, Veris AI provides the simulations, trajectory grading, and optimization loop, while Microsoft Foundry operationalizes the workflow with vendor-flexible model iteration, consistent policy enforcement, enterprise privacy, and centralized evaluation hooks that turn testing into a first-class system instead of bespoke glue code. The result is an improved and reliable Lume assitant that can help enterprises spend 35–55% less time on routine requests, and meaningfully deflect repetitive escalations, requiring fewer clarification cycles, enabling faster response times, and a stronger institutional memory where decisions and rationales compound over time instead of getting lost across tools and tickets. The self-improving loop continuously improves the agent, starting from a production trace, by generating a targeted rubric, expanding the incident into a scenario family, running it end-to-end with mocked tools and simulated users, scoring full trajectories, then using those scores to produce a safer prompt/model update and validating it against both the failure set and a broader regression suite. This turns rare long-tail failures into repeatable training and regression assets. If you’re building an AI agent, the recommendation is straightforward. Invest early in an orchestration and safety layer, as well as an environment-driven evaluation that can create an improvement loop to ship fixes without regressions. This way, your production failures act as the highest-signal input to continuously harden the system. To learn more or start building, teams can explore the Veris Console for free or browse the Veris documentation . In this stack, Microsoft Foundry provides the orchestration and safety control plane, and Veris provides the simulation, evaluation, and optimization loop needed to make agents improve safely over time. Learn more about the Lume Security assistant here.248Views1like0CommentsNow in Foundry: NVIDIA Nemotron-3-Super-120B-A12B, IBM Granite-4.0-1b-Speech, and Sarvam-105B
This week's Model Mondays edition highlights three models now available in Hugging Face collection on Microsoft Foundry: NVIDIA's Nemotron-3-Super-120B-A12B, a hybrid Latent Mixture-of-Experts (MOE) model with 12B active parameters and context handling up to 1 million tokens; IBM Granite's Granite-4.0-1b-Speech, a compact Automatic Speech Recognition (ASR) and Automatic Speech Translation (AST) model that achieves a 5.52% average Word Error Rate (WER) at 280× real-time speed with runtime keyword biasing for domain adaptation; and Sarvam's Sarvam-105B, a 105B Mixture-of-Experts (MoE) model with 10.3B active parameters optimized for complex reasoning and 22 Indian languages, with comparable agentic performance compared to other larger proprietary models on web search and task-planning benchmarks. Models of the week NVIDIA Nemotron-3-Super-120B-A12B Model Specs Parameters / size: 120B total with 12B active Context length: Up to 1M tokens Primary task: Text generation (reasoning, agentic workflows, long-context tasks, tool use, RAG) Why it's interesting (Spotlight) Hybrid Latent MoE architecture with selective attention: Nemotron-3-Super combines interleaved Mamba-2 state-space layers and sparse MoE layers with a select number of full attention layers—a design called Latent MoE. Tokens are routed into a smaller latent space for computation, which improves accuracy per parameter while keeping only 12B parameters active at inference time. Multi-Token Prediction (MTP) heads where the model simultaneously predicts multiple upcoming tokens during training enable native speculative decoding, reducing time-to-first-token on long outputs without a separate draft model. Configurable reasoning mode: The model supports toggling extended chain-of-thought reasoning on or off via the chat template flag enable_thinking. This lets developers suppress the reasoning trace for latency-sensitive tasks while keeping it available for high-stakes or multi-step agentic use cases without loading a separate model. Sustained 1M-token context reliability: On RULER, the standard long-context evaluation suite, Nemotron-3-Super achieves 91.75% at 1M tokens. This makes it practical for full-document retrieval-augmented generation (RAG), long-form code analysis, and extended agentic sessions without chunking or windowing strategies. Try it Use cases Best practices Ultra‑long document ingestion & consolidation (e.g., end‑to‑end review of massive specs, logs, or multi‑volume manuals without chunking) Use the native 1M‑token context to avoid windowing strategies; feed full corpora in one pass to reduce stitching errors. Prefer default decoding for general analysis (NVIDIA recommends temperature≈1.0, top_p≈0.95) before tuning; this aligns with the model’s training and MTP‑optimized generation path. Leverage MTP for throughput (multi‑token prediction improves output speed on long outputs), making single‑pass synthesis practical at scale. Latency‑sensitive chat & tool‑calling at scale (e.g., high‑volume enterprise assistants where response time matters) Toggle reasoning traces intentionally via the chat template (enable_thinking on/off): turn off for low‑latency interactions; on for harder prompts where accuracy benefits from explicit reasoning. Use model‑recommended sampling for tool calls (many guides tighten temperature for tool use) to improve determinism while keeping top_p near 0.95. Rely on the LatentMoE + MTP design to sustain high tokens/sec under load instead of adding a draft model for speculative decoding. IBM Granite-4.0-1b-Speech Model Specs Parameters / size: ~1B Context length: 128K tokens (LLM backbone; audio processed per utterance through the speech encoder) Primary task: Multilingual Automatic Speech Recognition (ASR) and bidirectional Automatic Speech Translation (AST) Why it's interesting (Spotlight) Compact ASR with speculative decoding at near-real-time speed: At roughly 1B parameters, Granite-4.0-1b-Speech achieves a 5.52% average WER across eight English benchmarks at 280× real-time speed (RTFx—the ratio of audio duration processed to wall-clock time) on the Open ASR Leaderboard. Runtime keyword biasing for domain adaptation without fine-tuning: Granite-4.0-1b-Speech accepts a runtime keyword list—proper nouns, brand names, technical terms, acronyms—that adjusts decoding probabilities toward those terms. This allows domain-specific vocabulary to be injected at inference time rather than requiring a fine-tuning run, practical for legal transcription, medical dictation, or financial meeting notes where terminology changes across clients. Bidirectional speech translation across 6 languages in one model: Beyond ASR, the model supports translation both to and from English for French, German, Spanish, Portuguese, and Japanese, plus English-to-Italian and English-to-Mandarin. A single deployed endpoint handles ASR and AST tasks without routing audio to separate models, reducing infrastructure surface area. Try it Test the model in the Hugging Face space before deploying in Foundry here: Sarvam’s Sarvam-105B Model Specs Parameters / size: 105B total with 10.3B active (Mixture of Experts, BF16) Context length: 128K tokens (with YaRN-based long-context extrapolation, scale factor 40) Primary task: Text generation (reasoning, coding, agentic tasks, Indian language understanding) Why it's interesting (Spotlight) Broad Indian language coverage at scale: Sarvam-105B supports English and 22 Indian languages—Hindi, Bengali, Tamil, Telugu, Marathi, Gujarati, Kannada, Malayalam, Punjabi, Odia, Assamese, Urdu, Sanskrit, Maithili, Dogri, Manipuri, Santali, Kashmiri, Nepali, Sindhi, Konkani, and Tibetan—the broadest open-model coverage for this language set at this parameter range. Training explicitly prioritized the Indian context, resulting in reported state-of-the-art performance across these languages for models of comparable size. Strong agentic and web-search performance: Sarvam-105B scores 49.5% on BrowseComp (web research benchmark with search tool access)—substantially above GLM-4.5-Air (21.3%) and Qwen3-Next-80B-A3B-Thinking (38.0%). It also achieves 68.3% average on τ² Bench (multi-domain task-planning benchmark), above GPT-OSS-120B (65.8%) and GLM-4.5-Air (53.2%). This reflects training emphasis on multi-step agentic workflows in addition to standard reasoning. Try it Use cases Best practices Agentic web research & technical troubleshooting (multi-step reasoning, planning, troubleshooting) Use longer context when needed: the model is designed for long-context workflows (up to 128K context with YaRN-based extrapolation noted). Start from the model’s baseline decoding settings (as shown in the model’s sample usage) and adjust for your task: temperature ~0.8, top_p ~0.95, repetition_penalty ~1.0, and set an explicit max_new_tokens (sample shows 2048). Suggestion (general, not stated verbatim in the sources): For agentic tasks, keep the prompt structured (goal → constraints → tools available → required output format), and ask for a short plan + final answer to reduce wandering. Multilingual (Indic) customer support & content generation (English + 22 Indian languages; native-script / romanized / code-mixed inputs) Be explicit about the language/script you want back (e.g., Hindi in Devanagari vs romanized Hinglish), since training emphasized Indian languages and code-mixed/romanized inputs. Provide in-language examples (a short “good response” example in the target language/script) to anchor tone and terminology. (Suggestion—general best practice; not stated verbatim in sources.) Use the model’s baseline generation settings first (sample decoding params) and then tighten creativity for support use cases (e.g., lower temperature) if you see variability. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. Or start from the Hugging Face Hub and choose the "Deploy on Microsoft Foundry" option, which brings you straight into Foundry. Learn how to discover models and deploy them using Microsoft Foundry here: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry382Views0likes0CommentsIntroducing OpenAI’s GPT-5.4 mini and GPT-5.4 nano for low-latency AI
Imagine you’re a developer building a research assistant agent on top of GPT‑5.4. The agent retrieves documents, summarizes findings, and answers follow‑up questions across multiple turns. In early testing, the reasoning quality is strong, but as the agent chains together retrieval, tool calls, and generation, latency starts to add up. For interactive experiences, those delays matter—so many teams adopt a multi‑model approach, using a larger model to plan and smaller models to execute subtasks quickly at scale. This is where GPT‑5.4 mini and GPT‑5.4 nano come in. These smaller variants of GPT-5.4 are optimized for developer workloads where latency, cost savings, and agentic design are top of mind. GPT-5.4 mini and GPT-5.4 nano will be rolling out today in Microsoft Foundry, so you can evaluate them in the model catalog and deploy the right option for each workload. GPT-5.4 mini: efficient reasoning for production workflows GPT-5.4 mini distills GPT-5.4’s strengths into a smaller, more efficient model for developer workloads where responsiveness matters. It significantly improves over GPT-5 mini across coding, reasoning, multimodal understanding, and tool use while running about 2X faster. Text and image inputs: build multimodal experiences that combine prompts with screenshots or other images. Tool use and function calling: reliably invoke tools and APIs for agentic workflows. Web search and file search: ground responses in external or enterprise content as part of multi-step tasks. Computer use: support software-interaction loops where the model interprets UI state and takes well-scoped actions. Where GPT-5.4 mini thrives Developer copilots and coding assistants: latency-sensitive coding help, code review suggestions, and fast iteration loops where turnaround time matters. Multimodal developer workflows: applications that interpret screenshots, understand UI state, or process images as part of coding and debugging loops. Computer-use sub-agents: fast executors that take well-scoped actions in software (for example, navigating UIs or completing repetitive steps) within a larger agent loop coordinated by a planner model. GPT-5.4 nano: ultra-low latency automation at scale GPT-5.4 nano is the smallest and fastest model in the lineup, designed for low-latency and low-cost API usage at high throughput. It’s optimized for short-turn tasks like classification, extraction, and ranking, plus lightweight sub-agent work where speed and cost are the priority and extended multi-step reasoning isn’t required. Strong instruction following: consistent adherence to developer intent across short, well-defined interactions. Function and tool calling: dependable invocation of tools and APIs for lightweight agent and automation scenarios. Coding support: optimized performance for common coding tasks where fast turnaround is required. Image understanding: multimodal image input support for basic image interpretation alongside text. Low-latency, low-cost execution: designed to deliver responses quickly and efficiently at scale. Where GPT-5.4 nano thrives GPT-5.4 nano is a strong fit when you need predictable behavior at very high throughput and the task can be expressed as short, well-scoped instructions. Classification and intent detection: fast labeling and routing decisions for high-volume requests. Extraction and normalization: pull structured fields from text, validate formats, and standardize outputs. Ranking and triage: reorder candidates, prioritize tickets/leads, and select best-next actions under tight latency budgets. Guardrails and policy checks: lightweight safety and policy classification, prompt gating, and enforcement decisions before dispatching to tools or larger models. High-volume text processing pipelines: batch transformation, cleanup, deduping, and normalization steps where unit cost and throughput dominate. Routing and prioritization at the edge: select the right downstream workflow (template, queue, or model) for each request under tight latency budgets. Choosing the right GPT-5.4 model Microsoft Foundry makes it possible to deploy multiple GPT-5.4 variants side by side, so teams can route requests to the model that best fits each task. Here’s a practical way to think about the lineup: Model Best suited for Typical workloads GPT-5.4 Sustained, multi-step reasoning with reliable follow-through Agentic workflows, research assistants, document analysis, complex internal tools GPT-5.4 Pro Deeper, higher-reliability reasoning for complex production scenarios High-stakes agentic workflows, long-form analysis and synthesis, complex planning, advanced internal copilots GPT-5.4 mini Balanced reasoning with lower latency for interactive systems Real-time agents, developer tools, retrieval-augmented applications GPT-5.4 nano Ultra-low latency and high throughput High-volume request routing, real-time chat, lightweight automation Responsible AI in Microsoft Foundry At Microsoft, our mission to empower people and organizations remains constant. In the age of AI, trust is foundational to adoption, and earning that trust requires a commitment to transparency, safety, and accountability. Microsoft Foundry provides governance controls, monitoring, and evaluation capabilities to help organizations deploy GPT-5.4 models responsibly in production environments, aligned with Microsoft's Responsible AI principles. Pricing Model Deployment Input (USD $/M tokens) Cached input (USD $/M tokens) Output (USD $/M tokens) GPT-5.4 mini Standard Global $0.75 $0.075 $4.5 GPT-5.4 nano Standard Global $0.20 $0.02 $1.25 The models are also available in Data Zone US. It is rolling out to Data Zone EU. Getting started Explore the models in Microsoft Foundry. Sign in to the Foundry portal and browse the model catalog to evaluate GPT-5.4 mini and GPT-5.4 nano alongside other options, then deploy the right model for each workload.12KViews0likes1Comment