azure ai
83 TopicsNVIDIA NIM for NVIDIA Nemotron, Cosmos, & Microsoft Trellis: Now Available in Azure AI Foundry
We’re excited to announce 7 new powerful NVIDIA NIM™ additions to Azure AI Foundry Models now on Managed Compute. The latest wave of models—NVIDIA Nemotron Nano 9B v2, Llama 3.1 Nemotron Nano VL 8B, Llama 3.3 Nemotron Super 49B v1.5 (coming soon), Cosmos Reason1-7B, Cosmos Predict 2.5 (coming soon), Cosmos Transfer 2.5. (coming soon), and Microsoft Trellis—marks a significant leap forward in intelligent application development. Collectively, these models redefine what’s possible in advanced instruction-following, vision-language understanding, and efficient language modeling, empowering developers to build multimodal, visually rich, and context-aware solutions. By combining robust reasoning, flexible input handling, and enterprise-grade deployment options, these additions accelerate innovation across industries—from robotics and autonomous vehicles to immersive retail and digital twins—enabling smarter, safer, and more adaptive experiences at scale. Meet the Models Model Name Size Primary Use Cases NVIDIA Nemotron Nano 9B v2 Available Now 9B parameters Multilingual Reasoning: Multilingual and code-based reasoning tasks Enterprise Agents: AI and productivity agents Math/Science: Scientific reasoning, advanced math Coding: Software engineering and tool calling Llama 3.3 Nemotron Super 49B v1.5 Coming Soon 49B Enterprise Agents: AI and productivity agents Math/Science: Scientific reasoning, advanced math Coding: Software engineering and tool calling Llama 3.1 Nemotron Nano VL 8B Available Now 8B Multimodal: Multimodal vision-language tasks, document intelligence and understanding Edge Agents: Mobile and edge AI agents Cosmos Reason1-7B Available Now 7B Robotics: Planning and executing tasks with physical constraints. Autonomous Vehicles: Understanding environments and making decisions. Video Analytics Agents: Extracting insights and performing root-cause analysis from video data. Cosmos Predict 2.5 Coming Soon 2B Generalist Model: World state generation and prediction Cosmos Transfer 2.5 Coming Soon 2B Structural Conditioning: Physical AI Microsoft TRELLIS by Microsoft Research Available Now - Digital Twins: Generate accurate 3D assets from simple prompts Immersive Retail experiences: photorealistic product models for AR, virtual try-ons Game and simulation development: Turn creative ideas into production-ready 3D content Meet the NVIDIA Nemotron Family NVIDIA Nemotron Nano 9B v2: Compact power for high-performance reasoning and agentic tasks NVIDIA Nemotron Nano 9B v2 is a high-efficiency large language model built with a hybrid Mamba-Transformer architecture, designed to excel in both reasoning and non-reasoning tasks. Efficient architecture for high-performance reasoning: Combines Mamba-2 and Transformer components to deliver strong reasoning capabilities with higher throughput. Extensive multilingual and code capabilities: Trained on diverse language and programming data, it performs exceptionally well across tasks involving natural language (English, German, French, Italian, Spanish and Japanese), code generation, and complex problem solving. Reasoning Budget Control: Supports runtime “thinking” budget control. During inference, the user can specify how many tokens the model is allowed to "think" for helping balance speed, cost, and accuracy during inference. For example, a user can tell the model to think for “1K tokens or 3K tokens, etc ” for different use cases with far better cost predictability. Fig 1. provided by NVIDIA Nemotron Nano 9B v2 is built from the ground up with training data spanning 15 languages and 43 programming languages, giving it broad multilingual and coding fluency. Its capabilities were sharpened through advanced post-training techniques like GRPO and DPO enabling it to reason deeply, follow instructions precisely, and adapt dynamically to different tasks. -> Explore the model card on Azure AI Foundry Llama 3.3 Nemotron Super 49B v1.5: High-throughput reasoning at scale Llama 3.3 Nemotron Super 49Bv1.5 (coming soon) is a significantly upgraded version of Llama-3.3-Nemotron-Super-49B-v1 and is a large language model which is a derivative of Meta Llama-3.3-70B-Instruct (the reference model) optimized for advanced reasoning, instruction following, and tool use across a wide range of tasks. Excels in applications such as chatbots, AI agents, and retrieval-augmented generation (RAG) systems Balances accuracy and compute efficiency for enterprise-scale workloads Designed to run efficiently on a single NVIDIA H100 GPU, making it practical for real-world applications Llama-3.3-Nemotron-Super-49B-v1.5 was trained through a multi-phase process combining human expertise, synthetic data, and advanced reinforcement learning techniques to refine its reasoning and instruction-following abilities. Its impressive performance across benchmarks like MATH500 (97.4%) and AIME 2024 (87.5%) highlights its strength in tackling complex tasks with precision and depth. Llama 3.1 Nemotron Nano VL 8B: Multimodal intelligence for edge deployments Llama 3.1 Nemotron Nano VL 8B is a compact vision-language model that excels in tasks such as report generation, Q&A, visual understand, and document intelligence. This model delivers low latency and high efficiency, reducing TCO. This model was trained on a diverse mix of human-annotated and synthetic data, enabling robust performance across multimodal tasks such as document understanding and visual question answering. It achieved strong results on evaluation benchmarks including DocVQA (91.2%), ChartQA (86.3%), AI2D (84.8%), and OCRBenchV2 English (60.1%). -> Explore the model card on Azure AI Foundry What Sets Nemotron Apart NVIDIA Nemotron is a family of open models, datasets, recipes, and tools. 1. Open-source AI technologies: Open models, data, and recipes offer transparency, allowing developers to create trustworthy custom AI for their specific needs, from creating new agents to refining existing applications. Open Weights: NVIDIA Open Model License offers enterprises data control and flexible deployment. Open Data: Models are trained with transparent, permissively-licensed NVIDIA data, available on Hugging Face, ensuring confidence in use. Additionally, it allows developers to train their high-accuracy custom models with these open datasets. Open Recipe: NVIDIA shares development techniques, like NAS, hybrid architecture, Minitron, as well as NeMo tools enabling customization or creation of custom models. 2. Highest Accuracy & Efficiency: Engineered for efficiency, Nemotron delivers industry leading accuracy in the least amount of time for reasoning, vision, and agentic tasks. 3. Run Anywhere On Cloud: Packaged as NVIDIA NIM, for secure and reliable deployment of high-performance AI model inferencing across Azure platforms. Meet the Cosmos Family NVIDIA Cosmos™ is a world foundation model (WFM) development platform to advance physical AI. At its core are Cosmos WFMs, openly available pretrained multimodal models that developers can use out-of-the-box for generating world states as videos and physical AI reasoning, or post-train to develop specialized physical AI models. Cosmos Reason1-7B: Physical AI Cosmos Reason1-7B combines chain-of-thought reasoning, flexible input handling for images and video, a compact 7B parameter architecture, and advanced physical world understanding making it ideal for real-time robotics, video analytics, and AI agents that require contextual, step-by-step decision-making in complex environments. This model transforms how AI and robotics interact with the real world giving your systems the power to not just see and describe, but truly understand, reason, and make decisions in complex environments like factories, cities, and autonomous vehicles. With its ability to analyze video, plan robot actions, and verify safety protocols, Cosmos Reason1-7B helps developers build smarter, safer, and more adaptive solutions for real-world challenges. Cosmos Reason1-7B is physical AI for 4 embodiments: Fig.2 Physical AI Model Strengths Physical World Reasoning: Leverages prior knowledge, physics laws, and common sense to understand complex scenarios. Chain-of-Thought (CoT) Reasoning: Delivers contextual, step-by-step analysis for robust decision-making. Flexible Input: Handles images, video (up to 30 seconds, 1080p), and text with a 16k context window. Compact & Deployable: 7B parameters runs efficiently from edge devices to the cloud. Production-Ready: Available via Hugging Face, GitHub, and NVIDIA NIM; integrates with industry-standard APIs. Enterprise Use Cases Cosmos Reason1-7B is more than a model, it’s a catalyst for building intelligent, adaptive solutions that help enterprises shape a safer, more efficient, and truly connected physical world. Fig.3 Use Cases Reimagine safety and efficiency by empowering AI agents to analyze millions of live streams and recorded videos, instantly verifying protocols and detecting risks in factories, cities, and industrial sites. Accelerate robotics innovation with advanced reasoning and planning, enabling robots to understand their environment, make methodical decisions, and perform complex tasks—from autonomous vehicles navigating busy streets to household robots assisting with daily chores. Transform data curation and annotation by automating the selection, labeling, and critiquing of massive, diverse datasets, fueling the next generation of AI with high-quality training data. Unlock smarter video analytics with chain-of-thought reasoning, allowing systems to summarize events, verify actions, and deliver actionable insights for security, compliance, and operational excellence. -> Explore the model card on Azure AI Foundry Also coming soon to Azure AI Foundry are two models of the Cosmos WFM, designed for world generation and data augmentation. Cosmos Predict 2.5 2B Cosmos Predict 2.5 is a next-generation world foundation model that generates realistic, controllable video worlds from text, images, or videos—all through a unified architecture. Trained on 200M+ high-quality clips and enhanced with reinforcement learning, it delivers stronger physics and prompt alignment while cutting compute cost and post-training time for faster Physical AI workflows. Cosmos Transfer 2.5 2B While Predict 2.5 generates worlds, Transfer 2.5 that transforms structured simulation inputs—like segmentation, depth, or LiDAR maps—into photorealistic synthetic data for Physical AI training and development. What Sets Cosmos Apart Built for Physical AI — Purpose-built for robotics, autonomous systems, and embodied agents that understand physics, motion, and spatial environments. Multimodal World Modeling — Combines images, video, depth, segmentation, LiDAR, and trajectories to create physics-aware, controllable world simulations. Scalable Synthetic Data Generation — Generates diverse, photorealistic data at scale using structured simulation inputs for faster Sim2Real training and adaptation. Microsoft Trellis by Microsoft Research: Enterprise-ready 3D Generation Microsoft Trellis by Microsoft Research is a cutting-edge 3D asset generation model developed by Microsoft Research, designed to create high-quality, versatile 3D assets, complete with shapes and textures, from text or image prompts. Seamlessly integrated within the NVIDIA NIM microservice, Trellis accelerates asset generation and empowers creators with flexible, production-ready outputs. Quickly generate high-fidelity 3D models from simple text or image prompts perfect for industries like manufacturing, energy, and smart infrastructure looking to accelerate digital twin creation, predictive maintenance, and immersive training environments. From virtual try-ons in retail to production-ready assets in media, TRELLIS empowers teams to create stunning 3D content at scale, cutting down production time and unlocking new levels of interactivity and personalization. -> Explore the model card on Azure AI Foundry Pricing The pricing breakdown consists of the Azure Compute charges plus a flat fee per GPU for the NVIDIA AI Enterprise license that is required to use the NIM software. Pay-as-you-go (per gpu hour) NIM Surcharge: $1 per gpu hour Azure Compute charges also apply based on deployment configuration Why use Managed Compute? Managed Compute is a deployment option within Azure AI Foundry Models that lets you run large language models (LLMs), SLMs, HuggingFace models and custom models fully hosted on Azure infrastructure. Azure Managed Compute is a powerful deployment option for models not available via standard (pay-go) endpoints. It gives you: Custom model support: Deploy open-source or third-party models Infrastructure flexibility: Choose your own GPU SKUs (NVIDIA A10, A100, H100) Detailed control: Configure inference servers, protocols, and advanced settings Full integration: Works with Azure ML SDK, CLI, Prompt Flow, and REST APIs Enterprise-ready: Supports VNet, private endpoints, quotas, and scaling policies NVIDIA NIM Microservices on Azure These models are available as NVIDIA NIM™ microservices on Azure AI Foundry. NVIDIA NIM, part of NVIDIA AI Enterprise, is a set of easy-to-use microservices designed for secure, reliable deployment of high-performance AI model inferencing. NIM microservices are pre-built, containerized AI endpoints that simplify deployment and scale across environments. They allow developers to run models securely and efficiently in the cloud environment. If you're ready to build smarter, more capable AI agents, start exploring Azure AI Foundry. Build Trustworthy AI Solutions Azure AI Foundry delivers managed compute designed for enterprise-grade security, privacy, and governance. Every deployment of NIM microservices through Azure AI Foundry is backed by Microsoft’s Responsible AI principles and Secure Future Initiative ensuring fairness, reliability, and transparency so organizations can confidently build and scale agentic AI workflows. How to Get Started in Azure AI Foundry Explore Azure AI Foundry: Begin by accessing the Azure AI Foundry portal and then following the steps below. Navigate to ai.azure.com. Select on top left existing project that is (Hub) resource provider. If you do not have a HUB Project, create new Hub Project using “+ Create New” link. Choose AI Hub Resource: Deploy with NIM Microservices: Use NVIDIA’s optimized containers for secure, scalable deployment. Select Model Catalog from the left sidebar menu: In the "Collections" filter, select NVIDIA to see all the NIM microservices that are available on Azure AI Foundry. Select the NIM you want to use. Click Deploy. Choose the deployment name and virtual machine (VM) type that you would like to use for your deployment. VM SKUs that are supported for the selected NIM and also specified within the model card will be preselected. Note that this step requires having sufficient quota available in your Azure subscription for the selected VM type. If needed, follow the instructions to request a service quota increase. Use this NVIDIA NeMo Agent Toolkit: designed to orchestrate, monitor, and optimize collaborative AI agents. Note about the License Users are responsible for compliance with the terms of NVIDIA AI Product Agreement . Learn More How to Deploy NVIDIA NIM Docs Learn More about Accelerating agentic workflows with Azure AI Foundry, NVIDIA NIM, and NVIDIA NeMo Agent Toolkit Register for Microsoft Ignite 2025419Views1like0CommentsSimplify Search Development with the New Azure AI Search Wizard
Azure AI Search has introduced the new “Import Data” wizard—a unified, modernized experience that streamlines index creation across keyword, RAG, and multimodal RAG workflows. By merging the legacy keyword search wizard with the vectorization flow used for advanced AI scenarios, this update simplifies how users connect to data sources, configure skillsets, and build query-ready indexes. The new wizard supports semantic ranking, integrated vectorization, and multimodal enrichment, with expanded connector options like Azure Queues, OneDrive for Business, and SharePoint Online via Logic Apps. During the phased rollout, both the classic and new wizards will coexist, but users are encouraged to switch early to take advantage of enhanced capabilities and prepare for the eventual retirement of the legacy experience. Whether you're building traditional search or intelligent retrieval systems, the new wizard offers a faster, more intuitive path to production-ready indexes.316Views0likes0CommentsQuick look at journey of Agentic Solutions, from No‑code to Developer tools
Why this journey matters My journey with Bot, virtual agents and personal assistants has been quite long and, in this time, not only has the usage and user scenario evolved but the technology and platforms that fueled it significantly changed as well. Agentic solutions are no longer just “chat with documents, knowledgebases or hand curate the decision making into the AI services” - The bar has moved to systems that understand context, invoke tools, and complete workflows—with the governance and telemetry your business requires, and the new tools that are at our disposal. In this article, I’m going through the notes that I have made and formulated approaches that I go through as I work on new AI solutions and AI projects. I have also added a checklist and a 90-day plan, if you are lucky enough to launch an AI Agentic project and want to start in a structured way from small wins to big bang. While navigating various scenarios and projects, I have developed and refined this practical approach/progression. This methodology gradually evolved as I encountered different timeline constraints and use cases. No‑code for rapid wins inside Microsoft 365 Low‑code for richer conversation design and workflow orchestration Pro‑code for robust model choice, evaluation, safety, and operations on Azure Use it as a blueprint to decide where to start, when to step up, and how to land production quality without over‑engineering day one. With this approach, I have seen team formation evolve as well. While some use cases will hit fruition at Low-code stage itself, there will be few that will be adopted for Pro-code and involve larger Development team and more matured, DevOps processes. The spectrum at a glance Layer Primary Builder Best For Integration Depth Time‑to‑Value Microsoft 365 Copilot – Agent Builder (No‑code) Smart users, business leads Q&A, task helpers, quick pilots in Teams/Outlook Connect org content and simple actions Fastest Microsoft Copilot Studio (Low‑code) Citizen developers, power users Multi‑turn conversations, API actions, enterprise data Custom connectors, policies, orchestration Weeks Azure AI Foundry (Pro‑code) Developers, architects Model selection, evaluation, safety, observability Prompt flows, CI/CD, monitoring, scale Project lifecycle Start: No‑code with Microsoft 365 Copilot Agent Builder When you need impact now, or something that you want to automate quickly, including your daily routine or a quick business process - embedded intelligence where people work every day. What you can achieve Answer policy and product questions grounded in your internal content Automate simple tasks (drafts, reminders, status messages) Share quickly in Teams to capture user feedback Collaborate and share with your teammates. How to approach Define one job to be done (e.g., “answer 80% of field FAQs”). Attach one high‑quality content source (structured SharePoint library beats scattered files). Add one action that saves clicks (create a task, send a summary). Pilot with a small group; measure deflection, satisfaction, and turnaround time. Guardrails from day one Keep scope narrow, content curated, and responses concise. Document the agent’s mandate and what it won’t do (set expectations). Level up: Low‑code with Copilot Studio Transition to this approach when your project requires designed conversations, conditional logic, and system actions—all without needing to move into full pro-code development. This method is especially effective for quickly deploying agents across a department, particularly for straightforward use cases, simple automations, and workflows that require more extensive reach. It enables broader automation and process improvement while maintaining a low-code approach that remains accessible to a wider range of users. What you can achieve Model topics/intents and multi‑turn dialogues. Call internal and external APIs via custom connectors Apply business rules before actions are carried out. Design tips Structure the conversation: greet → clarify → retrieve/act → confirm → summarize. Separate knowledge from behavior: keep content where it’s governed; keep logic in Studio. Instrument outcomes: track successful task completion, not just messages exchanged. Deep analytics into usage etc. Integration patterns Internal systems (HR, finance, CRM) through connectors. Event-driven flows (create tickets, update records, trigger notifications). Approval handoffs when confidence is low. Production grade: Pro‑code with Azure AI Foundry When correctness, safety, scale, and cost matter, graduate to developer tooling on Azure. Why this layer Model choice: right‑fit models (capability, latency, cost) for each task. Prompt orchestration: multi‑step reasoning and tool calling. Evaluation: offline tests before release and live monitoring after. Safety: input/output filtering and policy enforcement. Operations: CI/CD, observability, and performance management. Standard development process and tooling: I emphasize largely AI Models and Azure AI Foundry here, however the standard development practices, code security, Identity and access, compliance, testing etc. will remain same. Engineering flow that works Frame the objective: Define success metrics (quality, safety, and business KPIs). Prototype prompt flows: Start small, version them, and add tool calls only where needed. Evaluate before you ship: Use curated datasets for offline tests; include tricky edge cases. Harden safety: Enable content filters, set thresholds, and log decisions for auditability. Ship with telemetry: Track latency, cost per task, answer accuracy, and user feedback. Continuously improve: Roll updates behind flags, watch for drift, and retrain or return when needed. Reference architecture (conceptual) Experience → Teams/web/app Orchestration → Copilot Studio (dialog, routing, actions) AI Services → Azure AI Foundry (models, prompt flows, evaluation, safety, monitoring) Enterprise systems → Data platforms, line‑of‑business APIs, automation services Key principles Separation of concerns: UI ≠ Conversation logic ≠ Model/runtime ≠ Business systems. Least privilege: Only the permissions and scopes the agent truly needs. Observability first: Logs, traces, and quality events from day one. Human‑in‑the‑loop: Escalation paths for low‑confidence or sensitive requests. My 90‑day plan Days 1–30: Prove value Ship two no‑code agents for different teams. Measure deflection %, response helpfulness, and time saved. Days 31–60: Orchestrate actions Rebuild one agent in Copilot Studio with a clear dialog flow. Add a secure API action and an approval fallback. Days 61–90: Operationalize Port the highest‑impact scenario to Foundry. Implement offline evaluation, enable safety filters, deploy to a controlled audience, and set up monitoring dashboards. Design checklists (save for later) No-code launch checklist ☐ One job to be done ☐ Single, high quality knowledge source ☐ One user visible action ☐ Pilot cohort & feedback channel Low-code orchestration checklist ☐ Dialog flow defined (happy path + clarifications) ☐ Input validation before actions ☐ Connector secrets managed securely ☐ Outcome metrics (task completion, reengagement) Pro-code readiness checklist ☐ Model fit (capability, latency, cost) documented ☐ Offline evaluation set with edge cases ☐ Safety filters configured and logged ☐ Monitoring, alerting, and rollback plan Common pitfalls and how to avoid them Starting big: Begin with one clearly defined outcome; expand only after you see measurable impact. Over‑indexing on chat: Instrument task completion, not just message counts. Hidden coupling: Don’t bury business logic inside prompts; keep rules visible and testable. Skipping eval: Always gate releases with a small, representative test set. No feedback loop: Capture user feedback in‑product and close the loop with updates. Final take Stay on the course and go progressive: 1) No‑code for momentum and adoption, 2) Low‑code for richer conversations and actions, and 3) Pro‑code for the rigor that production demands. Treat evaluation, safety, and observability as core features and focus on it from day 1, not afterthoughts. That’s how you build agentic solutions that are useful on day one and trustworthy on day one hundred. These links cover the full journey from no-code to pro-code, including responsible AI practices: Microsoft 365 Copilot Agent Builder Overview https://learn.microsoft.com/en-us/microsoft-365-copilot/extensibility/agents-overview Microsoft Copilot Studio Documentation https://learn.microsoft.com/en-us/microsoft-copilot-studio/ Azure AI Foundry Documentation https://learn.microsoft.com/en-us/azure/ai-foundry/ Responsible AI and Content Safety in Azure https://learn.microsoft.com/en-us/azure/ai-services/content-safety/ Introduction to Microsoft AI Agent Solutions (Microsoft Learn module) https://learn.microsoft.com/en-us/training/modules/introduction-microsoft-ai-agent-solutions/ Software Development best practices & using AI in software development AI in Software Development | Microsoft Copilot Architecture strategies for formalizing software development management practices - Microsoft Azure Well-Architected Framework | Microsoft Learn About the Author Dipanjan Ghosh is a seasoned technology leader at Microsoft with extensive experience in AI solutions, enterprise architecture, and modern developer practices. He enables organizations to adopt Microsoft AI platforms such as Copilot, Copilot Studio, and Azure AI Foundry, ensuring scalability, security, and operational excellence. With a strong foundation in cloud architecture and automation, Dipanjan bridges innovation with practical implementation. Passionate about evangelizing technology innovations, he simplifies complex concepts and inspires businesses to embrace responsible, cutting-edge solutions. #SkilledByMTT, #MSLearn, #MTTBloggingGroup278Views0likes0CommentsThe Future of AI: From Noise to Insight - An AI Agent for Customer Feedback
This post explores how Microsoft’s AI Futures team built a multi-agent system to transform scattered customer feedback into actionable insights. The solution aggregates feedback from multiple channels, uses advanced language models to cluster themes, summarize content, and identify sentiment, and delivers prioritized insights directly in Microsoft Teams. With human-in-the-loop safeguards, the system accelerates triage, prioritization, and follow-ups while maintaining compliance and traceability. Future enhancements include richer automation, trend visualization, and expanded feedback sources.258Views0likes0CommentsContext-Aware RAG System with Azure AI Search to Cut Token Costs and Boost Accuracy
🚀 Introduction As AI copilots and assistants become integral to enterprises, one question dominates architecture discussions: “How can we make large language models (LLMs) provide accurate, source-grounded answers — without blowing up token costs?” Retrieval-Augmented Generation (RAG) is the industry’s go-to strategy for this challenge. But traditional RAG pipelines often use static document chunking, which breaks semantic context and drives inefficiencies. To address this, we built a context-aware, cost-optimized RAG pipeline using Azure AI Search and Azure OpenAI, leveraging AI-driven semantic chunking and intelligent retrieval. The result: accurate answers with up to 85% lower token consumption. Majorly in this blog we are considering: Tokenization Chunking The Problem with Naive Chunking Most RAG systems split documents by token or character count (e.g., every 1,000 tokens). This is easy to implement but introduces real-world problems: 🧩 Loss of context — sentences or concepts get split mid-idea. ⚙️ Retrieval noise — irrelevant fragments appear in top results. 💸 Higher cost — you often send 5× more text than necessary. These issues degrade both accuracy and cost efficiency. 🧠 Context-Aware Chunking: Smarter Document Segmentation Instead of breaking text arbitrarily, our system uses an LLM-powered preprocessor to identify semantic boundaries — meaning each chunk represents a complete and coherent concept. Example Naive chunking: “Azure OpenAI Service offers… [cut] …integrates with Azure AI Search for intelligent retrieval.” Context-aware chunking: “Azure OpenAI Service provides access to models like GPT-4o, enabling developers to integrate advanced natural language understanding and generation into their applications. It can be paired with Azure AI Search for efficient, context-aware information retrieval.” ✅ The chunk is self-contained and semantically meaningful. This allows the retriever to match queries with conceptually complete information rather than partial sentences — leading to precision and fewer chunks needed per query. Architecture Diagram Chunking Service: Purpose: Transforms messy enterprise data (wikis, PDFs, transcripts, repos, images) into structured, model-friendly chunks for Retrieval-Augmented Generation (RAG). ChallengeChunking FixLLM context limitsBreaks docs into smaller piecesEmbedding sizeKeeps within token boundsRetrieval accuracyGranular, relevant sections onlyNoiseRemoves irrelevant blocksTraceabilityChunk IDs for auditabilityCost/latencyRe-embed only changed chunks The Chunking Flow (End-to-End) The Chunking Service sits in the ingestion pipeline and follows this sequence: Ingestion: Raw text arrives from sources (wiki, repo, transcript, PDF, image description). Token-aware splitting: Large text is cut into manageable pre-chunks with a 100-token overlap, ensuring no semantic drift across boundaries. Semantic segmentation: Each pre-chunk is passed to an Azure OpenAI Chat model with a structured prompt. Output = JSON array of semantic chunks (sectiontitle, speaker, content). Optional overlap injection: Character-level overlap can be applied across chunks for discourse-heavy text like meeting transcripts. Embedding generation: Each chunk is passed to Azure OpenAI Embeddings API (text-embedding-3-small), producing a 1536-dimension vector. Indexing: Chunks (text + vectors) are uploaded to Azure AI Search. Retrieval: During question answering or document generation, the system pulls top-k chunks, concatenates them, and enriches the prompt for the LLM. Resilience & Traceability The service is built to handle real-world pipeline issues. It retries once on rate limits, validates JSON outputs, and fails fast on malformed data instead of silently dropping chunks. Each chunk is assigned a unique ID (chunk_<sequence>_<sourceTag>), making retrieval auditable and enabling selective re-embedding when only parts of a document change. ☁️ Why Azure AI Search Matters Here Azure AI Search (formerly Cognitive Search) is the heart of the retrieval pipeline. Key Roles: Vector Search Engine: Stores embeddings of chunks and performs semantic similarity search. Hybrid Search (Keyword + Vector): Combines lexical and semantic matching for high precision and recall. Scalability: Supports millions of chunks with blazing-fast search latency. Metadata Filtering: Enables fine-grained retrieval (e.g., by document type, author, section). Native Integration with Azure OpenAI: Allows a seamless, end-to-end RAG pipeline without third-party dependencies. In short, Azure AI Search provides the speed, scalability, and semantic intelligence to make your RAG pipeline enterprise-grade. 💡 Importance of Azure OpenAI Azure OpenAI complements Azure AI Search by providing: High-quality embeddings (text-embedding-3-large) for accurate vector search. Powerful generative reasoning (GPT-4o or GPT-4.1) to craft contextually relevant answers. Security and compliance within your organization’s Azure boundary — critical for regulated environments. Together, these two services form the retrieval (Azure AI Search) and generation (Azure OpenAI) halves of your RAG system. 💰 Token Efficiency By limiting the model’s input to only the most relevant, semantically meaningful chunks, you drastically reduce prompt size and cost. Approach Tokens per Query Typical Cost Accuracy Full-document prompt ~15,000–20,000 Very high Medium Fixed-size RAG chunks ~5,000–8,000 Moderate Medium-high Context-aware RAG (this approach) ~2,000–3,000 Low High 💰 Token Cost Reduction Analysis Let’s quantify it: Step Naive Approach (no RAG) Your Approach (Context-Aware RAG) Prompt context size Entire document (e.g., 15,000 tokens) Top 3 chunks (e.g., 2,000 tokens) Tokens per query ~16,000 (incl. user + system) ~2,500 Cost reduction — ~84% reduction in token usage Accuracy Often low (hallucinations) Higher (targeted retrieval) That’s roughly an 80–85% reduction in token usage while improving both accuracy and response speed. 🧱 Tech Stack Overview Component Service Purpose Chunking Engine Azure OpenAI (GPT models) Generate context-aware chunks Embedding Model Azure OpenAI Embedding API Create high-dimensional vectors Retriever Azure AI Search Perform hybrid and vector search Generator Azure OpenAI GPT-4o Produce final answer Orchestration Layer Python / FastAPI / .NET c# Handle RAG pipeline 🔍 The Bottom Line By adopting context-aware chunking and Azure AI Search-powered RAG, you achieve: ✅ Higher accuracy (contextually complete retrievals) 💸 Lower cost (token-efficient prompts) ⚡ Faster latency (smaller context per call) 🧩 Scalable and secure architecture (fully Azure-native) This is the same design philosophy powering Microsoft Copilot and other enterprise AI assistants today. 🧪 Real-Life Example: Context-Aware RAG in Action To bring this architecture to life, let’s walk through a simple example of how documents can be chunked, embedded, stored in Azure AI Search, and then queried to generate accurate, cost-efficient answers. Imagine you want to build an internal knowledge assistant that answers developer questions from your company’s Azure documentation. ⚙️ Step 1: Intelligent Document Chunking We’ll use a small LLM call to segment text into context-aware chunks — rather than fixed token counts //Context Aware Chunking //text can be your retrieved text from any page/ document private async Task<List<SemanticChunk>> AzureOpenAIChunk(string text) { try { string prompt = $@" Divide the following text into logical, meaningful chunks. Each chunk should represent a coherent section, topic, or idea. Return the result as a JSON array, where each object contains: - sectiontitle - speaker (if applicable, otherwise leave empty) - content Do not add any extra commentary or explanation. Only output the JSON array. Do not give content an array, try to keep all in string. TEXT: {text}" var client = GetAzureOpenAIClient(); var chatCompletionsOptions = new ChatCompletionOptions { Temperature = 0, FrequencyPenalty = 0, PresencePenalty = 0 }; var Messages = new List<OpenAI.Chat.ChatMessage> { new SystemChatMessage("You are a text processing assistant."), new UserChatMessage(prompt) }; var chatClient = client.GetChatClient( deploymentName: _appSettings.Agent.Model); var response = await chatClient.CompleteChatAsync(Messages, chatCompletionsOptions); string responseText = response.Value.Content[0].Text.ToString(); string cleaned = Regex.Replace(responseText, @"```[\s\S]*?```", match => { var match1 = match.Value.Replace("```json", "").Trim(); return match1.Replace("```", "").Trim(); }); // Try to parse the response as JSON array of chunks return CreateChunkArray(cleaned); } catch (JsonException ex) { _logger.LogError("Failed to parse GPT response: " + ex.Message); throw; } catch (Exception ex) { _logger.LogError("Error in AzureOpenAIChunk: " + ex.Message); throw; } } 🧠 Step 2: Adding Overlaps for better result We are adding overlapping between chunks for better and accurate answers. Overlapping window can be modified based on the documents. public List<SemanticChunk> AddOverlap(List<SemanticChunk> chunks, string IDText, int overlapChars = 0) { var overlappedChunks = new List<SemanticChunk>(); for (int i = 0; i < chunks.Count; i++) { var current = chunks[i]; string previousOverlap = i > 0 ? chunks[i - 1].Content[^Math.Min(overlapChars, chunks[i - 1].Content.Length)..] : ""; string combinedText = previousOverlap + "\n" + current.Content; var Id = $"chunk_{i + '_' + IDText}"; overlappedChunks.Add(new SemanticChunk { Id = Regex.Replace(Id, @"[^A-Za-z0-9_\-=]", "_"), Content = combinedText, SectionTitle = current.SectionTitle }); } return overlappedChunks; } 🧠 Step 3: Generate and Store Embeddings in Azure AI Search We convert each chunk into an embedding vector and push it to an Azure AI Search index. public async Task<List<SemanticChunk>> AddEmbeddings(List<SemanticChunk> chunks) { var client = GetAzureOpenAIClient(); var embeddingClient = client.GetEmbeddingClient("text-embedding-3-small"); foreach (var chunk in chunks) { // Generate embedding using the EmbeddingClient var embeddingResult = await embeddingClient.GenerateEmbeddingAsync(chunk.Content).ConfigureAwait(false); chunk.Embedding = embeddingResult.Value.ToFloats(); } return chunks; } public async Task UploadDocsAsync(List<SemanticChunk> chunks) { try { var indexClient = GetSearchindexClient(); var searchClient = indexClient.GetSearchClient(_indexName); var result = await searchClient.UploadDocumentsAsync(chunks); } catch (Exception ex) { _logger.LogError("Failed to upload documents: " + ex); throw; } } 🤖 Step 4: Generate the Final Answer with Azure OpenAI Now we combine the top chunks with the user query to create a cost-efficient, context-rich prompt. P.S. : Here in this example we have used semantic kernel agent , in real time any agent can be used and any prompt can be updated. var context = await _aiSearchService.GetSemanticSearchresultsAsync(UserQuery); // Gets chunks from Azure AI Search //here UserQuery is query asked by user/any question prompt which need to be answered. string questionWithContext = $@"Answer the question briefly in short relevant words based on the context provided. Context : {context}. \n\n Question : {UserQuery}?"; var _agentModel = new AgentModel() { Model = _appSettings.Agent.Model, AgentName = "Answering_Agent", Temperature = _appSettings.Agent.Temperature, TopP = _appSettings.Agent.TopP, AgentInstructions = $@"You are a cloud Migration Architect. " + "Analyze all the details from top to bottom in context based on the details provided for the Migration of APP app using Azure Services. Do not assume anything." + "There can be conflicting details for a question , please verify all details of the context. If there are any conflict please start your answer with word - **Conflict**." + "There might not be answers for all the questions, please verify all details of the context. If there are no answer for question just mention - **No Information**" }; _agentModel = await _agentService.CreateAgentAsync(_agentModel); _agentModel.QuestionWithContext = questionWithContext; var modelWithResponse = await _agentService.GetAnswerAsync(_agentModel); 🧠 Final Thoughts Context-aware RAG isn’t just a performance optimization — it’s an architectural evolution. It shifts the focus from feeding LLMs more data to feeding them the right data. By letting Azure AI Search handle intelligent retrieval and Azure OpenAI handle reasoning, you create an efficient, explainable, and scalable AI assistant. The outcome: Smarter answers, lower costs, and a pipeline that scales with your enterprise. Wiki Link: Tokenization and Chunking IP Link: AI Migration Accelerator935Views4likes0CommentsIntegrate Custom Azure AI Agents with CoPilot Studio and M365 CoPilot
Integrating Custom Agents with Copilot Studio and M365 Copilot In today's fast-paced digital world, integrating custom agents with Copilot Studio and M365 Copilot can significantly enhance your company's digital presence and extend your CoPilot platform to your enterprise applications and data. This blog will guide you through the integration steps of bringing your custom Azure AI Agent Service within an Azure Function App, into a Copilot Studio solution and publishing it to M365 and Teams Applications. When Might This Be Necessary: Integrating custom agents with Copilot Studio and M365 Copilot is necessary when you want to extend customization to automate tasks, streamline processes, and provide better user experience for your end-users. This integration is particularly useful for organizations looking to streamline their AI Platform, extend out-of-the-box functionality, and leverage existing enterprise data and applications to optimize their operations. Custom agents built on Azure allow you to achieve greater customization and flexibility than using Copilot Studio agents alone. What You Will Need: To get started, you will need the following: Azure AI Foundry Azure OpenAI Service Copilot Studio Developer License Microsoft Teams Enterprise License M365 Copilot License Steps to Integrate Custom Agents: Create a Project in Azure AI Foundry: Navigate to Azure AI Foundry and create a project. Select 'Agents' from the 'Build and Customize' menu pane on the left side of the screen and click the blue button to create a new agent. Customize Your Agent: Your agent will automatically be assigned an Agent ID. Give your agent a name and assign the model your agent will use. Customize your agent with instructions: Add your knowledge source: You can connect to Azure AI Search, load files directly to your agent, link to Microsoft Fabric, or connect to third-party sources like Tripadvisor. In our example, we are only testing the CoPilot integration steps of the AI Agent, so we did not build out additional options of providing grounding knowledge or function calling here. Test Your Agent: Once you have created your agent, test it in the playground. If you are happy with it, you are ready to call the agent in an Azure Function. Create and Publish an Azure Function: Use the sample function code from the GitHub repository to call the Azure AI Project and Agent. Publish your Azure Function to make it available for integration. azure-ai-foundry-agent/function_app.py at main · azure-data-ai-hub/azure-ai-foundry-agent Connect your AI Agent to your Function: update the "AIProjectConnString" value to include your Project connection string from the project overview page of in the AI Foundry. Role Based Access Controls: We have to add a role for the function app on OpenAI service. Role-based access control for Azure OpenAI - Azure AI services | Microsoft Learn Enable Managed Identity on the Function App Grant "Cognitive Services OpenAI Contributor" role to the System-assigned managed identity to the Function App in the Azure OpenAI resource Grant "Azure AI Developer" role to the System-assigned managed identity for your Function App in the Azure AI Project resource from the AI Foundry Build a Flow in Power Platform: Before you begin, make sure you are working in the same environment you will use to create your CoPilot Studio agent. To get started, navigate to the Power Platform (https://make.powerapps.com) to build out a flow that connects your Copilot Studio solution to your Azure Function App. When creating a new flow, select 'Build an instant cloud flow' and trigger the flow using 'Run a flow from Copilot'. Add an HTTP action to call the Function using the URL and pass the message prompt from the end user with your URL. The output of your function is plain text, so you can pass the response from your Azure AI Agent directly to your Copilot Studio solution. Create Your Copilot Studio Agent: Navigate to Microsoft Copilot Studio and select 'Agents', then 'New Agent'. Make sure you are in the same environment you used to create your cloud flow. Now select ‘Create’ button at the top of the screen From the top menu, navigate to ‘Topics’ and ‘System’. We will open up the ‘Conversation boosting’ topic. When you first open the Conversation boosting topic, you will see a template of connected nodes. Delete all but the initial ‘Trigger’ node. Now we will rebuild the conversation boosting agent to call the Flow you built in the previous step. Select 'Add an Action' and then select the option for existing Power Automate flow. Pass the response from your Custom Agent to the end user and end the current topic. My existing Cloud Flow: Add action to connect to existing Cloud Flow: When this menu pops up, you should see the option to Run the flow you created. Here, mine does not have a very unique name, but you see my flow 'Run a flow from Copilot' as a Basic action menu item. If you do not see your cloud flow here add the flow to the default solution in the environment. Go to Solutions > select the All pill > Default Solution > then add the Cloud Flow you created to the solution. Then go back to Copilot Studio, refresh and the flow will be listed there. Now complete building out the conversation boosting topic: Make Agent Available in M365 Copilot: Navigate to the 'Channels' menu and select 'Teams + Microsoft 365'. Be sure to select the box to 'Make agent available in M365 Copilot'. Save and re-publish your Copilot Agent. It may take up to 24 hours for the Copilot Agent to appear in M365 Teams agents list. Once it has loaded, select the 'Get Agents' option from the side menu of Copilot and pin your Copilot Studio Agent to your featured agent list Now, you can chat with your custom Azure AI Agent, directly from M365 Copilot! Conclusion: By following these steps, you can successfully integrate custom Azure AI Agents with Copilot Studio and M365 Copilot, enhancing you’re the utility of your existing platform and improving operational efficiency. This integration allows you to automate tasks, streamline processes, and provide better user experience for your end-users. Give it a try! Curious of how to bring custom models from your AI Foundry to your CoPilot Studio solutions? Check out this blog17KViews3likes11CommentsThe Future of AI: An Intern’s Adventure Turning Hours of Video into Minutes of Meaning
This blog post, part of The Future of AI series by Microsoft’s AI Futures team, follows an intern’s journey in developing AutoHighlight—a tool that transforms long-form video into concise, narrative-driven highlight reels. By combining Azure AI Content Understanding with OpenAI reasoning models, AutoHighlight bridges the gap between machine-detected moments and human storytelling. The post explores the challenges of video summarization, the technical architecture of the solution, and the lessons learned along the way.475Views0likes0CommentsInteractive AI Avatars: Building Voice Agents with Azure Voice Live API
Azure Voice Live API recently reached General Availability, marking a significant milestone in conversational AI technology. This unified API surface doesn't just enable speech-to-speech capabilities for AI agents—it revolutionizes the entire experience by streaming interactions through lifelike avatars. Built on the powerful speech-to-speech capabilities of the GPT-4 Realtime model, Azure Voice Live API offers developers unprecedented flexibility: - Out-of-the-box or custom avatars from Azure AI Services - Wide range of neural voices, including Indic languages like the one featured in this demo - Single API interface that handles both audio processing and avatar streaming - Real-time responsiveness with sub-second latency In this post, I'll walk you through building a retail e-commerce voice agent that demonstrates this technology. While this implementation focuses on retail apparel, the architecture is entirely generic and can be adapted to any domain—healthcare, banking, education, or customer support—by simply changing the system prompt and implementing domain-specific tools integration. The Challenge: Navigating Uncharted Territory At the time of writing, documentation for implementing avatar features with Azure Voice Live API is minimal. The protocol-specific intricacies around avatar video streaming and the complex sequence of steps required to establish a live avatar connection were quite overwhelming. This is where Agent mode in GitHub Copilot in Visual Studio Code proved extremely useful. Through iterative conversations with the AI agent, I successfully discovered the approach to implement avatar streaming without getting lost in low-level protocol details. Here's how different AI models contributed to this solution: - Claude Sonnet 4.5: Rapidly architected the application structure, designing the hybrid WebSocket + WebRTC architecture with TypeScript/Vite frontend and FastAPI backend - GPT-5-Codex (Preview): Instrumental in implementing the complex avatar streaming components, handling WebRTC peer connections, and managing the bidirectional audio flow Architecture Overview: A Hybrid Approach The architecture comprises of these components 🐳 Container Application Architecture Vite Server: Node.js-based development server that serves the React application. In development, it provides hot module replacement and proxies API calls to `FastAPI`. In production, the React app is built into static files served by FastAPI. FastAPI with ASGI: Python web framework running on `uvicorn ASGI server`. ASGI (Asynchronous Server Gateway Interface) enables handling multiple concurrent connections efficiently, crucial for WebSocket connections and real-time audio processing. 🤖 AI & Voice Services Integration Azure Voice Live API: Primary service that manages the connection to GPT-4 Realtime Model, provides avatar video generation, neural text-to-speech, and WebSocket gateway functionality GPT-4 Realtime Model: Accessed through Azure Voice Live API for real-time audio processing, function calling, and intelligent conversation management 🔄 Communication Flows Audio Flow: Browser → WebSocket → FastAPI → WebSocket → Azure Voice Live API → GPT-4 Realtime Model Video Flow: Browser ↔ WebRTC Direct Connection ↔ Azure Voice Live API (bypasses backend for performance) Function Calls: GPT-4 Realtime (via Voice Live) → FastAPI Tools → Business APIs → Response → GPT-4 Realtime (via Voice Live) 🤖 Business process automation Workflows / RAG Shipment Logic App Agent: Analyzes orders, validates data, creates shipping labels, and updates tracking information Conversation Analysis Agent: Azure Logic App Reviews complete conversations, performs sentiment analysis, generates quality scores with justification, and stores insights for continuous improvement Knowledge Retrieval: Azure AI Search is used to reason over manuals and help respond to Customer queries on policies, products The solution implements a hybrid architecture that leverages both WebSocket proxying and direct WebRTC connections for optimal performance. This design ensures the conversational audio flow remains manageable and secure through the backend, while the bandwidth-intensive avatar video streams directly to the browser for optimal performance. The flow used in the Avatar communication: ``` Frontend FastAPI Backend Azure Voice Live API │ │ │ │ 1. Request Session │ │ │─────────────────────────►│ │ │ │ 2. Create Session │ │ │─────────────────────────►│ │ │ │ │ │ 3. Session Config │ │ │ (with avatar settings)│ │ │─────────────────────────►│ │ │ │ │ │ 4. session.updated │ │ │ (ICE servers) │ │ 5. ICE servers │◄─────────────────────────│ │◄─────────────────────────│ │ │ │ │ │ 6. Click "Start Avatar" │ │ │ │ │ │ 7. Create RTCPeerConn │ │ │ with ICE servers │ │ │ │ │ │ 8. Generate SDP Offer │ │ │ │ │ │ 9. POST /avatar-offer │ │ │─────────────────────────►│ │ │ │ 10. Encode & Send SDP │ │ │─────────────────────────►│ │ │ │ │ │ 11. session.avatar. │ │ │ connecting │ │ │ (SDP answer) │ │ 12. SDP Answer │◄─────────────────────────│ │◄─────────────────────────│ │ │ │ │ │ 13. setRemoteDescription │ │ │ │ │ │ 14. WebRTC Handshake │ │ │◄─────────────────────────┼─────────────────────────►│ │ (Direct Connection) │ │ │ │ │ │ 15. Video/Audio Stream │ │ │◄────────────────────────────────────────────────────│ │ (Bypasses Backend) │ │ ``` For more technical details, refer to the technical details behind the implementation, refer to the GitHub Repo shared in this post. Here is a video of the demo of the application in action.458Views2likes0CommentsAnnouncing the Grok 4 Fast Models from xAI: Now Available in Azure AI Foundry
These models, grok-4-fast-reasoning and grok-4-fast-non-reasoning, empower developers with distinct approaches to suit their application needs. Each model brings advanced capabilities such as structured outputs, long-context processing, and seamless integration with enterprise-grade security and governance. This release marks a significant step toward scalable, agentic AI systems that orchestrate tools, APIs, and domain data with low latency. Leveraging the Grok 4 Fast models within Azure AI Foundry Models accelerates the development of intelligent applications that combine speed, flexibility, and compliance. The unified model experience, paired with Azure’s enterprise controls, positions the Grok 4 Fast models as foundational technologies for next-generation AI-powered workflows. Why use the Grok 4 Fast Models on Azure Modern AI applications are increasingly agentic—capable of orchestrating tools, APIs, and domain data at low latency. The Grok 4 Fast models were designed for these patterns: fast, intelligent, and agent-ready, enabling parallel tool use, JSON-structured outputs, and image input for multimodal understanding. Azure AI Foundry enhances these models with enterprise controls (RBAC, private networking, customer-managed keys), observability and evaluations, and first-party hosting through Foundry Models—helping teams move confidently from prototype to production. Beyond that, using the Grok 4 Fast models on Azure offers the following: Global scalability and reliability – Azure’s worldwide infrastructure supports resilient, high-availability deployments across multiple regions. Integrated security and compliance – Enterprise-grade identity management, network isolation, encryption at rest and in transit, and compliance certifications help safeguard sensitive data and comply with regulatory requirements. Unified management experience – Centralized monitoring, governance, and cost controls through Azure Portal and Azure Resource Manager simplify operations and oversight. Native integration across Azure services – Easily connect to data sources, analytics, and other services like Azure Synapse, Cosmos DB, and Logic Apps for end-to-end solutions. Enterprise support and SLAs – Azure delivers 24/7 support, service-level agreements, and best-in-class reliability for mission-critical workloads. By building withDeploying Grok 4 Fast models throughon Azure, enables organizations tocan build robust, secure, and scalable AI applications with confidence and agility. Key capabilities The Grok 4 Fast models introduce a suite of advanced features designed to enhance agentic workflows and multimodal integration. With flexible model choices and powerful context handling, the Grok 4 Fast models are engineered for efficiency, scalability, and seamless deployment. Choose reasoning level by selecting which Grok 4 Fast model to use: grok-4-fast-reasoning: Optimized for fast reasoning in agentic workflows. grok-4-fast-non-reasoning: Uses the same underlying weights but is constrained by a non-reasoning system prompt, offering a streamlined approach for specific tasks. Multimodal: Provides image understanding when deployed with Grok image tokenizer. Tool use & structured outputs: Enables parallel function calling and supports JSON schemas for predictable integration. Long context: Supports approximately 131K tokens for deep, comprehensive understanding. Efficient H100 performance: Designed to run efficiently on H100 GPUs for agentic search and real-time orchestration. Collectively, these features make the Grok 4 Fast models a robust and versatile solution for developers and enterprises looking to push the boundaries of AI-powered workflows. What you can do with the Grok 4 Fast models Building on the advanced capabilities of the Grok 4 Fast models, developers can unlock innovative solutions across a wide variety of applications. The following use cases highlight how these models streamline complex workflows, maximize efficiency, and accelerate intelligent automation with robust, scalable AI. Real-time agentic task orchestration : Automate and coordinate multi-step processes across systems with fast, flexible reasoning for dynamic business operations. Multimodal document analysis : Extract insights and process information from both text and images for comprehensive, context-aware understanding. Enterprise search and knowledge retrieval : Leverage long-context support for enhanced semantic search, surfacing relevant information from massive data repositories. Parallel tool integration : Invoke multiple APIs and functions simultaneously, enabling sophisticated workflows with structured, predictable outputs. Scalable conversational AI : Deploy high-capacity virtual agents capable of handling extended dialogues and nuanced queries with low latency. Customizable decision support- : Empower users with AI-driven recommendations and scenario analysis tailored to organizational needs and governance requirements. With the Grok 4 Fast models, developers are equipped to build and iterate on next-generation AI solutions, leveraging powerful tools and streamlined deployment workflows. Start shaping the future of intelligent applications by harnessing the speed, scalability, and multimodal capabilities of the Grok 4 Fast models today. The Grok 4 Fast models offer developers the speed, scalability, and multimodal capabilities needed to advance intelligent applications, supporting complex workflows and innovative solutions across a range of use cases. Pricing for Grok 4 Fast Models on Azure AI Foundry Model Deployment Price $/1m tokens grok-4-fast-reasoning Global Standard (PayGo) Input - $0.43 Output - $1.73 grok-4-fast-non-reasoning Get started in minutes With the Grok 4 Fast models, developers gain access to cutting-edge AI with a massive context window, efficient GPU performance, and enterprise-grade governance. Start building the future of AI today,visit the Model Catalog in Azure AI Foundry and deploy grok-4-fast-reasoning and grok-4-fast-non-reasoning to accelerate your innovation.1.5KViews0likes1CommentStep-by-Step Deployment Guide: MCP Tool Call in Copilot Studio agent with OAuth 2.0
Introduction Modern development workflows increasingly rely on secure integrations between tools and platforms. Copilot Studio, with its ability to extend functionality through MCP (Model Context Protocol) tools, offers developers powerful customization options. However, when these tools need to access sensitive APIs or user-specific data, robust authentication becomes essential. OAuth 2.0, particularly the Authorization Code Flow, is the industry standard for secure delegated access. It enables applications to obtain tokens on behalf of users without exposing credentials, ensuring compliance with enterprise security policies. In this guide, we’ll walk through how to configure MCP tools in Copilot Studio using OAuth 2.0 Authorization Code Flow — covering prerequisites, configuration steps, token handling, and best practices for a seamless and secure setup. Disclaimer & Context This article focuses on configuring an MCP tool within a Copilot Studio agent using OAuth 2.0 Authorization Code Flow. Every environment is unique, so the approach outlined here should be treated as a starting point rather than a one-size-fits-all solution. You can enhance this setup with additional security measures such as app roles, conditional access policies, or by extending the Python code for advanced scenarios. We assume readers have a basic understanding of Python, MCP concepts, OAuth 2.0 flows, and some familiarity with Copilot Studio. For deeper dives into these individual technologies, refer to the official documentation linked throughout this article. Please note: This solution reflects the state of the technology at the time of writing. Given the fast-paced nature of these platforms, minor adjustments may be required as features evolve. MCP tool integration in Copilot Studio is currently a preview feature, so expect changes and improvements over time. What is Authorization Code Flow? The Authorization Code Flow is designed for applications that can securely store a client secret (like server-side apps). It allows the app to obtain an access token on behalf of the user without exposing their credentials. This flow uses an intermediate authorization code to exchange for tokens, adding an extra layer of security. Steps in the Flow User Authentication The user is redirected to the Authorization Server (In this case: Azure AD) to log in and grant consent. Authorization Code Issued After successful login, the Authorization Server sends an authorization code to the app via the redirect URI. Token Exchange The app sends the authorization code (plus client credentials) to the Token Endpoint to get: Access Token (for API calls) and Refresh Token (to renew access without user interaction) API Access The app uses the Access Token to call protected resources. Below diagram shows the Authorization code flow in detail. Press enter or click to view image in full size Microsoft identity platform and OAuth 2.0 authorization code flow — Microsoft identity platform | Microsoft Learn High Level Architecture Press enter or click to view image in full size High Level Architecture for MCP server as a backend server and Copilot Studio as a front-end client Develop MCP server in VS Code Clone the following repository and open in VS Code. git clone https://github.com/mafzal786/mcp-server.git Run the following to execute it locally. cd mcp-server uv venv uv sync uv run mcpserver.py Deploy MCP Server as Azure Container App Deploy the MCP server in Azure container App by running the following command. It can be deployed by many other various ways such as via VS Code or CI/CD pipeline. AZ Cli is used for simplicity. az containerapp up \ --resource-group <RESOURCE_GROUP_NAME> \ --name streamable-mcp-server2 \ --environment mcp \ --location <REGION> \ --source . Configure Authentication for Azure Container App Sign in Azure portal. Visit the container App in Azure and Click “Authentication” as shown below Press enter or click to view image in full size For more details, visit the following link: Enable authentication and authorization in Azure Container Apps with Microsoft Entra ID | Microsoft Learn Click Add Identity Provider as shown. Select Microsoft from the drop down and leave everything as is as shown below. This will create a new app registration for the container App. After it is all setup, it will look like as below. As soon as authentication is configured. it will make container app inaccessible except for OAuth. Review App Registration of Container App — Backend Visit App registration and click streamable-mcp-server2 as in this case. Click on Authentication tab. Verify the Redirect URIs. you should see a redirect URL for container app. URI will end with /.auth/login/aad/callback as shown in the green box in the below screenshot. Now click on “Expose an API”. Confirm Application ID URI is configured with scope as shown below. its format isapi://<client id> Verify API Permission. Make sure you Grant admin consent for your tenant as shown below. More scope can be created depending on the requirement of data access. Create App Registration for Client App — Copilot Studio In these steps, we will be configuring app registration for the client app, such copilot studio in this case acting as a client app. This is also mentioned in the “high level architecture” diagram in the earlier section of this article. Lauch Azure Portal. Visit App registration. Click New registration. Create a new App registration. leave the Redirect URL as of now, we will configure it later as it is provided by copilot studio when configuring custom MCP connector. Click on API permission and click “Add a permission”. Click Microsoft Graph and then click “Delegated permissions”. Select email, openid, profile as shown below. Make sure to Grant admin consent and it should look like as below. Create a secret. click “Certificates & secrets”. Create a new client secret by clicking “New client secret”. store the value as it will be masked after some time. if that happens, you can always delete and re-create a new secret. Capture the following as you would need it in configuring MCP tool in Copilot Studio. Client ID from the Overview Tab of app registration. Client secret from “Certificates & secrets” tab. Configure API permissions for backend which is App registration of Azure container app i.e. streamable-mcp-server2 in this case. Click “API permissions” tab. Click “Add a permission”. Click on “My APIs” tab as shown below and select streamable-mcp-server2. Select “Delegated permissions” Select the Permissions already created as a result of configuring Authentication for Azure Container App earlier. Click “Add permission” You MUST “Grant admin consent” as final step. It is very important!!! I can’t emphasize more on that. without it, nothing will work!!! End result of this client app registration should look like as mentioned in the below figure. MCP Tool configuration in Copilot Studio Lauch copilot studio at https://copilotstudio.microsoft.com/. Configuration of environment and agent is beyond the scope of this article. It is assumed, you already have environment setup and agent has been created. Following link will help you, how to create an agent in copilot studio. Quickstart: Create and deploy an agent — Microsoft Copilot Studio | Microsoft Learn Inside agent configuration, Click “Add tool”. Click on New tool. Select Model Context Protocol. Provide all relevant information for MCP server. Make sure your server URL ends with your mcp setup. In this case, it is Azure container app URL with/mcpin the end. Provide server name and server description. Select OAuth 2.0 radio button. Provide the following in the OAuth 2.0 section Client ID of client app registration. In this case, copilot-studio-client as configured earlier. Client secret of copilot-studio-client app registration. Authorization URL: https://login.microsoftonline.com/common/oauth2/v2.0/authorize Token URL template & Refresh URL: https://login.microsoftonline.com/oauth2/v2.0/token Scopes: openid, profile, email — which we selected earlier for Microsoft Azure Graph permissions. Click “Create”. This will provide you Redirect URL. you need to configure the redirect URL in client app registration. In this case, it is copilot-agent-client. Configure Redirect URL in Client App Registration Visit client app registration. i.e. copilot-studio-client. Click Authentication Tab and provide the Web Redirect URIs as shown below. Modify MCP connector in PowerApps Now visit the https://make.powerapps.com and open the newly created connector as shown below. Change the Resource URL from “Expose an API” in streamable-mcp-server2 app registration. The backend “Application ID URI” and also add .default in the scope. Provide the secret of client app registration as it will not let you update the connector. This is extra security measure for updating the connector in Powerapps. Click Update connector. CORS Configuration Congratulation of getting that far!!!, We are getting close to make it work!!! CORS configuration is a MUST!!! Since our Azure Container App is a remote MCP with totally different domain or origin. Power Apps and CORS for External Domains — Brief Overview When embedding or integrating Power Apps with external web applications or APIs, Cross-Origin Resource Sharing (CORS) becomes a critical consideration. CORS is a browser security feature that restricts web pages from making requests to a different domain than the one that served the page, unless explicitly allowed. Key Points: Power Apps hosted on *.powerapps.com or within Microsoft 365 domains will block calls to external APIs unless those APIs include the proper CORS headers. The external API must return: Access-Control-Allow-Origin: https://apps.powerapps.com (or * for all origins, though not recommended for production) Access-Control-Allow-Methods: GET, POST, OPTIONS (or as needed) Access-Control-Allow-Headers: Content-Type, Authorization (and any custom headers) If the API requires authentication (e.g., OAuth 2.0), ensure preflight OPTIONS requests are handled correctly. For scenarios where you cannot modify the external API, consider using: Power Automate flows as a proxy Azure API Management or Azure Functions to inject CORS headers Always validate security implications before enabling wide-open CORS. If the CORS are not setup. You will encounter following error in copilot studio after pressing F12 (Browser Developer) CORS policy — blocking the container app Azure container app provides very efficient way of configuring CORS in the Azure portal. Lauch Azure Portal. Visit Azure container app i.e. streamable-mcp-server2 in this case. Click on CORS under Networking section. Configure the following in Allowed Origin Section as shown below. localhost is added to make it work from local laptop, although it is not required for Copilot Studio. Click on “Allowed Method” tab and provide the following. Provide wild card “*” in “Allowed Headers”tab. Although, it is not recommended for production system. it is done for the sake for simplicity. Configure that for added security Click “Apply”. This will configure CORS for remote application. Test the connector We are in the final stages of configuring the connector. It is time to test it, if everything is configured correctly and works. Lauch the http://make.powerapps.com and click on “Custom connectors”, select your configured connector and click “5. Test” tab as shown below. You will see Selected Connection as blank if you are running it first time. Click “+ New connection” New connection will launch the Authorization flow and browser dialog will pop up for making a request for authorization code. Click “Create”. Complete the login process. This will create a successful connection. Click “Test operation”. If the response is 406 means everything is configured correctly. Test MCP Tool in Copilot Studio Lauch copilot studio and click on the Agent you created in earlier steps and click on “Tools tab”. Select your MCP tool as shown the following figure. Make sure it is “Enabled” if you have other tools attached to the same agent, disable them for now for testing. Make sure you have connection available which we created during the testing of custom connector in earlier step. You can also initiate a fresh connection by clicking on the drop down under “Connection” as shown below. Refreshing the tools will show all the tools available in this MCP server. Provide the prompt such as “Give me the stock price of tesla”. This will trigger the MCP server and call the respective method to bring the stock price of Tesla. Now try a weather-related question to see more. Conclusion Securing MCP tools in Copilot Studio with OAuth 2.0 Authorization Code Flow is a critical step toward building enterprise-ready integrations. By leveraging this flow, you ensure that user credentials remain protected while enabling delegated access to sensitive APIs and resources. The approach outlined here provides a solid foundation, but it’s only the beginning. As environments differ, you should evaluate additional security enhancements such as app roles, conditional access policies, and token lifecycle management to meet organizational compliance standards. Remember, MCP integration in Copilot Studio is still a preview feature, and the ecosystem evolves rapidly. Stay informed, revisit configurations periodically, and adapt to new best practices as they emerge. With a thoughtful implementation, you can unlock the full potential of MCP tools while maintaining robust security and user trust.1.6KViews4likes2Comments