azure ai services
387 TopicsSeamlessly Integrating Azure Document Intelligence with Azure API Management (APIM)
In today’s data-driven world, organizations are increasingly turning to AI for document understanding. Whether it's extracting invoices, contracts, ID cards, or complex forms, Azure Document Intelligence (formerly known as Form Recognizer) provides a robust, AI-powered solution for automated document processing. But what happens when you want to scale, secure, and load balance your document intelligence backend for high availability and enterprise-grade integration? Enter Azure API Management (APIM) — your gateway to efficient, scalable API orchestration. In this blog, we’ll explore how to integrate Azure Document Intelligence with APIM using a load-balanced architecture that works seamlessly with the Document Intelligence SDK — without rewriting your application logic. Azure Doc Intelligence SDKs simplify working with long-running document analysis operations — particularly asynchronous calls — by handling the polling and response parsing under the hood. Why Use API Management with Document Intelligence? While the SDK is great for client-side development, APIM adds essential capabilities for enterprise-scale deployments: 🔐 Security & authentication at the gateway level ⚖️ Load balancing across multiple backend instances 🔁 Circuit breakers, caching, and retries 📊 Monitoring and analytics 🔄 Response rewriting and dynamic routing By routing all SDK and API calls through APIM, you get full control over traffic flow, visibility into usage patterns, and the ability to scale horizontally with multiple Document Intelligence backends. SDK Behavior with Document Intelligence When using the Document Intelligence SDK (e.g., begin_analyze_document), it follows this two-step pattern: POST request to initiate document analysis Polling (GET) request to the operation-location URL until results are ready This is an asynchronous pattern where the SDK expects a polling URL in the response of the POST. If you’re not careful, this polling can bypass APIM — which defeats the purpose of using APIM in the first place. So what do we do? The Smart Rewrite Strategy We use APIM to intercept and rewrite the response from the POST call. POST Flow SDK sends a POST to: https://apim-host/analyze APIM routes the request to one of the backend services: https://doc-intel-backend-1/analyze Backend responds with: operation-location: https://doc-intel-backend-1/operations/123 APIM rewrites this header before returning to the client: operation-location: https://apim-host/operations/poller?backend=doc-intel-backend-1 Now, the SDK will automatically poll APIM, not the backend directly. GET (Polling) Flow Path to be set as /operations/123 in GET operation of APIM SDK polls: https://apim-host/operations/123?backend=doc-intel-backend-1 APIM extracts the query parameter backend=doc-intel-backend-1 APIM dynamically sets the backend URL for this request to: https://doc-intel-backend-1 It forwards the request to: https://doc-intel-backend-1/operations/123 Backend sends the status/result back to APIM → which APIM returns to the SDK. All of this happens transparently to the SDK. Sample policies //Outbound policies for POST - /documentintelligence/documentModels/prebuilt-read:analyze //--------------------------------------------------------------------------------------------------- <!-- - Policies are applied in the order they appear. - Position <base/> inside a section to inherit policies from the outer scope. - Comments within policies are not preserved. --> <!-- Add policies as children to the <inbound>, <outbound>, <backend>, and <on-error> elements --> <policies> <!-- Throttle, authorize, validate, cache, or transform the requests --> <inbound> <base /> </inbound> <!-- Control if and how the requests are forwarded to services --> <backend> <base /> </backend> <!-- Customize the responses --> <outbound> <base /> <set-header name="operation-location" exists-action="override"> <value>@{ // Original operation-location from backend var originalOpLoc = context.Response.Headers.GetValueOrDefault("operation-location", ""); // Encode original URL to pass as query parameter var encoded = System.Net.WebUtility.UrlEncode(originalOpLoc); // Construct APIM URL pointing to poller endpoint with backendUrl var apimUrl = $"https://tstmdapim.azure-api.net/document-intelligent/poller?backendUrl={encoded}"; return apimUrl; }</value> </set-header> </outbound> <!-- Handle exceptions and customize error responses --> <on-error> <base /> </on-error> </policies> //Inbound policies for Get (Note: path for get should be modified - /document-intelligent/poller //---------------------------------------------------------------------------------------------- <!-- - Policies are applied in the order they appear. - Position <base/> inside a section to inherit policies from the outer scope. - Comments within policies are not preserved. --> <!-- Add policies as children to the <inbound>, <outbound>, <backend>, and <on-error> elements --> <policies> <!-- Throttle, authorize, validate, cache, or transform the requests --> <inbound> <base /> <choose> <when condition="@(context.Request.Url.Query.ContainsKey("backendUrl"))"> <set-variable name="decodedUrl" value="@{ var backendUrlEncoded = context.Request.Url.Query.GetValueOrDefault("backendUrl", ""); // Make sure to decode the URL properly, potentially multiple times if needed var decoded = System.Net.WebUtility.UrlDecode(backendUrlEncoded); // Check if it's still encoded and decode again if necessary while (decoded.Contains("%")) { decoded = System.Net.WebUtility.UrlDecode(decoded); } return decoded; }" /> <!-- Log the decoded URL for debugging remove if not needed--> <trace source="Decoded URL">@((string)context.Variables["decodedUrl"])</trace> <send-request mode="new" response-variable-name="backendResponse" timeout="30" ignore-error="false"> <set-url>@((string)context.Variables["decodedUrl"])</set-url> <set-method>GET</set-method> <authentication-managed-identity resource="https://cognitiveservices.azure.com/" /> </send-request> <return-response response-variable-name="backendResponse" /> </when> <otherwise> <return-response> <set-status code="400" reason="Missing backendUrl query parameter" /> <set-body>{"error": "Missing backendUrl query parameter."}</set-body> </return-response> </otherwise> </choose> </inbound> <!-- Control if and how the requests are forwarded to services --> <backend> <base /> </backend> <!-- Customize the responses --> <outbound> <base /> </outbound> <!-- Handle exceptions and customize error responses --> <on-error> <base /> </on-error> </policies> Load Balancing in APIM You can configure multiple backend services in APIM and use built-in load-balancing policies to: Distribute POST requests across multiple Document Intelligence instances Use custom headers or variables to control backend selection Handle failure scenarios with circuit-breakers and retries Reference: Azure API Management backends – Microsoft Learn Sample: Using APIM Circuit Breaker & Load Balancing – Microsoft Community Hub Conclusion By integrating Azure Document Intelligence with Azure API Management native capabilities like Load balancing, rewrite header, authentication, rate limiting policies, organizations can transform their document processing workflows into scalable, secure, and efficient systems.373Views5likes12CommentsGround your AI agents with knowledge from Bing Search, Microsoft Fabric, SharePoint and more
Today, we are thrilled to announce the upcoming preview of Azure AI Agent Service, a comprehensive suite of capabilities designed to empower developers to securely build, deploy, and scale high-quality, extensible, and reliable AI agents. By leveraging an extensive ecosystem of models, tools, and capabilities from OpenAI, Microsoft, and industry-leading partners such as Meta, Azure AI Agent Service enables developers to efficiently create agents for a wide range of generative AI use cases. In this blog, we will explore the Knowledge Integration capabilities of Azure AI Agent Service, designed not only to streamline the creation of Retrieval-Augmented Generation (RAG) workflow, but also to empower developers to build intelligent, knowledge-driven AI agents. Grounding AI Agents with Knowledge Knowledge is the foundation of generating accurate, grounded responses, allowing Azure AI Agent Service to make informed decisions with confidence. By integrating comprehensive and accurate data, Azure AI Agent Service enhances precision and provides effective solutions, elevating the overall customer experience. With the preview of Azure AI Agent Service, you can ground your agent’s responses using data from Bing Search, Microsoft Fabric, SharePoint, Azure AI Search, Azure Blob Storage, your local files, and even your own licensed data. These data sources enable grounding with diverse data types, from enterprise private data and public web data to your own licensed data, structured or unstructured. Enterprise-grade security features, such as On-Behalf-Of (OBO) authorization, ensures your data is stored, retrieved and accessed, meeting the highest standards of privacy and protection. Key Capabilities Leverage Real-Time Public Web Data with Grounding with Bing Search LLMs can sometimes generate outdated content. By grounding your agent with Bing Search, you can overcome this limitation and create more reliable and trustworthy applications. Grounding with Bing Search allows your agents to integrate real-time public web data, ensuring their response is accurate and up to date. By including supporting URLs and search query links, Grounding with Bing Search enhances trust and transparency, empowering the users to verify responses with the original sources. Empower Data-Driven Decisions with Microsoft Fabric Integrate your Azure AI Agent with Fabric AI Skill to unlock powerful data analysis capabilities. Fabric AI Skill transforms enterprise data into conversational Q&A systems, allowing users to interact with the data through chat and uncover data-driven and actionable insights effortlessly. With OBO authorization, this integration simplifies access to enterprise data in Fabric while maintaining robust security, ensuring proper access control and enterprise-grade protection. Connect Private Data Securely with SharePoint Azure AI Agent Service supports grounding response with your data in SharePoint (coming soon). This integration makes your SharePoint content more accessible to your end users. Enterprise-grade security features, such as OBO authorization for SharePoint, ensure secure and controlled access for end users. Ground Private Data with Azure AI Search, Azure Blob Storage and Your Local Files Azure AI Agent Service supports connecting private data from various sources, such as Azure AI Search, Azure Blob Storage, and local files, to enhance responses. Bring your existing Azure AI Search index or create a new one using the improved File Search tool. This tool leverages a built-in ingestion pipeline to process files from your local system or Azure Blob Storage. With the new File Search tool, your files remain in your own storage, and your Azure AI Search resource is used to ingest them, ensuring you maintain complete control over your data. Enrich Responses with Your Licensed Data Azure AI Agent Service also integrates your own licensed data from specialized data providers, such as Tripadvisor. Enhance the quality of your agent’s responses with high-quality, fresh data, such as travel guidance and reviews. These insights empower your agents to deliver nuanced, informed solutions tailored to specific use cases. “We’re excited to partner with Microsoft as the first data and intelligence provider for its Azure AI Agent Service," said Rahul Todkar, Vice President, Head of Data and AI at Tripadvisor. “At Tripadvisor, we are focused on leveraging the power of Data and Generative AI to benefit all travelers and partners across the globe. With this new partnership we are making available to developers a set of APIs that provide access to granular Tripadvisor data, content and intelligence. This will allow developers and AI engineers/scientists to use our robust travel data across a broad array of AI and ML use cases including building AI agents, contextually relevant recommendations and drive increased personalization." What’s Next Sign up for the private preview: Contact your account executive Learn More: Watch our breakout session on Azure AI Agent Service. See it in action: Check out our demo session on building custom agents with models and tools. Start building: Explore single and multi-agent solutions with our Azure AI Agent Service code samples.5.9KViews3likes1CommentNavigating AI Solutions: Microsoft Copilot Studio vs. Azure AI Foundry
Are you looking to build custom Copilots but unsure about the differences between Copilot Studio and Azure AI Foundry? As a Microsoft Technical Trainer with over a decade of experience, I've spent the last 18 months focusing on Azure AI Solutions and Copilot. Through numerous workshops, I've seen firsthand how customers benefit from AI solutions beyond Microsoft Copilot. Microsoft 365 Copilot Chat offers seamless integration with Generative AI for tasks like document creation, content summarization, and insights from M365 solutions such as Email, OneDrive, SharePoint, and Teams. It ensures compliance with organizational security, governance, and privacy policies, making it ideal for immediate AI assistance without customization. On the other hand, platforms like Copilot Studio and Azure AI Foundry provide greater customization and flexibility, tailoring AI assistance to specific business processes, workflows, and data sources for more relevant support. In this blog, I'll share insights on building custom copilots, and the tools Microsoft offers to support this journey. Technical Insights into Two Leading AI Platforms Copilot Studio and Azure AI Foundry are two flagship platforms within the Microsoft AI ecosystem, each tailored for distinct purposes. Both are integral to the development and deployment of AI-driven solutions. Let's dive into a comprehensive comparison to explore how they differ in scope, target audience, and use cases. Target Audience Copilot Studio Copilot Studio is ideal for business users and developers looking to implement conversational AI with minimal setup. It is well-suited for industries like retail, customer service, and human resources. Azure AI Foundry Azure AI Foundry caters to software developers, data scientists, and technical decision-makers focused on building complex, scalable AI solutions. It is commonly used by enterprises in healthcare, manufacturing, and finance. Core Solution Focus Copilot Studio Copilot Studio is centered around creating and customizing conversational copilots and bots, often made available to users as ‘virtual assistants’. It emphasizes a low-code/no-code environment, making it accessible to organizations looking to integrate AI-powered assistants into their workflows, all without the need of developing and writing code. Its primary goal is to enable tailored conversational experiences through customizable plugins – offering both Microsoft and 3rd party connectors to interact with, generative AI, and integration with tools like Microsoft Teams, Power Platform, Slack, Facebook and others. Copilot Studio is accessible from https://copilotstudio.microsoft.com and can be used through different licensing options. Image 1: Copilot Studio interface with the different tabs to customize your copilot, as well as the testing pane. Azure AI Foundry Azure AI Foundry, conversely, is a robust platform designed for developing AI applications and solutions at scale. It focuses on foundational AI tools, including an extensive AI Large Language Model catalog, where the models allow fine-tuning, tracing, evaluations, and observability. Targeted at developers and data scientists, Azure AI Foundry provides access to a suite of pre-trained models, a unified SDK, and deeper integration with Azure’s cloud ecosystem. The Azure AI Foundry Management Center is available from https://ai.azure.com. While there is no specific license cost for using Azure AI Foundry, note that the different underlying Azure services such as Azure OpenAI, Azure AI Search and the LLMs will incur consumption costs. Image 2: Azure AI Foundry Management Center, allowing for model deployment, fine-tuning, AI Search indexes integration and more. Capabilities Overview Customizability Copilot Studio enables organizations to build conversational bots with extensive customization options. The best part is that users don’t need to have developer skills and can add plugins, integrate APIs, and tailor responses dynamically. For example, a retail company can create a chatbot using Copilot Studio to assist customers in real-time, pull product data from SharePoint and answer queries about pricing and availability. You could also build a virtual assistant that helps conference attendees with questions and provides info on speakers, schedule, traveling information and more. Image 3: Conference Virtual Assistant responding to a prompt about the conference agenda and offering detailed information on titles, speakers, sessions, and timings. Azure AI Foundry specializes in advanced AI capabilities like Retrieval-Augmented Generation (RAG), model benchmarking, and multi-modal integrations. For instance, Azure AI Foundry allows a healthcare organization to use generative AI models to analyze large datasets and create research summaries while ensuring data compliance and security. Image 4: Azure AI Foundry Safety + Security management options, follow Microsoft Responsible AI Framework guidelines. Ease of Use Copilot Studio is designed with simplicity in mind. Its interface supports drag-and-drop functionality, prebuilt templates, and intuitive prompt creation. Users with minimal technical expertise can quickly deploy solutions without complex coding. Azure AI Foundry, while powerful, demands higher technical proficiency. Its SDKs and APIs are tailored for experienced developers seeking granular control over AI workflows. For example, Azure AI Foundry’s model fine-tuning capabilities require understanding of machine learning, while Copilot Studio abstracts much of this complexity. Integration with Other Platforms and Tools Copilot Studio Integration Copilot Studio seamlessly integrates with Microsoft Office applications like Teams, Outlook, and OneDrive, offering conversational plugins that enhance productivity. For instance, organizations can extend Microsoft 365 Copilot with enterprise-specific scenarios, such as HR bots for employee onboarding. Image 4: For example, Copilot Studio can integrate with email and Microsoft Dynamics. Azure AI Foundry Integration Azure AI Foundry connects deeply with the Azure ecosystem, including Azure Machine Learning, Azure OpenAI Service, and Azure AI Search. Developers and AI Engineers can experiment with multiple models, deploy AI workflows, and its unified SDK supports integration into GitHub, Visual Studio, and Microsoft Fabric. It also provides integration with other AI tools such as Prompt Flow, Semantic Kernel and more. Image 5: The VSCode Prompt Flow extension can be used by developers to build and validate chat functionality, while connecting to Azure AI Foundry in the backend. Use Case Examples Real-Time Assistance with Copilot Studio An airline can use Copilot Studio to create an interactive chatbot that assists travelers with flight details, weather forecasts, and booking management. The platform’s dynamic chaining capabilities allow the bot to call multiple APIs (e.g., weather and ticketing services) and provide contextual answers seamlessly. Advanced AI Applications with Azure AI Foundry A manufacturing company can leverage Azure AI Foundry to optimize production processes. By using multi-modal models, the company can analyze visual data from factory cameras alongside operational metrics to identify inefficiencies and recommend improvements. Getting Started I hope it is becoming clearer by now, which path you could follow to start building your custom copilots. As a Learn expert, I also know that customers mostly learn best by doing. To get you started, I would personally recommend going through the following Microsoft Learn tutorials: Copilot Studio: Create and deploy an agent - This tutorial guides you through creating and deploying an agent using Copilot Studio. It covers adding knowledge to your agent, testing content changes in real-time, and deploying your agent to a test page: Link to tutorial. Building agents with generative AI - This tutorial helps you create an agent with generative AI capabilities. It provides a summary of available features and prerequisites for getting started: Link to tutorial. Create and publish agents - This module introduces key concepts for creating agents based on business scenarios that customers and employees can interact with: Link to tutorial. Azure AI Foundry: Build a basic chat app in Python - This tutorial walks you through setting up your local development environment with the Azure AI Foundry SDK, writing prompts, running app code, tracing LLM calls, and running basic evaluations: Link to tutorial. Use the chat playground - This QuickStart guides you through deploying a chat model and using it in the chat playground within the Azure AI Foundry portal: Link to tutorial. Azure AI Foundry documentation - This comprehensive documentation helps developers and organizations rapidly create intelligent applications with prebuilt and customizable APIs and models: Link to tutorial. Conclusion While Copilot Studio and Azure AI Foundry share Microsoft’s vision for democratizing AI, they are typically used by different audiences and serve distinct purposes. Copilot Studio is the go-to platform for conversational AI and low-code deployments, making it accessible for businesses and their users, aiming to enhance customer and employee interactions. Azure AI Foundry is a powerhouse for advanced AI application development, enabling organizations to leverage cutting-edge models and tools for data-driven insights and innovation, but it requires advanced development skills to build such AI-inspired applications. Choosing between Copilot Studio and Azure AI Foundry depends on the specific needs and technical expertise of the organization. If you are new to AI, a good place to start is with Copilot Studio and then to grow into a more advanced scenario with Azure AI Foundry.920Views2likes2CommentsIntroducing Azure AI Content Understanding for Beginners
Enterprises today face several challenges in processing and extracting insights from multimodal data, like managing diverse data formats, ensuring data quality, and streamlining workflows efficiently. Ensuring the accuracy and usability of extracted insights often requires advanced AI techniques, while inefficiencies in managing large data volumes increase costs and delay results. Azure AI Content Understanding addresses these pain points by offering a unified solution to transform unstructured data into actionable insights, improve data accuracy with schema extraction and confidence scoring, and integrate seamlessly with Azure’s ecosystem to enhance efficiency and reduce costs. Content Understanding makes it easy to extract custom task-specific output without advanced GenAI skills. It enables a quick path to scale for retrieval augmented generation (RAG) grounded by multimodal data or transactional content processing for agent workflows and process automation. We are excited to announce a new video series to help you get started with Azure AI Content Understanding and extract the task specific output for your business. Whether you're looking for a well-rounded overview, want to discover how to develop a RAG index ovideo content, or learn how to build a post-call analytics workflow, this series has something for everyone. What is Azure AI Content Understanding? Azure AI Content Understanding is a new Azure AI service, designed to process and transform content of any type, including documents, images, videos, audio, and text into a user-defined output schema. This streamlined process allows developers to reason over large amounts of unstructured data, accelerating time-to-value by generating an output that can be easily integrated into agentic, automation and analytical workflows. Video Series Highlights Azure AI Content Understanding: How to Get Started - Vinod Kurpad, Principal GPM, AI Services, shows how you can process content of any modality—audio, video, documents, and text—in a unified workflow in Azure AI Foundry using Azure AI Content Understanding. It's simple, intuitive, and doesn't require any GenAI skills. 2. Post-call Analytics Using Azure AI Content Understanding - Jan Goergen Senior Program Manager, AI Services shows how to process any number of video or audio call recordings quickly in Azure AI Foundry by leveraging the Post‑Call Analytics template powered by Content Understanding. The video also introduces the broader concept of templates, illustrating how you can embed Content Understanding into reusable templates that you can build, deploy, and share across projects. 3. RAG on Video Using Azure AI Content Understanding - Joe Filcik, Principal Product Manager, AI Services, shows how you can process videos and ground them on your data with multimodal retrieval augmented generation (RAG) to derive insights that would otherwise take much longer. Joe demonstrates how this can be achieved using a single Azure AI Content Understanding API in Azure AI Foundry. Why Azure AI Content Understanding? The Azure AI Content Understanding service is ideal for enterprises and developers looking to process large amounts of multimodal content, such as call center recordings and videos for training and compliance, without requiring GenAI skills such as prompt-engineering and model selection. Enjoy the video series and start exploring the possibilities with Azure AI Content Understanding. For additional resources: Watch the Video Series Try it in Azure AI Foundry Content Understanding documentation Content Understanding samples Feedback? Contact us at cu_contact@microsoft.com299Views0likes0CommentsDynamic Tool Discovery: Azure AI Agent Service + MCP Server Integration
At the time of this writing, Azure AI Agent Service does not offer turnkey integration with Model Context Protocol (MCP) Servers. Discussed here is a solution that helps to leverage MCP's powerful capabilities while working within the Azure ecosystem. The integration approach piggybacks on the Function integration capability in the Azure AI Agent Service. By utilizing an MCP Client to discover and register tools from an MCP Server as Functions with the Agent Service, we create a seamless integration layer between the two systems. Built using the Microsoft Bot Framework, this application can be published as an AI Assistant across numerous channels like Microsoft Teams, Slack, and others. For development and testing purposes, we've used the Bot Framework Emulator to run and validate the application locally. Architecture Overview The solution architecture consists of several key components: MCP Server: Hosted in Azure Container Apps, the MCP Server connects to Azure Blob Storage using Managed Identity, providing secure, token-based authentication without the need for stored credentials. Azure AI Agent Service: The core intelligence platform that powers our AI Assistant. It leverages various tools including: Native Bing Search tool for retrieving news content Dynamically registered MCP tools for storage operations GPT-4o model for natural language understanding and generation Custom AI Assistant Application: Built with the Microsoft Bot Framework, this application runs locally during development but could be hosted in Azure Container Apps for production use. It serves as the bridge between user interactions and the Azure AI Agent Service. Integration Layer: The MCP client within our application discovers available tools from the MCP Server and registers them with the Azure AI Agent Service, enabling seamless function calling between these systems. Technical Implementation MCP Tool Discovery and Registration The core of our integration lies in how we discover MCP tools and register them with the Azure AI Agent Service. Let's explore the key components of this process. Tool Discovery Process The agent.py file contains the logic for connecting to the MCP Server, discovering available tools, and registering them with the Azure AI Agent Service: # Fetch tool schemas from MCP Server async def fetch_tools(): conn = ServerConnection(mcp_server_url) await conn.connect() tools = await conn.list_tools() await conn.cleanup() return tools tools = asyncio.run(fetch_tools()) # Build a function for each tool def make_tool_func(tool_name): def tool_func(**kwargs): async def call_tool(): conn = ServerConnection(mcp_server_url) await conn.connect() result = await conn.execute_tool(tool_name, kwargs) await conn.cleanup() return result return asyncio.run(call_tool()) tool_func.__name__ = tool_name return tool_func functions_dict = {tool["name"]: make_tool_func(tool["name"]) for tool in tools} mcp_function_tool = FunctionTool(functions=list(functions_dict.values())) This approach dynamically creates Python function stubs for each MCP tool, which can then be registered with the Azure AI Agent Service. Agent Creation and Registration Once we have our function stubs, we register them with the Azure AI Agent Service: # Initialize agent with tools toolset = ToolSet() toolset.add(mcp_function_tool) toolset.add(bing) # Adding the Bing Search tool # Create or update agent with the toolset agent = project_client.agents.create_agent( model=config.aoai_model_name, name=agent_name, instructions=agent_instructions, tools=toolset.definitions ) The advantage with this approach is that it allows for dynamic discovery and registration. When the MCP Server adds or updates tools, you can simply run the agent creation process again to update the registered functions. The picture below shows the tool actions discovered from the MCP Server are registered as Functions in the AI Agent Service upon Agent creation/updation. Executing Requests Using the MCP Client When a user interacts with the bot, the state_management_bot.py handles the function calls and routes them to the appropriate handler: # Process each tool call tool_outputs = [] for tool_call in tool_calls: if isinstance(tool_call, RequiredFunctionToolCall): # Get function name and arguments function_name = tool_call.function.name args_json = tool_call.function.arguments arguments = json.loads(args_json) if args_json else {} # Check if this is an MCP function if is_mcp_function(function_name): # Direct MCP execution using our specialized handler output = await execute_mcp_tool_async(function_name, arguments) else: # Use FunctionTool as fallback output = functions.execute(tool_call) The system is designed to be loosely coupled - the Agent only knows about the tool signatures and how to call them, while the MCP Server handles the implementation details of interacting with Azure Storage. Running the Application The application workflow consists of two main steps: 1. Creating/Updating the Agent This step discovers available tools from the MCP Server and registers them with the Azure AI Agent Service: python agent.py This process: Connects to the MCP Server Retrieves the schema of all available tools Creates function stubs for each tool Registers these stubs with the Azure AI Agent Service 2. Running the AI Assistant Once the agent is configured with the appropriate tools, you can run the application: python app.py Users interact with the AI Assistant through the Bot Framework Emulator using natural language. The assistant can: Search for news using Bing Search Summarize the findings Store and organize summaries in Azure Blob Storage via the MCP Server References Here is the GitHub Repo for this App. It has references to relevant documentation on the subject Here is the GitHub Repo of the MCP Server Here is a video demonstrating this Application in action Conclusion This implementation demonstrates a practical approach to integrating Azure AI Agent Service with MCP Servers. By leveraging the Function integration capability, we've created a bridge that allows these technologies to work together seamlessly. The architecture is: Flexible: New tools can be added to the MCP Server and automatically discovered Maintainable: Changes to storage operations can be made without modifying the agent Scalable: Additional capabilities can be easily added through new MCP tools As Azure AI Agent Service evolves, we may see native integration with MCP Servers in the future. Until then, this approach provides a robust solution for developers looking to combine these powerful technologies.592Views0likes0CommentsFrom Complexity to Simplicity: The ASC and Azure AI Partnership
ASC Technologies, a leader in compliance recording and AI-driven data analytics, provides cutting-edge software solutions for capturing and analyzing communication channels. Their innovative technology empowers more than 500 customers worldwide to record communications legally while extracting valuable insights and helping to prevent fraudulent activities. Many of their customers operate in heavily regulated industries where compliance recording is mandatory. These organizations rely on the ability to consolidate and analyze information shared across multiple channels including voice recordings, chat logs, speaker recognition, video analysis, and document and screen activity. As ASC’s customer base expanded, and their clients accumulated millions of calls and vast amounts of conversation metadata, the sheer volume of recorded data grew exponentially. This surge in data led to increased costs for searching and retrieving critical information and made the process of quickly accessing relevant data more challenging. Workshop leads to insights and AI-driven solutions To address this issue, ASC collaborated with Microsoft in a workshop to discover and design solutions. They participated in a preview of Azure AI Content Understanding and saw the potential to simplify product development with an easy-to-integrate processing pipeline. By applying AI, they could bridge compliance gaps, enhance customer experiences, uncover actionable intelligence, and prevent fraudulent activities. The team decided to integrate Content Understanding into their platform, finding the implementation process relatively simple. “For the new Content Understanding service, we just implemented a new AI Content Understanding actor. For each recorded Microsoft Teams conversation, or a chat conversation, or email, the actor is triggered. Because the result of the Content Understanding service is quite similar to the results of the Microsoft analytics services we used before, we could use the existing UI with just some add-ons to show the new extracted information,” said Tobias Fengler, Chief Engineering Officer, ASC. Fengler mentioned that the ASC team received prompt assistance from the Microsoft team, facilitating the implementation process. “We got a lot of help from the Microsoft team and a very fast direct answer from the product manager from Microsoft whenever we had an urgent question.” A win/win situation for the company and their customers Since developing their new integrated platform, ASC has realized many positive impacts. Search queries are much faster, and the service is easier to maintain, requiring only one analytics service, as opposed to the six different services ASC previously used. Additionally, ASC now empowers its solution to analyze all communications streams, giving customers a holistic view of all communication channels including audio, video, screen sharing, chat, documents, and emails. Customers can easily extract information from unstructured data with a simple configuration, and the cost structure is straightforward and easy-to-understand with a single consumption model. Thanks to the success of the implementation and the positive results of the solution, Fengler says he would recommend Content Understanding to others. “Compared to other solutions, the capability [in the ASC Recording Insights platform] powered by Content Understanding provides a single interface for both audio transcription and all analytics functionalities like summaries, sentiment analysis, and extracting information from unstructured data,” he said. With these positive outcomes and significant improvements, ASC is now looking to extend AI into new use cases. Content Understanding will be used to analyze email content, provide summaries, and detect compliance relevancy based on the user group and content. The company is also building an AI agent that will automate certain workflows and automate sending notification emails to compliance officers when needed. By introducing agentic AI into their solution, ASC is empowering their organization to drive new business, become more efficient, and accomplish more. Get started: Learn more about Azure AI Content Understanding. Try Azure AI Content Understanding in Azure AI Foundry.167Views0likes0CommentsAzure OpenAI o-series & GPT-4.1 Models Now Available in Azure AI Agent Service
New Models Available! We’re excited to announce the preview availability of the following Azure OpenAI Service models for use in the Azure AI Agent Service, starting 5/7: o1 o3-mini gpt-4.1 gpt-4.1-mini gpt-4.1-nano Azure OpenAI o-Series Models Azure OpenAI o-series models are designed to tackle reasoning and problem-solving tasks with increased focus and capability. These models spend more time processing and understanding the user's request, making them exceptionally strong in areas like science, coding, and math compared to previous iterations. o1: The most capable model in the o1 series, offering enhanced reasoning abilities. o3 (coming soon): The most capable reasoning model in the o model series, and the first one to offer full tools support for agentic solutions. o3-mini: A faster and more cost-efficient option in the o3 series, ideal for coding tasks requiring speed and lower resource consumption. o4-mini (coming soon): The most efficient reasoning model in the o model series, well suited for agentic solutions. Azure OpenAI GPT-4.1 Model Series We are excited to share the launch Agents support for the next iteration of the GPT model series with GPT-4.1, 4.1-mini, and 4.1-nano. The GPT-4.1 models bring improved capabilities and significant advancements in coding, instruction following, and long-context processing that is critical for developers. What is GPT-4.1? GPT-4.1 is the latest iteration of the GPT-4o model, trained to excel at coding and instruction-following tasks. This model will improve the quality of agentic workflows and accelerate the productivity of developers across all scenarios. Key features of GPT-4.1 GPT-4.1 brings several notable improvements: Enhanced coding and instruction following: The model is optimized for better handling of complex technical and coding problems. It generates cleaner, simpler front-end code, accurately identifies necessary changes in existing code, and consistently produces outputs that compile and run successfully. Long context model: GPT-4.1 supports one million token inputs, allowing it to process and understand extensive context in a single interaction. This capability is particularly beneficial for tasks requiring detailed and nuanced understanding as well as multi-step agents that increase context as they operate. Improved instruction following: The model excels at following detailed instructions, especially agents containing multiple requests. It is more intuitive and collaborative, making it easier to work with for various applications. Model capabilities In addition to the post-training improvements and long context support, GPT-4.1 retains the same API capabilities as the GPT-4o model family, including tool calling and structured outputs. Model Reasoning & Accuracy Cost & Efficiency Context Length GPT-4.1 Highest Higher Cost 1M GPT-4.1-mini Balanced Balanced 1M GPT-4.1-nano Lower Lowest Cost 1M Explore GPT-4.1 today GPT-4.1 is now available in the AI Foundry Model Catalog, bringing unparalleled advancements in AI capabilities. This release marks a significant leap forward, offering enhanced performance, efficiency, and versatility across a wide array of applications. Whether you’re looking to improve your customer service chatbot, develop cutting-edge data analysis tools, or explore new frontiers in machine learning, GPT-4.1 has something to offer. We invite you to delve into the new features and discover how GPT-4.1 can revolutionize your workflows and applications. Explore, deploy, and build applications using these models today in Azure AI Foundry to access this powerful tool and stay ahead in the rapidly evolving world of AI. How to use these models in Azure AI Agent Service Models with Tool-Calling To best support agentic scenarios, we recommend using models that support tool-calling. The Azure AI Agent Service currently supports all agent-compatible models from the Azure AI Foundry model catalog. To use these models, use the Azure AI Foundry portal to make a model deployment, then reference the deployment name in your agent. For example: agent = project_client.agents.create_agent( model="llama-3", name="my-agent", instructions="You are a helpful agent") NOTE: This option should only be used for open-source models (e.g., Cepstral, Mistral, Llama) and not for OpenAI models, which are natively supported in the service. This option should also only be used for models that support tool-calling. Models without Tool-Calling Though tool-calling support is a core capability for agentic scenarios, we now provide the ability to use models that don’t support tool-calling in our API and SDK. This option may be helpful when you have specific use-cases that don’t require tool-calling. The following steps will allow you to utilize any chat-completion model that is available through a Serverless API: Deploy your desired model through Serverless API. Model will show up on your ‘Models + Endpoints’ page. Click on model name to see model details, where you’ll find your model’s Target URI and Key. Create a new Serverless connection on ‘Connected Resources’ page, using the Target URI and Key. Model can now be referenced in your code (Target URI + ‘@’ + Model Name), for example: Model=https://Phi-4-mejco.eastus.models.ai.azure.com/@Phi-4-mejco Further Exploration Azure AI Foundry Agent Service: Model Support Explore the models in Azure AI Foundry1.4KViews0likes0CommentsExpand Azure AI Agent with New Knowledge Tools: Microsoft Fabric and Tripadvisor
To help AI Agents make well-informed decisions with confidence, knowledge serves as the foundation for generating accurate and grounded responses. By integrating comprehensive and precise data, Azure AI Agent Service enhances accuracy and delivers effective solutions, thereby improving the overall customer experience. Azure AI Agent Service aims to provide a wide range of knowledge tools to address various customer use cases, encompassing unstructured text data, structured data, private data, licensed data, public web data, and more. Today, we are thrilled to announce the public preview of two new knowledge tools - Microsoft Fabric and Tripadvisor – designed to further empower your AI agents. Alongside existing capabilities such as Azure AI Search, File Search, and Grounding with Bing Search tools, these new integrations provide even broader coverage. This post will explore how these capabilities can empower your AI agents to be more intelligent and knowledge-driven, as well as enterprise-ready features to ensure data security. Key Capabilities Empower Data-Driven Decisions with Microsoft Fabric The Microsoft Fabric tool connects your AI agents with customized, conversational data agents built in Microsoft Fabric. Fabric data agents are AI-powered assistants that can learn, adapt, and deliver insights instantly, allowing you to reason over structured and semantic data. With Fabric data agents, you can easily create conversational experiences over enterprise data from multiple data sources and enhance your AI agents with analytics capabilities in just a few clicks. At the heart of Fabric is OneLake—a single, unified, and governed data lake that connects departments, applications, and teams. Whether data is directly ingested or mirrored from external systems, like Snowflake or other third‑party databases, OneLake consolidates it into a common, open data format. This unified repository not only simplifies data management but also serves as a comprehensive knowledge source for AI agents, ensuring that insights are grounded in a complete view of your organization’s data. Fabric data agents can determine when to use specific data, how to combine it, and what insights matter most. This seamless integration between Fabric and Azure AI Agent Service enables organizations to develop agents with additional quantitative insights from data in OneLake, leading to data-driven decision-making. Built into this integration are enterprise-ready features such as Identity Passthrough/On-Behalf-Of(OBO) authentication to ensure end users only receive AI agent responses based on data they have access to. With this, Azure AI Agent Service uses end user’s identity to authorize and retrieve data they have access to. Based on end user’s accessible data, Fabric data agents will perform data analysis and return to agent. Therefore, it gives you built-in access control and management over your private data. “We see data agents as a conversational capability layer we can use to ‘talk’ to our data, understand it, and derive different insights in support of our daily decision making,” says Genis Campa, head of Products Strategy at NTT DATA. "By significantly improving real-time actionable insights, Azure AI and Fabric help elevate business outcomes as well as human potential.” Enrich Responses with Your Licensed Data In addition to private data and public web data, you can also ground your AI agent with specialized data from licensed data providers, such as Tripadvisor. Grounding with your licensed data will improve AI agent response quality with fresh, well-maintained and trusted data. As one of the first licensed data providers, Tripadvisor enables AI agents to utilize its comprehensive dataset to develop innovative solutions, such as travel booking assistance, that significantly enhance customer experiences. With millions of unique data points and reviews, Tripadvisor's information can be applied to various AI applications requiring reliable data about hotels, restaurants, experiences, and more. "Trust and accuracy are critical in AI decision-making. By integrating Tripadvisor’s extensive, high-quality, privacy-preserved data into Azure AI Agent Service, we're empowering businesses to deliver smarter and more personalized in-context travel recommendations based on real traveler insights. This collaboration strengthens AI's ability to provide meaningful guidance, helping travelers make informed travel planning and booking choices with confidence.” - Rahul Todkar, Vice President, Head of Data and AI at Tripadvisor. Get Started Try out these new tools with your Azure AI Agent: Microsoft Fabric Tool Bring Your Own Licensed Data Learn More Fabric Data Agent integration with Azure AI Agent Service Join the Microsoft Fabric Community Conference session Get your data AI ready with Microsoft Fabric. Skill-up with this Plan on Microsoft Learn to learn how to ingest, store, and monitor your data to ensure AI-readiness.5KViews2likes2CommentsUse Azure OpenAI and APIM with the OpenAI Agents SDK
The OpenAI Agents SDK provides a powerful framework for building intelligent AI assistants with specialised capabilities. In this blog post, I'll demonstrate how to integrate Azure OpenAI Service and Azure API Management (APIM) with the OpenAI Agents SDK to create a banking assistant system with specialised agents. Key Takeaways: Learn how to connect the OpenAI Agents SDK to Azure OpenAI Service Understand the differences between direct Azure OpenAI integration and using Azure API Management Implement tracing with the OpenAI Agents SDK for monitoring and debugging Create a practical banking application with specialized agents and handoff capabilities The OpenAI Agents SDK The OpenAI Agents SDK is a powerful toolkit that enables developers to create AI agents with specialised capabilities, tools, and the ability to work together through handoffs. It's designed to work seamlessly with OpenAI's models, but can be integrated with Azure services for enterprise-grade deployments. Setting Up Your Environment To get started with the OpenAI Agents SDK and Azure, you'll need to install the necessary packages: pip install openai openai-agents python-dotenv You'll also need to set up your environment variables. Create a `.env` file with your Azure OpenAI or APIM credentials: For Direct Azure OpenAI Connection: # .env file for Azure OpenAI AZURE_OPENAI_API_KEY=your_api_key AZURE_OPENAI_API_VERSION=2024-08-01-preview AZURE_OPENAI_ENDPOINT=https://your-resource-name.openai.azure.com/ AZURE_OPENAI_DEPLOYMENT=your-deployment-name For Azure API Management (APIM) Connection: # .env file for Azure APIM AZURE_APIM_OPENAI_SUBSCRIPTION_KEY=your_subscription_key AZURE_APIM_OPENAI_API_VERSION=2024-08-01-preview AZURE_APIM_OPENAI_ENDPOINT=https://your-apim-name.azure-api.net/ AZURE_APIM_OPENAI_DEPLOYMENT=your-deployment-name Connecting to Azure OpenAI Service The OpenAI Agents SDK can be integrated with Azure OpenAI Service in two ways: direct connection or through Azure API Management (APIM). Option 1: Direct Azure OpenAI Connection from openai import AsyncAzureOpenAI from agents import set_default_openai_client from dotenv import load_dotenv import os # Load environment variables load_dotenv() # Create OpenAI client using Azure OpenAI openai_client = AsyncAzureOpenAI( api_key=os.getenv("AZURE_OPENAI_API_KEY"), api_version=os.getenv("AZURE_OPENAI_API_VERSION"), azure_endpoint=os.getenv("AZURE_OPENAI_ENDPOINT"), azure_deployment=os.getenv("AZURE_OPENAI_DEPLOYMENT") ) # Set the default OpenAI client for the Agents SDK set_default_openai_client(openai_client) Option 2: Azure API Management (APIM) Connection from openai import AsyncAzureOpenAI from agents import set_default_openai_client from dotenv import load_dotenv import os # Load environment variables load_dotenv() # Create OpenAI client using Azure APIM openai_client = AsyncAzureOpenAI( api_key=os.getenv("AZURE_APIM_OPENAI_SUBSCRIPTION_KEY"), # Note: Using subscription key api_version=os.getenv("AZURE_APIM_OPENAI_API_VERSION"), azure_endpoint=os.getenv("AZURE_APIM_OPENAI_ENDPOINT"), azure_deployment=os.getenv("AZURE_APIM_OPENAI_DEPLOYMENT") ) # Set the default OpenAI client for the Agents SDK set_default_openai_client(openai_client) Key Difference: When using Azure API Management, you use a subscription key instead of an API key. This provides an additional layer of management, security, and monitoring for your OpenAI API access. Creating Agents with the OpenAI Agents SDK Once you've set up your Azure OpenAI or APIM connection, you can create agents using the OpenAI Agents SDK: from agents import Agent from openai.types.chat import ChatCompletionMessageParam # Create a banking assistant agent banking_assistant = Agent( name="Banking Assistant", instructions="You are a helpful banking assistant. Be concise and professional.", model="gpt-4o", # This will use the deployment specified in your Azure OpenAI/APIM client tools=[check_account_balance] # A function tool defined elsewhere ) The OpenAI Agents SDK automatically uses the Azure OpenAI or APIM client you've configured, making it seamless to switch between different Azure environments or configurations. Implementing Tracing with Azure OpenAI The OpenAI Agents SDK includes powerful tracing capabilities that can help you monitor and debug your agents. When using Azure OpenAI or APIM, you can implement two types of tracing: 1. Console Tracing for Development Console logging is rather verbose, if you would like to explore the Spans then enable do it like below: from agents import Agent, HandoffInputData, Runner, function_tool, handoff, trace, set_default_openai_client, set_tracing_disabled, OpenAIChatCompletionsModel, set_tracing_export_api_key, add_trace_processor from agents.tracing.processors import ConsoleSpanExporter, BatchTraceProcessor # Set up console tracing console_exporter = ConsoleSpanExporter() console_processor = BatchTraceProcessor(exporter=console_exporter) add_trace_processor(console_processor) 2. OpenAI Dashboard Tracing Currently the spans are being sent to https://api.openai.com/v1/traces/ingest from agents import Agent, HandoffInputData, Runner, function_tool, handoff, trace, set_default_openai_client, set_tracing_disabled, OpenAIChatCompletionsModel, set_tracing_export_api_key, add_trace_processor set_tracing_export_api_key(os.getenv("OPENAI_API_KEY")) Tracing is particularly valuable when working with Azure deployments, as it helps you monitor usage, performance, and behavior across different environments. Note that at the time of writing this article, there is a ongoing bug where OpenAI Agent SDK is fetching the old input_tokens, output_tokens instead of the new prompt_tokens & completion_tokens returned by newer ChatCompletion APIs. Thus you would need to manually update in agents/run.py file to make this work per https://github.com/openai/openai-agents-python/pull/65/files Running Agents with Azure OpenAI To run your agents with Azure OpenAI or APIM, use the Runner class from the OpenAI Agents SDK: from agents import Runner import asyncio async def main(): # Run the banking assistant result = await Runner.run( banking_assistant, input="Hi, I'd like to check my account balance." ) print(f"Response: {result.response.content}") if __name__ == "__main__": asyncio.run(main()) Practical Example: Banking Agents System Let's look at how we can use Azure OpenAI or APIM with the OpenAI Agents SDK to create a banking system with specialized agents and handoff capabilities. 1. Define Specialized Banking Agents We'll create several specialized agents: General Banking Assistant: Handles basic inquiries and account information Loan Specialist: Focuses on loan options and payment calculations Investment Specialist: Provides guidance on investment options Customer Service Agent: Routes inquiries to specialists 2. Implement Handoff Between Agents from agents import handoff, HandoffInputData from agents.extensions import handoff_filters # Define a filter for handoff messages def banking_handoff_message_filter(handoff_message_data: HandoffInputData) -> HandoffInputData: # Remove any tool-related messages from the message history handoff_message_data = handoff_filters.remove_all_tools(handoff_message_data) return handoff_message_data # Create customer service agent with handoffs customer_service_agent = Agent( name="Customer Service Agent", instructions="""You are a customer service agent at a bank. Help customers with general inquiries and direct them to specialists when needed. If the customer asks about loans or mortgages, handoff to the Loan Specialist. If the customer asks about investments or portfolio management, handoff to the Investment Specialist.""", handoffs=[ handoff(loan_specialist_agent, input_filter=banking_handoff_message_filter), handoff(investment_specialist_agent, input_filter=banking_handoff_message_filter), ], tools=[check_account_balance], ) 3. Trace the Conversation Flow from agents import trace async def main(): # Trace the entire run as a single workflow with trace(workflow_name="Banking Assistant Demo"): # Run the customer service agent result = await Runner.run( customer_service_agent, input="I'm interested in taking out a mortgage loan. Can you help me understand my options?" ) print(f"Response: {result.response.content}") if __name__ == "__main__": asyncio.run(main()) Benefits of Using Azure OpenAI/APIM with the OpenAI Agents SDK Integrating Azure OpenAI or APIM with the OpenAI Agents SDK offers several advantages: Enterprise-Grade Security: Azure provides robust security features, compliance certifications, and private networking options Scalability: Azure's infrastructure can handle high-volume production workloads Monitoring and Management: APIM provides additional monitoring, throttling, and API management capabilities Regional Deployment: Azure allows you to deploy models in specific regions to meet data residency requirements Cost Management: Azure provides detailed usage tracking and cost management tools Conclusion The OpenAI Agents SDK combined with Azure OpenAI Service or Azure API Management provides a powerful foundation for building intelligent, specialized AI assistants. By leveraging Azure's enterprise features and the OpenAI Agents SDK's capabilities, you can create robust, scalable, and secure AI applications for production environments. Whether you choose direct Azure OpenAI integration or Azure API Management depends on your specific needs for API management, security, and monitoring. Both approaches work seamlessly with the OpenAI Agents SDK, making it easy to build sophisticated agent-based applications. Repo: https://github.com/hieumoscow/azure-openai-agents Video demo: https://www.youtube.com/watch?v=gJt-bt-vLJY8.4KViews6likes12CommentsReal-time Speech Transcription with GPT-4o-transcribe and GPT-4o-mini-transcribe using WebSocket
Azure OpenAI has expanded its speech recognition capabilities with two powerful models: GPT-4o-transcribe and GPT-4o-mini-transcribe. These models also leverage WebSocket connections to enable real-time transcription of audio streams, providing developers with cutting-edge tools for speech-to-text applications. In this technical blog, we'll explore how these models work and demonstrate a practical implementation using Python. Understanding OpenAI's Realtime Transcription API Unlike the regular REST API for audio transcription, Azure OpenAI's Realtime API enables continuous streaming of audio data through WebSockets or WebRTC connections. This approach is particularly valuable for applications requiring immediate transcription feedback, such as live captioning, meeting transcription, or voice assistants. The key difference between the standard transcription API and the Realtime API is that transcription sessions typically don't contain responses from the model, but rather focus exclusively on converting speech to text in real-time. GPT-4o-transcribe and GPT-4o-mini-transcribe: Feature Overview Azure OpenAI has introduced two specialized transcription models: GPT-4o-transcribe: The full-featured transcription model with high accuracy GPT-4o-mini-transcribe: A lighter, faster model with slightly reduced accuracy but lower latency Both models connect through WebSockets, enabling developers to stream audio directly from microphones or other sources for immediate transcription. These models are designed specifically for the Realtime API infrastructure. Setting Up the Environment First, we need to set up our Python environment with the necessary libraries: import os import json import base64 import threading import pyaudio import websocket from dotenv import load_dotenv load_dotenv('azure.env') # Load environment variables from .env OPENAI_API_KEY = os.environ.get("AZURE_OPENAI_STT_TTS_KEY") if not OPENAI_API_KEY: raise RuntimeError("❌ OPENAI_API_KEY is missing!") # WebSocket endpoint for OpenAI Realtime API (transcription model) url = f"{os.environ.get('AZURE_OPENAI_STT_TTS_ENDPOINT').replace('https', 'wss')}/openai/realtime?api-version=2025-04-01-preview&intent=transcription" headers = { "api-key": OPENAI_API_KEY} # Audio stream parameters (16-bit PCM, 16kHz mono) RATE = 24000 CHANNELS = 1 FORMAT = pyaudio.paInt16 CHUNK = 1024 audio_interface = pyaudio.PyAudio() stream = audio_interface.open(format=FORMAT, channels=CHANNELS, rate=RATE, input=True, frames_per_buffer=CHUNK) Establishing the WebSocket Connection The following code establishes a connection to OpenAI's Realtime API and configures the transcription session: def on_open(ws): print("Connected! Start speaking...") session_config = { "type": "transcription_session.update", "session": { "input_audio_format": "pcm16", "input_audio_transcription": { "model": "gpt-4o-mini-transcribe", "prompt": "Respond in English." }, "input_audio_noise_reduction": {"type": "near_field"}, "turn_detection": {"type": "server_vad", "threshold": 0.5, "prefix_padding_ms": 300, "silence_duration_ms": 200} } } ws.send(json.dumps(session_config)) def stream_microphone(): try: while ws.keep_running: audio_data = stream.read(CHUNK, exception_on_overflow=False) audio_base64 = base64.b64encode(audio_data).decode('utf-8') ws.send(json.dumps({ "type": "input_audio_buffer.append", "audio": audio_base64 })) except Exception as e: print("Audio streaming error:", e) ws.close() threading.Thread(target=stream_microphone, daemon=True).start() Processing Transcription Results This section handles the incoming WebSocket messages containing the transcription results: def on_message(ws, message): try: data = json.loads(message) event_type = data.get("type", "") print("Event type:", event_type) #print(data) # Stream live incremental transcripts if event_type == "conversation.item.input_audio_transcription.delta": transcript_piece = data.get("delta", "") if transcript_piece: print(transcript_piece, end=' ', flush=True) if event_type == "conversation.item.input_audio_transcription.completed": print(data["transcript"]) if event_type == "item": transcript = data.get("item", "") if transcript: print("\nFinal transcript:", transcript) except Exception: pass # Ignore unrelated events Error Handling and Cleanup To ensure proper resource management, we implement handlers for errors and connection closing: def on_error(ws, error): print("WebSocket error:", error) def on_close(ws, close_status_code, close_msg): print("Disconnected from server.") stream.stop_stream() stream.close() audio_interface.terminate() Running the WebSocket Client Finally, this code initiates the WebSocket connection and starts the transcription process: print("Connecting to OpenAI Realtime API...") ws_app = websocket.WebSocketApp( url, header=headers, on_open=on_open, on_message=on_message, on_error=on_error, on_close=on_close ) ws_app.run_forever() Analyzing the Implementation Details Session Configuration Let's break down the key components of the session configuration: input_audio_format: Specifies "pcm16" for 16-bit PCM audio input_audio_transcription: model: Specifies "gpt-4o-mini-transcribe" (could be replaced with "gpt-4o-transcribe" for higher accuracy) prompt: Provides instructions to the model ("Respond in English") language: specify the language like "hi" else you can set it null to default to all language. input_audio_noise_reduction: Type of noise reduction. near_field is for close-talking microphones such as headphones, far_field is for far-field microphones such as laptop or conference room microphones. turn_detection: Configures "server_vad" (Voice Activity Detection) to automatically detect speech turns. Configuration for turn detection, ether Server VAD or Semantic VAD. This can be set to null to turn off, in which case the client must manually trigger model response. Server VAD means that the model will detect the start and end of speech based on audio volume and respond at the end of user speech. Semantic VAD is more advanced and uses a turn detection model (in conjuction with VAD) to semantically estimate whether the user has finished speaking, then dynamically sets a timeout based on this probability. For example, if user audio trails off with "uhhm", the model will score a low probability of turn end and wait longer for the user to continue speaking. This can be useful for more natural conversations, but may have a higher latency. Audio Streaming The implementation uses a threaded approach to continuously stream audio data from the microphone to the WebSocket connection. Each chunk of audio is: Read from the microphone Encoded to base64 Sent as a JSON message with the "input_audio_buffer.append" event type Transcription Events The system processes several types of events from the WebSocket connection: conversation.item.input_audio_transcription.delta: Incremental updates to the transcription conversation.item.input_audio_transcription.completed: Complete transcripts for a segment item: Final transcription results Customization Options The example code can be customized in several ways: Switch between models (gpt-4o-transcribe or gpt-4o-mini-transcribe) Adjust audio parameters (sample rate, channels, chunk size) Modify the prompt to provide context or language preferences Configure noise reduction for different environments Adjust turn detection for different speaking patterns Deployment Considerations When deploying this solution in production, consider: Authentication: Securely store and retrieve API keys Error handling: Implement robust reconnection logic Performance: Optimize audio parameters for your use case Rate limits: Be aware of Azure OpenAI's rate limits for the Realtime API Fallback strategies: Implement fallbacks for connection drops Conclusion GPT-4o-transcribe and GPT-4o-mini-transcribe represent significant advances in real-time speech recognition technology. By leveraging WebSockets for continuous audio streaming, these models enable developers to build responsive speech-to-text applications with minimal latency. The implementation showcased in this blog demonstrates how to quickly set up a real-time transcription system using Python. This foundation can be extended for various applications, from live captioning and meeting transcription to voice-controlled interfaces and accessibility tools. As these models continue to evolve, we can expect even better accuracy and performance, opening up new possibilities for speech recognition applications across industries. Remember that when implementing these APIs in production environments, you should follow Azure OpenAI's best practices for API security, including proper authentication and keeping your API keys secure. Here is the link to end to end code. Thanks Manoranjan Rajguru https://www.linkedin.com/in/manoranjan-rajguru/1.8KViews1like0Comments