pamela fox
15 TopicsEntity extraction with Azure OpenAI Structured Outputs
📺 Tune into our live stream on this topic on December 3rd! Have you ever wanted to extract some details from a large block of text, like to figure out the topics of a blog post or the location of a news article? In the past, I've had to use specialized models and domain-specific packages for entity extraction. But now, we can do entity extraction with large language models and get equally impressive results. 🎉 When we use the OpenAI gpt-4o model along with the structured outputs mode, we can define a schema for the details we'd like to extract and get a response that conforms to that schema. Here's the most basic example from the Azure OpenAI tutorial about structured outputs: class CalendarEvent(BaseModel): name: str date: str participants: list[str] completion = client.beta.chat.completions.parse( model="MODEL_DEPLOYMENT_NAME", messages=[ {"role": "system", "content": "Extract the event information."}, {"role": "user", "content": "Alice and Bob are going to a science fair on Friday."}, ], response_format=CalendarEvent, ) output = completion.choices[0].message.parsed The code first defines the CalendarEvent class, an instance of a Pydantic model. Then it sends a request to the GPT model specifying a response_format of CalendarEvent . The parsed output will be a dictionary containing a name , date , and participants . We can even go a step farther and turn the parsed output into a CalendarEvent instance, using the Pydantic model_validate method: event = CalendarEvent.model_validate(event) With this structured outputs capability, it's easier than ever to use GPT models for "entity extraction" tasks: give it some data, tell it what sorts of entities to extract from that data, and constrain it as needed. Extracting from GitHub READMEs Let's see an example of a way that I actually used structured outputs, to help me summarize the submissions that we got to a recent hackathon. I can feed the README of a repository to the GPT model and ask for it to extract key details like project title and technologies used. First I define the Pydantic models: class Language(str, Enum): JAVASCRIPT = "JavaScript" PYTHON = "Python" DOTNET = ".NET" class Framework(str, Enum): LANGCHAIN = "Langchain" SEMANTICKERNEL = "Semantic Kernel" LLAMAINDEX = "Llamaindex" AUTOGEN = "Autogen" SPRINGBOOT = "Spring Boot" PROMPTY = "Prompty" class RepoOverview(BaseModel): name: str summary: str = Field(..., description="A 1-2 sentence description of the project") languages: list[Language] frameworks: list[Framework] In the code above, I asked for a list of a Python enum, which will constrain the model to return only options matching that list. I could have also asked for a list[str] to give it more flexibility, but I wanted to constrain it in this case. I also annoted the description using the Pydantic Field class so that I could specify the length of the description. Without that annotation, the descriptions are often much longer. We can use that description whenever we want to give additional guidance to the model about a field. Next, I fetch the GitHub readme, storing it as a string: url = "https://api.github.com/repos/shank250/CareerCanvas-msft-raghack/contents/README.md" response = requests.get(url) readme_content = base64.b64decode(response.json()["content"]).decode("utf-8") Finally, I send off the request and convert the result into a RepoOverview instance: completion = client.beta.chat.completions.parse( model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"), messages=[ { "role": "system", "content": "Extract info from the GitHub issue markdown about this hack submission.", }, {"role": "user", "content": readme_content}, ], response_format=RepoOverview, ) output = completion.choices[0].message.parsed repo_overview = RepoOverview.model_validate(output) You can see the full code in extract_github_repo.py That gives back an object like this one: RepoOverview( name='Job Finder Chatbot with RAG', description='This project is a chatbot application aimed at helping users find job opportunities and get relevant answers to questions about job roles, leveraging Retrieval-Augmented Generation (RAG) for personalized recommendations and answers.', languages=[<Language.JAVASCRIPT: 'JavaScript'>], azure_services=[<AzureService.AISEARCH: 'AI Search'>, <AzureService.POSTGRESQL: 'PostgreSQL'>], frameworks=[<Framework.SPRINGBOOT: 'Spring Boot'>] ) Extracting from PDFs I talk to many customers that want to extract details from PDF, like locations and dates, often to store as metadata in their RAG search index. The first step is to extract the PDF as text, and we have a few options: a hosted service like Azure Document Intelligence, or a local Python package like pymupdf. For this example, I'm using the latter, as I wanted to try out their specialized pymupdf4llm package that converts the PDF to LLM-friendly markdown. First I load in a PDF of an order receipt and convert it to markdown: md_text = pymupdf4llm.to_markdown("example_receipt.pdf") Then I define the Pydantic models for a receipt: class Item(BaseModel): product: str price: float quantity: int class Receipt(BaseModel): total: float shipping: float payment_method: str items: list[Item] order_number: int In this example, I'm using a nested Pydantic model Item for each item in the receipt, so that I can get detailed information about each item. And then, as before, I send the text off to the GPT model and convert the response back to a Receipt instance: completion = client.beta.chat.completions.parse( model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"), messages=[ {"role": "system", "content": "Extract the information from the blog post"}, {"role": "user", "content": md_text}, ], response_format=Receipt, ) output = completion.choices[0].message.parsed receipt = Receipt.model_validate(output) You can see the full code in extract_pdf_receipt.py Extracting from images Since the gpt-4o model is also a multimodal model, it can accept both images and text. That means that we can send it an image and ask it for a structured output that extracts details from that image. Pretty darn cool! First I load in a local image as a base-64 encoded data URI: def open_image_as_base64(filename): with open(filename, "rb") as image_file: image_data = image_file.read() image_base64 = base64.b64encode(image_data).decode("utf-8") return f"data:image/png;base64,{image_base64}" image_url = open_image_as_base64("example_graph_treecover.png") For this example, my image is a graph, so I'm going to have it extract details about the graph. Here are the Pydantic models: class Graph(BaseModel): title: str description: str = Field(..., description="1 sentence description of the graph") x_axis: str y_axis: str legend: list[str] Then I send off the base-64 image URI to the GPT model, inside a "image_url" type message, and convert the response back to a Graph object: completion = client.beta.chat.completions.parse( model=os.getenv("AZURE_OPENAI_GPT_DEPLOYMENT"), messages=[ {"role": "system", "content": "Extract the information from the graph"}, { "role": "user", "content": [ {"image_url": {"url": image_url}, "type": "image_url"}, ], }, ], response_format=Graph, ) output = completion.choices[0].message.parsed graph = Graph.model_validate(output) More examples You can use this same general approach for entity extraction across many file types, as long as they can be represented in either a text or image form. See more examples in my azure-openai-entity-extraction repository. As always, remember that large language models are probabilistic next-word-predictors that won't always get things right, so definitely evaluate the accuracy of the outputs before you use this approach for a business-critical task.Why your LLM-powered app needs concurrency
As part of the Python advocacy team, I help maintain several open-source sample AI applications, like our popular RAG chat demo. Through that work, I’ve learned a lot about what makes LLM-powered apps feel fast, reliable, and responsive. One of the most important lessons: use an asynchronous backend framework. Concurrency is critical for LLM apps, which often juggle multiple API calls, database queries, and user requests at the same time. Without async, your app may spend most of its time waiting — blocking one user’s request while another sits idle. The need for concurrency Why? Let’s imagine we’re using a synchronous framework like Flask. We deploy that to a server with gunicorn and several workers. One worker receives a POST request to the "/chat" endpoint, which in turn calls the Azure OpenAI Chat Completions API. That API call can take several seconds to complete — and during that time, the worker is completely tied up, unable to handle any other requests. We could scale out by adding more CPU cores, workers, or threads, but that’s often wasteful and expensive. Without concurrency, each request must be handled serially: When your app relies on long, blocking I/O operations — like model calls, database queries, or external API lookups — a better approach is to use an asynchronous framework. With async I/O, the Python runtime can pause a coroutine that’s waiting for a slow response and switch to handling another incoming request in the meantime. With concurrency, your workers stay busy and can handle new requests while others are waiting: Asynchronous Python backends In the Python ecosystem, there are several asynchronous backend frameworks to choose from: Quart: the asynchronous version of Flask FastAPI: an API-centric, async-only framework (built on Starlette) Litestar: a batteries-included async framework (also built on Starlette) Django: not async by default, but includes support for asynchronous views All of these can be good options depending on your project’s needs. I’ve written more about the decision-making process in another blog post. As an example, let's see what changes when we port a Flask app to a Quart app. First, our handlers now have async in front, signifying that they return a Python coroutine instead of a normal function: async def chat_handler(): request_message = (await request.get_json())["message"] When deploying these apps, I often still use the Gunicorn production web server—but with the Uvicorn worker, which is designed for Python ASGI applications. Alternatively, you can run Uvicorn or Hypercorn directly as standalone servers. Asynchronous API calls To fully benefit from moving to an asynchronous framework, your app’s API calls also need to be asynchronous. That way, whenever a worker is waiting for an external response, it can pause that coroutine and start handling another incoming request. Let's see what that looks like when using the official OpenAI Python SDK. First, we initialize the async version of the OpenAI client: openai_client = openai.AsyncOpenAI( base_url=os.environ["AZURE_OPENAI_ENDPOINT"] + "/openai/v1", api_key=token_provider ) Then, whenever we make API calls with methods on that client, we await their results: chat_coroutine = await openai_client.chat.completions.create( deployment_id=os.environ["AZURE_OPENAI_CHAT_DEPLOYMENT"], messages=[{"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": request_message}], stream=True, ) For the RAG sample, we also have calls to Azure services like Azure AI Search. To make those asynchronous, we first import the async variant of the credential and client classes in the aio module: from azure.identity.aio import DefaultAzureCredential from azure.search.documents.aio import SearchClient Then, like with the OpenAI async clients, we must await results from any methods that make network calls: r = await self.search_client.search(query_text) By ensuring that every outbound network call is asynchronous, your app can make the most of Python’s event loop — handling multiple user sessions and API requests concurrently, without wasting worker time waiting on slow responses. Sample applications We’ve already linked to several of our samples that use async frameworks, but here’s a longer list so you can find the one that best fits your tech stack: Repository App purpose Backend Frontend azure-search-openai-demo RAG with AI Search Python + Quart React rag-postgres-openai-python RAG with PostgreSQL Python + FastAPI React openai-chat-app-quickstart Simple chat with Azure OpenAI models Python + Quart plain JS openai-chat-backend-fastapi Simple chat with Azure OpenAI models Python + FastAPI plain JS deepseek-python Simple chat with Azure AI Foundry models Python + Quart plain JS1.6KViews4likes0CommentsLearn how to build MCP servers with Python and Azure
We just concluded Python + MCP, a three-part livestream series where we: Built MCP servers in Python using FastMCP Deployed them into production on Azure (Container Apps and Functions) Added authentication, including Microsoft Entra as the OAuth provider All of the materials from our series are available for you to keep learning from, and linked below: Video recordings of each stream Powerpoint slides Open-source code samples complete with Azure infrastructure and 1-command deployment If you're an instructor, feel free to use the slides and code examples in your own classes. Spanish speaker? We've got you covered- check out the Spanish version of the series. 🙋🏽♂️Have follow up questions? Join our weekly office hours on Foundry Discord: Tuesdays @ 11AM PT → Python + AI Thursdays @ 8:30 AM PT → All things MCP Building MCP servers with FastMCP 📺 Watch YouTube recording In the intro session of our Python + MCP series, we dive into the hottest technology of 2025: MCP (Model Context Protocol). This open protocol makes it easy to extend AI agents and chatbots with custom functionality, making them more powerful and flexible. We demonstrate how to use the Python FastMCP SDK to build an MCP server running locally. Then we consume that server from chatbots like GitHub Copilot in VS Code, using it's tools, resources, and prompts. Finally, we discover how easy it is to connect AI agent frameworks like Langchain and Microsoft agent-framework to the MCP server. Slides for this session Code repository with examples: python-mcp-demos Deploying MCP servers to the cloud 📺 Watch YouTube recording In our second session of the Python + MCP series, we deploy MCP servers to the cloud! We walk through the process of containerizing a FastMCP server with Docker and deploying to Azure Container Apps. Then we instrument the MCP server with OpenTelemetry and observe the tool calls using Azure Application Insights and Logfire. Finally, we explore private networking options for MCP servers, using virtual networks that restrict external access to internal MCP tools and agents. Slides for this session Code repository with examples: python-mcp-demos Authentication for MCP servers 📺 Watch YouTube recording In our third session of the Python + MCP series, we explore the best ways to build authentication layers on top of your MCP servers. We start off simple, with an API key to gate access, and demonstrate a key-restricted FastMCP server deployed to Azure Functions. Then we move on to OAuth-based authentication for MCP servers that provide user-specific data. We dive deep into MCP authentication, which is built on top of OAuth2 but with additional requirements like PRM and DCR/CIMD, which can make it difficult to implement fully. We demonstrate the full MCP auth flow in the open-souce identity provider KeyCloak, and show how to use an OAuth proxy pattern to implement MCP auth on top of Microsoft Entra. Slides for this session Code repository with Container Apps examples: python-mcp-demos Code repository with Functions examples: python-mcp-demos9.8KViews3likes2CommentsLearn MCP from our free livestream series in December
Our Python + MCP series is a three-part, hands-on journey into one of the most important emerging technologies of 2025: MCP (Model Context Protocol) — an open standard for extending AI agents and chat interfaces with real-world tools, data, and execution environments. Whether you're building custom GitHub Copilot tools, powering internal developer agents, or creating AI-augmented applications, MCP provides the missing interoperability layer between LLMs and the systems they need to act on. Across the series, we move from local prototyping, to cloud deployment, to enterprise-grade authentication and security, all powered by Python and the FastMCP SDK. Each session builds on the last, showing how MCP servers can evolve from simple localhost services to fully authenticated, production-ready services running in the cloud — and how agents built with frameworks like Langchain and Microsoft’s agent-framework can consume them at every stage. 🔗 Register for the entire series. You can also scroll down to learn about each live stream and register for individual sessions. In addition to the live streams, you can also join office hours after each session in Foundry Discord to ask any follow-up questions. To get started with your MCP learnings before the series, check out the free MCP-for-beginners course on GitHub. Building MCP servers with FastMCP 16 December, 2025 | 6:00 PM - 7:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In the intro session of our Python + MCP series, we dive into the hottest technology of 2025: MCP (Model Context Protocol). This open protocol makes it easy to extend AI agents and chatbots with custom functionality, making them more powerful and flexible. We demonstrate how to use the Python FastMCP SDK to build an MCP server running locally and consume that server from chatbots like GitHub Copilot. Then we build our own MCP client to consume the server. Finally, we discover how easy it is to connect AI agent frameworks like Langchain and Microsoft agent-framework to MCP servers. Deploying MCP servers to the cloud 17 December, 2025 | 6:00 PM - 7:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our second session of the Python + MCP series, we're deploying MCP servers to the cloud! We'll walk through the process of containerizing a FastMCP server with Docker and deploying to Azure Container Apps, and also demonstrate a FastMCP server running directly on Azure Functions. Then we'll explore private networking options for MCP servers, using virtual networks that restrict external access to internal MCP tools and agents. Authentication for MCP servers 18 December, 2025 | 6:00 PM - 7:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In our third session of the Python + MCP series, we're exploring the best ways to build authentication layers on top of your MCP servers. That could be as simple as an API key to gate access, but for the servers that provide user-specific data, we need to use an OAuth2-based authentication flow. MCP authentication is built on top of OAuth2 but with additional requirements like PRM and DCR/CIMD, which can make it difficult to implement fully. In this session, we'll demonstrate the full MCP auth flow, and provide examples that implement MCP Auth on top of Microsoft Entra.1.1KViews2likes0CommentsGPT-5: Will it RAG?
OpenAI released the GPT-5 model family last week, with an emphasis on accurate tool calling and reduced hallucinations. For those of us working on RAG (Retrieval-Augmented Generation), it's particularly exciting to see a model specifically trained to reduce hallucination. There are five variants in the family: gpt-5 gpt-5-mini gpt-5-nano gpt-5-chat: Not a reasoning model, optimized for chat applications gpt-5-pro: Only available in ChatGPT, not via the API As soon as GPT-5 models were available in Azure AI Foundry, I deployed them and evaluated them inside our most popular open source RAG template. I was immediately impressed - not by the model's ability to answer a question, but by it's ability to admit it could not answer a question! You see, we have one test question for our sample data (HR documents for a fictional company's) that sounds like it should be an easy question: "What does a Product Manager do?" But, if you actually look at the company documents, there's no job description for "Product Manager", only related jobs like "Senior Manager of Product Management". Every other model, including the reasoning models, has still pretended that it could answer that question. For example, here's a response from o4-mini: However, the gpt-5 model realizes that it doesn't have the information necessary, and responds that it cannot answer the question: As I always say: I would much rather have an LLM admit that it doesn't have enough information instead of making up an answer. Bulk evaluation But that's just a single question! What we really need to know is whether the GPT-5 models will generally do a better job across the board, on a wide range of questions. So I ran bulk evaluations using the azure-ai-evaluations SDK, checking my favorite metrics: groundedness (LLM-judged), relevance (LLM-judged), and citations_matched (regex based off ground truth citations). I didn't bother evaluating gpt-5-nano, as I did some quick manual tests and wasn't impressed enough - plus, we've never used a nano sized model for our RAG scenarios. Here are the results for 50 Q/A pairs: metric stat gpt-4.1-mini gpt-4o-mini gpt-5-chat gpt-5 gpt-5-mini o3-mini groundedness pass % 94% 86% 96% 100% 🏆 94% 96% ↑ mean score 4.76 4.50 4.86 5.00 🏆 4.82 4.80 relevance pass % 94% 🏆 84% 90% 90% 74% 90% ↑ mean score 4.42 🏆 4.22 4.06 4.20 4.02 4.00 answer_length mean 829 919 549 844 940 499 latency mean 2.9 4.5 2.9 9.6 7.5 19.4 citations_matched % 52% 49% 52% 47% 49% 51% For the LLM-judged metrics of groundedness and relevance , the LLM awards a score of 1-5, and both 4 and 5 are considered passing scores. That's why you see both a "pass %" (percentage with 4 or 5 score) and an average score in the table above. For the groundedness metric, which measures whether an answer is grounded in the retrieved search results, the gpt-5 model does the best (100%), while the other gpt-5 models do quite well as well, on par with our current default model of gpt-4.1-mini. For the relevance metric, which measures whether an answer fully answers a question, the gpt-5 models don't score as highly as gpt-4.1-mini. I looked into the discrepancies there, and I think that's actually due to gpt-5 being less willing to give an answer when it's not fully confident in it - it would rather give a partial answer instead. That's a good thing for RAG apps, so I am comfortable with that metric being less than 100%. The latency metric is generally higher for the gpt-5 reasoning models, as would be expected, but is also variable based on on deployment region, region capacity, etc, assuming you're not using a "provisioned thoroughput" deployment. Also note that the latency here records the total time taken, from first token to last token, whereas the most important metric for a user-facing streaming chat is the time to first token. For the gpt-5 reasoning models in this evaluation, I set the reasoning_effort to "minimal", which means the model chooses whether to use reasoning tokens. I have never seen it actually use any reasoning tokens when I set the effort to minimal, so maybe that means that a higher reasoning effort is really only needed for longer or more complex tasks, and RAG answering is a non-reasoning task. A higher reasoning effort would definitely affect the latency and likely also affect the answer quality. I did not test that out, since the "minimal" effort setting already results in high quality answers. Answer differences Now that we've seen the overall metrics, let's dig into some actual answers, and see some of the key ways that GPT-5 model answers differ. Quicker to say "I don't know" 🤷🏻 Let's start with the first thing I noticed: gpt-5 was happy to admit when it couldn't fully answer a question. In the 50 questions evaluated, the gpt-5 model started off their answer with "I don't know" for 3 of the answers, while gpt-5-mini started that way for 10 of the answers. In table form: gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat 1 (1.6%) 3 (6%) 10 (20%) 1 (1.6%) Here's an example of a gpt-4.1-mini answer versus a gpt-5 answer: Q: Explain the eligibility criteria for the Northwind Standard plan offered to Contoso employees? gpt-4.1-mini gpt-5 The eligibility criteria for the Northwind Standard plan for Contoso employees are not explicitly detailed in the provided sources. However, it is mentioned that prior-authorization may be required for certain services or treatments, which involves approval from Northwind Health based on medical necessity and coverage under the plan. Employees must have their physician submit a request for prior-authorization, and in some cases, meet specific criteria such as medical tests, treatments, or lifestyle changes to be approved. Preventive care, emergency services, mental health, and substance abuse coverage do not require prior-authorization. Out-of-network services do require prior-authorization and may incur additional costs [Northwind_Standard_Benefits_Details.pdf#page=63][Northwind_Standard_Benefits_Details.pdf#page=102]. If you need information on basic eligibility (such as employment status or tenure), that is not provided in these sources. I don’t know. The provided sources describe compliance (ERISA, COBRA, HIPAA, ACA) and note that Northwind Standard is a group plan covering “all eligible employees of Contoso and their spouses or dependents,” but they do not define the specific eligibility criteria (e.g., hours worked, waiting period, employment class). [Northwind_Standard_Benefits_Details.pdf#page=88][Northwind_Standard_Benefits_Details.pdf#page=85][Northwind_Standard_Benefits_Details.pdf#page=86] Both of the answers admit some degree of not-knowing, but gpt-5 comes right out with it at the start of the answer. Notably, our system prompt includes this instruction: Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. So what we may be seeing is a combination of two factors: 1) the gpt-5 reasoning models are generally better at following instructions, 2) the gpt-5 reasoning models went through a specific alignment process to reduce hallucination. Notably, however, the gpt-5-chat model rarely responded with "I don't know", so it is not as affected by whatever factors are increasing the likelihood of the reasoning models to answer that way. You may be wondering why I have questions in the evaluation set that aren't fully answerable - it's hard to generate a good set of ground truth for evaluation, and my ground truth data is not the optimal set of questions. Ideally, I would do a separate evaluation for all the fully knowable questions versus all the unknowable questions - and I even have a specific custom metric for "don't-know-ness" that I run on those - but a few partially knowable questions snuck into this ground truth data set. Requests for clarification 🙋🏻♀️ Our system prompt also encourages the model to ask clarifying questions when necessary: If asking a clarifying question to the user would help, ask the question. But I had never actually seen a clarifying question from a model until gpt-5-chat, which replied with a question to 2 of the 50 questions. Example: Q: What should I do if I need healthcare services outside Washington with Northwind Health? gpt-4.1-mini gpt-5-chat If you have Northwind Standard and need healthcare services outside Washington, you must meet these criteria for coverage: be outside Washington for no more than 12 consecutive months, be more than 100 miles away from your primary care provider, and have had your plan in effect for at least 6 months before receiving out-of-area care. You must contact Northwind Health in advance and obtain prior authorization for the care; otherwise, you may be responsible for the full cost. Emergency services outside Washington are covered, but non-emergency services are not unless these conditions are met [Northwind_Standard_Benefits_Details.pdf#page=59][Northwind_Standard_Benefits_Details.pdf#page=60]. If you have Northwind Health Plus, first ensure the provider is in the nationwide network by checking the online directory or calling customer service. Keep receipts and documentation for any out-of-state services, as you may need to submit claims for reimbursement. Contact Northwind Health Plus before receiving services to confirm coverage, as not all services outside Washington are covered [Northwind_Health_Plus_Benefits_Details.pdf#page=62]. Which plan do you have — Northwind Standard or Northwind Health Plus? Like in the example above, other models typically avoid asking a clarifying question by instead addressing every possible scenario (covering all their bases!), resulting in a fairly long answer. Arguably, it's a better user experience to ask the clarifying question, so that the user can get an answer tailored to their specific situation and not get confused by other possibilities. The primary drawback is the additional back-and-forth for the final answer. I did not personally see clarifying questions from the GPT-5 reasoning models, but my colleagues noticed an increase in such questions when using gpt-5 with GitHub Copilot Agent mode, so it's possible that every model in the family has an increased likelihood to ask for clarification when faced with ambiguity. Reasoning models ❤️ lists I saw a definite tendency of the GPT-5 reasoning models to output the answer as a list, even though there's nothing in the RAG system prompt that tells it to use lists. Out of the 50 questions evaluated, here are the number with answers formatted as a top-level bulleted list: gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat 0 36 (72%) 26 (52%) 0 Here's an example of a gpt-4.1-mini answer (paragraph) versus a gpt-5 answer (list): Q: What are the coverage limitations of the Northwind Standard plan regarding clinical trials? gpt-4.1-mini gpt-5 The Northwind Standard plan covers certain clinical trial services such as diagnostic testing, treatment of the condition being studied, medications, lab services, and imaging services. However, it does not cover travel expenses associated with attending clinical trials. Additionally, any experimental treatments or services not part of the clinical trial are not covered. Coverage for other clinical trial services not explicitly listed may be considered on a case-by-case basis. Members should contact Northwind Health customer service for more details [Northwind_Standard_Benefits_Details.pdf#page=23][Northwind_Standard_Benefits_Details.pdf#page=24]. Travel expenses for attending clinical trials are not covered [Northwind_Standard_Benefits_Details.pdf#page=23][Northwind_Standard_Benefits_Details.pdf#page=24]. Experimental treatments or services that are not part of the clinical trial are not covered [Northwind_Standard_Benefits_Details.pdf#page=24]. Services or treatments deemed experimental or investigational are not covered by the plan [Northwind_Standard_Benefits_Details.pdf#page=53]. Now, is it a bad thing that the gpt-5 reasoning models use lists? Not necessarily! But if that's not the style you're looking for, then you either want to consider the gpt-5-chat model or add specific messaging in the system prompt to veer the model away from top level lists. Longer answers As we saw in overall metrics above, there was an impact of the answer length (measured in the number of characters, not tokens). Let's isolate those stats: gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat 829 844 990 549 The gpt-5 reasoning models are generating answers of similar length to the current baseline of gpt-4.1-mini, though the gpt-5-mini model seems to be a bit more verbose. The API now has a new parameter to control verbosity for those models, which defaults to "medium". I did not try an evaluation with that set to "low" or "high", which would be an interesting evaluation to run. The gpt-5-chat model outputs relatively short answers, which are actually closer in length to the answer length that I used to see from gpt-3.5-turbo. What answer length is best? A longer answer will take longer to finish rendering to the user (even when streaming), and will cost the developer more tokens. However, sometimes answers are longer due to better formatting that is easier to skim, so longer does not always mean less readable. For the user-facing RAG chat scenario, I generally think that shorter answers are better. If I was putting these gpt-5 reasoning models in production, I'd probably try out the "low" verbosity value, or put instructions in the system prompt, so that users get their answers more quickly. They can always ask follow-up questions as needed. Fancy punctuation This is a weird difference that I discovered while researching the other differences: the GPT-5 models are more likely to use “smart” quotes instead of standard ASCII quotes. Specifically: Left single: ‘ (U+2018) Right single / apostrophe: ’ (U+2019) Left double: “ (U+201C) Right double: ” (U+201D) For example, the gpt-5 model actually responded with " I don’t know ", not with " I don't know ". It's a subtle difference, but if you are doing any sort of post-processing or analysis, it's good to know. I've also seen the models using the smart quotes incorrectly in coding contexts (like GitHub Copilot Agent mode), so that's another potential issue to look out for. I assumed that the models were trained on data that tended to use smart quotes more often, perhaps synthetic data or book text. I know that as a normal human, I rarely use them, given the extra effort required to type them. Query rewriting to the extreme Our RAG flow makes two LLM calls: the second answers the question, as you’d expect, but the first rewrites the user’s query into a strong search query. This step can fix spelling mistakes, but it’s even more important for filling in missing context in multi-turn conversations—like when the user simply asks, “what else?” A well-crafted rewritten query leads to better search results, and ultimately, a more complete and accurate answer. During my manual tests, I noticed that the rewritten queries from the GPT-5 models are much longer, filled to the brim with synonyms. For example: Q: What does a Product Manager do? gpt-4.1-mini gpt-5-mini product manager responsibilities Product Manager role responsibilities duties skills day-to-day tasks product management overview Are these new rewritten queries better or worse than the previous short ones? It's hard to tell, since they're just one factor in the overall answer output, and I haven't set up a retrieval-specific metric. The closest metric is citations_matched , since the new answer from the app can only match the citations in the ground truth if the app managed to retrieve all the same citations. That metric was generally high for these models, and when I looked into the cases where the citations didn't match, I typically thought the gpt-5 family of responses were still good answers. I suspect that the rewritten query does not have a huge effect either way, since our retrieval step uses hybrid search from Azure AI Search, and the combined power of both hybrid and vector search generally compensates for differences in search query wording. It's worth evaluating this further however, and considering using a different model for the query rewriting step. Developers often choose to use a smaller, faster model for that stage, since query rewriting is an easier task than answering a question. So, are the answers accurate? Even with a 100% groundedness score from an LLM judge, it's possible that a RAG app can be producing inaccurate answers, like if the LLM judge is biased or the retrieved context is incomplete. The only way to really know if RAG answers are accurate is to send them to a human expert. For the sample data in this blog, there is no human expert available, since they're based off synthetically generated documents. Despite two years of staring at those documents and running dozens of evaluations, I still am not an expert in the HR benefits of the fictional Contoso company. That's why I also ran the same evaluations on the same RAG codebase, but with data that I know intimately: my own personal blog. I looked through 200 answers from gpt-5, and did not notice any inaccuracies in the answers. Yes, there are times when it says "I don't know" or asks a clarifying question, but I consider those to be accurate answers, since they do not spread misinformation. I imagine that I could find some way to trick up the gpt-5 model, but on the whole, it looks like a model with a high likelihood of generating accurate answers when given relevant context. Evaluate for yourself! I share my evaluations on our sample RAG app as a way to share general learnings on model differences, but I encourage every developer to evaluate these models for your specific domain, alongside domain experts that can reason about the correctness of the answers. How can you evaluate? If you are using the same open source RAG project for Azure, deploy the GPT-5 models and follow the steps in the evaluation guide. If you have your own solution, you can use an open-source SDK for evaluating, like azure-ai-evaluation (the one that I use), DeepEval, promptfoo, etc. If you are using an observability platform like Langfuse, Arize, or Langsmith, they have evaluation strategies baked in. Or if you're using an agents framework like Pydantic AI, those also often have built-in eval mechanisms. If you can share what you learn from evaluations, please do! We are all learning about the strange new world of LLMs together.2.2KViews2likes0CommentsRAG Deep Dive: 10-part live stream series
Our most popular RAG solution for Azure has now been deployed thousands of times by developers using it across myriad domains, like meeting transcripts, research papers, HR documents, and industry manuals. Based on feedback from the community (and often, thanks to pull requests from the community!), we've added the most hotly requested features: support for multiple document types, chat history with Cosmos DB, user account and login, data access control, multimodal media ingestion, private deployment, and more. This open-source RAG solution is powerful, but it can be intimidating to dive into the code yourself, especially now that it has so many optional features. That's why we're putting on a 10-part live series in January/February, diving deep into the solution and showing you all the ways you can use it. Register for the whole series on Reactor or scroll down to learn about each session and register for individual sessions. We look forward to seeing you in the live chat and hearing how you're using the RAG solution for your own domain. See you in the streams! 👋🏻 The RAG solution for Azure 13 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Join us for the kick-off session, where we'll do a live demo of the RAG solution and explain how it all works. We'll step through the RAG flow from Azure AI Search to Azure OpenAI, deploy the app to Azure, and discuss the Azure architecture. Customizing our RAG solution 15 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor In our second session, we'll show you how to customize the RAG solution for your own domain - adding your own data, modifying the prompts, and personalizing the UI. Plus, we'll give you tips for local development for faster feature iteration. Optimal retrieval with Azure AI Search 20 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Our RAG solution uses Azure AI Search to find matching documents, using state-of-the-art retrieval mechanisms. We'll dive into the mechanics of vector embeddings, hybrid search with RRF, and semantic ranking. We'll also discuss the data ingestion process, highlighting the differences between manual ingestion and integrated vectorization Multimedia data ingestion 22 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Do your documents contain images or charts? Our RAG solution has two different approaches to handling multimedia documents, and we'll dive into both approaches in this session. The first approach is purely during ingestion time, where it replaces media in the documents with LLM-generated descriptions. The second approach stores images of the media alongside vector embeddings of the images, and sends both text and images to a multimodal LLM for question answering. Learn about both approaches in this session so that you can decide what to use for your app. User login and data access control 27 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor In our RAG flow, the app first searches a knowledge base for relevant matches to a user's query, then sends the results to the LLM along with the original question. What if you have documents that should only be accessed by a subset of your users, like a group or a single user? Then you need data access controls to ensure that document visibility is respected during the RAG flow. In this session, we'll show an approach using Azure AI Search with data access controls to only search the documents that can be seen by the logged in user. We'll also demonstrate a feature for user-uploaded documents that uses data access controls along with Azure Data Lake Storage Gen2. Storing chat history 29 January, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Learn how we store chat history using either IndexedDB for client-side storage or Azure Cosmos DB for persistent storage. We'll discuss the API architecture and data schema choices, doing both a live demo of the app and a walkthrough of the code. Adding speech input and output 3 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor Our RAG solution includes optional features for speech input and output, powered either by the free browser SDKs or by the powerful Azure Speech API. We also offer a tight integration with the VoiceRAG solution, for those of you who want a real-time voice interface. Learn about all the ways you can add speech to your RAG chat in this session! Private deployment 5 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor To ensure that the RAG app can only be accessed within your enterprise network, you can deploy it to an Azure virtual network with private endpoints for each Azure service used. In this session, we'll show how to deploy the app to a virtual network that includes AI Search, OpenAI, Document Intelligence, and Blob storage. Then we'll log in to the virtual network using Azure Bastion with a virtual machine to demonstrate that we can access the RAG app from inside the network, and only inside the network. Evaluating RAG answer quality 10 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor How can you be sure that the RAG chat app answers are accurate, clear, and well formatted? Evaluation! In this session, we'll show you how to generate synthetic data and run bulk evaluations on your RAG app, using the azure-ai-evaluation SDK. Learn about GPT metrics like groundedness and fluency, and custom metrics like citation matching. Plus, discover how you can run evaluations on CI/CD, to easily verify that new changes don't introduce quality regressions. Monitoring and tracing LLM calls 12 February, 2025 | 11:30 PM UTC | 3:30 PM PT Register for the stream on Reactor When your RAG app is in production, observability is crucial. You need to know about performance issues, runtime errors, and LLM-specific issues like Content Safety filter violations. In this session, learn how to use Azure Monitor along with OpenTelemetry SDKs to monitor the RAG application.1.6KViews2likes0CommentsAdd speech input & output to your app with the free browser APIs
One of the amazing benefits of modern machine learning is that computers can reliably turn text into speech, or transcribe speech into text, across multiple languages and accents. We can then use those capabilities to make our web apps more accessible for anyone who has a situational, temporary, or chronic issue that makes typing difficult. That describes so many people - for example, a parent holding a squirmy toddler in their hands, an athlete with a broken arm, or an individual with Parkinson's disease. There are two approaches we can use to add speech capabilities to our apps: Use the built-in browser APIs: the SpeechRecognition API and SpeechSynthesis API. Use a cloud-based service, like the Azure Speech API. Which one to use? The great thing about the browser APIs is that they're free and available in most modern browsers and operating systems. The drawback of the APIs is that they're often not as powerful and flexible as cloud-based services, and the speech output often sounds more robotic. There are also a few niche browser/OS combos where the built-in APIs don't work. That's why we decided to add both options to our most popular RAG chat solution, to give developers the option to decide for themselves. However, in this post, I'm going to show you how to add speech capabilities using the free built-in browser APIs, since free APIs are often easier to get started with and it's important to do what we can to improve the accessibility of our apps. The GIF below shows the end result, a chat app with both speech input and output buttons: All of the code described in this post is part of openai-chat-vision-quickstart, so you can grab the full code yourself after seeing how it works. Speech input with SpeechRecognition API To make it easier to add a speech input button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechInputButton . First I construct the speech input button element with an instance of the SpeechRecognition API, making sure to use the browser's preferred language if any are set: class SpeechInputButton extends HTMLElement { constructor() { super(); this.isRecording = false; const SpeechRecognition = window.SpeechRecognition || window.webkitSpeechRecognition; if (!SpeechRecognition) { this.dispatchEvent( new CustomEvent("speecherror", { detail: { error: "SpeechRecognition not supported" }, }) ); return; } this.speechRecognition = new SpeechRecognition(); this.speechRecognition.lang = navigator.language || navigator.userLanguage; this.speechRecognition.interimResults = false; this.speechRecognition.continuous = true; this.speechRecognition.maxAlternatives = 1; } Then I define the connectedCallback() method that will be called whenever this custom element has been added to the DOM. When that happens, I define the inner HTML to render a button and attach event listeners for both mouse and keyboard events. Since we want this to be fully accessible, keyboard support is important. connectedCallback() { this.innerHTML = ` <button class="btn btn-outline-secondary" type="button" title="Start recording (Shift + Space)"> <i class="bi bi-mic"></i> </button>`; this.recordButton = this.querySelector('button'); this.recordButton.addEventListener('click', () => this.toggleRecording()); document.addEventListener('keydown', this.handleKeydown.bind(this)); } handleKeydown(event) { if (event.key === 'Escape') { this.abortRecording(); } else if (event.key === ' ' && event.shiftKey) { // Shift + Space event.preventDefault(); this.toggleRecording(); } } toggleRecording() { if (this.isRecording) { this.stopRecording(); } else { this.startRecording(); } } The majority of the code is in the startRecording function. It sets up a listener for the "result" event from the SpeechRecognition instance, which contains the transcribed text. It also sets up a listener for the "end" event, which is triggered either automatically after a few seconds of silence (in some browsers) or when the user ends the recording by clicking the button. Finally, it sets up a listener for any "error" events. Once all listeners are ready, it calls start() on the SpeechRecognition instance and styles the button to be in an active state. startRecording() { if (this.speechRecognition == null) { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: { error: "SpeechRecognition not supported" }, }) ); } this.speechRecognition.onresult = (event) => { let input = ""; for (const result of event.results) { input += result[0].transcript; } this.dispatchEvent( new CustomEvent("speech-input-result", { detail: { transcript: input }, }) ); }; this.speechRecognition.onend = () => { this.isRecording = false; this.renderButtonOff(); this.dispatchEvent(new Event("speech-input-end")); }; this.speechRecognition.onerror = (event) => { if (this.speechRecognition) { this.speechRecognition.stop(); if (event.error == "no-speech") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "No speech was detected. Please check your system audio settings and try again."}, })); } else if (event.error == "language-not-supported") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "The selected language is not supported. Please try a different language.", }})); } else if (event.error != "aborted") { this.dispatchEvent( new CustomEvent("speech-input-error", { detail: {error: "An error occurred while recording. Please try again: " + event.error}, })); } } }; this.speechRecognition.start(); this.isRecording = true; this.renderButtonOn(); } If the user stops the recording using the keyboard shortcut or button click, we call stop() on the SpeechRecognition instance. At that point, anything the user had said will be transcribed and become available via the "result" event. stopRecording() { if (this.speechRecognition) { this.speechRecognition.stop(); } } Alternatively, if the user presses the Escape keyboard shortcut, we instead call abort() on the SpeechRecognition instance, which stops the recording and does not send any previously untranscribed speech over. abortRecording() { if (this.speechRecognition) { this.speechRecognition.abort(); } } Once the custom HTML element is fully defined, we register it with the desired tag name, speech-input-button : customElements.define("speech-input-button", SpeechInputButton); To use the custom speech-input-button element in a chat application, we add it to the HTML for the chat form: <speech-input-button></speech-input-button> <input id="message" name="message" type="text" rows="1"></input> Then we attach an event listener for the custom events dispatched by the element, and we update the input text field with the transcribed text: const speechInputButton = document.querySelector("speech-input-button"); speechInputButton.addEventListener("speech-input-result", (event) => { messageInput.value += " " + event.detail.transcript.trim(); messageInput.focus(); }); You can see the full custom HTML element code in speech-input.js and the usage in index.html. There's also a fun pulsing animation for the button's active state in styles.css. Speech output with SpeechSynthesis API Once again, to make it easier to add a speech output button to any app, I'm wrapping the functionality inside a custom HTML element, SpeechOutputButton . When defining the custom element, we specify an observed attribute named "text", to store whatever text should be turned into speech when the button is clicked. class SpeechOutputButton extends HTMLElement { static observedAttributes = ["text"]; In the constructor, we check to make sure the SpeechSynthesis API is supported, and remember the browser's preferred language for later use. constructor() { super(); this.isPlaying = false; const SpeechSynthesis = window.speechSynthesis || window.webkitSpeechSynthesis; if (!SpeechSynthesis) { this.dispatchEvent( new CustomEvent("speech-output-error", { detail: { error: "SpeechSynthesis not supported" } })); return; } this.synth = SpeechSynthesis; this.lngCode = navigator.language || navigator.userLanguage; } When the custom element is added to the DOM, I define the inner HTML to render a button and attach mouse and keyboard event listeners: connectedCallback() { this.innerHTML = ` <button class="btn btn-outline-secondary" type="button"> <i class="bi bi-volume-up"></i> </button>`; this.speechButton = this.querySelector("button"); this.speechButton.addEventListener("click", () => this.toggleSpeechOutput() ); document.addEventListener('keydown', this.handleKeydown.bind(this)); } The majority of the code is in the toggleSpeechOutput function. If the speech is not yet playing, it creates a new SpeechSynthesisUtterance instance, passes it the "text" attribute, and sets the language and audio properties. It attempts to use a voice that's optimal for the desired language, but falls back to "en-US" if none is found. It attaches event listeners for the start and end events, which will change the button's style to look either active or unactive. Finally, it tells the SpeechSynthesis API to speak the utterance. toggleSpeechOutput() { if (!this.isConnected) { return; } const text = this.getAttribute("text"); if (this.synth != null) { if (this.isPlaying || text === "") { this.stopSpeech(); return; } // Create a new utterance and play it. const utterance = new SpeechSynthesisUtterance(text); utterance.lang = this.lngCode; utterance.volume = 1; utterance.rate = 1; utterance.pitch = 1; let voice = this.synth .getVoices() .filter((voice) => voice.lang === this.lngCode)[0]; if (!voice) { voice = this.synth .getVoices() .filter((voice) => voice.lang === "en-US")[0]; } utterance.voice = voice; if (!utterance) { return; } utterance.onstart = () => { this.isPlaying = true; this.renderButtonOn(); }; utterance.onend = () => { this.isPlaying = false; this.renderButtonOff(); }; this.synth.speak(utterance); } } When the user no longer wants to hear the speech output, indicated either via another press of the button or by pressing the Escape key, we call cancel() from the SpeechSynthesis API. stopSpeech() { if (this.synth) { this.synth.cancel(); this.isPlaying = false; this.renderButtonOff(); } } Once the custom HTML element is fully defined, we register it with the desired tag name, speech-output-button : customElements.define("speech-output-button", SpeechOutputButton); To use this custom speech-output-button element in a chat application, we construct it dynamically each time that we've received a full response from an LLM, and call setAttribute to pass in the text to be spoken: const speechOutput = document.createElement("speech-output-button"); speechOutput.setAttribute("text", answer); messageDiv.appendChild(speechOutput); You can see the full custom HTML element code in speech-output.js and the usage in index.html. This button also uses the same pulsing animation for the active state, defined in styles.css. Acknowledgments I want to give a huge shout-out to John Aziz for his amazing work adding speech input and output to the azure-search-openai-demo, as that was the basis for the code I shared in this blog post.1.4KViews2likes0CommentsBuilding MCP servers with Entra ID and pre-authorized clients
The Model Context Protocol (MCP) gives AI agents a standard way to call external tools, but things get more complicated when those tools need to know who the user is. In this post, I’ll show how to build an MCP server with the Python FastMCP package that authenticates users with Microsoft Entra ID when they connect from a pre-authorized client such as VS Code. If you need to build a server that works with any MCP clients, read my previous blog post. With Microsoft Entra as the authorization server, supporting arbitrary clients currently requires adding an OAuth proxy in front, which increases security risk. This post focuses on the simpler pre-authorized-client path instead. MCP auth Let’s start by digging into the MCP auth spec, since that explains both the shape of the flow and the constraints we run into with Entra. The MCP specification includes an authorization protocol based on OAuth 2.1, so an MCP client can send a request that includes a Bearer token from an authorization server, and the MCP server can validate that token. In OAuth 2.1 terms, the MCP client is acting as the OAuth client, the MCP server is the resource server, the signed-in user is the resource owner, and the authorization server issues an access token. In this case, Entra will be our authorization server. We can't necessarily use any OAuth-compatible authorization servers, as MCP auth requires more than just the core OAuth 2.1 functionality. In OAuth, the authorization server needs a relationship with the client. MCP auth describes three options: Pre-registration: the auth server has a pre-existing relationship and has the client ID in its database already CIMD (Client Identity Metadata Document): the MCP client sends the URL of its CIMD, a JSON document that describes its attributes, and the auth server bases its interactions on that information. DCR (Dynamic Client Registration): when the auth server sees a new client, it explicitly registers it and stores the client information in its own data. DCR is now considered a "legacy" path, as the hope is for CIMD to be the supported path in the future. For each MCP scenario - each combination of MCP server, MCP client, and authorization server - we need to determine which of those options are viable and optimal. Here's one way of thinking through it: VS Code supports all of MCP auth, so its MCP client includes both CIMD and DCR support. However, the Microsoft Entra authorization server does not support CIMD or DCR. That leaves us with only one official option: pre-registration. If we desperately need support for arbitrary clients, it is possible to put a CIMD/DCR proxy in front of Entra, as discussed in my previous blog post, but the Entra team discourages that approach due to increased security risks. When using pre-registration, the auth flow is relatively simple (but still complex, because hey, this is OAuth!): User asks to use auth-restricted MCP server MCP client makes a request to MCP server without a bearer token MCP server responds with an HTTP 401 and a pointer to its PRM (Protected Resource Metadata) document MCP client reads PRM to discover the authorization server and options MCP client redirects to authorization server, including its client ID User signs into authorization server Authorization server returns authorization code MCP client exchanges authorization code for access token Authorization server returns access token MCP client re-tries original request, but now with bearer token included MCP server validates bearer token and returns successfully Here's what that looks like: Now let's dig into the code for implementing MCP auth with the pre-registered VS Code client. Registering the MCP server with Entra Before the server can use Entra to authorize users, we need to register the server with Entra via an app registration. We can do registration using the Azure Portal, Azure CLI, Microsoft Graph SDK, or even Bicep. In this case, I use the Python MS Graph SDK as it allows me to specify everything programmatically. First, I create the Entra app registration, specifying the sign-in audience (single-tenant) and configuring the MCP server as a protected resource: scope_id = str(uuid.uuid4()) Application( display_name="Entra App for MCP server", sign_in_audience="AzureADMyOrg", api=ApiApplication( requested_access_token_version=2, oauth2_permission_scopes=[ PermissionScope( admin_consent_description="Allows access to the MCP server as the signed-in user.", admin_consent_display_name="Access MCP Server", id=scope_id, is_enabled=True, type="User", user_consent_description="Allow access to the MCP server on your behalf.", user_consent_display_name="Access MCP Server", value="user_impersonation") ], pre_authorized_applications=[ PreAuthorizedApplication( app_id=VSCODE_CLIENT_ID, delegated_permission_ids=[scope_id], )])) The api parameter is doing the heavy lifting, ensuring that other applications (like VS Code) can request permission to access the server on behalf of a user. Here's what each parameter does: requested_access_token_version=2: Entra ID has two token formats (v1.0 and v2.0). We need v2.0 because that's what FastMCP's token validator expects. oauth2_permission_scopes: This defines a permission called user_impersonation that MCP clients can request when connecting to your server. It's the server saying: "I accept tokens that let an MCP client act on behalf of a signed-in user." Without at least one scope defined, no MCP client can obtain a token for your server — Entra wouldn't know what permission to grant. The name user_impersonation is a convention (we could call it anything), but it clearly signals that the MCP client is accessing your server as the user, not as itself. pre_authorized_applications: This list tells Entra which client applications are pre-approved to request tokens for this server’s API without showing an extra consent prompt to the user. In this case, I list VS Code’s application ID and tie it to the user_impersonation scope, so VS Code can request a token for the MCP server as the signed-in user. Thanks to that configuration, when VS Code requests a token, it will request a token with the scope "api://{app_id}/user_impersonation" , and the FastMCP server will validate that incoming tokens contain that scope. Next, I create a Service Principal for that Entra app registration, which represents the Entra app in my tenant request_principal = ServicePrincipal(app_id=app.app_id, display_name=app.display_name) await graph_client.service_principals.post(request_principal) Securing credentials for Entra app registrations I also need a way for the server to prove that it can use that Entra app registration. There are three options: Client secret: Easiest to set up, but since it's a secret, it must be stored securely, protected carefully, and rotated regularly. Certificate: Stronger than a client secret and generally better suited for production, but it still requires certificate storage, renewal, and lifecycle management. Managed identity as Federated Identity Credential (MI-as-FIC): No stored secret, no certificate to manage, and usually the best choice when your app is hosted on Azure. No support for local development however. I wanted the best of both worlds: easy local development on my machine, but the most secure production story for deployment on Azure Container Apps. So I actually created two Entra app registrations, one for local with client secret, and one for production with managed identity. Here's how I set up the password for the local Entra app: password_credential = await graph_client.applications.by_application_id(app.id).add_password.post( AddPasswordPostRequestBody( password_credential=PasswordCredential(display_name="FastMCPSecret"))) It's a bit trickier to set up the MI-as-FIC, since we first need to provision the managed identity and associate that with our Azure Container Apps resource. I set all of that up in Bicep, and then after provisioning completes, I run this code to configure a FIC using the managed identity: fic = FederatedIdentityCredential( name="miAsFic", issuer=f"https://login.microsoftonline.com/{tenant_id}/v2.0", subject=managed_identity_principal_id, audiences=["api://AzureADTokenExchange"], ) await graph_client.applications.by_application_id( prod_app_id ).federated_identity_credentials.post(fic) Since I now have two Entra app registrations, I make sure that the environment variables in my local .env point to the secret-secured local Entra app registration, and the environment variables on my Azure Container App point to the FIC-secured prod Entra app registration. Granting admin consent This next step is only necessary if the MCP server uses the on-behalf-of (OBO) flow to exchange the incoming access token for a token to a downstream API, such as Microsoft Graph. In this case, my demo server uses OBO so it can query Microsoft Graph to check the signed-in user's group membership. The earlier code added VS Code as a pre-authorized application, but that only allows VS Code to obtain a token for the MCP server itself; it does not grant the MCP server permission to call Microsoft Graph on the user's behalf. Because the MCP sign-in flow in VS Code does not include a separate consent step for those downstream Graph scopes, I grant admin consent up front so the OBO exchange can succeed. This code grants the admin consent to the associated service principal for the Graph API resource and scopes: server_principal = await graph_client.service_principals_with_app_id(app.app_id).get() graph_principal = await graph_client.service_principals_with_app_id( "00000003-0000-0000-c000-000000000000" # Graph API ).get() await graph_client.oauth2_permission_grants.post( OAuth2PermissionGrant( client_id=server_principal.id, consent_type="AllPrincipals", resource_id=graph_principal.id, scope="User.Read email offline_access openid profile", ) ) If our MCP server needed to use an OBO flow with another resource server, we could request additional grants for those resources and scopes. Our Entra app registration is now ready for the MCP server, so let's move on to see the server code. Using FastMCP servers with Entra In our MCP server code, we configure FastMCP's RemoteAuthProvider based on the details from the Entra app registration process: from fastmcp.server.auth import RemoteAuthProvider from fastmcp.server.auth.providers.azure import AzureJWTVerifier verifier = AzureJWTVerifier( client_id=ENTRA_CLIENT_ID, tenant_id=AZURE_TENANT_ID, required_scopes=["user_impersonation"], ) auth = RemoteAuthProvider( token_verifier=verifier, authorization_servers=[f"https://login.microsoftonline.com/{AZURE_TENANT_ID}/v2.0"], base_url=base_url, ) Notice that we do not need to pass in a client secret at this point, even when using the local Entra app registration. FastMCP validates the tokens using Entra's public keys - no Entra app credentials needed. To make it easy for our MCP tools to access an identifier for the currently logged in user, we define a middleware that inspects the claims of the current token using FastMCP's get_access_token() and sets the "oid" (Entra object identifier) in the state: class UserAuthMiddleware(Middleware): def _get_user_id(self): token = get_access_token() if not (token and hasattr(token, "claims")): return None return token.claims.get("oid") async def on_call_tool(self, context: MiddlewareContext, call_next): user_id = self._get_user_id() if context.fastmcp_context is not None: await context.fastmcp_context.set_state("user_id", user_id) return await call_next(context) async def on_read_resource(self, context: MiddlewareContext, call_next): user_id = self._get_user_id() if context.fastmcp_context is not None: await context.fastmcp_context.set_state("user_id", user_id) return await call_next(context) When we initialize the FastMCP server, we set the auth provider and include that middleware: mcp = FastMCP("Expenses Tracker", auth=auth, middleware=[UserAuthMiddleware()]) Now, every request made to the MCP server will require authentication. The server will return a 401 if a valid token isn't provided, and that 401 will prompt the VS Code MCP client to kick off the MCP authorization flow. Inside each tool, we can grab the user id from the state, and use that to customize the response for the user, like to store or query items in a database. MCP.tool async def add_user_expense( date: Annotated[date, "Date of the expense in YYYY-MM-DD format"], amount: Annotated[float, "Positive numeric amount of the expense"], description: Annotated[str, "Human-readable description of the expense"], ctx: Context, ): """Add a new expense to Cosmos DB.""" user_id = await ctx.get_state("user_id") if not user_id: return "Error: Authentication required (no user_id present)" expense_item = { "id": str(uuid.uuid4()), "user_id": user_id, "date": date.isoformat(), "amount": amount, "description": description } await cosmos_container.create_item(body=expense_item) Using OBO flow in FastMCP server Remember when we granted admin consent for the Entra app registration earlier? That means we can use an OBO flow inside the MCP server, to make calls to the Graph API on behalf of the signed-in user. To make it easier to exchange and validate tokens, we use the Python MSAL SDK and configure a ConfidentialClientApplication . When using the local secret-secured Entra app registration, this is all we need to set it up: from msal import ConfidentialClientApplication confidential_client = ConfidentialClientApplication( client_id=entra_client_id, client_credential=os.environ["ENTRA_DEV_CLIENT_SECRET"], authority=f"https://login.microsoftonline.com/{os.environ['AZURE_TENANT_ID']}", token_cache=TokenCache(), ) When using the production FIC-secured Entra app registration, we need a function that returns tokens for the managed identity: from msal import ManagedIdentityClient, TokenCache, UserAssignedManagedIdentity mi_client = ManagedIdentityClient( UserAssignedManagedIdentity(client_id=os.environ["AZURE_CLIENT_ID"]), http_client=requests.Session(), token_cache=TokenCache()) def _get_mi_assertion(): result = mi_client.acquire_token_for_client(resource="api://AzureADTokenExchange") if "access_token" not in result: raise RuntimeError(f"Failed to get MI assertion: {result.get('error_description', 'unknown error')}") return result["access_token"] confidential_client = ConfidentialClientApplication( client_id=entra_client_id, client_credential={"client_assertion": _get_mi_assertion}, authority=f"https://login.microsoftonline.com/{os.environ['AZURE_TENANT_ID']}", token_cache=TokenCache()) Inside any code that requires OBO, we ask MSAL to exchange the MCP access token for a Graph API access token: graph_resource_access_token = confidential_client.acquire_token_on_behalf_of( user_assertion=access_token.token, scopes=["https://graph.microsoft.com/.default"] ) graph_token = graph_resource_access_token["access_token"] Once we successfully acquire the token, we can use that token with the Graph API, for any operations permitted by the scopes in the admin consent granted earlier. For this example, we call the Graph API to check whether the logged in user is a member of a particular Entra group: client = httpx.AsyncClient() url = ("https://graph.microsoft.com/v1.0/me/transitiveMemberOf/microsoft.graph.group" f"?$filter=id eq '{group_id}'&$count=true") response = await client.get( url, headers={ "Authorization": f"Bearer {graph_token}", "ConsistencyLevel": "eventual", }) data = response.json() membership_count = data.get("@odata.count", 0) is_admin = membership_count > 0 FastMCP 3.0 now provides a way to restrict tool visibility based on authorization checks, so I wrapped the above code in a function and set it as the auth constraint for the admin tool: async def require_admin_group(ctx: AuthContext) -> bool: graph_token = exchange_for_graph_token(ctx.token.token) return await check_user_in_group(graph_token, admin_group_id) @mcp.tool(auth=require_admin_group) async def get_expense_stats(ctx: Context): """Get expense statistics. Only accessible to admins.""" ... FastMCP will run that function both when an MCP client requests the list of tools, to determine which tools can be seen by the current user, and again when a user tries to use that tool, for an added just-in-time security check. This is just one way to use an OBO flow however. You can use it directly inside tools, like to query for more details from the Graph API, upload documents to OneDrive/SharePoint/Notes, send emails, etc. All together now For the full code, check out the open source azure-cosmosdb-identity-aware-mcp-server repository. The most relevant files for the Entra authentication setup are: auth_init.py: Creates the Entra app registrations for production and local development, defines the delegated user_impersonation scope, pre-authorizes VS Code, creates the service principal, and grants admin consent for the Microsoft Graph scopes used in the OBO flow. auth_postprovision.py: Adds the federated identity credential (FIC) after deployment so the container app's managed identity can act as the production Entra app without storing a client secret. main.py: Implements the MCP server using FastMCP's RemoteAuthProvider and AzureJWTVerifier for direct Entra authentication, plus OBO-based Microsoft Graph calls for admin group membership checks. As always, please let me know if you have further questions or ideas for other Entra integrations. Acknowledgements: Thank you to Matt Gotteiner for his guidance in implementing the OBO flow and review of the blog post.Learn how to build agents and workflows in Python
We just concluded Python + Agents, a six-part livestream series where we explored the foundational concepts behind building AI agents in Python using the Microsoft Agent Framework: Using agents with tools, MCP servers, and subagents Adding context to agents with database calls and long-term memory with Redis or Mem0 Monitoring using OpenTelemetry and evaluating quality with the Azure AI Evaluation SDK AI-driven workflows with conditional branching, structured outputs, and multi-agent orchestration Adding human-in-the-loop with tool approval and checkpoints All of the materials from our series are available for you to keep learning from, and linked below: Video recordings of each stream Powerpoint slides that you can use for reviewing or even teaching the material to your own community Open-source code samples you can run yourself using frontier LLMs from GitHub Models or Microsoft Foundry Models Spanish speaker? Check out the Spanish version of the series. 🙋🏽♂️ Have follow up questions? Join the weekly Python+AI office hours on Foundry Discord or the weekly Agent Framework office hours. Building your first agent in Python 📺 Watch YouTube recording In the first session of our Python + Agents series, we'll kick things off with the fundamentals: what AI agents are, how they work, and how to build your first one using the Microsoft Agent Framework. We'll start with the core anatomy of an agent, then walk through how tool calling works in practice—beginning with a single tool, expanding to multiple tools, and finally connecting to tools exposed through local MCP servers. We'll conclude with the supervisor agent pattern, where a single supervisor agent coordinates subtasks across multiple subagents, by treating each agent as a tool. Along the way, we'll share tips for debugging and inspecting agents, like using the DevUI interface from Microsoft Agent Framework for interacting with agent prototypes. 🖼️ Slides for this session 💻 Code repository with examples: python-agentframework-demos 📝 Write-up for this session Adding context and memory to agents 📺 Watch YouTube recording In the second session of our Python + Agents series, we'll extend agents built with the Microsoft Agent Framework by adding two essential capabilities: context and memory. We'll begin with context, commonly known as Retrieval‑Augmented Generation (RAG), and show how agents can ground their responses using knowledge retrieved from local data sources such as SQLite or PostgreSQL. This enables agents to provide accurate, domain‑specific answers based on real information rather than model hallucination. Next, we'll explore memory—both short‑term, thread‑level context and long‑term, persistent memory. You'll see how agents can store and recall information using solutions like Redis or open‑source libraries such as Mem0, enabling them to remember previous interactions, user preferences, and evolving tasks across sessions. By the end, you'll understand how to build agents that are not only capable but context‑aware and memory‑efficient, resulting in richer, more personalized user experiences. 🖼️ Slides for this session 💻 Code repository with examples: python-agentframework-demos 📝 Write-up for this session Monitoring and evaluating agents 📺 Watch YouTube recording In the third session of our Python + Agents series, we'll focus on two essential components of building reliable agents: observability and evaluation. We'll begin with observability, using OpenTelemetry to capture traces, metrics, and logs from agent actions. You'll learn how to instrument your agents and use a local Aspire dashboard to identify slowdowns and failures. From there, we'll explore how to evaluate agent behavior using the Azure AI Evaluation SDK. You'll see how to define evaluation criteria, run automated assessments over a set of tasks, and analyze the results to measure accuracy, helpfulness, and task success. By the end of the session, you'll have practical tools and workflows for monitoring, measuring, and improving your agents—so they're not just functional, but dependable and verifiably effective. 🖼️ Slides for this session 💻 Code repository with examples: python-agentframework-demos 📝 Write-up for this session Building your first AI-driven workflows 📺 Watch YouTube recording In Session 4 of our Python + Agents series, we'll explore the foundations of building AI‑driven workflows using the Microsoft Agent Framework: defining workflow steps, connecting them, passing data between them, and introducing simple ways to guide the path a workflow takes. We'll begin with a conceptual overview of workflows and walk through their core components: executors, edges, and events. You'll learn how workflows can be composed of simple Python functions or powered by full AI agents when a step requires model‑driven behavior. From there, we'll dig into conditional branching, showing how workflows can follow different paths depending on model outputs, intermediate results, or lightweight decision functions. We'll introduce structured outputs as a way to make branching more reliable and easier to maintain—avoiding vague string checks and ensuring that workflow decisions are based on clear, typed data. We'll discover how the DevUI interface makes it easier to develop workflows by visualizing the workflow graph and surfacing the streaming events during a workflow's execution. Finally, we'll dive into an E2E demo application that uses workflows inside a user-facing application with a frontend and backend. 🖼️ Slides for this session 💻 Code repository with examples: python-agentframework-demos 📝 Write-up for this session Orchestrating advanced multi-agent workflows 📺 Watch YouTube recording In Session 5 of our Python + Agents series, we'll go beyond workflow fundamentals and explore how to orchestrate advanced, multi‑agent workflows using the Microsoft Agent Framework. This session focuses on patterns that coordinate multiple steps or multiple agents at once, enabling more powerful and flexible AI‑driven systems. We'll begin by comparing sequential vs. concurrent execution, then dive into techniques for running workflow steps in parallel. You'll learn how fan‑out and fan‑in edges enable multiple branches to run at the same time, how to aggregate their results, and how concurrency allows workflows to scale across tasks efficiently. From there, we'll introduce two multi‑agent orchestration approaches that are built into the framework. We'll start with handoff, where control moves entirely from one agent to another based on workflow logic, which is useful for routing tasks to the right agent as the workflow progresses. We'll then look at Magentic, a planning‑oriented supervisor that generates a high‑level plan for completing a task and delegates portions of that plan to other agents. Finally, we'll wrap up with a demo of an E2E application that showcases a concurrent multi-agent workflow in action. 🖼️ Slides for this session 💻 Code repository with examples: python-agentframework-demos 📝 Write-up for this session Adding a human in the loop to agentic workflows 📺 Watch YouTube recording In the final session of our Python + Agents series, we'll explore how to incorporate human‑in‑the‑loop (HITL) interactions into agentic workflows using the Microsoft Agent Framework. This session focuses on adding points where a workflow can pause, request input or approval from a user, and then resume once the human has responded. HITL is especially important because LLMs can produce uncertain or inconsistent outputs, and human checkpoints provide an added layer of accuracy and oversight. We'll begin with the framework's requests‑and‑responses model, which provides a structured way for workflows to ask questions, collect human input, and continue execution with that data. We'll move onto tool approval, one of the most frequent reasons an agent requests input from a human, and see how workflows can surface pending tool calls for approval or rejection. Next, we'll cover checkpoints and resuming, which allow workflows to pause and be restarted later. This is especially important for HITL scenarios where the human may not be available immediately. We'll walk through examples that demonstrate how checkpoints store progress, how resuming picks up the workflow state, and how this mechanism supports longer‑running or multi‑step review cycles. This session brings together everything from the series—agents, workflows, branching, orchestration—and shows how to integrate humans thoughtfully into AI‑driven processes, especially when reliability and judgment matter most. 🖼️ Slides for this session 💻 Code repository with examples: python-agentframework-demos 📝 Write-up for this session