azure ai foundry
28 TopicsUnderstanding Small Language Modes
Small Language Models (SLMs) bring AI from the cloud to your device. Unlike Large Language Models that require massive compute and energy, SLMs run locally, offering speed, privacy, and efficiency. They’re ideal for edge applications like mobile, robotics, and IoT.How do I control how my agent responds?
Welcome to Agent Support—a developer advice column for those head-scratching moments when you’re building an AI agent! Each post answers a question inspired by real conversations in the AI developer community, offering practical advice and tips. This time, we’re talking about one of the most misunderstood ingredients in agent behavior: the system prompt. Let’s dive in! 💬 Dear Agent Support I’ve written a few different prompts to guide my agent’s responses, but the output still misses the mark—sometimes it’s too vague, other times too detailed. What’s the best way to structure the instructions so the results are more consistent? Great question! It gets right to the heart of prompt engineering. When the output feels inconsistent, it’s often because the instructions aren’t doing enough to guide the model’s behavior. That’s where prompt engineering can make a difference. By refining how you frame the instructions, you can guide the model toward more reliable, purpose-driven output. 🧠 What Is Prompt Engineering (and Why It Matters for Agents) Before we can fix the prompt, let’s define the craft. Prompt engineering is the practice of designing clear, structured input instructions that guide a model toward the behavior you want. In agent systems, this usually means writing the system prompt, a behind-the-scenes instruction that sets the tone, context, and boundaries for how the agent should act. While prompt engineering feels new, it’s rooted in decades of interface design, instruction tuning, and human-computer interaction research. The big shift? With large language models (LLMs), language becomes the interface. The better your instructions, the better your outcomes. 🧩 The Anatomy of a Good System Prompt Think of your system prompt as a blueprint for how the agent should operate. It sets the stage before the conversation starts. A strong system prompt should: Define the role: Who is this agent? What’s their tone, expertise, or purpose? Clarify the goal: What task should the agent help with? What should it avoid? Establish boundaries: Are there any constraints? Should it cite sources? Stay concise? Here’s a rough template you can build from: “You are a helpful assistant that specializes in [domain]. Your job is to [task]. Keep responses [format/length/tone]. If you’re unsure, respond with ‘I don’t know’ instead of guessing.” 🛠️ Why Prompts Fail (Even When They Sound Fine) Common issues we see: Too vague (“Be helpful” isn’t helpful.) Overloaded with logic (Treating the system prompt like a config file.) Conflicting instructions (“Be friendly” + “Use legal terminology precisely.”) Even well-written prompts can underperform if they’re mismatched with the model or task. That’s why we recommend testing and refining early and often! ✏️ Skip the Struggle— let the AI Toolkit Write It! Writing a great system prompt takes practice. And even then, it’s easy to overthink it! If you’re not sure where to start (or just want to speed things up), the AI Toolkit provides a built-in way to generate a system prompt for you. All you have to do is describe what the agent needs to do, and the AI Toolkit will generate a well-defined and detailed system prompt for your agent. Here's how to do it: Open the Agent Builder from the AI Toolkit panel in Visual Studio Code. Click the + New Agent button and provide a name for your agent. Select a Model for your agent. In the Prompts section, click Generate system prompt. In the Generate a prompt window that appears, provide basic details about your task and click Generate. After the AI Toolkit generates your agent’s system prompt, it’ll appear in the System prompt field. I recommend reviewing the system prompt and modifying any parts that may need revision! Heads up: System prompts aren’t just behind-the-scenes setup, they’re submitted along with the user prompt every time you send a request. That means they count toward your total token limit, so longer prompts can impact both cost and response length. 🧪 Test Before You Build Once you’ve written (or generated) a system prompt, don’t skip straight to wiring it into your agent. It’s worth testing how the model responds with the prompt in place first. You can do that right in the Agent Builder. Just submit a test prompt in the User Prompt field, click Run, and the model will generate a response using the system prompt behind the scenes. This gives you a quick read on whether the behavior aligns with your expectations before you start building around it. 🔁 Recap Here’s a quick rundown of what we covered: Prompt engineering helps guide your agent’s behavior through language. A good system prompt sets the tone, purpose, and guardrails for the agent. Test, tweak, and simplify—especially if responses seem inconsistent or off-target. You can use the Generate system prompt feature within the AI Toolkit to quickly generate instructions for your agent. 📺 Want to Go Deeper? Check out my latest video on how to define your agent’s behavior—it’s part of the Build an Agent Series, where I walk through the building blocks of turning an idea into a working AI agent. The Prompt Engineering Fundamentals chapter from our aka.ms/AITKGenAI curriculum overs all the essentials—prompt structure, common patterns, and ways to test and improve your outputs. It also includes exercises so you can get some hands-on practice. 👉 Explore the full curriculum: aka.ms/AITKGenAI And for all your general AI and AI agent questions, join us in the Azure AI Foundry Discord! You can find me hanging out there answering your questions about the AI Toolkit. I'm looking forward to chatting with you there! And remember, great agent behavior starts with great instructions—and now you’ve got the tools to write them.I want to show my agent a picture—Can I?
Welcome to Agent Support—a developer advice column for those head-scratching moments when you’re building an AI agent! Each post answers a question inspired by real conversations in the AI developer community, offering practical advice and tips. To kick things off, we’re tackling a common challenge for anyone experimenting with multimodal agents: working with image input. Let’s dive in! Dear Agent Support, I’m building an AI agent, and I’d like to include screenshots or product photos as part of the input. But I’m not sure if that’s even possible, or if I need to use a different kind of model altogether. Can I actually upload an image and have the agent process it? Great question, and one that trips up a lot of people early on! The short answer is: yes, some models can process images—but not all of them. Let’s break that down a bit. 🧠 Understanding Image Input When we talk about image input or image attachments, we’re talking about the ability to send a non-text file (like a .png, .jpg, or screenshot) into your prompt and have the model analyze or interpret it. That could mean describing what’s in the image, extracting text from it, answering questions about a chart, or giving feedback on a design layout. 🚫 Not All Models Support Image Input That said, this isn’t something every model can do. Most base language models are trained on text data only, they’re not designed to interpret non-text inputs like images. In most tools and interfaces, the option to upload an image only appears if the selected model supports it, since platforms typically hide or disable features that aren't compatible with a model's capabilities. So, if your current chat interface doesn’t mention anything about vision or image input, it’s likely because the model itself isn’t equipped to handle it. That’s where multimodal models come in. These are models that have been trained (or extended) to understand both text and images, and sometimes other data types too. Think of them as being fluent in more than one language, except in this case, one of those “languages” is visual. 🔎 How to Find Image-Supporting Models If you’re trying to figure out which models support images, the AI Toolkit is a great place to start! The extension includes a built-in Model Catalog where you can filter models by Feature—like Image Attachment—so you can skip the guesswork. Here’s how to do it: Open the Model Catalog from the AI Toolkit panel in Visual Studio Code. Click the Feature filter near the search bar. Select Image Attachment. Browse the filtered results to see which models can accept visual input. Once you've got your filtered list, you can check out the model details or try one in the Playground to test how it handles image-based prompts. 🧪 Test Before You Build Before you plug a model into your agent and start wiring things together, it’s a good idea to test how the model handles image input on its own. This gives you a quick feel for the model’s behavior and helps you catch any limitations before you're deep into building. You can do this in the Playground, where you can upload an image and pair it with a simple prompt like: “Describe the contents of this image.” OR “Summarize what’s happening in this screenshot.” If the model supports image input, you’ll be able to attach a file and get a response based on its visual analysis. If you don’t see the option to upload an image, double-check that the model you’ve selected has image capabilities—this is usually a model issue, not a UI bug. 🔁 Recap Here’s a quick rundown of what we covered: Not all models support image input—you’ll need a multimodal model specifically built to handle visual data. Most platforms won’t let you upload an image unless the model supports it, so if you don’t see that option, it’s probably a model limitation. You can use the AI Toolkit’s Model Catalog to filter models by capability—just check the box for Image Attachment. Test the model in the Playground before integrating it into your agent to make sure it behaves the way you expect. 📺 Want to Go Deeper? Check out my latest video on how to choose the right model for your agent—it’s part of the Build an Agent Series, where I walk through the building blocks of turning an idea into a working AI agent. And if you’re looking to sharpen your model instincts, don’t miss Model Mondays—a weekly series that helps developers like you build your Model IQ, one spotlight at a time. Whether you’re just starting out or already building AI-powered apps, it’s a great way to stay current and confident in your model choices. 👉 Explore the series and catch the next episode: aka.ms/model-mondays/rsvp If you're just getting started with building agents, check out our Agents for Beginners curriculum. And for all your general AI and AI agent questions, join us in the Azure AI Foundry Discord! You can find me hanging out there answering your questions about the AI Toolkit. I'm looking forward to chatting with you there! Whatever you're building, the right model is out there—and with the right tools, you'll know exactly how to find it.On‑Device AI with Windows AI Foundry
From “waiting” to “instant”- without sending data away AI is everywhere, but speed, privacy, and reliability are critical. Users expect instant answers without compromise. On-device AI makes that possible: fast, private and available, even when the network isn’t - empowering apps to deliver seamless experiences. Imagine an intelligent assistant that works in seconds, without sending a text to the cloud. This approach brings speed and data control to the places that need it most; while still letting you tap into cloud power when it makes sense. Windows AI Foundry: A Local Home for Models Windows AI Foundry is a developer toolkit that makes it simple to run AI models directly on Windows devices. It uses ONNX Runtime under the hood and can leverage CPU, GPU (via DirectML), or NPU acceleration, without requiring you to manage those details. The principle is straightforward: Keep the model and the data on the same device. Inference becomes faster, and data stays local by default unless you explicitly choose to use the cloud. Foundry Local Foundry Local is the engine that powers this experience. Think of it as local AI runtime - fast, private, and easy to integrate into an app. Why Adopt On‑Device AI? Faster, more responsive apps: Local inference often reduces perceived latency and improves user experience. Privacy‑first by design: Keep sensitive data on the device; avoid cloud round trips unless the user opts in. Offline capability: An app can provide AI features even without a network connection. Cost control: Reduce cloud compute and data costs for common, high‑volume tasks. This approach is especially useful in regulated industries, field‑work tools, and any app where users expect quick, on‑device responses. Hybrid Pattern for Real Apps On-device AI doesn’t replace the cloud, it complements it. Here’s how: Standalone On‑Device: Quick, private actions like document summarization, local search, and offline assistants. Cloud‑Enhanced (Optional): Large-context models, up-to-date knowledge, or heavy multimodal workloads. Design an app to keep data local by default and surface cloud options transparently with user consent and clear disclosures. Windows AI Foundry supports hybrid workflows: Use Foundry Local for real-time inference. Sync with Azure AI services for model updates, telemetry, and advanced analytics. Implement fallback strategies for resource-intensive scenarios. Application Workflow Code Example 1. Only On-Device: Tries Foundry Local first, falls back to ONNX if foundry_runtime.check_foundry_available(): # Use on-device Foundry Local models try: answer = foundry_runtime.run_inference(question, context) return answer, source="Foundry Local (On-Device)" except Exception as e: logger.warning(f"Foundry failed: {e}, trying ONNX...") if onnx_model.is_loaded(): # Fallback to local BERT ONNX model try: answer = bert_model.get_answer(question, context) return answer, source="BERT ONNX (On-Device)" except Exception as e: logger.warning(f"ONNX failed: {e}") return "Error: No local AI available" 2. Hybrid approach: On-device first, cloud as last resort def get_answer(question, context): """ Priority order: 1. Foundry Local (best: advanced + private) 2. ONNX Runtime (good: fast + private) 3. Cloud API (fallback: requires internet, less private) # in case of Hybrid approach, based on real-time scenario """ if foundry_runtime.check_foundry_available(): # Use on-device Foundry Local models try: answer = foundry_runtime.run_inference(question, context) return answer, source="Foundry Local (On-Device)" except Exception as e: logger.warning(f"Foundry failed: {e}, trying ONNX...") if onnx_model.is_loaded(): # Fallback to local BERT ONNX model try: answer = bert_model.get_answer(question, context) return answer, source="BERT ONNX (On-Device)" except Exception as e: logger.warning(f"ONNX failed: {e}, trying cloud...") # Last resort: Cloud API (requires internet) if network_available(): try: import requests response = requests.post( '{BASE_URL_AI_CHAT_COMPLETION}', headers={'Authorization': f'Bearer {API_KEY}'}, json={ 'model': '{MODEL_NAME}', 'messages': [{ 'role': 'user', 'content': f'Context: {context}\n\nQuestion: {question}' }] }, timeout=10 ) answer = response.json()['choices'][0]['message']['content'] return answer, source="Cloud API (Online)" except Exception as e: return "Error: No AI runtime available", source="Failed" else: return "Error: No internet and no local AI available", source="Offline" Demo Project Output: Foundry Local answering context-based questions offline : The Foundry Local engine ran the Phi-4-mini model offline and retrieved context-based data. : The Foundry Local engine ran the Phi-4-mini model offline and mentioned that there is no answer. Practical Use Cases Privacy-First Reading Assistant: Summarize documents locally without sending text to the cloud. Healthcare Apps: Analyze medical data on-device for compliance. Financial Tools: Risk scoring without exposing sensitive financial data. IoT & Edge Devices: Real-time anomaly detection without network dependency. Conclusion On-device AI isn’t just a trend - it’s a shift toward smarter, faster, and more secure applications. With Windows AI Foundry and Foundry Local, developers can deliver experiences that respect user specific data, reduce latency, and work even when connectivity fails. By combining local inference with optional cloud enhancements, you get the best of both worlds: instant performance and scalable intelligence. Whether you’re creating document summarizers, offline assistants, or compliance-ready solutions, this approach ensures your apps stay responsive, reliable, and user-centric. References Get started with Foundry Local - Foundry Local | Microsoft Learn What is Windows AI Foundry? | Microsoft Learn https://devblogs.microsoft.com/foundry/unlock-instant-on-device-ai-with-foundry-local/GPT-5: Will it RAG?
OpenAI released the GPT-5 model family last week, with an emphasis on accurate tool calling and reduced hallucinations. For those of us working on RAG (Retrieval-Augmented Generation), it's particularly exciting to see a model specifically trained to reduce hallucination. There are five variants in the family: gpt-5 gpt-5-mini gpt-5-nano gpt-5-chat: Not a reasoning model, optimized for chat applications gpt-5-pro: Only available in ChatGPT, not via the API As soon as GPT-5 models were available in Azure AI Foundry, I deployed them and evaluated them inside our most popular open source RAG template. I was immediately impressed - not by the model's ability to answer a question, but by it's ability to admit it could not answer a question! You see, we have one test question for our sample data (HR documents for a fictional company's) that sounds like it should be an easy question: "What does a Product Manager do?" But, if you actually look at the company documents, there's no job description for "Product Manager", only related jobs like "Senior Manager of Product Management". Every other model, including the reasoning models, has still pretended that it could answer that question. For example, here's a response from o4-mini: However, the gpt-5 model realizes that it doesn't have the information necessary, and responds that it cannot answer the question: As I always say: I would much rather have an LLM admit that it doesn't have enough information instead of making up an answer. Bulk evaluation But that's just a single question! What we really need to know is whether the GPT-5 models will generally do a better job across the board, on a wide range of questions. So I ran bulk evaluations using the azure-ai-evaluations SDK, checking my favorite metrics: groundedness (LLM-judged), relevance (LLM-judged), and citations_matched (regex based off ground truth citations). I didn't bother evaluating gpt-5-nano, as I did some quick manual tests and wasn't impressed enough - plus, we've never used a nano sized model for our RAG scenarios. Here are the results for 50 Q/A pairs: metric stat gpt-4.1-mini gpt-4o-mini gpt-5-chat gpt-5 gpt-5-mini o3-mini groundedness pass % 94% 86% 96% 100% 🏆 94% 96% ↑ mean score 4.76 4.50 4.86 5.00 🏆 4.82 4.80 relevance pass % 94% 🏆 84% 90% 90% 74% 90% ↑ mean score 4.42 🏆 4.22 4.06 4.20 4.02 4.00 answer_length mean 829 919 549 844 940 499 latency mean 2.9 4.5 2.9 9.6 7.5 19.4 citations_matched % 52% 49% 52% 47% 49% 51% For the LLM-judged metrics of groundedness and relevance , the LLM awards a score of 1-5, and both 4 and 5 are considered passing scores. That's why you see both a "pass %" (percentage with 4 or 5 score) and an average score in the table above. For the groundedness metric, which measures whether an answer is grounded in the retrieved search results, the gpt-5 model does the best (100%), while the other gpt-5 models do quite well as well, on par with our current default model of gpt-4.1-mini. For the relevance metric, which measures whether an answer fully answers a question, the gpt-5 models don't score as highly as gpt-4.1-mini. I looked into the discrepancies there, and I think that's actually due to gpt-5 being less willing to give an answer when it's not fully confident in it - it would rather give a partial answer instead. That's a good thing for RAG apps, so I am comfortable with that metric being less than 100%. The latency metric is generally higher for the gpt-5 reasoning models, as would be expected, but is also variable based on on deployment region, region capacity, etc, assuming you're not using a "provisioned thoroughput" deployment. Also note that the latency here records the total time taken, from first token to last token, whereas the most important metric for a user-facing streaming chat is the time to first token. For the gpt-5 reasoning models in this evaluation, I set the reasoning_effort to "minimal", which means the model chooses whether to use reasoning tokens. I have never seen it actually use any reasoning tokens when I set the effort to minimal, so maybe that means that a higher reasoning effort is really only needed for longer or more complex tasks, and RAG answering is a non-reasoning task. A higher reasoning effort would definitely affect the latency and likely also affect the answer quality. I did not test that out, since the "minimal" effort setting already results in high quality answers. Answer differences Now that we've seen the overall metrics, let's dig into some actual answers, and see some of the key ways that GPT-5 model answers differ. Quicker to say "I don't know" 🤷🏻 Let's start with the first thing I noticed: gpt-5 was happy to admit when it couldn't fully answer a question. In the 50 questions evaluated, the gpt-5 model started off their answer with "I don't know" for 3 of the answers, while gpt-5-mini started that way for 10 of the answers. In table form: gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat 1 (1.6%) 3 (6%) 10 (20%) 1 (1.6%) Here's an example of a gpt-4.1-mini answer versus a gpt-5 answer: Q: Explain the eligibility criteria for the Northwind Standard plan offered to Contoso employees? gpt-4.1-mini gpt-5 The eligibility criteria for the Northwind Standard plan for Contoso employees are not explicitly detailed in the provided sources. However, it is mentioned that prior-authorization may be required for certain services or treatments, which involves approval from Northwind Health based on medical necessity and coverage under the plan. Employees must have their physician submit a request for prior-authorization, and in some cases, meet specific criteria such as medical tests, treatments, or lifestyle changes to be approved. Preventive care, emergency services, mental health, and substance abuse coverage do not require prior-authorization. Out-of-network services do require prior-authorization and may incur additional costs [Northwind_Standard_Benefits_Details.pdf#page=63][Northwind_Standard_Benefits_Details.pdf#page=102]. If you need information on basic eligibility (such as employment status or tenure), that is not provided in these sources. I don’t know. The provided sources describe compliance (ERISA, COBRA, HIPAA, ACA) and note that Northwind Standard is a group plan covering “all eligible employees of Contoso and their spouses or dependents,” but they do not define the specific eligibility criteria (e.g., hours worked, waiting period, employment class). [Northwind_Standard_Benefits_Details.pdf#page=88][Northwind_Standard_Benefits_Details.pdf#page=85][Northwind_Standard_Benefits_Details.pdf#page=86] Both of the answers admit some degree of not-knowing, but gpt-5 comes right out with it at the start of the answer. Notably, our system prompt includes this instruction: Answer ONLY with the facts listed in the list of sources below. If there isn't enough information below, say you don't know. So what we may be seeing is a combination of two factors: 1) the gpt-5 reasoning models are generally better at following instructions, 2) the gpt-5 reasoning models went through a specific alignment process to reduce hallucination. Notably, however, the gpt-5-chat model rarely responded with "I don't know", so it is not as affected by whatever factors are increasing the likelihood of the reasoning models to answer that way. You may be wondering why I have questions in the evaluation set that aren't fully answerable - it's hard to generate a good set of ground truth for evaluation, and my ground truth data is not the optimal set of questions. Ideally, I would do a separate evaluation for all the fully knowable questions versus all the unknowable questions - and I even have a specific custom metric for "don't-know-ness" that I run on those - but a few partially knowable questions snuck into this ground truth data set. Requests for clarification 🙋🏻♀️ Our system prompt also encourages the model to ask clarifying questions when necessary: If asking a clarifying question to the user would help, ask the question. But I had never actually seen a clarifying question from a model until gpt-5-chat, which replied with a question to 2 of the 50 questions. Example: Q: What should I do if I need healthcare services outside Washington with Northwind Health? gpt-4.1-mini gpt-5-chat If you have Northwind Standard and need healthcare services outside Washington, you must meet these criteria for coverage: be outside Washington for no more than 12 consecutive months, be more than 100 miles away from your primary care provider, and have had your plan in effect for at least 6 months before receiving out-of-area care. You must contact Northwind Health in advance and obtain prior authorization for the care; otherwise, you may be responsible for the full cost. Emergency services outside Washington are covered, but non-emergency services are not unless these conditions are met [Northwind_Standard_Benefits_Details.pdf#page=59][Northwind_Standard_Benefits_Details.pdf#page=60]. If you have Northwind Health Plus, first ensure the provider is in the nationwide network by checking the online directory or calling customer service. Keep receipts and documentation for any out-of-state services, as you may need to submit claims for reimbursement. Contact Northwind Health Plus before receiving services to confirm coverage, as not all services outside Washington are covered [Northwind_Health_Plus_Benefits_Details.pdf#page=62]. Which plan do you have — Northwind Standard or Northwind Health Plus? Like in the example above, other models typically avoid asking a clarifying question by instead addressing every possible scenario (covering all their bases!), resulting in a fairly long answer. Arguably, it's a better user experience to ask the clarifying question, so that the user can get an answer tailored to their specific situation and not get confused by other possibilities. The primary drawback is the additional back-and-forth for the final answer. I did not personally see clarifying questions from the GPT-5 reasoning models, but my colleagues noticed an increase in such questions when using gpt-5 with GitHub Copilot Agent mode, so it's possible that every model in the family has an increased likelihood to ask for clarification when faced with ambiguity. Reasoning models ❤️ lists I saw a definite tendency of the GPT-5 reasoning models to output the answer as a list, even though there's nothing in the RAG system prompt that tells it to use lists. Out of the 50 questions evaluated, here are the number with answers formatted as a top-level bulleted list: gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat 0 36 (72%) 26 (52%) 0 Here's an example of a gpt-4.1-mini answer (paragraph) versus a gpt-5 answer (list): Q: What are the coverage limitations of the Northwind Standard plan regarding clinical trials? gpt-4.1-mini gpt-5 The Northwind Standard plan covers certain clinical trial services such as diagnostic testing, treatment of the condition being studied, medications, lab services, and imaging services. However, it does not cover travel expenses associated with attending clinical trials. Additionally, any experimental treatments or services not part of the clinical trial are not covered. Coverage for other clinical trial services not explicitly listed may be considered on a case-by-case basis. Members should contact Northwind Health customer service for more details [Northwind_Standard_Benefits_Details.pdf#page=23][Northwind_Standard_Benefits_Details.pdf#page=24]. Travel expenses for attending clinical trials are not covered [Northwind_Standard_Benefits_Details.pdf#page=23][Northwind_Standard_Benefits_Details.pdf#page=24]. Experimental treatments or services that are not part of the clinical trial are not covered [Northwind_Standard_Benefits_Details.pdf#page=24]. Services or treatments deemed experimental or investigational are not covered by the plan [Northwind_Standard_Benefits_Details.pdf#page=53]. Now, is it a bad thing that the gpt-5 reasoning models use lists? Not necessarily! But if that's not the style you're looking for, then you either want to consider the gpt-5-chat model or add specific messaging in the system prompt to veer the model away from top level lists. Longer answers As we saw in overall metrics above, there was an impact of the answer length (measured in the number of characters, not tokens). Let's isolate those stats: gpt-4.1-mini gpt-5 gpt-5-mini gpt-5-chat 829 844 990 549 The gpt-5 reasoning models are generating answers of similar length to the current baseline of gpt-4.1-mini, though the gpt-5-mini model seems to be a bit more verbose. The API now has a new parameter to control verbosity for those models, which defaults to "medium". I did not try an evaluation with that set to "low" or "high", which would be an interesting evaluation to run. The gpt-5-chat model outputs relatively short answers, which are actually closer in length to the answer length that I used to see from gpt-3.5-turbo. What answer length is best? A longer answer will take longer to finish rendering to the user (even when streaming), and will cost the developer more tokens. However, sometimes answers are longer due to better formatting that is easier to skim, so longer does not always mean less readable. For the user-facing RAG chat scenario, I generally think that shorter answers are better. If I was putting these gpt-5 reasoning models in production, I'd probably try out the "low" verbosity value, or put instructions in the system prompt, so that users get their answers more quickly. They can always ask follow-up questions as needed. Fancy punctuation This is a weird difference that I discovered while researching the other differences: the GPT-5 models are more likely to use “smart” quotes instead of standard ASCII quotes. Specifically: Left single: ‘ (U+2018) Right single / apostrophe: ’ (U+2019) Left double: “ (U+201C) Right double: ” (U+201D) For example, the gpt-5 model actually responded with " I don’t know ", not with " I don't know ". It's a subtle difference, but if you are doing any sort of post-processing or analysis, it's good to know. I've also seen the models using the smart quotes incorrectly in coding contexts (like GitHub Copilot Agent mode), so that's another potential issue to look out for. I assumed that the models were trained on data that tended to use smart quotes more often, perhaps synthetic data or book text. I know that as a normal human, I rarely use them, given the extra effort required to type them. Query rewriting to the extreme Our RAG flow makes two LLM calls: the second answers the question, as you’d expect, but the first rewrites the user’s query into a strong search query. This step can fix spelling mistakes, but it’s even more important for filling in missing context in multi-turn conversations—like when the user simply asks, “what else?” A well-crafted rewritten query leads to better search results, and ultimately, a more complete and accurate answer. During my manual tests, I noticed that the rewritten queries from the GPT-5 models are much longer, filled to the brim with synonyms. For example: Q: What does a Product Manager do? gpt-4.1-mini gpt-5-mini product manager responsibilities Product Manager role responsibilities duties skills day-to-day tasks product management overview Are these new rewritten queries better or worse than the previous short ones? It's hard to tell, since they're just one factor in the overall answer output, and I haven't set up a retrieval-specific metric. The closest metric is citations_matched , since the new answer from the app can only match the citations in the ground truth if the app managed to retrieve all the same citations. That metric was generally high for these models, and when I looked into the cases where the citations didn't match, I typically thought the gpt-5 family of responses were still good answers. I suspect that the rewritten query does not have a huge effect either way, since our retrieval step uses hybrid search from Azure AI Search, and the combined power of both hybrid and vector search generally compensates for differences in search query wording. It's worth evaluating this further however, and considering using a different model for the query rewriting step. Developers often choose to use a smaller, faster model for that stage, since query rewriting is an easier task than answering a question. So, are the answers accurate? Even with a 100% groundedness score from an LLM judge, it's possible that a RAG app can be producing inaccurate answers, like if the LLM judge is biased or the retrieved context is incomplete. The only way to really know if RAG answers are accurate is to send them to a human expert. For the sample data in this blog, there is no human expert available, since they're based off synthetically generated documents. Despite two years of staring at those documents and running dozens of evaluations, I still am not an expert in the HR benefits of the fictional Contoso company. That's why I also ran the same evaluations on the same RAG codebase, but with data that I know intimately: my own personal blog. I looked through 200 answers from gpt-5, and did not notice any inaccuracies in the answers. Yes, there are times when it says "I don't know" or asks a clarifying question, but I consider those to be accurate answers, since they do not spread misinformation. I imagine that I could find some way to trick up the gpt-5 model, but on the whole, it looks like a model with a high likelihood of generating accurate answers when given relevant context. Evaluate for yourself! I share my evaluations on our sample RAG app as a way to share general learnings on model differences, but I encourage every developer to evaluate these models for your specific domain, alongside domain experts that can reason about the correctness of the answers. How can you evaluate? If you are using the same open source RAG project for Azure, deploy the GPT-5 models and follow the steps in the evaluation guide. If you have your own solution, you can use an open-source SDK for evaluating, like azure-ai-evaluation (the one that I use), DeepEval, promptfoo, etc. If you are using an observability platform like Langfuse, Arize, or Langsmith, they have evaluation strategies baked in. Or if you're using an agents framework like Pydantic AI, those also often have built-in eval mechanisms. If you can share what you learn from evaluations, please do! We are all learning about the strange new world of LLMs together.1.7KViews2likes0CommentsImpariamo a conoscere MCP: Introduzione al Model Context Protocol (MCP)
Non perderti il prossimo evento “Let’s Learn – MCP” su Microsoft Reactor il 24 di Luglio, pensato per chiunque voglia conoscere meglio il nuovo standard per agenti intelligenti (il Model Context Protocol) e imparare a metterlo in pratica. La sessione è in Italiano e le demo sono in Python, ma fa parte di una serie di live-streaming disponibili in tantissime lingue.AI Toolkit Extension Pack for Visual Studio Code: Ignite 2025 Update
Unlock the Latest Agentic App Capabilities The Ignite 2025 update delivers a major leap forward for the AI Toolkit extension pack in VS Code, introducing a unified, end-to-end environment for building, visualizing, and deploying agentic applications to Microsoft Foundry, and the addition of Anthropic’s frontier Claude models in the Model Catalog! This release enables developers to build and debug locally in VS Code, then deploy to the cloud with a single click. Seamlessly switch between VS Code and the Foundry portal for visualization, orchestration, and evaluation, creating a smooth roundtrip workflow that accelerates innovation and delivers a truly unified AI development experience. Download the http://aka.ms/aitoolkit today and start building next-generation agentic apps in VS Code! What Can You Do with the AI Toolkit Extension Pack? Access Anthropic models in the Model Catalog Following the Microsoft, NVIDIA and Anthropic strategic partnerships announcement today, we are excited to share that Anthropic’s frontier Claude models including Claude Sonnet 4.5, Claude Opus 4.1, and Claude Haiku 4.5, are now integrated into the AI Toolkit, providing even more choices and flexibility when building intelligent applications and AI agents. Build AI Agents Using GitHub Copilot Scaffold agent applications using best-practice patterns, tool-calling examples, tracing hooks, and test scaffolds, all powered by Copilot and aligned with the Microsoft Agent Framework. Generate agent code in Python or .NET, giving you flexibility to target your preferred runtime. Build and Customize YAML Workflows Design YAML-based workflows in the Foundry portal, then continue editing and testing directly in VS Code. To customize your YAML-based workflows, instantly convert it to Agent Framework code using GitHub Copilot. Upgrade from declarative design to code-first customization without starting from scratch. Visualize Multi-Agent Workflows Envision your code-based agent workflows with an interactive graph visualizer that reveals each component and how they connect Watch in real-time how each node lights up as you run your agent. Use the visualizer to understand and debug complex agent graphs, making iteration fast and intuitive. Experiment, Debug, and Evaluate Locally Use the Hosted Agents Playground to quickly interact with your agents on your development machine. Leverage local tracing support to debug reasoning steps, tool calls, and latency hotspots—so you can quickly diagnose and fix issues. Define metrics, tasks, and datasets for agent evaluation, then implement metrics using the Foundry Evaluation SDK and orchestrate evaluations runs with the help of Copilot. Seamless Integration Across Environments Jump from Foundry Portal to VS Code Web for a development environment in your preferred code editor setting. Open YAML workflows, playgrounds, and agent templates directly in VS Code for editing and deployment. How to Get Started Install the AI Toolkit extension pack from the VS Code marketplace. Check out documentation. Get started with building workflows with Microsoft Foundry in VS Code 1. Work with Hosted (Pro-code) Agent workflows in VS Code 2. Work with Declarative (Low-code) Agent workflows in VS Code Feedback & Support Try out the extensions and let us know what you think! File issues or feedback on our GitHub repo for Foundry extension and AI Toolkit extension. Your input helps us make continuous improvements.From Cloud to Chip: Building Smarter AI at the Edge with Windows AI PCs
As AI engineers, we’ve spent years optimizing models for the cloud, scaling inference, wrangling latency, and chasing compute across clusters. But the frontier is shifting. With the rise of Windows AI PCs and powerful local accelerators, the edge is no longer a constraint it’s now a canvas. Whether you're deploying vision models to industrial cameras, optimizing speech interfaces for offline assistants, or building privacy-preserving apps for healthcare, Edge AI is where real-world intelligence meets real-time performance. Why Edge AI, Why Now? Edge AI isn’t just about running models locally, it’s about rethinking the entire lifecycle: - Latency: Decisions in milliseconds, not round-trips to the cloud. - Privacy: Sensitive data stays on-device, enabling HIPAA/GDPR compliance. - Resilience: Offline-first apps that don’t break when the network does. - Cost: Reduced cloud compute and bandwidth overhead. With Windows AI PCs powered by Intel and Qualcomm NPUs and tools like ONNX Runtime, DirectML, and Olive, developers can now optimize and deploy models with unprecedented efficiency. What You’ll Learn in Edge AI for Beginners The Edge AI for Beginners curriculum is a hands-on, open-source guide designed for engineers ready to move from theory to deployment. Multi-Language Support This content is available in over 48 languages, so you can read and study in your native language. What You'll Master This course takes you from fundamental concepts to production-ready implementations, covering: Small Language Models (SLMs) optimized for edge deployment Hardware-aware optimization across diverse platforms Real-time inference with privacy-preserving capabilities Production deployment strategies for enterprise applications Why EdgeAI Matters Edge AI represents a paradigm shift that addresses critical modern challenges: Privacy & Security: Process sensitive data locally without cloud exposure Real-time Performance: Eliminate network latency for time-critical applications Cost Efficiency: Reduce bandwidth and cloud computing expenses Resilient Operations: Maintain functionality during network outages Regulatory Compliance: Meet data sovereignty requirements Edge AI Edge AI refers to running AI algorithms and language models locally on hardware, close to where data is generated without relying on cloud resources for inference. It reduces latency, enhances privacy, and enables real-time decision-making. Core Principles: On-device inference: AI models run on edge devices (phones, routers, microcontrollers, industrial PCs) Offline capability: Functions without persistent internet connectivity Low latency: Immediate responses suited for real-time systems Data sovereignty: Keeps sensitive data local, improving security and compliance Small Language Models (SLMs) SLMs like Phi-4, Mistral-7B, Qwen and Gemma are optimized versions of larger LLMs, trained or distilled for: Reduced memory footprint: Efficient use of limited edge device memory Lower compute demand: Optimized for CPU and edge GPU performance Faster startup times: Quick initialization for responsive applications They unlock powerful NLP capabilities while meeting the constraints of: Embedded systems: IoT devices and industrial controllers Mobile devices: Smartphones and tablets with offline capabilities IoT Devices: Sensors and smart devices with limited resources Edge servers: Local processing units with limited GPU resources Personal Computers: Desktop and laptop deployment scenarios Course Modules & Navigation Course duration. 10 hours of content Module Topic Focus Area Key Content Level Duration 📖 00 Introduction to EdgeAI Foundation & Context EdgeAI Overview • Industry Applications • SLM Introduction • Learning Objectives Beginner 1-2 hrs 📚 01 EdgeAI Fundamentals Cloud vs Edge AI comparison EdgeAI Fundamentals • Real World Case Studies • Implementation Guide • Edge Deployment Beginner 3-4 hrs 🧠 02 SLM Model Foundations Model families & architecture Phi Family • Qwen Family • Gemma Family • BitNET • μModel • Phi-Silica Beginner 4-5 hrs 🚀 03 SLM Deployment Practice Local & cloud deployment Advanced Learning • Local Environment • Cloud Deployment Intermediate 4-5 hrs ⚙️ 04 Model Optimization Toolkit Cross-platform optimization Introduction • Llama.cpp • Microsoft Olive • OpenVINO • Apple MLX • Workflow Synthesis Intermediate 5-6 hrs 🔧 05 SLMOps Production Production operations SLMOps Introduction • Model Distillation • Fine-tuning • Production Deployment Advanced 5-6 hrs 🤖 06 AI Agents & Function Calling Agent frameworks & MCP Agent Introduction • Function Calling • Model Context Protocol Advanced 4-5 hrs 💻 07 Platform Implementation Cross-platform samples AI Toolkit • Foundry Local • Windows Development Advanced 3-4 hrs 🏭 08 Foundry Local Toolkit Production-ready samples Sample applications (see details below) Expert 8-10 hrs Each module includes Jupyter notebooks, code samples, and deployment walkthroughs, perfect for engineers who learn by doing. Developer Highlights - 🔧 Olive: Microsoft's optimization toolchain for quantization, pruning, and acceleration. - 🧩 ONNX Runtime: Cross-platform inference engine with support for CPU, GPU, and NPU. - 🎮 DirectML: GPU-accelerated ML API for Windows, ideal for gaming and real-time apps. - 🖥️ Windows AI PCs: Devices with built-in NPUs for low-power, high-performance inference. Local AI: Beyond the Edge Local AI isn’t just about inference, it’s about autonomy. Imagine agents that: - Learn from local context - Adapt to user behavior - Respect privacy by design With tools like Agent Framework, Azure AI Foundry and Windows Copilot Studio, and Foundry Local developers can orchestrate local agents that blend LLMs, sensors, and user preferences, all without cloud dependency. Try It Yourself Ready to get started? Clone the Edge AI for Beginners GitHub repo, run the notebooks, and deploy your first model to a Windows AI PC or IoT devices Whether you're building smart kiosks, offline assistants, or industrial monitors, this curriculum gives you the scaffolding to go from prototype to production.How do I give my agent access to tools?
Welcome back to Agent Support—a developer advice column for those head-scratching moments when you’re building an AI agent! Each post answers a real question from the community with simple, practical guidance to help you build smarter agents. Today’s question comes from someone trying to move beyond chat-only agents into more capable, action-driven ones: 💬 Dear Agent Support I want my agent to do more than just respond with text. Ideally, it could look up information, call APIs, or even run code—but I’m not sure where to start. How do I give my agent access to tools? This is exactly where agents start to get interesting! Giving your agent tools is one of the most powerful ways to expand what it can do. But before we get into the “how,” let’s talk about what tools actually mean in this context—and how Model Context Protocol (MCP) helps you use them in a standardized, agent-friendly way. 🛠️ What Do We Mean by “Tools”? In agent terms, a tool is any external function or capability your agent can use to complete a task. That might be: A web search function A weather lookup API A calculator A database query A custom Python script When you give an agent tools, you’re giving it a way to take action—not just generate text. Think of tools as buttons the agent can press to interact with the outside world. ⚡ Why Give Your Agent Tools? Without tools, an agent is limited to what it “knows” from its training data and prompt. It can guess, summarize, and predict, but it can’t do. Tools change that! With the right tools, your agent can: Pull live data from external systems Perform dynamic calculations Trigger workflows in real time Make decisions based on changing conditions It’s the difference between an assistant that can answer trivia questions vs. one that can book your travel or manage your calendar. 🧩 So… How Does This Work? Enter Model Context Protocol (MCP). MCP is a simple, open protocol that helps agents use tools in a consistent way—whether those tools live in your app, your cloud project, or on a server you built yourself. Here’s what MCP does: Describes tools in a standardized format that models can understand Wraps the function call + input + output into a predictable schema Lets agents request tools as needed (with reasoning behind their choices) This makes it much easier to plug tools into your agent workflow without reinventing the wheel every time! 🔌 How to Connect an Agent to Tools Wiring tools into your agent might sound complex, but it doesn’t have to be! If you’ve already got a MCP server in mind, there’s a straightforward way within the AI Toolkit to expose it as a tool your agent can use. Here’s how to do it: Open the Agent Builder from the AI Toolkit panel in Visual Studio Code. Click the + New Agent button and provide a name for your agent. Select a Model for your agent. Within the Tools section, click + MCP Server. In the wizard that appears, click + Add Server. From there, you can select one of the MCP servers built my Microsoft, connect to an existing server that’s running, or even create your own using a template! After giving the server a Server ID, you’ll be given the option to select which tools from the server to add for your agent. Once connected, your agent can call tools dynamically based on the task at hand. 🧪 Test Before You Build Once you’ve connected your agent to an MCP server and added tools, don’t jump straight into full integration. It’s worth taking time to test whether the agent is calling the right tool for the job. You can do this directly in the Agent Builder: enter a test prompt that should trigger a tool in the User Prompt field, click Run, and observe how the model responds. This gives you a quick read on tool call accuracy. If the agent selects the wrong tool, it’s a sign that your system prompt might need tweaking before you move forward. However, if the agent calls the correct tool but the output still doesn’t look right, take a step back and check both sides of the interaction. It might be that the system prompt isn’t clearly guiding the agent on how to use or interpret the tool’s response. But it could also be an issue with the tool itself—whether that’s a bug in the logic, unexpected behavior, or a mismatch between input and expected output. Testing both the tool and the prompt in isolation can help you pinpoint where things are going wrong before you move on to full integration. 🔁 Recap Here’s a quick rundown of what we covered: Tools = external functions your agent can use to take action MCP = a protocol that helps your agent discover and use those tools reliably If the agent calls the wrong tool—or uses the right tool incorrectly—check your system prompt and test the tool logic separately to isolate the issue. 📺 Want to Go Deeper? Check out my latest video on how to connect your agent to a MCP server—it’s part of the Build an Agent Series, where I walk through the building blocks of turning an idea into a working AI agent. The MCP for Beginners curriculum covers all the essentials—MCP architecture, creating and debugging servers, and best practices for developing, testing, and deploying MCP servers and features in production environments. It also includes several hands-on exercises across .NET, Java, TypeScript, JavaScript and Python. 👉 Explore the full curriculum: aka.ms/AITKmcp And for all your general AI and AI agent questions, join us in the Azure AI Foundry Discord! You can find me hanging out there answering your questions about the AI Toolkit. I'm looking forward to chatting with you there! Whether you’re building a productivity agent, a data assistant, or a game bot—tools are how you turn your agent from smart to useful.