genai
26 TopicsBuild AI Agents with MCP Tool Use in Minutes with AI Toolkit for VSCode
We’re excited to announce Agent Builder, the newest evolution of what was formerly known as Prompt Builder, now reimagined and supercharged for intelligent app development. This powerful tool in AI Toolkit enables you to create, iterate, and optimize agents—from prompt engineering to tool integration—all in one seamless workflow. Whether you're designing simple chat interactions or complex task-performing agents with tool access, Agent Builder simplifies the journey from idea to integration. Why Agent Builder? Agent Builder is designed to empower developers and prompt engineers to: 🚀 Generate starter prompts with natural language 🔁 Iterate and refine prompts based on model responses 🧩 Break down tasks with prompt chaining and structured outputs 🧪 Test integrations with real-time runs and tool use such as MCP servers 💻 Generate production-ready code for rapid app development And a lot of features are coming soon, stay tuned for: 📝 Use variables in prompts �� Run agent with test cases to test your agent easily 📊 Evaluate the accuracy and performance of your agent with built-in or your custom metrics ☁️ Deploy your agent to cloud Build Smart Agents with Tool Use (MCP Servers) Agents can now connect to external tools through MCP (Model Control Protocol) servers, enabling them to perform real-world actions like querying a database, accessing APIs, or executing custom logic. Connect to an Existing MCP Server To use an existing MCP server in Agent Builder: In the Tools section, select + MCP Server. Choose a connection type: Command (stdio) – run a local command that implements the MCP protocol HTTP (server-sent events) – connect to a remote server implementing the MCP protocol If the MCP server supports multiple tools, select the specific tool you want to use. Enter your prompts and click Run to test the agent's interaction with the tool. This integration allows your agents to fetch live data or trigger custom backend services as part of the conversation flow. Build and Scaffold a New MCP Server Want to create your own tool? Agent Builder helps you scaffold a new MCP server project: In the Tools section, select + MCP Server. Choose MCP server project. Select your preferred programming language: Python or TypeScript. Pick a folder to create your server project. Name your project and click Create. Agent Builder generates a scaffolded implementation of the MCP protocol that you can extend. Use the built-in VS Code debugger: Press F5 or click Debug in Agent Builder Test with prompts like: System: You are a weather forecast professional that can tell weather information based on given location. User: What is the weather in Shanghai? Agent Builder will automatically connect to your running server and show the response, making it easy to test and refine the tool-agent interaction. AI Sparks from Prototype to Production with AI Toolkit Building AI-powered applications from scratch or infusing intelligence into existing systems? AI Sparks is your go-to webinar series for mastering the AI Toolkit (AITK) from foundational concepts to cutting-edge techniques. In this bi-weekly, hands-on series, we’ll cover: 🚀SLMs & Local Models – Test and deploy AI models and applications efficiently on your own terms locally, to edge devices or to the cloud 🔍 Embedding Models & RAG – Supercharge retrieval for smarter applications using existing data. 🎨 Multi-Modal AI – Work with images, text, and beyond. 🤖 Agentic Frameworks – Build autonomous, decision-making AI systems. Watch on Demand Share your feedback Get started with the latest version, share your feedback, and let us know how these new features help you in your AI development journey. As always, we’re here to listen, collaborate, and grow alongside our amazing user community. Thank you for being a part of this journey—let’s build the future of AI together! Join our Microsoft Azure AI Foundry Discord channel to continue the discussion 🚀Quest 1 – I Want to Build a Local Gen AI Prototype
In this quest, you’ll build a local Gen AI app prototype using JavaScript or TypeScript. You’ll explore open-source models via GitHub, test them in a visual playground, and use them in real code — all from the comfort of VS Code with the AI Toolkit. It’s fast, hands-on, and sets you up to build real AI apps, starting with a sketch.Managing Token Consumption with GitHub Copilot for Azure
Introduction AI Engineers often face challenges that require creative solutions. One such challenge is managing the consumption of tokens when using large language models. For example, you may observe heavy token consumption from a single client app or user, and determine that with that kind of usage pattern, the shared quota for other client applications relying on the same OpenAI backend will be depleted quickly. To prevent this, we need a solution that doesn't involve spending hours reading documentation or watching tutorials. Enter GitHub Copilot for Azure. GitHub Copilot for Azure Instead of diving into extensive documentation, we can leverage GitHub Copilot for Azure directly within VS Code. By invoking Copilot using azure, we can describe our issue in natural language. For our example, we might say: "Some users of my app are consuming too many tokens, which will affect tokens left for my other services. I need to limit the number of tokens a user can consume." Refer to video above for more context. azure GitHub Copilot in Action GitHub Copilot pools relevant ideas from https://learn.microsoft.com/ and suggests Azure services that can help. We can engage in a chat conversation, with follow-up questions like, "What happens if a user exceeds their token limit?" etcetera. This response from GitHub Copilot accurately describes the specific feature we need, along with the expected outcome/ behavior of user requests being blocked from accessing the backend, and users will receive a "too many requests" warning—exactly what we need. At this point, it felt like I was having a 1:1 chat with docs 🙃 Implementation To implement this, we ask GitHub Copilot for an example on enforcing the Azure token limit policy. It references the docs on Learn and provides a policy statement. Since we're not fully conversant with the product, we continue using Copilot to help with the implementation. Although GitHub Copilot chat cannot directly update our code, we can switch to GitHub Copilot Edits, provide some custom instructions in natural language, and watch as GitHub Copilot makes the necessary changes, which we review and accept/ decline. Testing and Deployment After implementing the policy, we redeploy our application using the Azure Developer CLI (azd) and restart our application and API to test. We now see that if a user sends another prompt after hitting the applied token limit, their request is terminated with a warning that the allocated limit is exceeded, along with instructions on what to do next. Conclusion Managing token consumption effectively is just one of the many ways GitHub Copilot for Azure can assist developers. Download and install the extension today to try it out yourself. If you have any scenarios you'd like to see us cover, drop them in the comments, and we'll feature them. See you in the next blog!JS AI Build-a-thon Setup in 5 Easy Steps
🔥 TL;DR — You’re 5 Steps Away from an AI Adventure Set up your project repo, follow the quests, build cool stuff, and level up. Everything’s automated, community-backed, and designed to help you actually learn AI — using the skills you already have. Let’s build the future. One quest at a time. 👉 Join the Build-a-thon | Chat on Discord🚨Introducing the JS AI Build-a-thon 🚨
We’re entering a future where AI-first and agentic developer experiences will shape how we build — and you don’t want to be left behind. This isn’t your average hackathon. It’s a hands-on, quest-driven learning experience designed for developers, packed with: Interactive quests that guide you step by step — from your first prototype to production-ready apps Community-powered support via our dedicated Discord and local, community-led study jams Showcase moments to share your journey, get inspired, and celebrate what you build Whether you're just starting your AI journey or looking to sharpen your skills with frameworks like LangChain.js, tools like the Azure AI Foundry and AI Toolkit Extensions, or diving deeper into agentic app design — this is your moment to start building.Improve LLM backend resiliency with load balancer and circuit breaker rules in Azure API Management
This article is part of a series of articles on Azure API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. We previously covered the hidden risks of AI APIs in today's AI-driven technological landscape. In this article, we dive deeper into one of the supported Gen AI policies in API Management, which allows your applications to change the effective Gen AI backend based on unexpected and specified events. In Azure API Management, you can set up your different LLMs as backends and define structures to route requests to prioritized backends and add automatic circuit breaker rules to protect backends from too many requests. Under normal conditions, if your Azure OpenAI service fails, users of your application will continue to receive error messages, an experience that will persist until the backend issue is resolved and becomes ready to serve requests again. Similarly, managing multiple Azure OpenAI resources can be cumbersome, as manual URL changes are required in your API settings to switch between backend entities. This approach lacks efficiency and does not account for dynamic user conditions, preventing seamless switching to the optimal backend services for enhanced performance and reliability. How load balancing will work First configure your Azure OpenAI resources as referenceable backends, defining the base-url and assign a backend-id. As an example, let's assume we have three different Azure OpenAI resources as follows: To set up load balancing across the backends, you can either use one of supported approaches/ strategies or a combination of two to ensure optimal use of your Azure OpenAI resources. 1. Round Robin As the name suggests, API Management will evenly distribute requests to the available backends in the pool. 2. Priority-based For this approach, you organize multiple backends into priority groups, and API Management will follow and assign requests to these backends in order of priority. Back to our example, we are going to assign openai1 the top priority (priority 1), assign openai2 to priority 2 and add openai3 with priority 3 This will mean that requests will be forwarded to openai1 (priority 1), but if the service is unreachable, the calls will reroute to hit openai2 defined in the next priority group and so on. 3. Weighted Here, you assign weights to your backends, and requests will be distributed based on these relative weights. For our example above, we want to be even more specific by saying that while all requests default to openai1, in the event of its failure, we now want requests to be equally distributed to our priority 2 backends (specified by the 50/50 weight allocation) Now, configure your circuit breaker rules The next step is to define rules to that listen to the events in your API, and trip when specified conditions are met. Let's look at the example below to learn more about how this works. Inside your CircuitBreaker property configuration, you define an array that can hold multiple rules This section defines the conditions that must be met for the circuit breaker to trip. a. The circuit breaker will trip if there is at least one failure b. The number of failures specified in count will be monitored within 5-minute intervals c. We are looking out for errors that return a status code of 429 (Too Many Requests), and you can define a range of codes here The circuit will remain tripped for 1 minute, after which it will reset and route traffic to the endpoint Alright, so what should be my next steps? This article just introduced you to one of the many Generative AI supported capabilities in Azure API Management. We have more policies that you can use to better manage your AI APIs, covered in other articles in this series. Do check them out. Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - https://learn.microsoft.com/en-us/azure/api-management/set-backend-service-policy https://learn.microsoft.com/en-us/azure/api-management/backends?tabs=bicep https://github.com/Azure-Samples/AI-Gateway/tree/main/labs/backend-pool-load-balancingQuest 6 - I want to build an AI Agent
Quest 6 of the JS AI Build-a-thon marks a major milestone — building your first intelligent AI agent using the Azure AI Foundry VS Code extension. In this quest, you’ll design, test, and integrate an agent that can use tools like Bing Search, respond to goals, and adapt in real-time. With updated instructions, real-world workflows, and powerful tooling, this is where your AI app gets truly smart.Use Prompty with Foundry Local
Prompty is a powerful tool for managing prompts in AI applications. Not only does it allow you to easily test your prompts during development, but it also provides observability, understandability and portability. Here's how to use Prompty with Foundry Local to support your AI applications with on-device inference. Foundry Local At the Build '25 conference, Microsoft announced Foundry Local, a new tool that allows developers to run AI models locally on their devices. Foundry Local offers developers several benefits, including performance, privacy, and cost savings. Why Prompty? When you build AI applications with Foundry Local, but also other language model hosts, consider using Prompty to manage your prompts. With Prompty, you store your prompts in separate files, making it easy to test and adjust them without changing your code. Prompty also supports templating, allowing you to create dynamic prompts that adapt to different contexts or user inputs. Using Prompty with Foundry Local The most convenient way to use Prompty with Foundry Local is to create a new configuration for Foundry Local. Using a separate configuration allows you to seamlessly test your prompts without having to repeat the configuration for every prompt. It also allows you to easily switch between different configurations, such as Foundry Local and other language model hosts. Install Prompty and Foundry Local To get started, install the Prompty Visual Studio Code extension and Foundry Local. Start Foundry Local from the command line by running foundry service start and note the URL on which it listens for requests, such as http://localhost:5272 or http://localhost:5273. Create a new Prompty configuration for Foundry Local If you don't have a Prompty file yet, create one to easily access Prompty settings. In Visual Studio Code, open Explorer, click right to open the context menu, and select New Prompty. This creates a basic.prompty file in your workspace. Create the Foundry Local configuration From the status bar, select default to open the Prompty configuration picker. When prompted to select the configuration, choose Add or Edit.... In the settings pane, choose Edit in settings.json. In the settings.json file, to the prompty.modelConfigurations collection, add a new configuration for Foundry Local, for example (ignore comments): { // Foundry Local model ID that you want to use "name": "Phi-4-mini-instruct-generic-gpu", // API type; Foundry Local exposes OpenAI-compatible APIs "type": "openai", // API key required for the OpenAI SDK, but not used by Foundry Local "api_key": "local", // The URL where Foundry Local exposes its API "base_url": "http://localhost:5272/v1" } Important: Be sure to check that you use the correct URL for Foundry Local. If you started Foundry Local with a different port, adjust the URL accordingly. Save your changes, and go back to the .prompty file. Once again, select the default configuration from the status bar, and choose Phi-4-mini-instruct-generic-gpu from the list. Since the model and API are configured, you can remove them from the .prompty file. Test your prompts With the newly created Foundry Local configuration selected, in the .prompty file, press F5 to test the prompt. The first time you run the prompt, it may take a few seconds because Foundry Local needs to load the model. Eventually, you should see the response from Foundry Local in the output pane. Summary Using Prompty with Foundry Local allows you to easily manage and test your prompts while running AI models locally. By creating a dedicated Prompty configuration for Foundry Local, you can conveniently test your prompts with Foundry Local models and switch between different model hosts and models if needed.Cut Costs and Speed Up AI API Responses with Semantic Caching in Azure API Management
This article is part of a series of articles on API Management and Generative AI. We believe that adding Azure API Management to your AI projects can help you scale your AI models, make them more secure and easier to manage. We previously covered the hidden risks of AI APIs in today's AI-driven technological landscape. In this article, we dive deeper into one of the supported Gen AI policies in API Management, which allows you to minimize Azure OpenAI costs and make your applications more performant by reducing the number of calls sent to your LLM service. How does it currently work without the semantic caching policy? For simplicity, let's look at a scenario where we only have a single client app, a single user, and a single model deployment. This of course does not represent most real-world use-cases, as you often have multiple users talking to different services. Take the following cases into consideration: - A user lands on your application and sends in a query (query 1), They then send the exact same query again, with similar verbiage, in the same session (query 2), The user changes the wording of the query, but it is still relevant and related to the original query (query 3) The last query, (query 4), is completely different and unrelated to the previous queries. In a normal implementation, all these queries will cost you tokens (TPM), resulting in higher cuts in your billing. Your users are also likely to experience some latency as they wait for the LLM to build a response with each call. As the user base grows, you anticipate that the expenses will grow exponentially, making it more expensive to run your system eventually. How does Semantic caching in Azure API Management fix this? Let's look at the same scenario as described above (at a high level first), with a flow diagram representing how you can cut costs and boost your app's performance with the semantic cache policy. When the user sends in the first query, the LLM will be used to generate a response, which will then be stored in the cache. Queries 2 and 3 are somewhat related to query 1, which could be a semantic similarity, or exact match, or could contain a specified keyword, i.e.. price. In all these cases, a lookup will be performed, and the appropriate response will be retrieved from the cache, without waiting on the LLM to regenerate a response. Query 4, which is different from the previous prompts, will require the call to be passed through to the LLM, then grabs the generated response and stores it in the cache for future searches. Okay. Tell me more - How does this work and how do I set it up? Think about this - What would be the likelihood of your users asking related questions or exactly comparable questions in your app? I'd argue that the odds are quite high. Semantic caching for Azure OpenAI API requests To start, you will need to add Azure OpenAI Service APIs to your Azure API Management instance with semantic caching enabled. Luckily, this step has been reduced to just a one-click step. I'll link a tutorial on this in the 'Resources' section. Before you get to configure the policies, you first need to set up a backend for the embeddings API. Oh yes, as part of your deployments, you will need an embedding model to convert your input to the corresponding vector representation, allowing Azure Redis cache to perform the vector similarity search. This step also allows you to set a score_threshold, a parameter used to determine how similar user queries need to be to retrieve responses from the cache. Next, is to add the two policies that you need: azure-openai-semantic-cache-store/ llm-semantic-cache-store and azure-openai-semantic-cache-lookup/ llm-semantic-cache-lookup The azure-openai-semantic-cache-store policy will cache the completions and requests to the configured cache service. You can use the internal Azure Redis enterprise or any another external cache as long as it's a Redis-compatible cache in Azure API Management. The second policy, azure-openai-semantic-cache-lookup, based on the proximity result of the similarity search and the score_threshold, will perform a cache lookup through the compilation of cached requests and completions. In addition to the score_threshold attribute, you will also specify the id of the embeddings backend created in an earlier step and can choose to omit the system messages from the prompt at this step. These two policies enhance your system's efficiency and performance by reusing completions, increasing response speed, and making your API calls much cheaper. Alright, so what should be my next steps? This article just introduced you to one of the many Generative AI supported capabilities in Azure API Management. We have more policies that you can use to better manage your AI APIs, covered in other articles in this series. Do check them out. Do you have any resources I can look at in the meantime to learn more? Absolutely! Check out: - Using external Redis-compatible cache in Azure API Management documentation Use Azure Cache for Redis as a semantic cache tutorial Enable semantic caching for Azure OpenAI APIs in Azure API Management article Improve the performance of an API by adding a caching policy in Azure API Management Learn module