vs code
148 TopicsMCP for Beginners: Why Every AI Engineer and Developer Should Learn the Model Context Protocol
If you have spent any time building with large language models in the last year, you have hit the same wall everyone hits: your model is brilliant at reasoning but blind to the real world. It cannot read your database, call your internal API, search your documents, or trigger a deployment unless you hand-write glue code for every single integration. The Model Context Protocol (MCP) exists to tear that wall down, and Microsoft's open-source MCP for Beginners curriculum (reachable via the short link https://aka.ms/mcp-for-beginners) is the most complete, hands-on way to learn it. This post explains what MCP is, walks through the latest updates to the course, shows real code, and makes the case for why MCP belongs on your learning roadmap right now. Whether you are an AI engineer shipping agents to production, a developer wiring tools into Copilot, or a student trying to build a standout portfolio project. What is MCP, and why does it matter? Think of MCP as a universal translator for AI applications. Just as a USB-C port lets you connect any peripheral to any laptop without a custom cable per device, MCP lets an AI model connect to any tool or data source through one standardized protocol. The course uses exactly this analogy, and it holds up well. Before MCP, integrations were an M × N problem: every one of your M AI applications needed bespoke code to talk to each of your N tools. MCP turns that into an M + N problem. Build a tool once as an MCP server, and any MCP-compatible client, Claude Desktop, VS Code, Cursor, GitHub Copilot, and many others — can use it immediately. The protocol is built on a clean client–server model with a small set of primitives: Tools — functions the model can call (query a database, send an email, run code). Resources — data the server exposes for context (files, records, documents). Prompts — reusable, parameterized prompt templates. Sampling — a server asking the client's LLM to generate a completion, enabling collaborative workflows. Elicitation — a server requesting structured input from the user mid-task. Roots — boundaries that tell a server which directories or resources it is allowed to operate on. Communication runs over JSON-RPC, with transports for local processes ( stdio ) and remote servers (streamable HTTP). That standardization is the whole point: write to the spec, and you interoperate with the entire ecosystem. What's new: the latest updates to the course The MCP for Beginners curriculum is actively maintained, and the public changelog reads like a release log for a living product. Here are the most important recent changes, drawn directly from that changelog. 1. Aligned to MCP Specification The biggest update: the entire curriculum has been validated against the current MCP Specification 2025-11-25 and the latest official SDKs. Stale references to older spec revisions (2025-03-26 and 2025-06-18) were corrected across the security, transport, real-time search, sampling, and stdio-server modules, with links repointed to the canonical modelcontextprotocol.io spec paths. A gap analysis confirmed the course already covers every primitive introduced or expanded in the latest spec: Sampling — covered in lesson 3.14 and Advanced Topics. Elicitation (including URL mode) — in Core Concepts and Protocol Features. Roots — in the Introduction, Core Concepts, and Root Contexts. Tasks (experimental, long-running operations) — in Core Concepts and Protocol Features. Tool Annotations ( readOnlyHint / destructiveHint ) — in Core Concepts and Protocol Features. 2. Samples validated against current SDKs Code that does not run is worse than no code at all, so the maintainers re-validated the core samples: TypeScript: @modelcontextprotocol/sdk resolved to 1.29.0 ; a tsc --noEmit type-check passed with no errors — the McpServer and StdioServerTransport APIs remain valid. Python: validated in an isolated virtual environment with mcp[cli] (1.27.2); FastMCP.list_tools() correctly returned the sample add and subtract tools. SDK version pins across labs were bumped (for example mcp>=1.26.0 ) and lockfiles regenerated so every sample tracks the current release. 3. A serious security pass Security is treated as a first-class concern, not an afterthought. A full audit across every dependency manifest and the sample source code was run, and npm audit now reports 0 vulnerabilities in every audited directory. Highlights: Transitive npm advisories (in the MCP Inspector dev tool, the OpenAI client, and the SDK) were remediated by bumping @modelcontextprotocol/inspector to 0.22.0 and pinning a patched shell-quote . A real code-level command-injection fix (OWASP A03): an open_in_vscode tool that used subprocess.run(..., shell=True) was rewritten to launch the resolved executable directly with no shell — closing a metacharacter-injection vector. Python dependencies were audited with pip-audit , and a vulnerable transitive werkzeug was pinned to a patched >=3.1.6 . For anyone learning to ship agents, this is gold: the course demonstrates the whole secure-development loop, not just the happy path. 4. New lessons and a growing curriculum The curriculum keeps expanding with practical, modern lessons: 5.17 Adversarial Multi-Agent Reasoning — two agents argue opposite sides of a question using shared MCP tools ( web_search + run_python ), judged by a third agent. Includes a Mermaid architecture diagram, orchestrators in Python, TypeScript, and C#, and use cases like hallucination detection, threat modeling, and API design review. 3.12 MCP Hosts — configuration for Claude Desktop, VS Code, Cursor, Cline, and Windsurf, with JSON templates and a transport comparison table. 3.13 MCP Inspector — a debugging guide for testing tools, resources, and prompts. 4.1 Pagination — cursor-based pagination patterns in Python, TypeScript, and Java. 5.16 Protocol Features — progress notifications, request cancellation, resource templates, and lifecycle management. 5. Microsoft product rebranding Content was updated to reflect Microsoft's rebranding: Azure AI Foundry → Microsoft Foundry, and the AI Toolkit (AITK) → Microsoft Foundry Toolkit Extension for VS Code. If you have seen older tutorials referencing the previous names, the curriculum is now current. Your first MCP server: see how little code it takes The course's "first server" lesson builds a simple calculator. Here is the shape of a minimal MCP server in Python using FastMCP , which mirrors the validated sample in the repo. Notice how the protocol plumbing disappears — you just decorate functions. # server.py — a minimal MCP server with two tools from mcp.server.fastmcp import FastMCP # Name your server; this identifies it to MCP clients mcp = FastMCP("Calculator") @mcp.tool() def add(a: int, b: int) -> int: """Add two numbers and return the result.""" return a + b @mcp.tool() def subtract(a: int, b: int) -> int: """Subtract b from a and return the result.""" return a - b if __name__ == "__main__": # Run over stdio so local hosts (VS Code, Claude Desktop) can connect mcp.run() The same idea in TypeScript, using the official SDK validated at version 1.29.0 : // server.ts — minimal MCP server in TypeScript import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js"; import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js"; import { z } from "zod"; const server = new McpServer({ name: "Calculator", version: "1.0.0" }); // Register a tool with a typed input schema server.tool( "add", { a: z.number(), b: z.number() }, async ({ a, b }) => ({ content: [{ type: "text", text: String(a + b) }], }) ); // Connect over stdio and start listening const transport = new StdioServerTransport(); await server.connect(transport); That is a complete, runnable server. The docstrings and schemas matter: MCP exposes them to the model so it knows when and how to call each tool. Clear descriptions are effectively prompt engineering for your tools — a common pitfall is leaving them vague, which leads to the model misusing or ignoring the tool. Connecting it in VS Code Once your server runs, an MCP host connects to it. A typical VS Code / host configuration looks like this: { "servers": { "calculator": { "command": "python", "args": ["server.py"] } } } Lesson 3.12 (MCP Hosts) covers the equivalent JSON for Claude Desktop, Cursor, Cline, and Windsurf, and lesson 3.13 shows how to use the MCP Inspector to test your tools before wiring them into a host — the single best debugging habit you can build early. How the course is structured The curriculum is organized as a progressive journey with hands-on code in C#, Java, JavaScript, Python, Rust, and TypeScript. It is grouped into phases: Foundations (Modules 0–2): Introduction, Core Concepts, and Security. Building (Module 3): Getting Started — 15 lessons covering your first server and client, LLM clients, VS Code integration, stdio and HTTP streaming, testing, deployment, auth, hosts, the Inspector, sampling, and MCP Apps. Growing (Modules 4–5): Practical Implementation and Advanced Topics — 17 advanced lessons including Azure integration, OAuth2, Entra ID auth, scaling, multi-modality, context engineering, custom transports, and adversarial multi-agent reasoning. Mastery (Modules 6–11): Community Contributions, Lessons from Early Adoption, Best Practices, Case Studies, a Microsoft Foundry Toolkit workshop, and an end-to-end 13-lab PostgreSQL capstone. That final module is the standout for portfolio building: a complete, production-flavored path that takes you from architecture and row-level security through database design, a FastMCP server, semantic search with pgvector and Azure OpenAI, testing, Docker deployment to Azure Container Apps, and monitoring with Application Insights. Why developers should learn MCP now For AI engineers MCP is becoming the default integration layer for agents. Instead of re-implementing tool calling for every framework, you write to one open protocol and your tools work everywhere. The advanced modules — sampling, roots, elicitation, scaling, routing, and adversarial multi-agent patterns — are exactly the techniques you need to move agents from demo to production. For developers MCP is already wired into tools you use daily: VS Code, GitHub Copilot, Claude Desktop, Cursor, and more. Learning to build an MCP server means you can expose your systems — internal APIs, databases, CI/CD — to AI assistants safely. The security-first approach in the course (OAuth2, Entra ID, RBAC, dependency auditing) teaches you to do this the right way from day one. For students MCP is a rare opportunity to learn a technology while it is still early, with a free, beginner-friendly, Microsoft-maintained curriculum and code in six languages. The 13-lab capstone alone is a genuine portfolio project. And with content translated into 50+ languages, the barrier to entry is low no matter where you are. Responsible and secure by design A recurring theme worth calling out: the course does not treat security and governance as optional extras. It models real practices you should carry into your own work: Least privilege via roots — constrain what a server can touch. Tool annotations — mark tools readOnlyHint or destructiveHint so clients can warn users before destructive actions. No shells for user input — the command-injection fix is a textbook example of why you never pass untrusted input through a shell. Dependency hygiene — audit with npm audit and pip-audit , and pin patched releases. Proper auth — dedicated lessons on OAuth2 and Microsoft Entra ID. Key takeaways MCP standardizes how AI connects to tools and data, turning a combinatorial integration problem into a simple, reusable one. The course is current, validated against MCP Specification 2025-11-25 with SDKs at TypeScript 1.29.0 and Python mcp 1.27.2 . Samples actually run, and the repo demonstrates a full secure-development loop with 0 reported vulnerabilities after auditing. It is broad and deep: from a 10-line calculator server to a 13-lab production capstone, in six languages. It is the fastest credible path to MCP fluency for AI engineers, developers, and students alike. Get started today Open the course: https://aka.ms/mcp-for-beginners (redirects to the GitHub repository). Fork and clone it — use a sparse checkout to skip translations for a faster download: git clone --filter=blob:none --sparse https://github.com/microsoft/mcp-for-beginners.git cd mcp-for-beginners git sparse-checkout set --no-cone "/*" "!translations" "!translated_images" Build your first server with lesson 3.1 in your language of choice. Debug it with the MCP Inspector, then connect it in VS Code. Go deep with the 13-lab database capstone, and read the official spec at modelcontextprotocol.io. Track what's new in the changelog and join the community discussions. MCP is quietly becoming the connective tissue of the AI ecosystem. The earlier you learn it, the more leverage you will have — and Microsoft's MCP for Beginners is the clearest on-ramp available. Star the repo, build a server this week, and start connecting your AI to the world.MCP Server Authorization with Azure API Management: From Simple to Advanced
Why put API Management in front of your MCP servers The Model Context Protocol (MCP) has quickly become the standard way for AI agents, such as GitHub Copilot in VS Code, to reach external tools and data. As soon as an MCP server does anything meaningful, the same questions that govern any API resurface: who is allowed to call it, what are they allowed to do, and how do you enforce that consistently across many servers without rewriting each one. Azure API Management (APIM) answers those questions for MCP. It sits between the MCP client and the tool backend and applies the controls you already trust for REST APIs: identity validation, OAuth, rate limiting, IP filtering, and observability. Crucially, APIM speaks the MCP authorization specification, which is built on OAuth 2.1 and Protected Resource Metadata (PRM, RFC 9728). That means APIM can do more than block bad requests. It can actively drive an interactive sign-in from the IDE, so the user logs in with their own identity and the agent acts on their behalf. This article walks through a progression of authorization scenarios, each one building on the last: The simple case: validate a token and block everything else. Triggering an interactive sign-in from VS Code for an MCP server that APIM hosts from your own APIs. Going beyond "is this a tenant user" to "does this user have the right attribute" with Entra app roles. Fronting an existing external MCP server and letting it drive its own OAuth flow (GitHub as the example). Governing which tools of an existing MCP server an agent is actually allowed to invoke. APIM MCP capabilities and the basic authorization options API Management exposes MCP servers in two distinct ways, and the authorization story differs slightly for each. Expose a REST API as an MCP server. APIM takes an API it already manages and projects selected operations as MCP tools. You own the operations, so you choose exactly which ones become tools at configuration time. This is the right mode when the capability you want to expose is an API you control. Expose an existing MCP server (passthrough). APIM fronts a remote MCP-compatible server (LangChain, an Azure Function, GitHub's remote MCP server, your own container) and relays the MCP protocol to it. APIM governs access, but the upstream server still owns its tool catalog. On top of either mode, you have a spectrum of authorization options: Subscription keys for simple, machine-to-machine access where a shared secret in a header is acceptable. Token validation with Microsoft Entra ID, where APIM acts as the protected resource and verifies a bearer token on every call. Interactive OAuth 2.1 sign-in, where APIM advertises Protected Resource Metadata so an MCP client can discover the authorization server, log the user in, and retry with a user token. Authorization passthrough, where an external MCP server presents its own authorization challenge and APIM relays it faithfully so the client authenticates directly against the upstream's identity provider. The rest of the article works through these options in increasing order of capability. The example setup The walkthroughs in the first three scenarios all use the same backend so you can reproduce them without standing up anything of your own: the publicly available Star Wars API at Star Wars API. It is a simple, read-friendly REST API (characters, films, planets, starships, and so on) imported into API Management as a normal API and then projected as an MCP server. The reason this single API is enough to illustrate the whole progression is that, in API Management, one underlying API can back several independent MCP servers, each exposing a different slice of its operations. For example, you can create: A read-only MCP server that exposes only the GET operations, for agents that should be able to query data but never change it. A write-capable MCP server that exposes the POST, PUT, or DELETE operations, for trusted automation that is allowed to mutate state. Same backend API, two MCP servers, two different tool surfaces. Each of these servers is an independent resource in APIM, so each one can carry its own authorization. Both can require an authenticated user (Scenarios 1 and 2), and you can go further by protecting only the sensitive one: gate the write-capable server behind an Entra app role so that, even among authenticated users, only those who carry a specific claim can reach the mutating tools. That app-role mechanism is the subject of Scenario 3, and it composes naturally with the multi-server split described here. Registering the MCP API in Microsoft Entra ID Before any of the policies below can validate a token, you need an application registration in Microsoft Entra ID that represents the MCP API. This registration is what defines the audience and scope that tokens are issued for, and it is the source of the mcp-audience, mcp-scope, and (indirectly) mcp-client-id values that the policies reference. Create it once and reuse it across all the MCP servers in this article. In the Azure portal, open Microsoft Entra ID, then App registrations, then New registration. Name it (for example, star-wars-mcp-api), choose single-tenant, and register. Record the Application (client) ID and the Directory (tenant) ID. Open Expose an API and add an Application ID URI. Accept the default api://<app-id>. This URI is your token audience. Still under Expose an API, add a delegated scope named MCP.Access, set its consent display name and description, set the state to Enabled, and save. Authorize the client that will request the scope. Under Expose an API, select Add a client application and enter the client ID of the MCP client. For VS Code, this is the built-in Microsoft authentication client aebc6443-996d-45c2-90f0-388ff96faa56. Check the MCP.Access scope and save. These steps produce the four constants the validation policy needs: Named value Comes from Example entra-tenant-id The Directory (tenant) ID from step 1 11111111-1111-1111-1111-111111111111 mcp-audience The Application ID URI from step 2 api://22222222-2222-2222-2222-222222222222 mcp-scope The scope name from step 3 MCP.Access mcp-client-id The client ID of the calling app from step 4 aebc6443-996d-45c2-90f0-388ff96faa56 [!NOTE] mcp-client-id is the identity of the application calling the MCP server, not the MCP API itself. For VS Code it is the built-in Microsoft authentication client, and its value lands in the token's appid claim, which is why the validation policy lists it under client-application-ids. If your tenant blocks the first-party VS Code client, register your own public client application and use its client ID instead. [!TIP] For the privileged-access feature in Scenario 3, you will also declare an app role on this same registration. You do not need it yet, but it is convenient to know that all identity configuration for these servers lives on this one app registration. With that backend and structure in mind, the scenarios below build up the authorization model one capability at a time. Scenario 1: The simple case, validate the token and block unauthorized access The most basic protection is to require a valid Entra ID token on every MCP request and reject anything that fails validation. No interactive flow, no roles, just a gate. APIM does this with the validate-azure-ad-token policy. The policy checks the issuing tenant, the audience (your MCP API), the calling client application, and the required scope. Anything that does not satisfy all four is rejected with a 401. <policies> <inbound> <base /> <validate-azure-ad-token tenant-id="{{entra-tenant-id}}" header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized. Access token is missing or invalid."> <client-application-ids> <application-id>{{mcp-client-id}}</application-id> </client-application-ids> <audiences> <audience>{{mcp-audience}}</audience> </audiences> <required-claims> <claim name="scp" match="any"> <value>{{mcp-scope}}</value> </claim> </required-claims> </validate-azure-ad-token> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies> The values in double braces are APIM named values: centralized constants, defined once and shared by every MCP server. They map directly to the four values produced by the Entra app registration in the example setup (entra-tenant-id, mcp-audience, mcp-scope, and mcp-client-id). Storing them as named values keeps the policy free of hardcoded identifiers and lets every server reuse the same configuration. This gets you a server that nobody can call without a properly minted token. What it does not do is help a fresh client obtain that token in the first place. That is the next scenario. Scenario 2: Driving an interactive sign-in from VS Code for an APIM-hosted MCP server When you expose one of your own APIs as an MCP server, you usually want a developer to open VS Code, connect to the server, and be prompted to sign in with their Microsoft account. No pre-shared key, no manual token handling. APIM achieves this by behaving as a well-mannered OAuth 2.1 protected resource. Using the Star Wars MCP server from the example setup, each selected operation becomes a tool the agent can call, so an agent can answer "which films featured the character named Leia" by calling the underlying API through APIM. How the sign-in flow works The protocol choreography is what turns a plain 401 into an interactive login: Two ingredients make this work: a 401 challenge that points to a metadata document, and the metadata document itself. The challenge: a 401 that points the client to its metadata Instead of a bare 401, APIM returns a WWW-Authenticate header carrying the URL of the server's Protected Resource Metadata. This is what tells the client "you need a token, and here is where to learn how to get one." Keeping this logic in a shared policy fragment means every MCP server reuses it. Notice the mcpResourceMetadataUrl reference in the fragment below. It is not hardcoded; it is a context variable that each MCP server sets in its own server-level policy before including this fragment (you will see that wiring in the per-server policy later in this scenario). The fragment simply reads whatever value the calling server provided. This indirection is what keeps the fragment pluggable: the same shared challenge-and-validate logic serves every MCP server, while each server supplies its own PRM URL. In most deployments the PRM endpoint is a single, dynamic one (built in the next section) that derives the resource from the request path, so the variable just carries that server's path. But because the URL is configurable per server rather than baked into the fragment, you retain flexibility for the cases that need it. <fragment> <!-- No token: challenge with the per-server PRM URL set by the caller --> <choose> <when condition="@(!context.Request.Headers.ContainsKey("Authorization"))"> <return-response> <set-status code="401" reason="Unauthorized" /> <set-header name="WWW-Authenticate" exists-action="override"> <value>@("Bearer resource_metadata=\"" + (string)context.Variables.GetValueOrDefault("mcpResourceMetadataUrl", "") + "\"")</value> </set-header> </return-response> </when> </choose> <!-- Token present: validate against shared named values --> <validate-azure-ad-token tenant-id="{{entra-tenant-id}}" header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized. Access token is missing or invalid."> <client-application-ids> <application-id>{{mcp-client-id}}</application-id> </client-application-ids> <audiences> <audience>{{mcp-audience}}</audience> </audiences> <required-claims> <claim name="scp" match="any"> <value>{{mcp-scope}}</value> </claim> </required-claims> </validate-azure-ad-token> </fragment> Creating the /.well-known PRM endpoint in APIM with a policy This is the part that often surprises people: APIM itself serves the metadata document. There is no separate identity service to stand up. You publish one small anonymous API at the service root that answers GET /.well-known/oauth-protected-resource/*, derives the resource value from the requested path, and returns a JSON document pointing at Microsoft Entra ID as the authorization server. Create a blank HTTP API named well-known with an empty API URL suffix so it resolves at the service root, add a GET operation with the template /.well-known/oauth-protected-resource/*, clear the subscription requirement so it is reachable anonymously, and apply this policy: <policies> <inbound> <base /> <!-- Build the resource URL from the requested PRM sub-path --> <set-variable name="resourceUrl" value="@{ var prefix = "/.well-known/oauth-protected-resource"; var path = context.Request.OriginalUrl.Path; var resourcePath = path.Length > prefix.Length ? path.Substring(prefix.Length) : ""; return "https://" + context.Request.OriginalUrl.Host + resourcePath; }" /> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ return new JObject( new JProperty("resource", (string)context.Variables["resourceUrl"]), new JProperty("authorization_servers", new JArray( "https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0")), new JProperty("scopes_supported", new JArray("{{mcp-prm-scope}}")), new JProperty("bearer_methods_supported", new JArray("header")) ).ToString(); }</set-body> </return-response> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies> The {{mcp-prm-scope}} named value populates the scopes_supported array of the metadata document. It tells the client which delegated scope to request when it goes to the authorization server, so it must be the fully qualified scope value: the token audience (the Application ID URI from the app registration) followed by the scope name. With the example values that is api://22222222-2222-2222-2222-222222222222/MCP.Access. In other words, it is the combination of the mcp-audience and mcp-scope values defined in the example setup. Named value Value to set Example mcp-prm-scope <mcp-audience>/<mcp-scope> api://22222222-2222-2222-2222-222222222222/MCP.Access [!NOTE] Keep mcp-prm-scope in sync with the scope the validation fragment requires. The PRM document advertises this scope so the client requests it, and validate-azure-ad-token then checks for it in the scp claim. A mismatch means the client obtains a token without the scope APIM expects, and validation fails. Because the policy builds the resource value from the request path, this single endpoint serves metadata for every MCP server you ever add. The Star Wars server, a future inventory server, and anything else all share it. Wiring it onto the MCP server Each MCP server only needs to declare its own metadata URL and include the shared fragment: <policies> <inbound> <base /> <set-variable name="mcpResourceMetadataUrl" value="https://apim-contoso-mcp.azure-api.net/.well-known/oauth-protected-resource/star-wars-mcp/mcp" /> <include-fragment fragment-id="mcp-entra-auth" /> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> <include-fragment fragment-id="mcp-auth-challenge-onerror" /> </on-error> </policies> On the VS Code side, the configuration is deliberately plain. With no subscription-key header present, the client falls straight into the OAuth flow: { "servers": { "star-wars-mcp": { "url": "https://apim-contoso-mcp.azure-api.net/star-wars-mcp/mcp", "type": "http" } } } Restart the server in VS Code, and it detects the 401, reads the metadata, opens a browser sign-in, requests consent on first use, and then loads the tools using the user's token. [!CAUTION] Do not read the response body with context.Response.Body inside MCP server policies. It forces response buffering and breaks the MCP streaming transport. If global diagnostic logging is enabled, set the Frontend Response payload bytes to log to 0 at the All APIs scope. Scenario 3: Beyond tenant membership, authorize on a user attribute with app roles Validating a token confirms the caller is a signed-in user in your tenant with the right scope. That is often not enough. Some MCP servers expose sensitive tools that only a subset of users should reach. You want to express "this user is not only part of the tenant, but has a specific attribute that permits this server." Microsoft Entra app roles are the optimal mechanism for this. You declare a role on the MCP API app registration, assign it to specific users or to a security group, and Entra ID emits a roles claim in the access token whenever your API is the audience. APIM then authorizes on that claim. App roles beat the groups claim here because they avoid the group overage problem, they are scoped to the application, and they travel with the app. Declaring and assigning the role On the MCP API app registration, under App roles, create a role: Setting Value Display name Privileged Access Allowed member types Users/Groups Value Privileged.Access Description Access to privileged MCP servers Then, on the matching enterprise application, under Users and groups, assign the users (or, better, a security group) to the Privileged Access role. The Value field is the exact string that lands in the token roles claim, so it cannot contain spaces. [!TIP] Keep User assignment required set to No on the enterprise application. Unassigned users still obtain a valid token with the MCP.Access scope and keep access to the non-privileged servers. They simply do not carry the roles claim, so the privileged servers reject them. Enforcing the claim in the per-server policy The shared mcp-entra-auth fragment is used by every server, so the role requirement must not live there. Place the check in the privileged server's own policy, right after the fragment include. The token is already validated at that point, so this step is pure authorization. Because the caller is authenticated but not authorized, return 403, not 401, and do not emit a challenge: re-authenticating will not grant a role the user does not have. <policies> <inbound> <base /> <set-variable name="mcpResourceMetadataUrl" value="https://apim-contoso-mcp.azure-api.net/.well-known/oauth-protected-resource/star-wars-mcp/mcp" /> <include-fragment fragment-id="mcp-entra-auth" /> <!-- Privileged guardrail: require the Privileged.Access app role --> <choose> <when condition="@(!context.Request.Headers.GetValueOrDefault("Authorization","").Replace("Bearer ","").AsJwt().Claims.GetValueOrDefault("roles", new string[0]).Contains("Privileged.Access"))"> <return-response> <set-status code="403" reason="Forbidden" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>{"error":"forbidden","message":"You lack the Privileged.Access role required for this MCP server."}</set-body> </return-response> </when> </choose> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> <include-fragment fragment-id="mcp-auth-challenge-onerror" /> </on-error> </policies> One operational detail worth calling out: app-role assignments only appear in newly issued tokens. A user who is granted the role after they signed in must obtain a fresh token. In VS Code, run MCP: Reset Cached Tokens (or sign out of the Microsoft account from the Accounts menu), then restart the server and sign in again. You can confirm the result by pasting the access token into https://jwt.ms and checking for "roles": ["Privileged.Access"]. Scenario 4: Fronting an existing external MCP server that drives its own sign-in So far APIM has been the authorization resource. But many valuable MCP servers already exist and run their own identity. GitHub publishes a remote MCP server with dozens of tools, and it authenticates users against GitHub's own OAuth authorization server. You do not want to re-implement that. You want APIM to govern access (rate limits, IP rules, logging, a single managed endpoint) while letting the upstream own the login. This is the "expose an existing MCP server" passthrough mode. When you register GitHub's remote MCP server behind APIM, the gateway relays the upstream's own authorization challenge. The client never authenticates against Entra here. It authenticates directly against GitHub. The flow, confirmed by probing the gateway: A call to the APIM endpoint with no token returns GitHub's own 401 with a WWW-Authenticate header, relayed through APIM. The Protected Resource Metadata that GitHub serves advertises authorization_servers: ["https://github.com/login/oauth"], so the client knows to log in at GitHub. The PRM resource reflects the APIM host, because GitHub builds it from the forwarded Host header. The client trusts the APIM endpoint while still logging in at GitHub. VS Code completes the GitHub sign-in and the full tool catalog loads. In the proof of concept this surfaced all 47 GitHub tools through the single APIM endpoint. The client configuration is again just a URL pointing at APIM: { "servers": { "github-via-apim": { "url": "https://apim-contoso-mcp.azure-api.net/github-mcp/mcp", "type": "http" } } } The key insight is that APIM transparently relays the backend's authentication challenge. GitHub remains the authorization server, GitHub tolerates being fronted by APIM, and you get a governed, centrally managed entry point without owning the identity flow. [!NOTE] Passthrough only relays what the upstream advertises. If the backend's PRM resource value and the actual MCP transport endpoint differ by a path segment, some clients fall back to deriving the metadata location from the server URL and can miss it. When you onboard a custom self-authenticating server, verify that the resource it advertises matches the exact URL the client connects to. Scenario 5: Restricting which tools of an existing MCP server an agent may call Passthrough raises a governance question that token validation alone cannot answer. A developer may legitimately have permission to merge a pull request through GitHub, but you may not want their AI agent to perform that action autonomously. You want to allow the read and discovery tools while blocking the destructive write tools, at the gateway, regardless of what the client tries. What is and is not possible for an external server It is important to be precise here, because the capability differs from the REST-as-MCP mode: For a REST-API-exposed-as-MCP server, you pick which operations become tools at creation time. That is native tool selection and the cleanest possible filter. For an existing/external MCP server, APIM does not enumerate the upstream's tools. The portal Tools blade explicitly states that tools are not visible for external MCP servers, and there is no allow-list property for them. APIM also cannot safely rewrite the tools/list response, because reading the response body breaks the streaming transport and the list may arrive as text/event-stream. What APIM can do reliably, and server-agnostically, is block the invocation. Every tool call arrives as a JSON-RPC tools/call request in the request body, which APIM can inspect safely. The deny-listed tools remain visible in the catalog, but any attempt to invoke one is intercepted at the gateway and returned a JSON-RPC error before it ever reaches the upstream. The reusable deny-list fragment The block is driven by a per-server named value (a comma-separated list of tool names), so the same fragment governs every external server. Only the named value changes. <!-- Fragment: mcp-tool-filter (include after the auth fragment) --> <fragment> <choose> <when condition="@(context.Request.Body != null)"> <set-variable name="mcpMethod" value="@{ try { var body = context.Request.Body.As<JObject>(preserveContent: true); return (string)body?["method"] ?? string.Empty; } catch { return string.Empty; } }" /> <choose> <when condition="@(((string)context.Variables["mcpMethod"]).Equals("tools/call", StringComparison.OrdinalIgnoreCase))"> <set-variable name="mcpToolName" value="@{ var body = context.Request.Body.As<JObject>(preserveContent: true); return (string)body?["params"]?["name"] ?? string.Empty; }" /> <!-- mcpBlockedTools is a comma-separated deny-list set by the per-server policy before this include --> <set-variable name="mcpBlocked" value="@{ var tool = ((string)context.Variables["mcpToolName"]).Trim().ToLowerInvariant(); var deny = ((string)context.Variables.GetValueOrDefault("mcpBlockedTools", "")).ToLowerInvariant().Split(',').Select(t => t.Trim()); return deny.Contains(tool); }" /> <choose> <when condition="@((bool)context.Variables["mcpBlocked"])"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var id = "null"; try { var body = context.Request.Body.As<JObject>(preserveContent: true); id = body?["id"]?.ToString(Newtonsoft.Json.Formatting.None) ?? "null"; } catch {} return "{\"jsonrpc\":\"2.0\",\"id\":" + id + ",\"error\":{\"code\":-32602,\"message\":\"Unknown tool: " + ((string)context.Variables["mcpToolName"]) + "\"}}"; }</set-body> </return-response> </when> </choose> </when> </choose> </when> </choose> </fragment> The deny-list itself lives in a named value, one per server: APIM named value. Comma-separated, case-insensitive. mcp-blocked-tools-github = merge_pull_request,create_repository,delete_repository,push_files,create_or_update_file,issue_write,label_write # <policies> <inbound> <base /> <set-variable name="mcpResourceMetadataUrl" value="https://apim-contoso-mcp.azure-api.net/.well-known/oauth-protected-resource/github-mcp/mcp" /> <include-fragment fragment-id="mcp-entra-auth" /> <set-variable name="mcpBlockedTools" value="{{mcp-blocked-tools-github}}" /> <include-fragment fragment-id="mcp-tool-filter" /> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> <include-fragment fragment-id="mcp-auth-challenge-onerror" /> </on-error> </policies> Generic per-server pattern: mcp-blocked-tools-<server> = <comma,separated,tool,names> Wiring it onto the GitHub passthrough server <policies> <inbound> <base /> <set-variable name="mcpResourceMetadataUrl" value="https://apim-contoso-mcp.azure-api.net/.well-known/oauth-protected-resource/github-mcp/mcp" /> <include-fragment fragment-id="mcp-entra-auth" /> <set-variable name="mcpBlockedTools" value="{{mcp-blocked-tools-github}}" /> <include-fragment fragment-id="mcp-tool-filter" /> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> <include-fragment fragment-id="mcp-auth-challenge-onerror" /> </on-error> </policies> Now when the agent tries to merge a pull request, the gateway returns a clean -32602 Unknown tool error and the upstream is never touched. Read and discovery tools continue to work. The tool still appears in the client's catalog. Adding governance for another external server is just one more named value plus the same fragment include. No new policy logic. Key takeaways API Management turns MCP servers into governed resources, applying the same identity, traffic, and observability controls you already use for APIs. Start simple with validate-azure-ad-token to gate access, then graduate to a full interactive sign-in by serving Protected Resource Metadata from a single APIM policy. You can publish multiple MCP servers from one underlying API, for example a read-only server and a read-write server, by selecting different operations. App roles let you authorize on a user attribute, not just tenant membership, and the check belongs in the per-server policy so shared logic stays clean. For existing external servers, APIM relays the upstream's own OAuth flow, so a server like GitHub keeps owning its identity while you keep central governance. When an external server's full tool surface is too broad, APIM can block specific tool invocations at the gateway with a reusable, named-value-driven policy, so a user's agent cannot perform actions the user could perform manually. References About MCP servers in Azure API Management Secure access to MCP servers in API Management Expose REST API in API Management as an MCP server Expose and govern an existing MCP server validate-azure-ad-token policy reference Policy fragments in API Management RFC 9728: OAuth 2.0 Protected Resource Metadata MCP authorization specification Star Wars API (example backend) MCP for BeginnersFoundry Toolkit for VS Code at //build: Hosted Agents End-to-End, a Smarter Toolbox, and More
We’re excited to share what’s new for Foundry Toolkit for Visual Studio Code at //build 2026. Since going generally available, the toolkit has kept moving fast, and this release is a big one. The headline: a complete, end-to-end Hosted Agent experience, scaffold, run, deploy, and observe without ever leaving VS Code. On top of that, we’ve expanded the Toolbox with native enterprise integrations and shipped a wave of LangGraph samples so every developer has a clear path from idea to production. From your first prompt to a production-grade, observable agent, Foundry Toolkit meets you where you are. Hosted Agents, End to End Building an agent is the easy part; getting it from a first draft to a production-grade, observable service is what matters. This release makes the full Hosted Agent lifecycle available in VS Code, and it follows the way you actually work — scaffold, run, deploy, observe. Scaffold — start from a rich set of samples Hosted Agent creation now opens with a refreshed scaffolding experience and a rich sample selection, so you start from a working, framework-appropriate template instead of a blank file. Creation is smarter, too: we auto-select your subscription when there’s only one, gate tabs more clearly, and tightened spacing for a cleaner setup flow. Run (F5) — inspect as you build Press F5 and your agent runs locally with the Agent Inspector, now aligned with the rest of the extension and featuring Copilot SDK visualization so you can see what the Inspector visualizes as the agent executes. It’s the fastest loop from change to verification before anything leaves your machine. Deploy — a new UX and new ways to ship Different teams ship differently, so deployment got a refreshed UX and two new options for Hosted Agents: ZIP Code Deploy: Package your agent source as a ZIP and deploy it directly to Microsoft Foundry Agent Service. Bring-Your-Own-Image (BYOI): Already have a pre-built container in your own Azure Container Registry? Deploy straight from it. Observe — know it works in production Once deployed, the full observability story is now available: Hosted Agent Tracing: Inspect end-to-end traces of Hosted Agent invocations directly from VS Code — tool calls, delegation chains, and timing for real debugging instead of guesswork. Continuous Evaluation Settings: A new page to configure ongoing evaluation for deployed Hosted Agents, so quality is measured continuously — not just at ship time. Evaluations Node: One-click access to evaluation runs and results right from the Foundry project tree. A Smarter, More Connected Toolbox What it is, and why it matters A Toolbox is how your agent gets its capabilities — the curated set of tools, knowledge sources, and integrations it can call at runtime. Instead of hand-wiring each connection, you assemble a Toolbox once and your agent consumes it consistently across local runs and production. The result: agents that can act on real enterprise data and systems, with the connections managed in one place. From what to how: create, connect, consume Create: Start a new Toolbox from the Foundry Toolkit sidebar “Tools Catalog” and pick the capabilities your agent needs. Connect: Configure and wire in enterprise systems through native, first-class connections once, and use it for all your agents. Consume: Reference the Toolbox from your Hosted Agent so its tools are available the moment the agent runs, locally (F5) and once deployed. New this release Building on that flow, the Toolbox is now richer and more enterprise-ready: WorkIQ as a Built-in Tool: A first-class WorkIQ experience powered by A2A connections — no MCP fallback required. End-to-end toolbox creation with WorkIQ works out of the box. Fabric IQ (OneLake Catalog) Integration: Connect your agents to Microsoft Fabric OneLake catalogs directly from the Toolbox. Toolbox Guardrails: Apply content-safety guardrails to your Toolbox for safer agent execution. Faster discovery: A new Toolbox Search Toggle and Agent Tool Multi-Select let you find and wire in multiple tools in a single action. LangGraph Reaches Parity LangGraph developers, this one is for you. We’ve added five new Hosted Agent samples that bring LangGraph to full parity with the Agent Framework Responses learning path — so you get an equivalent, end-to-end walkthrough no matter which framework you prefer: MCP — tool loading from a remote MCP server (defaults to GitHub Copilot MCP) via MultiServerMCPClient. Workflows — a custom StateGraph chaining three specialized LLM nodes: slogan writer, legal reviewer, and formatter. Files — local filesystem tools plus the Foundry-Toolbox code_interpreter working over session-uploaded files. Human-in-the-Loop — a StateGraph that drafts a proposal and pauses for approval via langgraph.types.interrupt. Observability — GenAI OpenTelemetry tracing with enable_auto_tracing(); spans, metrics, and logs flow to Application Insights. We’ve also refreshed the existing bring-your-own LangGraph samples against the new hosting layer (chat with local tools, Foundry-managed Toolbox loading, and SSE-streamed multi-turn sessions backed by a MemorySaver checkpointer), so every sample reflects how Hosted Agents work today. Polish Across the Board A release is more than headline features. This one also includes a redesigned Prompt Builder “Improve an Instruction” dialog for faster iteration, fixes for MCP toolbox tool icons, clearer ZIP-deploy error surfacing, and assorted Agent Builder and Playground regression fixes — the whole experience feels tighter end to end. Get Started Today Install: Foundry Toolkit on the VS Code Marketplace Quick Start: Follow our getting-started tutorial to build your first Hosted Agent Deep Dive: Explore the documentation, samples, and LangGraph parity walkthroughs Join the Community Share your projects, file issues, or suggest features on our GitHub repository. We can’t wait to see what you build. Welcome to the next chapter of AI development!247Views0likes0CommentsAzure HorizonDB: Enterprise-Ready Postgres, Engineered for the AI Era
Affan Dar, Vice President of Engineering, PostgreSQL at Microsoft Charles Feddersen, Partner Director of Program Management, PostgreSQL at Microsoft Today at Microsoft Build, we’re pleased to announce the public preview of Azure HorizonDB, a new enterprise-ready Postgres-compatible database service designed to meet the needs of modern AI applications, alongside a set of enhancements to our PostgreSQL tooling in Visual Studio Code to further streamline the developer experience. Postgres is rapidly solidifying its role as a foundational layer in modern data architectures, with accelerating adoption across industries. For developers, it has become the preferred platform for new application development, driven by its extensible architecture, mature extension ecosystem, and adherence to open standards and APIs. At the same time, enterprises are choosing Postgres to re-platform and modernize existing systems, taking advantage of its ability to support a broad range of operational workloads while enabling advanced capabilities such as vector-based data access all within a single, interoperable platform. A Postgres Platform Grounded in Security, Resilience, Scale, and Performance Azure HorizonDB is purpose-built to meet these demands, combining the flexibility developers expect from Postgres with the operational rigor enterprises require. It extends the core Postgres engine with cloud-native capabilities such as integrated identity, fine-grained network and security controls, and seamless lifecycle management, while preserving full compatibility with the open ecosystem of extensions and tools. At the same time, HorizonDB introduces advanced, natively integrated capabilities like vector data support and AI model management, enabling new classes of intelligent applications without sacrificing transactional integrity or developer productivity. These capabilities are backed by a platform designed for enterprise performance and scale. HorizonDB supports databases up to 128 TB, scales out with up to 15 read replicas for high-throughput workloads, and delivers sub-millisecond commit latency across availability zones for low-latency transactions and high availability. This combination is critical for modern applications that require consistent performance under load, including high-concurrency transactional systems, real-time AI-driven interactions, and globally distributed services. The result is a unified platform that scales from the first line of code to globally distributed, mission-critical systems. Enterprise adoption ultimately depends on trust in the platform itself. Azure HorizonDB delivers this with native integration into Microsoft Entra ID for centralized identity and access control, private endpoints for network isolation, and built-in encryption to protect data at rest and in transit. These capabilities are essential for meeting compliance requirements and enabling organizations to run mission-critical workloads with confidence, without added complexity. This foundation is critical for any application, but it becomes indispensable for AI, where secure access to data and controlled model interaction underpin every intelligent experience. Building on this, HorizonDB introduces a set of integrated AI capabilities designed to bring intelligence directly into the database. Run Fast, Memory-Efficient Vector Search with DiskANN HorizonDB brings high-performance vector search directly into Postgres through DiskANN with spherical quantization. This enables efficient, low-latency similarity search at scale while significantly reducing memory and storage overhead. Spherical quantization works by normalizing vectors and encoding them into compact representations that preserve angular distance, allowing the system to compare vectors efficiently with minimal loss in accuracy. The result is the ability to index and query large embedding datasets within the transactional engine itself, making vector search a first-class capability rather than an external dependency. "HorizonDB is compelling because it brings a PostgreSQL-compatible foundation, AI-native capabilities and enterprise-grade controls closer to the operational data layer." Jennings Balavari, Founder, Opsen AI Build Smarter Apps with Hybrid Search in Postgres HorizonDB supports hybrid search by combining vector similarity through pgvector with full-text search enabled via the pg_textsearch extension, allowing applications to match both semantic meaning and precise keyword relevance in a single query. This enables more accurate, context-aware results, such as blending intent-driven retrieval with exact term matching for search, recommendations, or RAG scenarios. By unifying these capabilities within Postgres, HorizonDB improves result quality while simplifying application design without the need for external search systems. Operationalize AI with Built-In AI Model Management Working with vectors requires models to generate, interpret, and evolve embeddings, making model lifecycle a core part of the application stack. HorizonDB introduces integrated AI model management to simplify how models are registered, versioned, and governed alongside data, including built-in support for generative GPT models and ranking models. For example, GPT models can be used to generate summaries, responses, or structured outputs directly from application data, while ranking models enable relevance scoring for search results or recommendations over vector results. By managing these models alongside the data they operate on, HorizonDB ensures consistency, traceability, and control, creating a unified environment where models and data evolve together. “As we build a multi-tenant, AI-driven commerce platform, HorizonDB has been particularly compelling in two areas: scale and how close AI capabilities are to the data itself. Running vector search, filtering, and model-driven workflows directly inside the database removes a lot of the complexity we’d normally manage across separate services." James Frawley, CIAO, ReFiBuy Bring AI into SQL with AI Functions With models managed in place, AI Functions provide a direct way to invoke them from within SQL and application logic. These functions are implemented through the azure_ai extension, which brings model invocation directly into the Postgres engine. This allows developers to embed inference into queries and transactions, eliminating the need for external orchestration. By bringing model execution closer to the data, AI Functions reduce latency, simplify application design, and make intelligent behavior a natural extension of existing Postgres workloads. "What stood out with HorizonDB is that it aligns closely with how we already think about the problem. Instead of stitching together multiple components, it brings transactional data, vector search, and AI capabilities into a single platform, which simplifies the architecture without forcing a complete rethink." Mohsin Shafqat, Director Software Engineering, Nasdaq Run Reliable, Event-Driven Workflows with AI Pipelines Finally, AI Pipelines operationalize these capabilities through reliable, event-driven workflows for model execution and data processing. Pipelines execute on data changes, enabling real-time asynchronous reactions without external orchestration and ensuring consistent, repeatable behavior as data evolves. Combined with model management and AI Functions, they turn embedded intelligence into something that can be run, scaled, and trusted in production, while inheriting the database’s high availability and failover characteristics for resilience. Pipelines can also be visualized and observed in real time through the Visual Studio Code extension for PostgreSQL, giving developers and operators immediate visibility into execution flow, state, and outcomes Modern Unified Experience for Data, AI, and Operations in VS Code As intelligence becomes a core part of the data platform, the developer and operator experience becomes equally critical. HorizonDB extends seamlessly into Visual Studio Code with enhanced PostgreSQL tooling that works across any Postgres deployment, not just HorizonDB. Features like AI-assisted query plans and integrated monitoring enable faster debugging and optimization, helping teams understand both database performance and AI-driven behaviors. At the same time, for Azure-based deployments, the experience is deeply integrated with platform capabilities, enabling management of networking configuration, server parameters, and server logs directly from the development environment, streamlining operations across application and infrastructure layers. Azure HorizonDB brings together enterprise-grade security, deep Postgres compatibility, and a modern AI-native data platform, all engineered for developers. It scales efficiently across workloads, from transactional systems to intelligent applications, while delivering a world-class, Azure-integrated experience in Visual Studio Code for both developers and operators. Ready to get started with Azure HorizonDB? Azure HorizonDB is now available in public preview in Australia East, Central US, Sweden Central, West US 2, and West US 3 regions. Additionally, East US, Canada Central, Indonesia Central, Italy North, Japan East, Korea Central, and Poland Central will be available in the coming weeks. You can get started today by creating a new HorizonDB instance using the Azure portal, API’s, or the Visual Studio Code extension for PostgreSQL to begin exploring these capabilities firsthand. To learn more, dive deeper into our documentation and sign-up today to try AI model management in a limited preview.Building a GitHub Copilot Agent Usage Dashboard
Introduction Working with organisations that are attempting to create GitHub Copilot custom agents, take-up of these agents by their community becomes important to know. Some questions quickly emerge are "how well are we actually using it?", "which agents are getting used and which have not had that much traction?". Native metrics provide high-level insights into adoption, but they lack the depth needed to answer more granular questions—such as which agent workflows are most used, or how behaviour evolves over time. In this post, I’ll walk through how to build an enterprise-grade GitHub Copilot usage dashboard that captures detailed telemetry from VS Code using OpenTelemetry, processes it in Azure Monitor, and visualises insights in Grafana—all using a reproducible, infrastructure-as-code approach. The dashboard can be made available to anyone that needs it. Architecture VS Code can be configured to emit metrics using Open Telemetry as a standard. This is a configuration item in VS Code and you essentially point it to an Open Telemetry Collector. The collector is an endpoint that can consume the telemetry. In this implementation, it is a container image that is hosted in Azure and I have chosen Azure Container Apps (ACA) for this purpose as it is an easy to use managed environment - but it could also run in Azure Kubernetes Service (AKS) with a little more effort. There is a prebuilt image opentelemetry collector for this and this has been adapted to inject configuration to send the telemetry to Azure Application Insights. For defining and hosting the dashboard, I have chosen another Azure managed service Azure Managed Grafana Sample Dashboard The sample dashboard is one that contains a collection of visualisations derived from the collected data in Application Insights. Azure Managed Grafana allows you to visually author these dashboards or they can be implemented as a JSON file and adapted from there. Note that the telemetry generated by VS Code gives the location of the users - city, region and country, but does not include any personally-identifying information (PII) and so cannot be used to track individuals. As I understand it, this is by design. Managed Grafana has its own permission structure, which may then be used to give users access to the dashboard. Implementation Details There is a GitHub repo Copilot Usage Dashboard that contains details of how to implement this together with instructions for either "click-ops" 🙂creation or via Terraform. So I suggest you follow the link to my repo to look at the details. In summary, there needs to be in Azure: Azure Container App (ACA) that hosts the collector - this needs to have public ingress Azure Container Registry (ACR) that hosts the docker image that is customised via the Dockerfile Key Vault that hosts the Application Insights connection string that ACA references Application Insights - this needs to be created with a flag to allow it to work with Grafana data Log Analytics Workspace that works with Application Insights Azure Managed Grafana to host the Grafana dashboard The main thing to bear in mind is that VS Code needs to be configured to emit OpenTelemetry { "github.copilot.nextEditSuggestions.enabled": true, "github.copilot.chat.otel.enabled": true, "github.copilot.chat.otel.exporterType": "otlp-http", "github.copilot.chat.otel.otlpEndpoint": "https://<fqdn>" } where the FQDN is the URL of the public ingress to the Azure Container App. There is a Dockerfile in this repo that just injects the correct configuration file into the OpenTelemetry collector image. It is this configuration file that tells the collector to emit to Application Insights. It is of the form below: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: attributes: actions: - key: environment value: "prod" action: upsert exporters: azuremonitor: connection_string: "${APPLICATIONINSIGHTS_CONNECTION_STRING}" debug: verbosity: detailed service: pipelines: traces: receivers: [otlp] processors: [batch, attributes] exporters: [azuremonitor, debug] metrics: receivers: [otlp] processors: [batch, attributes] exporters: [azuremonitor] As can be seen above, there is a placeholder for the Application Insights connection string - in the ACA configuration this is an environment variable that then points to a secret which is in key vault. If all is well, VS Code will emit telemetry to the container image running in ACA and this will use its configuration to send to Application Insights. The Grafana dashboard then using this data. Troubleshooting The GitHub repo goes into the detail of troubleshooting, but the overall steps to troubleshoot are: If there is no data in Grafana, check that Grafana has access to Application Insights check whether there is telemetry being pushed into Application Insights by looking at the logs and looking for the contents of the table Dependencies. If there is telemetry there, then it is Grafana permissions. If not, look to ACA Look at ACA logs to see if it is healthy and look to see if there is any logs being received Use a curl request to send a fake log to ACA (a sample is in the repo) to see if the ACA is accepting logs Check the connection to Application Insights is correct and is being pulled from key vault or replace the environment variable value with the connection string directly If all good so far, then it may be that the configuration in VS Code is not correct or in the correct place. Hopefully the more detailed steps will resolve any issues quickly. Further thoughts and enhancements This implementation attempts to build a dashboard showing GitHub Copilot agent usage using a standard set of security controls, but more may be needed. Here is a list of possible enhancements: A more refined dashboard. This should be easy as there are samples for all sorts of visualisations and few of these may allow more focus on agent and model usage. the ACA-hosted OpenTelemetry collector has a public-facing ingress. This may need to be locked-down at the network level by address restriction or by a non-public ingress. Care would need to be taken to make sure that this is then visible/reachable to the intended VS Code user audience The ACA collector endpoint is not authenticated in of itself. This could be achieved at the container level by putting an authenticating proxy in the Dockerfile or at the ACA ingress level. Some investigation would be needed to see how the VS Code configuration could work with this and this may dictate largely what form this authentication can take. How the VS Code configuration changes can be automated for a user base has not been investigated as part of this work. It is assumed that an organisation may be able to roll-out these changes using their application deployment automation. Summary This approach provides a means by which an organisation can track the usage of GitHub Copilot agents (and their models), that is not provided by GitHub Enterprise dashboards. This will provide insights into the take up of custom agents and their underlying models - allowing an organisation to test whether their investments on custom agents are being used effectively. Additionally, the dashboards themselves can easily be rolled-out to a wider community than GitHub Enterprise one.Claude Code on Microsoft Foundry in VS Code — A Practical Setup Guide (with the gotchas)
Enables enterprise-grade governance without changing your developer workflow. The official Microsoft Learn article (Configure Claude Code for Microsoft Foundry) gets you ~80% of the way there. The remaining 20%—VS Code settings shape, tenant mismatches, and configuration conflicts like "baseURL and resource are mutually exclusive"—is where most setups fail in practice. This guide walks the full path end-to-end, with the exact JSON that validates, working CLI configuration, and a troubleshooting matrix based on real-world failures. This guidance is based on repeated customer deployments and internal testing across both CLI and VS Code scenarios. TL;DR Setup - Deploy claude-sonnet-4-6 (optionally Haiku + Opus) in a supported region - Grant Cognitive Services User + Foundry User - az login --tenant <tenant> , then launch VS Code via code . Config - CLI: - CLAUDE_CODE_USE_FOUNDRY=1 - ANTHROPIC_FOUNDRY_RESOURCE=<name> - Do NOT set ANTHROPIC_FOUNDRY_BASE_URL at the same time - VS Code: - Use [{ "name": "...", "value": "..." }] format Validate - claude → /status - Expect: API provider: Microsoft Foundry Why run Claude Code on Foundry? Anthropic's Claude Code is a top-tier agentic coding assistant. Running it through Microsoft Foundry instead of Anthropic's public API gives you: Data residency & compliance: prompts and completions stay inside your Azure tenant. Entra ID auth: no API keys to rotate; centralized RBAC. Private networking: works behind VNets/Private Endpoints. Unified billing & quotas: usage shows up on your Azure invoice and in Foundry monitoring. Same model, same CLI, enterprise-grade plumbing underneath. Prerequisites checklist Requirement How to verify Azure subscription with pay-as-you-go billing az account show Foundry resource in supported regions Check your region's model availability in Foundry portal Contributor/Owner on the resource group (for deployments) Azure Portal → IAM Cognitive Services User + Foundry User on the resource (for invoking) Azure Portal → IAM Azure CLI installed and logged in az --version , az login Claude Code CLI installed claude --version VS Code (current) with the Anthropic Claude Code extension Help → About Windows only: Git Bash (from Git for Windows) or WSL2 — Claude Code's runtime requires a POSIX shell bash --version in Git Bash / WSL ⚠️ Claude models in Foundry are currently available in select regions. Check the Foundry portal model catalog for your region's availability (commonly East US 2 and Sweden Central). Step 1 — Deploy the Claude models Claude Code uses three model roles, and it expects a deployment for each: Role Default deployment name Used for Primary claude-sonnet-4-6 general coding (balanced) Fast claude-haiku-4-5 quick edits, file reads Extended thinking claude-opus-4-6 complex reasoning Deploy at least Sonnet to get started. Add Haiku and Opus when you need them — Claude Code will route automatically. If a role-specific model isn't deployed, Claude Code may fall back or fail depending on the task. Deployment names in this guide follow the current Claude 4.x naming exposed in Foundry. Exact versions change over time — check the Foundry model catalog in your region for what's currently available. Foundry Portal: AI Foundry → your project → Build → Models + endpoints → + Deploy model → pick the Anthropic Claude model → Global Standard deployment → name it exactly as above (or remember the name to override later). To discover the current model version before deploying (replace eastus2 with your Foundry region): az cognitiveservices model list -l eastus2 ` --query "[?contains(model.name,'claude')].{name:model.name, version:model.version, format:model.format}" -o table Azure CLI: az cognitiveservices account deployment create ` --name <foundry-resource> ` --resource-group <rg> ` --deployment-name claude-sonnet-4-6 ` --model-name claude-sonnet-4-6 ` --model-version <version> ` --model-format Anthropic ` --sku-name GlobalStandard ` --sku-capacity 1 ✍️ Figure 1: Foundry portal “Models + endpoints” showing the three Claude deployments. Step 2 — Grant yourself the right roles This is the #1 source of silent failures. You need both: Role Role ID Purpose Cognitive Services User a97b65f3-24c7-4388-baec-2e87135dc908 data-plane invocation Foundry User (formerly Azure AI User) 53ca6127-db72-4b80-b1b0-d745d6d5456d Foundry-native permissions $me = az ad signed-in-user show --query id -o tsv $scope = az cognitiveservices account show -n <foundry-resource> -g <rg> --query id -o tsv # Use role IDs — rename-proof (works whether the display name is "Azure AI User" or "Foundry User") az role assignment create --assignee $me --role a97b65f3-24c7-4388-baec-2e87135dc908 --scope $scope # Cognitive Services User az role assignment create --assignee $me --role 53ca6127-db72-4b80-b1b0-d745d6d5456d --scope $scope # Foundry User (formerly Azure AI User) The Foundry RBAC rename (Azure AI User → Foundry User) is rolling out; both role names map to the same role definition (same role ID), depending on tenant rollout state. Use whichever role name your tenant exposes — or use the role IDs above to avoid ambiguity. Step 3 — Install the Claude Code CLI Use the official installer from Anthropic (auto-updates in the background): irm https://claude.ai/install.ps1 | iex claude --version If claude isn't on PATH, restart your shell. The installer drops it under %USERPROFILE%\.local\bin . Step 4 — Sign in to the right tenant If your Foundry resource lives in a tenant different from your default, an az login to the wrong tenant produces the cryptic error: ValueError: Unable to get authority configuration for https://login.microsoftonline.com/<bad-guid>. Authority would typically be in a format of https://login.microsoftonline.com/your_tenant Fix: az login --tenant <foundry-tenant-guid> az account set --subscription <foundry-subscription-guid> az account show # confirm tenant & subscription 💡 You can list every tenant you have access to with: az account list --query "[].{name:name, tenantId:tenantId}" -o table Step 5 — Configure the CLI Set these in the same PowerShell session you'll launch claude from: $env:CLAUDE_CODE_USE_FOUNDRY = "1" $env:ANTHROPIC_FOUNDRY_RESOURCE = "<your-foundry-resource-name>" # Optional: only if your deployment names differ from the defaults $env:ANTHROPIC_DEFAULT_SONNET_MODEL = "claude-sonnet-4-6" $env:ANTHROPIC_DEFAULT_HAIKU_MODEL = "claude-haiku-4-5" $env:ANTHROPIC_DEFAULT_OPUS_MODEL = "claude-opus-4-6" To make them persistent: setx CLAUDE_CODE_USE_FOUNDRY 1 (and so on), then sign out and back in (or restart Explorer). GUI apps like VS Code launched from the Start menu only pick up new user-env vars after the user session refreshes — opening a fresh terminal isn't enough. 🚫 The "mutually exclusive" trap API Error: baseURL and resource are mutually exclusive You'll hit this if you set both ANTHROPIC_FOUNDRY_RESOURCE and ANTHROPIC_FOUNDRY_BASE_URL . Pick one: Most users → ANTHROPIC_FOUNDRY_RESOURCE=<name> (Claude Code builds the URL). Custom subdomain / private endpoint → use ANTHROPIC_FOUNDRY_BASE_URL instead. Step 6 — Verify the CLI claude > /status Expected output: API provider: Microsoft Foundry Microsoft Foundry base URL: https://<resource>.services.ai.azure.com/anthropic Microsoft Foundry resource: <resource> Model: Default (claude-sonnet-4-6) ✍️ Figure 2: /status output confirming API provider: Microsoft Foundry . If you instead see "Anthropic" or it prompts for an Anthropic login, CLAUDE_CODE_USE_FOUNDRY isn't being inherited — see troubleshooting below. Step 7 — Configure the VS Code extension Install Claude Code from the VS Code Marketplace (publisher: Anthropic). Open user settings.json ( Ctrl+Shift+P → Preferences: Open User Settings (JSON)) and add: "claudeCode.environmentVariables": [ { "name": "CLAUDE_CODE_USE_FOUNDRY", "value": "1" }, { "name": "ANTHROPIC_FOUNDRY_RESOURCE", "value": "<your-foundry-resource-name>" } ] 🪤 Schema gotcha. The MS Learn doc currently shows this as a plain {KEY: VALUE} object under the UI label "Claude Code: Environment Variables" . In recent extension versions the actual JSON key is claudeCode.environmentVariables and the value must be an array of {name, value} objects. If you paste the doc's snippet verbatim, VS Code will flag "Missing property name", "Colon expected", "Unknown configuration setting". Use the array form above. Make the extension see your az login The extension inherits environment & credentials from the process that launches VS Code. After az login : # In the same PowerShell where az login succeeded: code . If VS Code was already running, fully quit it (not just close the window) and relaunch from the terminal. Developer: Reload Window is not enough to refresh inherited Azure CLI credentials. ✍️ Figure 3: settings.json with the claudeCode.environmentVariables array form. Step 8 — Try it In VS Code, click the Claude Code (Spark) icon in the sidebar to open the panel. Type: Summarize the structure of this project. You should get a response within a few seconds, and the panel should indicate it's routing through Microsoft Foundry. Run /status inside the panel to confirm API provider: Microsoft Foundry if you want certainty. ✍️ Figure 4: Claude Code panel in VS Code responding through Microsoft Foundry. Troubleshooting matrix Symptom Where it shows up Likely cause Fix API Error: baseURL and resource are mutually exclusive claude CLI on first request Both ANTHROPIC_FOUNDRY_BASE_URL and ANTHROPIC_FOUNDRY_RESOURCE set Unset one. Prefer ANTHROPIC_FOUNDRY_RESOURCE . Unable to get authority configuration for https://login.microsoftonline.com/<guid> claude CLI startup or VS Code panel Wrong tenant ID in az login az login --tenant <correct-guid> ; verify with az account show Failed to get token from azureADTokenProvider: ChainedTokenCredential authentication failed VS Code Claude Code panel Extension didn't inherit az login session Quit VS Code entirely; relaunch with code . from the authed shell Token tenant does not match resource tenant claude CLI or VS Code panel CLI logged into a different tenant than the Foundry resource az login --tenant <foundry-tenant> The model <name> is not available on your foundry deployment claude CLI first use or VS Code model selector Deployment name mismatch Either rename the Foundry deployment, or set ANTHROPIC_DEFAULT_*_MODEL to the actual name 401 / 403 on first request claude CLI or VS Code panel Missing RBAC on the resource Assign Cognitive Services User and Foundry User on the resource scope Claude Code prompts for Anthropic login VS Code Claude Code panel CLAUDE_CODE_USE_FOUNDRY not set in the process Set the env var before launching claude / code . VS Code shows "Unknown Configuration Setting" for claudeCode.environmentVariables VS Code Settings tab Wrong JSON shape Use the array of {name,value} objects form 429 Too Many Requests claude CLI or VS Code panel TPM/RPM exhausted Foundry portal → Operate → Quotas; request increase or reduce parallelism Works in CLI, fails in VS Code extension VS Code Claude Code panel only Env vars set per-shell, not visible to GUI VS Code Use setx (persistent user env) or move them into claudeCode.environmentVariables "Model is not available in region" Foundry portal model deployment step Foundry resource not in a supported region Deploy a new Foundry resource in a supported region, or check model availability Best practices Auth & secrets - Prefer Entra ID over API keys. If you must use a key for CI, store it as a secret (GitHub Actions secret, Key Vault) — never in settings.json (it may sync via Settings Sync). - Scope RBAC at the resource level, not the subscription, for least privilege. Project context - Create a CLAUDE.md at your repo root with stack, conventions, and entry-point commands. Claude Code reads it automatically and the quality jump is significant. - Use .claude/rules/*.md for per-area rules (e.g., test conventions, security rules). Cost & latency - Let Claude Code's auto-routing pick the right role (Sonnet/Haiku/Opus). Don't pin everything to Opus. - Cap context with ANTHROPIC_MAX_TOKENS if you have a strict budget. (Note: not honored by every Claude Code version — check the Claude Code docs for your version.) - Watch token spend in Foundry → Operate → Metrics weekly. Reliability - For team use, deploy all three model roles even if you don't think you need them — silent role-routing failures are confusing. - Tag your Foundry resource ( env=dev|prod , team=... ) for chargeback. Reproducibility - Document the exact env vars and az login --tenant GUID in your team README. - Pin Claude Code CLI version in onboarding docs ( claude --version ) so new joiners hit the same behavior. A note on the MS Learn doc The doc is accurate but skips three things that caused the most friction in real-world deployments: VS Code extension settings shape — the example uses the UI label as a JSON key and an object instead of the array form the schema actually expects. Process inheritance — it says "set the env vars" but doesn't emphasize that the VS Code window must be launched from a shell where both az login and the env vars are live. Reloading the window doesn't help. Mutually exclusive RESOURCE vs BASE_URL — listed in passing, but the error message only appears at first request, after you think everything is configured. If the Microsoft Learn page is updated, treat this post as a companion — same destination, fewer dead ends. What you've got now Claude Code running locally on your machine, talking to your Foundry resource. Entra ID auth — no API keys to manage. Full Foundry telemetry, quotas, and billing. VS Code panel + CLI, both backed by the same setup. Drop a CLAUDE.md in your repo and start shipping. When to Use RESOURCE vs BASE_URL Use RESOURCE (default) - Standard public deployments - No custom networking Use BASE_URL - Private endpoints - Custom DNS / VNet routing Never set both.757Views0likes0CommentsPower Pages & SPA: Deploying a Custom React SPA to Microsoft Power Pages
Build modern client-side experiences with React and deploy them securely on Power Pages using Power Platform CLI. Learn how to build a React Single Page Application using Vite and deploy it to Microsoft Power Pages with Power Platform CLI while using Power Pages authentication, Web API, and Dataverse security.Building Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.Building an On-Device Voice Assistant with Microsoft Foundry Local
Why on-device voice still matters Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged. The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini , you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress. This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable. What we're building A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000 . The page captures microphone audio, posts it to /api/transcribe , then streams the chat reply back over Server-Sent Events from /api/chat . All inference runs locally through two Foundry Local models loaded into the same process. The shape of the pipeline: Microphone (browser MediaRecorder) │ WebM/Opus blob ▼ Client-side WAV encoder (16 kHz, mono, PCM-16) │ multipart/form-data ▼ FastAPI /api/transcribe │ ▼ Nemotron Speech Streaming En 0.6B (Foundry Local audio client) │ transcript text ▼ Chat LLM e.g. qwen2.5-0.5b (Foundry Local chat client) │ streamed tokens ▼ FastAPI /api/chat → SSE → browser bubble The version that bit us: foundry-local-sdk >= 1.1.0 Before any code, the single most important fact about this project: The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the alias nemotron-speech-streaming-en-0.6b and fail with model not found . The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore- sdk suffix), not foundry_local . The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about. Pin it explicitly: pip install --upgrade "foundry-local-sdk>=1.1.0,<2" And verify before anything else: python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))" # expect: sdk 1.1.0 Loading both models from one manager The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client. The wrapper used in the repo ( src/foundry_client.py ) does this: from foundry_local_sdk import Configuration, FoundryLocalManager FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron")) manager = FoundryLocalManager.instance chat_model = manager.load_model("qwen2.5-0.5b") stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b") chat_client = chat_model.get_chat_client() audio_client = stt_model.get_audio_client() Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB. The transcription surprise The first naive approach was the obvious one: with open(wav_path, "rb") as f: result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b") That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime. The fix is to use the streaming session API instead — a different native entry point ( core_interop.start_audio_stream ) that the streaming model does support. The repo isolates this in src/_nemotron_live.py : def transcribe_wav_live(audio_client, wav_path, *, language="en"): with wave.open(str(wav_path), "rb") as w: sample_rate = w.getframerate() channels = w.getnchannels() sample_width = w.getsampwidth() pcm = w.readframes(w.getnframes()) session = audio_client.create_live_transcription_session() session.settings.sample_rate = sample_rate session.settings.channels = channels session.settings.bits_per_sample = sample_width * 8 session.settings.language = language session.start() # Feed PCM in ~100 ms chunks from a worker thread, then stop. bytes_per_sec = sample_rate * channels * sample_width chunk_bytes = max(bytes_per_sec // 10, 1024) def _pusher(): try: for offset in range(0, len(pcm), chunk_bytes): session.append(pcm[offset:offset + chunk_bytes]) finally: session.stop() threading.Thread(target=_pusher, daemon=True).start() parts = [] for resp in session.get_stream(): for cp in getattr(resp, "content", []) or []: text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or "" if text: parts.append(text) return " ".join(p.strip() for p in parts if p.strip()).strip() Two things to notice: Push from a thread, read from the main coroutine. session.append() is a blocking write into the native stream and session.get_stream() is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session. Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency. Always session.stop() . Without it the generator never terminates and the request hangs. The other transcription surprise: browsers don't send WAV Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus . That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove. The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript: // Inside src/static/index.html async function webmToWav(blob) { const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 }); const buf = await ctx.decodeAudioData(await blob.arrayBuffer()); // Mix to mono const ch = buf.numberOfChannels; const mono = new Float32Array(buf.length); for (let c = 0; c < ch; c++) { const data = buf.getChannelData(c); for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch; } return encodeWav(mono, 16000); } function encodeWav(samples, sampleRate) { const buffer = new ArrayBuffer(44 + samples.length * 2); const view = new DataView(buffer); // RIFF header writeStr(view, 0, "RIFF"); view.setUint32(4, 36 + samples.length * 2, true); writeStr(view, 8, "WAVE"); // fmt chunk writeStr(view, 12, "fmt "); view.setUint32(16, 16, true); // PCM chunk size view.setUint16(20, 1, true); // PCM format view.setUint16(22, 1, true); // mono view.setUint32(24, sampleRate, true); view.setUint32(28, sampleRate * 2, true); // byte rate view.setUint16(32, 2, true); // block align view.setUint16(34, 16, true); // bits per sample // data chunk writeStr(view, 36, "data"); view.setUint32(40, samples.length * 2, true); // PCM-16 samples let o = 44; for (let i = 0; i < samples.length; i++, o += 2) { const s = Math.max(-1, Math.min(1, samples[i])); view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true); } return new Blob([view], { type: "audio/wav" }); } Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side. Wiring it into FastAPI The server ( src/app.py ) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request: @app.post("/api/transcribe") async def transcribe(audio: UploadFile = File(...)): data = await audio.read() with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: f.write(data); path = f.name text = _ai_client.transcribe(path) return {"text": text} @app.post("/api/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( _sse(_ai_client.stream_completion(req.messages)), media_type="text/event-stream", ) return {"text": _ai_client.chat_completion(req.messages)} Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost . What it looks like The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone. Performance, honestly This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation: First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for qwen2.5-0.5b . Subsequent loads (warm cache): a few seconds per model. End-to-end transcription of a 5-second utterance: well under a second after warm-up. First chat token from qwen2.5-0.5b : typically 200–500 ms; full short reply within 1–2 s. On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model. Trade-offs to know about Model quality. qwen2.5-0.5b is a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap in phi-4-mini or mistral-nemo-12b-instruct if you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog. STT is English-only here. The current Nemotron streaming model in the catalog is ...-en-0.6b . Multilingual variants are likely to follow. Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny getUserMedia by default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real. No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour. Responsible AI considerations Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility: Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red ⏹ button and a "Recording…" banner for that reason. Don't log raw audio. The sample writes audio to a per-request NamedTemporaryFile and deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device. Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters. Try it Clone github.com/leestott/fl-nemotron. ./setup.ps1 (or ./setup.sh ) to create a virtualenv and install the pinned SDK. python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models. .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000 Open http://127.0.0.1:8000 in a real browser and click the 🎤 button. Where to go next Foundry Local documentation — official docs for the runtime, catalog, and SDK. microsoft/Foundry-Local — upstream samples and issue tracker. NVIDIA Nemotron model family — background on the speech and language models being published into the catalog. leestott/fl-nemotron — the full source for this post. Key takeaways Pin foundry-local-sdk >= 1.1.0 . Earlier SDKs cannot see the Nemotron Speech Streaming model. Use the LiveAudioTranscriptionSession API for Nemotron, not AudioClient.transcribe() . Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS. Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks. A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.