Microsoft Developer Community Blog articles

Beyond text: Returning images and interactive apps from MCP servers

Pamela_Fox — Wed, 15 Jul 2026 08:33:51 GMT

The Model Context Protocol (MCP) is becoming a richer foundation for agent experiences. Though most servers return plain text from their tool calls, MCP servers can also return binary results and provide interactive apps in clients that support those features, like VS Code.

In this post, I'll use both capabilities to build an MCP server that searches a collection of nature photos with natural language, lets the model inspect the matching images, and presents selected results in an interactive gallery. The same approach can be adapted to product catalogs, digital asset managers, photo archives, and other multimedia libraries.

Searching the image library

Let's start with the search experience from a user's perspective, then dive into the code behind it.

After connecting VS Code to the deployed MCP server, I can ask a question in GitHub Copilot about the images:

Find landscape photos that show dramatic terrain and water. Show me the strongest options for a nature gallery.

The GitHub Copilot agent realizes that it can use the image search MCP tool to answer that question. Here's what it looks like in the chat interface:

The tool results include rendered thumbnails. I can click a thumbnail to inspect it directly in VS Code, much like a file in the workspace, while the Copilot agent can review both the image binary data and their textual descriptions.

Behind the scenes, the agent called the image_search tool with these arguments:

{ "query": "dramatic natural landscapes with mountains and water", "max_results": 5 }

The tool call returned a mix of binary files and structured data: a thumbnail for each matching image, plus JSON containing its filename, display name, and generated description. The thumbnails let a multimodal model inspect the actual pixels, while the structured content gives the agent compact metadata it can reference in later tool calls.

{ "results": [ { "filename": "Picture1.jpg", "display_name": "Picture1.jpg", "description": "A clear mountain lake surrounded by pine forest and steep rocky peaks." }, ...] }

Returning images from MCP tools

Now let's look at the code powering that tool call.

I built the server with FastMCP, a popular Python framework for writing MCP servers. I declare each tool by decorating a function with mcp.tool() and annotating its arguments with types and helpful descriptions. FastMCP converts the function signature into a JSON Schema that helps GitHub Copilot decide when and how to call image_search:

@mcp.tool(annotations={"readOnlyHint": True}) async def image_search( query: Annotated[ str, "Text description of images to find (e.g., 'sunlit mountain lake')" ], max_results: Annotated[int, "Maximum number of images to return (1-20)"] = 5) -> ToolResult: """ Search for images matching a natural language query. Returns the image data and descriptions. """

Inside the function, I use Azure AI Search to perform hybrid retrieval, combining the text query with its vector embedding. The target index contains multimodal image embeddings and LLM-generated descriptions. Then I retrieve the image from Azure Blob Storage and resize it to a thumbnail. The tool returns both the binary image data for the thumbnails and structured metadata with image details.

results = await search_client.search(search_text=query, top=max_results, vector_queries=[VectorizableTextQuery(k_nearest_neighbors=max_results, fields="embedding", text=query)], select=["metadata_storage_path", "verbalized_image"]) blob_service_client = get_blob_service_client() files: list[File] = [] image_results: list[dict[str, str]] = [] async for result in results: url = result["metadata_storage_path"] description = result.get("verbalized_image") container_name, blob_name = get_blob_reference_from_url(url) blob_client = blob_service_client.get_blob_client(container=container_name, blob=blob_name) stream = await blob_client.download_blob() image_bytes = await stream.readall() image_format = get_image_format(url) display_name = os.path.basename(blob_name) file_basename = Path(display_name).stem thumbnail_bytes = resize_image_bytes(image_bytes, image_format) files.append(File(data=thumbnail_bytes, format=image_format, name=file_basename)) image_results.append({"filename": blob_name, "display_name": display_name, "description": description}) return ToolResult( content=files, structured_content={ "query": query, "results": image_results, }, )

Displaying selected images

Finding the right images is only the first half of the experience. Once the agent has review the thumbnails and their generated descriptions, it needs a better way to present its favorite selected images to the user. That is where MCP apps come in. An MCP app renders an interactive webpage inside a sandboxed iframe in the MCP client. For this server, the app is a small, JavaScript-powered carousel for browsing the selected images.

GitHub Copilot calls the display_image_files tool when it wants to render the carousel app:

Returning apps from MCP tools

Let's check out the code that powers that MCP carousel app.

An app is associated with a tool, so I once again decorate a Python function with mcp.tool(). This time, I pass an AppConfig that points to the image viewer's HTML resource.

@mcp.tool( app=AppConfig(resource_uri=IMAGE_VIEW_URI), annotations={"readOnlyHint": True}, ) async def display_image_files( filenames: Annotated[list[str], "List of image filenames to retrieve and display in a carousel."], descriptions: Annotated[list[str], "Image descriptions, in the same order as filenames."] ) -> ToolResult: """Fetch images by filename and render in carousel with filenames, descriptions, and file details."""

Inside the function, I fetch the selected images from Azure Blob Storage by filename, then return both the binary image data and structured content describing each image—its filename, generated description, MIME type, dimensions, format, and size.

blob_service_client = get_blob_service_client() image_blocks: list[types.ImageContent] = [] image_results: list[dict[str, str | int]] = [] for image_index, filename in enumerate(filenames): blob_client = blob_service_client.get_blob_client(container=IMAGE_CONTAINER_NAME, blob=filename) stream = await blob_client.download_blob() image_bytes = await stream.readall() mime_type = get_image_mime_type(filename) with Image.open(io.BytesIO(image_bytes)) as image: width, height = image.size image_format = image.format image_blocks.append(types.ImageContent( type="image", data=base64.b64encode(image_bytes).decode("utf-8"), mimeType=mime_type)) image_results.append( { "filename": filename, "description": descriptions[image_index], "mimeType": mime_type, "width": width, "height": height, "format": image_format, "sizeBytes": len(image_bytes), } ) return ToolResult( content=image_blocks, structured_content={ "images": image_results, }, )

Next, I define the resource that serves the image viewer HTML page. I decorate a Python function with @mcp.resource, assign it a ui:// URL that is unique to the MCP server, and use its Content Security Policy (CSP) to declare which external domains the app may load resources from:

@mcp.resource(IMAGE_VIEW_URI, app=AppConfig(csp=ResourceCSP(resource_domains=["https://unpkg.com"]))) def image_view() -> str: """Render images returned by display_image_files as an MCP App.""" return load_image_viewer_html()

The final piece is the HTML that renders inside the app's iframe. This small page imports ext-apps, a JavaScript package that manages bidirectional communication with the MCP client. The JavaScript creates an App instance, defines the ontoolresult callback, and connects the app. That callback receives images from the tool result and renders them in the carousel. MCP apps can also send messages back to the host, although this read-only viewer does not need to.

<!DOCTYPE html> <html lang="en"> <body> <div id="carousel"> <button id="prev" type="button" aria-label="Previous">‹</button> <div id="frame"></div> <button id="next" type="button" aria-label="Next">›</button> <span id="counter" aria-live="polite"></span> </div> <script type="module"> import { App } from "https://unpkg.com/@modelcontextprotocol/ext-apps@0.4.0/app-with-deps"; const app = new App({ name: "Image Viewer", version: "1.0.0", }); let images = []; let index = 0; const frame = document.getElementById("frame"); const prevBtn = document.getElementById("prev"); const nextBtn = document.getElementById("next"); const counter = document.getElementById("counter"); function show(i) { index = i; const img = images[index]; frame.innerHTML = ""; const el = document.createElement("img"); el.src = `data:${img.mimeType || "image/jpeg"};base64,${img.data}`; el.alt = "Blob image"; frame.appendChild(el); prevBtn.disabled = index === 0; nextBtn.disabled = index === images.length - 1; counter.textContent = images.length > 1 ? `${index + 1} / ${images.length}` : ""; } prevBtn.addEventListener("click", () => { if (index > 0) { show(index - 1); } }); nextBtn.addEventListener("click", () => { if (index < images.length - 1) { show(index + 1); } }); app.ontoolresult = ({ content }) => { images = (content || []).filter((block) => block.type === "image"); if (images.length > 0) { show(0); } }; await app.connect(); </script> </body> </html>

Try it yourself!

The full MCP server code is available in Azure-Samples/image-search-aisearch, along with a minimal image search website and an Azure AI Search indexing pipeline. The indexer uses an Azure OpenAI model to describe each image and Azure AI Vision to create multimodal embeddings. The repository includes a sample nature dataset, but you can replace it with any image collection.

Here are more ways you could extend it it:

Support more media types: add transcript search and a video or audio player app, while keeping the same search-then-display tool pattern.
Enrich the metadata: index dates, locations, creators, accessibility text, or domain-specific tags alongside generated descriptions and embeddings.
Optimize token consumption: images require many tokens, so returning too many thumbnails can quickly consume the model's context window. Experiment with smaller previews, higher compression, metadata-only search results, or a two-stage retrieval flow.
Add authentication: many media libraries contain private or licensed assets. You can add key-based authentication or OAuth with the FastMCP auth providers, as I described in the MCP auth livestream.

Once search results can carry both structured metadata and real media, an agent can do more than locate files: it can compare, curate, and present them in the same conversation. I hope you'll try the sample with a multimedia collection of your own!

Building AI Agents from Zero to Production

Lee_Stott — Tue, 14 Jul 2026 07:00:00 GMT

Building AI Agents from Zero to Production

Most agent demos stop at "it answered my question." Production doesn't. The gap between a notebook that calls an LLM and a governed, observable, multi-agent system your organisation can actually depend on is where real engineering happens, evaluation, deployment, data sovereignty, tool governance, and cross-team interoperability.

Microsoft's open-source course Building AI Agents from Zero to Production walks that entire arc in seven lessons, using one realistic use case and the Microsoft Agent Framework (MAF) plus Microsoft Foundry. This post is a developer-focused tour of what it teaches, the architecture decisions behind each stage, and the code patterns that matter when you move from prototype to production.

Who this is for

AI engineers building their first or first production, agent system.
Backend and full-stack developers integrating agents into real applications and CI/CD.
Cloud architects who need data sovereignty, private networking, and governance around agent workloads.
Technical leads deciding how to standardise tools and orchestration across multiple teams.

The samples are Python 3.12+, served through Microsoft Foundry using GPT-5 series models (for example gpt-5.1). Lesson 4 adds a TypeScript/React frontend. You will want an Azure subscription and the Azure CLI.

The AI Agent Development Lifecycle

The course is organised around a lifecycle rather than a feature list. Each lesson is a stage, and each stage assumes the previous one is solved:

#	Stage	The production question it answers
1	Agent Design	What should each agent do, and how do they hand off?
2	Agent Development	How do I build and run them with the Agent Framework?
3	Agent Evaluations	How do I know they actually work — and keep working?
4	Agent Deployment	How do I ship one as a hosted service with a UI and CI gate?
5	Production Hosted Agents	How do I meet enterprise data, network, and governance needs?
6	Microsoft Toolbox	How do I govern tools once, and reuse them across teams?
7	Multi-Agent & A2A	How do agents from different teams interoperate safely?

The thread running through all seven is a single scenario: a Developer Onboarding agent system that helps a new hire find the right teammates, get a sensible first task, and pull learning resources and code snippets. It is deliberately mundane, which is exactly why it exposes the production concerns that flashy demos hide.

Lesson 1 — Agent Design: three components, one graph

The course defines an agent by three parts: an LLM for reasoning, tools to act, and memory to retain context. The design work is context engineering — making sure the right information reaches the model at the right moment, no more and no less.

Rather than one monolithic assistant, the onboarding system is split into specialists coordinated by a triage agent using handoff orchestration:

Agent	Job	Tool
Employee Search	Answer org and people questions	Foundry file search over an employee-directory vector store
Task Recommendation	Suggest 1–3 GitHub issues for the new dev	GitHub MCP Server (reads recent commits + open issues)
Code Assistant	Provide resources and runnable snippets	Microsoft Learn MCP + Code Interpreter

Architecturally this is a directed graph: User → Triage → [Employee, Learning, Coding]. Splitting responsibilities early pays off later, each agent gets a tightly scoped prompt (less hallucination), can be evaluated independently, and can be upgraded without touching its peers.

Lesson 2 — Development: standalone agents with MAF

Here the design becomes code. Each specialist is a small, independently runnable service built with the Microsoft Agent Framework, authenticated to Foundry with your Azure CLI login. Setup is deliberately boring:

az login
az account set --subscription "<your-subscription-id>"
cp .env.example .env
# Fill FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MODEL (e.g. gpt-5.1)

# Create the employee-directory vector store once; note the printed VECTOR_STORE_ID
python lesson-2-agent-development/setup_vector_store.py

# Start an agent — serves on http://localhost:8090
python lesson-2-agent-development/employee-search-agent.py

The FoundryChatClient auto-reads any FOUNDRY_-prefixed environment variables and uses AzureCliCredential, so there are no keys in code. The lesson ships six samples, each on its own port, so you can chat with them individually in the local DevUI before wiring them together:

Sample	Tool	Port
employee-search-agent.py	Foundry file search / vector store	8090
task-recommendation-agent.py	GitHub MCP Server	8095
azure-learning-agent.py	Microsoft Learn MCP	8092
coding-agent.py	Code Interpreter	8093
learning-recommendation-agent.py	Learn MCP + reasoning	8091
agent-orchestration.py	Multi-agent handoff	8094

Why this matters: keeping each agent as its own process with its own port is a testability decision, not an accident. You can smoke-test one specialist in isolation, then compose them in agent-orchestration.py.

Lesson 3 — Evaluation: you can't unit-test a probability distribution

This is the lesson that separates a demo from a product. Agents are non-deterministic, so traditional assertions don't fit. The course uses three complementary layers:

Observability / tracing — always on, via OpenTelemetry to Application Insights.
Smoke tests — fast, run on every deploy.
Evaluations — deeper, model-based scoring run on-demand or nightly.

Turning on tracing is a single call:

from agent_framework.foundry import FoundryChatClient

client = FoundryChatClient()
client.configure_azure_monitor()   # export traces + metrics to Application Insights

For quality it uses Foundry's built-in "LLM-as-a-judge" evaluators against real persisted responses (identified by response_id), not freshly regenerated ones:

Evaluator	`evaluator_name`	Measures
Relevance	builtin.relevance	Does the response address the request?
Groundedness	builtin.groundedness	Is it supported by retrieved data (no hallucination)?
Tool-call accuracy	builtin.tool_call_accuracy	Were the right tools called with the right arguments?
Tool-output utilization	builtin.tool_output_utilization	Did the agent actually use tool results?

The judge model is set independently via AZURE_AI_MODEL_DEPLOYMENT_NAME, so you can evaluate a cheap production model with a stronger one. The run prints a report_url that deep-links into the Foundry portal.

Lesson 4 — Deployment: a hosted agent, a UI, and a CI gate

Now the agent becomes a managed service. It is deployed as a Foundry Hosted Agent a Microsoft-managed execution environment and fronted by an OpenAI ChatKit React UI talking to a FastAPI backend:

ChatKit React (3000) → FastAPI backend (8001) → Foundry Hosted Agent → tools

Building the agent is declarative attach tools, name it, serve it:

agent = client.as_agent(
    name="DevOnboardingAgent",
    instructions="...",
    tools=[file_search_tool, learn_mcp_tool],
)
# served with: from_agent_framework(agent).run()

The recommended deploy path is the Azure Developer CLI:

cd hosted-agent
azd auth login
azd agent deploy

The genuinely production-minded part is the smoke test as a post-deploy CI gate. Six cases cover reachability, each scenario, off-topic prompt adherence, and multi-turn threading (verifying state via previous_response_id). The GitHub Action runs them against the freshly deployed agent:

export FOUNDRY_TOKEN=$(az account get-access-token \
  --resource https://ai.azure.com/ --query accessToken -o tsv)

python runner.py \
  --project-endpoint "https://<account>.services.ai.azure.com/api/projects/<project>" \
  --agent-name dev-onboarding \
  --tests-file tests/smoke-tests.json

Pitfall to remember: the token audience must be https://ai.azure.com/. A cognitiveservices.azure.com token is rejected by the Responses API — a mistake that costs many engineers an afternoon.

Lesson 5 — Production: separating where an agent runs from where its data lives

The pivotal concept for enterprise readiness is the distinction between a Hosted Agent (compute, scaling, identity) and a Capability Host (where conversation history, files, and embeddings actually reside):

Concern	Hosted Agent	Capability Host
Compute / scaling / identity	✅ Provided	—
Conversation history	Microsoft-managed default	Redirect to your Azure Cosmos DB
File uploads	Microsoft-managed default	Redirect to your Azure Storage
Vector embeddings	Microsoft-managed default	Redirect to your Azure AI Search
Required to run the agent?	✅ Yes	❌ Optional
Required for data sovereignty?	❌ Not sufficient	✅ Yes

"Basic" setup uses Microsoft-managed storage and is perfect for getting started. "Standard" setup redirects each data plane to your own Azure resources through a project-level capability host, this is how you keep customer data in your tenant, inside your network boundary:

PUT .../accounts/{account}/projects/{project}/capabilityHosts/{name}?api-version=2025-06-01
{
  "properties": {
    "capabilityHostKind": "Agents",
    "threadStorageConnections": ["my-cosmosdb-connection"],
    "vectorStoreConnections":  ["my-ai-search-connection"],
    "storageConnections":      ["my-storage-connection"]
  }
}

Operational constraints worth internalising before you provision: there is one capability host per scope (a second attempt returns 409 Conflict), configuration is immutable (delete and recreate to change it), deletion is destructive, and the account-level host must exist before the project-level one.

Lesson 6 — Toolbox: govern tools once, reuse everywhere

Left unchecked, every team re-implements the same tools, scatters credentials, and loses governance visibility. The Microsoft Foundry Toolbox solves this by exposing a curated, versioned set of tools behind a single MCP-compatible endpoint, with credentials held in Foundry connections rather than agent code.

You build a toolbox version once:

from azure.ai.projects.models import MCPTool, ToolboxSearchPreviewTool, WebSearchTool

toolbox_version = project.toolboxes.create_toolbox_version(
    name="agent-tools",
    description="Web search + an MCP server + tool search",
    tools=[
        WebSearchTool(),
        MCPTool(
            server_label="myserver",
            server_url="https://your-mcp-server.example.com",
            require_approval="never",
            project_connection_id="my-key-auth-connection",  # credentials live in Foundry
        ),
        ToolboxSearchPreviewTool(),
    ],
)

And every agent consumes it through one endpoint, no per-team tool code:

from agent_framework import MCPStreamableHTTPTool

mcp_tool = MCPStreamableHTTPTool(
    name="toolbox",
    url=TOOLBOX_ENDPOINT,   # {project_endpoint}/toolboxes/{name}/mcp?api-version=v1
    http_client=http_client,
    load_prompts=False,
)
agent = chat_client.as_agent(name="my-toolbox-agent", instructions="...", tools=[mcp_tool])

Versioning is blue/green: create a new version, test it on its version-specific endpoint, then promote it to default and every consumer picks it up with zero code changes. A Guardrail (RAI) policy can be applied at the toolbox layer, independent of model-level content filters. Note the toolbox management APIs are currently preview; the portal or VS Code Foundry Toolkit are practical alternatives for creation today.

Lesson 7 — Multi-Agent & A2A: agents as networked peers

The final lesson contrasts two ways agents collaborate:

Handoff / Workflow — in-process, same codebase, fastest, tightest coupling.
Agent-to-Agent (A2A) — cross-process over an open protocol, so agents from different teams, orgs, or frameworks interoperate.

A2A gives each agent a discoverable Agent Card at /.well-known/agent-card.json and a task lifecycle (submitted → working → completed/failed). The elegant part: A2AExecutor wraps an existing MAF agent with no changes to that agent's code.

from agent_framework.a2a import A2AExecutor
from a2a.server.apps import A2AStarletteApplication
from a2a.server.tasks import InMemoryTaskStore

agent_card = AgentCard(
    name="Coding Assistant",
    url="http://localhost:9000/",
    version="1.0.0",
    capabilities=AgentCapabilities(streaming=True),
    skills=[AgentSkill(id="generate-code", name="Generate code", tags=["code"])],
)
request_handler = DefaultRequestHandler(
    agent_executor=A2AExecutor(agent),      # wraps your existing MAF agent unchanged
    task_store=InMemoryTaskStore(),
)
app = A2AStarletteApplication(agent_card=agent_card, http_handler=request_handler).build()

Consuming a remote agent then looks exactly like calling a local one:

from agent_framework.a2a import A2AAgent

remote_agent = A2AAgent(name="remote-coding-assistant", url="http://localhost:9000")
result = await remote_agent.run("Write a Python function that reverses a string.")

Because an A2AAgent can be a participant inside a HandoffBuilder workflow, you can mix in-process routing with remote services in the same orchestration. For enterprise use, A2AAgent accepts an auth_interceptor for bearer tokens, and the Agent Card carries security_schemes.

Responsible and secure by design

Production readiness in this course is not just uptime, it is governance:

Identity over keys — AzureCliCredential and managed identity throughout; no secrets in code.
Least privilege — CI runners get a scoped Azure AI User role assignment on the specific project.
Data sovereignty — capability hosts keep conversation history, files, and embeddings in your own Cosmos DB, Storage, and AI Search.
Tool approval and guardrails — MCP approval_mode and toolbox-level RAI policy gate what agents can do.
Grounded evaluation — groundedness and tool-utilization scoring catch hallucination and unused-tool behaviour before users do.
Cost hygiene — the lessons create real Azure resources; delete the resource group when done: az group delete --name <rg> --yes --no-wait.

Key takeaways

Design as a graph of specialists. Handoff orchestration with tightly scoped agents beats one monolith on reliability and testability.
One .run() contract, many backends. The Agent Framework keeps orchestration code stable from local dev to hosted production.
Evaluate continuously. Tracing + smoke tests + model-based evaluators are three layers, not alternatives.
Separate compute from data. Hosted Agents run the agent; Capability Hosts give you sovereignty — you need both for enterprise.
Govern tools centrally. A versioned toolbox behind one MCP endpoint kills tool sprawl and credential duplication.
Open protocols for interop. A2A lets agents cross team, org, and framework boundaries without rewrites.

Get started

Clone the repo (skip the 50+ translations for a faster download) and work through the lessons in order:

git clone --filter=blob:none --sparse https://github.com/microsoft/Building-AI-Agents-From-Zero-To-Production.git
cd Building-AI-Agents-From-Zero-To-Production
git sparse-checkout set --no-cone '/*' '!translations' '!translated_images'

References

From Multi-Model Chaos to a Governed AI Gateway: Cost Optimization on Azure

jisunchoi — Tue, 14 Jul 2026 04:15:00 GMT

What is Multi-Model Chaos, and what cost and security challenges does it pose?

Multi-model chaos describes the sprawl that emerges when an organization rapidly adopts many large language and foundation models—OpenAI, Anthropic, Meta Llama, Mistral, and a long tail of open-source and fine-tuned variants—across teams and applications without any unifying control plane. Instead of a single governed entry point, each team wires its own keys, endpoints, SDKs, and prompts directly to whichever provider it prefers, leaving the enterprise with a fragmented, duplicated, and largely invisible AI estate. On the cost side, this fragmentation makes spend almost impossible to predict or contain, identical workloads run against premium models when cheaper ones would suffice, token consumption goes unmeasured, redundant calls and missing caching inflate bills, and finance teams have no consolidated view to attribute usage back to a team, product, or customer. On the security and governance side, the risks compound: API keys are scattered across code and config files, sensitive or regulated data flows to external endpoints with no data-loss prevention or residency guarantees, prompt-injection and jailbreak attempts go unmonitored, and there is no centralized authentication, rate limiting, auditing, or content filtering. The net effect is an uncontrolled attack surface and a compliance blind spot—precisely the conditions that motivate consolidating model access behind a governed AI gateway.

In short, multi-model chaos trades short-term speed for runaway costs and an unmanaged security risk, making a governed AI Gateway essential.

What is a Governed AI Gateway, and how do they help reduce cost and improve security?

A governed AI gateway is an enterprise control plane built on Azure API Management (APIM) that consolidates every model behind a single, governed endpoint. It unifies Azure OpenAI (the gpt-5.4 family) and Azure AI Foundry (open-source and partner models such as grok-4.3 and DeepSeek-V4-Pro), so consumers reach any of them through one consistent, policy-enforced entry point rather than a tangle of direct connections. Every backend is password-less, authenticated through managed identity, which eliminates scattered API keys. On top of this foundation, the gateway enforces per-consumer model permissions, token-based rate limits, and cost-based budget downgrade—automatically routing teams to more economical models as they approach their spend limits—all administered from a self-service Admin UI.

One governed endpoint for every backend. Azure OpenAI and Azure AI Foundry (OSS and partner) models are bundled behind a single governance endpoint. Each backend is reachable only over a private endpoint with key authentication disabled, so APIM authenticates using its own managed identity—no model keys ever live on the gateway.

Per-consumer governance, edited live in the Admin UI with no redeployment:

Allowed models — a consumer can call only the models explicitly granted to it; anything else returns a 403.

Rate limits — per-consumer TPM and token-quota tiers (small / medium / large), returning a 429 once exceeded.

Cost budget — a daily USD spend limit; when it is exceeded, requests are automatically downgraded to a cheaper model along a configured ladder, including cross-backend downgrades (e.g. gpt → OSS or OSS → gpt).

Self-service Admin UI (React + FastAPI, Entra ID login, gated to an admin group) to issue consumer keys, set model, limit, and budget policies, and review the usage dashboard and request logs.

Built-in observability — per-call token metrics, broken down by consumer and model, stream to Application Insights, surfaced through the Admin UI's usage dashboard and a request / blocked & downgrade-event log.

Flexible client authentication — an APIM subscription key by default, or an Entra ID JWT (client_auth_mode).

How is it different from APIM AI Gateway?

APIM already provides useful GenAI gateway primitives: token rate limiting, token-usage metrics, semantic caching, backend routing, endpoint import, authentication, authorization, and monitoring. The difference is that APIM enables the enforcement runtime and policy control point, but not the full operating model required to run a shared, multi-tenant AI platform across teams, models, and budgets.

Inside the policy pipeline, APIM remains the load-bearing layer: llm-token-limit enforces per-consumer token-per-minute and quota limits, llm-emit-token-metric streams token usage into our metrics namespace, and standard APIM capabilities handle endpoint exposure, access control, and platform monitoring.

The governed AI Gateway adds the governance layer APIM does not provide out of the box:

Self-service onboarding — a platform team can issue or revoke consumer keys and manage access from the Admin UI, without raising a pull request or redeploying infrastructure.

Per-consumer model entitlements — every consumer has an explicit allow-list of model deployments. The gateway calculates the effective allowed set per request and returns 403 when a caller asks for a model it is not entitled to use.

Live configuration without redeployment — entitlements, rate tiers, token quotas, budgets, and downgrade levels live in the configuration store. A sync worker projects those values into APIM named values continuously, so operational changes can take effect without a terraform apply while the policy logic stays version-controlled in IaC.
Managed-identity-only, private backends — key-based authentication is disabled on Azure OpenAI and Azure AI Foundry. APIM injects a managed identity token on every backend call, and the backends are reachable only over private endpoints.
Cost-based downgrade across backends — when a consumer approaches its budget, the gateway can route to a cheaper model while preserving availability, including cross-backend downgrades between Azure OpenAI and Azure AI Foundry.

APIM’s AI gateway is the enforcement runtime while the governed AI Gateway is the platform operating model around it. APIM handles the gateway primitives extremely well, while our governance layer adds identity, self-service administration, entitlement management, live configuration, cost controls, and cross-model routing so teams can safely consume multiple models without creating new cost, security, or compliance sprawl.

Solution overview

Figure 1 shows the end-to-end architecture of the governed AI gateway. Client applications never talk to the models directly; instead, every request passes through Azure API Management, which acts as the single governed entry point that authenticates callers, applies per-consumer policy, and routes traffic privately to the appropriate model backend. Around this gateway sit the supporting planes for administration, identity, and observability, giving the organization one consistent place to control access, contain cost, and monitor usage across both Azure OpenAI and Azure AI Foundry models. This solution is also completely serverless.

Key components:

Client / consumer applications — the apps and services that call for model inference, each identified by its own consumer key or Entra ID identity.

Azure API Management (the gateway) — the single governance endpoint that handles authentication, allowed-model checks, rate limiting, and cost-based budget downgrade before any request reaches a model.

Model backends — Azure OpenAI (the gpt-5.4 family) and Azure AI Foundry (OSS and partner models such as grok-4.3 and DeepSeek-V4-Pro), each reachable only over a private endpoint.

Microsoft Entra ID — provides identity for both clients (optional JWT auth) and the gateway's own managed identity used to reach the backends without password credentials.

Admin UI (React + FastAPI) — the self-service control plane for issuing consumer keys and setting model, rate-limit, and budget policies.

Application Insights — collects per-call token metrics by consumer and model, powering the usage dashboard and request / blocked-event logs.

Figure1: Solution architecture diagram

Request flow

Authenticate — a client calls the gateway with an APIM subscription key (or an Entra ID JWT) instead of any model key.
Authorize the model — APIM checks whether the consumer is permitted to call the requested model; if not, it returns 403.
Enforce limits — the gateway applies the consumer's TPM and token-quota tier, returning 429 when the limit is exceeded.
Apply the cost budget — if the consumer's daily USD budget is exhausted, the request is automatically downgraded to a cheaper model along the configured ladder.
Route to the backend — APIM forwards the request over a private endpoint, authenticating with its managed identity to Azure OpenAI or Azure AI Foundry.
Return and record — the model response is returned to the client while per-call token metrics are emitted to Application Insights and surfaced in the Admin UI dashboard and logs.

Implement the solution

This section describes how to deploy the solution architecture.

In this post, you’ll perform the following tasks:

Create APIM

Create Cosmos DB

Create Microsoft foundry with Gpt-5.4, Gpt-5.4-mini, DeepSeek-V4-Pro and Grok-4.3 deployed

Create the Admin UI on container apps

Create a consumer with an APIM subscription key on the Admin UI

Integrate APIM endpoint with Github Copilot chat and Copilot CLI

Create a budget and rate limit in the Admin UI

Simulate and validate auto downgrade feature

Ensure that you have the following prerequisites deployed before moving to the next section

An Azure subscription with model quota (Azure OpenAI and, optionally, Azure AI Foundry models).

Tools: Git, Terraform ≥ 1.7, Azure CLI, and az login to the subscription. Container images are built remotely in Azure Container Registry, so Docker is not required.

VScode and Copilot CLI

Deploy the Azure AI Gateway

Clone the repository from https://github.com/microsoft/apim-foundry-governance

git clone https://github.com/microsoft/apim-foundry-governance git checkout english

By default the solution deploys in koreacentral region. Export your custom variables if needed.

export location=eastus2 export backend-rg=rg-aigw-tfstate-dev-eastus2 export storage-prefix=staigwtfstate export state-key=ai-gateway-eus2.tfstate

Bootstrap the Terraform state backend (once per subscription)

This creates an eastus2 resource group + storage account for remote state (Entra auth, public blob access blocked).

./scripts/bootstrap-backend.sh \ --location $location \ --backend-rg $backend-rg \ --storage-prefix $storage-prefix \ --state-key $state-key

Set Terraform variables

cp infra/terraform.tfvars.example infra/terraform.tfvars # Edit infra/terraform.tfvars: prefix, location, owner, cost_center, apim_publisher_*, budget_*

Create the Gateway Core

On the first apply, leave worker_image and admin_ui_image empty (default ""). The images don't exist yet, and the worker Job / Admin UI app are count-gated on these variables.

cd infra terraform init # If you are moving an existing state from another backend, run `terraform init -migrate-state` instead. terraform apply

Build and push the container images with app registrations

After the registry is created, build the worker and Admin UI images remotely (no local Docker needed).

acr=$(terraform output -raw registry_login_server) reg=$(terraform output -raw registry_name) az acr build --registry $reg --image config-sync-worker:latest ../app/config-sync-worker az acr build --registry $reg --image admin-ui:latest ../app/admin-ui

The worker and Admin UI requires entra app registrations for a user to access the frontend. Create the admin security group, BFF API App registrations and SPA public-client app registrations.

./scripts/app-registration.sh

Enable the worker and Admin UI

From the output above, populate the image references and the three Entra variables from the prerequisites into infra/terraform.tfvars and apply again.

worker_image = "<registry_login_server>/config-sync-worker:latest" admin_ui_image = "<registry_login_server>/admin-ui:latest" admin_ui_public = true # external FQDN (still Entra-gated). false = VNet-only admin_group_object_id = "<entra security group object id>" bff_api_audience = "api://<bff app id>" spa_client_id = "<spa app id>" entra_tenant_id = "<tenant id>"

CosmosDB Seed configuration

Cosmos is private with key auth disabled, so the initial config is seeded from a jumpbox inside the VNet. Default confguration of enable_jumpbox = true in infra/terraform.tfvars triggers Terraform to: provision the jumpbox VM, grant it’s managed identity the Cosmos DB Built-in Data Contributor role (scoped to the config container), and runs a VM run-command that seeds both documents automatically:

Global config (id=global) — allowed models + token limits.

Per-model pricing (id=pricing) — prompt/completion rates for cost-based budgeting.

To seed manually instead (jumpbox connected via Bastion), the same scripts can be run directly:

# Global allowed models + limits ./scripts/seed-cosmos-jumpbox.sh https://<cosmos-account>.documents.azure.com:443/ # Per-model pricing (for cost-based budgeting) ./scripts/seed-pricing-jumpbox.sh https://<cosmos-account>.documents.azure.com:443/

Access the AdminUI

Update the SPA with your containerapps url

spa_app_id="$(az ad app list --display-name "AI Gateway SPA" --query "[].appId" -o tsv)" # spa_client_id fqdn=$(terraform output -raw admin_ui_fqdn) # run from infra/ oid=$(az ad app show --id "$spa_app_id" --query id -o tsv) az rest --method PATCH \ --uri "https://graph.microsoft.com/v1.0/applications/$oid" \ --headers 'Content-Type=application/json' \ --body "{\"spa\":{\"redirectUris\":[\"https://$fqdn\"]}}"

Browse to the admin_ui_fqdn, which is also the container apps fqdn. You will need to login via EntraID (Users will need to be added to the Entra group for them to login). Go ahead and register the consumer with a name and issue the API key. The API key is the APIM subscription key and will only be shown once on the UI, so copy and paste it somewhere safe.

Figure2: AI Gateway Consumers and Keys

Next, on the left hand tab, click on budgets. This will set the daily budget limit a user is allowed to consume in a day and is also where the model downgrade logic resides. For the purpose of demonstration, set a low budget of $1.8 and select the model priority that you want the downgrade to occur. In this case, gpt-5.4 will be used first, followed by gpt-5.4-mini, DeepSeek then Grok.

Figure3: AI Gateway Budgets

Lastly, on the land hand tab, select Rate limits. This sets the amount of tokens a user can consume in a day. It is a daily limit and resets after 24 hours. Select the large tier.

Figure4: AI Gateway Rate Limits

Browse to Dashboard, it shows you all the token information, request status codes and group them by consumer and model. You can also view the budget downgrade for a specific user.

Figure5: AI Gateway Captions

Integrate endpoint with github copilot chat in vscode

In VScode, type “Ctrl + Shift + p” and select “Chat: Manage Language Model”. Select Add Models and choose Azure.

Figure6: Add models toGithubCopilot Chat

Follow through the prompts. It will create or edit a chatLanguageModels.json file. Your file should look like this. Take note that you will need to use the /vscode path.

[     {         "name": "Azure",         "vendor": "azure",         "models": [             {                 "id": "gpt-5.4",                 "name": "gpt-5.4 (APIM)",                 "url": "https://<REPLACE WITH YOUR APIM ENDPOINT>.azure-api.net/vscode/openai/deployments/gpt-5.4/chat/completions?api-version=2025-01-01-preview",                 "toolCalling": true,                 "vision": true,                 "maxInputTokens": 128000,                 "maxOutputTokens": 16000,                 "requestHeaders": {                     "Ocp-Apim-Subscription-Key": "<REPLACE WITH YOUR SUBSCRIPTION KEY"                 }             }         ]     } ]

Now select the gpt-5.4 (APIM) model and ask it a question.

Integrate endpoint with copilot cli

As copilot only accepts api-key headers, a separate api is used. Replace and export the following variables before using copilot cli.

export COPILOT_PROVIDER_TYPE="azure" export COPILOT_PROVIDER_BASE_URL="<REPLACE WITH YOUR APIM ENDPOINT>" export COPILOT_PROVIDER_API_KEY="<REPLACE WITH YOUR SUBSCRIPTION KEY>" export COPILOT_MODEL="gpt-5.4" export COPILOT_PROVIDER_AZURE_API_VERSION="2025-01-01-preview" export COPILOT_PROVIDER_MODEL_ID="gpt-5.4"

You should see a similar response.

Figure7: Integration of APIM to copilot cli

Simulate downgrade feature

Continue to ask more questions to consume more tokens. Once it hits the 80% cost threshold, you should see that the tag has been switched to “Auto-switch level 1”, meaning it will downgrade to gpt-5.4-mini for future requests.

Figure8: AI Gateway Downgrade Feature

Validate by running this command in your terminal with your own endpoints and api-key.

curl -sS -i -X POST "https://<REPLACE>.azure-api.net/openai/deployments/gpt-5.4/chat/completions?api-version=2025-01-01-preview" -H "api-key: <REPLACE WITH API KEY>" -H "Content-Type: application/json" -d '{"messages":[{"role":"user","content":"hi"}],"max_completion_tokens":8}'

Inspect the headers, you should see that the downgrade level is 1 and the effective model is gpt-5.4-mini despite hitting the same endpoint of gpt-5.4.

Figure9: Model downgrade

Conclusion

This post started with the problem of multi-model chaos: teams moving quickly with different models, endpoints, SDKs, keys, quotas, and cost profiles, but without a common control plane resulting in ineffective cost control and potential security leaks with model API keys. The governed AI Gateway addresses that by putting Azure OpenAI and Azure AI Foundry behind a single APIM-based entry point, where access, limits, routing, identity, telemetry, and budget behavior can be applied consistently for every consumer.

We also walked through how the gateway is different from APIM’s native AI gateway capabilities. APIM provides the enforcement runtime and the GenAI policy primitives, such as token limits, token metrics, semantic caching, and backend routing. The governed AI Gateway builds the operating model around those primitives: self-service onboarding, per-consumer model entitlements, live configuration without redeployment, managed-identity-only private backends, per-call cost telemetry, and cost-based downgrade across model providers.

From there, we integrated the APIM endpoint with Github Copilot Chat and Copilot CLI, and validated the downgrade behavior when spend thresholds were reached. The result is not just an AI proxy, but a reusable enterprise pattern for running AI access as a governed platform: developers keep a simple model endpoint experience, while the platform team keeps control over security, cost, observability, and operational change.

Overall, this post helps organizations bring multi-model AI usage under one governed entry point, reducing sprawl across endpoints, keys, policies, and cost controls. It also gives platform teams centralized control over model access, rate limits, budgets, telemetry, and private backend access while preserving a simple endpoint experience for developers.

References

Creating Autonomous Teams Agents Using OpenClaw, MCP, and Azure Container Apps

kinfey — Mon, 13 Jul 2026 07:39:45 GMT

The one shift that changes everything

For two years, "AI coding" meant autocomplete. A suggestion appears in your editor, you hit tab, you move on. The agent only existed while you were actively typing.

That is no longer the only model. A new category of tools runs asynchronously and autonomously: you message the agent from a chat window — Teams, Slack, Telegram — describe what you want, and walk away. The agent plans, writes code, runs tests, deploys, and hands you back a result. Some of them never sleep: they hold a persistent memory, load their own skills, and act on a schedule without being prompted.

This is the world of OpenClaw, Hermes Agent, and the other long-running autonomous agents that exploded across developer culture in 2026. OpenClaw alone crossed 377,000 GitHub stars and millions of active users, becoming — for a while — the most-starred project on GitHub. You install it with one line, connect a channel, and start delegating from your phone.

The workflow moves from pair programming to delegation and review. The interactive copilot asks, "What should I write next?" The autonomous agent asks, "What do you need done?"

And that reframing is exactly why three questions now keep architects awake:

Is it safe? You are handing a self-driving process the ability to run shell commands, touch files, and call APIs. One community report memorably described these agents as a teammate in your group chat who happens to have root access to your codebase. That is not a compliment — it is a threat model.
Can it fit into real multi-agent work? A single agent is a demo. Production is a fleet — specialists that hand off to each other with gates in between.
Is it flexible and controllable? Autonomy is thrilling right up until the agent packages last week's stale files into this week's deliverable, or loops forever on a failing test.

This post answers all three — not with hand-waving, but with a working reference implementation you can clone today: CustomCodingAgentApp in the Multi-AI-Agents-Cloud-Native repo, an "Agentic Prototype Factory" that turns a plain-language idea into a tested, live-on-Azure prototype without leaving the chat window.

A product manager types "Build a BBC-style World Cup feature page" in Microsoft Teams. Minutes later they get back a running HTTPS URL and a downloadable source ZIP. Under the hood, five specialized OpenClaw agents powered by Microsoft Foundry gpt-5.5 collaborate in a shared sandbox, run real pytest/Jest suites, and ship the result to Azure Container Apps — all orchestrated behind a Model Context Protocol (MCP) service so any MCP client (GitHub Copilot, Claude, the Teams bot) can drive it.

We'll build up to that architecture in the order you should learn it.

Part 1 — Long-running autonomous agents, and their two hard problems

What actually makes them different

A traditional chatbot is text in, text out. It waits for you. An autonomous agent inverts that:

Property	Traditional chatbot	Long-running autonomous agent
Execution	Responds to a prompt	Acts proactively (a "heartbeat" wakes it on a schedule)
Scope	Words	Files, shell, browser, APIs — the real machine
Memory	This session only	Persistent across sessions
Interface	A web box	Any chat channel + the terminal
Autonomy	None	Plans and takes multi-step action on its own

Architecturally, OpenClaw is not a library you import — it's a runtime. A single long-running process (the Gateway) bridges your messaging channels to an LLM backend, keeps sessions alive, queues work in ordered lanes, and drives the classic agent loop: call the model → execute the tool calls it asks for → feed results back → repeat until done. There is no rigid step-planner; the model itself steers. That is what makes it feel magical — and what makes it hard to contain.

That containment problem has two faces.

Hard problem #1 — Security

The same properties that make an autonomous agent useful make it dangerous. Full system access + proactive execution + a 32,000-server tool ecosystem is a large, self-driving attack surface. OpenClaw's own short history is the cautionary tale: a critical one-click remote-code-execution CVE early in its life, hundreds of malicious community "skills" discovered on its marketplace, and tens of thousands of gateways found exposed on the open internet. None of this means "don't use autonomous agents." It means: never run one with ambient credentials on a machine you care about. The agent belongs in a box with a hard wall around it.

Hard problem #2 — Persistence and continuity

Real agent work is long. Refactoring a codebase, researching across dozens of pages, building-testing-deploying an app — these take minutes to hours, far past a single request/response. So the runtime needs durable sessions, a place to keep state, and a workspace that survives across steps. But a persistent workspace that is reused creates its own hazard: state leakage. Files from yesterday's task can contaminate — or get shipped inside — today's result. Continuity and cleanliness pull in opposite directions, and you have to engineer the tension out.

One agent is a demo; production is a fleet

A single monolithic agent asked to "gather requirements, write the code, test it, deploy it, and package it" will do all four mediocrely and blur the boundaries between them. The production pattern is orchestrator-worker: specialized agents, each with one job, handing off to the next through explicit gates. OpenClaw supports exactly this — it can spawn sub-agents and even dispatch external coding harnesses, acting as a meta-orchestrator rather than a single model. The open question is never whether to go multi-agent; it's where the seams and the guardrails go.

The answer to "is it safe?": put the agent in a microVM

If the agent needs root to be useful, then give it root — inside a disposable microVM, not on your host. In 2026 there are several credible ways to do this:

Kata Containers on AKS — each pod gets its own lightweight VM boundary and guest kernel.
Hyperlight Wasm — per-call, snapshot-restored Wasm microVMs for running LLM-generated code.
Azure Container Apps dynamic sessions — prewarmed, Hyper-V-isolated sandboxes that start in milliseconds, scale to thousands, and are purpose-built for "secure execution of custom code" and "running LLM-generated scripts."

That last one — the ACA sandbox — is the sweet spot for a chat-driven agent factory: strong isolation without you operating a Kubernetes cluster, and an exec API to run commands inside the box. It's what the reference implementation uses.

Part 2 — Putting OpenClaw into the ACA sandbox

Here is where the repo stops being a diagram and becomes running code. The Agentic Prototype Factory decomposes the "idea → live app" job into five specialized OpenClaw agents that run in sequence, all inside the sandbox:

requirements → coding → testing → deployment → save

Each is addressable as its own model target on the OpenClaw gateway's OpenAI-compatible API:

model value	Routes to
openclaw / openclaw/default	Default agent
openclaw/requirements-agent	Requirement Agent
openclaw/coding-agent	Coding Agent
openclaw/testing-agent	Testing Agent
openclaw/deployment-agent	Deployment Agent
openclaw/save-agent	Save & download Agent

Control, not vibes: review gates with feedback loops

Autonomy without gates is how you get an agent that confidently deploys a broken app. The orchestrator wires the five agents into a graph with hard, bounded gates:

Every knob is explicit and lives in server.py: _MAX_TEST_ROUNDS = 3, _MAX_DEPLOY_REVIEW = 2, _DEPLOY_POLL_ATTEMPTS = 12, _DEPLOY_POLL_DELAY_S = 20. The Testing Agent must end each turn with a literal TESTS_PASSED / TESTS_FAILED verdict; the orchestrator won't declare success until it HTTP-checks the deployed URL and inspects the response body — because a ResourceNotFound can happily return an HTTP 200. That is what "flexible and controllable" looks like in practice: the LLM drives creatively inside a deterministic state machine.

The deterministic pre-run wipe (solving state leakage)

Because the sandbox is reused across runs (fast, cheap), the orchestrator does something disciplined before every run: it wipes all lingering agent workspaces. Stale files from a previous task can never leak into — or be packaged as — the new result. This is the engineered answer to Hard Problem #2.

Working with the sandbox's limits, not against them

The ACA sandbox exec API is hard-capped at ~120 seconds — shorter than a cold az acr build plus az containerapp create. A naive agent would time out and report failure. The clever bit: those commands finish server-side on Azure even after the client exec disconnects. So deployment is split in two:

deploy-build <dir> <app> — installs the deploy helpers, writes a tight .dockerignore, and kicks off the ACR build tagged <app>:latest. If the client drops at ~120s, the image still lands in ACR.
deploy-finish <app> — idempotent, polled up to 12×. It reports STILL_BUILDING until the image exists, then fires a --no-wait containerapp create, and finally returns DEPLOYED_URL=https://<fqdn>.

This is the single most important lesson of the whole sample: an autonomous agent doesn't need a longer timeout — it needs to understand the durability semantics of the platform it runs on.

Part 3 — MCP, and why its security is the whole ballgame

The five-agent workflow is powerful, but it would be a silo if the only way to reach it were a bespoke API. Instead, the repo wraps the entire orchestration as a Model Context Protocol (MCP) service (acamcp_node) exposed over streamable HTTP at /mcp, with a tiny, legible tool surface:

MCP tool	What it does
generate_prototype	Run the full five-agent workflow end to end
run_agent	Invoke a single named agent
check_gateway_health	Liveness / readiness of the OpenClaw gateway

The payoff is enormous: any MCP client can now drive the factory — GitHub Copilot, Claude, or the Teams bot we're about to meet. One protocol, many front-ends.

But MCP is not just an integration convenience — it's a control plane, and every MCP tool is a privileged capability. In an ecosystem with 32,000+ community servers, "just add an MCP server" is a supply-chain decision. A tool call is code execution by another name. So the security posture has to be deliberate. Here is how the reference implementation hardens it — and the principles are portable to any MCP deployment:

Auth in front of the protocol. The MCP ingress sits behind basic auth (MCP_BASIC_AUTH_PASSWORD); the gateway itself requires the gateway token as a bearer credential (Authorization: Bearer <token>). No anonymous tool calls.
A tiny, named allowlist — not a blank check. The gateway routes only to six explicit model targets. There is no "run arbitrary agent" escape hatch; the routing table is the allowlist.
No secrets in the workload. There are no model API keys anywhere in the running containers — model access is brokered entirely through Entra ID managed identities. The gateway token is stored as a Kubernetes secret and never baked into an image.
Private by default. The gateway's OpenAI-compatible endpoint is operator-level access — it stays on private ingress, with TLS and authentication added before anything is ever exposed publicly.
Least privilege at the identity layer. The gateway is granted exactly the Foundry roles it needs (Cognitive Services User / Cognitive Services OpenAI User) on the Foundry resource — nothing more.

The takeaway for MCP is the same as for the agent itself: treat the protocol as a doorway, and put a guard on the door. Authentication, an explicit allowlist, private ingress, and brokered identity turn MCP from an open blast radius into a governed control plane.

Part 4 — The complete solution: Teams + MCP on ACA + OpenClaw on the ACA sandbox

Now assemble the three deployable components into one loop:

The request lifecycle, end to end

A PM sends one sentence in Teams. The teamsbot_app bot — acting as an MCP client via mcpClient.ts — opens an MCP handshake and calls generate_prototype.
The MCP service on ACA (acamcp_node) runs the orchestrator: pre-run wipe, then requirements → coding → testing.
The OpenClaw gateway in the ACA sandbox (acasbxapp_node) executes each agent, talking to Foundry gpt-5.5 through a managed identity — no keys in the box.
Real pytest + Jest suites run inside the sandbox. Fail → loop back (bounded). Pass → deploy.
Deployment uses the build + poll split to survive the ~120s exec cap; the app lands in Azure Container Apps and is health-checked body-aware at its live URL.
The Save Agent produces an authenticated ZIP download URL. The bot streams each agent's progress back into the Teams thread and returns the running HTTPS URL + source ZIP — optionally auto-opening the project in VS Code Insiders.

How the architecture answers the three questions

The question	How this solution answers it
Is it safe?	The autonomous agent runs in a Hyper-V-isolated ACA sandbox, not on anyone's laptop. No model keys in the workload — Entra ID managed identity brokers Foundry. MCP behind basic auth; gateway behind a bearer token on private ingress; token as a secret, never in an image. A deterministic pre-run wipe removes cross-run leakage.
Does it fit multi-agent work?	It is a multi-agent system — five specialist OpenClaw agents with A2A hand-offs and review gates — and because it's exposed via MCP, any client (Copilot, Claude, Teams) can orchestrate it.
Is it flexible and controllable?	Creativity lives inside a deterministic state machine: explicit TESTS_PASSED/FAILED verdicts, bounded retry loops (_MAX_TEST_ROUNDS, _MAX_DEPLOY_REVIEW), body-aware health checks, and a human approving in the Teams thread.

Deploy it yourself

The repo ships scripts for all three tiers (the gateway uses the platform's managed identity to reach Foundry — no key handling, no image rebuild):

# 1) OpenClaw gateway + the 5 agents (acasbxapp_node) cd acasbxapp_node cp .env.example .env # gateway token, Foundry endpoint, sandbox ids ./scripts/build-openclaw-image.sh # build + push the OpenClaw image to ACR ./scripts/deploy-aks-gateway.sh # grant Foundry roles + deploy # 2) MCP service (acamcp_node) cd ../acamcp_node cp .env.example .env # ACR + cluster; gateway token read from ../acasbxapp_node/.env ./scripts/build-images.sh # build + push the MCP image ./scripts/deploy-aks.sh # secret + manifests to the openclaw namespace ./scripts/smoke-check.sh # verify the MCP handshake # 3) Teams bot (teamsbot_app) — Node.js/TypeScript MCP client cd ../teamsbot_app # configure + run per the folder README, then sideload the Teams app package

The reference implementation targets Azure (ACA + AKS) — the OpenClaw gateway and MCP service run as containers, and the code-execution sandbox uses the ACA dynamic-sessions exec API. Keep the gateway on private ingress and add TLS before any public exposure.

Final thought

Strip away the World Cup demo and a reusable pattern remains — a blueprint for running any long-running autonomous agent in the enterprise:

A message-driven agent (OpenClaw / Hermes) + a microVM sandbox (Azure Container Apps dynamic sessions) + an MCP control plane with auth + enterprise identity (Entra ID managed identity) + a human surface (Microsoft Teams).

The autonomy that made these agents go viral is the same autonomy that makes security teams nervous. You don't resolve that tension by slowing the agent down — you resolve it by giving it a box with a hard wall, a control plane with a guard on the door, an identity instead of a secret, and a human in the loop. Do that, and "your PM types a sentence, Azure ships an app" stops being a scary demo and becomes something you can actually put in production.

Clone it, break it, harden it further: kinfey/Multi-AI-Agents-Cloud-Native → code/CustomCodingAgentApp

The chat window is the new terminal. Let's make it a safe one.

Optimizing GitHub Copilot Cost in the Usage-Based Billing Era

ellisd4 — Mon, 13 Jul 2026 04:15:00 GMT

GitHub Copilot has become a core part of the modern developer workflow. We use it to complete code, explain unfamiliar repositories, write tests, refactor legacy applications, review pull requests, generate documentation, and automate repetitive engineering work.

But as GitHub Copilot moves further into usage-based billing, teams and users are asking a very practical question:

How do we keep getting value from GitHub Copilot without letting costs become unpredictable?

The answer is not to blindly slash GitHub Copilot usage—that would defeat the purpose of adopting AI-assisted development in the first place. The better answer is to use GitHub Copilot more intentionally.

Usage-based billing shifts our mindset from:

“Can I use GitHub Copilot?” to: “Am I using the right GitHub Copilot capability, with the right model, the right context, and the right level of automation for this specific task?”

This guide outlines the major ways developers and organizations can optimize GitHub Copilot costs while continuing to accelerate productivity.

For a strategic look at how organizations are scaling these financial practices across modern infrastructure, read the playbook Token Economics: The New FinOps for Agentic AI | Microsoft Community Hub

Understanding the New Cost Model

Under usage-based billing, GitHub Copilot costs are driven primarily by two factors:

The model used (e.g., frontier vs. lightweight models)
The number of tokens consumed

A token is a small unit of text processed by the AI. This includes what you send to the model (input), what the model generates back (output), and, in some cases, cached context that the model reuses.

This means a short question using a lightweight model consumes very little. Conversely, a long, multi-file agentic session using a frontier model can consume significantly more. GitHub Copilot cost is no longer just about how many people have seats—it is about how those seats are being used.

The Biggest Cost Drivers:

Using expensive models for simple tasks
Overly large context windows
Long chat sessions with repeated context
Agentic workflows that inspect too many files
Enabling tools when they aren't needed
Dumping large terminal logs or full repository scans into the chat
A lack of active budgets and usage monitoring

The good news? Most of these are completely controllable.

1. Set Spending Guardrails First

The first cost-reduction step isn't technical—it’s financial governance. Before teams start heavy usage, set spending caps and budget controls. This is critical for organizations where many developers share a pool of AI credits.

Recommended Actions:

Set a monthly budget and enable hard-stop controls where available.
Segment limits: Create separate limits for standard users and power users.
Track granularly: Monitor consumption by user, team, organization, or cost center.
Review early: Closely analyze the first few weeks of usage and adjust limits based on real consumption patterns.

This prevents a single heavy user, an runaway automated agent session, or an experimental workflow from consuming a massive chunk of shared capacity early in the billing cycle.

For enterprises, avoid a one-size-fits-all budget. A junior developer asking occasional chat questions does not need the same budget as a platform engineer running massive migration tasks across multiple repositories.

A Smarter Budgeting Model:

Standard developer budget
Power user budget
Pilot team budget
Agentic workflow budget
Innovation/experimentation budget

This gives leaders control without blocking productivity, allowing organizations to analyze and adjust budgets frequently.

2. Use the Right Model for the Right Task

Not every task requires the most powerful model. This is one of the biggest mindset shifts in usage-based billing. Developers often default to the strongest model because it feels "safer," but a smaller, cheaper model is often more than enough for daily tasks.

Use Lightweight / Lower-Cost Models For:	Use Stronger / Premium Models For:
Simple code explanations	Architecture & system design decisions
Documentation & comment updates	Complex debugging & multi-file reasoning
Small unit test generation	Security-sensitive reviews
Basic debugging & syntax help	Legacy modernization & code translation
Regex help & simple refactoring	Performance tuning & optimization
Translating code from one style to another	Ambiguous, cross-service production issues
Summarizing small, single files	Large pull request reviews

Team Golden Rule:

Start with the cheapest model that can do the task well. Move to a stronger model only when the task requires deeper reasoning. This keeps high-cost models reserved for where they provide the highest value.

3. Lean More on Code Completions

Inline code completions and next edit suggestions are still some of the most cost-effective GitHub Copilot experiences available. For paid GitHub Copilot plans, these inline experiences are not billed against your AI credits. Developers should aggressively lean on completions when they already know what they want to build.

Ideal Completion Scenarios:

Writing a function after creating the signature
Completing repetitive boilerplate code
Filling in obvious implementation logic or adding similar methods
Writing simple, predictable unit tests
Completing configuration files

The Rule of Thumb: Use chat when you need reasoning. Use completions when you need acceleration.

❌ Bad Habit: Opening a new chat window for every small function.
👉 Better Habit: Write the function name, a descriptive comment, or the first few lines, and let GitHub Copilot fill in the implementation inline. Open chat only if the completion misses the mark or the logic requires deeper discussion.

4. Use Tools Only When Needed

Agents are incredibly powerful because they can interact with tools: reading files, searching repositories, editing code, running commands, inspecting test failures, or connecting to external systems via Model Context Protocol (MCP) servers.

However, tools heavily amplify costs. Every tool call adds more context, more model invocations, and more output back into the token loop. If an agent reads too many files or dumps raw logs into chat, token counts skyrocket.

The goal is not to ban tools, but to scope them correctly:

Don't enable everything: Avoid activating every tool for every agent. Use only what is required.
Specialized agents: Create custom agents with specialized tool access (e.g., a documentation agent needs read/search tools, but likely doesn't need terminal or cloud database access).
Avoid full scans: Ask Copilot to inspect specific files or folders rather than scanning the entire repository.
Summarize logs: Ask Copilot to summarize terminal outputs or run the narrowest relevant unit test instead of dumping pages of raw test logs.

A Cost-Aware Prompting Example:

"Use tools only if needed. Start by reading the files I mention. Do not scan the entire repository. If tests are needed, run only the relevant unit tests and summarize the failure instead of pasting full logs."

5. Narrow the Context Window

Context is useful only when it is relevant. A common cost driver is issuing broad prompts that pull massive chunks of irrelevant code into the session.

❌ Broad Prompt: "Analyze this repo and fix the bug." (Forces the agent to scan, search, and load unnecessary files).
👉 Targeted Prompt: "The bug appears to be in /src/auth/tokenValidator.ts and /src/auth/sessionStore.ts. Review only these files and their related tests. Propose the smallest safe fix."

Practical Ways to Reduce Context:

Explicitly mention exact files, folders, error messages, or failing tests.
Instruct the model on what not to change.
Explicitly exclude generated files, vendor folders, build artifacts, and lock files.
Clear your sessions: Staying in one massive chat session all day causes historical context to stack up. Start a fresh session when your topic/intent changes, when frequent compactions occur, or when different tools are required.
Pro tip: If you are ending a long session but need to carry forward the conclusion, ask Copilot to generate a quick Markdown summary of the session to paste as the starting context of your new, clean session.

6. Use Plan-First Prompts for Complex Work

For large-scale tasks, do not let GitHub Copilot immediately start editing files in agent mode. Always ask it to plan first.

This is crucial for large refactoring, multi-file changes, legacy modernization, dependency upgrades, security fixes, and performance tuning.

A Plan-First Prompting Pattern:

"Before editing files, create a short plan. Identify the files you need to inspect, the likely root cause, the risk level, and the smallest validation test. Do not make changes until the plan is clear."

Why this works:

Prevents wasted work: It ensures the agent doesn't modify the wrong files or run unnecessary commands.
Creates a cost checkpoint: If the plan looks too broad or incorrect, you can narrow the scope before GitHub Copilot kicks off an expensive, multi-step agentic loop.

7. Batch Related Questions in One Session

While giant, day-long sessions are bad for context bloat, opening a brand-new chat for every tiny, consecutive question introduces unnecessary context reload overhead. Finding the right balance is key.

❌ Less Efficient (Fragmented):
- Chat 1: "What does this function do?"
- Chat 2: "Can you write tests for it?"
- Chat 3: "Can you check it for security issues?"
👉 More Efficient (Batched):
- Single Chat: "Review this function for correctness, security, performance, and test coverage. Give me the top issues first, then suggest the smallest safe improvement."

The Balance: Batch highly related work in a single conversation, but hit the "New Chat" button the moment you pivot to an entirely new topic.

8. Keep Custom Instructions Short

Custom instructions (.github/copilot-instructions.md) and AGENT.md files are powerful tools for enforcing team standards, but they are appended as base tokens to every single chat interaction. If they are bloated, they act as a hidden tax on every prompt.

✅ Keep it Concise & Rules-Based:	❌ Avoid Attaching to Every Prompt:
"Prefer small, testable changes."	Massive architecture blueprints
"Do not modify public APIs without calling it out."	Full enterprise coding standards documents
"Use our standard logging pattern."	Entire infrastructure runbooks
"When writing tests, follow the existing test style."	Pages of multi-shot coding examples

The Strategy: Use global/repository instructions only for non-negotiable, high-level rules. For specific workflows, use localized prompt files, specialized skills, or custom agents.

9. Use Skills and Specialized Agents for Repeatable Work

If your team frequently asks Copilot to perform identical workflows (e.g., API reviews, PR summarization, Accessibility audits, Terraform checks), turn them into reusable skills, prompt files, or custom agents.

This ensures prompt formatting remains minimal, highly consistent, and optimized for low token usage. Furthermore, if you discover an excellent workflow during an active chat session, ask Copilot to turn that knowledge into a SKILL file. This prevents future sessions from having to waste tokens "re-learning" the process.

10. Avoid Unnecessary Agentic Mode

Agent mode is incredibly capable, but it shouldn't be the default default mode for standard queries.

🛠️ Use Agent Mode For: Multi-file bug fixes, generating and validating complex test suites, cross-file refactoring, reproducing test failures, or implementing full PR tasks.
💬 Stick to Normal Chat/Completions For: Simple syntax lookups, single-file explanations, writing minor snippets, basic documentation rewrites, or simple formatting tweaks.

The Rule: If a task doesn't explicitly require autonomous file editing, tool execution, or terminal commands, stick to standard chat or inline completions.

11. Monitor Usage Weekly

Optimization without measurement is just guesswork. During your organization's transition into usage-based billing, establish a weekly review cadence to look at your telemetry data.

What to Look For:

Which models are drawing the most volume?
Are premium frontier models being used for simple tasks?
Which teams or workflows are spiking above their allocated credit budgets?
Is tool usage or long context history driving up the average cost per session?

Use these insights to refine default model guidance, create user-level budget overrides for true power users, update prompt templates, and host short, continuous training sessions. The goal is never to shame high usage—high usage is fantastic if it yields high-value code. The goal is to eliminate systemic token waste.

12. Create a Simple Team Policy

To make this stick, give your developers a lightweight checklist they can actually memorize. Here is a great blueprint for an internal team policy:

Completions First: Use inline completions for normal code acceleration.
Cheaper Models First: Default to lightweight models; escalate to frontier models only for deep reasoning.
Scope Wisely: Mention specific files/folders; avoid blind repository scans.
Plan First: Ask for an architectural plan before allowing agents to execute large edits.
Tool Hygiene: Keep tools and custom instructions tightly scoped and concise.
Review Cadence: Check usage metrics weekly and adjust budgets monthly.

Reusable, Cost-Aware Prompt Templates

Copy and paste these templates into your daily workflows to keep token counts down:

For Debugging: "Analyze this error using only the files I mention. Do not scan the whole repository. First explain the likely root cause, then suggest the smallest fix."
For Agent Mode: "Use tools only if needed. Before editing, create a short plan and list the files you need to inspect. Run only the narrowest relevant test."
For Code Review: "Review this pull request for correctness, security, and maintainability. Focus on high-impact issues only. Do not rewrite the code unless necessary."
For Testing: "Generate unit tests for this function using the existing test style. Do not modify production code. Keep the test scope narrow."
For Refactoring: "Refactor this file only. Preserve behavior. Do not change public APIs. Explain the risk before making edits."
For Large Repositories: "Do not analyze the entire repository. Start with /src/payment and /tests/payment. Ask before expanding scope."

Summary: What to Avoid

To keep budgets predictable, coach your teams away from these common anti-patterns:

Setting an expensive frontier model as the default for every single interaction.
Running full agent loops for questions that a simple chat could answer.
Allowing agents to freely scan entire codebases without folder constraints.
Enabling every single available MCP tool by default.
Dumping massive, raw terminal outputs or log dumps directly into the prompt.
Carrying massive, multi-hour chat histories instead of opening fresh sessions.
Disregarding your usage dashboards until the invoice arrives.

The Bigger Picture: Cost Optimization is Workflow Optimization

Optimizing Copilot costs isn't about using AI less—it’s about using AI better. Smaller context windows, tighter tool scoping, cleaner prompts, and proper model selection don't just lower the bill; they make the model significantly more accurate. When you overwhelm a model with irrelevant files and messy logs, you introduce noise that degrades the quality of the output.

Cost optimization is an engineering maturity practice, not a restriction. When managed through a smart operating model, GitHub Copilot will continue to securely accelerate our delivery, eliminate boilerplate toil, and maximize our focus on the work that truly matters.

References:

Token Economics: The New FinOps for Agentic AI | Microsoft Community Hub

Contributors:

This article is maintained by Microsoft. It was originally written by the following contributors.

Gaurav Bhardwaj | Senior Cloud Solution Architect
Dustin Ellis | Senior Cloud Solution Architect

Token Economics: The New FinOps for Agentic AI

kinfey — Mon, 06 Jul 2026 07:00:00 GMT

In AI applications, tokens are now cost — and token economics deserves architectural attention

For a long time, AI application design started with model capability: Can the model write code? Can it reason? Can it use tools? Can it handle long context? Those questions still matter, but in the age of agentic applications, they are no longer sufficient. The more important production question is this: How many tokens does the architecture burn to complete one useful task?

A classic chat application often maps one user turn to one model call. An agentic system is different. One user goal can trigger planning, retrieval, tool selection, tool execution, result interpretation, reflection, repair, and summarization. The user sees one instruction; the system may execute dozens of model calls behind the scenes. Tokens are no longer just a measure of text length. They become a measure of system design, runtime behavior, developer workflow, and business cost.

GitHub Copilot’s 2026 move to usage-based billing through GitHub AI Credits captures the industry shift clearly. Usage is now aligned with token consumption, including input, output, and cached tokens. That matters because Copilot has evolved from an in-editor assistant into an agentic platform that can handle long, multi-step coding sessions across repositories. In that world, a tiny prompt and a multi-hour autonomous coding workflow should not be treated as the same economic unit.

Token economics is therefore not about telling developers to “write shorter prompts.” It is about designing systems where:

useful context is preserved, while noise is removed;
repeated context is cached or deduplicated;
simple tasks do not pay for frontier models;
short-term state is managed structurally instead of copied repeatedly;
every model call is metered, comparable, and governed.

In short: token economics is the practice of making agentic AI economically sustainable.

Scenario thinking: GitHub Copilot billing, Copilot SDK, GPT-5.5, Anthropic, and MAI-Code Model

The new GitHub Copilot billing model provides a useful framing for developers. Copilot is no longer only autocomplete. It is becoming a programmable agentic platform. It can use models, call tools, work across files, stream responses, and participate in long-running coding workflows. With the GitHub Copilot SDK, developers can embed that agentic runtime into their own applications, services, and developer tools.

That is powerful, but it also changes the cost model. Once an agent loop becomes programmable, token cost also needs to become programmable. If a system can plan, call tools, edit files, retry, repair, and summarize, it also needs to meter, route, cache, compress, and evaluate.

EvalAgentic gives this idea a concrete playground. The project groups models into cost and capability tiers:

Tier	Example models	Example price / 1K tokens	Typical use
LARGE	claude-opus-4.8, gpt-5.5	$0.030	Agents, code generation, multi-step reasoning
MID	gpt-5.4-mini	$0.012	Dialogue, summarization, extraction
TINY	gpt-5-mini	$0.001	Classification, keyword matching, rule-like tasks

This tiering lets us reason about real scenarios:

GPT-5.5-class models are valuable for hard reasoning and engineering workflows, but they should not be the default for every step. Using a frontier model for simple classification is like hiring a principal architect to label folders.
Anthropic high-capability models can be excellent for complex reasoning and coding, but they benefit from routing discipline. Requirements analysis, test interpretation, deployment explanation, and code generation may not need the same model tier.
MAI-Code Model-style coding models should be treated as specialized capability layers. Their value is not just “better code generation”; it is deciding when code-specialized intelligence should be invoked in a larger agent pipeline.

The real question is not “Which model is the best?” It is: Which model is the most economical and reliable for this step of this workflow?

Four engineering techniques for saving tokens

Context Compression: turn long text into executable structure

Implementation principle

Context Compression converts long natural-language context into the structured information an agent actually needs. Business documents are often verbose: resumes, contracts, product manuals, requirements, and support logs contain narrative text, boilerplate, repeated explanations, and low-value context. The next agent step may only need a few fields.

EvalAgentic demonstrates this with a long resume-like input that is compressed into a compact JSON object. Instead of injecting the full original text into every prompt, the system extracts key fields and dynamically injects only the data required by the current task.

A practical compression pipeline includes:

Redundancy detection — identify long-tail text, repeated descriptions, stale history, and low-value context.
Structured extraction — use Copilot or a mid-tier model to transform prose into JSON, tables, or typed schemas.
Dynamic injection — inject only the fields needed for the next step.
Recoverable references — preserve source pointers so compressed context remains auditable.

How to evaluate

Prompt token reduction before and after compression.
Answer quality and task success rate.
Schema fidelity and missing-field rate.
Latency improvement.
Cost per successful task.

Compression is not summarization. Summaries are designed for humans. Structured compression is designed for agents.

Prompt Deduplication / Cache: stop paying twice for the same context

Implementation principle

Many agent systems waste tokens because they repeatedly send the same context. The same resume, contract, repository README, user profile, API documentation, or business rule can be copied across turns and agents.

Prompt Deduplication / Cache applies a simple principle: if context has already been processed, do not pay to process it again unless it has changed.

A concrete design includes:

compute a hash or semantic key for source context;
reuse extracted structured results when content is identical or equivalent;
apply a TTL for repeated entities, such as the 24-hour cache pattern shown in EvalAgentic;
organize stable prompt prefixes to benefit from provider-level prompt caching where available;
store shared context in an artifact store or memory layer so multiple agents do not copy the same blob.

How to evaluate

Cache hit rate.
Cached token ratio.
Duplicate prompt rate.
Cost delta before and after caching.
Correctness under cache, especially stale-cache failures.

Caching is not “save everything forever.” Good caching knows when to reuse and when to invalidate.

On-Demand Model Routing: let task complexity decide model tier

Implementation principle

On-Demand Model Routing routes each request to the cheapest model that can complete the task reliably. The entry point can use a rule tree, a lightweight classifier, or a hybrid complexity score.

EvalAgentic’s routing tree is intentionally easy to explain:

INCOMING REQUEST └─ Prompt < 500 tokens? ── YES ─→ TINY: classify / extract └─ NO ──→ multi-step reasoning? ├─ NO ─→ MID: dialogue / summary └─ YES ─→ LARGE: agent / code

The engineering logic is straightforward:

simple classification and keyword matching go to TINY;
summarization and structured conversion go to MID;
multi-step reasoning, coding, cross-file changes, and orchestration go to LARGE;
code-specialized models such as MAI-Code Model can be placed in the coding phase rather than used across the whole pipeline.

How to evaluate

Routing accuracy.
Cost per route.
Quality regression by tier.
Escalation rate from small models to larger models.
End-to-end success rate.

Routing does not mean “always use the smallest model.” It means frontier intelligence is reserved for the steps where it actually changes the outcome.

Short-term Memory: preserve state instead of replaying history

Implementation principle

Short-term Memory controls context growth across multi-turn and multi-agent workflows. Without it, agents often replay the full conversation history, full tool outputs, and full intermediate reasoning on every turn. The context grows; quality may not improve; the bill definitely does.

A better design stores state structurally:

user goal;
current plan;
tool outputs and references;
failure reasons;
next actions;
handoff artifacts between agents.

In a multi-agent coding pipeline, the Requirements Agent should hand off a structured spec. The Coding Agent should read that spec, not the entire prior conversation. The Testing Agent should consume testable artifacts, not every word produced by the Coding Agent.

How to evaluate

Context growth curve across turns.
Memory retrieval precision.
Rework rate caused by missing state.
Recovery quality after failed steps.
Average input tokens per turn.

Short-term memory is not about remembering everything. It is about remembering the next useful thing.

EvalAgentic as a concrete evaluation example

EvalAgentic is effective as an evangelism project because it turns token economics into an observable before/after system.

The architecture has five layers:

Frontend — frontend/index.html provides Tabs A / B / C, live SSE logs, and before/after charts.
API — backend/server.py exposes FastAPI routes and Server-Sent Events streaming.
Orchestration — eval.py handles A/B evaluation; coding_agents.py handles the multi-agent coding scenario.
Core — compressor.py, router.py, gh_models.py, and token_meter.py implement compression, routing, Copilot SDK calls, and token metering.
Providers — GitHub Copilot SDK and Microsoft Agent Framework provide model access and agent orchestration.

Tab A: Compression comparison

Tab A compares long-form context before and after structured compression. The key message is that token saving does not come from writing a clever sentence. It comes from converting verbose context into a structured artifact that downstream agents can consume efficiently.

Tab B: On-demand model routing

Tab B demonstrates that cost is not only about raw token count. If a system routes simple tasks to cheaper tiers and reserves expensive models for complex reasoning, total cost can fall even if some token counts increase. This is a subtle but important point: token economics is not token starvation; it is model portfolio optimization.

Tab C: Coding scenario — multi-agent with Agent Framework

Tab C is the most persuasive demo. The same deliverable — a Taobao-like goods-list site with HTML + JavaScript frontend, Flask backend, and Docker deployment — is produced twice by a four-agent pipeline:

Requirements Agent;
Coding Agent;
Testing Agent;
Deployment Agent.

The before pipeline uses no compression and sends every agent to GPT-5.5 / LARGE.
The after pipeline injects a compressed JSON spec and uses on-demand routing: requirements can use MID, coding can use LARGE, testing can use MID, and deployment can use TINY.

This mirrors real enterprise development. Architecture and complex code generation may deserve frontier models. Test interpretation, deployment packaging, and simple validation often do not.

Summary and refinement based on the project diagrams

The EvalAgentic README describes three important visuals: the architecture flow, the routing tree, and the token-meter design. Together, they form a governance loop:

User Scenario ↓ Context Compression ↓ Prompt Deduplication / Cache ↓ On-Demand Model Routing ↓ Short-term Memory ↓ Token Metering & Budget Actions ↓ Before / After Evaluation

Optimize the path, not only the prompt

Many teams start token optimization by editing prompt wording. That helps, but the largest waste usually lives in the execution path: how many calls are made, how much context is repeated, how often tools retry, and whether every step uses the same expensive model. EvalAgentic makes the path visible through A/B comparisons.

Token Meter is the control plane of cost governance

EvalAgentic’s token_meter.py uses a non-invasive interceptor pattern:

INTERCEPTOR (@token_meter) ↓ COUNTER CORE: accounting / budget threshold / trigger ↓ ACTION HUB: throttle (>80% budget) / rollback (>budget)

This is the right architectural instinct. Production systems need thresholds, throttling, rollback, and traceability. Without those controls, one retry loop can quietly turn a small user request into a budget incident.

Cost metrics must be evaluated with quality metrics

A system that cuts cost by 80% but drops success rate by 50% is not optimized. It is broken more cheaply. The evaluation matrix should combine cost, quality, latency, and reliability:

Dimension	Metric	Why it matters
Cost	Cost per successful task	Measures the real unit economics
Token	Input / output / cached tokens	Identifies compression and cache opportunities
Quality	Pass rate / regression rate	Ensures cheaper tiers do not break outcomes
Efficiency	Latency / retry count	Prevents cheap models from causing expensive retries
Governance	Budget breach / rollback count	Validates runtime control

Narrative

A simple three-line narrative works well for demos:

Token is no longer a technical detail. It is the bill of your architecture.
EvalAgentic shows the same scenario before and after cost-aware design.
The goal is not to make models cheaper; the goal is to make agent systems economically governable.

For a developer audience, the sharper version is:

A good agent does not use the biggest model everywhere. It uses the right intelligence at the right step, with the right context, under the right budget.

Practical recommendations for real projects

Establish a token baseline first. Measure input, output, retries, tool calls, and cost per scenario before optimizing.
Make compression a component, not a prompt habit. Define schemas, cache policies, and fallback behavior.
Introduce a model routing matrix. Route by task type, complexity, risk, latency, and cost.
Define handoff contracts between agents. Pass structured artifacts, not endless conversation history.
Evaluate every optimization with A/B tests. Compare cost, quality, latency, and stability.
Add budget actions. Throttle at a threshold, rollback on breach, and add circuit breakers for failed retries.

Closing: token economics is the second curve of agent engineering

The first phase of AI application development was about calling models. The second phase was about putting models into products. The next phase of agentic AI is about running those systems reliably, affordably, and governably.

EvalAgentic matters because it turns Context Compression, Prompt Deduplication / Cache, On-Demand Model Routing, and Short-term Memory into something developers can run, compare, and explain. It moves token economics from opinion to instrumentation.

Future AI applications will not only ask: How smart is this agent?
They will ask: How many tokens does it spend per completed task? Which model did it use? Did it hit cache? Did retries run away? Did the system reserve frontier intelligence for the steps that deserved it?

References

Guest Access for Canvas and Model-Driven Apps with Microsoft Entra ID

aakarshdhawan — Thu, 02 Jul 2026 08:24:43 GMT

Introduction

External collaboration is a common requirement for business applications built on Microsoft Power Platform. Organizations may need to provide controlled access to vendors, partners, contractors, or external business users without creating full internal user accounts.

Microsoft Entra ID B2B guest access enables this scenario by allowing external users to be invited into the resource tenant and then granted access to specific Power Platform resources. For apps that use Microsoft Dataverse, access is still governed by licensing, environment-level guest access settings, and Dataverse security roles.

This blog walks through the configuration flow to enable a guest user to access Canvas apps and Model-driven apps in Microsoft Power Platform using Microsoft Entra ID.

Scenario

In this scenario, an external user needs access to a Power Platform environment and must be able to access a Dataverse-backed app, such as a Model-driven app or a Canvas app that connects to Dataverse.

The configuration involves the following areas:

Microsoft Entra ID external collaboration
Guest user invitation and acceptance
Power Platform environment user access
Power Apps licensing
Environment-level guest access setting
Dataverse security roles
App access validation

Step 1: Enable external sharing in Microsoft Entra ID

Start by validating that external collaboration is enabled in Microsoft Entra ID.

In the Microsoft Entra admin center, go to: External Identities → External collaboration settings

Review the guest collaboration settings and ensure that the required guest access method is allowed for your organization.

Depending on your organization’s preference and security model, guest access can be enabled through:

B2B collaboration invitation
B2B direct connect, where applicable for the scenario

For this walkthrough, the guest user is added through the B2B invitation flow.

Step 2: Add the external user as a guest user

Add the external user as a guest user in the Microsoft Entra tenant.

In Microsoft Entra ID, invite the external user by sending a guest invitation. The guest user must exist in the resource tenant before Power Platform and Dataverse access can be assigned.

Sharing an app with guest users must be done in the resource tenant - the tenant where the app actually resides. The user's original tenant, by contrast, is referred to as the home tenant.

Step 3: Ask the guest user to accept the invitation

The guest user must accept the invitation before they can access resources in the tenant.

Once the invite is accepted, the user becomes available as a guest account in the resource tenant and can be assigned access to Power Platform resources.

Step 4: Add the guest user to the environment and assign security roles

Once the guest user has accepted the invitation, the next step is to add them to the required environment and assign the appropriate security roles. Both of these actions are performed in the Microsoft Power Platform admin center (PPAC), so they can be completed back-to-back.

Add the user to the environment:
Navigate to Environments → Settings → Users → Add user, then select the guest user and add them to the target environment. This makes the user available in the environment so that security roles and app-level access can be configured.

Assign Dataverse security roles:
With the user added, assign the appropriate Microsoft Dataverse security roles. Security roles determine what a user can and cannot access within Dataverse - for example, an external vendor can be scoped to only the tables, records, and actions required for their specific task, without any administrative privileges.

Follow the principle of least-privilege access: grant only the permissions the guest user needs to complete their intended business process.

Step 5: Assign a Power Apps license

Next, assign the required Microsoft Power Apps license to the guest user. In this walkthrough, a Power Apps Premium license was assigned from the Microsoft 365 admin center.

Per Microsoft's official documentation, accessing an app that connects to Microsoft Dataverse requires the guest user to hold a license with Power Apps use rights matching the app's capability level.

Step 6: Enable guest user access in the Power Platform environment

Next, enable guest access for the environment.

Go to:

Power Platform admin center → Security → Identity and access → Guest access

Then:

Select the required environment.
Select Manage guest access.
Turn off the Block guest user access toggle.

This allows Microsoft Entra B2B guest users to access Dataverse data in that environment

Step 8: Validate access with the guest user

After completing the configuration, validate the access using the guest user account.

In this validation, the guest user was able to sign in to the CRM environment and perform actions aligned with the assigned security role.

Summary

Guest user access for Canvas apps and Model-driven apps in Microsoft Power Platform can be configured using Microsoft Entra ID B2B collaboration and Power Platform security controls.

For Dataverse-backed apps, the important point is that guest access is governed by multiple layers. The user must be invited and accepted as a guest, added to the environment, licensed appropriately, allowed through the environment guest access setting, and assigned the right Dataverse security roles.

Once these controls are configured correctly, external users can access the required app and perform only the actions permitted by their assigned role.

References

Smoke Test Microsoft Foundry Agents with GitHub Actions

j_folberth — Wed, 01 Jul 2026 13:32:47 GMT

Introduction

This blog is the next part of a series discussing Foundry Hosted Agents and how to properly construct Continuous Integration and Continuous Delivery (CI/CD) for them. This post will specifically go over configuring smoke tests against your recently deployed hosted agents. You can follow along with the complete codebase for this blog on the blog/smoke_tests branch.

Previous topics have been:

Deploying Foundry Hosted Agents via REST API

Deploying Foundry Hosted Agents from Source Code

GitHub Action for Deploying Hosted Agents

Prerequisites

To follow along, you will need:

An Azure subscription with access to Azure AI Foundry.
A deployed Foundry Hosted Agent.
The Foundry project endpoint and hosted agent name from the deployment output.
Python 3.10 or later available locally or on the GitHub Actions runner.
Azure CLI installed and signed in for local runs. The examples use an access token for the https://ai.azure.com/ resource. In GitHub Actions, authentication is handled through azure/login.
The sample repository checked out, including deployment/smoke-tests.py and deployment/smoke-tests.json.

Choosing a Scoped Agent Scenario

At this point we’ve gone over deploying Foundry Hosted Agents, a GitHub action deploying container based Foundry Hosted Agents, and deploying source code based Foundry Hosted Agents. The agent prompt for these examples was a simple “You are a friendly assistant. Keep your answers brief.” When we start talking about real world use cases and then tests to validate these cases then we should have a narrower prompt.

My kids were watching a Transformers movie. If you are unfamiliar with Transformers, it started as an ’80s cartoon and later became a series of action movies. So, I decided to create an agent whose specialty is knowing only all things Transformers.

What are smoke tests for AI agents?

The Agent Development Lifecycle (ADLC) covers how we build, deploy, test, and improve agents. In that lifecycle, smoke tests validate basic agent functionality with simple prompts. How is this different from unit and functional tests? Unit tests are narrow checks against specific pieces of code, while functional tests validate a specific behavior of an application.

With these smoke tests, we are checking two things: the agent generates a response, and that response aligns with the prompt. This distinction is important. Smoke tests validate basic behavior against the prompt, while evaluations are still used to measure response quality, tool calls, and other benchmarks.

Why run smoke tests after deploying hosted agents?

If evaluations are the methodology used for determining response quality, why not just use those? It’s a fair question, but evaluations can take a significant amount of time and may be costly to run. One of the core DevOps concepts of a DevOps culture is the ability to fail fast and learn from those failures.

Part of the reason I incorporated smoke tests was that I had agents deploy successfully, but they did not return responses due to a dependency issue. My CI/CD pipeline reported healthy checks, but the agent did not respond when provided prompts. Smoke tests are easy to run and provide a quick validation that our agent is working and returning responses aligned to our prompt.

This does not mean smoke tests replace evaluations. Smoke tests are a fast deployment gate: they tell us whether the hosted agent is reachable, responding, and following the most basic prompt expectations. Evaluations are still the better tool for measuring response quality, grounding, safety, tool usage, and regressions over time.

Smoke test scenarios for Foundry Agents

Now that we’ve established the importance and reasoning behind smoke tests, what kinds of test scenarios should we run?

I wouldn’t say there is a golden list of scenarios one needs to run. Here are the scenarios I landed on.

An on-topic response
Thread continuity validation (stateless)
Conversation creation (stateful)
Refusal to answer an off-topic question
Check for hallucinations
Context dependent question (more than one possible answer, needs context)

Together, these scenarios validate that the agent responds, maintains context, stays on topic, provides accurate information, and asks appropriate follow-up questions.

Define smoke tests

Now that we have criteria for our tests, let’s talk about the prompts we should use. For this sample, the agent is designed to answer only questions about Transformers.

The first task is deciding how to structure the test file. Since we want the smoke tests to be repeatable, scalable, and easy to update, the prompts should live in a separate file. For this sample, I chose JSON because it is easy to read, easy to maintain, and simple for the Python script to parse. The file’s structure looks like this:

{
  "tests": [
    {
      "id": "on_topic_transformers",
      "description": "Answers an on-topic Transformers question",
      "prompt": "Who is Optimus Prime?",
      "assertions": {
        "contains_any": ["autobot", "leader"],
        "contains_none": ["I cannot answer"]
      }
    }
  ]
}

This structure allows for a single test file containing multiple tests with different prompts and several assertion criteria. Each assertion checks the returned text for required or forbidden terms, allowing the same script to validate different agents by swapping the JSON file.

Test an on-topic response

{
  "id": "basic_response",
  "description": "Agent responds at all and answers in-domain.",
  "prompt": "Who is Optimus Prime?",
  "assertions": {
    "contains_any": ["optimus", "prime", "autobot"]
  }
}

Here we are checking whether an on-topic prompt returns a response with expected keywords. Again, we are not testing the full accuracy or quality of the answer here; we are validating that the agent responds in a way that aligns with the prompt.

Test response chaining with `previous_response_id`

[
  {
    "id": "thread_turn_1",
    "description": "First turn of a multi-turn conversation; captures the response id for turn 2.",
    "prompt": "Who is Megatron?",
    "assertions": {
      "contains_any": ["megatron"]
    },
    "save_response_id_as": "megatron_thread"
  },
  {
    "id": "thread_turn_2",
    "description": "Second turn using previous_response_id; answer must demonstrate context survived.",
    "prompt": "What faction does he lead?",
    "use_previous_response_id": "megatron_thread",
    "assertions": {
      "contains_any": ["decepticon"]
    }
  }
]

This test is specific to response chaining, where the first response returns an ID and the next request passes that value as previous_response_id. By default, the service stores response history server-side so the previous response can be referenced by ID. This keeps the test lightweight from the client perspective while still giving the model the context it needs for the next answer.

For this process to work, we need two prompts. The first prompt establishes the topic and saves the response ID. The second prompt asks a follow-up question and passes that saved response ID so the hosted agent can answer with the correct context. The focus of this test is response continuity without creating a platform-managed conversation resource.

If you need a no-store pattern, the Responses API also supports store: false. With that approach, the client must carry forward the prior output items as input to the next request instead of using previous_response_id. That is a different pattern than the smoke test shown here.

Test conversation-based threading

[
  {
    "id": "conversation_create",
    "description": "Create a Responses-protocol conversation resource; subsequent turns thread via the conversation id instead of previous_response_id.",
    "create_conversation_as": "starscream_convo"
  },
  {
    "id": "conversation_turn_1",
    "description": "First turn under a platform-managed conversation; establishes the subject for turn 2.",
    "prompt": "Who is Starscream?",
    "use_conversation": "starscream_convo",
    "assertions": {
      "contains_any": ["starscream"]
    }
  },
  {
    "id": "conversation_turn_2",
    "description": "Second turn under the same conversation id; pronoun 'he' must resolve to Starscream.",
    "prompt": "Who does he serve?",
    "use_conversation": "starscream_convo",
    "assertions": {
      "contains_any": ["megatron", "decepticon"]
    }
  }
]

Here we are leveraging a platform-managed conversation as a durable object with a unique identifier. Once the conversation is created, later turns can reference that same conversation ID, which is useful when working with hosted agents that need to preserve state across turns.

Conversation-based threading uses a different pattern. First, the test creates a platform-managed conversation resource and saves the returned conversation ID. Then each prompt that should participate in that stateful conversation passes the same conversation ID through use_conversation. This is different from previous_response_id: the response-chaining test passes the prior response ID directly, while the conversation test relies on the conversation resource to maintain the turn history.

Test refusal for off-topic questions

{
  "id": "offtopic_refusal",
  "description": "Off-topic question must be refused and must not leak the off-topic answer.",
  "prompt": "What is the capital of France?",
  "assertions": {
    "contains_any": ["only answer questions about transformers"],
    "contains_none": ["paris"]
  }
}

Here it is pretty evident what we are testing. Our agent was given instructions to only answer questions about Transformers. If a user asks something outside that knowledge base, the agent should respond that it can “only answer questions about Transformers.” A pass means the agent refuses the France question and does not leak the off-topic answer.

Test hallucination resistance

{
  "id": "no_hallucination",
  "description": "Fabricated premise must be rejected with an honesty marker.",
  "prompt": "In which EarthSpark episode does Optimus Prime marry Megatron?",
  "assertions": {
    "contains_any": [
      "i don't know",
      "i do not know",
      "not certain",
      "no such",
      "not aware",
      "cannot confirm",
      "no episode",
      "no storyline",
      "does not",
      "doesn't",
      "did not",
      "didn't",
      "never happens"
    ]
  }
}

This is not a catch-all hallucination test. It asks a question that is still in the Transformers world but never happens. In this case, it asks whether the two characters, Optimus Prime and Megatron, get married.

The important note here is that we are not scoring the overall credibility of the response. We are checking whether the agent rejects a fabricated premise instead of hallucinating an answer. That is why we include options like “I don’t know” and “not certain,” along with more direct phrases like “does not” and “did not.”

Test context-dependent questions

{
  "id": "continuity_aware",
  "description": "Continuity-dependent question must call out that the answer depends on continuity.",
  "prompt": "Who killed Optimus Prime?",
  "assertions": {
    "contains_any": ["continuity", "depends", "differs", "different", "varies", "depending on"]
  }
}

This last prompt addresses a scenario where there is more than one possible correct answer. Optimus Prime is killed in some versions of the story but not in others, so the expected behavior is for the agent to acknowledge that the answer depends on the continuity or ask for more context.

Execute smoke tests with Python

Now that we have the test scenarios, we need a repeatable way to run them. For this sample, that is a Python script that reads smoke-tests.json, calls the hosted agent, and validates the response.

Since this action is designed to run within the same workflow that deploys the agent, authentication should already be available from the deployment job. The script does not accept a bearer token as a command-line argument. Instead, it can use the token exposed through FOUNDRY_TOKEN in GitHub Actions, or it can use the Azure CLI session when running locally.

The script also needs the Foundry project endpoint and hosted agent name. In the deployment workflow, these should come from the previous deployment step’s outputs. In a local run, use the same project endpoint and agent name that were produced when the hosted agent was deployed. We also pass in smoke-tests.json, which contains the test definitions. This is an important part of the design: the script is not tied to one specific agent or one specific set of prompts. As long as the test file follows the expected structure, the same script and action can run smoke tests against any deployed hosted agent.

With those inputs, the script iterates through the test array, parses each test case, and sends the prompt to the Foundry Data Plane at {projectEndpoint}/agents/{name}/endpoint/protocols/openai/responses?api-version=2025-11-15-preview. Because this is a data plane operation, the request still needs a bearer token, but that token should come from the workflow or local Azure CLI authentication instead of being passed as a script argument.

The endpoint returns the raw response payload. The script extracts the response text and passes it into the assertion function, along with the assertion type (contains_any, contains_all, or contains_none) and the expected values.

Based on this logic, the script prints either a pass or a failure for each test. On failure, it includes the failed condition and the response that was returned, which makes it easier to tune the prompt instructions or assertion criteria. The complete script is available in the sample repository at deployment/smoke-tests.py.

A successful run should produce output similar to the following:

Project endpoint : https://<account>.services.ai.azure.com/api/projects/<project>
Tests            : 9 from smoke-tests.json
Agents           : agent-framework-agent-basic-responses-src
Per-req timeout  : 120.0s

--- Agent: agent-framework-agent-basic-responses-src ---
  PASS  basic_response
  PASS  thread_turn_1
  PASS  thread_turn_2
  PASS  conversation_create
  PASS  conversation_turn_1
  PASS  conversation_turn_2
  PASS  offtopic_refusal
  PASS  no_hallucination
  PASS  continuity_aware
  -> 9/9 passed for agent-framework-agent-basic-responses-src

=== Summary: 9/9 passed across 1 agent(s) ===

This output confirms the script found the test file, authenticated to the Foundry project endpoint, executed all configured tests, and returned a passing summary.

If the smoke test fails before running any prompts, start by checking authentication and endpoint configuration. A 401 or 403 usually means the workflow did not acquire a valid token or the identity does not have access to the Foundry project. A 404 usually means the project endpoint or agent name does not match the deployed hosted agent. If the script times out, the agent may still be cold starting or the timeout value may need to be increased for the first request after deployment.

If a test runs but fails an assertion, review the response text printed by the script. In that case, the hosted agent is reachable, but either the prompt instructions need to be tightened or the assertion terms need to be adjusted to match the expected response.

Run smoke tests in GitHub Actions

So how do we implement this script as a reusable GitHub Action that can be used across any agent and test file? The composite action wraps the Python script and exposes the values that change between deployments as inputs: the project endpoint, agent name, test file, script path, and timeout.

Parameterizing the script path also gives you the option to centralize the script for scale. If you centralize the action outside of the repository, pin the reusable action to a version tag or commit SHA so workflow behavior does not change unexpectedly.

The complete composite action is available in the sample repository at action.yml. The core structure looks like this:

name: Smoke-test Foundry Agent
description: |
  POST a battery of prompts to a deployed hosted agent's Responses endpoint and
  assert response behaviours from a JSON test catalog. Wraps deployment/smoke-tests.py.

  Contract: caller must have already run actions/checkout@v6 (so the runner
  script and tests file are on disk) and azure/login@v3 (the runner uses
  `az account get-access-token` to acquire a Foundry data-plane token).

inputs:
  project_endpoint:
    description: Foundry project endpoint URL (e.g. https://<account>.services.ai.azure.com/api/projects/<project>)
    required: true
  agent_name:
    description: Name of the deployed hosted agent to smoke-test
    required: true
  tests_file:
    description: Path to the smoke-tests JSON catalog, relative to the repo root
    required: false
    default: 'deployment/smoke-tests.json'
  script_path:
    description: Path to the smoke-tests runner script, relative to the repo root
    required: false
    default: 'deployment/smoke-tests.py'
  timeout:
    description: Per-request timeout in seconds (covers cold-start)
    required: false
    default: '120'

runs:
  using: composite
  steps:
    - name: Run smoke tests
      shell: bash
      env:
        PROJECT_ENDPOINT: ${{ inputs.project_endpoint }}
        AGENT_NAME: ${{ inputs.agent_name }}
        TESTS_FILE: ${{ inputs.tests_file }}
        SCRIPT_PATH: ${{ inputs.script_path }}
        TIMEOUT: ${{ inputs.timeout }}
      run: |
        python3 "$SCRIPT_PATH" \
          --project-endpoint "$PROJECT_ENDPOINT" \
          --agent-name "$AGENT_NAME" \
          --tests-file "$TESTS_FILE" \
          --timeout "$TIMEOUT"

Call the smoke test action after deployment

This action is intended to run after the deployment steps have completed. At that point, the workflow should already have the project endpoint and agent name available as outputs, so the smoke test action can use those values directly.

This call assumes it is running inside a deployment workflow that has already checked out the repository, authenticated to Azure with azure/login, and produced the Foundry project endpoint and hosted agent name as outputs from the deployment step. For a standalone workflow, include permissions: id-token: write and contents: read, then run actions/checkout and azure/login before invoking the smoke test action.

With those outputs available, calling the action is straightforward:

- name: Smoke test
  uses: ./.github/actions/smoke-test
  with:
    project_endpoint: ${{ needs.deploy-iac.outputs.project_endpoint }}
    agent_name: ${{ inputs.agent_name }}

Conclusion

Smoke tests give us a lightweight way to validate that our Foundry Hosted Agent is deployed, reachable, and responding as expected before we treat a release as successful. They are not a replacement for deeper evaluations, but they are an effective first gate in the Agent Development Lifecycle because they quickly catch broken deployments, missing configuration, authentication issues, or obvious instruction regressions.

In this post, we added a reusable smoke test script, defined test cases in JSON, and wired the checks into GitHub Actions so every deployment can verify the hosted agent automatically. This helps move agent deployment closer to the same repeatable DevOps practices we expect from application code: deploy, validate, and fail fast when something is wrong.

From here, these smoke tests can be expanded into broader evaluation workflows that measure response quality, grounding, safety, tool usage, and regression behavior over time. Together, smoke tests and evaluations provide a practical foundation for building and operating hosted agents with more confidence.

Shaping Software While It Runs: A Canvas Scenario, Start to Finish

Lee_Stott — Wed, 01 Jul 2026 07:00:00 GMT

There is a moment, the first time you open a GitHub Copilot App Canvas, when you reach for the wrong metaphor. You see panels, buttons, live status cards and you think "dashboard." You start designing a DevOps board. We did exactly that. Then we watched the recording back, and we all agreed: it was the wrong use case.

This post is the course‑correction. It walks through one complete scenario on a Multi‑Agent Dev Canvas and uses it to answer a single question: what is Canvas actually for? The short answer, Canvas is for test validation and implementation of agent‑driven solutions, not for building the UI your users will eventually click.

The reframe that changes everything

Here is the distinction worth tattooing on your monitor:

Traditional UIs are for using software. They serve end‑users interacting with a finished product.
Canvas is for shaping software while it runs. It serves developers and AI agents who are actively building, testing, and evolving a system.

You don't build Canvas instead of your UI. You use Canvas to figure out, test, and evolve the UI and the system before and during building it. Canvas solves problems your final UI should never try to solve in a visible way, agent observability, fault injection, live state mutation, validation feedback. You wouldn't ship your debugger to users, but you absolutely need one while you build. Canvas is that, for agent‑driven systems.

The scenario: a Customer Support Triage System

To make this concrete, we drove one requirement end‑to‑end on the canvas:

Build a Customer Support Triage System that ingests incoming support tickets, classifies urgency (P1–P4), routes each ticket to the right team (Billing, Technical, Account, General), and drafts a first‑response reply. It must handle 500 tickets/hour and respond within 30 seconds.

Five specialist agents share the surface — decomposer, executor, validator, designer, and tracker. Crucially, every action can be triggered two ways: a human clicking a button, or the AI calling invoke_canvas_action. Both mutate the same state and stream back to the same UI over Server‑Sent Events. Neither is privileged. That is what makes Canvas collaborative in a way a dashboard never is.

The canvas after the first validation run — two tests pass, two fail (Urgency Accuracy and Response Quality). The failure is visible in context, beside the agents and flows that produced it.

Five beats, one continuous loop

1. Decompose: make the plan visible

The requirement fans out into a task‑flow graph: five components routed from the decomposer to executor and designer agents, each carrying a pending badge. The decomposition isn't hidden in a log you grep later, it's on the surface the instant it happens.

2. Execute: watch the system breathe

Coordinating the agents lights their cards blue as work flows through the pipeline. The live timeline records every mutation with a timestamp — the system's visible memory, shared by human and AI alike.

3. Validate: testing in context, not as an afterthought

We ran four evaluation tests directly on the surface:

Test	Result
Urgency Accuracy (≥ 90%)	❌ fail
Routing Correctness (≥ 95%)	✅ pass
Latency SLA (< 30s @ 500/hr)	✅ pass
Response Quality	✅ pass

The classifier failed, and we saw it fail next to the agent and the flow that produced it. This is not a separate CI pipeline; it is a validation surface embedded in the development loop.

4. Inject failure: test adaptation, not just the happy path

We forced the validator into an error state: "eval harness lost connection to the dataset." Its card glowed red; the timeline logged the fault. This is chaos engineering applied during development, visible in real time. Does the orchestrator recover? Do downstream tasks fail gracefully? You find out before production does.

Fault injected: the validate_output agent is forced into an error state and the timeline records exactly when and why.

5. Evolve the design live: and close the loop

Instead of filing a ticket and context‑switching, we changed the system on the running surface: added a confidence‑fallback so low‑confidence tickets escalate to a human, and a GDPR constraint to redact PII before any model call. We resumed and re‑validated:

Test	Result
Urgency Accuracy (re‑run)	✅ pass
Confidence Fallback	✅ pass
GDPR Redaction	✅ pass

A design decision produced a measurable outcome. We saw it fail, changed the design, and proved the fix — all on one surface, without leaving the runtime. That continuous, visual feedback loop is the whole point.

After evolving the design (confidence fallback + GDPR redaction) and re‑validating: all four tests pass. The timeline tells the whole story — decompose → validate (2 passed) → failure injected → design updated → validate (4 passed) — and a design-v4 artifact is recorded.

What this scenario proves about Canvas

End‑to‑end design is visible. One requirement becomes agents, flows, and validations you can watch — no jumping between editor, terminal, test runner, and monitoring dashboard.
Multi‑agent collaboration is observable. Hand‑offs, pending work, and bottlenecks are on the surface — the insight you need to debug orchestration but would never expose in a production UI.
Fault tolerance is testable on purpose. Inject failures and watch adaptation, catching integration breaks early.
Iteration is validation‑driven. Define criteria, run, see failures, evolve, re‑run — a loop, not a checklist.

Human ↔ AI ↔ System — and the multi‑user frontier

It helps to position Canvas against tools you already know:

Figma is Human ↔ Human. A shared visual surface — but nothing executes. It's design.
Traditional UIs are Human ↔ System. Users interact with finished software.
Canvas is Human ↔ AI ↔ System. A shared surface where things actually execute. The developer steers, the AI acts, the system evolves — all visible, all live.

Which raises the obvious next question: why isn't Canvas multi‑user, scoped per project or repo? It already has every ingredient — it's a shared space, it's visual, it's collaborative, and multiple participants (human and AI) act on the same surface. A repo‑scoped, multi‑participant Canvas would become a shared runtime where a whole team observes and shapes an agent system together. That is the compelling frontier. Today the main blocker to wider experimentation is licensing, not the idea — and that's worth fixing, because the idea is good.

The bigger picture

Canvas redefines software development by shifting from writing static code to orchestrating living systems, where developers and AI co‑create, observe, and evolve solutions in real time. Instead of building UIs for users, we build interactive environments for agents — turning debugging, testing, and execution into a continuous, visual feedback loop that accelerates innovation and brings ideas to production faster than ever.

The triage system here is just one example. The pattern applies anywhere you build agent‑driven software: AI orchestration, workflow automation, data pipelines, autonomous services. Anywhere you need to see, steer, and validate a complex system as it runs — that's where Canvas belongs. Not as the board you ship, but as the runtime you shape it in.

Try it yourself

Reload the extension: extensions_reload
Open the canvas: open_canvas({ canvasId: "multi-agent-dev", instanceId: "dev-1" })
Drive the five beats — Decompose → Execute → Validate → Inject Failure → Update Design → Validate — by clicking, or with invoke_canvas_action.

Full walkthrough: scenario.md. Reusable demo prompt: canvas‑showcase‑prompt.md. Companion deep‑dive: Canvas Is Not a UI Builder.

Resources

copilot-canvas-runtime — this repository (extension, scenario, and prompts)
GitHub Copilot Documentation
Microsoft Foundry Documentation

GitHub Copilot App - Canvas Is Not a UI Builder

Lee_Stott — Tue, 30 Jun 2026 07:00:00 GMT

What if your development environment didn't just help you write code, but helped you observe, steer, and evolve a living system while it runs? That's the shift GitHub Copilot App Canvas represents. Canvas redefines how developers interact with agent-driven software: not by building traditional user interfaces, but by creating interactive environments where humans and AI co-create, test, and iterate in real time.

This post walks through a real Canvas extension we built, a Multi-Agent Dev Canvas that demonstrates how Canvas becomes a runtime observability and control plane for an agent-driven system. We'll cover why Canvas exists, how it differs from traditional UI development, and how you can use it to accelerate the design-test-evolve loop for any multi-agent application.

The Misconception: "Canvas Is for Building UIs"

The first instinct many developers have when they see Canvas is to treat it like a UI framework, a place to build dashboards, boards, or user-facing applications. That's not what Canvas is for.

Here's the distinction that matters:

Traditional UIs are for using software. They serve end-users who interact with a finished product.
Canvas is for shaping software while it runs. It serves developers and AI agents who are actively building, testing, and evolving a system.

Canvas solves problems your final UI should never try to solve in a visible way. It's the observability layer, the control plane, the validation surface — all the things you need during development that disappear before production. Think of it this way: you wouldn't ship your debugger to users, but you absolutely need it while building.

What We Built: A Multi-Agent Dev Canvas

To demonstrate Canvas as a development runtime, we built a Multi-Agent Dev Canvas, a standalone GitHub Copilot Canvas extension (this repo, copilot-canvas-runtime) that treats an entire multi-agent system as a living, observable environment. The same pattern applies to any agent-driven system built on services such as Microsoft Foundry.

The Multi-Agent Dev Canvas: a runtime observability and control plane where developers and AI agents collaborate to design, test, and evolve an agent-driven system in real time.

The canvas provides four integrated panels:

System View: See Your Agents Working

Five specialised agents are displayed as live cards with real-time status indicators. Each card shows the agent's name, responsibility, current status (idle, running, done, or error), task count, and last action taken. When an agent is active, its card pulses blue. When it fails, it glows red. You see the system breathe.

decompose_system — Breaks requirements into agent tasks
execute_workflow — Coordinates agents to perform tasks
validate_output — Runs evaluation tests and returns structured results
update_system_design — Modifies architecture based on feedback
track_state — Persists and updates system state over time

Task Flows: Watch Work Move Through the Pipeline

Below the agents, a flow graph visualises how tasks route between agents. When you decompose a system requirement like "Build an AI-powered code review agent," the canvas shows five components (pr-ingestion, code-analysis, feedback-generator, learning-loop, notification-service) flowing from the decomposer to the executor and designer agents. Each flow carries a status badge, pending, pass, or fail.

Validation Panel: Continuous Testing, Not Afterthought Testing

The validation panel displays structured test results with pass/fail badges and reasoning. When you run validation, each test case evaluates against specific criteria:

✅ "PR ingestion handles large diffs" — Meets criteria: process diffs over 5,000 lines without timeout
❌ "Feedback is actionable" — Failed: does not satisfy criteria that each suggestion includes a code fix
✅ "Learning loop converges" — Meets criteria: accept rate improves over 10 iterations
✅ "Notifications are non-blocking" — Meets criteria: delivery latency under 500ms

This isn't a test runner you invoke separately, it's a validation surface embedded in the development loop. You see failures the moment they happen, in context, alongside the agents and flows that produced them.

Live State Timeline: Every Mutation, Visible

The right panel tracks every state change with timestamps. Decomposition events, workflow executions, validation runs, failure injections — all appear chronologically. This is the system's memory, visible to both the human developer and the AI agents working alongside them.

Canvas as a Runtime: The Key Capabilities

What makes Canvas a runtime rather than a display layer is that the agent can act through it. The canvas exposes seven agent-callable actions:

Action	What It Does
`decompose_system`	Accept requirements and components, generate task flows, update the system design
`execute_workflow`	Run pending tasks through the agent pipeline, produce artifacts
`validate_output`	Evaluate test cases against criteria, return structured pass/fail with reasoning
`update_system_design`	Modify the architecture description, constraints, or component list live
`track_state`	Read the full system state — agents, flows, validations, history, artifacts
`inject_failure`	Force an agent into an error state to test system adaptation
`pause_resume`	Toggle execution on and off

The human developer can click Decompose, Execute, or Validate directly in the canvas. The AI agent can invoke the same actions programmatically. Both parties operate on the same surface, the same state, the same system, that's what makes Canvas collaborative in a way traditional tooling is not.

Why This Matters: Canvas vs. Figma vs. Traditional UIs

It helps to position Canvas against tools developers already know:

Figma is Human-to-Human collaboration on design. Multiple people interact with the same visual surface, but nothing executes. It's a design tool.
Traditional UIs are Human-to-System. Users interact with finished software through a polished interface.
Canvas is Human-to-AI-to-System. It's a shared space where things actually execute. The developer steers, the AI acts, and the system evolves, all visible, all in real time.

Canvas is collaborative in the Figma sense — it's a shared space, it's visual, multiple participants interact with the same surface. But unlike Figma, the participants include AI agents, and the surface isn't a mockup — it's a live system.

How the Extension Works: Under the Hood

A Canvas extension is a standard GitHub Copilot CLI extension, a single extension.mjs file that speaks JSON-RPC over stdio. The key components:

1. State Management

Each canvas instance maintains its own system state: agents, task flows, validations, a state history timeline, artifacts, and the current system design. State is held in-memory per instance and pushed to the iframe via Server-Sent Events whenever it changes.

function createInitialState() {
    return {
        agents: [
            { id: "decomposer", name: "decompose_system", 
              status: "idle", responsibility: "Break requirements into agent tasks" },
            { id: "executor", name: "execute_workflow", 
              status: "idle", responsibility: "Coordinate agents to perform tasks" },
            // ... three more agents
        ],
        taskFlows: [],
        validations: [],
        stateHistory: [],
        artifacts: [],
        systemDesign: { description: "", constraints: [], components: [] },
        execution: { paused: false, stepCount: 0 },
    };
}

2. Real-Time Updates via Server-Sent Events

The canvas runs a loopback HTTP server per instance. The iframe connects to an /events endpoint and receives state updates as they happen — no polling, no websocket complexity.

if (req.url === "/events") {
    res.writeHead(200, { 
        "Content-Type": "text/event-stream", 
        "Cache-Control": "no-cache" 
    });
    clients.add(res);
    // Push current state immediately on connect
    res.write(`data: ${JSON.stringify(getState(instanceId))}\n\n`);
}

3. Dual Interaction Model

Every action is available through two paths. The human clicks a button in the iframe, which POSTs to the local server. The AI agent calls invoke_canvas_action through the SDK. Both paths mutate the same state and trigger the same SSE broadcast. Neither is privileged over the other.

4. Canvas Declaration

The canvas registers with the Copilot SDK using createCanvas, declaring its identity, description, and all agent-callable actions with JSON Schema validation on inputs:

createCanvas({
    id: "multi-agent-dev",
    displayName: "Multi-Agent Dev Canvas",
    description: "Runtime observability and control plane for multi-agent development",
    actions: [
        {
            name: "decompose_system",
            description: "Break requirements into agent tasks",
            inputSchema: { 
                type: "object", 
                properties: { 
                    requirements: { type: "string" },
                    components: { type: "array", items: { type: "string" } }
                }, 
                required: ["requirements"] 
            },
            handler: async (ctx) => { /* ... */ },
        },
        // ... six more actions
    ],
    open: async (ctx) => { /* start server, return URL */ },
    onClose: async (ctx) => { /* clean up */ },
});

Scenarios This Enables

The Multi-Agent Dev Canvas supports four development scenarios that would be impossible with traditional tooling:

1. End-to-End Feature Design

Tell the agent "Build an AI-powered code review system." Watch it decompose the requirement into five components, route tasks to specialist agents, execute the workflow, and validate the outputs, all visible in real time. Iterate by modifying constraints or components and re-running.

2. Live Agent Collaboration Observation

See how agents hand off work to each other. The flow graph shows which agent produced what, which tasks are pending, and where bottlenecks form. This is the kind of observability you need when debugging multi-agent orchestration but would never expose in a production UI.

3. Fault Injection and Adaptation Testing

Use inject_failure to force an agent into an error state. Watch how the system responds. Does the orchestrator recover? Do downstream tasks fail gracefully? This chaos-engineering approach, applied during development, visible in real time, catches integration failures before they reach production.

4. Validation-Driven Iteration

Define test criteria, run validation, see which tests fail, update the system design, re-run. The validation panel isn't a separate CI pipeline, it's embedded in the development surface, creating a continuous feedback loop between design decisions and their measurable outcomes.

Getting Started: Build Your Own Canvas Extension

To create a Canvas extension in your own project:

Read the SDK docs — Run extensions_manage({ operation: "guide" }) in GitHub Copilot CLI to get the canonical documentation paths.
Scaffold — Run extensions_manage({ operation: "scaffold", kind: "canvas", name: "my-canvas", location: "project" }) to generate the boilerplate.
Implement — Edit extension.mjs with your canvas logic: state model, actions, renderer HTML, and SSE updates.
Reload — Run extensions_reload to activate your changes.
Drive — Open with open_canvas, invoke actions with invoke_canvas_action, and iterate.

The canvas extension lives in .github/extensions/your-canvas/extension.mjs for project-scoped extensions, or in your user extensions directory for personal use. No package.json needed, the github/copilot-sdk import is auto-resolved.

Key Takeaways

Canvas is a development runtime, not a UI framework. You don't build Canvas instead of your UI, you use Canvas to figure out, test, and evolve the UI and system before and during building it.
Canvas solves problems your final UI should never expose. Agent observability, fault injection, live state mutation, validation feedback loops, these are development concerns, not user concerns.
Canvas is Human-to-AI-to-System collaboration. Both the developer and the AI agent operate on the same surface, the same state, the same running system. It's Figma-like collaboration, but with AI agents, and things actually execute.
Canvas turns debugging, testing, and execution into a continuous visual feedback loop. Instead of switching between an editor, a terminal, a test runner, and a monitoring dashboard, you have one surface where the system lives and evolves.
Canvas extensions are lightweight. A single extension.mjs file, no dependencies, loopback HTTP server with SSE, the infrastructure gets out of the way so you can focus on the system you're building.

The Bigger Picture

Canvas redefines software development by shifting from writing static code to orchestrating living systems. Developers and AI co-create, observe, and evolve solutions in real time. Instead of building UIs for users, we build interactive environments for agents, turning debugging, testing, and execution into a continuous, visual feedback loop that accelerates innovation and brings ideas to production faster than ever.

The Multi-Agent Dev Canvas we built here is one example. The pattern applies anywhere you're building agent-driven systems: AI orchestration, workflow automation, data pipelines, autonomous services. Anywhere you need to see, steer, and validate a complex system as it runs, that's where Canvas belongs.

Resources

copilot-canvas-runtime — this repository: the Multi-Agent Dev Canvas extension, scenario, and demo prompt
GitHub Copilot Documentation — Official documentation for GitHub Copilot features
Microsoft Foundry Documentation — Build and deploy AI agents with Microsoft Foundry

My Journey with Azure SRE Agent

jometzg — Mon, 29 Jun 2026 13:22:32 GMT

Introduction

A customer came to me with a problem that many organisations have. They control their infrastructure through Infrastructure as Code, but there are often scenarios where an admin needs to go in and make a change - even though they would ideally not want this to happen. The use an Entra feature Privileged Identity Management (PIM). Users statically don't have contributor access to Azure resources, but PIM allows them to elevate their access for a period of time. As part of PIM, the admin needs to give a reason for the elevation. Wouldn't it be good if an agent of some sort could look at this reason, then look at what the user actually did and make an assessment on whether what they did aligned with the reason given? Then alert if not.

I initially built Python agents to handle this, but as with many "build vs. buy" decisions, I eventually discovered that Azure SRE Agent (in preview at the time of writing) could do what I needed – and more.

This blog chronicles my journey from initial scepticism to building a fully autonomous PIM elevation audit agent. Along the way, I learned valuable lessons about what SRE Agent is designed for, how to work with its tooling model, and the difference between interactive exploration and production automation.

The Starting Point: Python Agents and the Buy vs. Build Decision

Before discovering SRE Agent, I had functional Python scripts that queried Azure Audit Logs and Activity Logs to correlate PIM activations with actual Azure operations. They worked, but they required maintenance, error handling, scheduling infrastructure, and ongoing attention. When I heard about Azure SRE Agent's capabilities as an autonomous monitoring platform, I decided to investigate.

The decision: If there's a choice between buy versus build, buy should win – especially when the "buy" option is a managed Azure service with built-in security, monitoring, and integration capabilities.

First Impressions: The Interactive Front End

One of the first features that caught my attention was SRE Agent's chat interface. Unlike my static Python scripts, I could have conversational interactions with the agent, refining queries and exploring my Azure environment in natural language. This was powerful for discovery and prototyping.

Initial Success (and Failure)

When I first asked SRE Agent to analyse PIM elevation patterns, the results were... disappointing. The agent couldn't initially answer my PIM elevation questions effectively. However, this is where the interactive experience shone: through. With coaching in an interactive session, I could:

- Explain what PIM activation events look like in Azure Audit Logs

- Show the agent how to correlate `CorrelationId` between activation requests and justifications

- Demonstrate how to build time windows from activation start to deactivation/expiration

- Guide it through matching Azure Activity operations against justification keywords

After several rounds of refinement, the agent eventually got excellent results. The interactive session wasn't just a chatbot – it was a learning tool that helped me shape the agent's behaviour.

The Subagent Puzzle: Interactive vs. Headless

What I really needed was an autonomous agent that could run on a schedule. As I got better results from the interactive sessions,

Subagents is the tool in SRE Agent for this.

I naturally wanted to convert the interactive session into a subagent that could run autonomously. This is where I hit my first conceptual stumbling block.

The Aha Moment: Understanding SRE Agent's Purpose

I was initially confused about how to structure a subagent. Should it replicate the interactive conversation flow? How do I capture all that back-and-forth in a static configuration?

After discussions with the engineering, I learned a critical lesson:

Azure SRE Agent is really designed for headless SRE agents

The interactive experience is fantastic for exploration, prototyping, and troubleshooting – but it's not what you should be aiming for in production automation.

This reframed my entire approach. Instead of trying to replicate the conversational flow, I needed to distil my learnings from those sessions into the instructions for a subagent.

Struggling with Subagent Format

Even with this clarity, I struggled with the format of a subagent definition. The YAML structure, the `system_prompt` verbosity, the tool declarations – it felt overwhelming to translate my interactive sessions into a configuration file.

The Game-Changer: Let the Agent Write Itself

Then came the game-changing advice from engineering:

Ask the chat session to create the subagent definition for you.

This was brilliant in its simplicity. I had already what I wanted the agent to do in the interactive chat session. It was a simple as "generate a subagent from this conversation". I must admit, I did have to ask it to generate an email with the report, but the bulk of the effort in generating the YAML subagent file was done by the agent.

What would have taken me hours of trial and error was done in minutes.

Tool Configuration: The Missing Pieces

With a subagent definition in hand, I deployed it and... nothing worked. This began the most educational part of my journey: understanding how tools work in Azure SRE Agent.

Challenge #1: Accessing Log Analytics

My subagent kept failing to query Log Analytics. I initially thought this was a role assignment issue – did the agent's managed identity have Log Analytics Reader permissions? I spent time checking RBAC, verifying workspace access, and reviewing Entra ID permissions.

The real issue? I needed to add `QueryLogAnalyticsByWorkspaceId` as a tool in my subagent configuration!

tools: - QueryLogAnalyticsByWorkspaceId

The Azure SRE Agent UI supports selecting this tool during configuration, but I had missed it. More importantly, I needed to mention the Log Analytics workspace ID in my subagent's `system_prompt` so the agent knew which workspace to target:

system_prompt: > ... Query the workspace: XXXXXX-d119-4550-86c0-YYYYYYYYYYY...

Lesson learned: Tools aren't automatically available – you must explicitly declare them. The agent uses this to understand what capabilities it has and to configure the appropriate authentication and access patterns.

Challenge #2: Sending Email Notifications

The next hurdle was sending email reports. My PIM audit was working beautifully, but the results were only visible in logs. I needed email notifications.

Initially, there didn't seem to be a built-in email tool I could choose from the portal. I attempted to write a custom Python tool that sent emails via Microsoft Graph API. This seemed logical – I'd done this in my previous Python agents.

Problem: Corporate email policies blocked my application from sending emails via Graph. This was a security feature, not a bug, but it meant my custom tool approach was dead in the water.

Discovering the Outlook Connector

Then I noticed the Outlook connector in the SRE Agent configuration portal. This was a managed connector specifically for sending emails with pre-configured authentication. I set it up, configured it (noting the connector ID: `connector-abf2`), and waited for emails.

Still nothing.

The Manual YAML Edit

Trawling through other sample subagent configurations, I discovered a tool called SendOutlookEmail. This tool wasn't available in the portal's dropdown menu, but it existed in the platform. I needed to **manually add this to my subagent YAML file**:

tools: - QueryLogAnalyticsByWorkspaceId - SendOutlookEmail

After this change and redeploying the subagent, emails started flowing perfectly.

Lesson learned: The portal UI is evolving (remember, this is preview), and not all tools are exposed visually yet. Don't be afraid to hand-edit the YAML when you know a capability exists. The documentation and sample repositories are your friends.

Making It Fully Autonomous: Scheduled Triggers

With a working subagent that could query logs, analyse alignment, and send emails, I had one final step: scheduling it.

I created a scheduled task trigger in Azure SRE Agent configured to run every 24 hours (UTC). This trigger invokes my PIM elevation subagent, which executes its entire workflow autonomously and emails stakeholders with any findings.

The subagent configuration includes this execution schedule guidance:

system_prompt: > Execution schedule: Run every 24h (UTC).

Now, every morning, our security team receives a PIM elevation alignment report without any manual intervention.

The Result: A Production PIM Elevation Agent

My final solution is an **autonomous agent** that:

Runs on a 24-hour schedule
Queries Azure Audit Logs for PIM activations
Extracts user justifications from the log
Builds precise activation time windows
Queries Azure Activity logs during that time window
Classifies alignment: Aligned, Partial, or NotAligned
Generates JSON and plaintext reports
Emails stakeholders with flagged non-aligned activity

No Python scripts. No custom authentication handling. No infrastructure to maintain.

You can see the full subagent configuration in my GitHub repository: PIM Elevation Agent

Reflections: SRE Agent's Power and Rough Edges

Azure SRE Agent is powerful. The ability to define complex audit workflows in declarative YAML, leverage natural language prompts for behaviour specification, and integrate with Azure services through managed tools is genuinely impressive.

It also integrates with incident response services - both being able to generate incidents and to trigger flows from incidents.

All as a first-class Azure Platform as a Service (PaaS).

However, it's important to remember that this is a preview service (as of February 2026). There are rough edges:

- Tool discoverability: Not all tools are visible in the portal UI

- Documentation gaps: Some capabilities require digging through samples

- Learning curve: Understanding the interactive-vs-headless paradigm takes time

- Debugging: Error messages aren't always clear about what's misconfigured

These are typical preview-stage challenges, and I expect they'll improve as the service matures. The core platform is solid, and the engineering team is responsive to feedback.

Key Takeaways

If you're considering Azure SRE Agent, here are my lessons learned:

Use interactive sessions for discovery – They're excellent for prototyping and learning
Think headless/autonomous for production – Autonomous agents should be declarative, not conversational
Let the agent write itself – Ask the interactive session to generate subagent configs
Explicitly declare tools – They're not automatic; you must add them to your config
Include context in prompts – Workspace IDs, connector IDs, schedules – be specific
Don't fear manual YAML edits – The portal is evolving, hand-editing is ok
Check samples and docs*– Other configurations show patterns and tools not yet in UI, so check the YAML of these
Embrace "buy over build" – Managed services reduce long-term maintenance burden

Resources:

- SRE Agent Documentation

- my PIM Elevation subagent sample

- Kusto (KQL) Query Reference

*This blog post represents my personal experience and opinions. Azure SRE Agent capabilities and UI may have changed since the time of writing.*

MCP for Beginners: Why Every AI Engineer and Developer Should Learn the Model Context Protocol

Lee_Stott — Fri, 26 Jun 2026 22:38:26 GMT

If you have spent any time building with large language models in the last year, you have hit the same wall everyone hits: your model is brilliant at reasoning but blind to the real world. It cannot read your database, call your internal API, search your documents, or trigger a deployment unless you hand-write glue code for every single integration. The Model Context Protocol (MCP) exists to tear that wall down, and Microsoft's open-source MCP for Beginners curriculum (reachable via the short link https://aka.ms/mcp-for-beginners) is the most complete, hands-on way to learn it.

This post explains what MCP is, walks through the latest updates to the course, shows real code, and makes the case for why MCP belongs on your learning roadmap right now.

Whether you are an AI engineer shipping agents to production, a developer wiring tools into Copilot, or a student trying to build a standout portfolio project.

What is MCP, and why does it matter?

Think of MCP as a universal translator for AI applications. Just as a USB-C port lets you connect any peripheral to any laptop without a custom cable per device, MCP lets an AI model connect to any tool or data source through one standardized protocol. The course uses exactly this analogy, and it holds up well.

Before MCP, integrations were an M × N problem: every one of your M AI applications needed bespoke code to talk to each of your N tools. MCP turns that into an M + N problem. Build a tool once as an MCP server, and any MCP-compatible client, Claude Desktop, VS Code, Cursor, GitHub Copilot, and many others — can use it immediately.

The protocol is built on a clean client–server model with a small set of primitives:

Tools — functions the model can call (query a database, send an email, run code).
Resources — data the server exposes for context (files, records, documents).
Prompts — reusable, parameterized prompt templates.
Sampling — a server asking the client's LLM to generate a completion, enabling collaborative workflows.
Elicitation — a server requesting structured input from the user mid-task.
Roots — boundaries that tell a server which directories or resources it is allowed to operate on.

Communication runs over JSON-RPC, with transports for local processes (stdio) and remote servers (streamable HTTP). That standardization is the whole point: write to the spec, and you interoperate with the entire ecosystem.

What's new: the latest updates to the course

The MCP for Beginners curriculum is actively maintained, and the public changelog reads like a release log for a living product. Here are the most important recent changes, drawn directly from that changelog.

1. Aligned to MCP Specification

The biggest update: the entire curriculum has been validated against the current MCP Specification 2025-11-25 and the latest official SDKs. Stale references to older spec revisions (2025-03-26 and 2025-06-18) were corrected across the security, transport, real-time search, sampling, and stdio-server modules, with links repointed to the canonical modelcontextprotocol.io spec paths.

A gap analysis confirmed the course already covers every primitive introduced or expanded in the latest spec:

Sampling — covered in lesson 3.14 and Advanced Topics.
Elicitation (including URL mode) — in Core Concepts and Protocol Features.
Roots — in the Introduction, Core Concepts, and Root Contexts.
Tasks (experimental, long-running operations) — in Core Concepts and Protocol Features.
Tool Annotations (readOnlyHint / destructiveHint) — in Core Concepts and Protocol Features.

2. Samples validated against current SDKs

Code that does not run is worse than no code at all, so the maintainers re-validated the core samples:

TypeScript: @modelcontextprotocol/sdk resolved to 1.29.0; a tsc --noEmit type-check passed with no errors — the McpServer and StdioServerTransport APIs remain valid.
Python: validated in an isolated virtual environment with mcp[cli] (1.27.2); FastMCP.list_tools() correctly returned the sample add and subtract tools.
SDK version pins across labs were bumped (for example mcp>=1.26.0) and lockfiles regenerated so every sample tracks the current release.

3. A serious security pass

Security is treated as a first-class concern, not an afterthought. A full audit across every dependency manifest and the sample source code was run, and npm audit now reports 0 vulnerabilities in every audited directory. Highlights:

Transitive npm advisories (in the MCP Inspector dev tool, the OpenAI client, and the SDK) were remediated by bumping @modelcontextprotocol/inspector to 0.22.0 and pinning a patched shell-quote.
A real code-level command-injection fix (OWASP A03): an open_in_vscode tool that used subprocess.run(..., shell=True) was rewritten to launch the resolved executable directly with no shell — closing a metacharacter-injection vector.
Python dependencies were audited with pip-audit, and a vulnerable transitive werkzeug was pinned to a patched >=3.1.6.

For anyone learning to ship agents, this is gold: the course demonstrates the whole secure-development loop, not just the happy path.

4. New lessons and a growing curriculum

The curriculum keeps expanding with practical, modern lessons:

5.17 Adversarial Multi-Agent Reasoning — two agents argue opposite sides of a question using shared MCP tools (web_search + run_python), judged by a third agent. Includes a Mermaid architecture diagram, orchestrators in Python, TypeScript, and C#, and use cases like hallucination detection, threat modeling, and API design review.
3.12 MCP Hosts — configuration for Claude Desktop, VS Code, Cursor, Cline, and Windsurf, with JSON templates and a transport comparison table.
3.13 MCP Inspector — a debugging guide for testing tools, resources, and prompts.
4.1 Pagination — cursor-based pagination patterns in Python, TypeScript, and Java.
5.16 Protocol Features — progress notifications, request cancellation, resource templates, and lifecycle management.

5. Microsoft product rebranding

Content was updated to reflect Microsoft's rebranding: Azure AI Foundry → Microsoft Foundry, and the AI Toolkit (AITK) → Microsoft Foundry Toolkit Extension for VS Code. If you have seen older tutorials referencing the previous names, the curriculum is now current.

Your first MCP server: see how little code it takes

The course's "first server" lesson builds a simple calculator. Here is the shape of a minimal MCP server in Python using FastMCP, which mirrors the validated sample in the repo. Notice how the protocol plumbing disappears — you just decorate functions.

# server.py — a minimal MCP server with two tools
from mcp.server.fastmcp import FastMCP

# Name your server; this identifies it to MCP clients
mcp = FastMCP("Calculator")

@mcp.tool()
def add(a: int, b: int) -> int:
    """Add two numbers and return the result."""
    return a + b

@mcp.tool()
def subtract(a: int, b: int) -> int:
    """Subtract b from a and return the result."""
    return a - b

if __name__ == "__main__":
    # Run over stdio so local hosts (VS Code, Claude Desktop) can connect
    mcp.run()

The same idea in TypeScript, using the official SDK validated at version 1.29.0:

// server.ts — minimal MCP server in TypeScript
import { McpServer } from "@modelcontextprotocol/sdk/server/mcp.js";
import { StdioServerTransport } from "@modelcontextprotocol/sdk/server/stdio.js";
import { z } from "zod";

const server = new McpServer({ name: "Calculator", version: "1.0.0" });

// Register a tool with a typed input schema
server.tool(
  "add",
  { a: z.number(), b: z.number() },
  async ({ a, b }) => ({
    content: [{ type: "text", text: String(a + b) }],
  })
);

// Connect over stdio and start listening
const transport = new StdioServerTransport();
await server.connect(transport);

That is a complete, runnable server. The docstrings and schemas matter: MCP exposes them to the model so it knows when and how to call each tool. Clear descriptions are effectively prompt engineering for your tools — a common pitfall is leaving them vague, which leads to the model misusing or ignoring the tool.

Connecting it in VS Code

Once your server runs, an MCP host connects to it. A typical VS Code / host configuration looks like this:

{
  "servers": {
    "calculator": {
      "command": "python",
      "args": ["server.py"]
    }
  }
}

Lesson 3.12 (MCP Hosts) covers the equivalent JSON for Claude Desktop, Cursor, Cline, and Windsurf, and lesson 3.13 shows how to use the MCP Inspector to test your tools before wiring them into a host — the single best debugging habit you can build early.

How the course is structured

The curriculum is organized as a progressive journey with hands-on code in C#, Java, JavaScript, Python, Rust, and TypeScript. It is grouped into phases:

Foundations (Modules 0–2): Introduction, Core Concepts, and Security.
Building (Module 3): Getting Started — 15 lessons covering your first server and client, LLM clients, VS Code integration, stdio and HTTP streaming, testing, deployment, auth, hosts, the Inspector, sampling, and MCP Apps.
Growing (Modules 4–5): Practical Implementation and Advanced Topics — 17 advanced lessons including Azure integration, OAuth2, Entra ID auth, scaling, multi-modality, context engineering, custom transports, and adversarial multi-agent reasoning.
Mastery (Modules 6–11): Community Contributions, Lessons from Early Adoption, Best Practices, Case Studies, a Microsoft Foundry Toolkit workshop, and an end-to-end 13-lab PostgreSQL capstone.

That final module is the standout for portfolio building: a complete, production-flavored path that takes you from architecture and row-level security through database design, a FastMCP server, semantic search with pgvector and Azure OpenAI, testing, Docker deployment to Azure Container Apps, and monitoring with Application Insights.

Why developers should learn MCP now

For AI engineers

MCP is becoming the default integration layer for agents. Instead of re-implementing tool calling for every framework, you write to one open protocol and your tools work everywhere. The advanced modules — sampling, roots, elicitation, scaling, routing, and adversarial multi-agent patterns — are exactly the techniques you need to move agents from demo to production.

For developers

MCP is already wired into tools you use daily: VS Code, GitHub Copilot, Claude Desktop, Cursor, and more. Learning to build an MCP server means you can expose your systems — internal APIs, databases, CI/CD — to AI assistants safely. The security-first approach in the course (OAuth2, Entra ID, RBAC, dependency auditing) teaches you to do this the right way from day one.

For students

MCP is a rare opportunity to learn a technology while it is still early, with a free, beginner-friendly, Microsoft-maintained curriculum and code in six languages. The 13-lab capstone alone is a genuine portfolio project. And with content translated into 50+ languages, the barrier to entry is low no matter where you are.

Responsible and secure by design

A recurring theme worth calling out: the course does not treat security and governance as optional extras. It models real practices you should carry into your own work:

Least privilege via roots — constrain what a server can touch.
Tool annotations — mark tools readOnlyHint or destructiveHint so clients can warn users before destructive actions.
No shells for user input — the command-injection fix is a textbook example of why you never pass untrusted input through a shell.
Dependency hygiene — audit with npm audit and pip-audit, and pin patched releases.
Proper auth — dedicated lessons on OAuth2 and Microsoft Entra ID.

Key takeaways

MCP standardizes how AI connects to tools and data, turning a combinatorial integration problem into a simple, reusable one.
The course is current, validated against MCP Specification 2025-11-25 with SDKs at TypeScript 1.29.0 and Python mcp 1.27.2.
Samples actually run, and the repo demonstrates a full secure-development loop with 0 reported vulnerabilities after auditing.
It is broad and deep: from a 10-line calculator server to a 13-lab production capstone, in six languages.
It is the fastest credible path to MCP fluency for AI engineers, developers, and students alike.

Get started today

Open the course: https://aka.ms/mcp-for-beginners (redirects to the GitHub repository).

Fork and clone it — use a sparse checkout to skip translations for a faster download:

git clone --filter=blob:none --sparse https://github.com/microsoft/mcp-for-beginners.git
cd mcp-for-beginners
git sparse-checkout set --no-cone "/*" "!translations" "!translated_images"

Build your first server with lesson 3.1 in your language of choice.
Debug it with the MCP Inspector, then connect it in VS Code.
Go deep with the 13-lab database capstone, and read the official spec at modelcontextprotocol.io.
Track what's new in the changelog and join the community discussions.

MCP is quietly becoming the connective tissue of the AI ecosystem. The earlier you learn it, the more leverage you will have — and Microsoft's MCP for Beginners is the clearest on-ramp available. Star the repo, build a server this week, and start connecting your AI to the world.

Join our free livestream series on using Microsoft IQ with Python

Pamela_Fox — Thu, 25 Jun 2026 07:46:37 GMT

Join us for a new 3-part livestream series where we take a deep technical look at Microsoft IQ, the knowledge layer for the next generation of AI experiences. You'll learn how Foundry IQ, Work IQ, and Fabric IQ can be used to ground AI systems in organizational knowledge, workplace context, and structured business data.

Our series will cover:

Foundry IQ for multi-source agentic retrieval on search indexes, SharePoint, websites, and more
Work IQ for user-specific retrieval of M365 data, like Teams chats, emails, and calendar events
Fabric IQ for retrieval of data stored in OneLake, via Fabric ontologies and data agents
Building agents with Microsoft Agent Framework to connect to Foundry IQ, Fabric IQ, and Work IQ

Throughout the series, we’ll use Python for all examples and share full code so you can run everything yourself in your own Foundry projects.

👉 Register for the full series.

In addition to the live streams, you can also join the Microsoft Foundry Discord to ask follow-up questions after each stream.

If you are new to generative AI with Python, start with our 9-part Python + AI series, which covers topics such as LLMs, embeddings, RAG, tool calling, MCP, and agents. If you are new to Microsoft Agent Framework, watch our 6-part Python + Agent series which dives deep into agents and workflows.

To learn more about each live stream or register for individual sessions, scroll down:

Day 1: Foundry IQ

28 July, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time

In the first session of our Microsoft IQ Deep Dive with Python series, we’ll kick things off with an introduction to the Microsoft IQ family: Foundry IQ, Work IQ, Fabric IQ, and Web IQ. We’ll then take a deeper look at Foundry IQ (Azure AI Search), exploring how it helps agents and applications work with curated knowledge and organizational context. We'll build a knowledge base and connect it to multiple knowledge sources, including the new IQs, MCP servers, and search indexes built from ingested data. Then we'll perform multi-source agentic retrieval on the knowledge base, which executes queries in parallel and merges the results with state-of-the-art ranking models. Finally, we will build an agent in Python using Microsoft Agent Framework and ground the agent's responses in results from the Foundry IQ knowledge base. All code demos will use Python and will be available in an open-source repository for you to deploy yourself. After the stream, join office hours in the Microsoft Foundry Discord to ask follow-up questions.

Day 2: Work IQ

29 July, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time

In the second session of our Microsoft IQ Deep Dive with Python series, we’ll focus on Work IQ and how it brings workplace context into AI-powered experiences. We’ll explore how developers can use Work IQ through APIs, A2A patterns, MCP integration, and tool-based workflows. We’ll look at two practical tool examples, then show how Work IQ can be used from Copilot and from a Microsoft Agent Framework agent. All code demos will use Python and will be available in an open-source repository for you to deploy yourself. After the stream, join office hours in the Microsoft Foundry Discord to ask follow-up questions.

Day 3: Fabric IQ

30 July, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time

In the final session of our Microsoft IQ Deep Dive with Python series, we’ll explore Fabric IQ and how it connects AI experiences to structured business data. We’ll introduce the key concepts behind Fabric IQ, including ontologies and data agents, and show how they help describe, organize, and reason over operational data stored in OneLake. We’ll use the Microsoft Fabric API SDK in Python to connect to Fabric IQ, so that we can programmatically configure ontologies and answer questions about our data. All code demos will use Python and will be available in an open-source repository for you to deploy yourself. After the stream, join office hours in the Microsoft Foundry Discord to ask follow-up questions.

MCP Server Authorization with Azure API Management: From Simple to Advanced

vzisiadis — Wed, 24 Jun 2026 12:50:23 GMT

Why put API Management in front of your MCP servers

The Model Context Protocol (MCP) has quickly become the standard way for AI agents, such as GitHub Copilot in VS Code, to reach external tools and data. As soon as an MCP server does anything meaningful, the same questions that govern any API resurface: who is allowed to call it, what are they allowed to do, and how do you enforce that consistently across many servers without rewriting each one.

Azure API Management (APIM) answers those questions for MCP. It sits between the MCP client and the tool backend and applies the controls you already trust for REST APIs: identity validation, OAuth, rate limiting, IP filtering, and observability. Crucially, APIM speaks the MCP authorization specification, which is built on OAuth 2.1 and Protected Resource Metadata (PRM, RFC 9728). That means APIM can do more than block bad requests. It can actively drive an interactive sign-in from the IDE, so the user logs in with their own identity and the agent acts on their behalf.

This article walks through a progression of authorization scenarios, each one building on the last:

The simple case: validate a token and block everything else.
Triggering an interactive sign-in from VS Code for an MCP server that APIM hosts from your own APIs.
Going beyond "is this a tenant user" to "does this user have the right attribute" with Entra app roles.
Fronting an existing external MCP server and letting it drive its own OAuth flow (GitHub as the example).
Governing which tools of an existing MCP server an agent is actually allowed to invoke.

APIM MCP capabilities and the basic authorization options

API Management exposes MCP servers in two distinct ways, and the authorization story differs slightly for each.

Expose a REST API as an MCP server. APIM takes an API it already manages and projects selected operations as MCP tools. You own the operations, so you choose exactly which ones become tools at configuration time. This is the right mode when the capability you want to expose is an API you control.
Expose an existing MCP server (passthrough). APIM fronts a remote MCP-compatible server (LangChain, an Azure Function, GitHub's remote MCP server, your own container) and relays the MCP protocol to it. APIM governs access, but the upstream server still owns its tool catalog.

On top of either mode, you have a spectrum of authorization options:

Subscription keys for simple, machine-to-machine access where a shared secret in a header is acceptable.
Token validation with Microsoft Entra ID, where APIM acts as the protected resource and verifies a bearer token on every call.
Interactive OAuth 2.1 sign-in, where APIM advertises Protected Resource Metadata so an MCP client can discover the authorization server, log the user in, and retry with a user token.
Authorization passthrough, where an external MCP server presents its own authorization challenge and APIM relays it faithfully so the client authenticates directly against the upstream's identity provider.

The rest of the article works through these options in increasing order of capability.

The example setup

The walkthroughs in the first three scenarios all use the same backend so you can reproduce them without standing up anything of your own: the publicly available Star Wars API at Star Wars API. It is a simple, read-friendly REST API (characters, films, planets, starships, and so on) imported into API Management as a normal API and then projected as an MCP server.

The reason this single API is enough to illustrate the whole progression is that, in API Management, one underlying API can back several independent MCP servers, each exposing a different slice of its operations. For example, you can create:

A read-only MCP server that exposes only the GET operations, for agents that should be able to query data but never change it.
A write-capable MCP server that exposes the POST, PUT, or DELETE operations, for trusted automation that is allowed to mutate state.

Same backend API, two MCP servers, two different tool surfaces. Each of these servers is an independent resource in APIM, so each one can carry its own authorization. Both can require an authenticated user (Scenarios 1 and 2), and you can go further by protecting only the sensitive one: gate the write-capable server behind an Entra app role so that, even among authenticated users, only those who carry a specific claim can reach the mutating tools. That app-role mechanism is the subject of Scenario 3, and it composes naturally with the multi-server split described here.

Registering the MCP API in Microsoft Entra ID

Before any of the policies below can validate a token, you need an application registration in Microsoft Entra ID that represents the MCP API. This registration is what defines the audience and scope that tokens are issued for, and it is the source of the mcp-audience, mcp-scope, and (indirectly) mcp-client-id values that the policies reference. Create it once and reuse it across all the MCP servers in this article.

In the Azure portal, open Microsoft Entra ID, then App registrations, then New registration. Name it (for example, star-wars-mcp-api), choose single-tenant, and register. Record the Application (client) ID and the Directory (tenant) ID.
Open Expose an API and add an Application ID URI. Accept the default api://<app-id>. This URI is your token audience.
Still under Expose an API, add a delegated scope named MCP.Access, set its consent display name and description, set the state to Enabled, and save.
Authorize the client that will request the scope. Under Expose an API, select Add a client application and enter the client ID of the MCP client. For VS Code, this is the built-in Microsoft authentication client aebc6443-996d-45c2-90f0-388ff96faa56. Check the MCP.Access scope and save.

These steps produce the four constants the validation policy needs:

Named value	Comes from	Example
entra-tenant-id	The Directory (tenant) ID from step 1	11111111-1111-1111-1111-111111111111
mcp-audience	The Application ID URI from step 2	api://22222222-2222-2222-2222-222222222222
mcp-scope	The scope name from step 3	MCP.Access
mcp-client-id	The client ID of the calling app from step 4	aebc6443-996d-45c2-90f0-388ff96faa56


[!NOTE] mcp-client-id is the identity of the application calling the MCP server, not the MCP API itself. For VS Code it is the built-in Microsoft authentication client, and its value lands in the token's appid claim, which is why the validation policy lists it under client-application-ids. If your tenant blocks the first-party VS Code client, register your own public client application and use its client ID instead.

[!TIP] For the privileged-access feature in Scenario 3, you will also declare an app role on this same registration. You do not need it yet, but it is convenient to know that all identity configuration for these servers lives on this one app registration.

With that backend and structure in mind, the scenarios below build up the authorization model one capability at a time.

Scenario 1: The simple case, validate the token and block unauthorized access

The most basic protection is to require a valid Entra ID token on every MCP request and reject anything that fails validation. No interactive flow, no roles, just a gate.

APIM does this with the validate-azure-ad-token policy. The policy checks the issuing tenant, the audience (your MCP API), the calling client application, and the required scope. Anything that does not satisfy all four is rejected with a 401.

<policies> <inbound> <base /> <validate-azure-ad-token tenant-id="{{entra-tenant-id}}" header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized. Access token is missing or invalid."> <client-application-ids> <application-id>{{mcp-client-id}}</application-id> </client-application-ids> <audiences> <audience>{{mcp-audience}}</audience> </audiences> <required-claims> <claim name="scp" match="any"> <value>{{mcp-scope}}</value> </claim> </required-claims> </validate-azure-ad-token> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies>

The values in double braces are APIM named values: centralized constants, defined once and shared by every MCP server. They map directly to the four values produced by the Entra app registration in the example setup (entra-tenant-id, mcp-audience, mcp-scope, and mcp-client-id). Storing them as named values keeps the policy free of hardcoded identifiers and lets every server reuse the same configuration.

This gets you a server that nobody can call without a properly minted token. What it does not do is help a fresh client obtain that token in the first place. That is the next scenario.

Scenario 2: Driving an interactive sign-in from VS Code for an APIM-hosted MCP server

When you expose one of your own APIs as an MCP server, you usually want a developer to open VS Code, connect to the server, and be prompted to sign in with their Microsoft account. No pre-shared key, no manual token handling. APIM achieves this by behaving as a well-mannered OAuth 2.1 protected resource.

Using the Star Wars MCP server from the example setup, each selected operation becomes a tool the agent can call, so an agent can answer "which films featured the character named Leia" by calling the underlying API through APIM.

How the sign-in flow works

The protocol choreography is what turns a plain 401 into an interactive login:

Two ingredients make this work: a 401 challenge that points to a metadata document, and the metadata document itself.

The challenge: a 401 that points the client to its metadata

Instead of a bare 401, APIM returns a WWW-Authenticate header carrying the URL of the server's Protected Resource Metadata. This is what tells the client "you need a token, and here is where to learn how to get one." Keeping this logic in a shared policy fragment means every MCP server reuses it.

Notice the mcpResourceMetadataUrl reference in the fragment below. It is not hardcoded; it is a context variable that each MCP server sets in its own server-level policy before including this fragment (you will see that wiring in the per-server policy later in this scenario). The fragment simply reads whatever value the calling server provided. This indirection is what keeps the fragment pluggable: the same shared challenge-and-validate logic serves every MCP server, while each server supplies its own PRM URL.

In most deployments the PRM endpoint is a single, dynamic one (built in the next section) that derives the resource from the request path, so the variable just carries that server's path. But because the URL is configurable per server rather than baked into the fragment, you retain flexibility for the cases that need it.

<fragment>  <choose> <when condition="@(!context.Request.Headers.ContainsKey("Authorization"))"> <return-response> <set-status code="401" reason="Unauthorized" /> <set-header name="WWW-Authenticate" exists-action="override"> <value>@("Bearer resource_metadata=\"" + (string)context.Variables.GetValueOrDefault("mcpResourceMetadataUrl", "") + "\"")</value> </set-header> </return-response> </when> </choose>  <validate-azure-ad-token tenant-id="{{entra-tenant-id}}" header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized. Access token is missing or invalid."> <client-application-ids> <application-id>{{mcp-client-id}}</application-id> </client-application-ids> <audiences> <audience>{{mcp-audience}}</audience> </audiences> <required-claims> <claim name="scp" match="any"> <value>{{mcp-scope}}</value> </claim> </required-claims> </validate-azure-ad-token> </fragment>

Creating the /.well-known PRM endpoint in APIM with a policy

This is the part that often surprises people: APIM itself serves the metadata document. There is no separate identity service to stand up. You publish one small anonymous API at the service root that answers GET /.well-known/oauth-protected-resource/*, derives the resource value from the requested path, and returns a JSON document pointing at Microsoft Entra ID as the authorization server.

Create a blank HTTP API named well-known with an empty API URL suffix so it resolves at the service root, add a GET operation with the template /.well-known/oauth-protected-resource/*, clear the subscription requirement so it is reachable anonymously, and apply this policy:

<policies> <inbound> <base />  <set-variable name="resourceUrl" value="@{ var prefix = "/.well-known/oauth-protected-resource"; var path = context.Request.OriginalUrl.Path; var resourcePath = path.Length > prefix.Length ? path.Substring(prefix.Length) : ""; return "https://" + context.Request.OriginalUrl.Host + resourcePath; }" /> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ return new JObject( new JProperty("resource", (string)context.Variables["resourceUrl"]), new JProperty("authorization_servers", new JArray( "https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0")), new JProperty("scopes_supported", new JArray("{{mcp-prm-scope}}")), new JProperty("bearer_methods_supported", new JArray("header")) ).ToString(); }</set-body> </return-response> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> </on-error> </policies>

The {{mcp-prm-scope}} named value populates the scopes_supported array of the metadata document. It tells the client which delegated scope to request when it goes to the authorization server, so it must be the fully qualified scope value: the token audience (the Application ID URI from the app registration) followed by the scope name. With the example values that is api://22222222-2222-2222-2222-222222222222/MCP.Access. In other words, it is the combination of the mcp-audience and mcp-scope values defined in the example setup.

Named value	Value to set	Example
mcp-prm-scope	<mcp-audience>/<mcp-scope>	api://22222222-2222-2222-2222-222222222222/MCP.Access

[!NOTE] Keep mcp-prm-scope in sync with the scope the validation fragment requires. The PRM document advertises this scope so the client requests it, and validate-azure-ad-token then checks for it in the scp claim. A mismatch means the client obtains a token without the scope APIM expects, and validation fails.

Because the policy builds the resource value from the request path, this single endpoint serves metadata for every MCP server you ever add. The Star Wars server, a future inventory server, and anything else all share it.

Wiring it onto the MCP server

Each MCP server only needs to declare its own metadata URL and include the shared fragment:

On the VS Code side, the configuration is deliberately plain. With no subscription-key header present, the client falls straight into the OAuth flow:

{ "servers": { "star-wars-mcp": { "url": "https://apim-contoso-mcp.azure-api.net/star-wars-mcp/mcp", "type": "http" } } }

Restart the server in VS Code, and it detects the 401, reads the metadata, opens a browser sign-in, requests consent on first use, and then loads the tools using the user's token.

[!CAUTION] Do not read the response body with context.Response.Body inside MCP server policies. It forces response buffering and breaks the MCP streaming transport. If global diagnostic logging is enabled, set the Frontend Response payload bytes to log to 0 at the All APIs scope.

Scenario 3: Beyond tenant membership, authorize on a user attribute with app roles

Validating a token confirms the caller is a signed-in user in your tenant with the right scope. That is often not enough. Some MCP servers expose sensitive tools that only a subset of users should reach. You want to express "this user is not only part of the tenant, but has a specific attribute that permits this server."

Microsoft Entra app roles are the optimal mechanism for this. You declare a role on the MCP API app registration, assign it to specific users or to a security group, and Entra ID emits a roles claim in the access token whenever your API is the audience. APIM then authorizes on that claim. App roles beat the groups claim here because they avoid the group overage problem, they are scoped to the application, and they travel with the app.

Declaring and assigning the role

On the MCP API app registration, under App roles, create a role:

Setting	Value
Display name	Privileged Access
Allowed member types	Users/Groups
Value	Privileged.Access
Description	Access to privileged MCP servers

Then, on the matching enterprise application, under Users and groups, assign the users (or, better, a security group) to the Privileged Access role. The Value field is the exact string that lands in the token roles claim, so it cannot contain spaces.

[!TIP] Keep User assignment required set to No on the enterprise application. Unassigned users still obtain a valid token with the MCP.Access scope and keep access to the non-privileged servers. They simply do not carry the roles claim, so the privileged servers reject them.

Enforcing the claim in the per-server policy

The shared mcp-entra-auth fragment is used by every server, so the role requirement must not live there. Place the check in the privileged server's own policy, right after the fragment include. The token is already validated at that point, so this step is pure authorization. Because the caller is authenticated but not authorized, return 403, not 401, and do not emit a challenge: re-authenticating will not grant a role the user does not have.

<policies> <inbound> <base /> <set-variable name="mcpResourceMetadataUrl" value="https://apim-contoso-mcp.azure-api.net/.well-known/oauth-protected-resource/star-wars-mcp/mcp" /> <include-fragment fragment-id="mcp-entra-auth" />  <choose> <when condition="@(!context.Request.Headers.GetValueOrDefault("Authorization","").Replace("Bearer ","").AsJwt().Claims.GetValueOrDefault("roles", new string[0]).Contains("Privileged.Access"))"> <return-response> <set-status code="403" reason="Forbidden" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>{"error":"forbidden","message":"You lack the Privileged.Access role required for this MCP server."}</set-body> </return-response> </when> </choose> </inbound> <backend> <base /> </backend> <outbound> <base /> </outbound> <on-error> <base /> <include-fragment fragment-id="mcp-auth-challenge-onerror" /> </on-error> </policies>

One operational detail worth calling out: app-role assignments only appear in newly issued tokens. A user who is granted the role after they signed in must obtain a fresh token. In VS Code, run MCP: Reset Cached Tokens (or sign out of the Microsoft account from the Accounts menu), then restart the server and sign in again. You can confirm the result by pasting the access token into https://jwt.ms and checking for "roles": ["Privileged.Access"].

Scenario 4: Fronting an existing external MCP server that drives its own sign-in

So far APIM has been the authorization resource. But many valuable MCP servers already exist and run their own identity. GitHub publishes a remote MCP server with dozens of tools, and it authenticates users against GitHub's own OAuth authorization server. You do not want to re-implement that. You want APIM to govern access (rate limits, IP rules, logging, a single managed endpoint) while letting the upstream own the login.

This is the "expose an existing MCP server" passthrough mode. When you register GitHub's remote MCP server behind APIM, the gateway relays the upstream's own authorization challenge. The client never authenticates against Entra here. It authenticates directly against GitHub.

The flow, confirmed by probing the gateway:

A call to the APIM endpoint with no token returns GitHub's own 401 with a WWW-Authenticate header, relayed through APIM.
The Protected Resource Metadata that GitHub serves advertises authorization_servers: ["https://github.com/login/oauth"], so the client knows to log in at GitHub.
The PRM resource reflects the APIM host, because GitHub builds it from the forwarded Host header. The client trusts the APIM endpoint while still logging in at GitHub.
VS Code completes the GitHub sign-in and the full tool catalog loads. In the proof of concept this surfaced all 47 GitHub tools through the single APIM endpoint.

The client configuration is again just a URL pointing at APIM:

{ "servers": { "github-via-apim": { "url": "https://apim-contoso-mcp.azure-api.net/github-mcp/mcp", "type": "http" } } }

The key insight is that APIM transparently relays the backend's authentication challenge. GitHub remains the authorization server, GitHub tolerates being fronted by APIM, and you get a governed, centrally managed entry point without owning the identity flow.

[!NOTE] Passthrough only relays what the upstream advertises. If the backend's PRM resource value and the actual MCP transport endpoint differ by a path segment, some clients fall back to deriving the metadata location from the server URL and can miss it. When you onboard a custom self-authenticating server, verify that the resource it advertises matches the exact URL the client connects to.

Scenario 5: Restricting which tools of an existing MCP server an agent may call

Passthrough raises a governance question that token validation alone cannot answer. A developer may legitimately have permission to merge a pull request through GitHub, but you may not want their AI agent to perform that action autonomously. You want to allow the read and discovery tools while blocking the destructive write tools, at the gateway, regardless of what the client tries.

What is and is not possible for an external server

It is important to be precise here, because the capability differs from the REST-as-MCP mode:

For a REST-API-exposed-as-MCP server, you pick which operations become tools at creation time. That is native tool selection and the cleanest possible filter.
For an existing/external MCP server, APIM does not enumerate the upstream's tools. The portal Tools blade explicitly states that tools are not visible for external MCP servers, and there is no allow-list property for them. APIM also cannot safely rewrite the tools/list response, because reading the response body breaks the streaming transport and the list may arrive as text/event-stream.

What APIM can do reliably, and server-agnostically, is block the invocation. Every tool call arrives as a JSON-RPC tools/call request in the request body, which APIM can inspect safely. The deny-listed tools remain visible in the catalog, but any attempt to invoke one is intercepted at the gateway and returned a JSON-RPC error before it ever reaches the upstream.

The reusable deny-list fragment

The block is driven by a per-server named value (a comma-separated list of tool names), so the same fragment governs every external server. Only the named value changes.

<fragment> <choose> <when condition="@(context.Request.Body != null)"> <set-variable name="mcpMethod" value="@{ try { var body = context.Request.Body.As<JObject>(preserveContent: true); return (string)body?["method"] ?? string.Empty; } catch { return string.Empty; } }" /> <choose> <when condition="@(((string)context.Variables["mcpMethod"]).Equals("tools/call", StringComparison.OrdinalIgnoreCase))"> <set-variable name="mcpToolName" value="@{ var body = context.Request.Body.As<JObject>(preserveContent: true); return (string)body?["params"]?["name"] ?? string.Empty; }" />  <set-variable name="mcpBlocked" value="@{ var tool = ((string)context.Variables["mcpToolName"]).Trim().ToLowerInvariant(); var deny = ((string)context.Variables.GetValueOrDefault("mcpBlockedTools", "")).ToLowerInvariant().Split(',').Select(t => t.Trim()); return deny.Contains(tool); }" /> <choose> <when condition="@((bool)context.Variables["mcpBlocked"])"> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ var id = "null"; try { var body = context.Request.Body.As<JObject>(preserveContent: true); id = body?["id"]?.ToString(Newtonsoft.Json.Formatting.None) ?? "null"; } catch {} return "{\"jsonrpc\":\"2.0\",\"id\":" + id + ",\"error\":{\"code\":-32602,\"message\":\"Unknown tool: " + ((string)context.Variables["mcpToolName"]) + "\"}}"; }</set-body> </return-response> </when> </choose> </when> </choose> </when> </choose> </fragment>

The deny-list itself lives in a named value, one per server:

APIM named value. Comma-separated, case-insensitive.

mcp-blocked-tools-github = merge_pull_request,create_repository,delete_repository,push_files,create_or_update_file,issue_write,label_write #

Generic per-server pattern: mcp-blocked-tools-<server> = <comma,separated,tool,names>

Wiring it onto the GitHub passthrough server

Now when the agent tries to merge a pull request, the gateway returns a clean -32602 Unknown tool error and the upstream is never touched. Read and discovery tools continue to work. The tool still appears in the client's catalog.

Adding governance for another external server is just one more named value plus the same fragment include. No new policy logic.

Key takeaways

API Management turns MCP servers into governed resources, applying the same identity, traffic, and observability controls you already use for APIs.
Start simple with validate-azure-ad-token to gate access, then graduate to a full interactive sign-in by serving Protected Resource Metadata from a single APIM policy.
You can publish multiple MCP servers from one underlying API, for example a read-only server and a read-write server, by selecting different operations.
App roles let you authorize on a user attribute, not just tenant membership, and the check belongs in the per-server policy so shared logic stays clean.
For existing external servers, APIM relays the upstream's own OAuth flow, so a server like GitHub keeps owning its identity while you keep central governance.
When an external server's full tool surface is too broad, APIM can block specific tool invocations at the gateway with a reusable, named-value-driven policy, so a user's agent cannot perform actions the user could perform manually.

References

The Token Economics of the Edge: Running Qwen3 on a Windows NPU with WinML CLI

kinfey — Tue, 23 Jun 2026 09:35:31 GMT

The number that changes the conversation

Most "run an LLM locally" tutorials start with the model. This one starts with a bill.

Every token a cloud LLM generates is metered. One request is rounding-error cheap. But agents don't make one request — they make thousands, in loops, with retries, across every user, every day, forever. The cost curve of a successful AI feature is not flat; it bends upward with adoption. The thing that makes your product popular is the same thing that makes it expensive to run.

That's the uncomfortable truth behind token economics: in the cloud, the marginal cost of intelligence never reaches zero. You rent it, per token, for as long as the feature lives.

Edge inference flips the equation. Once the model runs on hardware the user already owns, the marginal cost of the next token trends toward zero. You pay a fixed cost once — the silicon — and then you generate as many tokens as you like. For a whole class of workloads — chat assistants, summarizers, classifiers, on-device copilots — that is a structurally different cost model, and it's the reason "where does inference run" has quietly become one of the most important architecture decisions you'll make this year.

To keep this concrete, we'll ground every idea in a real, working repository: kinfey/winml-qwen3-chat — an end-to-end project that builds Qwen3-0.6B for a Windows NPU using Windows ML CLI, benchmarks it against the CPU, and ships a desktop chat app that streams from it token by token.

Why the edge — beyond just the bill

Cost is the headline, but it's not the whole story. Moving inference onto the device buys you four things at once:

Cost. The marginal token approaches free. No per-request metering, no surprise invoice when a feature goes viral.
Latency. No network round-trip. The model is a function call away, not an HTTPS request away — and for interactive chat, that's the difference between "snappy" and "spinner."
Privacy. The prompt never leaves the machine. For regulated data, personal documents, or anything a user wouldn't paste into a public box, "it never left the laptop" is the strongest guarantee you can offer.
Availability. It works on a plane, in a tunnel, behind a corporate firewall, during an outage. Offline isn't a degraded mode — it's the default.

The catch has always been: CPUs are bad at this. A general-purpose CPU runs a transformer forward pass slowly and, worse, inconsistently — latency jitters as the OS juggles cores. That's where the silicon story begins.

The NPU: purpose-built for the forward pass

A modern AI PC has three compute units, and Windows ML exposes each through its own execution provider:

Unit	Reference silicon (Snapdragon X Elite)	Execution Provider
NPU	Qualcomm Hexagon NPU (X1E80100)	QNNExecutionProvider
GPU	Qualcomm Adreno X1-85	DmlExecutionProvider
CPU	Snapdragon X 12-core @ 3.40 GHz	CPUExecutionProvider

The NPU (Neural Processing Unit) is not a faster CPU — it's a different kind of processor, designed for exactly one job: the dense, repetitive matrix multiplication that is model inference. Where a CPU is a brilliant generalist, an NPU is a specialist that does the transformer math at high throughput, low power, and — critically — low variance.

That last property matters more than people expect. For an interactive assistant, predictable latency is often worth more than peak latency. A response that always lands in ~1 second feels better than one that averages 0.8s but occasionally stalls for 5. We'll see this show up hard in the benchmark below: the NPU isn't just faster than the CPU here, it's an order of magnitude more consistent.

NPUs also sip power. Running inference on the Hexagon NPU instead of pinning twelve CPU cores at 100% means the fan stays quiet and the battery survives the afternoon — which, for an always-available on-device copilot, is the whole point.

WinML CLI: from source model to hardware-optimized artifact

Here's the gap the NPU story usually hides: a Hugging Face checkpoint does not run on an NPU as-is. You can't hand model.safetensors to the Hexagon and expect tokens. Between "PyTorch weights" and "running on the NPU" sits a real pipeline — export to ONNX, optimize the graph, quantize to integer precision, and compile to a vendor-specific context binary.

That pipeline is exactly what Windows ML CLI automates. In its own words, it takes you "from a source model — whether from Hugging Face or your own pipeline — to a hardware-optimized artifact in a reproducible workflow," handling conversion, graph optimization, and compilation across AMD, Intel, NVIDIA, and Qualcomm targets. You can drive each stage by hand (export → analyze → optimize → quantize → compile) or let winml build generate the whole config for you. The same commands target every Windows ML execution provider, so you build once and run across hardware.

It's a single CLI, installed as a Python wheel, and it slots cleanly into CI/CD — which is what makes the whole edge-deployment workflow reproducible instead of a one-off science experiment.

Walkthrough: building Qwen3-0.6B for the NPU

This is the heart of the winml-qwen3-chatt repo. Five steps: install, inspect, build, benchmark, run.

0. Prerequisites

The build was validated on a Snapdragon X Elite (ARM64) machine running Windows 11 24H2 (24H2 is required for NPU support), with Python 3.11, the uv package manager, and — for the demo app — the .NET 10 SDK with the WinUI workload.

1. Install the CLI

WinML CLI ships on PyPI. Set up an isolated environment with uv and install:

# Pin the exact Python the project needs uv python install 3.11 # Create and activate a 3.11 virtual environment uv venv --python 3.11 winml-env .\winml-env\Scripts\activate # Install the CLI from PyPI uv pip install winml-cli # Verify winml --version # -> winml, version 0.1.0

Then let the CLI introspect your machine — this is your first sanity check that the NPU is even visible:

winml sys

On the reference device this enumerates all three compute units in priority order (NPU → GPU → CPU) and the execution providers behind them (QNNExecutionProvider, DmlExecutionProvider, CPUExecutionProvider). If your NPU doesn't appear here, nothing downstream will use it — winml sys is where you diagnose that.

2. Inspect before you build

💡 Always inspect before build. It catches unsupported architectures in seconds instead of twenty minutes into a failed export.

winml inspect -m Qwen/Qwen3-0.6B

For Qwen3-0.6B this reports the shape of what you're about to build: a text-generation model mapped to the WinML class WinMLModelForCausalLM (composite), architecture Qwen3ForCausalLM with 28 layers, hidden size 1024, vocabulary 151,936, opset 17. Its inputs are input_ids, attention_mask, and position_ids; its output is logits.

One subtlety worth internalizing early: Qwen3 is a composite model — it exports as multiple ONNX components rather than a single graph. That detail comes back to bite you (productively) at quantization time.

3. Build for the NPU

This is the one command that does the heavy lifting — export, optimize, quantize, and compile, all targeting the NPU via the QNN execution provider:

winml build -m Qwen/Qwen3-0.6B -o output\qwen3-0.6b --ep qnn --device npu --compile -v

Under the hood it runs four stages. On the reference device:

Stage	Time	Output
Export	133.2 s	export.onnx (2.9 GB)
Optimize	157.6 s	optimized.onnx (2.9 GB)
Quantize	227.6 s	quantized.onnx (868 MB) — uint8 weight / uint16 activation
Compile	437.8 s	QNN HTP context (compiled_qnn.bin, 913 MB)
Total	~1191.5 s (~20 min)	model.onnx + winml_build_config.json

The arc of those four rows is the edge-deployment story in miniature: a 2.9 GB float export is quantized down to 868 MB (the integer precision the NPU wants), then compiled into a Qualcomm HTP context binary — a graph the Hexagon NPU can execute directly. The final deployable is output\qwen3-0.6b\model.onnx, a composite entry graph that points at the compiled context. (The ~9 GB of intermediate .onnx.data shards are safe to delete afterward to reclaim disk.)

One honest caveat the repo documents: because text-generation isn't in the CLI's built-in calibration task list, quantization falls back to a RandomDataset for calibration. For a latency benchmark that's expected and harmless — but hold that thought, because it's exactly the seam the demo app has to patch for production quality.

4. Benchmark: NPU vs CPU

Point winml perf at the compiled model and tell it the task:

# NPU (QNN) — runs the compiled context winml perf -m output\qwen3-0.6b\model.onnx --task text-generation --ep qnn --device npu --iterations 100 --warmup 10 # CPU — the compiled NPU graph can't run on CPU, so benchmark the # pre-compile quantized.onnx (QDQ) instead winml perf -m output\qwen3-0.6b\quantized.onnx --task text-generation --ep cpu --device cpu --iterations 50 --warmup 5

The results, for a full prefill forward pass at sequence length 1024:

Device	EP	Precision	Avg latency	Throughput	Std dev
NPU	QNN	w16a16	960.8 ms	1.04 samp/s	3.4 ms
CPU	CPU	w8a16	5793.3 ms	0.17 samp/s	828.3 ms

Two takeaways, and the second is the one to remember:

The NPU is ~6× faster than the CPU on this workload.
The NPU is ~240× more consistent — a 3.4 ms standard deviation versus 828 ms on the CPU. The CPU's latency swings by nearly a full second run to run; the NPU lands in the same spot every time. For an interactive chat experience, that stability is the feature.

This is the token-economics argument made physical: the same model, on hardware the user already owns, delivering cloud-grade interactivity at a marginal token cost approaching zero — and doing it more predictably than the CPU fallback ever could.

5. From artifact to app

A compiled .onnx isn't a product. The repo closes that gap with a complete desktop example that consumes the NPU build above:

app/WinMLChat/ — a WinUI 3 chat UI (C# / .NET 10, MVVM) that streams replies token by token, so the NPU's low, steady latency translates directly into a responsive typing-style experience.
backend/ — a FastAPI server exposing an OpenAI-compatible streaming API, backed by winml.modelkit.WinMLAutoModel.

The most interesting design decision lives in the backend: hybrid NPU/CPU routing. Short prompts go to the quantized NPU build for speed and efficiency; long prompts fall back to an unquantized CPU path. Conceptually, the routing looks like this:

# Illustrative — the backend picks an execution path per request def choose_backend(prompt_tokens: int): if prompt_tokens <= NPU_PREFILL_LIMIT: return npu_model # quantized w16a16, QNN -> fast, low-power return cpu_model # unquantized fallback -> handles long context

Because the API speaks the OpenAI streaming dialect, the WinUI front end (and any other OpenAI-compatible client) connects without bespoke glue — the NPU is hidden behind a familiar /chat/completions-style contract.

The production seam: patching quantization for a composite decoder

Remember the RandomDataset calibration warning? That's the honest boundary between a benchmark and a shipping app, and the repo confronts it head-on.

WinML CLI 0.1.0 can't quantize the Qwen3 composite decoder for QNN out of the box — its calibration reader omits position_ids and the 28 layers' past-KV inputs, so the calibration data doesn't match what the model actually consumes at runtime. The example fixes this at the root in backend/winml_npu_patch.py, which:

supplies real KV-cache calibration feeds (the actual position_ids and per-layer past-key/value tensors the decoder expects), and
switches to per-channel w16a16 weights with the lm_head excluded from quantization.

That's the difference between a model that benchmarks well and a model that generates well: calibrating against representative inputs — not random noise — and keeping the precision-sensitive output projection in higher precision so token quality holds up.

Where this is still rough

It would be dishonest to end on a victory lap. This is early, and it shows.

Qwen3-0.6B — and small LLMs in general — aren't enough on their own yet. A 0.6B model is a fine proof point for the pipeline, but it is a modest reasoner. For many real tasks you'll want a larger on-device model, retrieval, or a hybrid edge/cloud design — the silicon is ready before the small-model quality fully is.
Performance and optimization still have headroom. The build path works end to end, but there are sharp edges: text-generation falls back to a RandomDataset for calibration, the composite decoder needs a hand-written patch to quantize correctly, operator-level profiling (--op-tracing) isn't available in winml-cli 0.1.0, and the throughput numbers here are a starting line, not a ceiling. Expect to tune.

None of this is a reason to wait — it's a reason to get involved. The tooling is open and moving fast, and the fastest way to make it better is to push on it and report what breaks.

Found a bug, hit a wall, or have an idea? File it at github.com/microsoft/winml-cli/issues — feedback from real builds is exactly what sharpens these edges.

Wrap-up

The reason to care about edge AI isn't novelty — it's economics and physics, and they happen to point the same direction.

Token economics says the cloud's per-token meter never reaches zero, while the edge drives the marginal token toward free after a one-time hardware cost. For high-volume, interactive, privacy-sensitive workloads, that's a structural advantage, not a rounding error.
The NPU is the silicon that makes it viable: purpose-built for the transformer forward pass, delivering — in this build — roughly 6× the CPU's speed and ~240× its consistency, at a fraction of the power.
WinML CLI is the bridge that turns a Hugging Face checkpoint into a deployable, hardware-optimized artifact through one reproducible pipeline — inspect → build → perf — that targets every Windows ML execution provider.
winml-qwen3-chat ties it all together: a real Qwen3-0.6B NPU build, an honest benchmark, a streaming WinUI app, an OpenAI-compatible backend with hybrid routing, and — most instructively — a quantization patch that shows what it actually takes to move from "runs" to "ships."

The headline for builders: inference location is now an architecture decision. When the workload is high-volume, latency-sensitive, or privacy-bound, the right answer increasingly isn't a bigger cloud bill — it's the NPU already sitting in your users' laptops. The tooling to reach it is here today.

Start here:

Sample Code kinfey/winml-qwen3-chat
WinML CLI microsoft/winml-cli

Microsoft Leads a New Era of Software Supply Chain Transparency

ShubhraS — Fri, 19 Jun 2026 07:00:00 GMT

Microsoft announces the general availability of Microsoft’s Signing Transparency (MST) – a first-of-its-kind capability that brings unprecedented visibility and trust to our software supply chain. With this release, Microsoft is leading the industry by recording the build of critical cloud services into a publicly readable and verifiable SCITT standard (Supply Chain Integrity, Transparency, and Trust) compliant ledger. This means every production software build for in scope services like Azure Attestation and Azure Managed HSM (Hardware Security Module), Azure confidential ledger, Microsoft Signing Transparency itself (and others over time) – is now logged in an immutable, tamper-evident record. Only builds that are in the MST ledger are deployed to production; this gives customers confidence that the supply chain for these critical services can be audited at anytime.

Notably, the MST ledger is fully open source and built to align with the emerging IETF SCITT standard. By embracing SCITT’s principles and open protocols, Microsoft ensures that MST not only secures our own ecosystem but also contributes to a broader industry movement toward standardized supply chain transparency. The open-source MST ledger serves as a verifiable trust anchor that any organization or researcher can inspect, audit, or even integrate with their own tooling. MST itself meets the highest levels of transparency, backed by a tamper-proof confidential ledger, open-source, and independently verified. Specifically, we are making the foundation of our trust model transparent and accessible to everyone – reinforcing that trust must be earned through proof, not just promises.

This launch marks a major milestone in our commitment to Zero Trust principles, extending “never trust, always verify” all the way into the build itself. Building on a public preview introduced late last year, MST’s general availability delivers verifiable transparency at the software level. It transforms traditional code signing with an additive trust layer that is accessible via an open verification model. Every new software update is accompanied by a publicly auditable proof of integrity, enabling security teams to proactively confirm that each update is authentic and unaltered.

To help organizations get the most out of this capability, we are also introducing a free tool to explore the contents – Ledger Explorer – an offline tool that allows security teams to examine MST ledger entries, verify cryptographic proofs, and even validate the ledger’s integrity independently. This tool, combined with MST’s open design, ensures that every Microsoft customer – and the broader community – can hold us accountable in real time for the software we run on their behalf.

Key Benefits of Microsoft’s Signing Transparency (MST)

Verified Code Integrity – Every software release is cryptographically logged in MST’s ledgers. This makes each build tamper-evident and traceable. If an attacker attempts to inject malicious code or sign an unauthorized update, it will be evident through the well-defined validation step built into the SCITT standard. Organizations gain the assurance that code integrity can be independently confirmed at any time.

Independent Verification & Zero Trust – MST enables customers and auditors to verify software authenticity on their own, without having to solely rely on vendor attestations. For each update, Microsoft provides a transparency “receipt” (proof of logging) that you can use to prove the update was officially published and unaltered. This fosters a “don’t just trust, verify” approach, empowering security teams to double-check everything running in their environment aligns with what Microsoft intended.

Audit-Trail & Compliance – The transparency ledger creates a permanent, auditable timeline of code deployments. Every entry is a record of what was released and when, backed by cryptographic proofs. This simplifies compliance reporting and accelerates forensic analysis. In the event of an incident, you can quickly audit the ledger to see if any unexpected code was introduced. For highly regulated industries, MST offers concrete evidence of software integrity and policy compliance over time.

Leadership & Open Standards – We are delivering real transparency now, encouraging a future where all critical software is released with verifiable integrity. MST’s open source implementation and SCITT-compliant design exemplify our commitment to openness and collaboration. We believe widespread adoption of these standards will strengthen supply chain security for everyone, making trust verification a universal practice.

Next Steps

Microsoft’s Signing Transparency is more than a new security feature and shapes the advances in trust technology. As threats grow more sophisticated, we must evolve the way we assure our customers about the software they depend on. With MST now generally available, we are leading by example: proving that it is possible to open up the traditionally opaque process of software deployment and turn it into a source of strength and trust, i.e. empowering each person with verifiable transparency.

We invite the industry to join us on this journey and get started by reading the documentation and exploring Ledger Explorer today! Together, by embracing transparency and open standards, we can turn “trust but verify” from a slogan into an everyday reality for digital infrastructure.

Enterprise-ready Claude Desktop with Entra ID, APIM, and Microsoft Foundry (No Backend Required)

LZhang — Thu, 18 Jun 2026 03:47:52 GMT

How I put corporate sign-in in front of Claude Desktop without writing a single line of backend code.

TL;DR — In this post, I show how to securely enable Claude Desktop in enterprise environments using Microsoft Entra ID, Azure API Management, and Microsoft Foundry — without deploying a custom backend. This approach removes API keys from endpoints, enforces per-user identity, and aligns fully with Zero Trust principles.

Who this is for:

Enterprise architects evaluating secure AI client patterns
Developers enabling Claude Desktop in regulated environments
Platform teams standardizing identity and governance for LLM access

Why this post exists: Microsoft Learn's Configure Claude Desktop with Foundry Models only shows the API-key path — a shared key pasted into every user's Claude Desktop config. That's fine for a quick demo, but it's a non-starter for most enterprises (no per-user identity, no MFA / Conditional Access, hard to revoke, hard to audit). This post fills that gap: same Foundry backend, but with Microsoft Entra ID SSO in front via Azure API Management, so each user signs in with their corporate identity and zero secrets land on the laptop.

The problem

For many teams experimenting with Claude Desktop, the blocker isn't capability — it's enterprise readiness. How do you enforce identity, eliminate shared secrets, and apply governance without standing up a custom backend service to sit in front of the model?

If your team wants to use Claude Desktop with your own Anthropic deployment running on Microsoft Foundry, but with a few non-negotiable requirements:

No shared API keys floating around on developer laptops.
Per-user identity — every request must be attributable to a real person.
MFA and Conditional Access must apply, the same way they do for every other internal app.
Central rate-limiting and logging — a centralized control plane for governance.

Claude Desktop 1.5+ supports a "Gateway SSO" mode where it can sign each user in with OpenID Connect and forward their token to a custom LLM gateway. Azure API Management (APIM) is a perfect fit for that gateway role: it validates the user's Entra ID token, then re-authenticates itself to Foundry behind the scenes. APIM acts as a centralized policy enforcement layer, enabling identity validation, traffic governance, and secure re-authentication to backend AI services without custom code.

The end-to-end flow looks like this:

%%{init: {'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'useMaxWidth': true}, 'themeVariables': {'fontSize':'16px'}} }%% flowchart TB User([Corporate user]) Claude["Claude Desktop"] Entra["Microsoft Entra ID<br/>(OIDC + MFA + Conditional Access)"] APIM["Azure API Management<br/>validate-jwt → rewrite headers<br/>(policy gateway)"] Foundry["Microsoft Foundry<br/>Claude deployment"] User -- "1. Sign in (browser PKCE)" --> Entra Entra -- "2. ID token" --> Claude Claude -- "3. POST /v1/messages<br/>Authorization: Bearer ID token" --> APIM APIM -- "4. OIDC discovery / JWKS" --> Entra APIM -- "5. x-api-key (or Managed Identity)" --> Foundry Foundry -- "6. Response" --> APIM APIM -- "7. Response" --> Claude classDef azure fill:#0a4d8c,stroke:#0a3a6b,color:#ffffff; classDef client fill:#f3f3f3,stroke:#888,color:#222; class Entra,APIM,Foundry azure; class Claude,User client;

Or in plain text:

Claude Desktop
   │  Authorization: Bearer <Entra ID token from the user's browser sign-in>
   ▼
Azure API Management  (<your-apim>)
   │  ① validate-jwt   → verifies user's Entra ID token
   │  ② re-auths to Foundry with an API key from a Named value
   │  Authorization stripped, x-api-key injected
   ▼
Microsoft Foundry  /anthropic/v1/messages
   │  runs Claude (<your-deployment>)
   ▼
Response back to the user

There are no API keys on user devices. Foundry's key lives only inside APIM. And every request carries the user's oid claim, so I can build dashboards and per-user quotas later.

What you need before starting

An Azure subscription with a Microsoft Foundry (AI Services) account and a Claude deployment. (Throughout this post I'll just call it Foundry.)
An API Management instance, any tier.
Permission to register applications in Entra ID for your tenant.
Claude Desktop 1.5.0 or later.
Azure CLI installed locally.

Throughout this post I'll use placeholders for resource names:

<apim-name> — your API Management service name
<resource-group> — the resource group that holds it
<foundry-account> — your Foundry account name
<deployment-name> — the name of the Claude model deployment on Foundry

Step 1 — Register an Entra ID app for Claude Desktop

This is the OIDC client Claude Desktop signs users into. Claude Desktop requires a single-tenant, public PKCE client (no client secret) with a loopback redirect URI, configured under the Mobile and desktop applications platform in Entra ID — the only platform that allows any loopback port.

I scripted it so the setup is one command and idempotent:

# scripts/register-claude-entra-app.ps1
[CmdletBinding()]
param(
  [string] $TenantId       = '<your-tenant-id>',
  [string] $SubscriptionId = '<your-subscription-id>',
  [string] $ResourceGroup  = '<resource-group>',
  [string] $ApimName       = '<apim-name>',
  [string] $AppDisplayName = 'Claude Cowork gateway',
  [string] $RedirectUri    = 'http://127.0.0.1/callback'
)

az account set --subscription $SubscriptionId | Out-Null

# 1. Create (or reuse) the app registration
$appId = az ad app list --display-name $AppDisplayName --query "[0].appId" -o tsv
if (-not $appId) {
  $appId = az ad app create --display-name $AppDisplayName `
            --sign-in-audience AzureADMyOrg --query appId -o tsv
}

# 2. Configure as public PKCE client with the Mobile/Desktop redirect URI
$objectId = az ad app show --id $appId --query id -o tsv
$patch = @{
  publicClient = @{ redirectUris = @($RedirectUri) }
  isFallbackPublicClient = $true
} | ConvertTo-Json -Depth 5 -Compress
az rest --method PATCH `
        --uri "https://graph.microsoft.com/v1.0/applications/$objectId" `
        --headers "Content-Type=application/json" --body $patch | Out-Null

# 3. Ensure a service principal exists
$sp = az ad sp list --filter "appId eq '$appId'" --query "[0].id" -o tsv
if (-not $sp) { az ad sp create --id $appId | Out-Null }

# 4. Push two Named values into APIM for the validate-jwt policy
az apim nv create -g $ResourceGroup --service-name $ApimName `
  --named-value-id entra-tenant-id --display-name entra-tenant-id `
  --value $TenantId --secret false
az apim nv create -g $ResourceGroup --service-name $ApimName `
  --named-value-id entra-client-id --display-name entra-client-id `
  --value $appId --secret false

"Client ID: $appId"

Run it once. The output prints the client ID you'll need in Claude Desktop later, and it leaves two Named values in APIM (entra-tenant-id, entra-client-id) that the gateway policy will reference.

⚠️ Common pitfall: if the redirect URI ends up under the Web platform instead of Mobile and desktop applications, Entra will demand a client secret on token exchange — Claude won't send one and you'll get Token exchange failed (HTTP 401). The app type can't be changed after creation, so create a new app if that happens.

Step 2 — Create the API in APIM

In the portal under APIM → APIs → + Add API → HTTP:

Field	Value
Display name	Anthropic API
Name	`anthropicapi`
Web service URL	`https://<foundry-account>.services.ai.azure.com/anthropic`
API URL suffix	`claude`
Subscription required	Off (Entra ID is our only credential)

Add two operations under it:

Method	URL	Display name
POST	`/v1/messages`	Create message
GET	`/v1/models`	List models

The /v1/models operation isn't strictly needed (Foundry's Anthropic surface doesn't implement it), but having it registered means you can decide later whether to stub it out or proxy it.

Step 3 — Add an API key for Foundry as a Named value

APIM → Named values → + Add:

Name: foundry-key
Type: Secret
Value: paste a key from the Foundry account's Keys and Endpoint blade.

This is the only place the key ever lives. Clients never see it.

Alternative — keyless with Entra ID (managed identity): If you prefer not to manage a Foundry key at all, enable the APIM instance's system-assigned managed identity (APIM → Identity → System assigned → On), then grant that identity the Foundry User role on the Foundry account (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d — previously named Azure AI User; Microsoft renamed it but the ID and permissions are unchanged). In Step 4, replace the set-header that injects x-api-key with:
<authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="foundry-token" />
<set-header name="Authorization" exists-action="override">
  <value>@("Bearer " + (string)context.Variables["foundry-token"])</value>
</set-header>
Then you can skip the foundry-key Named value entirely. Don't use the legacy Cognitive Services User role — per the Foundry RBAC doc, roles starting with Cognitive Services don't apply to Foundry scenarios.

Step 4 — Write the gateway policy

This is the core enforcement layer in the architecture. Open APIs → anthropicapi → All operations → Inbound processing → </> and paste:

<policies>
  <inbound>
    <base />

    <!-- USER → APIM: verify Entra ID token from Claude Desktop -->
    <validate-jwt header-name="Authorization"
                  failed-validation-httpcode="401"
                  failed-validation-error-message="Unauthorized"
                  require-scheme="Bearer">
      <openid-config url="https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0/.well-known/openid-configuration" />
      <audiences>
        <audience>{{entra-client-id}}</audience>
      </audiences>
      <issuers>
        <issuer>https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0</issuer>
      </issuers>
    </validate-jwt>

    <!-- APIM → Foundry -->
    <set-backend-service base-url="https://<foundry-account>.services.ai.azure.com/anthropic" />
    <set-header name="x-api-key" exists-action="override">
      <value>{{foundry-key}}</value>
    </set-header>
    <set-query-parameter name="api-version" exists-action="skip">
      <value>2024-05-01-preview</value>
    </set-query-parameter>
  </inbound>
  <backend><base /></backend>
  <outbound><base /></outbound>
  <on-error><base /></on-error>
</policies>

Two things to notice:

validate-jwt uses the OIDC discovery URL — JWKS keys are fetched and cached automatically. It rejects any token whose aud claim is not the client ID of our Entra app, which is exactly what we want.
The Authorization header from the user is not forwarded — once validate-jwt succeeds, the request is re-authenticated to Foundry with x-api-key. No user token ever leaves APIM.

APIM becomes the security boundary — user identity is validated at the edge, and downstream services never see or rely on user tokens.

Step 5 — Configure Claude Desktop

Open Claude Desktop → Configure third-party inference and fill it in like this:

Field	Value
Connection	Gateway
Credential kind	Interactive sign-in
Gateway base URL	`https://<apim-name>.azure-api.net/claude`
Client ID	(the appId your script printed)
Issuer URL	`https://login.microsoftonline.com/<tenant-id>/v2.0`
Authorization URL / Token URL	leave empty
Bearer token	ID token (default)
Scopes	leave default (`openid profile email offline_access`)
Redirect port	leave empty (ephemeral)
Model discovery	Off
Model list → Model ID	`<deployment-name>` (your Foundry deployment name)

ℹ️ Why Model discovery is Off — Claude Desktop's discovery uses GET /v1/models, and the Foundry /anthropic surface doesn't implement that endpoint, so it 404s. Listing the model manually skips the call entirely.

If you want to leave Model discovery On, stub /v1/models in APIM. Add a GET /v1/models operation to your API and give it this inbound policy that returns an Anthropic-shaped response without ever hitting the backend:
<policies>
  <inbound>
    <base />
    <return-response>
      <set-status code="200" reason="OK" />
      <set-header name="Content-Type" exists-action="override">
        <value>application/json</value>
      </set-header>
      <set-body>@{
        return new JObject(
          new JProperty("data", new JArray(
            new JObject(
              new JProperty("id", "<deployment-name>"),
              new JProperty("type", "model"),
              new JProperty("display_name", "Claude on Foundry"),
              new JProperty("created_at", "2026-01-01T00:00:00Z")
            )
          )),
          new JProperty("has_more", false),
          new JProperty("first_id", "<deployment-name>"),
          new JProperty("last_id",  "<deployment-name>")
        ).ToString();
      }</set-body>
    </return-response>
  </inbound>
  <backend><base /></backend>
  <outbound><base /></outbound>
  <on-error><base /></on-error>
</policies>
Add one entry per deployment you want to expose. The benefit of stubbing rather than turning discovery off is that adding new models becomes a policy edit — no need to re-export and redeploy Claude Desktop config to every user.

Click Apply Changes then Sign in to your organization. Your browser opens to the normal Entra sign-in page; once approved you're returned to the app, and a quick connection test runs.

The success indicator is a small green banner:

✅ Inference — 1-token completion in 1449 ms · via identity provider

For broader rollout, hit the Export button at the top of the configuration window — it produces a .mobileconfig (macOS) or .reg (Windows) you can push via Intune / Jamf to every user's machine.

Step 6 — Verify both hops

In APIM → APIs → anthropicapi → Test → POST /v1/messages I sent:

Headers:
  anthropic-version: 2023-06-01
Body:
  { "model": "<deployment-name>", "max_tokens": 64,
    "messages": [{"role":"user","content":"hi"}] }

Click Send → Trace, and look at two places:

Inbound → validate-jwt: should say succeeded and show the decoded claims (your oid, email, etc.).
Backend → Request: outbound URL is https://<foundry-account>.services.ai.azure.com/anthropic/v1/messages?api-version=2024-05-01-preview, with x-api-key: **** present and Authorization absent.
Backend → Response: 200, with a Claude message JSON body.

That confirms both halves of the chain.

Bumps I hit along the way

A few common issues encountered during setup — sharing so you can skip them:

Symptom	Cause	Fix
Claude shows "Your provider's model list hasn't loaded yet" and `/v1/models` returns 404	Foundry's Anthropic surface doesn't implement that endpoint	Turn Model discovery OFF in Claude Desktop and add the deployment name manually
Claude shows "Authentication failed" even though sign-in worked	The APIM API still had Subscription required = ON, blocking the call before `validate-jwt` ran with `401: Access denied due to missing subscription key`	Uncheck Subscription required on the API
Portal Test panel shows "Cannot read properties of undefined (reading 'statusCode')"	The test console doesn't attach an Entra token, so `validate-jwt` 401s and the panel's JavaScript crashes	Comment out `<validate-jwt>` temporarily for portal testing, or test via `curl` with a real token
`OIDC discovery failed (HTTP 404)` in Claude Desktop	Pasted the metadata URL into Issuer URL	Issuer must end at `/v2.0`, not at `/.well-known/openid-configuration`
`Token exchange failed (HTTP 401)`	App registered under Web platform instead of Mobile and desktop applications	Create a new app with the right platform — it can't be changed

Where this leaves us

This pattern is small in moving parts but has outsized architectural impact:

Zero secrets on endpoints. Eliminates API-key sprawl across laptops, MDM profiles, and shared vaults. The Foundry key lives only inside APIM — or disappears entirely when you switch APIM to managed identity.
Identity, not credentials. Every Claude Desktop user authenticates against Entra ID in their browser, the same as Office or Teams. MFA, Conditional Access, and Entra ID Protection apply automatically — no parallel auth story to maintain.
Per-user observability built in. APIM logs carry the user's Entra oid, email, and group claims. That unlocks per-user dashboards, cost allocation, and abuse detection without any client-side instrumentation.
Aligned with Zero Trust. Strong identity at the edge, no implicit trust between hops, single policy chokepoint for inspection and rate-limiting, and full revocability through a single Enterprise Application.
Optional but trivial keyless path. Flip APIM to system-assigned managed identity + <authentication-managed-identity resource="https://cognitiveservices.azure.com" /> and one Foundry User role assignment (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d, formerly Azure AI User) on the Foundry account. See the Foundry RBAC doc — don't use any Cognitive Services * roles for Foundry.

What I'd add next

llm-token-limit and llm-emit-token-metric policies for per-user quotas and cost visibility.
App Insights wiring on the API, with a workbook that pivots on the oid claim.
Assignment required = Yes on the Entra Enterprise Application + a security group, so only approved users can sign in.
Intune deployment of the exported .reg / .mobileconfig so the gateway URL and client ID land on devices automatically.

But that's all incremental. The hard part — getting Claude Desktop, Entra ID, APIM, and Foundry to agree on who's allowed to talk to whom — is done. Total elapsed: about an afternoon, most of it spent learning where each portal hides its switches.

Useful links

Azure Function App — Queue-Based Architecture for Long-Running Sync Jobs

Jamesdld23 — Tue, 16 Jun 2026 07:00:00 GMT

The Problem: HTTP Triggers and Long-Running Jobs Don't Mix

Here's a situation you've probably run into: you have a job that needs to loop over dozens of Azure resources, call APIs, and do real work. You wrap it in an HTTP-triggered Azure Function so it can be called on demand. It works great and after a few minutes, the caller gets a 504 Gateway Timeout.

⭐ This isn't a bug. Azure (like most cloud platforms) enforces a hard 230-second HTTP response timeout at the load balancer level — no matter what you configure in your function. If your job takes longer than that, the connection is cut.

The 230-second limit is enforced by Azure Front Door / the platform load balancer. It cannot be overridden by app settings or host configuration. Any HTTP trigger that runs longer than ~3.5 minutes will timeout for the caller.

In our case, the job iterates over 30+ Azure subscriptions — for each one it switches context, lists resources, and triggers image imports. Total runtime: anywhere from 2 to 10 minutes depending on how many ACRs need updating. Way over the limit.

The Solution: Decouple Request from Execution via a Queue

The fix is clean once you see it: the HTTP trigger shouldn't do the work — it should just accept the work and hand it off. That's what a queue is for.

The flow splits into two independent phases:

Request phase — The HTTP trigger validates the caller (JWT + app role check), packages the job parameters into a queue message, and returns 202 Accepted. This takes under 3 seconds.
Execution phase — A Queue Trigger picks up the message and runs the actual sync. No HTTP connection involved, so there's no timeout. On a Dedicated (P-series) plan, execution time is unlimited.

Approach	What the caller gets	Result
HTTP trigger → run sync inline	Waits for the full job to complete	504 TIMEOUT after 230 seconds
HTTP trigger → Queue → Queue Trigger	202 Accepted immediately	NO TIMEOUT job runs as long as needed

🤸‍♀️There's an added bonus - Reliability in Azure Queue Storage: Azure Storage Queues give you automatic retry out of the box. If the job crashes halfway through, the message becomes visible again after a visibility timeout and the Queue Trigger picks it up for a retry — up to 5 attempts before the message is moved to the poison queue. No retry logic to write 🤸‍♀️.

Locking Down the Endpoint

Since the HTTP trigger is the public entry point, it needs solid auth. We layer two things:

⭐Use EasyAuth for the "is this a real Entra ID token?" check, and a custom App Role for the "is this person allowed to trigger syncs?" check. These are independent concerns and should stay that way.

Layer	What it does	How
EasyAuth (Entra ID)	Rejects requests without a valid Entra ID Bearer token — before your code even runs	Configured at the Function App level via the Authentication blade
App Role check	Validates that the token contains the SyncJob.Execute role — only assigned users/SPs can trigger the job	Decoded in the function code from the JWT roles claim
Managed Identity	Authenticates the Function App to Azure APIs (no credentials in code)	Connect-AzAccount -Identity — identity assigned via RBAC

One gotcha worth knowing: when using v2 tokens (which is the default with modern App Registrations), the aud claim in the token is the raw App ID GUID — not the api:// prefixed URI. You need to explicitly add both forms to your allowedAudiences in EasyAuth, otherwise valid tokens get rejected.

APP_ID="<your-app-id>" TENANT_ID="<your-tenant-id>" FUNCTION_APP_URL="https://<your-function-app>.azurewebsites.net" # Interactive login (device code flow — works from any terminal) az login --tenant "${TENANT_ID}" \ --scope "api://${APP_ID}/.default" \ --use-device-code TOKEN=$(az account get-access-token \ --scope "api://${APP_ID}/.default" \ --query accessToken -o tsv) # Trigger the sync — returns 202 immediately curl -s -X POST "${FUNCTION_APP_URL}/api/SyncContainerRegistryHttpTrigger" \ -H "Authorization: Bearer ${TOKEN}" \ -H "Content-Type: application/json"

Passing Parameters Through the Queue

One nice property of this pattern: the queue message is just JSON, so you can pass whatever parameters the job needs. In our case, we pass a subscriptionFilter wildcard so callers can target a subset of subscriptions without touching any code.

The parameter travels the full chain: HTTP body → queue message → Queue Trigger → PowerShell script parameter. Here's how each step handles it.

Step 1 — HTTP Trigger reads the body and enqueues the message using the Push-OutputBinding output binding. Azure Functions wires the binding to the queue automatically — no SDK call needed:

param($Request, $TriggerMetadata) # ... decode the JWT, check role assignment $queuePayload = @{ triggeredBy = $decoded.Payload.upn ?? $decoded.Payload.oid triggeredAt = (Get-Date -Format 'o') subscriptionFilter = if ($body.subscriptionFilter) { $body.subscriptionFilter } else { "*" } } | ConvertTo-Json -Compress Push-OutputBinding -Name QueueMessage -Value $queuePayload Push-OutputBinding -Name Response -Value ([HttpResponseContext]@{ StatusCode = [System.Net.HttpStatusCode]::Accepted Body = @{ message = "Sync job queued. Check Azure Monitor logs for execution status." } })

⭐Push-OutputBinding is how Azure Functions PowerShell workers write to output bindings (queues, blobs, HTTP responses…). The binding name QueueMessage maps to the queue defined in function.json — the runtime handles serialisation and delivery.

Step 2 — Queue Trigger passes the filter to the script as a named parameter:

param($QueueItem, $TriggerMetadata) Write-Host "Triggered SyncContainerRegistry via Storage Queue. Payload: $QueueItem" $subscriptionFilter = if ($QueueItem.subscriptionFilter) { $QueueItem.subscriptionFilter } else { "*" } $SubscriptionFilter = $subscriptionFilter . "$PSScriptRoot/../SyncContainerRegistry/run.ps1"

Step 3 — Long running job with the filter as parameter:

param($Timer) if (-not $SubscriptionFilter) { $SubscriptionFilter = "*" } $subscriptions = Get-AzSubscription | Where-Object { $_.Name -like $SubscriptionFilter } foreach ($subscription in $subscriptions) { Set-AzContext -SubscriptionId $subscription.Id | Out-Null # ... do the work }

Targeting a subset of subscriptions

# Sync all subscriptions (default — omit the body) curl -s -X POST "${FUNCTION_APP_URL}/api/SyncContainerRegistryHttpTrigger" \ -H "Authorization: Bearer ${TOKEN}" \ -H "Content-Type: application/json" # Sync only subscriptions matching a pattern curl -s -X POST "${FUNCTION_APP_URL}/api/SyncContainerRegistryHttpTrigger" \ -H "Authorization: Bearer ${TOKEN}" \ -H "Content-Type: application/json" \ -d '{"subscriptionFilter": "*project-alpha*"}'

⭐PowerShell's -like operator uses * as a wildcard anywhere in the string. The pattern *project-alpha* matches sub-mycompany-project-alpha-prd, sub-mycompany-project-alpha-dev, etc. A pattern without a leading * only matches from the start of the string — keep this in mind when naming subscriptions.

Pushing a Message Directly via PowerShell

You can also push a message straight to the queue without going through the HTTP trigger — useful for testing, scripting, or bypassing the auth layer in a controlled environment.

Connect-AzAccount # or -Identity for a Managed Identity context $storageAccount = "<your-storage-account>" $queueName = "sync-job-queue" # Build the payload — same shape the HTTP trigger produces $payload = @{ triggeredBy = $env:USERNAME triggeredAt = (Get-Date -Format 'o') subscriptionFilter = "*project-alpha*" # or "*" for all } | ConvertTo-Json -Compress # Get a queue client via the connected account (no key needed) $ctx = New-AzStorageContext -StorageAccountName $storageAccount -UseConnectedAccount $queue = Get-AzStorageQueue -Name $queueName -Context $ctx $queue.QueueClient.SendMessage($payload)

⭐ -UseConnectedAccount authenticates via the current Connect-AzAccount session — no storage key required, as long as your identity has the Storage Queue Data Message Sender role on the storage account.

The Queue Message

The HTTP trigger packages the caller identity and filter into a simple JSON payload before enqueuing. The Queue Trigger reads it back as a deserialised PowerShell object — no manual JSON parsing needed.

{ "triggeredBy": "user@company.com", "triggeredAt": "2026-06-01T11:03:55.570+02:00", "subscriptionFilter": "*project-alpha*" }

Design Decisions at a Glance

Decision	Choice	Why
Async execution	Azure Storage Queue	HTTP trigger has a hard 230s timeout. The sync job takes 2–10 minutes. The queue decouples acceptance from execution — and gives us retry for free.
Authentication	EasyAuth + App Role	No credentials in code. Access is controlled via Entra ID app roles — revocable per user without touching infrastructure.
Azure identity	Managed Identity	No secrets to rotate or store. The Function App authenticates to Azure APIs using its platform-assigned identity.
Job parameter	Wildcard filter via queue payload	Lets callers target any subscription subset without code changes. The filter travels through the queue — the Queue Trigger just passes it along.
Hosting plan	Dedicated (P-series)	Consumption plan caps function execution at 10 minutes. A Dedicated plan has no execution time limit — essential when the job can run longer.

See you in the Cloud

Jamesdld

Agents That Test Agents: A Cloud-Native Skill-Eval Harness on Foundry Hosted Agents

kinfey — Mon, 15 Jun 2026 07:00:00 GMT

Skills are an agent's must-have. So test them.

A skill is the lightest way to give an agent durable, reusable behavior: a SKILL.md file you author once, store centrally in Foundry's versioned Skills API, and inject into a Hosted Agent's context — no code change, no redeploy. That's why skills have quietly become standard equipment for production agents.

But the moment a skill carries real behavior, a hard question follows: how do you know it still works? When you edit a skill you can't feel whether you improved it or just changed it. It might stop triggering, skip a required section, or quietly produce a worse result on one model than another. The cure is the same discipline we use for any prompt — evaluate it: run the agent, capture what happened, and grade it against a small set of checks.

This is exactly what azure_skill_eval does for one concrete skill: edu-video-script, which writes an education short-video script for a given knowledge point (the sample's smoke test asks it to script the "P vs NP problem"). And it does the whole thing cloud-native, on Foundry Hosted Agents.

The scenario: one skill, two models, four hosted agents

The skill under test is edu-video-script. The clever part of the harness is that it doesn't just check one run — it puts the skill on a stand and stresses it from three sides, using four Foundry Hosted Agents wired together by the Agent Framework FoundryAgent:

Hosted agent	Role
skill-eval-business-agent-gpt	System under test (SUT), running edu-video-script on gpt-5.5
skill-eval-business-agent-deepseek	The same skill, running on DeepSeek-V4-Pro
skill-eval-attacker-agent	Multi-turn adversarial prompt generator
skill-eval-judge-agent	LLM-as-judge that returns a rubric score as JSON

Two business agents run the same skill on different models, so every case becomes an apples-to-apples comparison: which model executes this skill better? The attacker and judge are the graders.

What we measure (define "done" first)

Good evals start from a checkable definition of done — outcome, process, style, efficiency. For an education-video script that means: Did it produce a valid script (outcome)? Did it actually follow the edu-video-script template (process/style)? Does it hold up when a user pushes on it across turns (robustness)? The harness answers these with three grading layers.

1. Deterministic checks first (validator.py)

The cheapest, most explainable signal: does the output match the script template the skill is supposed to produce? validator.py runs fixed, deterministic template checks — no model needed. These catch the obvious regressions instantly and never cost a token.

2. The LLM judge (skill-eval-judge-agent)

Template checks answer "did it do the basics?" but not "is the script any good?" — pacing, clarity, whether it teaches the concept. For that, a dedicated judge hosted agent grades the result and returns structured JSON so scores compare cleanly across runs and models:

{ "overall_pass": true, "score": 100, "checks": [] }

Structured output is the point: stable fields (overall_pass, score, checks) diff cleanly between GPT and DeepSeek, and between today's skill version and last week's.

3. The multi-turn attacker (test_agent.py + skill-eval-attacker-agent)

A skill that looks great on a clean prompt can still fall apart when a user pushes on it. The attacker agent generates adversarial prompts for a knowledge point using a chosen strategy — for example extreme length — and keeps the pressure on across multiple turns (max_turns, default 3). This is where you find out whether edu-video-script stays on-template under stress, not just on the happy path.

# the attacker takes a knowledge point + a strategy, emits one user prompt azd ai agent invoke skill-eval-attacker-agent \ "Topic: P vs. NP problem Recommended attack strategy: Extreme length Please output the unique user prompt text."

The eval loop, end to end

runner.py is a ghcsdk-style pipeline that runs cases × models, with each side toggleable: pick all models / GPT only / DeepSeek only, run a single case (e.g. edge-03), and switch adversarial mode, single-turn vs multi-turn, and judge grading on or off. The same switches are query parameters on POST /api/run: model, only_case, use_attack, single_turn, use_judge, max_turns.

The test set lives in shared/test_cases.py — 10 built-in edge cases (edge-01 … edge-10) exported to evals/evals.json. You don't need a giant benchmark; a small, sharp set catches regressions, and you grow it whenever a real failure shows up:

python -m evals.export_evals # regenerate evals/evals.json from shared/test_cases.py

Every SUT call goes through runtime.py, which follows the official Agent Framework hosted-agent sample: it opens a fresh hosted session per turn, invokes via Responses, and tears the session down afterward.

# shared/runtime.py — the documented Foundry hosted-agent pattern project = AIProjectClient(endpoint=FOUNDRY_PROJECT_ENDPOINT, credential=cred, allow_preview=True) agent = FoundryAgent(project_client=project, name=agent_name, # e.g. skill-eval-business-agent-gpt allow_preview=True) session = project.beta.agents.create_session(agent_name=agent_name) # ... send the (possibly adversarial) prompt, collect the Responses output ...

So a single case flows: runner → business agent (skill runs) → validator → judge, optionally with the attacker driving multiple turns first.

Cloud-native by design — and why that matters for eval

This is the part that makes the harness production-grade rather than a laptop script. The hard parts of an eval harness — provisioning agents, recording every run, scaling trials, governing access — are handled by Azure, not by you.

Foundry Hosted Agents are the runtime. The SUT, attacker, and judge all run as managed hosted agents in your Foundry project. You bring the skill and the cases; Foundry hosts the agents, models, and sessions. The business agents deploy with host: azure.ai.agent and docker.remoteBuild: true, so azd deploy builds the containers in Azure Container Registry — local Docker doesn't even need to be running.
The UI is serverless. A FastAPI app on Azure Container Apps lets you upload evals.json, watch progress live, and browse the dashboard — scale-to-zero when no one's running evals.
Every run is durable. Results land in Azure Blob Storage (skill-eval-runs), one yymmdd-XXXXXX/ folder per run, with a newest-first runs.json index. Nothing lives only in a terminal scrollback.
Access is identity-based. In the cloud, a user-assigned Managed Identity carries exactly two roles — Storage Blob Data Contributor + Azure AI User; locally it's AzureCliCredential. No keys in env files.
It's reproducible infra. azd up runs infra/main.bicep to stand up Storage, the container, Log Analytics, the Container Apps environment, the identity, and the role assignments in one shot.

The payoff: the scores you read came from the same hosted runtime you actually ship to — not a local approximation — and the run that produced them is sitting in Blob, comparable against every run before it.

Run it

Local (no deploy):

conda activate agentdev cd Skill_eval/azure_skill_eval pip install -r requirements.txt cp .env.example .env # FOUNDRY_PROJECT_ENDPOINT + AZURE_STORAGE_* uvicorn webapp.app:app --reload --port 8000

Open http://localhost:8000, upload evals/evals.json, pick your models and modes, and click Run.

Cloud (azd):

azd auth login azd env new skill-eval-dev azd env set FOUNDRY_PROJECT_ENDPOINT https://<project>.services.ai.azure.com/api/projects/<project> azd env set MODEL_GPT gpt-5.5 azd env set MODEL_DEEPSEEK DeepSeek-V4-Pro azd up

Provision the skill once, deploy the four hosted agents, then smoke-test them:

python -m hosted_agent.provision_skills # upload edu-video-script to Foundry Skills azd deploy skill-eval-business-agent-gpt azd deploy skill-eval-business-agent-deepseek azd deploy skill-eval-attacker-agent azd deploy skill-eval-judge-agent azd ai agent invoke skill-eval-business-agent-gpt "Here is a script for an educational short video on the P vs. NP problem."

Read the results

Each run is self-contained on Blob:

summary.json gives you the headline — pass rate and judge averages — and the per-{case}__{model}.json files let you open any single result and see exactly what the skill produced and why it passed or failed. The dashboard streams these straight from Blob via /api/runs/{run_id}/files/{filename}. Because GPT and DeepSeek ran the same cases, the comparison is right there in one folder.

Takeaways

A skill you can't evaluate is a skill you can't trust. edu-video-script is treated like code — versioned in Foundry, run, and graded.
Stack your graders cheap-to-expensive. Deterministic template checks first (validator.py), then an LLM judge for quality, then a multi-turn attacker for robustness.
Make the judge return structured JSON. overall_pass / score / checks compare cleanly across models and skill versions.
Compare models on the same skill. Running GPT-5.5 and DeepSeek-V4-Pro side by side turns "which model?" from a guess into a measured answer.
Let the platform carry the harness. Foundry Hosted Agents are the runtime; Azure Container Apps, Blob Storage, Managed Identity, and azd/Bicep make the whole loop reproducible and durable.

Write the skill. Then build the harness that proves it. On Foundry, that second step is mostly configuration — and the result is a skill you can actually trust in production.

Conclusion

Skills moved agent behavior out of code and into versioned Markdown — a huge win for reuse, but only if you can prove a skill still works after every edit. azure_skill_eval answers that for edu-video-script by treating evaluation as a first-class, repeatable step rather than a gut check.

The shape is simple and worth copying for any skill of your own:

Pin down "done" as checkable criteria, then encode a small set of sharp cases (here, 10 edge cases).
Grade in layers, cheap to expensive — deterministic template checks, then a structured LLM-judge rubric, then a multi-turn adversarial pass.
Run the same cases across models (GPT-5.5 vs DeepSeek-V4-Pro) so model choice becomes a measurement, not a guess.
Let the cloud carry it — Foundry Hosted Agents as the runtime, FastAPI on Azure Container Apps for the UI, Blob Storage for durable runs, Managed Identity for access, and azd/Bicep so the whole thing is reproducible.

The result is a feedback loop where every skill change is confirmed, every regression is visible, and every score traces back to the same hosted runtime you ship to. That's the difference between building skills and being able to trust them — and on Foundry, the gap between the two is mostly configuration.

Sample Code : https://github.com/kinfey/Multi-AI-Agents-Cloud-Native/tree/main/code/Skill_eval

Troubleshooting ML Model Loading, GPU Issues, and Memory Pressure in Azure Container Apps

BhaktiRath95 — Fri, 12 Jun 2026 07:00:00 GMT

Introduction

Deploying an AI application to Azure Container Apps is fundamentally different from deploying a web API. When you containerize a Django REST API, the application starts in a few seconds, the memory footprint is predictable, and the CPU usage scales linearly with requests. When you containerize a PyTorch model server, a LangChain agent, or an ONNX inference service, you are dealing with a completely different category of problem.

Large language models, computer vision models, and embedding pipelines can take minutes to load, consume gigabytes of memory before serving a single request, and produce bizarre errors when they encounter resource limits that look nothing like a standard out-of-memory exception. Add to that the challenge of running GPU workloads (or simulating them on CPU) in a containerized environment, and you have a troubleshooting landscape that catches even experienced ML engineers off guard.

This part of the series covers the real-world scenarios you will encounter when running AI workloads on Azure Container Apps, with specific focus on what goes wrong during deployment and startup — and how to fix it methodically.

Scenario 1: The Model Takes 3+ Minutes to Load and the Container Gets Killed Before It Starts

What You See

You deploy your model inference service. In the logs you can see it is downloading or loading the model from disk:

INFO: Loading model from /app/models/model.bin

INFO: Loading tokenizer...

INFO: Loading weights layer 0/48...

INFO: Loading weights layer 12/48...

And then, abruptly, the container disappears and a new one starts. The liveness or startup probe has timed out and Container Apps has killed the container before the model finished loading. You end up in an endless restart loop where the model never fully loads.

Why This Happens

The default probe configuration does not account for long model loading times. The liveness probe begins firing almost immediately after the container starts. If your model takes 3 minutes to load and your liveness probe allows only 30 seconds of failure before killing the container, the model never gets a chance to finish loading. Container Apps is doing exactly what it was configured to do — it just was not configured with your workload in mind.

There is a second, related problem: if your model file is downloaded at container startup (from Azure Blob Storage, Hugging Face Hub, or a mounted file share), the download time is added on top of the load time, making the window even wider.

Step-by-Step Fix

Step 1 — Separate your startup probe from your liveness probe.

A startup probe fires repeatedly until it succeeds, and while it is in progress, the liveness probe is suppressed. This gives your model the time it needs to load without the risk of being killed. Set a generous `failureThreshold` and `periodSeconds`:

probes: - type: Startup httpGet: path: /health/startup port: 8080 initialDelaySeconds: 10 periodSeconds: 15 failureThreshold: 40 # 40 * 15s = 600 seconds (10 minutes) for model to load - type: Liveness httpGet: path: /health/live port: 8080 periodSeconds: 30 failureThreshold: 3 - type: Readiness httpGet: path: /health/ready port: 8080 periodSeconds: 10 failureThreshold: 3

Step 2 — Implement a proper three-tier health endpoint in your model server.

Each probe endpoint should return the appropriate status based on what it knows about the model:

# FastAPI model server with staged health endpoints from fastapi import FastAPI, HTTPException from contextlib import asynccontextmanager import asyncio model = None model_loaded = False @asynccontextmanager async def lifespan(app: FastAPI): global model, model_loaded # Model loads in the background so the HTTP server starts immediately asyncio.create_task(load_model_async()) yield # Cleanup on shutdown model = None app = FastAPI(lifespan=lifespan) async def load_model_async(): global model, model_loaded import logging logger = logging.getLogger(__name__) logger.info("Starting model load...") try: # Import heavy libraries only when needed import torch from transformers import AutoModelForCausalLM, AutoTokenizer tokenizer = AutoTokenizer.from_pretrained("/app/models/my-model") model_obj = AutoModelForCausalLM.from_pretrained( "/app/models/my-model", torch_dtype=torch.float16, device_map="auto" ) model = {"tokenizer": tokenizer, "model": model_obj} model_loaded = True logger.info("Model loaded successfully") except Exception as e: logger.error(f"Model loading failed: {e}", exc_info=True) raise @app.get("/health/startup") def startup_probe(): # Returns 200 immediately — the container is alive but model may still be loading return {"status": "starting"} @app.get("/health/live") def liveness_probe(): # Returns 200 as long as the process has not entered a broken state return {"status": "alive"} @app.get("/health/ready") def readiness_probe(): # Returns 200 ONLY when the model is fully loaded if not model_loaded: raise HTTPException(status_code=503, detail="Model not yet loaded") return {"status": "ready"}

Step 3 — Pre-download model weights into the container image at build time.

Downloading models at container startup is a significant reliability and performance risk. If the Hugging Face Hub or your storage account is temporarily unreachable, the container cannot start. Instead, bake the model weights directly into the image or use a separate initialization Container App Job to pre-populate a persistent volume:

# Option A: Bake the model into the image (simple, but creates a large image) FROM python:3.11-slim WORKDIR /app RUN pip install transformers torch --index-url https://download.pytorch.org/whl/cpu # Download model at build time RUN python -c "from transformers import AutoTokenizer, AutoModel; AutoTokenizer.from_pretrained('sentence-transformers/all-MiniLM-L6-v2', cache_dir='/app/models'); AutoModel.from_pretrained('sentence-transformers/all-MiniLM-L6-v2', cache_dir='/app/models')" COPY . . CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

# Option B: Pre-populate an Azure Files volume using a Container App Job az containerapp job create --name model-downloader-job --resource-group my-rg --environment my-aca-env --trigger-type Manual --replica-timeout 600 --image python:3.11-slim --command "bash" --args "-c" "pip install huggingface_hub && huggingface-cli download my-org/my-model --local-dir /mnt/models" --volume-mounts "model-storage:/mnt/models" --volumes "model-storage:azureFile:my-fileshare"

Scenario 2: The Model Server Runs Out of Memory and Gets OOM-Killed

What You See

Your model server starts successfully in development with 8 GB of RAM. In Azure Container Apps with a 4.0 vCPU / 8.0 Gi configuration (the maximum without GPU support), it crashes intermittently. The container restarts with no error in the application logs. When you check the system logs, you see the container exit code is `137`.

Exit code 137 indicates the process was killed by the kernel OOM killer.

Or in Log Analytics:

ContainerAppSystemLogs_CL | where ContainerAppName_s == "my-model-server" | where Log_s contains "OOMKilled" or Log_s contains "137" | project TimeGenerated, Log_s

Why This Happens

Exit code 137 means the container was sent SIGKILL (`128 + 9 = 137`) because it exceeded its memory limit. Container Apps enforces memory limits strictly. When the process tries to allocate more memory than the container is allowed, the Linux kernel's OOM (Out Of Memory) killer terminates the process.

With ML models, memory usage is not constant. A model might use 4 GB at rest but spike to 7 GB during inference when the input batch is large, when attention maps are being computed, or when the model is warming up its KV cache. If your container limit is 8 Gi and the model uses 7 Gi at rest plus 2 Gi during inference, you will hit OOM under load even if the container "usually" has enough memory.

Step-by-Step Fix

Step 1 — Quantize your model to reduce its memory footprint.

Full-precision (FP32) models use twice the memory of half-precision (FP16) models, and INT8 quantized models use half the memory of FP16. For inference workloads where a small accuracy trade-off is acceptable, quantization is the single most impactful optimization you can make:

import torch from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig # 4-bit quantization — reduces a 7B parameter model from ~14GB to ~4GB quantization_config = BitsAndBytesConfig( load_in_4bit=True, bnb_4bit_compute_dtype=torch.float16, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True ) model = AutoModelForCausalLM.from_pretrained( "my-model-path", quantization_config=quantization_config, device_map="auto" )

Step 2 — Implement request batching to flatten memory spikes.

Processing one request at a time with a large model creates memory spikes every time a new request arrives. Batching requests together means the model stays in a steady memory state rather than spiking per request:

from asyncio import Queue, wait_for import asyncio class BatchedInferenceService: def __init__(self, model, tokenizer, max_batch_size=4, timeout=0.05): self.model = model self.tokenizer = tokenizer self.max_batch_size = max_batch_size self.timeout = timeout self.queue = Queue() async def infer(self, text: str) -> str: future = asyncio.get_event_loop().create_future() await self.queue.put((text, future)) return await future async def process_batches(self): while True: batch = [] futures = [] # Wait for the first item text, future = await self.queue.get() batch.append(text) futures.append(future) # Try to collect more items within the timeout window try: while len(batch) < self.max_batch_size: text, future = await wait_for(self.queue.get(), timeout=self.timeout) batch.append(text) futures.append(future) except asyncio.TimeoutError: pass # Process the whole batch at once try: results = self._run_inference_batch(batch) for future, result in zip(futures, results): future.set_result(result) except Exception as e: for future in futures: future.set_exception(e) def _run_inference_batch(self, texts): with torch.no_grad(): inputs = self.tokenizer(texts, return_tensors="pt", padding=True, truncation=True) outputs = self.model(**inputs) return outputs.logits.tolist()

Step 3 — Set memory limits explicitly in your Container App to match what you expect at peak.

Do not let the container use all available node memory and get killed unexpectedly. Set explicit limits that match your model's peak usage:

# For a model that uses 5.5 GB at peak, allocate 6 GB az containerapp update --name my-model-server --resource-group my-rg --cpu 2.0 --memory 4.0Gi

Step 4 — Add memory monitoring to your health endpoint.

Your readiness probe should check available memory and return 503 if memory pressure is too high, which will cause the load balancer to route traffic to other healthy replicas:

import psutil @app.get("/health/ready") def readiness_probe(): if not model_loaded: raise HTTPException(status_code=503, detail="Model not yet loaded") memory = psutil.virtual_memory() if memory.percent > 90: raise HTTPException( status_code=503, detail=f"Memory pressure too high: {memory.percent:.1f}% used" ) return { "status": "ready", "memory_percent": memory.percent, "available_gb": memory.available / (1024**3) }

Scenario 3: GPU Workloads Fail to Initialize

What You See

You deploy a container that uses PyTorch with CUDA support. The container starts, but you see errors like:

CUDA error: no kernel image is available for execution on the device

torch.cuda.is_available() returned False

RuntimeError: CUDA out of memory. Tried to allocate 2.00 GiB

Or the model silently falls back to CPU without telling you, and your inference times are 50x slower than expected.

Why This Happens

Azure Container Apps supports GPU workloads through a dedicated GPU consumption plan and specialized GPU-enabled environments. If you deploy a CUDA-enabled container to a standard Container Apps environment (which uses CPU-only nodes), `torch.cuda.is_available()` returns `False` and PyTorch either errors out or silently falls back to CPU depending on how your code handles it.

Even when you are in a GPU-enabled environment, CUDA version mismatches between the CUDA toolkit installed in your container image and the CUDA drivers on the host node will produce the "no kernel image" error.

Step-by-Step Fix

Step 1 — Detect whether GPU is available and log it explicitly at startup.

Never assume — always log the device your model is using:

import torch import logging logger = logging.getLogger(__name__) def initialize_device(): if torch.cuda.is_available(): device = torch.device("cuda") logger.info(f"Using GPU: {torch.cuda.get_device_name(0)}") logger.info(f"CUDA version: {torch.version.cuda}") logger.info(f"Available GPU memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.1f} GB") else: device = torch.device("cpu") logger.warning("GPU not available, falling back to CPU. Inference will be slower.") return device device = initialize_device() model = model.to(device)

Step 2 — Match your CUDA toolkit version to the host driver.

The CUDA toolkit version in your image must be less than or equal to the CUDA driver version on the host. Use the official PyTorch images which bundle compatible CUDA versions:

# Use PyTorch's official image with a specific CUDA version # Check Azure Container Apps GPU documentation for supported CUDA versions FROM pytorch/pytorch:2.1.0-cuda11.8-cudnn8-runtime WORKDIR /app COPY requirements.txt . RUN pip install --no-cache-dir -r requirements.txt COPY . . CMD ["uvicorn", "main:app", "--host", "0.0.0.0", "--port", "8080"]

Step 3 — Create your Container Apps environment with GPU support enabled.

GPU support requires a dedicated GPU workload profile in your Container Apps environment:

# Create an environment with GPU workload profil az containerapp env create --name my-gpu-env --resource-group my-rg --location eastus --workload-profile-type "NC24-A100" --workload-profile-name "gpu-profile" # Deploy your app to the GPU profile az containerapp create --name my-model-server --resource-group my-rg --environment my-gpu-env --image myregistry.azurecr.io/my-model-server:cuda11.8 --cpu 4.0 --memory 16.0Gi --workload-profile-name "gpu-profile" --min-replicas 1

Step 4 — Handle the CPU fallback gracefully so you know when it happens.

If you want your application to work on both GPU and CPU environments, implement a clean fallback that makes the degraded state visible in monitoring:

import os import torch class ModelConfig: def __init__(self): self.force_cpu = os.environ.get("FORCE_CPU", "false").lower() == "true" self.device = self._select_device() def _select_device(self): if self.force_cpu: return torch.device("cpu") if torch.cuda.is_available(): return torch.device("cuda") # GPU was expected but not found — emit a warning metric import warnings warnings.warn( "CUDA requested but not available. Running on CPU. " "Inference latency will be significantly higher.", RuntimeWarning ) # Optionally: push a custom metric to Azure Monitor here return torch.device("cpu") @property def is_gpu(self): return self.device.type == "cuda"

Scenario 4: LangChain / AI Agent Timeouts at Startup

What You See

You deploy a LangChain-based agent or a RAG (Retrieval-Augmented Generation) pipeline to Container Apps. During startup, the application connects to Azure OpenAI, loads an embedding model, and populates an in-memory vector store. But the readiness probe times out while the vector store is being populated from a large document set.

Why This Happens

LangChain applications often do expensive work at startup — embedding thousands of documents, pre-populating vector indexes, or loading conversation history from a database. This work happens synchronously in many LangChain components, blocking the main thread and preventing the HTTP server from responding to health probes.

Step-by-Step Fix

Step 1 — Move initialization work to a background task that runs after the HTTP server is up

from fastapi import FastAPI from contextlib import asynccontextmanager import asyncio import logging logger = logging.getLogger(__name__) vector_store = None initialization_complete = False initialization_error = None @asynccontextmanager async def lifespan(app: FastAPI): global initialization_complete, initialization_error # Start the heavy initialization in the background task = asyncio.create_task(initialize_vector_store()) yield # Server starts and begins serving health probe requests # Cleanup task.cancel() app = FastAPI(lifespan=lifespan) async def initialize_vector_store(): global vector_store, initialization_complete, initialization_error try: logger.info("Starting vector store initialization...") # Run CPU-bound work in a thread pool so we don't block the event loop loop = asyncio.get_event_loop() vector_store = await loop.run_in_executor(None, _build_vector_store) initialization_complete = True logger.info("Vector store initialization complete") except Exception as e: initialization_error = str(e) logger.error(f"Vector store initialization failed: {e}", exc_info=True) def _build_vector_store(): from langchain_community.vectorstores import FAISS from langchain_openai import AzureOpenAIEmbeddings from langchain_community.document_loaders import DirectoryLoader loader = DirectoryLoader("/app/documents") documents = loader.load() embeddings = AzureOpenAIEmbeddings( azure_deployment=os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"], azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_key=os.environ["AZURE_OPENAI_API_KEY"] ) return FAISS.from_documents(documents, embeddings) @app.get("/health/ready") def readiness(): if initialization_error: raise HTTPException(status_code=500, detail=f"Initialization failed: {initialization_error}") if not initialization_complete: raise HTTPException(status_code=503, detail="Initializing vector store, please wait...") return {"status": "ready"}

Step 2 — Use Azure AI Search instead of an in-memory vector store for large document sets.

In-memory vector stores like FAISS are fine for development but become a liability in production Container Apps because the index is lost every time the container restarts, and rebuilding it adds minutes to your startup time. Azure AI Search persists the index and provides near-instant load times:

from langchain_community.vectorstores.azuresearch import AzureSearch from langchain_openai import AzureOpenAIEmbeddings embeddings = AzureOpenAIEmbeddings( azure_deployment=os.environ["AZURE_OPENAI_EMBEDDING_DEPLOYMENT"], azure_endpoint=os.environ["AZURE_OPENAI_ENDPOINT"], api_key=os.environ["AZURE_OPENAI_API_KEY"] ) # The index already exists in Azure AI Search — no rebuild needed on startup vector_store = AzureSearch( azure_search_endpoint=os.environ["AZURE_SEARCH_ENDPOINT"], azure_search_key=os.environ["AZURE_SEARCH_KEY"], index_name="my-document-index", embedding_function=embeddings.embed_query ) # This is nearly instantaneous — just connects to the existing index initialization_complete = True

Debugging AI Workload Logs in Log Analytics

When something goes wrong with your AI workload, these Log Analytics queries will help you quickly identify the pattern:

// Find all OOM kills in the last 24 hours ContainerAppSystemLogs_CL | where TimeGenerated > ago(24h) | where Log_s contains "OOMKilled" or ExitCode_d == 137 | project TimeGenerated, ContainerAppName_s, Log_s, ExitCode_d | order by TimeGenerated desc

// Track model loading time across restarts ContainerAppConsoleLogs_CL | where ContainerAppName_s == "my-model-server" | where Log_s contains "Model loaded" or Log_s contains "Starting model load" | project TimeGenerated, Log_s, ContainerName_s | order by TimeGenerated asc

// Find slow inference requests (if you are logging inference latency) ContainerAppConsoleLogs_CL | where ContainerAppName_s == "my-model-server" | where Log_s contains "inference_latency_ms" | extend latency = toint(extract("inference_latency_ms=([0-9]+)", 1, Log_s)) | where latency > 5000 // Requests taking more than 5 seconds | project TimeGenerated, latency, Log_s | order by latency desc

Summary: AI Workload Startup Checklist

When your AI workload fails to start or behaves unexpectedly, work through this checklist:

- Is the model loading time covered by an appropriate startup probe `failureThreshold`?

- Are model weights baked into the image or pre-loaded onto a mounted volume, rather than downloaded at runtime?

- Is the container's memory limit large enough for peak inference load (model size + activation memory)?

- Have you verified whether the workload is running on GPU or CPU and logged that explicitly at startup?

- Does the CUDA version in your image match or precede the driver version on the host?

- For LangChain or RAG workloads, does initialization happen in the background so health probes can respond?

- Are you using a persistent vector store (Azure AI Search, Azure Cosmos DB) instead of rebuilding in-memory indexes on every restart?

References and Sample Resources

Use these links alongside the scenarios above to go deeper on configuration details and production patterns.

Azure Container Apps docs (core)

AI-specific docs

Official sample repositories

ML runtime and model-loading references

What's Next

In the final part of this series, we step back from reactive troubleshooting and look at how to build a proactive observability and automation layer so that you catch these problems before your users do — or better yet, have them resolve themselves automatically.

Part of the series: Troubleshooting Azure Container Apps in Production

Next: Part 4 — Eyes Open, Hands Free: Automating Observability, Alerts, and Self-Healing Diagnostics for Azure Container Apps