ai foundry

97 Topics

Building Agents in Production with Toolbox, Skills, and Tool Search
If you are shipping AI agents beyond a demo, you have felt the pain: every agent needs the same tools, each with its own authentication, and the tool list keeps growing until your prompt is bloated and the model picks the wrong one. On 22 July 2026 at 5:00 PM BST, the Microsoft Foundry community is running a 40-minute Discord round table to talk about exactly this, and to gather your feedback on three capabilities built to fix it: Toolbox, Skills, and Tool Search. This is a discussion, not a slideshow. Bring your real projects, including tool sprawl, duplicated skills, and authentication headaches, and help shape where these features go next. Join us in the Microsoft Foundry Discord community. Event at a glance What: Microsoft Foundry Discord Community Round Table: Building Agents in Production with Foundry Toolbox, Skills, and Tool Search When: 22 July 2026, 5:00 PM BST (40 minutes) Where: https://aka.ms/foundry/discord Event Link https://discord.gg/Z8JZsrP5P5?event=1527379174061379584 Format: Interactive discussion with voice and chat, live polls, and a short prioritisation exercise Who it's for: AI engineers and developers building and scaling agents in production Opening question we'll start with: "As your agents grow, how do you decide which tools to give them, and how do they pick the right one at runtime?" The problem: agents don't scale by hard-wiring every tool When several agents, or a mix of Foundry hosted agents, Microsoft Agent Framework, LangGraph, and Copilot SDK apps, need the same governed set of tools, you do not want to re-wire those tools and their authentication into every one. Two things break as you grow: Integration sprawl: the same tool gets wired, authenticated, and versioned separately in every agent. Tool overload: sending every tool definition to the model on every turn is slow, expensive, and hurts selection accuracy. The pattern that scales: package the tools once behind a single versioned, governed MCP endpoint, make them discoverable, and let every runtime consume them from the same URL. That is what Toolbox, Skills, and Tool Search deliver together. The three concepts we'll discuss Toolbox: build once, govern centrally, consume anywhere A Toolbox is a reusable, centrally managed bundle of tools exposed through a single MCP-compatible endpoint. Because it is a managed resource, you can add, remove, or reconfigure tools without changing agent code because every agent connects to the same endpoint. Immutable versioning gives you safe, atomic rollouts: build and test a new version on its pinned URL, then promote it to default, and every consumer picks it up with no redeployment. Skills: reusable, composable capabilities A Skill is a reusable, published set of behavioural instructions (a SKILL.md file following the open Agent Skills spec) that is registered once and reused across toolboxes and agents, for example, "summarize document" or "create calendar event". In a toolbox, a skill is not a callable tool: it surfaces as an MCP Resource on the same endpoint, so clients discover and read it with plain resources/list and resources/read, with no Foundry SDK required. Tool Search: runtime discovery instead of hard-wiring A real toolbox can hold dozens or hundreds of tools. Tool Search keeps that cheap for the model through progressive disclosure: instead of listing every tool, Foundry shows the model just two meta-tools, tool_search and call_tool, plus any pinned tools. The model searches for a capability by intent, Foundry ranks the toolbox's tools by match on name and description, and returns only the hits. The prompt stays small no matter how many tools the toolbox holds. How they fit together: Skills and tools live in a Toolbox; Tool Search lets agents scale to many tools without prompt bloat or manual wiring. You manage all of it from the Foundry portal or the Foundry Toolkit extension in VS Code. The scenario we'll walk through A user asks an agent to "summarize this email and schedule a follow-up." The agent uses Tool Search to find the right tools ("email summarization" and "calendar scheduling") from a shared Toolbox, chains them, and returns a result with no per-agent integration and no hard-coded tool list. Discussion prompt: "At what point does the number of tools in your agent start to hurt, and would Tool Search help?" A peek at the code (so you arrive ready) The full, runnable walkthrough lives in the Mastering Foundry Toolbox notebook in the microsoft-foundry/forgebook repo. The core spine is short. First, build a versioned toolbox from typed tool objects plus an optional list of skills: # Build an immutable toolbox version from typed tools + skills version = project.toolboxes.create_version( name=TOOLBOX_NAME, description="Search, code, knowledge, and connection-backed tools.", tools=tools, # e.g. WebSearchToolboxTool(...), AzureAISearchToolboxTool(...) skills=skills or None, # ToolboxSkillReference(name=..., version=...) - SEPARATE from tools ) print(f"Created {TOOLBOX_NAME} version {version.version}") Then turn the same tools into a search-first toolbox by adding the Tool Search meta-tool and pinning only your one or two hottest tools: from azure.ai.projects.models import ToolboxSearchPreviewToolboxTool, ToolConfig # Pin the hottest tool so it's always exposed; everything else is search-gated. tool_configs = {"web_search": ToolConfig(pin=True)} search_version = project.toolboxes.create_version( name=TOOLBOX_NAME, tools=tools + [ToolboxSearchPreviewToolboxTool(tool_configs=tool_configs)], skills=skills or None, ) Every consumer talks to the same MCP endpoint: one URL, any framework: # The default (promoted) version is served from one stable consumer URL def consumer_mcp_url(name): return f"{PROJECT_ENDPOINT.rstrip('/')}/toolboxes/{name}/mcp?api-version=v1" # Microsoft Agent Framework speaks MCP natively - just point it at the URL. # LangGraph (AzureAIProjectToolbox) and the GitHub Copilot SDK consume the same endpoint. How Toolbox simplifies the auth and identity flow This is one of the most important things to understand before you scale agents, and it is a great topic to bring questions on. A toolbox tool reaches a downstream system through a project connection, and the connection's authentication type decides whose identity is used. Get this right once and every consumer inherits correct, least-privilege access automatically, without writing OAuth or token-exchange plumbing in your agent code. Running a toolbox behind a hosted agent puts two identities in play, and the platform wires them together for you: Agent -> Toolbox (the trust boundary). The hosted agent authenticates to the toolbox MCP endpoint with its own agent identity, which holds the Foundry user role on the project. If the agent doesn't have access, the toolbox rejects the agent. This gates access to the toolbox itself, independent of any single tool. Toolbox -> Tool (the end-user passthrough). For oauth2 authentication, the agent forwards the caller's end-user Entra token, and the toolbox uses that token (on-behalf-of) to reach the downstream tool. The tool then acts on behalf of the real end user, providing per-user, least-privilege access with correct downstream audit. For non-passthrough authentication types (none, custom-keys, project-managed-identity, and agentic-identity), the toolbox authenticates using the connection's configured identity, and the agent never sees the secret. That is the "better-together" story: a stable, governable managed identity to the toolbox, plus true end-user identity on the downstream data call. What we'll cover in the 40 minutes https://discord.gg/Z8JZsrP5P5?event=1527379174061379584 Welcome & framing (0:00-0:08): what Toolbox, Skills, and Tool Search are, and how they fit together. Scenario walkthrough (0:08-0:13): the "summarize email, schedule follow-up" flow, end to end. Use cases & opportunities (0:13-0:22): which capabilities you would package as reusable Skills, how many tools your agents carry, and where Tool Search would help. Trust, security & governance (0:22-0:31): what you are comfortable exposing through a shared endpoint, how to scope which tools an agent may discover, authentication models, and the observability you need. Developer experience feedback (0:31-0:36): your biggest adoption blockers, missing docs, and the SDK samples and end-to-end demos you would prioritise. Prioritisation & next steps (0:36-0:40): vote live on the top use cases, challenges, and feature requests. Come prepared to talk about What tools and skills your agents use today, and how they're wired up. Which capabilities you'd turn into reusable, composable Skills shared across agents. How many tools your agents carry, and whether you hit prompt-size or tool-selection accuracy issues. Which tools you'd expose through a shared endpoint, and which need tighter scoping. How you want to control what an agent is allowed to discover and invoke with Tool Search. The examples, samples, and tutorials that would help you get started fastest. Responsible and secure by design Because these features let agents discover and invoke tools dynamically, governance is a first-class part of the conversation. Foundry toolboxes are governed by default: you can screen every tool's inputs and outputs with an RAI guardrail, front your MCP servers with a bring-your-own AI gateway (APIM), scope which tools are discoverable, and use least-privilege identity passthrough so downstream calls carry the real user's permissions and audit trail. Bring your enterprise safeguard requirements because they directly shape the roadmap. Note: Toolbox, Tool Search, and Skills are in preview; APIs and headers may change. Key takeaways Toolbox packages tools once behind a single governed, versioned MCP endpoint, so you can build once and consume from any framework. Skills are reusable, composable capabilities you register once and chain across agents. Tool Search uses two meta-tools and progressive disclosure so agents scale to hundreds of tools without prompt bloat. Auth is simplified: the agent's managed identity gates the toolbox, while end-user token passthrough gives correct, least-privilege downstream access with no OAuth plumbing in your code. Your feedback shapes the product because this round table feeds directly into the engineering and product teams. Save your spot Add it to your calendar: 22 July 2026, 5:00 PM BST. Join the community: https://aka.ms/foundry/discord Prep with the sample: run the Mastering Foundry Toolbox notebook to build, search, and consume a toolbox end to end. Read the docs: Toolbox, Tool Search, and Skills. Agents get more capable as they gain access to more tools and skills, but only if you can build, govern, and scale those capabilities without drowning in integration and prompt bloat. Come and share how you are doing it today, and help shape how Foundry does it next. See you on 22 July.
Lee_Stott
Jul 16, 2026 Place Microsoft Developer Community Blog
224Views
0likes
0Comments
Building AI Agents from Zero to Production
Building AI Agents from Zero to Production Most agent demos stop at "it answered my question." Production doesn't. The gap between a notebook that calls an LLM and a governed, observable, multi-agent system your organisation can actually depend on is where real engineering happens, evaluation, deployment, data sovereignty, tool governance, and cross-team interoperability. Microsoft's open-source course Building AI Agents from Zero to Production walks that entire arc in seven lessons, using one realistic use case and the Microsoft Agent Framework (MAF) plus Microsoft Foundry. This post is a developer-focused tour of what it teaches, the architecture decisions behind each stage, and the code patterns that matter when you move from prototype to production. Who this is for AI engineers building their first or first production, agent system. Backend and full-stack developers integrating agents into real applications and CI/CD. Cloud architects who need data sovereignty, private networking, and governance around agent workloads. Technical leads deciding how to standardise tools and orchestration across multiple teams. The samples are Python 3.12+, served through Microsoft Foundry using GPT-5 series models (for example gpt-5.1 ). Lesson 4 adds a TypeScript/React frontend. You will want an Azure subscription and the Azure CLI. The AI Agent Development Lifecycle The course is organised around a lifecycle rather than a feature list. Each lesson is a stage, and each stage assumes the previous one is solved: # Stage The production question it answers 1 Agent Design What should each agent do, and how do they hand off? 2 Agent Development How do I build and run them with the Agent Framework? 3 Agent Evaluations How do I know they actually work — and keep working? 4 Agent Deployment How do I ship one as a hosted service with a UI and CI gate? 5 Production Hosted Agents How do I meet enterprise data, network, and governance needs? 6 Microsoft Toolbox How do I govern tools once, and reuse them across teams? 7 Multi-Agent & A2A How do agents from different teams interoperate safely? The thread running through all seven is a single scenario: a Developer Onboarding agent system that helps a new hire find the right teammates, get a sensible first task, and pull learning resources and code snippets. It is deliberately mundane, which is exactly why it exposes the production concerns that flashy demos hide. Lesson 1 — Agent Design: three components, one graph The course defines an agent by three parts: an LLM for reasoning, tools to act, and memory to retain context. The design work is context engineering — making sure the right information reaches the model at the right moment, no more and no less. Rather than one monolithic assistant, the onboarding system is split into specialists coordinated by a triage agent using handoff orchestration: Agent Job Tool Employee Search Answer org and people questions Foundry file search over an employee-directory vector store Task Recommendation Suggest 1–3 GitHub issues for the new dev GitHub MCP Server (reads recent commits + open issues) Code Assistant Provide resources and runnable snippets Microsoft Learn MCP + Code Interpreter Architecturally this is a directed graph: User → Triage → [Employee, Learning, Coding] . Splitting responsibilities early pays off later, each agent gets a tightly scoped prompt (less hallucination), can be evaluated independently, and can be upgraded without touching its peers. Lesson 2 — Development: standalone agents with MAF Here the design becomes code. Each specialist is a small, independently runnable service built with the Microsoft Agent Framework, authenticated to Foundry with your Azure CLI login. Setup is deliberately boring: az login az account set --subscription "<your-subscription-id>" cp .env.example .env # Fill FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MODEL (e.g. gpt-5.1) # Create the employee-directory vector store once; note the printed VECTOR_STORE_ID python lesson-2-agent-development/setup_vector_store.py # Start an agent — serves on http://localhost:8090 python lesson-2-agent-development/employee-search-agent.py The FoundryChatClient auto-reads any FOUNDRY_ -prefixed environment variables and uses AzureCliCredential , so there are no keys in code. The lesson ships six samples, each on its own port, so you can chat with them individually in the local DevUI before wiring them together: Sample Tool Port employee-search-agent.py Foundry file search / vector store 8090 task-recommendation-agent.py GitHub MCP Server 8095 azure-learning-agent.py Microsoft Learn MCP 8092 coding-agent.py Code Interpreter 8093 learning-recommendation-agent.py Learn MCP + reasoning 8091 agent-orchestration.py Multi-agent handoff 8094 Why this matters: keeping each agent as its own process with its own port is a testability decision, not an accident. You can smoke-test one specialist in isolation, then compose them in agent-orchestration.py . Lesson 3 — Evaluation: you can't unit-test a probability distribution This is the lesson that separates a demo from a product. Agents are non-deterministic, so traditional assertions don't fit. The course uses three complementary layers: Observability / tracing — always on, via OpenTelemetry to Application Insights. Smoke tests — fast, run on every deploy. Evaluations — deeper, model-based scoring run on-demand or nightly. Turning on tracing is a single call: from agent_framework.foundry import FoundryChatClient client = FoundryChatClient() client.configure_azure_monitor() # export traces + metrics to Application Insights For quality it uses Foundry's built-in "LLM-as-a-judge" evaluators against real persisted responses (identified by response_id ), not freshly regenerated ones: Evaluator evaluator_name Measures Relevance builtin.relevance Does the response address the request? Groundedness builtin.groundedness Is it supported by retrieved data (no hallucination)? Tool-call accuracy builtin.tool_call_accuracy Were the right tools called with the right arguments? Tool-output utilization builtin.tool_output_utilization Did the agent actually use tool results? The judge model is set independently via AZURE_AI_MODEL_DEPLOYMENT_NAME , so you can evaluate a cheap production model with a stronger one. The run prints a report_url that deep-links into the Foundry portal. Lesson 4 — Deployment: a hosted agent, a UI, and a CI gate Now the agent becomes a managed service. It is deployed as a Foundry Hosted Agent a Microsoft-managed execution environment and fronted by an OpenAI ChatKit React UI talking to a FastAPI backend: ChatKit React (3000) → FastAPI backend (8001) → Foundry Hosted Agent → tools Building the agent is declarative attach tools, name it, serve it: agent = client.as_agent( name="DevOnboardingAgent", instructions="...", tools=[file_search_tool, learn_mcp_tool], ) # served with: from_agent_framework(agent).run() The recommended deploy path is the Azure Developer CLI: cd hosted-agent azd auth login azd agent deploy The genuinely production-minded part is the smoke test as a post-deploy CI gate. Six cases cover reachability, each scenario, off-topic prompt adherence, and multi-turn threading (verifying state via previous_response_id ). The GitHub Action runs them against the freshly deployed agent: export FOUNDRY_TOKEN=$(az account get-access-token \ --resource https://ai.azure.com/ --query accessToken -o tsv) python runner.py \ --project-endpoint "https://<account>.services.ai.azure.com/api/projects/<project>" \ --agent-name dev-onboarding \ --tests-file tests/smoke-tests.json Pitfall to remember: the token audience must be https://ai.azure.com/ . A cognitiveservices.azure.com token is rejected by the Responses API — a mistake that costs many engineers an afternoon. Lesson 5 — Production: separating where an agent runs from where its data lives The pivotal concept for enterprise readiness is the distinction between a Hosted Agent (compute, scaling, identity) and a Capability Host (where conversation history, files, and embeddings actually reside): Concern Hosted Agent Capability Host Compute / scaling / identity ✅ Provided — Conversation history Microsoft-managed default Redirect to your Azure Cosmos DB File uploads Microsoft-managed default Redirect to your Azure Storage Vector embeddings Microsoft-managed default Redirect to your Azure AI Search Required to run the agent? ✅ Yes ❌ Optional Required for data sovereignty? ❌ Not sufficient ✅ Yes "Basic" setup uses Microsoft-managed storage and is perfect for getting started. "Standard" setup redirects each data plane to your own Azure resources through a project-level capability host, this is how you keep customer data in your tenant, inside your network boundary: PUT .../accounts/{account}/projects/{project}/capabilityHosts/{name}?api-version=2025-06-01 { "properties": { "capabilityHostKind": "Agents", "threadStorageConnections": ["my-cosmosdb-connection"], "vectorStoreConnections": ["my-ai-search-connection"], "storageConnections": ["my-storage-connection"] } } Operational constraints worth internalising before you provision: there is one capability host per scope (a second attempt returns 409 Conflict ), configuration is immutable (delete and recreate to change it), deletion is destructive, and the account-level host must exist before the project-level one. Lesson 6 — Toolbox: govern tools once, reuse everywhere Left unchecked, every team re-implements the same tools, scatters credentials, and loses governance visibility. The Microsoft Foundry Toolbox solves this by exposing a curated, versioned set of tools behind a single MCP-compatible endpoint, with credentials held in Foundry connections rather than agent code. You build a toolbox version once: from azure.ai.projects.models import MCPTool, ToolboxSearchPreviewTool, WebSearchTool toolbox_version = project.toolboxes.create_toolbox_version( name="agent-tools", description="Web search + an MCP server + tool search", tools=[ WebSearchTool(), MCPTool( server_label="myserver", server_url="https://your-mcp-server.example.com", require_approval="never", project_connection_id="my-key-auth-connection", # credentials live in Foundry ), ToolboxSearchPreviewTool(), ], ) And every agent consumes it through one endpoint, no per-team tool code: from agent_framework import MCPStreamableHTTPTool mcp_tool = MCPStreamableHTTPTool( name="toolbox", url=TOOLBOX_ENDPOINT, # {project_endpoint}/toolboxes/{name}/mcp?api-version=v1 http_client=http_client, load_prompts=False, ) agent = chat_client.as_agent(name="my-toolbox-agent", instructions="...", tools=[mcp_tool]) Versioning is blue/green: create a new version, test it on its version-specific endpoint, then promote it to default and every consumer picks it up with zero code changes. A Guardrail (RAI) policy can be applied at the toolbox layer, independent of model-level content filters. Note the toolbox management APIs are currently preview; the portal or VS Code Foundry Toolkit are practical alternatives for creation today. Lesson 7 — Multi-Agent & A2A: agents as networked peers The final lesson contrasts two ways agents collaborate: Handoff / Workflow — in-process, same codebase, fastest, tightest coupling. Agent-to-Agent (A2A) — cross-process over an open protocol, so agents from different teams, orgs, or frameworks interoperate. A2A gives each agent a discoverable Agent Card at /.well-known/agent-card.json and a task lifecycle (submitted → working → completed/failed). The elegant part: A2AExecutor wraps an existing MAF agent with no changes to that agent's code. from agent_framework.a2a import A2AExecutor from a2a.server.apps import A2AStarletteApplication from a2a.server.tasks import InMemoryTaskStore agent_card = AgentCard( name="Coding Assistant", url="http://localhost:9000/", version="1.0.0", capabilities=AgentCapabilities(streaming=True), skills=[AgentSkill(id="generate-code", name="Generate code", tags=["code"])], ) request_handler = DefaultRequestHandler( agent_executor=A2AExecutor(agent), # wraps your existing MAF agent unchanged task_store=InMemoryTaskStore(), ) app = A2AStarletteApplication(agent_card=agent_card, http_handler=request_handler).build() Consuming a remote agent then looks exactly like calling a local one: from agent_framework.a2a import A2AAgent remote_agent = A2AAgent(name="remote-coding-assistant", url="http://localhost:9000") result = await remote_agent.run("Write a Python function that reverses a string.") Because an A2AAgent can be a participant inside a HandoffBuilder workflow, you can mix in-process routing with remote services in the same orchestration. For enterprise use, A2AAgent accepts an auth_interceptor for bearer tokens, and the Agent Card carries security_schemes . Responsible and secure by design Production readiness in this course is not just uptime, it is governance: Identity over keys — AzureCliCredential and managed identity throughout; no secrets in code. Least privilege — CI runners get a scoped Azure AI User role assignment on the specific project. Data sovereignty — capability hosts keep conversation history, files, and embeddings in your own Cosmos DB, Storage, and AI Search. Tool approval and guardrails — MCP approval_mode and toolbox-level RAI policy gate what agents can do. Grounded evaluation — groundedness and tool-utilization scoring catch hallucination and unused-tool behaviour before users do. Cost hygiene — the lessons create real Azure resources; delete the resource group when done: az group delete --name <rg> --yes --no-wait . Key takeaways Design as a graph of specialists. Handoff orchestration with tightly scoped agents beats one monolith on reliability and testability. One .run() contract, many backends. The Agent Framework keeps orchestration code stable from local dev to hosted production. Evaluate continuously. Tracing + smoke tests + model-based evaluators are three layers, not alternatives. Separate compute from data. Hosted Agents run the agent; Capability Hosts give you sovereignty — you need both for enterprise. Govern tools centrally. A versioned toolbox behind one MCP endpoint kills tool sprawl and credential duplication. Open protocols for interop. A2A lets agents cross team, org, and framework boundaries without rewrites. Get started Clone the repo (skip the 50+ translations for a faster download) and work through the lessons in order: git clone --filter=blob:none --sparse https://github.com/microsoft/Building-AI-Agents-From-Zero-To-Production.git cd Building-AI-Agents-From-Zero-To-Production git sparse-checkout set --no-cone '/*' '!translations' '!translated_images' References Building AI Agents from Zero to Production — course repo Microsoft Agent Framework Microsoft Foundry documentation Agent-to-Agent (A2A) protocol specification a2a-python SDK AI Agents for Beginners MCP for Beginners Microsoft Foundry Discord
Lee_Stott
Jul 14, 2026 Place Microsoft Developer Community Blog
461Views
0likes
0Comments
Agents League: The Esports-Inspired Hackathon Where AI Agents Battle for Glory
Ready to put your AI skills to the ultimate test? Agents League is here, a dynamic, esports-inspired developer challenge that brings the thrill of live competition to the world of agentic AI. Whether you're a seasoned AI developer or just getting started, this is your chance to build, compete, and win. What is Agents League? Agents League is a week-long hackathon running as part of AI Skills Fest (June 4–14, 2026). Unlike traditional hackathons, Agents League combines live AI coding battles, asynchronous project submissions, and a thriving Discord community all competing for a total prize pool of $55,000 USD. This isn't just about building it's about showcasing what's possible with agentic AI in a format that's fast, competitive, and globally accessible. Three Challenge Tracks Pick One or Compete in All 1. Creative Apps Build innovative applications using GitHub Copilot for AI-assisted development. Show off your creativity and demonstrate how AI can accelerate app creation from concept to code. 2. Reasoning Agents Create intelligent agents using Microsoft Foundry that solve complex problems through multi-step reasoning. This track is all about building agents that can think, plan, and execute. 3. Enterprise Agents Build business-ready knowledge agents integrated with Microsoft 365 Copilot, authored in Copilot Studio. Perfect for developers focused on real-world enterprise solutions. Live Microsoft Reactor Events—Don't Miss the Battles! The heart of Agents League beats through live Microsoft Reactor events. Watch experts go head-to-head in live coding battles, learn cutting-edge techniques, and get inspired for your own submissions: Event What You'll Learn Creative Apps Battle See GitHub Copilot in action as experts build innovative apps live Reasoning Agents Battle Watch multi-step reasoning agents come to life with Microsoft Foundry Enterprise Agents Battle Learn to build M365-integrated agents with Copilot Studio 👉 View the full event series Key Dates Registration Deadline: June 12, 2026, 12:00 PM PT Hacking Period: June 4–14, 2026 Submission Deadline: June 14, 2026, 11:59 PM PT What You Get Live coding battles with expert demonstrations Curated technical experiences and on-demand content Learning resources on Microsoft Learn and AI Skills Navigator Community support through Discord GitHub-based submissions for transparent, collaborative judging Why Participate? Agents League isn't just another hackathon. It's designed as a streamlined, competitive format that: ✅ Fits into your schedule with focused, time-boxed challenges ✅ Provides real-world product innovation experience ✅ Offers global accessibility—participate from anywhere ✅ Demonstrates the latest capabilities of agentic AI, including new IQ tools ✅ Connects you with a passionate developer community Ready to Enter the Arena? Register Now for Agents League Before you register: Review the Hackathon Rules and Regulations for prize categories and judging criteria Join the Microsoft Reactor event series for live battles and learning Check out the Microsoft Event Code of Conduct Join the Conversation Have questions? Want to connect with fellow competitors? Join the Agents League community on Discord and start strategizing with developers from around the world. Whether you're building creative apps, reasoning agents, or enterprise solutions—the arena awaits. May the best agent win! 🏆 Agents League hackathon is open to the public and offered at no cost. Government employees should check with their employers to ensure participation is permitted in accordance with applicable policies. Related Links: Agents League Hackathon Registration Microsoft Reactor Series AI Skills Fest
Lee_Stott
Jul 12, 2026 Place Microsoft Developer Community Blog
980Views
1like
3Comments
Enterprise-ready Claude Desktop with Entra ID, APIM, and Microsoft Foundry (No Backend Required)
How I put corporate sign-in in front of Claude Desktop without writing a single line of backend code. TL;DR — In this post, I show how to securely enable Claude Desktop in enterprise environments using Microsoft Entra ID, Azure API Management, and Microsoft Foundry — without deploying a custom backend. This approach removes API keys from endpoints, enforces per-user identity, and aligns fully with Zero Trust principles. Who this is for: Enterprise architects evaluating secure AI client patterns Developers enabling Claude Desktop in regulated environments Platform teams standardizing identity and governance for LLM access Why this post exists: Microsoft Learn's Configure Claude Desktop with Foundry Models only shows the API-key path — a shared key pasted into every user's Claude Desktop config. That's fine for a quick demo, but it's a non-starter for most enterprises (no per-user identity, no MFA / Conditional Access, hard to revoke, hard to audit). This post fills that gap: same Foundry backend, but with Microsoft Entra ID SSO in front via Azure API Management, so each user signs in with their corporate identity and zero secrets land on the laptop. The problem For many teams experimenting with Claude Desktop, the blocker isn't capability — it's enterprise readiness. How do you enforce identity, eliminate shared secrets, and apply governance without standing up a custom backend service to sit in front of the model? If your team wants to use Claude Desktop with your own Anthropic deployment running on Microsoft Foundry, but with a few non-negotiable requirements: No shared API keys floating around on developer laptops. Per-user identity — every request must be attributable to a real person. MFA and Conditional Access must apply, the same way they do for every other internal app. Central rate-limiting and logging — a centralized control plane for governance. Claude Desktop 1.5+ supports a "Gateway SSO" mode where it can sign each user in with OpenID Connect and forward their token to a custom LLM gateway. Azure API Management (APIM) is a perfect fit for that gateway role: it validates the user's Entra ID token, then re-authenticates itself to Foundry behind the scenes. APIM acts as a centralized policy enforcement layer, enabling identity validation, traffic governance, and secure re-authentication to backend AI services without custom code. The end-to-end flow looks like this: %%{init: {'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'useMaxWidth': true}, 'themeVariables': {'fontSize':'16px'}} }%% flowchart TB User([Corporate user]) Claude["Claude Desktop"] Entra["Microsoft Entra ID<br/>(OIDC + MFA + Conditional Access)"] APIM["Azure API Management<br/>validate-jwt → rewrite headers<br/>(policy gateway)"] Foundry["Microsoft Foundry<br/>Claude deployment"] User -- "1. Sign in (browser PKCE)" --> Entra Entra -- "2. ID token" --> Claude Claude -- "3. POST /v1/messages<br/>Authorization: Bearer ID token" --> APIM APIM -- "4. OIDC discovery / JWKS" --> Entra APIM -- "5. x-api-key (or Managed Identity)" --> Foundry Foundry -- "6. Response" --> APIM APIM -- "7. Response" --> Claude classDef azure fill:#0a4d8c,stroke:#0a3a6b,color:#ffffff; classDef client fill:#f3f3f3,stroke:#888,color:#222; class Entra,APIM,Foundry azure; class Claude,User client; Or in plain text: Claude Desktop │ Authorization: Bearer <Entra ID token from the user's browser sign-in> ▼ Azure API Management (<your-apim>) │ ① validate-jwt → verifies user's Entra ID token │ ② re-auths to Foundry with an API key from a Named value │ Authorization stripped, x-api-key injected ▼ Microsoft Foundry /anthropic/v1/messages │ runs Claude (<your-deployment>) ▼ Response back to the user There are no API keys on user devices. Foundry's key lives only inside APIM. And every request carries the user's oid claim, so I can build dashboards and per-user quotas later. What you need before starting An Azure subscription with a Microsoft Foundry (AI Services) account and a Claude deployment. (Throughout this post I'll just call it Foundry.) An API Management instance, any tier. Permission to register applications in Entra ID for your tenant. Claude Desktop 1.5.0 or later. Azure CLI installed locally. Throughout this post I'll use placeholders for resource names: <apim-name> — your API Management service name <resource-group> — the resource group that holds it <foundry-account> — your Foundry account name <deployment-name> — the name of the Claude model deployment on Foundry Step 1 — Register an Entra ID app for Claude Desktop This is the OIDC client Claude Desktop signs users into. Claude Desktop requires a single-tenant, public PKCE client (no client secret) with a loopback redirect URI, configured under the Mobile and desktop applications platform in Entra ID — the only platform that allows any loopback port. I scripted it so the setup is one command and idempotent: # scripts/register-claude-entra-app.ps1 [CmdletBinding()] param( [string] $TenantId = '<your-tenant-id>', [string] $SubscriptionId = '<your-subscription-id>', [string] $ResourceGroup = '<resource-group>', [string] $ApimName = '<apim-name>', [string] $AppDisplayName = 'Claude Cowork gateway', [string] $RedirectUri = 'http://127.0.0.1/callback' ) az account set --subscription $SubscriptionId | Out-Null # 1. Create (or reuse) the app registration $appId = az ad app list --display-name $AppDisplayName --query "[0].appId" -o tsv if (-not $appId) { $appId = az ad app create --display-name $AppDisplayName ` --sign-in-audience AzureADMyOrg --query appId -o tsv } # 2. Configure as public PKCE client with the Mobile/Desktop redirect URI $objectId = az ad app show --id $appId --query id -o tsv $patch = @{ publicClient = @{ redirectUris = @($RedirectUri) } isFallbackPublicClient = $true } | ConvertTo-Json -Depth 5 -Compress az rest --method PATCH ` --uri "https://graph.microsoft.com/v1.0/applications/$objectId" ` --headers "Content-Type=application/json" --body $patch | Out-Null # 3. Ensure a service principal exists $sp = az ad sp list --filter "appId eq '$appId'" --query "[0].id" -o tsv if (-not $sp) { az ad sp create --id $appId | Out-Null } # 4. Push two Named values into APIM for the validate-jwt policy az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-tenant-id --display-name entra-tenant-id ` --value $TenantId --secret false az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-client-id --display-name entra-client-id ` --value $appId --secret false "Client ID: $appId" Run it once. The output prints the client ID you'll need in Claude Desktop later, and it leaves two Named values in APIM ( entra-tenant-id , entra-client-id ) that the gateway policy will reference. ⚠️ Common pitfall: if the redirect URI ends up under the Web platform instead of Mobile and desktop applications, Entra will demand a client secret on token exchange — Claude won't send one and you'll get Token exchange failed (HTTP 401) . The app type can't be changed after creation, so create a new app if that happens. Step 2 — Create the API in APIM In the portal under APIM → APIs → + Add API → HTTP: Field Value Display name Anthropic API Name anthropicapi Web service URL https://<foundry-account>.services.ai.azure.com/anthropic API URL suffix claude Subscription required Off (Entra ID is our only credential) Add two operations under it: Method URL Display name POST /v1/messages Create message GET /v1/models List models The /v1/models operation isn't strictly needed (Foundry's Anthropic surface doesn't implement it), but having it registered means you can decide later whether to stub it out or proxy it. Step 3 — Add an API key for Foundry as a Named value APIM → Named values → + Add: Name: foundry-key Type: Secret Value: paste a key from the Foundry account's Keys and Endpoint blade. This is the only place the key ever lives. Clients never see it. Alternative — keyless with Entra ID (managed identity): If you prefer not to manage a Foundry key at all, enable the APIM instance's system-assigned managed identity (APIM → Identity → System assigned → On), then grant that identity the Foundry User role on the Foundry account (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d — previously named Azure AI User; Microsoft renamed it but the ID and permissions are unchanged). In Step 4, replace the set-header that injects x-api-key with: <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="foundry-token" /> <set-header name="Authorization" exists-action="override"> <value>@("Bearer " + (string)context.Variables["foundry-token"])</value> </set-header> Then you can skip the foundry-key Named value entirely. Don't use the legacy Cognitive Services User role — per the Foundry RBAC doc, roles starting with Cognitive Services don't apply to Foundry scenarios. Step 4 — Write the gateway policy This is the core enforcement layer in the architecture. Open APIs → anthropicapi → All operations → Inbound processing → </> and paste: <policies> <inbound> <base />  <validate-jwt header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized" require-scheme="Bearer"> <openid-config url="https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0/.well-known/openid-configuration" /> <audiences> <audience>{{entra-client-id}}</audience> </audiences> <issuers> <issuer>https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0</issuer> </issuers> </validate-jwt>  <set-backend-service base-url="https://<foundry-account>.services.ai.azure.com/anthropic" /> <set-header name="x-api-key" exists-action="override"> <value>{{foundry-key}}</value> </set-header> <set-query-parameter name="api-version" exists-action="skip"> <value>2024-05-01-preview</value> </set-query-parameter> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Two things to notice: validate-jwt uses the OIDC discovery URL — JWKS keys are fetched and cached automatically. It rejects any token whose aud claim is not the client ID of our Entra app, which is exactly what we want. The Authorization header from the user is not forwarded — once validate-jwt succeeds, the request is re-authenticated to Foundry with x-api-key . No user token ever leaves APIM. APIM becomes the security boundary — user identity is validated at the edge, and downstream services never see or rely on user tokens. Step 5 — Configure Claude Desktop Open Claude Desktop → Configure third-party inference and fill it in like this: Field Value Connection Gateway Credential kind Interactive sign-in Gateway base URL https://<apim-name>.azure-api.net/claude Client ID (the appId your script printed) Issuer URL https://login.microsoftonline.com/<tenant-id>/v2.0 Authorization URL / Token URL leave empty Bearer token ID token (default) Scopes leave default ( openid profile email offline_access ) Redirect port leave empty (ephemeral) Model discovery Off Model list → Model ID <deployment-name> (your Foundry deployment name) ℹ️ Why Model discovery is Off — Claude Desktop's discovery uses GET /v1/models , and the Foundry /anthropic surface doesn't implement that endpoint, so it 404s. Listing the model manually skips the call entirely. If you want to leave Model discovery On, stub /v1/models in APIM. Add a GET /v1/models operation to your API and give it this inbound policy that returns an Anthropic-shaped response without ever hitting the backend: <policies> <inbound> <base /> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ return new JObject( new JProperty("data", new JArray( new JObject( new JProperty("id", "<deployment-name>"), new JProperty("type", "model"), new JProperty("display_name", "Claude on Foundry"), new JProperty("created_at", "2026-01-01T00:00:00Z") ) )), new JProperty("has_more", false), new JProperty("first_id", "<deployment-name>"), new JProperty("last_id", "<deployment-name>") ).ToString(); }</set-body> </return-response> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Add one entry per deployment you want to expose. The benefit of stubbing rather than turning discovery off is that adding new models becomes a policy edit — no need to re-export and redeploy Claude Desktop config to every user. Click Apply Changes then Sign in to your organization. Your browser opens to the normal Entra sign-in page; once approved you're returned to the app, and a quick connection test runs. The success indicator is a small green banner: ✅ Inference — 1-token completion in 1449 ms · via identity provider For broader rollout, hit the Export button at the top of the configuration window — it produces a .mobileconfig (macOS) or .reg (Windows) you can push via Intune / Jamf to every user's machine. Step 6 — Verify both hops In APIM → APIs → anthropicapi → Test → POST /v1/messages I sent: Headers: anthropic-version: 2023-06-01 Body: { "model": "<deployment-name>", "max_tokens": 64, "messages": [{"role":"user","content":"hi"}] } Click Send → Trace, and look at two places: Inbound → validate-jwt: should say succeeded and show the decoded claims (your oid , email , etc.). Backend → Request: outbound URL is https://<foundry-account>.services.ai.azure.com/anthropic/v1/messages?api-version=2024-05-01-preview , with x-api-key: **** present and Authorization absent. Backend → Response: 200, with a Claude message JSON body. That confirms both halves of the chain. Bumps I hit along the way A few common issues encountered during setup — sharing so you can skip them: Symptom Cause Fix Claude shows "Your provider's model list hasn't loaded yet" and /v1/models returns 404 Foundry's Anthropic surface doesn't implement that endpoint Turn Model discovery OFF in Claude Desktop and add the deployment name manually Claude shows "Authentication failed" even though sign-in worked The APIM API still had Subscription required = ON, blocking the call before validate-jwt ran with 401: Access denied due to missing subscription key Uncheck Subscription required on the API Portal Test panel shows "Cannot read properties of undefined (reading 'statusCode')" The test console doesn't attach an Entra token, so validate-jwt 401s and the panel's JavaScript crashes Comment out <validate-jwt> temporarily for portal testing, or test via curl with a real token OIDC discovery failed (HTTP 404) in Claude Desktop Pasted the metadata URL into Issuer URL Issuer must end at /v2.0 , not at /.well-known/openid-configuration Token exchange failed (HTTP 401) App registered under Web platform instead of Mobile and desktop applications Create a new app with the right platform — it can't be changed Where this leaves us This pattern is small in moving parts but has outsized architectural impact: Zero secrets on endpoints. Eliminates API-key sprawl across laptops, MDM profiles, and shared vaults. The Foundry key lives only inside APIM — or disappears entirely when you switch APIM to managed identity. Identity, not credentials. Every Claude Desktop user authenticates against Entra ID in their browser, the same as Office or Teams. MFA, Conditional Access, and Entra ID Protection apply automatically — no parallel auth story to maintain. Per-user observability built in. APIM logs carry the user's Entra oid , email , and group claims. That unlocks per-user dashboards, cost allocation, and abuse detection without any client-side instrumentation. Aligned with Zero Trust. Strong identity at the edge, no implicit trust between hops, single policy chokepoint for inspection and rate-limiting, and full revocability through a single Enterprise Application. Optional but trivial keyless path. Flip APIM to system-assigned managed identity + <authentication-managed-identity resource="https://cognitiveservices.azure.com" /> and one Foundry User role assignment (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d , formerly Azure AI User) on the Foundry account. See the Foundry RBAC doc — don't use any Cognitive Services * roles for Foundry. What I'd add next llm-token-limit and llm-emit-token-metric policies for per-user quotas and cost visibility. App Insights wiring on the API, with a workbook that pivots on the oid claim. Assignment required = Yes on the Entra Enterprise Application + a security group, so only approved users can sign in. Intune deployment of the exported .reg / .mobileconfig so the gateway URL and client ID land on devices automatically. But that's all incremental. The hard part — getting Claude Desktop, Entra ID, APIM, and Foundry to agree on who's allowed to talk to whom — is done. Total elapsed: about an afternoon, most of it spent learning where each portal hides its switches. Useful links Gateway single sign-on with your identity provider — Claude.ai Documentation Configure Claude Desktop with Foundry Models — Microsoft Learn Role-based access control for Microsoft Foundry — Microsoft Learn
LZhang
Jun 30, 2026 Place Microsoft Developer Community Blog
1.5KViews
0likes
3Comments
Mastering Query Fields in Azure AI Document Intelligence with C#
Introduction Azure AI Document Intelligence simplifies document data extraction, with features like query fields enabling targeted data retrieval. However, using these features with the C# SDK can be tricky. This guide highlights a real-world issue, provides a corrected implementation, and shares best practices for efficient usage. Use case scenario During the cause of Azure AI Document Intelligence software engineering code tasks or review, many developers encountered an error while trying to extract fields like "FullName," "CompanyName," and "JobTitle" using `AnalyzeDocumentAsync`: The error might be similar to Inner Error: The parameter urlSource or base64Source is required. This is a challenge referred to as parameter errors and SDK changes. Most problematic code are looks like below in C#: BinaryData data = BinaryData.FromBytes(Content); var queryFields = new List<string> { "FullName", "CompanyName", "JobTitle" }; var operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, modelId, data, "1-2", queryFields: queryFields, features: new List<DocumentAnalysisFeature> { DocumentAnalysisFeature.QueryFields } ); One of the reasons this failed was that the developer was using `Azure.AI.DocumentIntelligence v1.0.0`, where `base64Source` and `urlSource` must be handled internally. Because the older examples using `AnalyzeDocumentContent` no longer apply and leading to errors. Practical Solution Using AnalyzeDocumentOptions. Alternative Method using manual JSON Payload. Using AnalyzeDocumentOptions The correct method involves using AnalyzeDocumentOptions, which streamlines the request construction using the below steps: Prepare the document content: BinaryData data = BinaryData.FromBytes(Content); Create AnalyzeDocumentOptions: var analyzeOptions = new AnalyzeDocumentOptions(modelId, data) { Pages = "1-2", Features = { DocumentAnalysisFeature.QueryFields }, QueryFields = { "FullName", "CompanyName", "JobTitle" } }; - `modelId`: Your trained model’s ID. - `Pages`: Specify pages to analyze (e.g., "1-2"). - `Features`: Enable `QueryFields`. - `QueryFields`: Define which fields to extract. Run the analysis: Operation<AnalyzeResult> operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, analyzeOptions ); AnalyzeResult result = operation.Value; The reason this works: The SDK manages `base64Source` automatically. This approach matches the latest SDK standards. It results in cleaner, more maintainable code. Alternative method using manual JSON payload For advanced use cases where more control over the request is needed, you can manually create the JSON payload. For an example: var queriesPayload = new { queryFields = new[] { new { key = "FullName" }, new { key = "CompanyName" }, new { key = "JobTitle" } } }; string jsonPayload = JsonSerializer.Serialize(queriesPayload); BinaryData requestData = BinaryData.FromString(jsonPayload); var operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, modelId, requestData, "1-2", features: new List<DocumentAnalysisFeature> { DocumentAnalysisFeature.QueryFields } ); When to use the above: Custom request formats Non-standard data source integration Key points to remember Breaking changes exist between preview versions and v1.0.0 by checking the SDK version. Prefer `AnalyzeDocumentOptions` for simpler, error-free integration by using built-In classes. Ensure your content is wrapped in `BinaryData` or use a direct URL for correct document input: Conclusion Using AnalyzeDocumentOptions provides a cleaner and more reliable way to work with query fields in Azure AI Document Intelligence using C#. By aligning with the latest SDK approach, developers can simplify implementation, reduce common errors, and improve code maintainability. Keeping up with SDK enhancements and recommended practices ensures more accurate and efficient document data extraction. As Azure AI capabilities continue to evolve, adopting modern integration patterns will help you build scalable and future-ready document processing solutions with greater confidence. Reference Official AnalyzeDocumentAsync Documentation. Official Azure SDK documentation. Azure Document Intelligence C# SDK support add-on query field.
sasina
Jun 19, 2026 Place Microsoft Developer Community Blog
493Views
0likes
0Comments
Agents That Test Agents: A Cloud-Native Skill-Eval Harness on Foundry Hosted Agents
Skills are an agent's must-have. So test them. A skill is the lightest way to give an agent durable, reusable behavior: a SKILL.md file you author once, store centrally in Foundry's versioned Skills API, and inject into a Hosted Agent's context — no code change, no redeploy. That's why skills have quietly become standard equipment for production agents. But the moment a skill carries real behavior, a hard question follows: how do you know it still works? When you edit a skill you can't feel whether you improved it or just changed it. It might stop triggering, skip a required section, or quietly produce a worse result on one model than another. The cure is the same discipline we use for any prompt — evaluate it: run the agent, capture what happened, and grade it against a small set of checks. This is exactly what azure_skill_eval does for one concrete skill: edu-video-script, which writes an education short-video script for a given knowledge point (the sample's smoke test asks it to script the "P vs NP problem"). And it does the whole thing cloud-native, on Foundry Hosted Agents. The scenario: one skill, two models, four hosted agents The skill under test is edu-video-script. The clever part of the harness is that it doesn't just check one run — it puts the skill on a stand and stresses it from three sides, using four Foundry Hosted Agents wired together by the Agent Framework FoundryAgent: Hosted agent Role skill-eval-business-agent-gpt System under test (SUT), running edu-video-script on gpt-5.5 skill-eval-business-agent-deepseek The same skill, running on DeepSeek-V4-Pro skill-eval-attacker-agent Multi-turn adversarial prompt generator skill-eval-judge-agent LLM-as-judge that returns a rubric score as JSON Two business agents run the same skill on different models, so every case becomes an apples-to-apples comparison: which model executes this skill better? The attacker and judge are the graders. What we measure (define "done" first) Good evals start from a checkable definition of done — outcome, process, style, efficiency. For an education-video script that means: Did it produce a valid script (outcome)? Did it actually follow the edu-video-script template (process/style)? Does it hold up when a user pushes on it across turns (robustness)? The harness answers these with three grading layers. 1. Deterministic checks first (validator.py) The cheapest, most explainable signal: does the output match the script template the skill is supposed to produce? validator.py runs fixed, deterministic template checks — no model needed. These catch the obvious regressions instantly and never cost a token. 2. The LLM judge (skill-eval-judge-agent) Template checks answer "did it do the basics?" but not "is the script any good?" — pacing, clarity, whether it teaches the concept. For that, a dedicated judge hosted agent grades the result and returns structured JSON so scores compare cleanly across runs and models: { "overall_pass": true, "score": 100, "checks": [] } Structured output is the point: stable fields (overall_pass, score, checks) diff cleanly between GPT and DeepSeek, and between today's skill version and last week's. 3. The multi-turn attacker (test_agent.py + skill-eval-attacker-agent) A skill that looks great on a clean prompt can still fall apart when a user pushes on it. The attacker agent generates adversarial prompts for a knowledge point using a chosen strategy — for example extreme length — and keeps the pressure on across multiple turns (max_turns, default 3). This is where you find out whether edu-video-script stays on-template under stress, not just on the happy path. # the attacker takes a knowledge point + a strategy, emits one user prompt azd ai agent invoke skill-eval-attacker-agent \ "Topic: P vs. NP problem Recommended attack strategy: Extreme length Please output the unique user prompt text." The eval loop, end to end runner.py is a ghcsdk-style pipeline that runs cases × models, with each side toggleable: pick all models / GPT only / DeepSeek only, run a single case (e.g. edge-03), and switch adversarial mode, single-turn vs multi-turn, and judge grading on or off. The same switches are query parameters on POST /api/run: model, only_case, use_attack, single_turn, use_judge, max_turns. The test set lives in shared/test_cases.py — 10 built-in edge cases (edge-01 … edge-10) exported to evals/evals.json. You don't need a giant benchmark; a small, sharp set catches regressions, and you grow it whenever a real failure shows up: python -m evals.export_evals # regenerate evals/evals.json from shared/test_cases.py Every SUT call goes through runtime.py, which follows the official Agent Framework hosted-agent sample: it opens a fresh hosted session per turn, invokes via Responses, and tears the session down afterward. # shared/runtime.py — the documented Foundry hosted-agent pattern project = AIProjectClient(endpoint=FOUNDRY_PROJECT_ENDPOINT, credential=cred, allow_preview=True) agent = FoundryAgent(project_client=project, name=agent_name, # e.g. skill-eval-business-agent-gpt allow_preview=True) session = project.beta.agents.create_session(agent_name=agent_name) # ... send the (possibly adversarial) prompt, collect the Responses output ... So a single case flows: runner → business agent (skill runs) → validator → judge, optionally with the attacker driving multiple turns first. Cloud-native by design — and why that matters for eval This is the part that makes the harness production-grade rather than a laptop script. The hard parts of an eval harness — provisioning agents, recording every run, scaling trials, governing access — are handled by Azure, not by you. Foundry Hosted Agents are the runtime. The SUT, attacker, and judge all run as managed hosted agents in your Foundry project. You bring the skill and the cases; Foundry hosts the agents, models, and sessions. The business agents deploy with host: azure.ai.agent and docker.remoteBuild: true, so azd deploy builds the containers in Azure Container Registry — local Docker doesn't even need to be running. The UI is serverless. A FastAPI app on Azure Container Apps lets you upload evals.json, watch progress live, and browse the dashboard — scale-to-zero when no one's running evals. Every run is durable. Results land in Azure Blob Storage (skill-eval-runs), one yymmdd-XXXXXX/ folder per run, with a newest-first runs.json index. Nothing lives only in a terminal scrollback. Access is identity-based. In the cloud, a user-assigned Managed Identity carries exactly two roles — Storage Blob Data Contributor + Azure AI User; locally it's AzureCliCredential. No keys in env files. It's reproducible infra. azd up runs infra/main.bicep to stand up Storage, the container, Log Analytics, the Container Apps environment, the identity, and the role assignments in one shot. The payoff: the scores you read came from the same hosted runtime you actually ship to — not a local approximation — and the run that produced them is sitting in Blob, comparable against every run before it. Run it Local (no deploy): conda activate agentdev cd Skill_eval/azure_skill_eval pip install -r requirements.txt cp .env.example .env # FOUNDRY_PROJECT_ENDPOINT + AZURE_STORAGE_* uvicorn webapp.app:app --reload --port 8000 Open http://localhost:8000, upload evals/evals.json, pick your models and modes, and click Run. Cloud (azd): azd auth login azd env new skill-eval-dev azd env set FOUNDRY_PROJECT_ENDPOINT https://<project>.services.ai.azure.com/api/projects/<project> azd env set MODEL_GPT gpt-5.5 azd env set MODEL_DEEPSEEK DeepSeek-V4-Pro azd up Provision the skill once, deploy the four hosted agents, then smoke-test them: python -m hosted_agent.provision_skills # upload edu-video-script to Foundry Skills azd deploy skill-eval-business-agent-gpt azd deploy skill-eval-business-agent-deepseek azd deploy skill-eval-attacker-agent azd deploy skill-eval-judge-agent azd ai agent invoke skill-eval-business-agent-gpt "Here is a script for an educational short video on the P vs. NP problem." Read the results Each run is self-contained on Blob: summary.json gives you the headline — pass rate and judge averages — and the per-{case}__{model}.json files let you open any single result and see exactly what the skill produced and why it passed or failed. The dashboard streams these straight from Blob via /api/runs/{run_id}/files/{filename}. Because GPT and DeepSeek ran the same cases, the comparison is right there in one folder. Takeaways A skill you can't evaluate is a skill you can't trust. edu-video-script is treated like code — versioned in Foundry, run, and graded. Stack your graders cheap-to-expensive. Deterministic template checks first (validator.py), then an LLM judge for quality, then a multi-turn attacker for robustness. Make the judge return structured JSON. overall_pass / score / checks compare cleanly across models and skill versions. Compare models on the same skill. Running GPT-5.5 and DeepSeek-V4-Pro side by side turns "which model?" from a guess into a measured answer. Let the platform carry the harness. Foundry Hosted Agents are the runtime; Azure Container Apps, Blob Storage, Managed Identity, and azd/Bicep make the whole loop reproducible and durable. Write the skill. Then build the harness that proves it. On Foundry, that second step is mostly configuration — and the result is a skill you can actually trust in production. Conclusion Skills moved agent behavior out of code and into versioned Markdown — a huge win for reuse, but only if you can prove a skill still works after every edit. azure_skill_eval answers that for edu-video-script by treating evaluation as a first-class, repeatable step rather than a gut check. The shape is simple and worth copying for any skill of your own: Pin down "done" as checkable criteria, then encode a small set of sharp cases (here, 10 edge cases). Grade in layers, cheap to expensive — deterministic template checks, then a structured LLM-judge rubric, then a multi-turn adversarial pass. Run the same cases across models (GPT-5.5 vs DeepSeek-V4-Pro) so model choice becomes a measurement, not a guess. Let the cloud carry it — Foundry Hosted Agents as the runtime, FastAPI on Azure Container Apps for the UI, Blob Storage for durable runs, Managed Identity for access, and azd/Bicep so the whole thing is reproducible. The result is a feedback loop where every skill change is confirmed, every regression is visible, and every score traces back to the same hosted runtime you ship to. That's the difference between building skills and being able to trust them — and on Foundry, the gap between the two is mostly configuration. Sample Code : https://github.com/kinfey/Multi-AI-Agents-Cloud-Native/tree/main/code/Skill_eval
kinfey
Jun 15, 2026 Place Microsoft Developer Community Blog
668Views
1like
0Comments
Deploying Foundry Hosted Agents from Source Code
Introduction At Microsoft Build, it was announced that Foundry Hosted Agents now support source-code deployments. Previously, Hosted Agents required application code to be packaged in a container for deployment. This new functionality allows you to deploy the agent from a `.zip` file instead of from a container image. This post walks through the process of deploying a source-code Hosted Agent, briefly compares that approach to container-based Hosted Agent deployment, and provides a reusable GitHub Action for CI/CD deployments. It is part of a series of post whose source code is housed in simple-hosted-agent-responses repository. If Hosted Agents are new to you, read the previous posts, "Deploying Foundry Hosted Agents via REST API" and "GitHub Actions for Deploying Hosted Agents." Background A Foundry Hosted Agent helps abstract the management of the compute tier for your agent. It runs in a self-contained Micro-VM sandbox, meaning the Hosted Agent sandbox provides the CPU and memory allocation used to run your agent. Previously, this Micro-VM would download your code from an Azure Container Registry (ACR) and run it on the virtualized platform. Not all customers use container-based workloads today and, let's face it, not everything needs to be a container. So how do those customers and platforms take advantage of Foundry Hosted Agents? The answer is through source-code deployments of Foundry Hosted Agents. What is a Source Code Agent? Source Code Agents are like other Foundry Hosted Agents. The key deployment difference is that the code asset is a .zip file instead of a container image. This also changes the Agent Development Lifecycle compared with the containerized version of Foundry Hosted Agents. An important point of clarity: the way the agent is configured is a data plane operation. As such, taking advantage of Source Code Agent functionality does not require changes to the Foundry infrastructure itself when your Infrastructure as Code (IaC) is only provisioning the supporting resources in Bicep, Terraform, or PowerShell. The deployment change happens through the Foundry data plane. First, let's look at a container-based Foundry Hosted Agent: Now, let's compare it to the source-code version: Deployment Process Now that we've looked at the end result, let's talk through the steps required to deploy a Foundry Hosted Agent via source code. So in Foundry, what does the difference between a container-based and a source-code-based Foundry Hosted Agent look like? The Microsoft Learn docs outline this well: Every source-code deployment follows the same sequence: package -> create or update -> poll until active -> invoke. The source-code path uses `code_configuration` in the agent definition; the image-based path uses `container_configuration` instead--the two are mutually exclusive on a single version. If wanting to confirm and see in more detail one can refer to the Foundry Agent REST API documentation. The source layout can stay familiar, but the deployed artifact changes to a `.zip` file. Packaging the source code into a ZIP is the piece that differs from the container-image flow. The agent deployment to Foundry is also slightly different because it uses source-code configuration instead of container configuration. You can run this via `azd` with a command structured like the following: azd ai agent init --no-prompt --project-id "<project-resource-id>" --deploy-mode code --runtime python_3_13 --entry-point main.py This assumes `azd` is installed and authenticated, and that the authenticated identity has access to the Foundry project. The command initializes a code deployment for the project. However, we recognize that the majority of enterprise organizations will want to use other deployment methods. As such, REST API deployments are supported, as are the Python and C# SDKs for creating the agent. Taking this a step further, and similar to "GitHub Actions for Deploying Hosted Agents," let's create a reusable GitHub Action for deploying source-code-based Hosted Agents. GitHub Action If you are wanting to see the entire action it is part of the repository simple-hosted-agent-responses, which contains source code, IaC, and deployment options. Background First, we need to understand that we cannot reuse the GitHub Action from "GitHub Actions for Deploying Hosted Agents" because, as noted above, the REST API uses mutually exclusive options. In theory, we could add conditional logic across the parameters; however, it is cleaner to create a separate action. Before invoking this action, the workflow must authenticate to Azure because the action calls `az account get-access-token` to acquire a token for the Foundry data plane. Inputs inputs: project_endpoint: description: Foundry project endpoint URL required: true agent_name: description: Name of the hosted agent required: true source_code_zip: description: Path to the local source-code zip artifact required: true model_deployment_name: description: Name of the AI model deployment required: true cpu: description: CPU allocation for the hosted agent container required: false default: '0.25' memory: description: Memory allocation for the hosted agent container required: false default: '0.5Gi' runtime: description: Source-code runtime for the hosted agent required: false default: 'python_3_13' entry_point: description: Source-code entry point command for the hosted agent required: false default: '["python", "main.py"]' dependency_resolution: description: How Agent Service resolves dependencies for the source-code deployment required: false default: 'remote_build' max_polling_seconds: description: Maximum time to wait for the source-code deployment to reach active status required: false default: '600' For our inputs, `project_endpoint`, `agent_name`, `source_code_zip`, and `model_deployment_name` are required. The CPU, memory, runtime, entry point, dependency resolution, and max polling values are configurable properties with defaults set in the action. The source-code-specific inputs populate the `code_configuration` properties of the REST payload. These include `source_code_zip`, `runtime`, `entry_point`, and `dependency_resolution`. This information tells Foundry how to run the code from the `.zip` package. Outputs We should output values that make sense for downstream workflows. Every workflow may not use them, but it is useful to expose non-secret values when they can support later steps. In this case, we are creating a new version of the agent, so let's output that version ID. outputs: agent_version: description: Version ID returned by the Foundry data plane value: ${{ steps.post.outputs.agent_version }} Action The action maps the inputs to environment variables as the first step. After that, it gets an access token from Azure and calls the REST API endpoint. Once we have this, we prepare the body of the call. Verify against the API for all valid properties. For this example, I chose not to set `rai_config` and `tools` to keep things simple. runs: using: composite steps: - name: Create source-code metadata id: metadata shell: bash env: AGENT_NAME: ${{ inputs.agent_name }} MODEL_DEPLOYMENT_NAME: ${{ inputs.model_deployment_name }} CPU: ${{ inputs.cpu }} MEMORY: ${{ inputs.memory }} RUNTIME: ${{ inputs.runtime }} ENTRY_POINT: ${{ inputs.entry_point }} DEPENDENCY_RESOLUTION: ${{ inputs.dependency_resolution }} run: | METADATA_FILE=$(mktemp) ENTRY_POINT_JSON=$(python3 -c 'import json,sys; print(json.dumps(json.loads(sys.argv[1])))' "$ENTRY_POINT") jq -n \ --arg model "$MODEL_DEPLOYMENT_NAME" \ --arg cpu "$CPU" \ --arg memory "$MEMORY" \ --arg runtime "$RUNTIME" \ --arg dep_resolution "$DEPENDENCY_RESOLUTION" \ --argjson entry_point "$ENTRY_POINT_JSON" \ '{ description: "Hosted agent deployed from source code", definition: { kind: "hosted", protocol_versions: [{protocol: "responses", version: "1.0.0"}], cpu: $cpu, memory: $memory, code_configuration: { runtime: $runtime, entry_point: $entry_point, dependency_resolution: $dep_resolution }, environment_variables: {AZURE_AI_MODEL_DEPLOYMENT_NAME: $model} } }' > "$METADATA_FILE" echo "metadata_file=${METADATA_FILE}" >> "$GITHUB_OUTPUT" echo "Metadata file created at ${METADATA_FILE}" - name: Post source-code agent deployment to Foundry data plane id: post shell: bash env: PROJECT_ENDPOINT: ${{ inputs.project_endpoint }} AGENT_NAME: ${{ inputs.agent_name }} SOURCE_CODE_ZIP: ${{ inputs.source_code_zip }} METADATA_FILE: ${{ steps.metadata.outputs.metadata_file }} MAX_POLLING_SECONDS: ${{ inputs.max_polling_seconds }} run: | if [[ ! -f "$SOURCE_CODE_ZIP" ]]; then echo "Error: Source code zip not found at ${SOURCE_CODE_ZIP}" exit 1 fi CODE_ZIP_SHA256=$(sha256sum "$SOURCE_CODE_ZIP" | awk '{print $1}') echo "Source code SHA256: ${CODE_ZIP_SHA256}" FOUNDRY_TOKEN=$(az account get-access-token \ --resource "https://ai.azure.com/" \ --query accessToken -o tsv) # POST /agents/{name}/versions auto-creates the agent if it doesn't # exist and adds a new version if it does, so a single call covers # both first-deploy and update scenarios (matches update-agent). HTTP_STATUS=$(curl -s -o /tmp/source_code_response.json \ -w "%{http_code}" \ -X POST \ "${PROJECT_ENDPOINT}/agents/${AGENT_NAME}/versions?api-version=2025-11-15-preview" \ -H "Authorization: Bearer ${FOUNDRY_TOKEN}" \ -H "Accept: application/json" \ -H "Foundry-Features: CodeAgents=V1Preview,HostedAgents=V1Preview" \ -H "x-ms-agent-name: ${AGENT_NAME}" \ -H "x-ms-code-zip-sha256: ${CODE_ZIP_SHA256}" \ -F "metadata=@${METADATA_FILE};type=application/json" \ -F "code=@${SOURCE_CODE_ZIP};type=application/zip;filename=${AGENT_NAME}.zip") echo "HTTP ${HTTP_STATUS}: $(cat /tmp/source_code_response.json)" if [[ "$HTTP_STATUS" -lt 200 || "$HTTP_STATUS" -ge 300 ]]; then echo "Error: Foundry data plane returned HTTP ${HTTP_STATUS}" exit 1 fi RESPONSE=$(cat /tmp/source_code_response.json) AGENT_VERSION=$(echo "$RESPONSE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["version"])') echo "agent_version=${AGENT_VERSION}" >> "$GITHUB_OUTPUT" echo "Agent version resolved as ${AGENT_VERSION}" START_TIME=$(date +%s) while true; do ELAPSED=$(($(date +%s) - START_TIME)) if [[ $ELAPSED -gt $MAX_POLLING_SECONDS ]]; then echo "Error: Agent version did not reach active state within ${MAX_POLLING_SECONDS} seconds" exit 1 fi VERSION_STATUS=$(curl -s \ -X GET \ "${PROJECT_ENDPOINT}/agents/${AGENT_NAME}/versions/${AGENT_VERSION}?api-version=2025-11-15-preview" \ -H "Authorization: Bearer ${FOUNDRY_TOKEN}" \ -H "Accept: application/json" \ -H "Foundry-Features: CodeAgents=V1Preview,HostedAgents=V1Preview" \ | python3 -c 'import sys,json; data=json.load(sys.stdin); print(data.get("status", "unknown"))' 2>/dev/null) echo "Current status: ${VERSION_STATUS} (elapsed ${ELAPSED}s)" if [[ "$VERSION_STATUS" == "active" ]]; then echo "Agent version ${AGENT_VERSION} is active" break fi if [[ "$VERSION_STATUS" == "failed" ]]; then echo "Error: Agent version reached failed status" exit 1 fi sleep 5 done Building the Source-Code Artifact Before calling the source-code Hosted Agent action, create the ZIP artifact that will be passed into `source_code_zip`. source-code: name: Build source-code artifact runs-on: ubuntu-latest permissions: contents: read steps: - name: Checkout uses: actions/checkout@v6 - name: Create source-code zip artifact run: | git archive --format=zip --output=source-code.zip HEAD:src/agent-framework/responses/basic - name: Upload source-code artifact uses: actions/upload-artifact@v7 with: name: source-code path: source-code.zip Calling the Action Now that we have the action, how can we scale this across multiple workflows? We pass in the required parameters and the ZIP artifact path. - name: Update agent with source code uses: ./.github/actions/update-agent-source-code with: project_endpoint: ${{ needs.deploy-iac.outputs.project_endpoint }} # Source-code agent shares the same Foundry project as the image-based # agent; the `-src` suffix keeps them as distinct agent versions. agent_name: ${{ inputs.agent_name }}-src source_code_zip: ./.artifacts/source-code/source-code.zip model_deployment_name: ${{ needs.deploy-iac.outputs.model_deployment_name }} And just to show we can call the same action multiple times, here are two examples that do just that: Deploy (Bicep) and Deploy (Terraform). Conclusion Source-code deployments give Foundry Hosted Agents another deployment path for teams that do not want, or do not need, to package every agent as a container image. By using a .zip artifact, teams can keep a familiar source-code packaging flow while still taking advantage of the managed compute abstraction that Hosted Agents provide. The reusable GitHub Action shown in this post turns that deployment process into a repeatable CI/CD step: package the source code, post the deployment to the Foundry data plane, poll until the new version is active, and expose the resulting agent version for downstream workflow steps. This keeps the deployment flexible while fitting into existing enterprise pipeline patterns. For organizations already using container-based Hosted Agents, source-code deployments do not replace that model; they expand the options available. Choose the deployment approach that best fits how your teams package, govern, and operate their agent workloads.
j_folberth
Jun 10, 2026 Place Microsoft Developer Community Blog
339Views
2likes
0Comments
We Gave Ourselves 20 Minutes to Build an AI Agent for a Lumber Company. The Timer's Still on Screen.
Here's a confession: most "build with AI" webinars are 60 minutes of slides, 5 minutes of a polished demo someone rehearsed for a week, and a closing CTA. You leave inspired but not really sure what you saw. So we tried something different. We put a visible countdown timer on the screen and gave ourselves 20 minutes to do two things, live: Build an AI agent that solves a real business problem Deploy a working AI application to Azure No edits to hide the awkward parts. No "and here's one I prepared earlier." Just the timer, the screen, and a working app at the end. The on-demand recording is up now. Here's what's in it and why you should carve out 20 minutes for it this week. The setup: why lumber? 🏘️ We needed a real business problem, not a toy one. So for the demo, we role-play as the owner of Contoso Lumber — a regional lumber business with a very specific, very real headache: Should we sell our inventory now, or hold it longer? Sell too early, miss a better price. Hold too long, eat storage costs. Lumber prices fluctuate with global competition, macro shifts, even the weather. In the past, decisions like this came from morning meetings and gut instinct, or maybe the occasional ad-hoc spreadsheet that nobody could reuse a month later. It's the kind of decision that should have an analyst behind it — except most growing businesses can't afford to hire one full-time. So we build the AI agent that does. (Yes, lumber. We know. Stick with us — the boring industry is exactly the point. If it works here, it works for your business too.) What we actually build (in 20 minutes flat) The webinar walks through the entire flow, end to end: Part 1 — The agent. We open Microsoft Foundry at ai.azure.com, browse the model leaderboard (there are over 11,000 models to choose from — we compare a few on the cost-vs-quality chart), pick one, write a plain-English instruction for the agent, upload a CSV of historical lumber pricing, and ask it a real question: "If I cannot sell one of my products today unless I offer my clients a 35% discount, and knowing the historical pricing data, should I still sell it?" The agent runs a break-even analysis and comes back with a reasoned recommendation — hold for 3–6 months, here's the math on why, here's where storage costs start eating the upside. Then we add voice mode (now you can ask the agent for pricing recs from a coffee shop on your phone), and lock down guardrails to block jailbreaks, prompt injection, data leakage, and — because we're feeling fancy — profanity in responses. Part 2 — The app. With the agent done, we pivot to deploying a full AI chat application to Azure. From scratch. Using exactly five commands in Azure Cloud Shell: azd auth login git clone <repo> cd <folder> azd up azd down # (this one's for when you're done — kills everything to avoid surprise bills) That's it. The template handles the Container Apps setup, the architecture-aligned-to-Well-Architected-Framework stuff, all the boilerplate that usually eats half a sprint. By the end of the segment, there's a working AI chatbot running on a real Azure URL. We even pause the timer when we're just explaining things, so you know the 20-minute clock is honest about build time, not talk time. Why this format is more useful than another slide deck A few things this webinar shows that a written tutorial can't: The Foundry UI is super navigable. You watch someone do it. You see where the buttons are. You see what the leaderboard looks like when you're comparing GPT-5.3 Codex against Kimi K2.5 on a cost-to-quality chart. (Spoiler: Kimi wins this particular trio. Your mileage will vary depending on your workload.) The "no-stitching" claim is real. Models, data, agents, guardrails, deployment — all in one place. You don't need to leave Foundry to wire seven products together. The webinar makes that concrete by showing you the actual flow without cutting. Five commands really is five commands. This is the part people are most skeptical about until they see it. azd up does the work. The infrastructure provisioning, the container app, the AI service hookup — all of it. You can delete it just as fast. azd down tears everything back down. Useful when you're experimenting and don't want a $40 surprise on your Azure bill next month. What's on screen at the end By the 20-minute mark: A published AI agent named for the lumber business, with guardrails, voice mode enabled, ready to be called from Teams, Microsoft 365 Copilot, or any application via endpoint A separate AI chat application deployed to Azure Container Apps, with a live URL Logs, observability, the full Foundry control plane — all available out of the box And in the closing minutes, four very concrete next steps for what you do next if this sparked an idea for your own business — including Azure Accelerate (if you want Microsoft experts in the room with you), the partner network, and the Microsoft marketplace if you'd rather buy than build. Watch the recording The on-demand recording is available now. Block 20 minutes — that's literally all it takes — and ideally watch with your Azure portal open in another tab so you can follow along. If you're the kind of person who learns by doing, pause at the agent-building section and try it yourself in parallel. Foundry is free to explore; the agent we build in the webinar costs cents to run. → Watch the on-demand webinar A few things we'd love feedback on If you watch it, we'd genuinely love to know: Did the timer help or distract? (We thought it would feel gimmicky. It turned out to be the most-mentioned thing in early feedback.) What use case from your business would you want to see in the next one? We're picking the next demo problem from comments. Was the lumber thing weirdly compelling or were you just here for the Azure parts? Drop a comment, tag us, or grab a partner and try building your own version this week. The timer's reset. Your 20 minutes start whenever you press play. Want to go deeper than the webinar? Two companion reads: From Idea to Impact: How Growing Businesses Scale with Azure (five real customer stories with the full architectures) and AI Made Simple: 3 Practical Moves for Growing Businesses (the structured playbook for figuring out what to build first).
JoshuaHuang
Jun 10, 2026 Place Microsoft Developer Community Blog
177Views
0likes
0Comments
Building Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.
ChaitanyaThalloory
Jun 08, 2026 Place Microsoft Developer Community Blog
541Views
3likes
1Comment
Harness-Driven Agents: Secure Podcast Pipeline in Hyperlight MicroVM Sandbox
The moment the agent reached for rm -rf For most of 2024 and 2025, "agents" were a demo word. By 2026 they are something you run — autonomously, in a loop, executing code they wrote themselves a second ago. I was watching one work late one night. I had given it a goal, a handful of tools, and the freedom to write and run its own Python. For twenty minutes it was magic: read a file, reason about it, write a script, run it, inspect the output, correct itself, try again. Then it produced this: import shutil shutil.rmtree("/") # "cleaning up temporary files" It was trying to be helpful — it had decided the workspace was cluttered and wanted a clean start. The "workspace," as far as that process was concerned, was my entire machine. I killed it in time. But the lesson is the one every agent builder eventually arrives at: the model is not the dangerous part — the execution is. A chatbot that answers wrong is annoying. An agent that fetches a web page, runs code, and writes files has a blast radius. The bounding box has to come from infrastructure, not from a system prompt. harnessagent_sandbox_demo is a concrete build that puts that bounding box in exactly the right place — and it does it in service of a real, charming little product: a daily five-minute Mandarin podcast about the FIFA World Cup 2026. The scenario: a daily World Cup podcast, written by agents Strip away the infrastructure for a second and look at what this thing actually does. Every day it produces a fresh Mandarin podcast script about the FIFA World Cup 2026. Three LLM agents run in sequence: SearchAgent — goes out and gathers the day's World Cup news. ContentAgent — turns that raw material into structured podcast content. GenScriptAgent — writes the final, readable five-minute script. The output is two text files — one in Simplified Chinese, one in Traditional Chinese: ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt That's the whole product. It sounds simple — and the point of the project is that making it safe is the hard part. SearchAgent has to reach the open internet. All three agents write and run code. If you wire that naively, you have just built the exact machine that types shutil.rmtree("/") for you. So the entire architecture is organized around one principle: the agents get to do real work, but every dangerous capability is fenced behind a hardware boundary. Why the obvious sandboxes fall short for agents An agent is defined by an act-observe-correct loop running untrusted, model-generated code over and over. That single property breaks most conventional isolation choices. Option Why it falls short for agents No sandbox One rm -rf, one leaked .env, one rogue network call — the blast radius is the whole machine. Container Great for shipping apps, but a coding agent wants to build and run its own container, which means Docker-in-Docker and elevated privileges that quietly undo the isolation. WASM / V8 isolate Fast to start, but you isolate a language runtime, not an OS — no system packages, no arbitrary shell, and hardening the engine is a moving target. Full VM Rock-solid isolation, but cold starts in seconds and heavy memory — exactly the friction that pushes developers to skip isolation entirely. Each option trades away safety, speed, or compatibility. A podcast pipeline that runs every day, spinning agents up and down, needs all three at once: A real environment — to fetch URLs, run shells, call tools. A hard boundary — so a bad step can't reach the host. Near-instant lifecycle — because a slow sandbox is a sandbox developers skip, and an unused safety feature protects nobody. The MicroVM answer, embedded as a library: Hyperlight A MicroVM gives each workload its own kernel and a hardware-enforced boundary — the isolation strength of a full VM — stripped down to start in milliseconds and tear down just as fast. Misbehave inside, and you hit a wall; there is no path back to the host. And it is disposable by design: when an agent goes off the rails, you delete the sandbox and reopen in milliseconds, with nothing to clean up. Most MicroVM runtimes (Firecracker and friends) are cloud infrastructure — server-side. Hyperlight is different: a lightweight Virtual Machine Manager (a CNCF sandbox project) designed to be embedded inside your application, like a library. MicroVMs that boot in milliseconds, with guest function calls completing in microseconds. No guest kernel, no OS — the guest is a purpose-built no_std Rust/C binary. Nothing in there to attack. Sandboxed by default — no filesystem, no network, nothing, unless explicitly granted. Typed function calls across the VM boundary, and snapshot/restore to rewind to a clean state between calls. Runs on KVM, MSHV (Microsoft Hypervisor), and Windows Hypervisor Platform. This project uses the Wasm backend: the three agents share a single HyperlightRuntime, and the guest is reset to a clean snapshot before every code execution. That detail is what makes a daily, many-step pipeline cheap — you capture the sandbox state once and rewind to it, instead of rebuilding a VM hundreds of times. Agent = Model + Harness The community has converged on a simple equation: Agent = Model + Harness. The model is a brain in a jar — text in, text out, no memory between calls, no loop, no hands. It can express the intent to call a tool; it cannot actually call it. The harness is the execution layer: it calls the model, handles its tool calls, and decides when to stop. As the Hugging Face glossary puts it, "if you're not the model, you're the harness." That reframes the safety problem precisely. When my agent emitted shutil.rmtree("/"), the model deleted nothing — it merely suggested. The harness would have run it. The harness is where reasoning meets reality, so it is exactly where safety must live. The question stops being "how do I make the model safer?" and becomes: how do I build a harness that executes the model's intent inside a boundary it cannot escape? The Microsoft Agent Framework answers that with first-class agent harness capabilities in Python and .NET, and it ships with one security note stated plainly: For local shell execution, we recommend running this logic in an isolated environment and keeping explicit approval in place before commands are allowed to run. The harness is the steering wheel — it does not pretend to be the seatbelt and the crumple zone. For that, it points you outward: run this somewhere isolated. Hyperlight is that isolated somewhere. This project snaps the two pieces together. The architecture: two planes, one bridge Here is the heart of the design. Two planes run together every episode: An orchestration plane on the host — the WorkflowBuilder graph, the LLM clients, and the deterministic save step. An execution plane inside one Hyperlight Wasm sandbox — the only place LLM-generated code is allowed to run. The single bridge between them is one call: call_tool("fetch_url", ...). The mapping to layers: Layer Component Role Model Azure AI Foundry via FoundryChatClient (AzureCliCredential) The reasoning brain behind each harness agent Agent runtime Microsoft Agent Framework create_harness_agent Drives the model, advertises skills, handles tool calls, decides when to stop Orchestration WorkflowBuilder graph prepare → SearchAgent → adapt → ContentAgent → adapt → GenScriptAgent → save_scripts Code execution CodeAct provider Runs model-written code via the one execute_code tool — inside the MicroVM, never on the host Isolation Hyperlight Wasm MicroVM One shared HyperlightRuntime; clean snapshot restored before every execute_code Host tool fetch_url (sandbox/podcast_tools.py) The only network path; urllib + a BBC-only allow-list Persistence save_scripts Executor Deterministic, no LLM — parses two fenced blocks and writes the two output files The four invariants that make it safe The README is explicit about what the diagram guarantees. These four invariants are the whole security argument. The model never sees the network.Its only tool isexecute_code. Network access happens only when the guest itself runs call_tool("fetch_url", ...) from inside the sandbox. The model cannot reach the internet directly — it can only ask the guest to, and the guest can only reach BBC. One sandbox per run, snapshot per call.All three agents share the sameHyperlightRuntime. Before every execute_code, the guest is reset to a clean snapshot — so nothing one step does can leak into the next, and there is no VM to rebuild. Two counter paths — and why there are two.Thefunction_middleware (make_tool_call_recorder) sees the model-direct execute_code calls. But the inner, guest-initiated fetch_url is dispatched by Hyperlight straight to the FunctionTool, bypassing the middleware entirely. So a second counter — make_call_tool_counter(on_call=) — bumps state["tool_call_counts"][<agent>]["fetch_url"] on every guest invocation. Two observation points, because the architecture has two genuinely different call surfaces. Deterministic save — no LLM in the persistence step.GenScriptAgentonly emits text. The save_scripts Executor parses the two fenced code blocks out of that text and writes the simplified and traditional files itself. There is no model in the loop when bytes hit disk, so the output path is fully predictable. Now let's look at the real code surface The README documents the API the demo is built on. The snippets below reflect that surface. 1. Install and environment pip install agent-framework-hyperlight --pre # Hyperlight needs a hypervisor: KVM on Linux, WHP on Windows. macOS is not yet supported. # The model runs on Azure AI Foundry; FoundryChatClient authenticates via AzureCliCredential. az login export HYPERLIGHT_PYTHON_GUEST_PATH="/path/to/python_guest" 2. A harness agent that carries only a stub — skills do the rest Each of the three agents is built with create_harness_agent + FoundryChatClient. The agents themselves carry only a tiny stub instruction; their real role prompts and the shared sandbox/CodeAct guardrails live as file-based Agent Skills under skills/. The harness's built-in SkillsProvider advertises those SKILL.md packages, and the model loads them at runtime via load_skill. from agent_framework import create_harness_agent from agent_framework.foundry import FoundryChatClient from azure.identity import AzureCliCredential # Model on Azure AI Foundry — not Azure OpenAI directly. client = FoundryChatClient(credential=AzureCliCredential()) # The agent carries a tiny stub. Its real persona — "you gather World Cup # news", "you write the script" — lives in a SKILL.md package under skills/, # advertised by the harness SkillsProvider and pulled in via load_skill. search_agent = create_harness_agent( chat_client=client, name="SearchAgent", instructions="You are a harness agent. Load your skill, then begin.", ) 3 The CodeAct surface: one tool the model can see This is the CodeAct pattern from 02-agents/context_providers/code_act/code_act.py. The model sees exactly one tool — execute_code. Any extra capability (here, only fetch_url) is reachable from inside the guest via call_tool(...). # What the MODEL sees and writes — one script, not ten tool round-trips: # # # inside execute_code, running in the Hyperlight Wasm guest: page = call_tool("fetch_url", url="https://www.bbc.com/sport/football/world-cup") # # ... parse page["BODY"], pull out today's stories ... print(top_stories) # # execute_code is the ONLY tool on the model's surface. call_tool("fetch_url", ...) is reachable only from inside the sandbox. 4. The one host tool, with a BBC-only allow-list fetch_url lives on the host (sandbox/podcast_tools.py). It is the single bridge across the boundary, and it is deliberately narrow. import urllib.request from urllib.parse import urlparse ALLOWED_DOMAINS = {"bbc.com", "www.bbc.com"} # allow-list: BBC only def fetch_url(url: str) -> dict: """The ONLY network path out of the sandbox. Host-side, allow-listed.""" host = urlparse(url).netloc if host not in ALLOWED_DOMAINS: return {"STATUS": "blocked", "URL": url} with urllib.request.urlopen(url, timeout=20) as resp: body = resp.read(8192).decode("utf-8", "ignore") # BODY capped at ~8 KB return { "STATUS": "ok", "URL": url, "TITLE": _extract_title(body), "DESCRIPTION": _extract_description(body), "LINKS": _extract_links(body), "BODY": body, } Notice what this buys you: even if SearchAgent writes hostile code, the worst it can do over the network is read BBC, 8 KB at a time. The allow-list is host-side and the model never sees it — it cannot be prompt-injected away. 5. Wiring the graph and the deterministic save from agent_framework import WorkflowBuilder workflow = ( WorkflowBuilder() .add_node("prepare", prepare) .add_node("SearchAgent", search_agent) .add_node("adapt_1", adapt) .add_node("ContentAgent", content_agent) .add_node("adapt_2", adapt) .add_node("GenScriptAgent", genscript_agent) .add_node("save_scripts", save_scripts) # deterministic Executor, NO LLM .build() ) # GenScriptAgent emits text containing two fenced blocks (simplified + # traditional). save_scripts parses them and writes the files itself — # there is no model in the persistence step. await workflow.run() # -> ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt # -> ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt 6. The payoff Run that shutil.rmtree("/") inside this pipeline now and the result is delightfully boring: the agent deletes its own throwaway sandbox, the host never notices, and the next execute_code starts from a clean snapshot. Two things to call out: Snapshot/restore means every code execution starts from a clean, reusable baseline — capture state once, rewind between calls, instead of rebuilding the whole VM. For a daily pipeline that runs the act-observe-correct loop many times, that is the difference between "fast enough to always use" and "slow enough to skip." Because each agent writes one script instead of ten round-tripped tool calls, the CodeAct approach keeps both latency and token usage down — the model reasons once and lets the guest do the busywork behind the boundary. Where it fits, and the one idea to keep harnessagent_sandbox_demo lives inside Multi-AI-Agents-Cloud-Native — a gallery of patterns for running agent systems safely on Azure: A2A multi-agent orchestration, the Kubernetes sidecar pattern, hardened pipelines, and a sibling sample that runs Copilot agents on AKS inside Kata Containers MicroVMs at the pod level. And the README is explicit that this design is cloud-native: running it in-cluster on AKS changes nothing about the architecture — the same WorkflowBuilder graph, the same Hyperlight sandbox, the same deterministic save_scripts executor. The local build and the in-cluster build are the same shape. The two MicroVM samples are two ends of one spectrum. The Kata sample puts the boundary around the whole pod — a deployment topology. This Hyperlight demo pulls the boundary all the way into the agent process itself — the sandbox becomes a library call. Same question — where do you place the hardware boundary in an agent stack? — answered at two different altitudes. The old pitch for sandboxing always carried an asterisk: yes, it's safer, but you'll pay in speed, compatibility, or friction. MicroVMs erase the asterisk — VM-grade isolation, cold starts fast enough that there's no reason to skip it, and a real environment your agents can actually work in. Enough of a real environment, in fact, to write you a World Cup podcast every morning. The one idea to internalize: the harness decides, the MicroVM contains. Give your agent a room where it is allowed to fail — then let it be brilliant. References Project: harnessagent_sandbox_demo · Multi-AI-Agents-Cloud-Native Hyperlight: hyperlight-dev/hyperlight · hyperlight-dev/hyperlight-sandbox Agent Framework: Agent Harness in Microsoft Agent Framework Background: Why MicroVMs (Docker) · Harness vs. Scaffold glossary (Hugging Face) Install: pip install agent-framework-hyperlight --pre · .NET: dotnet add package Microsoft.Agents.AI.Hyperlight --prerelease Requirements: KVM (Linux) or WHP (Windows); macOS not yet supported.
kinfey
Jun 04, 2026 Place Microsoft Developer Community Blog
5.3KViews
0likes
0Comments