azure ai foundry
230 TopicsEnterprise-ready Claude Desktop with Entra ID, APIM, and Microsoft Foundry (No Backend Required)
How I put corporate sign-in in front of Claude Desktop without writing a single line of backend code. TL;DR — In this post, I show how to securely enable Claude Desktop in enterprise environments using Microsoft Entra ID, Azure API Management, and Microsoft Foundry — without deploying a custom backend. This approach removes API keys from endpoints, enforces per-user identity, and aligns fully with Zero Trust principles. Who this is for: Enterprise architects evaluating secure AI client patterns Developers enabling Claude Desktop in regulated environments Platform teams standardizing identity and governance for LLM access Why this post exists: Microsoft Learn's Configure Claude Desktop with Foundry Models only shows the API-key path — a shared key pasted into every user's Claude Desktop config. That's fine for a quick demo, but it's a non-starter for most enterprises (no per-user identity, no MFA / Conditional Access, hard to revoke, hard to audit). This post fills that gap: same Foundry backend, but with Microsoft Entra ID SSO in front via Azure API Management, so each user signs in with their corporate identity and zero secrets land on the laptop. The problem For many teams experimenting with Claude Desktop, the blocker isn't capability — it's enterprise readiness. How do you enforce identity, eliminate shared secrets, and apply governance without standing up a custom backend service to sit in front of the model? If your team wants to use Claude Desktop with your own Anthropic deployment running on Microsoft Foundry, but with a few non-negotiable requirements: No shared API keys floating around on developer laptops. Per-user identity — every request must be attributable to a real person. MFA and Conditional Access must apply, the same way they do for every other internal app. Central rate-limiting and logging — a centralized control plane for governance. Claude Desktop 1.5+ supports a "Gateway SSO" mode where it can sign each user in with OpenID Connect and forward their token to a custom LLM gateway. Azure API Management (APIM) is a perfect fit for that gateway role: it validates the user's Entra ID token, then re-authenticates itself to Foundry behind the scenes. APIM acts as a centralized policy enforcement layer, enabling identity validation, traffic governance, and secure re-authentication to backend AI services without custom code. The end-to-end flow looks like this: %%{init: {'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'useMaxWidth': true}, 'themeVariables': {'fontSize':'16px'}} }%% flowchart TB User([Corporate user]) Claude["Claude Desktop"] Entra["Microsoft Entra ID<br/>(OIDC + MFA + Conditional Access)"] APIM["Azure API Management<br/>validate-jwt → rewrite headers<br/>(policy gateway)"] Foundry["Microsoft Foundry<br/>Claude deployment"] User -- "1. Sign in (browser PKCE)" --> Entra Entra -- "2. ID token" --> Claude Claude -- "3. POST /v1/messages<br/>Authorization: Bearer ID token" --> APIM APIM -- "4. OIDC discovery / JWKS" --> Entra APIM -- "5. x-api-key (or Managed Identity)" --> Foundry Foundry -- "6. Response" --> APIM APIM -- "7. Response" --> Claude classDef azure fill:#0a4d8c,stroke:#0a3a6b,color:#ffffff; classDef client fill:#f3f3f3,stroke:#888,color:#222; class Entra,APIM,Foundry azure; class Claude,User client; Or in plain text: Claude Desktop │ Authorization: Bearer <Entra ID token from the user's browser sign-in> ▼ Azure API Management (<your-apim>) │ ① validate-jwt → verifies user's Entra ID token │ ② re-auths to Foundry with an API key from a Named value │ Authorization stripped, x-api-key injected ▼ Microsoft Foundry /anthropic/v1/messages │ runs Claude (<your-deployment>) ▼ Response back to the user There are no API keys on user devices. Foundry's key lives only inside APIM. And every request carries the user's oid claim, so I can build dashboards and per-user quotas later. What you need before starting An Azure subscription with a Microsoft Foundry (AI Services) account and a Claude deployment. (Throughout this post I'll just call it Foundry.) An API Management instance, any tier. Permission to register applications in Entra ID for your tenant. Claude Desktop 1.5.0 or later. Azure CLI installed locally. Throughout this post I'll use placeholders for resource names: <apim-name> — your API Management service name <resource-group> — the resource group that holds it <foundry-account> — your Foundry account name <deployment-name> — the name of the Claude model deployment on Foundry Step 1 — Register an Entra ID app for Claude Desktop This is the OIDC client Claude Desktop signs users into. Claude Desktop requires a single-tenant, public PKCE client (no client secret) with a loopback redirect URI, configured under the Mobile and desktop applications platform in Entra ID — the only platform that allows any loopback port. I scripted it so the setup is one command and idempotent: # scripts/register-claude-entra-app.ps1 [CmdletBinding()] param( [string] $TenantId = '<your-tenant-id>', [string] $SubscriptionId = '<your-subscription-id>', [string] $ResourceGroup = '<resource-group>', [string] $ApimName = '<apim-name>', [string] $AppDisplayName = 'Claude Cowork gateway', [string] $RedirectUri = 'http://127.0.0.1/callback' ) az account set --subscription $SubscriptionId | Out-Null # 1. Create (or reuse) the app registration $appId = az ad app list --display-name $AppDisplayName --query "[0].appId" -o tsv if (-not $appId) { $appId = az ad app create --display-name $AppDisplayName ` --sign-in-audience AzureADMyOrg --query appId -o tsv } # 2. Configure as public PKCE client with the Mobile/Desktop redirect URI $objectId = az ad app show --id $appId --query id -o tsv $patch = @{ publicClient = @{ redirectUris = @($RedirectUri) } isFallbackPublicClient = $true } | ConvertTo-Json -Depth 5 -Compress az rest --method PATCH ` --uri "https://graph.microsoft.com/v1.0/applications/$objectId" ` --headers "Content-Type=application/json" --body $patch | Out-Null # 3. Ensure a service principal exists $sp = az ad sp list --filter "appId eq '$appId'" --query "[0].id" -o tsv if (-not $sp) { az ad sp create --id $appId | Out-Null } # 4. Push two Named values into APIM for the validate-jwt policy az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-tenant-id --display-name entra-tenant-id ` --value $TenantId --secret false az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-client-id --display-name entra-client-id ` --value $appId --secret false "Client ID: $appId" Run it once. The output prints the client ID you'll need in Claude Desktop later, and it leaves two Named values in APIM ( entra-tenant-id , entra-client-id ) that the gateway policy will reference. ⚠️ Common pitfall: if the redirect URI ends up under the Web platform instead of Mobile and desktop applications, Entra will demand a client secret on token exchange — Claude won't send one and you'll get Token exchange failed (HTTP 401) . The app type can't be changed after creation, so create a new app if that happens. Step 2 — Create the API in APIM In the portal under APIM → APIs → + Add API → HTTP: Field Value Display name Anthropic API Name anthropicapi Web service URL https://<foundry-account>.services.ai.azure.com/anthropic API URL suffix claude Subscription required Off (Entra ID is our only credential) Add two operations under it: Method URL Display name POST /v1/messages Create message GET /v1/models List models The /v1/models operation isn't strictly needed (Foundry's Anthropic surface doesn't implement it), but having it registered means you can decide later whether to stub it out or proxy it. Step 3 — Add an API key for Foundry as a Named value APIM → Named values → + Add: Name: foundry-key Type: Secret Value: paste a key from the Foundry account's Keys and Endpoint blade. This is the only place the key ever lives. Clients never see it. Alternative — keyless with Entra ID (managed identity): If you prefer not to manage a Foundry key at all, enable the APIM instance's system-assigned managed identity (APIM → Identity → System assigned → On), then grant that identity the Foundry User role on the Foundry account (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d — previously named Azure AI User; Microsoft renamed it but the ID and permissions are unchanged). In Step 4, replace the set-header that injects x-api-key with: <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="foundry-token" /> <set-header name="Authorization" exists-action="override"> <value>@("Bearer " + (string)context.Variables["foundry-token"])</value> </set-header> Then you can skip the foundry-key Named value entirely. Don't use the legacy Cognitive Services User role — per the Foundry RBAC doc, roles starting with Cognitive Services don't apply to Foundry scenarios. Step 4 — Write the gateway policy This is the core enforcement layer in the architecture. Open APIs → anthropicapi → All operations → Inbound processing → </> and paste: <policies> <inbound> <base /> <!-- USER → APIM: verify Entra ID token from Claude Desktop --> <validate-jwt header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized" require-scheme="Bearer"> <openid-config url="https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0/.well-known/openid-configuration" /> <audiences> <audience>{{entra-client-id}}</audience> </audiences> <issuers> <issuer>https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0</issuer> </issuers> </validate-jwt> <!-- APIM → Foundry --> <set-backend-service base-url="https://<foundry-account>.services.ai.azure.com/anthropic" /> <set-header name="x-api-key" exists-action="override"> <value>{{foundry-key}}</value> </set-header> <set-query-parameter name="api-version" exists-action="skip"> <value>2024-05-01-preview</value> </set-query-parameter> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Two things to notice: validate-jwt uses the OIDC discovery URL — JWKS keys are fetched and cached automatically. It rejects any token whose aud claim is not the client ID of our Entra app, which is exactly what we want. The Authorization header from the user is not forwarded — once validate-jwt succeeds, the request is re-authenticated to Foundry with x-api-key . No user token ever leaves APIM. APIM becomes the security boundary — user identity is validated at the edge, and downstream services never see or rely on user tokens. Step 5 — Configure Claude Desktop Open Claude Desktop → Configure third-party inference and fill it in like this: Field Value Connection Gateway Credential kind Interactive sign-in Gateway base URL https://<apim-name>.azure-api.net/claude Client ID (the appId your script printed) Issuer URL https://login.microsoftonline.com/<tenant-id>/v2.0 Authorization URL / Token URL leave empty Bearer token ID token (default) Scopes leave default ( openid profile email offline_access ) Redirect port leave empty (ephemeral) Model discovery Off Model list → Model ID <deployment-name> (your Foundry deployment name) ℹ️ Why Model discovery is Off — Claude Desktop's discovery uses GET /v1/models , and the Foundry /anthropic surface doesn't implement that endpoint, so it 404s. Listing the model manually skips the call entirely. If you want to leave Model discovery On, stub /v1/models in APIM. Add a GET /v1/models operation to your API and give it this inbound policy that returns an Anthropic-shaped response without ever hitting the backend: <policies> <inbound> <base /> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ return new JObject( new JProperty("data", new JArray( new JObject( new JProperty("id", "<deployment-name>"), new JProperty("type", "model"), new JProperty("display_name", "Claude on Foundry"), new JProperty("created_at", "2026-01-01T00:00:00Z") ) )), new JProperty("has_more", false), new JProperty("first_id", "<deployment-name>"), new JProperty("last_id", "<deployment-name>") ).ToString(); }</set-body> </return-response> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Add one entry per deployment you want to expose. The benefit of stubbing rather than turning discovery off is that adding new models becomes a policy edit — no need to re-export and redeploy Claude Desktop config to every user. Click Apply Changes then Sign in to your organization. Your browser opens to the normal Entra sign-in page; once approved you're returned to the app, and a quick connection test runs. The success indicator is a small green banner: ✅ Inference — 1-token completion in 1449 ms · via identity provider For broader rollout, hit the Export button at the top of the configuration window — it produces a .mobileconfig (macOS) or .reg (Windows) you can push via Intune / Jamf to every user's machine. Step 6 — Verify both hops In APIM → APIs → anthropicapi → Test → POST /v1/messages I sent: Headers: anthropic-version: 2023-06-01 Body: { "model": "<deployment-name>", "max_tokens": 64, "messages": [{"role":"user","content":"hi"}] } Click Send → Trace, and look at two places: Inbound → validate-jwt: should say succeeded and show the decoded claims (your oid , email , etc.). Backend → Request: outbound URL is https://<foundry-account>.services.ai.azure.com/anthropic/v1/messages?api-version=2024-05-01-preview , with x-api-key: **** present and Authorization absent. Backend → Response: 200, with a Claude message JSON body. That confirms both halves of the chain. Bumps I hit along the way A few common issues encountered during setup — sharing so you can skip them: Symptom Cause Fix Claude shows "Your provider's model list hasn't loaded yet" and /v1/models returns 404 Foundry's Anthropic surface doesn't implement that endpoint Turn Model discovery OFF in Claude Desktop and add the deployment name manually Claude shows "Authentication failed" even though sign-in worked The APIM API still had Subscription required = ON, blocking the call before validate-jwt ran with 401: Access denied due to missing subscription key Uncheck Subscription required on the API Portal Test panel shows "Cannot read properties of undefined (reading 'statusCode')" The test console doesn't attach an Entra token, so validate-jwt 401s and the panel's JavaScript crashes Comment out <validate-jwt> temporarily for portal testing, or test via curl with a real token OIDC discovery failed (HTTP 404) in Claude Desktop Pasted the metadata URL into Issuer URL Issuer must end at /v2.0 , not at /.well-known/openid-configuration Token exchange failed (HTTP 401) App registered under Web platform instead of Mobile and desktop applications Create a new app with the right platform — it can't be changed Where this leaves us This pattern is small in moving parts but has outsized architectural impact: Zero secrets on endpoints. Eliminates API-key sprawl across laptops, MDM profiles, and shared vaults. The Foundry key lives only inside APIM — or disappears entirely when you switch APIM to managed identity. Identity, not credentials. Every Claude Desktop user authenticates against Entra ID in their browser, the same as Office or Teams. MFA, Conditional Access, and Entra ID Protection apply automatically — no parallel auth story to maintain. Per-user observability built in. APIM logs carry the user's Entra oid , email , and group claims. That unlocks per-user dashboards, cost allocation, and abuse detection without any client-side instrumentation. Aligned with Zero Trust. Strong identity at the edge, no implicit trust between hops, single policy chokepoint for inspection and rate-limiting, and full revocability through a single Enterprise Application. Optional but trivial keyless path. Flip APIM to system-assigned managed identity + <authentication-managed-identity resource="https://cognitiveservices.azure.com" /> and one Foundry User role assignment (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d , formerly Azure AI User) on the Foundry account. See the Foundry RBAC doc — don't use any Cognitive Services * roles for Foundry. What I'd add next llm-token-limit and llm-emit-token-metric policies for per-user quotas and cost visibility. App Insights wiring on the API, with a workbook that pivots on the oid claim. Assignment required = Yes on the Entra Enterprise Application + a security group, so only approved users can sign in. Intune deployment of the exported .reg / .mobileconfig so the gateway URL and client ID land on devices automatically. But that's all incremental. The hard part — getting Claude Desktop, Entra ID, APIM, and Foundry to agree on who's allowed to talk to whom — is done. Total elapsed: about an afternoon, most of it spent learning where each portal hides its switches. Useful links Gateway single sign-on with your identity provider — Claude.ai Documentation Configure Claude Desktop with Foundry Models — Microsoft Learn Role-based access control for Microsoft Foundry — Microsoft Learn904Views0likes3CommentsData Visualisation / Charting in Azure Foundry
Hi Foundry community, We are working on an agent that can query internal data sources, and are looking for ways that we can visualise data (think pie charts, bar charts, etc.). This would be consumed by end users through Copilot/Teams. However we are unable to find a way to do so, which is surprising given that you easily can create charts through M365 Copilot Chat and through Copilot Studio. We have tried using the 'Code Interpreter' tool, but the Teams/Copilot client UIs just do not render the results inline, either interactive or as an embedded image. They also do not give any option to download them. Has anyone tackled this before? How have you been able generate charts? Many thanks!13Views0likes0CommentsAzure AI Foundry Agent Unable to Use Credentials Stored in Key Vault Through Playwright MCP Tool
Hello everyone, I am trying to understand how Azure AI Foundry agents interact with Azure Key Vault when using custom MCP tools, and I would appreciate any guidance from the community. My Setup - Created an Azure AI Foundry agent. - Created an Azure Key Vault and configured all permissions according to Microsoft's official documentation. - Stored the required website credentials (username and password) in the Key Vault. - Deployed the official Playwright MCP Docker image. - Exposed the MCP server using ngrok and verified that the endpoint is accessible. - Connected the MCP endpoint as a Custom MCP Tool in Azure AI Foundry. - Performed all configuration through the Azure portal, Foundry UI, and Playground only (no SDK or custom application code involved). The Issue The agent can access and use the Playwright MCP tool. However, when I ask it to log in to a website using credentials that are already stored in Key Vault, it does not populate the username and password fields. My expectation was that the agent would be able to retrieve the secrets from Key Vault and provide them to the Playwright tool during execution. Questions Is there currently a supported mechanism for Azure AI Foundry agents to automatically retrieve Key Vault secrets and pass them to a Custom MCP tool? Does the Playwright MCP Docker image have any built-in integration with Azure Key Vault? When using only the Foundry UI (without SDK code), can a Foundry agent securely inject Key Vault secrets into MCP tool calls? Are additional configurations required beyond Key Vault permissions and agent connections? Has anyone successfully implemented a similar setup where a Foundry agent uses credentials stored in Key Vault to perform browser automation through Playwright MCP? Any clarification on the expected architecture and whether this scenario is currently supported in Azure AI Foundry would be greatly appreciated. Thank you.Unable to Connect Localhost MCP Server from Azure AI Foundry Hosted Agent (o4-mini)
I'm using the Azure AI Foundry Toolkit in VS Code and have configured an MCP server running on my local machine (localhost). When I run my Azure AI Foundry-hosted agent (o4-mini), it fails to connect to the MCP server. Based on the error logs, it appears that the hosted agent cannot reach the localhost endpoint. My understanding is that the MCP server is running correctly locally, but the hosted agent seems unable to access services running on my machine. Has anyone successfully connected a locally hosted MCP server to an Azure AI Foundry-hosted agent while using the Foundry Toolkit in VS Code?48Views0likes1CommentFailed to add tool to agent - Preview Feature Required?
Hi, We’ve recently run into an issue where we’re no longer able to add tools to our Foundry agent. This was previously working without problems in our development environment, but now every attempt results in the following error: “Failed to add tool to agent Request failed with status code 403.” After inspecting the request in the browser’s developer console, we noticed an additional message: "This operation requires the following opt-in preview feature(s): AgentEndpoints=V1Preview. Include the 'Foundry-Features: AgentEndpoints=V1Preview' header in your request." How can we opt in for this foundry preview feature? and when was this change introduced? We are unsure if the issue is related the the preview feature missing, or some other forbidden issue. Any help would be very much appreciated. Kind regards, Arne293Views1like2CommentsMastering Query Fields in Azure AI Document Intelligence with C#
Introduction Azure AI Document Intelligence simplifies document data extraction, with features like query fields enabling targeted data retrieval. However, using these features with the C# SDK can be tricky. This guide highlights a real-world issue, provides a corrected implementation, and shares best practices for efficient usage. Use case scenario During the cause of Azure AI Document Intelligence software engineering code tasks or review, many developers encountered an error while trying to extract fields like "FullName," "CompanyName," and "JobTitle" using `AnalyzeDocumentAsync`: The error might be similar to Inner Error: The parameter urlSource or base64Source is required. This is a challenge referred to as parameter errors and SDK changes. Most problematic code are looks like below in C#: BinaryData data = BinaryData.FromBytes(Content); var queryFields = new List<string> { "FullName", "CompanyName", "JobTitle" }; var operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, modelId, data, "1-2", queryFields: queryFields, features: new List<DocumentAnalysisFeature> { DocumentAnalysisFeature.QueryFields } ); One of the reasons this failed was that the developer was using `Azure.AI.DocumentIntelligence v1.0.0`, where `base64Source` and `urlSource` must be handled internally. Because the older examples using `AnalyzeDocumentContent` no longer apply and leading to errors. Practical Solution Using AnalyzeDocumentOptions. Alternative Method using manual JSON Payload. Using AnalyzeDocumentOptions The correct method involves using AnalyzeDocumentOptions, which streamlines the request construction using the below steps: Prepare the document content: BinaryData data = BinaryData.FromBytes(Content); Create AnalyzeDocumentOptions: var analyzeOptions = new AnalyzeDocumentOptions(modelId, data) { Pages = "1-2", Features = { DocumentAnalysisFeature.QueryFields }, QueryFields = { "FullName", "CompanyName", "JobTitle" } }; - `modelId`: Your trained model’s ID. - `Pages`: Specify pages to analyze (e.g., "1-2"). - `Features`: Enable `QueryFields`. - `QueryFields`: Define which fields to extract. Run the analysis: Operation<AnalyzeResult> operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, analyzeOptions ); AnalyzeResult result = operation.Value; The reason this works: The SDK manages `base64Source` automatically. This approach matches the latest SDK standards. It results in cleaner, more maintainable code. Alternative method using manual JSON payload For advanced use cases where more control over the request is needed, you can manually create the JSON payload. For an example: var queriesPayload = new { queryFields = new[] { new { key = "FullName" }, new { key = "CompanyName" }, new { key = "JobTitle" } } }; string jsonPayload = JsonSerializer.Serialize(queriesPayload); BinaryData requestData = BinaryData.FromString(jsonPayload); var operation = await client.AnalyzeDocumentAsync( WaitUntil.Completed, modelId, requestData, "1-2", features: new List<DocumentAnalysisFeature> { DocumentAnalysisFeature.QueryFields } ); When to use the above: Custom request formats Non-standard data source integration Key points to remember Breaking changes exist between preview versions and v1.0.0 by checking the SDK version. Prefer `AnalyzeDocumentOptions` for simpler, error-free integration by using built-In classes. Ensure your content is wrapped in `BinaryData` or use a direct URL for correct document input: Conclusion Using AnalyzeDocumentOptions provides a cleaner and more reliable way to work with query fields in Azure AI Document Intelligence using C#. By aligning with the latest SDK approach, developers can simplify implementation, reduce common errors, and improve code maintainability. Keeping up with SDK enhancements and recommended practices ensures more accurate and efficient document data extraction. As Azure AI capabilities continue to evolve, adopting modern integration patterns will help you build scalable and future-ready document processing solutions with greater confidence. Reference Official AnalyzeDocumentAsync Documentation. Official Azure SDK documentation. Azure Document Intelligence C# SDK support add-on query field.477Views0likes0CommentsAgents That Test Agents: A Cloud-Native Skill-Eval Harness on Foundry Hosted Agents
Skills are an agent's must-have. So test them. A skill is the lightest way to give an agent durable, reusable behavior: a SKILL.md file you author once, store centrally in Foundry's versioned Skills API, and inject into a Hosted Agent's context — no code change, no redeploy. That's why skills have quietly become standard equipment for production agents. But the moment a skill carries real behavior, a hard question follows: how do you know it still works? When you edit a skill you can't feel whether you improved it or just changed it. It might stop triggering, skip a required section, or quietly produce a worse result on one model than another. The cure is the same discipline we use for any prompt — evaluate it: run the agent, capture what happened, and grade it against a small set of checks. This is exactly what azure_skill_eval does for one concrete skill: edu-video-script, which writes an education short-video script for a given knowledge point (the sample's smoke test asks it to script the "P vs NP problem"). And it does the whole thing cloud-native, on Foundry Hosted Agents. The scenario: one skill, two models, four hosted agents The skill under test is edu-video-script. The clever part of the harness is that it doesn't just check one run — it puts the skill on a stand and stresses it from three sides, using four Foundry Hosted Agents wired together by the Agent Framework FoundryAgent: Hosted agent Role skill-eval-business-agent-gpt System under test (SUT), running edu-video-script on gpt-5.5 skill-eval-business-agent-deepseek The same skill, running on DeepSeek-V4-Pro skill-eval-attacker-agent Multi-turn adversarial prompt generator skill-eval-judge-agent LLM-as-judge that returns a rubric score as JSON Two business agents run the same skill on different models, so every case becomes an apples-to-apples comparison: which model executes this skill better? The attacker and judge are the graders. What we measure (define "done" first) Good evals start from a checkable definition of done — outcome, process, style, efficiency. For an education-video script that means: Did it produce a valid script (outcome)? Did it actually follow the edu-video-script template (process/style)? Does it hold up when a user pushes on it across turns (robustness)? The harness answers these with three grading layers. 1. Deterministic checks first (validator.py) The cheapest, most explainable signal: does the output match the script template the skill is supposed to produce? validator.py runs fixed, deterministic template checks — no model needed. These catch the obvious regressions instantly and never cost a token. 2. The LLM judge (skill-eval-judge-agent) Template checks answer "did it do the basics?" but not "is the script any good?" — pacing, clarity, whether it teaches the concept. For that, a dedicated judge hosted agent grades the result and returns structured JSON so scores compare cleanly across runs and models: { "overall_pass": true, "score": 100, "checks": [] } Structured output is the point: stable fields (overall_pass, score, checks) diff cleanly between GPT and DeepSeek, and between today's skill version and last week's. 3. The multi-turn attacker (test_agent.py + skill-eval-attacker-agent) A skill that looks great on a clean prompt can still fall apart when a user pushes on it. The attacker agent generates adversarial prompts for a knowledge point using a chosen strategy — for example extreme length — and keeps the pressure on across multiple turns (max_turns, default 3). This is where you find out whether edu-video-script stays on-template under stress, not just on the happy path. # the attacker takes a knowledge point + a strategy, emits one user prompt azd ai agent invoke skill-eval-attacker-agent \ "Topic: P vs. NP problem Recommended attack strategy: Extreme length Please output the unique user prompt text." The eval loop, end to end runner.py is a ghcsdk-style pipeline that runs cases × models, with each side toggleable: pick all models / GPT only / DeepSeek only, run a single case (e.g. edge-03), and switch adversarial mode, single-turn vs multi-turn, and judge grading on or off. The same switches are query parameters on POST /api/run: model, only_case, use_attack, single_turn, use_judge, max_turns. The test set lives in shared/test_cases.py — 10 built-in edge cases (edge-01 … edge-10) exported to evals/evals.json. You don't need a giant benchmark; a small, sharp set catches regressions, and you grow it whenever a real failure shows up: python -m evals.export_evals # regenerate evals/evals.json from shared/test_cases.py Every SUT call goes through runtime.py, which follows the official Agent Framework hosted-agent sample: it opens a fresh hosted session per turn, invokes via Responses, and tears the session down afterward. # shared/runtime.py — the documented Foundry hosted-agent pattern project = AIProjectClient(endpoint=FOUNDRY_PROJECT_ENDPOINT, credential=cred, allow_preview=True) agent = FoundryAgent(project_client=project, name=agent_name, # e.g. skill-eval-business-agent-gpt allow_preview=True) session = project.beta.agents.create_session(agent_name=agent_name) # ... send the (possibly adversarial) prompt, collect the Responses output ... So a single case flows: runner → business agent (skill runs) → validator → judge, optionally with the attacker driving multiple turns first. Cloud-native by design — and why that matters for eval This is the part that makes the harness production-grade rather than a laptop script. The hard parts of an eval harness — provisioning agents, recording every run, scaling trials, governing access — are handled by Azure, not by you. Foundry Hosted Agents are the runtime. The SUT, attacker, and judge all run as managed hosted agents in your Foundry project. You bring the skill and the cases; Foundry hosts the agents, models, and sessions. The business agents deploy with host: azure.ai.agent and docker.remoteBuild: true, so azd deploy builds the containers in Azure Container Registry — local Docker doesn't even need to be running. The UI is serverless. A FastAPI app on Azure Container Apps lets you upload evals.json, watch progress live, and browse the dashboard — scale-to-zero when no one's running evals. Every run is durable. Results land in Azure Blob Storage (skill-eval-runs), one yymmdd-XXXXXX/ folder per run, with a newest-first runs.json index. Nothing lives only in a terminal scrollback. Access is identity-based. In the cloud, a user-assigned Managed Identity carries exactly two roles — Storage Blob Data Contributor + Azure AI User; locally it's AzureCliCredential. No keys in env files. It's reproducible infra. azd up runs infra/main.bicep to stand up Storage, the container, Log Analytics, the Container Apps environment, the identity, and the role assignments in one shot. The payoff: the scores you read came from the same hosted runtime you actually ship to — not a local approximation — and the run that produced them is sitting in Blob, comparable against every run before it. Run it Local (no deploy): conda activate agentdev cd Skill_eval/azure_skill_eval pip install -r requirements.txt cp .env.example .env # FOUNDRY_PROJECT_ENDPOINT + AZURE_STORAGE_* uvicorn webapp.app:app --reload --port 8000 Open http://localhost:8000, upload evals/evals.json, pick your models and modes, and click Run. Cloud (azd): azd auth login azd env new skill-eval-dev azd env set FOUNDRY_PROJECT_ENDPOINT https://<project>.services.ai.azure.com/api/projects/<project> azd env set MODEL_GPT gpt-5.5 azd env set MODEL_DEEPSEEK DeepSeek-V4-Pro azd up Provision the skill once, deploy the four hosted agents, then smoke-test them: python -m hosted_agent.provision_skills # upload edu-video-script to Foundry Skills azd deploy skill-eval-business-agent-gpt azd deploy skill-eval-business-agent-deepseek azd deploy skill-eval-attacker-agent azd deploy skill-eval-judge-agent azd ai agent invoke skill-eval-business-agent-gpt "Here is a script for an educational short video on the P vs. NP problem." Read the results Each run is self-contained on Blob: summary.json gives you the headline — pass rate and judge averages — and the per-{case}__{model}.json files let you open any single result and see exactly what the skill produced and why it passed or failed. The dashboard streams these straight from Blob via /api/runs/{run_id}/files/{filename}. Because GPT and DeepSeek ran the same cases, the comparison is right there in one folder. Takeaways A skill you can't evaluate is a skill you can't trust. edu-video-script is treated like code — versioned in Foundry, run, and graded. Stack your graders cheap-to-expensive. Deterministic template checks first (validator.py), then an LLM judge for quality, then a multi-turn attacker for robustness. Make the judge return structured JSON. overall_pass / score / checks compare cleanly across models and skill versions. Compare models on the same skill. Running GPT-5.5 and DeepSeek-V4-Pro side by side turns "which model?" from a guess into a measured answer. Let the platform carry the harness. Foundry Hosted Agents are the runtime; Azure Container Apps, Blob Storage, Managed Identity, and azd/Bicep make the whole loop reproducible and durable. Write the skill. Then build the harness that proves it. On Foundry, that second step is mostly configuration — and the result is a skill you can actually trust in production. Conclusion Skills moved agent behavior out of code and into versioned Markdown — a huge win for reuse, but only if you can prove a skill still works after every edit. azure_skill_eval answers that for edu-video-script by treating evaluation as a first-class, repeatable step rather than a gut check. The shape is simple and worth copying for any skill of your own: Pin down "done" as checkable criteria, then encode a small set of sharp cases (here, 10 edge cases). Grade in layers, cheap to expensive — deterministic template checks, then a structured LLM-judge rubric, then a multi-turn adversarial pass. Run the same cases across models (GPT-5.5 vs DeepSeek-V4-Pro) so model choice becomes a measurement, not a guess. Let the cloud carry it — Foundry Hosted Agents as the runtime, FastAPI on Azure Container Apps for the UI, Blob Storage for durable runs, Managed Identity for access, and azd/Bicep so the whole thing is reproducible. The result is a feedback loop where every skill change is confirmed, every regression is visible, and every score traces back to the same hosted runtime you ship to. That's the difference between building skills and being able to trust them — and on Foundry, the gap between the two is mostly configuration. Sample Code : https://github.com/kinfey/Multi-AI-Agents-Cloud-Native/tree/main/code/Skill_evalDeploying Foundry Hosted Agents from Source Code
Introduction At Microsoft Build, it was announced that Foundry Hosted Agents now support source-code deployments. Previously, Hosted Agents required application code to be packaged in a container for deployment. This new functionality allows you to deploy the agent from a `.zip` file instead of from a container image. This post walks through the process of deploying a source-code Hosted Agent, briefly compares that approach to container-based Hosted Agent deployment, and provides a reusable GitHub Action for CI/CD deployments. It is part of a series of post whose source code is housed in simple-hosted-agent-responses repository. If Hosted Agents are new to you, read the previous posts, "Deploying Foundry Hosted Agents via REST API" and "GitHub Actions for Deploying Hosted Agents." Background A Foundry Hosted Agent helps abstract the management of the compute tier for your agent. It runs in a self-contained Micro-VM sandbox, meaning the Hosted Agent sandbox provides the CPU and memory allocation used to run your agent. Previously, this Micro-VM would download your code from an Azure Container Registry (ACR) and run it on the virtualized platform. Not all customers use container-based workloads today and, let's face it, not everything needs to be a container. So how do those customers and platforms take advantage of Foundry Hosted Agents? The answer is through source-code deployments of Foundry Hosted Agents. What is a Source Code Agent? Source Code Agents are like other Foundry Hosted Agents. The key deployment difference is that the code asset is a .zip file instead of a container image. This also changes the Agent Development Lifecycle compared with the containerized version of Foundry Hosted Agents. An important point of clarity: the way the agent is configured is a data plane operation. As such, taking advantage of Source Code Agent functionality does not require changes to the Foundry infrastructure itself when your Infrastructure as Code (IaC) is only provisioning the supporting resources in Bicep, Terraform, or PowerShell. The deployment change happens through the Foundry data plane. First, let's look at a container-based Foundry Hosted Agent: Now, let's compare it to the source-code version: Deployment Process Now that we've looked at the end result, let's talk through the steps required to deploy a Foundry Hosted Agent via source code. So in Foundry, what does the difference between a container-based and a source-code-based Foundry Hosted Agent look like? The Microsoft Learn docs outline this well: Every source-code deployment follows the same sequence: package -> create or update -> poll until active -> invoke. The source-code path uses `code_configuration` in the agent definition; the image-based path uses `container_configuration` instead--the two are mutually exclusive on a single version. If wanting to confirm and see in more detail one can refer to the Foundry Agent REST API documentation. The source layout can stay familiar, but the deployed artifact changes to a `.zip` file. Packaging the source code into a ZIP is the piece that differs from the container-image flow. The agent deployment to Foundry is also slightly different because it uses source-code configuration instead of container configuration. You can run this via `azd` with a command structured like the following: azd ai agent init --no-prompt --project-id "<project-resource-id>" --deploy-mode code --runtime python_3_13 --entry-point main.py This assumes `azd` is installed and authenticated, and that the authenticated identity has access to the Foundry project. The command initializes a code deployment for the project. However, we recognize that the majority of enterprise organizations will want to use other deployment methods. As such, REST API deployments are supported, as are the Python and C# SDKs for creating the agent. Taking this a step further, and similar to "GitHub Actions for Deploying Hosted Agents," let's create a reusable GitHub Action for deploying source-code-based Hosted Agents. GitHub Action If you are wanting to see the entire action it is part of the repository simple-hosted-agent-responses, which contains source code, IaC, and deployment options. Background First, we need to understand that we cannot reuse the GitHub Action from "GitHub Actions for Deploying Hosted Agents" because, as noted above, the REST API uses mutually exclusive options. In theory, we could add conditional logic across the parameters; however, it is cleaner to create a separate action. Before invoking this action, the workflow must authenticate to Azure because the action calls `az account get-access-token` to acquire a token for the Foundry data plane. Inputs inputs: project_endpoint: description: Foundry project endpoint URL required: true agent_name: description: Name of the hosted agent required: true source_code_zip: description: Path to the local source-code zip artifact required: true model_deployment_name: description: Name of the AI model deployment required: true cpu: description: CPU allocation for the hosted agent container required: false default: '0.25' memory: description: Memory allocation for the hosted agent container required: false default: '0.5Gi' runtime: description: Source-code runtime for the hosted agent required: false default: 'python_3_13' entry_point: description: Source-code entry point command for the hosted agent required: false default: '["python", "main.py"]' dependency_resolution: description: How Agent Service resolves dependencies for the source-code deployment required: false default: 'remote_build' max_polling_seconds: description: Maximum time to wait for the source-code deployment to reach active status required: false default: '600' For our inputs, `project_endpoint`, `agent_name`, `source_code_zip`, and `model_deployment_name` are required. The CPU, memory, runtime, entry point, dependency resolution, and max polling values are configurable properties with defaults set in the action. The source-code-specific inputs populate the `code_configuration` properties of the REST payload. These include `source_code_zip`, `runtime`, `entry_point`, and `dependency_resolution`. This information tells Foundry how to run the code from the `.zip` package. Outputs We should output values that make sense for downstream workflows. Every workflow may not use them, but it is useful to expose non-secret values when they can support later steps. In this case, we are creating a new version of the agent, so let's output that version ID. outputs: agent_version: description: Version ID returned by the Foundry data plane value: ${{ steps.post.outputs.agent_version }} Action The action maps the inputs to environment variables as the first step. After that, it gets an access token from Azure and calls the REST API endpoint. Once we have this, we prepare the body of the call. Verify against the API for all valid properties. For this example, I chose not to set `rai_config` and `tools` to keep things simple. runs: using: composite steps: - name: Create source-code metadata id: metadata shell: bash env: AGENT_NAME: ${{ inputs.agent_name }} MODEL_DEPLOYMENT_NAME: ${{ inputs.model_deployment_name }} CPU: ${{ inputs.cpu }} MEMORY: ${{ inputs.memory }} RUNTIME: ${{ inputs.runtime }} ENTRY_POINT: ${{ inputs.entry_point }} DEPENDENCY_RESOLUTION: ${{ inputs.dependency_resolution }} run: | METADATA_FILE=$(mktemp) ENTRY_POINT_JSON=$(python3 -c 'import json,sys; print(json.dumps(json.loads(sys.argv[1])))' "$ENTRY_POINT") jq -n \ --arg model "$MODEL_DEPLOYMENT_NAME" \ --arg cpu "$CPU" \ --arg memory "$MEMORY" \ --arg runtime "$RUNTIME" \ --arg dep_resolution "$DEPENDENCY_RESOLUTION" \ --argjson entry_point "$ENTRY_POINT_JSON" \ '{ description: "Hosted agent deployed from source code", definition: { kind: "hosted", protocol_versions: [{protocol: "responses", version: "1.0.0"}], cpu: $cpu, memory: $memory, code_configuration: { runtime: $runtime, entry_point: $entry_point, dependency_resolution: $dep_resolution }, environment_variables: {AZURE_AI_MODEL_DEPLOYMENT_NAME: $model} } }' > "$METADATA_FILE" echo "metadata_file=${METADATA_FILE}" >> "$GITHUB_OUTPUT" echo "Metadata file created at ${METADATA_FILE}" - name: Post source-code agent deployment to Foundry data plane id: post shell: bash env: PROJECT_ENDPOINT: ${{ inputs.project_endpoint }} AGENT_NAME: ${{ inputs.agent_name }} SOURCE_CODE_ZIP: ${{ inputs.source_code_zip }} METADATA_FILE: ${{ steps.metadata.outputs.metadata_file }} MAX_POLLING_SECONDS: ${{ inputs.max_polling_seconds }} run: | if [[ ! -f "$SOURCE_CODE_ZIP" ]]; then echo "Error: Source code zip not found at ${SOURCE_CODE_ZIP}" exit 1 fi CODE_ZIP_SHA256=$(sha256sum "$SOURCE_CODE_ZIP" | awk '{print $1}') echo "Source code SHA256: ${CODE_ZIP_SHA256}" FOUNDRY_TOKEN=$(az account get-access-token \ --resource "https://ai.azure.com/" \ --query accessToken -o tsv) # POST /agents/{name}/versions auto-creates the agent if it doesn't # exist and adds a new version if it does, so a single call covers # both first-deploy and update scenarios (matches update-agent). HTTP_STATUS=$(curl -s -o /tmp/source_code_response.json \ -w "%{http_code}" \ -X POST \ "${PROJECT_ENDPOINT}/agents/${AGENT_NAME}/versions?api-version=2025-11-15-preview" \ -H "Authorization: Bearer ${FOUNDRY_TOKEN}" \ -H "Accept: application/json" \ -H "Foundry-Features: CodeAgents=V1Preview,HostedAgents=V1Preview" \ -H "x-ms-agent-name: ${AGENT_NAME}" \ -H "x-ms-code-zip-sha256: ${CODE_ZIP_SHA256}" \ -F "metadata=@${METADATA_FILE};type=application/json" \ -F "code=@${SOURCE_CODE_ZIP};type=application/zip;filename=${AGENT_NAME}.zip") echo "HTTP ${HTTP_STATUS}: $(cat /tmp/source_code_response.json)" if [[ "$HTTP_STATUS" -lt 200 || "$HTTP_STATUS" -ge 300 ]]; then echo "Error: Foundry data plane returned HTTP ${HTTP_STATUS}" exit 1 fi RESPONSE=$(cat /tmp/source_code_response.json) AGENT_VERSION=$(echo "$RESPONSE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["version"])') echo "agent_version=${AGENT_VERSION}" >> "$GITHUB_OUTPUT" echo "Agent version resolved as ${AGENT_VERSION}" START_TIME=$(date +%s) while true; do ELAPSED=$(($(date +%s) - START_TIME)) if [[ $ELAPSED -gt $MAX_POLLING_SECONDS ]]; then echo "Error: Agent version did not reach active state within ${MAX_POLLING_SECONDS} seconds" exit 1 fi VERSION_STATUS=$(curl -s \ -X GET \ "${PROJECT_ENDPOINT}/agents/${AGENT_NAME}/versions/${AGENT_VERSION}?api-version=2025-11-15-preview" \ -H "Authorization: Bearer ${FOUNDRY_TOKEN}" \ -H "Accept: application/json" \ -H "Foundry-Features: CodeAgents=V1Preview,HostedAgents=V1Preview" \ | python3 -c 'import sys,json; data=json.load(sys.stdin); print(data.get("status", "unknown"))' 2>/dev/null) echo "Current status: ${VERSION_STATUS} (elapsed ${ELAPSED}s)" if [[ "$VERSION_STATUS" == "active" ]]; then echo "Agent version ${AGENT_VERSION} is active" break fi if [[ "$VERSION_STATUS" == "failed" ]]; then echo "Error: Agent version reached failed status" exit 1 fi sleep 5 done Building the Source-Code Artifact Before calling the source-code Hosted Agent action, create the ZIP artifact that will be passed into `source_code_zip`. source-code: name: Build source-code artifact runs-on: ubuntu-latest permissions: contents: read steps: - name: Checkout uses: actions/checkout@v6 - name: Create source-code zip artifact run: | git archive --format=zip --output=source-code.zip HEAD:src/agent-framework/responses/basic - name: Upload source-code artifact uses: actions/upload-artifact@v7 with: name: source-code path: source-code.zip Calling the Action Now that we have the action, how can we scale this across multiple workflows? We pass in the required parameters and the ZIP artifact path. - name: Update agent with source code uses: ./.github/actions/update-agent-source-code with: project_endpoint: ${{ needs.deploy-iac.outputs.project_endpoint }} # Source-code agent shares the same Foundry project as the image-based # agent; the `-src` suffix keeps them as distinct agent versions. agent_name: ${{ inputs.agent_name }}-src source_code_zip: ./.artifacts/source-code/source-code.zip model_deployment_name: ${{ needs.deploy-iac.outputs.model_deployment_name }} And just to show we can call the same action multiple times, here are two examples that do just that: Deploy (Bicep) and Deploy (Terraform). Conclusion Source-code deployments give Foundry Hosted Agents another deployment path for teams that do not want, or do not need, to package every agent as a container image. By using a .zip artifact, teams can keep a familiar source-code packaging flow while still taking advantage of the managed compute abstraction that Hosted Agents provide. The reusable GitHub Action shown in this post turns that deployment process into a repeatable CI/CD step: package the source code, post the deployment to the Foundry data plane, poll until the new version is active, and expose the resulting agent version for downstream workflow steps. This keeps the deployment flexible while fitting into existing enterprise pipeline patterns. For organizations already using container-based Hosted Agents, source-code deployments do not replace that model; they expand the options available. Choose the deployment approach that best fits how your teams package, govern, and operate their agent workloads.Building Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.Harness-Driven Agents: Secure Podcast Pipeline in Hyperlight MicroVM Sandbox
The moment the agent reached for rm -rf For most of 2024 and 2025, "agents" were a demo word. By 2026 they are something you run — autonomously, in a loop, executing code they wrote themselves a second ago. I was watching one work late one night. I had given it a goal, a handful of tools, and the freedom to write and run its own Python. For twenty minutes it was magic: read a file, reason about it, write a script, run it, inspect the output, correct itself, try again. Then it produced this: import shutil shutil.rmtree("/") # "cleaning up temporary files" It was trying to be helpful — it had decided the workspace was cluttered and wanted a clean start. The "workspace," as far as that process was concerned, was my entire machine. I killed it in time. But the lesson is the one every agent builder eventually arrives at: the model is not the dangerous part — the execution is. A chatbot that answers wrong is annoying. An agent that fetches a web page, runs code, and writes files has a blast radius. The bounding box has to come from infrastructure, not from a system prompt. harnessagent_sandbox_demo is a concrete build that puts that bounding box in exactly the right place — and it does it in service of a real, charming little product: a daily five-minute Mandarin podcast about the FIFA World Cup 2026. The scenario: a daily World Cup podcast, written by agents Strip away the infrastructure for a second and look at what this thing actually does. Every day it produces a fresh Mandarin podcast script about the FIFA World Cup 2026. Three LLM agents run in sequence: SearchAgent — goes out and gathers the day's World Cup news. ContentAgent — turns that raw material into structured podcast content. GenScriptAgent — writes the final, readable five-minute script. The output is two text files — one in Simplified Chinese, one in Traditional Chinese: ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt That's the whole product. It sounds simple — and the point of the project is that making it safe is the hard part. SearchAgent has to reach the open internet. All three agents write and run code. If you wire that naively, you have just built the exact machine that types shutil.rmtree("/") for you. So the entire architecture is organized around one principle: the agents get to do real work, but every dangerous capability is fenced behind a hardware boundary. Why the obvious sandboxes fall short for agents An agent is defined by an act-observe-correct loop running untrusted, model-generated code over and over. That single property breaks most conventional isolation choices. Option Why it falls short for agents No sandbox One rm -rf, one leaked .env, one rogue network call — the blast radius is the whole machine. Container Great for shipping apps, but a coding agent wants to build and run its own container, which means Docker-in-Docker and elevated privileges that quietly undo the isolation. WASM / V8 isolate Fast to start, but you isolate a language runtime, not an OS — no system packages, no arbitrary shell, and hardening the engine is a moving target. Full VM Rock-solid isolation, but cold starts in seconds and heavy memory — exactly the friction that pushes developers to skip isolation entirely. Each option trades away safety, speed, or compatibility. A podcast pipeline that runs every day, spinning agents up and down, needs all three at once: A real environment — to fetch URLs, run shells, call tools. A hard boundary — so a bad step can't reach the host. Near-instant lifecycle — because a slow sandbox is a sandbox developers skip, and an unused safety feature protects nobody. The MicroVM answer, embedded as a library: Hyperlight A MicroVM gives each workload its own kernel and a hardware-enforced boundary — the isolation strength of a full VM — stripped down to start in milliseconds and tear down just as fast. Misbehave inside, and you hit a wall; there is no path back to the host. And it is disposable by design: when an agent goes off the rails, you delete the sandbox and reopen in milliseconds, with nothing to clean up. Most MicroVM runtimes (Firecracker and friends) are cloud infrastructure — server-side. Hyperlight is different: a lightweight Virtual Machine Manager (a CNCF sandbox project) designed to be embedded inside your application, like a library. MicroVMs that boot in milliseconds, with guest function calls completing in microseconds. No guest kernel, no OS — the guest is a purpose-built no_std Rust/C binary. Nothing in there to attack. Sandboxed by default — no filesystem, no network, nothing, unless explicitly granted. Typed function calls across the VM boundary, and snapshot/restore to rewind to a clean state between calls. Runs on KVM, MSHV (Microsoft Hypervisor), and Windows Hypervisor Platform. This project uses the Wasm backend: the three agents share a single HyperlightRuntime, and the guest is reset to a clean snapshot before every code execution. That detail is what makes a daily, many-step pipeline cheap — you capture the sandbox state once and rewind to it, instead of rebuilding a VM hundreds of times. Agent = Model + Harness The community has converged on a simple equation: Agent = Model + Harness. The model is a brain in a jar — text in, text out, no memory between calls, no loop, no hands. It can express the intent to call a tool; it cannot actually call it. The harness is the execution layer: it calls the model, handles its tool calls, and decides when to stop. As the Hugging Face glossary puts it, "if you're not the model, you're the harness." That reframes the safety problem precisely. When my agent emitted shutil.rmtree("/"), the model deleted nothing — it merely suggested. The harness would have run it. The harness is where reasoning meets reality, so it is exactly where safety must live. The question stops being "how do I make the model safer?" and becomes: how do I build a harness that executes the model's intent inside a boundary it cannot escape? The Microsoft Agent Framework answers that with first-class agent harness capabilities in Python and .NET, and it ships with one security note stated plainly: For local shell execution, we recommend running this logic in an isolated environment and keeping explicit approval in place before commands are allowed to run. The harness is the steering wheel — it does not pretend to be the seatbelt and the crumple zone. For that, it points you outward: run this somewhere isolated. Hyperlight is that isolated somewhere. This project snaps the two pieces together. The architecture: two planes, one bridge Here is the heart of the design. Two planes run together every episode: An orchestration plane on the host — the WorkflowBuilder graph, the LLM clients, and the deterministic save step. An execution plane inside one Hyperlight Wasm sandbox — the only place LLM-generated code is allowed to run. The single bridge between them is one call: call_tool("fetch_url", ...). The mapping to layers: Layer Component Role Model Azure AI Foundry via FoundryChatClient (AzureCliCredential) The reasoning brain behind each harness agent Agent runtime Microsoft Agent Framework create_harness_agent Drives the model, advertises skills, handles tool calls, decides when to stop Orchestration WorkflowBuilder graph prepare → SearchAgent → adapt → ContentAgent → adapt → GenScriptAgent → save_scripts Code execution CodeAct provider Runs model-written code via the one execute_code tool — inside the MicroVM, never on the host Isolation Hyperlight Wasm MicroVM One shared HyperlightRuntime; clean snapshot restored before every execute_code Host tool fetch_url (sandbox/podcast_tools.py) The only network path; urllib + a BBC-only allow-list Persistence save_scripts Executor Deterministic, no LLM — parses two fenced blocks and writes the two output files The four invariants that make it safe The README is explicit about what the diagram guarantees. These four invariants are the whole security argument. The model never sees the network.Its only tool isexecute_code. Network access happens only when the guest itself runs call_tool("fetch_url", ...) from inside the sandbox. The model cannot reach the internet directly — it can only ask the guest to, and the guest can only reach BBC. One sandbox per run, snapshot per call.All three agents share the sameHyperlightRuntime. Before every execute_code, the guest is reset to a clean snapshot — so nothing one step does can leak into the next, and there is no VM to rebuild. Two counter paths — and why there are two.Thefunction_middleware (make_tool_call_recorder) sees the model-direct execute_code calls. But the inner, guest-initiated fetch_url is dispatched by Hyperlight straight to the FunctionTool, bypassing the middleware entirely. So a second counter — make_call_tool_counter(on_call=) — bumps state["tool_call_counts"][<agent>]["fetch_url"] on every guest invocation. Two observation points, because the architecture has two genuinely different call surfaces. Deterministic save — no LLM in the persistence step.GenScriptAgentonly emits text. The save_scripts Executor parses the two fenced code blocks out of that text and writes the simplified and traditional files itself. There is no model in the loop when bytes hit disk, so the output path is fully predictable. Now let's look at the real code surface The README documents the API the demo is built on. The snippets below reflect that surface. 1. Install and environment pip install agent-framework-hyperlight --pre # Hyperlight needs a hypervisor: KVM on Linux, WHP on Windows. macOS is not yet supported. # The model runs on Azure AI Foundry; FoundryChatClient authenticates via AzureCliCredential. az login export HYPERLIGHT_PYTHON_GUEST_PATH="/path/to/python_guest" 2. A harness agent that carries only a stub — skills do the rest Each of the three agents is built with create_harness_agent + FoundryChatClient. The agents themselves carry only a tiny stub instruction; their real role prompts and the shared sandbox/CodeAct guardrails live as file-based Agent Skills under skills/. The harness's built-in SkillsProvider advertises those SKILL.md packages, and the model loads them at runtime via load_skill. from agent_framework import create_harness_agent from agent_framework.foundry import FoundryChatClient from azure.identity import AzureCliCredential # Model on Azure AI Foundry — not Azure OpenAI directly. client = FoundryChatClient(credential=AzureCliCredential()) # The agent carries a tiny stub. Its real persona — "you gather World Cup # news", "you write the script" — lives in a SKILL.md package under skills/, # advertised by the harness SkillsProvider and pulled in via load_skill. search_agent = create_harness_agent( chat_client=client, name="SearchAgent", instructions="You are a harness agent. Load your skill, then begin.", ) 3 The CodeAct surface: one tool the model can see This is the CodeAct pattern from 02-agents/context_providers/code_act/code_act.py. The model sees exactly one tool — execute_code. Any extra capability (here, only fetch_url) is reachable from inside the guest via call_tool(...). # What the MODEL sees and writes — one script, not ten tool round-trips: # # # inside execute_code, running in the Hyperlight Wasm guest: page = call_tool("fetch_url", url="https://www.bbc.com/sport/football/world-cup") # # ... parse page["BODY"], pull out today's stories ... print(top_stories) # # execute_code is the ONLY tool on the model's surface. call_tool("fetch_url", ...) is reachable only from inside the sandbox. 4. The one host tool, with a BBC-only allow-list fetch_url lives on the host (sandbox/podcast_tools.py). It is the single bridge across the boundary, and it is deliberately narrow. import urllib.request from urllib.parse import urlparse ALLOWED_DOMAINS = {"bbc.com", "www.bbc.com"} # allow-list: BBC only def fetch_url(url: str) -> dict: """The ONLY network path out of the sandbox. Host-side, allow-listed.""" host = urlparse(url).netloc if host not in ALLOWED_DOMAINS: return {"STATUS": "blocked", "URL": url} with urllib.request.urlopen(url, timeout=20) as resp: body = resp.read(8192).decode("utf-8", "ignore") # BODY capped at ~8 KB return { "STATUS": "ok", "URL": url, "TITLE": _extract_title(body), "DESCRIPTION": _extract_description(body), "LINKS": _extract_links(body), "BODY": body, } Notice what this buys you: even if SearchAgent writes hostile code, the worst it can do over the network is read BBC, 8 KB at a time. The allow-list is host-side and the model never sees it — it cannot be prompt-injected away. 5. Wiring the graph and the deterministic save from agent_framework import WorkflowBuilder workflow = ( WorkflowBuilder() .add_node("prepare", prepare) .add_node("SearchAgent", search_agent) .add_node("adapt_1", adapt) .add_node("ContentAgent", content_agent) .add_node("adapt_2", adapt) .add_node("GenScriptAgent", genscript_agent) .add_node("save_scripts", save_scripts) # deterministic Executor, NO LLM .build() ) # GenScriptAgent emits text containing two fenced blocks (simplified + # traditional). save_scripts parses them and writes the files itself — # there is no model in the persistence step. await workflow.run() # -> ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt # -> ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt 6. The payoff Run that shutil.rmtree("/") inside this pipeline now and the result is delightfully boring: the agent deletes its own throwaway sandbox, the host never notices, and the next execute_code starts from a clean snapshot. Two things to call out: Snapshot/restore means every code execution starts from a clean, reusable baseline — capture state once, rewind between calls, instead of rebuilding the whole VM. For a daily pipeline that runs the act-observe-correct loop many times, that is the difference between "fast enough to always use" and "slow enough to skip." Because each agent writes one script instead of ten round-tripped tool calls, the CodeAct approach keeps both latency and token usage down — the model reasons once and lets the guest do the busywork behind the boundary. Where it fits, and the one idea to keep harnessagent_sandbox_demo lives inside Multi-AI-Agents-Cloud-Native — a gallery of patterns for running agent systems safely on Azure: A2A multi-agent orchestration, the Kubernetes sidecar pattern, hardened pipelines, and a sibling sample that runs Copilot agents on AKS inside Kata Containers MicroVMs at the pod level. And the README is explicit that this design is cloud-native: running it in-cluster on AKS changes nothing about the architecture — the same WorkflowBuilder graph, the same Hyperlight sandbox, the same deterministic save_scripts executor. The local build and the in-cluster build are the same shape. The two MicroVM samples are two ends of one spectrum. The Kata sample puts the boundary around the whole pod — a deployment topology. This Hyperlight demo pulls the boundary all the way into the agent process itself — the sandbox becomes a library call. Same question — where do you place the hardware boundary in an agent stack? — answered at two different altitudes. The old pitch for sandboxing always carried an asterisk: yes, it's safer, but you'll pay in speed, compatibility, or friction. MicroVMs erase the asterisk — VM-grade isolation, cold starts fast enough that there's no reason to skip it, and a real environment your agents can actually work in. Enough of a real environment, in fact, to write you a World Cup podcast every morning. The one idea to internalize: the harness decides, the MicroVM contains. Give your agent a room where it is allowed to fail — then let it be brilliant. References Project: harnessagent_sandbox_demo · Multi-AI-Agents-Cloud-Native Hyperlight: hyperlight-dev/hyperlight · hyperlight-dev/hyperlight-sandbox Agent Framework: Agent Harness in Microsoft Agent Framework Background: Why MicroVMs (Docker) · Harness vs. Scaffold glossary (Hugging Face) Install: pip install agent-framework-hyperlight --pre · .NET: dotnet add package Microsoft.Agents.AI.Hyperlight --prerelease Requirements: KVM (Linux) or WHP (Windows); macOS not yet supported.