microsoft foundry
97 TopicsJoin our free livestream series on using Microsoft IQ with Python
Join us for a new 3-part livestream series where we take a deep technical look at Microsoft IQ, the knowledge layer for the next generation of AI experiences. You'll learn how Foundry IQ, Work IQ, and Fabric IQ can be used to ground AI systems in organizational knowledge, workplace context, and structured business data. Our series will cover: Foundry IQ for multi-source agentic retrieval on search indexes, SharePoint, websites, and more Work IQ for user-specific retrieval of M365 data, like Teams chats, emails, and calendar events Fabric IQ for retrieval of data stored in OneLake, via Fabric ontologies and data agents Building agents with Microsoft Agent Framework to connect to Foundry IQ, Fabric IQ, and Work IQ Throughout the series, weโll use Python for all examples and share full code so you can run everything yourself in your own Foundry projects. ๐ Register for the full series. In addition to the live streams, you can also join the Microsoft Foundry Discord to ask follow-up questions after each stream. If you are new to generative AI with Python, start with our 9-part Python + AI series, which covers topics such as LLMs, embeddings, RAG, tool calling, MCP, and agents. If you are new to Microsoft Agent Framework, watch our 6-part Python + Agent series which dives deep into agents and workflows. To learn more about each live stream or register for individual sessions, scroll down: Day 1: Foundry IQ 28 July, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In the first session of our Microsoft IQ Deep Dive with Python series, weโll kick things off with an introduction to the Microsoft IQ family: Foundry IQ, Work IQ, Fabric IQ, and Web IQ. Weโll then take a deeper look at Foundry IQ (Azure AI Search), exploring how it helps agents and applications work with curated knowledge and organizational context. We'll build a knowledge base and connect it to multiple knowledge sources, including the new IQs, MCP servers, and search indexes built from ingested data. Then we'll perform multi-source agentic retrieval on the knowledge base, which executes queries in parallel and merges the results with state-of-the-art ranking models. Finally, we will build an agent in Python using Microsoft Agent Framework and ground the agent's responses in results from the Foundry IQ knowledge base. All code demos will use Python and will be available in an open-source repository for you to deploy yourself. After the stream, join office hours in the Microsoft Foundry Discord to ask follow-up questions. Day 2: Work IQ 29 July, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In the second session of our Microsoft IQ Deep Dive with Python series, weโll focus on Work IQ and how it brings workplace context into AI-powered experiences. Weโll explore how developers can use Work IQ through APIs, A2A patterns, MCP integration, and tool-based workflows. Weโll look at two practical tool examples, then show how Work IQ can be used from Copilot and from a Microsoft Agent Framework agent. All code demos will use Python and will be available in an open-source repository for you to deploy yourself. After the stream, join office hours in the Microsoft Foundry Discord to ask follow-up questions. Day 3: Fabric IQ 30 July, 2026 | 5:00 PM - 6:00 PM (UTC) Coordinated Universal Time Register for the stream on Reactor In the final session of our Microsoft IQ Deep Dive with Python series, weโll explore Fabric IQ and how it connects AI experiences to structured business data. Weโll introduce the key concepts behind Fabric IQ, including ontologies and data agents, and show how they help describe, organize, and reason over operational data stored in OneLake. Weโll use the Microsoft Fabric API SDK in Python to connect to Fabric IQ, so that we can programmatically configure ontologies and answer questions about our data. All code demos will use Python and will be available in an open-source repository for you to deploy yourself. After the stream, join office hours in the Microsoft Foundry Discord to ask follow-up questions.Enterprise-ready Claude Desktop with Entra ID, APIM, and Microsoft Foundry (No Backend Required)
How I put corporate sign-in in front of Claude Desktop without writing a single line of backend code. TL;DR โ In this post, I show how to securely enable Claude Desktop in enterprise environments using Microsoft Entra ID, Azure API Management, and Microsoft Foundry โ without deploying a custom backend. This approach removes API keys from endpoints, enforces per-user identity, and aligns fully with Zero Trust principles. Who this is for: Enterprise architects evaluating secure AI client patterns Developers enabling Claude Desktop in regulated environments Platform teams standardizing identity and governance for LLM access Why this post exists: Microsoft Learn's Configure Claude Desktop with Foundry Models only shows the API-key path โ a shared key pasted into every user's Claude Desktop config. That's fine for a quick demo, but it's a non-starter for most enterprises (no per-user identity, no MFA / Conditional Access, hard to revoke, hard to audit). This post fills that gap: same Foundry backend, but with Microsoft Entra ID SSO in front via Azure API Management, so each user signs in with their corporate identity and zero secrets land on the laptop. The problem For many teams experimenting with Claude Desktop, the blocker isn't capability โ it's enterprise readiness. How do you enforce identity, eliminate shared secrets, and apply governance without standing up a custom backend service to sit in front of the model? If your team wants to use Claude Desktop with your own Anthropic deployment running on Microsoft Foundry, but with a few non-negotiable requirements: No shared API keys floating around on developer laptops. Per-user identity โ every request must be attributable to a real person. MFA and Conditional Access must apply, the same way they do for every other internal app. Central rate-limiting and logging โ a centralized control plane for governance. Claude Desktop 1.5+ supports a "Gateway SSO" mode where it can sign each user in with OpenID Connect and forward their token to a custom LLM gateway. Azure API Management (APIM) is a perfect fit for that gateway role: it validates the user's Entra ID token, then re-authenticates itself to Foundry behind the scenes. APIM acts as a centralized policy enforcement layer, enabling identity validation, traffic governance, and secure re-authentication to backend AI services without custom code. The end-to-end flow looks like this: %%{init: {'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'useMaxWidth': true}, 'themeVariables': {'fontSize':'16px'}} }%% flowchart TB User([Corporate user]) Claude["Claude Desktop"] Entra["Microsoft Entra ID<br/>(OIDC + MFA + Conditional Access)"] APIM["Azure API Management<br/>validate-jwt โ rewrite headers<br/>(policy gateway)"] Foundry["Microsoft Foundry<br/>Claude deployment"] User -- "1. Sign in (browser PKCE)" --> Entra Entra -- "2. ID token" --> Claude Claude -- "3. POST /v1/messages<br/>Authorization: Bearer ID token" --> APIM APIM -- "4. OIDC discovery / JWKS" --> Entra APIM -- "5. x-api-key (or Managed Identity)" --> Foundry Foundry -- "6. Response" --> APIM APIM -- "7. Response" --> Claude classDef azure fill:#0a4d8c,stroke:#0a3a6b,color:#ffffff; classDef client fill:#f3f3f3,stroke:#888,color:#222; class Entra,APIM,Foundry azure; class Claude,User client; Or in plain text: Claude Desktop โ Authorization: Bearer <Entra ID token from the user's browser sign-in> โผ Azure API Management (<your-apim>) โ โ validate-jwt โ verifies user's Entra ID token โ โก re-auths to Foundry with an API key from a Named value โ Authorization stripped, x-api-key injected โผ Microsoft Foundry /anthropic/v1/messages โ runs Claude (<your-deployment>) โผ Response back to the user There are no API keys on user devices. Foundry's key lives only inside APIM. And every request carries the user's oid claim, so I can build dashboards and per-user quotas later. What you need before starting An Azure subscription with a Microsoft Foundry (AI Services) account and a Claude deployment. (Throughout this post I'll just call it Foundry.) An API Management instance, any tier. Permission to register applications in Entra ID for your tenant. Claude Desktop 1.5.0 or later. Azure CLI installed locally. Throughout this post I'll use placeholders for resource names: <apim-name> โ your API Management service name <resource-group> โ the resource group that holds it <foundry-account> โ your Foundry account name <deployment-name> โ the name of the Claude model deployment on Foundry Step 1 โ Register an Entra ID app for Claude Desktop This is the OIDC client Claude Desktop signs users into. Claude Desktop requires a single-tenant, public PKCE client (no client secret) with a loopback redirect URI, configured under the Mobile and desktop applications platform in Entra ID โ the only platform that allows any loopback port. I scripted it so the setup is one command and idempotent: # scripts/register-claude-entra-app.ps1 [CmdletBinding()] param( [string] $TenantId = '<your-tenant-id>', [string] $SubscriptionId = '<your-subscription-id>', [string] $ResourceGroup = '<resource-group>', [string] $ApimName = '<apim-name>', [string] $AppDisplayName = 'Claude Cowork gateway', [string] $RedirectUri = 'http://127.0.0.1/callback' ) az account set --subscription $SubscriptionId | Out-Null # 1. Create (or reuse) the app registration $appId = az ad app list --display-name $AppDisplayName --query "[0].appId" -o tsv if (-not $appId) { $appId = az ad app create --display-name $AppDisplayName ` --sign-in-audience AzureADMyOrg --query appId -o tsv } # 2. Configure as public PKCE client with the Mobile/Desktop redirect URI $objectId = az ad app show --id $appId --query id -o tsv $patch = @{ publicClient = @{ redirectUris = @($RedirectUri) } isFallbackPublicClient = $true } | ConvertTo-Json -Depth 5 -Compress az rest --method PATCH ` --uri "https://graph.microsoft.com/v1.0/applications/$objectId" ` --headers "Content-Type=application/json" --body $patch | Out-Null # 3. Ensure a service principal exists $sp = az ad sp list --filter "appId eq '$appId'" --query "[0].id" -o tsv if (-not $sp) { az ad sp create --id $appId | Out-Null } # 4. Push two Named values into APIM for the validate-jwt policy az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-tenant-id --display-name entra-tenant-id ` --value $TenantId --secret false az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-client-id --display-name entra-client-id ` --value $appId --secret false "Client ID: $appId" Run it once. The output prints the client ID you'll need in Claude Desktop later, and it leaves two Named values in APIM ( entra-tenant-id , entra-client-id ) that the gateway policy will reference. โ ๏ธ Common pitfall: if the redirect URI ends up under the Web platform instead of Mobile and desktop applications, Entra will demand a client secret on token exchange โ Claude won't send one and you'll get Token exchange failed (HTTP 401) . The app type can't be changed after creation, so create a new app if that happens. Step 2 โ Create the API in APIM In the portal under APIM โ APIs โ + Add API โ HTTP: Field Value Display name Anthropic API Name anthropicapi Web service URL https://<foundry-account>.services.ai.azure.com/anthropic API URL suffix claude Subscription required Off (Entra ID is our only credential) Add two operations under it: Method URL Display name POST /v1/messages Create message GET /v1/models List models The /v1/models operation isn't strictly needed (Foundry's Anthropic surface doesn't implement it), but having it registered means you can decide later whether to stub it out or proxy it. Step 3 โ Add an API key for Foundry as a Named value APIM โ Named values โ + Add: Name: foundry-key Type: Secret Value: paste a key from the Foundry account's Keys and Endpoint blade. This is the only place the key ever lives. Clients never see it. Alternative โ keyless with Entra ID (managed identity): If you prefer not to manage a Foundry key at all, enable the APIM instance's system-assigned managed identity (APIM โ Identity โ System assigned โ On), then grant that identity the Foundry User role on the Foundry account (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d โ previously named Azure AI User; Microsoft renamed it but the ID and permissions are unchanged). In Step 4, replace the set-header that injects x-api-key with: <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="foundry-token" /> <set-header name="Authorization" exists-action="override"> <value>@("Bearer " + (string)context.Variables["foundry-token"])</value> </set-header> Then you can skip the foundry-key Named value entirely. Don't use the legacy Cognitive Services User role โ per the Foundry RBAC doc, roles starting with Cognitive Services don't apply to Foundry scenarios. Step 4 โ Write the gateway policy This is the core enforcement layer in the architecture. Open APIs โ anthropicapi โ All operations โ Inbound processing โ </> and paste: <policies> <inbound> <base /> <!-- USER โ APIM: verify Entra ID token from Claude Desktop --> <validate-jwt header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized" require-scheme="Bearer"> <openid-config url="https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0/.well-known/openid-configuration" /> <audiences> <audience>{{entra-client-id}}</audience> </audiences> <issuers> <issuer>https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0</issuer> </issuers> </validate-jwt> <!-- APIM โ Foundry --> <set-backend-service base-url="https://<foundry-account>.services.ai.azure.com/anthropic" /> <set-header name="x-api-key" exists-action="override"> <value>{{foundry-key}}</value> </set-header> <set-query-parameter name="api-version" exists-action="skip"> <value>2024-05-01-preview</value> </set-query-parameter> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Two things to notice: validate-jwt uses the OIDC discovery URL โ JWKS keys are fetched and cached automatically. It rejects any token whose aud claim is not the client ID of our Entra app, which is exactly what we want. The Authorization header from the user is not forwarded โ once validate-jwt succeeds, the request is re-authenticated to Foundry with x-api-key . No user token ever leaves APIM. APIM becomes the security boundary โ user identity is validated at the edge, and downstream services never see or rely on user tokens. Step 5 โ Configure Claude Desktop Open Claude Desktop โ Configure third-party inference and fill it in like this: Field Value Connection Gateway Credential kind Interactive sign-in Gateway base URL https://<apim-name>.azure-api.net/claude Client ID (the appId your script printed) Issuer URL https://login.microsoftonline.com/<tenant-id>/v2.0 Authorization URL / Token URL leave empty Bearer token ID token (default) Scopes leave default ( openid profile email offline_access ) Redirect port leave empty (ephemeral) Model discovery Off Model list โ Model ID <deployment-name> (your Foundry deployment name) โน๏ธ Why Model discovery is Off โ Claude Desktop's discovery uses GET /v1/models , and the Foundry /anthropic surface doesn't implement that endpoint, so it 404s. Listing the model manually skips the call entirely. If you want to leave Model discovery On, stub /v1/models in APIM. Add a GET /v1/models operation to your API and give it this inbound policy that returns an Anthropic-shaped response without ever hitting the backend: <policies> <inbound> <base /> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ return new JObject( new JProperty("data", new JArray( new JObject( new JProperty("id", "<deployment-name>"), new JProperty("type", "model"), new JProperty("display_name", "Claude on Foundry"), new JProperty("created_at", "2026-01-01T00:00:00Z") ) )), new JProperty("has_more", false), new JProperty("first_id", "<deployment-name>"), new JProperty("last_id", "<deployment-name>") ).ToString(); }</set-body> </return-response> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Add one entry per deployment you want to expose. The benefit of stubbing rather than turning discovery off is that adding new models becomes a policy edit โ no need to re-export and redeploy Claude Desktop config to every user. Click Apply Changes then Sign in to your organization. Your browser opens to the normal Entra sign-in page; once approved you're returned to the app, and a quick connection test runs. The success indicator is a small green banner: โ Inference โ 1-token completion in 1449 ms ยท via identity provider For broader rollout, hit the Export button at the top of the configuration window โ it produces a .mobileconfig (macOS) or .reg (Windows) you can push via Intune / Jamf to every user's machine. Step 6 โ Verify both hops In APIM โ APIs โ anthropicapi โ Test โ POST /v1/messages I sent: Headers: anthropic-version: 2023-06-01 Body: { "model": "<deployment-name>", "max_tokens": 64, "messages": [{"role":"user","content":"hi"}] } Click Send โ Trace, and look at two places: Inbound โ validate-jwt: should say succeeded and show the decoded claims (your oid , email , etc.). Backend โ Request: outbound URL is https://<foundry-account>.services.ai.azure.com/anthropic/v1/messages?api-version=2024-05-01-preview , with x-api-key: **** present and Authorization absent. Backend โ Response: 200, with a Claude message JSON body. That confirms both halves of the chain. Bumps I hit along the way A few common issues encountered during setup โ sharing so you can skip them: Symptom Cause Fix Claude shows "Your provider's model list hasn't loaded yet" and /v1/models returns 404 Foundry's Anthropic surface doesn't implement that endpoint Turn Model discovery OFF in Claude Desktop and add the deployment name manually Claude shows "Authentication failed" even though sign-in worked The APIM API still had Subscription required = ON, blocking the call before validate-jwt ran with 401: Access denied due to missing subscription key Uncheck Subscription required on the API Portal Test panel shows "Cannot read properties of undefined (reading 'statusCode')" The test console doesn't attach an Entra token, so validate-jwt 401s and the panel's JavaScript crashes Comment out <validate-jwt> temporarily for portal testing, or test via curl with a real token OIDC discovery failed (HTTP 404) in Claude Desktop Pasted the metadata URL into Issuer URL Issuer must end at /v2.0 , not at /.well-known/openid-configuration Token exchange failed (HTTP 401) App registered under Web platform instead of Mobile and desktop applications Create a new app with the right platform โ it can't be changed Where this leaves us This pattern is small in moving parts but has outsized architectural impact: Zero secrets on endpoints. Eliminates API-key sprawl across laptops, MDM profiles, and shared vaults. The Foundry key lives only inside APIM โ or disappears entirely when you switch APIM to managed identity. Identity, not credentials. Every Claude Desktop user authenticates against Entra ID in their browser, the same as Office or Teams. MFA, Conditional Access, and Entra ID Protection apply automatically โ no parallel auth story to maintain. Per-user observability built in. APIM logs carry the user's Entra oid , email , and group claims. That unlocks per-user dashboards, cost allocation, and abuse detection without any client-side instrumentation. Aligned with Zero Trust. Strong identity at the edge, no implicit trust between hops, single policy chokepoint for inspection and rate-limiting, and full revocability through a single Enterprise Application. Optional but trivial keyless path. Flip APIM to system-assigned managed identity + <authentication-managed-identity resource="https://cognitiveservices.azure.com" /> and one Foundry User role assignment (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d , formerly Azure AI User) on the Foundry account. See the Foundry RBAC doc โ don't use any Cognitive Services * roles for Foundry. What I'd add next llm-token-limit and llm-emit-token-metric policies for per-user quotas and cost visibility. App Insights wiring on the API, with a workbook that pivots on the oid claim. Assignment required = Yes on the Entra Enterprise Application + a security group, so only approved users can sign in. Intune deployment of the exported .reg / .mobileconfig so the gateway URL and client ID land on devices automatically. But that's all incremental. The hard part โ getting Claude Desktop, Entra ID, APIM, and Foundry to agree on who's allowed to talk to whom โ is done. Total elapsed: about an afternoon, most of it spent learning where each portal hides its switches. Useful links Gateway single sign-on with your identity provider โ Claude.ai Documentation Configure Claude Desktop with Foundry Models โ Microsoft Learn Role-based access control for Microsoft Foundry โ Microsoft Learn654Views0likes2CommentsAuto-Generated Rubric Evaluators: Building Context-Aware Evaluators for AI Agents
Authors: Shuo Qiu, Sydney Lister, Ilya Matiach, Ali Mahmoudzadeh, Salma Elshafey, Josรฉ Santos, Vivek Bhadauria, Morteza Ziyadi, April Kwong Why Your Agent Needs a Task-Specific Evaluator Picture a customer-service agent for a telecom company. A customer messages in asking to switch plans and get a refund for last month's overcharge. The agent needs to verify the customer's identity and confirm the new plan before ending the conversation. Miss the verification step and you have a security incident. Those success criteria are specific to this one scenario. The auto-generated rubric evaluator is designed to help address this: use the context you already have to generate a task-specific rubric evaluator that returns a weighted score with per-dimension explanations, then can be reused across iterations. How We Validated Evaluator Quality We validate auto-generated rubric evaluators across four aspects: Verdict Validity โ whether judgments on real cases reflect what a competent reviewer would conclude. Rubric Validity โ whether generated rubrics capture the task requirements and failure modes. Manual Quality Inspection โ whether judgments on real cases look right to a human reviewer. Reliability and Separability โ whether judgments are stable across repeated runs and distinguish stronger from weaker candidate agents. Validation Results 1. Agreement with Trusted Reference Signals We first validate the auto-generated rubric evaluator end-to-end: we use the rubric generator to produce the rubric's dimensions, then the rubric evaluator scores each case against them. We use GPT-5.4 for both rubric generator and rubric evaluator. The first question is whether those end-to-end scores move with signals teams already trust. For example, does the rubric evaluator give lower scores to failed cases, and higher scores to successful ones? We start by choosing benchmarks the community already uses as reference points: Dataset What It Tests JSON Editing Deterministic structured-editing tasks where outputs can be checked exactly. TauBench Telecom Customer-service agent tasks requiring policy following, tool use, and task completion. The Agent Company Long-horizon workplace-agent tasks with multi-step tool use. We InspectAIโs 10-case subset. BFCL Multi-Turn Tool Calling Multi-turn function-calling behavior across realistic tool-use scenarios. LiveClawBench Open-ended web-agent tasks that require browsing, interaction, and judgment. Retail-Agent Customer Service Real production-style retail support conversations. We then ask the generation pipeline to generate rubric evaluators for each scenario, and measure the correlation between the evaluator's scores and the trusted reference signals. For the three datasets with per-case reference signals, we can directly check whether the evaluator gives higher scores to successful cases than failed ones. We then create traces from different candidate agents. In these experiments, each candidate agent uses the same task setup and prompt but a different underlying model, which gives us a controlled range of stronger and weaker agent behaviors. Because the evaluator returns a continuous score, we use receiver operating characteristic area under the curve (ROC AUC) when the trusted case-level signal can be read as success versus failure. It measures how often, when comparing a successful case with a failed case, the evaluator assigns the successful case the higher score. In these experiments, generated rubric evaluators align well with trusted signals at the case level, with ROC AUC of 0.794 on TauBench Telecom, 0.869 on The Agent Company, and 0.972 on JSON Editing. An important goal of evaluation is to score candidate agents that perform better on the reference signal also higher by the evaluator. This is more directly relevant when choosing among candidate agents, and it is a more forgiving test of alignment because aggregated scores are less sensitive to noise in individual judgments. We measure this with aggregate candidate-agent Spearman ฯ, which checks whether the evaluator ranks candidate agents the same way as the oracle โ a ฯ of 1.0 means the evaluator's ranking is perfectly aligned with the oracle's, while 0 means no relationship. For BFCL and LiveClawBench, the oracle ranking comes from their official leaderboard scores. At the aggregate candidate-agent level, Spearman ฯ ranges from 0.69 on The Agent Company to 0.98 on JSON Editing across all five benchmarks. Aggregation reduces per-case noise, so the candidate-agent ranking is the more relevant view when the goal is agent selection. 2. Rubric Quality on GDPVal GDPVal is a benchmark that measures how well AI models perform real-world, economically valuable work in sectors such as government, manufacturing, and technical services. This benchmark includes a rubric for each task, authored by a domain expert, which is useful for rubric-validity measurement. We ask the rubric generator to produce a rubric for each test case, then use a separate matching judge to match the generated dimensions to the expert dimensions. This gives us two metrics for rubric quality: Recall. For each annotated dimension, did at least one generated dimension express a similar requirement? Precision. For each generated dimension, did at least one annotated dimension express a similar requirement? Under this setup, the generated rubric achieved 72.1% recall and 86.4% precision against the expert dimensions on GDPVal tasks. 3. Manual Quality on Retail-Agent Conversations For a real-world retail-agent customer-service dataset, we generated a rubric with six dimensions, then graded 12 conversations over those dimensions, and manually inspected every case-by-dimension judgment. In this small sample (12 conversations), the reviewer disagreed with only one of the 72 case-by-dimension judgments. Most neutral cases involved applicability questions that the evaluator flagged inconsistently. Reliability and Separability Another key question is how reliable the evaluator's scores are. We look at two things: reliability (does the same case get the same score next time?) and separability (can the evaluator confidently rank two candidate agents against each other?). Reliability If you re-grade the same case tomorrow, do you get the same score? We measure this two ways: single-measure intraclass correlation, ICC(3,1) measures how much of the score variance comes from real case differences rather than repeat noise, and Kendall's W measures rank reliability across repeats โ 1.0 means the evaluator ranks cases in the same order every time. On JSON Editing, single-measure intraclass correlation, ICC(3,1), is 0.852 and Kendall's W is 0.767, which means re-running the evaluator on the same case gives similar numbers under repeated runs in this experimental setup. TauBench Telecom shows similarly strong reliability, with ICC(3,1) of 0.85 and Kendall's W of 0.89 under the same recommended configuration. Separability Separability measures whether the score is decisive: when you put two candidate agents side by side, can the evaluator confidently say which one is better? We report mean pairwise bootstrap confidence, which measures ranking stability. For each pair of candidate agents, we resample cases and recompute each agent's mean evaluator score. The pair confidence is the fraction of bootstrap samples supporting the more common ordering: a value near 0.5 means the ordering is unstable, while a value near 1.0 means the evaluator consistently separates that pair. We average this across all candidate-agent pairs. The candidate-agent intervals are tight on JSON Editing and TauBench Telecom. Mean pairwise bootstrap confidence is 0.96 on JSON Editing dataset and 0.95 on TauBench Telecom dataset. Get Started The auto-generated rubric evaluator's results may vary depending on task design, input quality, and evaluation setup. Start with a clear, well-defined description for your evaluation in the prompt field, include as much high-quality context as possible, such as the agent definition and examples, and review the generated rubric carefully before using it. Run it against a small set of known-good and known-bad cases to understand how the score reflects different failure modes. Try the workflow in the Foundry portal and follow the rubric evaluator tutorial. For a demo that covers Rubric in the broader observability workflow, watch the Build breakout session From observability to ROI for AI agents on any framework. For the full set of Build observability announcements, read Build 2026: From observability to ROI for AI agents on any framework.339Views0likes0CommentsDeploying Foundry Hosted Agents from Source Code
Introduction At Microsoft Build, it was announced that Foundry Hosted Agents now support source-code deployments. Previously, Hosted Agents required application code to be packaged in a container for deployment. This new functionality allows you to deploy the agent from a `.zip` file instead of from a container image. This post walks through the process of deploying a source-code Hosted Agent, briefly compares that approach to container-based Hosted Agent deployment, and provides a reusable GitHub Action for CI/CD deployments. It is part of a series of post whose source code is housed in simple-hosted-agent-responses repository. If Hosted Agents are new to you, read the previous posts, "Deploying Foundry Hosted Agents via REST API" and "GitHub Actions for Deploying Hosted Agents." Background A Foundry Hosted Agent helps abstract the management of the compute tier for your agent. It runs in a self-contained Micro-VM sandbox, meaning the Hosted Agent sandbox provides the CPU and memory allocation used to run your agent. Previously, this Micro-VM would download your code from an Azure Container Registry (ACR) and run it on the virtualized platform. Not all customers use container-based workloads today and, let's face it, not everything needs to be a container. So how do those customers and platforms take advantage of Foundry Hosted Agents? The answer is through source-code deployments of Foundry Hosted Agents. What is a Source Code Agent? Source Code Agents are like other Foundry Hosted Agents. The key deployment difference is that the code asset is a .zip file instead of a container image. This also changes the Agent Development Lifecycle compared with the containerized version of Foundry Hosted Agents. An important point of clarity: the way the agent is configured is a data plane operation. As such, taking advantage of Source Code Agent functionality does not require changes to the Foundry infrastructure itself when your Infrastructure as Code (IaC) is only provisioning the supporting resources in Bicep, Terraform, or PowerShell. The deployment change happens through the Foundry data plane. First, let's look at a container-based Foundry Hosted Agent: Now, let's compare it to the source-code version: Deployment Process Now that we've looked at the end result, let's talk through the steps required to deploy a Foundry Hosted Agent via source code. So in Foundry, what does the difference between a container-based and a source-code-based Foundry Hosted Agent look like? The Microsoft Learn docs outline this well: Every source-code deployment follows the same sequence: package -> create or update -> poll until active -> invoke. The source-code path uses `code_configuration` in the agent definition; the image-based path uses `container_configuration` instead--the two are mutually exclusive on a single version. If wanting to confirm and see in more detail one can refer to the Foundry Agent REST API documentation. The source layout can stay familiar, but the deployed artifact changes to a `.zip` file. Packaging the source code into a ZIP is the piece that differs from the container-image flow. The agent deployment to Foundry is also slightly different because it uses source-code configuration instead of container configuration. You can run this via `azd` with a command structured like the following: azd ai agent init --no-prompt --project-id "<project-resource-id>" --deploy-mode code --runtime python_3_13 --entry-point main.py This assumes `azd` is installed and authenticated, and that the authenticated identity has access to the Foundry project. The command initializes a code deployment for the project. However, we recognize that the majority of enterprise organizations will want to use other deployment methods. As such, REST API deployments are supported, as are the Python and C# SDKs for creating the agent. Taking this a step further, and similar to "GitHub Actions for Deploying Hosted Agents," let's create a reusable GitHub Action for deploying source-code-based Hosted Agents. GitHub Action If you are wanting to see the entire action it is part of the repository simple-hosted-agent-responses, which contains source code, IaC, and deployment options. Background First, we need to understand that we cannot reuse the GitHub Action from "GitHub Actions for Deploying Hosted Agents" because, as noted above, the REST API uses mutually exclusive options. In theory, we could add conditional logic across the parameters; however, it is cleaner to create a separate action. Before invoking this action, the workflow must authenticate to Azure because the action calls `az account get-access-token` to acquire a token for the Foundry data plane. Inputs inputs: project_endpoint: description: Foundry project endpoint URL required: true agent_name: description: Name of the hosted agent required: true source_code_zip: description: Path to the local source-code zip artifact required: true model_deployment_name: description: Name of the AI model deployment required: true cpu: description: CPU allocation for the hosted agent container required: false default: '0.25' memory: description: Memory allocation for the hosted agent container required: false default: '0.5Gi' runtime: description: Source-code runtime for the hosted agent required: false default: 'python_3_13' entry_point: description: Source-code entry point command for the hosted agent required: false default: '["python", "main.py"]' dependency_resolution: description: How Agent Service resolves dependencies for the source-code deployment required: false default: 'remote_build' max_polling_seconds: description: Maximum time to wait for the source-code deployment to reach active status required: false default: '600' For our inputs, `project_endpoint`, `agent_name`, `source_code_zip`, and `model_deployment_name` are required. The CPU, memory, runtime, entry point, dependency resolution, and max polling values are configurable properties with defaults set in the action. The source-code-specific inputs populate the `code_configuration` properties of the REST payload. These include `source_code_zip`, `runtime`, `entry_point`, and `dependency_resolution`. This information tells Foundry how to run the code from the `.zip` package. Outputs We should output values that make sense for downstream workflows. Every workflow may not use them, but it is useful to expose non-secret values when they can support later steps. In this case, we are creating a new version of the agent, so let's output that version ID. outputs: agent_version: description: Version ID returned by the Foundry data plane value: ${{ steps.post.outputs.agent_version }} Action The action maps the inputs to environment variables as the first step. After that, it gets an access token from Azure and calls the REST API endpoint. Once we have this, we prepare the body of the call. Verify against the API for all valid properties. For this example, I chose not to set `rai_config` and `tools` to keep things simple. runs: using: composite steps: - name: Create source-code metadata id: metadata shell: bash env: AGENT_NAME: ${{ inputs.agent_name }} MODEL_DEPLOYMENT_NAME: ${{ inputs.model_deployment_name }} CPU: ${{ inputs.cpu }} MEMORY: ${{ inputs.memory }} RUNTIME: ${{ inputs.runtime }} ENTRY_POINT: ${{ inputs.entry_point }} DEPENDENCY_RESOLUTION: ${{ inputs.dependency_resolution }} run: | METADATA_FILE=$(mktemp) ENTRY_POINT_JSON=$(python3 -c 'import json,sys; print(json.dumps(json.loads(sys.argv[1])))' "$ENTRY_POINT") jq -n \ --arg model "$MODEL_DEPLOYMENT_NAME" \ --arg cpu "$CPU" \ --arg memory "$MEMORY" \ --arg runtime "$RUNTIME" \ --arg dep_resolution "$DEPENDENCY_RESOLUTION" \ --argjson entry_point "$ENTRY_POINT_JSON" \ '{ description: "Hosted agent deployed from source code", definition: { kind: "hosted", protocol_versions: [{protocol: "responses", version: "1.0.0"}], cpu: $cpu, memory: $memory, code_configuration: { runtime: $runtime, entry_point: $entry_point, dependency_resolution: $dep_resolution }, environment_variables: {AZURE_AI_MODEL_DEPLOYMENT_NAME: $model} } }' > "$METADATA_FILE" echo "metadata_file=${METADATA_FILE}" >> "$GITHUB_OUTPUT" echo "Metadata file created at ${METADATA_FILE}" - name: Post source-code agent deployment to Foundry data plane id: post shell: bash env: PROJECT_ENDPOINT: ${{ inputs.project_endpoint }} AGENT_NAME: ${{ inputs.agent_name }} SOURCE_CODE_ZIP: ${{ inputs.source_code_zip }} METADATA_FILE: ${{ steps.metadata.outputs.metadata_file }} MAX_POLLING_SECONDS: ${{ inputs.max_polling_seconds }} run: | if [[ ! -f "$SOURCE_CODE_ZIP" ]]; then echo "Error: Source code zip not found at ${SOURCE_CODE_ZIP}" exit 1 fi CODE_ZIP_SHA256=$(sha256sum "$SOURCE_CODE_ZIP" | awk '{print $1}') echo "Source code SHA256: ${CODE_ZIP_SHA256}" FOUNDRY_TOKEN=$(az account get-access-token \ --resource "https://ai.azure.com/" \ --query accessToken -o tsv) # POST /agents/{name}/versions auto-creates the agent if it doesn't # exist and adds a new version if it does, so a single call covers # both first-deploy and update scenarios (matches update-agent). HTTP_STATUS=$(curl -s -o /tmp/source_code_response.json \ -w "%{http_code}" \ -X POST \ "${PROJECT_ENDPOINT}/agents/${AGENT_NAME}/versions?api-version=2025-11-15-preview" \ -H "Authorization: Bearer ${FOUNDRY_TOKEN}" \ -H "Accept: application/json" \ -H "Foundry-Features: CodeAgents=V1Preview,HostedAgents=V1Preview" \ -H "x-ms-agent-name: ${AGENT_NAME}" \ -H "x-ms-code-zip-sha256: ${CODE_ZIP_SHA256}" \ -F "metadata=@${METADATA_FILE};type=application/json" \ -F "code=@${SOURCE_CODE_ZIP};type=application/zip;filename=${AGENT_NAME}.zip") echo "HTTP ${HTTP_STATUS}: $(cat /tmp/source_code_response.json)" if [[ "$HTTP_STATUS" -lt 200 || "$HTTP_STATUS" -ge 300 ]]; then echo "Error: Foundry data plane returned HTTP ${HTTP_STATUS}" exit 1 fi RESPONSE=$(cat /tmp/source_code_response.json) AGENT_VERSION=$(echo "$RESPONSE" | python3 -c 'import sys,json; print(json.load(sys.stdin)["version"])') echo "agent_version=${AGENT_VERSION}" >> "$GITHUB_OUTPUT" echo "Agent version resolved as ${AGENT_VERSION}" START_TIME=$(date +%s) while true; do ELAPSED=$(($(date +%s) - START_TIME)) if [[ $ELAPSED -gt $MAX_POLLING_SECONDS ]]; then echo "Error: Agent version did not reach active state within ${MAX_POLLING_SECONDS} seconds" exit 1 fi VERSION_STATUS=$(curl -s \ -X GET \ "${PROJECT_ENDPOINT}/agents/${AGENT_NAME}/versions/${AGENT_VERSION}?api-version=2025-11-15-preview" \ -H "Authorization: Bearer ${FOUNDRY_TOKEN}" \ -H "Accept: application/json" \ -H "Foundry-Features: CodeAgents=V1Preview,HostedAgents=V1Preview" \ | python3 -c 'import sys,json; data=json.load(sys.stdin); print(data.get("status", "unknown"))' 2>/dev/null) echo "Current status: ${VERSION_STATUS} (elapsed ${ELAPSED}s)" if [[ "$VERSION_STATUS" == "active" ]]; then echo "Agent version ${AGENT_VERSION} is active" break fi if [[ "$VERSION_STATUS" == "failed" ]]; then echo "Error: Agent version reached failed status" exit 1 fi sleep 5 done Building the Source-Code Artifact Before calling the source-code Hosted Agent action, create the ZIP artifact that will be passed into `source_code_zip`. source-code: name: Build source-code artifact runs-on: ubuntu-latest permissions: contents: read steps: - name: Checkout uses: actions/checkout@v6 - name: Create source-code zip artifact run: | git archive --format=zip --output=source-code.zip HEAD:src/agent-framework/responses/basic - name: Upload source-code artifact uses: actions/upload-artifact@v7 with: name: source-code path: source-code.zip Calling the Action Now that we have the action, how can we scale this across multiple workflows? We pass in the required parameters and the ZIP artifact path. - name: Update agent with source code uses: ./.github/actions/update-agent-source-code with: project_endpoint: ${{ needs.deploy-iac.outputs.project_endpoint }} # Source-code agent shares the same Foundry project as the image-based # agent; the `-src` suffix keeps them as distinct agent versions. agent_name: ${{ inputs.agent_name }}-src source_code_zip: ./.artifacts/source-code/source-code.zip model_deployment_name: ${{ needs.deploy-iac.outputs.model_deployment_name }} And just to show we can call the same action multiple times, here are two examples that do just that: Deploy (Bicep) and Deploy (Terraform). Conclusion Source-code deployments give Foundry Hosted Agents another deployment path for teams that do not want, or do not need, to package every agent as a container image. By using a .zip artifact, teams can keep a familiar source-code packaging flow while still taking advantage of the managed compute abstraction that Hosted Agents provide. The reusable GitHub Action shown in this post turns that deployment process into a repeatable CI/CD step: package the source code, post the deployment to the Foundry data plane, poll until the new version is active, and expose the resulting agent version for downstream workflow steps. This keeps the deployment flexible while fitting into existing enterprise pipeline patterns. For organizations already using container-based Hosted Agents, source-code deployments do not replace that model; they expand the options available. Choose the deployment approach that best fits how your teams package, govern, and operate their agent workloads.Foundry Toolkit for VS Code at //build: Hosted Agents End-to-End, a Smarter Toolbox, and More
Weโre excited to share whatโs new for Foundry Toolkit for Visual Studio Code at //build 2026. Since going generally available, the toolkit has kept moving fast, and this release is a big one. The headline: a complete, end-to-end Hosted Agent experience, scaffold, run, deploy, and observe without ever leaving VS Code. On top of that, weโve expanded the Toolbox with native enterprise integrations and shipped a wave of LangGraph samples so every developer has a clear path from idea to production. From your first prompt to a production-grade, observable agent, Foundry Toolkit meets you where you are. Hosted Agents, End to End Building an agent is the easy part; getting it from a first draft to a production-grade, observable service is what matters. This release makes the full Hosted Agent lifecycle available in VS Code, and it follows the way you actually work โ scaffold, run, deploy, observe. Scaffold โ start from a rich set of samples Hosted Agent creation now opens with a refreshed scaffolding experience and a rich sample selection, so you start from a working, framework-appropriate template instead of a blank file. Creation is smarter, too: we auto-select your subscription when thereโs only one, gate tabs more clearly, and tightened spacing for a cleaner setup flow. Run (F5) โ inspect as you build Press F5 and your agent runs locally with the Agent Inspector, now aligned with the rest of the extension and featuring Copilot SDK visualization so you can see what the Inspector visualizes as the agent executes. Itโs the fastest loop from change to verification before anything leaves your machine. Deploy โ a new UX and new ways to ship Different teams ship differently, so deployment got a refreshed UX and two new options for Hosted Agents: ZIP Code Deploy: Package your agent source as a ZIP and deploy it directly to Microsoft Foundry Agent Service. Bring-Your-Own-Image (BYOI): Already have a pre-built container in your own Azure Container Registry? Deploy straight from it. Observe โ know it works in production Once deployed, the full observability story is now available: Hosted Agent Tracing: Inspect end-to-end traces of Hosted Agent invocations directly from VS Code โ tool calls, delegation chains, and timing for real debugging instead of guesswork. Continuous Evaluation Settings: A new page to configure ongoing evaluation for deployed Hosted Agents, so quality is measured continuously โ not just at ship time. Evaluations Node: One-click access to evaluation runs and results right from the Foundry project tree. A Smarter, More Connected Toolbox What it is, and why it matters A Toolbox is how your agent gets its capabilities โ the curated set of tools, knowledge sources, and integrations it can call at runtime. Instead of hand-wiring each connection, you assemble a Toolbox once and your agent consumes it consistently across local runs and production. The result: agents that can act on real enterprise data and systems, with the connections managed in one place. From what to how: create, connect, consume Create: Start a new Toolbox from the Foundry Toolkit sidebar โTools Catalogโ and pick the capabilities your agent needs. Connect: Configure and wire in enterprise systems through native, first-class connections once, and use it for all your agents. Consume: Reference the Toolbox from your Hosted Agent so its tools are available the moment the agent runs, locally (F5) and once deployed. New this release Building on that flow, the Toolbox is now richer and more enterprise-ready: WorkIQ as a Built-in Tool: A first-class WorkIQ experience powered by A2A connections โ no MCP fallback required. End-to-end toolbox creation with WorkIQ works out of the box. Fabric IQ (OneLake Catalog) Integration: Connect your agents to Microsoft Fabric OneLake catalogs directly from the Toolbox. Toolbox Guardrails: Apply content-safety guardrails to your Toolbox for safer agent execution. Faster discovery: A new Toolbox Search Toggle and Agent Tool Multi-Select let you find and wire in multiple tools in a single action. LangGraph Reaches Parity LangGraph developers, this one is for you. Weโve added five new Hosted Agent samples that bring LangGraph to full parity with the Agent Framework Responses learning path โ so you get an equivalent, end-to-end walkthrough no matter which framework you prefer: MCP โ tool loading from a remote MCP server (defaults to GitHub Copilot MCP) via MultiServerMCPClient. Workflows โ a custom StateGraph chaining three specialized LLM nodes: slogan writer, legal reviewer, and formatter. Files โ local filesystem tools plus the Foundry-Toolbox code_interpreter working over session-uploaded files. Human-in-the-Loop โ a StateGraph that drafts a proposal and pauses for approval via langgraph.types.interrupt. Observability โ GenAI OpenTelemetry tracing with enable_auto_tracing(); spans, metrics, and logs flow to Application Insights. Weโve also refreshed the existing bring-your-own LangGraph samples against the new hosting layer (chat with local tools, Foundry-managed Toolbox loading, and SSE-streamed multi-turn sessions backed by a MemorySaver checkpointer), so every sample reflects how Hosted Agents work today. Polish Across the Board A release is more than headline features. This one also includes a redesigned Prompt Builder โImprove an Instructionโ dialog for faster iteration, fixes for MCP toolbox tool icons, clearer ZIP-deploy error surfacing, and assorted Agent Builder and Playground regression fixes โ the whole experience feels tighter end to end. Get Started Today Install: Foundry Toolkit on the VS Code Marketplace Quick Start: Follow our getting-started tutorial to build your first Hosted Agent Deep Dive: Explore the documentation, samples, and LangGraph parity walkthroughs Join the Community Share your projects, file issues, or suggest features on our GitHub repository. We canโt wait to see what you build. Welcome to the next chapter of AI development!236Views0likes0CommentsFoundry IQ: Improve recall by up to 54% with knowledge bases
Foundry IQ: Improve recall by up to 54% with knowledge bases. Foundry IQ (Azure AI Search) has improved its agentic retrieval engine resulting in better answer quality and improved token cost savings. We compared standalone retrieval tools to knowledge bases using the challenging BrowseComp-Plus benchmark and found: Replacing single-shot RAG with a knowledge base improves evidence recall by up to 46%. Combining a smaller agent model with agentic retrieval improves evidence recall by up to 54% while controlling costs and increasing agent responsiveness. In both cases, the amount of retrieval tool calls your agent makes is reduced, resulting in 34% token cost savings.1.5KViews3likes0CommentsDevOps for Microsoft Hosted Agents: From Terraform Apply to Production-Grade Agent Delivery
A companion piece to Infrastructure as Code for AI: Building and Deploying Microsoft Hosted Agents with Terraform. Just announced โ source-code deploy (preview). Foundry has just added a second Hosted Agent deploy path alongside the container path this post covers. Instead of a container image, you upload a .zip of your source plus a requirements.txt (Python 3.13 / 3.14) or a .csproj (.NET 10), and the Agent Service either builds dependencies for you ( remote_build ) or runs a prebuilt bundle ( bundled ). The version definition uses code_configuration instead of container_configuration โ the two are mutually exclusive on a given version. Versioning is content-addressable on the zip's SHA-256, so the dedup behaviour described below still applies. Required roles shift slightly: deploying the agent needs Foundry Project Manager at project scope, and the platform-assigned agent identity gets Foundry User (both handled automatically by azd and the Foundry VS Code Toolkit). The DevOps loop in this post โ immutable versions, eval gating, manifest-driven promotion, traffic-split canary, per-version observability โ transfers directly; only the build-and-push stage changes (no Dockerfile, no ACR for remote_build ). The container path covered here remains fully supported and is still the right choice if you need custom base images, system packages, or non-Python/.NET runtimes. Full details: Deploy a hosted agent from source code (preview). What this post assumes. It describes recommended enterprise DevOps patterns on top of Microsoft Foundry Hosted Agents. Some patterns โ evaluation gating, traffic-based rollout, manifest-driven promotion โ are best practices and may not be enforced by the platform itself. Hosted Agents and several related capabilities (A2A, certain deployment and routing controls) are in preview and may evolve. TL;DR Terraform provisions the platform: Foundry account, project, model deployment, ACR, App Insights, RBAC. DevOps pipelines ship agent versions, not source branches โ the deploy artifact is a container image digest plus an immutable version spec. Evaluation should be treated as a release gate, not a dashboard. Quality regressions should fail the build the same way unit-test failures do. Traffic split between versions is the rollout and rollback primitive. Rollback typically avoids rebuilding or redeploying artifacts. Observability is sliced per version โ during canary, two versions serve simultaneously and aggregate metrics lie. The Delivery Pipeline at a Glance Terraform โโโโบ Foundry project (AIServices) + model deployment + ACR + App Insights โ PR opened โผ โโโบ docker build โโโโบ push to ACR โโโโบ capture image digest โ โผ Foundry SDK: create agent version (image digest + cpu/mem + env + protocols) โ โผ Evaluation gate โโโโโบ fail โ stop โ โผ pass Promote via manifest โ staging โ prod โ โผ Traffic-split canary (0% โ 10% โ 100%) โ โผ App Insights: per-version latency, cost, sampled quality, sandbox sizing Infrastructure as Code gets the platform stood up. It does not, on its own, ship an agent. The gap between terraform apply succeeding and a customer-facing agent reliably serving requests in production is where DevOps lives โ and for Microsoft Hosted Agents on Microsoft Foundry, that gap has its own shape. A Hosted Agent is not a prompt and a tool list. It is your own code, packaged as a container image, pushed to Azure Container Registry, and deployed to a Foundry project. The Foundry Agent Service pulls the image, provisions an isolated execution environment per agent session, assigns the agent its own dedicated Microsoft Entra ID (agent identity), and exposes a dedicated endpoint. An agent supports up to four protocols, any of which can be combined in a single deployment: Responses ( .../protocols/openai/responses ) โ OpenAI-compatible chat-style API. Implemented in the container. Invocations ( .../protocols/invocations ) โ arbitrary JSON in / arbitrary JSON out for webhook receivers and non-conversational workloads. Implemented in the container. A2A ( .../protocols/a2a , preview) โ the open Agent2Agent protocol for agent-to-agent delegation across frameworks and vendors. Surfaced on its own endpoint path by the platform. Activity โ the Teams / M365 channel protocol. The platform bridges Responses to Activity automatically when an agent is published to a Microsoft 365 channel. Microsoft manages the runtime, scaling, session state, and lifecycle. You ship the image and the version definition. Important โ Foundry version compatibility. Hosted Agents are supported on the new Microsoft Foundry project resource model ( azurerm_cognitive_account_project under a Cognitive Services account of kind = "AIServices" ). The older Azure AI Foundry Hub model ( azurerm_ai_foundry / azurerm_ai_foundry_project , kind = "Hub" ) โ the Azure MLโderived workspace surface โ does not expose Hosted Agent capabilities. They are two distinct Azure resource types with different APIs. Everything in this post assumes the new Foundry project. That shape drives three things every DevOps loop for Hosted Agents has to handle: The deploy artifact is a container image plus an immutable agent version. A version snapshots the image digest, CPU/memory, environment variables, and protocol configuration. To change anything, you create a new version. The platform supports weighted traffic between versions, which is your blue/green and canary primitive. The agent identity is created for you, per agent. You don't pick one or wire managed-identity references manually. Each agent is assigned a dedicated Microsoft Entra ID (agent identity) at deploy time; RBAC to downstream resources is granted to that identity. Quality is non-deterministic. Two terraform apply runs against the same configuration produce identical resources. Two agent runs against the same input can produce different outputs. Your pipeline has to gate on evaluation, not only on tests passing and HTTP 200s. This post lays out an end-to-end DevOps loop on top of that shape: how to structure the repository, what runs in CI versus CD, how to gate releases on evaluation, how to promote across environments, how to use version traffic split for safe rollouts and instant rollback, and what observability is worth wiring beyond the defaults. A Quick Tour of Microsoft Foundry If you've spent more time in Azure OpenAI or AI Studio than in Foundry, a short orientation helps before the DevOps patterns make sense. Microsoft Foundry is Microsoft's unified platform for building, evaluating, deploying, and operating AI applications and agents. It consolidates what used to be spread across Azure OpenAI, Azure AI Studio, and the AI Hub model into a single resource and a single portal at ai.azure.com. Three pieces are worth knowing up front. The resource model Foundry is built on two Azure resources: Foundry account โ an azurerm_cognitive_account with kind = "AIServices" , project_management_enabled = true , a custom_subdomain_name , and a managed identity. This is the top-level container: it holds your model deployments (Azure OpenAI and the broader Foundry model catalog), connections to backing services, and the Foundry-managed Toolbox MCP endpoint. Foundry project โ an azurerm_cognitive_account_project under that account. A project is the scope for agents, evaluations, conversation history, indexes, and per-app connections. One project per app or per environment is the usual shape. This is the new Foundry model โ and it is the only model that supports Hosted Agents. The older Azure AI Foundry Hub ( azurerm_ai_foundry + azurerm_ai_foundry_project , kind = "Hub" ) is a separate Azure MLโderived workspace and cannot host Hosted Agents. The two surfaces look superficially similar in the portal but are distinct Azure resource types with different APIs and feature sets. If a tutorial, sample, or piece of Terraform you find online creates an azurerm_ai_foundry Hub, it is targeting the classic surface and the Hosted Agents APIs ( /agents , agent versions, traffic split, dedicated endpoints) will not be available against it. To use Hosted Agents you must provision a new Foundry account + project as described above. There is no in-place upgrade from a Hub. What Foundry gives you A Foundry project is more than a container. Out of the box it provides: A model catalog and deployment surface โ Azure OpenAI models (GPT-4.1, GPT-4o, o-series, embeddings), plus open and partner models, all deployed and invoked through the same project endpoint with the same auth model. Two agent execution modes โ prompt-based agents (defined entirely by instructions + tool configuration in the portal, suitable for conversational assistants) and Hosted Agents (your own containerized code, the subject of this post). A managed Toolbox โ a project-level MCP endpoint that exposes Foundry-curated tools (Code Interpreter, Web Search, Azure AI Search, OpenAPI, custom MCP, A2A) with consolidated auth. Hosted Agent code connects to the Toolbox using standard MCP client libraries. First-class evaluation โ datasets, graders (similarity, LLM-as-judge, safety, groundedness), and evaluation runs as a built-in concept, not a bolt-on. Built-in tracing โ OpenTelemetry traces from agents land in a linked Application Insights resource automatically. No manual instrumentation needed to get the basics. Per-agent identity โ when you deploy a Hosted Agent, the platform creates a dedicated Microsoft Entra ID (agent identity) for it and gives it a dedicated endpoint. RBAC to downstream resources is granted to that identity. How the pieces line up for Hosted Agents For the rest of this post, the mental model is: Resource group โโโ Foundry account (Cognitive Services, kind=AIServices) โโโ Model deployments (e.g. gpt-4.1) โโโ Foundry project โโโ Hosted Agent: customer-support โ โโโ Version v1 (image digest A, 100% traffic) โ โโโ Version v2 (image digest B, 0% traffic โ canary) โโโ Hosted Agent: webhook-handler โโโ Evaluations โโโ Connections (ACR, AI Search, Key Vaultโฆ) โโโ Toolbox (MCP) Terraform provisions the account, project, model deployments, ACR, App Insights, and RBAC. Hosted Agents โ images, versions, traffic weights โ are managed through azd or the Foundry SDK. That boundary is what the rest of this post automates. The minimal Terraform shape For Hosted Agents you need the new-model shape instead. The skeleton below is the minimum that lets you deploy a Hosted Agent on top of it โ storage, Key Vault, monitoring, networking, and OIDC for CI live alongside for more details see Infrastructure as Code for AI: Building and Deploying Microsoft Hosted Agents with Terraform | Microsoft Community Hub. # Foundry account (new model โ required for Hosted Agents) resource "azurerm_cognitive_account" "foundry" { name = "ai-${local.name}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location kind = "AIServices" sku_name = "S0" project_management_enabled = true custom_subdomain_name = "ai-${local.name}" # required for AAD auth identity { type = "SystemAssigned" } } # Model deployment the agent will call resource "azurerm_cognitive_deployment" "gpt" { name = "gpt-4.1" # stable name โ agents pin to this cognitive_account_id = azurerm_cognitive_account.foundry.id model { format = "OpenAI" name = "gpt-4.1" version = "2025-04-14" } sku { name = "GlobalStandard" capacity = 10 } } # Foundry project โ the scope for Hosted Agents, evals, conversations resource "azurerm_cognitive_account_project" "main" { name = "proj-${local.name}" cognitive_account_id = azurerm_cognitive_account.foundry.id location = azurerm_resource_group.main.location identity { type = "SystemAssigned" } } # Container registry the agent image is pushed to and pulled from resource "azurerm_container_registry" "acr" { name = "acr${replace(local.name, "-", "")}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location sku = "Standard" admin_enabled = false # use RBAC, not admin user } # The project's managed identity needs to pull the agent image resource "azurerm_role_assignment" "project_acr_pull" { scope = azurerm_container_registry.acr.id role_definition_name = "AcrPull" # use Container Registry Repository Reader if the ACR has ABAC enabled principal_id = azurerm_cognitive_account_project.main.identity[0].principal_id } A few things worth calling out: kind = "AIServices" + project_management_enabled = true + custom_subdomain_name are what make this a new-model Foundry account. Omit project_management_enabled and azurerm_cognitive_account_project will not provision; omit custom_subdomain_name and you lose the Foundry endpoint shape that Entra-authenticated access depends on. azurerm_cognitive_account_project is the new-Foundry project resource. Do not use azurerm_ai_foundry_project โ that targets the Hub model and does not host agents. Keep the model deployment name stable. Agent code (and your agent.yaml ) pins to the deployment name, not the model version. Changing the version is safe; changing the name forces a new agent version. The project MI needs ACR pull, not push. CI pushes the image (via its own identity); the platform pulls it on the project's behalf when the agent runs. ABAC-enabled ACR is supported but requires --source-acr-auth-id [caller] on az acr build in your CI script โ a common gotcha. A note on the provider. Everything above uses the hashicorp/azurerm provider. Foundry's surface evolves quickly, and you will occasionally hit a property or child resource that AzureRM hasn't caught up with yet โ project connections, capability hosts, and some newer agent-related fields are common examples. When that happens, reach for azure/azapi: use azapi_update_resource to patch a missing property on an AzureRM-owned resource, and azapi_resource for resources AzureRM doesn't model at all. Keep AzureRM as the default and use AzAPI as a targeted gap-filler, so you don't fork ownership of mainstream resources. The Hosted Agent Delivery Loop A working delivery loop has five stages. Each maps to a specific artifact, a specific tool, and a specific failure mode. Stage Artifact Tool Primary failure mode Infra provisioning Terraform state terraform apply Quota, RBAC propagation, ACR not reachable Image build & push OCI image in ACR (ACR must remain publicly reachable today) docker build / az acr build Image too large, base image CVEs Agent version create Immutable version (image digest + config) azd or Foundry SDK Bad env var, wrong protocol declared Evaluation Eval dataset + grader Foundry evaluators Quality / safety regression Traffic shift & observe Version weights, App Insights traces Foundry SDK + Azure Monitor Silent quality decay, sandbox over/under-sizing The first stage is where the prior post left off. The remaining four are this post. Infra provisioning assumes the standard pattern: terraform plan runs on every PR as a review gate (posted as a PR comment) and terraform apply runs only on merge to the environment branch. Everything below assumes the platform is already applied. Repository Shape A repository that supports the loop end-to-end looks roughly like this: agent-platform/ โโโ infra/ # Terraform from the prior post (AIServices + project) โ โโโ modules/foundry-project/ โ โโโ environments/ โ โโโ dev.tfvars โ โโโ staging.tfvars โ โโโ prod.tfvars โโโ agents/ โ โโโ customer-support/ โ โ โโโ Dockerfile โ โ โโโ src/ # Agent code (Python or C#) โ โ โโโ agent.yaml # Version spec: image, cpu/memory, protocols, env โ โ โโโ evals/ โ โ โ โโโ dataset.jsonl โ โ โ โโโ graders.yaml โ โ โโโ README.md โ โโโ webhook-handler/ โ โโโ ... โโโ scripts/ โ โโโ deploy_agent_version.py # Build โ push โ create version โ optional weight shift โ โโโ run_evals.py โ โโโ promote_version.py # Shifts traffic between versions โโโ .github/workflows/ โโโ infra.yml # Terraform plan/apply โโโ agent-pr.yml # Build, push to ACR, deploy candidate version, run evals โโโ agent-release.yml # Promote a tested version to staging / prod Two deliberate choices. First, infrastructure and agents live in the same repo but in separate top-level directories with separate pipelines. They have different cadences and different reviewers. Second, each agent is its own folder with its own Dockerfile , code, version spec, and eval suite. A single PR touches one agent's directory cleanly; a code-review diff stays focused. The Agent Version as the Deploy Unit A Hosted Agent is deployed as a version. A version is immutable โ once created it captures: the container image digest (not just the tag โ the digest, so it cannot drift), CPU and memory allocation for the per-session sandbox (e.g. 1 vCPU / 2 GiB), the container protocols the image implements โ responses , invocations , or both, environment variables passed to the container at runtime, any other version-scoped configuration (e.g. base model deployment name). The container's container_protocol_versions only declares responses and/or invocations โ the two protocols the container itself implements. A2A (preview) is surfaced by the platform on its own endpoint path, and Activity is bridged from Responses automatically when the agent is published to a Microsoft 365 channel. Under the hood, agent versions run on Azure Container Apps with VM-isolated sandboxes, which is also why you may see the term revision in some Container Appsโsurfaced APIs and limits โ a Hosted Agent version corresponds to one such revision. To change any of those, you create a new version. The platform keeps the old one and shifts traffic between them by weight. This is the primitive you use for canary rollouts and for rollback โ both reduce to a traffic-weight change, not a redeploy. An agent.yaml per agent makes the version reproducible from source: # agents/customer-support/agent.yaml name: customer-support container: image: ${ACR_LOGIN_SERVER}/customer-support # digest resolved at deploy time cpu: 1 memory: 2Gi protocols: # container_protocol_versions - responses # add `invocations` here if the container also handles webhook-style payloads env: # The platform automatically injects FOUNDRY_PROJECT_ENDPOINT, # AZURE_AI_MODEL_DEPLOYMENT_NAME, and APPLICATIONINSIGHTS_CONNECTION_STRING # โ you only set what's specific to your agent. LOG_LEVEL: info metadata: owner: support-team source_commit: ${GITHUB_SHA} scripts/deploy_agent_version.py is the executable form of this spec. Its job per agent is: Build the container image ( docker build locally, or az acr build server-side for ABAC ACRs). Push to ACR and capture the resulting image digest โ not the :latest tag. Resolve environment variables from the target environment's config. Call the Foundry SDK to create a new agent version pinned to that digest. Emit a deployment-manifest.json containing the agent name, version ID, image digest, source commit SHA, and the eval dataset hash used. One gotcha: the platform deduplicates. A create version call with no change to the version parameters (same image digest, same env, same CPU/memory, same protocols) will not produce a new version object. Write the script to treat "no new version returned" as success and reuse the existing version ID in the manifest, not as a failure to retry. That manifest is the cross-pipeline contract. PR pipelines produce one. Promotion pipelines consume one. Rollback consumes a previous one. Evaluation as a Release Gate Foundry ships evaluators (datasets, graders, evaluation runs) as a first-class platform feature. Whether to block a release on their results is a team decision, not a platform mandate โ but it is the recommended pattern for any agent serving real users. A pipeline that promotes an agent because the image built, the container started, and the version was created with HTTP 200 will eventually ship a regression that an integration test cannot catch. Treat the eval suite the way you treat unit tests: failures stop the pipeline. A minimal but honest evaluation setup has three pieces. A reference dataset. Twenty to fifty representative scenarios is enough to start. Each row is an input plus either a reference answer, a set of must-include facts, or a rubric. Store as JSONL alongside the agent: {"id":"refund-1","input":"How do I get a refund for order 12345?","must_include":["return window","14 days","original payment method"]} {"id":"escalate-1","input":"This is the third time my package is late.","rubric":"Agent should acknowledge, apologize, offer escalation, not promise compensation."} Graders. Foundry's evaluators library ships templates โ exact match, similarity, LLM-as-judge for rubric scoring, and built-in safety and groundedness graders. Pick what matches your dataset shape. LLM-as-judge is the workhorse for open-ended responses; pin its model deployment explicitly so the grader itself does not drift between runs. Thresholds. Decide what "passing" means before the first run. A common pattern: Hard floor on safety / groundedness โ any regression fails the build. Relative threshold on quality โ no more than X% drop versus the last known-good version. Absolute floor on must-include coverage โ for example โฅ 90%. Wire it into the PR pipeline: # .github/workflows/agent-pr.yml (excerpt) - name: Build, push, and create candidate version run: | python scripts/deploy_agent_version.py \ --agent customer-support \ --project $EVAL_PROJECT \ --version-suffix pr-${{ github.event.number }} \ --traffic 0 # create the version, do not route traffic yet - name: Run evaluations against candidate endpoint run: | python scripts/run_evals.py \ --agent customer-support \ --version pr-${{ github.event.number }} \ --baseline last-known-good \ --fail-on-regression The PR creates a candidate version with zero traffic weight against a long-lived "eval" Foundry project, runs evaluations against the candidate version's dedicated endpoint, and then deletes the candidate version on PR close. A standing eval project beats a per-PR Foundry project โ provisioning a project per PR is slow and adds RBAC overhead that does not earn its keep. Environment Promotion Three environments is the floor: dev , staging , prod . Each is its own Foundry project, ideally its own Foundry account in its own resource group. What promotes between them is the image digest and the version spec โ not source code, and not "redeploy from main." A workable model: dev โ every push to a feature branch builds an image and creates a dev version. Loose evaluation thresholds. Used for human poking and end-to-end debugging. staging โ merges to main create a staging version. Full eval suite, strict thresholds. Same sandbox sizing, same env vars, same protocols as prod. prod โ manually approved promotion from staging. Promotion script reads the staging manifest, finds the image digest that passed, and creates the prod version pointed at that exact digest. No rebuild. The "same digest" rule is the recommended pattern for safe promotion. If staging passed evaluations on customer-support@sha256:abcโฆ running gpt-4.1 , prod should get that exact image. Re-building from main in the prod pipeline reintroduces the risk you spent staging trying to eliminate โ a different base-image patch level, a different transitive dependency, a different build clock โ even though nothing in your source changed. GitHub Actions environments make the approval concrete: jobs: promote-prod: needs: deploy-staging environment: production # requires reviewer approval runs-on: ubuntu-latest steps: - name: Create prod version from staging manifest run: | python scripts/deploy_agent_version.py \ --agent customer-support \ --project $PROD_PROJECT \ --from-manifest staging-manifest.json \ --traffic 10 # canary at 10% The canary weight is the second half of safe promotion: create the prod version, give it a small fraction of traffic, watch the App Insights traces, then shift the rest with promote_version.py . Traffic-Split Rollout and Instant Rollback Weighted version traffic changes the rollback model entirely. Rollback typically avoids rebuilding or redeploying artifacts โ the previous version is still there, ready to take traffic. A typical canary flow: Create new version v42 at 0% traffic. Endpoint exists; no production calls reach it. Shift to 10%. Observe for an hour or a day, depending on traffic volume. Shift to 50%, then 100%. Old version stays at 0% but is not deleted. After a stability window (commonly a week), delete the previous version to free quota. Rollback is the reverse: shift weights back to the previous version. It is a control-plane call, not a deploy. The agent's endpoint URL does not change, sessions in flight continue on whichever version they started on, and new sessions land on whatever the weights say. Two consequences worth internalizing: Keep at least the last two known-good versions live. Rollback is only as fast as your ability to flip weights to a version that already exists. Do not skip the canary step under deadline pressure. A 0%โ100% cutover gives you the same blast radius as a non-canaried deploy. The platform supports incremental rollout; use it. For a destructive change โ a removed protocol, a renamed agent, an env var the previous version cannot tolerate โ rollback may not be safe. Forward-fix is the answer. Identify those changes in PR review and require an explicit "rollback path: forward-fix" note in the PR. Handling Model Version Changes A model deployment bump is the highest-blast-radius runtime change you can make to a Hosted Agent: the agent's behaviour on every input can shift. Treat it like a dependency upgrade. Open a PR that changes only the AZURE_AI_MODEL_DEPLOYMENT_NAME (or the model version on the deployment, via Terraform). Build a new image if needed, create a new agent version, run the full eval suite at 0% traffic. Run a larger regression dataset if you have one. Require a human reviewer who is not the PR author. Promote through staging, then canary in prod for at least one business day before shifting full traffic. If the new model is faster or cheaper, the temptation is to skip steps. Don't. A quality regression in prod almost always costs more than a careful upgrade. The Terraform side is small: openai_model_version is a variable on the azurerm_cognitive_deployment . Terraform recreates the deployment if the version changes. The Hosted Agent picks up the new deployment the next time it calls the model โ if you kept the deployment name stable, which is your contract with the agent code. If you change the deployment name as well, the agent needs a new version that knows the new name. Observability That Actually Tells You Something The platform injects an Application Insights connection string into every Hosted Agent container as an environment variable. Agents that use the protocol libraries emit OpenTelemetry traces by default. That gives you per-request latency, token counts, tool invocations, and conversation IDs out of the box. That is the floor. Add to it: Custom span attributes on every request. Agent name, agent version ID, image digest (short), model deployment name. Without these, post-incident analysis cannot tell you which version was live when a problem started โ especially during a traffic-split rollout where two versions are serving simultaneously. Quality signal capture. Sample a percentage of production conversations into a queue for offline grading. Run the same graders you used in CI against that sample on a schedule. This is your drift detector for response quality. Sandbox right-sizing signals. Hosted Agents bill on the CPU/memory you allocate per session. Oversizing multiplies cost by your concurrency. Track CPU and available memory inside the sandbox and compare against the version's allocation โ if peaks stay below ~50%, the next version should drop a tier; if they push above ~70%, raise it. Right-sizing is a per-version decision because versions are immutable. Per-version error and latency. Slice every standard metric by version ID. A canary that looks fine in aggregate can be quietly worse than the previous version on specific request shapes. Cost dimensions. Tag traces with customer_id or tenant_id if you have multi-tenancy. Aggregating session cost by tenant in App Insights is straightforward once the dimension is on the span. Alerts on shape, not just rate. A doubling in average response length or a sudden drop in tool invocation frequency often precedes a quality regression that error-rate alerts will miss entirely. A weekly "agent health" report in your team channel โ pulling these App Insights queries together โ beats a perfect dashboard nobody opens. A Pragmatic Maturity Path Most teams cannot build the whole loop on day one. A reasonable order: Infrastructure in Terraform. AIServices account, project, model deployment, ACR, App Insights, role assignment so the project MI can pull from ACR. First agent deployed manually with azd . Just to prove the round trip end to end. agent.yaml plus a deploy script that builds, pushes by digest, and creates a version. One environment. Three environments with manual promotion by manifest. A 20-row eval dataset with one grader, run on every PR. Advisory only at first. Eval as a blocking gate. Thresholds tuned from the advisory phase. Canary rollout via traffic split. Versions held live for a stability window before deletion. Production sampling into offline evaluation. Drift detection. Model version upgrade playbook. Documented, exercised once on a low-risk agent. Tested rollback via weight shift. The first time you discover a rollback bug should not be during an incident. Each step is independently useful. Skipping ahead โ particularly to step 6 without time in step 5 โ produces thresholds that block legitimate changes and erode trust in the pipeline. Where This Is Heading The platform is moving. A few things to watch as you build: Declarative Hosted Agent versions in Terraform. AzureRM coverage of Hosted Agents and agent versions is expanding. Parts of the deploy script will collapse into Terraform as that lands. The script-driven approach in this post is the bridge, not the destination. Continuous evaluation as a first-class platform feature. Sampling production traffic into scheduled evals โ what you wire by hand today โ is moving into the Foundry control plane. Multi-agent composition over A2A. As the A2A endpoint moves from preview to general availability and more frameworks ship A2A clients, multi-agent workflows become a first-class deployment shape. The DevOps loop extends โ version pinning between agents, eval at the workflow level, observability across the agent graph โ but the manifest grows accordingly. Toolbox-managed tool surfaces. As more tool integrations move behind the project Toolbox MCP endpoint, the agent image gets smaller and the tool configuration becomes a project-level concern. That changes what belongs in agent.yaml versus what belongs in Terraform. The throughline: the more the platform absorbs, the more your job shifts from wiring plumbing to defining policy. What "good" means for your agent, what the quality floor is, who can approve a model upgrade, how fast you can roll back. Those decisions do not get automated away. The pipeline just makes them executable. Conclusion Terraform provisions the Foundry project, model deployment, ACR, and observability. The DevOps loop on top of it โ container builds pinned by digest, immutable agent versions, evaluation as a release gate, manifest-driven promotion across environments, traffic-split canary and rollback, and observability sliced by version โ gets Hosted Agents to production and keeps them there. Build it incrementally. Treat the image digest and the version spec as the deploy artifact, not the source branch. Make evaluation a check the pipeline cares about. Use version weights as your rollout and rollback primitive. And design for the day the platform absorbs the next layer of plumbing, so that when it does, your work moves up the stack instead of getting thrown away.631Views0likes0CommentsEvaluate before you ship: introducing the Voice Live Evaluation Harness
You've built a voice agent on Azure Voice Live. It demos beautifully. Then a teammate asks the question that keeps every voice-agent team up at night: "How do we know it's actually good โ across 200 customer calls, not the three we just listened to?" Until today, the honest answer was: put on headphones. Manual listening. Subjective scoring in a spreadsheet. No baseline, no regression signal, no way to defend a model swap with data. We're releasing the Voice Live Evaluation Harness to change that. It's an open-source, deployable evaluation pipeline that runs pre-recorded multi-turn audio through your Voice Live agent and scores every turn with the same evaluators built into Microsoft Foundry โ automatically, repeatably, and in parallel. TL;DR Two flavors, one repo. Run the CLI harness locally against a Foundry project for fast iteration, or deploy the evaluation agent into your Azure subscription with the Azure Developer CLI (azd) for a fully-hosted evaluation backend. 13 built-in evaluators score every turn โ intent resolution, task adherence, task completion, response completeness, tool-call accuracy, groundedness, and more โ viewable per-turn and in aggregate inside the Foundry portal. Supports the three Voice Live modes you actually ship in โ Semantic VAD, Push-to-Talk, and Foundry Agent mode โ including multi-turn conversations with tool calls and grounding. Grows with your agent. Start with the sample datasets, then layer in audio collected from user testing and production traffic so your evaluation set matures alongside the agent. ๐ Repo: microsoft-foundry/voicelive-evaluation ยท Docs: Evaluate Voice Live agents (preview) Why systematic evaluation matters for voice agents Text agents have a mature evaluation story. Voice agents don't โ and the gaps actually matter more, because every voice failure happens in real time, in front of a customer, on a phone line you can't easily replay. The Voice Live Evaluation Harness closes that gap with four concrete capabilities: Establish a quality baseline. Run a representative audio dataset through your agent and get scores you can publish as your launch bar. Compare configurations side-by-side. Swap the underlying model (GPT-Realtime 1.5, Azure-Realtime, MAI-Transcribe-1.5), change the voice, tune VAD thresholds โ and see exactly which knobs moved which scores. Catch regressions before users do. Wire it into CI and fail the build when intent resolution drops below your threshold. Optimize with data, not vibes. When task-completion drops, drill into the per-turn scores to see whether the agent failed to call the right tool, misunderstood intent, or generated an incomplete response. Keep iterating as production data rolls in. Start with the sample datasets, then grow your evaluation set with audio captured from internal testing, pilot users, and real production traffic. Re-run after every prompt tweak or model swap so the harness becomes a continuous quality signal โ not a one-time launch checklist. How it works The pipeline is a five-stage loop: Audio Dataset. Multi-turn audio + expected behaviors in a simple JSONL schema. Four sample datasets ship in the repo (travel planning, complex data analytics, tool-calling tests, batch multi-conversation) so you can run end-to-end on day one. Voice Live API. Pick your Voice Live mode (Semantic VAD, PTT, or Foundry Agent), model, voice, and turn-detection settings via a JSON config file, then stream each turn of audio through the API โ locally with the CLI harness, or, if you've deployed the evaluation agent, via the hosted Container App for long-running batches in your own subscription. Transcript + Response. Every turn produces an agent transcript, the model's response, and any tool calls it made โ captured automatically for scoring. Foundry Evaluators. 13 built-in evaluators โ powered by the same Foundry evaluator models (GPT-4.1-mini and o4-mini) used across Microsoft Foundry โ judge every turn on intent resolution, task adherence, tool-call accuracy, groundedness, and more. Quality Scores. Per-turn and aggregate scores land in the Microsoft Foundry portal under your project's Evaluation tab โ sortable, filterable, comparable across runs. Then loop. Audio captured from internal testing, pilots, and production traffic feeds back into the dataset โ each pass makes the next evaluation more representative of what users actually do. What gets measured The accelerator ships 13 built-in evaluators out of the box, covering the dimensions that matter most for production voice agents: Category Evaluators Intent & task quality Intent Resolution ยท Task Adherence ยท Task Completion ยท Response Completeness Tool calling Tool Call Accuracy ยท Tool Call Parameter Validity ยท Tool Result Usage ยท Tool Call Success Content quality Groundedness ยท Relevance ยท Fluency ยท Coherence Conversational dynamics Turn-taking quality Every evaluator runs against the same Foundry evaluator models (GPT-4.1-mini and o4-mini) that power evaluation across the rest of Microsoft Foundry โ so your voice-agent scores are directly comparable to your text-agent scores. Run the CLI locally against your existing Voice Live endpoint If you already have a Voice Live agent deployed and just want fast iteration on a laptop: git clone https://github.com/microsoft-foundry/voicelive-evaluation.git cd voicelive-evaluation/evaluation_harness python -m venv .venv && source .venv/bin/activate pip install -r requirements.txt cp .sample_env .env # Edit .env with your AZURE_VOICELIVE_ENDPOINT python voice_agent_evaluation.py \ --config configs/sample_vad_realtime.json The full walkthrough โ dataset schema, configuration reference, score interpretation, and troubleshooting โ is in the documentation. Get started Repo: microsoft-foundry/voicelive-evaluation Docs: How to evaluate Voice Live agents (preview) We'd love your feedback โ try it, file issues, and tell us which evaluators you wish you had.285Views0likes0CommentsGitHub Action for Deploying Hosted Agents
Introduction With Microsoft's introduction to Hosted Agents comes a next logical question. How to implement this? Organizations need a method that is quick, repeatable, and requires minimal adjustments to their existing tooling and processes. Thus, we will walk through how to deploy a Hosted Agent through a repeatable GitHub Action. If this is new to you this blog is a follow up to Deploying Foundry Hosted Agents via REST API | Microsoft Community Hub. Before You Start This action assumes the following are already in place in the workflow that calls it: An existing Microsoft Foundry project with a deployed model. A container image already pushed to Azure Container Registry (ACR). An identity with the **Foundry User** role on the Foundry project. See [hosted agent permissions](https://learn.microsoft.com/en-us/azure/foundry/agents/concepts/hosted-agent-permissions) for the full permissions reference. A runner with `az`, `jq`, and `python3` installed. This is true on `ubuntu-latest`; if you self-host, install them explicitly. azure/login configured in the caller workflow **before** this action runs. โ ๏ธ *Identity prerequisite This action assumes `azure/login` has already run in the caller workflow and that the resulting identity holds a Foundry data-plane role (e.g., Foundry User). Without that, `az account get-access-token` will fail before the REST call is made. Requirements Grounding ourselves in our requirements to implement the deployment processes, in the quickest way that leverages minimal adjustments and a repeatable process, we will leverage GitHub Action and Bash. The Bash script will take a series of arguments that will be used to call the REST API. The action requires four inputs: `project_endpoint`, `agent_name`, `image`, and `model_deployment_name`. The example pipeline wires these from the outputs of a preceding IaC step, but the action itself takes plain strings. These strings can come from any tool that can hand them off as workflow inputs. This keeps it flexible and limits adjustments to existing CI/CD processes. If interested, one can use the Azure Developer CLI (`azd up`) command which is documented via Microsoft official examples and MS Learn. This blog chose not to cover this as the majority of enterprise customers already have tooling they are leveraging other than `azd`. Also, one could use the `azure.ai.projects` library to create an agent. This blog made the decision not to go down this route as not all organizations have adopted the philosophy of allowing application code to create underlying compute infrastructure. Additionally, some organizations desire teams outside of developers to control and set the size of the Micro VM (referred to as the "sandbox" in the Foundry docs) that the Hosted Agent is running on. If your organization does not use GitHub Actions this step should be duplicatable in Azure DevOps leveraging the Bash task. Deployment Steps For us to do this appropriately let's take a step back and evaluate a CI/CD workflow for an Agent whose definition is stored in a container. Ideally a pipeline should follow steps outlined in CI/CD for AI Agents on Microsoft Foundry. Those pipelines typically take the shape build/push โ IaC โ update agent โ smoke test. For our purposes, since we are hyper-focusing on the Hosted Agent Deployment via REST API we are going to focus on the repeatable GitHub Action of deploying the agent. To emphasize this our workflow will focus on the step called "Update agent โ Foundry data plane POST `agents/NAME/versions`". Based on organization preference, I can understand the need to break out the update agent step into a separate workflow. We traditionally don't recommend this as keeping everything in one pipeline means one set of failures to triage, one history to read, and one CI/CD surface to keep current. but This action though is structured to support a split if your release process requires it. Hosted Agent REST Deployment Action This is the crux of why the article exists. If you've followed my style of repeatable DevOps process for YAML Pipelines, this action follows similar principles. We will parametrize with defaults to empower minimal configuration while also optimizing for flexibility. To view the full example check out the Update Foundry Agent action . The Inputs, Outputs, and `runs:` blocks shown below all live in a single file: `.github/actions/update-agent/action.yml`. Inputs Here are those parameters with descriptions and defaults: inputs: project_endpoint: description: Foundry project endpoint URL required: true agent_name: description: Name of the hosted agent required: true image: description: Full container image reference (registry/name:tag) required: true model_deployment_name: description: Name of the AI model deployment required: true cpu: description: CPU allocation for the agent container required: false default: '0.25' memory: description: Memory allocation for the agent container required: false default: '0.5Gi' Verify the latest sandbox sizes at hosted-agents#sandbox-sizes There is also guidance on right-sizing your Micro VMs. At the time of this writing here are the available combinations: Outputs We should output values that make sense for subsequent steps in the workflow. Every instance that calls this action may not use them, but it's always good to expose non-secret values just in case. In our case we are creating a new version of the agent, so let's output that agent version: outputs: agent_version: description: Version ID returned by the Foundry data plane value: ${{ steps.post.outputs.agent_version }} `agent_version` is the version identifier returned by the data plane. Capture this in your pipeline (artifact, release tag, etc.) so you have an audit trail and a target to re-deploy against if a future version needs to be rolled back. Subsequent steps in the workflow can reference it via `${{ steps.<step-id>.outputs.agent_version }}`. Action The action will need to map our environment variables being passed into the input as the first step. After that we will need to get an access token from Azure so we can then call the REST API endpoint. Once we have this, we will need to prepare the body of our call. Verify against the API for all valid properties. For our example I chose not to set `rai_config` (Responsible AI overview) and `tools` (function/tool bindings) to keep things simple. runs: using: composite steps: - name: Post agent version to Foundry data plane id: post shell: bash env: PROJECT_ENDPOINT: ${{ inputs.project_endpoint }} AGENT_NAME: ${{ inputs.agent_name }} IMAGE: ${{ inputs.image }} MODEL_DEPLOYMENT_NAME: ${{ inputs.model_deployment_name }} CPU: ${{ inputs.cpu }} MEMORY: ${{ inputs.memory }} run: | FOUNDRY_TOKEN=$(az account get-access-token \ --resource "https://ai.azure.com/" \ --query accessToken -o tsv) AGENT_REQUEST_BODY=$(jq -n \ --arg cpu "$CPU" \ --arg memory "$MEMORY" \ --arg model "$MODEL_DEPLOYMENT_NAME" \ --arg image "$IMAGE" \ '{ definition: { kind: "hosted", container_protocol_versions: [{protocol: "responses", version: "1.0.0"}], cpu: $cpu, memory: $memory, environment_variables: {AZURE_AI_MODEL_DEPLOYMENT_NAME: $model}, image: $image โ ๏ธ **Heads up on logs.** The line that echoes `HTTP ${HTTP_STATUS}: $(cat /tmp/agent_response.json)` dumps the full response body to the job log. If your request body contains sensitive `environment_variables`, the API may return them in the response, where they will appear in plain text in the workflow log. Either scrub the response before echoing, or echo only the `version` field on success. A 2xx response confirms the data plane accepted the new agent version. Confirming the agent behaves as intended is a separate step. This is done typically with a smoke test against the deployed agent in a later workflow job. If something goes wrong the most common failures are: 401/403- `azure/login` didn't run, the identity is missing a Foundry data-plane role, or the wrong subscription is selected. Check the `azure/login` step and confirm the identity holds **Foundry User** (or higher) on the Foundry project (see the *Before You Start* callout above). 404 - wrong `project_endpoint`, or the agent named in `agent_name` does not yet exist on the project. The agent must exist before posting a new version. 400 - body or model issue: invalid `cpu` / `memory` shape, a required field missing, or `model_deployment_name` pointing at a deployment that isn't reachable from this project. Calling the Action So now that we have the action, how can we scale this across multiple workflows? Simple, we just need to pass in the required parameters. Here is an example, with a stubbed `deploy-iac` step so can the outputs passed into the action as inputs: - name: Deploy Bicep infrastructure id: deploy-iac uses: ./.github/actions/deploy-bicep with: environment_name: ${{ inputs.environment_name || 'main' }} location: ${{ inputs.location || 'swedencentral' }} - name: Update agent uses: ./.github/actions/update-agent with: project_endpoint: ${{ steps.deploy-iac.outputs.project_endpoint }} agent_name: ${{ inputs.agent_name }} image: ${{ steps.deploy-iac.outputs.acr_endpoint }}/${{ inputs.image_name }}:${{ inputs.image_tag }} model_deployment_name: ${{ steps.deploy-iac.outputs.model_deployment_name }} And just to show we can call the same action multiple times here are two examples that do just that: Deploy (Bicep) and Deploy (Terraform). Conclusion The composite action shown above gives organizations what the introduction called for: a quick, repeatable way to deploy a Hosted Agent that requires minimal adjustments to the GitHub Actions tooling and processes already in use. With it wired into a workflow, deploying a new Hosted Agent version becomes a standard step in your pipeline.