azure
226 TopicsFoundry Toolkit for VS Code at //build: Hosted Agents End-to-End, a Smarter Toolbox, and More
We’re excited to share what’s new for Foundry Toolkit for Visual Studio Code at //build 2026. Since going generally available, the toolkit has kept moving fast, and this release is a big one. The headline: a complete, end-to-end Hosted Agent experience, scaffold, run, deploy, and observe without ever leaving VS Code. On top of that, we’ve expanded the Toolbox with native enterprise integrations and shipped a wave of LangGraph samples so every developer has a clear path from idea to production. From your first prompt to a production-grade, observable agent, Foundry Toolkit meets you where you are. Hosted Agents, End to End Building an agent is the easy part; getting it from a first draft to a production-grade, observable service is what matters. This release makes the full Hosted Agent lifecycle available in VS Code, and it follows the way you actually work — scaffold, run, deploy, observe. Scaffold — start from a rich set of samples Hosted Agent creation now opens with a refreshed scaffolding experience and a rich sample selection, so you start from a working, framework-appropriate template instead of a blank file. Creation is smarter, too: we auto-select your subscription when there’s only one, gate tabs more clearly, and tightened spacing for a cleaner setup flow. Run (F5) — inspect as you build Press F5 and your agent runs locally with the Agent Inspector, now aligned with the rest of the extension and featuring Copilot SDK visualization so you can see what the Inspector visualizes as the agent executes. It’s the fastest loop from change to verification before anything leaves your machine. Deploy — a new UX and new ways to ship Different teams ship differently, so deployment got a refreshed UX and two new options for Hosted Agents: ZIP Code Deploy: Package your agent source as a ZIP and deploy it directly to Microsoft Foundry Agent Service. Bring-Your-Own-Image (BYOI): Already have a pre-built container in your own Azure Container Registry? Deploy straight from it. Observe — know it works in production Once deployed, the full observability story is now available: Hosted Agent Tracing: Inspect end-to-end traces of Hosted Agent invocations directly from VS Code — tool calls, delegation chains, and timing for real debugging instead of guesswork. Continuous Evaluation Settings: A new page to configure ongoing evaluation for deployed Hosted Agents, so quality is measured continuously — not just at ship time. Evaluations Node: One-click access to evaluation runs and results right from the Foundry project tree. A Smarter, More Connected Toolbox What it is, and why it matters A Toolbox is how your agent gets its capabilities — the curated set of tools, knowledge sources, and integrations it can call at runtime. Instead of hand-wiring each connection, you assemble a Toolbox once and your agent consumes it consistently across local runs and production. The result: agents that can act on real enterprise data and systems, with the connections managed in one place. From what to how: create, connect, consume Create: Start a new Toolbox from the Foundry Toolkit sidebar “Tools Catalog” and pick the capabilities your agent needs. Connect: Configure and wire in enterprise systems through native, first-class connections once, and use it for all your agents. Consume: Reference the Toolbox from your Hosted Agent so its tools are available the moment the agent runs, locally (F5) and once deployed. New this release Building on that flow, the Toolbox is now richer and more enterprise-ready: WorkIQ as a Built-in Tool: A first-class WorkIQ experience powered by A2A connections — no MCP fallback required. End-to-end toolbox creation with WorkIQ works out of the box. Fabric IQ (OneLake Catalog) Integration: Connect your agents to Microsoft Fabric OneLake catalogs directly from the Toolbox. Toolbox Guardrails: Apply content-safety guardrails to your Toolbox for safer agent execution. Faster discovery: A new Toolbox Search Toggle and Agent Tool Multi-Select let you find and wire in multiple tools in a single action. LangGraph Reaches Parity LangGraph developers, this one is for you. We’ve added five new Hosted Agent samples that bring LangGraph to full parity with the Agent Framework Responses learning path — so you get an equivalent, end-to-end walkthrough no matter which framework you prefer: MCP — tool loading from a remote MCP server (defaults to GitHub Copilot MCP) via MultiServerMCPClient. Workflows — a custom StateGraph chaining three specialized LLM nodes: slogan writer, legal reviewer, and formatter. Files — local filesystem tools plus the Foundry-Toolbox code_interpreter working over session-uploaded files. Human-in-the-Loop — a StateGraph that drafts a proposal and pauses for approval via langgraph.types.interrupt. Observability — GenAI OpenTelemetry tracing with enable_auto_tracing(); spans, metrics, and logs flow to Application Insights. We’ve also refreshed the existing bring-your-own LangGraph samples against the new hosting layer (chat with local tools, Foundry-managed Toolbox loading, and SSE-streamed multi-turn sessions backed by a MemorySaver checkpointer), so every sample reflects how Hosted Agents work today. Polish Across the Board A release is more than headline features. This one also includes a redesigned Prompt Builder “Improve an Instruction” dialog for faster iteration, fixes for MCP toolbox tool icons, clearer ZIP-deploy error surfacing, and assorted Agent Builder and Playground regression fixes — the whole experience feels tighter end to end. Get Started Today Install: Foundry Toolkit on the VS Code Marketplace Quick Start: Follow our getting-started tutorial to build your first Hosted Agent Deep Dive: Explore the documentation, samples, and LangGraph parity walkthroughs Join the Community Share your projects, file issues, or suggest features on our GitHub repository. We can’t wait to see what you build. Welcome to the next chapter of AI development!DevOps for Microsoft Hosted Agents: From Terraform Apply to Production-Grade Agent Delivery
A companion piece to Infrastructure as Code for AI: Building and Deploying Microsoft Hosted Agents with Terraform. Just announced — source-code deploy (preview). Foundry has just added a second Hosted Agent deploy path alongside the container path this post covers. Instead of a container image, you upload a .zip of your source plus a requirements.txt (Python 3.13 / 3.14) or a .csproj (.NET 10), and the Agent Service either builds dependencies for you ( remote_build ) or runs a prebuilt bundle ( bundled ). The version definition uses code_configuration instead of container_configuration — the two are mutually exclusive on a given version. Versioning is content-addressable on the zip's SHA-256, so the dedup behaviour described below still applies. Required roles shift slightly: deploying the agent needs Foundry Project Manager at project scope, and the platform-assigned agent identity gets Foundry User (both handled automatically by azd and the Foundry VS Code Toolkit). The DevOps loop in this post — immutable versions, eval gating, manifest-driven promotion, traffic-split canary, per-version observability — transfers directly; only the build-and-push stage changes (no Dockerfile, no ACR for remote_build ). The container path covered here remains fully supported and is still the right choice if you need custom base images, system packages, or non-Python/.NET runtimes. Full details: Deploy a hosted agent from source code (preview). What this post assumes. It describes recommended enterprise DevOps patterns on top of Microsoft Foundry Hosted Agents. Some patterns — evaluation gating, traffic-based rollout, manifest-driven promotion — are best practices and may not be enforced by the platform itself. Hosted Agents and several related capabilities (A2A, certain deployment and routing controls) are in preview and may evolve. TL;DR Terraform provisions the platform: Foundry account, project, model deployment, ACR, App Insights, RBAC. DevOps pipelines ship agent versions, not source branches — the deploy artifact is a container image digest plus an immutable version spec. Evaluation should be treated as a release gate, not a dashboard. Quality regressions should fail the build the same way unit-test failures do. Traffic split between versions is the rollout and rollback primitive. Rollback typically avoids rebuilding or redeploying artifacts. Observability is sliced per version — during canary, two versions serve simultaneously and aggregate metrics lie. The Delivery Pipeline at a Glance Terraform ───► Foundry project (AIServices) + model deployment + ACR + App Insights │ PR opened ▼ └─► docker build ───► push to ACR ───► capture image digest │ ▼ Foundry SDK: create agent version (image digest + cpu/mem + env + protocols) │ ▼ Evaluation gate ────► fail → stop │ ▼ pass Promote via manifest → staging → prod │ ▼ Traffic-split canary (0% → 10% → 100%) │ ▼ App Insights: per-version latency, cost, sampled quality, sandbox sizing Infrastructure as Code gets the platform stood up. It does not, on its own, ship an agent. The gap between terraform apply succeeding and a customer-facing agent reliably serving requests in production is where DevOps lives — and for Microsoft Hosted Agents on Microsoft Foundry, that gap has its own shape. A Hosted Agent is not a prompt and a tool list. It is your own code, packaged as a container image, pushed to Azure Container Registry, and deployed to a Foundry project. The Foundry Agent Service pulls the image, provisions an isolated execution environment per agent session, assigns the agent its own dedicated Microsoft Entra ID (agent identity), and exposes a dedicated endpoint. An agent supports up to four protocols, any of which can be combined in a single deployment: Responses ( .../protocols/openai/responses ) — OpenAI-compatible chat-style API. Implemented in the container. Invocations ( .../protocols/invocations ) — arbitrary JSON in / arbitrary JSON out for webhook receivers and non-conversational workloads. Implemented in the container. A2A ( .../protocols/a2a , preview) — the open Agent2Agent protocol for agent-to-agent delegation across frameworks and vendors. Surfaced on its own endpoint path by the platform. Activity — the Teams / M365 channel protocol. The platform bridges Responses to Activity automatically when an agent is published to a Microsoft 365 channel. Microsoft manages the runtime, scaling, session state, and lifecycle. You ship the image and the version definition. Important — Foundry version compatibility. Hosted Agents are supported on the new Microsoft Foundry project resource model ( azurerm_cognitive_account_project under a Cognitive Services account of kind = "AIServices" ). The older Azure AI Foundry Hub model ( azurerm_ai_foundry / azurerm_ai_foundry_project , kind = "Hub" ) — the Azure ML–derived workspace surface — does not expose Hosted Agent capabilities. They are two distinct Azure resource types with different APIs. Everything in this post assumes the new Foundry project. That shape drives three things every DevOps loop for Hosted Agents has to handle: The deploy artifact is a container image plus an immutable agent version. A version snapshots the image digest, CPU/memory, environment variables, and protocol configuration. To change anything, you create a new version. The platform supports weighted traffic between versions, which is your blue/green and canary primitive. The agent identity is created for you, per agent. You don't pick one or wire managed-identity references manually. Each agent is assigned a dedicated Microsoft Entra ID (agent identity) at deploy time; RBAC to downstream resources is granted to that identity. Quality is non-deterministic. Two terraform apply runs against the same configuration produce identical resources. Two agent runs against the same input can produce different outputs. Your pipeline has to gate on evaluation, not only on tests passing and HTTP 200s. This post lays out an end-to-end DevOps loop on top of that shape: how to structure the repository, what runs in CI versus CD, how to gate releases on evaluation, how to promote across environments, how to use version traffic split for safe rollouts and instant rollback, and what observability is worth wiring beyond the defaults. A Quick Tour of Microsoft Foundry If you've spent more time in Azure OpenAI or AI Studio than in Foundry, a short orientation helps before the DevOps patterns make sense. Microsoft Foundry is Microsoft's unified platform for building, evaluating, deploying, and operating AI applications and agents. It consolidates what used to be spread across Azure OpenAI, Azure AI Studio, and the AI Hub model into a single resource and a single portal at ai.azure.com. Three pieces are worth knowing up front. The resource model Foundry is built on two Azure resources: Foundry account — an azurerm_cognitive_account with kind = "AIServices" , project_management_enabled = true , a custom_subdomain_name , and a managed identity. This is the top-level container: it holds your model deployments (Azure OpenAI and the broader Foundry model catalog), connections to backing services, and the Foundry-managed Toolbox MCP endpoint. Foundry project — an azurerm_cognitive_account_project under that account. A project is the scope for agents, evaluations, conversation history, indexes, and per-app connections. One project per app or per environment is the usual shape. This is the new Foundry model — and it is the only model that supports Hosted Agents. The older Azure AI Foundry Hub ( azurerm_ai_foundry + azurerm_ai_foundry_project , kind = "Hub" ) is a separate Azure ML–derived workspace and cannot host Hosted Agents. The two surfaces look superficially similar in the portal but are distinct Azure resource types with different APIs and feature sets. If a tutorial, sample, or piece of Terraform you find online creates an azurerm_ai_foundry Hub, it is targeting the classic surface and the Hosted Agents APIs ( /agents , agent versions, traffic split, dedicated endpoints) will not be available against it. To use Hosted Agents you must provision a new Foundry account + project as described above. There is no in-place upgrade from a Hub. What Foundry gives you A Foundry project is more than a container. Out of the box it provides: A model catalog and deployment surface — Azure OpenAI models (GPT-4.1, GPT-4o, o-series, embeddings), plus open and partner models, all deployed and invoked through the same project endpoint with the same auth model. Two agent execution modes — prompt-based agents (defined entirely by instructions + tool configuration in the portal, suitable for conversational assistants) and Hosted Agents (your own containerized code, the subject of this post). A managed Toolbox — a project-level MCP endpoint that exposes Foundry-curated tools (Code Interpreter, Web Search, Azure AI Search, OpenAPI, custom MCP, A2A) with consolidated auth. Hosted Agent code connects to the Toolbox using standard MCP client libraries. First-class evaluation — datasets, graders (similarity, LLM-as-judge, safety, groundedness), and evaluation runs as a built-in concept, not a bolt-on. Built-in tracing — OpenTelemetry traces from agents land in a linked Application Insights resource automatically. No manual instrumentation needed to get the basics. Per-agent identity — when you deploy a Hosted Agent, the platform creates a dedicated Microsoft Entra ID (agent identity) for it and gives it a dedicated endpoint. RBAC to downstream resources is granted to that identity. How the pieces line up for Hosted Agents For the rest of this post, the mental model is: Resource group └── Foundry account (Cognitive Services, kind=AIServices) ├── Model deployments (e.g. gpt-4.1) └── Foundry project ├── Hosted Agent: customer-support │ ├── Version v1 (image digest A, 100% traffic) │ └── Version v2 (image digest B, 0% traffic — canary) ├── Hosted Agent: webhook-handler ├── Evaluations ├── Connections (ACR, AI Search, Key Vault…) └── Toolbox (MCP) Terraform provisions the account, project, model deployments, ACR, App Insights, and RBAC. Hosted Agents — images, versions, traffic weights — are managed through azd or the Foundry SDK. That boundary is what the rest of this post automates. The minimal Terraform shape For Hosted Agents you need the new-model shape instead. The skeleton below is the minimum that lets you deploy a Hosted Agent on top of it — storage, Key Vault, monitoring, networking, and OIDC for CI live alongside for more details see Infrastructure as Code for AI: Building and Deploying Microsoft Hosted Agents with Terraform | Microsoft Community Hub. # Foundry account (new model — required for Hosted Agents) resource "azurerm_cognitive_account" "foundry" { name = "ai-${local.name}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location kind = "AIServices" sku_name = "S0" project_management_enabled = true custom_subdomain_name = "ai-${local.name}" # required for AAD auth identity { type = "SystemAssigned" } } # Model deployment the agent will call resource "azurerm_cognitive_deployment" "gpt" { name = "gpt-4.1" # stable name — agents pin to this cognitive_account_id = azurerm_cognitive_account.foundry.id model { format = "OpenAI" name = "gpt-4.1" version = "2025-04-14" } sku { name = "GlobalStandard" capacity = 10 } } # Foundry project — the scope for Hosted Agents, evals, conversations resource "azurerm_cognitive_account_project" "main" { name = "proj-${local.name}" cognitive_account_id = azurerm_cognitive_account.foundry.id location = azurerm_resource_group.main.location identity { type = "SystemAssigned" } } # Container registry the agent image is pushed to and pulled from resource "azurerm_container_registry" "acr" { name = "acr${replace(local.name, "-", "")}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location sku = "Standard" admin_enabled = false # use RBAC, not admin user } # The project's managed identity needs to pull the agent image resource "azurerm_role_assignment" "project_acr_pull" { scope = azurerm_container_registry.acr.id role_definition_name = "AcrPull" # use Container Registry Repository Reader if the ACR has ABAC enabled principal_id = azurerm_cognitive_account_project.main.identity[0].principal_id } A few things worth calling out: kind = "AIServices" + project_management_enabled = true + custom_subdomain_name are what make this a new-model Foundry account. Omit project_management_enabled and azurerm_cognitive_account_project will not provision; omit custom_subdomain_name and you lose the Foundry endpoint shape that Entra-authenticated access depends on. azurerm_cognitive_account_project is the new-Foundry project resource. Do not use azurerm_ai_foundry_project — that targets the Hub model and does not host agents. Keep the model deployment name stable. Agent code (and your agent.yaml ) pins to the deployment name, not the model version. Changing the version is safe; changing the name forces a new agent version. The project MI needs ACR pull, not push. CI pushes the image (via its own identity); the platform pulls it on the project's behalf when the agent runs. ABAC-enabled ACR is supported but requires --source-acr-auth-id [caller] on az acr build in your CI script — a common gotcha. A note on the provider. Everything above uses the hashicorp/azurerm provider. Foundry's surface evolves quickly, and you will occasionally hit a property or child resource that AzureRM hasn't caught up with yet — project connections, capability hosts, and some newer agent-related fields are common examples. When that happens, reach for azure/azapi: use azapi_update_resource to patch a missing property on an AzureRM-owned resource, and azapi_resource for resources AzureRM doesn't model at all. Keep AzureRM as the default and use AzAPI as a targeted gap-filler, so you don't fork ownership of mainstream resources. The Hosted Agent Delivery Loop A working delivery loop has five stages. Each maps to a specific artifact, a specific tool, and a specific failure mode. Stage Artifact Tool Primary failure mode Infra provisioning Terraform state terraform apply Quota, RBAC propagation, ACR not reachable Image build & push OCI image in ACR (ACR must remain publicly reachable today) docker build / az acr build Image too large, base image CVEs Agent version create Immutable version (image digest + config) azd or Foundry SDK Bad env var, wrong protocol declared Evaluation Eval dataset + grader Foundry evaluators Quality / safety regression Traffic shift & observe Version weights, App Insights traces Foundry SDK + Azure Monitor Silent quality decay, sandbox over/under-sizing The first stage is where the prior post left off. The remaining four are this post. Infra provisioning assumes the standard pattern: terraform plan runs on every PR as a review gate (posted as a PR comment) and terraform apply runs only on merge to the environment branch. Everything below assumes the platform is already applied. Repository Shape A repository that supports the loop end-to-end looks roughly like this: agent-platform/ ├── infra/ # Terraform from the prior post (AIServices + project) │ ├── modules/foundry-project/ │ └── environments/ │ ├── dev.tfvars │ ├── staging.tfvars │ └── prod.tfvars ├── agents/ │ ├── customer-support/ │ │ ├── Dockerfile │ │ ├── src/ # Agent code (Python or C#) │ │ ├── agent.yaml # Version spec: image, cpu/memory, protocols, env │ │ ├── evals/ │ │ │ ├── dataset.jsonl │ │ │ └── graders.yaml │ │ └── README.md │ └── webhook-handler/ │ └── ... ├── scripts/ │ ├── deploy_agent_version.py # Build → push → create version → optional weight shift │ ├── run_evals.py │ └── promote_version.py # Shifts traffic between versions └── .github/workflows/ ├── infra.yml # Terraform plan/apply ├── agent-pr.yml # Build, push to ACR, deploy candidate version, run evals └── agent-release.yml # Promote a tested version to staging / prod Two deliberate choices. First, infrastructure and agents live in the same repo but in separate top-level directories with separate pipelines. They have different cadences and different reviewers. Second, each agent is its own folder with its own Dockerfile , code, version spec, and eval suite. A single PR touches one agent's directory cleanly; a code-review diff stays focused. The Agent Version as the Deploy Unit A Hosted Agent is deployed as a version. A version is immutable — once created it captures: the container image digest (not just the tag — the digest, so it cannot drift), CPU and memory allocation for the per-session sandbox (e.g. 1 vCPU / 2 GiB), the container protocols the image implements — responses , invocations , or both, environment variables passed to the container at runtime, any other version-scoped configuration (e.g. base model deployment name). The container's container_protocol_versions only declares responses and/or invocations — the two protocols the container itself implements. A2A (preview) is surfaced by the platform on its own endpoint path, and Activity is bridged from Responses automatically when the agent is published to a Microsoft 365 channel. Under the hood, agent versions run on Azure Container Apps with VM-isolated sandboxes, which is also why you may see the term revision in some Container Apps–surfaced APIs and limits — a Hosted Agent version corresponds to one such revision. To change any of those, you create a new version. The platform keeps the old one and shifts traffic between them by weight. This is the primitive you use for canary rollouts and for rollback — both reduce to a traffic-weight change, not a redeploy. An agent.yaml per agent makes the version reproducible from source: # agents/customer-support/agent.yaml name: customer-support container: image: ${ACR_LOGIN_SERVER}/customer-support # digest resolved at deploy time cpu: 1 memory: 2Gi protocols: # container_protocol_versions - responses # add `invocations` here if the container also handles webhook-style payloads env: # The platform automatically injects FOUNDRY_PROJECT_ENDPOINT, # AZURE_AI_MODEL_DEPLOYMENT_NAME, and APPLICATIONINSIGHTS_CONNECTION_STRING # — you only set what's specific to your agent. LOG_LEVEL: info metadata: owner: support-team source_commit: ${GITHUB_SHA} scripts/deploy_agent_version.py is the executable form of this spec. Its job per agent is: Build the container image ( docker build locally, or az acr build server-side for ABAC ACRs). Push to ACR and capture the resulting image digest — not the :latest tag. Resolve environment variables from the target environment's config. Call the Foundry SDK to create a new agent version pinned to that digest. Emit a deployment-manifest.json containing the agent name, version ID, image digest, source commit SHA, and the eval dataset hash used. One gotcha: the platform deduplicates. A create version call with no change to the version parameters (same image digest, same env, same CPU/memory, same protocols) will not produce a new version object. Write the script to treat "no new version returned" as success and reuse the existing version ID in the manifest, not as a failure to retry. That manifest is the cross-pipeline contract. PR pipelines produce one. Promotion pipelines consume one. Rollback consumes a previous one. Evaluation as a Release Gate Foundry ships evaluators (datasets, graders, evaluation runs) as a first-class platform feature. Whether to block a release on their results is a team decision, not a platform mandate — but it is the recommended pattern for any agent serving real users. A pipeline that promotes an agent because the image built, the container started, and the version was created with HTTP 200 will eventually ship a regression that an integration test cannot catch. Treat the eval suite the way you treat unit tests: failures stop the pipeline. A minimal but honest evaluation setup has three pieces. A reference dataset. Twenty to fifty representative scenarios is enough to start. Each row is an input plus either a reference answer, a set of must-include facts, or a rubric. Store as JSONL alongside the agent: {"id":"refund-1","input":"How do I get a refund for order 12345?","must_include":["return window","14 days","original payment method"]} {"id":"escalate-1","input":"This is the third time my package is late.","rubric":"Agent should acknowledge, apologize, offer escalation, not promise compensation."} Graders. Foundry's evaluators library ships templates — exact match, similarity, LLM-as-judge for rubric scoring, and built-in safety and groundedness graders. Pick what matches your dataset shape. LLM-as-judge is the workhorse for open-ended responses; pin its model deployment explicitly so the grader itself does not drift between runs. Thresholds. Decide what "passing" means before the first run. A common pattern: Hard floor on safety / groundedness — any regression fails the build. Relative threshold on quality — no more than X% drop versus the last known-good version. Absolute floor on must-include coverage — for example ≥ 90%. Wire it into the PR pipeline: # .github/workflows/agent-pr.yml (excerpt) - name: Build, push, and create candidate version run: | python scripts/deploy_agent_version.py \ --agent customer-support \ --project $EVAL_PROJECT \ --version-suffix pr-${{ github.event.number }} \ --traffic 0 # create the version, do not route traffic yet - name: Run evaluations against candidate endpoint run: | python scripts/run_evals.py \ --agent customer-support \ --version pr-${{ github.event.number }} \ --baseline last-known-good \ --fail-on-regression The PR creates a candidate version with zero traffic weight against a long-lived "eval" Foundry project, runs evaluations against the candidate version's dedicated endpoint, and then deletes the candidate version on PR close. A standing eval project beats a per-PR Foundry project — provisioning a project per PR is slow and adds RBAC overhead that does not earn its keep. Environment Promotion Three environments is the floor: dev , staging , prod . Each is its own Foundry project, ideally its own Foundry account in its own resource group. What promotes between them is the image digest and the version spec — not source code, and not "redeploy from main." A workable model: dev — every push to a feature branch builds an image and creates a dev version. Loose evaluation thresholds. Used for human poking and end-to-end debugging. staging — merges to main create a staging version. Full eval suite, strict thresholds. Same sandbox sizing, same env vars, same protocols as prod. prod — manually approved promotion from staging. Promotion script reads the staging manifest, finds the image digest that passed, and creates the prod version pointed at that exact digest. No rebuild. The "same digest" rule is the recommended pattern for safe promotion. If staging passed evaluations on customer-support@sha256:abc… running gpt-4.1 , prod should get that exact image. Re-building from main in the prod pipeline reintroduces the risk you spent staging trying to eliminate — a different base-image patch level, a different transitive dependency, a different build clock — even though nothing in your source changed. GitHub Actions environments make the approval concrete: jobs: promote-prod: needs: deploy-staging environment: production # requires reviewer approval runs-on: ubuntu-latest steps: - name: Create prod version from staging manifest run: | python scripts/deploy_agent_version.py \ --agent customer-support \ --project $PROD_PROJECT \ --from-manifest staging-manifest.json \ --traffic 10 # canary at 10% The canary weight is the second half of safe promotion: create the prod version, give it a small fraction of traffic, watch the App Insights traces, then shift the rest with promote_version.py . Traffic-Split Rollout and Instant Rollback Weighted version traffic changes the rollback model entirely. Rollback typically avoids rebuilding or redeploying artifacts — the previous version is still there, ready to take traffic. A typical canary flow: Create new version v42 at 0% traffic. Endpoint exists; no production calls reach it. Shift to 10%. Observe for an hour or a day, depending on traffic volume. Shift to 50%, then 100%. Old version stays at 0% but is not deleted. After a stability window (commonly a week), delete the previous version to free quota. Rollback is the reverse: shift weights back to the previous version. It is a control-plane call, not a deploy. The agent's endpoint URL does not change, sessions in flight continue on whichever version they started on, and new sessions land on whatever the weights say. Two consequences worth internalizing: Keep at least the last two known-good versions live. Rollback is only as fast as your ability to flip weights to a version that already exists. Do not skip the canary step under deadline pressure. A 0%→100% cutover gives you the same blast radius as a non-canaried deploy. The platform supports incremental rollout; use it. For a destructive change — a removed protocol, a renamed agent, an env var the previous version cannot tolerate — rollback may not be safe. Forward-fix is the answer. Identify those changes in PR review and require an explicit "rollback path: forward-fix" note in the PR. Handling Model Version Changes A model deployment bump is the highest-blast-radius runtime change you can make to a Hosted Agent: the agent's behaviour on every input can shift. Treat it like a dependency upgrade. Open a PR that changes only the AZURE_AI_MODEL_DEPLOYMENT_NAME (or the model version on the deployment, via Terraform). Build a new image if needed, create a new agent version, run the full eval suite at 0% traffic. Run a larger regression dataset if you have one. Require a human reviewer who is not the PR author. Promote through staging, then canary in prod for at least one business day before shifting full traffic. If the new model is faster or cheaper, the temptation is to skip steps. Don't. A quality regression in prod almost always costs more than a careful upgrade. The Terraform side is small: openai_model_version is a variable on the azurerm_cognitive_deployment . Terraform recreates the deployment if the version changes. The Hosted Agent picks up the new deployment the next time it calls the model — if you kept the deployment name stable, which is your contract with the agent code. If you change the deployment name as well, the agent needs a new version that knows the new name. Observability That Actually Tells You Something The platform injects an Application Insights connection string into every Hosted Agent container as an environment variable. Agents that use the protocol libraries emit OpenTelemetry traces by default. That gives you per-request latency, token counts, tool invocations, and conversation IDs out of the box. That is the floor. Add to it: Custom span attributes on every request. Agent name, agent version ID, image digest (short), model deployment name. Without these, post-incident analysis cannot tell you which version was live when a problem started — especially during a traffic-split rollout where two versions are serving simultaneously. Quality signal capture. Sample a percentage of production conversations into a queue for offline grading. Run the same graders you used in CI against that sample on a schedule. This is your drift detector for response quality. Sandbox right-sizing signals. Hosted Agents bill on the CPU/memory you allocate per session. Oversizing multiplies cost by your concurrency. Track CPU and available memory inside the sandbox and compare against the version's allocation — if peaks stay below ~50%, the next version should drop a tier; if they push above ~70%, raise it. Right-sizing is a per-version decision because versions are immutable. Per-version error and latency. Slice every standard metric by version ID. A canary that looks fine in aggregate can be quietly worse than the previous version on specific request shapes. Cost dimensions. Tag traces with customer_id or tenant_id if you have multi-tenancy. Aggregating session cost by tenant in App Insights is straightforward once the dimension is on the span. Alerts on shape, not just rate. A doubling in average response length or a sudden drop in tool invocation frequency often precedes a quality regression that error-rate alerts will miss entirely. A weekly "agent health" report in your team channel — pulling these App Insights queries together — beats a perfect dashboard nobody opens. A Pragmatic Maturity Path Most teams cannot build the whole loop on day one. A reasonable order: Infrastructure in Terraform. AIServices account, project, model deployment, ACR, App Insights, role assignment so the project MI can pull from ACR. First agent deployed manually with azd . Just to prove the round trip end to end. agent.yaml plus a deploy script that builds, pushes by digest, and creates a version. One environment. Three environments with manual promotion by manifest. A 20-row eval dataset with one grader, run on every PR. Advisory only at first. Eval as a blocking gate. Thresholds tuned from the advisory phase. Canary rollout via traffic split. Versions held live for a stability window before deletion. Production sampling into offline evaluation. Drift detection. Model version upgrade playbook. Documented, exercised once on a low-risk agent. Tested rollback via weight shift. The first time you discover a rollback bug should not be during an incident. Each step is independently useful. Skipping ahead — particularly to step 6 without time in step 5 — produces thresholds that block legitimate changes and erode trust in the pipeline. Where This Is Heading The platform is moving. A few things to watch as you build: Declarative Hosted Agent versions in Terraform. AzureRM coverage of Hosted Agents and agent versions is expanding. Parts of the deploy script will collapse into Terraform as that lands. The script-driven approach in this post is the bridge, not the destination. Continuous evaluation as a first-class platform feature. Sampling production traffic into scheduled evals — what you wire by hand today — is moving into the Foundry control plane. Multi-agent composition over A2A. As the A2A endpoint moves from preview to general availability and more frameworks ship A2A clients, multi-agent workflows become a first-class deployment shape. The DevOps loop extends — version pinning between agents, eval at the workflow level, observability across the agent graph — but the manifest grows accordingly. Toolbox-managed tool surfaces. As more tool integrations move behind the project Toolbox MCP endpoint, the agent image gets smaller and the tool configuration becomes a project-level concern. That changes what belongs in agent.yaml versus what belongs in Terraform. The throughline: the more the platform absorbs, the more your job shifts from wiring plumbing to defining policy. What "good" means for your agent, what the quality floor is, who can approve a model upgrade, how fast you can roll back. Those decisions do not get automated away. The pipeline just makes them executable. Conclusion Terraform provisions the Foundry project, model deployment, ACR, and observability. The DevOps loop on top of it — container builds pinned by digest, immutable agent versions, evaluation as a release gate, manifest-driven promotion across environments, traffic-split canary and rollback, and observability sliced by version — gets Hosted Agents to production and keeps them there. Build it incrementally. Treat the image digest and the version spec as the deploy artifact, not the source branch. Make evaluation a check the pipeline cares about. Use version weights as your rollout and rollback primitive. And design for the day the platform absorbs the next layer of plumbing, so that when it does, your work moves up the stack instead of getting thrown away.178Views0likes0CommentsBuilding a GitHub Copilot Agent Usage Dashboard
Introduction Working with organisations that are attempting to create GitHub Copilot custom agents, take-up of these agents by their community becomes important to know. Some questions quickly emerge are "how well are we actually using it?", "which agents are getting used and which have not had that much traction?". Native metrics provide high-level insights into adoption, but they lack the depth needed to answer more granular questions—such as which agent workflows are most used, or how behaviour evolves over time. In this post, I’ll walk through how to build an enterprise-grade GitHub Copilot usage dashboard that captures detailed telemetry from VS Code using OpenTelemetry, processes it in Azure Monitor, and visualises insights in Grafana—all using a reproducible, infrastructure-as-code approach. The dashboard can be made available to anyone that needs it. Architecture VS Code can be configured to emit metrics using Open Telemetry as a standard. This is a configuration item in VS Code and you essentially point it to an Open Telemetry Collector. The collector is an endpoint that can consume the telemetry. In this implementation, it is a container image that is hosted in Azure and I have chosen Azure Container Apps (ACA) for this purpose as it is an easy to use managed environment - but it could also run in Azure Kubernetes Service (AKS) with a little more effort. There is a prebuilt image opentelemetry collector for this and this has been adapted to inject configuration to send the telemetry to Azure Application Insights. For defining and hosting the dashboard, I have chosen another Azure managed service Azure Managed Grafana Sample Dashboard The sample dashboard is one that contains a collection of visualisations derived from the collected data in Application Insights. Azure Managed Grafana allows you to visually author these dashboards or they can be implemented as a JSON file and adapted from there. Note that the telemetry generated by VS Code gives the location of the users - city, region and country, but does not include any personally-identifying information (PII) and so cannot be used to track individuals. As I understand it, this is by design. Managed Grafana has its own permission structure, which may then be used to give users access to the dashboard. Implementation Details There is a GitHub repo Copilot Usage Dashboard that contains details of how to implement this together with instructions for either "click-ops" 🙂creation or via Terraform. So I suggest you follow the link to my repo to look at the details. In summary, there needs to be in Azure: Azure Container App (ACA) that hosts the collector - this needs to have public ingress Azure Container Registry (ACR) that hosts the docker image that is customised via the Dockerfile Key Vault that hosts the Application Insights connection string that ACA references Application Insights - this needs to be created with a flag to allow it to work with Grafana data Log Analytics Workspace that works with Application Insights Azure Managed Grafana to host the Grafana dashboard The main thing to bear in mind is that VS Code needs to be configured to emit OpenTelemetry { "github.copilot.nextEditSuggestions.enabled": true, "github.copilot.chat.otel.enabled": true, "github.copilot.chat.otel.exporterType": "otlp-http", "github.copilot.chat.otel.otlpEndpoint": "https://<fqdn>" } where the FQDN is the URL of the public ingress to the Azure Container App. There is a Dockerfile in this repo that just injects the correct configuration file into the OpenTelemetry collector image. It is this configuration file that tells the collector to emit to Application Insights. It is of the form below: receivers: otlp: protocols: grpc: endpoint: 0.0.0.0:4317 http: endpoint: 0.0.0.0:4318 processors: batch: attributes: actions: - key: environment value: "prod" action: upsert exporters: azuremonitor: connection_string: "${APPLICATIONINSIGHTS_CONNECTION_STRING}" debug: verbosity: detailed service: pipelines: traces: receivers: [otlp] processors: [batch, attributes] exporters: [azuremonitor, debug] metrics: receivers: [otlp] processors: [batch, attributes] exporters: [azuremonitor] As can be seen above, there is a placeholder for the Application Insights connection string - in the ACA configuration this is an environment variable that then points to a secret which is in key vault. If all is well, VS Code will emit telemetry to the container image running in ACA and this will use its configuration to send to Application Insights. The Grafana dashboard then using this data. Troubleshooting The GitHub repo goes into the detail of troubleshooting, but the overall steps to troubleshoot are: If there is no data in Grafana, check that Grafana has access to Application Insights check whether there is telemetry being pushed into Application Insights by looking at the logs and looking for the contents of the table Dependencies. If there is telemetry there, then it is Grafana permissions. If not, look to ACA Look at ACA logs to see if it is healthy and look to see if there is any logs being received Use a curl request to send a fake log to ACA (a sample is in the repo) to see if the ACA is accepting logs Check the connection to Application Insights is correct and is being pulled from key vault or replace the environment variable value with the connection string directly If all good so far, then it may be that the configuration in VS Code is not correct or in the correct place. Hopefully the more detailed steps will resolve any issues quickly. Further thoughts and enhancements This implementation attempts to build a dashboard showing GitHub Copilot agent usage using a standard set of security controls, but more may be needed. Here is a list of possible enhancements: A more refined dashboard. This should be easy as there are samples for all sorts of visualisations and few of these may allow more focus on agent and model usage. the ACA-hosted OpenTelemetry collector has a public-facing ingress. This may need to be locked-down at the network level by address restriction or by a non-public ingress. Care would need to be taken to make sure that this is then visible/reachable to the intended VS Code user audience The ACA collector endpoint is not authenticated in of itself. This could be achieved at the container level by putting an authenticating proxy in the Dockerfile or at the ACA ingress level. Some investigation would be needed to see how the VS Code configuration could work with this and this may dictate largely what form this authentication can take. How the VS Code configuration changes can be automated for a user base has not been investigated as part of this work. It is assumed that an organisation may be able to roll-out these changes using their application deployment automation. Summary This approach provides a means by which an organisation can track the usage of GitHub Copilot agents (and their models), that is not provided by GitHub Enterprise dashboards. This will provide insights into the take up of custom agents and their underlying models - allowing an organisation to test whether their investments on custom agents are being used effectively. Additionally, the dashboards themselves can easily be rolled-out to a wider community than GitHub Enterprise one.Building and Operating a Microsoft Foundry Hosted Agent with GitOps and GitHub Tasks
The Gap Between Prototype and Production Most AI engineering teams can build a working agent in a day. The hard part is not building it; the hard part is operating it. Prompts drift. Tool configurations change without review. Deployments happen from someone's laptop. There is no audit trail, no rollback plan, and no consistent way to promote a change from a development environment to production. GitOps closes that gap. By treating your agent definition, configuration, and infrastructure as version-controlled source code, you get the same delivery discipline that software engineering teams have applied to application code for years. Every change is reviewed, every deployment is automated, and every environment state is traceable to a specific commit. This post shows you how to apply GitOps principles to a Microsoft Foundry Hosted Agent using GitHub as the source of truth and GitHub Tasks and Actions as the automation layer. The result is a repeatable, governed, production-ready delivery model for AI agents. What Is a Microsoft Foundry Hosted Agent? Microsoft Foundry is Microsoft's platform for building, deploying, and operating AI applications and agents. A Hosted Agent is an agent runtime managed by the Foundry platform rather than self-hosted by your team. You supply the agent logic, configuration, and tools; Foundry handles the runtime lifecycle, scaling, and managed infrastructure. In practical terms, a Foundry Hosted Agent is a containerised agent application. You package your agent code, prompt definitions, tool bindings, and environment configuration into a container image. Foundry deploys and manages that container within a Foundry project, connected to models, tools, and observability infrastructure that the platform provides. Teams choose Hosted Agents over self-hosting because: The platform manages runtime infrastructure, patching, and scaling Integration with Azure AI models, managed identity, and observability is built in You can focus engineering effort on agent logic rather than cluster management Foundry projects provide environment and resource isolation without requiring you to provision and manage separate Azure resources for each environment Hosted Agents are a good fit when your team wants strong operational support with minimal platform overhead, when you need clear separation between environments, and when your agents depend on Azure AI capabilities such as Azure OpenAI Service, Azure AI Search, or Model Context Protocol integrations. Why GitOps Matters Specifically for AI Agents GitOps is straightforward for stateless web services: the code changes, the pipeline runs, the container is deployed. AI agents are more complex because there are multiple distinct artefacts that all affect agent behaviour: System prompts and instruction files Tool definitions and external integrations Model selection and configuration (temperature, max tokens, safety settings) Model Context Protocol (MCP) server definitions Orchestration logic and agent workflow code Safety and policy settings Infrastructure and deployment configuration Any one of these can change the behaviour of your agent in ways that are difficult to detect without structured review. A prompt change that looks harmless can alter tone, scope, or factual grounding. A tool configuration change can expose data to unintended callers. A model upgrade can shift response quality unpredictably. Git gives you a single place to version, review, and approve all of these artefacts together. Pull requests give you a structured review gate. Workflow automation gives you validation before anything reaches a deployed environment. Tags and releases give you deployment markers you can roll back to. The discipline of GitOps turns what is often an ad-hoc AI delivery process into a repeatable engineering practice. Reference Architecture The following diagram shows a practical reference architecture for delivering a Microsoft Foundry Hosted Agent through a GitOps model using GitHub. +---------------------------+ | GitHub Repository | | /src /agents /tools | | /prompts /infra | | /.github/workflows | +---------------------------+ | | Pull Request / Push to main v +---------------------------+ | GitHub Actions | | 1. Validate agent config | | 2. Lint and scan code | | 3. Run unit tests | | 4. Build container image | | 5. Push to registry | +---------------------------+ | | Image tag (SHA or semver) v +---------------------------+ | Azure Container Registry | | myregistry.azurecr.io | | my-agent:<sha> | +---------------------------+ | +------+------+ | | v v +----------+ +----------+ | Foundry | | Foundry | | Dev | | Test | | Project | | Project | +----------+ +----------+ | Approval gate (GitHub env) | v +----------+ | Foundry | | Prod | | Project | +----------+ | v +---------------------------+ | Observability | | Azure Monitor / App | | Insights / Foundry Logs | +---------------------------+ Key design decisions in this architecture: The GitHub repository is the single source of truth for all agent artefacts No human deploys directly to any Foundry project; all changes flow through automation Environment promotion requires a GitHub environment approval, creating a governance gate The container image is built once and promoted across environments; the image is not rebuilt per environment Secrets are stored in Azure Key Vault and accessed by the Foundry agent at runtime via managed identity Figure: GitOps delivery pipeline stages from commit to production Repository Structure A well-structured repository separates agent logic from infrastructure and tooling from prompts. The following structure works well in practice: my-foundry-agent/ ├── .github/ │ ├── workflows/ │ │ ├── validate.yml # Runs on every PR │ │ ├── build-deploy.yml # Runs on merge to main │ │ └── rollback.yml # Manual trigger workflow │ └── CODEOWNERS # Review assignments by path ├── src/ │ ├── agents/ │ │ ├── agent.py # Agent entry point and orchestration │ │ └── agent_config.json # Agent metadata and settings │ ├── tools/ │ │ ├── search_tool.py # Tool implementations │ │ └── data_tool.py │ └── prompts/ │ ├── system.txt # System prompt (versioned as plain text) │ └── instructions.txt # Supplementary instructions ├── tests/ │ ├── unit/ # Unit tests for tools and logic │ ├── integration/ # Integration tests against a running agent │ └── smoke/ # Post-deployment smoke tests ├── infra/ │ ├── main.bicep # Foundry project and resource definitions │ └── environments/ │ ├── dev.parameters.json │ ├── test.parameters.json │ └── prod.parameters.json ├── scripts/ │ ├── validate_agent.py # Config validation script │ └── smoke_test.py # Smoke test runner ├── Dockerfile # Container image definition └── docs/ └── architecture.md # Architecture and runbook documentation What belongs where and why: /src/prompts - System prompts as plain text files. Versioning prompts as files means every change goes through a pull request with a diff review, just as code does. /src/agents - Agent orchestration logic and configuration. Keeps the entry point and agent metadata co-located. /src/tools - Tool implementations separated from agent logic. Tool logic changes independently and should be reviewable in isolation. /infra - Infrastructure as code with per-environment parameter files. Environment-specific values live here, never in source files. /tests - Three layers of testing: unit tests for tools, integration tests for the full agent, and smoke tests that run against a deployed environment. /.github/workflows - All automation defined as code. There should be no manual deployment steps that live outside this directory. GitHub Tasks Across the Delivery Lifecycle GitHub Tasks and Issues provide the work tracking layer on top of the GitOps delivery model. Used well, they connect the intention behind a change to its implementation and deployment history. Practical patterns for using GitHub Tasks with agent delivery: Prompt change task - Open an issue to describe why the system prompt is changing. The pull request that changes system.txt closes that issue, creating a permanent link between the rationale and the diff. Tool integration task - When adding a new MCP server or external tool integration, create a task that captures the design decision, security review outcome, and test evidence before the pull request is merged. Model upgrade task - When upgrading the underlying model version, create a task that includes evaluation results and comparison data. The task becomes part of your change audit trail. Rollback task - If a deployment causes quality regressions, create a task to track the rollback, root cause investigation, and corrective action. Automation can open this task automatically when a deployment fails health checks. Dependency on approval - GitHub Tasks can be linked to environment approvals in GitHub Actions. A task in a specific milestone or project column can gate a promotion workflow. The key insight is that GitHub Tasks are not just work management; they are part of your audit trail. A regulatory or security reviewer can follow the chain from a production deployment back through workflow runs, pull request reviews, and the original task that described the intent of the change. End-to-End GitOps Flow The following walk-through describes a realistic developer experience for changing an agent prompt and promoting it to production. A developer opens a GitHub Issue describing the prompt change required and the expected behaviour improvement. The developer creates a feature branch, edits src/prompts/system.txt , and updates any related unit tests. A pull request is opened. The validate workflow runs immediately, checking prompt length, configuration schema, and lint rules. Unit tests run against the changed files. A code reviewer approves the pull request. The CODEOWNERS file ensures that prompt changes require review from the AI engineering team, not just any contributor. On merge to main, the build workflow runs: the container image is built with the new prompt baked in, tagged with the commit SHA, and pushed to Azure Container Registry. The deployment workflow deploys the new image to the Foundry Dev project automatically. Integration and smoke tests run against the deployed dev agent. If tests pass, the workflow pauses at the Test environment gate and requests approval from a named reviewer. After approval, the same image is deployed to Foundry Test. Smoke tests run again. A second approval gate controls promotion to Foundry Prod. If at any point a health check or smoke test fails, the rollback workflow redeploys the previous image tag from the registry. The image tag of the last known-good deployment is stored as a GitHub environment variable. This flow means that no human ever deploys directly to any environment. Every environment state is traceable to a specific commit, image tag, and workflow run. Security and Governance AI agents often have access to sensitive data and external systems. Security and governance cannot be an afterthought. Identity and Access Use managed identity for the Foundry Hosted Agent to access Azure resources. Avoid service principal secrets where Microsoft Entra Workload Identity or managed identity is available. Apply the principle of least privilege: the agent identity should have read access to data sources and limited write access only where the use case requires it. Tool integrations that require API keys or external credentials should retrieve them from Azure Key Vault at runtime, never from environment variables baked into the image. Secrets and Configuration Store secrets in Azure Key Vault. Reference them in your Foundry project configuration using Key Vault references. Store GitHub Actions secrets using repository or environment-scoped secrets. Never echo secrets in workflow logs. Separate environment configuration (endpoints, resource names, capacity settings) from agent logic. Use the /infra/environments/ parameter files for this. Auditability and Review Enforce pull request reviews for all changes to /src/prompts , /src/agents , and /infra via CODEOWNERS. Require status checks to pass before merging. Blocked merges prevent untested changes reaching production. GitHub's workflow run history gives you a complete deployment audit trail. You can answer "what was deployed to prod on Tuesday and who approved it" in seconds. For regulated environments, consider branch protection rules that require signed commits. Safe Rollout Use canary or blue-green patterns where Foundry supports them for high-traffic agents. Always keep the previous image tag available in the registry. Do not delete images on deployment. Document and test your rollback procedure before you need it in production. Observability and Operational Readiness A deployed agent that you cannot observe is an agent you cannot operate. Build observability in from the start. What to Monitor Deployment health - Track whether each Foundry deployment succeeded and the agent is responding. Wire deployment outcomes back to GitHub workflow run status. Model and tool errors - Log tool call failures, model timeout errors, and safety filter activations. Aggregate these in Azure Monitor or Application Insights. Latency - Track end-to-end response latency per agent version. A latency increase after a model or prompt change is an early signal of a quality regression. Token consumption - Monitor token usage per request and per session. Unexpected increases can indicate prompt injection or runaway orchestration loops. Traceability - Log which agent version handled each request. Correlation between the image tag and request traces is essential for debugging production issues. Debugging and Alerting Use structured logging with a consistent schema. Include fields for agent version, session ID, tool called, and outcome. Set up alerts for error rate thresholds and latency percentiles. Alert before users notice the problem. For failed agent runs, ensure logs capture the full conversation context (within your data retention policy) so that developers can reproduce and diagnose the failure. Microsoft Foundry Toolboxes One of the most important additions to the Foundry platform is Toolboxes, currently in Public Preview. If you have ever seen an agent codebase where three different agents each wire the same search tool with their own credentials and slightly different configurations, you already understand the problem Toolboxes solve. A Toolbox is a named, versioned bundle of tools managed centrally in Microsoft Foundry. You define the tools once, configure authentication and access centrally, and publish a single MCP-compatible endpoint. Any agent in any runtime consumes that endpoint without per-tool wiring, custom SDK integration, or duplicated credential management. Figure: Before and after Foundry Toolboxes. Each agent previously managed its own tool connections. With Toolboxes, agents connect to one governed endpoint. The Four Pillars Discover (coming soon) - Find approved tools without browsing long catalogues. Reduces duplication by surfacing what already exists before developers build something new. Build (available today) - Select tools into a named toolbox. Supported types include built-in tools (Web Search, Code Interpreter, File Search, Azure AI Search), MCP servers, Agent-to-Agent (A2A) endpoints, and OpenAPI-defined services. Consume (available today) - A single MCP-compatible endpoint exposes every tool in the toolbox to any agent runtime. Agents that can speak MCP can use a Foundry Toolbox without any Foundry-specific SDK dependency. Govern (coming soon) - Centralised authentication and observability applied to every tool call flowing through the toolbox. Security and platform teams get consistent controls without asking developers to bolt governance onto every agent individually. Toolboxes and GitOps: A Natural Fit Toolboxes are particularly well-suited to a GitOps delivery model because the toolbox definition is a discrete, versioned artefact. Instead of credentials and tool configuration scattered across agent codebases, the toolbox becomes its own managed entity with its own version history. The key design property is that the toolbox endpoint URL is stable. When you promote a new toolbox version to be the default, agents consuming the endpoint pick up the update without any code changes. This means you can update tool configuration, add a new MCP server, or rotate credentials in the toolbox without redeploying every agent that uses it. Figure: Toolbox versioning in a GitOps model. Commits trigger CI validation and deployment of new toolbox versions. The stable endpoint URL allows agents to consume updates without redeployment. Adding a Toolbox to Your Repository In your GitOps repository, toolbox definitions belong in /src/tools/toolbox_config.py or as a declarative configuration file checked into version control. The following example creates a toolbox that combines web search, Azure AI Search over internal documentation, and a GitHub MCP server: # src/tools/toolbox_config.py # Run this via CI to create or update a toolbox version in Foundry. from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient import os client = AIProjectClient( endpoint=os.environ["FOUNDRY_PROJECT_ENDPOINT"], credential=DefaultAzureCredential() ) toolbox_version = client.beta.toolboxes.create_toolbox_version( toolbox_name="customer-feedback-toolbox", description="Tools for triaging customer feedback: search, docs, and GitHub.", tools=[ { "type": "web_search", "description": "Search approved public documentation sites.", "custom_search_configuration": { "project_connection_id": os.environ["BING_CONNECTION_NAME"], "instance_name": os.environ["BING_INSTANCE_NAME"] } }, { "type": "azure_ai_search", "name": "product-manuals-search", "description": "Search internal product documentation.", "azure_ai_search": { "indexes": [ { "index_name": os.environ["SEARCH_INDEX_NAME"], "project_connection_id": os.environ["SEARCH_CONNECTION_ID"] } ] } }, { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp", "project_connection_id": os.environ["GITHUB_CONNECTION_ID"] } ], ) print(f"Toolbox version created: {toolbox_version.version}") print(f"MCP endpoint: {toolbox_version.mcp_endpoint}") To promote a toolbox version to be the default (the endpoint agents use without specifying a version), add this to your deployment workflow: # Promote toolbox version to default after validation toolbox = client.beta.toolboxes.update( toolbox_name="customer-feedback-toolbox", default_version=toolbox_version.version, ) print(f"Default version is now: {toolbox.default_version}") The stable endpoint for agents consuming this toolbox is: https://<your-project>.services.ai.azure.com/api/projects/<project>/toolbox/customer-feedback-toolbox/mcp?api-version=v1 Attaching the Toolbox to Your Hosted Agent In your agent code, connect to the toolbox via a single MCP tool definition. The agent gains access to every tool in the toolbox without knowing their individual configurations: # src/agents/agent.py (relevant excerpt) from agent_framework import MCPStreamableHTTPTool import httpx, os toolbox_endpoint = os.environ["FOUNDRY_TOOLBOX_ENDPOINT"] http_client = httpx.AsyncClient( auth=_ToolboxAuth(token_provider), # Microsoft Entra bearer token timeout=120.0, ) mcp_tool = MCPStreamableHTTPTool( name="toolbox", url=toolbox_endpoint, http_client=http_client, load_prompts=False, ) # Agent now has access to web search, AI Search, and GitHub MCP # through one tool definition and one authenticated connection. GitOps Workflow Extension for Toolboxes Add a dedicated job to your build-deploy workflow to create and promote toolbox versions as part of the same CI/CD pipeline: deploy-toolbox: name: Deploy Toolbox Version needs: validate runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Create toolbox version in Foundry env: FOUNDRY_PROJECT_ENDPOINT: ${{ vars.FOUNDRY_PROJECT_ENDPOINT_DEV }} BING_CONNECTION_NAME: ${{ vars.BING_CONNECTION_NAME }} BING_INSTANCE_NAME: ${{ vars.BING_INSTANCE_NAME }} SEARCH_INDEX_NAME: ${{ vars.SEARCH_INDEX_NAME }} SEARCH_CONNECTION_ID: ${{ vars.SEARCH_CONNECTION_ID }} GITHUB_CONNECTION_ID: ${{ vars.GITHUB_CONNECTION_ID }} run: python src/tools/toolbox_config.py Key points to note: Toolbox configuration is Python code in source control, reviewed through pull requests like any other change Connection IDs and index names are environment variables from GitHub Actions variables, not hardcoded in the script The same script runs for dev, test, and prod with different environment variable bindings Toolbox version promotion is a separate step from agent deployment, so you can update tools independently of the agent container Because the toolbox endpoint is stable, rolling back a toolbox version does not require rolling back the agent image Common Pitfalls Teams adopting this pattern commonly make the following mistakes. Identifying them early saves significant operational pain later. Treating prompts as unmanaged text. If your system prompt lives in a portal text box rather than a versioned file, you have no history, no review process, and no rollback capability. Move prompts into source control on day one. Deploying manually from the portal. Even one manual deployment breaks the GitOps contract. Your repository no longer reflects the true state of the environment. Automate everything and remove portal deployment permissions from individuals. Mixing environment configuration into source files. Hardcoded endpoint URLs or model deployment names in agent_config.json mean your dev and prod configurations diverge at the source level. Use parameter files and environment variables resolved at deployment time. Poor separation between agent logic and tool logic. When agents and tools are tightly coupled in a single file, a tool change requires a full agent review and redeployment. Keep them separate so they can evolve independently. Not versioning your Toolbox definition. Defining a Foundry Toolbox interactively through the portal gives you no audit trail and no rollback path. The toolbox configuration script belongs in source control alongside your agent code. Skipping evaluation before promotion. Deploying a prompt change without running a structured evaluation against a representative test set is how regressions reach production. Build evaluation into the pull request workflow, not just the deployment workflow. No rollback plan. If your first rollback is unplanned and urgent, it will be slow and stressful. Test your rollback procedure in a non-production environment and document the steps. Ignoring token and cost signals. AI workloads have variable cost profiles. A change that doubles average token consumption per request may be functionally correct but economically unsustainable. Monitor consumption as a first-class signal. Example GitHub Actions Workflow The following workflow runs on pull request validation and on merge to main. It covers the core delivery lifecycle: validate, build, deploy to dev, and smoke test. # .github/workflows/build-deploy.yml name: Build and Deploy Foundry Hosted Agent on: push: branches: - main pull_request: branches: - main env: REGISTRY: myregistry.azurecr.io IMAGE_NAME: my-foundry-agent jobs: validate: name: Validate Agent Configuration runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.12" - name: Install dependencies run: pip install -r requirements.txt - name: Validate agent config schema run: python scripts/validate_agent.py - name: Run unit tests run: pytest tests/unit/ -v - name: Lint code run: ruff check src/ build: name: Build and Push Container Image needs: validate runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' permissions: id-token: write contents: read outputs: image_tag: ${{ steps.meta.outputs.version }} steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Log in to Azure Container Registry run: az acr login --name ${{ env.REGISTRY }} - name: Extract metadata id: meta uses: docker/metadata-action@v5 with: images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} tags: | type=sha,format=short - name: Build and push image uses: docker/build-push-action@v7 with: context: . push: true tags: ${{ steps.meta.outputs.tags }} deploy-dev: name: Deploy to Foundry Dev needs: build runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Dev project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_DEV }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment dev - name: Run smoke tests against dev run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_DEV }} deploy-test: name: Deploy to Foundry Test needs: deploy-dev runs-on: ubuntu-latest environment: test permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_TEST }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Test project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_TEST }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment test - name: Run smoke tests against test run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_TEST }} Key decisions in this workflow: Validation runs on every pull request, not just on merge. Fast feedback catches problems before review. The container image is built once and the image tag is passed forward to deployment jobs. The same artefact is promoted across environments. Authentication uses OIDC federated credentials via azure/login@v3 with id-token: write permissions. No long-lived secrets are stored in GitHub for Azure authentication. The environment: test directive in the deploy-test job triggers a GitHub environment approval gate. A named reviewer must approve before the job runs. Smoke tests run after every deployment. A failed smoke test prevents further promotion. Best Practices Checklist Use this checklist when adopting the GitOps pattern for a Microsoft Foundry Hosted Agent: All agent artefacts, including prompts, tool definitions, model configuration, and Toolbox configuration scripts, are committed to source control No manual deployments to any environment; all changes flow through GitHub Actions workflows Pull request reviews are enforced for all changes to agent logic, prompts, and infrastructure via CODEOWNERS Unit tests cover tool logic; integration tests cover end-to-end agent behaviour; smoke tests cover deployed environments Container images are built once per commit and promoted across environments; images are not rebuilt per environment Environment configuration (endpoints, resource names) lives in parameter files, never in source code Secrets are stored in Azure Key Vault and accessed via managed identity at runtime GitHub environment approval gates control promotion from dev to test to prod Foundry Toolboxes are used to centralise tool definitions, credentials, and access governance across all agents; the toolbox configuration script is version-controlled and deployed through CI/CD Toolbox versions are promoted via the update default_version API step in the deployment workflow, not manually through the portal Latency, error rate, and token consumption are monitored with alerting thresholds The rollback procedure is documented, automated, and has been tested in a non-production environment GitHub Issues are used to record the intent behind significant changes and link to the pull requests that implement them Branch protection rules prevent direct pushes to main and require status checks to pass before merge The previous image tag is retained in the registry and stored as a GitHub environment variable for rollback Conclusion A Microsoft Foundry Hosted Agent is not something you deploy once and forget. Prompts evolve, tools change, models are upgraded, and policy requirements shift. Every one of those changes has the potential to alter agent behaviour in ways that affect users, costs, and compliance posture. GitOps, implemented through GitHub and GitHub Tasks, gives you the operational discipline to manage that complexity. Source control for all artefacts. Pull request review for every change. Automated validation, build, and deployment. Environment promotion gates. A complete audit trail from task to production. These are not bureaucratic overhead; they are the foundation of reliable, trustworthy AI agent operations. The teams that operate AI agents well are the ones that treat them like production software from the start. The investment in pipeline, structure, and governance pays back every time a change goes smoothly, every time a rollback takes minutes rather than hours, and every time a security or compliance reviewer can answer their question from a pull request history rather than a support ticket. Build the discipline in early. Your future self, and your production environment, will benefit from it. References Microsoft Foundry documentation Microsoft Foundry Agent Service documentation Microsoft Foundry Toolboxes documentation Introducing Toolboxes in Foundry (Microsoft Developer Blog) GitHub Actions documentation GitHub Projects and Tasks documentation Azure Container Registry documentation Azure Key Vault documentation Microsoft Entra Managed Identities documentation OpenGitOps PrinciplesBuilding AI Agents with Microsoft Foundry: A Progressive Lab from Hello World to Self-Hosted
AI agent development has a steep on-ramp. The combination of new SDKs, tool-calling patterns, model selection decisions, retrieval-augmented generation, and deployment concerns means most developers spend more time wiring things together than actually building anything useful. The Microsoft Foundry Agent Lab is a structured, open-source demo series designed to change that — nine self-contained demos, each adding exactly one new concept, all built on the same Microsoft Foundry SDK and a single model deployment. This post walks through what the lab contains, how each demo works under the hood, and the architectural decisions that make it a useful reference for AI engineers building production agents. Why a Progressive Lab? Agent frameworks can be overwhelming. A developer who opens a rich example with RAG, tool-calling, streaming, and a custom UI all at once has no clear line of sight to which parts are essential and which are embellishments. The Foundry Agent Lab takes the opposite approach: start with the absolute minimum and introduce one new primitive per demo. By the time you reach Demo 8, you have seen every major capability — not in one monolithic sample, but in a layered sequence where each addition is visible and understandable. # Demo New Concept Tool Used UX 0 hello-demo Agent creation, Responses API, conversations None Terminal 1 tools-demo Function calling, tool-calling loop, live API FunctionTool Terminal 2 desktop-demo UI decoupling — same agent, different surface None Desktop (Tkinter) 3 websearch-demo Server-side built-in tools, no client loop WebSearchTool Terminal 4 code-demo Code execution in sandbox, Gradio web UI CodeInterpreterTool Web (Gradio) 5 rag-demo Document upload, vector stores, RAG grounding FileSearchTool Terminal 6 mcp-demo MCP servers, human-in-the-loop approval MCPTool Terminal 7 toolbox-demo Centralized tool governance, Toolbox versioning Toolbox Terminal 8 hosted-demo Self-hosted agent with Responses protocol Custom server Terminal + Agent Inspector The Model Router: One Deployment to Rule Them All Before diving into the demos, it is worth understanding the one architectural decision that ties the entire lab together: every agent uses model-router as its model deployment. MODEL_DEPLOYMENT=model-router Model Router is a Microsoft Foundry capability that inspects each request at inference time and routes it to the optimal available model — weighing task complexity, cost, and latency. A simple factual question goes to a fast, cheap model. A complex tool-calling chain with code generation gets routed to a frontier model. You write zero routing logic. The lab's MODEL-ROUTER.md file contains empirical observations from running all nine demos. A sample of what the router selected: Demo Query Task Type Model Selected hello "What's the capital of WA state?" Factual recall grok-4-1-fast-reasoning hello "Summarize our conversation" Summarization gpt-5.2-chat-2025-12-11 tools "What's the weather in Seattle?" Tool-using gpt-5.4-mini-2026-03-17 code Data analysis with code generation Code generation + execution gpt-5.4-2026-03-05 rag HR policy document question Retrieval + synthesis gpt-5.3-chat-2026-03-03 This is the strongest signal in the lab: you do not need to reason about model selection. You declare what your agent needs to do; the router handles the rest, and it chooses correctly. Demo 0: The Minimum Viable Agent The hello-demo establishes the baseline pattern used by every subsequent demo. Two files: one to register the agent, one to chat with it. Registering the agent from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient from azure.ai.projects.models import PromptAgentDefinition credential = DefaultAzureCredential() project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, instructions="You are a helpful, friendly assistant.", ), ) Authentication uses DefaultAzureCredential , which works with az login locally and with managed identity in production — no API keys anywhere in the code. Chatting with the agent # Create a server-side conversation (persists history across turns) conversation = openai.conversations.create() # Each turn sends the user message; the agent sees full history response = openai.responses.create( input=user_input, conversation=conversation.id, extra_body={"agent_reference": {"name": AGENT_NAME, "type": "agent_reference"}}, ) print(response.output_text) The conversation object is server-side. You pass its ID on every turn; the history lives in Foundry, not in a local list. This is the Responses API pattern — distinct from the older Completions or Chat Completions APIs. Demo 1: Function Tools and the Tool-Calling Loop Demo 1 adds function calling against a real weather API. The key insight here is that the model does not execute the function — it requests the execution, and your code executes it locally, then feeds the result back. Declaring a function tool from azure.ai.projects.models import FunctionTool, PromptAgentDefinition func_tool = FunctionTool( name="get_weather", description="Get the current weather for a given city.", parameters={ "type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"], }, strict=True, ) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[func_tool], instructions="You are a weather assistant...", ), ) The tool-calling loop response = openai.responses.create(input=user_input, conversation=conversation.id, ...) # Loop while the model is requesting tool calls while any(item.type == "function_call" for item in response.output): input_list = [] for item in response.output: if item.type == "function_call": args = json.loads(item.arguments) result = get_weather(args["city"]) # execute locally input_list.append(FunctionCallOutput(call_id=item.call_id, output=result)) # Send results back to the agent response = openai.responses.create(input=input_list, conversation=conversation.id, ...) print(response.output_text) The strict=True parameter on FunctionTool enforces structured outputs — the model must return arguments that match the declared JSON schema exactly. This eliminates argument parsing errors in production. Demo 2: UI Is Not Your Agent Demo 2 runs the exact same agent as Demo 1 but surfaces it in a Tkinter desktop window. The point is pedagogical: your agent definition, conversation management, and tool-calling logic are entirely independent of your UI layer. Swapping from terminal to desktop requires changing only the presentation code — nothing in the agent or conversation path changes. This is a principle worth internalising early: agent logic and UI logic should never be entangled. The lab enforces this separation structurally. Demo 3: Server-Side Built-In Tools The web search demo introduces a sharp contrast with Demo 1. With WebSearchTool , the tool-calling loop disappears entirely from client code: from azure.ai.projects.models import WebSearchTool agent = project.agents.create_version( agent_name="Search-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[WebSearchTool()], instructions="You are a research assistant...", ), ) The agent decides when to search, executes the search server-side, and returns a grounded response with citations. Your client code looks identical to Demo 0 — a simple responses.create() call with no tool loop. The distinction matters architecturally: Function tools (Demo 1) — tool execution happens on your client; you control the code, the API call, the error handling. Built-in tools (Demo 3+) — tool execution happens inside Foundry; you get results without managing execution. Demo 4: Code Interpreter and the Gradio Web UI Demo 4 attaches CodeInterpreterTool , which gives the agent a sandboxed Python execution environment inside Foundry. The agent can write code, run it, observe output, and iterate — all server-side. Combined with a Gradio web interface, this demo shows an agent that can perform data analysis, generate charts, and explain results through a browser UI. Model Router is particularly interesting here: the empirical data shows it selects a more capable frontier model ( gpt-5.4-2026-03-05 ) for code-generation tasks, while simpler conversational turns stay on lighter models. Demo 5: Retrieval-Augmented Generation with FileSearchTool Demo 5 introduces RAG. The setup phase uploads a document, creates a vector store, and attaches it to the agent: # Upload document and create a vector store vector_store = openai.vector_stores.create(name="employee-handbook-store") with open("data/employee-handbook.md", "rb") as f: openai.vector_stores.files.upload_and_poll( vector_store_id=vector_store.id, file=f ) # Attach the vector store to the agent agent = project.agents.create_version( agent_name="RAG-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[FileSearchTool(vector_store_ids=[vector_store.id])], instructions="Answer questions using only the provided documents...", ), ) At query time, the agent embeds the question, searches the vector store semantically, retrieves matching chunks, and generates an answer grounded in the retrieved content — entirely server-side. The client code remains a plain responses.create() call. An important detail: the .vector_store_id file is written to disk during setup and read back during the chat session, so the demo survives process restarts without re-uploading the document. The .gitignore excludes this file from source control. Demo 6: Model Context Protocol Demo 6 connects the agent to a GitHub MCP server, giving it access to repository and issue data via the open Model Context Protocol standard. MCP servers expose tools over a standardised wire protocol; the agent discovers and calls them without any client-side function declarations. The demo also demonstrates human-in-the-loop approval: before executing any MCP tool call, the agent surfaces the proposed action and waits for the user to confirm. This is an important safety pattern for agents that can trigger side effects on external systems. Demo 7: Toolbox — Centralised Tool Governance Where Demo 6 connects to a single MCP server directly, Demo 7 uses a Toolbox — a managed Microsoft Foundry resource that bundles multiple tools into a single, versioned, MCP-compatible endpoint. The Toolbox in this demo exposes both GitHub Issues and GitHub Repos tools, curated into an immutable versioned snapshot. This pattern is significant for production multi-agent systems: Centralised governance — one team owns the tool definitions; all agents consume them via a single endpoint. Versioned snapshots — promoting a new Toolbox version is explicit; agents pin to a version and upgrade intentionally. MCP compatibility — any MCP-capable agent or framework can connect, not just Foundry SDK agents. from azure.ai.projects.models import McpTool toolbox_tool = McpTool( server_label="toolbox", server_url=TOOLBOX_ENDPOINT, allowed_tools=[], # empty = all tools in the Toolbox version headers={"Authorization": f"Bearer {token}"}, ) Demo 8: Self-Hosted Agent with the Responses Protocol The final demo departs from the prompt-agent pattern. Instead of registering a declarative agent in Foundry, Demo 8 implements a custom agent server using the Responses protocol. The server exposes a streaming HTTP endpoint; Foundry's Agent Inspector can connect to it and route user turns to it just as it would to a hosted prompt agent. This demo includes a Dockerfile and an agent.yaml , enabling deployment to Foundry's container hosting service. It uses gpt-4.1-mini directly rather than the model router, because the custom server owns the entire inference path. When to consider this pattern: Your agent requires custom pre- or post-processing logic that cannot be expressed in a system prompt. You need to integrate with infrastructure that is not reachable through MCP or built-in tools. You want to own the inference call for cost control, A/B testing, or compliance reasons. You are building a multi-agent orchestrator that needs to expose itself as an agent to other orchestrators. Getting Started The lab requires Python 3.10 or higher, an Azure subscription with a Microsoft Foundry project, and the Azure CLI. 1. Clone and set up the virtual environment git clone https://github.com/microsoft-foundry/Foundry-Agent-Lab.git cd Foundry-Agent-Lab # Create and activate the virtual environment python -m venv .venv # Windows Command Prompt .venv\Scripts\activate.bat # Windows PowerShell .venv\Scripts\Activate.ps1 # macOS / Linux source .venv/bin/activate pip install -r requirements.txt 2. Configure a demo copy hello-demo\.env.sample hello-demo\.env # Edit hello-demo\.env and set PROJECT_ENDPOINT Your PROJECT_ENDPOINT is on the Overview page of your Foundry project in the Azure portal. It takes the form https://your-resource.ai.azure.com/api/projects/your-project . 3. Run the demo az login 0-hello-demo Each numbered batch file at the root activates the virtual environment, runs create_agent.py , and launches chat.py . Append log to capture the full session transcript: 0-hello-demo log Reset between runs hello-demo\reset.bat Every demo includes a reset.bat that deletes the registered agent and any associated resources (vector stores, uploaded files). Demos are fully repeatable. Architecture Principles Demonstrated Across the nine demos, the lab illustrates a set of design principles that apply directly to production agent systems: Keyless authentication throughout Every demo uses DefaultAzureCredential . No API keys appear anywhere in the code. Locally, az login provides credentials. In production, managed identity takes over automatically — same code, no secrets to rotate. Server-side conversation state The Responses API stores conversation history server-side. Your application passes a conversation ID; Foundry maintains the thread. This eliminates the common bug of truncating history due to local list management and makes multi-process or multi-instance deployments straightforward. Client-side vs server-side tool execution The lab makes the distinction explicit. Function tools execute in your process — you control the code, the external call, and the error handling. Built-in tools (WebSearch, CodeInterpreter, FileSearch) execute inside Foundry — you get results without managing execution infrastructure. MCP tools (Demo 6, 7) fall between these: they execute in a separately deployed server, with the protocol mediating the call. Progressive tool introduction Each demo's create_agent.py registers the agent once. The chat.py file handles the conversation loop. These two responsibilities are always separate, making it easy to update agent definitions without modifying conversation logic, and vice versa. Security Considerations When building agents for production, keep the following in mind: Never commit .env files. The .gitignore excludes them, but verify this before pushing. Use Azure Key Vault or environment variable injection in CI/CD pipelines. Use managed identity in production. DefaultAzureCredential automatically picks up managed identity when deployed to Azure, eliminating the need for any stored credentials. Apply human-in-the-loop for side-effecting tools. Demo 6 demonstrates this pattern for MCP tool calls. Any agent that can modify external state (create issues, send emails, write files) should surface proposed actions for confirmation. Validate tool outputs before use. Treat data returned by external tools (weather APIs, search results, document retrieval) as untrusted input. Prompt injection through tool results is a real attack surface; grounding instructions in your system prompt reduce but do not eliminate this risk. Scope Toolbox permissions narrowly. When using a Toolbox (Demo 7), use allowed_tools to restrict which tools the agent can call, rather than granting access to all tools in a Toolbox version. Key Takeaways Start with the minimum. A prompt agent with no tools requires fewer than 30 lines of code using the Foundry SDK. Add tools only when the use case demands them. Use model-router unless you have a specific reason not to. The empirical data in the lab shows the router selects appropriate models across all task types — factual, creative, tool-calling, RAG, and code generation. Understand the client/server tool boundary. Function tools give you control; built-in tools give you simplicity. MCP and Toolbox give you governance and interoperability. Choose based on where you need control and where you need scale. Conversation state belongs on the server. Do not maintain conversation history in application memory if you can avoid it. The Responses API conversation object is designed for this. The hosted-demo pattern is for when you need to own the inference path. For most use cases, a declarative prompt agent is sufficient and far simpler to operate. Next Steps Explore the repo: github.com/microsoft-foundry/Foundry-Agent-Lab Microsoft Foundry SDK documentation: learn.microsoft.com/azure/ai-studio/ Responses API quickstart: Prompt agent quickstart Model Router conceptual documentation: Model Router for Microsoft Foundry Model Context Protocol: modelcontextprotocol.io Azure Identity SDK (DefaultAzureCredential): azure-identity Python SDK The Foundry Agent Lab is open source under the MIT licence. Contributions, bug reports, and feature requests are welcome through GitHub Issues. See CONTRIBUTING.md for guidelines.OIDC vs SPN: Securing Azure Deployments with GitHub Actions & Terraform
From Secrets to Trust: Modernizing CI/CD Authentication When building infrastructure pipelines on Microsoft Azure using GitHub Actions and Terraform, one design choice quietly determines your entire security posture: How does your pipeline authenticate to Azure? For years, the answer was simple: Use a Service Principal (SPN) Store a client secret in GitHub Authenticate using credentials It works—but it doesn’t scale securely. This article walks through a real, production-ready implementation comparing: SPN (Client Secret – legacy pattern) OIDC (Federated Identity – modern standard) Backed by a working repo: WorkFlowBasedDeployment Architecture Overview This repository implements a workflow-driven Terraform deployment model with modular Azure infrastructure. Repository Structure .github/workflows/ deploy-infrastructure.yml # OIDC deployment deploy-infrastructure-spn.yml # SPN deployment destroy-infrastructure.yml # OIDC destroy destroy-infrastructure-spn.yml # SPN destroy Deployment/ main.tf providers.tf variables.tf terraform.tfvars modules/ Azure Resources Provisioned Resource Module Resource Group Virtual Network + NSGs vnet rg-network Storage Account sa rg-data Container Apps containerapps rg-compute AI Foundry aifoundry rg-data AI Search aisearch rg-data Azure Container Registry acr rg-compute Key Vault azkeyvault rg-data Monitoring azmonitor rg-compute Private Endpoints private_endpoints rg-network Authentication Models Service Principal (SPN) – The Traditional Way How it works Create App Registration Generate client secret Store it in GitHubTerraform authenticates using environment variables env: ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} The problem Risk Impact Long-lived secrets Can be leaked Manual rotation Operational burden Repo compromise Full environment exposure This model is still supported—but increasingly considered legacy for secure pipelines. OIDC (OpenID Connect) – The Modern Approach How it works GitHub Actions generates a short-lived identity token Microsoft Entra ID validates it Azure issues a temporary access token Terraform executes using that token No secrets. No storage. No rotation. Authentication Models Compared OIDC Flow (Mental Model) Think of OIDC like this: GitHub → Identity Provider Azure → Trust Authority Workflow → Temporary Identity OIDC Implementation (From the Repo) Workflow Configuration permissions: id-token: write contents: read env: ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} ARM_USE_OIDC: true Azure Login - name: Azure Login (OIDC) uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} Backend (Terraform State with OIDC) terraform init \ -backend-config="use_oidc=true" Even your state storage is secretless Azure Setup for OIDC Create App Registration No client secret required Configure Federated Credential Example: Issuer: https://token.actions.githubusercontent.com Subject: repo:<org>/<repo>:ref:refs/heads/master You can restrict by: Branch Environment Repository Assign RBAC: Grant roles like: Contributor Or scoped resource-level access CI/CD Workflow Design Both SPN and OIDC pipelines follow a 2-stage pattern: Plan Stage terraform fmt terraform validate terraform plan Upload plan artifact Apply Stage Triggered only on main Downloads plan Runs apply -auto-approve Protected via environment approvals This ensures safe, auditable deployments OIDC vs SPN — Real Comparison Feature SPN OIDC Secrets Stored in GitHub None Token lifetime Long-lived Short-lived Rotation Manual Not required Security Medium High Setup Simple Slightly complex Recommended No Yes Common Pitfalls (Real-World Lessons) Missing id-token permission Without this, OIDC fails silently. Federated credential mismatch Wrong branch Incorrect repo name Case sensitivity issues Azure rejects the token completely. RBAC delay Role assignments can take time → causes confusing failures. Backend misconfiguration Forgetting use_oidc=true breaks Terraform state auth. Debugging Tips Enable debug logs in GitHub Actions Check Sign-in logs in Microsoft Entra ID Validate federated credential subject format Always isolate: Identity issue vs Permission issue Migration Strategy (SPN → OIDC) A safe transition looks like this: Keep SPN as fallback Add OIDC alongside Test in DEV environment Remove client secret Revoke old credentials No downtime, no risk. Where This Fits in Modern Azure Architecture This pattern integrates naturally with: Azure Container Apps AI/ML workloads (AI Foundry, Search) Multi-environment deployments Zero-trust enterprise architectures Authentication becomes identity-driven, not secret-driven When NOT to Use OIDC Legacy CI/CD systems without OIDC support Organisations with strict identity federation constraints Cross-tenant scenarios with limited trust setup Note: These cases are becoming increasingly rare in modern cloud setups. Security Perspective Threat SPN Risk OIDC Risk Secret leak High None Credential reuse High Low Token replay Possible Limited Repo compromise Full access Scoped Final Takeaway This repository demonstrates a key shift in modern DevOps: Secrets were a workaround for identity. OIDC replaces that workaround with trust. By combining: GitHub Actions OIDC federation Azure RBAC You get: Secure pipelines Scalable deployments Zero secret management In enterprise environments, moving to OIDC can eliminate secret rotation pipelines entirely, reducing operational overhead and significantly lowering breach risk. Reference Implementation GitHub Repository: WorkFlowBasedDeployment Closing Thought OIDC doesn’t just improve authentication, it fundamentally changes how trust is established in cloud systems. In a world moving toward zero-trust architectures, identity is the new perimeter and OIDC is how you enforce it.How to Visualize Your Azure AI Workloads Usage for Observability
This article assumes you already have an Azure Foundry project and resource deployed in Microsoft Foundry. The options referenced here are documented in detail in the linked articles; this post serves as a consolidated step by step guide bringing them all together and explaining where each option is most useful. A Summary: Need Best Option Quick day-over-day visual, minimal setup Grafana Dashboard (Option 3) Custom growth % calculations App Insights + KQL in Log Analytics (Option 4) Shareable, interactive report Azure Workbooks (Option 5) Per-user/per-agent granularity APIM + App Insights (Option 6) Quick one-off chart, export to Excel Microsoft Foundry Monitor tab or App Insights Metrics Explorer (Option 1 and 2) Option 1. Within the Microsoft Foundry Portal (Quickest, No Setup) If you have models deployed in Microsoft Foundry and would like to monitor its usage, go to the New Foundry Portal → Build → Models → Monitor tab. View metrics such as: Estimated cost Total token usage Input vs. output tokens Number of requests This is the simplest way to monitor both model and agent usage. For PAYG plans: You can also view your total allocated quota (and figure out which Tier you are on) using the Quota Management Screen (New Foundry Portal → Operate → Quota tab). This screen shows how much your total allocated quota is, per model in a given subscription + region + Deployment Type (Global, Data Zones or Regional). For eg., in the image below, for gpt-4o, I am allocated 7M total TPM in my subscription. I am only using 150K TPM of the allocated 7M TPM amount. Which means, my requests will get throttled if I exceed the 150K TPM limit. To avoid throttling, I would need to increase my shared allocation limit. NOTE: you are charged for usage, so if you allow more capacity, you use more, so you pay more. Option 2: Azure Monitor Metrics Explorer This is already built into the Azure Portal and gives you time-series charts out of the box. Go to Azure Portal → your Azure OpenAI / Foundry resource → Monitoring → Metrics Select a metric like AzureOpenAIRequests or TokenTransaction Set Aggregation to Sum (total) or Max and Time granularity to 1 day Split by ModelDeploymentName to see per-model trends Adjust the time range (e.g., last 30 days) — you'll see day-over-day bars/lines Tip: You can pin these charts to an Azure Dashboard for a persistent view, or click Share → Download to Excel to get the raw data for your own analysis. Option 3: Azure Managed Grafana (Best Pre-Built Dashboard) This is the best option for a polished, real-time, day-over-day dashboard with no custom code. There's a pre-built AI Foundry dashboard ready to import. [grafana.com], [Create a M...ed Grafana] How to set it up: Create an Azure Managed Grafana workspace (if you don't have one) In Grafana, go to Dashboards → New → Import → enter dashboard ID 24039 (for Foundry) Select your Azure Monitor data source and point it to your Foundry resource Tip: You can also import this directly from the Azure Portal: Monitor → Dashboards with Grafana → AI Foundry. That's it — the dashboard gives you (per model deployment): Token trends over time (inference, prompt, completion — day over day) Request trends over time (AzureOpenAIRequests as a time series) Latency trends (bonus) NOTE: Default time range is 7 days — adjust to 30/60/90 days for growth trends Option 4: Application Insights + KQL Queries (Most Flexible, Custom Reports) If you want fully custom day-over-day growth calculations (e.g., % change day-to-day), this is the way. [azurefeeds.com] Setup: Ensure your Foundry project is connected to an Application Insights resource (Foundry → Settings → Connected Resources). Open up App Insights resource → Logs → New Query or choose a sample query. In the images below, we simply ran 'requests' and set the time range to 24 hours. There is also a Kusto Query Language (KQL) mode or Simple mode on the right-hand side: Simple mode will let you run out of the box samples. KQL mode will open up a query window for you to enter custom queries. Below are the results in grid view. Same view but showing a chart: Export options: Another way to get the above graphs are via Log Analytics. Simply enable Diagnostic Settings on your Azure OpenAI resource → send to a Log Analytics workspace. Open Log Analytics → Logs and try our your sample queries. Sample KQL for day-over-day token usage (adjust to your needs): AzureMetrics | where MetricName in ("TokenTransaction", "ProcessedPromptTokens", "GeneratedTokens") | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by bin(TimeGenerated, 1d), MetricName | order by TimeGenerated asc | render timechart Result: Sample KQL for day-over-day growth % (adjust to your needs): AzureMetrics | where MetricName == "TokenTransaction" | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by Day = bin(TimeGenerated, 1d) | sort by Day asc | extend PrevDay = prev(DailyTokens) | extend GrowthPct = round((DailyTokens - PrevDay) / PrevDay * 100, 2) | project Day, DailyTokens, GrowthPct Option 5: Azure Monitor Workbooks (Custom Dashboards, Shareable) Workbooks let you build interactive, parameterized dashboards that combine metrics and KQL logs. What's more, you can select resources from multiple subscriptions and visualize them all in one place using Workbooks! Go to Azure Portal → Monitor → Workbooks → New Add a Metrics query panel → select your Log Analytics or App Insights or Foundry resource -> Enter the same query you used in Option 4. Do a test run and view the graphs (this can be viewed as charts or a list (grid view)): 4. Save and share with your team. Option 6: APIM + Application Insights (Granular Per-Caller/Per-Agent Tracking) 1. If your app routes requests through Azure API Management, you can use the azure-openai-emit-token-metric policy to send per-request token metrics to Application Insights with custom dimensions (User ID, Subscription ID, Agent, etc.). [Azure API...osoft Docs] This is ideal for scenarios like: "Which agent consumed the most tokens last week?" "What's the token usage per API consumer/team?" NOTE: Microsoft Foundry resources do not track usage by users. So, fronting your Foundry resource with an APIM could be a way to track users provided you pass the username/id in the request context. How you implement this is upto your app design. Ref: AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHub Bonus: Check out all other APIM + AI related policies here: AI-Gateway/labs/semantic-caching at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-rate-limiting at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHubPrivacy proxy in Agents with Microsoft Agent Framework Middleware
In the first part - Introducing PII Shield: A Privacy Proxy for Every LLM Call - we introduced PII-Shield as a privacy proxy that detects sensitive data, applies configurable anonymization strategies per entity, and reverses that anonymization once the LLM has completed its task. While this approach is valuable in any AI application, it becomes essential in agentic systems. Consider an agentic chatbot - BankBuddy - designed to help customers with tasks such as unblocking cards or requesting replacements. As part of its workflow, the agent collects sensitive information including the customer’s name, Aadhaar number, card details, and date of birth. This data is used to validate the customer, invoke backend APIs to perform the requested action, and ultimately communicate the outcome - what was completed, what could not be processed, and any next steps required. The sequence diagram below illustrates how this interaction is orchestrated behind the scenes. If we carefully observe the interactions between the agent and LLM, some information about the customer is sent across so that the LLMs can perform tool calling, frame accurate responses etc. For regulated industries, this could be a red flag. If not handled well, actual customer information (which may be PII) could inadvertently surface in LLM inference logs or downstream processing layers. Rather than placing this responsibility on individual AI developers, the burden should be absorbed by the enterprise architecture. Developers should remain focused on delivering business functionality, while a centralized, enforced privacy boundary ensures that all LLM and tool interactions are consistently governed. This boundary must be non-optional - every request and response flows through it, with no possibility of bypass. This approach removes the need for per-agent privacy engineering and guarantees uniform protection across applications - an essential requirement in highly regulated sectors such as BFSI, healthcare, and government. In the next section, we explore how this challenge can be addressed within the Microsoft Agent Framework. Middleware in Microsoft Agent Framework The Microsoft Agent Framework provides two middleware extensibility points that are particularly relevant in this context (for more refer here): Middleware Purpose Where PII-shield gets hooked ChatMiddleware Wraps every LLM call Anonymise outgoing messages, de-anonymise the response FunctionMiddleware Every @tool invocation De-anonymise tool args, re-anonymise tool results Together, these form a generalized implementation of the “Anonymize → LLM → De-anonymize” pattern introduced in the first post - now systematically applied to every LLM interaction and every tool call within an agent’s execution loop. The example below demonstrates how PII Shield is integrated into a Microsoft Agent Framework ChatAgent using two coordinated middleware components that share a common mapping store. The PiiMappingStore serves as per-conversation memory, maintaining the association between placeholders (for example, {{PERSON_1}}, {{IN_AADHAAR_1}}) and their corresponding real values. This ensures consistent coreference across turns and tool invocations. PiiShieldChatMiddleware operates at the LLM boundary, sanitizing every outbound prompt - including user input and accumulated context before it reaches the model, and restoring original values in the model’s response on the return path. Complementing this, PiiShieldFunctionMiddleware wraps all tool calls: it resolves placeholders into real values before execution (ensuring functions such as get_customer_info receive valid inputs) and then re-anonymizes the results before they are passed back into the agent loop. By registering both middleware components in the agent’s middleware=[…] pipeline, the privacy boundary becomes comprehensive and non-bypassable. Every prompt, every tool invocation, and every response is automatically governed - eliminating opt-out paths and ensuring consistent protection by design. from agent_framework import ChatAgent from bankingbuddy.middleware import ( PiiShieldChatMiddleware, PiiShieldFunctionMiddleware, PiiMappingStore, ) mapping_store = PiiMappingStore() # per-conversation chat_mw = PiiShieldChatMiddleware( mode="api", # or "library" api_url="http://pii-shield:8000", api_app_id="bankingbuddy-prod", # received from registering the app in PII-shield mapping_store=mapping_store, ) fn_mw = PiiShieldFunctionMiddleware(chat_middleware=chat_mw) agent = ChatAgent( chat_client=foundry_chat_client, instructions="You are BankingBuddy …", tools=[get_customer_info, unblock_card, order_card, report_lost_card], middleware=[chat_mw, fn_mw], ) The middleware can talk to PII Shield in two ways: api mode (default) - Invokes the PII Shield REST endpoints (such as /anonymize_unique and /deanonymize). This approach decouples the NLP layer from the agent runtime, making it well-suited for production environments where PII Shield is deployed as a shared service across multiple teams. library mode - Uses the in-process pii_shield.PiiShieldEngine to perform anonymization locally. This reduces latency and is particularly useful for local development, batch processing scenarios, and edge deployments. A complete reference implementation of an agent incorporating this middleware pattern is available in the following GitHub repository: github.com/vikasgautam18/pii-agent-demos Once this middleware approach is in place, the execution flow changes fundamentally. The LLM operates exclusively on anonymized placeholders, while the middleware transparently handles the substitution of real values during tool execution and response handling. This ensures that sensitive data never enters the model context, without requiring any additional effort from the agent developer. The sequence diagram below depicts how additional hops ChatMiddleware, FunctionMiddleare and PII-Shield ensure the LLM does not receive PII values. High-level design The diagram below illustrates the high-level architecture of this solution. From the user’s perspective, interaction remains unchanged - they communicate with a standard Microsoft Agent Framework ChatAgent, sending raw text and receiving raw responses. Within the agent, however, two middleware components enforce strict privacy boundaries at the points where PII could otherwise be exposed. PiiShieldChatMiddleware operates at the LLM boundary, transforming every outbound prompt into anonymized placeholders and rehydrating the model’s response before it is returned. In parallel, PiiShieldFunctionMiddleware governs the tool boundary, performing the inverse operation: it de-anonymizes arguments so backend systems receive real values, and re-anonymizes tool outputs before they are fed back into the next LLM turn. Both middleware components delegate the underlying detection and mapping logic to PII Shield - deployed either as a sidecar service or consumed as a library - and share a common PiiMappingStore scoped to the conversation. This shared state ensures consistency, allowing placeholders such as {{PERSON_3}} to reliably represent the same individual (for example, Vikas Gautam) across multiple turns, tool invocations, and final responses. The net effect is a clean separation of concerns: the LLM operates exclusively on anonymized placeholders, backend systems operate on real values, and neither side needs awareness of the other. Privacy is enforced transparently and consistently, without adding complexity to the agent’s core logic. How does it work - a turn in the life of an agent Imagine the user types: "Hi, I'm Vikas Gautam, customer ID 246813579. My card 4532-1234-5678-9012 is blocked, can you unblock it?" ChatMiddleware: inbound The middleware grabs the last user message, calls POST /anonymize, gets the following response: "Hi, I'm {{PERSON_1}}, customer ID {{IN_CUSTOMER_ID_1}}. My card {{CREDIT_CARD_1}} is blocked, can you unblock it?" And the mapping { {{PERSON_1}}: "Vikas Gautam", {{IN_CUSTOMER_ID_1}}: "246813579", {{CREDIT_CARD_1}}: "4532-1234-5678-9012" } is stored in the run state and merged into the long-lived PiiMappingStore. LLM The LLM sees only placeholders. It decides to call a tool: { "name": "unblock_card", "arguments": { "customer_id": "{{IN_CUSTOMER_ID_1}}", "card_number": "{{CREDIT_CARD_1}}" } } FunctionMiddleware: before the tool The function middleware reads the run state, walks the arguments, and replaces every placeholder with its real value { "customer_id": "246813579", "card_number": "4532-1234-5678-9012" } The unblock_card tool runs against the real CRM with real IDs. No changes or configurations required in the tool itself. FunctionMiddleware: after the tool The tool returns the following: { "status": "unblocked", "card_holder": "Vikas Gautam", "card_number": "4532-1234-5678-9012", "confirmation_email_sent_to": "vikas@example.com" } The middleware then re-anonymises the result through PII Shield before handing it back to the LLM, so the next LLM hop sees only known placeholders (plus a brand-new {{EMAIL_ADDRESS_1}} for the email). ChatMiddleware: outbound After the agent loop finishes, the LLM's final answer might be: "All done, {{PERSON_1}}. Your card ending in {{CREDIT_CARD_1}} is now active. A confirmation has been sent to {{EMAIL_ADDRESS_1}}." The chat middleware walks every assistant message in the result, applies local string replacement (longest placeholder first) using the accumulated PiiMappingStore, and the user sees: "All done, Vikas Gautam. Your card ending in 4532-1234-5678-9012 is now active. A confirmation has been sent to vikas@example.com." Let's just assume at this point, the user asks: "What's my balance?" There's no PII in this message, so anonymisation is a no-op. But when the LLM calls get_balance(customer_id="{{IN_CUSTOMER_ID_1}}") - using a placeholder it remembers from turn 1 - the function middleware still resolves it from the PiiMappingStore. Memory is preserved across the entire conversation with no extra plumbing. Benchmark To evaluate the impact of the middleware, we designed a paired-sample benchmark that executes the same workload through two paths: one without the middleware and one with PiiShieldChatMiddleware enabled. Each pair is run back-to-back for every message, with execution order randomized to minimize the effects of transient LLM and network variability. This allows us to isolate the per-pair latency difference with high fidelity. Please refer the below GitHub repository for more details on the benchmark implementation: pii-agent-demos/benchmark We report the median, 95th percentile, and 99th percentile of these differences across 1,000 runs. All tests were conducted against a locally deployed PII Shield service running in Docker. The benchmark harness also includes a no-op mode, where the LLM call is bypassed, enabling precise measurement of middleware overhead in isolation. To keep total runtime practical, the framework supports concurrent execution of request pairs. For transparency and reproducibility, all raw samples, aggregated results, and a complete environment fingerprint are persisted to disk for every run. Our findings show that PiiShieldChatMiddleware introduces a median latency overhead of approximately 70 ms per interaction, with a 99th percentile overhead of approximately 200 ms. The majority of this cost is attributable to the PII Shield service’s inference time rather than the middleware layer itself. It adds roughly one-tenth of a second per turn - a modest and predictable trade-off for ensuring that raw personally identifiable information is never exposed to the model context. Metric Overhead per turn Median 71.27 ms Mean 84.09 ms 95th percentile 189.29 ms 99th percentile 207.44 ms Conclusion Agents multiply every privacy-sensitive boundary by the number of LLM hops and tool invocations. Relying on developers to consistently route calls through a downstream PII Shield is not a robust strategy. By integrating PII Shield directly into the Microsoft Agent Framework as a ChatMiddleware + FunctionMiddleware pair, these concerns are addressed systematically: Automatic anonymization on every LLM call - even during multi-step agent loops. Seamless restoration of real values for tools - placeholders are transparently resolved before execution. Automatic re-anonymization of tool outputs - ensuring that raw PII from systems like CRM never re-enters the model context. Consistent identity mapping across turns - placeholders such as {{PERSON_1}} remain stable from the first interaction through to the seventieth. Flexible deployment model - a single switch allows you to move between in-process (library) and shared-service (API) modes. This post focused on protections inside the application - embedding PII Shield within the Agent Framework so that a single agent, or even an entire fleet within a shared codebase, is secure by default. In the next post, we zoom out one layer to explore PII Shield in combination with Azure API Management. Here, the privacy boundary shifts to the gateway, ensuring that every call to an LLM backend - regardless of application, language, or team - is intercepted, anonymized, forwarded, and de-anonymized through APIM policies. The principle remains the same, but enforcement moves from the agent runtime to the network edge.Six Coding Agents, One Production System: A Field Guide to AgenticOps with AKS-Lab-GitHubCopilot
The shift: from "AI helps me code" to "AI authors my repo" For two years we've been talking about GitHub Copilot as an inline pair programmer — a clever autocomplete that lives in your editor. That framing is officially out of date. The new reality is agentic delivery: a team of named, scoped AI agents owns slices of your repository, each with its own tools, skills, and refusal rules. They produce pull requests. They run tests. They roll deployments. And when one finishes its turn, it hands off to the next. The microsoft/AKS-Lab-GitHubCopilot's five labs you ship ZavaShop — a multi-agent retail supply-chain control plane running on AKS + Azure Container Apps — and along the way you internalize an operating model you can carry to any project. Everything in the repo (specs, agents, MCP servers, tests, Bicep, Helm, GitHub Actions) is authored by six GitHub Copilot Custom Coding Agents working from your IDE, plus the remote GitHub Copilot Coding Agent that closes the PR loop on GitHub. This is what AgenticOps looks like in practice. Two layers of agents — don't confuse them The first cognitive hurdle in this lab is keeping two very different agent populations straight: Layer What it is When it lives Examples Application agents The product you ship — the runtime ZavaShop fleet that solves a business problem Production (AKS + ACA) InventoryAgent, SupplierAgent, LogisticsAgent, PricingAgent, OrchestratorAgent Coding agents The dev-time team that writes the application agents Your IDE + GitHub requirements-analyst, mcp-builder, agent-builder, orchestrator-architect, test-author, deploy-engineer Both are built with the Microsoft Agent Framework (MAF). Both use the GitHub Copilot SDK as their model provider. But they exist at different layers of the development lifecycle, and the entire lab is structured around that distinction. If you only remember one thing from this post: the coding agents are how you build the application agents. That is the whole AgenticOps loop, compressed into one sentence. GitHub Copilot Coding Agent vs. Custom Coding Agents There are two flavors of "coding agent" in the GitHub Copilot ecosystem, and this lab uses both. 1. The remote GitHub Copilot Coding Agent This is the GitHub-side, asynchronous, PR-driven agent. You assign it an issue, it spins up a sandboxed environment, writes the code, runs the tests, and opens a PR for human review. You don't watch it work — you review what it produces. In ZavaShop, Lab 04 (Testing) explicitly uses this agent: you take a failing eval scenario, file it as an issue, assign it to Copilot, and the agent comes back with a PR. Your job is the human bar, not the keystrokes. Important governance choice from AGENTS.md: the remote Coding Agent is allowed to open PRs against src/ and tests/ only — never against infra/ without human review. That single rule is a textbook example of agent-aware policy. 2. The local Custom Coding Agents These are scoped, in-IDE specialist agents you select <agent name> in Copilot Chat. They live as *.agent.md files inside .github/agents/ and are discovered by VS Code on reload. Each one owns exactly one slice of the repository. Six of them ship in this lab: Phase Agent Owns Refusal rule Requirements requirements-analyst specs/*.md Refuses to write code MCP tools mcp-builder src/mcp_servers/* One server per turn Specialist agents agent-builder src/agents/<specialist>/* One specialist per turn Orchestration orchestrator-architect src/agents/orchestrator/*, src/shared/*, docker-compose.yml Owns wiring, not business logic Tests test-author tests/** Never edits src/ Deploy deploy-engineer infra/**, .github/workflows/** Won't touch application code The pattern that matters here isn't just "we made some custom agents." It's that every agent declares what it owns and what it refuses to do. That refusal envelope is what makes the system safe to delegate to. Without it, you'd just have a noisier autocomplete. Three workflow prompts in .github/prompts/ chain the agents together so you don't have to remember the sequence: /feature-from-issue — issue → spec → code → tests → PR → deploy /spec-to-code — drive an existing spec through code + tests /ship-it — quality gate → build → push → ACR/ACA/AKS rollout → smoke + evals This is the closest thing I've seen to a programmable software development lifecycle. Where AgenticOps fits in DevOps gave us repeatable infrastructure. MLOps gave us repeatable model lifecycles. AgenticOps is what you need when the thing you're operating is itself a fleet of autonomous agents — both at build time and at runtime. The lab makes the four pillars of AgenticOps concrete: Specs as the contract. /requirements-analyst produces specs/<slug>.md files with goals, contracts, and eval scenarios. Nothing else in the repo is built until that spec exists. Specs are the source of truth that human reviewers actually read. Skills as living documentation. .github/skills/<skill>/SKILL.md files hold shared, agent-agnostic knowledge — Python conventions, Kubernetes patterns, MAF idioms. Every coding agent declares which skills it must consult before writing code. This is how you stop drift: knowledge lives in one place and is pulled in on demand. Evals as the quality gate. The repo runs a four-layer test pyramid plus five golden eval scenarios (S1–S5). uv run poe check runs locally and in GitHub Actions. Copilot-authored PRs must pass the same bar a human does — no exceptions. Observability tied to agent identity. Every agent emits agent.name, agent.run_id, and agent.span_id through structlog. When something misbehaves in production, you can trace the line from "this evaluation failed" all the way back to "this version of this agent, on this run, called this tool with these arguments." These four pillars aren't ZavaShop-specific. They're the contract for any AgenticOps system: scoped ownership, contracts as code, evals as gates, identity in every span. Walking through the workshop: which agent does what, when The five labs are five chapters of one story — ZavaShop going from an empty Azure subscription to a live retail control plane. Each lab activates a different subset of coding agents. Lab 01 — Environment Setup (no coding agents yet) You provision the platform: AKS cluster, ACA environment, Azure Container Registry, Key Vault, and the Workload Identity that every agent will wear. Then you install the six Custom Coding Agents into your IDE. Think of this as hiring the development team and giving them their badges. Lab 02 — Agent Creation (four agents in play) This is where it clicks. You start by requirements-analyst in Copilot Chat to produce the spec for each ZavaShop application agent. Then mcp-builder is invoked four times to scaffold the four MCP servers — one per domain (inventory DB, supplier API, shipping API, pricing API). Then agent-builder runs four more times to build the typed ChatAgent specialists. Finally orchestrator-architect wires them together with a MAF Workflow. What's stunning about this lab is the handoff discipline. Every coding agent ends its turn with a line naming the next agent to invoke. You're not orchestrating the work — the agents are. Lab 03 — Multi-Agent Orchestration & Config (two agents) The orchestrator stops being a one-shot LLM call and becomes a deterministic Workflow. Secrets move from .env to Key Vault. The whole fleet boots locally with Docker Compose. This is orchestrator-architect's star turn — wiring A2A endpoints, MCP tool registration, Key Vault hydration, OpenTelemetry. Specs come from requirements-analyst; the rest is orchestration. Lab 04 — Testing (both coding agent flavors) /test-author writes the four-layer pyramid (unit, MCP contract, integration, eval). Then you switch gears: take a failing eval scenario, file it as a GitHub issue, and assign it to the remote GitHub Copilot Coding Agent. The agent works asynchronously, opens a PR, and uv run poe check decides whether it passes. This is the lab where the local-vs-remote distinction stops being abstract and starts being operational. Lab 05 — Deployment & Run (deployment specialist) /deploy-engineer writes the Helm chart for the AKS orchestrator and the Bicep modules for the ACA specialists. The /ship-it workflow prompt then runs the full pipeline: quality gate → ACR build → ACA deploy → AKS rollout → smoke tests → evals. GitHub Actions OIDC re-runs the same pipeline on every main push. Notice the pattern across all five labs: at no point does a human write production code from scratch. Humans set goals, review specs, approve PRs, and run quality gates. The keystrokes belong to agents. How Coding Agents transform the DevOps pipeline Take a step back from the lab and ask: what actually changes in your DevOps flow when you adopt this model? The atomic unit of work shifts. In classic DevOps the unit is the commit. In AgenticOps the unit is the spec. A spec drives one or more agents; agents produce commits; commits trigger CI; CI gates promotion. The commit becomes a derived artifact, not the starting point. Code review changes shape. You're no longer reviewing "did this human understand the codebase?" — you're reviewing "did this agent follow its refusal rules, consult its skills, and produce something that passes the evals?" Reviewers spend less time on style and more time on intent. The diff is often less interesting than the spec it came from. Governance becomes structural, not procedural. Instead of writing a wiki page that says "don't touch infra without review," you encode that rule in AGENTS.md and refuse to let the agent's tool set include infra paths. Policy becomes part of the agent definition, not a checklist humans hopefully remember. The CI pipeline expands. Beyond build/test/deploy, you now have an eval stage that asks "does the system still behave correctly on the golden scenarios?" — and a Copilot-authored PR has to pass the same eval stage as a human-authored one. The pipeline is the great equalizer. Onboarding compresses. A new engineer doesn't need to read 50 wiki pages to be productive. They read AGENTS.md, select the relevant agent walks them through. Institutional knowledge lives in .agent.md and SKILL.md files instead of senior engineers' heads. The net effect is a pipeline that's faster, more uniform, and easier to audit. Faster because agents parallelize what humans serialize. More uniform because every change goes through the same six-agent template. Easier to audit because every artifact has a named author and a refusal rule it had to respect. What to take away The AKS-Lab-GitHubCopilot workshop teaches three things at once. The surface lesson is "how to build a multi-agent retail system on AKS." The middle lesson is "how to use GitHub Copilot Custom Agents and the remote Coding Agent." The deepest lesson — and the one I'd argue matters most — is how to design a development process where AI agents are first-class citizens with bounded responsibilities, not free-form copilots. If you take the model and walk away from the lab, three patterns are worth keeping: Scope before capability. Don't give an agent every tool; give it the smallest surface that makes it useful. Specs are the API between humans and agents. Invest in requirements-analyst-style flows even if the rest of your stack isn't there yet. Evals are non-negotiable. The moment an agent can open a PR, you need a quality gate that doesn't care who the author is. Clone the repo microsoft/AKS-Lab-GitHubCopilot , hit Developer: Reload Window, select agents in Copilot Chat, and watch six teammates show up. That's the future of the DevOps pipeline — and it's already shipping. Resources microsoft/AKS-Lab-GitHubCopilot — The repository this post is built on. Best practices for using Copilot to work on tasks — Governance patterns for delegating issues to Copilot. GitHub Copilot SDK (Python) — The provider used by every agent in this lab.574Views0likes0Comments