llm
78 TopicsKVStream: Smarter Memory Management for On-Device Language Model Inference
On-Device LLMs Are Starved for Memory Intelligence The shift toward on-device language model inference is accelerating. Platforms like Microsoft's Foundry Local, Ollama, and llama.cpp now make it possible to run capable models such as Phi-3-mini and Llama 3 entirely on local hardware, no cloud dependency, no data leaving the device. This is a meaningful leap for privacy-sensitive applications, offline scenarios, and latency-critical workloads. But on-device/edge inference surfaces a class of performance problems that cloud-based serving largely abstracts away. Memory is the bottleneck. When a language model generates text, it builds a Key-Value (KV) cache a memory structure that stores intermediate attention states so it doesn't have to recompute them for every new token. On a cloud server with hundreds of gigabytes of HBM (Hardware Bandwidth Memory), wasting KV cache memory is a rounding error. On a developer workstation or a GPU/NPU-equipped edge device, it is the difference between a usable system and a stalled one. The typical on-device runtime allocates KV cache memory statically and per-sequence. This creates three concrete problems: Fragmentation and over-reservation. A model has no way of knowing in advance how long a response will be. So, runtimes allocate a worst-case maximum context window of memory for every request even if the actual response is short. The result is a fragmented memory pool where most of the reserved space sits idle while queued requests wait for a slot to open. Research confirms the scale of this waste: production settings with variable-length or concurrent requests typically discard 60–80% of KV memory under monolithic static allocation; paged designs reduce this to under 5%. No batching across requests. Most local runtimes process one request at a time. If you fire four concurrent questions at Foundry Local or Ollama without a batching layer, they queue sequentially. Throughput scales linearly with latency instead of benefiting from the parallel processing the underlying hardware supports. Continuous batching the technique of admitting new sequences mid-generation rather than waiting for the whole batch to drain has been shown to yield up to 36.9× throughput improvement in the original Orca study, with production deployments regularly achieving 2–5× over static batching. Redundant computation. Real-world applications consistently include a shared system prompt instructions that tell the model how to behave. With a RAG pipeline or a chat assistant, the same few hundred tokens get prefilled on every single request, burning compute and TTFT (time-to-first-token) unnecessarily. External prefix caching layers addressing this problem have demonstrated up to 15× throughput improvement on multi-round workloads. These are not hypothetical inefficiencies. In practice, they translate to sluggish multi-turn conversation, poor throughput when multiple users or agent loops share the same local model, and wasted hardware potential on capable machines that could be doing much more. Crucially, while some of the cloud-scale inference engines embed these techniques internally, there is no equivalent orchestration layer for on-device runtimes a gap that enterprise deployments and the research community are now actively calling. Introducing KVStream: A Middleware Layer for Local LLM Runtimes KVStream is a lightweight Python middleware that sits between the application and any local LLM runtime, adding production-grade memory management and scheduling without requiring you to modify the backend or the client. The design principle is deliberate: KVStream is not a new inference engine. It does not replace Foundry Local or any other runtime. Instead, it solves the orchestration layer that on-device runtimes currently leave unaddressed the gap between a single model server and an application that expects the reliability and throughput of a managed serving system. KVStream exposes a fully OpenAI-compatible API on http://localhost:8080/v1. Any existing client the openai Python SDK, LangChain, httpx, or a curl command connects to it without modification. The backend continues to run unchanged. Architecture KVStream is composed of four cooperating subsystems. Understanding how they interact clarifies why the gains are meaningful and composable. 1.Paged KV-Cache Allocator: Inspired by the paged attention mechanism, KVStream manages KV cache memory as a pool of fixed-size pages (or blocks), each holding a configurable number of token states (default: 16 tokens per block). Rather than reserving a contiguous worst-case buffer per sequence, the allocator assigns pages on demand and reclaims them when a sequence finishes. Each sequence owns a logical page table, a mapping from logical page indices to physical block slots. Pages can be shared across sequences (enabling prefix deduplication) and migrated between GPU and CPU (enabling pre-emption). For Foundry Local and Ollama, this operates in soft-inject mode, the page table controls admission and logical accounting, while the actual KV tensors stay inside the backend. For llama.cpp, which exposes a /slots API for saving and restoring raw KV state, KVStream can operate in hard-inject mode, it manages a real tensor pool and performs zero re-compute cache reuse by physically restoring KV state between requests. 2.Continuous Batching Scheduler: The scheduler merges multiple queued sequences into a single batched forward pass, up to a configurable max_batch_size. Critically, it uses continuous batching new sequences can be admitted mid-generation, filling slots vacated by completed requests rather than waiting for the entire batch to finish. Two scheduling priorities are supported: fcfs (first-come, first-served): straightforward FIFO, best for fairness. sjf (shortest-job-first): minimizes average TTFT by prioritizing requests with shorter expected output lengths. When GPU page blocks are exhausted and a new sequence must be admitted, the scheduler applies a preemption policy: swap: the lowest-priority active sequence's pages are migrated to CPU RAM. The sequence resumes when GPU blocks become available again. recompute: pages are freed and the sequence is re-queued from scratch, lower memory overhead, higher latency for the preempted request. 3.Prefix Cache: The prefix cache deduplicates the KV computation for any shared token prefix across requests like system prompts, few-shot examples, RAG preambles, or any stable instruction block. The mechanism works in three steps: After a request completes its prefill phase, KVStream hashes the prompt tokens in block-aligned chunks and stores the canonical block table for that prefix. On a subsequent request whose prompt starts with the same prefix, the new sequence forks the canonical block table via copy-on-write, no re-computation occurs. Entries expire after a configurable TTL (default: 1 hour), or can be evicted manually. The practical consequence: in any application with a stable system prompt, only the first request pays the prefill cost for that prompt. Every subsequent request skips it entirely, reducing TTFT in proportion to the length of the shared prefix. 3.OpenAI-Compatible Proxy KVStream wraps all of the above behind a standard /v1/chat/completions interface. It handles both streaming (SSE) and non-streaming responses, translates between the OpenAI request format and the backend's native API, and serves a /health, /status, and /metrics endpoint for observability. The /status endpoint returns live scheduler and memory state: { "scheduler": { "waiting": 2, "running": 8, "swapped": 0, "gpu_blocks_free": 184, "gpu_utilization": 0.281 }, "prefix_cache": { "cached_prefixes": 3, "total_prefix_hits": 12, "cached_tokens": 768 } } A Prometheus-compatible /metrics endpoint is also available, with a pre-configured Grafana dashboard included in the Docker Compose stack. Getting Started with Foundry Local: Microsoft's Foundry Local is the primary integration target for KVStream. Foundry Local provides a high-quality on-device inference runtime with strong model support (Phi-3-mini, Phi-3.5, and others), NPU acceleration, and direct integration with the Windows AI platform. KVStream's Foundry backend adds continuous batching and prefix caching on top of that foundation without any changes to the Foundry runtime itself. Installation: pip install kvstream Start the KVStream Proxy: # 1. Start Foundry Local (if not already running) foundrylocal serve # 2. Start KVStream in front of it kvstream serve --backend foundry --model phi-3-mini --port 8080 KVStream's Foundry backend includes auto-discovery, if Foundry Local assigns an ephemeral OS port (which it typically does), KVStream scans localhost to locate the active service automatically, so you do not need to hardcode the backend port. Connect your existing client — unchanged: from openai import OpenAI client = OpenAI( base_url="http://localhost:8080/v1", api_key="not-required", ) response = client.chat.completions.create( model="phi-3-mini", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Explain paged attention in simple terms."}, ], max_tokens=512, ) print(response.choices[0].message.content) Any client that speaks the OpenAI protocol like LangChain, httpx, or a raw HTTP call, connects here without modification. Maximize prefix cache hits: Put the system prompt first; KVStream caches it after the first request and skips its computation for every subsequent one. import asyncio from openai import AsyncOpenAI client = AsyncOpenAI(base_url="http://localhost:8080/v1", api_key="not-required") SYSTEM = "You are an expert assistant specialized in on-device AI." async def ask(question: str) -> str: r = await client.chat.completions.create( model="phi-3-mini", messages=[ {"role": "system", "content": SYSTEM}, # cached after first call {"role": "user", "content": question}, ], max_tokens=256, ) return r.choices[0].message.content async def main(): questions = [ "What is NPU acceleration?", "How does Phi-3-mini differ from larger models?", "What is token streaming?", ] # KVStream batches these automatically; system prompt computed once answers = await asyncio.gather(*[ask(q) for q in questions]) for q, a in zip(questions, answers): print(f"Q: {q}\nA: {a}\n") asyncio.run(main()) Tuning memory for hardware: The single most impactful configuration knob is the GPU block pool size. For soft-inject mode (Foundry Local, Ollama), this controls admission concurrency, not actual VRAM allocation by KVStream. A practical starting table is as follows, Device VRAM / RAM num_gpu_blocks Suitable models (E.g.,) 4 GB 64 phi-3-mini 8 GB 128 llama3-8b, mistral-7b 16 GB 256 llama3-8b (q8) 24 GB 512 llama3-70b (q4), mixtral Via YAML configuration: # kvstream.yaml backend: type: foundry model: phi-3-mini memory: num_gpu_blocks: 128 num_cpu_blocks: 256 block_size: 16 scheduler: max_batch_size: 8 preemption_policy: swap priority: fcfs prefix_cache: enabled: true ttl_seconds: 3600 min_match_tokens: 16 Benchmarking KVStream ships with a built-in benchmarking command: kvstream bench \ --url http://localhost:8080 \ --model phi-3-mini \ --concurrency 8 \ --prompt-len 128 \ --output-len 64 \ --total-requests 100 Example output: ┌──────────────────────────┐ │ KVStream Benchmark │ ├──────────────┬───────────┤ │ Requests │ 100 │ │ Concurrency │ 8 │ │ Errors │ 0 │ │ Throughput │ 12.4 req/s│ │ p50 │ 612 ms │ │ p99 │ 1840 ms │ └──────────────┴───────────┘ KVStream covers the most common local inference runtimes today Foundry Local, Ollama, llama.cpp, and LM Studio through a single, consistent interface. If you are building on Foundry Local and running into the throughput or memory fragmentation issues described here, KVStream is designed to slot in with a single command and zero changes to your application code. pip install kvstream kvstream serve --backend foundry --model phi-3-mini The full integration guide, configuration reference, and Docker deployment instructions are available in the KVStream Documentation on Github. We are always looking to improve! If you want to help make KVStream even better, check out our Contributing Guide to get started on your first pull request.103Views0likes1CommentEnterprise-ready Claude Desktop with Entra ID, APIM, and Microsoft Foundry (No Backend Required)
How I put corporate sign-in in front of Claude Desktop without writing a single line of backend code. TL;DR — In this post, I show how to securely enable Claude Desktop in enterprise environments using Microsoft Entra ID, Azure API Management, and Microsoft Foundry — without deploying a custom backend. This approach removes API keys from endpoints, enforces per-user identity, and aligns fully with Zero Trust principles. Who this is for: Enterprise architects evaluating secure AI client patterns Developers enabling Claude Desktop in regulated environments Platform teams standardizing identity and governance for LLM access Why this post exists: Microsoft Learn's Configure Claude Desktop with Foundry Models only shows the API-key path — a shared key pasted into every user's Claude Desktop config. That's fine for a quick demo, but it's a non-starter for most enterprises (no per-user identity, no MFA / Conditional Access, hard to revoke, hard to audit). This post fills that gap: same Foundry backend, but with Microsoft Entra ID SSO in front via Azure API Management, so each user signs in with their corporate identity and zero secrets land on the laptop. The problem For many teams experimenting with Claude Desktop, the blocker isn't capability — it's enterprise readiness. How do you enforce identity, eliminate shared secrets, and apply governance without standing up a custom backend service to sit in front of the model? If your team wants to use Claude Desktop with your own Anthropic deployment running on Microsoft Foundry, but with a few non-negotiable requirements: No shared API keys floating around on developer laptops. Per-user identity — every request must be attributable to a real person. MFA and Conditional Access must apply, the same way they do for every other internal app. Central rate-limiting and logging — a centralized control plane for governance. Claude Desktop 1.5+ supports a "Gateway SSO" mode where it can sign each user in with OpenID Connect and forward their token to a custom LLM gateway. Azure API Management (APIM) is a perfect fit for that gateway role: it validates the user's Entra ID token, then re-authenticates itself to Foundry behind the scenes. APIM acts as a centralized policy enforcement layer, enabling identity validation, traffic governance, and secure re-authentication to backend AI services without custom code. The end-to-end flow looks like this: %%{init: {'flowchart': {'nodeSpacing': 60, 'rankSpacing': 80, 'useMaxWidth': true}, 'themeVariables': {'fontSize':'16px'}} }%% flowchart TB User([Corporate user]) Claude["Claude Desktop"] Entra["Microsoft Entra ID<br/>(OIDC + MFA + Conditional Access)"] APIM["Azure API Management<br/>validate-jwt → rewrite headers<br/>(policy gateway)"] Foundry["Microsoft Foundry<br/>Claude deployment"] User -- "1. Sign in (browser PKCE)" --> Entra Entra -- "2. ID token" --> Claude Claude -- "3. POST /v1/messages<br/>Authorization: Bearer ID token" --> APIM APIM -- "4. OIDC discovery / JWKS" --> Entra APIM -- "5. x-api-key (or Managed Identity)" --> Foundry Foundry -- "6. Response" --> APIM APIM -- "7. Response" --> Claude classDef azure fill:#0a4d8c,stroke:#0a3a6b,color:#ffffff; classDef client fill:#f3f3f3,stroke:#888,color:#222; class Entra,APIM,Foundry azure; class Claude,User client; Or in plain text: Claude Desktop │ Authorization: Bearer <Entra ID token from the user's browser sign-in> ▼ Azure API Management (<your-apim>) │ ① validate-jwt → verifies user's Entra ID token │ ② re-auths to Foundry with an API key from a Named value │ Authorization stripped, x-api-key injected ▼ Microsoft Foundry /anthropic/v1/messages │ runs Claude (<your-deployment>) ▼ Response back to the user There are no API keys on user devices. Foundry's key lives only inside APIM. And every request carries the user's oid claim, so I can build dashboards and per-user quotas later. What you need before starting An Azure subscription with a Microsoft Foundry (AI Services) account and a Claude deployment. (Throughout this post I'll just call it Foundry.) An API Management instance, any tier. Permission to register applications in Entra ID for your tenant. Claude Desktop 1.5.0 or later. Azure CLI installed locally. Throughout this post I'll use placeholders for resource names: <apim-name> — your API Management service name <resource-group> — the resource group that holds it <foundry-account> — your Foundry account name <deployment-name> — the name of the Claude model deployment on Foundry Step 1 — Register an Entra ID app for Claude Desktop This is the OIDC client Claude Desktop signs users into. Claude Desktop requires a single-tenant, public PKCE client (no client secret) with a loopback redirect URI, configured under the Mobile and desktop applications platform in Entra ID — the only platform that allows any loopback port. I scripted it so the setup is one command and idempotent: # scripts/register-claude-entra-app.ps1 [CmdletBinding()] param( [string] $TenantId = '<your-tenant-id>', [string] $SubscriptionId = '<your-subscription-id>', [string] $ResourceGroup = '<resource-group>', [string] $ApimName = '<apim-name>', [string] $AppDisplayName = 'Claude Cowork gateway', [string] $RedirectUri = 'http://127.0.0.1/callback' ) az account set --subscription $SubscriptionId | Out-Null # 1. Create (or reuse) the app registration $appId = az ad app list --display-name $AppDisplayName --query "[0].appId" -o tsv if (-not $appId) { $appId = az ad app create --display-name $AppDisplayName ` --sign-in-audience AzureADMyOrg --query appId -o tsv } # 2. Configure as public PKCE client with the Mobile/Desktop redirect URI $objectId = az ad app show --id $appId --query id -o tsv $patch = @{ publicClient = @{ redirectUris = @($RedirectUri) } isFallbackPublicClient = $true } | ConvertTo-Json -Depth 5 -Compress az rest --method PATCH ` --uri "https://graph.microsoft.com/v1.0/applications/$objectId" ` --headers "Content-Type=application/json" --body $patch | Out-Null # 3. Ensure a service principal exists $sp = az ad sp list --filter "appId eq '$appId'" --query "[0].id" -o tsv if (-not $sp) { az ad sp create --id $appId | Out-Null } # 4. Push two Named values into APIM for the validate-jwt policy az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-tenant-id --display-name entra-tenant-id ` --value $TenantId --secret false az apim nv create -g $ResourceGroup --service-name $ApimName ` --named-value-id entra-client-id --display-name entra-client-id ` --value $appId --secret false "Client ID: $appId" Run it once. The output prints the client ID you'll need in Claude Desktop later, and it leaves two Named values in APIM ( entra-tenant-id , entra-client-id ) that the gateway policy will reference. ⚠️ Common pitfall: if the redirect URI ends up under the Web platform instead of Mobile and desktop applications, Entra will demand a client secret on token exchange — Claude won't send one and you'll get Token exchange failed (HTTP 401) . The app type can't be changed after creation, so create a new app if that happens. Step 2 — Create the API in APIM In the portal under APIM → APIs → + Add API → HTTP: Field Value Display name Anthropic API Name anthropicapi Web service URL https://<foundry-account>.services.ai.azure.com/anthropic API URL suffix claude Subscription required Off (Entra ID is our only credential) Add two operations under it: Method URL Display name POST /v1/messages Create message GET /v1/models List models The /v1/models operation isn't strictly needed (Foundry's Anthropic surface doesn't implement it), but having it registered means you can decide later whether to stub it out or proxy it. Step 3 — Add an API key for Foundry as a Named value APIM → Named values → + Add: Name: foundry-key Type: Secret Value: paste a key from the Foundry account's Keys and Endpoint blade. This is the only place the key ever lives. Clients never see it. Alternative — keyless with Entra ID (managed identity): If you prefer not to manage a Foundry key at all, enable the APIM instance's system-assigned managed identity (APIM → Identity → System assigned → On), then grant that identity the Foundry User role on the Foundry account (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d — previously named Azure AI User; Microsoft renamed it but the ID and permissions are unchanged). In Step 4, replace the set-header that injects x-api-key with: <authentication-managed-identity resource="https://cognitiveservices.azure.com" output-token-variable-name="foundry-token" /> <set-header name="Authorization" exists-action="override"> <value>@("Bearer " + (string)context.Variables["foundry-token"])</value> </set-header> Then you can skip the foundry-key Named value entirely. Don't use the legacy Cognitive Services User role — per the Foundry RBAC doc, roles starting with Cognitive Services don't apply to Foundry scenarios. Step 4 — Write the gateway policy This is the core enforcement layer in the architecture. Open APIs → anthropicapi → All operations → Inbound processing → </> and paste: <policies> <inbound> <base /> <!-- USER → APIM: verify Entra ID token from Claude Desktop --> <validate-jwt header-name="Authorization" failed-validation-httpcode="401" failed-validation-error-message="Unauthorized" require-scheme="Bearer"> <openid-config url="https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0/.well-known/openid-configuration" /> <audiences> <audience>{{entra-client-id}}</audience> </audiences> <issuers> <issuer>https://login.microsoftonline.com/{{entra-tenant-id}}/v2.0</issuer> </issuers> </validate-jwt> <!-- APIM → Foundry --> <set-backend-service base-url="https://<foundry-account>.services.ai.azure.com/anthropic" /> <set-header name="x-api-key" exists-action="override"> <value>{{foundry-key}}</value> </set-header> <set-query-parameter name="api-version" exists-action="skip"> <value>2024-05-01-preview</value> </set-query-parameter> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Two things to notice: validate-jwt uses the OIDC discovery URL — JWKS keys are fetched and cached automatically. It rejects any token whose aud claim is not the client ID of our Entra app, which is exactly what we want. The Authorization header from the user is not forwarded — once validate-jwt succeeds, the request is re-authenticated to Foundry with x-api-key . No user token ever leaves APIM. APIM becomes the security boundary — user identity is validated at the edge, and downstream services never see or rely on user tokens. Step 5 — Configure Claude Desktop Open Claude Desktop → Configure third-party inference and fill it in like this: Field Value Connection Gateway Credential kind Interactive sign-in Gateway base URL https://<apim-name>.azure-api.net/claude Client ID (the appId your script printed) Issuer URL https://login.microsoftonline.com/<tenant-id>/v2.0 Authorization URL / Token URL leave empty Bearer token ID token (default) Scopes leave default ( openid profile email offline_access ) Redirect port leave empty (ephemeral) Model discovery Off Model list → Model ID <deployment-name> (your Foundry deployment name) ℹ️ Why Model discovery is Off — Claude Desktop's discovery uses GET /v1/models , and the Foundry /anthropic surface doesn't implement that endpoint, so it 404s. Listing the model manually skips the call entirely. If you want to leave Model discovery On, stub /v1/models in APIM. Add a GET /v1/models operation to your API and give it this inbound policy that returns an Anthropic-shaped response without ever hitting the backend: <policies> <inbound> <base /> <return-response> <set-status code="200" reason="OK" /> <set-header name="Content-Type" exists-action="override"> <value>application/json</value> </set-header> <set-body>@{ return new JObject( new JProperty("data", new JArray( new JObject( new JProperty("id", "<deployment-name>"), new JProperty("type", "model"), new JProperty("display_name", "Claude on Foundry"), new JProperty("created_at", "2026-01-01T00:00:00Z") ) )), new JProperty("has_more", false), new JProperty("first_id", "<deployment-name>"), new JProperty("last_id", "<deployment-name>") ).ToString(); }</set-body> </return-response> </inbound> <backend><base /></backend> <outbound><base /></outbound> <on-error><base /></on-error> </policies> Add one entry per deployment you want to expose. The benefit of stubbing rather than turning discovery off is that adding new models becomes a policy edit — no need to re-export and redeploy Claude Desktop config to every user. Click Apply Changes then Sign in to your organization. Your browser opens to the normal Entra sign-in page; once approved you're returned to the app, and a quick connection test runs. The success indicator is a small green banner: ✅ Inference — 1-token completion in 1449 ms · via identity provider For broader rollout, hit the Export button at the top of the configuration window — it produces a .mobileconfig (macOS) or .reg (Windows) you can push via Intune / Jamf to every user's machine. Step 6 — Verify both hops In APIM → APIs → anthropicapi → Test → POST /v1/messages I sent: Headers: anthropic-version: 2023-06-01 Body: { "model": "<deployment-name>", "max_tokens": 64, "messages": [{"role":"user","content":"hi"}] } Click Send → Trace, and look at two places: Inbound → validate-jwt: should say succeeded and show the decoded claims (your oid , email , etc.). Backend → Request: outbound URL is https://<foundry-account>.services.ai.azure.com/anthropic/v1/messages?api-version=2024-05-01-preview , with x-api-key: **** present and Authorization absent. Backend → Response: 200, with a Claude message JSON body. That confirms both halves of the chain. Bumps I hit along the way A few common issues encountered during setup — sharing so you can skip them: Symptom Cause Fix Claude shows "Your provider's model list hasn't loaded yet" and /v1/models returns 404 Foundry's Anthropic surface doesn't implement that endpoint Turn Model discovery OFF in Claude Desktop and add the deployment name manually Claude shows "Authentication failed" even though sign-in worked The APIM API still had Subscription required = ON, blocking the call before validate-jwt ran with 401: Access denied due to missing subscription key Uncheck Subscription required on the API Portal Test panel shows "Cannot read properties of undefined (reading 'statusCode')" The test console doesn't attach an Entra token, so validate-jwt 401s and the panel's JavaScript crashes Comment out <validate-jwt> temporarily for portal testing, or test via curl with a real token OIDC discovery failed (HTTP 404) in Claude Desktop Pasted the metadata URL into Issuer URL Issuer must end at /v2.0 , not at /.well-known/openid-configuration Token exchange failed (HTTP 401) App registered under Web platform instead of Mobile and desktop applications Create a new app with the right platform — it can't be changed Where this leaves us This pattern is small in moving parts but has outsized architectural impact: Zero secrets on endpoints. Eliminates API-key sprawl across laptops, MDM profiles, and shared vaults. The Foundry key lives only inside APIM — or disappears entirely when you switch APIM to managed identity. Identity, not credentials. Every Claude Desktop user authenticates against Entra ID in their browser, the same as Office or Teams. MFA, Conditional Access, and Entra ID Protection apply automatically — no parallel auth story to maintain. Per-user observability built in. APIM logs carry the user's Entra oid , email , and group claims. That unlocks per-user dashboards, cost allocation, and abuse detection without any client-side instrumentation. Aligned with Zero Trust. Strong identity at the edge, no implicit trust between hops, single policy chokepoint for inspection and rate-limiting, and full revocability through a single Enterprise Application. Optional but trivial keyless path. Flip APIM to system-assigned managed identity + <authentication-managed-identity resource="https://cognitiveservices.azure.com" /> and one Foundry User role assignment (role ID 53ca6127-db72-4b80-b1b0-d745d6d5456d , formerly Azure AI User) on the Foundry account. See the Foundry RBAC doc — don't use any Cognitive Services * roles for Foundry. What I'd add next llm-token-limit and llm-emit-token-metric policies for per-user quotas and cost visibility. App Insights wiring on the API, with a workbook that pivots on the oid claim. Assignment required = Yes on the Entra Enterprise Application + a security group, so only approved users can sign in. Intune deployment of the exported .reg / .mobileconfig so the gateway URL and client ID land on devices automatically. But that's all incremental. The hard part — getting Claude Desktop, Entra ID, APIM, and Foundry to agree on who's allowed to talk to whom — is done. Total elapsed: about an afternoon, most of it spent learning where each portal hides its switches. Useful links Gateway single sign-on with your identity provider — Claude.ai Documentation Configure Claude Desktop with Foundry Models — Microsoft Learn Role-based access control for Microsoft Foundry — Microsoft Learn594Views0likes2CommentsMake Your Copilot Credits Count: A Student's Guide to Smarter AI Usage
If you're a student enrolled in GitHub Education, you already have something most developers pay for: free access to GitHub Copilot and its premium features. That's incredible. But here's the thing, free access doesn't mean unlimited usage, and not all AI interactions cost the same. Every chat message, every agent task, every model call consumes something called AI Credits, and knowing how they work will help you use Copilot smarter, produce better code, and build the kind of disciplined AI habits that professional developers are only just starting to learn. This post is inspired by a fantastic deep-dive from my collegaue developer advocate Bruno: "GitHub Copilot and Tokens: How to Keep Using AI Without Burning Your Budget" . We've taken those professional lessons and tailored them specifically for students because your learning environment, your assignments, and your goals are different from a seasoned engineer at a tech company. TL;DR: Use autocomplete before chat. Choose the right model. Keep context small. Start fresh chats often. Plan before you build. These habits will make you a better developer and stretch your credits further. What Are AI Credits and Why Do They Matter? When you interact with GitHub Copilot through chat, agent mode, or inline edits the model processes tokens. Tokens are small chunks of text (roughly 3–4 characters each). Every interaction consumes: Input tokens — everything sent to the model (your message, attached files, chat history, instructions) Output tokens — everything the model generates back to you Cached tokens — context the model reuses from previous turns (cheaper) These tokens are converted to AI Credits, where 1 AI Credit = $0.01 USD. Different models have very different token costs a lightweight model like GPT-5 mini charges $0.25 per million input tokens, while a powerful model like GPT-5.5 charges $5.00 per million input tokens (20x more expensive). Using the wrong model for a simple task is like taking a taxi to a destination that's a 5-minute walk. See the official pricing table: GitHub Copilot Models and Pricing . Figure 1: The four cost tiers of Copilot interactions. Autocomplete and Next Edit Suggestions are free — they do not consume AI Credits on paid plans Strategy 1: Tab Before Chat The Free Tier is Powerful Here is the single most impactful habit you can build: always try autocomplete before opening chat. According to GitHub's official billing documentation, code completions and Next Edit Suggestions are not billed as AI Credits on paid plans. That means every time you press Tab to accept an inline suggestion, you are getting AI assistance for free. Use autocomplete (Tab) for: Completing a line or a simple function Generating repetitive boilerplate (constructors, properties, getters/setters) Completing a repeated pattern you've started Writing obvious next lines like console.log , imports, or variable declarations Adjusting variable names inline Only move to Inline Edit (Ctrl+I / Cmd+I) when autocomplete isn't enough for a local change. Only open a Chat window when you need genuine reasoning an explanation, a plan, or a multi-step solution. As Bruno puts it: "The most expensive model in the world should not be helping you write public string Name { get; set; } . That's what Tab is for. And coffee." Strategy 2: Choose the Right Model for the Job GitHub Copilot gives you access to models from OpenAI, Anthropic, and Google each at different price points and capability levels. The key insight from VS Code's official Copilot usage guide is: reserve powerful reasoning models for tasks that genuinely need them. Your Task Recommended Model Tier Example Models Simple question or boilerplate Lightweight GPT-5 mini, Gemini 3 Flash Code explanation or basic docs Lightweight GPT-5 mini, GPT-5.4 nano Writing tests or debugging a single function Medium / Versatile Claude Haiku 4.5, GPT-5.4 Multi-file refactor or code review Medium / Versatile Claude Sonnet 4.6, GPT-5.4 Complex system design or architecture Powerful Claude Opus 4.7, GPT-5.5 Long agentic workflows Powerful (scoped!) Claude Opus 4.8, GPT-5.5 Not sure what you need Auto (recommended default) Copilot selects for you GitHub Copilot's Auto Model Selection feature automatically chooses a model based on task complexity, availability, and policies. For most students, Auto should be your default only switch manually when you have a specific reason. And when the complex task is done, switch back to Auto or a lighter model. Strategy 3: Context is Currency Smaller is Smarter Here's the counterintuitive truth that surprises most developers: the expensive part of a prompt is usually not the question you type it's everything surrounding it. Every token consumed by Copilot includes: All your previous chat messages in the session Every file you have open or attached Workspace search results Copilot pulled in Build output, terminal logs, or diff content Responses from any MCP (Model Context Protocol) servers you have enabled Your custom instructions file ( .github/copilot-instructions.md ) A single question inside a conversation with 80 messages, 12 open files, and 3 tool call results can cost significantly more than the same question asked fresh in a new chat with one relevant file attached. Figure 2: The same task asked two ways. Scope your prompts to save credits and often get better answers. Practical rules for context management: Attach only 2–3 relevant files — not your entire project Don't ask Copilot to analyse the whole repo when you only need changes in one module Paste only the first relevant error from a log, not 2,000 lines of output Remove timestamps and duplicate stack traces from pasted logs State the expected output format explicitly so the model stops early Use /compact in VS Code Chat to summarise a long conversation without losing key context Use /fork to explore an alternative direction without polluting the main conversation Strategy 4: Start Fresh Chats When You Change Tasks This is one of the simplest optimisations and one of the most ignored. The VS Code Copilot usage guide is explicit about it: when a conversation grows, it carries context from all previous messages. If you switch to an unrelated task in the same session, the model still processes that irrelevant history and you pay for it in credits. Bad pattern: Chat session: - "Help me fix the JWT bug in auth.ts" [10 messages] - "Now write unit tests for my sorting algorithm" [still in same chat!] - "Can you generate the README for my project?" [still in same chat!] - "Now debug this CSS layout issue..." [still in same chat!] Smart pattern: Chat 1: "Fix JWT bug in auth.ts" - DONE, close chat. Chat 2: "Write unit tests for sorting algorithm" - DONE, close chat. Chat 3: "Generate README for project" - fresh context, fresh cost. New task = new chat. Your human brain benefits too — focused sessions produce better outcomes than sprawling multi-topic conversations. Strategy 5: Plan Before You Build Use Agent Mode Wisely Agent mode is one of the most powerful Copilot features for students working on larger assignments — it can create files, run terminal commands, edit across multiple files, and execute tests. But agent mode also carries the highest token cost, because it loops: it plans, acts, observes tool output, then plans again. The VS Code documentation recommends separating planning from implementation to reduce rework and back-and-forth. Here's a phased approach that saves credits and produces better results: Figure 3: The credit-smart workflow. Always try the cheaper option first, escalate only when needed. Phase 1: Plan (lightweight model, low cost) I need to add user authentication to my Express app. Before writing any code, give me a step-by-step plan covering which files to create, which packages to install, and what tests to write. Do not write code yet. Phase 2: Scoped Implementation (one feature at a time) Using the plan we agreed, implement only Step 1: create src/middleware/auth.ts with JWT validation. Do not modify any other files yet. Phase 3: Validate Run the existing tests in tests/auth.test.ts and report the results. Fix only test failures related to the new auth middleware. Phase 4: Cleanup The implementation is complete. Update README.md with setup instructions for the auth module. Keep it under 200 words. Each phase is small, scoped, and verifiable. You can stop at any phase, check the result, and only continue when you're satisfied. This dramatically reduces expensive re-runs where the agent reverses its own changes. Strategy 6: Review Your MCP Servers and Custom Instructions MCP Servers MCP (Model Context Protocol) servers let Copilot connect to external tools databases, GitHub issues, Jira, Slack, browser automation, and more. Each enabled server expands what the agent can do, but also adds to the context the model must consider, which increases token usage. For students, a practical rule: only enable MCP servers relevant to your current project. If you're working on a simple Python web app, you probably don't need browser automation, a Kubernetes connector, and a Slack integration all active at the same time. See the VS Code MCP servers documentation for how to enable, disable, and configure them. Custom Instructions A .github/copilot-instructions.md file in your repository lets you give Copilot standing instructions — coding standards, testing commands, architecture conventions. This is a fantastic feature. But that file is included in every prompt's context, so a bloated instructions file costs credits on every single interaction. A good custom instructions file is: Short — under 200 words for a student project Specific to this repository's real conventions Clear about test commands (e.g., npm test , pytest ) Free of generic advice that applies to every codebase on earth Example of a good student instructions file: # Copilot Instructions for MyWebApp Language: TypeScript (strict mode) Framework: Express.js with Prisma ORM Tests: Run with `npm test` (Jest) Lint: Run with `npm run lint` (ESLint + Prettier) Conventions: - Use async/await, not callbacks - Validate all request inputs with Zod - Keep controllers thin; put logic in service files - Write a test for every new public function That's it. Short, actionable, and genuinely useful — not a 500-line manifesto. Strategy 7: Use Traditional Tools First AI is excellent for reasoning, explaining, planning, and connecting ideas. It is not the right tool for every job. Before reaching for Copilot chat, ask yourself whether a traditional tool can answer your question faster, cheaper, and more reliably: Compiler / type-checker — to find type errors (TypeScript, mypy) Linter — to find style and logic issues (ESLint, Pylint, Checkstyle) Formatter — to fix formatting (Prettier, Black, gofmt) Test runner — to confirm whether your code works (Jest, pytest, JUnit) Debugger — to step through execution and inspect state Docs / Stack Overflow — for well-documented APIs and common patterns If your linter tells you there's a missing import, fix it directly — don't ask Copilot to analyse your code to find it. Let deterministic tools do deterministic work, and let AI do the reasoning where it genuinely adds value. Your GitHub Education Benefits: What You Get If you haven't already, apply for GitHub Education with your school email address. Once verified, you receive: Free GitHub Copilot including premium features — see how to enable Copilot as a student Free GitHub Codespaces — 180 core hours per month, equivalent to GitHub Pro (great for browser-based coding with Copilot built in) GitHub Student Developer Pack — free access to dozens of professional tools from GitHub's partners, including cloud credits, domains, and IDEs GitHub Classroom — your instructors can manage assignments and provide feedback GitHub Community Exchange — discover and contribute to student-built projects Campus Experts program — become a student leader in your tech community These benefits are designed to give you real-world tools in an educational setting. Copilot is the standout feature — it's the same tool professional developers use every day. Using it wisely during your studies means you'll arrive in the workforce already ahead of the curve. Pre-Prompt Checklist for Students Before you fire off your next Copilot prompt, run through this checklist. It takes 10 seconds and can save significant credits — and more importantly, it builds the mental habits of a professional AI user. Figure 4: Two-column checklist covering what to check before opening chat and when writing your prompt. Before you open chat: ☐ Can Tab / autocomplete solve this? ☐ Is inline edit (Ctrl+I) enough for this local change? ☐ Can a linter, compiler, or test runner answer this? ☐ Is this a different task from my last message? If so, start a new chat. ☐ Am I on Auto model selection (or the right tier for this task)? ☐ Should I ask for a plan before asking for code? ☐ Do I have MCP servers enabled that I don't need right now? ☐ Is my copilot-instructions.md file concise and current? When writing your prompt: ☐ Attach only 2–3 relevant files, not the whole project ☐ Paste only the first relevant error from any logs ☐ Define the files to change, the goal, and any files not to touch ☐ Ask for a plan before implementation on complex tasks ☐ Remove timestamps and duplicate stack traces from pasted logs ☐ State the expected output format and length ☐ Use /compact if the session is getting long ☐ Use /fork to explore alternatives without polluting the main thread A Note on Responsible AI Use in Education Using Copilot smartly is not just about saving credits it's about developing genuine skills. When you ask Copilot to write all your code without understanding it, you lose the learning opportunity the assignment was designed to create. When you review and understand every suggestion Copilot makes, you learn faster, build better instincts, and can confidently explain your own work. Best practices for academic integrity with AI tools: Understand before you accept — never paste code you can't explain Use Copilot to learn, not to skip learning — ask it to explain the code it generates Follow your institution's AI policy — many universities have specific guidance on AI use in assessments Treat Copilot as a senior pair-programmer, not an answer machine — question its suggestions, push back, iterate Verify facts and documentation links — AI can hallucinate; always check official sources GitHub Education exists to give you real professional tools while you learn. The goal is for you to graduate with genuine skills, a real portfolio, and the confidence that comes from building things yourself — with AI as your collaborator, not your ghostwriter. Key Takeaways Tab first — autocomplete and Next Edit Suggestions are free; use them for everything small Auto model by default — only switch to a powerful model when you have a clear reason Context is cost — fewer files, fewer messages, fewer tools = fewer tokens New task = new chat — don't carry stale context into unrelated work Plan before you build — a 10-message plan session is cheaper than 50 messages of rework Keep instructions short — your copilot-instructions.md runs on every prompt Use traditional tools first — linters and compilers are free, fast, and deterministic Understand your code — Copilot is a collaborator, not a replacement for learning Resources and Next Steps GitHub Education — apply for your free student benefits GitHub Student Developer Pack — explore free tools for students Enable GitHub Copilot as a student GitHub Copilot: Models and Pricing — understand exactly what each model costs Auto Model Selection in GitHub Copilot VS Code: Optimising GitHub Copilot Usage — the official guide that inspired many of these tips Managing MCP Servers in VS Code El Bruno: GitHub Copilot and Tokens (the original professional perspective) GitHub Education Community Discussions — connect with students and educators worldwide This post draws on insights from El Bruno's developer blog and best practices from GitHub Education. All pricing figures are sourced from the official GitHub Copilot billing documentation and are correct as of June 2026.2.7KViews0likes1CommentBuilding Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.Harness-Driven Agents: Secure Podcast Pipeline in Hyperlight MicroVM Sandbox
The moment the agent reached for rm -rf For most of 2024 and 2025, "agents" were a demo word. By 2026 they are something you run — autonomously, in a loop, executing code they wrote themselves a second ago. I was watching one work late one night. I had given it a goal, a handful of tools, and the freedom to write and run its own Python. For twenty minutes it was magic: read a file, reason about it, write a script, run it, inspect the output, correct itself, try again. Then it produced this: import shutil shutil.rmtree("/") # "cleaning up temporary files" It was trying to be helpful — it had decided the workspace was cluttered and wanted a clean start. The "workspace," as far as that process was concerned, was my entire machine. I killed it in time. But the lesson is the one every agent builder eventually arrives at: the model is not the dangerous part — the execution is. A chatbot that answers wrong is annoying. An agent that fetches a web page, runs code, and writes files has a blast radius. The bounding box has to come from infrastructure, not from a system prompt. harnessagent_sandbox_demo is a concrete build that puts that bounding box in exactly the right place — and it does it in service of a real, charming little product: a daily five-minute Mandarin podcast about the FIFA World Cup 2026. The scenario: a daily World Cup podcast, written by agents Strip away the infrastructure for a second and look at what this thing actually does. Every day it produces a fresh Mandarin podcast script about the FIFA World Cup 2026. Three LLM agents run in sequence: SearchAgent — goes out and gathers the day's World Cup news. ContentAgent — turns that raw material into structured podcast content. GenScriptAgent — writes the final, readable five-minute script. The output is two text files — one in Simplified Chinese, one in Traditional Chinese: ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt That's the whole product. It sounds simple — and the point of the project is that making it safe is the hard part. SearchAgent has to reach the open internet. All three agents write and run code. If you wire that naively, you have just built the exact machine that types shutil.rmtree("/") for you. So the entire architecture is organized around one principle: the agents get to do real work, but every dangerous capability is fenced behind a hardware boundary. Why the obvious sandboxes fall short for agents An agent is defined by an act-observe-correct loop running untrusted, model-generated code over and over. That single property breaks most conventional isolation choices. Option Why it falls short for agents No sandbox One rm -rf, one leaked .env, one rogue network call — the blast radius is the whole machine. Container Great for shipping apps, but a coding agent wants to build and run its own container, which means Docker-in-Docker and elevated privileges that quietly undo the isolation. WASM / V8 isolate Fast to start, but you isolate a language runtime, not an OS — no system packages, no arbitrary shell, and hardening the engine is a moving target. Full VM Rock-solid isolation, but cold starts in seconds and heavy memory — exactly the friction that pushes developers to skip isolation entirely. Each option trades away safety, speed, or compatibility. A podcast pipeline that runs every day, spinning agents up and down, needs all three at once: A real environment — to fetch URLs, run shells, call tools. A hard boundary — so a bad step can't reach the host. Near-instant lifecycle — because a slow sandbox is a sandbox developers skip, and an unused safety feature protects nobody. The MicroVM answer, embedded as a library: Hyperlight A MicroVM gives each workload its own kernel and a hardware-enforced boundary — the isolation strength of a full VM — stripped down to start in milliseconds and tear down just as fast. Misbehave inside, and you hit a wall; there is no path back to the host. And it is disposable by design: when an agent goes off the rails, you delete the sandbox and reopen in milliseconds, with nothing to clean up. Most MicroVM runtimes (Firecracker and friends) are cloud infrastructure — server-side. Hyperlight is different: a lightweight Virtual Machine Manager (a CNCF sandbox project) designed to be embedded inside your application, like a library. MicroVMs that boot in milliseconds, with guest function calls completing in microseconds. No guest kernel, no OS — the guest is a purpose-built no_std Rust/C binary. Nothing in there to attack. Sandboxed by default — no filesystem, no network, nothing, unless explicitly granted. Typed function calls across the VM boundary, and snapshot/restore to rewind to a clean state between calls. Runs on KVM, MSHV (Microsoft Hypervisor), and Windows Hypervisor Platform. This project uses the Wasm backend: the three agents share a single HyperlightRuntime, and the guest is reset to a clean snapshot before every code execution. That detail is what makes a daily, many-step pipeline cheap — you capture the sandbox state once and rewind to it, instead of rebuilding a VM hundreds of times. Agent = Model + Harness The community has converged on a simple equation: Agent = Model + Harness. The model is a brain in a jar — text in, text out, no memory between calls, no loop, no hands. It can express the intent to call a tool; it cannot actually call it. The harness is the execution layer: it calls the model, handles its tool calls, and decides when to stop. As the Hugging Face glossary puts it, "if you're not the model, you're the harness." That reframes the safety problem precisely. When my agent emitted shutil.rmtree("/"), the model deleted nothing — it merely suggested. The harness would have run it. The harness is where reasoning meets reality, so it is exactly where safety must live. The question stops being "how do I make the model safer?" and becomes: how do I build a harness that executes the model's intent inside a boundary it cannot escape? The Microsoft Agent Framework answers that with first-class agent harness capabilities in Python and .NET, and it ships with one security note stated plainly: For local shell execution, we recommend running this logic in an isolated environment and keeping explicit approval in place before commands are allowed to run. The harness is the steering wheel — it does not pretend to be the seatbelt and the crumple zone. For that, it points you outward: run this somewhere isolated. Hyperlight is that isolated somewhere. This project snaps the two pieces together. The architecture: two planes, one bridge Here is the heart of the design. Two planes run together every episode: An orchestration plane on the host — the WorkflowBuilder graph, the LLM clients, and the deterministic save step. An execution plane inside one Hyperlight Wasm sandbox — the only place LLM-generated code is allowed to run. The single bridge between them is one call: call_tool("fetch_url", ...). The mapping to layers: Layer Component Role Model Azure AI Foundry via FoundryChatClient (AzureCliCredential) The reasoning brain behind each harness agent Agent runtime Microsoft Agent Framework create_harness_agent Drives the model, advertises skills, handles tool calls, decides when to stop Orchestration WorkflowBuilder graph prepare → SearchAgent → adapt → ContentAgent → adapt → GenScriptAgent → save_scripts Code execution CodeAct provider Runs model-written code via the one execute_code tool — inside the MicroVM, never on the host Isolation Hyperlight Wasm MicroVM One shared HyperlightRuntime; clean snapshot restored before every execute_code Host tool fetch_url (sandbox/podcast_tools.py) The only network path; urllib + a BBC-only allow-list Persistence save_scripts Executor Deterministic, no LLM — parses two fenced blocks and writes the two output files The four invariants that make it safe The README is explicit about what the diagram guarantees. These four invariants are the whole security argument. The model never sees the network.Its only tool isexecute_code. Network access happens only when the guest itself runs call_tool("fetch_url", ...) from inside the sandbox. The model cannot reach the internet directly — it can only ask the guest to, and the guest can only reach BBC. One sandbox per run, snapshot per call.All three agents share the sameHyperlightRuntime. Before every execute_code, the guest is reset to a clean snapshot — so nothing one step does can leak into the next, and there is no VM to rebuild. Two counter paths — and why there are two.Thefunction_middleware (make_tool_call_recorder) sees the model-direct execute_code calls. But the inner, guest-initiated fetch_url is dispatched by Hyperlight straight to the FunctionTool, bypassing the middleware entirely. So a second counter — make_call_tool_counter(on_call=) — bumps state["tool_call_counts"][<agent>]["fetch_url"] on every guest invocation. Two observation points, because the architecture has two genuinely different call surfaces. Deterministic save — no LLM in the persistence step.GenScriptAgentonly emits text. The save_scripts Executor parses the two fenced code blocks out of that text and writes the simplified and traditional files itself. There is no model in the loop when bytes hit disk, so the output path is fully predictable. Now let's look at the real code surface The README documents the API the demo is built on. The snippets below reflect that surface. 1. Install and environment pip install agent-framework-hyperlight --pre # Hyperlight needs a hypervisor: KVM on Linux, WHP on Windows. macOS is not yet supported. # The model runs on Azure AI Foundry; FoundryChatClient authenticates via AzureCliCredential. az login export HYPERLIGHT_PYTHON_GUEST_PATH="/path/to/python_guest" 2. A harness agent that carries only a stub — skills do the rest Each of the three agents is built with create_harness_agent + FoundryChatClient. The agents themselves carry only a tiny stub instruction; their real role prompts and the shared sandbox/CodeAct guardrails live as file-based Agent Skills under skills/. The harness's built-in SkillsProvider advertises those SKILL.md packages, and the model loads them at runtime via load_skill. from agent_framework import create_harness_agent from agent_framework.foundry import FoundryChatClient from azure.identity import AzureCliCredential # Model on Azure AI Foundry — not Azure OpenAI directly. client = FoundryChatClient(credential=AzureCliCredential()) # The agent carries a tiny stub. Its real persona — "you gather World Cup # news", "you write the script" — lives in a SKILL.md package under skills/, # advertised by the harness SkillsProvider and pulled in via load_skill. search_agent = create_harness_agent( chat_client=client, name="SearchAgent", instructions="You are a harness agent. Load your skill, then begin.", ) 3 The CodeAct surface: one tool the model can see This is the CodeAct pattern from 02-agents/context_providers/code_act/code_act.py. The model sees exactly one tool — execute_code. Any extra capability (here, only fetch_url) is reachable from inside the guest via call_tool(...). # What the MODEL sees and writes — one script, not ten tool round-trips: # # # inside execute_code, running in the Hyperlight Wasm guest: page = call_tool("fetch_url", url="https://www.bbc.com/sport/football/world-cup") # # ... parse page["BODY"], pull out today's stories ... print(top_stories) # # execute_code is the ONLY tool on the model's surface. call_tool("fetch_url", ...) is reachable only from inside the sandbox. 4. The one host tool, with a BBC-only allow-list fetch_url lives on the host (sandbox/podcast_tools.py). It is the single bridge across the boundary, and it is deliberately narrow. import urllib.request from urllib.parse import urlparse ALLOWED_DOMAINS = {"bbc.com", "www.bbc.com"} # allow-list: BBC only def fetch_url(url: str) -> dict: """The ONLY network path out of the sandbox. Host-side, allow-listed.""" host = urlparse(url).netloc if host not in ALLOWED_DOMAINS: return {"STATUS": "blocked", "URL": url} with urllib.request.urlopen(url, timeout=20) as resp: body = resp.read(8192).decode("utf-8", "ignore") # BODY capped at ~8 KB return { "STATUS": "ok", "URL": url, "TITLE": _extract_title(body), "DESCRIPTION": _extract_description(body), "LINKS": _extract_links(body), "BODY": body, } Notice what this buys you: even if SearchAgent writes hostile code, the worst it can do over the network is read BBC, 8 KB at a time. The allow-list is host-side and the model never sees it — it cannot be prompt-injected away. 5. Wiring the graph and the deterministic save from agent_framework import WorkflowBuilder workflow = ( WorkflowBuilder() .add_node("prepare", prepare) .add_node("SearchAgent", search_agent) .add_node("adapt_1", adapt) .add_node("ContentAgent", content_agent) .add_node("adapt_2", adapt) .add_node("GenScriptAgent", genscript_agent) .add_node("save_scripts", save_scripts) # deterministic Executor, NO LLM .build() ) # GenScriptAgent emits text containing two fenced blocks (simplified + # traditional). save_scripts parses them and writes the files itself — # there is no model in the persistence step. await workflow.run() # -> ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt # -> ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt 6. The payoff Run that shutil.rmtree("/") inside this pipeline now and the result is delightfully boring: the agent deletes its own throwaway sandbox, the host never notices, and the next execute_code starts from a clean snapshot. Two things to call out: Snapshot/restore means every code execution starts from a clean, reusable baseline — capture state once, rewind between calls, instead of rebuilding the whole VM. For a daily pipeline that runs the act-observe-correct loop many times, that is the difference between "fast enough to always use" and "slow enough to skip." Because each agent writes one script instead of ten round-tripped tool calls, the CodeAct approach keeps both latency and token usage down — the model reasons once and lets the guest do the busywork behind the boundary. Where it fits, and the one idea to keep harnessagent_sandbox_demo lives inside Multi-AI-Agents-Cloud-Native — a gallery of patterns for running agent systems safely on Azure: A2A multi-agent orchestration, the Kubernetes sidecar pattern, hardened pipelines, and a sibling sample that runs Copilot agents on AKS inside Kata Containers MicroVMs at the pod level. And the README is explicit that this design is cloud-native: running it in-cluster on AKS changes nothing about the architecture — the same WorkflowBuilder graph, the same Hyperlight sandbox, the same deterministic save_scripts executor. The local build and the in-cluster build are the same shape. The two MicroVM samples are two ends of one spectrum. The Kata sample puts the boundary around the whole pod — a deployment topology. This Hyperlight demo pulls the boundary all the way into the agent process itself — the sandbox becomes a library call. Same question — where do you place the hardware boundary in an agent stack? — answered at two different altitudes. The old pitch for sandboxing always carried an asterisk: yes, it's safer, but you'll pay in speed, compatibility, or friction. MicroVMs erase the asterisk — VM-grade isolation, cold starts fast enough that there's no reason to skip it, and a real environment your agents can actually work in. Enough of a real environment, in fact, to write you a World Cup podcast every morning. The one idea to internalize: the harness decides, the MicroVM contains. Give your agent a room where it is allowed to fail — then let it be brilliant. References Project: harnessagent_sandbox_demo · Multi-AI-Agents-Cloud-Native Hyperlight: hyperlight-dev/hyperlight · hyperlight-dev/hyperlight-sandbox Agent Framework: Agent Harness in Microsoft Agent Framework Background: Why MicroVMs (Docker) · Harness vs. Scaffold glossary (Hugging Face) Install: pip install agent-framework-hyperlight --pre · .NET: dotnet add package Microsoft.Agents.AI.Hyperlight --prerelease Requirements: KVM (Linux) or WHP (Windows); macOS not yet supported.4.9KViews0likes0CommentsDevOps for Microsoft Hosted Agents: From Terraform Apply to Production-Grade Agent Delivery
A companion piece to Infrastructure as Code for AI: Building and Deploying Microsoft Hosted Agents with Terraform. Just announced — source-code deploy (preview). Foundry has just added a second Hosted Agent deploy path alongside the container path this post covers. Instead of a container image, you upload a .zip of your source plus a requirements.txt (Python 3.13 / 3.14) or a .csproj (.NET 10), and the Agent Service either builds dependencies for you ( remote_build ) or runs a prebuilt bundle ( bundled ). The version definition uses code_configuration instead of container_configuration — the two are mutually exclusive on a given version. Versioning is content-addressable on the zip's SHA-256, so the dedup behaviour described below still applies. Required roles shift slightly: deploying the agent needs Foundry Project Manager at project scope, and the platform-assigned agent identity gets Foundry User (both handled automatically by azd and the Foundry VS Code Toolkit). The DevOps loop in this post — immutable versions, eval gating, manifest-driven promotion, traffic-split canary, per-version observability — transfers directly; only the build-and-push stage changes (no Dockerfile, no ACR for remote_build ). The container path covered here remains fully supported and is still the right choice if you need custom base images, system packages, or non-Python/.NET runtimes. Full details: Deploy a hosted agent from source code (preview). What this post assumes. It describes recommended enterprise DevOps patterns on top of Microsoft Foundry Hosted Agents. Some patterns — evaluation gating, traffic-based rollout, manifest-driven promotion — are best practices and may not be enforced by the platform itself. Hosted Agents and several related capabilities (A2A, certain deployment and routing controls) are in preview and may evolve. TL;DR Terraform provisions the platform: Foundry account, project, model deployment, ACR, App Insights, RBAC. DevOps pipelines ship agent versions, not source branches — the deploy artifact is a container image digest plus an immutable version spec. Evaluation should be treated as a release gate, not a dashboard. Quality regressions should fail the build the same way unit-test failures do. Traffic split between versions is the rollout and rollback primitive. Rollback typically avoids rebuilding or redeploying artifacts. Observability is sliced per version — during canary, two versions serve simultaneously and aggregate metrics lie. The Delivery Pipeline at a Glance Terraform ───► Foundry project (AIServices) + model deployment + ACR + App Insights │ PR opened ▼ └─► docker build ───► push to ACR ───► capture image digest │ ▼ Foundry SDK: create agent version (image digest + cpu/mem + env + protocols) │ ▼ Evaluation gate ────► fail → stop │ ▼ pass Promote via manifest → staging → prod │ ▼ Traffic-split canary (0% → 10% → 100%) │ ▼ App Insights: per-version latency, cost, sampled quality, sandbox sizing Infrastructure as Code gets the platform stood up. It does not, on its own, ship an agent. The gap between terraform apply succeeding and a customer-facing agent reliably serving requests in production is where DevOps lives — and for Microsoft Hosted Agents on Microsoft Foundry, that gap has its own shape. A Hosted Agent is not a prompt and a tool list. It is your own code, packaged as a container image, pushed to Azure Container Registry, and deployed to a Foundry project. The Foundry Agent Service pulls the image, provisions an isolated execution environment per agent session, assigns the agent its own dedicated Microsoft Entra ID (agent identity), and exposes a dedicated endpoint. An agent supports up to four protocols, any of which can be combined in a single deployment: Responses ( .../protocols/openai/responses ) — OpenAI-compatible chat-style API. Implemented in the container. Invocations ( .../protocols/invocations ) — arbitrary JSON in / arbitrary JSON out for webhook receivers and non-conversational workloads. Implemented in the container. A2A ( .../protocols/a2a , preview) — the open Agent2Agent protocol for agent-to-agent delegation across frameworks and vendors. Surfaced on its own endpoint path by the platform. Activity — the Teams / M365 channel protocol. The platform bridges Responses to Activity automatically when an agent is published to a Microsoft 365 channel. Microsoft manages the runtime, scaling, session state, and lifecycle. You ship the image and the version definition. Important — Foundry version compatibility. Hosted Agents are supported on the new Microsoft Foundry project resource model ( azurerm_cognitive_account_project under a Cognitive Services account of kind = "AIServices" ). The older Azure AI Foundry Hub model ( azurerm_ai_foundry / azurerm_ai_foundry_project , kind = "Hub" ) — the Azure ML–derived workspace surface — does not expose Hosted Agent capabilities. They are two distinct Azure resource types with different APIs. Everything in this post assumes the new Foundry project. That shape drives three things every DevOps loop for Hosted Agents has to handle: The deploy artifact is a container image plus an immutable agent version. A version snapshots the image digest, CPU/memory, environment variables, and protocol configuration. To change anything, you create a new version. The platform supports weighted traffic between versions, which is your blue/green and canary primitive. The agent identity is created for you, per agent. You don't pick one or wire managed-identity references manually. Each agent is assigned a dedicated Microsoft Entra ID (agent identity) at deploy time; RBAC to downstream resources is granted to that identity. Quality is non-deterministic. Two terraform apply runs against the same configuration produce identical resources. Two agent runs against the same input can produce different outputs. Your pipeline has to gate on evaluation, not only on tests passing and HTTP 200s. This post lays out an end-to-end DevOps loop on top of that shape: how to structure the repository, what runs in CI versus CD, how to gate releases on evaluation, how to promote across environments, how to use version traffic split for safe rollouts and instant rollback, and what observability is worth wiring beyond the defaults. A Quick Tour of Microsoft Foundry If you've spent more time in Azure OpenAI or AI Studio than in Foundry, a short orientation helps before the DevOps patterns make sense. Microsoft Foundry is Microsoft's unified platform for building, evaluating, deploying, and operating AI applications and agents. It consolidates what used to be spread across Azure OpenAI, Azure AI Studio, and the AI Hub model into a single resource and a single portal at ai.azure.com. Three pieces are worth knowing up front. The resource model Foundry is built on two Azure resources: Foundry account — an azurerm_cognitive_account with kind = "AIServices" , project_management_enabled = true , a custom_subdomain_name , and a managed identity. This is the top-level container: it holds your model deployments (Azure OpenAI and the broader Foundry model catalog), connections to backing services, and the Foundry-managed Toolbox MCP endpoint. Foundry project — an azurerm_cognitive_account_project under that account. A project is the scope for agents, evaluations, conversation history, indexes, and per-app connections. One project per app or per environment is the usual shape. This is the new Foundry model — and it is the only model that supports Hosted Agents. The older Azure AI Foundry Hub ( azurerm_ai_foundry + azurerm_ai_foundry_project , kind = "Hub" ) is a separate Azure ML–derived workspace and cannot host Hosted Agents. The two surfaces look superficially similar in the portal but are distinct Azure resource types with different APIs and feature sets. If a tutorial, sample, or piece of Terraform you find online creates an azurerm_ai_foundry Hub, it is targeting the classic surface and the Hosted Agents APIs ( /agents , agent versions, traffic split, dedicated endpoints) will not be available against it. To use Hosted Agents you must provision a new Foundry account + project as described above. There is no in-place upgrade from a Hub. What Foundry gives you A Foundry project is more than a container. Out of the box it provides: A model catalog and deployment surface — Azure OpenAI models (GPT-4.1, GPT-4o, o-series, embeddings), plus open and partner models, all deployed and invoked through the same project endpoint with the same auth model. Two agent execution modes — prompt-based agents (defined entirely by instructions + tool configuration in the portal, suitable for conversational assistants) and Hosted Agents (your own containerized code, the subject of this post). A managed Toolbox — a project-level MCP endpoint that exposes Foundry-curated tools (Code Interpreter, Web Search, Azure AI Search, OpenAPI, custom MCP, A2A) with consolidated auth. Hosted Agent code connects to the Toolbox using standard MCP client libraries. First-class evaluation — datasets, graders (similarity, LLM-as-judge, safety, groundedness), and evaluation runs as a built-in concept, not a bolt-on. Built-in tracing — OpenTelemetry traces from agents land in a linked Application Insights resource automatically. No manual instrumentation needed to get the basics. Per-agent identity — when you deploy a Hosted Agent, the platform creates a dedicated Microsoft Entra ID (agent identity) for it and gives it a dedicated endpoint. RBAC to downstream resources is granted to that identity. How the pieces line up for Hosted Agents For the rest of this post, the mental model is: Resource group └── Foundry account (Cognitive Services, kind=AIServices) ├── Model deployments (e.g. gpt-4.1) └── Foundry project ├── Hosted Agent: customer-support │ ├── Version v1 (image digest A, 100% traffic) │ └── Version v2 (image digest B, 0% traffic — canary) ├── Hosted Agent: webhook-handler ├── Evaluations ├── Connections (ACR, AI Search, Key Vault…) └── Toolbox (MCP) Terraform provisions the account, project, model deployments, ACR, App Insights, and RBAC. Hosted Agents — images, versions, traffic weights — are managed through azd or the Foundry SDK. That boundary is what the rest of this post automates. The minimal Terraform shape For Hosted Agents you need the new-model shape instead. The skeleton below is the minimum that lets you deploy a Hosted Agent on top of it — storage, Key Vault, monitoring, networking, and OIDC for CI live alongside for more details see Infrastructure as Code for AI: Building and Deploying Microsoft Hosted Agents with Terraform | Microsoft Community Hub. # Foundry account (new model — required for Hosted Agents) resource "azurerm_cognitive_account" "foundry" { name = "ai-${local.name}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location kind = "AIServices" sku_name = "S0" project_management_enabled = true custom_subdomain_name = "ai-${local.name}" # required for AAD auth identity { type = "SystemAssigned" } } # Model deployment the agent will call resource "azurerm_cognitive_deployment" "gpt" { name = "gpt-4.1" # stable name — agents pin to this cognitive_account_id = azurerm_cognitive_account.foundry.id model { format = "OpenAI" name = "gpt-4.1" version = "2025-04-14" } sku { name = "GlobalStandard" capacity = 10 } } # Foundry project — the scope for Hosted Agents, evals, conversations resource "azurerm_cognitive_account_project" "main" { name = "proj-${local.name}" cognitive_account_id = azurerm_cognitive_account.foundry.id location = azurerm_resource_group.main.location identity { type = "SystemAssigned" } } # Container registry the agent image is pushed to and pulled from resource "azurerm_container_registry" "acr" { name = "acr${replace(local.name, "-", "")}" resource_group_name = azurerm_resource_group.main.name location = azurerm_resource_group.main.location sku = "Standard" admin_enabled = false # use RBAC, not admin user } # The project's managed identity needs to pull the agent image resource "azurerm_role_assignment" "project_acr_pull" { scope = azurerm_container_registry.acr.id role_definition_name = "AcrPull" # use Container Registry Repository Reader if the ACR has ABAC enabled principal_id = azurerm_cognitive_account_project.main.identity[0].principal_id } A few things worth calling out: kind = "AIServices" + project_management_enabled = true + custom_subdomain_name are what make this a new-model Foundry account. Omit project_management_enabled and azurerm_cognitive_account_project will not provision; omit custom_subdomain_name and you lose the Foundry endpoint shape that Entra-authenticated access depends on. azurerm_cognitive_account_project is the new-Foundry project resource. Do not use azurerm_ai_foundry_project — that targets the Hub model and does not host agents. Keep the model deployment name stable. Agent code (and your agent.yaml ) pins to the deployment name, not the model version. Changing the version is safe; changing the name forces a new agent version. The project MI needs ACR pull, not push. CI pushes the image (via its own identity); the platform pulls it on the project's behalf when the agent runs. ABAC-enabled ACR is supported but requires --source-acr-auth-id [caller] on az acr build in your CI script — a common gotcha. A note on the provider. Everything above uses the hashicorp/azurerm provider. Foundry's surface evolves quickly, and you will occasionally hit a property or child resource that AzureRM hasn't caught up with yet — project connections, capability hosts, and some newer agent-related fields are common examples. When that happens, reach for azure/azapi: use azapi_update_resource to patch a missing property on an AzureRM-owned resource, and azapi_resource for resources AzureRM doesn't model at all. Keep AzureRM as the default and use AzAPI as a targeted gap-filler, so you don't fork ownership of mainstream resources. The Hosted Agent Delivery Loop A working delivery loop has five stages. Each maps to a specific artifact, a specific tool, and a specific failure mode. Stage Artifact Tool Primary failure mode Infra provisioning Terraform state terraform apply Quota, RBAC propagation, ACR not reachable Image build & push OCI image in ACR (ACR must remain publicly reachable today) docker build / az acr build Image too large, base image CVEs Agent version create Immutable version (image digest + config) azd or Foundry SDK Bad env var, wrong protocol declared Evaluation Eval dataset + grader Foundry evaluators Quality / safety regression Traffic shift & observe Version weights, App Insights traces Foundry SDK + Azure Monitor Silent quality decay, sandbox over/under-sizing The first stage is where the prior post left off. The remaining four are this post. Infra provisioning assumes the standard pattern: terraform plan runs on every PR as a review gate (posted as a PR comment) and terraform apply runs only on merge to the environment branch. Everything below assumes the platform is already applied. Repository Shape A repository that supports the loop end-to-end looks roughly like this: agent-platform/ ├── infra/ # Terraform from the prior post (AIServices + project) │ ├── modules/foundry-project/ │ └── environments/ │ ├── dev.tfvars │ ├── staging.tfvars │ └── prod.tfvars ├── agents/ │ ├── customer-support/ │ │ ├── Dockerfile │ │ ├── src/ # Agent code (Python or C#) │ │ ├── agent.yaml # Version spec: image, cpu/memory, protocols, env │ │ ├── evals/ │ │ │ ├── dataset.jsonl │ │ │ └── graders.yaml │ │ └── README.md │ └── webhook-handler/ │ └── ... ├── scripts/ │ ├── deploy_agent_version.py # Build → push → create version → optional weight shift │ ├── run_evals.py │ └── promote_version.py # Shifts traffic between versions └── .github/workflows/ ├── infra.yml # Terraform plan/apply ├── agent-pr.yml # Build, push to ACR, deploy candidate version, run evals └── agent-release.yml # Promote a tested version to staging / prod Two deliberate choices. First, infrastructure and agents live in the same repo but in separate top-level directories with separate pipelines. They have different cadences and different reviewers. Second, each agent is its own folder with its own Dockerfile , code, version spec, and eval suite. A single PR touches one agent's directory cleanly; a code-review diff stays focused. The Agent Version as the Deploy Unit A Hosted Agent is deployed as a version. A version is immutable — once created it captures: the container image digest (not just the tag — the digest, so it cannot drift), CPU and memory allocation for the per-session sandbox (e.g. 1 vCPU / 2 GiB), the container protocols the image implements — responses , invocations , or both, environment variables passed to the container at runtime, any other version-scoped configuration (e.g. base model deployment name). The container's container_protocol_versions only declares responses and/or invocations — the two protocols the container itself implements. A2A (preview) is surfaced by the platform on its own endpoint path, and Activity is bridged from Responses automatically when the agent is published to a Microsoft 365 channel. Under the hood, agent versions run on Azure Container Apps with VM-isolated sandboxes, which is also why you may see the term revision in some Container Apps–surfaced APIs and limits — a Hosted Agent version corresponds to one such revision. To change any of those, you create a new version. The platform keeps the old one and shifts traffic between them by weight. This is the primitive you use for canary rollouts and for rollback — both reduce to a traffic-weight change, not a redeploy. An agent.yaml per agent makes the version reproducible from source: # agents/customer-support/agent.yaml name: customer-support container: image: ${ACR_LOGIN_SERVER}/customer-support # digest resolved at deploy time cpu: 1 memory: 2Gi protocols: # container_protocol_versions - responses # add `invocations` here if the container also handles webhook-style payloads env: # The platform automatically injects FOUNDRY_PROJECT_ENDPOINT, # AZURE_AI_MODEL_DEPLOYMENT_NAME, and APPLICATIONINSIGHTS_CONNECTION_STRING # — you only set what's specific to your agent. LOG_LEVEL: info metadata: owner: support-team source_commit: ${GITHUB_SHA} scripts/deploy_agent_version.py is the executable form of this spec. Its job per agent is: Build the container image ( docker build locally, or az acr build server-side for ABAC ACRs). Push to ACR and capture the resulting image digest — not the :latest tag. Resolve environment variables from the target environment's config. Call the Foundry SDK to create a new agent version pinned to that digest. Emit a deployment-manifest.json containing the agent name, version ID, image digest, source commit SHA, and the eval dataset hash used. One gotcha: the platform deduplicates. A create version call with no change to the version parameters (same image digest, same env, same CPU/memory, same protocols) will not produce a new version object. Write the script to treat "no new version returned" as success and reuse the existing version ID in the manifest, not as a failure to retry. That manifest is the cross-pipeline contract. PR pipelines produce one. Promotion pipelines consume one. Rollback consumes a previous one. Evaluation as a Release Gate Foundry ships evaluators (datasets, graders, evaluation runs) as a first-class platform feature. Whether to block a release on their results is a team decision, not a platform mandate — but it is the recommended pattern for any agent serving real users. A pipeline that promotes an agent because the image built, the container started, and the version was created with HTTP 200 will eventually ship a regression that an integration test cannot catch. Treat the eval suite the way you treat unit tests: failures stop the pipeline. A minimal but honest evaluation setup has three pieces. A reference dataset. Twenty to fifty representative scenarios is enough to start. Each row is an input plus either a reference answer, a set of must-include facts, or a rubric. Store as JSONL alongside the agent: {"id":"refund-1","input":"How do I get a refund for order 12345?","must_include":["return window","14 days","original payment method"]} {"id":"escalate-1","input":"This is the third time my package is late.","rubric":"Agent should acknowledge, apologize, offer escalation, not promise compensation."} Graders. Foundry's evaluators library ships templates — exact match, similarity, LLM-as-judge for rubric scoring, and built-in safety and groundedness graders. Pick what matches your dataset shape. LLM-as-judge is the workhorse for open-ended responses; pin its model deployment explicitly so the grader itself does not drift between runs. Thresholds. Decide what "passing" means before the first run. A common pattern: Hard floor on safety / groundedness — any regression fails the build. Relative threshold on quality — no more than X% drop versus the last known-good version. Absolute floor on must-include coverage — for example ≥ 90%. Wire it into the PR pipeline: # .github/workflows/agent-pr.yml (excerpt) - name: Build, push, and create candidate version run: | python scripts/deploy_agent_version.py \ --agent customer-support \ --project $EVAL_PROJECT \ --version-suffix pr-${{ github.event.number }} \ --traffic 0 # create the version, do not route traffic yet - name: Run evaluations against candidate endpoint run: | python scripts/run_evals.py \ --agent customer-support \ --version pr-${{ github.event.number }} \ --baseline last-known-good \ --fail-on-regression The PR creates a candidate version with zero traffic weight against a long-lived "eval" Foundry project, runs evaluations against the candidate version's dedicated endpoint, and then deletes the candidate version on PR close. A standing eval project beats a per-PR Foundry project — provisioning a project per PR is slow and adds RBAC overhead that does not earn its keep. Environment Promotion Three environments is the floor: dev , staging , prod . Each is its own Foundry project, ideally its own Foundry account in its own resource group. What promotes between them is the image digest and the version spec — not source code, and not "redeploy from main." A workable model: dev — every push to a feature branch builds an image and creates a dev version. Loose evaluation thresholds. Used for human poking and end-to-end debugging. staging — merges to main create a staging version. Full eval suite, strict thresholds. Same sandbox sizing, same env vars, same protocols as prod. prod — manually approved promotion from staging. Promotion script reads the staging manifest, finds the image digest that passed, and creates the prod version pointed at that exact digest. No rebuild. The "same digest" rule is the recommended pattern for safe promotion. If staging passed evaluations on customer-support@sha256:abc… running gpt-4.1 , prod should get that exact image. Re-building from main in the prod pipeline reintroduces the risk you spent staging trying to eliminate — a different base-image patch level, a different transitive dependency, a different build clock — even though nothing in your source changed. GitHub Actions environments make the approval concrete: jobs: promote-prod: needs: deploy-staging environment: production # requires reviewer approval runs-on: ubuntu-latest steps: - name: Create prod version from staging manifest run: | python scripts/deploy_agent_version.py \ --agent customer-support \ --project $PROD_PROJECT \ --from-manifest staging-manifest.json \ --traffic 10 # canary at 10% The canary weight is the second half of safe promotion: create the prod version, give it a small fraction of traffic, watch the App Insights traces, then shift the rest with promote_version.py . Traffic-Split Rollout and Instant Rollback Weighted version traffic changes the rollback model entirely. Rollback typically avoids rebuilding or redeploying artifacts — the previous version is still there, ready to take traffic. A typical canary flow: Create new version v42 at 0% traffic. Endpoint exists; no production calls reach it. Shift to 10%. Observe for an hour or a day, depending on traffic volume. Shift to 50%, then 100%. Old version stays at 0% but is not deleted. After a stability window (commonly a week), delete the previous version to free quota. Rollback is the reverse: shift weights back to the previous version. It is a control-plane call, not a deploy. The agent's endpoint URL does not change, sessions in flight continue on whichever version they started on, and new sessions land on whatever the weights say. Two consequences worth internalizing: Keep at least the last two known-good versions live. Rollback is only as fast as your ability to flip weights to a version that already exists. Do not skip the canary step under deadline pressure. A 0%→100% cutover gives you the same blast radius as a non-canaried deploy. The platform supports incremental rollout; use it. For a destructive change — a removed protocol, a renamed agent, an env var the previous version cannot tolerate — rollback may not be safe. Forward-fix is the answer. Identify those changes in PR review and require an explicit "rollback path: forward-fix" note in the PR. Handling Model Version Changes A model deployment bump is the highest-blast-radius runtime change you can make to a Hosted Agent: the agent's behaviour on every input can shift. Treat it like a dependency upgrade. Open a PR that changes only the AZURE_AI_MODEL_DEPLOYMENT_NAME (or the model version on the deployment, via Terraform). Build a new image if needed, create a new agent version, run the full eval suite at 0% traffic. Run a larger regression dataset if you have one. Require a human reviewer who is not the PR author. Promote through staging, then canary in prod for at least one business day before shifting full traffic. If the new model is faster or cheaper, the temptation is to skip steps. Don't. A quality regression in prod almost always costs more than a careful upgrade. The Terraform side is small: openai_model_version is a variable on the azurerm_cognitive_deployment . Terraform recreates the deployment if the version changes. The Hosted Agent picks up the new deployment the next time it calls the model — if you kept the deployment name stable, which is your contract with the agent code. If you change the deployment name as well, the agent needs a new version that knows the new name. Observability That Actually Tells You Something The platform injects an Application Insights connection string into every Hosted Agent container as an environment variable. Agents that use the protocol libraries emit OpenTelemetry traces by default. That gives you per-request latency, token counts, tool invocations, and conversation IDs out of the box. That is the floor. Add to it: Custom span attributes on every request. Agent name, agent version ID, image digest (short), model deployment name. Without these, post-incident analysis cannot tell you which version was live when a problem started — especially during a traffic-split rollout where two versions are serving simultaneously. Quality signal capture. Sample a percentage of production conversations into a queue for offline grading. Run the same graders you used in CI against that sample on a schedule. This is your drift detector for response quality. Sandbox right-sizing signals. Hosted Agents bill on the CPU/memory you allocate per session. Oversizing multiplies cost by your concurrency. Track CPU and available memory inside the sandbox and compare against the version's allocation — if peaks stay below ~50%, the next version should drop a tier; if they push above ~70%, raise it. Right-sizing is a per-version decision because versions are immutable. Per-version error and latency. Slice every standard metric by version ID. A canary that looks fine in aggregate can be quietly worse than the previous version on specific request shapes. Cost dimensions. Tag traces with customer_id or tenant_id if you have multi-tenancy. Aggregating session cost by tenant in App Insights is straightforward once the dimension is on the span. Alerts on shape, not just rate. A doubling in average response length or a sudden drop in tool invocation frequency often precedes a quality regression that error-rate alerts will miss entirely. A weekly "agent health" report in your team channel — pulling these App Insights queries together — beats a perfect dashboard nobody opens. A Pragmatic Maturity Path Most teams cannot build the whole loop on day one. A reasonable order: Infrastructure in Terraform. AIServices account, project, model deployment, ACR, App Insights, role assignment so the project MI can pull from ACR. First agent deployed manually with azd . Just to prove the round trip end to end. agent.yaml plus a deploy script that builds, pushes by digest, and creates a version. One environment. Three environments with manual promotion by manifest. A 20-row eval dataset with one grader, run on every PR. Advisory only at first. Eval as a blocking gate. Thresholds tuned from the advisory phase. Canary rollout via traffic split. Versions held live for a stability window before deletion. Production sampling into offline evaluation. Drift detection. Model version upgrade playbook. Documented, exercised once on a low-risk agent. Tested rollback via weight shift. The first time you discover a rollback bug should not be during an incident. Each step is independently useful. Skipping ahead — particularly to step 6 without time in step 5 — produces thresholds that block legitimate changes and erode trust in the pipeline. Where This Is Heading The platform is moving. A few things to watch as you build: Declarative Hosted Agent versions in Terraform. AzureRM coverage of Hosted Agents and agent versions is expanding. Parts of the deploy script will collapse into Terraform as that lands. The script-driven approach in this post is the bridge, not the destination. Continuous evaluation as a first-class platform feature. Sampling production traffic into scheduled evals — what you wire by hand today — is moving into the Foundry control plane. Multi-agent composition over A2A. As the A2A endpoint moves from preview to general availability and more frameworks ship A2A clients, multi-agent workflows become a first-class deployment shape. The DevOps loop extends — version pinning between agents, eval at the workflow level, observability across the agent graph — but the manifest grows accordingly. Toolbox-managed tool surfaces. As more tool integrations move behind the project Toolbox MCP endpoint, the agent image gets smaller and the tool configuration becomes a project-level concern. That changes what belongs in agent.yaml versus what belongs in Terraform. The throughline: the more the platform absorbs, the more your job shifts from wiring plumbing to defining policy. What "good" means for your agent, what the quality floor is, who can approve a model upgrade, how fast you can roll back. Those decisions do not get automated away. The pipeline just makes them executable. Conclusion Terraform provisions the Foundry project, model deployment, ACR, and observability. The DevOps loop on top of it — container builds pinned by digest, immutable agent versions, evaluation as a release gate, manifest-driven promotion across environments, traffic-split canary and rollback, and observability sliced by version — gets Hosted Agents to production and keeps them there. Build it incrementally. Treat the image digest and the version spec as the deploy artifact, not the source branch. Make evaluation a check the pipeline cares about. Use version weights as your rollout and rollback primitive. And design for the day the platform absorbs the next layer of plumbing, so that when it does, your work moves up the stack instead of getting thrown away.629Views0likes0CommentsClaude Code on Microsoft Foundry in VS Code — A Practical Setup Guide (with the gotchas)
Enables enterprise-grade governance without changing your developer workflow. The official Microsoft Learn article (Configure Claude Code for Microsoft Foundry) gets you ~80% of the way there. The remaining 20%—VS Code settings shape, tenant mismatches, and configuration conflicts like "baseURL and resource are mutually exclusive"—is where most setups fail in practice. This guide walks the full path end-to-end, with the exact JSON that validates, working CLI configuration, and a troubleshooting matrix based on real-world failures. This guidance is based on repeated customer deployments and internal testing across both CLI and VS Code scenarios. TL;DR Setup - Deploy claude-sonnet-4-6 (optionally Haiku + Opus) in a supported region - Grant Cognitive Services User + Foundry User - az login --tenant <tenant> , then launch VS Code via code . Config - CLI: - CLAUDE_CODE_USE_FOUNDRY=1 - ANTHROPIC_FOUNDRY_RESOURCE=<name> - Do NOT set ANTHROPIC_FOUNDRY_BASE_URL at the same time - VS Code: - Use [{ "name": "...", "value": "..." }] format Validate - claude → /status - Expect: API provider: Microsoft Foundry Why run Claude Code on Foundry? Anthropic's Claude Code is a top-tier agentic coding assistant. Running it through Microsoft Foundry instead of Anthropic's public API gives you: Data residency & compliance: prompts and completions stay inside your Azure tenant. Entra ID auth: no API keys to rotate; centralized RBAC. Private networking: works behind VNets/Private Endpoints. Unified billing & quotas: usage shows up on your Azure invoice and in Foundry monitoring. Same model, same CLI, enterprise-grade plumbing underneath. Prerequisites checklist Requirement How to verify Azure subscription with pay-as-you-go billing az account show Foundry resource in supported regions Check your region's model availability in Foundry portal Contributor/Owner on the resource group (for deployments) Azure Portal → IAM Cognitive Services User + Foundry User on the resource (for invoking) Azure Portal → IAM Azure CLI installed and logged in az --version , az login Claude Code CLI installed claude --version VS Code (current) with the Anthropic Claude Code extension Help → About Windows only: Git Bash (from Git for Windows) or WSL2 — Claude Code's runtime requires a POSIX shell bash --version in Git Bash / WSL ⚠️ Claude models in Foundry are currently available in select regions. Check the Foundry portal model catalog for your region's availability (commonly East US 2 and Sweden Central). Step 1 — Deploy the Claude models Claude Code uses three model roles, and it expects a deployment for each: Role Default deployment name Used for Primary claude-sonnet-4-6 general coding (balanced) Fast claude-haiku-4-5 quick edits, file reads Extended thinking claude-opus-4-6 complex reasoning Deploy at least Sonnet to get started. Add Haiku and Opus when you need them — Claude Code will route automatically. If a role-specific model isn't deployed, Claude Code may fall back or fail depending on the task. Deployment names in this guide follow the current Claude 4.x naming exposed in Foundry. Exact versions change over time — check the Foundry model catalog in your region for what's currently available. Foundry Portal: AI Foundry → your project → Build → Models + endpoints → + Deploy model → pick the Anthropic Claude model → Global Standard deployment → name it exactly as above (or remember the name to override later). To discover the current model version before deploying (replace eastus2 with your Foundry region): az cognitiveservices model list -l eastus2 ` --query "[?contains(model.name,'claude')].{name:model.name, version:model.version, format:model.format}" -o table Azure CLI: az cognitiveservices account deployment create ` --name <foundry-resource> ` --resource-group <rg> ` --deployment-name claude-sonnet-4-6 ` --model-name claude-sonnet-4-6 ` --model-version <version> ` --model-format Anthropic ` --sku-name GlobalStandard ` --sku-capacity 1 ✍️ Figure 1: Foundry portal “Models + endpoints” showing the three Claude deployments. Step 2 — Grant yourself the right roles This is the #1 source of silent failures. You need both: Role Role ID Purpose Cognitive Services User a97b65f3-24c7-4388-baec-2e87135dc908 data-plane invocation Foundry User (formerly Azure AI User) 53ca6127-db72-4b80-b1b0-d745d6d5456d Foundry-native permissions $me = az ad signed-in-user show --query id -o tsv $scope = az cognitiveservices account show -n <foundry-resource> -g <rg> --query id -o tsv # Use role IDs — rename-proof (works whether the display name is "Azure AI User" or "Foundry User") az role assignment create --assignee $me --role a97b65f3-24c7-4388-baec-2e87135dc908 --scope $scope # Cognitive Services User az role assignment create --assignee $me --role 53ca6127-db72-4b80-b1b0-d745d6d5456d --scope $scope # Foundry User (formerly Azure AI User) The Foundry RBAC rename (Azure AI User → Foundry User) is rolling out; both role names map to the same role definition (same role ID), depending on tenant rollout state. Use whichever role name your tenant exposes — or use the role IDs above to avoid ambiguity. Step 3 — Install the Claude Code CLI Use the official installer from Anthropic (auto-updates in the background): irm https://claude.ai/install.ps1 | iex claude --version If claude isn't on PATH, restart your shell. The installer drops it under %USERPROFILE%\.local\bin . Step 4 — Sign in to the right tenant If your Foundry resource lives in a tenant different from your default, an az login to the wrong tenant produces the cryptic error: ValueError: Unable to get authority configuration for https://login.microsoftonline.com/<bad-guid>. Authority would typically be in a format of https://login.microsoftonline.com/your_tenant Fix: az login --tenant <foundry-tenant-guid> az account set --subscription <foundry-subscription-guid> az account show # confirm tenant & subscription 💡 You can list every tenant you have access to with: az account list --query "[].{name:name, tenantId:tenantId}" -o table Step 5 — Configure the CLI Set these in the same PowerShell session you'll launch claude from: $env:CLAUDE_CODE_USE_FOUNDRY = "1" $env:ANTHROPIC_FOUNDRY_RESOURCE = "<your-foundry-resource-name>" # Optional: only if your deployment names differ from the defaults $env:ANTHROPIC_DEFAULT_SONNET_MODEL = "claude-sonnet-4-6" $env:ANTHROPIC_DEFAULT_HAIKU_MODEL = "claude-haiku-4-5" $env:ANTHROPIC_DEFAULT_OPUS_MODEL = "claude-opus-4-6" To make them persistent: setx CLAUDE_CODE_USE_FOUNDRY 1 (and so on), then sign out and back in (or restart Explorer). GUI apps like VS Code launched from the Start menu only pick up new user-env vars after the user session refreshes — opening a fresh terminal isn't enough. 🚫 The "mutually exclusive" trap API Error: baseURL and resource are mutually exclusive You'll hit this if you set both ANTHROPIC_FOUNDRY_RESOURCE and ANTHROPIC_FOUNDRY_BASE_URL . Pick one: Most users → ANTHROPIC_FOUNDRY_RESOURCE=<name> (Claude Code builds the URL). Custom subdomain / private endpoint → use ANTHROPIC_FOUNDRY_BASE_URL instead. Step 6 — Verify the CLI claude > /status Expected output: API provider: Microsoft Foundry Microsoft Foundry base URL: https://<resource>.services.ai.azure.com/anthropic Microsoft Foundry resource: <resource> Model: Default (claude-sonnet-4-6) ✍️ Figure 2: /status output confirming API provider: Microsoft Foundry . If you instead see "Anthropic" or it prompts for an Anthropic login, CLAUDE_CODE_USE_FOUNDRY isn't being inherited — see troubleshooting below. Step 7 — Configure the VS Code extension Install Claude Code from the VS Code Marketplace (publisher: Anthropic). Open user settings.json ( Ctrl+Shift+P → Preferences: Open User Settings (JSON)) and add: "claudeCode.environmentVariables": [ { "name": "CLAUDE_CODE_USE_FOUNDRY", "value": "1" }, { "name": "ANTHROPIC_FOUNDRY_RESOURCE", "value": "<your-foundry-resource-name>" } ] 🪤 Schema gotcha. The MS Learn doc currently shows this as a plain {KEY: VALUE} object under the UI label "Claude Code: Environment Variables" . In recent extension versions the actual JSON key is claudeCode.environmentVariables and the value must be an array of {name, value} objects. If you paste the doc's snippet verbatim, VS Code will flag "Missing property name", "Colon expected", "Unknown configuration setting". Use the array form above. Make the extension see your az login The extension inherits environment & credentials from the process that launches VS Code. After az login : # In the same PowerShell where az login succeeded: code . If VS Code was already running, fully quit it (not just close the window) and relaunch from the terminal. Developer: Reload Window is not enough to refresh inherited Azure CLI credentials. ✍️ Figure 3: settings.json with the claudeCode.environmentVariables array form. Step 8 — Try it In VS Code, click the Claude Code (Spark) icon in the sidebar to open the panel. Type: Summarize the structure of this project. You should get a response within a few seconds, and the panel should indicate it's routing through Microsoft Foundry. Run /status inside the panel to confirm API provider: Microsoft Foundry if you want certainty. ✍️ Figure 4: Claude Code panel in VS Code responding through Microsoft Foundry. Troubleshooting matrix Symptom Where it shows up Likely cause Fix API Error: baseURL and resource are mutually exclusive claude CLI on first request Both ANTHROPIC_FOUNDRY_BASE_URL and ANTHROPIC_FOUNDRY_RESOURCE set Unset one. Prefer ANTHROPIC_FOUNDRY_RESOURCE . Unable to get authority configuration for https://login.microsoftonline.com/<guid> claude CLI startup or VS Code panel Wrong tenant ID in az login az login --tenant <correct-guid> ; verify with az account show Failed to get token from azureADTokenProvider: ChainedTokenCredential authentication failed VS Code Claude Code panel Extension didn't inherit az login session Quit VS Code entirely; relaunch with code . from the authed shell Token tenant does not match resource tenant claude CLI or VS Code panel CLI logged into a different tenant than the Foundry resource az login --tenant <foundry-tenant> The model <name> is not available on your foundry deployment claude CLI first use or VS Code model selector Deployment name mismatch Either rename the Foundry deployment, or set ANTHROPIC_DEFAULT_*_MODEL to the actual name 401 / 403 on first request claude CLI or VS Code panel Missing RBAC on the resource Assign Cognitive Services User and Foundry User on the resource scope Claude Code prompts for Anthropic login VS Code Claude Code panel CLAUDE_CODE_USE_FOUNDRY not set in the process Set the env var before launching claude / code . VS Code shows "Unknown Configuration Setting" for claudeCode.environmentVariables VS Code Settings tab Wrong JSON shape Use the array of {name,value} objects form 429 Too Many Requests claude CLI or VS Code panel TPM/RPM exhausted Foundry portal → Operate → Quotas; request increase or reduce parallelism Works in CLI, fails in VS Code extension VS Code Claude Code panel only Env vars set per-shell, not visible to GUI VS Code Use setx (persistent user env) or move them into claudeCode.environmentVariables "Model is not available in region" Foundry portal model deployment step Foundry resource not in a supported region Deploy a new Foundry resource in a supported region, or check model availability Best practices Auth & secrets - Prefer Entra ID over API keys. If you must use a key for CI, store it as a secret (GitHub Actions secret, Key Vault) — never in settings.json (it may sync via Settings Sync). - Scope RBAC at the resource level, not the subscription, for least privilege. Project context - Create a CLAUDE.md at your repo root with stack, conventions, and entry-point commands. Claude Code reads it automatically and the quality jump is significant. - Use .claude/rules/*.md for per-area rules (e.g., test conventions, security rules). Cost & latency - Let Claude Code's auto-routing pick the right role (Sonnet/Haiku/Opus). Don't pin everything to Opus. - Cap context with ANTHROPIC_MAX_TOKENS if you have a strict budget. (Note: not honored by every Claude Code version — check the Claude Code docs for your version.) - Watch token spend in Foundry → Operate → Metrics weekly. Reliability - For team use, deploy all three model roles even if you don't think you need them — silent role-routing failures are confusing. - Tag your Foundry resource ( env=dev|prod , team=... ) for chargeback. Reproducibility - Document the exact env vars and az login --tenant GUID in your team README. - Pin Claude Code CLI version in onboarding docs ( claude --version ) so new joiners hit the same behavior. A note on the MS Learn doc The doc is accurate but skips three things that caused the most friction in real-world deployments: VS Code extension settings shape — the example uses the UI label as a JSON key and an object instead of the array form the schema actually expects. Process inheritance — it says "set the env vars" but doesn't emphasize that the VS Code window must be launched from a shell where both az login and the env vars are live. Reloading the window doesn't help. Mutually exclusive RESOURCE vs BASE_URL — listed in passing, but the error message only appears at first request, after you think everything is configured. If the Microsoft Learn page is updated, treat this post as a companion — same destination, fewer dead ends. What you've got now Claude Code running locally on your machine, talking to your Foundry resource. Entra ID auth — no API keys to manage. Full Foundry telemetry, quotas, and billing. VS Code panel + CLI, both backed by the same setup. Drop a CLAUDE.md in your repo and start shipping. When to Use RESOURCE vs BASE_URL Use RESOURCE (default) - Standard public deployments - No custom networking Use BASE_URL - Private endpoints - Custom DNS / VNet routing Never set both.653Views0likes0CommentsBuilding an End-to-End Azure RAG Strategy Agent with MS Foundry
High-Level Architecture This architecture represents an end-to-end Retrieval-Augmented Generation (RAG) pipeline where raw documents are ingested from Azure Blob Storage, processed using Document Intelligence, transformed into embeddings via Azure OpenAI, and indexed in Azure AI Search for hybrid retrieval. A Foundry/MAF-based agent orchestrates query processing by combining user input with relevant search results and generates contextual responses, which are exposed through a FastAPI or CLI interface. This solution is composed of two main layers: 1. Data Ingestion Layer (RAG Pipeline) This layer transforms raw enterprise documents into searchable knowledge. Flow: Raw documents stored in Azure Blob Storage Supported formats: PDF, DOCX, PPTX, images, etc. Document Intelligence extraction Extracts: Text Tables Key-value pairs Structure Writes output as structured JSON back to Blob (processed/) Chunking + Embedding Documents are split into chunks Each chunk is embedded using Azure OpenAI (text-embedding-*) Indexing into Azure AI Search Creates a hybrid index: Keyword search Semantic ranking Vector search Enables flexible retrieval strategies 2. Query Layer (Strategy Agents) This layer enables intelligent query answering. Flow: User sends a query via: FastAPI endpoint CLI interface Query is handled by: Microsoft Agent Framework (MAF) agent Running on Azure AI Foundry Agent: Queries Azure AI Search Retrieves top relevant chunks Injects them into LLM prompt LLM generates grounded response This follows the standard RAG pattern: Retrieval → Augmentation → Generation End-to-End Flow Key Azure Services Used Service Purpose Azure Blob Storage Raw + processed document storage Azure AI Document Intelligence Extract structured content Azure OpenAI Embeddings + LLM generation Azure AI Search Hybrid retrieval engine Azure AI Foundry Agent orchestration Microsoft Agent Framework Agent execution layer Why this Architecture Matters This solution goes beyond basic RAG and provides: Hybrid Retrieval Combines keyword + semantic + vector search Improves recall and accuracy Structured Document Parsing Handles complex enterprise documents Extracts tables and metadata Agent-Based Orchestration Enables reasoning over retrieval results Extensible for multi-agent workflows Scalable Data Pipeline Supports continuous ingestion Works with large document collections Enterprise Considerations Use Managed Identity for secure service access Apply RBAC on Cosmos DB / Search / Storage Enable Private Endpoints for network isolation Use Guardrails + Evaluations in Foundry Summary This repository demonstrates a production-ready Azure RAG architecture: Ingest → Extract → Chunk → Embed → Index Retrieve → Reason → Generate Powered by Azure AI Foundry + Agent Framework By combining data engineering + AI orchestration, it enables enterprise AI systems that are: Accurate Grounded Extensible Repo: https://github.com/snd94/azure-rag-strategy-agent Please refer to the Microsoft Learn Documentation for further information: Azure AI Search documentation - Azure AI Search | Microsoft Learn Document Intelligence documentation - Quickstarts, Tutorials, API Reference - Foundry Tools | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn Microsoft Agent Framework Overview | Microsoft Learn What is Microsoft Foundry? - Microsoft Foundry | Microsoft Learn8 Architectural Pillars to Boost GenAI LLM Accuracy and Performance in Low Cost
Smarter AI architecture, not bigger LLM models - how engineering teams push LLM accuracy and high performance in low cost. Enterprises using LLM (Large Language Models) hits the same ceiling and paying big price! A raw API call to a frontier model- GPT-4, Claude, Gemini delivers only 35-40% accuracy on structured output tasks like code generation, NL to DAX query generation, domain-specific reasoning. Prompt engineering pushes that to ~60%. But the final 35+ percentage points? Those come from system architecture, not model upgrades. This guide presents 8 architectural pillars, distilled from production Gen AI systems, that compound to close the accuracy gap. These patterns are model-agnostic and domain-agnostic, they apply equally to chatbots, coding assistants, content/query generators, automation agents, and any application where an LLM produces structured or semi-structured output. It’s based on my recent Gen AI projects. The key takeaway: use the LLM as one component in a larger system, not as the system itself. Surround it with deterministic guardrails, verified knowledge, and feedback loops. Pillar 1: Enhance Prompts with Verified Knowledge Context Impact: +35–40% accuracy (based on production use cases; may vary by domain) Top source of LLM errors in production is hallucinated identifiers knowledgebase, the model invents names, references, or structures that don't exist in the target system. This happens because LLMs are trained on general knowledge but deployed against specific, private enterprise systems they've never seen local database and knowledgebase. The fix is straightforward: inject verified, system-specific context (type definitions, API specs, ontologies, configuration schemas, entity catalogues) directly into the prompt so the model composes from known-good elements rather than recalling from training data. Use Knowledge Graph for better sematic knowledge. How to Implement Provide explicit context, never implicit- Whatever the LLM needs to reference identifiers, valid values, semantic knowledge, structures must appear verbatim in the prompt or retrieved context window. Filter aggressively. A full knowledge base with thousands of entities overwhelms the context window and confuses the model. Use intelligent filtering to surface only needed 5-10 most relevant elements per request. Store structured semantic knowledge in a graph or searchable index. This enables relationship-aware retrieval: "given entity X, what related entities, attributes, and constraints are also needed?" Include rich Semantic metadata. Names alone are insufficient. Include types, constraints, valid value ranges, relationships, and usage notes to minimize ambiguity. Keep context fresh. Stale context causes a different class of hallucination the model generates valid-looking output that references outdated structures. Sync your knowledge store with your source of truth. Why This Works LLMs excel at composition and reasoning combining elements, applying logic, following patterns. They are unreliable at recall of specific identifiers exact names, valid values, structural constraints. By offloading recall to a deterministic retrieval system and giving the LLM only composition tasks, you play to each system's strengths. Pillar 2: Tiered LLM Approach: Route Deterministically First, Use LLMs Last Impact: 80% cost reduction, 85% latency reduction, eliminates non-deterministic errors for most traffic. The most impactful architectural insight: most production requests don't need an LLM at all. A well-designed system handles 60-70% of traffic with deterministic logic templates, composition rules, cached results and reserves expensive, non-deterministic LLM calls only for genuinely novel inputs. The Three-Tier Model These metrics are from a real use case to convert NLP to Power BI DAX query. Tier Strategy Uses LLM ? Latency Accuracy Tier 0 Template slot-filling - handles requests that match known patterns exactly the system fills slots in a pre-built template with extracted parameters. No LLM, no non-determinism, near-perfect accuracy, sub-100ms response. No ~50ms 95-98% Tier 1 Compose from pre-validated fragments- handles requests that combine known patterns in new ways. The system retrieves pre-validated building blocks via search, composes them using deterministic rules, and validates the result. Still no LLM call. No ~200ms 90-95% Tier 2 Full LLM generation with enriched context- is reserved for genuinely novel requests that can't be served deterministically. Even here, the LLM receives maximum support: filtered context, relevant examples, explicit rules, and structured planning. Yes (1 call) 2-5s 88-93% Complexity-Based Routing A lightweight scoring function (evaluated in <1ms) routes each incoming request: Factors: reasoning depth, number of components, cross-references, constraints, nesting depth, novelty (distance from known patterns) Score 0-39: Tier 0 (deterministic template) Score 40-59: Tier 1 if confidence ≥ 85%, else Tier 2 Score 60+: Tier 2 (LLM generation) This routing achieves 96%+ accuracy in tier assignment and ensures the expensive path is only taken when necessary. Why This Matters Cost: 70-80% of requests cost zero LLM tokens Latency: Majority of responses in <200ms instead of 2-5s Reliability: Deterministic tiers produce identical output for identical input. Scalability: Deterministic tiers scale horizontally with trivial compute Pillar 3: Encode Prompt Anti-Patterns as Explicit Rules Impact: +8-10% accuracy, ~80% reduction in common structural errors LLM mistakes are patterned, not random. In any domain, 80% of errors cluster around a small set of 6-13 recurring structural mistakes. Instead of hoping the model avoids them through general instruction-following, compile these mistakes into explicit WRONG => CORRECT rules embedded directly in the system prompt. How to Implement Collect error data. Run 100+ requests through your system and categorize the failures. You'll find the same 6-13 patterns appearing repeatedly. Write concrete rules. For each pattern, show the exact wrong output and the exact correct alternative, with a one-line explanation of why. Embed in system prompt. Place rules prominently after the task description, before examples. Use formatting that's hard to ignore (headers, bold, explicit "NEVER" language). Keep the list short. 6-13 rules maximum. Beyond that, attention dilutes and the model starts ignoring rules. Prioritize by frequency. Refresh continuously. As the system improves (via other pillars), some errors disappear. New error types emerge. Update the rule set quarterly. Why This Works LLMs respond strongly to explicit negative examples. A generic instruction like "be careful with X" has minimal impact. But showing the exact wrong output the model tends to produce, paired with the correction, creates a strong avoidance signal. It's analogous to unit tests. Pillar 4: Retrieve Few-Shot Examples Dynamically Impact: +5-15% accuracy depending on domain complexity Static examples hardcoded in a prompt become stale, irrelevant of context tokens. Dynamic few-shot retrieval selects the 3-5 most relevant examples for each specific request, maximizing the signal-to-noise ratio in the prompt. Hybrid Retrieval Architecture The most effective approach combines two search strategies for intent search to understand natural language (NL) context: Keyword search (BM25) Finds examples with exact matching terms, identifiers, and domain vocabulary Vector search (semantic similarity) Finds examples with similar intent and structure, even if wording differs Rank fusion Merges results from both strategies, re-ranking by combined relevance This hybrid approach outperforms either strategy alone because keyword search catches exact identifier matches that vector search dilutes, while vector search captures semantic similarity that keyword search misses entirely. Best Gen AI Architectural Practices Match complexity to complexity. Simple requests should see simple examples. Complex requests should see complex examples. Mismatched examples confuse the model. Include negative examples. For the detected request type, include 1-2 "wrong => correct" pairs alongside positive examples. This reinforces Pillar 3's anti-pattern rules with concrete, contextually relevant demonstrations. Pre-compute embeddings. Generate vector embeddings at indexing time, not at query time. Cache retrieval results for repeated patterns. Curate quality over quantity. 3 excellent, diverse examples beat 10 mediocre ones. Each example should demonstrate a distinct pattern or edge case. Keep examples current. As your system evolves, old examples may demonstrate outdated patterns. Review and refresh the example store periodically. Pillar 5: Feedback Loop- Validate and Auto-Fix Every Output Deterministically Impact: +3-5% accuracy as a safety net, plus continuous improvement via feedback No matter how well-prompted, LLMs will occasionally produce outputs with minor structural errors - wrong casing, missing delimiters, references to slightly-incorrect identifiers, or subtle format violations. A deterministic post-processing pipeline catches and fixes these without any additional LLM calls. The Validation Pipeline LLM Output => Parse (grammar/AST) => Rule-Based Fixes => Compliance Check/validation => Final Output Each stage is fully deterministic: Parsing: Use a formal grammar or AST parser (ANTLR, tree-sitter, language-native parsers) to structurally analyse the output. Never regex-parse structured output - it's fragile and misses edge cases. Rule-based fixes: 10-20 deterministic transformation rules that correct known error patterns - name normalization, casing fixes, missing delimiters, structural repairs. Compliance check: Verify every identifier referenced in the output actually exists in the provided context. Flag unknown references. Design Principles Zero LLM calls in the fix pipeline. Every fix is a regex, an AST transformation, or a lookup table operation. Instant, free, deterministic, 100% reliable. Fail safe. If a fix is ambiguous (multiple valid corrections possible), pass through rather than corrupt. A minor error is better than a confident wrong "fix." Log everything. Track every fix applied, categorized by type. This data drives the feedback loop. The Critical Feedback Loop- The validation pipeline's most important function isn't fixing outputs, it's generating improvement signals: This creates a feedback loop: the auto-fix catches errors → the errors get promoted to upstream prevention → fewer errors reach the auto-fix → the system continuously tightens. Pillar 6: Multi-Agent Orchestration with Fewer Agents and Clear Contracts Impact: Reduced latency, clearer debugging, fewer failure modes The multi-agent pattern is powerful but commonly over-applied. The counter-intuitive lesson from production systems: fewer agents with well-defined responsibilities outperform many fine-grained agents. Why Fewer Is Better Each agent handoff introduces: Latency - serialization, network calls, context assembly Context loss - information dropped between boundaries Failure modes - each handoff is a potential error point Debugging complexity - tracing issues across many agents is exponentially harder Multi-Agent Orchestration Principles Merge agents that always run sequentially. If Agent A always feeds into Agent B with no branching or conditional logic, they should be one agent with two internal steps. Parallelize independent operations. Context retrieval and example lookup are independent, run them concurrently to halve retrieval latency. Route sub-tasks to cheaper models. Decomposed sub-problems are simpler by design. Use a smaller, faster, cheaper model (3x cost savings, 2x speed improvement). Define strict contracts. Each agent boundary should have an explicit schema defining inputs and outputs. No implicit assumptions about what crosses the boundary. Only 2 of 4 agents should call an LLM. The rest are purely deterministic. This minimizes non-deterministic behavior and cost. Pillar 7: Multi-Agent Cache at Multiple Hierarchical Levels Impact: 40-50% faster responses, 85%+ combined hit rate, significant cost reduction A single cache layer captures only one type of repetition. Production systems need hierarchical caching where multiple levels catch different repetition patterns , from exact duplicates to semantic near-misses. with -> A single cache layer captures only one type of repetition. Production systems need multi-level caching to handle exact matches, similar requests, and reusable fragments. or -> with Production systems need hierarchical caching where multiple levels handle exact matches, similar requests, and reusable fragments. Pillar 8: Measure Everything, Learn Continuously Impact: Enables data-driven iteration and prevents accuracy regressions. Architecture without observability is guesswork. The final pillar ensures every other pillar stays effective over time through comprehensive metrics and automated feedback loops. This isn't a one-time setup; it's a perpetual feedback loop. Every week, the top error patterns shift slightly. The auto-fix metrics tell you exactly where to focus next. Over months, this flywheel compounds into dramatic accuracy gains that no single prompt rewrite could achieve. Auto-Learning for New Domains When extending your system to new domains or knowledge areas: Auto-classify elements using naming conventions, type analysis, and structural patterns Auto-generate templates from universal patterns (transformations, comparisons, compositions, sequences) Bootstrap few-shot examples from successful template outputs Monitor for the first 100 requests, then curate only the edge cases manually This reduces domain onboarding from days of manual work to minutes of automated bootstrapping plus focused human review of outliers. Key Takeaways Architecture beats model size. A well-architected system with a smaller model outperforms a raw frontier model call on structured tasks at a fraction of the cost. Deterministic systems should do the heavy lifting. Reserve LLMs for genuinely novel, creative tasks. 70-80% of production requests should never touch an LLM. Verified knowledge is your top accuracy lever. Ground every prompt in context the model can trust. Errors are patterned, not random- Track them, compile them, and explicitly forbid them. Build feedback loops, not static systems- Every auto-fix, every cache miss, every routing decision is a signal for improvement. Fewer agents, done well- Fewer agents with strict contracts outperform 9 agents with fuzzy boundaries in accuracy, latency, and debuggability. Measure what matters and iterates- The system that wins isn't the one with the best day-one prompt, it's the one that improves fastest over time. Production-grade GenAI isn't about finding the perfect prompt or waiting for the next LLM model release. It's about building architectural guardrails that make failure nearly impossible and when failure does occur, the system learns from it automatically. These 8 pillars, applied together, transform any LLM from an unreliable black box into a precise, efficient, and continuously improving production system. -> Production Gen AI success is not about perfect prompts or waiting for the next LLM release. It comes from designing strong system guardrails that reduce failures and ensure consistent output. Even when failures happen, the system learns and improves automatically. When applied together, these 8 pillars turn an LLM into a reliable, efficient, and continuously improving production system.Confidence-Aware RAG: Teaching Your AI Pipeline to Acknowledge Uncertainty
Introduction Retrieval-Augmented Generation (RAG) has become the standard architecture for grounding Large Language Models (LLMs) with enterprise data. By retrieving relevant documents before generating a response, RAG helps reduce hallucinations compared to relying on model knowledge alone. However, an important limitation remains in most implementations: RAG systems can produce confident-sounding answers even when the underlying data is incomplete, irrelevant, or missing. This happens when: • Retrieved documents are loosely related to the query • The answer exists partially but lacks key details • Retrieved sources contradict each other • The query falls entirely outside the knowledge base In enterprise environments, this behavior carries real risk. A reliable AI system must not only answer well - it must also know when not to answer. This article presents a practical confidence-aware RAG architecture using three layered strategies: retrieval confidence scoring, citation validation, and LLM-based abstention - all implemented with Azure AI Search and Azure OpenAI. The Problem: Confident Hallucination Consider a real-world enterprise scenario. An employee asks: "What is our company's parental leave policy for contractors?""What is our company's parental leave policy for contractors?" The knowledge base contains parental leave policies for full-time employees - but nothing specific to contractors. A standard RAG pipeline retrieves the closest matching document and confidently presents full-time employee policy as the answer. This outcome is worse than returning no answer. The user trusts the system, acts on incorrect information, and the error may not surface until real consequences follow. This pattern is sometimes called hallucination laundering - the RAG architecture creates the appearance of factual grounding while the response is not actually supported by the retrieved evidence. Fixing this requires deliberate confidence checkpoints at each stage of the pipeline. Architecture Overview A standard RAG pipeline follows a simple path: User Query → Retrieve Documents → Generate Answer A confidence-aware pipeline adds two explicit decision checkpoints: Each layer catches failures the previous one may miss. Together, they form a defense-in-depth approach to output reliability. Strategy 1: Retrieval Confidence Scoring The first checkpoint evaluates whether retrieved documents are genuinely relevant before passing them to the LLM. Azure AI Search returns a @search.rerankerScore when semantic ranking is enabled - a value on the 0-4 scale that reflects how well each document matches the query intent, not just keyword overlap. from azure.search.documents import SearchClient from azure.identity import DefaultAzureCredential search_client = SearchClient( endpoint=AZURE_SEARCH_ENDPOINT, index_name="enterprise-knowledge-base", credential=DefaultAzureCredential() ) def retrieve_with_confidence(query: str, threshold: float = 1.5, top_k: int = 5): results = search_client.search( search_text=query, query_type="semantic", semantic_configuration_name="default", top=top_k, select=["content", "title", "source"] ) confident_results = [] for result in results: reranker_score = result.get("@search.rerankerScore", 0) if reranker_score >= threshold: confident_results.append({ "content": result["content"], "title": result["title"], "source": result["source"], "score": reranker_score }) return confident_results If no documents clear the threshold, the pipeline abstains rather than forcing a low-quality answer: results = retrieve_with_confidence(user_query, threshold=1.5) if not results: return { "answer": ( "I don't have enough information in the knowledge base to answer " "this question. Please contact the relevant team for assistance." ), "status": "abstained_retrieval" } Threshold tuning: Start at 1.5 on the 0-4 scale. Evaluate against a labeled test set and adjust based on your precision/recall requirements. Higher thresholds reduce false positives but may increase abstention on edge cases. Strategy 2: Citation Validation Even when retrieval scores are high, the LLM may synthesize information that does not exist in the retrieved context. Citation validation addresses this by requiring the model to ground every factual claim in a specific named source - and then programmatically verifying those citations exist in the retrieved set. from openai import AzureOpenAI client = AzureOpenAI( api_key=AZURE_OPENAI_API_KEY, azure_endpoint=AZURE_OPENAI_ENDPOINT, api_version="2025-12-01-preview" ) ANSWER_WITH_CITATIONS_PROMPT = """ You are an enterprise assistant. Answer the question using ONLY the provided context. RULES: 1. Every factual claim MUST include a citation in the format [Source: <title>]. 2. If the context does not contain enough information, respond with: "I don't have sufficient information to answer this question." 3. Do NOT infer, assume, or use knowledge outside the provided context. 4. If context partially answers the question, state what you know and explicitly note what information is missing. Context: {context} Question: {question} Answer: """ def generate_answer(question: str, context: str, sources: list) -> dict: prompt = ANSWER_WITH_CITATIONS_PROMPT.format( context=context, question=question ) response = client.chat.completions.create( model=AZURE_DEPLOYMENT_NAME, messages=[{"role": "user", "content": prompt}], temperature=0 ) answer = response.choices[0].message.content.strip() validation = validate_citations(answer, sources) return {"answer": answer, "citation_check": validation} The validation function checks that every citation in the answer maps to a document that was actually retrieved: import re def validate_citations(answer: str, sources: list) -> dict: cited = re.findall(r'\[Source:\s*(.+?)\]', answer) source_titles = {s["title"].lower().strip() for s in sources} valid, invalid = [], [] for citation in cited: if citation.lower().strip() in source_titles: valid.append(citation) else: invalid.append(citation) return { "total_citations": len(cited), "valid": valid, "invalid": invalid, "is_trustworthy": len(invalid) == 0 and len(cited) > 0 } If is_trustworthy is False, the pipeline flags the response for review or suppresses it: if not generation["citation_check"]["is_trustworthy"]: return { "answer": "I found related information but cannot provide a reliable answer based on the available sources.", "status": "abstained_citation" } Strategy 3: LLM-Based Abstention Scoring The third layer adds a second LLM call that acts as a quality judge - explicitly evaluating whether the generated answer is well-supported by the retrieved context, independent of citation formatting. ABSTENTION_JUDGE_PROMPT = """ You are an answer quality judge. Given a question, retrieved context, and a generated answer, evaluate whether the answer is fully supported by the context. Respond ONLY in JSON format: {{ "verdict": "supported" | "partial" | "unsupported", "confidence": <float between 0.0 and 1.0>, "reasoning": "<brief explanation>" }} Question: {question} Context: {context} Answer: {answer} """ def judge_answer(question: str, context: str, answer: str) -> dict: import json prompt = ABSTENTION_JUDGE_PROMPT.format( question=question, context=context, answer=answer ) response = client.chat.completions.create( model=AZURE_DEPLOYMENT_NAME, messages=[{"role": "user", "content": prompt}], temperature=0 ) return json.loads(response.choices[0].message.content.strip()) Integrate the judge with a confidence threshold of 0.6: judgement = judge_answer(user_query, context, generation["answer"]) if judgement["verdict"] == "unsupported" or judgement["confidence"] < 0.6: return { "answer": "I don't have sufficient information to answer this question confidently.", "status": "abstained_judge" } if judgement["verdict"] == "partial": generation["answer"] += ( "\n\nNote: This answer may be incomplete. " "Some aspects of your question were not covered in the available documents." ) End-to-End Pipeline Combining all three strategies gives a complete confidence-aware pipeline: def confidence_aware_rag(user_query: str) -> dict: # Layer 1: Retrieve with confidence gating results = retrieve_with_confidence(user_query, threshold=1.5) if not results: return { "answer": "I don't have enough information in the knowledge base to answer this.", "status": "abstained_retrieval" } context = "\n\n".join(r["content"] for r in results) # Layer 2: Generate with citation requirements generation = generate_answer(user_query, context, results) if not generation["citation_check"]["is_trustworthy"]: return { "answer": "I found related information but cannot provide a reliable answer.", "status": "abstained_citation" } # Layer 3: Judge the answer judgement = judge_answer(user_query, context, generation["answer"]) if judgement["verdict"] == "unsupported" or judgement["confidence"] < 0.6: return { "answer": "I don't have sufficient information to answer this question confidently.", "status": "abstained_judge" } if judgement["verdict"] == "partial": generation["answer"] += ( "\n\nNote: This answer may be incomplete. " "Some aspects of your question were not covered in available documents." ) return { "answer": generation["answer"], "status": "answered", "confidence": judgement["confidence"], "sources": [r["source"] for r in results[:3]] }def confidence_aware_rag(user_query: str) -> dict: # Layer 1: Retrieve with confidence gating results = retrieve_with_confidence(user_query, threshold=1.5) if not results: return { "answer": "I don't have enough information in the knowledge base to answer this.", "status": "abstained_retrieval" } context = "\n\n".join(r["content"] for r in results) # Layer 2: Generate with citation requirements generation = generate_answer(user_query, context, results) if not generation["citation_check"]["is_trustworthy"]: return { "answer": "I found related information but cannot provide a reliable answer.", "status": "abstained_citation" } # Layer 3: Judge the answer judgement = judge_answer(user_query, context, generation["answer"]) if judgement["verdict"] == "unsupported" or judgement["confidence"] < 0.6: return { "answer": "I don't have sufficient information to answer this question confidently.", "status": "abstained_judge" } if judgement["verdict"] == "partial": generation["answer"] += ( "\n\nNote: This answer may be incomplete. " "Some aspects of your question were not covered in available documents." ) return { "answer": generation["answer"], "status": "answered", "confidence": judgement["confidence"], "sources": [r["source"] for r in results[:3]] } Choosing the Right Strategies for Your Use Case Each strategy adds a layer of safety at a different cost. The right combination depends on the stakes involved in your deployment. Strategy Added Cost Latency Best For Retrieval Confidence Scoring None (uses existing search scores) None All RAG applications - this should be universal Citation Validation Minimal (regex post-processing) Negligible Regulated industries, compliance, audit trails LLM Abstention Judge One additional LLM call +1-3 seconds High-stakes decisions - financial, legal, medical For most enterprise applications, combining retrieval scoring and citation validation provides a strong baseline with minimal overhead. The judge layer is most valuable when incorrect answers carry significant business or compliance risk. Threshold calibration There is a meaningful tradeoff in threshold selection. Setting thresholds too high reduces hallucination but increases abstention - the system may refuse to answer even when reliable information is available. The recommended approach is to build a labeled evaluation set of query/answer pairs, run the pipeline at multiple threshold values, and select the point that meets your precision/recall requirements for the specific domain. When to Apply This Pattern Confidence-aware RAG is most valuable in deployments where: Data coverage is uneven - the knowledge base may have detailed coverage in some areas and gaps in others, making it difficult to predict when retrieval will be reliable Errors carry downstream consequences - healthcare documentation, legal and compliance search, financial reporting, and regulated industries where a wrong answer is worse than no answer Users have varying expertise - non-expert users may not recognize a plausible-sounding but incorrect response, making transparent uncertainty signals especially important Audit or traceability requirements apply - the ability to trace each answer back to a specific source with a confidence signal supports governance and review workflows Conclusion Building a RAG system that retrieves documents and generates responses is relatively straightforward. Building one that understands the limits of its own knowledge requires deliberate design. The three strategies covered here - retrieval confidence scoring, citation validation, and LLM-based abstention - form a layered defense against the most common failure mode in production RAG systems: the confident, well-formatted, completely unreliable answer. The most dangerous AI system is not one that fails openly. It is one that fails silently, with confidence. Teaching your pipeline to say "I don't know" is not a limitation. It is a feature that builds user trust and makes enterprise AI adoption sustainable over time.