developer
329 TopicsEnhancing Data Security and Digital Trust in the Cloud using Azure Services.
Enhancing Data Security and Digital Trust in the Cloud by Implementing Client-Side Encryption (CSE) using Azure Apps, Azure Storage and Azure Key Vault. Think of Client-Side Encryption (CSE) as a strategy that has proven to be most effective in augmenting data security and modern precursor to traditional approaches. CSE can provide superior protection for your data, particularly if an authentication and authorization account is compromised.We Gave Ourselves 20 Minutes to Build an AI Agent for a Lumber Company. The Timer's Still on Screen.
Here's a confession: most "build with AI" webinars are 60 minutes of slides, 5 minutes of a polished demo someone rehearsed for a week, and a closing CTA. You leave inspired but not really sure what you saw. So we tried something different. We put a visible countdown timer on the screen and gave ourselves 20 minutes to do two things, live: Build an AI agent that solves a real business problem Deploy a working AI application to Azure No edits to hide the awkward parts. No "and here's one I prepared earlier." Just the timer, the screen, and a working app at the end. The on-demand recording is up now. Here's what's in it and why you should carve out 20 minutes for it this week. The setup: why lumber? 🏘️ We needed a real business problem, not a toy one. So for the demo, we role-play as the owner of Contoso Lumber — a regional lumber business with a very specific, very real headache: Should we sell our inventory now, or hold it longer? Sell too early, miss a better price. Hold too long, eat storage costs. Lumber prices fluctuate with global competition, macro shifts, even the weather. In the past, decisions like this came from morning meetings and gut instinct, or maybe the occasional ad-hoc spreadsheet that nobody could reuse a month later. It's the kind of decision that should have an analyst behind it — except most growing businesses can't afford to hire one full-time. So we build the AI agent that does. (Yes, lumber. We know. Stick with us — the boring industry is exactly the point. If it works here, it works for your business too.) What we actually build (in 20 minutes flat) The webinar walks through the entire flow, end to end: Part 1 — The agent. We open Microsoft Foundry at ai.azure.com, browse the model leaderboard (there are over 11,000 models to choose from — we compare a few on the cost-vs-quality chart), pick one, write a plain-English instruction for the agent, upload a CSV of historical lumber pricing, and ask it a real question: "If I cannot sell one of my products today unless I offer my clients a 35% discount, and knowing the historical pricing data, should I still sell it?" The agent runs a break-even analysis and comes back with a reasoned recommendation — hold for 3–6 months, here's the math on why, here's where storage costs start eating the upside. Then we add voice mode (now you can ask the agent for pricing recs from a coffee shop on your phone), and lock down guardrails to block jailbreaks, prompt injection, data leakage, and — because we're feeling fancy — profanity in responses. Part 2 — The app. With the agent done, we pivot to deploying a full AI chat application to Azure. From scratch. Using exactly five commands in Azure Cloud Shell: azd auth login git clone <repo> cd <folder> azd up azd down # (this one's for when you're done — kills everything to avoid surprise bills) That's it. The template handles the Container Apps setup, the architecture-aligned-to-Well-Architected-Framework stuff, all the boilerplate that usually eats half a sprint. By the end of the segment, there's a working AI chatbot running on a real Azure URL. We even pause the timer when we're just explaining things, so you know the 20-minute clock is honest about build time, not talk time. Why this format is more useful than another slide deck A few things this webinar shows that a written tutorial can't: The Foundry UI is super navigable. You watch someone do it. You see where the buttons are. You see what the leaderboard looks like when you're comparing GPT-5.3 Codex against Kimi K2.5 on a cost-to-quality chart. (Spoiler: Kimi wins this particular trio. Your mileage will vary depending on your workload.) The "no-stitching" claim is real. Models, data, agents, guardrails, deployment — all in one place. You don't need to leave Foundry to wire seven products together. The webinar makes that concrete by showing you the actual flow without cutting. Five commands really is five commands. This is the part people are most skeptical about until they see it. azd up does the work. The infrastructure provisioning, the container app, the AI service hookup — all of it. You can delete it just as fast. azd down tears everything back down. Useful when you're experimenting and don't want a $40 surprise on your Azure bill next month. What's on screen at the end By the 20-minute mark: A published AI agent named for the lumber business, with guardrails, voice mode enabled, ready to be called from Teams, Microsoft 365 Copilot, or any application via endpoint A separate AI chat application deployed to Azure Container Apps, with a live URL Logs, observability, the full Foundry control plane — all available out of the box And in the closing minutes, four very concrete next steps for what you do next if this sparked an idea for your own business — including Azure Accelerate (if you want Microsoft experts in the room with you), the partner network, and the Microsoft marketplace if you'd rather buy than build. Watch the recording The on-demand recording is available now. Block 20 minutes — that's literally all it takes — and ideally watch with your Azure portal open in another tab so you can follow along. If you're the kind of person who learns by doing, pause at the agent-building section and try it yourself in parallel. Foundry is free to explore; the agent we build in the webinar costs cents to run. → Watch the on-demand webinar A few things we'd love feedback on If you watch it, we'd genuinely love to know: Did the timer help or distract? (We thought it would feel gimmicky. It turned out to be the most-mentioned thing in early feedback.) What use case from your business would you want to see in the next one? We're picking the next demo problem from comments. Was the lumber thing weirdly compelling or were you just here for the Azure parts? Drop a comment, tag us, or grab a partner and try building your own version this week. The timer's reset. Your 20 minutes start whenever you press play. Want to go deeper than the webinar? Two companion reads: From Idea to Impact: How Growing Businesses Scale with Azure (five real customer stories with the full architectures) and AI Made Simple: 3 Practical Moves for Growing Businesses (the structured playbook for figuring out what to build first).From AI Suggestions to Autonomous CRM Actions in Dynamics 365
Modern CRM AI solutions often stop at case summarization—but real transformation requires more. This blog introduces a CRM Copilot Agent Accelerator built on Microsoft Power Platform, designed to evolve AI from simple insights to predictive intelligence and ultimately to autonomous actions. By combining Dynamics 365, Dataverse, Power Automate, and AI Builder, and extending capabilities through modular add-on packs, this approach enables organizations to reduce manual effort, improve decision-making, and scale service operations efficiently—without additional Copilot licensing.Claude Code on Microsoft Foundry in VS Code — A Practical Setup Guide (with the gotchas)
Enables enterprise-grade governance without changing your developer workflow. The official Microsoft Learn article (Configure Claude Code for Microsoft Foundry) gets you ~80% of the way there. The remaining 20%—VS Code settings shape, tenant mismatches, and configuration conflicts like "baseURL and resource are mutually exclusive"—is where most setups fail in practice. This guide walks the full path end-to-end, with the exact JSON that validates, working CLI configuration, and a troubleshooting matrix based on real-world failures. This guidance is based on repeated customer deployments and internal testing across both CLI and VS Code scenarios. TL;DR Setup - Deploy claude-sonnet-4-6 (optionally Haiku + Opus) in a supported region - Grant Cognitive Services User + Foundry User - az login --tenant <tenant> , then launch VS Code via code . Config - CLI: - CLAUDE_CODE_USE_FOUNDRY=1 - ANTHROPIC_FOUNDRY_RESOURCE=<name> - Do NOT set ANTHROPIC_FOUNDRY_BASE_URL at the same time - VS Code: - Use [{ "name": "...", "value": "..." }] format Validate - claude → /status - Expect: API provider: Microsoft Foundry Why run Claude Code on Foundry? Anthropic's Claude Code is a top-tier agentic coding assistant. Running it through Microsoft Foundry instead of Anthropic's public API gives you: Data residency & compliance: prompts and completions stay inside your Azure tenant. Entra ID auth: no API keys to rotate; centralized RBAC. Private networking: works behind VNets/Private Endpoints. Unified billing & quotas: usage shows up on your Azure invoice and in Foundry monitoring. Same model, same CLI, enterprise-grade plumbing underneath. Prerequisites checklist Requirement How to verify Azure subscription with pay-as-you-go billing az account show Foundry resource in supported regions Check your region's model availability in Foundry portal Contributor/Owner on the resource group (for deployments) Azure Portal → IAM Cognitive Services User + Foundry User on the resource (for invoking) Azure Portal → IAM Azure CLI installed and logged in az --version , az login Claude Code CLI installed claude --version VS Code (current) with the Anthropic Claude Code extension Help → About Windows only: Git Bash (from Git for Windows) or WSL2 — Claude Code's runtime requires a POSIX shell bash --version in Git Bash / WSL ⚠️ Claude models in Foundry are currently available in select regions. Check the Foundry portal model catalog for your region's availability (commonly East US 2 and Sweden Central). Step 1 — Deploy the Claude models Claude Code uses three model roles, and it expects a deployment for each: Role Default deployment name Used for Primary claude-sonnet-4-6 general coding (balanced) Fast claude-haiku-4-5 quick edits, file reads Extended thinking claude-opus-4-6 complex reasoning Deploy at least Sonnet to get started. Add Haiku and Opus when you need them — Claude Code will route automatically. If a role-specific model isn't deployed, Claude Code may fall back or fail depending on the task. Deployment names in this guide follow the current Claude 4.x naming exposed in Foundry. Exact versions change over time — check the Foundry model catalog in your region for what's currently available. Foundry Portal: AI Foundry → your project → Build → Models + endpoints → + Deploy model → pick the Anthropic Claude model → Global Standard deployment → name it exactly as above (or remember the name to override later). To discover the current model version before deploying (replace eastus2 with your Foundry region): az cognitiveservices model list -l eastus2 ` --query "[?contains(model.name,'claude')].{name:model.name, version:model.version, format:model.format}" -o table Azure CLI: az cognitiveservices account deployment create ` --name <foundry-resource> ` --resource-group <rg> ` --deployment-name claude-sonnet-4-6 ` --model-name claude-sonnet-4-6 ` --model-version <version> ` --model-format Anthropic ` --sku-name GlobalStandard ` --sku-capacity 1 ✍️ Figure 1: Foundry portal “Models + endpoints” showing the three Claude deployments. Step 2 — Grant yourself the right roles This is the #1 source of silent failures. You need both: Role Role ID Purpose Cognitive Services User a97b65f3-24c7-4388-baec-2e87135dc908 data-plane invocation Foundry User (formerly Azure AI User) 53ca6127-db72-4b80-b1b0-d745d6d5456d Foundry-native permissions $me = az ad signed-in-user show --query id -o tsv $scope = az cognitiveservices account show -n <foundry-resource> -g <rg> --query id -o tsv # Use role IDs — rename-proof (works whether the display name is "Azure AI User" or "Foundry User") az role assignment create --assignee $me --role a97b65f3-24c7-4388-baec-2e87135dc908 --scope $scope # Cognitive Services User az role assignment create --assignee $me --role 53ca6127-db72-4b80-b1b0-d745d6d5456d --scope $scope # Foundry User (formerly Azure AI User) The Foundry RBAC rename (Azure AI User → Foundry User) is rolling out; both role names map to the same role definition (same role ID), depending on tenant rollout state. Use whichever role name your tenant exposes — or use the role IDs above to avoid ambiguity. Step 3 — Install the Claude Code CLI Use the official installer from Anthropic (auto-updates in the background): irm https://claude.ai/install.ps1 | iex claude --version If claude isn't on PATH, restart your shell. The installer drops it under %USERPROFILE%\.local\bin . Step 4 — Sign in to the right tenant If your Foundry resource lives in a tenant different from your default, an az login to the wrong tenant produces the cryptic error: ValueError: Unable to get authority configuration for https://login.microsoftonline.com/<bad-guid>. Authority would typically be in a format of https://login.microsoftonline.com/your_tenant Fix: az login --tenant <foundry-tenant-guid> az account set --subscription <foundry-subscription-guid> az account show # confirm tenant & subscription 💡 You can list every tenant you have access to with: az account list --query "[].{name:name, tenantId:tenantId}" -o table Step 5 — Configure the CLI Set these in the same PowerShell session you'll launch claude from: $env:CLAUDE_CODE_USE_FOUNDRY = "1" $env:ANTHROPIC_FOUNDRY_RESOURCE = "<your-foundry-resource-name>" # Optional: only if your deployment names differ from the defaults $env:ANTHROPIC_DEFAULT_SONNET_MODEL = "claude-sonnet-4-6" $env:ANTHROPIC_DEFAULT_HAIKU_MODEL = "claude-haiku-4-5" $env:ANTHROPIC_DEFAULT_OPUS_MODEL = "claude-opus-4-6" To make them persistent: setx CLAUDE_CODE_USE_FOUNDRY 1 (and so on), then sign out and back in (or restart Explorer). GUI apps like VS Code launched from the Start menu only pick up new user-env vars after the user session refreshes — opening a fresh terminal isn't enough. 🚫 The "mutually exclusive" trap API Error: baseURL and resource are mutually exclusive You'll hit this if you set both ANTHROPIC_FOUNDRY_RESOURCE and ANTHROPIC_FOUNDRY_BASE_URL . Pick one: Most users → ANTHROPIC_FOUNDRY_RESOURCE=<name> (Claude Code builds the URL). Custom subdomain / private endpoint → use ANTHROPIC_FOUNDRY_BASE_URL instead. Step 6 — Verify the CLI claude > /status Expected output: API provider: Microsoft Foundry Microsoft Foundry base URL: https://<resource>.services.ai.azure.com/anthropic Microsoft Foundry resource: <resource> Model: Default (claude-sonnet-4-6) ✍️ Figure 2: /status output confirming API provider: Microsoft Foundry . If you instead see "Anthropic" or it prompts for an Anthropic login, CLAUDE_CODE_USE_FOUNDRY isn't being inherited — see troubleshooting below. Step 7 — Configure the VS Code extension Install Claude Code from the VS Code Marketplace (publisher: Anthropic). Open user settings.json ( Ctrl+Shift+P → Preferences: Open User Settings (JSON)) and add: "claudeCode.environmentVariables": [ { "name": "CLAUDE_CODE_USE_FOUNDRY", "value": "1" }, { "name": "ANTHROPIC_FOUNDRY_RESOURCE", "value": "<your-foundry-resource-name>" } ] 🪤 Schema gotcha. The MS Learn doc currently shows this as a plain {KEY: VALUE} object under the UI label "Claude Code: Environment Variables" . In recent extension versions the actual JSON key is claudeCode.environmentVariables and the value must be an array of {name, value} objects. If you paste the doc's snippet verbatim, VS Code will flag "Missing property name", "Colon expected", "Unknown configuration setting". Use the array form above. Make the extension see your az login The extension inherits environment & credentials from the process that launches VS Code. After az login : # In the same PowerShell where az login succeeded: code . If VS Code was already running, fully quit it (not just close the window) and relaunch from the terminal. Developer: Reload Window is not enough to refresh inherited Azure CLI credentials. ✍️ Figure 3: settings.json with the claudeCode.environmentVariables array form. Step 8 — Try it In VS Code, click the Claude Code (Spark) icon in the sidebar to open the panel. Type: Summarize the structure of this project. You should get a response within a few seconds, and the panel should indicate it's routing through Microsoft Foundry. Run /status inside the panel to confirm API provider: Microsoft Foundry if you want certainty. ✍️ Figure 4: Claude Code panel in VS Code responding through Microsoft Foundry. Troubleshooting matrix Symptom Where it shows up Likely cause Fix API Error: baseURL and resource are mutually exclusive claude CLI on first request Both ANTHROPIC_FOUNDRY_BASE_URL and ANTHROPIC_FOUNDRY_RESOURCE set Unset one. Prefer ANTHROPIC_FOUNDRY_RESOURCE . Unable to get authority configuration for https://login.microsoftonline.com/<guid> claude CLI startup or VS Code panel Wrong tenant ID in az login az login --tenant <correct-guid> ; verify with az account show Failed to get token from azureADTokenProvider: ChainedTokenCredential authentication failed VS Code Claude Code panel Extension didn't inherit az login session Quit VS Code entirely; relaunch with code . from the authed shell Token tenant does not match resource tenant claude CLI or VS Code panel CLI logged into a different tenant than the Foundry resource az login --tenant <foundry-tenant> The model <name> is not available on your foundry deployment claude CLI first use or VS Code model selector Deployment name mismatch Either rename the Foundry deployment, or set ANTHROPIC_DEFAULT_*_MODEL to the actual name 401 / 403 on first request claude CLI or VS Code panel Missing RBAC on the resource Assign Cognitive Services User and Foundry User on the resource scope Claude Code prompts for Anthropic login VS Code Claude Code panel CLAUDE_CODE_USE_FOUNDRY not set in the process Set the env var before launching claude / code . VS Code shows "Unknown Configuration Setting" for claudeCode.environmentVariables VS Code Settings tab Wrong JSON shape Use the array of {name,value} objects form 429 Too Many Requests claude CLI or VS Code panel TPM/RPM exhausted Foundry portal → Operate → Quotas; request increase or reduce parallelism Works in CLI, fails in VS Code extension VS Code Claude Code panel only Env vars set per-shell, not visible to GUI VS Code Use setx (persistent user env) or move them into claudeCode.environmentVariables "Model is not available in region" Foundry portal model deployment step Foundry resource not in a supported region Deploy a new Foundry resource in a supported region, or check model availability Best practices Auth & secrets - Prefer Entra ID over API keys. If you must use a key for CI, store it as a secret (GitHub Actions secret, Key Vault) — never in settings.json (it may sync via Settings Sync). - Scope RBAC at the resource level, not the subscription, for least privilege. Project context - Create a CLAUDE.md at your repo root with stack, conventions, and entry-point commands. Claude Code reads it automatically and the quality jump is significant. - Use .claude/rules/*.md for per-area rules (e.g., test conventions, security rules). Cost & latency - Let Claude Code's auto-routing pick the right role (Sonnet/Haiku/Opus). Don't pin everything to Opus. - Cap context with ANTHROPIC_MAX_TOKENS if you have a strict budget. (Note: not honored by every Claude Code version — check the Claude Code docs for your version.) - Watch token spend in Foundry → Operate → Metrics weekly. Reliability - For team use, deploy all three model roles even if you don't think you need them — silent role-routing failures are confusing. - Tag your Foundry resource ( env=dev|prod , team=... ) for chargeback. Reproducibility - Document the exact env vars and az login --tenant GUID in your team README. - Pin Claude Code CLI version in onboarding docs ( claude --version ) so new joiners hit the same behavior. A note on the MS Learn doc The doc is accurate but skips three things that caused the most friction in real-world deployments: VS Code extension settings shape — the example uses the UI label as a JSON key and an object instead of the array form the schema actually expects. Process inheritance — it says "set the env vars" but doesn't emphasize that the VS Code window must be launched from a shell where both az login and the env vars are live. Reloading the window doesn't help. Mutually exclusive RESOURCE vs BASE_URL — listed in passing, but the error message only appears at first request, after you think everything is configured. If the Microsoft Learn page is updated, treat this post as a companion — same destination, fewer dead ends. What you've got now Claude Code running locally on your machine, talking to your Foundry resource. Entra ID auth — no API keys to manage. Full Foundry telemetry, quotas, and billing. VS Code panel + CLI, both backed by the same setup. Drop a CLAUDE.md in your repo and start shipping. When to Use RESOURCE vs BASE_URL Use RESOURCE (default) - Standard public deployments - No custom networking Use BASE_URL - Private endpoints - Custom DNS / VNet routing Never set both.541Views0likes0CommentsPower Pages & SPA: Deploying a Custom React SPA to Microsoft Power Pages
Build modern client-side experiences with React and deploy them securely on Power Pages using Power Platform CLI. Learn how to build a React Single Page Application using Vite and deploy it to Microsoft Power Pages with Power Platform CLI while using Power Pages authentication, Web API, and Dataverse security.Multi-Tenant Architecture: Real Challenges and an Azure Design Walkthrough
Azure Multi-Tenant Architecture (B2C Scenario) Let’s start with a reference design commonly used in Azure-based systems. A pretty standard setup looks something like this: Microsoft Entra External ID (Azure AD B2C) for authentication Azure API Management as the entry layer App Service or Functions for the compute layer Cosmos DB or SQL for storage Redis for caching Service Bus for async processing Application Insights for monitoring If you’ve worked on Azure systems, nothing here is surprising. On paper, this architecture is clean, scalable, and “multi-tenant ready”. But once traffic starts flowing and tenants behave differently, things start breaking in subtle ways. 1. Tenant Context Propagation Across Services A request doesn’t stay in one place. It moves across: API layer queues/topics background workers What I’ve seen happen multiple times: tenant ID is present in the API, but missing in async flows background jobs process data without knowing which tenant it belongs to logs become useless because you can’t tie actions back to a tenant The fix is simple in theory, but often missed in implementation: Every message should carry tenant context. No exceptions. If you rely on “it will be available somewhere”, it won’t be, especially in distributed systems. Ensure tenant context is explicitly carried everywhere: public class TenantMessage { public string TenantId { get; set; } public string Payload { get; set; } } Every message, event, and async operation should include tenant scope. 2. Data Isolation in Shared Databases Most teams start with a shared database model with tenant-based partitioning. It works well initially. Problems start creeping in later: someone forgets to add a tenant filter in a query a query suddenly scans across partitions one large tenant starts slowing down others A simple query like this becomes critical: var query = container.GetItemQueryIterator<Order>( new QueryDefinition("SELECT * FROM c WHERE c.tenantId = @tenantId") .WithParameter("@tenantId", tenantId) ); The tricky part is not writing it once, it’s making sure it’s applied everywhere, every time. 3. Authorization Beyond Tenant Boundaries At the beginning, access control is simple: “Users can access data from their own tenant.” But then requirements grow: admin access cross-tenant visibility reporting across firms or regions And this is where things usually get messy. Different services start implementing their own logic, and over time you end up with inconsistent behavior. A simple check: public bool CanAccess(string userTenant, string resourceTenant, bool isGlobalAdmin) { if (isGlobalAdmin) return true; return userTenant == resourceTenant; } becomes much harder to manage when duplicated across multiple services. One thing that helps a lot here is centralizing authorization logic early. 4. Caching as a Hidden Risk Caching is usually added later for performance. And that’s exactly why it becomes risky. I’ve seen scenarios where: cached data from one tenant is returned to another because the cache key didn’t include tenant information Fixing it is straightforward: public string BuildCacheKey(string tenantId, string key) { return $"{tenantId}:{key}"; } Cache keys must always include tenant boundaries 5. Resource Contention (Noisy Neighbor Problem) All tenants share resources: compute database throughput messaging What happens in practice: one high-load tenant impacts others latency becomes unpredictable system behavior differs per tenant You start adding controls like: if (RequestsPerTenant[tenantId] > 100) { return StatusCode(429); } And gradually move towards: throttling workload isolation prioritization This is less of a design problem and more of an operational reality. 6. Observability in Multi-Tenant Systems Logging works great, until you scale. Then suddenly: logs from all tenants are mixed debugging becomes slow it’s hard to answer basic questions like “which tenant failed?” A small change makes a huge difference: _logger.LogInformation( "Tenant={TenantId} Action=ProcessOrder OrderId={OrderId}", tenantId, orderId ); It sounds obvious, but it’s often inconsistent across services. 7. Backup and Restore Considerations Taking backups is easy. Restoring a single tenant isn’t. In most shared database setups: restore is done at database level which affects all tenants So if one tenant has a problem, recovery is not straightforward. This is one of those areas where decisions made early in design matter a lot later. Final Thoughts Designing a multi-tenant system is not just about choosing Azure services. The real challenges come from: how tenant context flows how isolation is enforced how systems behave under uneven load Most issues don’t show up on day one. They appear gradually as tenants grow, scale, and behave differently. References and Further Reading If you want to explore these concepts in more depth, here are some useful official resources: Microsoft Entra External ID (Azure AD B2C) Azure API Management Azure App Service Azure Cosmos DB and multitenant design Azure Service BusBuilding an On-Device Voice Assistant with Microsoft Foundry Local
Why on-device voice still matters Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged. The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini , you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress. This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable. What we're building A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000 . The page captures microphone audio, posts it to /api/transcribe , then streams the chat reply back over Server-Sent Events from /api/chat . All inference runs locally through two Foundry Local models loaded into the same process. The shape of the pipeline: Microphone (browser MediaRecorder) │ WebM/Opus blob ▼ Client-side WAV encoder (16 kHz, mono, PCM-16) │ multipart/form-data ▼ FastAPI /api/transcribe │ ▼ Nemotron Speech Streaming En 0.6B (Foundry Local audio client) │ transcript text ▼ Chat LLM e.g. qwen2.5-0.5b (Foundry Local chat client) │ streamed tokens ▼ FastAPI /api/chat → SSE → browser bubble The version that bit us: foundry-local-sdk >= 1.1.0 Before any code, the single most important fact about this project: The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the alias nemotron-speech-streaming-en-0.6b and fail with model not found . The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore- sdk suffix), not foundry_local . The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about. Pin it explicitly: pip install --upgrade "foundry-local-sdk>=1.1.0,<2" And verify before anything else: python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))" # expect: sdk 1.1.0 Loading both models from one manager The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client. The wrapper used in the repo ( src/foundry_client.py ) does this: from foundry_local_sdk import Configuration, FoundryLocalManager FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron")) manager = FoundryLocalManager.instance chat_model = manager.load_model("qwen2.5-0.5b") stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b") chat_client = chat_model.get_chat_client() audio_client = stt_model.get_audio_client() Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB. The transcription surprise The first naive approach was the obvious one: with open(wav_path, "rb") as f: result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b") That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime. The fix is to use the streaming session API instead — a different native entry point ( core_interop.start_audio_stream ) that the streaming model does support. The repo isolates this in src/_nemotron_live.py : def transcribe_wav_live(audio_client, wav_path, *, language="en"): with wave.open(str(wav_path), "rb") as w: sample_rate = w.getframerate() channels = w.getnchannels() sample_width = w.getsampwidth() pcm = w.readframes(w.getnframes()) session = audio_client.create_live_transcription_session() session.settings.sample_rate = sample_rate session.settings.channels = channels session.settings.bits_per_sample = sample_width * 8 session.settings.language = language session.start() # Feed PCM in ~100 ms chunks from a worker thread, then stop. bytes_per_sec = sample_rate * channels * sample_width chunk_bytes = max(bytes_per_sec // 10, 1024) def _pusher(): try: for offset in range(0, len(pcm), chunk_bytes): session.append(pcm[offset:offset + chunk_bytes]) finally: session.stop() threading.Thread(target=_pusher, daemon=True).start() parts = [] for resp in session.get_stream(): for cp in getattr(resp, "content", []) or []: text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or "" if text: parts.append(text) return " ".join(p.strip() for p in parts if p.strip()).strip() Two things to notice: Push from a thread, read from the main coroutine. session.append() is a blocking write into the native stream and session.get_stream() is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session. Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency. Always session.stop() . Without it the generator never terminates and the request hangs. The other transcription surprise: browsers don't send WAV Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus . That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove. The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript: // Inside src/static/index.html async function webmToWav(blob) { const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 }); const buf = await ctx.decodeAudioData(await blob.arrayBuffer()); // Mix to mono const ch = buf.numberOfChannels; const mono = new Float32Array(buf.length); for (let c = 0; c < ch; c++) { const data = buf.getChannelData(c); for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch; } return encodeWav(mono, 16000); } function encodeWav(samples, sampleRate) { const buffer = new ArrayBuffer(44 + samples.length * 2); const view = new DataView(buffer); // RIFF header writeStr(view, 0, "RIFF"); view.setUint32(4, 36 + samples.length * 2, true); writeStr(view, 8, "WAVE"); // fmt chunk writeStr(view, 12, "fmt "); view.setUint32(16, 16, true); // PCM chunk size view.setUint16(20, 1, true); // PCM format view.setUint16(22, 1, true); // mono view.setUint32(24, sampleRate, true); view.setUint32(28, sampleRate * 2, true); // byte rate view.setUint16(32, 2, true); // block align view.setUint16(34, 16, true); // bits per sample // data chunk writeStr(view, 36, "data"); view.setUint32(40, samples.length * 2, true); // PCM-16 samples let o = 44; for (let i = 0; i < samples.length; i++, o += 2) { const s = Math.max(-1, Math.min(1, samples[i])); view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true); } return new Blob([view], { type: "audio/wav" }); } Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side. Wiring it into FastAPI The server ( src/app.py ) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request: @app.post("/api/transcribe") async def transcribe(audio: UploadFile = File(...)): data = await audio.read() with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: f.write(data); path = f.name text = _ai_client.transcribe(path) return {"text": text} @app.post("/api/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( _sse(_ai_client.stream_completion(req.messages)), media_type="text/event-stream", ) return {"text": _ai_client.chat_completion(req.messages)} Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost . What it looks like The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone. Performance, honestly This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation: First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for qwen2.5-0.5b . Subsequent loads (warm cache): a few seconds per model. End-to-end transcription of a 5-second utterance: well under a second after warm-up. First chat token from qwen2.5-0.5b : typically 200–500 ms; full short reply within 1–2 s. On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model. Trade-offs to know about Model quality. qwen2.5-0.5b is a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap in phi-4-mini or mistral-nemo-12b-instruct if you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog. STT is English-only here. The current Nemotron streaming model in the catalog is ...-en-0.6b . Multilingual variants are likely to follow. Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny getUserMedia by default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real. No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour. Responsible AI considerations Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility: Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red ⏹ button and a "Recording…" banner for that reason. Don't log raw audio. The sample writes audio to a per-request NamedTemporaryFile and deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device. Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters. Try it Clone github.com/leestott/fl-nemotron. ./setup.ps1 (or ./setup.sh ) to create a virtualenv and install the pinned SDK. python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models. .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000 Open http://127.0.0.1:8000 in a real browser and click the 🎤 button. Where to go next Foundry Local documentation — official docs for the runtime, catalog, and SDK. microsoft/Foundry-Local — upstream samples and issue tracker. NVIDIA Nemotron model family — background on the speech and language models being published into the catalog. leestott/fl-nemotron — the full source for this post. Key takeaways Pin foundry-local-sdk >= 1.1.0 . Earlier SDKs cannot see the Nemotron Speech Streaming model. Use the LiveAudioTranscriptionSession API for Nemotron, not AudioClient.transcribe() . Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS. Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks. A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.Building AI Agents with Microsoft Foundry: A Progressive Lab from Hello World to Self-Hosted
AI agent development has a steep on-ramp. The combination of new SDKs, tool-calling patterns, model selection decisions, retrieval-augmented generation, and deployment concerns means most developers spend more time wiring things together than actually building anything useful. The Microsoft Foundry Agent Lab is a structured, open-source demo series designed to change that — nine self-contained demos, each adding exactly one new concept, all built on the same Microsoft Foundry SDK and a single model deployment. This post walks through what the lab contains, how each demo works under the hood, and the architectural decisions that make it a useful reference for AI engineers building production agents. Why a Progressive Lab? Agent frameworks can be overwhelming. A developer who opens a rich example with RAG, tool-calling, streaming, and a custom UI all at once has no clear line of sight to which parts are essential and which are embellishments. The Foundry Agent Lab takes the opposite approach: start with the absolute minimum and introduce one new primitive per demo. By the time you reach Demo 8, you have seen every major capability — not in one monolithic sample, but in a layered sequence where each addition is visible and understandable. # Demo New Concept Tool Used UX 0 hello-demo Agent creation, Responses API, conversations None Terminal 1 tools-demo Function calling, tool-calling loop, live API FunctionTool Terminal 2 desktop-demo UI decoupling — same agent, different surface None Desktop (Tkinter) 3 websearch-demo Server-side built-in tools, no client loop WebSearchTool Terminal 4 code-demo Code execution in sandbox, Gradio web UI CodeInterpreterTool Web (Gradio) 5 rag-demo Document upload, vector stores, RAG grounding FileSearchTool Terminal 6 mcp-demo MCP servers, human-in-the-loop approval MCPTool Terminal 7 toolbox-demo Centralized tool governance, Toolbox versioning Toolbox Terminal 8 hosted-demo Self-hosted agent with Responses protocol Custom server Terminal + Agent Inspector The Model Router: One Deployment to Rule Them All Before diving into the demos, it is worth understanding the one architectural decision that ties the entire lab together: every agent uses model-router as its model deployment. MODEL_DEPLOYMENT=model-router Model Router is a Microsoft Foundry capability that inspects each request at inference time and routes it to the optimal available model — weighing task complexity, cost, and latency. A simple factual question goes to a fast, cheap model. A complex tool-calling chain with code generation gets routed to a frontier model. You write zero routing logic. The lab's MODEL-ROUTER.md file contains empirical observations from running all nine demos. A sample of what the router selected: Demo Query Task Type Model Selected hello "What's the capital of WA state?" Factual recall grok-4-1-fast-reasoning hello "Summarize our conversation" Summarization gpt-5.2-chat-2025-12-11 tools "What's the weather in Seattle?" Tool-using gpt-5.4-mini-2026-03-17 code Data analysis with code generation Code generation + execution gpt-5.4-2026-03-05 rag HR policy document question Retrieval + synthesis gpt-5.3-chat-2026-03-03 This is the strongest signal in the lab: you do not need to reason about model selection. You declare what your agent needs to do; the router handles the rest, and it chooses correctly. Demo 0: The Minimum Viable Agent The hello-demo establishes the baseline pattern used by every subsequent demo. Two files: one to register the agent, one to chat with it. Registering the agent from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient from azure.ai.projects.models import PromptAgentDefinition credential = DefaultAzureCredential() project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, instructions="You are a helpful, friendly assistant.", ), ) Authentication uses DefaultAzureCredential , which works with az login locally and with managed identity in production — no API keys anywhere in the code. Chatting with the agent # Create a server-side conversation (persists history across turns) conversation = openai.conversations.create() # Each turn sends the user message; the agent sees full history response = openai.responses.create( input=user_input, conversation=conversation.id, extra_body={"agent_reference": {"name": AGENT_NAME, "type": "agent_reference"}}, ) print(response.output_text) The conversation object is server-side. You pass its ID on every turn; the history lives in Foundry, not in a local list. This is the Responses API pattern — distinct from the older Completions or Chat Completions APIs. Demo 1: Function Tools and the Tool-Calling Loop Demo 1 adds function calling against a real weather API. The key insight here is that the model does not execute the function — it requests the execution, and your code executes it locally, then feeds the result back. Declaring a function tool from azure.ai.projects.models import FunctionTool, PromptAgentDefinition func_tool = FunctionTool( name="get_weather", description="Get the current weather for a given city.", parameters={ "type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"], }, strict=True, ) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[func_tool], instructions="You are a weather assistant...", ), ) The tool-calling loop response = openai.responses.create(input=user_input, conversation=conversation.id, ...) # Loop while the model is requesting tool calls while any(item.type == "function_call" for item in response.output): input_list = [] for item in response.output: if item.type == "function_call": args = json.loads(item.arguments) result = get_weather(args["city"]) # execute locally input_list.append(FunctionCallOutput(call_id=item.call_id, output=result)) # Send results back to the agent response = openai.responses.create(input=input_list, conversation=conversation.id, ...) print(response.output_text) The strict=True parameter on FunctionTool enforces structured outputs — the model must return arguments that match the declared JSON schema exactly. This eliminates argument parsing errors in production. Demo 2: UI Is Not Your Agent Demo 2 runs the exact same agent as Demo 1 but surfaces it in a Tkinter desktop window. The point is pedagogical: your agent definition, conversation management, and tool-calling logic are entirely independent of your UI layer. Swapping from terminal to desktop requires changing only the presentation code — nothing in the agent or conversation path changes. This is a principle worth internalising early: agent logic and UI logic should never be entangled. The lab enforces this separation structurally. Demo 3: Server-Side Built-In Tools The web search demo introduces a sharp contrast with Demo 1. With WebSearchTool , the tool-calling loop disappears entirely from client code: from azure.ai.projects.models import WebSearchTool agent = project.agents.create_version( agent_name="Search-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[WebSearchTool()], instructions="You are a research assistant...", ), ) The agent decides when to search, executes the search server-side, and returns a grounded response with citations. Your client code looks identical to Demo 0 — a simple responses.create() call with no tool loop. The distinction matters architecturally: Function tools (Demo 1) — tool execution happens on your client; you control the code, the API call, the error handling. Built-in tools (Demo 3+) — tool execution happens inside Foundry; you get results without managing execution. Demo 4: Code Interpreter and the Gradio Web UI Demo 4 attaches CodeInterpreterTool , which gives the agent a sandboxed Python execution environment inside Foundry. The agent can write code, run it, observe output, and iterate — all server-side. Combined with a Gradio web interface, this demo shows an agent that can perform data analysis, generate charts, and explain results through a browser UI. Model Router is particularly interesting here: the empirical data shows it selects a more capable frontier model ( gpt-5.4-2026-03-05 ) for code-generation tasks, while simpler conversational turns stay on lighter models. Demo 5: Retrieval-Augmented Generation with FileSearchTool Demo 5 introduces RAG. The setup phase uploads a document, creates a vector store, and attaches it to the agent: # Upload document and create a vector store vector_store = openai.vector_stores.create(name="employee-handbook-store") with open("data/employee-handbook.md", "rb") as f: openai.vector_stores.files.upload_and_poll( vector_store_id=vector_store.id, file=f ) # Attach the vector store to the agent agent = project.agents.create_version( agent_name="RAG-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[FileSearchTool(vector_store_ids=[vector_store.id])], instructions="Answer questions using only the provided documents...", ), ) At query time, the agent embeds the question, searches the vector store semantically, retrieves matching chunks, and generates an answer grounded in the retrieved content — entirely server-side. The client code remains a plain responses.create() call. An important detail: the .vector_store_id file is written to disk during setup and read back during the chat session, so the demo survives process restarts without re-uploading the document. The .gitignore excludes this file from source control. Demo 6: Model Context Protocol Demo 6 connects the agent to a GitHub MCP server, giving it access to repository and issue data via the open Model Context Protocol standard. MCP servers expose tools over a standardised wire protocol; the agent discovers and calls them without any client-side function declarations. The demo also demonstrates human-in-the-loop approval: before executing any MCP tool call, the agent surfaces the proposed action and waits for the user to confirm. This is an important safety pattern for agents that can trigger side effects on external systems. Demo 7: Toolbox — Centralised Tool Governance Where Demo 6 connects to a single MCP server directly, Demo 7 uses a Toolbox — a managed Microsoft Foundry resource that bundles multiple tools into a single, versioned, MCP-compatible endpoint. The Toolbox in this demo exposes both GitHub Issues and GitHub Repos tools, curated into an immutable versioned snapshot. This pattern is significant for production multi-agent systems: Centralised governance — one team owns the tool definitions; all agents consume them via a single endpoint. Versioned snapshots — promoting a new Toolbox version is explicit; agents pin to a version and upgrade intentionally. MCP compatibility — any MCP-capable agent or framework can connect, not just Foundry SDK agents. from azure.ai.projects.models import McpTool toolbox_tool = McpTool( server_label="toolbox", server_url=TOOLBOX_ENDPOINT, allowed_tools=[], # empty = all tools in the Toolbox version headers={"Authorization": f"Bearer {token}"}, ) Demo 8: Self-Hosted Agent with the Responses Protocol The final demo departs from the prompt-agent pattern. Instead of registering a declarative agent in Foundry, Demo 8 implements a custom agent server using the Responses protocol. The server exposes a streaming HTTP endpoint; Foundry's Agent Inspector can connect to it and route user turns to it just as it would to a hosted prompt agent. This demo includes a Dockerfile and an agent.yaml , enabling deployment to Foundry's container hosting service. It uses gpt-4.1-mini directly rather than the model router, because the custom server owns the entire inference path. When to consider this pattern: Your agent requires custom pre- or post-processing logic that cannot be expressed in a system prompt. You need to integrate with infrastructure that is not reachable through MCP or built-in tools. You want to own the inference call for cost control, A/B testing, or compliance reasons. You are building a multi-agent orchestrator that needs to expose itself as an agent to other orchestrators. Getting Started The lab requires Python 3.10 or higher, an Azure subscription with a Microsoft Foundry project, and the Azure CLI. 1. Clone and set up the virtual environment git clone https://github.com/microsoft-foundry/Foundry-Agent-Lab.git cd Foundry-Agent-Lab # Create and activate the virtual environment python -m venv .venv # Windows Command Prompt .venv\Scripts\activate.bat # Windows PowerShell .venv\Scripts\Activate.ps1 # macOS / Linux source .venv/bin/activate pip install -r requirements.txt 2. Configure a demo copy hello-demo\.env.sample hello-demo\.env # Edit hello-demo\.env and set PROJECT_ENDPOINT Your PROJECT_ENDPOINT is on the Overview page of your Foundry project in the Azure portal. It takes the form https://your-resource.ai.azure.com/api/projects/your-project . 3. Run the demo az login 0-hello-demo Each numbered batch file at the root activates the virtual environment, runs create_agent.py , and launches chat.py . Append log to capture the full session transcript: 0-hello-demo log Reset between runs hello-demo\reset.bat Every demo includes a reset.bat that deletes the registered agent and any associated resources (vector stores, uploaded files). Demos are fully repeatable. Architecture Principles Demonstrated Across the nine demos, the lab illustrates a set of design principles that apply directly to production agent systems: Keyless authentication throughout Every demo uses DefaultAzureCredential . No API keys appear anywhere in the code. Locally, az login provides credentials. In production, managed identity takes over automatically — same code, no secrets to rotate. Server-side conversation state The Responses API stores conversation history server-side. Your application passes a conversation ID; Foundry maintains the thread. This eliminates the common bug of truncating history due to local list management and makes multi-process or multi-instance deployments straightforward. Client-side vs server-side tool execution The lab makes the distinction explicit. Function tools execute in your process — you control the code, the external call, and the error handling. Built-in tools (WebSearch, CodeInterpreter, FileSearch) execute inside Foundry — you get results without managing execution infrastructure. MCP tools (Demo 6, 7) fall between these: they execute in a separately deployed server, with the protocol mediating the call. Progressive tool introduction Each demo's create_agent.py registers the agent once. The chat.py file handles the conversation loop. These two responsibilities are always separate, making it easy to update agent definitions without modifying conversation logic, and vice versa. Security Considerations When building agents for production, keep the following in mind: Never commit .env files. The .gitignore excludes them, but verify this before pushing. Use Azure Key Vault or environment variable injection in CI/CD pipelines. Use managed identity in production. DefaultAzureCredential automatically picks up managed identity when deployed to Azure, eliminating the need for any stored credentials. Apply human-in-the-loop for side-effecting tools. Demo 6 demonstrates this pattern for MCP tool calls. Any agent that can modify external state (create issues, send emails, write files) should surface proposed actions for confirmation. Validate tool outputs before use. Treat data returned by external tools (weather APIs, search results, document retrieval) as untrusted input. Prompt injection through tool results is a real attack surface; grounding instructions in your system prompt reduce but do not eliminate this risk. Scope Toolbox permissions narrowly. When using a Toolbox (Demo 7), use allowed_tools to restrict which tools the agent can call, rather than granting access to all tools in a Toolbox version. Key Takeaways Start with the minimum. A prompt agent with no tools requires fewer than 30 lines of code using the Foundry SDK. Add tools only when the use case demands them. Use model-router unless you have a specific reason not to. The empirical data in the lab shows the router selects appropriate models across all task types — factual, creative, tool-calling, RAG, and code generation. Understand the client/server tool boundary. Function tools give you control; built-in tools give you simplicity. MCP and Toolbox give you governance and interoperability. Choose based on where you need control and where you need scale. Conversation state belongs on the server. Do not maintain conversation history in application memory if you can avoid it. The Responses API conversation object is designed for this. The hosted-demo pattern is for when you need to own the inference path. For most use cases, a declarative prompt agent is sufficient and far simpler to operate. Next Steps Explore the repo: github.com/microsoft-foundry/Foundry-Agent-Lab Microsoft Foundry SDK documentation: learn.microsoft.com/azure/ai-studio/ Responses API quickstart: Prompt agent quickstart Model Router conceptual documentation: Model Router for Microsoft Foundry Model Context Protocol: modelcontextprotocol.io Azure Identity SDK (DefaultAzureCredential): azure-identity Python SDK The Foundry Agent Lab is open source under the MIT licence. Contributions, bug reports, and feature requests are welcome through GitHub Issues. See CONTRIBUTING.md for guidelines.OIDC vs SPN: Securing Azure Deployments with GitHub Actions & Terraform
From Secrets to Trust: Modernizing CI/CD Authentication When building infrastructure pipelines on Microsoft Azure using GitHub Actions and Terraform, one design choice quietly determines your entire security posture: How does your pipeline authenticate to Azure? For years, the answer was simple: Use a Service Principal (SPN) Store a client secret in GitHub Authenticate using credentials It works—but it doesn’t scale securely. This article walks through a real, production-ready implementation comparing: SPN (Client Secret – legacy pattern) OIDC (Federated Identity – modern standard) Backed by a working repo: WorkFlowBasedDeployment Architecture Overview This repository implements a workflow-driven Terraform deployment model with modular Azure infrastructure. Repository Structure .github/workflows/ deploy-infrastructure.yml # OIDC deployment deploy-infrastructure-spn.yml # SPN deployment destroy-infrastructure.yml # OIDC destroy destroy-infrastructure-spn.yml # SPN destroy Deployment/ main.tf providers.tf variables.tf terraform.tfvars modules/ Azure Resources Provisioned Resource Module Resource Group Virtual Network + NSGs vnet rg-network Storage Account sa rg-data Container Apps containerapps rg-compute AI Foundry aifoundry rg-data AI Search aisearch rg-data Azure Container Registry acr rg-compute Key Vault azkeyvault rg-data Monitoring azmonitor rg-compute Private Endpoints private_endpoints rg-network Authentication Models Service Principal (SPN) – The Traditional Way How it works Create App Registration Generate client secret Store it in GitHubTerraform authenticates using environment variables env: ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_CLIENT_SECRET: ${{ secrets.AZURE_CLIENT_SECRET }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} The problem Risk Impact Long-lived secrets Can be leaked Manual rotation Operational burden Repo compromise Full environment exposure This model is still supported—but increasingly considered legacy for secure pipelines. OIDC (OpenID Connect) – The Modern Approach How it works GitHub Actions generates a short-lived identity token Microsoft Entra ID validates it Azure issues a temporary access token Terraform executes using that token No secrets. No storage. No rotation. Authentication Models Compared OIDC Flow (Mental Model) Think of OIDC like this: GitHub → Identity Provider Azure → Trust Authority Workflow → Temporary Identity OIDC Implementation (From the Repo) Workflow Configuration permissions: id-token: write contents: read env: ARM_CLIENT_ID: ${{ secrets.AZURE_CLIENT_ID }} ARM_SUBSCRIPTION_ID: ${{ secrets.AZURE_SUBSCRIPTION_ID }} ARM_TENANT_ID: ${{ secrets.AZURE_TENANT_ID }} ARM_USE_OIDC: true Azure Login - name: Azure Login (OIDC) uses: azure/login@v2 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} Backend (Terraform State with OIDC) terraform init \ -backend-config="use_oidc=true" Even your state storage is secretless Azure Setup for OIDC Create App Registration No client secret required Configure Federated Credential Example: Issuer: https://token.actions.githubusercontent.com Subject: repo:<org>/<repo>:ref:refs/heads/master You can restrict by: Branch Environment Repository Assign RBAC: Grant roles like: Contributor Or scoped resource-level access CI/CD Workflow Design Both SPN and OIDC pipelines follow a 2-stage pattern: Plan Stage terraform fmt terraform validate terraform plan Upload plan artifact Apply Stage Triggered only on main Downloads plan Runs apply -auto-approve Protected via environment approvals This ensures safe, auditable deployments OIDC vs SPN — Real Comparison Feature SPN OIDC Secrets Stored in GitHub None Token lifetime Long-lived Short-lived Rotation Manual Not required Security Medium High Setup Simple Slightly complex Recommended No Yes Common Pitfalls (Real-World Lessons) Missing id-token permission Without this, OIDC fails silently. Federated credential mismatch Wrong branch Incorrect repo name Case sensitivity issues Azure rejects the token completely. RBAC delay Role assignments can take time → causes confusing failures. Backend misconfiguration Forgetting use_oidc=true breaks Terraform state auth. Debugging Tips Enable debug logs in GitHub Actions Check Sign-in logs in Microsoft Entra ID Validate federated credential subject format Always isolate: Identity issue vs Permission issue Migration Strategy (SPN → OIDC) A safe transition looks like this: Keep SPN as fallback Add OIDC alongside Test in DEV environment Remove client secret Revoke old credentials No downtime, no risk. Where This Fits in Modern Azure Architecture This pattern integrates naturally with: Azure Container Apps AI/ML workloads (AI Foundry, Search) Multi-environment deployments Zero-trust enterprise architectures Authentication becomes identity-driven, not secret-driven When NOT to Use OIDC Legacy CI/CD systems without OIDC support Organisations with strict identity federation constraints Cross-tenant scenarios with limited trust setup Note: These cases are becoming increasingly rare in modern cloud setups. Security Perspective Threat SPN Risk OIDC Risk Secret leak High None Credential reuse High Low Token replay Possible Limited Repo compromise Full access Scoped Final Takeaway This repository demonstrates a key shift in modern DevOps: Secrets were a workaround for identity. OIDC replaces that workaround with trust. By combining: GitHub Actions OIDC federation Azure RBAC You get: Secure pipelines Scalable deployments Zero secret management In enterprise environments, moving to OIDC can eliminate secret rotation pipelines entirely, reducing operational overhead and significantly lowering breach risk. Reference Implementation GitHub Repository: WorkFlowBasedDeployment Closing Thought OIDC doesn’t just improve authentication, it fundamentally changes how trust is established in cloud systems. In a world moving toward zero-trust architectures, identity is the new perimeter and OIDC is how you enforce it.Building a Controllable Inference Platform on Kubernetes with AI Runway
When enterprises move generative AI from demos to real business workloads, the hardest question is usually not whether a model can answer a prompt. The harder question is whether the whole system can run reliably, predictably, securely, and economically over time. This becomes especially important as major model providers continue to adjust token pricing, context-window pricing, batching discounts, and model tiering. That is where AI Runway becomes valuable. It turns model deployment into a Kubernetes-native platform capability. Instead of binding every application to a specific inference runtime, AI Runway lets teams describe model-serving intent through a unified ModelDeployment resource, while the platform selects or delegates to the right provider and engine underneath. For teams already using Kubernetes, AKS, or cloud-native platform engineering practices, AI Runway offers a practical path from “calling an external model API” to “operating an enterprise inference platform.” Why do we need a self-hosted inference platform? Many teams have already proven the value of LLMs in knowledge assistants, code generation, content creation, customer support, document processing, and agentic workflows. But once usage grows, several platform-level issues appear quickly. 1. Token cost becomes an engineering problem In a proof of concept, token usage often looks like a small budget line. In production, it becomes an architectural concern. A single RAG request may include system prompts, user input, retrieved context, tool outputs, and the final answer. An agentic workflow may call models many times for planning, routing, summarization, validation, and generation. An internal Copilot used by hundreds of employees can generate token consumption at a scale that surprises the original project team. External model API cost is also affected by model versions, input/output token ratios, context length, caching policies, batch processing, and provider pricing strategy. When model vendors change pricing, enterprises without an alternative path become price takers. Self-hosted inference does not mean replacing every external model. It means creating a controllable platform layer for high-frequency, predictable, localized, or privacy-sensitive workloads. Scenario Why self-hosted inference helps High-frequency internal Q&A Large request volume can be served by smaller or quantized models Document summarization and extraction Stable task pattern, suitable for specialized local models Agent intermediate steps Planning, classification, and rewriting may not require the strongest closed model Edge or private-network workloads Data may need to stay inside a controlled boundary Cost-sensitive applications CPU/GPU resource pools, batching, and model tiering can reduce unit cost 2. Data boundaries and compliance become clearer Many enterprises are willing to use cloud-hosted models, but they also need clear controls for data classification, access boundaries, logging, and auditing. A self-hosted inference platform allows sensitive documents, internal knowledge bases, customer interactions, and business context to remain inside a governed network and operational model. 3. Teams should not be locked into one engine Inference engines are evolving quickly. vLLM, SGLang, TensorRT-LLM, and llama.cpp serve different needs. Some are optimized for high-throughput GPU serving. Some are better for structured serving or NVIDIA GPU acceleration. Others make GGUF quantized models practical on CPU or lightweight GPU environments. A platform should not force every team into one runtime. It should provide a unified entry point and absorb runtime differences underneath. 4. Production AI requires model operations, not just endpoints Production workloads need deployment lifecycle management, status, logs, metrics, scaling, debugging, progressive rollout, resource quotas, and secure ingress. A self-hosted inference platform should prevent every team from handcrafting runtime-specific YAML and instead provide these capabilities as shared platform primitives. What is AI Runway? AI Runway is a Kubernetes-native platform for deploying and managing large language models. Its core idea is to describe model deployment intent through a unified Kubernetes CRD called ModelDeployment. The AI Runway Controller then selects or delegates to provider-specific controllers based on provider capabilities. The project describes itself as: Deploy and manage large language models on Kubernetes — no YAML required. AI Runway supports a Web UI, REST API, Headlamp Plugin, and direct CRD management with kubectl. The UI is optional and replaceable; the core platform capability lives in the controller, CRDs, and provider abstraction. Key capabilities Capability Value Unified ModelDeployment CRD One API for model, engine, resources, scaling, and gateway configuration Multiple providers Supports KAITO, NVIDIA Dynamo, KubeRay, llm-d, and provider shims Multiple engines Supports vLLM, SGLang, TensorRT-LLM, and llama.cpp Automatic provider and engine selection Matches CPU/GPU requirements, serving mode, and provider capability Web UI and Headlamp Plugin Simplifies model discovery, deployment, and monitoring Hugging Face integration Enables model catalog browsing and deployment Observability Surfaces deployment status, logs, and Prometheus metrics Gateway API integration Enables unified OpenAI-compatible routing through a gateway Cost and capacity guidance Helps with GPU fit, pricing, and capacity decisions Multi-engine support is the key differentiator AI Runway is not just another model deployment tool. Its most important value is decoupling application developers from inference runtime decisions. Applications can call an OpenAI-compatible endpoint or a unified gateway, while the platform decides which engine and provider should serve a particular model. Engine Typical use case Resource target vLLM High-throughput general LLM serving GPU SGLang Complex inference workflows and structured serving GPU TensorRT-LLM Highly optimized inference on NVIDIA GPUs GPU llama.cpp GGUF quantized models and lightweight inference CPU / GPU For teams, this is an important story: instead of forcing every team into the same runtime, AI Runway creates a common platform where different workloads can choose different engines while keeping the developer experience consistent. AI Runway architecture overview The following Mermaid diagram shows a simplified view of the AI Runway platform layers. Three design points matter most: Unified control plane: users submit ModelDeployment resources instead of handcrafting YAML for each runtime. Out-of-tree providers: KAITO, Dynamo, KubeRay, and llm-d declare their capabilities through provider shims and controllers. Replaceable runtime layer: the same platform can serve CPU-based llama.cpp models and GPU-based vLLM or TensorRT-LLM workloads. Solution 1: Local Kubernetes with AI Runway, KAITO, and CPU Local Kubernetes is ideal for learning, demos, development validation, and small-model prototyping. The goal is not maximum throughput. The goal is to prove that AI Runway + KAITO + llama.cpp can expose an OpenAI-compatible model service without requiring a GPU. When to use this pattern Scenario Description Local developer experiments Use kind, minikube, k3d, or Docker Desktop Kubernetes Platform demos Show the ModelDeployment, provider, and OpenAI-compatible API flow CPU-only validation No GPU or cloud resource required SLM / GGUF testing Use llama.cpp to serve quantized models For local CPU inference, allocate at least 4 vCPU and 12 GiB memory. Even small models need memory for runtime startup, model loading, KV cache, and context windows. Local architecture The local KAITO + CPU pattern is powerful for education and adoption: Developers learn the ModelDeployment abstraction without needing a GPU. The application does not need to know whether the backend is LocalAI, llama.cpp, or KAITO Workspace. CPU-only environments can still run lightweight and quantized models. Teams can validate models, prompts, and API behavior locally before moving to AKS or production clusters. Sample Guideline - https://gist.github.com/kinfey/28b2338845cc63139aee2ea462a3c466 Solution 2: Azure with AKS, AI Runway, KAITO, and CPU After local validation, the next step is usually a cloud-hosted inference platform. AKS provides managed Kubernetes control plane, node pools, networking, identity, monitoring, and Azure ecosystem integration. It is a natural foundation for AI Runway in production or pre-production environments. The example below uses CPU-only AKS + KAITO + Qwen3-0.6B GGUF to build a cloud-hosted inference service without GPU nodes. Azure architecture Production recommendations for AKS Area Recommendation Secure ingress Do not expose plain HTTP 80 directly; add TLS, API keys, OAuth2 Proxy, WAF, or internal LoadBalancer Model governance Pin model versions, image versions, and GGUF filenames Cost governance Use CPU for lightweight tasks and GPU for high-throughput large models Observability Integrate Azure Monitor, Prometheus, logs, and request-level metrics Quota planning Check regional vCPU/GPU quota before deployment Caching Use PVCs or model cache volumes to reduce repeated downloads GitOps Manage ModelDeployment, providers, and ingress through GitOps Access control Use namespaces, RBAC, and NetworkPolicy for team isolation Sample Guideline - https://gist.github.com/kinfey/d439a545d8c93e15d8a2854b65f03d4d How to evangelize AI Runway inside an engineering organization When introducing AI Runway, I would avoid starting with “we are building our own model platform.” A more effective narrative is: Start with cost predictability: high-frequency workloads should not all depend on the most expensive external model tier. Emphasize technical optionality: teams can use different models and engines while keeping a unified platform entry point. Highlight Kubernetes-native operations: existing AKS, RBAC, monitoring, GitOps, networking, and security practices can be reused. Use CPU demos to lower the barrier: local KAITO + CPU lets developers understand the full flow without GPUs. Use Azure as the production landing zone: AKS carries the same abstraction into cloud environments and can evolve toward GPU, gateway, monitoring, and multi-tenant governance. This path avoids starting with GPU procurement, complex scheduling, or full-scale platform governance. Start small, prove the abstraction, then add higher-performance engines and stronger governance as the platform matures. Closing thoughts As AI applications enter production, enterprises need more than a model that can answer prompts. They need an inference platform that is controllable, observable, scalable, and evolvable. AI Runway brings this problem back into the Kubernetes platform engineering world: use ModelDeployment to standardize model deployment, use providers to hide runtime differences, and use multiple engines to match different cost and performance goals. From a local Kubernetes KAITO + CPU demo to a Qwen3-0.6B CPU inference service on AKS, AI Runway provides a clear adoption path: start with a low-barrier setup, then evolve toward multi-model, multi-engine, multi-provider, unified-gateway, enterprise-governed inference. In a world where token pricing changes frequently and model ecosystems evolve rapidly, a self-hosted inference platform is not about rejecting external models. It is about giving engineering teams more control over cost, architecture, and technical choice. References AI Runway GitHub: https://github.com/kaito-project/airunway AI Runway Architecture: https://github.com/kaito-project/airunway/blob/main/docs/architecture.md AI Runway Providers: https://github.com/kaito-project/airunway/blob/main/docs/providers.md AI Runway CRD Reference: https://github.com/kaito-project/airunway/blob/main/docs/crd-reference.md KAITO: https://github.com/kaito-project/kaito LocalAI: https://localai.io AKS Application Routing: https://learn.microsoft.com/azure/aks/app-routing Qwen3-0.6B GGUF: https://huggingface.co/Qwen/Qwen3-0.6B-GGUF228Views0likes0Comments