azure ai foundry

76 Topics

Building Agentic Systems on Azure: Microsoft Foundry Agents SDK vs Microsoft Agent Framework
In my recent experience as a Senior Consultant at Microsoft, I’ve been actively involved in designing and delivering AI-driven solutions, with a strong focus on building intelligent agents using modern frameworks. Along the way, I've built agents using both Microsoft Foundry Agents SDK (hereafter "Agents SDK") and Microsoft Agent Framework (MAF) Both approaches are powerful and capable. However, once you move beyond simple proofs of concept, the developer experience and architectural patterns start to differ significantly. This article provides a practical comparison based on real implementation experience and aims to help developers choose the right approach. Approach 1: Agents SDK Agents SDK provides a straightforward way to create agents with integrated tools and models. Example: Creating an Agent from azure.ai.projects import AIProjectClient from azure.ai.agents.models import AzureAISearchTool, AzureAISearchQueryType from azure.identity import DefaultAzureCredential client = AIProjectClient(credential=DefaultAzureCredential(), endpoint=os.getenv("AZURE_AI_PROJECT_ENDPOINT")) # Configure tools ai_search = AzureAISearchTool( index_connection_id=conn_id, index_name="my-index", query_type=AzureAISearchQueryType.SEMANTIC, ) # Create agent (persisted in Foundry portal) agent = client.agents.create_agent( model=os.getenv("AZURE_AI_AGENT_DEPLOYMENT_NAME"), name="MyAgent", instructions="You are a helpful assistant.", tool_resources=ai_search.resources, tools=ai_search.definitions, ) # Run conversation thread = client.agents.threads.create() client.agents.messages.create(thread_id=thread.id, role="user", content="Hello") run = client.agents.runs.create(thread_id=thread.id, agent_id=agent.id) What this approach provides Native integration with Azure AI services (OpenAI, AI Search, MCP) Managed execution environment Simple and quick agent setup Conceptually, this approach can be summarized as: Model + Tools + Execution Strengths ✅ Rapid development and onboarding ✅ Strong integration within the Azure ecosystem ✅ Well-suited for single-agent or tool-driven use cases ✅ Minimal infrastructure overhead Challenges observed in practice As the complexity of scenarios increases, certain limitations become more visible: Multi-agent workflows require custom orchestration logic Agent handoffs must be implemented manually Context sharing across agents requires additional design effort While this approach offers flexibility, it shifts orchestration complexity to the developer. Approach 2: Microsoft Agent Framework (MAF) Microsoft Agent Framework introduces a higher-level abstraction, focused on agent orchestration and system design. Creating an Agent from agent_framework import Agent, WorkflowBuilder, Message from agent_framework.foundry import FoundryChatClient from azure.identity import DefaultAzureCredential client = FoundryChatClient( project_endpoint=os.getenv("FOUNDRY_PROJECT_ENDPOINT"), model=os.getenv("FOUNDRY_MODEL_DEPLOYMENT_NAME"), credential=DefaultAzureCredential(), ) # Create agents (in-process only, not persisted in portal) researcher = Agent(client, name="ResearcherAgent", instructions="Research topics thoroughly.") writer = Agent(client, name="WriterAgent", instructions="Write concise summaries.") # Build and run multi-agent workflow workflow = WorkflowBuilder(start_executor=researcher).add_edge(researcher, writer).build() async for event in workflow.run(Message("user", "Summarize migration best practices"), stream=True): print(event.content) What this approach provides Built-in orchestration capabilities Native support for multi-agent workflows Structured agent lifecycle management Context and memory handling Conceptually, this can be viewed as: Agents + Orchestration + System Design Observations from implementation When implementing similar use cases using MAF: Agent responsibilities became clearly defined Routing and delegation patterns were significantly simplified Overall system architecture became easier to maintain and scale This approach encourages thinking in terms of agent ecosystems rather than isolated agents. Architecture Comparison Agents SDK Microsoft Agent Framework (MAF) Choosing the Right Approach Use Agents SDK when: You need rapid development for a single-agent use case The workflow is relatively straightforward You prefer flexibility and lower-level control Use Microsoft Agent Framework when: You are designing multi-agent systems Your solution requires routing, delegation, or handoffs Long-term scalability and maintainability are essential Pros and Cons Summary Agents SDK Pros Easy to get started Strong Azure integration Flexible design Cons Manual orchestration required Limited native multi-agent support Complexity increases as scenarios grow Microsoft Agent Framework (MAF) Pros Built-in orchestration Native multi-agent support Scalable and structured architecture Cons Learning curve for new developers More opinionated framework design Reduced low-level control compared to SDK-based approach References and Repositories 🔗 Microsoft Agent Framework (MAF) Microsoft Agent Framework – GitHub Repository Microsoft Agent Framework Samples – Tutorials & Examples Workflow Samples (Multi-agent patterns) FoundryChatClient sample (Python) Agent Framework demos - GitHub Source 📘 Documentation Microsoft Agent Framework Overview (Microsoft Learn) Agent Framework + Microsoft Foundry provider docs 🔗 Azure AI Projects / Agents SDK Azure AI Projects SDK – Python (GitHub Source) Azure AI Projects Agents (.NET SDK repo) 📘 Documentation Azure AI Projects SDK (Python) – Microsoft Learn Azure AI Agents SDK – Microsoft Learn Conclusion Azure AI Projects and Microsoft Agent Framework both play important roles in the modern agent development landscape. Agents SDK enables quick and flexible agent development Microsoft Agent Framework enables structured, scalable agent systems In practice, the choice depends on whether you are building a single agent feature or a multi-agent system. Final Thought Agents SDK helps you get started quickly. Microsoft Agent Framework helps you scale with confidence In a follow-up blog, I’ll dive into how the M365 Agents SDK compares with Microsoft Agent Framework, especially in the context of enterprise productivity and Copilot experiences.
ChaitanyaThalloory
Jun 08, 2026 Place Microsoft Developer Community Blog
338Views
3likes
1Comment
Harness-Driven Agents: Secure Podcast Pipeline in Hyperlight MicroVM Sandbox
The moment the agent reached for rm -rf For most of 2024 and 2025, "agents" were a demo word. By 2026 they are something you run — autonomously, in a loop, executing code they wrote themselves a second ago. I was watching one work late one night. I had given it a goal, a handful of tools, and the freedom to write and run its own Python. For twenty minutes it was magic: read a file, reason about it, write a script, run it, inspect the output, correct itself, try again. Then it produced this: import shutil shutil.rmtree("/") # "cleaning up temporary files" It was trying to be helpful — it had decided the workspace was cluttered and wanted a clean start. The "workspace," as far as that process was concerned, was my entire machine. I killed it in time. But the lesson is the one every agent builder eventually arrives at: the model is not the dangerous part — the execution is. A chatbot that answers wrong is annoying. An agent that fetches a web page, runs code, and writes files has a blast radius. The bounding box has to come from infrastructure, not from a system prompt. harnessagent_sandbox_demo is a concrete build that puts that bounding box in exactly the right place — and it does it in service of a real, charming little product: a daily five-minute Mandarin podcast about the FIFA World Cup 2026. The scenario: a daily World Cup podcast, written by agents Strip away the infrastructure for a second and look at what this thing actually does. Every day it produces a fresh Mandarin podcast script about the FIFA World Cup 2026. Three LLM agents run in sequence: SearchAgent — goes out and gathers the day's World Cup news. ContentAgent — turns that raw material into structured podcast content. GenScriptAgent — writes the final, readable five-minute script. The output is two text files — one in Simplified Chinese, one in Traditional Chinese: ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt That's the whole product. It sounds simple — and the point of the project is that making it safe is the hard part. SearchAgent has to reach the open internet. All three agents write and run code. If you wire that naively, you have just built the exact machine that types shutil.rmtree("/") for you. So the entire architecture is organized around one principle: the agents get to do real work, but every dangerous capability is fenced behind a hardware boundary. Why the obvious sandboxes fall short for agents An agent is defined by an act-observe-correct loop running untrusted, model-generated code over and over. That single property breaks most conventional isolation choices. Option Why it falls short for agents No sandbox One rm -rf, one leaked .env, one rogue network call — the blast radius is the whole machine. Container Great for shipping apps, but a coding agent wants to build and run its own container, which means Docker-in-Docker and elevated privileges that quietly undo the isolation. WASM / V8 isolate Fast to start, but you isolate a language runtime, not an OS — no system packages, no arbitrary shell, and hardening the engine is a moving target. Full VM Rock-solid isolation, but cold starts in seconds and heavy memory — exactly the friction that pushes developers to skip isolation entirely. Each option trades away safety, speed, or compatibility. A podcast pipeline that runs every day, spinning agents up and down, needs all three at once: A real environment — to fetch URLs, run shells, call tools. A hard boundary — so a bad step can't reach the host. Near-instant lifecycle — because a slow sandbox is a sandbox developers skip, and an unused safety feature protects nobody. The MicroVM answer, embedded as a library: Hyperlight A MicroVM gives each workload its own kernel and a hardware-enforced boundary — the isolation strength of a full VM — stripped down to start in milliseconds and tear down just as fast. Misbehave inside, and you hit a wall; there is no path back to the host. And it is disposable by design: when an agent goes off the rails, you delete the sandbox and reopen in milliseconds, with nothing to clean up. Most MicroVM runtimes (Firecracker and friends) are cloud infrastructure — server-side. Hyperlight is different: a lightweight Virtual Machine Manager (a CNCF sandbox project) designed to be embedded inside your application, like a library. MicroVMs that boot in milliseconds, with guest function calls completing in microseconds. No guest kernel, no OS — the guest is a purpose-built no_std Rust/C binary. Nothing in there to attack. Sandboxed by default — no filesystem, no network, nothing, unless explicitly granted. Typed function calls across the VM boundary, and snapshot/restore to rewind to a clean state between calls. Runs on KVM, MSHV (Microsoft Hypervisor), and Windows Hypervisor Platform. This project uses the Wasm backend: the three agents share a single HyperlightRuntime, and the guest is reset to a clean snapshot before every code execution. That detail is what makes a daily, many-step pipeline cheap — you capture the sandbox state once and rewind to it, instead of rebuilding a VM hundreds of times. Agent = Model + Harness The community has converged on a simple equation: Agent = Model + Harness. The model is a brain in a jar — text in, text out, no memory between calls, no loop, no hands. It can express the intent to call a tool; it cannot actually call it. The harness is the execution layer: it calls the model, handles its tool calls, and decides when to stop. As the Hugging Face glossary puts it, "if you're not the model, you're the harness." That reframes the safety problem precisely. When my agent emitted shutil.rmtree("/"), the model deleted nothing — it merely suggested. The harness would have run it. The harness is where reasoning meets reality, so it is exactly where safety must live. The question stops being "how do I make the model safer?" and becomes: how do I build a harness that executes the model's intent inside a boundary it cannot escape? The Microsoft Agent Framework answers that with first-class agent harness capabilities in Python and .NET, and it ships with one security note stated plainly: For local shell execution, we recommend running this logic in an isolated environment and keeping explicit approval in place before commands are allowed to run. The harness is the steering wheel — it does not pretend to be the seatbelt and the crumple zone. For that, it points you outward: run this somewhere isolated. Hyperlight is that isolated somewhere. This project snaps the two pieces together. The architecture: two planes, one bridge Here is the heart of the design. Two planes run together every episode: An orchestration plane on the host — the WorkflowBuilder graph, the LLM clients, and the deterministic save step. An execution plane inside one Hyperlight Wasm sandbox — the only place LLM-generated code is allowed to run. The single bridge between them is one call: call_tool("fetch_url", ...). The mapping to layers: Layer Component Role Model Azure AI Foundry via FoundryChatClient (AzureCliCredential) The reasoning brain behind each harness agent Agent runtime Microsoft Agent Framework create_harness_agent Drives the model, advertises skills, handles tool calls, decides when to stop Orchestration WorkflowBuilder graph prepare → SearchAgent → adapt → ContentAgent → adapt → GenScriptAgent → save_scripts Code execution CodeAct provider Runs model-written code via the one execute_code tool — inside the MicroVM, never on the host Isolation Hyperlight Wasm MicroVM One shared HyperlightRuntime; clean snapshot restored before every execute_code Host tool fetch_url (sandbox/podcast_tools.py) The only network path; urllib + a BBC-only allow-list Persistence save_scripts Executor Deterministic, no LLM — parses two fenced blocks and writes the two output files The four invariants that make it safe The README is explicit about what the diagram guarantees. These four invariants are the whole security argument. The model never sees the network.Its only tool isexecute_code. Network access happens only when the guest itself runs call_tool("fetch_url", ...) from inside the sandbox. The model cannot reach the internet directly — it can only ask the guest to, and the guest can only reach BBC. One sandbox per run, snapshot per call.All three agents share the sameHyperlightRuntime. Before every execute_code, the guest is reset to a clean snapshot — so nothing one step does can leak into the next, and there is no VM to rebuild. Two counter paths — and why there are two.Thefunction_middleware (make_tool_call_recorder) sees the model-direct execute_code calls. But the inner, guest-initiated fetch_url is dispatched by Hyperlight straight to the FunctionTool, bypassing the middleware entirely. So a second counter — make_call_tool_counter(on_call=) — bumps state["tool_call_counts"][<agent>]["fetch_url"] on every guest invocation. Two observation points, because the architecture has two genuinely different call surfaces. Deterministic save — no LLM in the persistence step.GenScriptAgentonly emits text. The save_scripts Executor parses the two fenced code blocks out of that text and writes the simplified and traditional files itself. There is no model in the loop when bytes hit disk, so the output path is fully predictable. Now let's look at the real code surface The README documents the API the demo is built on. The snippets below reflect that surface. 1. Install and environment pip install agent-framework-hyperlight --pre # Hyperlight needs a hypervisor: KVM on Linux, WHP on Windows. macOS is not yet supported. # The model runs on Azure AI Foundry; FoundryChatClient authenticates via AzureCliCredential. az login export HYPERLIGHT_PYTHON_GUEST_PATH="/path/to/python_guest" 2. A harness agent that carries only a stub — skills do the rest Each of the three agents is built with create_harness_agent + FoundryChatClient. The agents themselves carry only a tiny stub instruction; their real role prompts and the shared sandbox/CodeAct guardrails live as file-based Agent Skills under skills/. The harness's built-in SkillsProvider advertises those SKILL.md packages, and the model loads them at runtime via load_skill. from agent_framework import create_harness_agent from agent_framework.foundry import FoundryChatClient from azure.identity import AzureCliCredential # Model on Azure AI Foundry — not Azure OpenAI directly. client = FoundryChatClient(credential=AzureCliCredential()) # The agent carries a tiny stub. Its real persona — "you gather World Cup # news", "you write the script" — lives in a SKILL.md package under skills/, # advertised by the harness SkillsProvider and pulled in via load_skill. search_agent = create_harness_agent( chat_client=client, name="SearchAgent", instructions="You are a harness agent. Load your skill, then begin.", ) 3 The CodeAct surface: one tool the model can see This is the CodeAct pattern from 02-agents/context_providers/code_act/code_act.py. The model sees exactly one tool — execute_code. Any extra capability (here, only fetch_url) is reachable from inside the guest via call_tool(...). # What the MODEL sees and writes — one script, not ten tool round-trips: # # # inside execute_code, running in the Hyperlight Wasm guest: page = call_tool("fetch_url", url="https://www.bbc.com/sport/football/world-cup") # # ... parse page["BODY"], pull out today's stories ... print(top_stories) # # execute_code is the ONLY tool on the model's surface. call_tool("fetch_url", ...) is reachable only from inside the sandbox. 4. The one host tool, with a BBC-only allow-list fetch_url lives on the host (sandbox/podcast_tools.py). It is the single bridge across the boundary, and it is deliberately narrow. import urllib.request from urllib.parse import urlparse ALLOWED_DOMAINS = {"bbc.com", "www.bbc.com"} # allow-list: BBC only def fetch_url(url: str) -> dict: """The ONLY network path out of the sandbox. Host-side, allow-listed.""" host = urlparse(url).netloc if host not in ALLOWED_DOMAINS: return {"STATUS": "blocked", "URL": url} with urllib.request.urlopen(url, timeout=20) as resp: body = resp.read(8192).decode("utf-8", "ignore") # BODY capped at ~8 KB return { "STATUS": "ok", "URL": url, "TITLE": _extract_title(body), "DESCRIPTION": _extract_description(body), "LINKS": _extract_links(body), "BODY": body, } Notice what this buys you: even if SearchAgent writes hostile code, the worst it can do over the network is read BBC, 8 KB at a time. The allow-list is host-side and the model never sees it — it cannot be prompt-injected away. 5. Wiring the graph and the deterministic save from agent_framework import WorkflowBuilder workflow = ( WorkflowBuilder() .add_node("prepare", prepare) .add_node("SearchAgent", search_agent) .add_node("adapt_1", adapt) .add_node("ContentAgent", content_agent) .add_node("adapt_2", adapt) .add_node("GenScriptAgent", genscript_agent) .add_node("save_scripts", save_scripts) # deterministic Executor, NO LLM .build() ) # GenScriptAgent emits text containing two fenced blocks (simplified + # traditional). save_scripts parses them and writes the files itself — # there is no model in the persistence step. await workflow.run() # -> ./outputs/<YYMMDD>/<YYMMDD>.simple.zh.txt # -> ./outputs/<YYMMDD>/<YYMMDD>.tranditional.zh.txt 6. The payoff Run that shutil.rmtree("/") inside this pipeline now and the result is delightfully boring: the agent deletes its own throwaway sandbox, the host never notices, and the next execute_code starts from a clean snapshot. Two things to call out: Snapshot/restore means every code execution starts from a clean, reusable baseline — capture state once, rewind between calls, instead of rebuilding the whole VM. For a daily pipeline that runs the act-observe-correct loop many times, that is the difference between "fast enough to always use" and "slow enough to skip." Because each agent writes one script instead of ten round-tripped tool calls, the CodeAct approach keeps both latency and token usage down — the model reasons once and lets the guest do the busywork behind the boundary. Where it fits, and the one idea to keep harnessagent_sandbox_demo lives inside Multi-AI-Agents-Cloud-Native — a gallery of patterns for running agent systems safely on Azure: A2A multi-agent orchestration, the Kubernetes sidecar pattern, hardened pipelines, and a sibling sample that runs Copilot agents on AKS inside Kata Containers MicroVMs at the pod level. And the README is explicit that this design is cloud-native: running it in-cluster on AKS changes nothing about the architecture — the same WorkflowBuilder graph, the same Hyperlight sandbox, the same deterministic save_scripts executor. The local build and the in-cluster build are the same shape. The two MicroVM samples are two ends of one spectrum. The Kata sample puts the boundary around the whole pod — a deployment topology. This Hyperlight demo pulls the boundary all the way into the agent process itself — the sandbox becomes a library call. Same question — where do you place the hardware boundary in an agent stack? — answered at two different altitudes. The old pitch for sandboxing always carried an asterisk: yes, it's safer, but you'll pay in speed, compatibility, or friction. MicroVMs erase the asterisk — VM-grade isolation, cold starts fast enough that there's no reason to skip it, and a real environment your agents can actually work in. Enough of a real environment, in fact, to write you a World Cup podcast every morning. The one idea to internalize: the harness decides, the MicroVM contains. Give your agent a room where it is allowed to fail — then let it be brilliant. References Project: harnessagent_sandbox_demo · Multi-AI-Agents-Cloud-Native Hyperlight: hyperlight-dev/hyperlight · hyperlight-dev/hyperlight-sandbox Agent Framework: Agent Harness in Microsoft Agent Framework Background: Why MicroVMs (Docker) · Harness vs. Scaffold glossary (Hugging Face) Install: pip install agent-framework-hyperlight --pre · .NET: dotnet add package Microsoft.Agents.AI.Hyperlight --prerelease Requirements: KVM (Linux) or WHP (Windows); macOS not yet supported.
kinfey
Jun 04, 2026 Place Microsoft Developer Community Blog
3.6KViews
0likes
0Comments
Deploying Foundry Hosted Agents via REST API
Learn how to deploy Microsoft Foundry Hosted Agents via REST API, with a practical walkthrough of the request flow, deployment pattern, and key implementation considerations for integrating hosted agents into real-world applications.
j_folberth
Jun 02, 2026 Place Microsoft Developer Community Blog
195Views
0likes
0Comments
Building and Operating a Microsoft Foundry Hosted Agent with GitOps and GitHub Tasks
The Gap Between Prototype and Production Most AI engineering teams can build a working agent in a day. The hard part is not building it; the hard part is operating it. Prompts drift. Tool configurations change without review. Deployments happen from someone's laptop. There is no audit trail, no rollback plan, and no consistent way to promote a change from a development environment to production. GitOps closes that gap. By treating your agent definition, configuration, and infrastructure as version-controlled source code, you get the same delivery discipline that software engineering teams have applied to application code for years. Every change is reviewed, every deployment is automated, and every environment state is traceable to a specific commit. This post shows you how to apply GitOps principles to a Microsoft Foundry Hosted Agent using GitHub as the source of truth and GitHub Tasks and Actions as the automation layer. The result is a repeatable, governed, production-ready delivery model for AI agents. What Is a Microsoft Foundry Hosted Agent? Microsoft Foundry is Microsoft's platform for building, deploying, and operating AI applications and agents. A Hosted Agent is an agent runtime managed by the Foundry platform rather than self-hosted by your team. You supply the agent logic, configuration, and tools; Foundry handles the runtime lifecycle, scaling, and managed infrastructure. In practical terms, a Foundry Hosted Agent is a containerised agent application. You package your agent code, prompt definitions, tool bindings, and environment configuration into a container image. Foundry deploys and manages that container within a Foundry project, connected to models, tools, and observability infrastructure that the platform provides. Teams choose Hosted Agents over self-hosting because: The platform manages runtime infrastructure, patching, and scaling Integration with Azure AI models, managed identity, and observability is built in You can focus engineering effort on agent logic rather than cluster management Foundry projects provide environment and resource isolation without requiring you to provision and manage separate Azure resources for each environment Hosted Agents are a good fit when your team wants strong operational support with minimal platform overhead, when you need clear separation between environments, and when your agents depend on Azure AI capabilities such as Azure OpenAI Service, Azure AI Search, or Model Context Protocol integrations. Why GitOps Matters Specifically for AI Agents GitOps is straightforward for stateless web services: the code changes, the pipeline runs, the container is deployed. AI agents are more complex because there are multiple distinct artefacts that all affect agent behaviour: System prompts and instruction files Tool definitions and external integrations Model selection and configuration (temperature, max tokens, safety settings) Model Context Protocol (MCP) server definitions Orchestration logic and agent workflow code Safety and policy settings Infrastructure and deployment configuration Any one of these can change the behaviour of your agent in ways that are difficult to detect without structured review. A prompt change that looks harmless can alter tone, scope, or factual grounding. A tool configuration change can expose data to unintended callers. A model upgrade can shift response quality unpredictably. Git gives you a single place to version, review, and approve all of these artefacts together. Pull requests give you a structured review gate. Workflow automation gives you validation before anything reaches a deployed environment. Tags and releases give you deployment markers you can roll back to. The discipline of GitOps turns what is often an ad-hoc AI delivery process into a repeatable engineering practice. Reference Architecture The following diagram shows a practical reference architecture for delivering a Microsoft Foundry Hosted Agent through a GitOps model using GitHub. +---------------------------+ | GitHub Repository | | /src /agents /tools | | /prompts /infra | | /.github/workflows | +---------------------------+ | | Pull Request / Push to main v +---------------------------+ | GitHub Actions | | 1. Validate agent config | | 2. Lint and scan code | | 3. Run unit tests | | 4. Build container image | | 5. Push to registry | +---------------------------+ | | Image tag (SHA or semver) v +---------------------------+ | Azure Container Registry | | myregistry.azurecr.io | | my-agent:<sha> | +---------------------------+ | +------+------+ | | v v +----------+ +----------+ | Foundry | | Foundry | | Dev | | Test | | Project | | Project | +----------+ +----------+ | Approval gate (GitHub env) | v +----------+ | Foundry | | Prod | | Project | +----------+ | v +---------------------------+ | Observability | | Azure Monitor / App | | Insights / Foundry Logs | +---------------------------+ Key design decisions in this architecture: The GitHub repository is the single source of truth for all agent artefacts No human deploys directly to any Foundry project; all changes flow through automation Environment promotion requires a GitHub environment approval, creating a governance gate The container image is built once and promoted across environments; the image is not rebuilt per environment Secrets are stored in Azure Key Vault and accessed by the Foundry agent at runtime via managed identity Figure: GitOps delivery pipeline stages from commit to production Repository Structure A well-structured repository separates agent logic from infrastructure and tooling from prompts. The following structure works well in practice: my-foundry-agent/ ├── .github/ │ ├── workflows/ │ │ ├── validate.yml # Runs on every PR │ │ ├── build-deploy.yml # Runs on merge to main │ │ └── rollback.yml # Manual trigger workflow │ └── CODEOWNERS # Review assignments by path ├── src/ │ ├── agents/ │ │ ├── agent.py # Agent entry point and orchestration │ │ └── agent_config.json # Agent metadata and settings │ ├── tools/ │ │ ├── search_tool.py # Tool implementations │ │ └── data_tool.py │ └── prompts/ │ ├── system.txt # System prompt (versioned as plain text) │ └── instructions.txt # Supplementary instructions ├── tests/ │ ├── unit/ # Unit tests for tools and logic │ ├── integration/ # Integration tests against a running agent │ └── smoke/ # Post-deployment smoke tests ├── infra/ │ ├── main.bicep # Foundry project and resource definitions │ └── environments/ │ ├── dev.parameters.json │ ├── test.parameters.json │ └── prod.parameters.json ├── scripts/ │ ├── validate_agent.py # Config validation script │ └── smoke_test.py # Smoke test runner ├── Dockerfile # Container image definition └── docs/ └── architecture.md # Architecture and runbook documentation What belongs where and why: /src/prompts - System prompts as plain text files. Versioning prompts as files means every change goes through a pull request with a diff review, just as code does. /src/agents - Agent orchestration logic and configuration. Keeps the entry point and agent metadata co-located. /src/tools - Tool implementations separated from agent logic. Tool logic changes independently and should be reviewable in isolation. /infra - Infrastructure as code with per-environment parameter files. Environment-specific values live here, never in source files. /tests - Three layers of testing: unit tests for tools, integration tests for the full agent, and smoke tests that run against a deployed environment. /.github/workflows - All automation defined as code. There should be no manual deployment steps that live outside this directory. GitHub Tasks Across the Delivery Lifecycle GitHub Tasks and Issues provide the work tracking layer on top of the GitOps delivery model. Used well, they connect the intention behind a change to its implementation and deployment history. Practical patterns for using GitHub Tasks with agent delivery: Prompt change task - Open an issue to describe why the system prompt is changing. The pull request that changes system.txt closes that issue, creating a permanent link between the rationale and the diff. Tool integration task - When adding a new MCP server or external tool integration, create a task that captures the design decision, security review outcome, and test evidence before the pull request is merged. Model upgrade task - When upgrading the underlying model version, create a task that includes evaluation results and comparison data. The task becomes part of your change audit trail. Rollback task - If a deployment causes quality regressions, create a task to track the rollback, root cause investigation, and corrective action. Automation can open this task automatically when a deployment fails health checks. Dependency on approval - GitHub Tasks can be linked to environment approvals in GitHub Actions. A task in a specific milestone or project column can gate a promotion workflow. The key insight is that GitHub Tasks are not just work management; they are part of your audit trail. A regulatory or security reviewer can follow the chain from a production deployment back through workflow runs, pull request reviews, and the original task that described the intent of the change. End-to-End GitOps Flow The following walk-through describes a realistic developer experience for changing an agent prompt and promoting it to production. A developer opens a GitHub Issue describing the prompt change required and the expected behaviour improvement. The developer creates a feature branch, edits src/prompts/system.txt , and updates any related unit tests. A pull request is opened. The validate workflow runs immediately, checking prompt length, configuration schema, and lint rules. Unit tests run against the changed files. A code reviewer approves the pull request. The CODEOWNERS file ensures that prompt changes require review from the AI engineering team, not just any contributor. On merge to main, the build workflow runs: the container image is built with the new prompt baked in, tagged with the commit SHA, and pushed to Azure Container Registry. The deployment workflow deploys the new image to the Foundry Dev project automatically. Integration and smoke tests run against the deployed dev agent. If tests pass, the workflow pauses at the Test environment gate and requests approval from a named reviewer. After approval, the same image is deployed to Foundry Test. Smoke tests run again. A second approval gate controls promotion to Foundry Prod. If at any point a health check or smoke test fails, the rollback workflow redeploys the previous image tag from the registry. The image tag of the last known-good deployment is stored as a GitHub environment variable. This flow means that no human ever deploys directly to any environment. Every environment state is traceable to a specific commit, image tag, and workflow run. Security and Governance AI agents often have access to sensitive data and external systems. Security and governance cannot be an afterthought. Identity and Access Use managed identity for the Foundry Hosted Agent to access Azure resources. Avoid service principal secrets where Microsoft Entra Workload Identity or managed identity is available. Apply the principle of least privilege: the agent identity should have read access to data sources and limited write access only where the use case requires it. Tool integrations that require API keys or external credentials should retrieve them from Azure Key Vault at runtime, never from environment variables baked into the image. Secrets and Configuration Store secrets in Azure Key Vault. Reference them in your Foundry project configuration using Key Vault references. Store GitHub Actions secrets using repository or environment-scoped secrets. Never echo secrets in workflow logs. Separate environment configuration (endpoints, resource names, capacity settings) from agent logic. Use the /infra/environments/ parameter files for this. Auditability and Review Enforce pull request reviews for all changes to /src/prompts , /src/agents , and /infra via CODEOWNERS. Require status checks to pass before merging. Blocked merges prevent untested changes reaching production. GitHub's workflow run history gives you a complete deployment audit trail. You can answer "what was deployed to prod on Tuesday and who approved it" in seconds. For regulated environments, consider branch protection rules that require signed commits. Safe Rollout Use canary or blue-green patterns where Foundry supports them for high-traffic agents. Always keep the previous image tag available in the registry. Do not delete images on deployment. Document and test your rollback procedure before you need it in production. Observability and Operational Readiness A deployed agent that you cannot observe is an agent you cannot operate. Build observability in from the start. What to Monitor Deployment health - Track whether each Foundry deployment succeeded and the agent is responding. Wire deployment outcomes back to GitHub workflow run status. Model and tool errors - Log tool call failures, model timeout errors, and safety filter activations. Aggregate these in Azure Monitor or Application Insights. Latency - Track end-to-end response latency per agent version. A latency increase after a model or prompt change is an early signal of a quality regression. Token consumption - Monitor token usage per request and per session. Unexpected increases can indicate prompt injection or runaway orchestration loops. Traceability - Log which agent version handled each request. Correlation between the image tag and request traces is essential for debugging production issues. Debugging and Alerting Use structured logging with a consistent schema. Include fields for agent version, session ID, tool called, and outcome. Set up alerts for error rate thresholds and latency percentiles. Alert before users notice the problem. For failed agent runs, ensure logs capture the full conversation context (within your data retention policy) so that developers can reproduce and diagnose the failure. Microsoft Foundry Toolboxes One of the most important additions to the Foundry platform is Toolboxes, currently in Public Preview. If you have ever seen an agent codebase where three different agents each wire the same search tool with their own credentials and slightly different configurations, you already understand the problem Toolboxes solve. A Toolbox is a named, versioned bundle of tools managed centrally in Microsoft Foundry. You define the tools once, configure authentication and access centrally, and publish a single MCP-compatible endpoint. Any agent in any runtime consumes that endpoint without per-tool wiring, custom SDK integration, or duplicated credential management. Figure: Before and after Foundry Toolboxes. Each agent previously managed its own tool connections. With Toolboxes, agents connect to one governed endpoint. The Four Pillars Discover (coming soon) - Find approved tools without browsing long catalogues. Reduces duplication by surfacing what already exists before developers build something new. Build (available today) - Select tools into a named toolbox. Supported types include built-in tools (Web Search, Code Interpreter, File Search, Azure AI Search), MCP servers, Agent-to-Agent (A2A) endpoints, and OpenAPI-defined services. Consume (available today) - A single MCP-compatible endpoint exposes every tool in the toolbox to any agent runtime. Agents that can speak MCP can use a Foundry Toolbox without any Foundry-specific SDK dependency. Govern (coming soon) - Centralised authentication and observability applied to every tool call flowing through the toolbox. Security and platform teams get consistent controls without asking developers to bolt governance onto every agent individually. Toolboxes and GitOps: A Natural Fit Toolboxes are particularly well-suited to a GitOps delivery model because the toolbox definition is a discrete, versioned artefact. Instead of credentials and tool configuration scattered across agent codebases, the toolbox becomes its own managed entity with its own version history. The key design property is that the toolbox endpoint URL is stable. When you promote a new toolbox version to be the default, agents consuming the endpoint pick up the update without any code changes. This means you can update tool configuration, add a new MCP server, or rotate credentials in the toolbox without redeploying every agent that uses it. Figure: Toolbox versioning in a GitOps model. Commits trigger CI validation and deployment of new toolbox versions. The stable endpoint URL allows agents to consume updates without redeployment. Adding a Toolbox to Your Repository In your GitOps repository, toolbox definitions belong in /src/tools/toolbox_config.py or as a declarative configuration file checked into version control. The following example creates a toolbox that combines web search, Azure AI Search over internal documentation, and a GitHub MCP server: # src/tools/toolbox_config.py # Run this via CI to create or update a toolbox version in Foundry. from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient import os client = AIProjectClient( endpoint=os.environ["FOUNDRY_PROJECT_ENDPOINT"], credential=DefaultAzureCredential() ) toolbox_version = client.beta.toolboxes.create_toolbox_version( toolbox_name="customer-feedback-toolbox", description="Tools for triaging customer feedback: search, docs, and GitHub.", tools=[ { "type": "web_search", "description": "Search approved public documentation sites.", "custom_search_configuration": { "project_connection_id": os.environ["BING_CONNECTION_NAME"], "instance_name": os.environ["BING_INSTANCE_NAME"] } }, { "type": "azure_ai_search", "name": "product-manuals-search", "description": "Search internal product documentation.", "azure_ai_search": { "indexes": [ { "index_name": os.environ["SEARCH_INDEX_NAME"], "project_connection_id": os.environ["SEARCH_CONNECTION_ID"] } ] } }, { "type": "mcp", "server_label": "github", "server_url": "https://api.githubcopilot.com/mcp", "project_connection_id": os.environ["GITHUB_CONNECTION_ID"] } ], ) print(f"Toolbox version created: {toolbox_version.version}") print(f"MCP endpoint: {toolbox_version.mcp_endpoint}") To promote a toolbox version to be the default (the endpoint agents use without specifying a version), add this to your deployment workflow: # Promote toolbox version to default after validation toolbox = client.beta.toolboxes.update( toolbox_name="customer-feedback-toolbox", default_version=toolbox_version.version, ) print(f"Default version is now: {toolbox.default_version}") The stable endpoint for agents consuming this toolbox is: https://<your-project>.services.ai.azure.com/api/projects/<project>/toolbox/customer-feedback-toolbox/mcp?api-version=v1 Attaching the Toolbox to Your Hosted Agent In your agent code, connect to the toolbox via a single MCP tool definition. The agent gains access to every tool in the toolbox without knowing their individual configurations: # src/agents/agent.py (relevant excerpt) from agent_framework import MCPStreamableHTTPTool import httpx, os toolbox_endpoint = os.environ["FOUNDRY_TOOLBOX_ENDPOINT"] http_client = httpx.AsyncClient( auth=_ToolboxAuth(token_provider), # Microsoft Entra bearer token timeout=120.0, ) mcp_tool = MCPStreamableHTTPTool( name="toolbox", url=toolbox_endpoint, http_client=http_client, load_prompts=False, ) # Agent now has access to web search, AI Search, and GitHub MCP # through one tool definition and one authenticated connection. GitOps Workflow Extension for Toolboxes Add a dedicated job to your build-deploy workflow to create and promote toolbox versions as part of the same CI/CD pipeline: deploy-toolbox: name: Deploy Toolbox Version needs: validate runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Create toolbox version in Foundry env: FOUNDRY_PROJECT_ENDPOINT: ${{ vars.FOUNDRY_PROJECT_ENDPOINT_DEV }} BING_CONNECTION_NAME: ${{ vars.BING_CONNECTION_NAME }} BING_INSTANCE_NAME: ${{ vars.BING_INSTANCE_NAME }} SEARCH_INDEX_NAME: ${{ vars.SEARCH_INDEX_NAME }} SEARCH_CONNECTION_ID: ${{ vars.SEARCH_CONNECTION_ID }} GITHUB_CONNECTION_ID: ${{ vars.GITHUB_CONNECTION_ID }} run: python src/tools/toolbox_config.py Key points to note: Toolbox configuration is Python code in source control, reviewed through pull requests like any other change Connection IDs and index names are environment variables from GitHub Actions variables, not hardcoded in the script The same script runs for dev, test, and prod with different environment variable bindings Toolbox version promotion is a separate step from agent deployment, so you can update tools independently of the agent container Because the toolbox endpoint is stable, rolling back a toolbox version does not require rolling back the agent image Common Pitfalls Teams adopting this pattern commonly make the following mistakes. Identifying them early saves significant operational pain later. Treating prompts as unmanaged text. If your system prompt lives in a portal text box rather than a versioned file, you have no history, no review process, and no rollback capability. Move prompts into source control on day one. Deploying manually from the portal. Even one manual deployment breaks the GitOps contract. Your repository no longer reflects the true state of the environment. Automate everything and remove portal deployment permissions from individuals. Mixing environment configuration into source files. Hardcoded endpoint URLs or model deployment names in agent_config.json mean your dev and prod configurations diverge at the source level. Use parameter files and environment variables resolved at deployment time. Poor separation between agent logic and tool logic. When agents and tools are tightly coupled in a single file, a tool change requires a full agent review and redeployment. Keep them separate so they can evolve independently. Not versioning your Toolbox definition. Defining a Foundry Toolbox interactively through the portal gives you no audit trail and no rollback path. The toolbox configuration script belongs in source control alongside your agent code. Skipping evaluation before promotion. Deploying a prompt change without running a structured evaluation against a representative test set is how regressions reach production. Build evaluation into the pull request workflow, not just the deployment workflow. No rollback plan. If your first rollback is unplanned and urgent, it will be slow and stressful. Test your rollback procedure in a non-production environment and document the steps. Ignoring token and cost signals. AI workloads have variable cost profiles. A change that doubles average token consumption per request may be functionally correct but economically unsustainable. Monitor consumption as a first-class signal. Example GitHub Actions Workflow The following workflow runs on pull request validation and on merge to main. It covers the core delivery lifecycle: validate, build, deploy to dev, and smoke test. # .github/workflows/build-deploy.yml name: Build and Deploy Foundry Hosted Agent on: push: branches: - main pull_request: branches: - main env: REGISTRY: myregistry.azurecr.io IMAGE_NAME: my-foundry-agent jobs: validate: name: Validate Agent Configuration runs-on: ubuntu-latest steps: - uses: actions/checkout@v4 - name: Set up Python uses: actions/setup-python@v5 with: python-version: "3.12" - name: Install dependencies run: pip install -r requirements.txt - name: Validate agent config schema run: python scripts/validate_agent.py - name: Run unit tests run: pytest tests/unit/ -v - name: Lint code run: ruff check src/ build: name: Build and Push Container Image needs: validate runs-on: ubuntu-latest if: github.ref == 'refs/heads/main' permissions: id-token: write contents: read outputs: image_tag: ${{ steps.meta.outputs.version }} steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Log in to Azure Container Registry run: az acr login --name ${{ env.REGISTRY }} - name: Extract metadata id: meta uses: docker/metadata-action@v5 with: images: ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }} tags: | type=sha,format=short - name: Build and push image uses: docker/build-push-action@v7 with: context: . push: true tags: ${{ steps.meta.outputs.tags }} deploy-dev: name: Deploy to Foundry Dev needs: build runs-on: ubuntu-latest environment: dev permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_DEV }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Dev project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_DEV }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment dev - name: Run smoke tests against dev run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_DEV }} deploy-test: name: Deploy to Foundry Test needs: deploy-dev runs-on: ubuntu-latest environment: test permissions: id-token: write contents: read steps: - uses: actions/checkout@v4 - name: Azure login (OIDC) uses: azure/login@v3 with: client-id: ${{ secrets.AZURE_CLIENT_ID_TEST }} tenant-id: ${{ secrets.AZURE_TENANT_ID }} subscription-id: ${{ secrets.AZURE_SUBSCRIPTION_ID }} - name: Deploy agent to Foundry Test project run: | az ai foundry agent deploy \ --project ${{ vars.FOUNDRY_PROJECT_TEST }} \ --image ${{ env.REGISTRY }}/${{ env.IMAGE_NAME }}:${{ needs.build.outputs.image_tag }} \ --environment test - name: Run smoke tests against test run: pytest tests/smoke/ -v --base-url ${{ vars.AGENT_URL_TEST }} Key decisions in this workflow: Validation runs on every pull request, not just on merge. Fast feedback catches problems before review. The container image is built once and the image tag is passed forward to deployment jobs. The same artefact is promoted across environments. Authentication uses OIDC federated credentials via azure/login@v3 with id-token: write permissions. No long-lived secrets are stored in GitHub for Azure authentication. The environment: test directive in the deploy-test job triggers a GitHub environment approval gate. A named reviewer must approve before the job runs. Smoke tests run after every deployment. A failed smoke test prevents further promotion. Best Practices Checklist Use this checklist when adopting the GitOps pattern for a Microsoft Foundry Hosted Agent: All agent artefacts, including prompts, tool definitions, model configuration, and Toolbox configuration scripts, are committed to source control No manual deployments to any environment; all changes flow through GitHub Actions workflows Pull request reviews are enforced for all changes to agent logic, prompts, and infrastructure via CODEOWNERS Unit tests cover tool logic; integration tests cover end-to-end agent behaviour; smoke tests cover deployed environments Container images are built once per commit and promoted across environments; images are not rebuilt per environment Environment configuration (endpoints, resource names) lives in parameter files, never in source code Secrets are stored in Azure Key Vault and accessed via managed identity at runtime GitHub environment approval gates control promotion from dev to test to prod Foundry Toolboxes are used to centralise tool definitions, credentials, and access governance across all agents; the toolbox configuration script is version-controlled and deployed through CI/CD Toolbox versions are promoted via the update default_version API step in the deployment workflow, not manually through the portal Latency, error rate, and token consumption are monitored with alerting thresholds The rollback procedure is documented, automated, and has been tested in a non-production environment GitHub Issues are used to record the intent behind significant changes and link to the pull requests that implement them Branch protection rules prevent direct pushes to main and require status checks to pass before merge The previous image tag is retained in the registry and stored as a GitHub environment variable for rollback Conclusion A Microsoft Foundry Hosted Agent is not something you deploy once and forget. Prompts evolve, tools change, models are upgraded, and policy requirements shift. Every one of those changes has the potential to alter agent behaviour in ways that affect users, costs, and compliance posture. GitOps, implemented through GitHub and GitHub Tasks, gives you the operational discipline to manage that complexity. Source control for all artefacts. Pull request review for every change. Automated validation, build, and deployment. Environment promotion gates. A complete audit trail from task to production. These are not bureaucratic overhead; they are the foundation of reliable, trustworthy AI agent operations. The teams that operate AI agents well are the ones that treat them like production software from the start. The investment in pipeline, structure, and governance pays back every time a change goes smoothly, every time a rollback takes minutes rather than hours, and every time a security or compliance reviewer can answer their question from a pull request history rather than a support ticket. Build the discipline in early. Your future self, and your production environment, will benefit from it. References Microsoft Foundry documentation Microsoft Foundry Agent Service documentation Microsoft Foundry Toolboxes documentation Introducing Toolboxes in Foundry (Microsoft Developer Blog) GitHub Actions documentation GitHub Projects and Tasks documentation Azure Container Registry documentation Azure Key Vault documentation Microsoft Entra Managed Identities documentation OpenGitOps Principles
Lee_Stott
Jun 02, 2026 Place Microsoft Developer Community Blog
440Views
0likes
0Comments
Building an On-Device Voice Assistant with Microsoft Foundry Local
Why on-device voice still matters Most "voice AI" tutorials assume your audio leaves the machine. You ship a WAV to Whisper-API, your transcript to GPT-4, and a synthesized response back over the wire. That works — but it also means three round trips, three per-token bills, and three places your user's voice gets logged. The new wave of small, hardware-optimised models changes the trade-off. NVIDIA's Nemotron Speech Streaming En 0.6B is a 600M-parameter streaming ASR model published into the Microsoft Foundry Local catalog. Paired with a small chat model like qwen2.5-0.5b or phi-4-mini , you can run the entire capture → transcribe → reason → respond loop in-process on a developer laptop, with no API keys and no network egress. This post walks through how the fl-nemotron sample does it, the SDK pitfalls we hit on the way, and the design decisions that made the pipeline reliable. What we're building A browser-hosted assistant served by FastAPI at http://127.0.0.1:8000 . The page captures microphone audio, posts it to /api/transcribe , then streams the chat reply back over Server-Sent Events from /api/chat . All inference runs locally through two Foundry Local models loaded into the same process. The shape of the pipeline: Microphone (browser MediaRecorder) │ WebM/Opus blob ▼ Client-side WAV encoder (16 kHz, mono, PCM-16) │ multipart/form-data ▼ FastAPI /api/transcribe │ ▼ Nemotron Speech Streaming En 0.6B (Foundry Local audio client) │ transcript text ▼ Chat LLM e.g. qwen2.5-0.5b (Foundry Local chat client) │ streamed tokens ▼ FastAPI /api/chat → SSE → browser bubble The version that bit us: foundry-local-sdk >= 1.1.0 Before any code, the single most important fact about this project: The Nemotron Speech Streaming model only appears in the Foundry Local 1.1.x catalog. Older SDKs (0.5.x / 0.6.x) cannot resolve the alias nemotron-speech-streaming-en-0.6b and fail with model not found . The module name also changed in 1.1.0 — it is now foundry_local_sdk (with the underscore- sdk suffix), not foundry_local . The pip wheel for foundry-local-core is bundled, so there is no separate MSI / winget install to worry about. Pin it explicitly: pip install --upgrade "foundry-local-sdk>=1.1.0,<2" And verify before anything else: python -c "import importlib.metadata as m; print('sdk', m.version('foundry-local-sdk'))" # expect: sdk 1.1.0 Loading both models from one manager The 1.1.x SDK exposes a single FoundryLocalManager that owns the runtime. Each loaded model gives you back a per-model OpenAI-compatible client — get_chat_client() for text models and get_audio_client() for ASR. There is no need to bring your own openai Python package; the SDK ships its own thin client. The wrapper used in the repo ( src/foundry_client.py ) does this: from foundry_local_sdk import Configuration, FoundryLocalManager FoundryLocalManager.initialize(Configuration(app_name="fl-nemotron")) manager = FoundryLocalManager.instance chat_model = manager.load_model("qwen2.5-0.5b") stt_model = manager.load_model("nemotron-speech-streaming-en-0.6b") chat_client = chat_model.get_chat_client() audio_client = stt_model.get_audio_client() Both models are downloaded on first use into the Foundry Local cache and stay resident for the lifetime of the process. On a laptop with 16 GB RAM, the combined working set sits comfortably under 4 GB. The transcription surprise The first naive approach was the obvious one: with open(wav_path, "rb") as f: result = audio_client.transcribe(file=f, model="nemotron-speech-streaming-en-0.6b") That call fails on Nemotron. The bundled ONNX Runtime GenAI in foundry-local-core does not register the nemotron_speech multi-modal model type that the standard AudioClient.transcribe() path tries to instantiate. The error surfaces as a cryptic model-type registration failure deep inside the native runtime. The fix is to use the streaming session API instead — a different native entry point ( core_interop.start_audio_stream ) that the streaming model does support. The repo isolates this in src/_nemotron_live.py : def transcribe_wav_live(audio_client, wav_path, *, language="en"): with wave.open(str(wav_path), "rb") as w: sample_rate = w.getframerate() channels = w.getnchannels() sample_width = w.getsampwidth() pcm = w.readframes(w.getnframes()) session = audio_client.create_live_transcription_session() session.settings.sample_rate = sample_rate session.settings.channels = channels session.settings.bits_per_sample = sample_width * 8 session.settings.language = language session.start() # Feed PCM in ~100 ms chunks from a worker thread, then stop. bytes_per_sec = sample_rate * channels * sample_width chunk_bytes = max(bytes_per_sec // 10, 1024) def _pusher(): try: for offset in range(0, len(pcm), chunk_bytes): session.append(pcm[offset:offset + chunk_bytes]) finally: session.stop() threading.Thread(target=_pusher, daemon=True).start() parts = [] for resp in session.get_stream(): for cp in getattr(resp, "content", []) or []: text = getattr(cp, "text", "") or getattr(cp, "transcript", "") or "" if text: parts.append(text) return " ".join(p.strip() for p in parts if p.strip()).strip() Two things to notice: Push from a thread, read from the main coroutine. session.append() is a blocking write into the native stream and session.get_stream() is a blocking generator. Run one in a worker thread so the other can drain in parallel — otherwise you deadlock the session. Chunk to ~100 ms. Smaller chunks (e.g. 10 ms) spend more time crossing the FFI boundary than transcribing; larger chunks (e.g. 1 s) hold back partial results and hurt perceived latency. Always session.stop() . Without it the generator never terminates and the request hangs. The other transcription surprise: browsers don't send WAV Inside the browser, MediaRecorder defaults to audio/webm; codecs=opus . That's great for size but bad for our STT model, which expects a 16-bit mono PCM WAV at a known sample rate. Decoding WebM/Opus server-side would require ffmpeg as a runtime dependency — which is exactly the kind of friction this project exists to remove. The cleaner solution is to encode WAV on the client. AudioContext.decodeAudioData already understands WebM/Opus, so the page can decode the recording, resample to 16 kHz, mix to mono, and emit a PCM-16 WAV blob in 30 lines of JavaScript: // Inside src/static/index.html async function webmToWav(blob) { const ctx = new (window.AudioContext || window.webkitAudioContext)({ sampleRate: 16000 }); const buf = await ctx.decodeAudioData(await blob.arrayBuffer()); // Mix to mono const ch = buf.numberOfChannels; const mono = new Float32Array(buf.length); for (let c = 0; c < ch; c++) { const data = buf.getChannelData(c); for (let i = 0; i < data.length; i++) mono[i] += data[i] / ch; } return encodeWav(mono, 16000); } function encodeWav(samples, sampleRate) { const buffer = new ArrayBuffer(44 + samples.length * 2); const view = new DataView(buffer); // RIFF header writeStr(view, 0, "RIFF"); view.setUint32(4, 36 + samples.length * 2, true); writeStr(view, 8, "WAVE"); // fmt chunk writeStr(view, 12, "fmt "); view.setUint32(16, 16, true); // PCM chunk size view.setUint16(20, 1, true); // PCM format view.setUint16(22, 1, true); // mono view.setUint32(24, sampleRate, true); view.setUint32(28, sampleRate * 2, true); // byte rate view.setUint16(32, 2, true); // block align view.setUint16(34, 16, true); // bits per sample // data chunk writeStr(view, 36, "data"); view.setUint32(40, samples.length * 2, true); // PCM-16 samples let o = 44; for (let i = 0; i < samples.length; i++, o += 2) { const s = Math.max(-1, Math.min(1, samples[i])); view.setInt16(o, s < 0 ? s * 0x8000 : s * 0x7FFF, true); } return new Blob([view], { type: "audio/wav" }); } Now the server's /api/transcribe endpoint just writes the bytes to a temp file and hands them to transcribe_wav_live() — no audio decoding libraries on the Python side. Wiring it into FastAPI The server ( src/app.py ) is deliberately small. The notable detail is that the same process holds both Foundry Local model handles for its entire lifetime, so there is no warm-up cost per request: @app.post("/api/transcribe") async def transcribe(audio: UploadFile = File(...)): data = await audio.read() with tempfile.NamedTemporaryFile(suffix=".wav", delete=False) as f: f.write(data); path = f.name text = _ai_client.transcribe(path) return {"text": text} @app.post("/api/chat") async def chat(req: ChatRequest): if req.stream: return StreamingResponse( _sse(_ai_client.stream_completion(req.messages)), media_type="text/event-stream", ) return {"text": _ai_client.chat_completion(req.messages)} Streaming uses Server-Sent Events because they are trivially supported in both fetch() and the FastAPI runtime, and they don't require a WebSocket upgrade through any proxy a developer might have in front of localhost . What it looks like The repo includes screenshots of the running UI: a welcome screen with both models loaded, a streamed haiku reply, an inline code block with copy-to-clipboard, and the recording state for the microphone. Performance, honestly This is a small-model, CPU-friendly stack. On an Arm64 Surface running the x64 SDK under emulation: First model load (cold cache): tens of seconds — downloads ~600 MB for Nemotron and ~400 MB for qwen2.5-0.5b . Subsequent loads (warm cache): a few seconds per model. End-to-end transcription of a 5-second utterance: well under a second after warm-up. First chat token from qwen2.5-0.5b : typically 200–500 ms; full short reply within 1–2 s. On x64 silicon with a recent CPU the numbers improve substantially, and the SDK will pick the best execution provider it finds (CPU / DirectML / CUDA) for each model. Trade-offs to know about Model quality. qwen2.5-0.5b is a 500M-parameter model. It is fast and small enough to ship on a laptop, but it is not GPT-4. Swap in phi-4-mini or mistral-nemo-12b-instruct if you have the RAM and want better reasoning — the wrapper accepts any chat alias in the Foundry Local catalog. STT is English-only here. The current Nemotron streaming model in the catalog is ...-en-0.6b . Multilingual variants are likely to follow. Browser microphone needs a real browser. Headless / automated browsers (Playwright, Puppeteer) deny getUserMedia by default. Open the page in Edge / Chrome / Firefox to grant the permission and capture audio for real. No agent framework yet. This sample is deliberately a single-turn loop over a chat client — there is no tool calling, planning, or multi-agent orchestration. Adding the Microsoft Agent Framework on top would be a natural next step for richer behaviour. Responsible AI considerations Running locally removes the cloud-egress class of privacy concerns, but it does not remove responsibility: Disclose recording. The browser prompts for mic permission; your UI should make it obvious when capture is active. The sample shows a red ⏹ button and a "Recording…" banner for that reason. Don't log raw audio. The sample writes audio to a per-request NamedTemporaryFile and deletes it after transcription. Treat the WAV as sensitive data even when it never leaves the device. Small models hallucinate. A 0.5B chat model is great for snappy local replies, but unsuitable for high-stakes answers. Pair it with retrieval, ground it on your own data, or escalate to a larger model when accuracy matters. Try it Clone github.com/leestott/fl-nemotron. ./setup.ps1 (or ./setup.sh ) to create a virtualenv and install the pinned SDK. python scripts/prefetch.py nemotron-speech-streaming-en-0.6b qwen2.5-0.5b to download both models. .venv\Scripts\uvicorn.exe app:app --app-dir src --port 8000 Open http://127.0.0.1:8000 in a real browser and click the 🎤 button. Where to go next Foundry Local documentation — official docs for the runtime, catalog, and SDK. microsoft/Foundry-Local — upstream samples and issue tracker. NVIDIA Nemotron model family — background on the speech and language models being published into the catalog. leestott/fl-nemotron — the full source for this post. Key takeaways Pin foundry-local-sdk >= 1.1.0 . Earlier SDKs cannot see the Nemotron Speech Streaming model. Use the LiveAudioTranscriptionSession API for Nemotron, not AudioClient.transcribe() . Encode WAV in the browser. It eliminates a heavy server-side ffmpeg dependency for a few lines of JS. Push audio chunks on a worker thread and drain the response generator on the main one to avoid deadlocks. A small Foundry Local chat model plus Nemotron STT gives you a credible local voice loop in a single Python process — no cloud, no keys, no data egress.
Lee_Stott
May 26, 2026 Place Microsoft Developer Community Blog
327Views
0likes
0Comments
Building an End-to-End Azure RAG Strategy Agent with MS Foundry
High-Level Architecture This architecture represents an end-to-end Retrieval-Augmented Generation (RAG) pipeline where raw documents are ingested from Azure Blob Storage, processed using Document Intelligence, transformed into embeddings via Azure OpenAI, and indexed in Azure AI Search for hybrid retrieval. A Foundry/MAF-based agent orchestrates query processing by combining user input with relevant search results and generates contextual responses, which are exposed through a FastAPI or CLI interface. This solution is composed of two main layers: 1. Data Ingestion Layer (RAG Pipeline) This layer transforms raw enterprise documents into searchable knowledge. Flow: Raw documents stored in Azure Blob Storage Supported formats: PDF, DOCX, PPTX, images, etc. Document Intelligence extraction Extracts: Text Tables Key-value pairs Structure Writes output as structured JSON back to Blob (processed/) Chunking + Embedding Documents are split into chunks Each chunk is embedded using Azure OpenAI (text-embedding-*) Indexing into Azure AI Search Creates a hybrid index: Keyword search Semantic ranking Vector search Enables flexible retrieval strategies 2. Query Layer (Strategy Agents) This layer enables intelligent query answering. Flow: User sends a query via: FastAPI endpoint CLI interface Query is handled by: Microsoft Agent Framework (MAF) agent Running on Azure AI Foundry Agent: Queries Azure AI Search Retrieves top relevant chunks Injects them into LLM prompt LLM generates grounded response This follows the standard RAG pattern: Retrieval → Augmentation → Generation End-to-End Flow Key Azure Services Used Service Purpose Azure Blob Storage Raw + processed document storage Azure AI Document Intelligence Extract structured content Azure OpenAI Embeddings + LLM generation Azure AI Search Hybrid retrieval engine Azure AI Foundry Agent orchestration Microsoft Agent Framework Agent execution layer Why this Architecture Matters This solution goes beyond basic RAG and provides: Hybrid Retrieval Combines keyword + semantic + vector search Improves recall and accuracy Structured Document Parsing Handles complex enterprise documents Extracts tables and metadata Agent-Based Orchestration Enables reasoning over retrieval results Extensible for multi-agent workflows Scalable Data Pipeline Supports continuous ingestion Works with large document collections Enterprise Considerations Use Managed Identity for secure service access Apply RBAC on Cosmos DB / Search / Storage Enable Private Endpoints for network isolation Use Guardrails + Evaluations in Foundry Summary This repository demonstrates a production-ready Azure RAG architecture: Ingest → Extract → Chunk → Embed → Index Retrieve → Reason → Generate Powered by Azure AI Foundry + Agent Framework By combining data engineering + AI orchestration, it enables enterprise AI systems that are: Accurate Grounded Extensible Repo: https://github.com/snd94/azure-rag-strategy-agent Please refer to the Microsoft Learn Documentation for further information: Azure AI Search documentation - Azure AI Search | Microsoft Learn Document Intelligence documentation - Quickstarts, Tutorials, API Reference - Foundry Tools | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn How to generate embeddings with Azure OpenAI in Microsoft Foundry Models - Microsoft Foundry | Microsoft Learn Microsoft Agent Framework Overview | Microsoft Learn What is Microsoft Foundry? - Microsoft Foundry | Microsoft Learn
SHAILESHDEVADIGA
May 25, 2026 Place Microsoft Developer Community Blog
482Views
0likes
0Comments
Building AI Agents with Microsoft Foundry: A Progressive Lab from Hello World to Self-Hosted
AI agent development has a steep on-ramp. The combination of new SDKs, tool-calling patterns, model selection decisions, retrieval-augmented generation, and deployment concerns means most developers spend more time wiring things together than actually building anything useful. The Microsoft Foundry Agent Lab is a structured, open-source demo series designed to change that — nine self-contained demos, each adding exactly one new concept, all built on the same Microsoft Foundry SDK and a single model deployment. This post walks through what the lab contains, how each demo works under the hood, and the architectural decisions that make it a useful reference for AI engineers building production agents. Why a Progressive Lab? Agent frameworks can be overwhelming. A developer who opens a rich example with RAG, tool-calling, streaming, and a custom UI all at once has no clear line of sight to which parts are essential and which are embellishments. The Foundry Agent Lab takes the opposite approach: start with the absolute minimum and introduce one new primitive per demo. By the time you reach Demo 8, you have seen every major capability — not in one monolithic sample, but in a layered sequence where each addition is visible and understandable. # Demo New Concept Tool Used UX 0 hello-demo Agent creation, Responses API, conversations None Terminal 1 tools-demo Function calling, tool-calling loop, live API FunctionTool Terminal 2 desktop-demo UI decoupling — same agent, different surface None Desktop (Tkinter) 3 websearch-demo Server-side built-in tools, no client loop WebSearchTool Terminal 4 code-demo Code execution in sandbox, Gradio web UI CodeInterpreterTool Web (Gradio) 5 rag-demo Document upload, vector stores, RAG grounding FileSearchTool Terminal 6 mcp-demo MCP servers, human-in-the-loop approval MCPTool Terminal 7 toolbox-demo Centralized tool governance, Toolbox versioning Toolbox Terminal 8 hosted-demo Self-hosted agent with Responses protocol Custom server Terminal + Agent Inspector The Model Router: One Deployment to Rule Them All Before diving into the demos, it is worth understanding the one architectural decision that ties the entire lab together: every agent uses model-router as its model deployment. MODEL_DEPLOYMENT=model-router Model Router is a Microsoft Foundry capability that inspects each request at inference time and routes it to the optimal available model — weighing task complexity, cost, and latency. A simple factual question goes to a fast, cheap model. A complex tool-calling chain with code generation gets routed to a frontier model. You write zero routing logic. The lab's MODEL-ROUTER.md file contains empirical observations from running all nine demos. A sample of what the router selected: Demo Query Task Type Model Selected hello "What's the capital of WA state?" Factual recall grok-4-1-fast-reasoning hello "Summarize our conversation" Summarization gpt-5.2-chat-2025-12-11 tools "What's the weather in Seattle?" Tool-using gpt-5.4-mini-2026-03-17 code Data analysis with code generation Code generation + execution gpt-5.4-2026-03-05 rag HR policy document question Retrieval + synthesis gpt-5.3-chat-2026-03-03 This is the strongest signal in the lab: you do not need to reason about model selection. You declare what your agent needs to do; the router handles the rest, and it chooses correctly. Demo 0: The Minimum Viable Agent The hello-demo establishes the baseline pattern used by every subsequent demo. Two files: one to register the agent, one to chat with it. Registering the agent from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient from azure.ai.projects.models import PromptAgentDefinition credential = DefaultAzureCredential() project = AIProjectClient(endpoint=PROJECT_ENDPOINT, credential=credential) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, instructions="You are a helpful, friendly assistant.", ), ) Authentication uses DefaultAzureCredential , which works with az login locally and with managed identity in production — no API keys anywhere in the code. Chatting with the agent # Create a server-side conversation (persists history across turns) conversation = openai.conversations.create() # Each turn sends the user message; the agent sees full history response = openai.responses.create( input=user_input, conversation=conversation.id, extra_body={"agent_reference": {"name": AGENT_NAME, "type": "agent_reference"}}, ) print(response.output_text) The conversation object is server-side. You pass its ID on every turn; the history lives in Foundry, not in a local list. This is the Responses API pattern — distinct from the older Completions or Chat Completions APIs. Demo 1: Function Tools and the Tool-Calling Loop Demo 1 adds function calling against a real weather API. The key insight here is that the model does not execute the function — it requests the execution, and your code executes it locally, then feeds the result back. Declaring a function tool from azure.ai.projects.models import FunctionTool, PromptAgentDefinition func_tool = FunctionTool( name="get_weather", description="Get the current weather for a given city.", parameters={ "type": "object", "properties": {"city": {"type": "string", "description": "City name"}}, "required": ["city"], }, strict=True, ) agent = project.agents.create_version( agent_name=AGENT_NAME, definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[func_tool], instructions="You are a weather assistant...", ), ) The tool-calling loop response = openai.responses.create(input=user_input, conversation=conversation.id, ...) # Loop while the model is requesting tool calls while any(item.type == "function_call" for item in response.output): input_list = [] for item in response.output: if item.type == "function_call": args = json.loads(item.arguments) result = get_weather(args["city"]) # execute locally input_list.append(FunctionCallOutput(call_id=item.call_id, output=result)) # Send results back to the agent response = openai.responses.create(input=input_list, conversation=conversation.id, ...) print(response.output_text) The strict=True parameter on FunctionTool enforces structured outputs — the model must return arguments that match the declared JSON schema exactly. This eliminates argument parsing errors in production. Demo 2: UI Is Not Your Agent Demo 2 runs the exact same agent as Demo 1 but surfaces it in a Tkinter desktop window. The point is pedagogical: your agent definition, conversation management, and tool-calling logic are entirely independent of your UI layer. Swapping from terminal to desktop requires changing only the presentation code — nothing in the agent or conversation path changes. This is a principle worth internalising early: agent logic and UI logic should never be entangled. The lab enforces this separation structurally. Demo 3: Server-Side Built-In Tools The web search demo introduces a sharp contrast with Demo 1. With WebSearchTool , the tool-calling loop disappears entirely from client code: from azure.ai.projects.models import WebSearchTool agent = project.agents.create_version( agent_name="Search-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[WebSearchTool()], instructions="You are a research assistant...", ), ) The agent decides when to search, executes the search server-side, and returns a grounded response with citations. Your client code looks identical to Demo 0 — a simple responses.create() call with no tool loop. The distinction matters architecturally: Function tools (Demo 1) — tool execution happens on your client; you control the code, the API call, the error handling. Built-in tools (Demo 3+) — tool execution happens inside Foundry; you get results without managing execution. Demo 4: Code Interpreter and the Gradio Web UI Demo 4 attaches CodeInterpreterTool , which gives the agent a sandboxed Python execution environment inside Foundry. The agent can write code, run it, observe output, and iterate — all server-side. Combined with a Gradio web interface, this demo shows an agent that can perform data analysis, generate charts, and explain results through a browser UI. Model Router is particularly interesting here: the empirical data shows it selects a more capable frontier model ( gpt-5.4-2026-03-05 ) for code-generation tasks, while simpler conversational turns stay on lighter models. Demo 5: Retrieval-Augmented Generation with FileSearchTool Demo 5 introduces RAG. The setup phase uploads a document, creates a vector store, and attaches it to the agent: # Upload document and create a vector store vector_store = openai.vector_stores.create(name="employee-handbook-store") with open("data/employee-handbook.md", "rb") as f: openai.vector_stores.files.upload_and_poll( vector_store_id=vector_store.id, file=f ) # Attach the vector store to the agent agent = project.agents.create_version( agent_name="RAG-Agent", definition=PromptAgentDefinition( model=MODEL_DEPLOYMENT, tools=[FileSearchTool(vector_store_ids=[vector_store.id])], instructions="Answer questions using only the provided documents...", ), ) At query time, the agent embeds the question, searches the vector store semantically, retrieves matching chunks, and generates an answer grounded in the retrieved content — entirely server-side. The client code remains a plain responses.create() call. An important detail: the .vector_store_id file is written to disk during setup and read back during the chat session, so the demo survives process restarts without re-uploading the document. The .gitignore excludes this file from source control. Demo 6: Model Context Protocol Demo 6 connects the agent to a GitHub MCP server, giving it access to repository and issue data via the open Model Context Protocol standard. MCP servers expose tools over a standardised wire protocol; the agent discovers and calls them without any client-side function declarations. The demo also demonstrates human-in-the-loop approval: before executing any MCP tool call, the agent surfaces the proposed action and waits for the user to confirm. This is an important safety pattern for agents that can trigger side effects on external systems. Demo 7: Toolbox — Centralised Tool Governance Where Demo 6 connects to a single MCP server directly, Demo 7 uses a Toolbox — a managed Microsoft Foundry resource that bundles multiple tools into a single, versioned, MCP-compatible endpoint. The Toolbox in this demo exposes both GitHub Issues and GitHub Repos tools, curated into an immutable versioned snapshot. This pattern is significant for production multi-agent systems: Centralised governance — one team owns the tool definitions; all agents consume them via a single endpoint. Versioned snapshots — promoting a new Toolbox version is explicit; agents pin to a version and upgrade intentionally. MCP compatibility — any MCP-capable agent or framework can connect, not just Foundry SDK agents. from azure.ai.projects.models import McpTool toolbox_tool = McpTool( server_label="toolbox", server_url=TOOLBOX_ENDPOINT, allowed_tools=[], # empty = all tools in the Toolbox version headers={"Authorization": f"Bearer {token}"}, ) Demo 8: Self-Hosted Agent with the Responses Protocol The final demo departs from the prompt-agent pattern. Instead of registering a declarative agent in Foundry, Demo 8 implements a custom agent server using the Responses protocol. The server exposes a streaming HTTP endpoint; Foundry's Agent Inspector can connect to it and route user turns to it just as it would to a hosted prompt agent. This demo includes a Dockerfile and an agent.yaml , enabling deployment to Foundry's container hosting service. It uses gpt-4.1-mini directly rather than the model router, because the custom server owns the entire inference path. When to consider this pattern: Your agent requires custom pre- or post-processing logic that cannot be expressed in a system prompt. You need to integrate with infrastructure that is not reachable through MCP or built-in tools. You want to own the inference call for cost control, A/B testing, or compliance reasons. You are building a multi-agent orchestrator that needs to expose itself as an agent to other orchestrators. Getting Started The lab requires Python 3.10 or higher, an Azure subscription with a Microsoft Foundry project, and the Azure CLI. 1. Clone and set up the virtual environment git clone https://github.com/microsoft-foundry/Foundry-Agent-Lab.git cd Foundry-Agent-Lab # Create and activate the virtual environment python -m venv .venv # Windows Command Prompt .venv\Scripts\activate.bat # Windows PowerShell .venv\Scripts\Activate.ps1 # macOS / Linux source .venv/bin/activate pip install -r requirements.txt 2. Configure a demo copy hello-demo\.env.sample hello-demo\.env # Edit hello-demo\.env and set PROJECT_ENDPOINT Your PROJECT_ENDPOINT is on the Overview page of your Foundry project in the Azure portal. It takes the form https://your-resource.ai.azure.com/api/projects/your-project . 3. Run the demo az login 0-hello-demo Each numbered batch file at the root activates the virtual environment, runs create_agent.py , and launches chat.py . Append log to capture the full session transcript: 0-hello-demo log Reset between runs hello-demo\reset.bat Every demo includes a reset.bat that deletes the registered agent and any associated resources (vector stores, uploaded files). Demos are fully repeatable. Architecture Principles Demonstrated Across the nine demos, the lab illustrates a set of design principles that apply directly to production agent systems: Keyless authentication throughout Every demo uses DefaultAzureCredential . No API keys appear anywhere in the code. Locally, az login provides credentials. In production, managed identity takes over automatically — same code, no secrets to rotate. Server-side conversation state The Responses API stores conversation history server-side. Your application passes a conversation ID; Foundry maintains the thread. This eliminates the common bug of truncating history due to local list management and makes multi-process or multi-instance deployments straightforward. Client-side vs server-side tool execution The lab makes the distinction explicit. Function tools execute in your process — you control the code, the external call, and the error handling. Built-in tools (WebSearch, CodeInterpreter, FileSearch) execute inside Foundry — you get results without managing execution infrastructure. MCP tools (Demo 6, 7) fall between these: they execute in a separately deployed server, with the protocol mediating the call. Progressive tool introduction Each demo's create_agent.py registers the agent once. The chat.py file handles the conversation loop. These two responsibilities are always separate, making it easy to update agent definitions without modifying conversation logic, and vice versa. Security Considerations When building agents for production, keep the following in mind: Never commit .env files. The .gitignore excludes them, but verify this before pushing. Use Azure Key Vault or environment variable injection in CI/CD pipelines. Use managed identity in production. DefaultAzureCredential automatically picks up managed identity when deployed to Azure, eliminating the need for any stored credentials. Apply human-in-the-loop for side-effecting tools. Demo 6 demonstrates this pattern for MCP tool calls. Any agent that can modify external state (create issues, send emails, write files) should surface proposed actions for confirmation. Validate tool outputs before use. Treat data returned by external tools (weather APIs, search results, document retrieval) as untrusted input. Prompt injection through tool results is a real attack surface; grounding instructions in your system prompt reduce but do not eliminate this risk. Scope Toolbox permissions narrowly. When using a Toolbox (Demo 7), use allowed_tools to restrict which tools the agent can call, rather than granting access to all tools in a Toolbox version. Key Takeaways Start with the minimum. A prompt agent with no tools requires fewer than 30 lines of code using the Foundry SDK. Add tools only when the use case demands them. Use model-router unless you have a specific reason not to. The empirical data in the lab shows the router selects appropriate models across all task types — factual, creative, tool-calling, RAG, and code generation. Understand the client/server tool boundary. Function tools give you control; built-in tools give you simplicity. MCP and Toolbox give you governance and interoperability. Choose based on where you need control and where you need scale. Conversation state belongs on the server. Do not maintain conversation history in application memory if you can avoid it. The Responses API conversation object is designed for this. The hosted-demo pattern is for when you need to own the inference path. For most use cases, a declarative prompt agent is sufficient and far simpler to operate. Next Steps Explore the repo: github.com/microsoft-foundry/Foundry-Agent-Lab Microsoft Foundry SDK documentation: learn.microsoft.com/azure/ai-studio/ Responses API quickstart: Prompt agent quickstart Model Router conceptual documentation: Model Router for Microsoft Foundry Model Context Protocol: modelcontextprotocol.io Azure Identity SDK (DefaultAzureCredential): azure-identity Python SDK The Foundry Agent Lab is open source under the MIT licence. Contributions, bug reports, and feature requests are welcome through GitHub Issues. See CONTRIBUTING.md for guidelines.
Lee_Stott
May 21, 2026 Place Microsoft Developer Community Blog
4.2KViews
1like
0Comments
How to Visualize Your Azure AI Workloads Usage for Observability
This article assumes you already have an Azure Foundry project and resource deployed in Microsoft Foundry. The options referenced here are documented in detail in the linked articles; this post serves as a consolidated step by step guide bringing them all together and explaining where each option is most useful. A Summary: Need Best Option Quick day-over-day visual, minimal setup Grafana Dashboard (Option 3) Custom growth % calculations App Insights + KQL in Log Analytics (Option 4) Shareable, interactive report Azure Workbooks (Option 5) Per-user/per-agent granularity APIM + App Insights (Option 6) Quick one-off chart, export to Excel Microsoft Foundry Monitor tab or App Insights Metrics Explorer (Option 1 and 2) Option 1. Within the Microsoft Foundry Portal (Quickest, No Setup) If you have models deployed in Microsoft Foundry and would like to monitor its usage, go to the New Foundry Portal → Build → Models → Monitor tab. View metrics such as: Estimated cost Total token usage Input vs. output tokens Number of requests This is the simplest way to monitor both model and agent usage. For PAYG plans: You can also view your total allocated quota (and figure out which Tier you are on) using the Quota Management Screen (New Foundry Portal → Operate → Quota tab). This screen shows how much your total allocated quota is, per model in a given subscription + region + Deployment Type (Global, Data Zones or Regional). For eg., in the image below, for gpt-4o, I am allocated 7M total TPM in my subscription. I am only using 150K TPM of the allocated 7M TPM amount. Which means, my requests will get throttled if I exceed the 150K TPM limit. To avoid throttling, I would need to increase my shared allocation limit. NOTE: you are charged for usage, so if you allow more capacity, you use more, so you pay more. Option 2: Azure Monitor Metrics Explorer This is already built into the Azure Portal and gives you time-series charts out of the box. Go to Azure Portal → your Azure OpenAI / Foundry resource → Monitoring → Metrics Select a metric like AzureOpenAIRequests or TokenTransaction Set Aggregation to Sum (total) or Max and Time granularity to 1 day Split by ModelDeploymentName to see per-model trends Adjust the time range (e.g., last 30 days) — you'll see day-over-day bars/lines Tip: You can pin these charts to an Azure Dashboard for a persistent view, or click Share → Download to Excel to get the raw data for your own analysis. Option 3: Azure Managed Grafana (Best Pre-Built Dashboard) This is the best option for a polished, real-time, day-over-day dashboard with no custom code. There's a pre-built AI Foundry dashboard ready to import. [grafana.com], [Create a M...ed Grafana] How to set it up: Create an Azure Managed Grafana workspace (if you don't have one) In Grafana, go to Dashboards → New → Import → enter dashboard ID 24039 (for Foundry) Select your Azure Monitor data source and point it to your Foundry resource Tip: You can also import this directly from the Azure Portal: Monitor → Dashboards with Grafana → AI Foundry. That's it — the dashboard gives you (per model deployment): Token trends over time (inference, prompt, completion — day over day) Request trends over time (AzureOpenAIRequests as a time series) Latency trends (bonus) NOTE: Default time range is 7 days — adjust to 30/60/90 days for growth trends Option 4: Application Insights + KQL Queries (Most Flexible, Custom Reports) If you want fully custom day-over-day growth calculations (e.g., % change day-to-day), this is the way. [azurefeeds.com] Setup: Ensure your Foundry project is connected to an Application Insights resource (Foundry → Settings → Connected Resources). Open up App Insights resource → Logs → New Query or choose a sample query. In the images below, we simply ran 'requests' and set the time range to 24 hours. There is also a Kusto Query Language (KQL) mode or Simple mode on the right-hand side: Simple mode will let you run out of the box samples. KQL mode will open up a query window for you to enter custom queries. Below are the results in grid view. Same view but showing a chart: Export options: Another way to get the above graphs are via Log Analytics. Simply enable Diagnostic Settings on your Azure OpenAI resource → send to a Log Analytics workspace. Open Log Analytics → Logs and try our your sample queries. Sample KQL for day-over-day token usage (adjust to your needs): AzureMetrics | where MetricName in ("TokenTransaction", "ProcessedPromptTokens", "GeneratedTokens") | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by bin(TimeGenerated, 1d), MetricName | order by TimeGenerated asc | render timechart Result: Sample KQL for day-over-day growth % (adjust to your needs): AzureMetrics | where MetricName == "TokenTransaction" | where TimeGenerated > ago(30d) | summarize DailyTokens = sum(Total) by Day = bin(TimeGenerated, 1d) | sort by Day asc | extend PrevDay = prev(DailyTokens) | extend GrowthPct = round((DailyTokens - PrevDay) / PrevDay * 100, 2) | project Day, DailyTokens, GrowthPct Option 5: Azure Monitor Workbooks (Custom Dashboards, Shareable) Workbooks let you build interactive, parameterized dashboards that combine metrics and KQL logs. What's more, you can select resources from multiple subscriptions and visualize them all in one place using Workbooks! Go to Azure Portal → Monitor → Workbooks → New Add a Metrics query panel → select your Log Analytics or App Insights or Foundry resource -> Enter the same query you used in Option 4. Do a test run and view the graphs (this can be viewed as charts or a list (grid view)): 4. Save and share with your team. Option 6: APIM + Application Insights (Granular Per-Caller/Per-Agent Tracking) 1. If your app routes requests through Azure API Management, you can use the azure-openai-emit-token-metric policy to send per-request token metrics to Application Insights with custom dimensions (User ID, Subscription ID, Agent, etc.). [Azure API...osoft Docs] This is ideal for scenarios like: "Which agent consumed the most tokens last week?" "What's the token usage per API consumer/team?" NOTE: Microsoft Foundry resources do not track usage by users. So, fronting your Foundry resource with an APIM could be a way to track users provided you pass the username/id in the request context. How you implement this is upto your app design. Ref: AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHub Bonus: Check out all other APIM + AI related policies here: AI-Gateway/labs/semantic-caching at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-rate-limiting at main · Azure-Samples/AI-Gateway AI-Gateway/labs/token-metrics-emitting/token-metrics-emitting.ipynb at main · Azure-Samples/AI-Gateway · GitHub
juneesingh
May 20, 2026 Place Microsoft Developer Community Blog
437Views
0likes
0Comments
Genie in a Bot: Databricks AI/BI Meets Microsoft Teams
The Use Case: Why Genie Needs to Live in Teams Every organization has business users who need data answers — fast. Marketing managers want to know which campaign drove the most conversions last quarter. Finance teams need spend breakdowns by channel. Executives want real-time KPIs before a board meeting. Databricks Genie (part of AI/BI) is a brilliant solution to this: it lets you ask natural-language questions against your Data Lakehouse and get SQL-backed answers instantly. No dashboards to navigate, no SQL to write, no tickets to the analytics team. But there's a friction problem: Genie lives in the Databricks workspace. Your business users live in Microsoft Teams. Asking them to context-switch out of their primary collaboration tool, log into Databricks, navigate to a Genie Space, and type a question — that's a workflow that looks good in demos but dies in the field. The real unlock is bringing Genie into Teams: a bot that business users can message directly, in the same place they chat with colleagues, and get instant, data-backed answers. No portal, no login, no context switch. This blog talks about exactly this integration. We will explore how to connect Microsoft Teams to a Databricks Genie Space via an Azure AI Foundry agent. Users ask natural-language questions about campaign data — spend, clicks, conversions, ROI, audience segments — and the system translates these into SQL queries executed against a Databricks SQL warehouse. A user asks a question in Teams. The Bot Service relays it to our App Service, which uses an AI Foundry agent to reason over the question and queries Databricks Genie for SQL-backed data. The answer flows back the same path — arriving in the Teams chat within seconds. But getting here was far from straightforward. Why This Is Hard — Especially for Regulated Industries If you're reading this and thinking "just connect the services together," you're in for a surprise. Private networking, multi-hop authentication requirements along with server-side tools problem makes it quite complicated. Here's why this problem is deceptively complex: The Private Networking Requirement In regulated industries — financial services, healthcare, government — you can't expose your AI Services resources to the public internet. Azure AI Foundry (the Cognitive Services / AI Services resource) must be locked down with: defaultAction: Deny on the network ACL — blocking all traffic by default Private Endpoints — the only way to reach the service is through your Virtual Network No public network access — fully private deployment These controls are baseline requirements for meeting SOC 2, HIPAA, PCI-DSS, and most enterprise security standards. The MCP / Server-Side Tool Problem Azure AI Foundry supports MCP (Model Context Protocol) — a server-side tool execution framework that lets your agent call external services (like Databricks Genie) seamlessly. When it works, it's powerful: you register a tool in the Foundry portal, and the platform handles authentication, execution, and response marshaling automatically. For deployments with public network access, this is the fastest path to a working agent — often just a few clicks. However, in private-networked deployments, server-side tool execution faces a networking constraint. Here's the problem: when you lock down your AI Services resource with defaultAction: Deny and private endpoints (as enterprise security policies require), the AI Services infrastructure has no outbound path to external services like Databricks. This isn't a bug in Foundry — it's an inherent trade-off of full network isolation. The same restriction applies to any Azure service calling out from a network-locked resource. Microsoft is actively addressing this with newer features like Standard Agent Setup with dedicated MCP subnets, which give the agent infrastructure its own outbound-capable subnet within your VNet. As these capabilities mature and become generally available, the server-side MCP approach will work in private deployments too. The Multi-Hop Authentication Challenge Even once you solve the networking problem, you face a five-hop authentication chain: Teams → Bot Service: Bot Framework channel tokens Bot Service → App Service: User-Assigned Managed Identity (UAMI) App Service → AI Foundry: Managed Identity + Private Endpoint App Service → Entra ID: Token acquisition for Databricks resource Entra ID → Databricks: OAuth token federation (OIDC token exchange) Each hop has its own identity model, token format, and failure modes. Getting one wrong means a silent 401 somewhere in the chain with no useful error message. The Teams Bot Framework Constraint Azure Bot Service adds its own constraints: It requires a public HTTPS endpoint — you can't make the bot App Service fully private It uses UAMI (User-Assigned Managed Identity) for passwordless authentication — not the simpler system-assigned MI The Bot Framework SDK validates inbound channel tokens before your code even runs You need both MicrosoftAppId and MICROSOFT_APP_ID (PascalCase and UPPER_SNAKE aliases) because different SDK layers read different env var names Our Approach: Client-Side Function Tool Pattern The challenges above - private networking, multi-hop authentication, Teams constraints - aren't individual problems to solve in isolation. They're interconnected design constraints that need a cohesive architectural answer. This could be solved if we introduce some component that can take the required actions while being in the same network as Databricks genie and follow instructions from the Foundry AI Agent. Reference Implementation A complete, working implementation of the architecture described in this article is available as an open-source project: github.com/vikasgautam18/foundry-genie Azure AI Foundry excels at what it was built for - orchestrating large language models, managing conversation threads, and deciding when tools should be called. Databricks Genie excels at what it was built for - translating natural language into SQL and querying governed data. This new application in the middle will simply be the bridge between them. The diagram below explains this with a WebApp as the bridge. The diagram above shows the end-to-end execution path of a user query flowing from Microsoft Teams through a bot, into an orchestration layer, and finally into a data platform before returning a synthesized response. A Teams message is received by the Bot Service and forwarded to an App Service via POST /api/messages. The App Service validates the request using Bot Framework SDK's CloudAdapter and immediately initializes an AI Foundry thread to manage the interaction state, but importantly, this interaction stays within a private, service-to-service trust boundary, typically enforced via managed identity or private endpoints—no user-level credentials are propagated downstream. A detect step determines whether the input requires tool invocation, effectively acting as a lightweight router between pure LLM response generation and downstream system calls. Azure Bot Service requires a specific app registration (the MicrosoftAppId). With UAMI, the same managed identity serves as both the bot's app ID and the credential for calling AI Foundry. When the workflow requires data access, the App Service explicitly crosses into a new security boundary: Databricks. Instead of reusing upstream identity, the service acquires a scoped Databricks access token (often via OAuth, service principal, or managed identity federation). This token is short-lived and purpose-specific, limiting blast radius. The call to the Databricks Genie API initiates execution inside the data platform boundary, where the LLM translates intent into SQL and runs it against a SQL warehouse. Crucially, this isolates data-plane access from the application layer—App Service never directly queries the warehouse. The return path reinforces the separation of concerns. Databricks responses (SQL text, schema, and result sets) are treated as untrusted external input when re-entering the App Service boundary and are explicitly passed into AI Foundry as tool output. Within Foundry, the LLM operates inside its own controlled environment, synthesizing a response without gaining direct access to credentials or underlying systems. The App Service polls both Databricks and Foundry using their respective tokens, respecting independent authentication domains and time-bound sessions. Finally, the response is sent back through the Bot Service to Teams, completing a flow where each hop enforces validation, uses scoped credentials, and maintains strict isolation between user interaction, orchestration logic, AI reasoning, and data execution layers. One of the most interesting hops from an authentication point-of-view is the App Service to Azure Databricks. There are three options available for you to implement: PAT (Personal Access Token) approach uses a Databricks Personal Access Token - a long-lived bearer token issued to a user or service principal as the credential for every API/SQL call. These must be rotated manually, can't carry per-user identity (every call looks like the same principal), and a leaked PAT grants full workspace access until someone revokes it. M2M Mode (Machine-to-Machine OAuth a.k.a. workload identity federation) uses an OAuth 2.0 client-credentials flow between your app's managed identity and Databricks. Instead of carrying a static PAT, the app authenticates to a token endpoint with its credentials and receives a short-lived access token (usually ~1 hour) that it then uses as the Authorization: Bearer header for Databricks API/SQL calls. The token is refreshed automatically by the SDK when it expires. It does not carry per-user identity, though - every call to Databricks looks like the service principal, so row-level / Unity Catalog policies based on the end user won't apply. U2M Mode (User-to-Machine OAuth) uses an OAuth 2.0 authorization-code flow so that calls to Databricks are made as the actual end user, not as the application's service principal. The user signs in once (in Foundry Genie, via an Entra ID / Microsoft consent prompt surfaced in Teams or the web UI), the app receives an access token + refresh token scoped to that user's Databricks identity, and every subsequent SQL/Genie call carries those tokens. The access token is short-lived (~1 hour); the refresh token lets the app mint new ones silently in the background until the user revokes consent or the refresh token expires. The most important property is identity propagation: when the app queries Databricks on behalf of a User1, Databricks sees User1 - so Unity Catalog row/column-level security, table grants, audit logs, and Genie Space permissions all evaluate against her entitlements. Two analysts asking the same question can legitimately get different answers (or one can get an "access denied") based on the data they're each allowed to see. This comes at the cost of slightly higher operational complexity: you now have per-user tokens to store (Foundry Genie keeps them in Redis, keyed by Teams/AAD user ID, encrypted at rest), an interactive consent step the first time a user connects, and token refresh logic running in the background. Closing Thoughts: When “Just Ask the Question” Isn’t So Simple The original ask was quite simple: let a marketing director ask, “What was our top campaign by ROI last quarter?” in Microsoft Teams and get a real, trustworthy answer. No dashboards. No exports. No side channels. Just a question and a response. But what we ended up building tells a very different story. Behind that single question sits a five-hop authentication chain spanning three platforms, a fully private network topology with multiple DNS zones, a per-user token store backed by Redis, and a deliberately enforced separation between AI reasoning and tool execution. None of that complexity is accidental. Every layer exists because regulated enterprise environments demand it. And that’s the first lesson worth calling out: enterprise-grade AI systems are shaped more by networking, identity, and governance than by prompts or models. If you’re building something similar - a conversational interface over governed data, running inside a network-isolated environment, with per-user authorization - you’re not alone. The patterns here are reusable, and we hope they save you a few weeks of head-scratching.
vikas_gautam
May 14, 2026 Place Microsoft Developer Community Blog
352Views
0likes
0Comments
Fixing Broken Markdown in AI Translation: Hardening a Production Pipeline
By Minseok Song and Hiroshi Yoshioka (Microsoft MVPs) TL;DR Recent community feedback, especially from Japanese translations, revealed that many translation failures were not semantic, but structural. Through detailed issue reports and discussions, we identified recurring patterns such as broken links, malformed code fences, inconsistent list structures, and CJK-specific formatting issues. In response, Co-op Translator has undergone a series of structural improvements across multiple releases, culminating in v0.18.1 with enhancements such as parser-based code fence handling, list-aware chunking, language-specific Markdown templates, safer CJK emphasis normalization, more robust image migration, and improved internal anchor consistency. These changes were directly informed by real-world community feedback. We would like to especially thank Hiroshi Yoshioka (Microsoft MVP), whose many detailed reports not only uncovered several of these systemic issues but also made this community report possible. The result is not just improved Japanese translations, but a more reliable and resilient translation pipeline for any repository that depends on Markdown fidelity. Introduction Most translation bugs are not actually translation bugs. They are structural failures. They show up as broken links, missing bold markers, unclosed code fences, skipped content, or images that quietly point to the wrong place. To a learner reading translated technical documentation, those issues can make a page feel untrustworthy. To a maintainer localizing documentation at scale, they reveal something deeper: the translation pipeline is not preserving structure as carefully as it preserves meaning. That insight became much clearer over the past several months through community feedback on Co-op Translator. Co-op Translator helps maintain educational GitHub content across many languages while keeping Markdown, images, and notebooks synchronized as the source evolves. As Hiroshi Yoshioka reported a series of Japanese translation issues across real Microsoft learning repositories, each issue looked narrow on the surface: a broken link here, a skipped line there, bold markers not surviving around linked text, HTML image tags not being rewritten, or code fences breaking after chunking. Example of a real community-reported issue where a code block was broken during translation, causing structural corruption in the output. But taken together, those reports exposed a broader pattern: The hardest problem was not “translate this sentence.” The hardest problem was “translate this document without damaging its structure.” This post is a community report on the hardening work that followed, especially in the recent run-up to v0.18.1, and what we learned from those real-world cases. Why these reports mattered One of the most useful things about community feedback is that it reveals failure modes that synthetic tests often miss. These were not edge cases found in toy Markdown samples. The reports came from real translated content in active educational repositories. That meant we were dealing with the kinds of files maintainers actually have to ship: nested lists fenced code blocks inline HTML relative links translated headings migrated image assets CJK punctuation and emphasis edge cases In other words, we were seeing the kinds of Markdown that break when a translation system is only mostly correct. 1) We stopped treating code fences like a regex problem Code fences are not a regex problem—they are a structural one. Left: Regex-based handling breaks code fences and list structure across chunks. Right: Parser-based processing preserves code blocks and their surrounding context as atomic units. One of the earliest recurring themes was code fence integrity. A report on incorrectly handled triple backticks highlighted a classic failure mode: if fenced blocks are detected or split incorrectly, placeholders can fall out of sync, chunk boundaries can be corrupted, and the translated file can come back structurally damaged. A later report showed a closely related issue: list items and indented code placeholders could be split into separate chunks, which then caused broken fences downstream. The right fix was not another regex patch. Instead, Co-op Translator moved to a parser-based approach using markdown-it-py for fenced code block detection. This made code block handling spec-aware and more resilient to cases like unmatched fences, variable fence lengths, and info strings. More importantly, it ensured code sections were treated as atomic units during chunking and placeholder restoration. This same principle was extended to list-aware chunking. Rather than splitting Markdown line by line and hoping the model would preserve structure, the pipeline now groups list items together with their continuation lines and indented placeholders such as @@CODE_BLOCK_X@@. This prevents bullets and their associated code content from being separated into different translation chunks. This was not just a better heuristic. It changed the unit of chunking itself. In practice, this required modifying the chunking pipeline to detect and preserve list-item blocks before token-based splitting. Instead of treating each line independently, we introduced a grouping step that keeps the entire list context intact, including nested indentation and code placeholders. The change was implemented directly in the chunking logic: lines = _group_lines_preserving_list_items(part_text) This helper ensures that list items and their associated code blocks are processed as a single unit, preventing structural corruption during translation. Why this mattered Technical documentation frequently embeds code examples directly under list items or step-by-step instructions. When these relationships are broken during translation, the issue is not just cosmetic. It results in structurally invalid Markdown and misplaced code blocks that can confuse readers and make examples unusable. These were not edge cases. They appeared in real production documentation where: Fenced code blocks became malformed after chunking List items and their associated code placeholders were separated into different segments Placeholder ordering drifted, breaking reconstruction of the original structure In practice, this meant that even when the translated text was correct, the document itself could no longer be trusted as a working technical resource. What changed in practice Before: Code samples could leak out of their list context List items and code blocks were split across chunks Placeholder ordering could drift, breaking reconstruction After: Code blocks are preserved as atomic units during chunking List-bound code samples remain intact Placeholder ordering is stable across the pipeline 2) We restored internal link consistency across translation chunks Even when each chunk appears locally correct, internal links can break at the document level. Left: Anchor links drift out of sync because headings and links are translated independently across chunks. Right: After document-level normalization, links correctly resolve to their corresponding translated headings. Another cluster of issues surfaced when translating longer Markdown documents: internal links would silently break once the content was processed in chunks. Co-op Translator splits large documents into multiple chunks to fit within model constraints. While this works well for translation itself, it introduces a structural problem. Internal links such as [Go to section](#section-name) depend on heading-derived anchor slugs, and those slugs can change during translation. When each chunk is translated independently, links and headings can drift out of sync. In practice, this meant that even when translated headings and links looked correct locally within a chunk, they no longer matched at the document level. Tables of contents, section jump links, and cross-references inside the same file could silently break. The right fix was not to rely on chunk-level correctness. Instead, Co-op Translator introduced a document-level normalization step for internal anchor links. The pipeline now parses both the source and translated Markdown using markdown-it, extracts headings, generates GitHub-style slugs from the translated headings, and then realigns internal anchor links so they correctly point to their corresponding translated sections. Rather than trusting fragment identifiers produced during chunk-level translation, links are reconciled against the final translated document structure. This was not just a small post-processing tweak. It changed where consistency is enforced. In practice, this required introducing a normalization step that runs after all chunks are merged back into a single document. Instead of assuming each chunk is self-consistent, the system now treats the entire document as the source of truth and rebinds internal links accordingly. The change was implemented as a dedicated normalization pass: normalize_internal_anchor_links(source_markdown, translated_markdown) This function aligns fragment identifiers with translated heading slugs, ensuring that internal navigation remains valid even when content has been translated in multiple independent chunks. Why this mattered Technical documentation relies heavily on internal navigation such as tables of contents, section links, and cross-references within the same file. When anchor links drift out of sync with translated headings, the document becomes difficult to navigate even if the translation itself is accurate. Readers may click on links that lead to incorrect sections or nowhere at all, which significantly reduces trust in the content. These issues surfaced in real-world usage where: Internal links no longer matched translated heading slugs Tables of contents pointed to incorrect or missing sections Cross-references silently broke across chunk boundaries This highlighted that correctness at the chunk level was not enough. Consistency had to be enforced at the document level. What changed in practice Before: Internal links could drift out of sync with translated headings Tables of contents pointed to incorrect or missing sections Cross-references silently broke across chunk boundaries Long documents behaved like fragmented outputs rather than a single unit After: Internal links are realigned with translated heading slugs at the document level Tables of contents correctly resolve to translated sections Cross-references remain consistent across the entire document Long Markdown documents behave as a single coherent unit 3) We fixed CJK emphasis the safe way Bold and italic rendering around CJK text was a recurring and subtle failure point. Issues like “Markdown bold not handled correctly” may look minor, but they reveal a deeper compatibility problem: many Markdown renderers do not consistently apply emphasis when markers sit directly next to CJK characters. To address this, we introduced a dedicated normalization step for emphasis markers. Instead of relying on each renderer to interpret `*`, `**`, and `***` correctly in CJK-adjacent cases, Co-op Translator converts them into equivalent HTML tags such as `<em>` and `<strong>` when the target language is Japanese, Korean, or Chinese. This shifts emphasis rendering from renderer-dependent behavior to deterministic output. What mattered was not just fixing it, but fixing it safely. The normalization is strictly scoped to CJK languages and carefully designed to avoid overmatching. It does not mutate inline code spans or unrelated fragments. This is critical, because overly aggressive formatting fixes can easily break code, identifiers, or underscore-heavy technical text. Unlike whitespace-delimited languages, Japanese, Korean, and Chinese often place characters directly adjacent to emphasis markers without clear boundaries. For example, a phrase like: example is ... may be translated into Japanese as: 例は ... Here, the particle は is attached directly to the emphasized word. In some Markdown renderers, this breaks the expected boundary around ..., causing the emphasis to render incorrectly or not at all. This pattern is not limited to Japanese. Similar boundary issues can appear across CJK languages due to the absence of whitespace between words. Why this mattered Formatting bugs around emphasis may look cosmetic, but they affect readability, hierarchy, and trust especially in instructional documentation where emphasis often signals warnings, key concepts, or required steps. What changed in practice Before: Emphasis markers could render inconsistently when adjacent to CJK characters Bold and italic formatting could break depending on the Markdown renderer Fixes risked overmatching and corrupting code or inline technical content After: Emphasis rendering is deterministic across CJK languages using HTML tags Bold and italic formatting remains consistent regardless of renderer behavior Normalization is safely scoped, avoiding unintended mutations in code and inline content Next steps With the recent release, Co-op Translator now exposes a programmatic API that allows the translation pipeline to be executed directly from Python, not only through the CLI. This is an important step, but it is not the end state. The immediate focus is improving adoption. Documentation and usage patterns are being developed so that the API can be reliably integrated across different environments and workflows. More fundamentally, the direction is shifting. Co-op Translator is evolving from a repository-specific tool into a reusable translation engine that can operate as part of larger content pipelines. This enables broader use cases, including: Long-form content such as eBooks and technical blogs Developer documentation and static site projects (for example, Docusaurus or Astro) Continuous documentation pipelines that track and update translations as source content evolves Multilingual SDK, API documentation, and knowledge base systems The long-term goal is to treat translation as infrastructure rather than a one-time task. Instead of generating static outputs, the system is being designed to support continuous updates, structural guarantees, and seamless integration into real-world documentation workflows. Why community feedback mattered so much here One of the most encouraging parts of this work is that the most useful reports were not always long reports. Sometimes a single repository link, a screenshot, and one concrete example of broken output were enough to reveal a structural weakness in the translation engine. That feedback created a valuable loop between people reading translated docs and people maintaining the translation tooling. Hiroshi's reports did not just identify isolated defects. They helped surface recurring categories of failure: code fence integrity chunk boundary safety link preservation CJK emphasis compatibility image path migration anchor normalization Once those patterns became visible, the fixes could be implemented in the core and covered with tests so that the broader ecosystem not just one file or one repo would benefit. Why this matters for learners worldwide Co-op Translator is used in educational repositories where translated documentation can lower the barrier to learning for people around the world. That raises the quality bar. A learner should not have to wonder whether a missing bold marker changed the meaning of a sentence. A learner should not hit a broken anchor halfway through a tutorial. A learner should not lose trust in a translated page because a code block or image path was corrupted during processing. Improving those details is not cosmetic. It is part of making global technical education more reliable. Closing thoughts This community report comes down to a simple truth: Translation quality depends on structural quality. Community feedback helped Co-op Translator get better at preserving the things technical documents depend on most: code fences, lists, links, emphasis, images, and anchors. The result is a more dependable foundation for multilingual documentation not only for Japanese, but for any repository that needs translated content to behave like a maintained technical artifact rather than a plain text dump. To everyone who has opened an issue, shared a screenshot, submitted a PR, or stress-tested translated docs in the real world: thank you. That feedback is helping Co-op Translator become a stronger tool for maintainers and a more trustworthy experience for learners. If you are maintaining multilingual Markdown content, I hope these lessons are useful beyond this project too: use parsers where you can, make structure a first-class concern, and treat community bug reports as design input not just support tickets. If you are working on multilingual documentation, you can explore Co-op Translator here: https://github.com/Azure/co-op-translator Selected GitHub references Repository: https://github.com/Azure/co-op-translator Issue #221: https://github.com/Azure/co-op-translator/issues/221 PR #226: https://github.com/Azure/co-op-translator/pull/226 Issue #234: https://github.com/Azure/co-op-translator/issues/234 PR #237: https://github.com/Azure/co-op-translator/pull/237 Issue #235: https://github.com/Azure/co-op-translator/issues/235 Issue #239: https://github.com/Azure/co-op-translator/issues/239 Issue #357: https://github.com/Azure/co-op-translator/issues/357 Issue #362: https://github.com/Azure/co-op-translator/issues/362 Issue #363: https://github.com/Azure/co-op-translator/issues/363 PR #370: https://github.com/Azure/co-op-translator/pull/370 PR #372: https://github.com/Azure/co-op-translator/pull/372 PR #377: https://github.com/Azure/co-op-translator/pull/377 PR #378: https://github.com/Azure/co-op-translator/pull/378 PR #379: https://github.com/Azure/co-op-translator/pull/379 PR #364: https://github.com/Azure/co-op-translator/pull/364 About the authors Minseok Song (Microsoft MVP) is an OSS maintainer of Co-op Translator focusing on GitHub-native multilingual automation. Hiroshi Yoshioka (Microsoft MVP) is a community contributor who has played a key role in improving translation quality through detailed real-world feedback.
MinseokSong
Apr 30, 2026 Place Microsoft Developer Community Blog
433Views
0likes
2Comments