microsoft foundry
80 TopicsBuilding a hands-free voice concierge with Microsoft Foundry Voice Live and a Hosted Agent
This post walks through a small, working sample that wires the browser microphone to Azure AI Speech Voice Live, binds the realtime session to a Foundry hosted agent, and lets the agent answer travel questions using tool calls. The full source, infrastructure, and labs live in the repository linked at the end. Why this combination matters Voice user interfaces have historically been hard to build well. Streaming audio, partial transcripts, barge-in, voice activity detection, tool dispatch, and audio playback have traditionally meant stitching together five or six services. The combination of Voice Live and a Foundry hosted agent collapses that into one realtime WebSocket session with a single binding field. Voice Live owns the audio loop: speech to text, neural text to speech, semantic turn detection, noise suppression, and echo cancellation. The Foundry hosted agent owns the brain: instructions, memory, model selection, evaluators, and tool calling. The link between them is one query parameter on the WebSocket URL. What this means in practice: the browser never sees a model API key, never instantiates a tool, and never owns the agent prompt. The browser does microphone capture and audio playback. Everything else lives server-side. The scenario The sample is called Contoso Travel Concierge. The user is mid-journey, hands and eyes busy, and wants to ask things like: What is the weather in Tokyo this weekend? Is BA005 from Heathrow on time? What time is check-in at the Marriott Marquis? Each question triggers a tool call on the hosted agent. The reply is short, speakable, and synthesised back to the user in under a second on a warm connection. Architecture There are four moving parts. Three of them are managed Azure services. Only the broker is your code. Browser client – captures PCM16 audio at 24 kHz and streams it over a WebSocket to the broker. Plays back audio chunks the broker forwards from Voice Live. Session broker (FastAPI) – authenticates to Azure with DefaultAzureCredential , builds the Voice Live WebSocket URL with a short-lived bearer token, and relays frames in both directions. Voice Live – the Azure AI Speech realtime endpoint. Transcribes the user, hands the text to the bound agent, and synthesises the agent’s reply. Foundry hosted agent – a prompt-kind agent in Azure AI Foundry with instructions, tool definitions, and the microsoft.voice-live.enabled metadata flag set to true . Two design choices are worth calling out. The broker is small on purpose. It does authentication, URL construction, and WebSocket relay. It does not transcode audio, run business logic, or hold conversation state. Voice Live and the agent already do those things well. The agent binding is a URL query parameter, not an SDK call. There is no per-turn HTTP request to the agent runtime. Voice Live opens a session against the agent once and streams turns through it for the lifetime of the WebSocket. That is what keeps latency low. The Voice Live URL contract This is the single most important thing to get right. The public Microsoft sample that ships under liupeirong/ai-foundry-voice-agent targets a different URL shape ( services.ai.azure.com host, agent-id + agent-access-token parameters, an Authorization header). That shape is rejected by Foundry resources that expose voice-live-enabled agents. The shape below is the one the portal itself uses, and the one this sample dials. Three details cause most failures: The host must be <resource>.cognitiveservices.azure.com , not services.ai.azure.com . The broker rewrites this automatically from VOICE_LIVE_ENDPOINT . The bearer token travels in the authorization query parameter, URL-encoded, with a literal Bearer prefix and a + (or %20 ) before the token. No Authorization header is sent. agent-name and model are both the agent’s display name. agent-version is empty when you want the latest published version. Walkthrough: from clone to spoken reply Prerequisites Python 3.11 or later (the sample is developed on 3.13). The Azure CLI, signed in with az login --tenant <your-tenant-id> . An Azure AI Foundry project in a Voice Live region ( eastus2 , swedencentral , or westus2 ). A deployed prompt-kind agent in that project with Enable Voice Live turned on. The Cognitive Services User role on the Foundry resource for the identity the broker will use. Configure the broker Copy .env.sample to .env and fill in four values: AZURE_AI_PROJECT_ENDPOINT=https://<your-resource>.services.ai.azure.com AZURE_AI_PROJECT_NAME=<your-foundry-project-name> VOICE_LIVE_ENDPOINT=wss://<your-resource>.services.ai.azure.com/voice-live/realtime VOICE_LIVE_API_VERSION=2025-10-01 FOUNDRY_AGENT_ID=<your-agent-name> The agent name is what the Foundry portal shows on the agent card. The broker uses it for both the agent-name and model query parameters. Install and run python -m venv .venv .\.venv\Scripts\Activate.ps1 pip install -r requirements.txt .\scripts\start-local.ps1 The broker exposes three endpoints: GET /healthz – liveness probe. GET /config – returns the session.update the browser sends as its first frame. WS /ws – the bi-directional relay to Voice Live. Smoke test .\scripts\test-session.ps1 A successful run prints: [OK] /ws upgraded -> sent session.update <- {"type":"session.created",…} <- {"type":"session.updated",…} [OK] session.updated received -- E2E works This confirms the entire chain: local broker, DefaultAzureCredential token, Foundry Portal URL shape, Voice Live handshake, and the bound agent acknowledging the session. Open the browser UI Browse to http://localhost:8000/ , click Start talking, and ask one of the sample questions. Transcripts appear in real time and the spoken reply plays back through the audio context. Inside the broker The relay logic is tiny – the heavy lifting is the URL construction. The function below is the canonical reference; copy it if you are porting the pattern to another language. def build_voice_live_ws_url(agent_access_token: str) -> str: """ Build the Foundry Portal style Voice Live WebSocket URL. Auth lives in the query string only. No Authorization header is sent. """ host = _ws_host_from_endpoint(VOICE_LIVE_ENDPOINT) qs = urlencode( { "trafficType": "FoundryPortal", "agent-name": FOUNDRY_AGENT_ID, "agent-version": "", "agent-project-name": AZURE_AI_PROJECT_NAME, "api-version": VOICE_LIVE_API_VERSION, "model": FOUNDRY_AGENT_ID, "client-request-id": str(uuid.uuid4()), "authorization": f"Bearer {agent_access_token}", }, quote_via=quote, ) return f"wss://{host}/voice-live/realtime?{qs}" The relay itself is a pair of asyncio tasks: one forwarding browser frames upstream, one forwarding Voice Live frames back. Audio bytes are passed straight through – the broker never decodes them. Deploying the hosted agent The most reliable way to create a voice-live-enabled agent is the Foundry portal. Agents created via the Assistants v2 SDK do not carry the required metadata by default and will be rejected by the Voice Live URL shape above. The portal steps are: Open the Foundry project, go to Agents, and click New agent. Choose Prompt agent as the kind, name it (for example travel-concierge ), and pick a model deployment. Paste the contents of agent/src/prompts/system.txt into the instructions box. On the Voice tab, switch Enable Voice Live on. This is what sets the microsoft.voice-live.enabled = true metadata. Add the three tools ( get_weather , get_flight_status , get_hotel_info ) from agent/agent.yaml on the Tools tab. Publish the version and write the agent name back to .env as FOUNDRY_AGENT_ID . The full deployment guide, including how to host the broker on Azure Container Apps with a managed identity, is in docs/deployment.md in the repository. Three lessons from getting this to production 1. Voice output must be written for speech, not for screens Foundry agents tend to format answers in markdown with citations like ([data.jma.go.jp](https://…)) . When Voice Live synthesises that text, the user hears the URL read aloud, character by character. The fix is to write the agent instructions so the spoken text never contains URLs, markdown, or symbols. A short block at the end of the agent instructions does the job: Voice output rules - This output is read aloud by TTS. Never include URLs, domain names, or citation markers like "(source.com)" in your reply. Cite by speakable source name only. - Never use markdown for formatting. No asterisks, brackets, backticks, bullets, or hashes. Write in plain spoken sentences. - Keep numbers speakable: say "thirty degrees Celsius", not "30C / 86F". - Keep replies under about 40 words unless the user asks for detail. The browser transcript can still render markdown for the eyes. The sample does so with a small, escaping markdown renderer that whitelists bold, italic, code, and http(s) links only, so the same agent reply looks polished on screen even though the spoken version contains none of it. 2. Identity is simpler than it looks The broker uses DefaultAzureCredential and requests the https://ai.azure.com/.default scope. Locally that resolves to your az login credentials. In Azure Container Apps it resolves to the user-assigned managed identity. In both cases the only role assignment you need on the Foundry account is Cognitive Services User. There is no API key path on the working URL shape – it is bearer tokens all the way down. 3. The wrong sample wastes a day If you start from the public liupeirong/ai-foundry-voice-agent repository against a portal-provisioned voice-live agent, the WebSocket either returns HTTP 400 or closes silently with code 1006. The cause is the URL shape, not your code. The reference probe in scripts/probe_portal_shape.py is the single source of truth for the working contract – keep it as a regression test. Responsible AI and security notes Credentials never reach the browser. Tokens are minted server-side and travel only on the upstream Voice Live URL. No secrets in source. The .env file is gitignored. The .env.sample contains only placeholders. Markdown rendering is escape-first. The browser HTML-escapes the agent reply before applying its small markdown whitelist, and links are restricted to http(s) URLs so the rule cannot emit javascript: hrefs. Tool calls are auditable. Every turn shows up as a run in the Foundry portal under the agent, with the prompt, model output, and tool inputs and outputs visible for review. Voice biometric considerations. If you plan to handle account verification by voice, plug in dedicated speaker recognition rather than relying on the conversational model. Key takeaways Voice Live plus a Foundry hosted agent is a session-level integration, not an API integration. One URL, one binding field, one WebSocket. The browser is a thin client. Authentication, URL construction, and relay all live in a small FastAPI broker. Get the URL shape right ( cognitiveservices.azure.com , token in the query string, agent-name equals model equals the agent display name) and the rest is plumbing. Use the Foundry portal to create the agent so the voice-live metadata is set correctly. Write agent instructions for the ear, not the eye, then layer screen formatting on top in the browser. Get the code and try it Repository: github.com/microsoft/foundry-agent-voice-mode-sample Deployment guide: docs/deployment.md in the repository. Labs: three progressive workshops under labs/ – basic voice, adding tools, and binding a hosted agent. Reference docs: Voice Live in Azure AI Speech and Agents in Microsoft Foundry. If you build something on top of this pattern, open an issue or pull request on the repository. The sample is intentionally small so it stays easy to fork.82Views0likes0CommentsAutomate evaluations | Microsoft Foundry
Trace every run end-to-end, generate synthetic datasets to stress-test on demand, fire automated Red Team attacks at your own agents, and pin down why evaluations fail — all from the Microsoft Foundry control plane. Lock in guardrails that inspect every tool call at runtime, define the risks once, and enforce them across every agent run. Mohammad Abuomar, Responsible AI Principal Architect, shares how to turn a coding agent into production-ready software inside Foundry. Describe the agent, set the row count, confirm. Your test set lands in seconds. Microsoft Foundry’s synthetic dataset generator builds eval data on demand. Get started. Pin down why your agent fails evaluations. Foundry’s Analyze Results uses AI to cluster failures, name the root cause, and recommend specific fixes. Check it out. Lock down agent behavior with the Task Adherence Guardrail. It inspects every tool call against the original task and blocks the off-script ones. Try it in Microsoft Foundry. QUICK LINKS: 00:00 — Microsoft Foundry control plane 00:33 — See a finished agent 02:30 — See where the agent started 03:19 — Traces 04:04 — Built-in monitoring 04:34 — Evaluation types 05:51 — Red team evaluations 07:08 — Evaluation results 08:14 — Built-in Guardrails 08:14 — Wrap up Link References Get everything you need in Microsoft Foundry at https://ai.azure.com Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: -If you want to build agents that meet your expectations for output quality, performance, safety, and cost, it’s not just about the model or framework you select. The testing, evaluation, and the controls surrounding your agent matter. And that’s what the Microsoft Foundry control plane is designed to do, with tools you can use during development to make sure your agents raise the bar across every important dimension. Today, I’m going to walk through the process of building a coding agent and demonstrate where the controls in Foundry come in to make it better. I’ll start by showing my finished agent, then after that, I’ll show you the steps I took using Foundry to make it production ready. This agent is designed to take a simple user prompt, then find what it needs to build apps automatically. -So, I’ll paste in my prompt asking it to generate a Windows desktop app for personal cashflow management. It needs to be fast, use WebUI, and easy to use for broad appeal. I’m also asking it to make it safe, secure, and follow privacy best practices. And it needs to be easy for a developer to read, maintain, and to add to it. I’ll submit the request and it gets to work, with its reasoning on the left and code on the right. This process takes several minutes to complete with a few interactions in between, so to save a little time, I’ll skip to the result. We can see the agent’s reasoning and plan, with its technology stack, approach and initial action. Then, we can follow all of the steps it performed to author and configure the app and its dependencies. -Below that, is the React code and JavaScript. It asked whether to proceed writing this as an Electron and React setup, and I confirmed. Then it started to write, test and iterate on the app, followed by another question whether to implement more features or focus on security. And I responded to do both. It then finished writing the app and finally it outlined the steps to run the app locally. -So, let’s test it out. I’ll move over to my terminal window running PowerShell and start it. And here is the generated app. It’s fully functional with user authentication. I can enter my first item, Travel Expenses, and the amount, and there’s a Category dropdown menu with pre-configured options, and I’ll choose “Transportation”. And it writes that record into the local data store. This is a simple, production-ready app that the agent was able to create in just a few minutes. But it didn’t start out this way, and if you’ve built agents or apps yourself, you’ll know a lot of what doesn’t get shown is the testing, iteration, and refinement work to end up with production-ready code. Let’s change that. -Let’s go back in time to where this agent started. I’m in Visual Studio Code and this is my agent, which I built using the Foundry SDK. Here are the defined tools for it to use, WebSearch and CodeInterpreter. And on the left, we can see the full list of local tools. Like interacting with the file system, as well as git, patching, registry, local search and running shell scripts. And here in the center is the key SDK line that creates the agent, adds the tools, deployment name and so on. -So, the agent is functional and I’ve also started manual testing. And this is where Foundry controls let me stress‑test the agent to see what works and what doesn’t and see the details for each run. In the Microsoft Foundry portal, I have my agent open and the Traces tab. These are OTel traces of all of the runs for this agent, with the newest runs on top, everything here is backed by Azure Monitor. And I can click into any conversation or Trace ID to view the Input + Output turns for that session. They’re easier to parse than standard logs, speeding up reviews. We can also see the system message, user input, and what the agent did. Along with the agent’s reasoning, the technology stack it used, and the app features. Below that, we can see the development process as well as tool outputs Beyond that, with built-in monitoring, you can get a roll-up view of all activities for our agent with key metrics I’m in the Monitor tab. It shows me the estimated cost and token usage so far. This agent is new so I haven’t configured Evaluations yet, but we’ll get to those in a moment. -Next, you’ll see Operational metrics like the number of agent runs and how many successfully completed or failed, token consumption, tool calls made by the agent, and the error rate over time. Evaluations are where a lot more testing automation comes in to help you improve agent faster. I’m in the Evaluations tab, I need to create my first one. The options are: Automatic Evaluation, where you can automate the process using AI; Human Evaluation, where someone tests the agent and completes surveys; and Red team, where an agent runs automated attacks to expose vulnerabilities. I’ll start with Automatic Evaluation and hit Create. It starts with defining a target. My agent and the version I want are already selected. For data, I can upload an existing dataset or save time by creating a synthetic dataset, which is very cool. This generates data automatically, you just select the number of rows you want. I’ll guide it with a prompt, “Create a dataset for evaluating a coding agent.” I’ll skip the reference file and just Confirm. That automatically generates 90 rows of data to test with. -Next, I’ll choose the evaluation Criteria. There are several built-in evaluators for Agents. Below that are evaluators for Quality. These are editable, so I’ll remove Coherence, Fluency, and Groundedness because my agent doesn’t need them. For Safety, there are seven evaluators, and I’ll keep them as-is and move on to Review, then Submit it. These Automatic Evaluations can take several minutes to complete, so while it’s working, I’ll move into Red Teaming, which is now becoming a core part of AI testing to spot vulnerabilities early on. I’ve started creating my first red team evaluation. Let’s look at the standard configuration for risk categories. You can modify these. It can check for unsafe categories plus ungrounded attributes, code vulnerabilities, and task adherence. It shows the tools that the agent can access. I’ll provide descriptions for web_search, to search the internet for relevant SDKs, and the code_interpreter to run code for the coding agent. Then I’ll Save it. -Next, I’ll change Seed queries from 5 to 10 per category for more testing. In the Attack strategies, I can see exactly what the red teaming agents will try to do and select the ones most relevant to my agent. Each tile describes the attack type that will be tested. I’ll choose AsciiSmuggler, Base64, Jailbreak, StringJoin, UnicodeSubstitution, and IndirectJailbreak. Now, I can review the prohibited actions, including things like attempts to change password, and more. These are all things attackers might try to do with your agent, and we’re automating those tests for you. I’ll hit Submit to get everything started. Now, with two evaluations running, to save a little time, I’ll fast forward to the results of the evaluations. -Here, we can see the two runs. I’ll open the Automatic Evaluation first. Then clicking into the Run shows the details for each evaluator. If I scroll to the right, you’ll see that we’re green almost across the board. One glaring exception is the TaskCompletion score at 59%, which is below my bar, so it’s something to fix. One of my favorite capabilities in evaluation is using AI to analyze the results. I’ll start the analysis, and it creates a nice cluster analysis showing the main issues. I mentioned TaskCompletion before. Here, you can see “incomplete resolution” and “action plan issues”. Drilling in, looks that there is a “lack of actionable output” and the AI suggests specific ways to fix it. This saved me time to find ways to improve my agent. -Now, let’s review our Red Teaming evaluation. I’m at the top level view and I’ll click in to see the issues. Immediately, I can see that the Task adherence is red, which is also related to TaskCompletion. We can fix this using a built-in guardrail to check for task adherence. Guardrails define what risks to detect, from which point in the process, and how to respond. Let’s go to the agent playground. Scrolling down to Guardrails, I can see only the default model guardrail is set. Let’s add another by clicking Manage guardrails and Create. Here, I can define the risks and controls I want to enforce. I’ll start with Risk, and these are the types of risks we can detect and mitigate. There’s an option for “Task adherence” that I’ll choose. This guardrail checks any tool call made by the agent to ensure it’s used appropriately to “adhere” to the task. -Now, I just need hit Submit to activate this guardrail. And the TaskCompletion issue should now be fixed. In fact, here I’ve run another evaluation, and we can see that TaskCompletion is now green and everything meets our overall quality goals. With that, my agent is ready for broader use. And while I focused today on a single agent and using Foundry controls to test it, expose vulnerabilities, and make it better, Foundry also provides fleet-wide performance visibility across all agents and enables centrally applied and enforced policies and configurations to keep agents compliant. -To find out more and get started with these and other controls, you’ll find everything you need in Microsoft Foundry at ai.azure.com. Subscribe to Mechanics for the latest tech updates, and thanks for watching.121Views0likes0CommentsHybrid AI Agents in Python: Routing Between Foundry Local and Microsoft Foundry
Why hybrid, and why now If you build AI features today, you are caught between three forces. Users want low latency and strong privacy. Product teams want frontier reasoning capability. Finance teams want predictable cost. No single model satisfies all three. Run everything on a small on-device model and you bottleneck on complex questions. Send everything to a frontier cloud model and you pay for trivial requests, leak sensitive data across a network boundary, and add hundreds of milliseconds of latency to greetings. The pragmatic answer is hybrid inference: a lightweight local model classifies every request first, simple or sensitive ones stay on the device, and only the genuinely hard or frontier-capability requests escalate to the cloud. Microsoft now ships both halves of that pattern as supported Python SDKs — foundry-local-sdk for on-device inference and azure-ai-projects for Microsoft Foundry cloud models. This post walks through a working reference implementation that combines them behind a single ask() call. The full source is at github.com/leestott/fl-mixedmodel. It is Python-only, secretless by design, and ships with a Gradio diagnostics UI, a CLI demo mode, and a full pytest suite. The contract: one schema, two paths The most important architectural decision is that callers never know which path served a request. Every response, local or cloud, returns the same dataclass: class InferencePath(str, Enum): LOCAL = "local" CLOUD = "cloud" LOCAL_FALLBACK = "local_fallback" # cloud attempted, fell back to local CLOUD_FALLBACK = "cloud_fallback" # local attempted, fell back to cloud @dataclass class AgentResponse: answer: str path: InferencePath model: str reason: str confidence: float latency_ms: float correlation_id: str prompt_tokens: Optional[int] = None completion_tokens: Optional[int] = None fallback: bool = False fallback_reason: Optional[str] = None metadata: dict = field(default_factory=dict) This is what makes the design honest. The router can change, the cloud model can be swapped from gpt-4o to gpt-5.4 , fallback policies can flip — and the calling code never breaks. The four InferencePath values give you full observability without leaking implementation details into the API surface. Architecture in one diagram ┌─────────────┐ prompt ┌──────────────────────────┐ │ caller │ ──────────► │ HybridAgentService │ └─────────────┘ │ .ask(prompt) │ └────────────┬─────────────┘ │ ┌────────────▼─────────────┐ │ RoutingPolicy │ │ 1. Heuristic gate │ │ 2. Local router LLM │ │ 3. Hard policy gates │ └─────┬─────────────┬──────┘ │ │ LOCAL ◄┘ └► CLOUD │ │ ┌──────────▼──┐ ┌──────▼───────┐ │ Foundry │ │ Microsoft │ │ Local SDK │ │ Foundry │ │ (phi-4-mini)│ │ (gpt-5.4) │ └─────────────┘ └──────────────┘ Best practice: the two-stage router pattern Before walking through the implementation, it is worth stating the design pattern explicitly, because it is the part that generalises beyond this specific repo. The cleanest design for hybrid inference is a two-stage router. Stage 1 — local router. A small local model performs intent and complexity classification first. It does not answer the question; it decides where the question should go. Stage 2 — route the answer. If the prompt is simple, private, latency-sensitive, or clearly within local capability, route to a local task model on the device. If the prompt is complex, needs deeper reasoning, a larger context window, or a capability unavailable locally, escalate to a cloud frontier model in Microsoft Foundry. Microsoft's current guidance for the cloud side is to use the Responses API and choose one of two control modes: Pass a specific deployment name (for example gpt-5.4 ) when you want deterministic control over which model serves the request, which is the right choice for regulated workloads, repeatable evaluations, or cost ceilings. Pass model-router as the deployment when you want Microsoft Foundry to automatically select the best available cloud model for each request. This is a sensible default for general-purpose agents where you would rather let the platform optimise the model choice as new ones are released. The reference repo exposes both as environment variables so you can switch without code changes: # .env.example FOUNDRY_CLOUD_MODEL_DEPLOYMENT=gpt-5.4 # deterministic FOUNDRY_CLOUD_ROUTER_DEPLOYMENT=model-router # auto-select Best practice: pin the right SDK versions Two SDKs do the heavy lifting and both have had recent breaking changes, so version discipline matters. Local development — foundry-local-sdk . The current public guidance is to use the Foundry Local SDK package foundry-local-sdk , which provides model discovery, download, cache, load, unload, chat completions, embeddings, audio transcription, and an optional built-in web service. Use version 1.1.0, released on 5 May 2026. Earlier versions used an OpenAI-compatible client surface that has since been replaced by the FoundryLocalManager → load_model → get_chat_client → complete_chat chain shown above. Pin it explicitly: # requirements.txt foundry-local-sdk>=1.1.0 Cloud orchestration and agents — azure-ai-projects . For cloud-side orchestration, Microsoft's current Python guidance is to use azure-ai-projects , which the docs describe as part of the Microsoft Foundry SDK and as the entry point for agents, deployments, connections, datasets, evaluations, and an OpenAI-compatible client returned by get_openai_client() . The current PyPI listing shows azure-ai-projects 2.1.0. Pin it explicitly: # requirements.txt azure-ai-projects>=2.1.0 azure-identity>=1.17.0 If you find yourself reading old samples that import azure.ai.inference as the cloud entry point, or that initialise Foundry Local through a raw openai.OpenAI(base_url=...) client, you are looking at pre-2026 patterns. The current shape is what the reference repo uses: FoundryLocalManager.initialize(Configuration(...)) for the device and AIProjectClient(...).get_openai_client() for the cloud. Stage 1: a deterministic privacy gate Before any model touches a prompt, a deterministic heuristic classifier scans for sensitive patterns — passwords, API keys, SSN/NHS numbers, PII signals, explicit "do not share" flags. If the heuristic returns PrivacyClass.RESTRICTED , the prompt is forced local. The router LLM is not called. The cloud provider is not called. The decision is auditable from a single regex pass. # app/routing/policy.py def decide(self, prompt: str, correlation_id: str = "") -> RoutingDecision: hint, privacy, complexity, h_reason = self._heuristic.classify(prompt) # Hard gate: restricted content never leaves the device if privacy == PrivacyClass.RESTRICTED: return self._make_decision( target=RouteTarget.LOCAL, confidence=1.0, reason=f"Policy hard-gate: {h_reason}", privacy=privacy, complexity=complexity, deterministic=True, correlation_id=correlation_id, ) # Hard gate: very high complexity always goes to cloud if complexity == ComplexityBand.VERY_HIGH: return self._make_decision( target=RouteTarget.CLOUD, confidence=1.0, reason="Policy hard-gate: very_high complexity requires frontier model", ... ) This is the most important responsible-AI control in the whole system. If your privacy review depends on an LLM correctly classifying every prompt, you do not have a privacy control — you have a probability distribution. Deterministic gates first, model judgement second. Stage 2: a local LLM as the router For everything that passes the privacy gate, a small local model classifies whether the prompt needs frontier capability. This is the bit that surprises most engineers: you can do useful routing with a 4B parameter model running on a laptop CPU. The router does not need to answer the question. It only needs to classify it. The reference implementation uses phi-4-mini via Foundry Local. Initialising it is two lines: # app/providers/local_provider.py (excerpt) from foundry_local import FoundryLocalManager from foundry_local.models import Configuration self._manager = FoundryLocalManager.initialize( Configuration(app_name="hybrid-agent") ) self._router_model = self._manager.load_model(self._config.local_router_alias) self._chat_client = self._router_model.get_chat_client() response = self._chat_client.complete_chat( messages=[ {"role": "system", "content": ROUTER_SYSTEM_PROMPT}, {"role": "user", "content": prompt}, ], ) The router prompt asks for a strict JSON response: { "target": "local|cloud", "confidence": 0.0-1.0, "complexity": "low|medium|high|very_high", "reason": "..." } . The application parses it, applies the confidence threshold from config (default 0.6), and falls back to the heuristic decision if the router LLM is unsure or its JSON is malformed. The router never blocks the answer path — that is a deliberate reliability choice. Cloud inference via Microsoft Foundry When the policy returns RouteTarget.CLOUD , the request goes through AIProjectClient , which gives you an openai.OpenAI -compatible client wired to your Foundry project with DefaultAzureCredential . No API keys. No secrets in .env . # app/providers/cloud_provider.py (excerpt) from azure.ai.projects import AIProjectClient from azure.identity import DefaultAzureCredential self._project = AIProjectClient( endpoint=self._config.foundry_project_endpoint, credential=DefaultAzureCredential(), ) self._openai_client = self._project.get_openai_client() response = self._openai_client.chat.completions.create( model=self._config.foundry_cloud_model_deployment, # e.g. "gpt-5.4" messages=messages, max_completion_tokens=max_tokens, ) A subtle gotcha worth flagging: gpt-5 and o-series deployments reject the legacy max_tokens parameter and require max_completion_tokens . They also reject custom temperature values. The reference repo handles this by trying the new parameter first and falling back to the legacy one only when the API returns the specific unsupported parameter error. That keeps the same code working against older deployments without forking the provider. Graceful degradation: the fallback paths Hybrid systems fail in interesting ways. The cloud can be down. The local model can throw because the GPU ran out of memory. A reasoning model can return an empty completion. The service handles all of these by attempting the alternative path and labelling the response so observability stays honest: Cloud route fails → local fallback. The response carries path=LOCAL_FALLBACK , fallback=true , and a populated fallback_reason . The user gets an answer instead of an error. Local route fails → cloud fallback, but only if privacy class is not RESTRICTED. A sensitive prompt that the local model could not handle never leaks to the cloud as a fallback. It returns a clear error instead. This is the second hard gate in the system. Both fail. A structured error response with a correlation ID, never a stack trace. That last rule — fallback respects privacy class — is the kind of decision that is easy to skip and impossible to bolt on later. Encode it once in the service layer and your privacy reviewers will thank you. What it looks like in practice The diagnostics panel in the Gradio UI shows the routing decision live: path, model, confidence, latency, privacy class, complexity band, and the full JSON response. Five canonical scenarios shake out the entire decision tree: "hello" → path=local, confidence=1.0, complexity=low . Heuristic only. No router LLM call. ~3 seconds end-to-end with phi-4-mini cached. "explain transformer self-attention in depth with maths" → path=cloud, model=gpt-5.4, complexity=high . Router LLM classifies, hard gate confirms. "my password is hunter2, suggest a stronger one" → path=local, privacy=restricted, deterministic=true . Privacy gate fires before any model sees it. "summarise this 8 KB document" with cloud unavailable → path=cloud_fallback (local handles it, response is labelled). Complex prompt with local model error → path=local_fallback , fallback_reason populated. You can reproduce all five without any models installed by running python -m app.main --demo . The demo mode swaps the providers for deterministic stubs so you can validate the routing logic and the response schema in under a second on any machine. Operational lessons learned Some things the reference implementation only gets right because it got them wrong first: Pick a non-reasoning model for the router. Reasoning-tuned local models (Phi-4-reasoning, o-style) wrap their output in <think> blocks and blow your JSON parser. phi-4-mini is faster and more reliable for classification. Cache the local model. First load can take 30–60 seconds while Foundry Local downloads weights. Initialise the service once at process startup, not per request. Use correlation IDs everywhere. The service attaches one per request and the structured JSON logger emits it on every event. When you are debugging a fallback path across two model providers, this is the difference between five minutes and five hours. Run the privacy heuristic on every fallback path too. A naive implementation might route locally, fail, and then send the same sensitive prompt to the cloud as a "graceful" fallback. That is not graceful, it is a data leak. Keep configuration in .env and out of code. Privacy mode, fallback toggles, confidence threshold, model aliases — all environment-driven. The config.py module is the only place that reads them. Responsible AI in a hybrid topology Hybrid does not make responsible AI harder, but it does make it different. Three controls earn their keep: Data residency by default. The local path keeps prompts and answers on the device. For RESTRICTED content this is mandatory; for everything else it is a free latency and cost win. Auditability. Every routing decision is logged with the deterministic reason, the heuristic class, the router LLM output, the confidence, and the correlation ID. You can answer "why did this prompt go to the cloud?" months later. Keyless auth. DefaultAzureCredential means there is no API key to leak, rotate, or commit by accident. The repo's .gitignore , SECURITY.md , and pre-push checklist enforce this end-to-end. Try it Five minutes, no Azure account needed for the demo: git clone https://github.com/leestott/fl-mixedmodel.git cd fl-mixedmodel python -m venv .venv .venv\Scripts\activate # Windows # source .venv/bin/activate # macOS / Linux pip install -r requirements.txt python -m app.main --demo # all five scenarios, no models required To run with real models, install Foundry Local, copy .env.example to .env , set your FOUNDRY_PROJECT_ENDPOINT , then: az login python -m app.main --ui --port 7860 Where to go next Repository: github.com/leestott/fl-mixedmodel — full source, tests, specification, screenshots. Foundry Local SDK: pypi.org/project/foundry-local-sdk and the Foundry Local docs. Azure AI Projects SDK: pypi.org/project/azure-ai-projects and the Microsoft Foundry docs. Azure Identity: DefaultAzureCredential reference. Phi-4-mini: Phi-4-mini on Hugging Face. Key takeaways The best-practice pattern is a two-stage router: local model classifies first, then either a local task model or a Microsoft Foundry cloud model answers. For cloud control, use the Responses API with either a named deployment (deterministic) or model-router (auto-select). Pin foundry-local-sdk >= 1.1.0 (5 May 2026) and azure-ai-projects >= 2.1.0 . The 2026 SDK surfaces are not backwards-compatible with pre-2026 samples. Hybrid inference is a routing problem, not a model problem. A small local model is enough to classify the request. Deterministic privacy gates beat probabilistic ones. Code the rules; let the LLM judge only what is left. Return the same response schema from every path. Label fallbacks honestly. Carry a correlation ID everywhere. Keep auth keyless with DefaultAzureCredential and your .env out of git. Test the routing decisions, not just the model outputs. Demo mode and a strong pytest suite pay back every time you swap a model. Hybrid AI is not a compromise between local and cloud. It is the supervisor pattern applied to inference — fast and private where you can be, frontier where you have to be, observable everywhere. The hard part is the contract, not the models.157Views1like0CommentsBuilding the Solution Teams Need to Secure AI Against Prompt Injection
As artificial intelligence continues to evolve, teams are prioritising rapid advancements and deployment of applications while often overlooking security considerations. Emerging threats such as prompt injection remain poorly understood, and this is putting systems, users, and infrastructure at serious risk. Much of the expertise required to mitigate these risks is currently fragmented and inaccessible, concentrated among a small group of cybersecurity specialists. Meanwhile, developers, under pressure to ship quickly, often lack both the tools and frameworks needed to systematically test their AI systems for vulnerabilities. This disconnect is creating a significant gap between the development and security assurance of AI applications. To address this gap, we developed a unified Prompt Injection Testing Platform and knowledge base, powered by Microsoft Foundry, designed to make LLM security testing accessible, structured, and understandable for developers. Project Overview Developers are rapidly integrating LLMs and agents into applications, but: Security testing is not standardised Prompt injection risks are increasingly understood in research, but poorly mitigated in practice by developers There is a lack of accessible, actionable tooling This creates a dangerous gap: applications are being deployed faster than they are being secured. As part of our UCL Industry Exchange Network (IXN) project in collaboration with Avanade, we built a Prompt Injection Testing Platform designed to solve this exact issue by: Providing a knowledge base of vulnerabilities and mitigations Helping teams identify vulnerabilities within their AI systems Enabling custom and automated testing pipelines Integrating tools like Garak for adversarial testing With this, we aim to make prompt injection testing accessible, standard, and understandable. Project Journey We divided our project into several phases: Phase 1: Understanding our Users’ Needs. We began by identifying the core users of our platform: AI developers and broader stakeholders across development, security, and safety disciplines integrating LLMs into their applications. By meeting with them, we uncovered a few key challenges: Developers have limited awareness of prompt injection risks There is a generalised lack of accessible tools for testing This first exploration set a core principle: We must build a developer-first solution which does not depend on extensive technical knowledge to be used. We concluded that to be as useful as possible, our solution should not require prior prompt injection knowledge. In order to solve the two challenges presented by our users, we concluded a platform would be the best approach, as it enables us to centralise fragmented knowledge while providing a structured, scalable environment for testing LLM vulnerabilities in practice. Phase 2: Understanding the Threat Landscape Building on our user research, we focused on developing a deep understanding of the prompt injection threat landscape to inform the design of our platform. This phase involved researching: Different types of prompt injection vulnerabilities Common attack scenarios and override techniques Existing mitigation strategies used in practice Tools and methodologies for prompt injection security testing The most widely used models to ensure our platform would be compatible with real-world systems. We consolidated these findings into a structured technical report, designed to be shared with developers, security testers, and semi-technical stakeholders. The goal was not only to guide our own implementation, but also to contribute to making prompt injection more standard and understandable. From our research, we realised prompt injection is not a single vulnerability, but a rapidly evolving attack surface that requires continuous, scalable testing rather than one-time validation. Phase 3: Building the Platform Guided by both our user insights and the threat landscape analysis, we moved to designing and developing a unified prompt injection testing platform and knowledge base. To do this, we defined three core principles: Developer first: no deep security knowledge would be required Unified: combines education (knowledge base) and execution (testing tools) Scalable: Expert users could extend the platform by bringing their own models, tests, and mitigations. During this stage, we built a platform which allows teams to: Connect their own LLM endpoints Run custom prompt injection tests Execute automated adversarial testing through Garak Access a centralised knowledge base of vulnerabilities and mitigation strategies. Export knowledge base information and test results as PDFs. By the end, we had developed a unified platform that enables developers to systematically test, understand, and mitigate prompt injection vulnerabilities in their AI applications. To understand how our platform works in practice, you can view our demo video. Platform home interface presenting an overview of prompt injection concepts and a structured vulnerability catalogue for exploring attack types and mitigation strategies. Key Features Model Integration and Configuration Users can use models included in the platform or connect their own LLM endpoints, allowing the platform to work across different providers: Supports multiple model providers through Microsoft Foundry Supports custom model integration via HTTP endpoints Enables model configurations such as custom system prompts and mitigation layers. Ensures flexibility as new models and mitigations emerge Testing Suite The platform allows users to create and run custom prompt injection tests tailored to their applications. This involves: Creating and executing targeted prompts Simulating real-world attack scenarios Running predefined adversarial testing suites (integrating NVIDIA Garak) Testing interface showing configuration of prompt injection tests and execution of automated scans, with results and risk evaluation displayed. Knowledge Base A core component of our platform is a structured knowledge base, which is designed to make prompt injection concepts accessible and understandable. This is divided into two key areas: Vulnerabilities: Provides information on different types of prompt injection attacks, including explanations of how each vulnerability works, with real-world examples and scenarios, as well as references to reputable external sources Mitigations: Focuses on how to defend against these vulnerabilities, and it includes clear implementation strategies and code examples demonstrating how to integrate each mitigation. To support exploration, we also included a chatbot interface, which answers questions using knowledge base data and trusted sources. This helps users quickly navigate vulnerabilities and mitigation strategies by providing contextual, reliable information and redirecting users to the appropriate page of our platform. Figure 3: Direct prompt injection analysis view, where users can explore attack techniques, observe unsafe model responses, and review corresponding mitigation approaches. Prompt Enhancer In addition to testing and learning, our platform integrates a prompt enhancer, designed to help users actively improve the security of their system prompts. It works in the following way: Takes an existing prompt as input Draws on the knowledge base insights and best practices Restructures the prompt to improve clarity and robustness Incorporates selected prompt-layer mitigations to reduce prompt injection risk Prompt Enhancer interface showing the application of prompt-layer mitigations (e.g. delimiter tokens, instruction hierarchy enforcement) to restructure and secure a system prompt against prompt injection attacks. Technical Details To support a flexible and scalable testing system, we designed our platform with a modular, layered architecture. This allows different components to operate independently while remaining integrated through clearly defined interfaces, ensuring both extensibility and maintainability. System Architecture We divided our platform into four main layers: Frontend Layer An interactive user interface that allows developers to: Explore the prompt injection knowledge base Configure and run tests View results and vulnerability analysis API Layer The API layer acts as the orchestration and communication layer between the frontend and the core system. Handles requests from the frontend to create and run tests. Provides frontend with available models, mitigations, and configurations. Ensures any newly added models and mitigations can be automatically reflected in the frontend without requiring manual updates. Domain Layer The layer which defines the core structure and logic of the system: Defines interfaces for key components such as mitigations, models, and test runners Establishes the test structure and data models Encapsulates logic to ensure consistency Integration Layer The layer which implements the abstractions defined in the domain layer and connects the platform to external services Implements model providers such as OpenAI, Anthropic, and other external HTTP-based endpoints Implements test runners, including custom prompt runners and external tools such as Garak. Implements database connections and repository classes. Results and Outcomes Through the research and development of our platform, we were able to gain several key insights into the behaviour and security of LLM-based applications: Prompt injection vulnerabilities are more prevalent than expected. Even simple prompts with carefully crafted inputs can unsafely manipulate a model’s behaviour. Lack of structured testing leads to hidden risks. Without a systematic approach, many vulnerabilities remain undetected. It is sometimes time consuming to manually craft unsafe prompts. Combining custom testing with framework-based testing improves coverage. Using both custom prompts (targeted and application-specific scenarios) and framework-driven testing (e.g. Garak) enables a more comprehensive evaluation of model safety, as both expected and unexpected vulnerabilities can be captured Structured prompts can significantly improve robustness. We observed that prompts with a clear structure and embedded mitigations are less susceptible to injection attacks. By the end of our project, we successfully developed a platform that: Bridges the gap between prompt injection knowledge and practical testing. Enables repeatable and structured testing of prompt injection vulnerabilities Provides a unified workflow for learning, testing, and improving prompt security. Supports multiple models and testing approaches, to cover the entire vulnerability safety. We demonstrated that prompt injection risks can be systematically identified, tested, and mitigated through a structured and repeatable approach. Lessons Learned Throughout the project, we identified several key insights that shaped both our technical approach and our understanding of AI security. AI is rapidly evolving, and systems must be designed accordingly. AI models and attack techniques are advancing extremely fast. As a result, static solutions are quickly becoming obsolete. We learned that it is essential to design a platform that is modular, extensible and adaptable. Through well-defined interfaces and generic services, we ensured our platform can evolve alongside attacks and mitigations. Security must be built into development, not considered at testing. Many developers are focusing on functionality first and security often takes a backseat. In the context of LLMs, vulnerabilities can fundamentally affect the security of the system and its users. As such, security should be treated as a core part of the development cycle. Models and external tools should only be connected if their safety is guaranteed. Bridging the gap between developers and security testers is necessary. We identified a major disconnect between developers building AI applications and the security testers evaluating them. These groups often operate with different priorities and levels of knowledge. We are bridging this gap by making prompt injection knowledge more accessible and creating workflows that are usable by developers while still grounded in robust security practices. Further Development While our platform provides a strong foundation for prompt injection testing and knowledge, there are several areas for future exploration: Expanding our testing framework integrations, by adding a broader coverage of attack techniques Integration with MCP servers and external systems, supporting interactions with tools, APIs and external data sources. Addressing additional indirect prompt injection vulnerabilities, including file uploads, website scraping, and multi-step workflows. Looking ahead, we also aim to integrate our platform more deeply into development workflows by introducing CI/CD integrations for continuous security testing and versioned tracking of model robustness over time. Our goal is to evolve the platform into a comprehensive security layer, capable of testing entire AI-driven systems in dynamic, real-world contexts. Conclusion As AI becomes increasingly integrated into real-world applications, ensuring their security is essential. As our research highlights, current practices have not kept pace with the rapid evolution of AI systems and attack techniques. Through our work, we demonstrated that prompt injection risks can be systematically identified, tested, and mitigated using a structured approach. By combining a unified knowledge base with a flexible testing platform powered by Microsoft Foundry, we are taking a step towards making AI systems safer and more reliable. More importantly, our project reinforces a broader idea: a developer-first approach to security, supported by collaboration across development, security, and safety disciplines, is essential for building AI at scale. Security should not remain confined to specialist teams but should be embedded directly into the development process, alongside practices such as red-teaming and continuous testing. Our project empowers teams with the knowledge and tools they need to build safer and more reliable AI systems. If you’re interested in building more secure AI systems or exploring prompt injection in practice, we invite you to join us through the Foundry Community on the 3rd of June at 2pm BST, when we will be showcasing our platform live, walking through real-world examples, and discussing how teams can integrate prompt injection testing into their development workflows. Team Teo Montero Bonet, UCL Computer Science Mario Mojarro Ruiz, UCL Computer Science David Thomas Garcia, UCL Computer Science Nathaniel Gibbon, UCL Computer Science With support from Josh McDonald, Avanade260Views0likes0CommentsCI/CD for AI Agents on Microsoft Foundry
Introduction Building an AI agent is the straightforward part. Shipping it reliably to production with version control, evaluation-driven quality gates, multi-environment promotion, and enterprise governance is where most teams run into friction. Microsoft Foundry changes this. It is Microsoft's AI app and agent factory: a fully managed platform for building, deploying, and governing AI agents at scale. It provides a first-class agent runtime with built-in lifecycle management, making it possible to apply the same CI/CD rigour you already use for application software to AI agents — regardless of whether you are building containerised hosted agents or declarative prompt-based agents. This post walks through a complete, production-ready reference architecture for doing exactly that. You will find the GitHub Actions workflow, the Azure DevOps pipeline YAML, and the architecture diagram linked throughout. Reference implementation repository: foundry-agents-lifecycle and CI/CD for AI Agents on Microsoft Foundry Why Agent CI/CD Is Different Traditional software pipelines gate releases on test pass/fail. Agent pipelines require an additional, critical layer: evaluation-driven quality gates. Before any agent version can be promoted to the next environment, it must pass three categories of evaluation: Quality — answer correctness, task completion rate, hallucination rate Safety — grounded responses, policy compliance, tool usage validation Performance — token usage per query, p95 response latency A second key difference is the deployment unit. You are not deploying a binary or a container tag in isolation. You are deploying an agent version — an immutable artefact that bundles the model selection, system instructions, tool definitions, and configuration together. This is what enables deterministic promotion and full auditability across environments. "Agents follow a standard CI/CD pattern, but with a critical shift: promotion happens at the agent version level, and release gates are driven by evaluation outcomes, not just test results." Reference Architecture Figure 1: End-to-end CI/CD reference architecture for hosted and prompt-based agents on Microsoft Foundry. The architecture has five logical layers, flowing from developer commit to production monitoring: Layer 1 — Developer Layer The developer layer is a standard source-controlled repository in GitHub or Azure DevOps. It contains: Agent code written in Python or .NET agent.yaml or prompt definition files for prompt-based agents Tool configurations: MCP servers, REST API connectors, or other integrations Infrastructure as Code: Bicep or ARM templates for provisioning the Foundry project and dependencies Layer 2 — CI Pipeline (Build · Validate · Evaluate) Every push or pull request triggers the CI pipeline. It performs five steps: Docker build — for hosted agents, build and tag the container image Static checks — lint with ruff , security scan with bandit , agent YAML schema validation Unit and tool tests — pytest suites covering agent logic and tool integrations Evaluation gate — run evaluation datasets; fail the pipeline if thresholds are breached Image push — push the validated container to Azure Container Registry (ACR) Prompt-based agents skip the Docker build step. Instead, the YAML definition and prompt bundle are validated against schema and evaluated against golden datasets. Layer 3 — CD Pipeline (Multi-stage Promotion) A single agent version is promoted through three Foundry project environments: Stage Environment Activities Gate Stage 1 Dev Foundry Project Deploy vNext version, smoke tests, developer evals Eval quality thresholds Stage 2 Test / QA Foundry Project Scenario tests, HITL validation, safety evaluation Eval gates + human approval Stage 3 Production Foundry Project Promote version, enable endpoint, post-deploy smoke test Required reviewer approval Rollback is straightforward: switch the active version pointer back to the previous agent version. No re-deployment is needed. Layer 4 — Microsoft Foundry Agent Service The Foundry Agent Service runtime provides: Hosted agent runtime — managed container execution supporting Agent Framework, LangGraph, Semantic Kernel, or custom code Prompt-based agent runtime — declarative agent definitions, no container required Built-in lifecycle operations — version, start, stop, rollback Entra Agent Identity — each deployed version receives a dedicated Microsoft Entra managed identity RBAC and policy enforcement — Azure role-based access controls per project Observability — distributed traces, structured logs, and evaluation signals Layer 5 — Monitoring, Governance, and Control Plane Foundry control plane: agent registry, environment configuration, version history OpenTelemetry forwarded to Azure Monitor and Application Insights Continuous evaluation pipelines for ongoing quality, grounding, and safety monitoring Azure Policy and RBAC enforcement at the platform level Environment Topology There are two topology options. We recommend Option A for all production workloads: Option Structure Best for Trade-off A — Recommended Dev Project → Test Project → Prod Project (separate Foundry projects) Enterprise workloads Full isolation, clean RBAC boundaries, easier governance B — Lightweight Single Foundry project with agent version tags (dev/test/prod) Small teams, prototyping Simpler setup, but weaker environment separation Separate projects mean separate RBAC policies, separate connection strings, and separate evaluation signals. A developer service principal has access only to the Dev project; the CI/CD identity has restricted access to promote to Test and Production. Evaluation Gates — The Core Difference Evaluation gates transform a standard software pipeline into an AI-safe deployment pipeline. They run at two points: pre-merge (CI) and pre-promotion (CD). Defining the Gates Category Metric CI threshold Prod threshold Quality Hallucination rate < 5% < 3% Quality Task completion rate > 90% > 95% Safety Grounded response rate > 95% > 98% Safety Policy violations 0 0 Performance p95 latency < 4 000 ms < 3 000 ms Cost Token usage per query Track only Alert on > 20% regression Gate Enforcement (Python) import json import sys def check_gates(results_path: str) -> None: with open(results_path) as f: results = json.load(f) failures = [] if results["hallucination_rate"] > 0.05: failures.append(f"Hallucination rate {results['hallucination_rate']:.1%} exceeds 5% threshold") if results["task_completion_rate"] < 0.90: failures.append(f"Task completion {results['task_completion_rate']:.1%} below 90% threshold") if results["latency_p95_ms"] > 4000: failures.append(f"p95 latency {results['latency_p95_ms']}ms exceeds 4000ms threshold") if results.get("policy_violations", 0) > 0: failures.append(f"Policy violations detected: {results['policy_violations']}") if failures: for f in failures: print(f"GATE FAILED: {f}", file=sys.stderr) sys.exit(1) print("All evaluation gates passed — proceeding to deployment") if __name__ == "__main__": check_gates(sys.argv[1]) Hosted vs Prompt-Based Agents — Pipeline Differences Capability Hosted Agents Prompt-Based Agents Deployment unit Container image + agent definition YAML / prompt configuration bundle Build step required Yes — Docker build + ACR push No — YAML validation only Supported frameworks Agent Framework, LangGraph, Semantic Kernel, custom Foundry declarative runtime Promotion artefact Versioned agent with container image reference Versioned prompt/config bundle CI focus Code quality, tool tests, evaluation Prompt schema validation, evaluation Rollback mechanism Switch active agent version Switch active agent version Runtime management Foundry manages container lifecycle Foundry manages declarative runtime CI Pipeline Walkthrough The following steps are representative of the full GitHub Actions workflow available in github-actions-pipeline.yml alongside this post. Hosted Agent CI # 1. Static checks ruff check . bandit -r src/ -ll python scripts/validate_agent_config.py --config agent.yaml # 2. Tests pytest tests/unit/ -v --tb=short pytest tests/tools/ -v --tb=short # 3. Evaluation gate python scripts/run_evaluations.py \ --dataset eval/datasets/golden_set.jsonl \ --output eval/results/results.json python scripts/check_eval_gates.py \ --results eval/results/results.json \ --max-hallucination 0.05 \ --min-task-completion 0.90 \ --max-latency-p95 4000 # 4. Push container image az acr build \ --registry myregistry.azurecr.io \ --image "myagent:$SHA" \ --file Dockerfile . Prompt-Based Agent CI # Validate YAML / prompt definitions python scripts/validate_agent_config.py --config agent.yaml # Evaluation against golden dataset python scripts/run_evaluations.py \ --dataset eval/datasets/golden_set.jsonl \ --output eval/results/results.json python scripts/check_eval_gates.py \ --results eval/results/results.json CD Pipeline Walkthrough Stage 1 — Dev Deployment python scripts/deploy_agent.py \ --env dev \ --image "myregistry.azurecr.io/myagent:$SHA" \ --foundry-endpoint $FOUNDRY_ENDPOINT_DEV \ --agent-config agent.yaml # Returns the new agent version ID, stored for promotion AGENT_VERSION=$(python scripts/get_active_version.py --env dev) Stage 2 — Promote to Test (after approval gate) python scripts/promote_agent.py \ --from-env dev \ --to-env test \ --agent-version $AGENT_VERSION \ --foundry-endpoint $FOUNDRY_ENDPOINT_TEST # Run scenario tests and safety evaluation python scripts/run_evaluations.py \ --dataset eval/datasets/scenario_set.jsonl \ --output eval/results/test-results.json python scripts/check_eval_gates.py \ --results eval/results/test-results.json \ --max-hallucination 0.03 \ --min-task-completion 0.95 Stage 3 — Promote to Production (after required reviewer approval) python scripts/promote_agent.py \ --from-env test \ --to-env prod \ --agent-version $AGENT_VERSION \ --foundry-endpoint $FOUNDRY_ENDPOINT_PROD # Enable the production endpoint python scripts/enable_agent_endpoint.py \ --agent-version $AGENT_VERSION \ --foundry-endpoint $FOUNDRY_ENDPOINT_PROD Rollback # Switch the active version to the previous known-good version python scripts/promote_agent.py \ --from-env prod \ --to-env prod \ --agent-version $PREVIOUS_AGENT_VERSION \ --foundry-endpoint $FOUNDRY_ENDPOINT_PROD # OR delete the failing version python scripts/delete_agent_version.py \ --agent-version $AGENT_VERSION \ --foundry-endpoint $FOUNDRY_ENDPOINT_PROD Deployment Using the Azure AI Projects SDK The azure-ai-projects SDK provides programmatic control over the full agent lifecycle. This is the recommended approach for CI/CD scripts where you need deterministic, scriptable deployment. from azure.identity import DefaultAzureCredential from azure.ai.projects import AIProjectClient # Connect to the Foundry project client = AIProjectClient( endpoint=FOUNDRY_PROJECT_ENDPOINT, credential=DefaultAzureCredential() ) # List existing agents (useful for idempotent deploy scripts) for agent in client.agents.list(): print(f"Agent: {agent.name} version: {agent.id}") # Create a new agent version (hosted agent) agent = client.agents.create_agent( model="gpt-4o", name="my-enterprise-agent", instructions="You are a helpful assistant ...", tools=[...], # tool definitions metadata={"version": GIT_SHA, "environment": "dev"} ) print(f"Created agent version: {agent.id}") For hosted agents, the SDK call also references the container image pushed to ACR. Refer to the Deploy a hosted agent — Microsoft Foundry documentation for the full SDK flow including container image registration and version polling. Reference Implementation Stack Concern Technology Source control and pipelines GitHub Actions or Azure DevOps Pipelines Infrastructure and agent deployment Azure Developer CLI ( azd up ) Programmatic agent lifecycle azure-ai-projects Python SDK Agent evaluation azure-ai-evaluation Python SDK Agent runtime Microsoft Foundry Agent Service Container registry Azure Container Registry (hosted agents only) Observability OpenTelemetry, Azure Monitor, Application Insights Identity and access Microsoft Entra (Agent ID, OIDC workload identity federation) Governance Azure Policy, RBAC, Foundry control plane Governance and Responsible AI Shipping AI agents at enterprise scale requires governance beyond what a traditional CI/CD pipeline provides. Microsoft Foundry addresses this at the platform level: RBAC per environment — each Foundry project has independent access controls. Developers deploy to Dev; only CI/CD service principals (with audited OIDC tokens) can promote to Test and Production. Agent registry and audit trail — the Foundry control plane records which agent version is active in each environment, who deployed it, and when. This satisfies enterprise audit requirements without additional tooling. Content safety and policy enforcement — Azure Policy governs model access, data handling, and content safety rules at the infrastructure level, not just at the application code level. Policy violations block deployment automatically. Entra Agent Identity — each deployed agent version receives a dedicated, short-lived managed identity. Agents authenticate to downstream services using least-privilege credentials scoped to that specific deployment. Continuous evaluation in production — evaluation pipelines run on sampled production traffic, alerting when quality, safety, or cost metrics drift from their baseline. A key trade-off to be transparent about: evaluation datasets must be maintained and updated as the agent's tasks evolve. Stale datasets produce misleading pass/fail signals. Treat your golden evaluation set as a first-class engineering artefact alongside the agent code itself. Pipeline Files Two pipeline files accompany this reference architecture. Both implement the same four-stage pipeline (CI Build, CI Evaluate, CD Dev, CD Test, CD Production) with environment-appropriate approval gates. github-actions-pipeline.yml — GitHub Actions workflow. Uses GitHub Environments for approval gates and OIDC Workload Identity Federation for passwordless Azure authentication. No stored Azure credentials required. azure-devops-pipeline.yml — Azure DevOps multi-stage YAML pipeline. Uses ADO Environments with required approvers and variable groups per environment. Both pipelines share these security practices: OIDC / Workload Identity Federation — no long-lived Azure credentials stored in pipeline secrets Per-environment variable groups, each with scoped connection strings and endpoints Evaluation quality gates enforced before every promotion step Mandatory human approval before production deployment Summary The full pipeline in one view: Developer commit | CI Pipeline ├── Docker build (hosted agents) / YAML validation (prompt agents) ├── Static checks + unit tests + tool tests └── Evaluation gate ← quality · safety · performance | Agent Version created ← immutable, versioned artefact | CD Pipeline ├── Deploy to Dev → smoke tests + eval gate ├── Promote to Test → scenario tests + HITL + approval gate └── Promote to Prod → enable endpoint + monitoring | Microsoft Foundry Agent Service └── Versioned runtime · Entra identity · RBAC · Observability | Control Plane └── Agent registry · Governance · Continuous evaluation Microsoft Foundry provides the platform primitives — versioned agent deployments, multi-environment Foundry projects, built-in lifecycle management, and an enterprise observability stack — needed to operate AI agents with the same confidence as any production software system. The key takeaway: treat the agent version as your deployment artefact, and evaluation outcomes as your release gate. The rest follows familiar CI/CD patterns you already know and trust. Next Steps Clone the CI/CD Repo at leestott/foundry-cicd Clone the reference demo: foundry-agents-lifecycle on GitHub Set up your environment: Set up your environment for Foundry Agent Service Deploy your first hosted agent: Quickstart: Deploy your first hosted agent Understand hosted agent concepts: Foundry Hosted Agents concepts Automate deployments in CI/CD: Automate deployment of Microsoft Foundry agents Manage agent versions: Manage hosted agents — Microsoft Foundry Deploy via SDK: Deploy a hosted agent — Microsoft Foundry SDK and endpoint reference: Microsoft Foundry SDK and Endpoints reference Azure AI Projects SDK: azure-ai-projects Python SDK Azure Developer CLI: Azure Developer CLI (azd) overview Microsoft Foundry documentation hub: Microsoft Foundry on Microsoft Learn9.7KViews7likes0CommentsWhat's New in Microsoft Foundry Labs – May 2026
Four new releases this month — a new benchmark for how agents interact, an experimental end-to-end agentic stack, a faster image model, and a first-party geospatial model. Last month we kicked off this series with a roundup of new Foundry Labs releases across speech, vision, and multimodal AI. This month, we're back with another update — read on to see learn what's new! SocialReasoning-Bench: measuring whether AI agents act in their user's best interest We are moving into a world where agents are interacting with other agents on behalf of their users, and thus, task completion is no longer a sufficient measure of usefulness. What matters is whether the agent advocates well for the person it represents. SocialReasoning-Bench, a new open-source benchmark from Microsoft Research AI Frontiers, measures exactly that. The benchmark currently supports two main scenarios — Calendar Coordination and Marketplace Negotiation — and scores them on two new metrics: Outcome Optimality (the share of available value the agent captures for its principal) and Due Diligence (the quality of the process used, scored against a deterministic reasonable-agent policy). Together they define an operational notion of duty of care. Learn more about SocialReasoning-Bench in Foundry Labs Try it on GitHub MagenticLite, Magentic Orchestrator & Fara 1.5: an end-to-end agentic stack Microsoft Research AI Frontiers also released a complete agentic stack: MagenticLite is the application layer — the next generation of Magentic-UI, with a redesigned chat-and-browser interface and a harness rebuilt for small models. It works across both your browser and your local file system in a single workflow, with browser sessions and code execution sandboxed by Quicksand, the project's open-source QEMU runtime. Transparency is baked in: you see what the agent is reasoning about, you can take direct control at any moment, and critical actions pause for explicit approval. MagenticBrain is the orchestrator of the stack — an orchestration model fine-tuned on Qwen 3 8B that plans, codes, and delegates. Critically, it was trained end-to-end inside the MagenticLite harness with the same tool schemas it sees at inference, eliminating the gap between training and execution. Fara1.5 is the next generation of Microsoft's computer-use model family — three models (4B, 9B, 27B) on Qwen 3.5, with the 9B as the recommended flagship. Fara1.5 sets a new state of the art among small computer-use models on the Online‑Mind2Web benchmark, nearly doubling the performance of the previously released Fara‑7B, and the 27B variant records 90+% on the same benchmark 1 . Together, they represent an open-source, end-to-end agentic stack that work together, so developers can build, plan, and run agents on infrastructure they control. Learn more about MagenticLite on Foundry Labs Try it on GitHub MAI-Image-2-Efficient: high-quality image generation at speed and scale MAI-Image-2-Efficient — Image‑2e for short — is Microsoft's latest text-to-image model, built on the same architecture as MAI-Image-2 (which debuted at #3 on the Arena.ai leaderboard for image model families) but engineered for the production workloads where every millisecond and every GPU hour matters. When normalized by latency and GPU usage, Image‑2e is up to 22% faster and 4x more efficient than MAI-Image-2 — and outpaces leading text-to-image models by 40% on average 1 . In short, it delivers more output for less compute, giving teams the headroom to iterate faster without blowing through their GPU budget. That efficiency unlocks new categories of work. E-commerce platforms, media companies, and marketing teams generating thousands of images per day for targeted ads, concept art, and mood boards translate it directly into larger batches at lower GPU cost. Chatbots, creative copilots, and AI-powered design tools translate it into latency low enough for real-time interaction. The model also has a distinct visual signature — sharp, defined lines that fit illustration, animation, and attention-grabbing photoreal imagery. Learn more about MAI-Image-2-Efficient in Foundry Labs Try it in Microsoft Foundry EO/OS Object Detection: production-grade earth observation Object detection on satellite and aerial imagery has historically required months of in-house computer vision engineering — bespoke models, custom labels, fragile pipelines. EO/OS Object Detection collapses that into a managed first-party endpoint in Microsoft Foundry. Built by the team behind Planetary Computer, EO/OS Object Detection is a model that identifies and localizes objects in overhead imagery and returns bounding-box detections optimized for batch processing of large image archives. It's part of a new GeoAI category in Microsoft Foundry, opening Microsoft's geospatial intelligence stack to anyone building on satellite or aerial data. Defense and intelligence teams analyzing satellite feeds, infrastructure operators monitoring assets at scale, agriculture and energy companies tracking change across vast landscapes, and disaster response teams triaging post-event imagery can all swap a custom one-off detector for a managed endpoint that fits inside their existing Foundry stack. Put simply, the work shifts from "build the detector" to "use the detector" — and the detection signal lands faster, more consistently, and inside the same Microsoft platform their broader AI work already runs on. Learn more about EO/OS Object Detection in Foundry Labs Try EO/OS Object Detection in Microsoft Foundry What's Next Foundry Labs is where Microsoft's most ambitious AI research becomes accessible to builders and where the products you'll rely on tomorrow are taking shape today. There's plenty more in the pipeline. Explore more AI innovations on Foundry Labs Join the Microsoft Foundry Discord community to shape the future of AI together References As tested on April 13, 2026. Compared to MAI-Image-2 when normalized by latency and GPU usage. Throughput per GPU vs MAI-Image-2 on NVIDIA H100 at 1024×1024; measured with optimized batch sizes and matched latency targets. Results vary with batch size, concurrency, and latency constraints.544Views2likes0CommentsSigning in to Microsoft Foundry from OpenClaw using Azure AD: a smoother way to bring your models in
This post is a quick update to walk through the new flow. If you read the previous one, think of this as the easier path I wish I had the first time round. If you have not seen the original, you can find it here: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Pre-requisite: You will need the Azure CLI (azure-cli) installed on your machine. The official install guide for Linux is here: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest I am on Linux so I went the Homebrew route, which keeps things simple. The formula is here: https://formulae.brew.sh/formula/azure-cli Microsoft also has official docs covering the Homebrew/Linuxbrew install: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Once Homebrew is ready, run this in your terminal: brew install azure-cli Why this matters: Before this update, every Foundry model you wanted to use in OpenClaw needed its own API key and endpoint pasted into the config. It worked, but it was tedious, and keys are easy to leak if you are copying them around. The Azure AD path solves both problems. You authenticate as yourself (or a service principal), OpenClaw asks Azure for the list of Foundry resources you have access to, and it brings the models in automatically. Signing in to Microsoft Foundry from OpenClaw via Azure AD A device-code OAuth handshake replaces the old static-API-key flow. OpenClaw delegates auth to the local Azure CLI; the CLI handles the browser-side sign-in, holds the resulting tokens, and refreshes them silently. OpenClaw then walks the Azure resource graph, subscriptions → Foundry resources → model deployments and registers each model into its own config. No API keys move through OpenClaw at any point. Sequence diagram of the OAuth 2.0 device-authorization flow as orchestrated by OpenClaw. Phases 1–3 establish identity (the developer authenticates once, in a real browser, against Azure AD). Phases 4–5 perform service discovery (OpenClaw walks the ARM resource hierarchy, subscriptions → Foundry accounts → model deployments and persists the result to a local provider config). After registration, every model call OpenClaw makes against Foundry reuses the same Azure-CLI-managed token cache: tokens refresh transparently, and access is gated by the Foundry resource's RBAC assignments rather than a static API key. Dashed lines denote return values; the teal line in step 7 marks the single token-issuance event the rest of the system pivots on. Walking through the new flow: Start with the command to onboard openclaw as if you were setting up OpenClaw for the first time: openclaw onboard Kick things off with the OpenClaw onboard command, the same one you would use when setting up OpenClaw for the first time. When it prompts you, choose update values. Next, you will be asked to configure your models. Scroll down a little and you will see Microsoft Foundry listed as a supported provider. Pick it. From here, you have two options. You can sign in with an API key, which is what I covered in the previous blog post, or you can sign in through Azure AD. The Azure AD path is easier and more secure, so that is the one we will use. OpenClaw will give you a URL and a device code. Copy the URL into your browser and use the code to complete the sign in. (This is where the az CLI from the pre-requisite section earns its keep.) If everything worked, you should see a success prompt similar to this: Once you are signed in, OpenClaw will ask you to pick the Azure subscription that your Microsoft Foundry resource lives in. Pick the subscription, then pick the Foundry resource where your models are deployed. And that is pretty much it. All the models you have deployed to that Foundry resource get pulled into OpenClaw automatically. Compared to the old way of pasting API keys and endpoints one by one, this is a huge time saver, and you do not have to babysit any keys. From here you can start using your Foundry-deployed models inside OpenClaw straight away: Wrapping up The Azure AD sign-in option in OpenClaw is one of those small updates that quietly removes a real pain point. If you have ever juggled multiple Foundry endpoints and rotated keys across them, you already know why. With this flow, you sign in once, your models show up, and you can get back to actually building. If you have not tried OpenClaw with Microsoft Foundry yet, this is a good time to give it a go. And if you were holding off because of the key management overhead, that excuse is gone now. References Previous post on integrating Microsoft Foundry with OpenClaw using API keys: Integrating Microsoft Foundry with OpenClaw: Step by Step Model Configuration | Microsoft Community Hub Install the Azure CLI on Linux: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-linux?view=azure-cli-latest Install the Azure CLI on macOS: https://learn.microsoft.com/en-us/cli/azure/install-azure-cli-macos?view=azure-cli-latest#install-with-homebrew Homebrew formula for azure-cli: https://formulae.brew.sh/formula/azure-cli154Views0likes0CommentsIntroducing Grok 4.3 on Microsoft Foundry: Latest Generation Agentic Capabilities
Customers building advanced AI systems increasingly need models that can reason deeply, act autonomously, and integrate reliably into real‑world workflows—all without compromising on governance or cost efficiency. Grok 4.3, xAI’s latest flagship model, is now available in Microsoft Foundry, giving developers and enterprises access to latest agentic intelligence within a production‑ready environment designed for scale. With Grok 4.3 on Microsoft Foundry, customers can more easily experiment with, evaluate, and deploy a powerful new option for agent‑based and domain‑specific applications—while benefiting from the safety controls, monitoring, and operational tooling needed to move from prototype to production with confidence. About Grok 4.3 Grok 4.3 is xAI’s latest flagship model, designed to support agent-based and productivity-focused workflows across a wide range of professional scenarios. Based on information provided by xAI and independent research conducted by Artificial Analysis, Grok 4.3 demonstrates strong performance across multiple benchmarks, reflecting a favorable balance between model capability and reported benchmark cost. *Benchmark data and cost metrics are provided by xAI and independently analyzed by Artificial Analysis. Source: https://artificialanalysis.ai Improved agentic capabilities Grok 4.3 is purpose‑built for agentic systems, improving in tool calling, instruction following, and lower hallucination, as reported by xAI. Grok 4.3 also enables policy‑aware support agents with reliable tool use and consistent behavior across extended conversations. On Microsoft Foundry, Grok 4.3 supports up to a 200k token context window, enabling extended multi‑turn reasoning and agent workflows. Multi-modal and domain‑specific strengths Grok 4.3 delivers strong performance across a range of professional and technical domains: Multimodal analysis: Native understanding across text, images, diagrams, and mixed data sources, enabling synthesis of visual and textual information for complex reasoning tasks. Web development: Excels in full‑stack web development, producing clean, production‑ready code with minimal guidance. Legal reasoning: supports interpretation of contracts, case law, and regulatory documents. Finance agents: supports financial analysis, modeling, and human decisions Built‑In Native Capabilities Grok 4.3 includes powerful native capabilities that simplify real‑world application development: Web search and X search for real‑time context Python code execution for analysis and automation File search (RAG) for enterprise knowledge grounding Excel, PDF, and PowerPoint generation for end‑to‑end workflows Together, these capabilities allow Grok 4.3 to function as a powerful agentic productivity engine, not just a language mode. Why Grok 4.3 on Microsoft Foundry Bringing Grok 4.3 to Microsoft Foundry delivers value beyond raw model performance. When deployed through Foundry, Azure AI Content Safety is enabled by default, adding an additional layer of protection for enterprise use. Customers can review the Microsoft Foundry model card for detailed safety and usage considerations. Microsoft Foundry also provides tools to support our customers with their responsible AI efforts, including model cards during selection, configurable guardrails such as jailbreak detection and content filtering, pre‑deployment evaluations and red teaming, and post‑deployment monitoring and governance. These capabilities help customers maintain output quality and deploy Grok 4.3 responsibly at scale. Pricing Model Deployment Input/1M Tokens Output/1M Tokens Availability Grok 4.3 Global Standard $1.25 $2.50 Public Preview Getting started Grok 4.3 is now available in Microsoft Foundry. Explore the model details in the Foundry model catalog, evaluate it using your own datasets, and start building and deployment in minutes.876Views0likes0CommentsNow in Foundry: Tongyi-MAI Z-Image-Turbo, with FLUX.1-schnell and SDXL base 1.0
This week's Model Mondays edition pairs three models available through the Hugging Face collection in Microsoft Foundry: Tongyi-MAI's Z-Image-Turbo, a new designed for lower latency on a single GPU and native bilingual text rendering; Black Forest Labs' FLUX.1-schnell, a 12B rectified flow transformer distilled to 1–4 step inference and one of the most adopted open-weight image models since its 2024 release; and Stability AI's stable-diffusion-xl-base-1.0 (SDXL), a latent diffusion research model that can be used to generate and modify images based on text prompts. Models of the week Tongyi-MAI: Z-Image-Turbo Model Specs Parameters / size: 6B (BF16) Resolution: Up to 1024×1024 native Primary task: Text-to-image generation (English and Chinese) Why it's interesting (Spotlight) Scalable Single-Stream Diffusion Transformer (S3-DiT) architecture: Z-Image concatenates text tokens, visual semantic tokens, and image VAE tokens into a single unified input stream rather than running text and image through separate branches. This single-stream design can improve parameter efficiency relative to dual-stream DiT architectures at the same capacity. See the Z-Image technical report for details. 8-step inference at sub-second latency, fits in 16GB VRAM: Z-Image-Turbo is distilled with Decoupled Distribution Matching Distillation (Decoupled-DMD) and further refined with DMDR, a method that fuses DMD with reinforcement learning during post-training. The result is a model that runs 8 Number-of-Function-Evaluations (NFE) per image with no Classifier-Free Guidance (CFG)—which roughly halves the per-step compute compared to CFG-based inference. See the Decoupled-DMD and DMDR papers. Native bilingual text rendering and strong instruction adherence: Unlike most open-weight image models, which struggle with legible in-image text, Z-Image-Turbo renders complex English and Chinese text accurately which is useful for posters, signage, packaging mockups, and marketing creative. Try it Imagine you're a community programs coordinator at your city's parks department, planning a new summer event series — a "Cake Picnic in the Park" — designed to bring neighbors together over food in shared green space. The event is a few weeks out. You haven't booked bakery partners yet, so no actual cake exists, and you need marketing assets this week to start driving sign-ups: a hero image for the registration page, a flyer for community centers and libraries, social tiles for the city's channels. Use the prompt below and a photorealistic image, that can now be scaled to become additional assets like printed flyers or social images in minutes using image editing tools (or another model). Prompt: A round layered cake displayed on a white ceramic cake stand, topped with glossy fresh red cherries and smooth pastel pink buttercream frosting piped in delicate rosettes around the edge. One generous slice has been cleanly cut and removed from the front, revealing a perfect cross-section: four distinct horizontal layers alternating between soft pink sponge cake and fluffy white vanilla cream frosting. Professional bakery photography, soft natural window light from the left, shallow depth of field, marble countertop, warm and inviting atmosphere, photorealistic detail on the cake texture, cherry highlights, and frosting swirls. Black Forest Labs: FLUX.1-schnell Model Specs Parameters / size: 12B (rectified flow transformer) Resolution: Flexible up to 2 megapixels Primary task: Text-to-image generation Why it's interesting (Spotlight) Rectified flow transformer with adversarial distillation for 1–4 step inference: FLUX.1-schnell is the distilled, Apache 2.0 sibling of the FLUX.1 family. It uses a rectified flow formulation (a diffusion variant that learns straight-line probability paths between noise and data, reducing the number of solver steps needed) and is further compressed with latent adversarial diffusion distillation. The model generates high quality images in for latency-sensitive workloads. Permissive licensing for commercial use: Released under Apache 2.0, FLUX.1-schnell can be used for personal, scientific, and commercial purposes. This has driven broad adoption across product features that need an open, redistributable image backbone. Strong prompt adherence at its parameter range: At 12B parameters, FLUX.1-schnell sits between the SDXL family and frontier proprietary image models, and it remains a common reference point for evaluating open image generation prompt following—particularly for complex compositional prompts and longer captions—roughly two years after its initial release. Try it Hugging Face Spaces give developers the ability to experiment and try new models before deploying them. Test out a few prompts here: https://black-forest-labs-flux-1-schnell.hf.space then when you are ready, deploy the model in Microsoft Foundry. Stability AI: stable-diffusion-xl-base-1.0 stabilityai/stable-diffusion-xl-base-1.0 · Hugging Face Model Specs Parameters / size: 2.6B UNet (≈3.5B total with text encoders) Resolution: 1024×1024 native Primary task: Text-to-image generation Why it's interesting (Spotlight) Dual text encoder design and an ensemble-of-experts pipeline: SDXL uses two pretrained text encoders—OpenCLIP-ViT/G and CLIP-ViT/L—concatenated to capture both broad semantic alignment and finer-grained token-level cues. It can be run standalone or paired with the SDXL refiner in an ensemble-of-experts pipeline where the base model handles early denoising and the refiner specializes in the final steps. See the SDXL report for the original training and architecture details. CreativeML Open RAIL++-M licensing for managed deployments: SDXL is distributed under the CreativeML Open RAIL++-M license, which permits commercial use and downstream fine-tuning with documented use restrictions. Try it To go deeper on SDXL, take a look at Stability AI's generative-models GitHub repository, which implements the most popular diffusion frameworks for both training and inference and continues to expand with new capabilities like distillation. Getting started You can deploy open-source Hugging Face models directly in Microsoft Foundry in two ways. The first by browsing the Hugging Face collection in the Foundry model catalog and deploying to managed endpoints in just a few clicks. The second way is direct through the Hugging Face Hub, select any supported model and then choose "Deploy on Microsoft Foundry", which brings you straight into Azure. Learn how to discover models and deploy them using Microsoft Foundry documentation: Follow along the Model Mondays series and access the GitHub to stay up to date on the latest Read Hugging Face on Azure docs Learn about one-click deployments from the Hugging Face Hub on Microsoft Foundry Explore models in Microsoft Foundry322Views0likes0Comments"Not Available in Your Region" Isn't a Dead End: A Security Assessment of Global Deployments
You want to build with the latest Microsoft Foundry model. You checked the regional availability, and it isn't there yet — only Global Standard. Now you're weighing the capability you actually need against your instinct to keep everything in a regional SKU. This post is for that moment. This is a more common situation than people realise. Microsoft typically releases new and preview models on Global first, then expands into specific regions over time as capacity is built out. It isn't an oversight. It's how Microsoft makes new capabilities available to the broadest set of customers as quickly as possible. If you want those capabilities, Global is the path. The good news is that the path is well-paved. Microsoft Foundry Global Standard is a secure, enterprise-grade deployment type backed by the same Azure controls you already rely on, with explicit contractual commitments on how your data is used. The data protection guarantees don't change because the model is newer or because regional capacity hasn't caught up — they're the same on day one of a new model on Global as they are on a model that's been deployed regionally for a year. The rest of this post walks through what Microsoft commits to, what you get out of the box, what you add on top, and the small number of cases where Global is genuinely the wrong choice. It's written for three audiences: Developers who want to know if they're allowed to ship on Global. Solution architects weighing the model choice against latency, quota, and resilience. Security architects who need to map Foundry's behaviour to enterprise controls before they sign off. Where does my data actually go? This is the question that drives most of the concern, and the answer has two parts. Mixing them up is what causes the confusion. Data at rest stays in the Azure geography of your Foundry resource. That includes your configuration, uploaded files, stored artifacts, and logs. This is true for Global deployments, exactly the same as it is for regional ones. Microsoft commits to this in the Azure data residency page. Data in processing is different. When you send a prompt, the model processes it in memory for a few hundred milliseconds and returns a response. For Global deployments, that processing can happen in any Azure region where the model is hosted. This is how Microsoft gives you the highest available capacity and the broadest model access. The prompt and response are not persisted as part of inference processing in the region that processed them. Once you separate "where my data lives" from "where the request runs," the residency picture becomes much clearer. Your customer data lives where you put it. The model that processes that data runs on Microsoft's global fleet. You can read the official description on the Microsoft Foundry deployment types page. What Microsoft commits These commitments are contractual, not marketing language — they sit inside Microsoft's Product Terms and Data Protection Addendum. According to the data privacy page for Azure Direct Models, your prompts and completions are not used to train Microsoft or OpenAI models, and your fine-tuned models are exclusively yours. Microsoft is also explicit that your data does not touch consumer OpenAI services: "Microsoft hosts the Azure Direct Models in Microsoft's Azure environment and Azure Direct Models do NOT interact with any services operated by Azure Direct Model providers, for example, OpenAI (e.g. ChatGPT, or the OpenAI API)." For partner and community models served through serverless APIs, the model catalog data privacy page confirms that those models are stateless and that Microsoft does not use prompts or outputs to train any model. What Global does NOT do A Global deployment does not replicate your stored data into other regions, does not expose your prompts to consumer OpenAI services, and does not use your inputs or outputs for training. The only cross‑region behavior is the transient execution of model inference, which is stateless and not customer‑addressable. What Global gives you on day one Before you configure anything yourself, a Global Standard deployment already includes the following: Encryption at rest using FIPS 140-2 compliant 256-bit AES with Microsoft-managed keys, applied transparently. See the Microsoft Foundry architecture page. Encryption in transit using TLS 1.2 or higher, enforced by the platform. Microsoft Entra ID authentication with Azure RBAC. Foundry separates control-plane actions (like creating deployments) from data-plane actions (like invoking models), so you can grant least privilege without writing custom roles. Tenant isolation. Your Foundry resource lives in your subscription, your data lives in your tenant, and any fine-tuned models you create are exclusively yours. Compliance inheritance. Foundry runs on Azure and inherits Azure's compliance controls, including ISO 27001, SOC 1/2/3, HIPAA, PCI DSS, FedRAMP, and many others. The current authoritative list is in the Azure compliance offerings catalogue and the Microsoft Trust Center. This baseline, with no extra configuration, already meets the security posture most enterprise teams target for new workloads. The controls you already know Securing Microsoft Foundry uses the same building blocks as securing any other Azure PaaS service. If your team already knows how to lock down Azure Storage or Azure SQL, you already know how to lock down Foundry. Developers see familiar patterns. Architects get a clean fit into the landing zone. Security architects review the same control surfaces they review elsewhere. The controls you'd apply are exactly what you'd expect: Private networking: Map the Foundry resource to a private IP using Private Link, back it with Private DNS, disable public network access, and route egress through Azure Firewall or an NVA. For agent workloads, Microsoft publishes a private networking template for Foundry Agent Service you can deploy with Bicep or Terraform. Note that Private Link secures the path to the endpoint, not the routing of requests inside the model fleet — you get a private network path without giving up Global's capacity benefits. Azure APIM GenAI gateway: Put Azure API Management's GenAI gateway in front of your Foundry Global deployments to control who can call models, how much they can use, and under what policies, independent of where inference runs. It enforces central auth, per‑consumer token limits, logging, and policy controls, turning Global deployments from “globally available” into centrally governed and auditable services. Identity and secrets: Use Managed Identity for application-to-model calls and avoid embedding API keys in code. Apply Conditional Access to admin sign-in and use Privileged Identity Management for just-in-time elevation on admin roles. Customer-managed keys: If your compliance regime requires key ownership, enable CMK on the Foundry resource via Azure Key Vault for rotation, revocation, and separation of duties. Logging and monitoring: Send diagnostics to a customer-owned Log Analytics workspace, enable the Azure Activity Log, and alert on token-usage spikes, unusual source IPs, and repeated authentication failures. Governance at scale: Use Azure Policy to enforce baselines (allowed locations, mandatory diagnostics, required private access) across your tenant, and pair it with Microsoft Defender for Cloud for continuous posture management. The risk that deserves attention: Data Exfiltration The most common security risk in any LLM deployment, on any SKU, is not Microsoft's infrastructure. It's the application layer. Examples include over-broad RAG retrieval pulling data the user shouldn't see, a tool-calling agent reaching an unintended destination, or a prompt that quietly echoes PII into a downstream log. These risks exist on Global, Data Zone, and Regional deployments equally. Choosing a more restrictive SKU does not mitigate them. The good news is that the mitigations are well understood and entirely under your control: Use Private Endpoints for Storage, AI Search, Cosmos DB, and any other backing services your application uses for RAG, so retrieval traffic stays off the public internet. For tool-calling and agent scenarios, route outbound traffic through Azure Firewall with FQDN filtering, and keep an explicit allowlist of destinations the agent is permitted to reach. Apply DLP and redaction at the application layer for high-risk data classes, before that data ever becomes part of a prompt. Treat prompts and completions as transient. Don't persist them unless you have a specific, auditable reason to do so. Doing this work on a Global deployment gives you exactly the same protection as doing it on a regional one. Is Global Deployment right for you? For most teams building on Microsoft Foundry, the answer is yes. Global Standard gives you: The highest default quotas and the broadest model availability in the catalogue. First access to new models and features, often weeks or months ahead of regional rollouts. Elastic absorption of demand spikes through Microsoft's global capacity pool. A simpler architecture, with no regional duplication or custom failover logic. The full Azure security stack: Entra ID, RBAC, Private Link, CMK, Azure Policy, Defender for Cloud, and Monitor. Contractual guarantees that your data isn't used for training and isn't shared with consumer OpenAI services. Global is not the right choice when a specific regulation explicitly requires inference processing to occur within a named country or zone. Note the word "processing" there: not data at rest, but the transient processing of the prompt itself. These cases do exist, particularly in some government, healthcare, and financial sector contexts, and Microsoft Foundry offers Data Zone (US or EU) and Regional SKUs for exactly those situations. But unless someone has pointed you at a specific clause in a specific regulation that names processing locality, you most likely don't need to step down from Global. Summary Microsoft Foundry Global deployments are secure, compliant, and enterprise‑ready. Data at rest remains in your chosen Azure geography. Prompts and completions are not used for training and do not interact with consumer AI services. Encryption, identity, networking, logging, governance, and monitoring are all first‑class Azure controls. Modified Abuse Monitoring is available for qualifying enterprise customers where required. A short summary for each audience: Developers: you can build on Global with confidence, using the Azure patterns you already know. Solution architects: Global is a sensible default unless a regulatory requirement specifically rules it out. Data Zone and Regional remain available for the cases that need them. Security architects: the control surfaces are familiar, the contractual commitments are explicit, and Global can be approved on the same basis as any other Azure PaaS service handling equivalent data classifications. If you've been defaulting to a regional SKU "just to be safe," it's worth taking a fresh look at whether Global actually fits your workload. In most cases, it will.369Views1like0Comments