tags: `microsoft foundry` `ai agents` `agent framework` `multi-agent` `responsible ai`

1 Topic

Gamifying World Improvement: A Reasoning-Agent RPG on Microsoft Foundry
Building a multi-agent demo is the straightforward part. Building one where you can prove, live, with judges watching, that the agents reasoned, retrieved grounded context, called tools, and deferred to a human before awarding anything: that is where most teams run into friction. We found out on stage. This project was demoed live at the Agents League Reasoning Agents battle on Microsoft Reactor: real-time narration, real Foundry calls, real latency, and a host watching the terminal. For Microsoft Agents League Battle #2 (Reasoning Agents) the brief was the classic Game Master pattern: an orchestrator decomposes a goal, specialist agents execute, shared state tracks the world. We reskinned it with the biggest stakes we could defend. The game opens in a world that terraforms the Sahara and automates basic needs: food, water, energy, shelter. A vision that size is never commanded into existence; it has to be aligned. The player enters the story by founding a company on one front of the mission and taking the CEO's chair. From there the loop is concrete. The player pitches the idea (or pastes a real company URL); a Master Narrator turns it into a quest line; a digital workforce, designed per company rather than hardcoded, produces real launch artefacts: positioning, landing-page structure, launch copy. Nothing earns XP until the human CEO approves it at a verification gate. That is the game's one law, and its title mechanic: your company is the dungeon. A mission too big to command needs a workforce you can verify: reason on Foundry, ground with Foundry IQ, validate with deterministic tools, ask the human before anything counts. Why a reasoning-agent demo is different A normal chatbot demo gates on "did it answer." A reasoning-agent demo has to gate on something harder: can you show the work, and can you stop the work? That reframes the whole build. The deployment unit is not "a model." It is an orchestrated run with a visible decomposition, logged tool calls, cited retrieval, and a human decision point. Borrowing Lee Stott's framing from CI/CD for AI Agents on Microsoft Foundry: release gates should be driven by evaluation outcomes, not just test results. We applied that idea one level down, to each artefact, at runtime. The architecture in one diagram Two properties to read off the diagram. First, the loop is closed: the CEO's gate decision is written to agent memory and recalled into the next chapter's brief. Second, every cloud arrow has a keyless fallback. Clone the repo with no credentials and the whole game still plays in simulation mode. The pipeline: four agents before any work happens Most multi-agent demos hardcode their cast. We do not. The business defines the workforce; the workforce defines the quests. Four agents run before any artifact work starts: Company Analyst. Scrapes a real URL, reasons about the business, seeds the brief. Org Designer. Designs a digital-workforce blueprint for this specific company. World Designer. Decomposes the pitch into a chapter quest line. Worker Factory. Binds each chapter to a worker and builds it as an Agent Framework agent with tools. Pitch a bakery and you get a different org chart, and different quests, than a dev-tools startup. That is the difference between a scripted demo and a system. Workers are real Agent Framework agents on Foundry Each designed worker is built at runtime with the Microsoft Agent Framework, with inference through the Foundry project Responses endpoint under AAD auth. Keyless, no secrets in .env: # agents/maf_runtime.py from agent_framework.foundry import FoundryChatClient client = FoundryChatClient( project_endpoint=foundry_project_endpoint(), model=deployment, # gpt-5 family deployment credential=_aad_credential(), # DefaultAzureCredential - no API key ) Our deterministic validators are exposed to the model as real FunctionTools, capped so a stuck model cannot loop, and every mid-run call writes a receipt (args, result, latency) into the replay log: @tool(name=tool_name, description=f"Run the deterministic '{tool_name}' check on a draft artifact...", max_invocations=2) def _t(artifact_json: str) -> str: meta["maf_tools_called"].append(tool_name) receipt = {"tool": tool_name, "source": "maf-midrun", "args": {}, "result": "", "ms": 0.0} ... So the model can check its own draft mid-run, but the gate score never comes from the model alone. Best practice: deterministic gates first, model judgement second This is the single most important reliability decision, and it is the same principle Lee Stott states in the hybrid agents post: code the rules, and let the LLM judge only what is left. Our gate score is a weighted rubric with a deterministic floor. Structural validators (does the landing page have a headline, a CTA, a hero section; is the email parseable; do URLs resolve) set the minimum, and the narrator's rubric judgement can only move the score above that floor, never below the facts. Three scoring layers, and only one of them can award XP: Mid-run tool calls. Scored by deterministic validators. Advisory to the model only; cannot award XP. rubric_evaluate. A Foundry model judges weighted dimensions, floored by the validators. Cannot award XP. The CEO gate. The human. The only path to XP. Four proof points are logged on every single invocation, including simulation mode: iq_hits, memory_injected, tools_called, inference_usage. Every claim in the demo is a logged number in the replay log. It is the same discipline as carrying a correlation ID through every path. Operational lessons learned Stream the reasoning, not just the answer. The UI is a reasoning theater fed by SSE. Every decomposition beat, tool receipt, and retrieval hit arrives as an event. If a user cannot see the handoff, the system feels like one big chatbot. Give every cloud arrow a dashed twin. Foundry IQ falls back to local markdown retrieval; the Agent Service memory store falls back to a local JSON file; the model path falls back to scripted simulation. A live demo that depends on perfect connectivity is a demo waiting to fail. Cap tool invocations. max_invocations=2 on every FunctionTool. Without it, a model in a tight spot calls the validator in a loop. Reasoning models and strict JSON do not mix. Anything that must emit JSON gets a tolerant extractor (_extract_json) that survives think-blocks and markdown fences. This is the same think-block gotcha Lee Stott flags for local router models. Scrub secrets at the sink. Captured reasoning traces run through a secret scrubber before they touch the replay log. Responsible AI The verification gate is the responsible-AI story, and it is also the lore's law: in the game's fiction, a human holds the seal, because a vision too big to command must keep a human at the root of every result. No artefact becomes progress without explicit human approval; every approval is logged with the full reasoning chain; rejected work goes back for rework with the rejection written into agent memory as binding direction. Deterministic validators bound what the model can claim about its own output, and the replay log preserves the whole chain for audit. Auth is keyless via DefaultAzureCredential. Nothing to leak, rotate, or commit by accident. Try it Five minutes, no Azure account needed for the full game loop: git clone https://github.com/princepspolycap/agentsleague-afterbuild cd agentsleague-afterbuild python3 -m venv .venv && source .venv/bin/activate pip install -r submission/requirements.txt python3 submission/tools/run_quest_simulation.py --pitch "Your idea here" For live Foundry runs, copy submission/.env.example to submission/.env and point it at your Foundry project endpoint and a gpt-5 family deployment, then az login. Where this goes next Asked on stream where the project goes after the battle, the answer was already on the roadmap: "I'll deploy what I have as a web app that people can just use... I want to do some local models, make it even more accessible. Something people can play with even after the fact." That is the rest of this series. Part 2 covers how the agents remember the CEO's decisions, and Part 3 covers the fallback architecture and routing toward Foundry Local. The dungeon stays open after the competition ends. Useful links: Watch the live battle: Agents League - Reasoning Agents on Microsoft Reactor Clone the repo: github.com/princepspolycap/agentsleague-afterbuild The official challenge (submissions close 14 June 2026): Agents League registration Microsoft Foundry docs Key takeaways Map your domain onto a proven orchestration pattern (Game Master) instead of inventing one. Design the workforce per input. The business defines the org, the org defines the quests. Deterministic gates first, model judgement second. A validator floor stops a model talking its way past a broken artefact. LLMs create, tools validate, humans approve, replay logs preserve. Ship a simulation fallback for every cloud dependency. Forkability is reliability. Log proof points on every invocation so every claim is an auditable number. The dungeon is not a metaphor for difficulty. It is a metaphor for structure: a mission too big to command gets aligned one company at a time. Rooms you cannot skip, gates a human must pass, and a logged map of every step you took.
Princeps
Jun 14, 2026 Place Educator Developer Blog
26Views
0likes
0Comments