microsoft foundry foundry iq ai agents agent framework multi-agent responsible ai
1 TopicGamifying World Improvement: Shipping a Reasoning-Agent RPG on Microsoft Foundry
Part 1 of 5 Building a multi-agent demo is the straightforward part. Building one where you can prove - live, with judges watching - that the agents reasoned, retrieved grounded context, called tools, and deferred to a human before awarding anything: that is where most teams run into friction. We found out on stage. This project was demoed live at the Agents League Reasoning Agents battle on Microsoft Reactor - real-time narration, real Foundry calls, real latency. Since then it has changed a great deal: it is now a public web app you can play in your browser, with a developer console that exposes every model call, every retrieval, and every logged event. This series is the build, told properly, one subsystem at a time. This first post is the map: what the game is, the four-agent pipeline behind it, and the one law that makes it safe to demo - nothing counts until a human approves it. The premise: your company is the dungeon For Microsoft Agents League Battle #2 (Reasoning Agents) the brief was the classic Game Master pattern - an orchestrator decomposes a goal, specialist agents execute, shared state tracks the world. We reskinned it with the biggest stakes we could defend. The game opens in a world that terraforms the Sahara and automates basic needs - food, water, energy, shelter. A vision that size is never commanded into existence; it has to be aligned. You enter the story by founding a company on one front of that mission and taking the CEO's chair. From there the loop is concrete: You describe your edge - a real skill, or a public profile URL. A pipeline of Foundry agents reasons about who you are and what venture fits. An Org Designer designs the digital workforce that venture needs. A World Designer decomposes the venture into a chapter quest line. Each chapter is a real Agent Framework worker run that produces an artifact. Nothing earns XP until you, the human CEO, approve it at a verification gate. That last line is the whole reliability story, and the title mechanic: your company is the dungeon, and you clear it room by room. A mission too big to command needs a workforce you can verify: reason on Foundry, ground with Foundry IQ, validate with deterministic tools, ask the human before anything counts. The architecture in one diagram Two properties to read off the diagram. First, the loop is closed: the CEO's gate decision is written to agent memory and recalled into the next chapter's brief. Second, every cloud arrow has a keyless fallback - clone the repo with no credentials and the whole game still plays in simulation mode. We will spend a whole post (Part 4) on that fallback architecture, because forkability is reliability. The pipeline: agents before any work happens Most multi-agent demos hardcode their cast. We do not. The business defines the workforce; the workforce defines the quests. You can watch it happen - the preflight runs as a visible pipeline of reasoning steps, not a spinner. The preflight pipeline: Mission Analyst, Profile Analyst, Org Designer, and Antagonist Forge reasoning in sequence Stage Agent Output 1 Mission / Profile Analyst Reads the pitch or profile, frames the world it improves, casts your founder seat 2 Org Designer Designs a digital-workforce blueprint for this specific venture 3 World Designer Decomposes the venture into a chapter quest line 4 Worker Factory Binds each chapter to a worker, built as an Agent Framework agent with tools 5 Antagonist Forge Generates a rival counter-org that pressures the run Pitch a solar-cell venture and you get a different org chart - and different quests, a nd a different rival - than a dev-tools startup. That is the difference between a scripted demo and a system. The result is a "ready" card that shows the whole shape before you commit: your founder seat, your workforce size, your leverage ratio, and the rival who will contest you. The ready card: founder seat, 5 digital workers behind one human, 5x leverage, and a named rival Workers are real Agent Framework agents on Foundry Each designed worker is built at runtime with the Microsoft Agent Framework, with inference through the Foundry project under AAD auth - keyless, no secrets in .env : # agents/maf_runtime.py from agent_framework.foundry import FoundryChatClient client = FoundryChatClient( project_endpoint=foundry_project_endpoint(), model=deployment, # gpt-5 family deployment credential=_aad_credential(), # DefaultAzureCredential - no API key ) Our deterministic validators are exposed to the model as real @tool FunctionTools, capped so a stuck model cannot loop, and every mid-run call writes a receipt (args, result, latency) into the replay log: @tool(name=tool_name, description=f"Run the deterministic '{tool_name}' check on a draft artifact...", max_invocations=2) def _t(artifact_json: str) -> str: meta["maf_tools_called"].append(tool_name) ... So the model can check its own draft mid-run - but the gate score never comes from the model alone. That is the subject of Part 2. The play loop: a card-stacking roguelike with a CEO chair Once you descend, the game becomes a roguelike. Each chapter is a "room" on a hero's-journey quest graph - YOU, NEED, GO, SEARCH, FIND, TAKE, RETURN, CHANGE. You issue a move, a worker executes it, and the artifact it produces is scored at a gate. Clear the room and you draw a reward card for your run deck; the rival's threat meter climbs the whole time. The world board: the hero's-journey quest graph, the digital workforce, the Game Masters, and the live economy This is where the reasoning becomes legible. When a worker delivers, you see the artifact - here a positioning brief, a rendered org chart, and Q1 OKRs - alongside the line that matters: the deterministic validator scored it 100 of 100; it passes the gate and the company graph grows. A worker delivers a real artifact and a rendered org chart; the deterministic validator scores it and it passes the gate Every run carries a live economy - treasury, daily burn, runway, market share, paying customers - and a rival counter-org that reacts to your decisions. The stakes are not cosmetic: spend your runway and the run ends. We will cover the antagonist system and the economy in a later part. Best practice: make the reasoning legible, not just present A normal chatbot demo gates on "did it answer." A reasoning-agent demo has to gate on something harder: can you show the work, and can you stop the work? Borrowing Lee Stott's framing from CI/CD for AI Agents on Microsoft Foundry, release gates should be driven by evaluation outcomes, not just test results. We applied that idea one level down - to each artifact, at runtime, with a human at the final gate. Four proof points are logged on every single invocation, including in simulation mode: iq_hits , memory_injected , tools_called , inference_usage . Every claim the game makes is a number in a log you can open - which is exactly what the developer console (Part 4) exists to show. Responsible AI The verification gate is the responsible-AI story - and it is also the lore's law: in the game's fiction, a human holds the seal, because a vision too big to command must keep a human at the root of every result. No artifact becomes progress without explicit human approval; every approval is logged with the full reasoning chain; rejected work goes back for rework with the rejection written into agent memory as binding direction. Deterministic validators bound what the model can claim about its own output. Auth is keyless via DefaultAzureCredential - nothing to leak, rotate, or commit by accident. Try it The game is live - play it in your browser, no install: worldforge-game.mangowater-fa8b860a.eastus2.azurecontainerapps.io Or run it locally - five minutes, no Azure account needed for the full loop: git clone https://github.com/princepspolycap/agentsleague-afterbuild cd agentsleague-afterbuild python3 -m venv .venv && source .venv/bin/activate pip install -r submission/requirements.txt python3 submission/tools/run_quest_simulation.py --pitch "Your idea here" For live Foundry runs, copy submission/.env.example to submission/.env , point it at your Foundry project endpoint and a gpt-5 family deployment, then az login . What the rest of the series covers Part 2 - The gate is the product. Deterministic validators, the rubric floor, and why no agent grades itself. Part 3 - Agents that remember the boss. How a CEO decision becomes binding memory and visibly changes the next artifact. Part 4 - Local-first, and the replay log. The settings console, routing between Ollama / Foundry Local / cloud Foundry / simulation, and how every action is traced. Part 5 - The Org Designer bridge. Exporting the game's designed workforce as a portable bundle a real digital-worker platform can provision. Key takeaways Map your domain onto a proven orchestration pattern (Game Master) instead of inventing one. Design the workforce per-input - the business defines the org, the org defines the quests. Make the reasoning legible: a visible preflight pipeline beats a spinner; a logged proof point beats a claim. LLMs create, tools validate, humans approve, replay logs preserve. Ship a simulation fallback for every cloud dependency, then deploy the real thing so people can actually play it. The dungeon is not a metaphor for difficulty. It is a metaphor for structure: a mission too big to command gets aligned one company at a time - rooms you cannot skip, gates a human must pass, and a logged map of every step you took. Part 1 of 5. Next: the verification gate - how to let an LLM create while a deterministic floor and a human decide what counts. Play it: worldforge-game...azurecontainerapps.io Code: github.com/princepspolycap/agentsleague-afterbuild Live battle replay: Agents League - Reasoning Agents on Microsoft Reactor Microsoft Agent Framework and Microsoft Foundry docs98Views0likes0Comments