foundry iq
4 TopicsThe Gate Is the Product: Human-Verified Artifacts in a Foundry Multi-Agent Game
Part 2 of 5. In Part 1 the loop ended at a verification gate. This post is about why that gate is not a confirmation dialog - it is the core mechanic, the reliability story, and the thing that lets a reasoning-agent system be demoed live without praying. Most multi-agent demos gate on "did the model produce something." That is a vibe check. A reasoning-agent system that touches anything real needs a harder question: who is allowed to say this artifact is good enough - and can a human stop it? A model can create. A model must not be the thing that certifies its own creation. This post is a code walkthrough of how that rule is enforced. We build the three-layer scoring ladder from the bottom up - deterministic validators, a model rubric floored by them, and a human gate - then look at the parts that make it survive contact with real reasoning models: tolerant JSON parsing, capped self-check tools, a failure-degradation ladder, and four proof points that are tested on every run. Every snippet is from the shipped repo, and a file map at the end tells you where to read the rest. Three layers, one rule: no agent grades itself When a worker finishes a chapter, the artifact passes through three layers before it can become progress. Each layer has a different scorer, and only the last one - a human - can award XP. Layer Who scores Can it award XP? Mid-run tool calls Deterministic validators No - advisory to the model rubric_evaluate Foundry model judging weighted dimensions, floored by validators No CEO gate The human Yes - the only path to XP The order is the whole point. Deterministic validators set a floor the score can never fall below. The Foundry model's rubric judgement can move the score above that floor - reward genuine quality - but it can never talk the score below the facts the validators established. Then the work stops and waits for a person. Everything below is in the repo, so this post reads as a code walkthrough as much as an essay. The three layers live in three files: the validators in submission/tools/code_interpreter_wrappers.py , the rubric and floor in submission/agents/worker_factory.py , and the human gate that awards XP in submission/tools/server.py . Open them alongside this post. The thing being judged: what a worker actually returns Before you can score an artifact you have to agree on its shape. A worker does not return prose - it returns a typed JSON object whose keys the validators know how to read. A designer returns a landing page ( hero_headline , cta_text , url , features ); a strategist returns positioning ( target_audience , core_problem , value_proposition , primary_benefit ) and an org chart ( org_chart , okrs_q1 ); a marketer returns a financial plan ( gtm_channels , financial_plan ) and an email ( subject , body ). This contract is the hinge of the whole reliability story. Because the shape is fixed, the validator that reads it can be a dumb, fast, deterministic function - not a second model trying to interpret freeform text. In simulation mode the very same shapes are produced by _mock_artifact in worker_factory.py , which is why the gate behaves identically after a fresh clone with zero credentials: the artifact the validator reads looks the same whether a Foundry model wrote it or the simulator did. There is one wrinkle worth calling out, because it is where most "the validator says zero but the page looks fine" bugs come from: models are inconsistent about keys. One run returns hero_headline , another headline , another a nested hero.headline . Rather than make the validator guess, a small adapter - _landing_payload - coalesces those variants into the canonical shape before the validator ever sees them: # submission/agents/worker_factory.py - _landing_payload (excerpt) return { "hero_headline": page.get("hero_headline") or page.get("headline") or hero.get("headline") or artifact.get("headline") or "", "cta_text": page.get("cta_text") or cta.get("text") or artifact.get("cta") or "", # ... features, url ... } Normalising at the boundary keeps the validators pure: they assume one schema, and the messy job of mapping a model's many spellings onto that schema lives in one adapter, not smeared through every check. Layer 1: deterministic validators you could unit-test Layer 1 is a handful of pure functions in submission/tools/code_interpreter_wrappers.py . Each takes the artifact dict, runs structural checks, accumulates a 0-100 score, and returns (success, results) where results carries a per-check breakdown and human-readable feedback. Here is the landing-page validator, trimmed to its spine, because it is representative: # submission/tools/code_interpreter_wrappers.py def validate_landing_page(data): results = {"checks": {}, "score": 0, "feedback": []} if len(data.get("hero_headline", "").strip()) >= 15: results["checks"]["hero_headline_valid"] = True results["score"] += 30 cta = data.get("cta_text", "") if len(cta.strip()) >= 3: results["checks"]["cta_valid"] = True results["score"] += 20 # ... url format (+30) and a simulated http_status_200 (+20) ... success = results["score"] >= 70 return success, results There is no model anywhere in that function. A headline shorter than fifteen characters earns zero points and a line of feedback; a missing call to action earns zero points and a line of feedback. The thresholds are explicit and the points are explicit, so the score is reproducible to the digit - run it a thousand times, get the same number a thousand times. The other validators follow the same grammar, and each one encodes a small piece of domain judgement as code rather than as a prompt: validate_positioning requires four fields - target audience, core problem, value proposition, primary benefit - each non-trivial (more than ten characters). Four fields, twenty-five points each, pass at seventy-five. validate_org_chart wants a non-empty org chart that contains a Founder role, and OKRs where each objective carries at least two key results. A chart with no founder, or objectives with no measurable key results, simply does not score. validate_marketing_email checks a subject of real length, body copy past a hundred characters, and the literal presence of a call-to-action marker ( [CTA] , a link, "Sign up"). validate_financial_plan is the most opinionated, and the best illustration of the principle. It checks that the MRR ramp is monotonically increasing and that the breakeven month lands in a sane 1-24 range: # submission/tools/code_interpreter_wrappers.py is_monotonic = all(nums[i] <= nums[i + 1] for i in range(len(nums) - 1)) # ... be = fp.get("breakeven_month") if isinstance(be, (int, float)) and 1 <= be <= 24: results["checks"]["breakeven_sane"] = True results["score"] += 15 A model can write a beautiful narrative around a revenue plan that quietly shrinks in month four, or that breaks even in month ninety. A monotonicity check catches the first; a range check catches the second. Neither needs a second opinion from a language model - they are arithmetic. That is the whole thesis of Layer 1: anything you can state as a rule, state as a rule. One set of validators, two jobs The same pure functions do double duty, and that is deliberate. Through _score_artifact they compute the floor - the role's validators run, and the floor is the highest score any of them returns, plus a richness heuristic so an off-schema artifact never floors at a flat zero: Through _maf_tool_fns the same functions are wrapped as the mid-run FunctionTools we capped earlier. One implementation, one source of truth for "is this artifact structurally sound" - exposed both as the gate's floor and as the tool the model calls on its own draft. When you change a validator, the model's self-check and the gate's floor move together; they can never drift apart. The max is a deliberate choice, not a shortcut. A strategist artifact carries both an org chart and a positioning block; taking the best validator score means a strong org chart is not dragged down by a thin positioning section, while a worker that nails neither still cannot fake a floor. It is a forgiving floor for partial work and an honest one for empty work. Layer 2: rubric_evaluate, floored Layer 1 tells you whether an artifact is well-formed. It cannot tell you whether the positioning is sharp or the OKRs are ambitious. That nuance is what Layer 2 is for, and it is where a Foundry model finally gets to judge - inside a cage. rubric_evaluate in submission/agents/worker_factory.py scores the artifact on four weighted dimensions: # submission/agents/worker_factory.py RUBRIC_DIMENSIONS = [ ("Relevance to goal", 30), ("Completeness", 25), ("Actionability", 25), ("Clarity & structure", 20), ] In live mode it asks the narrator deployment - the same Foundry model that powers the Master Narrator - to score each dimension 0-100 and return strict JSON, with a generous token budget because reasoning models spend tokens thinking before they answer: # submission/agents/worker_factory.py - inside rubric_evaluate resp = create_chat_completion( deployment, [ {"role": "system", "content": ( "You are a strict rubric evaluator for business artifacts. " "Score the artifact 0-100 on each dimension: " + dims_spec + ". " "Return ONLY JSON: {dimensions: [...], verdict: one sentence}.")}, {"role": "user", "content": ( f"Venture brief: {brief[:600]}\nStage goal: {stage.goal}\n" f"Artifact JSON:\n{json.dumps(artifact)[:4000]}")}, ], max_completion_tokens=2500, ) parsed = _extract_json(resp.choices[0].message.content or "") or {} The prompt names the dimensions and their weights inline (via dims_spec ), demands JSON only, and caps the artifact at 4000 characters so a sprawling object cannot blow the context window. The response still goes through the same _extract_json we will meet in a moment - because even a "return ONLY JSON" instruction is a request, not a guarantee. Then it does something important: it does not trust the model's structure. It re-anchors the model's answer to our own spec - our dimension names, our weights - and keeps only the model's scores: # submission/agents/worker_factory.py - inside rubric_evaluate by_name = {str(d.get("name", "")).strip().lower(): d for d in dims if isinstance(d, dict)} dimensions = [] for name, weight in RUBRIC_DIMENSIONS: d = by_name.get(name.lower(), {}) dimensions.append({ "name": name, "weight": weight, "score": max(0, min(100, int(d.get("score", floor)))), "note": str(d.get("note", ""))[:120], }) This is a small piece of defensive engineering with a large payoff. A model asked for four dimensions might return three, invent a fifth, or quietly reweight them so its favourite one dominates. By looping over our RUBRIC_DIMENSIONS and pulling scores by name - defaulting any missing dimension to the validator floor, clamping every score to 0-100 - the weighting stays ours. The model colours inside lines it did not draw. Then the two layers meet in one line: rubric["final"] = max(floor, rubric["weighted_total"]) . The final score is one line of math That single line - final = max(floor, weighted_total) - is the entire trust model, and it is worth seeing as a picture, because it is only three numbers: If the artifact is structurally sound, the floor is high and the model can only push the score higher by recognising genuine quality. If the model is having a bad day and lowballs a perfectly valid artifact, the floor protects it. The model's judgement is additive, never subtractive. You get the nuance of a model evaluator with the safety of a deterministic one - and you can explain, to the point, exactly why any score is what it is. Why a deterministic floor, not just a model judge This is the single most important reliability decision, and it is the same principle Lee Stott states in his Hybrid AI Agents in Python post: code the rules, and let the LLM judge only what is left. As he puts it about privacy controls - if your check depends on an LLM correctly classifying every input, you do not have a control, you have a probability distribution. We apply the identical principle to artifact quality. The validators are boring on purpose. Does the landing page have a headline, a CTA, a hero section? Is the marketing email parseable? Do the URLs resolve? Does the org chart have a Founder and key results on every objective? These are structural facts, checked in code, that no amount of confident prose can override. A model that has just written a landing page is the worst-placed party to certify it - so it is graded by something it cannot sweet-talk. In the game you can watch this happen. A worker delivers, and the report names the score and the verdict in plain language: the deterministic validator scored it 100 of 100 - it passes the gate and the company graph grows. A worker delivers a positioning brief, an org chart, and Q1 OKRs; the deterministic validator scores it 100/100 and it passes the gate Reasoning models and strict JSON do not mix There is a sharp edge hiding in Layer 2 that bites everyone who puts a reasoning model behind a JSON contract: the model wraps its answer in think-blocks, prepends a sentence of preamble, or fences the JSON in markdown - and json.loads throws. If your rubric evaluator crashes on a stray backtick, your "deterministic floor" was never deterministic; it was one parse error away from no score at all. So every agent that must read JSON out of a model goes through a tolerant extractor. The same _extract_json shape appears in worker_factory.py , org_designer.py , founder_analyst.py , and world_designer.py - kept local to each module on purpose, so every agent is self-contained. It tries, in order: strip a Markdown code fence; parse the whole string; parse the substring from the first { to the last } ; then scan character by character and let json.JSONDecoder().raw_decode find the first valid object: # submission/agents/worker_factory.py decoder = json.JSONDecoder() for index, char in enumerate(text): if char != "{": continue try: parsed, _ = decoder.raw_decode(text[index:]) return parsed if isinstance(parsed, dict) else None except Exception: continue return None When all four strategies fail, the function returns None and the caller falls through to _rubric_from_floor - the deterministic rubric derived from the validator floor. There is no path where a malformed model response yields no score; the floor is always there to catch it. Tolerant parsing plus a deterministic fallback is what lets you run reasoning models inside a scoring loop without the loop ever dropping a frame. The receipts panel: scoring you can audit, not trust Every worker exposes a receipts panel - the artifact's scoring proof, broken out so a skeptic can audit it instead of taking the number on faith. Status, the model that ran, token usage, estimated call cost, latency, and how many tool calls the worker made out of its capped budget. The receipts panel: status, model, tokens, est. call cost, latency, and tool-call count for a worker run This is the runtime equivalent of carrying a correlation ID through every path. Four proof points are emitted on every invocation, in live mode and simulation mode alike, and they are written straight into the STAGE_EXECUTED replay event in submission/tools/server.py : # submission/tools/server.py - the STAGE_EXECUTED event payload "iq_hits": invocation.iq_sources, "memory_injected": invocation.maf_memory, "tools_called": invocation.maf_tools_called, "inference_usage": {"client": invocation.maf_client or "openai-direct", "fallback_reason": invocation.maf_fallback_reason, "tokens_in": invocation.tokens_in, "tokens_out": invocation.tokens_out, "reasoning_tokens": invocation.reasoning_tokens}, "rubric": stage.rubric, iq_hits - which Foundry IQ sources grounded the work memory_injected - whether the CEO's prior decisions entered the brief tools_called - which deterministic tools the worker actually ran inference_usage - the client used, tokens in and out, and reasoning tokens These are not decoration; they are enforced. submission/tools/demo_smoke_test.py walks every invocation in a simulated run and fails the build if any proof point is absent: # submission/tools/demo_smoke_test.py _require(bool(p.get("iq_hits")), f"{cid}: iq_hits empty - IQ recall not evidenced") _require(bool(p.get("memory_injected")), f"{cid}: memory_injected empty") _require("tools_called" in p, f"{cid}: tools_called missing") usage = p.get("inference_usage") or {} _require(bool(usage.get("client")), f"{cid}: inference_usage.client missing") When a judge asks "did the agent actually do anything, or is this theatre?", the answer is a panel you open, not a sentence you say - backed by a test that would have gone red if the panel were empty. Tools the model can call - but capped On the Agent Framework path the validators are not just a post-hoc gate; they are @tool FunctionTools the worker can call mid-run to test its own draft. That is good - a model that can check itself produces better artifacts. But an unbounded self-check is a failure mode: a model in a tight spot will call the same validator in a loop and burn your budget. So in submission/agents/maf_runtime.py every tool is wrapped with a hard cap, and every call leaves a receipt carrying its arguments, result, and latency: # submission/agents/maf_runtime.py @tool(name=tool_name, description=f"Run the deterministic '{tool_name}' check on a draft artifact " "(pass the artifact as a JSON string). Call at most once.", max_invocations=2) def _t(artifact_json: str) -> str: meta["maf_tools_called"].append(tool_name) receipt = {"tool": tool_name, "source": "maf-midrun", "args": {}, "result": "", "ms": 0.0} meta["maf_tool_trace"].append(receipt) t0 = time.perf_counter() # ... parse artifact, run fn(payload), summarise ... receipt["result"] = f"score={score} checks {passed}/{len(checks)}" receipt["ms"] = round((time.perf_counter() - t0) * 1000, 1) The model may check its draft twice. It may not certify it. max_invocations=2 is enforced by the runtime, not by a polite instruction the model can choose to ignore. And because every call appends to maf_tool_trace , the receipts panel can show you the exact tool, the artifact keys it inspected, the score it got back, and how many milliseconds it took. The certification is the human's, at the gate; the tool is only ever advisory. The same proof in simulation mode Forkability is a rubric criterion, so the gate cannot depend on Azure. After a git clone with zero credentials the system runs in simulation mode - and crucially, the same three layers run. _mock_artifact produces a well-formed artifact, the real validators score it, and rubric_evaluate falls through to _rubric_from_floor - a deterministic breakdown anchored to the validator score with a tiny, fixed spread: # submission/agents/worker_factory.py def _rubric_from_floor(floor): spread = [5, 0, -5, 0] # mild, deterministic variation around the floor dimensions = [ {"name": name, "weight": weight, "score": max(0, min(100, floor + spread[i])), "note": "derived from deterministic validators"} for i, (name, weight) in enumerate(RUBRIC_DIMENSIONS) ] # ... returns the same shape rubric_evaluate returns in live mode ... The four proof points are emitted on this path too. inference_usage.client reads "openai-direct" or a simulation marker instead of FoundryChatClient , but the field is present - which is exactly why demo_smoke_test.py passes offline. The rule we hold to: if your simulation mode does not emit the same evidence as live, you are testing a different program than the one you ship. When the model fails the artifact A model in a scoring loop will, eventually, hand you garbage: JSON truncated because it hit the token ceiling, prose where you asked for an object, or an outright exception from the endpoint. The gate cannot ship a blank stage, so the worker degrades down a fixed ladder rather than failing open. First, a weak Agent Framework run falls through. After the MAF path returns, the artifact is scored; if it is unparseable or the floor is below 40, the code raises and the worker retries on the direct OpenAI-compatible path. A half-formed artifact never reaches the gate: # submission/agents/worker_factory.py artifact = _extract_json(content) floor = _score_artifact(role, artifact) if not artifact or floor < 40: raise ValueError(f"MAF artifact too weak (floor={floor})") Second, an empty live artifact degrades to the deterministic mock. If even the direct call returns nothing parseable, the worker substitutes _mock_artifact , records a maf_fallback_reason , and the validators score the mock - so every stage still produces a real, gradeable artifact instead of a blank one. This is the same "simulation fallback for everything" law the rest of the repo follows. Third, a thrown exception becomes a receipt, not a crash. When the endpoint itself fails, the invocation is marked status="failed" with the error string captured, and the STAGE_EXECUTED event carries status , error , and whatever partial tool_trace accumulated straight into the replay log: # submission/agents/worker_factory.py except Exception as e: invocation.status = "failed" invocation.error = f"{type(e).__name__}: {e}" invocation.completed_at = time.time() return invocation, None, 0 The point is not that failures never happen - it is that a failure is visible and bounded. The floor still applies, the receipt still renders, the replay log still carries the error. A failed run is auditable; it is not a blank space where progress silently appeared. Layer 3: the gate has a third option - redirect Approve and reject are obvious. The interesting one is redirect - the gate can present a genuine strategic fork, two defensible options, and the CEO's pick becomes binding direction for the next worker. That turns the verification gate from a quality checkpoint into a steering wheel. A decision gate: two strategist proposals - Depth versus Breadth - each with tradeoffs, grounded in IQ sources and memory items Notice the metadata on that fork: the proposals are grounded in 2 IQ sources and 7 memory items in brief, and the worker reached them by calling recall , web_search , and calculate_consequence - a tool that previews the org and economic consequence before the CEO commits. The human is not rubber-stamping; they are choosing between options the workforce reasoned out and priced. Where XP is actually awarded The whole architecture funnels into one function. Approval - and only approval - calls approve_current_step in submission/tools/server.py , which awards XP and advances the campaign: # submission/tools/server.py - approve_current_step xp_earned = 10 + (score // 10) There is no other code path that mints XP. Not the validators, not the rubric, not the model. A high gate score enables a large reward, but a human pressing approve is what releases it. That is the responsible-AI guarantee expressed as control flow: the model can make the case, the validators can vouch for the structure, but the only function that turns work into progress is gated behind a person. How the approve, reject, or redirect decision is then written to memory and visibly changes the next chapter's brief is the subject of Part 3. Operational lessons learned Floor first, judge second. A model rubric on top of a deterministic floor gives you nuance without giving up safety. A model rubric alone gives you a confident scoreboard with no foundation. Re-anchor the model's structure to yours. Keep the model's scores, throw away its shape. Loop over your dimensions and weights so the model cannot reweight the rubric in its own favour. Cap every tool. max_invocations=2 is not a performance tweak; it is a containment boundary the runtime enforces. Log the proof on every path, then test for it. If your simulation mode does not emit the same proof points as live, you are testing a different program than you ship - so write a smoke test that fails the build when a proof point goes missing. Make the gate diegetic. Players (and judges) trust a control they can see. A score with a visible floor and an audit panel reads as engineering; a score from nowhere reads as marketing. Reasoning models and strict JSON do not mix. Anything that must emit JSON gets a tolerant extractor that survives think-blocks and markdown fences, with a deterministic fallback when every strategy fails. Responsible AI The gate is the responsible-AI architecture, stated as a game rule: nothing becomes progress without explicit human approval. Every approval is logged with the full reasoning chain in the replay log. Every rejection is written to memory as binding direction, so the same mistake is not made twice. Deterministic validators bound what the model can claim about its own output, the rubric re-anchors the model's judgement to weights it does not control, and the replay log preserves the whole chain for audit. The human's authority is not a courtesy - it is the only function that awards XP, enforced in code. Where this lives in the repo If you want to read the implementation, every piece of this post is in five files: Concern File Key symbol Deterministic validators (Layer 1) submission/tools/code_interpreter_wrappers.py validate_landing_page , validate_positioning , validate_org_chart , validate_financial_plan , validate_marketing_email Rubric + floor + tolerant JSON (Layer 2) submission/agents/worker_factory.py rubric_evaluate , _rubric_from_floor , _extract_json Capped mid-run tools + receipts submission/agents/maf_runtime.py _wrap , max_invocations=2 , maf_tool_trace Proof points, human gate, XP award (Layer 3) submission/tools/server.py approve_current_step , the STAGE_EXECUTED payload Build-fails-without-evidence test submission/tools/demo_smoke_test.py the _require evidence checks Try it Play a chapter and reject the first artifact on purpose - watch the rejection reshape the next brief: Play the live app or run it locally: git clone https://github.com/princepspolycap/agentsleague-afterbuild cd agentsleague-afterbuild && python3 -m venv .venv && source .venv/bin/activate pip install -r submission/requirements.txt python3 submission/tools/run_quest_simulation.py --pitch "Your idea here" The simulator runs the same three layers with zero Azure credentials, so you can step through approve, reject, and redirect offline before you ever wire up Foundry. Key takeaways Three scoring layers, one rule: no agent grades itself; only a human awards progress. Deterministic validators set a floor the model's rubric can raise but never lower - final = max(floor, weighted_total) . Re-anchor the model's rubric to your own dimensions and weights; keep its scores, not its structure. Expose receipts - model, tokens, cost, latency, tool calls - so scores are audited, not trusted, and test for them so the build fails without them. Cap every tool the model can call; an uncapped self-check is a budget leak. The gate's third option, redirect, turns verification into steering. A gate that only ever says "yes" is a dialog box. A gate that can say no, can redirect, floors a model with facts, re-anchors its judgement, and logs every decision is a reliability architecture. That is the difference between a demo and a system you would let touch something real. Part 2 of 5. Next: agents that remember the boss - how a gate decision becomes binding memory and visibly changes the next artifact on Microsoft Foundry. About this project I built a reasoning-agent game where you play a founder solving a world-improvement mission, and an AI workforce does the work while you make the calls. How it plays: You pitch a mission (like "solar microgrids for rural clinics") A Foundry-powered Master Narrator breaks it into an 8-stage quest graph An Org Designer agent builds you a custom digital workforce (Strategist, Designer, Marketer, Ops) Each stage runs as a real agent on the Microsoft Agent Framework, grounded in Foundry IQ You play tactical cards, counter a rival antagonist, and approve every artifact at a verification gate before it counts Why it's different: the reasoning IS the gameplay. Decomposition, IQ citations, memory, tool calls, and a deterministic validator floor are all visible as cards and receipts. Every reasoning agent runs on Microsoft Foundry. Runs after git clone with zero credentials (simulation mode), so it's fully forkable and MIT. Try it / check it out: Live app (hosted on Azure): worldforge-game.mangowater-fa8b860a.eastus2.azurecontainerapps.io GitHub (public, MIT): github.com/princepspolycap/agentsleague-afterbuild Demo video: youtu.be/ElGXboGh6NE Live battle replay: Agents League - Reasoning Agents on Microsoft Reactor Would love any feedback, pull requests, or ideas. Built for the Reasoning Agents track. Microsoft Agent Framework and Microsoft Foundry docs69Views0likes0CommentsBuilding ShadowQuest: A Multi-Agent RPG
Artificial Intelligence is rapidly evolving beyond traditional chatbots. Today, developers are building intelligent systems where multiple AI agents collaborate, retrieve knowledge, and solve problems together. Microsoft's Agents League Hackathon provided the perfect opportunity to explore this new approach through the Reasoning Agents challenge. For this challenge, I built ShadowQuest, a fantasy role-playing game (RPG) powered by Microsoft Foundry, Foundry IQ, Azure AI Search, GPT-4.1, and GitHub Copilot. The project demonstrates how specialized AI agents can work together while using Retrieval-Augmented Generation (RAG) to deliver accurate and context-aware responses. About the Challenge Microsoft Agents League is a global developer challenge designed to encourage developers to build intelligent AI applications using Microsoft's latest AI technologies. Participants could choose from three tracks: Creative Apps, Reasoning Agents, and Enterprise Agents. I selected the Reasoning Agents track because I wanted to explore how multiple AI agents could collaborate instead of relying on a single large language model. Another important requirement for this year's challenge was integrating at least one Microsoft Intelligence Layer. For ShadowQuest, I chose Foundry IQ as the project's intelligence layer. The Idea Behind ShadowQuest Fantasy RPGs are built around storytelling, exploration, and collaboration between different characters. Every character usually has a unique role, whether it's a warrior protecting the team, a mage interpreting magical knowledge, or a rogue discovering hidden paths. I wanted to recreate this experience using AI. Instead of building one AI assistant responsible for everything, I designed a system where multiple specialized agents collaborate to create a richer and more immersive adventure. ShadowQuest is set in a fantasy world filled with magical artifacts, forgotten kingdoms, mysterious locations, and story-driven quests. Players can ask questions about the world, explore different locations, and learn about the game's lore through conversations with AI agents. Building the Multi-Agent Architecture The architecture follows a simple but scalable design. At the center of the system is the Game Master Agent, which acts as the orchestrator. Every player interaction starts with the Game Master. It receives the player's request, determines what information is needed, retrieves additional knowledge when required, and generates the final response. Supporting the Game Master are three specialized agents: Warrior Agent – Focuses on combat strategy and tactical decisions. Mage Agent – Provides magical knowledge, world lore, and information about ancient artifacts. Rogue Agent – Specializes in exploration, investigation, and discovering hidden information. Each agent has a clearly defined responsibility, making the system easier to understand, maintain, and extend in the future. Using Foundry IQ as the Knowledge Layer One of the most important parts of the project was integrating Foundry IQ. Instead of storing every piece of game information inside prompts, I created a dedicated knowledge base containing information about characters, magical artifacts, locations, quests, and the history of the ShadowQuest world. This approach separates knowledge from reasoning. Whenever a player asks a question, the Game Master Agent first retrieves relevant information from the knowledge base before generating a response. This ensures that answers remain consistent with the game's world while reducing hallucinations. Foundry IQ became the central source of truth for the entire project, making it easy to manage and expand the game world without constantly modifying prompts. Azure AI Search and Retrieval-Augmented Generation To enable intelligent retrieval, I connected Foundry IQ with Azure AI Search. The RPG documents were indexed, and vector embeddings were generated using Microsoft's embedding models. This enables semantic search, allowing the system to understand the meaning behind a player's question instead of relying only on keyword matching. For example, if a player asks about a magical relic without mentioning its exact name, Azure AI Search can still retrieve the correct information based on semantic similarity. The complete workflow looks like this: The player submits a question. The Game Master Agent receives the request. Foundry IQ queries Azure AI Search. Relevant documents are retrieved. GPT-4.1 generates a grounded response using the retrieved context. This Retrieval-Augmented Generation (RAG) approach significantly improves the quality and reliability of responses. Accelerating Development with GitHub Copilot GitHub Copilot played an important role throughout the development process. It helped generate Python classes, improve documentation, create helper functions, and speed up repetitive coding tasks. During the live demonstration, I also showed how Copilot could quickly generate a new Healer Agent, demonstrating how AI-assisted development makes it easier to extend a multi-agent application while maintaining a consistent architecture. Rather than replacing the developer, Copilot acted as an intelligent coding assistant, allowing me to focus more on architecture and design decisions. Demonstrating ShadowQuest During the Microsoft Agents League Reasoning Agents Battle, I demonstrated the Game Master Agent by asking questions about the ShadowQuest world, magical artifacts, and game lore. One of the most interesting parts of the demonstration was observing the retrieval process. Before generating a response, the Game Master Agent called the knowledge retrieval function through Foundry IQ. This confirmed that the system was retrieving relevant information from the indexed knowledge base rather than relying only on GPT-4.1's internal knowledge. This demonstrated how RAG can create more grounded, reliable, and context-aware AI experiences. Lessons Learned Building ShadowQuest taught me that designing multi-agent systems is as much about architecture as it is about AI models. Clearly defining responsibilities for each agent made the application easier to maintain and opened the door for future expansion. I also learned how valuable Retrieval-Augmented Generation can be for applications that depend on structured knowledge. Separating reasoning from knowledge allows AI systems to remain accurate while making it easier to update information over time. Finally, participating in the Microsoft Agents League was an incredible opportunity to experiment with Microsoft's latest AI technologies, learn from other developers, and share ideas with a global community passionate about agentic AI. Looking Ahead ShadowQuest is only the beginning. In future iterations, I plan to expand the project by introducing additional agents such as a Merchant Agent and Healer Agent, implementing persistent player memory, adding dynamic quest generation, improving combat mechanics, and enabling deeper collaboration between agents. These improvements will make the game world more immersive while continuing to explore the possibilities of agent-based AI systems. Conclusion ShadowQuest demonstrates how Microsoft Foundry, Foundry IQ, Azure AI Search, GPT-4.1, and GitHub Copilot can be combined to build intelligent multi-agent applications. More importantly, the project reinforced an important idea: the future of AI is not a single assistant performing every task, but a team of specialized agents collaborating with shared knowledge to solve increasingly complex problems. Participating in the Microsoft Agents League was an inspiring experience that allowed me to explore the next generation of AI development while building a project that combines storytelling, reasoning, and knowledge retrieval. I look forward to continuing this journey and discovering new ways to build intelligent applications using Microsoft's growing AI ecosystem.138Views1like0CommentsFoundry IQ: Give Your AI Agents a Knowledge Upgrade
If you’re learning to build AI agents, you’ve probably hit a familiar wall: your agent can generate text, but it doesn’t actually know anything about your data. It can’t look up your documents, search across your files, or pull facts from multiple sources to answer a real question. That’s the gap Foundry IQ fills. It gives your AI agents structured access to knowledge, so they can retrieve, reason over, and synthesize information from real data sources instead of relying on what’s baked into the model. Why Should You Care? As a student or early-career developer, understanding how AI systems work with external knowledge is one of the most valuable skills you can build right now. Retrieval-Augmented Generation (RAG), knowledge bases, and multi-source querying are at the core of every production AI application, from customer support bots to research assistants to enterprise copilots. Foundry IQ gives you a hands-on way to learn these patterns without having to build all the plumbing yourself. You define knowledge bases, connect data sources, and let your agents query them. The concepts you learn here transfer directly to real-world AI engineering roles. What is Foundry IQ? Foundry IQ is a service within Azure AI Foundry that lets you create knowledge bases, collections of connected data sources that your AI agents can query through a single endpoint. Instead of writing custom retrieval logic for every app you build, you: Define knowledge sources — connect documents, data stores, or web content (SharePoint, Azure Blob Storage, Azure AI Search, Fabric OneLake, and more). Organize them into a knowledge base — group multiple sources behind one queryable endpoint. Query from your agent — your AI agent calls the knowledge base to get the context it needs before generating a response. This approach means the knowledge layer is reusable. Build it once, and any agent or app in your project can tap into it. The IQ Series: A Three-Part Learning Path The IQ Series is a set of three weekly episodes that walk you through Foundry IQ from concept to code. Each episode includes a tech talk, visual doodle summaries, and a companion cookbook with sample code you can run yourself. 👉 Get started: https://aka.ms/iq-series Episode 1: Unlocking Knowledge for Your Agents (March 18, 2026) Start here. This episode introduces the core architecture of Foundry IQ and explains how AI agents interact with knowledge. You’ll learn what knowledge bases are, why they matter, and how the key components fit together. What you’ll learn: The difference between model knowledge and retrieved knowledge How Foundry IQ structures the retrieval layer The building blocks: knowledge sources, knowledge bases, and agent queries Episode 2: Building the Data Pipeline with Knowledge Sources (March 25, 2026) This episode goes deeper into knowledge sources, the connectors that bring data into Foundry IQ. You’ll see how different content types flow into the system and how to wire up sources from services you may already be using. What you’ll learn: How to connect sources like Azure Blob Storage, Azure AI Search, SharePoint, Fabric OneLake, and the web How content is ingested and indexed for retrieval Patterns for combining multiple source types Episode 3: Querying Multi-Source Knowledge Bases (April 1, 2026) The final episode shows you how to bring it all together. You’ll learn how agents query across multiple knowledge sources through a single knowledge base endpoint and how to synthesize answers from diverse data. What you’ll learn: How to query a knowledge base from your agent code How retrieval works across multiple connected sources Techniques for synthesizing information to answer complex questions Get Hands-On with the Cookbooks Every episode comes with a companion cookbook in the GitHub repo, complete with sample code you can clone, run, and modify. This is the fastest way to go from watching to building. 👉 Explore the repo: https://aka.ms/iq-series Inside you’ll find: Episode links — watch the tech talks and doodle recaps Cookbooks — step-by-step code samples for each episode Documentation links — official Foundry IQ docs and additional learning resources What to Build Next Once you’ve worked through the series, try applying what you’ve learned: Study assistant — connect your course materials as knowledge sources and build an agent that can answer questions across all your notes and readings. Project documentation bot — index your team’s project docs and READMEs into a knowledge base so everyone can query them naturally. Research synthesizer — connect multiple data sources (papers, web content, datasets) and build an agent that can cross-reference and summarize findings. Start Learning The IQ Series is designed to take you from zero to building knowledge-driven AI agents. Watch the episodes, run the cookbooks, and start experimenting with your own knowledge bases. 👉 https://aka.ms/iq-series486Views0likes0CommentsFoundry IQ for Multi-Source AI Knowledge Bases
Pull from multiple sources at once, connect the dots automatically, and getvaccurate, context-rich answers without doing manual orchestration with Foundry IQ in Microsoft Foundry. Navigate complex, distributed data across Azure stores, SharePoint, OneLake, MCP servers, and even the web, all through a single knowledge base that handles query planning and iteration for you. Reuse the Azure AI Search assets you already have, build new knowledge bases with minimal setup, and control how much reasoning effort your agents apply. As you develop, you can rely on iterative retrieval only when it improves results, saving time, tokens, and development complexity. Pablo Castro, Azure AI Search CVP and Distinguished Engineer, joins Jeremy Chapman to share how to build smarter, more capable AI agents, with higher-quality grounded answers and less engineering overhead. Smart, accurate responses. Give your agents the ability to search across multiple sources automatically without extra development work. Check out Foundry IQ in Microsoft Foundry. Build AI agents fast. Organize your data, handle query planning, and orchestrate retrieval automatically. Get started using Foundry IQ knowledge bases. Save time and resources while keeping answers accurate. Foundry IQ decides when to iterate or exit, optimizing efficiency. Take a look. QUICK LINKS: 00:00 — Foundry IQ in Microsoft Foundry 01:02 — How it’s evolved 03:02 — Knowledge bases in Foundry IQ 04:37 — Azure AI Search and retrieval stack 05:51 — How it works 06:52 — Visualization tool demo 08:07 — Build a knowledge base 10:10 — Evaluating results 13:11 — Wrap up Link References To learn more check out https://aka.ms/FoundryIQ For more details on the evaluation metric discussed on this show, read our blog at https://aka.ms/kb-evals For more on Microsoft Foundry go to https://ai.azure.com/nextgen Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: - If you research any topic, do you stop after one knowledge source? That’s how most AI will typically work today to generate responses. Instead, now with Foundry IQ in Microsoft Foundry, built-in AI powered query decomposition and orchestration make it easy for your agents to find and retrieve the right information across multiple sources, autonomously iterating as much as required to generate smarter and more relevant responses than previously possible. And the good news is, as a developer, this all just works out of the box. And joining me to unpack everything and also show a few demonstrations of how it works is Pablo Castro, distinguished engineer and also CVP. He’s also the architect of Azure AI Search. So welcome back to the show. - It’s great to be back. - And you’ve been at the forefront really for AI knowledge retrieval really since the beginning, where Azure AI Search is Microsoft’s state-of-the-art search engine for vector and hybrid retrieval, and this is really key to building out things like RAG-based agentic services and applications. So how have things evolved since then? - Things are changing really fast. Now, AI and agents in particular, are expected to navigate the reality of enterprise information. They need to pull data across multiple sources and connect the dots as they automate tasks. This data is all over the place, some in Azure stores, some in SharePoint, some is public data on the web, anywhere you can think of. Up until now, AI applications that needed to ground agents on external knowledge typically used as single index. If they needed to use multiple data sources, it was up to the developer to orchestrate them. With Foundry IQ and the underlying Azure AI Search retrieval stack, we tackled this whole problem. Let me show you. Here is a technician support agent that I built. It’s pointed at a knowledge base with information from different sources that we pull together in Foundry IQ. It provides our agent with everything it needs to know as it provides support to onsite technicians. Let’s try it. I’ll ask a really convoluted question, more of a stream of thought that someone might ask when working on a problem. I’ll paste in: “Equipment not working, CTL11 light is red, “maybe power supply problem? “Label on equipment says P4324. “The cord has another label UL 817. “Okay to replace the part?” From here, the agent will give the question to the knowledge base, and the knowledge base will figure out which knowledge sources to consult before coming back with a comprehensive answer. So how did it answer this particular question? Well, we can see it went across three different data sources. The functionality of the CTL11 indicator is from the machine manuals. We received them from different machine vendors, and we have them all stored in OneLake. Then, the company policy for repairs, which our company regularly edits, lives in SharePoint. And finally, the agent retrieved public information from the web to determine electrical standards. - And really, the secret sauce behind all of this is the knowledge base. So can you explain what that is and how that works? - So yeah, knowledge bases are first class artifacts in Foundry IQ. Think of a knowledge base as the encapsulation of an information domain, such as technical support in our example. A knowledge base comprises one or more data sources that can live anywhere. And it has its own AI models for retrieval orchestration against those sources. When a query comes in, a planning step is run. Here, the query is deconstructed. The AI model refers to the source description or retrieval instructions provided, and it connects the different parts of the query to the appropriate knowledge source. It then runs the queries, and it looks at the results. A fast, fine-tuned SLM then assesses whether we have enough information to exit or if we need more information and should iterate by running the planning step again. Once it has a high level of confidence in the response, it’ll return the results to the agent along with the source information for citations. Let’s open the knowledge base for our technician support agent. And at the bottom, you can see our three different knowledge sources. Again, machine specs pulls markdown files from OneLake with all the equipment manuals. And notice the source description which Foundry IQ uses during query planning. Policies points at our SharePoint site with our company repair policies. And here’s the web source for public information. And above, I’ve also provided retrieval instructions in natural language. Here, for example, I explicitly call out using web for electrical and industry standards. - And you’re in Microsoft Foundry, but you also mentioned that Azure AI Search and the retrieval stack are really the underpinnings for Foundry IQ. So, what if I already have some Azure AI Search running in my case? - Sure. Knowledge bases are actually AI search artifacts. You can still use standalone AI search and access these capabilities. Let me show you what it looks like in the Azure portal and in code. Here, I’m in my Azure AI Search service. We can see existing knowledge bases, and here’s the knowledge base we were using in Foundry IQ. Flipping to VS code, we have a new KnowledgeBaseRetrievalClient. And if you’ve used Azure AI Search before, this is similar to the existing search client but focused on the agentic retrieval functionality. Let me run the retrieve step. The retrieve method takes a set of queries or a list of messages from a conversation and returns a response along with references. And here are the results in detail, this time purely using the Azure AI Search API. If you’re already using Azure AI Search, you can create knowledge bases in your existing services and even reuse your existing indexes. Layering things this way lets us deliver the state-of-the-art retrieval quality that Azure AI Search is known for, combined with the power of knowledge bases and agentic retrieval. - Now that we understand some of the core concepts behind knowledge bases, how does it actually work then under the covers? - Well, unlike the classic RAG technique that we typically use one source with one index, we can use one or more indexes as well as remote sources. When you construct a knowledge base, passive data sources, such as files in OneLake or Azure Blob Storage are indexed, meaning that Azure Search creates vector and keyword indexes by ingesting and processing the data from the source. We also give you the option to create indexes for specific SharePoint sites that you define while propagating permissions and labels. On the other hand, data sources like the web or MCP servers are accessed remotely, and we support remote access mode for SharePoint too. In these cases, we’ll effectively use the index for the connected source for data for retrieval. Surrounding those knowledge sources, we have an agentic retrieval engine powered by an ensemble of models to run the end-to-end query process that is used to find information. I wrote a small visualization tool to show you what’s going on during the retrieval process. Let me show you. I’ll paste the same query we used before and just hit run. This uses the Azure AI Search knowledge base API directly to run retrieval and return both the results and details of each step. Now in the return result, we can see it did two iterations and issued 15 queries total across three knowledge sources. This is work a person would’ve had to do manually while researching. In this first iteration, we can see it broke the question apart into three aspects, equipment details, the meaning of the label, and the associated policy, and it ran those three as queries against a selected set of knowledge sources. Then, the retrieval engine assessed that some information was missing, so it iterated and issued a second round of searches to complete the picture. Finally, we can see a summary of how much effort we put in, in tokens, along with an answer synthesis step, where it provided a complete answer along with references. And at the bottom, we can see all the reference data used to produce the answer was also returned. This is all very powerful, because as a developer, you just need to create a knowledge base with the data sources you need, connect your agent to it, and Foundry IQ takes care of the rest. - So, how easy is it then to build a knowledge base out like this? - This is something we’ve worked really hard on to reduce the complexity. We built a powerful and simplified experience in Foundry. Starting in the Foundry portal, I’ll go to Build, then to Knowledge in the left nav and see all the knowledge bases I already created. Just to show you the options, I’ll create a new one. Here, you can choose from different knowledge sources. In this case, I’ll cancel out of this and create a new one from scratch. We’ll give it a name, say repairs, and choose a model that’s used for planning and synthesis and define the retrieval reasoning effort. This allows you to control the time and effort the system will put into information retrieval, from minimum where we just retrieve from all the sources without planning to higher levels of effort, where we’ll do multiple iterations assessing whether we got the right results. Next, I’ll set the output mode to answer synthesis, which tells the knowledge base to take the grounding information it’s collected and compose a consolidated answer. Then I can add the knowledge sources we created earlier, and for example, I’ll reduce the machine specs that contains the manuals that are in OneLake and our policies from SharePoint. If I want to create a new knowledge source, I can choose supported stores in this list. For example, if I choose blob storage, I just need to point at the storage account and container, and Foundry IQ will pull all the documents, the chunking, vectorization, and everything needed to make it ready to use. We’ll leave things as is for now. Instead, something really cool is how we also support MCP servers as knowledge sources. Let’s create a quick one. Let’s say we want to pull software issues from GitHub. All I need to do is point it to the GitHub MCP server address and set search_issues as the tool name. At this point, I’m all set, and I just need to save my changes. If data needs to be indexed for some of my knowledge sources, that will happen in the background, and indexes are continually updated with fresh information. - And to be clear, this is hiding a ton of complexity, but how do we know it’s actually working better than previous ways for retrieval? - Well, as usual, we’ve done a ton of work on evaluations. First, we measured whether the agentic approach is better than just searching for all the sources and combining the results. In this study, the grey lines represent the various data sets we used in this evaluation, and when using query planning and iterative search, we saw an average 36% gain in answer score as represented by this green line. We also tested how effective it is to combine multiple private knowledge sources and also a mix of private sources with web search where public data can fill in the gaps when internal information falls short. We first spread information across nine knowledge sources and measure the answer score, which landed at 90%, showing just how effective multi-source retrieval is. We then removed three of the nine sources, and as expected, the answer score dropped to about 50%. Then, we added a web knowledge source to compensate for where our six internal sources were lacking, which in this case was publicly available information, and that boosted results significantly. We achieved a 24-point increase for low-retrieval reasoning effort and 34 points for medium effort. Finally, we wanted to make sure we only iterate if it’ll make things better. Otherwise, we want to exit the agentic retrieval loop. Again, under the covers, Foundry IQ uses two models to check whether we should exit, a fine-tuned SLM to do a fast check with a high bar, and if there is doubt, then we’ll use a full LLM to reassess the situation. In this table, on the left, we can see the various data sets used in our evaluation along with the type of knowledge source we used. The fast check and the full check columns indicate the number of times as a percentage that each of the models decided that we should exit the agentic retrieval loop. We need to know if it was a good idea to actually exit. So the last column has the answer score you would get if you use the minimal retrieval left for setting, where there is no iteration or query planning. If this score is high, iteration isn’t needed, and if it’s low, iteration could have improved the answer score. You can see, for example, in the first row, the answer score is great without iteration. Both fast and full checks show a high percentage of exits. In each of these, we saved time and tokens. The middle three rows are cases where the fast check, the first to the full check, and the full check predicts that we should exit at reasonable high percentages, which is consistent with the relatively high answers scores for minimal effort. Finally, the last two rows show both models wanting to iterate again most of the time, consistent with the low answer score you would’ve seen without iteration. So as you saw, the exit assessment approach in Foundry IQ orchestration is effective, saving time and tokens while ensuring high quality results. - Foundry IQ then is great for connecting the dots then across scattered information while keeping your agents simple to build, and there’s no orchestration required. It’s all done for you. So, how can people try Foundry IQ for themselves right now? - It’s available now in public preview. You can check it out at aka.ms/FoundryIQ. - Thanks so much again for joining us today, Pablo, and thank you for watching. Be sure to subscribe to Microsoft Mechanics for more updates, and we’ll see you again soon.3.5KViews0likes0Comments