foundry iq
5 TopicsThe Gate Is the Product: Human-Verified Artifacts in a Foundry Multi-Agent Game
Part 2 of 5. In Part 1 the loop ended at a verification gate. This post is about why that gate is not a confirmation dialog - it is the core mechanic, the reliability story, and the thing that lets a reasoning-agent system be demoed live without praying. Most multi-agent demos gate on "did the model produce something." That is a vibe check. A reasoning-agent system that touches anything real needs a harder question: who is allowed to say this artifact is good enough - and can a human stop it? A model can create. A model must not be the thing that certifies its own creation. This post is a code walkthrough of how that rule is enforced. We build the three-layer scoring ladder from the bottom up - deterministic validators, a model rubric floored by them, and a human gate - then look at the parts that make it survive contact with real reasoning models: tolerant JSON parsing, capped self-check tools, a failure-degradation ladder, and four proof points that are tested on every run. Every snippet is from the shipped repo, and a file map at the end tells you where to read the rest. Three layers, one rule: no agent grades itself When a worker finishes a chapter, the artifact passes through three layers before it can become progress. Each layer has a different scorer, and only the last one - a human - can award XP. Layer Who scores Can it award XP? Mid-run tool calls Deterministic validators No - advisory to the model rubric_evaluate Foundry model judging weighted dimensions, floored by validators No CEO gate The human Yes - the only path to XP The order is the whole point. Deterministic validators set a floor the score can never fall below. The Foundry model's rubric judgement can move the score above that floor - reward genuine quality - but it can never talk the score below the facts the validators established. Then the work stops and waits for a person. Everything below is in the repo, so this post reads as a code walkthrough as much as an essay. The three layers live in three files: the validators in submission/tools/code_interpreter_wrappers.py , the rubric and floor in submission/agents/worker_factory.py , and the human gate that awards XP in submission/tools/server.py . Open them alongside this post. The thing being judged: what a worker actually returns Before you can score an artifact you have to agree on its shape. A worker does not return prose - it returns a typed JSON object whose keys the validators know how to read. A designer returns a landing page ( hero_headline , cta_text , url , features ); a strategist returns positioning ( target_audience , core_problem , value_proposition , primary_benefit ) and an org chart ( org_chart , okrs_q1 ); a marketer returns a financial plan ( gtm_channels , financial_plan ) and an email ( subject , body ). This contract is the hinge of the whole reliability story. Because the shape is fixed, the validator that reads it can be a dumb, fast, deterministic function - not a second model trying to interpret freeform text. In simulation mode the very same shapes are produced by _mock_artifact in worker_factory.py , which is why the gate behaves identically after a fresh clone with zero credentials: the artifact the validator reads looks the same whether a Foundry model wrote it or the simulator did. There is one wrinkle worth calling out, because it is where most "the validator says zero but the page looks fine" bugs come from: models are inconsistent about keys. One run returns hero_headline , another headline , another a nested hero.headline . Rather than make the validator guess, a small adapter - _landing_payload - coalesces those variants into the canonical shape before the validator ever sees them: # submission/agents/worker_factory.py - _landing_payload (excerpt) return { "hero_headline": page.get("hero_headline") or page.get("headline") or hero.get("headline") or artifact.get("headline") or "", "cta_text": page.get("cta_text") or cta.get("text") or artifact.get("cta") or "", # ... features, url ... } Normalising at the boundary keeps the validators pure: they assume one schema, and the messy job of mapping a model's many spellings onto that schema lives in one adapter, not smeared through every check. Layer 1: deterministic validators you could unit-test Layer 1 is a handful of pure functions in submission/tools/code_interpreter_wrappers.py . Each takes the artifact dict, runs structural checks, accumulates a 0-100 score, and returns (success, results) where results carries a per-check breakdown and human-readable feedback. Here is the landing-page validator, trimmed to its spine, because it is representative: # submission/tools/code_interpreter_wrappers.py def validate_landing_page(data): results = {"checks": {}, "score": 0, "feedback": []} if len(data.get("hero_headline", "").strip()) >= 15: results["checks"]["hero_headline_valid"] = True results["score"] += 30 cta = data.get("cta_text", "") if len(cta.strip()) >= 3: results["checks"]["cta_valid"] = True results["score"] += 20 # ... url format (+30) and a simulated http_status_200 (+20) ... success = results["score"] >= 70 return success, results There is no model anywhere in that function. A headline shorter than fifteen characters earns zero points and a line of feedback; a missing call to action earns zero points and a line of feedback. The thresholds are explicit and the points are explicit, so the score is reproducible to the digit - run it a thousand times, get the same number a thousand times. The other validators follow the same grammar, and each one encodes a small piece of domain judgement as code rather than as a prompt: validate_positioning requires four fields - target audience, core problem, value proposition, primary benefit - each non-trivial (more than ten characters). Four fields, twenty-five points each, pass at seventy-five. validate_org_chart wants a non-empty org chart that contains a Founder role, and OKRs where each objective carries at least two key results. A chart with no founder, or objectives with no measurable key results, simply does not score. validate_marketing_email checks a subject of real length, body copy past a hundred characters, and the literal presence of a call-to-action marker ( [CTA] , a link, "Sign up"). validate_financial_plan is the most opinionated, and the best illustration of the principle. It checks that the MRR ramp is monotonically increasing and that the breakeven month lands in a sane 1-24 range: # submission/tools/code_interpreter_wrappers.py is_monotonic = all(nums[i] <= nums[i + 1] for i in range(len(nums) - 1)) # ... be = fp.get("breakeven_month") if isinstance(be, (int, float)) and 1 <= be <= 24: results["checks"]["breakeven_sane"] = True results["score"] += 15 A model can write a beautiful narrative around a revenue plan that quietly shrinks in month four, or that breaks even in month ninety. A monotonicity check catches the first; a range check catches the second. Neither needs a second opinion from a language model - they are arithmetic. That is the whole thesis of Layer 1: anything you can state as a rule, state as a rule. One set of validators, two jobs The same pure functions do double duty, and that is deliberate. Through _score_artifact they compute the floor - the role's validators run, and the floor is the highest score any of them returns, plus a richness heuristic so an off-schema artifact never floors at a flat zero: Through _maf_tool_fns the same functions are wrapped as the mid-run FunctionTools we capped earlier. One implementation, one source of truth for "is this artifact structurally sound" - exposed both as the gate's floor and as the tool the model calls on its own draft. When you change a validator, the model's self-check and the gate's floor move together; they can never drift apart. The max is a deliberate choice, not a shortcut. A strategist artifact carries both an org chart and a positioning block; taking the best validator score means a strong org chart is not dragged down by a thin positioning section, while a worker that nails neither still cannot fake a floor. It is a forgiving floor for partial work and an honest one for empty work. Layer 2: rubric_evaluate, floored Layer 1 tells you whether an artifact is well-formed. It cannot tell you whether the positioning is sharp or the OKRs are ambitious. That nuance is what Layer 2 is for, and it is where a Foundry model finally gets to judge - inside a cage. rubric_evaluate in submission/agents/worker_factory.py scores the artifact on four weighted dimensions: # submission/agents/worker_factory.py RUBRIC_DIMENSIONS = [ ("Relevance to goal", 30), ("Completeness", 25), ("Actionability", 25), ("Clarity & structure", 20), ] In live mode it asks the narrator deployment - the same Foundry model that powers the Master Narrator - to score each dimension 0-100 and return strict JSON, with a generous token budget because reasoning models spend tokens thinking before they answer: # submission/agents/worker_factory.py - inside rubric_evaluate resp = create_chat_completion( deployment, [ {"role": "system", "content": ( "You are a strict rubric evaluator for business artifacts. " "Score the artifact 0-100 on each dimension: " + dims_spec + ". " "Return ONLY JSON: {dimensions: [...], verdict: one sentence}.")}, {"role": "user", "content": ( f"Venture brief: {brief[:600]}\nStage goal: {stage.goal}\n" f"Artifact JSON:\n{json.dumps(artifact)[:4000]}")}, ], max_completion_tokens=2500, ) parsed = _extract_json(resp.choices[0].message.content or "") or {} The prompt names the dimensions and their weights inline (via dims_spec ), demands JSON only, and caps the artifact at 4000 characters so a sprawling object cannot blow the context window. The response still goes through the same _extract_json we will meet in a moment - because even a "return ONLY JSON" instruction is a request, not a guarantee. Then it does something important: it does not trust the model's structure. It re-anchors the model's answer to our own spec - our dimension names, our weights - and keeps only the model's scores: # submission/agents/worker_factory.py - inside rubric_evaluate by_name = {str(d.get("name", "")).strip().lower(): d for d in dims if isinstance(d, dict)} dimensions = [] for name, weight in RUBRIC_DIMENSIONS: d = by_name.get(name.lower(), {}) dimensions.append({ "name": name, "weight": weight, "score": max(0, min(100, int(d.get("score", floor)))), "note": str(d.get("note", ""))[:120], }) This is a small piece of defensive engineering with a large payoff. A model asked for four dimensions might return three, invent a fifth, or quietly reweight them so its favourite one dominates. By looping over our RUBRIC_DIMENSIONS and pulling scores by name - defaulting any missing dimension to the validator floor, clamping every score to 0-100 - the weighting stays ours. The model colours inside lines it did not draw. Then the two layers meet in one line: rubric["final"] = max(floor, rubric["weighted_total"]) . The final score is one line of math That single line - final = max(floor, weighted_total) - is the entire trust model, and it is worth seeing as a picture, because it is only three numbers: If the artifact is structurally sound, the floor is high and the model can only push the score higher by recognising genuine quality. If the model is having a bad day and lowballs a perfectly valid artifact, the floor protects it. The model's judgement is additive, never subtractive. You get the nuance of a model evaluator with the safety of a deterministic one - and you can explain, to the point, exactly why any score is what it is. Why a deterministic floor, not just a model judge This is the single most important reliability decision, and it is the same principle Lee Stott states in his Hybrid AI Agents in Python post: code the rules, and let the LLM judge only what is left. As he puts it about privacy controls - if your check depends on an LLM correctly classifying every input, you do not have a control, you have a probability distribution. We apply the identical principle to artifact quality. The validators are boring on purpose. Does the landing page have a headline, a CTA, a hero section? Is the marketing email parseable? Do the URLs resolve? Does the org chart have a Founder and key results on every objective? These are structural facts, checked in code, that no amount of confident prose can override. A model that has just written a landing page is the worst-placed party to certify it - so it is graded by something it cannot sweet-talk. In the game you can watch this happen. A worker delivers, and the report names the score and the verdict in plain language: the deterministic validator scored it 100 of 100 - it passes the gate and the company graph grows. A worker delivers a positioning brief, an org chart, and Q1 OKRs; the deterministic validator scores it 100/100 and it passes the gate Reasoning models and strict JSON do not mix There is a sharp edge hiding in Layer 2 that bites everyone who puts a reasoning model behind a JSON contract: the model wraps its answer in think-blocks, prepends a sentence of preamble, or fences the JSON in markdown - and json.loads throws. If your rubric evaluator crashes on a stray backtick, your "deterministic floor" was never deterministic; it was one parse error away from no score at all. So every agent that must read JSON out of a model goes through a tolerant extractor. The same _extract_json shape appears in worker_factory.py , org_designer.py , founder_analyst.py , and world_designer.py - kept local to each module on purpose, so every agent is self-contained. It tries, in order: strip a Markdown code fence; parse the whole string; parse the substring from the first { to the last } ; then scan character by character and let json.JSONDecoder().raw_decode find the first valid object: # submission/agents/worker_factory.py decoder = json.JSONDecoder() for index, char in enumerate(text): if char != "{": continue try: parsed, _ = decoder.raw_decode(text[index:]) return parsed if isinstance(parsed, dict) else None except Exception: continue return None When all four strategies fail, the function returns None and the caller falls through to _rubric_from_floor - the deterministic rubric derived from the validator floor. There is no path where a malformed model response yields no score; the floor is always there to catch it. Tolerant parsing plus a deterministic fallback is what lets you run reasoning models inside a scoring loop without the loop ever dropping a frame. The receipts panel: scoring you can audit, not trust Every worker exposes a receipts panel - the artifact's scoring proof, broken out so a skeptic can audit it instead of taking the number on faith. Status, the model that ran, token usage, estimated call cost, latency, and how many tool calls the worker made out of its capped budget. The receipts panel: status, model, tokens, est. call cost, latency, and tool-call count for a worker run This is the runtime equivalent of carrying a correlation ID through every path. Four proof points are emitted on every invocation, in live mode and simulation mode alike, and they are written straight into the STAGE_EXECUTED replay event in submission/tools/server.py : # submission/tools/server.py - the STAGE_EXECUTED event payload "iq_hits": invocation.iq_sources, "memory_injected": invocation.maf_memory, "tools_called": invocation.maf_tools_called, "inference_usage": {"client": invocation.maf_client or "openai-direct", "fallback_reason": invocation.maf_fallback_reason, "tokens_in": invocation.tokens_in, "tokens_out": invocation.tokens_out, "reasoning_tokens": invocation.reasoning_tokens}, "rubric": stage.rubric, iq_hits - which Foundry IQ sources grounded the work memory_injected - whether the CEO's prior decisions entered the brief tools_called - which deterministic tools the worker actually ran inference_usage - the client used, tokens in and out, and reasoning tokens These are not decoration; they are enforced. submission/tools/demo_smoke_test.py walks every invocation in a simulated run and fails the build if any proof point is absent: # submission/tools/demo_smoke_test.py _require(bool(p.get("iq_hits")), f"{cid}: iq_hits empty - IQ recall not evidenced") _require(bool(p.get("memory_injected")), f"{cid}: memory_injected empty") _require("tools_called" in p, f"{cid}: tools_called missing") usage = p.get("inference_usage") or {} _require(bool(usage.get("client")), f"{cid}: inference_usage.client missing") When a judge asks "did the agent actually do anything, or is this theatre?", the answer is a panel you open, not a sentence you say - backed by a test that would have gone red if the panel were empty. Tools the model can call - but capped On the Agent Framework path the validators are not just a post-hoc gate; they are @tool FunctionTools the worker can call mid-run to test its own draft. That is good - a model that can check itself produces better artifacts. But an unbounded self-check is a failure mode: a model in a tight spot will call the same validator in a loop and burn your budget. So in submission/agents/maf_runtime.py every tool is wrapped with a hard cap, and every call leaves a receipt carrying its arguments, result, and latency: # submission/agents/maf_runtime.py @tool(name=tool_name, description=f"Run the deterministic '{tool_name}' check on a draft artifact " "(pass the artifact as a JSON string). Call at most once.", max_invocations=2) def _t(artifact_json: str) -> str: meta["maf_tools_called"].append(tool_name) receipt = {"tool": tool_name, "source": "maf-midrun", "args": {}, "result": "", "ms": 0.0} meta["maf_tool_trace"].append(receipt) t0 = time.perf_counter() # ... parse artifact, run fn(payload), summarise ... receipt["result"] = f"score={score} checks {passed}/{len(checks)}" receipt["ms"] = round((time.perf_counter() - t0) * 1000, 1) The model may check its draft twice. It may not certify it. max_invocations=2 is enforced by the runtime, not by a polite instruction the model can choose to ignore. And because every call appends to maf_tool_trace , the receipts panel can show you the exact tool, the artifact keys it inspected, the score it got back, and how many milliseconds it took. The certification is the human's, at the gate; the tool is only ever advisory. The same proof in simulation mode Forkability is a rubric criterion, so the gate cannot depend on Azure. After a git clone with zero credentials the system runs in simulation mode - and crucially, the same three layers run. _mock_artifact produces a well-formed artifact, the real validators score it, and rubric_evaluate falls through to _rubric_from_floor - a deterministic breakdown anchored to the validator score with a tiny, fixed spread: # submission/agents/worker_factory.py def _rubric_from_floor(floor): spread = [5, 0, -5, 0] # mild, deterministic variation around the floor dimensions = [ {"name": name, "weight": weight, "score": max(0, min(100, floor + spread[i])), "note": "derived from deterministic validators"} for i, (name, weight) in enumerate(RUBRIC_DIMENSIONS) ] # ... returns the same shape rubric_evaluate returns in live mode ... The four proof points are emitted on this path too. inference_usage.client reads "openai-direct" or a simulation marker instead of FoundryChatClient , but the field is present - which is exactly why demo_smoke_test.py passes offline. The rule we hold to: if your simulation mode does not emit the same evidence as live, you are testing a different program than the one you ship. When the model fails the artifact A model in a scoring loop will, eventually, hand you garbage: JSON truncated because it hit the token ceiling, prose where you asked for an object, or an outright exception from the endpoint. The gate cannot ship a blank stage, so the worker degrades down a fixed ladder rather than failing open. First, a weak Agent Framework run falls through. After the MAF path returns, the artifact is scored; if it is unparseable or the floor is below 40, the code raises and the worker retries on the direct OpenAI-compatible path. A half-formed artifact never reaches the gate: # submission/agents/worker_factory.py artifact = _extract_json(content) floor = _score_artifact(role, artifact) if not artifact or floor < 40: raise ValueError(f"MAF artifact too weak (floor={floor})") Second, an empty live artifact degrades to the deterministic mock. If even the direct call returns nothing parseable, the worker substitutes _mock_artifact , records a maf_fallback_reason , and the validators score the mock - so every stage still produces a real, gradeable artifact instead of a blank one. This is the same "simulation fallback for everything" law the rest of the repo follows. Third, a thrown exception becomes a receipt, not a crash. When the endpoint itself fails, the invocation is marked status="failed" with the error string captured, and the STAGE_EXECUTED event carries status , error , and whatever partial tool_trace accumulated straight into the replay log: # submission/agents/worker_factory.py except Exception as e: invocation.status = "failed" invocation.error = f"{type(e).__name__}: {e}" invocation.completed_at = time.time() return invocation, None, 0 The point is not that failures never happen - it is that a failure is visible and bounded. The floor still applies, the receipt still renders, the replay log still carries the error. A failed run is auditable; it is not a blank space where progress silently appeared. Layer 3: the gate has a third option - redirect Approve and reject are obvious. The interesting one is redirect - the gate can present a genuine strategic fork, two defensible options, and the CEO's pick becomes binding direction for the next worker. That turns the verification gate from a quality checkpoint into a steering wheel. A decision gate: two strategist proposals - Depth versus Breadth - each with tradeoffs, grounded in IQ sources and memory items Notice the metadata on that fork: the proposals are grounded in 2 IQ sources and 7 memory items in brief, and the worker reached them by calling recall , web_search , and calculate_consequence - a tool that previews the org and economic consequence before the CEO commits. The human is not rubber-stamping; they are choosing between options the workforce reasoned out and priced. Where XP is actually awarded The whole architecture funnels into one function. Approval - and only approval - calls approve_current_step in submission/tools/server.py , which awards XP and advances the campaign: # submission/tools/server.py - approve_current_step xp_earned = 10 + (score // 10) There is no other code path that mints XP. Not the validators, not the rubric, not the model. A high gate score enables a large reward, but a human pressing approve is what releases it. That is the responsible-AI guarantee expressed as control flow: the model can make the case, the validators can vouch for the structure, but the only function that turns work into progress is gated behind a person. How the approve, reject, or redirect decision is then written to memory and visibly changes the next chapter's brief is the subject of Part 3. Operational lessons learned Floor first, judge second. A model rubric on top of a deterministic floor gives you nuance without giving up safety. A model rubric alone gives you a confident scoreboard with no foundation. Re-anchor the model's structure to yours. Keep the model's scores, throw away its shape. Loop over your dimensions and weights so the model cannot reweight the rubric in its own favour. Cap every tool. max_invocations=2 is not a performance tweak; it is a containment boundary the runtime enforces. Log the proof on every path, then test for it. If your simulation mode does not emit the same proof points as live, you are testing a different program than you ship - so write a smoke test that fails the build when a proof point goes missing. Make the gate diegetic. Players (and judges) trust a control they can see. A score with a visible floor and an audit panel reads as engineering; a score from nowhere reads as marketing. Reasoning models and strict JSON do not mix. Anything that must emit JSON gets a tolerant extractor that survives think-blocks and markdown fences, with a deterministic fallback when every strategy fails. Responsible AI The gate is the responsible-AI architecture, stated as a game rule: nothing becomes progress without explicit human approval. Every approval is logged with the full reasoning chain in the replay log. Every rejection is written to memory as binding direction, so the same mistake is not made twice. Deterministic validators bound what the model can claim about its own output, the rubric re-anchors the model's judgement to weights it does not control, and the replay log preserves the whole chain for audit. The human's authority is not a courtesy - it is the only function that awards XP, enforced in code. Where this lives in the repo If you want to read the implementation, every piece of this post is in five files: Concern File Key symbol Deterministic validators (Layer 1) submission/tools/code_interpreter_wrappers.py validate_landing_page , validate_positioning , validate_org_chart , validate_financial_plan , validate_marketing_email Rubric + floor + tolerant JSON (Layer 2) submission/agents/worker_factory.py rubric_evaluate , _rubric_from_floor , _extract_json Capped mid-run tools + receipts submission/agents/maf_runtime.py _wrap , max_invocations=2 , maf_tool_trace Proof points, human gate, XP award (Layer 3) submission/tools/server.py approve_current_step , the STAGE_EXECUTED payload Build-fails-without-evidence test submission/tools/demo_smoke_test.py the _require evidence checks Try it Play a chapter and reject the first artifact on purpose - watch the rejection reshape the next brief: Play the live app or run it locally: git clone https://github.com/princepspolycap/agentsleague-afterbuild cd agentsleague-afterbuild && python3 -m venv .venv && source .venv/bin/activate pip install -r submission/requirements.txt python3 submission/tools/run_quest_simulation.py --pitch "Your idea here" The simulator runs the same three layers with zero Azure credentials, so you can step through approve, reject, and redirect offline before you ever wire up Foundry. Key takeaways Three scoring layers, one rule: no agent grades itself; only a human awards progress. Deterministic validators set a floor the model's rubric can raise but never lower - final = max(floor, weighted_total) . Re-anchor the model's rubric to your own dimensions and weights; keep its scores, not its structure. Expose receipts - model, tokens, cost, latency, tool calls - so scores are audited, not trusted, and test for them so the build fails without them. Cap every tool the model can call; an uncapped self-check is a budget leak. The gate's third option, redirect, turns verification into steering. A gate that only ever says "yes" is a dialog box. A gate that can say no, can redirect, floors a model with facts, re-anchors its judgement, and logs every decision is a reliability architecture. That is the difference between a demo and a system you would let touch something real. Part 2 of 5. Next: agents that remember the boss - how a gate decision becomes binding memory and visibly changes the next artifact on Microsoft Foundry. About this project I built a reasoning-agent game where you play a founder solving a world-improvement mission, and an AI workforce does the work while you make the calls. How it plays: You pitch a mission (like "solar microgrids for rural clinics") A Foundry-powered Master Narrator breaks it into an 8-stage quest graph An Org Designer agent builds you a custom digital workforce (Strategist, Designer, Marketer, Ops) Each stage runs as a real agent on the Microsoft Agent Framework, grounded in Foundry IQ You play tactical cards, counter a rival antagonist, and approve every artifact at a verification gate before it counts Why it's different: the reasoning IS the gameplay. Decomposition, IQ citations, memory, tool calls, and a deterministic validator floor are all visible as cards and receipts. Every reasoning agent runs on Microsoft Foundry. Runs after git clone with zero credentials (simulation mode), so it's fully forkable and MIT. Try it / check it out: Live app (hosted on Azure): worldforge-game.mangowater-fa8b860a.eastus2.azurecontainerapps.io GitHub (public, MIT): github.com/princepspolycap/agentsleague-afterbuild Demo video: youtu.be/ElGXboGh6NE Live battle replay: Agents League - Reasoning Agents on Microsoft Reactor Would love any feedback, pull requests, or ideas. Built for the Reasoning Agents track. Microsoft Agent Framework and Microsoft Foundry docs74Views0likes0CommentsAgents That Remember the Boss: Closing the Loop with Foundry Agent Service Memory
Part 3 of 5. Most multi-agent demos have a quiet secret: the human's decisions change nothing. You approve an artifact, reject another, pick a direction at a fork - and the next agent runs with the same prompt it would have run with anyway. The loop is open. In Part 2, every artifact stops at a CEO gate. This post is about what happens after the gate: how the decision is written to the Foundry Agent Service memory store, recalled into the next worker's brief, and enforced as binding direction - so a choice in chapter 2 visibly changes the artifact in chapter 3. Memory is not a transcript. Memory is the set of decisions the agent is no longer allowed to ignore. The memory loop in one diagram Every piece below is in two files: the store and its keyless twin in submission/agents/memory.py , and the injection hook in submission/agents/maf_runtime.py . The writes happen at the gate in submission/tools/server.py . Open them alongside this post. Knowledge and memory are different rails Before the loop makes sense you have to separate two things that look alike and are not. Foundry IQ is what the company knows - stable, curated, cited playbooks that ground the work (Part 2's retrieval). Agent memory is what the agents learn about this specific CEO - the decisions they made and the operating patterns behind them. IQ is shared and durable; memory is personal and accumulating. The opening docstring of memory.py states the line plainly: # submission/agents/memory.py """Agent memory: what the workers learn from the player across a venture. Memory is NOT Foundry IQ. IQ answers from stable, curated source knowledge (the playbooks in submission/knowledge/). Memory holds what the agents learn from the CEO during play: gate decisions and the operating patterns behind them, the founder/company profile, and short summaries of shipped artifacts. """ The game keeps them on separate rails - and separate panels - so the two are never confused. IQ grounds the work; memory personalises it. Mix them and you get a model that cannot tell a durable fact from a fresh instruction. The decision receipt: the closed loop, made visible When you commit a choice at a fork, the game does not just continue - it shows you the loop closing, step by step. This single screen is the entire thesis of this post. A decision receipt: 1 You decided, 2 Consequence applied with before-and-after deltas, 3 Workers learned, 4 the next brief carries the decision as binding direction Read it top to bottom: You decided - "Depth: own one similar customer segment niche end to end," with the tradeoff you accepted spelled out. Consequence applied - the company state changes deterministically: workers, monthly burn, leverage, proof, trust, velocity, autonomy - each shown as a before -> after delta. The decision moved real numbers, not just narration. Workers learned - a procedural memory is written ( local-memory here, the Foundry store when configured): "CEO chose 'Depth...' at the 'NEED' gate accepting tradeoff: stabilized loop, slower reach." Next brief - the following stage names the worker that will execute it and the binding line: "executes this with your decision as binding direction." Decision -> consequence -> memory -> recall -> changed next artifact. That loop is the game. Consequences are deterministic, not narrated The decision receipt's middle row - the before/after deltas - is not the narrator improvising numbers. Every dilemma choice maps to an explicit rule in submission/state/consequences.py , and the rule, not the model, mutates company state: # submission/state/consequences.py RULES = { "strategist.depth": { "match": ("depth", "niche", "painful workflow", "moat"), "economics_delta": {"proof": 9, "trust": 7, "velocity": -4, "burn_pressure": 3, "autonomy": 1, "runway_months": -1}, "revenue_delta": 400, "consolidates": True, "role": { ... a new worker this choice adds to the org ... }, }, # ... } The narrator may phrase the fork a hundred different ways in live mode, but strategist.depth always moves proof +9, trust +7, velocity -4, and can even add a specific worker to the org. That separation - a model for the words, a rule for the mechanics - is why the same decision produces the same company-state delta every time, and why the receipt can show a before/after you can trust. It is the Part 2 principle (code the rules, let the model judge the prose) applied to game economics. The consequence moves the numbers; the memory write is what makes sure the next worker knows which way the CEO just steered. Where memory is written: two points, not every message The quickest way to ruin a memory system is to write everything to it. A store full of chit-chat is a retrieval problem; a store of decisions is a steering wheel. So memory is written at exactly two moments, both of them load-bearing: a gate decision (approve, reject, or fork - with the CEO's reason) and a chapter completion (a compact summary of what shipped). In the replay log those land as a MEMORY_WRITTEN event right next to the CEO_DECISION and CONSEQUENCE_APPLIED events that caused them. The write itself is one call, and it does three quiet but important things - clamp, dedupe, and record provenance: # submission/agents/memory.py def remember(kind, text, payload=None): if kind not in _KINDS: # _KINDS = user_profile / procedural / chat_summary kind = "procedural" text = (text or "").strip()[:400] # clamp - memory entries are short by law if not text: return {} sent = _foundry_add(kind, text, payload) # try the Foundry store entry = {"kind": kind, "text": text, "payload": payload or {}, "ts": time.time(), "origin": "foundry-memory" if sent else "local-memory"} # dedupe on (kind, text) so replays/idempotent endpoints don't pile up items = [m for m in _load_local() if not (m["kind"] == kind and m["text"] == text)] items.append(entry) _save_local(items) return entry Notice origin . Every entry records which store accepted it - foundry-memory when the Agent Service store took it, local-memory when it fell back to the on-disk ledger. That single field is what lets the UI and the replay log tell you, honestly, where a memory lives. It is the same degradation-is-labelled discipline the whole repo follows. What the gate write actually looks like The first write point lives in submission/tools/server.py , in the handler that records a CEO choice. One choice produces the procedural memory and the events that make the loop auditable - written in the same breath: # submission/tools/server.py - recording a gate decision mem_entry = remember_for_run(state, "procedural", f"CEO chose '{choice['option']}' at the '{stage.title}' gate" + (f" accepting tradeoff: {choice['tradeoff']}" if choice['tradeoff'] else "") + f". Consequence: {choice['consequence_summary']}", {"stage_id": stage.id}) if mem_entry: store.log_event("MEMORY_WRITTEN", "memory", f"Procedural memory stored ({mem_entry['origin']}): {mem_entry['text'][:120]}") store.log_event("CEO_DECISION", "founder", f"Gate decision after '{stage.title}': {choice['option']}") store.log_event("CONSEQUENCE_APPLIED", "system", f"{consequence['rule_id']} changed the company: {consequence['summary']}") The procedural memory is the operating pattern ("chose Depth, accepted slower reach"); the MEMORY_WRITTEN event records that it was stored and which store took it; the CEO_DECISION and CONSEQUENCE_APPLIED events sit beside it so the whole cause-and-effect is one readable strip in the replay log (Part 4). The memory is not a side effect bolted onto the decision - it is recorded in the same place the decision is logged, with the same run_id scope. The second write point: shipping a chapter The other write happens when a chapter ships. _remember_stage in worker_factory.py records a chat_summary - a one-line note of what was delivered and how well it scored - so the narrator keeps continuity without the workers re-reading every artifact: # submission/agents/worker_factory.py def _remember_stage(stage, worker_title, score, *, run_id=None): try: remember("chat_summary", f"Stage '{stage.title}' shipped by {worker_title} (score {score}/100). " f"Goal: {stage.goal[:120]}", {"stage_id": stage.id, "run_id": run_id}) except Exception: pass # memory must never break the game loop It is called on every execution path - live, Agent Framework, and simulation - so a chapter that shipped offline leaves the same memory trail as one that ran on Foundry. And, like every write in this module, it is wrapped in a best-effort try/except : a summary that fails to store is a missing note, never a failed chapter. The store and its keyless twin The preferred store is the Foundry Agent Service memory store on the project endpoint. It is reached over plain HTTPS with an AAD bearer token - no exotic SDK - against the preview memory API: # submission/agents/memory.py resp = httpx.post( f"{cfg['endpoint']}/memoryStores/{cfg['store']}/memories", params={"api-version": "2025-11-15-preview"}, headers=_foundry_headers(), # DefaultAzureCredential bearer json={"kind": kind, "content": text, "metadata": payload or {}}, timeout=8.0, ) Two environment variables turn it on - FOUNDRY_PROJECT_ENDPOINT and FOUNDRY_MEMORY_STORE . When they are absent, or the call fails once, a module-level flag ( _FOUNDRY_MEM_AVAILABLE ) flips to False and the process stops trying for the rest of its life - one failed probe should not tax every subsequent write. From then on remember writes only to the local ledger at submission/state/memory.json , which uses the identical schema. That is the part that matters for forkability: a keyless clone exercises the same remember / recall_memories code path with the same entry shape, so simulation is never a different program - only a different backend. The local ledger is even written when Foundry accepts the item, so the replay log and UI can read memory without a network hop. Memory must never break the game loop There is a small reliability rule worth stating because it shapes the whole module: a memory subsystem must never be able to crash the game. Persistence is therefore written atomically and failure is swallowed by design - a corrupt write, a full disk, a permissions error must degrade to "no memory this turn," never to a 500: # submission/agents/memory.py def _save_local(items): with _lock: try: fd, temp_path = tempfile.mkstemp(dir=str(MEMORY_FILE.parent), prefix="memory_tmp_") with os.fdopen(fd, "w", encoding="utf-8") as f: json.dump(items[-200:], f, indent=1) # bounded: last 200 entries os.replace(temp_path, str(MEMORY_FILE)) # atomic swap except Exception: pass # memory must never break the game loop Three decisions hide in those ten lines. The write goes to a temp file and is swapped in with os.replace , so a reader never sees a half-written ledger. The ledger is bounded to the last 200 entries, so a long campaign cannot grow it without limit. And every exception is swallowed, because a steering wheel that can stall the engine is worse than no steering wheel. Memory is load-bearing for direction, never for liveness. Run-scoped, so two ventures never bleed One more guarantee keeps the loop honest across replays: every memory carries a run_id in its payload, and recall filters on it through _matches_run . A second venture - or the simulation test bench running beside the live demo - reads only its own decisions, never the previous run's: # submission/agents/memory.py def _matches_run(item, run_id): if not run_id: return True payload = item.get("payload") or {} return str(payload.get("run_id") or "") == run_id or payload.get("scope") == "global" Starting a fresh venture also calls forget_all() , which empties the ledger outright. Between run-scoping and the reset, one CEO's operating patterns can never leak into another's company - which is exactly what you want when you run the same pitch twice to prove the choices are load-bearing. What we store, and why only three kinds Decisions are written at exactly two moments - gate decisions (approve / reject / fork + reason) and chapter completion. Not every message; decisions. That keeps the store small and every entry load-bearing. Kind Example Injected when user_profile "CEO prefers premium positioning over volume pricing" Every worker brief procedural "Landing pages rejected twice for weak CTAs - lead with the CTA" Every worker brief chat_summary Chapter-completion summaries Narrator context These three map directly to the kinds in Microsoft's Agent Service memory preview, and each earns its keep: user_profile - durable facts about the founder and company (the pitch, the name, the stage). Written once at onboarding, injected into every brief, rarely changed. procedural - the operating patterns learned from gate choices: "prefers organic growth over paid," "accepts scope cuts to hold the date." These are the binding ones - the newest procedural memory always rides into the next brief. chat_summary - compact summaries of completed chapters, fed to the narrator for continuity rather than to every worker. One safety step sits between a human's words and the store. Gate reasons are free text a CEO typed, so they run through the same scrub_secrets() redactor (in agents/model_config.py ) that cleans reasoning traces, before anything is persisted - a memory ledger is the last place you want an API key someone pasted into a justification box. Injection: a ContextProvider, not prompt soup On the Agent Framework path, memory arrives through a ContextProvider that runs before the agent - the framework's first-class hook for exactly this, instead of string-concatenating into the system prompt. The provider turns recalled decisions, IQ hits, and memories into one labelled block and hands it to the agent as instructions: # submission/agents/maf_runtime.py class CampaignMemory(ContextProvider): async def before_run(self, *, agent, session, context, state) -> None: lines = [] for d in decisions: # CEO gate decisions + consequences lines.append(f"CEO decision after '{d['stage_title']}': chose \"{d['option']}\"" + econ) # + the company-state delta it caused meta["maf_memory"].append({"kind": "ceo_decision", "text": ...}) for h in retrieval_hits[:2]: # Foundry IQ, capped lines.append(f"Knowledge base ({h['source']}): {h['content'][:400]}") for m in memories[:3]: # agent memory, capped lines.append(f"Agent memory ({m['kind']}): {m['text'][:300]}") if lines: context.extend_instructions( "campaign-memory", "Session memory (binding direction - the artifact must visibly follow " "the most recent CEO decision):\n- " + "\n- ".join(lines), ) Note the wording: binding direction. We do not ask the model to "consider" the memory - the rubric evaluation at the next gate (Part 2) checks whether the artifact actually followed it. And note what the provider appends to meta["maf_memory"] as it builds the block: that list becomes the memory_injected proof point, so the receipts panel can show precisely which memories entered this brief. A trade-off to be transparent about: injecting memory into every brief costs tokens on every invocation. We cap injection at the three most recent memories per kind (and two IQ hits) and truncate each to 300-400 characters. For direction-following, recency beats completeness. Recall: semantic first, then the binding procedural Getting the right memories into that block is its own small problem. recall_memories tries the Foundry store's semantic search first; if there are no credentials it falls back to the local ledger ranked by keyword overlap and recency. Either way it enforces one rule that makes "binding direction" real - the newest procedural memory always rides along, even if semantic search would not have surfaced it: # submission/agents/memory.py - recall_memories (local fallback) ranked = sorted(items, key=score, reverse=True)[:limit] # Binding rule: the newest procedural memory always rides along. procedural = [m for m in items if m.get("kind") == "procedural"] if procedural and procedural[-1] not in ranked: ranked = [procedural[-1]] + ranked[: max(limit - 1, 1)] return ranked That one guarantee is the difference between memory that informs and memory that binds. The CEO's latest operating pattern is not a candidate for retrieval - it is always in the brief. Watch it bind: the live standup Memory is not only a between-stages mechanism. Mid-run, the workforce holds a live group chat - a Microsoft Agent Framework sequential standup - where each worker reads the prior turns and the CEO's last decision as context. You can watch one worker carry the decision forward and hand off to the next. The live group chat: three workers reason in sequence, each running a real tool and handing off, ending at the CEO's turn In that standup the strategist runs Web search , the discovery analyst runs Read memory and literally says "My next brief starts from strategist.depth, current proof 46...", and the ops worker runs Watch burn and refuses to back a plan that burns faster than it earns. Then it hands to you. The CEO's decision is not a fact in a database; it is the thing the agents are visibly reasoning from. Inspect what they remember Memory you cannot see is memory you cannot trust. The whole store is inspectable through one read-only endpoint, GET /api/memory , which returns memory_snapshot() - every entry grouped by kind, with counts and the active store label: # submission/agents/memory.py def memory_snapshot(run_id=None): ... return { "store": "foundry-memory" if (cfg and _FOUNDRY_MEM_AVAILABLE) else "local-memory", "configured": bool(cfg), "counts": {k: len(v) for k, v in grouped.items()}, "memories": grouped, } The story UI's learning panel renders that snapshot live, and the developer console (Part 4) has a tab that lists every MEMORY_WRITTEN event. Starting a new venture calls forget_all() , which clears the ledger - new company, blank memory - so one CEO's operating patterns never bleed into another's run. Best practice: make memory falsifiable Memory the model is free to ignore is decoration. "Binding direction" only means something because two things check it: The next gate's rubric checks compliance - did the artifact actually follow the decision? Every invocation logs memory_injected alongside iq_hits , tools_called , and inference_usage - in live mode and simulation (where the store falls back to a local state/memory.json using the identical schema). The story UI's evidence rail shows exactly which memories entered the brief, and the developer console (Part 4) has a tab that lists every memory write as a logged event. If you cannot point to the check that enforces a memory, the memory is not a control - it is a comment. The next artifact, visibly bent It is worth being concrete about what "binding" buys you, because the payoff shows up in the artifact, not just the prompt. When a worker runs, the recalled decision is not only injected as instructions - it is also applied to the artifact's own fields. worker_factory calls _apply_decision_context_to_artifact(artifact, decisions) on every path, so after a Depth fork the org chart narrows, the burn line reflects the consolidated team, and the positioning names the single niche the CEO chose. The strategist that runs in chapter 3 does not get a neutral brief with a footnote; it gets a brief whose every section already leans the way you steered - and a gate rubric (Part 2) that will mark it down if it drifts back to the middle. That is the line between a memory the model merely reads and a memory the system enforces: one is advice, the other shapes the deliverable and is checked at the next gate. The strongest proof: run it twice Because the loop is real, you can run the same quest twice with opposite picks and get visibly different companies. Choose Depth and the org narrows to dominate one segment; choose Breadth and it spreads across beachheads with thinner proof. Same starting pitch, same workers, different CEO - different outcome. In a live demo, that A/B is the most convincing thing you can show: the human's choices are load-bearing, and the system can prove it. Operational lessons learned Write at decision points, not message points. A store full of chit-chat is a retrieval problem; a store of decisions is a steering wheel. Same schema in the fallback. The local JSON fallback uses the identical shape as the Agent Service store, so simulation runs exercise the same code path - you are never testing a different program. Verify compliance, do not assume it. If you cannot point to the gate check that enforces a memory, it is not a control. Show the consequence as numbers. A decision receipt that moves real meters (burn, trust, leverage) teaches more than any paragraph of narration. Scrub before you store. Gate reasons are free text from a human; run them through the same secret scrubber as the reasoning traces. Responsible AI This is user-direction memory, not surveillance. Only explicit decisions the user made at gates are stored; the snapshot is inspectable (there is a /api/memory endpoint and a console tab); and the local fallback keeps the data on the user's machine. The human's authority compounds over time instead of being re-litigated every chapter - and because the store holds decisions rather than raw conversation, there is far less sensitive data at rest to begin with. Where this lives in the repo Concern File Key symbol Store + keyless twin + recall submission/agents/memory.py remember , recall_memories , memory_snapshot , _foundry_add , _foundry_search Injection as binding direction submission/agents/maf_runtime.py CampaignMemory.before_run Write points + replay events submission/tools/server.py MEMORY_WRITTEN , CEO_DECISION , CONSEQUENCE_APPLIED Snapshot endpoint submission/tools/server.py GET /api/memory Try it Play the same opening pitch twice and choose the opposite fork each time: Play the live app or: git clone https://github.com/princepspolycap/agentsleague-afterbuild cd agentsleague-afterbuild && python3 -m venv .venv && source .venv/bin/activate pip install -r submission/requirements.txt python3 submission/tools/run_quest_simulation.py --pitch "Your idea here" Key takeaways An open loop - where decisions change nothing - is the most common multi-agent design flaw. Store decisions, not transcripts; three kinds - profile, procedural, summary. Inject via a ContextProvider as binding direction, and verify compliance at the next gate. Show the consequence as before/after numbers so the loop is legible. Log memory_injected on every run, and ship a same-schema local fallback so forks still close the loop. Agent memory is not a feature for the agent. It is a feature for the human - it is how their decisions stop being suggestions. Part 3 of 5. Next: local-first routing and the replay log - run the whole game on your own model, and trace every action that happens.23Views0likes0CommentsBuilding ShadowQuest: A Multi-Agent RPG
Artificial Intelligence is rapidly evolving beyond traditional chatbots. Today, developers are building intelligent systems where multiple AI agents collaborate, retrieve knowledge, and solve problems together. Microsoft's Agents League Hackathon provided the perfect opportunity to explore this new approach through the Reasoning Agents challenge. For this challenge, I built ShadowQuest, a fantasy role-playing game (RPG) powered by Microsoft Foundry, Foundry IQ, Azure AI Search, GPT-4.1, and GitHub Copilot. The project demonstrates how specialized AI agents can work together while using Retrieval-Augmented Generation (RAG) to deliver accurate and context-aware responses. About the Challenge Microsoft Agents League is a global developer challenge designed to encourage developers to build intelligent AI applications using Microsoft's latest AI technologies. Participants could choose from three tracks: Creative Apps, Reasoning Agents, and Enterprise Agents. I selected the Reasoning Agents track because I wanted to explore how multiple AI agents could collaborate instead of relying on a single large language model. Another important requirement for this year's challenge was integrating at least one Microsoft Intelligence Layer. For ShadowQuest, I chose Foundry IQ as the project's intelligence layer. The Idea Behind ShadowQuest Fantasy RPGs are built around storytelling, exploration, and collaboration between different characters. Every character usually has a unique role, whether it's a warrior protecting the team, a mage interpreting magical knowledge, or a rogue discovering hidden paths. I wanted to recreate this experience using AI. Instead of building one AI assistant responsible for everything, I designed a system where multiple specialized agents collaborate to create a richer and more immersive adventure. ShadowQuest is set in a fantasy world filled with magical artifacts, forgotten kingdoms, mysterious locations, and story-driven quests. Players can ask questions about the world, explore different locations, and learn about the game's lore through conversations with AI agents. Building the Multi-Agent Architecture The architecture follows a simple but scalable design. At the center of the system is the Game Master Agent, which acts as the orchestrator. Every player interaction starts with the Game Master. It receives the player's request, determines what information is needed, retrieves additional knowledge when required, and generates the final response. Supporting the Game Master are three specialized agents: Warrior Agent – Focuses on combat strategy and tactical decisions. Mage Agent – Provides magical knowledge, world lore, and information about ancient artifacts. Rogue Agent – Specializes in exploration, investigation, and discovering hidden information. Each agent has a clearly defined responsibility, making the system easier to understand, maintain, and extend in the future. Using Foundry IQ as the Knowledge Layer One of the most important parts of the project was integrating Foundry IQ. Instead of storing every piece of game information inside prompts, I created a dedicated knowledge base containing information about characters, magical artifacts, locations, quests, and the history of the ShadowQuest world. This approach separates knowledge from reasoning. Whenever a player asks a question, the Game Master Agent first retrieves relevant information from the knowledge base before generating a response. This ensures that answers remain consistent with the game's world while reducing hallucinations. Foundry IQ became the central source of truth for the entire project, making it easy to manage and expand the game world without constantly modifying prompts. Azure AI Search and Retrieval-Augmented Generation To enable intelligent retrieval, I connected Foundry IQ with Azure AI Search. The RPG documents were indexed, and vector embeddings were generated using Microsoft's embedding models. This enables semantic search, allowing the system to understand the meaning behind a player's question instead of relying only on keyword matching. For example, if a player asks about a magical relic without mentioning its exact name, Azure AI Search can still retrieve the correct information based on semantic similarity. The complete workflow looks like this: The player submits a question. The Game Master Agent receives the request. Foundry IQ queries Azure AI Search. Relevant documents are retrieved. GPT-4.1 generates a grounded response using the retrieved context. This Retrieval-Augmented Generation (RAG) approach significantly improves the quality and reliability of responses. Accelerating Development with GitHub Copilot GitHub Copilot played an important role throughout the development process. It helped generate Python classes, improve documentation, create helper functions, and speed up repetitive coding tasks. During the live demonstration, I also showed how Copilot could quickly generate a new Healer Agent, demonstrating how AI-assisted development makes it easier to extend a multi-agent application while maintaining a consistent architecture. Rather than replacing the developer, Copilot acted as an intelligent coding assistant, allowing me to focus more on architecture and design decisions. Demonstrating ShadowQuest During the Microsoft Agents League Reasoning Agents Battle, I demonstrated the Game Master Agent by asking questions about the ShadowQuest world, magical artifacts, and game lore. One of the most interesting parts of the demonstration was observing the retrieval process. Before generating a response, the Game Master Agent called the knowledge retrieval function through Foundry IQ. This confirmed that the system was retrieving relevant information from the indexed knowledge base rather than relying only on GPT-4.1's internal knowledge. This demonstrated how RAG can create more grounded, reliable, and context-aware AI experiences. Lessons Learned Building ShadowQuest taught me that designing multi-agent systems is as much about architecture as it is about AI models. Clearly defining responsibilities for each agent made the application easier to maintain and opened the door for future expansion. I also learned how valuable Retrieval-Augmented Generation can be for applications that depend on structured knowledge. Separating reasoning from knowledge allows AI systems to remain accurate while making it easier to update information over time. Finally, participating in the Microsoft Agents League was an incredible opportunity to experiment with Microsoft's latest AI technologies, learn from other developers, and share ideas with a global community passionate about agentic AI. Looking Ahead ShadowQuest is only the beginning. In future iterations, I plan to expand the project by introducing additional agents such as a Merchant Agent and Healer Agent, implementing persistent player memory, adding dynamic quest generation, improving combat mechanics, and enabling deeper collaboration between agents. These improvements will make the game world more immersive while continuing to explore the possibilities of agent-based AI systems. Conclusion ShadowQuest demonstrates how Microsoft Foundry, Foundry IQ, Azure AI Search, GPT-4.1, and GitHub Copilot can be combined to build intelligent multi-agent applications. More importantly, the project reinforced an important idea: the future of AI is not a single assistant performing every task, but a team of specialized agents collaborating with shared knowledge to solve increasingly complex problems. Participating in the Microsoft Agents League was an inspiring experience that allowed me to explore the next generation of AI development while building a project that combines storytelling, reasoning, and knowledge retrieval. I look forward to continuing this journey and discovering new ways to build intelligent applications using Microsoft's growing AI ecosystem.140Views1like0CommentsFoundry IQ: Give Your AI Agents a Knowledge Upgrade
If you’re learning to build AI agents, you’ve probably hit a familiar wall: your agent can generate text, but it doesn’t actually know anything about your data. It can’t look up your documents, search across your files, or pull facts from multiple sources to answer a real question. That’s the gap Foundry IQ fills. It gives your AI agents structured access to knowledge, so they can retrieve, reason over, and synthesize information from real data sources instead of relying on what’s baked into the model. Why Should You Care? As a student or early-career developer, understanding how AI systems work with external knowledge is one of the most valuable skills you can build right now. Retrieval-Augmented Generation (RAG), knowledge bases, and multi-source querying are at the core of every production AI application, from customer support bots to research assistants to enterprise copilots. Foundry IQ gives you a hands-on way to learn these patterns without having to build all the plumbing yourself. You define knowledge bases, connect data sources, and let your agents query them. The concepts you learn here transfer directly to real-world AI engineering roles. What is Foundry IQ? Foundry IQ is a service within Azure AI Foundry that lets you create knowledge bases, collections of connected data sources that your AI agents can query through a single endpoint. Instead of writing custom retrieval logic for every app you build, you: Define knowledge sources — connect documents, data stores, or web content (SharePoint, Azure Blob Storage, Azure AI Search, Fabric OneLake, and more). Organize them into a knowledge base — group multiple sources behind one queryable endpoint. Query from your agent — your AI agent calls the knowledge base to get the context it needs before generating a response. This approach means the knowledge layer is reusable. Build it once, and any agent or app in your project can tap into it. The IQ Series: A Three-Part Learning Path The IQ Series is a set of three weekly episodes that walk you through Foundry IQ from concept to code. Each episode includes a tech talk, visual doodle summaries, and a companion cookbook with sample code you can run yourself. 👉 Get started: https://aka.ms/iq-series Episode 1: Unlocking Knowledge for Your Agents (March 18, 2026) Start here. This episode introduces the core architecture of Foundry IQ and explains how AI agents interact with knowledge. You’ll learn what knowledge bases are, why they matter, and how the key components fit together. What you’ll learn: The difference between model knowledge and retrieved knowledge How Foundry IQ structures the retrieval layer The building blocks: knowledge sources, knowledge bases, and agent queries Episode 2: Building the Data Pipeline with Knowledge Sources (March 25, 2026) This episode goes deeper into knowledge sources, the connectors that bring data into Foundry IQ. You’ll see how different content types flow into the system and how to wire up sources from services you may already be using. What you’ll learn: How to connect sources like Azure Blob Storage, Azure AI Search, SharePoint, Fabric OneLake, and the web How content is ingested and indexed for retrieval Patterns for combining multiple source types Episode 3: Querying Multi-Source Knowledge Bases (April 1, 2026) The final episode shows you how to bring it all together. You’ll learn how agents query across multiple knowledge sources through a single knowledge base endpoint and how to synthesize answers from diverse data. What you’ll learn: How to query a knowledge base from your agent code How retrieval works across multiple connected sources Techniques for synthesizing information to answer complex questions Get Hands-On with the Cookbooks Every episode comes with a companion cookbook in the GitHub repo, complete with sample code you can clone, run, and modify. This is the fastest way to go from watching to building. 👉 Explore the repo: https://aka.ms/iq-series Inside you’ll find: Episode links — watch the tech talks and doodle recaps Cookbooks — step-by-step code samples for each episode Documentation links — official Foundry IQ docs and additional learning resources What to Build Next Once you’ve worked through the series, try applying what you’ve learned: Study assistant — connect your course materials as knowledge sources and build an agent that can answer questions across all your notes and readings. Project documentation bot — index your team’s project docs and READMEs into a knowledge base so everyone can query them naturally. Research synthesizer — connect multiple data sources (papers, web content, datasets) and build an agent that can cross-reference and summarize findings. Start Learning The IQ Series is designed to take you from zero to building knowledge-driven AI agents. Watch the episodes, run the cookbooks, and start experimenting with your own knowledge bases. 👉 https://aka.ms/iq-series486Views0likes0CommentsFoundry IQ for Multi-Source AI Knowledge Bases
Pull from multiple sources at once, connect the dots automatically, and getvaccurate, context-rich answers without doing manual orchestration with Foundry IQ in Microsoft Foundry. Navigate complex, distributed data across Azure stores, SharePoint, OneLake, MCP servers, and even the web, all through a single knowledge base that handles query planning and iteration for you. Reuse the Azure AI Search assets you already have, build new knowledge bases with minimal setup, and control how much reasoning effort your agents apply. As you develop, you can rely on iterative retrieval only when it improves results, saving time, tokens, and development complexity. Pablo Castro, Azure AI Search CVP and Distinguished Engineer, joins Jeremy Chapman to share how to build smarter, more capable AI agents, with higher-quality grounded answers and less engineering overhead. Smart, accurate responses. Give your agents the ability to search across multiple sources automatically without extra development work. Check out Foundry IQ in Microsoft Foundry. Build AI agents fast. Organize your data, handle query planning, and orchestrate retrieval automatically. Get started using Foundry IQ knowledge bases. Save time and resources while keeping answers accurate. Foundry IQ decides when to iterate or exit, optimizing efficiency. Take a look. QUICK LINKS: 00:00 — Foundry IQ in Microsoft Foundry 01:02 — How it’s evolved 03:02 — Knowledge bases in Foundry IQ 04:37 — Azure AI Search and retrieval stack 05:51 — How it works 06:52 — Visualization tool demo 08:07 — Build a knowledge base 10:10 — Evaluating results 13:11 — Wrap up Link References To learn more check out https://aka.ms/FoundryIQ For more details on the evaluation metric discussed on this show, read our blog at https://aka.ms/kb-evals For more on Microsoft Foundry go to https://ai.azure.com/nextgen Unfamiliar with Microsoft Mechanics? As Microsoft’s official video series for IT, you can watch and share valuable content and demos of current and upcoming tech from the people who build it at Microsoft. Subscribe to our YouTube: https://www.youtube.com/c/MicrosoftMechanicsSeries Talk with other IT Pros, join us on the Microsoft Tech Community: https://techcommunity.microsoft.com/t5/microsoft-mechanics-blog/bg-p/MicrosoftMechanicsBlog Watch or listen from anywhere, subscribe to our podcast: https://microsoftmechanics.libsyn.com/podcast Keep getting this insider knowledge, join us on social: Follow us on Twitter: https://twitter.com/MSFTMechanics Share knowledge on LinkedIn: https://www.linkedin.com/company/microsoft-mechanics/ Enjoy us on Instagram: https://www.instagram.com/msftmechanics/ Loosen up with us on TikTok: https://www.tiktok.com/@msftmechanics Video Transcript: - If you research any topic, do you stop after one knowledge source? That’s how most AI will typically work today to generate responses. Instead, now with Foundry IQ in Microsoft Foundry, built-in AI powered query decomposition and orchestration make it easy for your agents to find and retrieve the right information across multiple sources, autonomously iterating as much as required to generate smarter and more relevant responses than previously possible. And the good news is, as a developer, this all just works out of the box. And joining me to unpack everything and also show a few demonstrations of how it works is Pablo Castro, distinguished engineer and also CVP. He’s also the architect of Azure AI Search. So welcome back to the show. - It’s great to be back. - And you’ve been at the forefront really for AI knowledge retrieval really since the beginning, where Azure AI Search is Microsoft’s state-of-the-art search engine for vector and hybrid retrieval, and this is really key to building out things like RAG-based agentic services and applications. So how have things evolved since then? - Things are changing really fast. Now, AI and agents in particular, are expected to navigate the reality of enterprise information. They need to pull data across multiple sources and connect the dots as they automate tasks. This data is all over the place, some in Azure stores, some in SharePoint, some is public data on the web, anywhere you can think of. Up until now, AI applications that needed to ground agents on external knowledge typically used as single index. If they needed to use multiple data sources, it was up to the developer to orchestrate them. With Foundry IQ and the underlying Azure AI Search retrieval stack, we tackled this whole problem. Let me show you. Here is a technician support agent that I built. It’s pointed at a knowledge base with information from different sources that we pull together in Foundry IQ. It provides our agent with everything it needs to know as it provides support to onsite technicians. Let’s try it. I’ll ask a really convoluted question, more of a stream of thought that someone might ask when working on a problem. I’ll paste in: “Equipment not working, CTL11 light is red, “maybe power supply problem? “Label on equipment says P4324. “The cord has another label UL 817. “Okay to replace the part?” From here, the agent will give the question to the knowledge base, and the knowledge base will figure out which knowledge sources to consult before coming back with a comprehensive answer. So how did it answer this particular question? Well, we can see it went across three different data sources. The functionality of the CTL11 indicator is from the machine manuals. We received them from different machine vendors, and we have them all stored in OneLake. Then, the company policy for repairs, which our company regularly edits, lives in SharePoint. And finally, the agent retrieved public information from the web to determine electrical standards. - And really, the secret sauce behind all of this is the knowledge base. So can you explain what that is and how that works? - So yeah, knowledge bases are first class artifacts in Foundry IQ. Think of a knowledge base as the encapsulation of an information domain, such as technical support in our example. A knowledge base comprises one or more data sources that can live anywhere. And it has its own AI models for retrieval orchestration against those sources. When a query comes in, a planning step is run. Here, the query is deconstructed. The AI model refers to the source description or retrieval instructions provided, and it connects the different parts of the query to the appropriate knowledge source. It then runs the queries, and it looks at the results. A fast, fine-tuned SLM then assesses whether we have enough information to exit or if we need more information and should iterate by running the planning step again. Once it has a high level of confidence in the response, it’ll return the results to the agent along with the source information for citations. Let’s open the knowledge base for our technician support agent. And at the bottom, you can see our three different knowledge sources. Again, machine specs pulls markdown files from OneLake with all the equipment manuals. And notice the source description which Foundry IQ uses during query planning. Policies points at our SharePoint site with our company repair policies. And here’s the web source for public information. And above, I’ve also provided retrieval instructions in natural language. Here, for example, I explicitly call out using web for electrical and industry standards. - And you’re in Microsoft Foundry, but you also mentioned that Azure AI Search and the retrieval stack are really the underpinnings for Foundry IQ. So, what if I already have some Azure AI Search running in my case? - Sure. Knowledge bases are actually AI search artifacts. You can still use standalone AI search and access these capabilities. Let me show you what it looks like in the Azure portal and in code. Here, I’m in my Azure AI Search service. We can see existing knowledge bases, and here’s the knowledge base we were using in Foundry IQ. Flipping to VS code, we have a new KnowledgeBaseRetrievalClient. And if you’ve used Azure AI Search before, this is similar to the existing search client but focused on the agentic retrieval functionality. Let me run the retrieve step. The retrieve method takes a set of queries or a list of messages from a conversation and returns a response along with references. And here are the results in detail, this time purely using the Azure AI Search API. If you’re already using Azure AI Search, you can create knowledge bases in your existing services and even reuse your existing indexes. Layering things this way lets us deliver the state-of-the-art retrieval quality that Azure AI Search is known for, combined with the power of knowledge bases and agentic retrieval. - Now that we understand some of the core concepts behind knowledge bases, how does it actually work then under the covers? - Well, unlike the classic RAG technique that we typically use one source with one index, we can use one or more indexes as well as remote sources. When you construct a knowledge base, passive data sources, such as files in OneLake or Azure Blob Storage are indexed, meaning that Azure Search creates vector and keyword indexes by ingesting and processing the data from the source. We also give you the option to create indexes for specific SharePoint sites that you define while propagating permissions and labels. On the other hand, data sources like the web or MCP servers are accessed remotely, and we support remote access mode for SharePoint too. In these cases, we’ll effectively use the index for the connected source for data for retrieval. Surrounding those knowledge sources, we have an agentic retrieval engine powered by an ensemble of models to run the end-to-end query process that is used to find information. I wrote a small visualization tool to show you what’s going on during the retrieval process. Let me show you. I’ll paste the same query we used before and just hit run. This uses the Azure AI Search knowledge base API directly to run retrieval and return both the results and details of each step. Now in the return result, we can see it did two iterations and issued 15 queries total across three knowledge sources. This is work a person would’ve had to do manually while researching. In this first iteration, we can see it broke the question apart into three aspects, equipment details, the meaning of the label, and the associated policy, and it ran those three as queries against a selected set of knowledge sources. Then, the retrieval engine assessed that some information was missing, so it iterated and issued a second round of searches to complete the picture. Finally, we can see a summary of how much effort we put in, in tokens, along with an answer synthesis step, where it provided a complete answer along with references. And at the bottom, we can see all the reference data used to produce the answer was also returned. This is all very powerful, because as a developer, you just need to create a knowledge base with the data sources you need, connect your agent to it, and Foundry IQ takes care of the rest. - So, how easy is it then to build a knowledge base out like this? - This is something we’ve worked really hard on to reduce the complexity. We built a powerful and simplified experience in Foundry. Starting in the Foundry portal, I’ll go to Build, then to Knowledge in the left nav and see all the knowledge bases I already created. Just to show you the options, I’ll create a new one. Here, you can choose from different knowledge sources. In this case, I’ll cancel out of this and create a new one from scratch. We’ll give it a name, say repairs, and choose a model that’s used for planning and synthesis and define the retrieval reasoning effort. This allows you to control the time and effort the system will put into information retrieval, from minimum where we just retrieve from all the sources without planning to higher levels of effort, where we’ll do multiple iterations assessing whether we got the right results. Next, I’ll set the output mode to answer synthesis, which tells the knowledge base to take the grounding information it’s collected and compose a consolidated answer. Then I can add the knowledge sources we created earlier, and for example, I’ll reduce the machine specs that contains the manuals that are in OneLake and our policies from SharePoint. If I want to create a new knowledge source, I can choose supported stores in this list. For example, if I choose blob storage, I just need to point at the storage account and container, and Foundry IQ will pull all the documents, the chunking, vectorization, and everything needed to make it ready to use. We’ll leave things as is for now. Instead, something really cool is how we also support MCP servers as knowledge sources. Let’s create a quick one. Let’s say we want to pull software issues from GitHub. All I need to do is point it to the GitHub MCP server address and set search_issues as the tool name. At this point, I’m all set, and I just need to save my changes. If data needs to be indexed for some of my knowledge sources, that will happen in the background, and indexes are continually updated with fresh information. - And to be clear, this is hiding a ton of complexity, but how do we know it’s actually working better than previous ways for retrieval? - Well, as usual, we’ve done a ton of work on evaluations. First, we measured whether the agentic approach is better than just searching for all the sources and combining the results. In this study, the grey lines represent the various data sets we used in this evaluation, and when using query planning and iterative search, we saw an average 36% gain in answer score as represented by this green line. We also tested how effective it is to combine multiple private knowledge sources and also a mix of private sources with web search where public data can fill in the gaps when internal information falls short. We first spread information across nine knowledge sources and measure the answer score, which landed at 90%, showing just how effective multi-source retrieval is. We then removed three of the nine sources, and as expected, the answer score dropped to about 50%. Then, we added a web knowledge source to compensate for where our six internal sources were lacking, which in this case was publicly available information, and that boosted results significantly. We achieved a 24-point increase for low-retrieval reasoning effort and 34 points for medium effort. Finally, we wanted to make sure we only iterate if it’ll make things better. Otherwise, we want to exit the agentic retrieval loop. Again, under the covers, Foundry IQ uses two models to check whether we should exit, a fine-tuned SLM to do a fast check with a high bar, and if there is doubt, then we’ll use a full LLM to reassess the situation. In this table, on the left, we can see the various data sets used in our evaluation along with the type of knowledge source we used. The fast check and the full check columns indicate the number of times as a percentage that each of the models decided that we should exit the agentic retrieval loop. We need to know if it was a good idea to actually exit. So the last column has the answer score you would get if you use the minimal retrieval left for setting, where there is no iteration or query planning. If this score is high, iteration isn’t needed, and if it’s low, iteration could have improved the answer score. You can see, for example, in the first row, the answer score is great without iteration. Both fast and full checks show a high percentage of exits. In each of these, we saved time and tokens. The middle three rows are cases where the fast check, the first to the full check, and the full check predicts that we should exit at reasonable high percentages, which is consistent with the relatively high answers scores for minimal effort. Finally, the last two rows show both models wanting to iterate again most of the time, consistent with the low answer score you would’ve seen without iteration. So as you saw, the exit assessment approach in Foundry IQ orchestration is effective, saving time and tokens while ensuring high quality results. - Foundry IQ then is great for connecting the dots then across scattered information while keeping your agents simple to build, and there’s no orchestration required. It’s all done for you. So, how can people try Foundry IQ for themselves right now? - It’s available now in public preview. You can check it out at aka.ms/FoundryIQ. - Thanks so much again for joining us today, Pablo, and thank you for watching. Be sure to subscribe to Microsoft Mechanics for more updates, and we’ll see you again soon.3.5KViews0likes0Comments