ai agent
14 TopicsEvaluating the Evaluator: How to Test an LLM Judge with Microsoft Agent Framework
The four verdicts, up front Consistency: mean CV across posts 5.30% Pipeline format checks: pipeline pass rate 100% Rubric adherence (strict judge): 5.00 / 5, mean math drift 0.05 pts Calibration vs. labels: Pearson r = 0.51, MAE = 22.9 pts Three of those say the model is healthy. The last one is the only one that compared the model against anything real, and it tells a different story. Where we left off In Post 1 I built Viral or Fail, three Microsoft Agent Framework agents that pressure-test a gaming social post before you publish it. A Content Creator drafts the post, an Algorithm Simulator scores it the way a recommendation system might, and an Audience Persona reacts the way an actual viewer would. The whole thing runs on the GitHub Models free tier, with no paid API keys. That post ended on a cliffhanger I left deliberately open. The Algorithm Simulator scored the post 75/100, but how do I know the Simulator itself is any good? How consistent are its scores? Do they track real engagement? Would a human social strategist agree with its rubric weights? This post answers that empirically. I built four tests: consistency, pipeline format checks, rubric adherence, and calibration. Three came back healthy. The fourth caught a problem structural enough that it changed how I think about evaluating LLM judges in general. The surprising part isn't that the model failed somewhere but that it passed the three tests you naturally reach for first, and only failed the one most will skip. Why I built my own harness The Microsoft Agent Framework ships a real evaluation surface. You get evaluate_agent, LocalEvaluator, an @evaluator decorator, and the EvalItem / EvalResults data types. It's well designed, and for production agents it's the right choice. It also pairs most naturally with Azure AI Foundry. The path of least resistance assumes you already have an Azure project, a model deployment, and the budget for cloud-tier LLM-as-judge calls. Post 1 went the other way on purpose: zero paid keys, GitHub Models free tier only. To keep that footing, I wrote a small in-house harness that mirrors the call shape of evaluate_agent. The framework's evaluation surface is provider-agnostic in principle, but it leans toward Azure in practice. What the SDK hands you for free on Azure, you can rebuild for yourself on GitHub Models in as you would see shortly, and the patterns transfer directly when you upgrade. The harness is one file, roughly 150 lines. The trick that makes it more than a wrapper is that it tries to import the SDK's primitives first and only defines its own if they aren't there yet: try: from agent_framework import EvalItem, EvalResults, evaluator _USING_SDK_PRIMITIVES = True except ImportError: # agent-framework-core==1.0.0rc1 doesn't ship these yet, # so we define local equivalents with the same shape. @dataclass class EvalItem: query: str response: str expected_output: str | None = None scores: dict[str, float] = field(default_factory=dict) repetition: int = 0 # ... EvalResults, evaluator defined the same way The day Microsoft ships these types, the suite picks them up with no code change. An evaluator looks like this: @evaluator def correlates_with_truth(response: str, expected_output: str) -> float: sim = parse_weighted_total(response) if sim is None or expected_output is None: return 0.0 truth = float(expected_output) return 1.0 - (abs(sim - truth) / 100.0) If you've used the SDK's @evaluator, you've used this one. Same parameter-name dispatch (query, response, expected_output), same return convention (a float in [0, 1]). The runner wraps a retry-aware async loop around a list of these. GitHub Models caps this model at about 15 requests per minute, so the loop sleeps 4.5 seconds between calls (12 a minute, comfortably under the cap). When it does hit a 429 it waits 30 seconds and up, rather than the short exponential backoff it uses for ordinary transient failures. Boring glue code, and important glue code. When you eventually move to Azure, you swap runner.run(...) for evaluate_agent(...) and nothing else in your codebase has to change. What 'good' even means for a judge Before running anything, it's worth being precise about what "good" even means for a judge agent. There are four versions of it, and they split into two camps. The first three are process checks. They probe the model against itself. No external reference data, just the model and its own outputs. Consistency means same input, same output. Run the Simulator twice on the same post and the scores should land in roughly the same place. If they don't, the score is noise. Pipeline format checks ask whether each agent followed its required output shape. Did the Creator produce platform-native text? Did the Simulator emit a parseable weighted total? Did the Persona stay in character? These are the cheapest tests of all, just regex and keyword matching, no LLM judge needed. Rubric adherence is harder. The Simulator's prompt asks it to score five weighted criteria and report a weighted total. Did it actually do that, or did it list the criteria and then invent a number? Checking this needs an LLM. The cloud-tier equivalent is FoundryEvals.TaskAdherence, and I'll build the free-tier version below. The fourth check is a different animal. Validation against ground truth. Calibration asks whether the Simulator's scores correlate with real engagement. It's the same operation you'd run on any predictive model: predict, compare against a labeled set, report correlation and error. It's the only check that tells you the model is correct rather than merely consistent and well-formatted, and it's the only one that needs data the model didn't produce. That's the thesis of this post, and the reason the order matters. The three process checks can all come back green and still tell you nothing about the validation result. And because validation needs ground truth, the design of the ground-truth dataset becomes part of the result. I'll be explicit about that when we get there. The posts under test Every test runs against the same thing: a 10-post golden dataset of gaming social posts I wrote and hand-labeled. Each entry carries the post content, its real-world engagement numbers, a normalized engagement_score from 0 to 100, and a label (viral, decent, flop, or outlier). Here's the viral Valorant post that Test 1 keeps referencing, in full: json { "id": "post_001", "platform": "Twitter/X", "topic": "Valorant Champions 2025", "content": "EG winning Champions 2025 was the most underrated moment in Valorant esports history and people still don't talk about it enough.\n\nDemon1 carried that grand final on a level we won't see again until at least Champions '26. The map veto into Bind alone deserves a documentary.", "real_engagement": { "impressions": 2100000, "likes": 45000, "shares_or_retweets": 8000, "replies_or_comments": 1200, "engagement_rate_pct": 2.58 }, "engagement_score": 82, "label": "viral", "notes": "Hot take + esports nostalgia + named callout (Demon1) drove QRTs from competing fanbases." } The full set is in the repo at evals/golden_dataset.json: two viral hits, four decent posts, three flops, and one outlier, across Twitter/X, TikTok, YouTube, and Instagram. Test 1: Consistency The easiest test to write. Run the Simulator ten times on the same post with identical input. Compute the mean, standard deviation, and coefficient of variation. Repeat across five posts spanning viral, decent, flop, and outlier labels. The harness call is one line: runner = EvalRunner(rate_limit_sleep=4.5) # 12 RPM, under the cap results = await runner.run( agent=agent, queries=[_build_simulator_prompt(p) for p in selected], evaluators=[weighted_total_score], # parses the score out of each run num_repetitions=NUM_REPETITIONS, # 10 ) Fifty Simulator runs in total. Group by query, compute std/mean per post, then average the resulting CVs. Mean coefficient of variation across the five posts: 5.30%. With the rubric pinned, the Simulator is meaningfully non-deterministic, but it isn't chaotic. Most scores cluster within about four points of the per-post mean. That's the headline, and it's fine. Now look at the chart again. post_001 (the viral Valorant Champions post, mean 70.3) and post_003 (the decent Steam Deck OLED post, mean 72.4) sit at almost the same place on the x-axis. The decent post averages slightly higher than the viral one. Across ten reps each. Twenty data points, and the Simulator can't reliably tell which one is supposed to be the success. If you trace the mean diamonds left to right, the decent post outranks both viral posts. A consistency test won't flag this as a problem, because the Simulator is being consistent. It consistently rates these two posts in the same band. The problem is what that band means. If consistency were your only check, you'd close the laptop and ship. Hold onto that. It comes back. Test 4: Pipeline format checks Now zoom out from a single agent and run the full Viral or Fail pipeline (Creator, then Simulator, then Persona) on five live trending gaming topics, applying format-level checks to each agent's output. The checks are deliberately cheap. For the Creator: does the output contain Twitter/X-native vocabulary (the keyword list looks for things like thread, ratio, QRT, take, based)? For the Simulator: is there a parseable weighted total between 0 and 100? For the Persona (TryHard_Tyler, the competitive esports fan, in this run): does the output use any of the persona's keywords, like diff, cope, goated, ratio, cap? Five topics, fetched live from Google Trends: xbox game pass, the hobbit mtg collector booster, crimson desert patch notes, xbox, olden era steam. Per-agent pass rate: Creator 100%, Simulator 100%, Persona 100%. Pipeline pass rate 100%. The format checks are doing their job. Every agent produced output in the shape it was supposed to, on every topic. No regex misses, no missing weighted totals, no out-of-character personas. This is the point where, if you'd only run consistency and pipeline checks, you'd write the triumphant report. "Our agents are reliable. CV under 6%. Pipeline pass rate 100%. Ship it." That report would be true. It would also be wrong about whether the model is correct, because format adherence is not output validity. Keep going. Test 3: Rubric adherence, and a free-tier LLM-as-judge Format checks tell you what the output looks like. Rubric adherence asks whether the Simulator actually did the work it was prompted to do: score five weighted criteria, sum them correctly, and explain each score with platform-mechanic reasoning rather than vibes. There's no regex for that. You need an LLM to read the Simulator's full evaluation and judge whether it followed its own rubric. That's an LLM-as-judge, and the cloud-tier equivalent is FoundryEvals.TaskAdherence on Azure. Since we're staying free, I built it. The judge is just another Agent with a stricter system prompt: JUDGE_SYSTEM_PROMPT = """You are a Rubric Adherence Judge — strict and skeptical. You evaluate whether another AI agent ACTUALLY followed its scoring rubric, not just whether it produced output that looks like it did. You will check three things, in order of severity (the strictest failing check sets the score): A. MATHEMATICAL FIDELITY (most important). Compute sum(criterion_score × weight) yourself from the agent's per-criterion scores. Compare it to the agent's stated WEIGHTED TOTAL. If they differ by more than 2 points, the agent is doing the rubric wrong even if it looks correct on the surface. Report the difference as `math_diff`. B. REASONING SPECIFICITY. Each criterion's justification must reference platform-specific algorithm mechanics — "FYP retention threshold", "QRT velocity", "average view duration". Generic praise ("strong hook", "good engagement") is GENERIC and lowers the score. Classify reasoning as "specific", "mixed", or "generic". C. COVERAGE. Every criterion in the rubric must be explicitly scored. Missing criteria fail this check. ... Be strict. Format-following ≠ rubric-following.""" The full prompt is in the repo. The key decision is point A: the judge recomputes the math itself. That catches the failure where an agent lists every criterion with a score, but the weighted total it reports doesn't actually equal the weighted sum. That kind of quiet drift is exactly what format checks miss. The judge returns strict JSON: adherence_score (1 to 5), math_diff, reasoning_quality, criteria_present, missing_criteria, weight_drift, plus a few sentences of reasoning. Test 3 doesn't go through runner.run; it orchestrates the two agents by hand, one post at a time, so the judge sees the Simulator's full evaluation: for post in posts: sim_text = await call_agent_with_retry(simulator, build_simulator_prompt(post)) verdict = await judge.judge( rubric=PLATFORM_RULES[post["platform"]], post_content=post["content"], evaluation_output=sim_text, ) Run across all 10 posts in the golden dataset, here's what comes back. Mean adherence score: 5.00 / 5. Mean absolute math drift: 0.05 points (max 0.25). Reasoning quality classified "specific" on 100% of evaluations. Zero missing criteria, zero weight drift. This was not the result I expected. I built the judge to be strict on purpose, after my first version turned out too lenient (more on that in the bugs section). The strict version recomputes the weighted sum, classifies generic praise as a failure, and demands platform-mechanic citations. The Simulator passed every dimension anyway. The per-post reasoning is genuinely fun to read. On the Activision Blizzard flop, the judge noted that the Simulator's reasoning leaned on engagement velocity, quote-retweet incentives, topicality timing, and hashtag discoverability rather than generic praise. On the GTA 6 viral TikTok, it cited pattern interrupts, trending-cluster signals, and share-velocity drivers. That's the language I asked for, and the Simulator is producing it. So the Simulator does the rubric correctly. The math is right, the reasoning is specific, every criterion is covered. By every internal measure, it works. You can probably see where this is going. There's exactly one thing left to check, and it's the most expensive and most important one. Test 2: Calibration, the reckoning This one isn't a test in the same sense as the first three. They asked whether the model was malfunctioning. This asks whether it's correct, which is a different question entirely, because it's the only one that needs data the model didn't produce. And because it's a validation, what I validate against matters as much as the model. So before running anything, here's exactly what the ground truth is: a 10-post golden dataset that I built, not measured. I wrote the post content myself in platform-native style, then assigned each post an engagement_score from a back-of-envelope formula (impressions x engagement rate x shareability), calibrated against publicly observable performance for similar posts. The set spans two viral hits, four decent posts, three flops, and one deliberate outlier (a post that got ratio'd into orbit, with high reach and terrible reception). So when I show you a Pearson r in a moment, hold it loosely. The exact number is partly a function of how I designed the labels. The shape of the failure (whether the Simulator's predictions cluster, spread, invert, or track the labels) is what's actually informative, because the shape doesn't depend on the labels being precise. It only depends on them being roughly ordered: viral out-ranks decent, decent out-ranks flop. Whether viral is 91 against flop 18, or viral 85 against flop 25, doesn't change which way the comparison runs. With that on the table: run the Simulator once per post, compute Pearson r and Spearman rho, compute MAE. Pearson r = 0.51. Spearman rho = 0.52. MAE = 22.9 points. That r-value isn't a small problem. Here's what it means in practice, post by post: Post Topic Label Truth Simulator Error 001 Valorant Champions 2025 viral 82 69.75 12.25 002 GTA 6 reveal reaction viral 91 65.75 25.25 003 Steam Deck OLED price decent 55 71.00 16.00 004 Genshin Impact 5.0 pulls decent 48 65.00 17.00 005 Hollow Knight: Silksong decent 60 76.50 16.50 006 Xbox Showcase 2025 decent 42 74.75 32.75 007 Activision Blizzard acquisition flop 18 59.50 41.50 008 5 games to play this weekend flop 22 37.00 15.00 009 Pentiment retrospective flop 15 60.75 45.75 010 Concord shutdown post-mortem outlier 50 57.00 7.00 The pattern is structural. The Simulator's natural output band is roughly 60 to 76. Posts that should clear 80 get pulled down to 65 to 70. Posts that should land below 25 get pulled up toward 60, with one flop (post_008, the "5 games this weekend" listicle) the only exception at 37. The model has an attractor zone in the middle of the scale and refuses to leave it. Look at the most accurate prediction in the table. It's post_010, the outlier (truth 50, Simulator 57, error 7). Why is it the most accurate? Because 50 happens to sit inside the attractor zone. The Simulator's bias accidentally cancels out for posts that are supposed to be average. It isn't accurate, it's wrong in a way that lands near the truth for one specific case. This was the test I almost didn't run. It needs labeled data, which is annoying to gather, and three out of four tests had already declared the model healthy. By every internal measure, the Simulator was working as designed. It just couldn't tell viral from decent. It rated the GTA 6 reaction TikTok (truth 91) at 66, and the Steam Deck OLED post (truth 55) at 71. The model is consistent, rubric-faithful, and format-stable, and on real cases it literally inverts virality and decentness. The shape of that failure (flops pulled up hard, by 15, 42, and 46 points; virals pulled down by 12 and 25; the whole range collapsed into a narrow band) is what survives the synthetic-label uncertainty. If the labels were simply inaccurate, you'd see scatter. A symmetric squeeze toward the middle requires the Simulator itself to be conservative. The Pearson r of 0.51 (p around 0.13, not significant on n = 10) is the number to hold loosely. The squeeze is the result. Running this against measured engagement metrics is the natural Post 3, and I'd expect the qualitative finding to hold. Bugs the suite caught along the way This is something I want to keep doing in my write-ups. I usually publish the clean, glamorous version (here's what I built, here's what I learned, the end), which quietly erases the bugs that taught me the most about how the system actually behaves. So here are three real ones the eval suite caught while I was running it. The production parser regex was silently failing. Post 1's viral_or_fail.py extracts the Simulator's weighted total with a regex like Weighted\s*Total[^\n]+. That works for same-line layouts (Weighted Total: 73/100). It does not work for the multi-line layout the model produces about half the time: **WEIGHTED TOTAL:** = 22.5 + 15 + 14 + 12.75 + 9 = **73.25/100** When the regex misses, the production code silently falls back to a default of 50. Which means the public Viral or Fail demo had been quietly showing readers 50/100 on many of its runs since Post 1 went live. The eval suite caught it on the very first call: parse_weighted_total returned None, the harness logged it loudly, and the bug had nowhere to hide. The fix strips the bold markers, finds the header, then scans a few non-blank lines past it, preferring N/100, then a trailing = N, then the first number it sees: clean = response.replace("**", "") header = _WT_HEADER_RE.search(clean) # r"Weighted\s*Total\s*:?" if not header: return None after = clean[header.end():] window = [] for raw in after.splitlines()[: _WT_LOOKAHEAD_LINES + 1]: line = raw.strip() if not line and window: break if line: window.append(line) blob = " ".join(window) # prefer "N/100", then a trailing "= N", then the first number found That regex hunt alone justified the whole exercise. The Google Trends "Games" topic is contaminated. Test 4 originally fetched live trending topics and got back "kentucky derby 2026", "kentucky oaks", and "fanduel" alongside the actual gaming. The cause: Google's taxonomy bundles horse racing, gambling, and sportsbooks under the same Games topic ID it uses for video games, and the trends_tool.py filter from Post 1 was matching on that topic ID alone. The fix was a two-layer filter: require the games topic and not topic 17 (Sports), plus a small denylist for gambling keywords. Now the results come back as xbox game pass, crimson desert, and the hobbit mtg collector booster, with no horse racing. The first version of the judge was too lenient. My initial RubricAdherenceJudge rewarded "every criterion explicitly scored." But the Simulator's system prompt forces exactly that, so the judge handed out 5/5 trivially across all 10 posts and told me nothing. I tightened it to recompute the weighted sum and report math_diff, and to classify reasoning as specific, mixed, or generic based on whether justifications cite platform mechanics. Even under the strict judge the Simulator still scored 5/5, but now I'd earned that result instead of getting it for free. Why this matters in production I built four tests to evaluate the Algorithm Simulator from Post 1. Three of them (consistency, rubric adherence, pipeline format checks) declared it healthy. The fourth, calibration, compared its scores against labeled engagement and found systematic bias: the predictions are squashed into a narrow band regardless of how the post actually performed. A flop with engagement of 18 gets a 60. A viral hit with engagement of 91 gets a 66. The model isn't broken in any visible way. It's just consistently, faithfully, formally wrong. That's exactly why validation against ground truth isn't optional. It's the only check that catches a model doing everything right except being correct. Format, consistency, and rubric-coverage tests tell you the model isn't malfunctioning. They cannot tell you it's correct. They test the process, and only validation tests the output. A model can have a flawless process and still produce numbers that don't track reality. Now zoom out. Viral or Fail's Simulator is low stakes. Worst case, a creator publishes a post the Simulator liked and it flops. Embarrassing, not dangerous. The same failure at higher stakes is dangerous, and the same shape shows up everywhere in production AI. Ask a language model to be "objective" and it hedges toward the middle. Content moderation agents under-flag clearly harmful content and over-flag clearly benign content, because both extremes feel risky to the model. Resume screeners compress every candidate into a 60-to-80 band and call the lack of spread "fairness." Code-review bots return a comfortable 7/10 on a PR with real problems and on a PR with none. Support routing labels almost everything "medium priority" and quietly breaks the downstream automation that relied on the signal meaning something. Each of those has shipped in real deployments and then underperformed for months before anyone noticed. The teams weren't careless. They had observability, CI, process checks. What they lacked was a labeled validation set. And without one, a confidently miscalibrated model looks identical to a working one. A model that's wrong randomly gets caught, because outliers get flagged and reviewed. A model that's wrong consistently gets trusted, because it never trips an alarm. Once a downstream product depends on the miscalibrated output, the bias gets amplified at scale. Most production AI systems are not validated this way. Most LLM-as-judge components in agentic systems have never had their predictions compared against any external ground truth at all. And when something does feel off, teams reach for fine-tuning. But you can't fine-tune what you haven't characterized, and characterization is exactly what calibration testing produces. Without it, fine-tuning is guesswork in an engineering costume. "It works in eval" usually means it passed process checks, which is not the same thing as working. So evaluation is a discipline, not a phase. It belongs in the same loop as deployment, not as a one-off before launch. Internal-process checks belong in CI. Validation against labels belongs on a schedule. Both should alert when they regress, and both should be visible to the people accountable for the model's decisions. If there's one thing to take from this post: build a validation step into your eval suite from day one, even with synthetic labels, and especially if you can't get measured ones yet. Process tests keep you safe from regressions. Only the validation step keeps you honest about whether the model is right. What's next: the cloud-tier upgrade path Everything here runs on the GitHub Models free tier. That's deliberate, and it also means I've built the free-tier version of three things Microsoft already does better at production scale. The first is FoundryEvals in agent_framework_azure_ai. My RubricAdherenceJudge is a homemade FoundryEvals.TaskAdherence. Foundry's version uses Azure-hosted judges on a managed pipeline, with calibration handled internally and a portal for tracking runs over time. Same structural test, but operationally serious. The same idea applies to Relevance, Coherence, Groundedness, IntentResolution, and the rest of the catalogue. If you've built the harness from this post, swapping it for evaluate_agent plus FoundryEvals is mostly an import change. The second is the AI Red Teaming Agent. I didn't run any safety evaluation in this suite. The Audience Persona is the agent most likely to drift into unsafe territory, and the natural counterpart to quality evaluation is adversarial probing with PyRIT. The AI Red Teaming Agent wires that straight into Foundry. That's a Post 4. The third is observability. DevUI gives you real-time visualization of agent sessions, and OpenTelemetry traces flow into Azure Monitor. Both earn their keep when an eval flags a regression and you need to walk back through the failing run to find the cause. And then there's Post 3: the calibration test against real engagement data. If you have a Twitter, YouTube, or TikTok dataset with both post content and post-hoc engagement metrics, and you'd be open to collaborating, I'd love to hear from you. The full eval suite is on GitHub: github.com/HamidOna/viral-or-fail. Run pip install -r requirements.txt, set GITHUB_TOKEN, and run python -m evals.run_all. Six to eight minutes start to finish on the free tier. The suite runs, the JSONs write, the plots render, and you'll see the same thing I did: the easy tests will tell you everything is fine. The last test will tell you what's actually happening.82Views0likes0CommentsA Recap of the Build AI Agents with Custom Tools Live Session
Artificial Intelligence is evolving, and so are the ways we build intelligent agents. On a recent Microsoft YouTube Live session, developers and AI enthusiasts gathered to explore the power of custom tools in AI agents using Azure AI Studio. The session walked through concepts, use cases, and a live demo that showed how integrating custom tools can bring a new level of intelligence and adaptability to your applications. 🎥 Watch the full session here: https://www.youtube.com/live/MRpExvcdxGs?si=X03wsQxQkkshEkOT What Are AI Agents with Custom Tools? AI agents are essentially smart workflows that can reason, plan, and act — powered by large language models (LLMs). While built-in tools like search, calculator, or web APIs are helpful, custom tools allow developers to tailor agents for business-specific needs. For example: Calling internal APIs Accessing private databases Triggering backend operations like ticket creation or document generation Learn Module Overview: Build Agents with Custom Tools To complement the session, Microsoft offers a self-paced Microsoft Learn module that gives step-by-step guidance: Explore the module Key Learning Objectives: Understand why and when to use custom tools in agents Learn how to define, integrate, and test tools using Azure AI Studio Build an end-to-end agent scenario using custom capabilities Hands-On Exercise: The module includes a guided lab where you: Define a tool schema Register the tool within Azure AI Studio Build an AI agent that uses your custom logic Test and validate the agent’s response Highlights from the Live Session Here are some gems from the session: Real-World Use Cases – Automating customer support, connecting to CRMs, and more Tool Manifest Creation – Learn how to describe a tool in a machine-understandable way Live Azure Demo – See exactly how to register tools and invoke them from an AI agent Tips & Troubleshooting – Best practices and common pitfalls when designing agents Want to Get Started? If you're a developer, AI enthusiast, or product builder looking to elevate your agent’s capabilities — custom tools are the next step. Start building your own AI agents by combining the power of: Microsoft Learn Module YouTube Live Session Final Thoughts The future of AI isn't just about smart responses — it's about intelligent actions. Custom tools enable your AI agent to do things, not just say things. With Azure AI Studio, building a practical, action-oriented AI assistant is more accessible than ever. Learn More and Join the Community Learn more about AI Agents with https://aka.ms/ai-agents-beginnersOpen Source Course and Building Agents. Join the Azure AI Foundry Discord Channel. Continue the discussion and learning: https://aka.ms/AI/discord Have questions or want to share what you're building? Let’s connect on LinkedIn or drop a comment under the YouTube video!336Views0likes0CommentsAgent vs. Workflow in Copilot Studio - Which One Do I Actually Need?
Hey everyone! 👋 Raise your hand if this has happened to you... You open Copilot Studio for the first time, you're excited, you're ready to build and then the very first screen asks you: "What would you like to build?" [ Agent ] [ Workflow ] And your brain just goes blank. 😅 Which one? What's the difference? Does it even matter which I pick? I've been there. I picked randomly, built halfway through, and then realized I probably chose the wrong one. So I put together this quick breakdown to save you that frustration! The One-Line Answer Agent = Conversation. Workflow = Automation. That's the core of it. But let me unpack what that actually means in practice. Here's a Visual That Makes It Click Let's Break It Down Simply 🤖 Choose an Agent when... Your tool needs to talk to people and actually understand what they're saying. An Agent is like a smart assistant that: Chats with users in a natural, back-and-forth way Pulls answers from your knowledge sources like PDFs, SharePoint, or websites Asks follow-up questions to collect and validate information Guides users through a process step by step Handles all kinds of different questions without breaking Its whole goal? Understand, assist, and engage the person in front of it. Real example: A customer types "I need help with my invoice" - the Agent reads that, asks the right follow-up questions, and helps them resolve it without any human stepping in. ⚙️ Choose a Workflow when... You need something to run in the background and get things done - no conversation needed. A Workflow is like a reliable robot that: Follows a fixed set of predefined steps every single time Performs actions and processes automatically Creates or updates records in your systems Sends emails and notifications at the right moment Connects with Dataverse, Dynamics 365, Outlook, and more Just runs — quietly, consistently, without anyone needing to interact with it Its whole goal? Automate, process, and get things done. Real example: When a new employee is added to the system → automatically create their accounts, send a welcome email, and notify their manager. No one has to lift a finger. The Simplest Way to Decide Ask yourself just one question: Does someone need to have a conversation with it? Yes → Build an Agent No → Build a Workflow That single question will get you to the right answer 90% of the time. The Mistake Most Beginners Make A lot of us (myself included!) jump straight to building an Agent because it sounds more exciting and powerful. But if your process is just a series of fixed steps with no real conversation involved, a Workflow will do the job faster, cleaner, and more reliably. You don't have to choose just one forever. A really powerful pattern is having your Agent handle the conversation and then trigger a Workflow to do the heavy lifting in the background. Best of both worlds! 🙌 Quick Recap Agent Workflow Best for Conversation Automation Talks to users? Yes No Follows fixed steps? Not always Always Runs in background? No Yes Connects to systems? Can Yes, natively Hope this clears things up! Drop your questions below especially if you have a specific use case you're trying to figure out. Happy to help you work out which one fits. 😊228Views2likes1CommentSecuring AI Agents End‑to‑End: Connecting Purview DSPM, Agent 365, and the AI Security Dashboard
The Challenge: Organizations deploying Microsoft Copilot and custom AI agents face a critical gap: security visibility is fragmented across data protection, identity governance, and threat detection tools. While Microsoft provides powerful capabilities through Purview Data Security Posture Management (DSPM), Agent 365, and the AI Security Dashboard, practitioners often struggle to understand how these components work together to deliver unified AI security posture management. This blog provides an architectural and operational blueprint for connecting these three pillars into a cohesive security framework that security architects can implement today. The Three Pillars: Capabilities Overview Microsoft Purview DSPM for AI Purview DSPM extends data‑centric security controls to AI interactions. Its key capabilities include: Sensitivity labels with EXTRACT usage rights that govern whether AI agents can read and process sensitive content Data Loss Prevention (DLP) policies that block or audit AI interactions involving confidential data across Copilot, SharePoint, OneDrive, and Teams Comprehensive audit logging that captures AI‑to‑data interactions, including user identity, agent identity, data classification, and the action taken Insider Risk Management integration that detects anomalous agent behavior patterns, such as bulk or unusual data access DSPM operates at the data layer, answering a foundational question: What sensitive information can this agent access, and what is it doing with that data? Microsoft Agent 365 Agent 365 provides a unified control plane for governing AI agent identity, access, and lifecycle across the Microsoft 365 ecosystem. Core components include: Agent Registry, backed by Entra Agent IDs, providing a unique identity for every Copilot Studio agent, custom agent, and supported third‑party AI integration Conditional Access policies that enforce real‑time access controls based on agent identity, user context, device compliance, and risk signals Centralized observability, with dashboards showing agent‑to‑agent interactions, agent‑to‑human conversations, and near real‑time telemetry Governance workflows that support agent approval, lifecycle management, suspension, and decommissioning Agent 365 operates at the identity and control layer, answering: Which agents exist, who authorized them, and what access boundaries are enforced? AI Security Dashboard The AI Security Dashboard aggregates security signals from Entra, Purview, and Defender to provide a unified risk view across all AI assets. It delivers: AI asset inventory, cataloging Copilot instances, custom agents, and third‑party models with associated risk context Misconfiguration detection, identifying agents with excessive permissions, missing conditional access policies, or DLP coverage gaps Attack path visualization, showing how compromised agents could pivot to sensitive data or escalate privileges Integration with Microsoft Security Copilot, enabling natural‑language investigation of AI security risks and incidents The Dashboard operates at the aggregation and recommendation layer, answering: What is my overall AI security posture, and where should remediation be prioritized? The Unified Architecture: How Signals Flow End-to-End Understanding the technical integration requires mapping how identity, data, and security signals flow across these three systems. Identity Foundation (Microsoft Entra): Every AI agent is assigned a unique Entra Agent ID at creation. This identity becomes the anchor for all security controls—conditional access policies in Agent 365, audit attribution in Purview, and risk correlation in the AI Security Dashboard. When a Copilot Studio agent is deployed, Entra automatically registers it with Agent 365 and propagates identity metadata to connected security services. Data Interaction Telemetry (Microsoft Purview): When an agent accesses SharePoint files, reads emails, or queries structured data, Purview captures detailed audit events that include agent identity, user context, data classification labels, and enforcement outcomes. These events flow into Purview’s unified audit log and are accessible through the Compliance portal, Microsoft Graph, and SIEM integrations. Crucially, Purview enforces sensitivity labels with EXTRACT usage rights—if a document is labeled Confidential without EXTRACT permission, the agent’s request is blocked before content reaches the AI model. Control Plane Enforcement (Agent 365): Agent 365 applies identity‑based governance by evaluating Entra signals and surfaced risk indicators. During policy evaluation, the control plane verifies whether the agent is registered, whether the invoking user satisfies authentication requirements, and whether recent signals (such as DLP violations) warrant blocking execution. Agent 365 also provides observability views that correlate agent activity with security events, helping administrators identify unmanaged or unauthorized (“shadow”) agents. Aggregated Risk View (AI Security Dashboard): The AI Security Dashboard correlates telemetry from: Entra — conditional access decisions, authentication anomalies, and privileged identity usage Purview — DLP violations, sensitivity label mismatches, and Insider Risk Management signals Defender — threat detections, application posture assessments, and suspicious activity indicators These signals are correlated by agent identity and time, then surfaced as risk cards with contextual severity and recommended remediation actions. The Dashboard does not replace the underlying tools; instead, it provides a consolidated view that helps teams focus on the most impactful risks. The diagram below illustrates how identity, data, and threat signals flow across the three AI security pillars. Figure 1: End‑to‑end AI security architecture. Enforcement happens at the data layer (Purview) and identity layer (Agent 365 via Entra). The AI Security Dashboard aggregates—rather than replaces—underlying security controls. From Architecture to Action: Telemetry & Enforcement Flow Understanding architecture is essential—but practitioners need to know when and where enforcement occurs during a real agent invocation. The sequence below illustrates runtime interaction between a user, an AI agent, and the three security pillars. The Critical Distinction: Two Enforcement Layers Enforcement occurs at two distinct points in the request lifecycle. First, Microsoft Entra validates agent identity and evaluates conditional access policies before execution begins. If the agent is not registered, if the user fails authentication requirements, or if policy conditions require blocking, execution is denied immediately. Second, when execution is permitted, Purview DSPM enforces data access controls inline. Every attempt to access documents, emails, or structured data is evaluated in real time. If a document is labeled Confidential without EXTRACT rights, Purview blocks the request and returns no sensitive content to the agent. Telemetry Generation Across the Stack Each step produces structured telemetry. Entra logs authentication attempts and policy decisions. Purview records AI interaction audit events, including enforcement outcomes. Agent 365 correlates identity and behavior signals to maintain agent posture and observability. These combined signals are surfaced in the AI Security Dashboard, which correlates activity across time and identity to present prioritized risk insights. Make the “where enforcement happens” distinction explicit (data vs. identity). Figure 2: Purview enforces data controls inline, Agent 365 enforces identity and execution controls, and the AI Security Dashboard correlates signals for prioritization. Practitioner Scenario: Detecting and Blocking Agent Data Exposure Context: Your organization deploys a custom Copilot Studio agent to summarize sales proposals stored in SharePoint. Several documents contain customer PII labeled "Highly Confidential" with no EXTRACT usage rights granted. Incident Timeline: Agent Data Exposure Detection → Remediation Detection The agent attempts to access SharePoint files through Microsoft Graph. Purview DSPM evaluates sensitivity labels and identifies restricted documents. A DLP policy blocks access and logs a violation with full context. The audit event appears in the Purview unified audit log within minutes. Visibility Agent 365 flags the blocked interaction in its observability dashboard. The AI Security Dashboard surfaces a High‑severity risk card titled “Agent accessing restricted data.” Security teams investigate the agent using Security Copilot to determine scope and recurrence. Remediation An administrator applies an Entra conditional access policy to suspend the agent. Data permissions are adjusted to restrict access or explicitly grant EXTRACT rights where justified. The AI Security Dashboard reflects a reduced risk score once controls are validated. Outcome: The incident is contained quickly, audit evidence is preserved, and the agent is restored with least‑privilege access—without disrupting legitimate business workflows. Figure 3: A single DLP violation triggers coordinated detection, investigation, and remediation across Purview, Agent 365, and the AI Security Dashboard within 30 minutes. Division of Responsibility: What Each Tool Does Tool Primary Function Key Signals Enforcement Capability Purview DSPM Data-layer protection and audit Sensitivity labels, DLP violations, data access patterns Blocks API calls violating DLP or label policies Agent 365 Identity and lifecycle governance Agent registry, conditional access hits, observability telemetry Denies agent invocation based on Entra policies AI Security Dashboard Unified risk aggregation Cross-product signals from Entra, Purview, Defender No direct enforcement—provides recommendations and prioritization Critical Distinction: Enforcement happens at two layers—Purview blocks data access violations, while Agent 365 (via Entra) blocks agent invocation. The Dashboard does not enforce policies but accelerates investigation and remediation by correlating signals that would otherwise require manual analysis across three separate consoles. Key Takeaways for Practitioners Agent identity is the integration anchor. Every security control—DLP policies, conditional access, audit logs, risk scoring—relies on Entra Agent IDs. Ensure all agents are properly registered in Agent 365 before production deployment. Purview enforces at the data layer, Agent 365 at the identity layer. Use both—Purview prevents unauthorized data exfiltration, while Agent 365 prevents unauthorized agent execution. Neither is redundant. The AI Security Dashboard is for prioritization, not replacement. Continue using Purview Compliance Portal for detailed DLP investigations and Agent 365 registry for operational monitoring. Use the Dashboard to identify which risks warrant immediate attention. Audit logs are your ground truth. All three tools consume Purview audit events. Integrate these logs with Microsoft Sentinel or your SIEM for long-term retention and advanced threat hunting. Shadow agents are your blind spot. Regularly audit the Agent 365 registry against actual AI deployments (Copilot Studio, Azure OpenAI, third-party integrations) to identify unregistered instances. As AI agents become embedded in everyday work, security teams must move beyond feature‑level understanding and adopt an end‑to‑end enforcement mindset. The combination of Purview DSPM, Agent 365, and the AI Security Dashboard provides the building blocks—but value is realized only when they are implemented as a unified model. How are you governing AI agents in your environment today? Share your experiences and patterns in the comments—especially where identity, data, and security signals intersect.2.9KViews3likes0CommentsStop Writing Promotional Emails. Build an AI Agent That Does It For You.
Hi everyone 👋 A few weeks ago, I started thinking about how much time businesses still spend writing repetitive promotional emails manually every month. The process is usually the same: review customer purchase history check active discounts write personalized emails send them one by one So I decided to build a simple AI-powered workflow that could automate the entire process. For Edition #003 of my newsletter, I created an AI agent that: ✅ reads customer purchase data ✅ matches category-based discounts ✅ generates personalized promotional emails using AI ✅ sends emails automatically What I enjoyed most while building this project was seeing how even small personalization details can completely change the customer experience. Instead of sending generic promotions, the workflow creates emails tailored to each customer’s purchases and interests. In this edition, I shared: the real-world use case the complete workflow approach implementation screenshots sample datasets GitHub project files practical automation tips 📌 View the newsletter If you enjoy building practical AI automations or exploring real-world AI agent ideas, I think you’ll enjoy this edition. I’d genuinely love to hear your thoughts and learn how others are approaching AI-driven automation in their own projects 🙌70Views0likes0CommentsFrom Idea to Production — Building Microsoft Security Store Advisor with an Agentic SDLC
From AI-assisted coding to Agentic SDLC: Lessons from Microsoft Security Store If every developer on your team is using AI, why does the team still feel like it's starting from scratch on every feature? In this post, the Microsoft Security Store engineering team shares how we moved beyond one-off AI assists to an Agentic SDLC — a repeatable system where prompts, patterns, and reviews compound into team-wide velocity, quality, and security.Why Collecting User Feedback on Your AI Agent Actually Matters
Hi everyone, I see many of us experimenting with AI agents in Copilot Studio and other platforms. Spinning up an agent is now the easy part but making sure it actually helps users is much harder. In a short blog, I shared why listening to users should be part of your AI design, not an afterthought. I talk about: Using thumbs up/down, comments, and simple surveys Turning feedback into a backlog of improvements Why this feedback loop is essential for making AI agents truly useful If you’re building or maintaining AI agents, I’d love your thoughts and experiences. 🔗 Read the blog: Why Collecting User Feedback on Your AI Agent Actually Matters https://medium.com/@sajeda27/why-collecting-user-feedback-on-your-ai-agent-actually-matters-54deea4fee7b108Views1like0CommentsFrom AI‑Curious to Agent‑Builder in Microsoft 365 (No Code)
Hi everyone, I keep getting the same questions in my inbox: “How do I start learning AI?” “Can I build an AI Agent without knowing how to code?” So I put together a simple, beginner-friendly blog focused on Microsoft tools like Copilot and Copilot Studio - perfect for anyone starting from zero. 👉 Check it out here: https://medium.com/@sajeda27/from-ai-curious-to-agent-builder-no-code-required-46f845458a97 If you find it useful, feel free to share it with someone who’s been asking the same questions 🙌115Views0likes0CommentsProvePresent: Ending Proxy Attendance with Azure Serverless & Azure OpenAI
Problem Most schools use a smart‑card‑based attendance system where students tap their cards on a reader. However, this method is unreliable because students can give their cards to friends or simply tap and leave immediately. Teachers cannot accurately assess real student performance—whether high‑performing students are genuinely attending class or whether poor performance is due to actual absence. Another issue is that even if students are physically present in a lecture, teachers still cannot tell whether they are paying attention to the projector or actually learning. The current workaround is for teachers to override the attendance record by calling each student one by one, which is time‑consuming in large lectures and adds little educational value. It is also only a one‑time check, meaning students can still leave the lecture room immediately afterwards. Another issue is that we have many out‑of‑school activities such as site visit, and the school needs to ensure everyone’s presence promptly in each check point. This kind of problem isn’t unique to schools. It’s a common challenge for event organizers, where verifying attendee presence is essential but often slow, causing long queues. Organizers usually rely on a few mobile scanners to check in attendees one by one. Solution ProvePresent is an AI tool designed to verify attendance and create real‑time challenges for participants, ensuring that attendance records are authentic and that attendees remain focused on the presentation. It uses OTP login with school email. Check-in and Check-out With a Real‑time QR Code The code refreshes every 25 seconds, and the presenter can display it on the projector for everyone to scan when checking in at the beginning and checking out at the end of the session. However, this alone cannot prevent someone from capturing the code and sending it to others who are not in the room, or from using two devices to help someone else scan for attendance—even if geolocation checks are enabled. We will explain this next. This check‑in and check‑out process is highly scalable, and no one needs to queue while waiting for someone to scan their QR code! Organizers can set geolocation restrictions to prevent anyone from checking in remotely in a simple manner. Keep Attendee Alive with Signalr The SignalR live connection allows the presenter to create real‑time challenges for attendees, helping to verify their presence and ensure they are genuinely focused on the presentation. AI Powered Live Quiz The presenter shares their presentation screen, and two Microsoft Foundry agents with Azure OpenAI Chatgpt 5.3 —ImageAnalysisAgent, which extracts key information from the shared screen, and QuizQuestionGenerator, which generates simple questions based on the current slide—work together to create challenges. The question is broadcast to all online attendees, who must answer within 20 seconds. This feature keeps attendees on the webpage and prevents them from doing anything unrelated to the presentation. Detailed report can be downloaded for further analysis. Attendee Photo Capture Request all online students to capture and upload photos of their venue view. The system will analyze the images to estimate seating positions using Microsoft Foundry agents with Azure OpenAI ChatGPT 5.3 PositionEstimationAgent and complete an image challenge. When the presenter clicks Capture Attendee Photos, all online attendees are prompted to take a photo and upload it to blob storage. The PositionEstimationAgent then analyzes the image to estimate their seating location, which can provide insights into student performance. Analysis Notes: Analyzed 13 students in 2 overlapping batches. Batch 1: The venue is a computer lab with the projector screen at the front center, whiteboards on the left, and cabinets on the right. Relative depth was estimated mainly from screen size and number of monitor rows visible ahead. Column estimates were inferred from screen angle and side-room features, with lower confidence for the rotated side-view image. Batch 2: These six photos appear to come from the same computer lab with the projector at the front center. Relative depth was estimated mainly from projector size and number of visible desk/monitor rows ahead. Left-right placement was inferred from projector skew and side-wall visibility. Within this batch, 240124734 and 240167285 seem closest to the front, 240286514 and 240158424 are slightly farther back, 240293498 is farther back again, and 240160364 appears furthest. Pass around the QR code attendance sheet Traditionally, the attendance sheet is circulated for attendees to sign, but this method is unreliable because no one monitors the signing process, allowing one attendee to sign for someone who is absent. It is also slow and not scalable for large groups. The QR Code attendance sheet functions as a chain. The presenter randomly distributes a short‑lived, one‑time QR code—representing a virtual attendance sheet—to any number of attendees, just like handing out multiple physical sheets. Each attendee must find another participant to scan their code to record attendance, continuing the chain until the final group of attendees. The presenter then verifies the last group’s presence. The first chain is a dead chain because that student left the venue and cannot find another student to scan his QR code. The second chain contains 20 student attendance records. It also provides useful insights into their friendship and seating patterns. Architecture This project is built using Vibe Coding, so we will not share highly technical details in this post. If you'd like to learn more, leave a comment, and we will write another blog to cover the specifics. GitHub Repo https://github.com/wongcyrus/ProvePresent Conclusion ProvePresent demonstrates how Azure serverless technology and Azure OpenAI can work together to solve a long‑standing problem in education: verifying genuine student presence and engagement. By combining real‑time QR code verification, SignalR‑powered live interactions, AI‑generated quizzes, and intelligent photo‑based seating analysis, we created a system where “being present” is no longer just a checkbox—it becomes a verifiable, interactive, and meaningful part of the learning experience. Instead of relying on outdated smart‑card systems or manual roll calls, educators gain a dynamic tool that keeps students attentive, provides insight into classroom behavior, and produces useful analytics for improving teaching outcomes. Students, in turn, benefit from an engaging, modern attendance experience that aligns with how digital‑native learners expect classes to operate. This is only the beginning. With Microsoft Foundry agents and the flexibility of Azure Functions, there are many opportunities to extend ProvePresent further—richer analytics, smarter engagement models, and seamless integration with LMS platforms. If there’s interest, we’re happy to share more technical details, architectural deep dives, and future roadmap ideas in a follow‑up post. Thank you for the contribution of Microsoft Student Ambassadors Hong Kong Institute of Information Technology (HKIIT) Wong Wing Ho, CHAN Sham Jayson, Pang Ho Shum, and Chan Ka Chun. They are major in Higher Diploma in Cloud and Data Centre Administration. About the Author Cyrus Wong is the senior lecturer of Hong Kong Institute of Information Technology (HKIIT) @ IVE(Lee Wai Lee).and he focuses on teaching public Cloud technologies. He is a passionate advocate for the adoption of cloud technology across various media and events. With his extensive knowledge and expertise, he has earned prestigious recognitions such as AWS Builder Center, Microsoft MVP- Microsoft Foundry, and Google Developer Expert for Google Cloud Platform & AI.206Views0likes0Comments