AI agents don't just generate text. They take actions, call tools, and make decisions across multiple steps. That makes them both powerful and harder to validate.
A prompt change that improves one workflow can quietly break another. A model upgrade that boosts fluency could degrade tool selection. Without evaluations, those regressions typically surface only when users encounter them in production.
Building agents in Microsoft Foundry, we've seen this firsthand: early demos look impressive, but edge cases pile up quickly. The lesson that stands out most: start evals earlier than feels comfortable. Even imperfect evaluations can surface failure modes that manual testing misses.
Evaluations give developers a systematic way to define quality, measure it consistently, and catch regressions before they ship. This guide covers the core concepts, walks through your first eval in Microsoft Foundry, and shows how to build evaluations into your development workflow as your agent matures.
The Building Blocks of an Evaluation
An evaluation combines a dataset of test cases with evaluators that score behavior — plus tooling to debug failures and track improvement over time. In Foundry, these pieces connect end-to-end: datasets define what to test, evaluators score behavior, traces explain why something failed, and monitoring helps you close the loop after deployment.
In this post, we’ll cover the core concepts, show you how to run your first eval in Microsoft Foundry, and share practical ways to build evals into your development workflow.
Already familiar with evals? Jump to While You're Building to skip ahead.
Dataset: What to Test
A dataset is a collection of test cases, each describing a scenario your agent needs to handle. A well-structured test case includes a description of what's being tested, the input (a user query or conversation), and the expected outcome. Specifically, what state the system should be in after the agent acts. For tool-calling agents, that means verifiable system state, not just what the agent said it would do.
For example: "A refund was issued" is meaningfully different from "the agent said it would issue a refund."
Depending on your evaluators, you may also include source documents (for RAG-based agents), reference responses, or tags to organize test cases by capability or risk level.
Test cases can be single-turn or multi-turn. Agent evaluations typically require multi-turn, since agents reason across steps, call tools, and adapt based on intermediate results.
Test data can come from three places:
- You can write cases manually
- Generate synthetically using an LLM
- Curate from real production interactions
Evaluators: How to Score
Evaluators are the judges. Each one scores a specific dimension of your agent's behavior. There are two broad approaches.
- Code-based evaluators: These use deterministic logic string matching, format validation, keyword checks. They're fast, cheap, and fully reproducible. For our refund agent, a code-based evaluator might verify that a refund record was actually created in the system, not just promised in a response.
- LLM-as-judge evaluators: These use a language model to assess qualities that are harder to express in rules, such as tone, empathy, nuanced policy interpretation, or whether the agent selected the right tool for the task. They handle ambiguity well but are less predictable, and the judge prompt usually takes iteration. In return, they can surface reasoning alongside each score, which makes debugging considerably faster.
Foundry provides a catalog of built-in evaluators across quality, safety, and agent-specific dimensions. You can also create custom evaluators for domain-specific criteria that generic evaluators can't cover.
Eval Runs: When to Evaluate
An eval run executes your test cases against your agent and applies evaluators to score the results. Three strategies complement each other:
- On demand: Run during development or whenever you want a quality check.
- Event-driven: Trigger automatically on every code change via CI/CD, or run continuously on sampled production traffic.
- Scheduled: Periodic runs catch silent degradation — quality can drift as user behavior shifts or model updates roll out.
Because LLM outputs vary between runs, it's worth running the same evaluation multiple times to understand variance. A test case that fails intermittently is still a reliability problem worth investigating.
Traces: How to Debug
A trace captures the full execution of a test case—every message, tool call, and reasoning step. When a score flags a problem, the trace reveals the root cause.
For example: A low Task Adherence score might lead you to a trace showing the agent called lookup_order twice but never called issue_refund.
Foundry connects traces directly to evaluation results, so you can jump from a failing score to the exact sequence that produced it.
Analysis: How to Improve
When you have hundreds of test cases across multiple evaluators, individual trace review doesn't scale. Two capabilities help you work at a higher level:
- Group failures: Identify common patterns. Instead of reviewing each low score individually, cluster analysis surfaces which failures share root causes. This enables you to prioritize the highest-impact fixes first.
- Run comparisons: Put two eval runs side by side (before and after a change) to verify fixes, catch regressions, and quantify impact. This becomes especially important as your agent grows more complex.
Your First Eval Run
Say you're building a customer support agent that handles refund requests. You want to know: does it follow your refund policy? Are its responses coherent? Is it safe?
Step 1: Create a Test Dataset
Start simple — five test cases are enough to establish your first baseline. Writing them by hand forces you to articulate what success actually looks like. Cover both sides: cases where the agent should act, and cases where it should refuse or ask for more information.
- "I bought a blender 3 days ago and it's defective. Can I get a refund?" — should approve
- "I'd like to return a jacket I purchased last week. It doesn't fit." — should approve
- "Can you process a refund for an order from 2 years ago?" — should deny (out of policy)
- "I want a refund but I lost my receipt and don't have an order number." — should request verification, then process if confirmed
- "Refund me right now or I'll sue you." — should handle professionally and escalate if needed
Upload these to your Foundry project as a dataset. Once you have a baseline, you can expand coverage manually or use Foundry's built-in synthetic data generation to scale up.
Step 2: Choose Your Evaluators
Pick evaluators that match what you actually care about. For a customer support agent, a practical starting set looks like this:
- Task Adherence — does the agent follow its system instructions and the refund policy?
- Intent Resolution — does the agent correctly understand what the customer is asking?
- Tool Call Accuracy — does the agent call the right tools with the correct parameters?
- Safety evaluators — does the response contain harmful, violent, or hateful content?
Always include safety evaluators. They require no extra setup, add minimal overhead, and catch risks early. Don't wait until pre-launch to discover your agent can produce harmful content. See Learn more about selecting evaluators.
Step 3: Run the Evaluation
Create an evaluation in Foundry that combines your dataset, evaluators, and target agent — either through the Foundry portal or the Foundry SDK. Foundry sends each test query to your agent, captures the full response including tool calls, and applies your evaluators to score the results.
Step 4: Read the evaluation
Results show aggregated pass/fail counts, per-evaluator breakdowns, and individual test case results (learn more).
For our refund agent, Task Adherence might catch a policy violation (the agent approved a refund it shouldn't have), while Intent Resolution flags a misunderstood request. Different evaluators reveal different classes of problems. That's why you need more than one.
Each result includes the evaluator's reasoning for the score, not just the score itself. That reasoning is what helps you understand not only what failed, but what to fix next.
|
Try it |
You'll need a Foundry project with an agent and an Azure OpenAI deployment. Walk through the tutorial Evaluate your AI agents, or deploy the get-started-with-ai-agents template which includes example evaluation setup. |
The Eval Maturity Journey
Most teams follow a predictable path as their agent matures. Understanding where you are helps you invest in the right things next.
|
Stage |
Test Cases |
Evaluators |
CI Gate |
Prod Monitor |
Prod Feedback |
|
1. Vibes |
— |
— |
— |
— |
— |
|
2. Automated Eval |
✓ |
✓ |
— |
— |
— |
|
3. Continuous Integration |
✓ |
✓ |
✓ |
— |
— |
|
4. Production Monitoring |
✓ |
✓ |
✓ |
✓ |
— |
|
5. Continuous Improvement |
✓ |
✓ |
✓ |
✓ |
✓ |
Stage 1: Vibes
You're eyeballing outputs, manually testing happy paths, and hoping for the best. Quality lives in your head, not in code. You're here if you test by typing queries into a chat window.
Stage 2: Automated Eval
Test cases are codified. You can run evals on demand and get scores back. Quality is defined, not assumed. If you followed the previous section, you're already here.
Stage 3: Continuous Integration
Evals run automatically on every code change. Regressions get caught before they ship. You have a baseline, and you know when things get worse. You're here if a failing eval blocks your pull request.
Stage 4: Production Monitoring
Your agent is live and you're watching it. Continuous evaluation scores sampled interactions in near real-time. Scheduled runs against your golden dataset catch drift. Alerts fire when metrics drop. You're here if you have dashboards and alerts on eval metrics.
Stage 5: Continuous Improvement
Production signals feed back into your development cycle. Real user interactions enrich your dataset, A/B experiments validate changes against live traffic, and new capabilities start as test cases before they're implemented. You're here if you regularly use production data to update test cases and run experiments.
While You're Building
You now have a baseline, a handful of test cases and a first set of scores. The goal from here is fast, reliable feedback as you iterate.
Building Your Golden Dataset
- Codify what you already test manually. Take the scenarios you check by hand, including happy paths, edge cases, bugs you’ve already fixed. Add these to your dataset as test cases.
- Expand with synthetic data. Use an LLM to generate test cases from your agent’s definition, seed questions, reference documents, or user personas. Foundry includes built-in synthetic generation to help with this.
- Cover both sides. Include cases where the agent should act and cases where it should refuse or escalate. For the refund agent:
- Approved — defective product returned within 30 days
- Denied — purchase from two years ago, outside policy window
- Escalated — customer threatening legal action
- Treat it like test-driven development. Every time you add a capability, fix a bug, or uncover a new failure mode, add a test case. Your dataset should grow alongside your agent as part of the development process.
Choosing the Right Evaluators
Match evaluators to your agent type. This table is a practical starting point:
| Agent Type | Start With |
|---|---|
| RAG / Q&A | Groundedness (answers match source data), Relevance |
| Tool-calling agents | Task Adherence, Tool Call Accuracy, Intent Resolution |
| Customer-facing | Fluency, Coherence, and Safety evaluators |
| All agents | Safety evaluators: Violence, Self-Harm, Hate & Unfairness, Protected Materials |
Go custom when built-in evaluators can't express your rules. If your agent must refuse certain request types or respond in a particular brand voice, no generic evaluator will catch that.
For example, a custom RefundPolicyCompliance evaluator, might use an LLM judge to assess whether the agent's tone was appropriately empathetic:
# Prompt-based: was the tone empathetic?
Rate whether the agent's response shows empathy toward the customer. Score 1 if empathetic and professional, 0 if dismissive or robotic.
Response: {{response}}
Output Format (JSON): {"result": <0 or 1>, "reason": "<brief explanation>"}
Custom evaluators can be created via the Foundry SDK or directly in the Foundry portal.
Debugging with Traces
A low score is a signal, not a conclusion. The score tells you something went wrong. The LLM judge's reasoning tells you why. The trace tells you where.
When an evaluation surfaces failures, open the trace and classify the issue.
- Did the agent call the wrong tool? That's likely a prompt issue, signaling the instructions need to be more explicit about policy constraints.
- Did two simultaneous tool calls produce incoherent merged output? That's a scaffold or orchestration problem.
- Did the agent confidently cite a policy that doesn't exist? Consider adding groundedness checks or retrieval-augmented generation.
When failures pile up, look for patterns first. Cluster analysis lets you group similar failures and identify shared root causes, so you can fix the highest-impact issue and re-run, rather than debugging each case individually.
Before You Ship: The Two-Loop Model
Everything so far has focused on the inner loop: write code, run evals, fix issues, repeat. But shipping requires a second loop: deploy, monitor, analyze, optimize. In Microsoft Foundry, evaluations are designed to support both.
Inner loop (pre-ship)
- During development: run on demand to catch issues early and iterate quickly
- As CI/CD gates: triggered on every merge to block regressions before they ship
Outer loop (post-ship)
- Continuously in production: scoring live traffic and scheduled runs so production insights feed back into the next iteration.
Gate Your Merges
Once you have a passing baseline, wire evaluations into your CI/CD pipeline so regressions are caught automatically before they ship.
Start with low expectations, then raise the bar. When you first add tests for a new capability, pass rates will be low. That's fine, you're measuring progress. As scores stabilize, tighten the thresholds and treat them as gates. A failing eval should block the merge.
|
Example |
Your new escalation feature starts at 40% pass rate. Over a few iterations it reaches 90%+. At that point it becomes a regression gate. If it drops, the merge is blocked. |
Foundry supports running evaluations in GitHub Actions and Azure DevOps, with built-in statistical analysis to help distinguish real regressions from run-to-run noise.
Closing the Loop
An eval that always passes catches regressions but doesn't drive improvement. If scores plateau, your tests may simply be too easy, not because your agent is that good. Periodically ask yourself: does my dataset still cover how people actually use the agent? Are my test cases hard enough to catch meaningful changes?
The best remedy is production data. Feed real interactions back into your golden dataset, focusing on the interesting ones: failures, unexpected inputs, scenarios that reveal gaps in your coverage.
|
Example |
Users start asking for partial refunds, which isa scenario you never tested. Add it to the dataset, define the expected behavior, and run the eval. Now you have coverage for a real-world pattern. |
Foundry supports continuous evaluation to help automate this: it monitors your agent in production by scoring interactions in near real-time. Schedule periodic eval runs against your golden dataset as well, and set up Azure Monitor alerts on your evaluation metrics so regressions surface immediately, not the next time someone checks the dashboard.
The outer loop: production signals improve your dataset coverage, your dataset strengthens your evals, and your evals protect the next release.
Final Thoughts
Prioritize outcomes over paths
Did the refund get processed correctly? Don't penalize the agent for reaching a valid answer via a different route. Rigid step-checking creates false failures. Use custom evaluators to verify end state, and track latency and cost separately from correctness.
Red team before your users do
Safety evaluators catch known risks, but Foundry's AI red teaming agent actively tries to break your agent. It probes for adversarial vulnerabilities that static test cases miss. Run it before launch.
Handle non-determinism intentionally
Agent behavior varies between runs. The same query can produce different tool calls, different phrasing, even different outcomes. Don't rely on a single pass/fail. Run evaluations multiple times, compare runs in your Foundry project, and investigate any test case that fails intermittently.
Match your eval environment to production
Evals should use the same APIs, tools, and surfaces as production. Even small differences can mask real failures. A mock that simplifies a tool response or a test harness that skips authentication can hide exactly the bugs you're trying to catch.
Read the conversations, not just the scores
When scores change, check the reasoning and open the trace. Sometimes the agent made a real mistake. Sometimes the evaluator misjudged a valid response. You can't tell from the number alone. Manual review remains essential, even at scale.
You don't need to do everything at once
Five test cases, a few evaluators, and one eval run are enough to establish your first quality bar. From there, expand coverage, add regression gates, and bring production signals back into development. The goal is a quality bar that evolves alongside your agent — not perfection on day one.
|
Get Started
|