Blog Post

Microsoft Foundry Blog
21 MIN READ

Evaluating Multi-Turn Agents: A Quality Study of Microsoft Foundry’s Multi-Turn Evaluators

amah's avatar
amah
Icon for Microsoft rankMicrosoft
Jun 26, 2026

As AI applications evolve into complex, multi-turn systems, observability becomes essential not just for tracing behavior, but for analyzing quality and validating outcomes at every step. Microsoft Foundry observability enables developers to evaluate interactions across turns, diagnose issues, and continuously improve response accuracy, consistency, and reliability, providing the insights needed to measure, validate, and ship high-quality AI experiences with confidence.

Authors: Ali Mahmoudzadeh, Salma Elshafey, Shuo Qiu, José Santos, Ilya Matiach, Vivek Bhadauria, Morteza Ziyadi, April Kwong

LLM-as-judge evaluations are a critical part of any agent lifecycle. Single-turn metrics such as relevance, faithfulness, tone… can be straight forward to prompt and analyze, but a number per response isn’t the thing your users experience. They experience a session: a multi-turn conversation in which the agent asks clarifying questions, calls tools, retries when something fails, and hopefully stitches the whole thing into a usable outcome. A session can have every individual response look fine and still be unfinished, off-policy, or quietly hallucinated. Averaging turn-level scores doesn’t recover those properties.

That’s the gap multi-turn evaluators close. But there’s a second-order problem: any tool you use to score your agent is itself a system that can be wrong. A weak evaluator that always says “pass” will flatter your agent. A judge with high variance will surface phantom regressions on every rerun. If the evaluator isn’t calibrated and verified, no meaningful direction can be extracted from the generated numbers.

In Microsoft Foundry, we assess multi-turn evaluators the same way we’d measure any other AI system: with benchmark datasets, with paired accuracy and reliability metrics, and with multiple judge models per evaluator so we could separate “evaluator works” from “this particular judge works.” This post describes what the evaluators are, how we tested them, and what we found, including where to trust them confidently and where not.

The short version, up front: two of the four evaluators (Task Completion, Customer Satisfaction) reach substantial-to-near-perfect agreement with ground truth and have low enough run-to-run variance to be used as single-pass scores. Conversation Coherence is close behind. Groundedness is the hardest of the four and is better deployed as a triage signal than a hard gate. Across all four, judge sensitivity varies by evaluator; Task Completion is robust to judge choice, CSAT is accurate and reliable across judges, and Groundedness is the clearest case where judge choice materially changes outcomes. Judge-by-judge comparisons still help interpret portability, but the headline result for CSAT is that it is a strong evaluator.

Four session-level properties, four evaluators

Before scoring anything, you have to decide what session-level properties you actually want to measure. The four evaluators each target a distinct property, and the design is deliberately complementary. i.e. an agent can pass one and fail another, and that gap is diagnostic, not redundant.

Evaluator

The property it measures

Output

Task Completion

Was the user’s task successfully and completely accomplished by the end of the session?

Binary (pass/fail) + per-request breakdown

Customer Satisfaction (CSAT)

How satisfied would the user be with the agent’s performance across the session?

Likert 1–5 + per-dimension reasons

Groundedness

Are the agent’s factual claims across all turns supported by the conversation’s grounding sources?

Likert 1–5 + grounded / ungrounded claims

Conversation Coherence

Does the conversation flow logically across turns (state maintained, no contradictions, sensible progression)?

Likert 1–5 with skip-gating

The full prompt-level specifications, dimensions, and output schemas for each evaluator are in the internal quality report. Below we focus on how we tested them.

When to use multi-turn vs. single-turn vs. adaptive evaluators

Multi-turn evaluators are one of three evaluator families Foundry supports, and they are not always the right tool. Picking the wrong family is a common cause of either wasted compute or measurements that don’t answer the question you actually asked. Use this as a rough decision aid:

Family

What it scores

Unit of analysis

When to reach for it

When not to

Single-turn

A single (input, output) pair against a fixed rubric (e.g., relevance, fluency, single-turn coherence, faithfulness on one response)

One turn

High-volume per-response checks; RAG faithfulness on a single answer; fluency / safety screens on individual outputs; CI gates where you score every assistant message

Anything where the property is defined across multiple turns; task completion, conversational coherence, multi-turn grounding, on-goal behavior

Multi-turn

A whole session against a fixed rubric (the four evaluators above)

One conversation

Session-level properties: did the task get done, was every claim grounded across turns, did the conversation flow coherently, was the user satisfied with the whole experience

Per-turn safety screens (too coarse, a single bad turn averages out); domains with strong programmatic oracles; ad-hoc properties not in the catalog

Adaptive

A whole session against a rubric that was generated for the cohort; Foundry derives a rubric from all (or a sample of) traces from an agent, then applies it to each session in the cohort

One conversation, scored with a rubric synthesised from the cohort

Agents whose success criteria aren’t easily captured by a generic rubric (open-ended assistants, vertical agents, internal tools with bespoke contracts); benchmarking a new agent before you’ve written its rubric by hand; quickly iterating on what “good” means for a system

Cases where a well-validated fixed rubric already exists (use multi-turn; cheaper, more stable, easier to interpret across releases); per-response loops; cohorts so heterogeneous that no single rubric is representative of any session

Two heuristics that fall out of this:

  • Match the unit of analysis to the property you care about. Single-turn metrics can’t measure session-level properties no matter how you aggregate them, averaging fluency scores does not give you task completion. Conversely, scoring a whole 30-turn session with a single Likert is wasteful if all you need is a safety screen on the last response. The cheapest tool that matches the unit wins.
  • Use adaptive when no validated fixed rubric exists for the agent. Multi-turn evaluators are fixed-rubric i.e. every conversation in a run is scored against the same six-dimension CSAT rubric, the same task-completion definition, the same grounding rule. That works when the property generalises across agents (task completion, groundedness, coherence, satisfaction all do). For agents whose definition of “good” is bespoke, e.g. a vertical assistant, an internal tool with its own contract, an early-stage agent that hasn’t had a rubric written by hand yet, adaptive evaluators synthesise a rubric from a sample of the agent’s own traces and apply it consistently to the whole cohort. You still get apples-to-apples comparison within the cohort; the rubric-generation step itself becomes another quality dimension to measure (we cover this separately).

A pragmatic stack we see working in practice: single-turn evaluators on every assistant response (cheap safety / fluency / per-response faithfulness gates in CI), multi-turn evaluators on full sessions (the four above, run on test sets before each release), and adaptive evaluators on new or bespoke agents (synthesise a cohort-level rubric from a sample of traces when no validated fixed rubric yet exists).

How we put them to the test

The rubric for scoring an evaluator is straightforward, and it splits cleanly into three axes:

  • Validity: does the evaluator separate good outcomes from bad ones? We report this primarily as PR-AUC on the safety-relevant minority class (hallucinated for Groundedness, incoherent for Coherence, fail for Task Completion, pass for CSAT). PR-AUC uses the full continuous score rather than a single threshold, is robust to class imbalance (a degenerate “always predict majority” classifier scores at baseline = positive-class prevalence), and isolates the judge’s ranking ability from where the deployment pass-threshold happens to sit. For the threshold-bound, deployment-gate view we also report accuracy / Cohen’s κ at the operating point but we lead with PR-AUC because “Accuracy” against a binary threshold can be a misleading summary when the underlying score is continuous, the class distribution is skewed, or both.
  • Reliability: does the evaluator give the same verdict when you ask it again with the same judge? Each (evaluator, judge, trace) was graded 4 times. We report flip rate (% of traces where the verdict changed), unanimous agreement (% where all 4 repeats agreed), and mean variance across repeats.
  • Robustness: does the evaluator give the same verdict when you swap the judge model? An evaluator that hits 90% accuracy with one judge but 70% with another is pinned to that judge, not portable and most teams will need to swap judges over time for cost, availability, or policy reasons. We measure this by running each evaluator against six candidate judges spanning three model families and three size tiers: gpt-5.5 and gpt-5.4-mini and gpt-5.4-nano (OpenAI frontier / mid / small), Claude Opus 4.7 (Anthropic frontier), Grok-4 (xAI frontier reasoning), and DeepSeek-V3.2 (open-weights). A robust evaluator has a narrow cross-judge band on Validity; a fragile one has a wide one.

Headline numbers are necessary but not sufficient (a 0.9 PR-AUC that requires gpt-5.5 is a different finding than a 0.9 PR-AUC that holds across six judges). The robustness axis is what lets us say something useful about portability rather than just “this judge, this benchmark, this run”.

The full per-judge picture is the most compact summary of where each evaluator stands today:

 

 

Per-judge Validity (PR-AUC) vs. flip rate, for all four evaluators.

 

Each point is one (evaluator, judge) pair. The top-right corner is the goal: high validity (good ranking of positives vs.negatives) and low flip rate (reproducible verdict). Three of the four evaluators, Task Completion, CSAT, and Conversation Coherence, show tight, top-right judge clusters across all six multi-repeat candidates. CSAT (against the human-validated 3-judge panel GT at raw ≥ 3) is actually the highest-PR-AUC band of the four: every judge lands in 0.82–0.97 (Claude Opus 4.7 and gpt-5.5 at 0.97 lead; DeepSeek-V3.2 at 0.83 trails). Groundedness is the only outlier, it sits well below the rest on Validity, has higher flip rates (10–21%), and is the one evaluator where judge choice meaningfully moves the headline number. The judge variation that exists on CSAT lives at the operating point (where smaller judges place the pass/fail cut), not in the underlying score quality, see Finding 2.

Datasets

Choosing datasets for multi-turn evaluation is hard. Most public agent benchmarks ship outcome labels (did the task succeed?) but not process labels (was every claim grounded? did the conversation flow coherently?). For each evaluator we picked the dataset that isolated the target property as cleanly as possible.

Dataset

Evaluators tested

N

Ground truth

Domain

τ²-bench

Task Completion, CSAT

456

Deterministic (DB state verification)

Telecom customer service

BFCL v4

Task Completion (cross-domain)

200

Deterministic (state comparison)

Function calling

FaithDial

Groundedness

300

Pre-labeled (faithful / hallucinated)

Knowledge-grounded dialogue

FED

Conversation Coherence

125

5-annotator human labels, binarized

Open-domain dialogue

Copilot CLI sessions

Task Completion, CSAT (cross-domain)

107 (34 audited)

LLM-reviewed manual audit

Developer agentic coding

One methodological point is worth flagging:

Cross-domain generalization is its own test. τ²-bench is telecom customer service, a narrow slice. To stress-test generalization we ran Task Completion and CSAT against an internal corpus of 107 Copilot CLI agentic-coding sessions (longer conversations, dense tool use, ambiguous task boundaries) and against BFCL v4 function-calling traces. Without these, claims about “production readiness” are claims about telecom.

What we found

Four findings are worth reporting in detail. The Validity view across all four evaluators is the cleanest single summary of where the judges actually differ:

 

Validity (PR-AUC on the safety-relevant minority class) by judge, all four evaluators

 

Validity (PR-AUC on the safety-relevant minority class) by judge, all four evaluators

Each panel shows Validity = PR-AUC on the safety-relevant minority class for one evaluator (fail for Task Completion, pass for CSAT, hallucinated for Groundedness, incoherent for Coherence). The dashed line is the random-classifier baseline (positive-class prevalence). Three of the four evaluators, TC, CSAT, and Coherence, cluster well above baseline across all six candidate judges, with CSAT actually showing the highest and tightest band of the four (0.82–0.97 vs. 0.34 baseline). Groundedness is the outlier: judges that look within 3 pp on accuracy span 0.60–0.82 on Validity, with frontier reasoning judges clearly ahead of the rest. On CSAT, the new human-validated 3-judge panel GT (gpt-5.5 + Claude Opus 4.7 + Grok-4, majority vote, all split cases human-reviewed) with the corrected pass-threshold (raw Likert ≥ 3, matching the prompty’s “would not complain” boundary) puts every judge’s PR-AUC well above the 0.34 baseline, the ranking of scores is uniformly strong; what variation exists lives at the operating point (smaller judges place the cut too permissively), not in the score quality.

1. Task Completion is judge-robust on both axes

On τ²-bench (456 traces), Task Completion with gpt-5.5 reaches 88.4% accuracy, κ = 0.754 at the operating point, with a 6.4% flip rate across 4 repeats. The six multi-repeat candidates land in a tight 85.3–88.4% / κ 0.683–0.754 band; gpt-5.4-mini (87.9%) and gpt-5.4-nano (86.4%) match the frontier judges to within 2 pp while costing a fraction. Grok-4 (0.0% flip rate) and DeepSeek-V3.2 (0.7% flip rate) are the most reproducible, both at ~85.5% accuracy.

The Validity view tells the same story. The five candidates sit inside a 0.05 band: judge-agnostic not just at the threshold, but on ranking quality too. DeepSeek-V3.2 is falling behind. The cross-candidate consistency among the top five is what matters: this is the natural place to cost-optimize, because dropping from gpt-5.5 to gpt-5.4-mini or even gpt-5.4-nano costs you almost nothing on either axis.

2. CSAT is accurate and reliable

CSAT has no external oracle, so we build ground truth from a 3-judge frontier panel (gpt-5.5, Claude Opus 4.7, Grok-4) majority vote with human review of every split case, final GT is 157 pass (34.4%) / 299 fail (65.6%) on the 456-trace τ²-bench set.[1]

CSAT is shown to be a strong evaluator in the study on both axes. Every candidate judge lands between 0.82 and 0.97 PR-AUC on the pass class (baseline 0.34). At the operating point with a top-tier judge, Claude Opus 4.7 hits 97.4% accuracy (κ = 0.942) and gpt-5.5 96.9% (κ = 0.931), both meaningfully higher than the best numbers on any other evaluator in this study. gpt-5.4-mini (94.3%, κ = 0.870) is a genuine cost-saving alternative with no caveat attached. DeepSeek-V3.2 (κ = 0.769) is a credible non-OpenAI fallback with the best run-to-run reliability of any judge tested (flip rate 7.7%).

The one caveat is a threshold-calibration story, not a score-quality story. gpt-5.4-mini ranks traces nearly as well as the frontier (PR-AUC 0.95) but places its default pass cutoff too permissively, costing ~25 pp of precision; the same pattern is sharper for gpt-5.4-nano / gpt-5.2 / gpt-4o. Their scores are usable, the default thresholds aren’t. If you adopt one of these smaller judges, re-tune the threshold on your own data; if you adopt Claude Opus 4.7, gpt-5.5, or gpt-5.4, the default works out of the box.

The practical implication: CSAT is production-ready with any of Claude Opus 4.7, gpt-5.5, or gpt-5.4 at the default threshold; smaller judges are score generators that need their cutoffs re-validated, not judges to avoid.

3. Groundedness is the hard one, the only evaluator where judge tier moves the score itself

Groundedness is the one evaluator in this study where judge choice clearly degrades the score quality, not just the threshold placement. On FaithDial (300 traces, 100 hallucinated / 200 faithful), PR-AUC on the hallucinated class ranges from 0.82 (gpt-5.5, Grok-4) down to 0.60 (gpt-5.4-nano) against a 0.33 baseline, a 0.22 spread, the largest by judge tier in this study (DeepSeek-V3.2 sits mid-pack at 0.67). Accuracy moves the same way (gpt-5.5 84.7% → gpt-5.4-nano 72.7%) and flip rates run high (10–21%), so every axis points in the same direction.

The accuracy and PR-AUC gaps look worse than they are once you look at where the noise lives.

 

Per-trace within-judge std-dev vs.mean Groundedness score (gpt-5.5 × 4 repeats, N=299)

 

Per-trace within-judge std-dev vs. mean Groundedness score (gpt-5.5 × 4 repeats, N=299)

Within-judge reliability noise on FaithDial. Each red dot is one of the 299 FaithDial traces. The x-coordinate is the trace’s mean Groundedness score from gpt-5.5 across its 4 repeats (1–5 Likert). The y-coordinate is the std-dev of the same judge’s 4 repeats on that trace, i.e., same prompt, same judge, four independent runs. The dark-red line is the per-bin mean ±1 SEM in 0.5-point x-bins (bins with ≥ 5 traces only). Within-judge noise is concentrated in the borderline band. gpt-5.5 is essentially deterministic at the faithful extreme (x ≈ 4.5–5: bin mean ≈ 0.1) and tight at the hallucinated extreme (x ≈ 2: bin mean ≈ 0.4), but wobbles meaningfully through the middle (x ≈ 3–4: bin mean ≈ 0.5–0.8). The binary verdict at threshold ≥ 4 mostly absorbs that wobble (borderline scores fluctuate within the same side of the cut), which is why flip rates stay bounded even when 44% of traces show score-level fluctuation across repeats.

 

Small-multiples violins: per-judge score distributions split by FaithDial ground-truth label (faithful / hallucinated), with per-judge ROC-AUC

 

Small-multiples violins: per-judge score distributions split by FaithDial ground-truth label (faithful / hallucinated), with per-judge ROC-AUC

Per-trace Groundedness score distributions for each of the six candidate judges, split by FaithDial ground-truth label. Each panel is one judge. Within a panel, the green violin shows that judge’s score distribution on the 180 faithful* traces and the red violin shows the same judge’s distribution on the 83 hallucinated traces (per-trace 4-repeat-mean scores). Black diamonds mark per-group means (μ); the dotted line at 3.5 marks the ≥ 4 pass threshold. Each panel title prints the judge’s ROC-AUC of score against the hallucinated label (0.5 = chance, 1.0 = perfect ranking). Top row (frontier judges) push the hallucinated mass tightly down to 2 (gpt-5.5 ROC-AUC 0.91, Grok-4 0.89). Bottom row (mid/small/open judges) show the red violins climbing above the pass threshold, gpt-5.4-nano’s hallucinated distribution sits almost entirely above 3, barely distinguishable from its faithful distribution. All six judges agree on what faithful looks like; only the frontier judges agree on what hallucinated looks like. That asymmetry is the driver of the frontier-vs-rest PR-AUC gap and the reason small/open-weights judges are the wrong tier for a safety metric.*

So the picture is consistent: the noise is concentrated on genuinely borderline items, the boundary judges miss hallucinations rather than misranking them in general, and the safety-relevant minority class is where the frontier-vs-rest gap appears.

The deployment guidance follows directly: Groundedness is best run as a triage signal with a frontier reasoning judge (gpt-5.5 or Grok-4, the two judges that both rank well and separate the hallucinated class), tracking aggregate trends over time rather than hard-gating on individual fails. Smaller judges aren’t just shifted, they reshape the hallucinated distribution by missing items in the minority class you actually care about, which is the wrong failure mode for a safety metric. Groundedness is the one evaluator where the cost-optimisation lever from Task Completion does not transfer; pay for the frontier reasoning judge.

The Groundedness picture also surfaces two reporting points that apply across all four evaluators, both are about how noise behaves around thresholds, and they’re why we lead with PR-AUC and report flip rate as a distribution rather than a single number:

  • Calibration at the extremes matters more than calibration in the middle. Deployment uses thresholds, a Likert-5 evaluator becomes a pass/fail signal at score ≥ 4 or ≥ 3. What we actually care about is judge–ground-truth agreement on the low and high ends of the score range; the middle is where the ground truth itself often sits close to the decision boundary. We treat “44% of traces show score-level fluctuation, but the binary verdict is stable” as a feature, not a bug, PR-AUC captures the trustworthy ranking even when threshold-level accuracy is noisy.
  • Reliability is not uniform across the score range. Variance in the verdict isn’t evenly distributed, it concentrates on items whose average score sits in the middle. Items that consistently score high (or consistently score low) almost never flip; items in the borderline band flip often, because small interaction differences tip them either way. When we report a flip rate, we report it knowing the shape of the flip distribution matters as much as the headline number.

4. Conversation Coherence is solid on the only dialogue-level human-labeled dataset we have

On FED (125 human-annotated dialogues, three-system spread of Human / Meena / Mitsuku), Conversation Coherence with Claude Opus 4.7 reaches 88.0% accuracy, κ = 0.711, with all six candidates achieving κ ≥ 0.47 at the operating point. The three-system spread provides clear quality separation (the rule-based chatbot Mitsuku is the negative class), which is what makes the dataset usable even at n=125.

The Validity view is consistent with the threshold view but adds nuance. PR-AUC on the incoherent class (the safety-relevant minority 40/125): gpt-5.5 = 0.88, gpt-5.4-mini = 0.86, Claude Opus 4.7 = 0.83, Grok-4 = 0.81, DeepSeek-V3.2 = 0.78, gpt-5.4-nano = 0.77. The top four candidates are tightly clustered, robust on coherence, while the smaller / open-weights judges trail by 5–11 points at ranking the rare incoherent cases. For coherence triage specifically, any of gpt-5.5 / gpt-5.4-mini / Claude is a safe pick.

The honest caveat is that 125 dialogues of open-domain chitchat is not the same as 1,000 enterprise task-oriented agent sessions, and the binarization threshold (mean annotator score ≥ 1.5) is a design choice. Numbers are strong; transfer to enterprise domains needs re-validation.

Practical recommendations

If you’re standing up your own multi-turn evaluation, using these evaluators, a third-party stack, or judges you write yourself, these are the things that mattered most in our study.

Always pair validity with reliability. A high-validity judge with a high flip rate gives you the right ranking on average and changes its mind on rerun. The 4-repeat protocol takes more compute, but it’s what surfaces evaluators like Groundedness with gpt-5.4-nano (a weak-looking 0.60 PR-AUC paired with a 21.3% flip rate, meaning roughly one verdict in five changes on rerun) that look promising on a single run and aren’t. It also surfaces the reverse: Grok-4 on Task Completion has a 0.0% flip rate across 4 repeats, perfectly stable, even though its PR-AUC (0.80) is the lowest of the five candidates. Reliability and validity are independent dimensions; you want both.

Validity and operating-point accuracy can disagree and that’s diagnostic. For any Likert-vs-binary evaluator, threshold accuracy and PR-AUC can point at different judges. On Groundedness, all six candidates hover at ~73–85% accuracy but Validity spans 0.59–0.82. CSAT shows the reverse pattern: every judge’s PR-AUC against the human-validated panel GT lands in 0.87–0.97 (rank quality is uniformly strong), but threshold accuracy spans 43.6% → 97.4% because the judges place their decision boundaries in very different places; gpt-5.4-mini ranks traces almost as well as Claude (0.95 vs 0.97) but is so permissive at the default cutoff that it loses 25 pp of precision. Accuracy at the deployment threshold tells you how a single pass/fail gate would perform; Validity tells you how good the underlying score is for triage, prioritization, or threshold tuning. Report both; when they disagree, threshold tuning (not judge replacement) is often the right intervention.

Run multi-judge ablations. Headline numbers from a single judge are not portable. CSAT looked production-ready when gpt-5.4-mini was scored against a gpt-5.4 reference at the strict raw-Likert ≥ 4 threshold (92.5% accuracy, κ = 0.804); switching to a human-validated 3-judge panel GT at the rubric-aligned raw-Likert ≥ 3 threshold drops that same judge to 85.1% / κ = 0.699 and surfaces Claude Opus 4.7 (97.4%, κ = 0.942) and gpt-5.5 (96.9%, κ = 0.931) as the new top picks. Same data, same evaluator, very different “ready for production” answer depending on what you compare against and where you set the cutoff. Pick at least one frontier judge and one cheap judge per evaluator, sweep the threshold, and report both, if they disagree on a metric you actually care about, the ablation is the only thing telling you which one to trust.

Test cross-domain. Numbers on a single benchmark generalize only to that benchmark. Task Completion on τ²-bench looks production-ready; Task Completion on BFCL v4 has a 20-point accuracy gap and a fundamentally different failure mode (silent state corruption). Cross-domain testing, even a small audited internal corpus like our 34-session Copilot CLI sample is what catches that.

Use deterministic checks where they exist. If your domain has a programmatic oracle (was a refund record created? did the function return the right value?), the oracle beats an LLM judge every time. Multi-turn evaluators are for the cases where no oracle exists.

Treat groundedness as a trend, not a gate and pay for the frontier judge. Per-claim, any-failure-fails scoring is the right safety design and the wrong design for single-pass gating. Track Groundedness as an aggregate signal over time, and route low-scoring sessions to human review. This is also the one evaluator where the cost-optimisation lever doesn’t work: the gpt-5.4-nano tier doesn’t just shift the mean upward, it collapses the hallucinated cluster in the score distribution small/open judges are missing the failures the metric exists to catch (PR-AUC 0.60–0.67 vs. 0.77–0.82 for gpt-5.5, Grok-4, Claude Opus 4.7). On Task Completion you can drop two judge tiers and lose 2 pp of accuracy; on Groundedness you lose ~20 pp of Validity and the wrong failure mode.

Plan for ground truth before you plan for measurement. Public agent benchmarks ship outcome labels, not process labels if you’re measuring a process property (grounding, coherence, anything trajectory-level), expect to either generate labels yourself (majority-vote + human review on the ambiguous tail worked well for us) or accept that you’re validating against an LLM reference scorer with the limitations that implies.

When you’d reach for something else

Multi-turn evaluators aren’t the right tool for every evaluation. Reach for an alternative when:

  • You have a deterministic oracle. Programmatic checks (“was a refund record created?”, “does the function return the right value?”) are cheaper, faster, and more correct than any LLM-as-judge.
  • You need verifiable agent correctness against ground-truth identifiers. LLM judges score conversational quality, not the truthfulness of specific identifiers. Combine with deterministic checks.
  • You’re red-teaming for adversarial inputs. That’s a fuzz-testing problem, not a quality problem; the metric design is different.

For the use cases multi-turn evaluators are built for, measuring session-level agent quality before you have a production user base, catching regressions during development, and quantifying the impact of prompt or model changes on the agent’s intrinsic quality, they do the job at a quality bar we’re comfortable shipping against. That’s the bar this study was designed to verify.

Three things to carry out

Whatever evaluator stack you end up with, three takeaways from this study are worth holding onto.

Measure the evaluator, not just the agent. Evaluation is recursive: any tool you use to score your agent is itself a system that can be wrong. Building a rubric for the evaluator with the same rigor you’d build one for the agent is what stops evaluator quirks from quietly distorting your agent scores.

Reliability is half the story. Flip rate is not a vanity metric. A judge that disagrees with itself across reruns will produce phantom regressions; an evaluator’s run-to-run variance is the noise floor for every comparison you do downstream.

Judge sensitivity varies by evaluator, and varies in kind, not just in degree. Three of the four evaluators (TC, CSAT, Coherence) are judge-robust on the score (PR-AUC tightly clustered across all six candidates); only Groundedness has a score-quality gap across judge tiers (PR-AUC spans 0.60–0.82, frontier reasoning judges clearly ahead). On CSAT, smaller judges produce good scores at miscalibrated thresholds, a fixable problem. On Groundedness, smaller judges produce worse scores, not a fixable problem. The headline accuracy table is a starting point. The per-judge ablation is what tells you which category of caveat applies.

 

Closing Thoughts

As AI applications become more sophisticated, especially with multi-turn interactions, observability is no longer optional. It’s foundational. Developers need the ability to understand how their applications behave across turns, diagnose issues quickly, and continuously improve quality and reliability.

Why This Matters

Without strong observability, even the most powerful AI systems can become opaque, unpredictable, and difficult to optimize. Foundry observability gives developers the tools to:

  • Trace and evaluate multi-turn interactions with clarity
  • Identify performance gaps and improve system behavior
  • Build more trustworthy, production-ready AI experiences

Ultimately, it empowers you to move faster with confidence, turning insights into better user outcomes and more robust applications.

Get Started

Ready to go deeper? Explore these resources to start building and learning more about Foundry observability:

 

 

[1] Three of the four datasets in this study have an external reference: τ²-bench has a deterministic DB-state oracle (TC), FaithDial ships hallucination labels (Groundedness), and FED ships 5-annotator human labels (Coherence). CSAT has none, there is no oracle for “how satisfied would the user be?”. We build a reference in two steps: (i) a 3-judge frontier panel, gpt-5.5, Claude Opus 4.7, and Grok-4, each contributing its mode pass/fail across 4 repeats, label tentatively assigned when ≥ 2 of 3 agree; pass is defined as raw Likert ≥ 3, matching the prompty rubric’s “would not complain” boundary. (ii) Every 2-vs-1 split case is manually reviewed by a human annotator in a small labeling portal, on our 456-trace τ²-bench set the panel split on 116 traces (25.4%) and the reviewer confirmed all 116 majority labels (0 overrides, 0 skips). The earlier convention of “gpt-5.4 as reference scorer at raw ≥ 4” was retired because (a) the single-judge reference was self-referential, and (b) the stricter threshold collapsed the rubric’s “Satisfied with minor gaps” tier into fail, suppressing the pass rate to an artificial 9.4%.

Updated Jun 24, 2026
Version 1.0