Blog Post

Microsoft Foundry Blog
11 MIN READ

Evaluating AI Agents: Can LLM‑as‑a‑Judge Evaluators Be Trusted?

kaiqi's avatar
kaiqi
Icon for Microsoft rankMicrosoft
Jan 05, 2026

In our first blog post, we introduced a set of evaluators that are essential for evaluating AI agents, many of which are LLM‑as‑a‑Judge evaluators. As the second post in this blog series, we focus on the reliability of LLM‑as‑a‑Judge evaluators, sharing our key learnings and takeaways from studying their behavior in practice. These insights provide guidance and direction on how LLM‑based evaluators can be improved—a topic we will explore in depth in the next blog post.

In our previous blog post, we explored why evaluating AI agents is more complex than evaluating a single Large Language Model (LLM), and we introduced the diverse set of evaluators needed to gain a holistic understanding of agent behavior. Many of the evaluators we discussed fall under the category of LLM-as-a-Judge—a critical evolution from the traditional Human-as-a-Judge approach.

This shift is understandable: human evaluations are slow, costly, and difficult to scale, especially when agents must be evaluated across large datasets or in near real time. At Microsoft Foundry, we even introduced the concept of Continuous Evaluation, which would be prohibitively costly and nearly impossible to perform with humans-in-the-loop. Continuous human evaluation also raises significant privacy concerns, particularly for customer-facing agents where evaluators could be exposed to real customer data. These challenges make LLM-as-a-Judge approaches even more compelling.

However, this shift raises a fundamental question:

Are LLM-as-a-Judge evaluators reliable enough for practical use?

In this blog, we aim to address this question by sharing results from recent research conducted within Microsoft Foundry, and we summarize key findings that shed light on the current capabilities—and limitations—of LLM-as-a-Judge evaluation.

Why LLM-as-a-Judge is Necessary

The simple answer is scalability.

While human judgment remains the gold standard—offering unparalleled nuance and contextual understanding—it is expensive, slow, and cannot be scaled to meet the demands of continuous, large-volume AI agent development.

LLMs, conversely, can process evaluation rubrics consistently and produce judgments at massive scale. Since the judge often needs only to follow a predefined set of rules, this is a task LLMs excel at.

Beyond scalability, as noted in the introduction, privacy is another major factor. With LLM-as-a-Judge, the data being evaluated can remain fully contained within the automated evaluation pipeline, avoiding the privacy risks that arise when human evaluators may be exposed to sensitive customer information.

However, substituting human judges with LLMs introduces new challenges:

  • Sampling Noise: Even with temperature set to 0, inherent stochasticity can lead to score variability across runs.
  • Model Disagreement: Different LLM families (e.g., GPT-4o vs. GPT-4) may interpret the same criteria differently.
  • Self-Bias: A judge LLM might subtly favor the outputs generated by models from its own family.
  • Human Misalignment: Model reasoning may not always align with human expectations of quality.

Measuring the LLM-as-a-Judge Evaluators: Three Pillars of Reliability

If we are to rely on LLMs to judge our agents, we must first quantify the quality of the judge itself. What defines a trustworthy LLM-as-a-Judge evaluator? It must:

  1. Make general sense (align with human judgment).
  2. Have little internal noise (be self-consistent).
  3. Produce results independent of the specific judge model used (inter-model agreement).

These characteristics define reliability. We therefore focus on three essential reliability metrics:

Metric

Definition

Importance

Human Alignment

How closely do evaluator scores match human judgments?

External Validity: Ensures the evaluator reflects true human perception of quality.

Self-Consistency

How stable is the evaluator’s score for the same input across repeated runs?

Internal Stability: Measures low variance across repeated, identical evaluations.

Inter-Model Agreement

Do different judge LLMs (e.g., GPT-4o, GPT-5) produce similar scores for the same input?

Cross-Model Robustness: Confirms the evaluation rubric is clear and portable across different LLMs.

Together, these metrics provide a crucial, multidimensional view of an LLM evaluator’s trustworthiness.

Can We Trust LLM-as-a-Judge?

To quantitatively assess the reliability of LLM-as-a-Judge evaluators, we conducted a study using the evaluators most frequently used within the Microsoft Foundry platform. These evaluators are implemented as LLMs instructed via specialized prompts to score specific dimensions of agent quality.

Dataset Used in the Study

To stress‑test the LLM judges, we needed a dataset that captured the full spectrum of agent performance—from clear successes to ambiguous or edge‑case failures. We therefore generated a large‑scale synthetic dataset of 600 conversations, comprising 3,378 turns, focused on the Location domain. These conversations utilized 10 Azure Maps functions (e.g., azure_maps_weather_forecast, azure_maps_search, etc.), giving the agents realistic capabilities to invoke during evaluation.

Dataset generation variables:
  • Agent Quality Variants: We created agents with Excellent, Average, and Bad performance profiles by varying their system prompts. All shared the same base model (gpt-4o-mini, 2024-07-18) and tools, making the system prompt the key differentiator in behavior.
  • Synthetic User Personas:
    • Normal User: Uses clear, explicit language (e.g., "What’s the weather in Seattle, WA tomorrow?").
    • Ambiguous User: Uses vague language and provides less context (e.g., "Is it going to rain there tomorrow?").
  • Scenarios: We defined a set of scripted scenarios (e.g., WeatherByLocation, DistanceBetweenLocations) that outlined a sequence of a main question followed by possible follow-up questions, ensuring multi-turn complexity.

Combining 3 agent types × 2 user types × 6 scenarios produced a rich, controlled dataset with predictable successes, failures, and edge cases—ideal for evaluating judge models.

Responsible AI considerations

All data used in this study was synthetic.

  • No personally identifiable information (PII) or real user data was involved.
  • The dataset contains no harmful, offensive, or sensitive content.

This approach aligned with Microsoft’s Responsible AI principles while enabling high‑volume experimentation.

Sample conversation

Below is an example conversation from the synthetic dataset. The agent has access to 10 Azure Maps API functions, which it may invoke as needed to answer the user’s queries.

Turn

User

Agent

1

Calculate the distance between Rome, Italy and Kinshasa.

The distance between Rome, Italy, and Kinshasa, Democratic Republic of the Congo, is approximately 5146.46 kilometers.

2

What's the distance in miles?

The distance between Rome, Italy, and Kinshasa, Democratic Republic of the Congo, is approximately 3195.64 miles.

3

And the coordinates for Kinshasa?

The coordinates for Kinshasa, Democratic Republic of the Congo, are approximately Latitude: -4.3215, Longitude: 15.3119.

4

What's the current weather in Rome, Italy?

The current weather in Rome, Italy is as follows:

Condition: Sunny

Temperature: 11.5°C

Feels Like: 11.6°C

Humidity: 95%

Wind: 8.5 km/h from the Northeast

Overall, it's a pleasant sunny day in Rome!

Evaluators Studied

In this study, we focused on four critical LLM‑as‑a‑Judge evaluators that are most commonly leveraged.

  1. Intent Resolution: Did the agent correctly identify and fulfill the user’s core intent?
  2. Tool Call Accuracy: Were tools called appropriately with correct inputs and outputs?
  3. Task Adherence: Did the agent follow instructions and system constraints?
  4. Relevance: Was the final response on-topic and useful?

Each evaluator consumed the full conversation history, the agent’s response, and the available tool definitions. Their outputs typically included:

  • A 5‑point Likert score (except Tool Call Accuracy, which is primarily binary)
  • A pass/fail determination
  • A reason explaining the evaluator’s judgment

To minimize non‑determinism, we tested each evaluator using multiple OpenAI judge models (e.g., gpt‑4o‑mini, the gpt‑4.1 family, and the gpt‑4o family) at temperature = 0. Even so, slight variance can occur due to internal probability ties.

Sample Intent Resolution Output (gpt‑4.1‑nano)
{"intent_resolution": 5.0, "intent_resolution_result": "pass", "intent_resolution_threshold": 3, "intent_resolution_reason": "The assistant successfully identified the user's intent to find the current weather in Rome, Italy, and correctly called the appropriate weather function with the correct parameters. It then accurately interpreted the returned data and provided a clear, detailed weather report. The response fully addresses the user's query with precise and relevant information, demonstrating a complete resolution."}

Judge LLM models

For this study, we evaluated each conversation turn using five judge model variants: GPT‑4o, GPT‑4o‑mini, GPT‑4.1‑mini, GPT‑4.1‑nano, and GPT‑4.1. These models represent different sizes and versions within the GPT‑4 family available to us.

Each model, when provided with the evaluator prompt (for Intent Resolution, Relevance, etc.), may generate slightly different scores due to differences in capability, reasoning style, and model architecture. We therefore treated each model as an independent evaluator—or “judge”—and later analyzed the degree to which their judgments aligned or diverged.

Key Learnings

Learning #1: LLM-as-a-Judge evaluators are surprisingly inconsistent

A surprising finding: Even with temperature set to 0, the same model, same input, same seed sometimes produced different evaluation scores.

For example, when we ran the Intent Resolution evaluator 100 times on the exact same input, the judge model produced fluctuating scores across runs (e.g., one GPT‑4.1 variant oscillated between scores of 3 and 4).

Table 1. Score distribution for 100 runs of Intent Resolution evaluator, when powered by different models, for the same input.
Table 1. Score distribution for 100 runs of Intent Resolution evaluator, when powered by different models, for the same input.

Model

Score 1

Score 2

Score 3

Score 4

Score 5

gpt-4.1

0

0

40

60

0

gpt-4.1-mini

0

0

4

34

62

gpt-4.1-nano

0

0

48

52

0

gpt-4o

0

0

14

69

17

gpt-4o-mini

0

0

0

90

10

Why does this happen?
Even with temperature set to 0, models may internally view multiple next-token paths (e.g., "score": 3 vs. "score": 4) as equally probable. When tie-breaking occurs, small sources of nondeterminism in sampling, hardware, or model internals can cause the output to shift.

Takeaway: Temperature 0 ≠ determinism. LLM-based evaluators exhibit inherent variance—and this variance must be managed and mitigated.

Learning #2: Inter-model disagreement is real

Another key finding from our study is that different judge LLMs often disagree with one another—even when evaluating the exact same input under identical conditions. While this may seem surprising, the pattern is both consistent and explainable.

Each model family exhibits its own “scoring personality”:

  • Some judge LLMs tend to be more lenient,
  • Others are more strict,
  • And some show noticeably higher variance in how they interpret the rubric.

The table below summarizes the pairwise inter-LLM agreement rates for the Intent Resolution evaluator across the different models studied, using both binary and Likert scoring schemes. While agreement tends to be higher for binary scores, the average agreement rate remains around the 80% range — indicating that scores from different LLMs can still diverge significantly.

Figure 1. Summary table of pairwise agreement rates for the intent resolution evaluator when using different LLMs as judges, under both Likert-scale and binary scoring.

This disagreement is not random noise; it is a systematic property of LLM‑as‑a‑Judge systems. Model choice alone can shift evaluation results, meaning an agent that “passes” under one judge might be rated borderline under another. Without calibration, this introduces inconsistency into evaluation pipelines—especially at scale.

Takeaway: Model choice matters. Calibration or ensemble methods may be necessary.

Learning #3: LLM-as-a-Judge systems can approximate human judgment reliably—but only for certain metrics and with careful calibration

Because even human judges disagreed on borderline cases, perfect AI–human agreement is neither expected nor required. An evaluator performing within the range of human–human disagreement effectively reaches human parity.

How often do human judges agree with each other?​

  • Likert (1–5 scale): 60–70% agreement
  • Binary (pass/fail): 89–93% agreement

To evaluate LLM judges, we compared their outputs against a panel of seven human experts, all using the same strict rubrics.

  • Where LLMs work well:
    For more objective metrics (e.g., Relevance and Tool Call Accuracy), leading models achieved alignment levels comparable to human–human agreement, meaning they can reliably serve as scalable proxies for human evaluators.
  • Where challenges remain:
    For metrics requiring deeper interpretation—especially Intent Resolution—models showed noticeably higher variance. LLMs often struggled with distinctions such as partial credit vs. all‑or‑nothing scoring, echoing the same gray areas that caused humans to disagree with each other.
    These numbers fall below human–human agreement levels, indicating that some types of evaluators still require human judgment to ensure reliability. 
    For intent resolution, we observe that:
    • Mean LLM agreement on Likert scale: 58.7%
    • Mean LLM agreement on Binary scale: 75.6%

These numbers fall below human–human agreement levels, indicating that some types of evaluators still require human judgment to ensure reliability. 

Takeaway: While LLM-as-a-Judge evaluators can match human reliability on well-defined, objective metrics, human judgment remains the gold standard for tasks involving nuanced interpretation or ambiguous cases.

Learning #4: Higher self-consistency and inter-model agreement correlate with Human agreement

During LLM-as-a-Judge evaluator development, we often rely on a sampled validation set with human labels to measure alignment with human judgment. Even with a well‑designed sampling strategy, human labeling remains the most time‑consuming and least scalable component. These findings suggest opportunities to design smarter, more efficient sampling strategies for future human‑labeling efforts.

Across all evaluators studied, we observed an interesting pattern: Evaluators with lower variance and higher inter-model agreement were also the ones that aligned most closely with human judgments.
To illustrate this, we compared two versions of our Intent Resolution evaluator (v1 and v2). The table below summarizes their human alignment, inter-model agreement, and self-consistency:

Criterion 

Metric 

Intent Resolution v1 

Intent Resolution v2 

Human alignment 

Agreement (binary)  

66.0% 

75.7% 

Precision 

61.6% 

90.5% 

Recall 

83.3% 

62.2% 

F1-score 

69.7% 

72.7% 

Inter-model-consistency 

Inter-model (Likert)  

58.7% 

69.9% 

Inter-model (Binary) 

75.6% 

88.9% 

Self-consistency 

Intra-model variance

0.091 

0.061 

Intent Resolution v2 exhibits higher inter-model agreement and lower variance and demonstrates stronger alignment with human experts on most metrics. This shows Intent Resolution v2 is an improvement over Intent Resolution v1.

Takeaway: Human labeling can be minimized to a small, carefully selected validation set—as long as the evaluation prompt is unambiguous, and the LLM evaluator demonstrates strong inter and intra-model consistency.

Learning #5: Evaluation should be contextual when assessing a given turn

Most agent platforms, including Microsoft Foundry, evaluate outputs on a per‑turn basis, which is useful for pinpointing issues. However, judging a single turn in isolation can lead to misclassification—just as it would for human reviewers.

For example, consider the following:

  • Turn 1 (User): “What’s the weather in Paris tomorrow?”
  • Turn 2 (User): “Will it be sunny there tomorrow?”

A correct agent response to Turn 2 must resolve “there” as Paris, using context from Turn 1. But if an evaluator sees only Turn 2’s response—without preceding context—then the reply:

“It will be sunny tomorrow in Paris…”

The reply may appear irrelevant or incorrect because the user never explicitly mentioned “Paris” in that specific turn. This is especially problematic for metrics such as Intent Resolution and Relevance, which depend on understanding the conversation flow and references across turns.

Takeaway: LLM-as-a-Judge only works when evaluators can see and understand the full conversation context. Isolated turn level evaluation is insufficient for context-dependent metrics.

What’s Next

In this post, we explored why LLM-as-a-Judge is needed, how it performs in practice, and what our research revealed about its reliability.

Short summary:

  • LLM-as-a-Judge is essential for scale, privacy and affordability
  • It can be trusted—for the right metrics and with the right calibration
  • But inconsistency, model disagreements, and contextual blind spots require careful handling

In our next blog, we’ll build on these findings to discuss:

  • How to improve LLM-as-a-Judge quality
  • Techniques for reducing variance and boosting alignment
  • A proposed human–AI teaming strategy for defining ground truth and establishing a robust “gold standard”

Stay tuned!

References

  1. Slack, D., Wang, J., Semenenko, D., Park, K., Berrios, D., & Hendryx, S. (2023). A holistic approach for test and evaluation of large language models [Technical report]. Scale AI. https://static.scale.com.
Updated Jan 02, 2026
Version 1.0
No CommentsBe the first to comment