Blog Post

Microsoft Foundry Blog
7 MIN READ

Evaluating AI Agents: Techniques to Reduce Variance and Boost Alignment for LLM Judges

jamesasher's avatar
jamesasher
Icon for Microsoft rankMicrosoft
Mar 04, 2026

In this blog post, we continue our discussion on LLM-as-a-Judge evaluators, focusing on how their performance can be improved and offering practical suggestions for teams considering incorporating such evaluators into their deployment cycle.

In an ideal world, an LLM judge would behave like an experienced Subject Matter Expert (SME). To achieve this, we must align the judge with SME preferences and minimize any systematic biases it exhibits. (See our previous article for a detailed overview of common bias types in LLM judges.) We begin with techniques to improve alignment.

Pre-calibration of LLM judges to human preferences

Choosing the right models to calibrate

When choosing which model, models, or model family to use as the underlying driver of your LLM judge, it effectively comes down to a trade-off between cost and capability. Larger models tend to be more effective than smaller ones [1]; they also tend to be more expensive and slower. When deploying your suite of judges (as required in most projects) carefully decide which areas of your system are fundamentally required to be more aligned and consider deploying more expensive models here while using smaller, cheaper models in less essential areas. Randomly sampling data points and exposing them intermittently to the bulkier judges is also a tactic that can be used. Fundamentally, model choice is also a design choice, and systematic testing is required to help you decide which models to use in which areas.

Calibrating models to align with human preferences

LLM judges are delicate and highly responsive to system prompts. The most important thing when using LLM judges is consistency. This means that once you have a chosen system prompt that has been proven to align with human preferences, that you must stick to that system prompt for the duration of your evaluation. Fiddling with the system prompt should be done in advance and only in advance of evaluation – otherwise you end up in a situation in which you can be moving the goalposts to ensure that your shot goes in.

Therefore, the goal of calibration is to adjust the system prompt of a model to best align with responses that would be given by an SME. Unfortunately, this does mean that we need to collect and label SME responses such that we can evaluate the alignment.

Consider the example below where we wish to train an LLM judge to score a response from 1-5. This judge could be used to score a plethora of AI applications. Here’s how we can successfully align the judge.

  1. Create a stratified sample of diverse responses.
    • Ensure the full range of potential values are covered (e.g. 1–5 or 1–10).
    • Ensure to include edge cases and ambiguous samples.
    • Ensure diversity across content lengths, quality, tone, and so on.
    • Hold validation sets and test sets as standard.

An example:

ResponseToneLength
‘This is response A…’Clear and concise300
‘This is response Z…’Unclear and directionless150

 

2. Have human labelers annotate or score each response.

    • Decide clear and consistent scoring criteria, ideally within a group.
    • Have SMEs score the responses independently.
    • Calculate inter-annotator agreement using either Cohen’s Kappa (if two annotators) [2] or Fleiss’ Kappa (if 3 or more annotators) [3]. Use weighted kappa calculations if required; you may want to penalize large disagreements more severely.
    • Target κ > 0.6 — if not close to this then maybe a joint discussion or adjudication is required, as there may be severe ambiguity in the questions or the scoring criteria.
    • The SMEs should score responses blind to accompanying information such as tone and length.

Example extended:

ResponseToneLengthSME 1 scoreSME 2 scoreSME 3 score
‘This is response A…’Clear and concise100324
   
This is response Z…’Unclear and directionless200434

 

3. Create a baseline system prompt and compare with human scores.

    • Create an initial system prompt for the judge and pass in the validation set.
    • Pass in the same responses to the judging LLM and retrieve the scores.
    • Compare the scores from the LLM judges to the human annotated responses with correlation metrics such as Spearman’s or the Pearson coefficients.
    • Compare the agreement rate with the SME judges (rounded mean, Median or mode) score with the LLM judges ranking again using the correct Kappa measure.
    • Conduct line-by-line error analysis of discrepancies between human and LLM judges. Analyze and explore whether any systematic bias exists.

4. Iterate and improve.

    • Repeat the process by adjusting the system prompt according to deductions made from error analysis and metric observation.

5. Final validation.

    • After a significant improvement and plateauing of the improvement has been witnessed, the final test set that was initially withheld can be tested to ensure consistency.

6. Set and forget.

    • After the final system prompts are set for the LLM judges, leave them constant throughout experimentation to avoid any bias in the evaluation pipeline.

Post-calibration to mitigate bias

Further analysis of alignment

Stress testing alignment

Stress testing serves to rigorously assess whether the alignment between LLM judges and human evaluators remains robust under varying conditions and across different subpopulations. For instance, while an LLM judge may closely match human scores for short responses, it might consistently misjudge longer ones. If the dataset is dominated by short responses, this can artificially inflate overall correlation metrics and obscure critical weaknesses in the evaluator.

  • Stratified agreement analysis: Evaluate human–LLM agreement separately for distinct categories, such as short versus long responses, simple versus complex queries, different tones or writing styles, and diverse content domains. This helps to pinpoint where alignment may falter.
  • Counterfactual perturbations: Introduce minor modifications to the inputs—such as shuffling candidate order, shortening answers, or substituting synonyms—to observe whether the LLM judge’s scores change in a meaningful way. Such tests uncover sensitivity and potential bias in the evaluation process.
  • Permutation tests: Randomly permute answer labels or scoring assignments to ensure that observed patterns of alignment are not artifacts of dataset structure or chance.

When stress testing exposes deficiencies in the current evaluator, the next step is to iteratively refine the LLM judge to achieve stronger and more consistent alignment with human judgment.

Statistical validation of improvements

It is essential to confirm that any observed improvements are substantive and not merely statistical artifacts. This is where robust statistical validation is critical:

  • Paired significance tests: Use methods such as paired t-tests or Wilcoxon signed-rank tests to compare human–LLM deviations before and after calibration, ensuring that improvements are statistically supported.
  • Multiple testing corrections: Apply procedures like the Benjamini–Hochberg correction when evaluating numerous metrics or subgroups, reducing the risk of false positives.
  • Confidence intervals: Report confidence intervals for agreement estimates to quantify uncertainty and avoid overinterpreting marginal differences.

By systematically applying these practices, stakeholders gain clearer, data-driven assurances regarding the reliability of the LLM judge and the durability of any measured improvements. This approach ensures that alignment is not only statistically sound but also meaningful and stable across all relevant scenarios.

Regression to test for bias

Once the LLM judge is calibrated to align with human preferences and its performance meets our evaluation criteria, the final step is to quantify the presence and magnitude of residual biases that may persist.

Just like humans, LLM judges can exhibit systematic biases. Prior research has shown that LLM judges often display consistent patterns of favoritism when evaluated across many examples. Three well‑documented types of bias include:

  • Positional bias – Preferring the first or second option in a comparison, independent of quality.
  • Verbosity bias – Favoring longer answers over shorter, more concise ones.
  • Self‑bias – Giving higher scores to responses generated by the same model family as the judge.

Understanding these systematic effects allows us to diagnose limitations in the evaluator and iteratively reduce bias during further tuning. While aligning an LLM judge with human preferences is essential, achieving lower bias makes the evaluator even more reliable.

A straightforward yet powerful method to measure bias is regression modelling. Consider verbosity bias as an example.

Take the example of building an agent, in which we would like to evaluate how good our agent is at answering specific questions. We would like to compare that agent to a standard out-of-the-box LLM. We would like to know if our agent is producing answers of more substance. However, we know that in advance that LLM judges tend to favor longer answers. So how do we ensure that our LLM judge isn't biased towards our agent simply because it gives longer answers. We can attempt to control for that confounding influence by using linear regression.

Score=β01 (Agent)+β2 (LengthNormalized)+ε

This formulation lets us isolate the separate effects of who generated the answer and how long the answer is, while holding all other factors fixed. All other bias types can be modelled in a similar fashion, given the standard assumptions of linear regression apply.

The sign, magnitude, and statistical significance of the regression coefficients quantify the extent of each bias. For instance: A large positive and statistically significant β2 in this specification would indicate strong verbosity bias.

Bringing statistical rigor into practice with Microsoft Foundry

A lot of this work can be repeated easily using Microsoft Foundry and the open-source package judgesync.

In practice, statistical validation is most valuable when it is tightly integrated into the evaluation workflow. Microsoft Foundry natively supports paired statistical testing, enabling developers to directly quantify pre‑ and post‑calibration improvements.

What’s even cooler—and especially useful—is Microsoft Foundry’s evaluation cluster analysis feature (currently in public preview). It helps you understand and compare evaluation results by grouping samples with similar patterns, making it easier to surface alignment gaps where LLM judges diverge from human evaluators across response lengths, complexity, and content styles—issues that are often hidden by aggregate metrics.


Reference

[1] Zheng, L., Chiang, W.-L., Sheng, Y., Zhuang, S., Wu, Z., Zhuang, Y., Lin, Z., Li, Z., Li, D., Xing, E. P., Zhang, H., Gonzalez, J. E., & Stoica, I. (2023). Judging LLM-as-a-judge with MT-Bench and Chatbot Arena [arXiv preprint]. arXiv. https://doi.org/10.48550/arXiv.2306.05685

[2] Cohen, J. (1960). A coefficient of agreement for nominal scales. Educational and Psychological Measurement, 20(1), 37–46. https://doi.org/10.1177/001316446002000104

[3] Fleiss, J. L., & Cohen, J. (1973). The equivalence of weighted kappa and the intraclass correlation coefficient as measures of reliability. Educational and Psychological Measurement, 33(3), 613–619. https://doi.org/10.1177/001316447303300309

Updated Mar 04, 2026
Version 1.0
No CommentsBe the first to comment