Healthcare and Life Sciences Blog

11 MIN READ

Towards Robust Evaluation of Multi-Agent Systems in Clinical Settings

Microsoft

Jul 22, 2025

Authors: Hao Qiu, Leonardo Schettini, Mert Öz, Noel Codella, Sam Preston, Wen-wai Yim

As multi-agent systems become more capable and collaborative, their behavior begins to exhibit emergent properties that are difficult to predict or control – particularly in safety critical domains like healthcare. Coordination among agents can yield outputs that are non-deterministic, multi-faceted, and context sensitive.

This makes robust evaluation not just a matter of accuracy, but of safety, accountability, and trust. Traditional NLP metrics like ROUGE or BLEU fall short in these settings as they presuppose a single ground truth and fail to capture clinically relevant errors such as subtle omissions, hallucinations, or fact distortions. To address this, we present a modular evaluation framework for the Healthcare Agent Orchestrator, designed to support fine-grained, clinical grounded assessment across both deployed clinical workflows and simulated scenarios. This framework enables targeted stress-testing of multi-agent behavior – particularly how agents share information, reason under uncertainty, and maintain factual fidelity in high-stakes contexts.

Central to our framework is TBFact, a domain specific factuality metric that evaluates agent outputs based on three key criteria: factual inclusion, factual distortion, and factual omission. TBFact shows strong correlation with human experts (κ=0.760) and demonstrates that our Patient History agent successfully included up to 94% of high-importance information in the generated patient timelines.

To ground evaluations of the Patient History agent, we constructed a high-quality benchmark dataset from de-identified tumor board discussions and associated patient histories. Reference patient timeline summaries (originally written by medical professionals) formatting was standardized via a large language model to facilitate consistent evaluation. And under our benchmark, while the Patient History agent included over 94% of high-importance facts (counting both fully and partially entailed information), the Patient History agent achieved 0.84 TBFact recall on high-importance facts, showing that TBFact's strict entailment criteria and partial credit scoring create meaningful headroom for future improvements.

For more technical information about the evaluation framework, refer to the documentation. The healthcare-agent-orchestrator repository also includes an evaluation notebook with concrete examples for simulating conversations and evaluating them.

Figure 1: High-level architecture of the evaluation framework, showing data sources (real and simulated conversations) feeding into modular metrics for both orchestrator and individual agent assessment.

Available Metrics

Traditional similarity metrics (e.g.: ROUGE, BERTScore) fail to capture subtle yet critical factual inaccuracies in the output. Moreover, in agentic workflows, a ground truth answer often doesn’t exist or is expensive to curate. To overcome these shortcomings, we leverage Model-as-a-Judge to implement the following metrics:

Component	Metric	Description
Orchestrator	Agent and tool selection accuracy	Correct routing to specialized agents
Orchestrator	Intent resolution	How accurately the orchestrator interprets and completes user requests, including scoping and clarification.
Orchestrator	Information aggregation	Effective synthesis of multiple agent outputs.
Individual Agents	Context relevancy	Relevance of retrieved information in relation to user’s requests.
Individual Agents	TBFact (Factual Consistency)	An adapted version of RadFact for the text modality, that measures the factuality of claims in agents' messages and helps identifying omissions and hallucinations.

Large Language Models serve as useful evaluation tools in our framework, offering advantages especially when ground truth data is not available. They can follow detailed evaluation guidelines, maintain consistency when applying criteria across conversations, and generate explanations for their assessments—facilitating verification of the evaluation process. However, due to their subjective nature, LLM-based evaluations should be treated as directional signals rather than absolute scores, providing better directional guidance for system improvement rather than absolute judgment of correctness.

To complement LLM-based metrics with reproducible measurements especially when reference data is available, we include Rouge implementation, serving as an example for developers to incorporate other similarity metrics like BLEU or BERT-Score by extending the ReferenceBasedMetric class.

TBFact: Domain-Specific Factuality Evaluation

TBFact builds on RadFact (Bannur et al., 2024), a framework originally developed for evaluating factual consistency in radiology reports, by adapting its core principles to the text-only modality of healthcare agent interactions:

Fact Extraction: Separately decomposes both agent responses and reference texts into discrete factual claims, categorized by clinical relevance (e.g., demographics, diagnosis, treatment)
Logical Entailment: Compares each fact to determine if it's fully entailed, partially entailed, or not entailed by the reference, and further categorizes the reason for partial and total mismatches into “missing”, “ambiguous”, “incorrect” or “other”.
Metric Calculation: TBFact performs the logical entailment in two directions:
Precision (pred-to-gold): Measures the proportion of factual claims in the agent’s output that are supported by the reference data. A lower precision score may indicate the presence of hallucinated or extraneous facts not found in the reference, even if they are accurate. Precision can be seen as a proxy for succinctness.
Recall (gold-to-pred): Measures the proportion of reference facts that are successfully captured in the agent’s output. A lower recall score signals missing or omitted information, which is especially critical in clinical contexts where completeness is essential.

By operating at the level of atomic factual units, TBFact shifts the focus from holistic summary judgments to targeted, claim-by-claim analysis. While claim extraction introduces its own challenges—such as ensuring consistent coverage of verifiable content, maintaining entailment fidelity, and handling decontextualization (Metropolitansky & Larson, 2025)—factual claims make the evaluation process more modular and transparent, providing actionable insights into where and how agent responses differ from references.

For example, when evaluating a discharge summary, TBFact might identify that while demographic facts achieve 95% precision, treatment recommendations only reach 75% recall, pinpointing specific areas for agent improvement. This granular feedback enables developers to identify systematic issues, such as an agent consistently omitting medication dosages or incorrectly interpreting temporal information, that would be difficult to detect with traditional metrics.

Data Sources

Due to the challenge of having real-world data for each use-case we want to evaluate, and to accommodate different development stages and data availability, the framework supports two primary evaluation modes:

Real conversations: Healthcare Agent Orchestrator automatically saves chat sessions whenever a conversation is terminated with the command @Orchestrator: clear, enabling insight into actual clinical workflow performance.

Simulated conversations: Generated for controlled testing using predefined scripts or adaptive scenarios. Essential for specialized scenarios with limited real-world data.

Results and Performance Assessment

Note: The following results represent initial validation from our current research phase, with ongoing work expanding evaluation scope and refining methodologies. These preliminary results demonstrate promising capabilities for clinical system coordination and factual accuracy assessment.

Orchestrator Performance

We evaluated the orchestrator using simulated conversations across multiple patient scenarios. GPT-4o served as the evaluator, providing both quantitative scores and qualitative explanations based on defined metric criteria. In this preliminary experiment, the orchestrator demonstrated promising coordination capabilities:

Metric	Score Range	Average Score
Agent Selection Accuracy	3.89 – 5	4
Intent Resolution	4 – 5	4.5
Information Aggregation	3 – 5	3.7

In our preliminary evaluation, agent selection examples are relatively straightforward given our agents' well-defined responsibilities but provide a foundation for expanding to more complex scenarios involving agent-human expert interactions as we gather real-world data. Future work could include turn-level labeling of tumor board dataset dialogues to test classification accuracy of choosing the right next expert or agent. Agent selection can also be combined with "tool selection" metrics, addressing the fragmentation problem in multi-agent evaluation approaches.

In the current state, we mainly used the explanations provided by the evaluator model to better understand the behavior of the system in clinical workflows and guide the development process.

Patient History Agent Performance with TBFact

To evaluate the Patient History agent, we used an anonymized and PHI-free proprietary dataset, named TB-Bench, that comprehensively aggregates diverse medical records for 71 patients who had undergone the care of a Molecular Tumor Board (MTB). TB-Bench includes data such as tumor board transcripts, exported EHR data, and clinician-generated patient summaries. Due to the logistical challenges involved in curating such a comprehensive dataset across potentially multiple healthcare institutions and record keeping systems, we found that in some instances clinician-generated summaries available in the tumor board transcripts might refer to patient records that were lost in the data curation process. This mismatch made direct evaluation challenging.

Therefore, to ensure evaluation reflects system performance when complete patient records are accessible, we used TBFact to evaluate the agent’s output against a curated set of dataset verifiable facts— facts limited to those referring to information that is present in the dataset.

While TBFact measures both recall and precision of fact generation, our study focuses on recall because it measures how much of all important information is covered, which we consider the most critical metric for clinical applications where missing information can have serious consequences.

The preliminary experiments revealed significant performance improvements through prompt optimization and format adjustments. With specialized prompting, we specify the types of information to prioritize—such as biomarker results, imaging assessments, and treatment timelines. For instance, our updated prompt instructs the agent to “organize the patient data in chronological order” and explicitly calls out key elements to include: “all biomarkers”, “response to treatment including dates and imaging,” and “a summary of current status.” This prompt engineering approach proved to be one of the most effective levers for improving the quality and completeness of Patient History outputs.

Configuration	TBFact Recall for All Facts	TBFact Recall for Important Facts
Generic prompts (baseline)	0.56	0.66
Specialized Prompts	0.71	0.84

Since TBFact operates by comparing discrete factual claims, higher scores indicate that the agent is, according to the reference data, factually accurate and comprehensive in its coverage of the available patient information. In other words, optimizing for TBFact scores brings the agent’s output structurally and semantically closer to the curated reference timelines. And, in our case, that meant striving for detailed outputs, including information about allergies and ongoing medications, even when specific dates were unavailable. This underscores the importance of having high-quality, human-validated reference datasets, as without them, even well-performing agents may appear incomplete or inaccurate.

Human Validation Study

To validate TBFact's reliability, we conducted a preliminary study with human annotators, medical scribes by training, using 71 patient records. Two annotators assessed (a) whether a claim was properly extracted from its source text, (b) whether the fact was important (low, medium, high), and (c) whether individual claims were properly entailed by a reference text. Inter-annotator agreement was measured at 0.999, 0.66(strict) and 0.77(relaxed), and 0.914 for the three tasks respectively. The accuracy of the fact extraction pipeline was calculated to be 99.9%, validating that during the fact extraction phase minimal-to-no hallucinations are introduced. System accuracy for fact importance classification was at 66% when measured strictly, however, when allowing for a tolerance of one level (e.g. classifying medium instead of high), this was at 93%. These values are comparable to those of the medical annotators. Entailment classification at 88%, suggesting reasonable performance of the system’s ability to recognize entailment. Finally, we measured the correlation of the entire end-to-end TBFact F1 score of the system compared to humans using Kendall Tau, Pearson, and Spearman correlations. These were revealed to be at 55.8%, 70.5%, 72.8%, moderate-to-high correlations suggesting that the TBFact metric are well-aligned with expert clinical reasoning.

Qualitative insights from TBFact

The table below illustrates how TBFact evaluates factual alignment between agent-generated summaries and reference data. Each row shows a fact extracted from the agent’s output, the corresponding excerpt from the reference, and the entailment judgment. The logical entailment was produced by TBFact, while the accompanying explanations were generated separately to support interpretability.

Facts Extracted from Agent Response	Related Excerpt from Reference Text (Ground Truth)	TBFact Judgment
Molecular studies from the 2019-05-18 surgery identified TERT promoter mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7.	[…] Tumor Genetics: EGFR: Amplified CDKN2A/B: Deleted PTEN: p.L112R TERT: c.-146C>T Chromosome 10: Monosomy Chromosome 7: Trisomy […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […]	✔ Entailed: The summary lists TERT mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7.
Immunohistochemistry from 2019-05-18 showed GFAP positive, BRAF V600E negative, IDH1 R132H negative, ATRX retained, p53 negative, and a Ki-67 index of 3%.	[…] Tumor Genetics: IDH1: Wildtype - BRAF V600E: Negative […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […]	⚠️ Partial Entailment: Some IHC findings match (BRAF negative, IDH1 wildtype) but others (GFAP, p53, Ki-67) are not mentioned in the reference summary.
During the first cycle of CCNU on 2020-04-14, the patient reported significant fatigue, thrombocytopenia, and occasional confusion.	Introduction: […] The patient is experiencing poor tolerance to lomustine and is considering discontinuation due to further disease progression as confirmed by recent MRI scans. […] Timeline: 04/14/2020 - Present: Lomustine treatment initiated. […]	⚠️ Partial Entailment: Poor tolerance to lomustine is reported, but specific side effects are not listed in the reference summary.
On 2020-05-16, the plan was to continue CCNU and monitor with imaging.	No related information in the reference text.	⚠️ No Entailment: No mention in the summary of a plan on 2020-05-16 to continue CCNU with imaging follow-up.

These examples show that partial entailments are not necessarily errors. In many cases, they reflect the agent surfacing clinically relevant details that are absent from the reference. This is especially important in healthcare settings, where agent outputs may synthesize information across multiple documents or express facts in more complete or structured ways than the reference defined.

To further assess the factual grounding of the agent’s outputs, we compared all facts extracted from the Patient History agent’s summaries against the full set of available data for each patient in the TB-Bench dataset. We found that 97% of the extracted facts were entailed by at least one data point. Upon manually reviewing the remaining 3% of facts, we found that they often reflected condensed or synthesized information drawn from multiple sources, meaning these claims could not be matched to any one document in our one-to-one entailment setup. While we cannot rule out the presence of hallucinations entirely, this analysis highlights the agent’s capacity for multi-source summarization.

Closing Thoughts

As multi-agent systems become more capable and autonomous, robust evaluation must evolve in parallel. The framework presented here is a step toward that goal: modular, clinically grounded, and designed to surface actionable insights across both simulated and real-world workflows. By moving beyond traditional accuracy metrics and embracing factuality, relevance, and coordination as core evaluation dimensions, we can better understand how multi-agent systems work, and when and why they fail.

Our preliminary experiments and insights reinforce the value of TBFact not just as a metric, but as a diagnostic tool. Its structured, claim-level analysis (combined with fact categorization and human validation) offers a transparent and clinically meaningful way to evaluate and improve healthcare agents. In evaluating the Patient History agent, findings demonstrate that the agent remains faithful to the underlying data and produces complete, clinically relevant summaries. These outputs can help physicians prepare more efficiently and productively for tumor board review meetings, and being in a chat multiple agents, facilitate further investigation and understanding about patients.

Looking ahead, we see several promising directions for extending this work: incorporating human-in-the-loop review pipelines, expanding to multimodal evaluation, improving observability across agent interactions, and scaling to more diverse real-world datasets. We are also developing a standardized benchmark of synthetic and de-identified patient cases to support broader community testing and reproducibility.

We hope this work encourages others to adopt similarly rigorous approaches to evaluation, and to contribute to the development of shared benchmarks, metrics, and methodologies

References

Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., ... & Hyland, S. L. (2024). Maira-2: Grounded radiology report generation. arXiv:2406.04449v2.

Metropolitansky, D. & Larson, J. (2025). Towards Effective Extraction and Evaluation of Factual Claims. arXiv:2502.10855v2.

Updated Aug 13, 2025

Version 4.0

Microsoft

Joined May 27, 2025