model evaluation
6 TopicsTowards Robust Evaluation of Multi-Agent Systems in Clinical Settings
Authors: Hao Qiu, Leonardo Schettini, Mert Öz, Noel Codella, Sam Preston, Wen-wai Yim As multi-agent systems become more capable and collaborative, their behavior begins to exhibit emergent properties that are difficult to predict or control – particularly in safety critical domains like healthcare. Coordination among agents can yield outputs that are non-deterministic, multi-faceted, and context sensitive. This makes robust evaluation not just a matter of accuracy, but of safety, accountability, and trust. Traditional NLP metrics like ROUGE or BLEU fall short in these settings as they presuppose a single ground truth and fail to capture clinically relevant errors such as subtle omissions, hallucinations, or fact distortions. To address this, we present a modular evaluation framework for the Healthcare Agent Orchestrator, designed to support fine-grained, clinical grounded assessment across both deployed clinical workflows and simulated scenarios. This framework enables targeted stress-testing of multi-agent behavior – particularly how agents share information, reason under uncertainty, and maintain factual fidelity in high-stakes contexts. Central to our framework is TBFact, a domain specific factuality metric that evaluates agent outputs based on three key criteria: factual inclusion, factual distortion, and factual omission. TBFact shows strong correlation with human experts (κ=0.760) and demonstrates that our Patient History agent successfully included up to 94% of high-importance information in the generated patient timelines. To ground evaluations of the Patient History agent, we constructed a high-quality benchmark dataset from de-identified tumor board discussions and associated patient histories. Reference patient timeline summaries (originally written by medical professionals) formatting was standardized via a large language model to facilitate consistent evaluation. And under our benchmark, while the Patient History agent included over 94% of high-importance facts (counting both fully and partially entailed information), the Patient History agent achieved 0.84 TBFact recall on high-importance facts, showing that TBFact's strict entailment criteria and partial credit scoring create meaningful headroom for future improvements. For more technical information about the evaluation framework, refer to the documentation. The healthcare-agent-orchestrator repository also includes an evaluation notebook with concrete examples for simulating conversations and evaluating them. : High-level architecture of the evaluation framework, showing data sources (real and simulated conversations) feeding into modular metrics for both orchestrator and individual agent assessment. Available Metrics Traditional similarity metrics (e.g.: ROUGE, BERTScore) fail to capture subtle yet critical factual inaccuracies in the output. Moreover, in agentic workflows, a ground truth answer often doesn’t exist or is expensive to curate. To overcome these shortcomings, we leverage Model-as-a-Judge to implement the following metrics: Component Metric Description Orchestrator Agent and tool selection accuracy Correct routing to specialized agents Orchestrator Intent resolution How accurately the orchestrator interprets and completes user requests, including scoping and clarification. Orchestrator Information aggregation Effective synthesis of multiple agent outputs. Individual Agents Context relevancy Relevance of retrieved information in relation to user’s requests. Individual Agents TBFact (Factual Consistency) An adapted version of RadFact for the text modality, that measures the factuality of claims in agents' messages and helps identifying omissions and hallucinations. Large Language Models serve as useful evaluation tools in our framework, offering advantages especially when ground truth data is not available. They can follow detailed evaluation guidelines, maintain consistency when applying criteria across conversations, and generate explanations for their assessments—facilitating verification of the evaluation process. However, due to their subjective nature, LLM-based evaluations should be treated as directional signals rather than absolute scores, providing better directional guidance for system improvement rather than absolute judgment of correctness. To complement LLM-based metrics with reproducible measurements especially when reference data is available, we include Rouge implementation, serving as an example for developers to incorporate other similarity metrics like BLEU or BERT-Score by extending the ReferenceBasedMetric class. TBFact: Domain-Specific Factuality Evaluation TBFact builds on RadFact (Bannur et al., 2024), a framework originally developed for evaluating factual consistency in radiology reports, by adapting its core principles to the text-only modality of healthcare agent interactions: Fact Extraction: Separately decomposes both agent responses and reference texts into discrete factual claims, categorized by clinical relevance (e.g., demographics, diagnosis, treatment) Logical Entailment: Compares each fact to determine if it's fully entailed, partially entailed, or not entailed by the reference, and further categorizes the reason for partial and total mismatches into “missing”, “ambiguous”, “incorrect” or “other”. Metric Calculation: TBFact performs the logical entailment in two directions: Precision (pred-to-gold): Measures the proportion of factual claims in the agent’s output that are supported by the reference data. A lower precision score may indicate the presence of hallucinated or extraneous facts not found in the reference, even if they are accurate. Precision can be seen as a proxy for succinctness. Recall (gold-to-pred): Measures the proportion of reference facts that are successfully captured in the agent’s output. A lower recall score signals missing or omitted information, which is especially critical in clinical contexts where completeness is essential. By operating at the level of atomic factual units, TBFact shifts the focus from holistic summary judgments to targeted, claim-by-claim analysis. While claim extraction introduces its own challenges—such as ensuring consistent coverage of verifiable content, maintaining entailment fidelity, and handling decontextualization (Metropolitansky & Larson, 2025)—factual claims make the evaluation process more modular and transparent, providing actionable insights into where and how agent responses differ from references. For example, when evaluating a discharge summary, TBFact might identify that while demographic facts achieve 95% precision, treatment recommendations only reach 75% recall, pinpointing specific areas for agent improvement. This granular feedback enables developers to identify systematic issues, such as an agent consistently omitting medication dosages or incorrectly interpreting temporal information, that would be difficult to detect with traditional metrics. Data Sources Due to the challenge of having real-world data for each use-case we want to evaluate, and to accommodate different development stages and data availability, the framework supports two primary evaluation modes: Real conversations: Healthcare Agent Orchestrator automatically saves chat sessions whenever a conversation is terminated with the command @Orchestrator: clear, enabling insight into actual clinical workflow performance. Simulated conversations: Generated for controlled testing using predefined scripts or adaptive scenarios. Essential for specialized scenarios with limited real-world data. Results and Performance Assessment Note: The following results represent initial validation from our current research phase, with ongoing work expanding evaluation scope and refining methodologies. These preliminary results demonstrate promising capabilities for clinical system coordination and factual accuracy assessment. Orchestrator Performance We evaluated the orchestrator using simulated conversations across multiple patient scenarios. GPT-4o served as the evaluator, providing both quantitative scores and qualitative explanations based on defined metric criteria. In this preliminary experiment, the orchestrator demonstrated promising coordination capabilities: Metric Score Range Average Score Agent Selection Accuracy 3.89 – 5 4 Intent Resolution 4 – 5 4.5 Information Aggregation 3 – 5 3.7 In our preliminary evaluation, agent selection examples are relatively straightforward given our agents' well-defined responsibilities but provide a foundation for expanding to more complex scenarios involving agent-human expert interactions as we gather real-world data. Future work could include turn-level labeling of tumor board dataset dialogues to test classification accuracy of choosing the right next expert or agent. Agent selection can also be combined with "tool selection" metrics, addressing the fragmentation problem in multi-agent evaluation approaches. In the current state, we mainly used the explanations provided by the evaluator model to better understand the behavior of the system in clinical workflows and guide the development process. Patient History Agent Performance with TBFact To evaluate the Patient History agent, we used an anonymized and PHI-free proprietary dataset, named TB-Bench, that comprehensively aggregates diverse medical records for 71 patients who had undergone the care of a Molecular Tumor Board (MTB). TB-Bench includes data such as tumor board transcripts, exported EHR data, and clinician-generated patient summaries. Due to the logistical challenges involved in curating such a comprehensive dataset across potentially multiple healthcare institutions and record keeping systems, we found that in some instances clinician-generated summaries available in the tumor board transcripts might refer to patient records that were lost in the data curation process. This mismatch made direct evaluation challenging. Therefore, to ensure evaluation reflects system performance when complete patient records are accessible, we used TBFact to evaluate the agent’s output against a curated set of dataset verifiable facts— facts limited to those referring to information that is present in the dataset. While TBFact measures both recall and precision of fact generation, our study focuses on recall because it measures how much of all important information is covered, which we consider the most critical metric for clinical applications where missing information can have serious consequences. The preliminary experiments revealed significant performance improvements through prompt optimization and format adjustments. With specialized prompting, we specify the types of information to prioritize—such as biomarker results, imaging assessments, and treatment timelines. For instance, our updated prompt instructs the agent to “organize the patient data in chronological order” and explicitly calls out key elements to include: “all biomarkers”, “response to treatment including dates and imaging,” and “a summary of current status.” This prompt engineering approach proved to be one of the most effective levers for improving the quality and completeness of Patient History outputs. Configuration TBFact Recall for All Facts TBFact Recall for Important Facts Generic prompts (baseline) 0.56 0.66 Specialized Prompts 0.71 0.84 Since TBFact operates by comparing discrete factual claims, higher scores indicate that the agent is, according to the reference data, factually accurate and comprehensive in its coverage of the available patient information. In other words, optimizing for TBFact scores brings the agent’s output structurally and semantically closer to the curated reference timelines. And, in our case, that meant striving for detailed outputs, including information about allergies and ongoing medications, even when specific dates were unavailable. This underscores the importance of having high-quality, human-validated reference datasets, as without them, even well-performing agents may appear incomplete or inaccurate. Human Validation Study To validate TBFact's reliability, we conducted a preliminary study with human annotators, medical scribes by training, using 71 patient records. Two annotators assessed (a) whether a claim was properly extracted from its source text, (b) whether the fact was important (low, medium, high), and (c) whether individual claims were properly entailed by a reference text. Inter-annotator agreement was measured at 0.999, 0.66(strict) and 0.77(relaxed), and 0.914 for the three tasks respectively. The accuracy of the fact extraction pipeline was calculated to be 99.9%, validating that during the fact extraction phase minimal-to-no hallucinations are introduced. System accuracy for fact importance classification was at 66% when measured strictly, however, when allowing for a tolerance of one level (e.g. classifying medium instead of high), this was at 93%. These values are comparable to those of the medical annotators. Entailment classification at 88%, suggesting reasonable performance of the system’s ability to recognize entailment. Finally, we measured the correlation of the entire end-to-end TBFact F1 score of the system compared to humans using Kendall Tau, Pearson, and Spearman correlations. These were revealed to be at 55.8%, 70.5%, 72.8%, moderate-to-high correlations suggesting that the TBFact metric are well-aligned with expert clinical reasoning. Qualitative insights from TBFact The table below illustrates how TBFact evaluates factual alignment between agent-generated summaries and reference data. Each row shows a fact extracted from the agent’s output, the corresponding excerpt from the reference, and the entailment judgment. The logical entailment was produced by TBFact, while the accompanying explanations were generated separately to support interpretability. Facts Extracted from Agent Response Related Excerpt from Reference Text (Ground Truth) TBFact Judgment Molecular studies from the 2019-05-18 surgery identified TERT promoter mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. […] Tumor Genetics: EGFR: Amplified CDKN2A/B: Deleted PTEN: p.L112R TERT: c.-146C>T Chromosome 10: Monosomy Chromosome 7: Trisomy […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ✔ Entailed: The summary lists TERT mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. Immunohistochemistry from 2019-05-18 showed GFAP positive, BRAF V600E negative, IDH1 R132H negative, ATRX retained, p53 negative, and a Ki-67 index of 3%. […] Tumor Genetics: IDH1: Wildtype - BRAF V600E: Negative […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ⚠️ Partial Entailment: Some IHC findings match (BRAF negative, IDH1 wildtype) but others (GFAP, p53, Ki-67) are not mentioned in the reference summary. During the first cycle of CCNU on 2020-04-14, the patient reported significant fatigue, thrombocytopenia, and occasional confusion. Introduction: […] The patient is experiencing poor tolerance to lomustine and is considering discontinuation due to further disease progression as confirmed by recent MRI scans. […] Timeline: 04/14/2020 - Present: Lomustine treatment initiated. […] ⚠️ Partial Entailment: Poor tolerance to lomustine is reported, but specific side effects are not listed in the reference summary. On 2020-05-16, the plan was to continue CCNU and monitor with imaging. No related information in the reference text. ⚠️ No Entailment: No mention in the summary of a plan on 2020-05-16 to continue CCNU with imaging follow-up. These examples show that partial entailments are not necessarily errors. In many cases, they reflect the agent surfacing clinically relevant details that are absent from the reference. This is especially important in healthcare settings, where agent outputs may synthesize information across multiple documents or express facts in more complete or structured ways than the reference defined. To further assess the factual grounding of the agent’s outputs, we compared all facts extracted from the Patient History agent’s summaries against the full set of available data for each patient in the TB-Bench dataset. We found that 97% of the extracted facts were entailed by at least one data point. Upon manually reviewing the remaining 3% of facts, we found that they often reflected condensed or synthesized information drawn from multiple sources, meaning these claims could not be matched to any one document in our one-to-one entailment setup. While we cannot rule out the presence of hallucinations entirely, this analysis highlights the agent’s capacity for multi-source summarization. Closing Thoughts As multi-agent systems become more capable and autonomous, robust evaluation must evolve in parallel. The framework presented here is a step toward that goal: modular, clinically grounded, and designed to surface actionable insights across both simulated and real-world workflows. By moving beyond traditional accuracy metrics and embracing factuality, relevance, and coordination as core evaluation dimensions, we can better understand how multi-agent systems work, and when and why they fail. Our preliminary experiments and insights reinforce the value of TBFact not just as a metric, but as a diagnostic tool. Its structured, claim-level analysis (combined with fact categorization and human validation) offers a transparent and clinically meaningful way to evaluate and improve healthcare agents. In evaluating the Patient History agent, findings demonstrate that the agent remains faithful to the underlying data and produces complete, clinically relevant summaries. These outputs can help physicians prepare more efficiently and productively for tumor board review meetings, and being in a chat multiple agents, facilitate further investigation and understanding about patients. Looking ahead, we see several promising directions for extending this work: incorporating human-in-the-loop review pipelines, expanding to multimodal evaluation, improving observability across agent interactions, and scaling to more diverse real-world datasets. We are also developing a standardized benchmark of synthetic and de-identified patient cases to support broader community testing and reproducibility. We hope this work encourages others to adopt similarly rigorous approaches to evaluation, and to contribute to the development of shared benchmarks, metrics, and methodologies References Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., ... & Hyland, S. L. (2024). Maira-2: Grounded radiology report generation. arXiv:2406.04449v2. Metropolitansky, D. & Larson, J. (2025). Towards Effective Extraction and Evaluation of Factual Claims. arXiv:2502.10855v2.AI Agents: Building Trustworthy Agents- Part 6
This blog post, Part 6 in a series on AI agents, focuses on building trustworthy AI agents. It emphasizes the importance of safety and security in agent design and deployment. The post details a system message framework for creating robust and scalable prompts, outlining a four-step process from meta prompt to iterative refinement. It then explores various threats to AI agents, including task manipulation, unauthorized access, resource overloading, knowledge base poisoning, and cascading errors, providing mitigation strategies for each. The post also highlights the human-in-the-loop approach for enhanced trust and control, providing a code example using AutoGen. Finally, it links to further resources on responsible AI, model evaluation, and risk assessment, along with the previous posts in the series.625Views3likes0CommentsAzure AI Model Inference API
The Azure AI Model Inference API provides a unified interface for developers to interact with various foundational models deployed in Azure AI Studio. This API allows developers to generate predictions from multiple models without changing their underlying code. By providing a consistent set of capabilities, the API simplifies the process of integrating and switching between different models, enabling seamless model selection based on task requirements.4.2KViews0likes2CommentsEmbracing Responsible AI: A Comprehensive Guide and Call to Action
In an age where artificial intelligence (AI) is becoming increasingly integrated into our daily lives, the need for responsible AI practices has never been more critical. From healthcare to finance, AI systems influence decisions affecting millions of people. As developers, organizations, and users, we are responsible for ensuring that these technologies are designed, deployed, and evaluated ethically. This blog will delve into the principles of responsible AI, the importance of assessing generative AI applications, and provide a call to action to engage with the Microsoft Learn Module on responsible AI evaluations. What is Responsible AI? Responsible AI encompasses a set of principles and practices aimed at ensuring that AI technologies are developed and used in ways that are ethical, fair, and accountable. Here are the core principles that define responsible AI: Fairness AI systems must be designed to avoid bias and discrimination. This means ensuring that the data used to train these systems is representative and that the algorithms do not favor one group over another. Fairness is crucial in applications like hiring, lending, and law enforcement, where biased AI can lead to significant societal harm. Transparency Transparency involves making AI systems understandable to users and stakeholders. This includes providing clear explanations of how AI models make decisions and what data they use. Transparency builds trust and allows users to challenge or question AI decisions when necessary. Accountability Developers and organizations must be held accountable for the outcomes of their AI systems. This includes establishing clear lines of responsibility for AI decisions and ensuring that there are mechanisms in place to address any negative consequences that arise from AI use. Privacy AI systems often rely on vast amounts of data, raising concerns about user privacy. Responsible AI practices involve implementing robust data protection measures, ensuring compliance with regulations like GDPR, and being transparent about how user data is collected, stored, and used. The Importance of Evaluating Generative AI Applications Generative AI, which includes technologies that can create text, images, music, and more, presents unique challenges and opportunities. Evaluating these applications is essential for several reasons: Quality Assessment Evaluating the output quality of generative AI applications is crucial to ensure that they meet user expectations and ethical standards. Poor-quality outputs can lead to misinformation, misrepresentation, and a loss of trust in AI technologies. Custom Evaluators Learning to create and use custom evaluators allows developers to tailor assessments to specific applications and contexts. This flexibility is vital in ensuring that the evaluation process aligns with the intended use of the AI system. Synthetic Datasets Generative AI can be used to create synthetic datasets, which can help in training AI models while addressing privacy concerns and data scarcity. Evaluating these synthetic datasets is essential to ensure they are representative and do not introduce bias. Call to Action: Engage with the Microsoft Learn Module To deepen your understanding of responsible AI and enhance your skills in evaluating generative AI applications, I encourage you to explore the Microsoft Learn Module available at this link. What You Will Learn: Concepts and Methodologies: The module covers essential frameworks for evaluating generative AI, including best practices and methodologies that can be applied across various domains. Hands-On Exercises: Engage in practical, code-first exercises that simulate real-world scenarios. These exercises will help you apply the concepts learned tangibly, reinforcing your understanding. Prerequisites: An Azure subscription (you can create one for free). Basic familiarity with Azure and Python programming. Tools like Docker and Visual Studio Code for local development. Why This Matters By participating in this module, you are not just enhancing your skills; you are contributing to a broader movement towards responsible AI. As AI technologies continue to evolve, the demand for professionals who understand and prioritize ethical considerations will only grow. Your engagement in this learning journey can help shape the future of AI, ensuring it serves humanity positively and equitably. Conclusion As we navigate the complexities of AI technology, we must prioritize responsible AI practices. By engaging with educational resources like the Microsoft Learn Module on responsible AI evaluations, we can equip ourselves with the knowledge and skills necessary to create AI systems that are not only innovative but also ethical and responsible. Join the movement towards responsible AI today! Take the first step by exploring the Microsoft Learn Module and become an advocate for ethical AI practices in your community and beyond. Together, we can ensure that AI serves as a force for good in our society. References Evaluate generative AI applications https://learn.microsoft.com/en-us/training/paths/evaluate-generative-ai-apps/?wt.mc_id=studentamb_263805 Azure Subscription for Students https://azure.microsoft.com/en-us/free/students/?wt.mc_id=studentamb_263805 Visual Studio Code https://code.visualstudio.com/?wt.mc_id=studentamb_263805673Views0likes0CommentsResponsible Synthetic Data Creation for Fine-Tuning with RAFT Distillation
This blog will explore the process of crafting responsible synthetic data, evaluating it, and using it for fine-tuning models. We’ll also dive into Azure AI’s RAFT distillation recipe, a novel approach to generating synthetic datasets using Meta’s Llama 3.1 model and UC Berkeley’s Gorilla project.2.1KViews2likes0Comments