azure
136 TopicsHealthcare Agent Orchestrator: Multi-agent Framework for Domain-Specific Decision Support
At Microsoft Build, we introduced the Healthcare Agent Orchestrator, now available in Azure AI Foundry Agent Catalog . In this blog, we unpack the science: how we structured the architecture, curated real tumor board data, and built robust agent coordination that brings AI into real healthcare workflows. Healthcare Agent Orchestrator assisting a simulated tumor board meeting. Introduction Healthcare is inherently collaborative. Critical decisions often require input from multiple specialists—radiologists, pathologists, oncologists, and geneticists—working together to deliver the best outcomes for patients. Yet most AI systems today are designed around narrow tasks or single-agent architectures, failing to reflect the real-world teamwork that defines healthcare practice. That’s why we developed the Healthcare Agent Orchestrator: an orchestrator and code sample built around Microsoft’s industry-leading healthcare AI models, designed to support reasoning and multidisciplinary collaboration -- enabling modular, interpretable AI workflows that mirror how healthcare teams actually work. The orchestrator brings together Microsoft healthcare AI models—such as MedImageParse for image recognition, CXRReportGen for automated radiology reporting, and MedImageInsight for retrieval and similarity analysis—into a unified, task-aware system that enables developers to build an agent that reflects real-word healthcare decision making pattern. This work was led by Yu (Aiden) Gu, Principal Applied Scientist at Microsoft Research, who conceived the study, defined the research direction, and led the design and development of the Healthcare Agent Orchestrator proof-of-concept. Healthcare Is Naturally Multi-Agent Healthcare decision-making often requires synthesizing diverse data types—radiologic images, pathology slides, genetic markers, and unstructured clinical narratives—while reconciling differing expert perspectives. In a molecular tumor board, for instance, a radiologist might highlight a suspicious lesion on CT imaging, a pathologist may flag discordant biopsy findings, and a geneticist could identify a mutation pointing toward an alternate treatment path. Effective collaboration in these settings hinges not on isolated analysis, but on structured dialogue—where evidence is surfaced, assumptions are challenged, and hypotheses are iteratively refined. To support the development of healthcare agent orchestrator, we partnered with a leading healthcare provider organization, who independently curated and de-identified a proprietary dataset comprising longitudinal patient records and real tumor board transcripts—capturing the complexity of multidisciplinary discussions. We provided guidance on data types most relevant for evaluating agent coordination, reasoning handoffs, and task alignment in collaborative settings. We then applied LLM-based structuring techniques to convert de-identified free-form transcripts into interpretable units, followed by expert review to ensure domain fidelity and relevance. This dataset provides a critical foundation for assessing agent coordination, reasoning handoffs, and task alignment in simulated collaborative settings. Why General-Purpose LLMs Fall Short for Healthcare Collaboration While general-purpose large language models have delivered remarkable results in many domains, they face key limitations in high-stakes healthcare environments: Precision is critical: Even small hallucinations or inconsistencies can compromise safety and decision quality Multi-modal integration is required: Many healthcare decisions involve interpreting and correlating diverse data types—images, reports, structured records—much of which is not available in public training sets Transparency and traceability matter: Users must understand how conclusions are formed and be able to audit intermediate steps The Healthcare Agent Orchestrator addresses these challenges by pairing general reasoning capabilities with specialized agents that operate over imaging, genomics, and structured EHRs—ensuring grounded, explainable results aligned with clinical expectations. Each agent contributes domain-specific expertise, while the orchestrator ensures coherence, oversight, and explainability—resulting in outputs that are both grounded and verifiable. Architecture: Coordinating Specialists Through Orchestration Healthcare Agent Orchestrator. Healthcare Agent Orchestrator’s multi-agent framework is built on modular AI infrastructure, designed for secure, scalable collaboration: Semantic Kernel: A lightweight, open-source development kit for building AI agents and integrating the latest AI models into C#, Python, or Java codebases. It acts as efficient middleware for rapidly delivering enterprise-grade solutions—modular, extensible, and designed to support responsible AI at scale. Model Context Protocol (MCP): an open standard that enables developers to build secure, two-way connections between their data sources and AI-powered tools. Magentic-One: Microsoft’s generalist multi-agent system for solving open-ended web and file-based tasks across domains—built on Microsoft AutoGen, our popular open-source framework for developing multi-agent applications. Each agent is orchestrated within the system and integrated via Semantic Kernel’s group chat infrastructure, with support for communication and modular deployment via Azure. This orchestration ensures that each model—whether interpreting a lung nodule, analyzing a biopsy image, or summarizing a genomic variant—is applied precisely where its expertise is most relevant, without overloading a single system with every task. The modularity of the framework also future-proofs: as new health AI models and tools emerge, they can be seamlessly incorporated into the ecosystem without disrupting existing workflows—enabling continuous innovation while maintaining clinical stability. Microsoft’s healthcare AI models at the Core Healthcare agent orchestrator also enables developers to explore the capabilities of Microsoft’s latest healthcare AI models: CXRReportGen: Integrates multimodal inputs—including current and prior X-ray images and report context—to generate grounded, interpretable radiology reports. The model has shown improved accuracy and transparency in automated chest X-ray interpretation, evaluated on both public and private data. MedImageParse 3 : A biomedical foundation model for imaging parsing that can jointly conduct segmentation, detection, and recognition across 9 imaging modalities. MedImageInsight 4 : Facilitates fast retrieval of clinically similar cases, supports disease classification across broad range of medical image modalities, accelerating second opinion generation and diagnostic review workflows. Each model has the ability to act as a specialized agent within the system, contributing focused expertise while allowing flexible, context-aware collaboration orchestrated at the system level. CXRReportGen is included in the initial release and supports the development and testing of grounded radiology report generation. Other Microsoft healthcare models such as MedImageParse and MedImageInsight are being explored in internal prototypes to expand the orchestrator’s capabilities across segmentation, detection, and image retrieval tasks. Seamless Integration with Microsoft Teams Rather than creating new silos, Healthcare Agent Orchestrator integrates directly into the tools clinicians already use—specifically Microsoft Teams. Developers are investigating how clinicians can engage with agents through natural conversation, asking questions, requesting second opinions, or cross-validating findings—all without leaving their primary collaboration environment. This approach minimizes friction, improves user experience, and brings cutting-edge AI into real-world care settings. Building Toward Robust, Trustworthy Multi-Agent Collaboration Think of the orchestrator as managing a secure, structured group chat. Each participant is a specialized AI agent—such as a ‘Radiology’ agent, ‘PatientHistory’ agent, or 'ClinicalTrials‘ agent. At the center is the ‘Orchestrator’ agent, which moderates the interaction: assigning tasks, maintaining shared context, and resolving conflicting outputs. Agents can also communicate directly with one another, exchanging intermediate results or clarifying inputs. Meanwhile, the user can engage either with the orchestrator or with specific agents as needed. Each agent is configured with instructions (the system prompt that guides its reasoning), and a description (used by both the UI and the orchestrator to determine when the agent should be activated). For example, the Radiology agent is paired with the cxr_report_gen tool, which wraps Microsoft’s CXRReportGen model for generating findings from chest X-ray images. Tools like this are declared under the agent’s tools field and allow it to call foundation models or other capabilities on demand—such as the clinical_trials tool 5 for querying ClinicalTrials.gov. Only one agent is marked as facilitator, designating it as the moderator of the conversation; in this scenario, the Orchestrator agent fills that role. Early observations highlight that multi-agent orchestration introduces new complexities—even as it improves specialization and task alignment. To address these emergent challenges, we are actively evolving the framework across several dimensions: Mitigating Error Propagation Across Agents: Ensuring that early-stage errors by one agent do not cascade unchecked through subsequent reasoning steps. This includes introducing critical checkpoints where outputs from key agents are verified before being consumed by others. Optimizing Agent Selection and Specialization: Recognizing that more agents are not always better. Adding unnecessary or redundant agents can introduce noise and confusion. We’ve implemented a systematic framework that emphasizes a few highly suited agents per task —dynamically selected based on case complexity and domain needs—while continuously tracking performance gains and catching regressions early. Improving Transparency and Hand-off Clarity: Structuring agent interactions to make intermediate outputs and rationales visible, enabling developers (and the system itself) to trace how conclusions were reached, catch inconsistencies early, and intervene when necessary. Adapting General Frameworks for Healthcare Complexity Generic orchestration frameworks like Semantic Kernel provide a strong foundation—but healthcare demands more. The stakes are higher, the data more nuanced, and the workflows require precision, traceability, and regulatory compliance. Here’s how we’ve extended and adapted these systems to help address healthcare demands: Precision and Safety: We introduced domain-aware verification checkpoints and task-specific agent constraints to reduce inappropriate tool usage—supporting more reliable reasoning. To help uphold the high standards required in healthcare, we defined two complementary metric systems (Check Healthcare Agent Orchestrator Evaluation for more details): Core Metrics: monitor health agents selection accuracy, intent resolution, contextual relevance, and information aggregation RoughMetric: a composite score based on ROUGE that helps quantify the precision of generated outputs and conversation reliability. TBFact: A modified version of RadFact 2 that measures factuality of claims in agents' messages and helps identifying omissions and hallucination Domain-Specific Tool Planning: Healthcare agents must reason across multimodal inputs—such as chest X-rays, CT slices, pathology images, and structured EHRs. We’ve customized Semantic Kernel’s tool invocation and planning modules to reflect clinical workflows, not generic task chains. These infrastructure-level adaptations are designed to complement Microsoft Healthcare AI models—such as CXRReportGen, MedImageParse, and MedImageInsight—working together to enable coordinated, domain-aware reasoning across complex healthcare tasks. Enabling Collaborative, Trustworthy AI in Healthcare Healthcare demands AI systems that are as collaborative, adaptive, and trustworthy as the clinical teams they aim to support. The Healthcare Agent Orchestrator is a concrete step toward that vision—pairing specialized health AI models with a flexible, multi-agent coordination framework, purpose-built to reflect the complexity of real clinical decision-making. By aligning with existing healthcare workflows and enabling transparent, role-specific collaboration, this system shows promise to empower clinicians to work more effectively—with AI as a partner, not a replacement. Healthcare Multi-Agent Orchestrator and the Microsoft healthcare AI models are intended for research and development use. Healthcare Multi-Agent Orchestrator and the healthcare AI models not designed or intended to be deployed in clinical settings as-is nor is it intended for use in the diagnosis or treatment of any health or medical condition, and its performance for such purposes has not been established. You bear sole responsibility and liability for any use of Healthcare Multi-Agent Orchestrator or the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. 1 arXiv, Universal Abstraction: Harnessing Frontier Models to Structure Real-World Data at Scale, February 2, 2025 2 arXiv, MAIRA-2: Grounded Radiology Report Generation, June 6, 2024 3 Nature Method, A foundation model for joint segmentation, detection and recognition of biomedical objects across nine modalities, Nov 18, 2024 4 arXiv, Medimageinsight: An open-source embedding model for general domain medical imaging, Oct 9, 2024 5 Machine Learning for Healthcare Conference, Scaling Clinical Trial Matching Using Large Language Models: A Case Study in Oncology, August 4, 20236.7KViews2likes0CommentsUshering in the Next Era of Cloud-Native AI Capabilities for Radiology
Introducing Dragon Copilot, your AI companion for PowerScribe One For radiologists, the reporting workflow of the future is here. At RSNA 2025, in Chicago, we’re showcasing Dragon Copilot, a cloud-native companion for PowerScribe One. Currently in preview, Dragon Copilot builds on the trusted capabilities of PowerScribe One to accelerate innovation and modernize reporting workflows while unlocking extensibility for radiology teams and partners. Why we built it: Technical drivers for a new era With growing demand for imaging services coupled with a workforce shortage, healthcare professionals face increased workloads and burnout while patients experience greater wait times. With our breadth of healthcare industry experience combined with our AI expertise and development at Microsoft, we immediately understood how we could help address these challenges. For radiologists, we sought to plugin into existing reporting workflows with rapid innovation, scalable AI, and open extensibility. How we built it: Modern architecture and extensibility By delivering Dragon Copilot as cloud-native solution built on Azure, we can enable new services globally. We apply the full capabilities of Azure for compute, storage, and security for high availability and compliance. Our modular architecture enables fast delivery of new features with APIs at the core to allow seamless integration, extensibility, and partner innovation. To imbue the workflow with AI through our platform, we harness the latest generative, multimodal, and agentic AI (both internal and through our partners) to support clinical reporting, workflow automation, and decision support. Key architectural highlights: AI services: Integrated large language models (LLMs) and vision-language models (VLMs) for multimodal data processing. API-first design: RESTful APIs expose core functions (draft report content generation, prior summarization, quality checks and chat) enabling partners and developers to build extensions and custom workflows. Extensibility framework: Open platform for 1st- and 3rd-party extensions, supporting everything from custom AI models to workflow agents. Inside the innovation Dragon Copilot alongside PowerScribe provides a unified AI experience. Radiologists can take advantage of the latest AI advancements without disruption to their workflows. They do not need another widget taking up room on their desktop. Instead, they need AI that fits seamlessly into existing workflows connecting their data to the cloud. Our cloud-first approach brings increased reliability, stability, and performance to a radiologists’ workflow. I’m thrilled to highlight the key capabilities of this dynamic duo: PowerScribe One with Dragon Copilot. Prior report summary: Automatically summarizes relevant prior reports, surfacing key findings, and context for the current study. AI-generated draft reports and quality checks: The most transformative aspect of Dragon Copilot is its open, extensible architecture for AI integration. We don’t limit radiology teams to a single set of AI tools. We enable seamless plug-ins for AI apps & agents from both Microsoft and our growing ecosystem of 3rd-parties. We provide a single surface for all your AI needs. This approach will enable radiology departments to discover, acquire, & deploy new AI-powered extensions. We’re enthusiastic about embarking on this journey with partners. We're also excited about collaborations with developers and academic innovators to bring their own AI models and services directly into the Dragon Copilot experience. Integrated chat experience with credible knowledge sources and medical safeguards: This chat interface connects radiologists to credible, clinically validated sources from Radiopedia and Radiology Assistant. It enables agentic orchestration and safeguards provided by Azure's Healthcare Agent Services for PHI and clinical accuracy. In the future, we expect to have a variety of other sources for radiology customers to choose from as well as the ability for organizations to add their own approved policies and protocols. This chat is designed to route questions to the right agent, provide evidence for claims, and filter responses for clinical validity. Over time, it will include extensions with custom agents powered by Copilot Studio. Help us shape what’s next As we continue to evolve Dragon Copilot alongside PowerScribe One, we invite innovators, developer partners, and academics to join us in shaping the future of radiology workflow. Dragon Copilot is more than a product; it’s a solution for rapid, responsible innovation in radiology. By combining cloud-native architecture, advanced AI capabilities, and open extensibility, we’re enabling radiology teams to work smarter, faster, and with greater confidence. Ready to see it in action? Visit us at RSNA 2025 (November 30–December 4), booth #1311 South Hall. Or contact our team to join the journey.Protect patient privacy across languages with the de-identification service's preview expansion
Machine learning and analytics are transforming healthcare by streamlining clinical workflows, powering AI models and unlocking new insights from patient data. These innovations are fueled by textual data rich in Protected Health Information (PHI). To be used for research, innovation and operational improvements, this data must be responsibly de-identified to protect patient privacy. Manual de-identification can be slow, expensive, and error-prone, creating bottlenecks that delay progress and limit collaboration. De-identification is more than a compliance standard; it is the key to unlocking healthcare data’s full potential while maintaining patient privacy and trust. Today, we are excited to announce the expansion of the Azure Health Data Services de-identification service to support five new preview language-locale combinations: Spanish (United States) German (Germany) French (France) French (Canada) English (United Kingdom) This language expansion enables global healthcare organizations to unlock insights from data beyond English while continuing to adhere to regulatory standards. Why Language Support Matters Healthcare data is generated in many languages around the world, and each one comes with its own linguistic structure, formatting, and privacy considerations. By expanding support to multiple preview languages such as Spanish, French, German, and English, our de-identification service allows organizations to unlock data from a broader range of countries and regions. But language alone isn’t the whole story. Different locales within the same language (French in France vs. Canada, or English in the UK vs. the US) often format PHI in unique ways. Addresses, medical institutions, and identifiers can all look different depending on the region. Our service is designed to recognize and accurately de-identify these locale-specific patterns, supporting privacy and compliance wherever the data originates. How It Works The Azure Health Data Service de-identification service empowers healthcare organizations to protect patient data through three key operations: TAG detects and annotates PHI from unstructured text. REDACT obfuscates PHI to prevent exposure. SURROGATE replaces PHI with realistic, synthetic surrogates, preserving data utility while ensuring privacy. Our service leverages state-of-the-art machine learning models to identify and handle sensitive information, supporting compliance with HIPAA's Safe Harbor standards and unlinked pseudonymization aligned with GDPR principle. By maintaining entity consistency and temporal relationships, organizations can use de-identified data for research, analytics, and machine learning without compromising patient privacy. Unlocking New Use Cases By expanding the service's language support, organizations can now address some of the most pressing data challenges in healthcare: Reduce organizational liability by meeting evolving privacy standards. Enable secure data sharing across institutions and regions. Unlock AI opportunities by training models on multilingual, de-identified data. Share de-identified data across institutions to create larger, more diverse datasets. Conduct longitudinal research while preserving patient privacy. Proven Accuracy Researchers at the University of Oxford recently conducted a comprehensive comparative study evaluating multiple automated de-identification systems across 3,650 UK hospital records. Their analysis compared both task-specific transformer models and general-purpose large language models. The Azure Health Data Services de-identification service achieved the highest overall performance among the 9 evaluated tools, demonstrating a recall score of 0.95. The study highlights how robust de-identification enables large-scale, privacy-preserving EHR research and supports the responsible use of AI in healthcare. Read the full study here: Benchmarking transformer-based models for medical record deidentification Preview: Your Feedback Matters This multilingual feature is now available in preview. We invite healthcare organizations, research institutions, and clinicians to: Try it out Overview of the de-identification service in Azure Health Data Services | Microsoft Learn. Provide feedback to help refine the service: Azure Health Data Service multilingual de-identification Service Feedback – Fill out form. Join us in shaping the future of privacy-preserving healthcare innovation. At Microsoft, we are committed to helping healthcare providers, payors, researchers, and life sciences companies unlock the value of data while maintaining the highest standards of patient privacy. Azure Health Data Services de-identification service empowers organizations to accelerate AI and analytics initiatives safely, supporting innovation and improving patient outcomes across the healthcare ecosystem. Explore Azure Health Data Services to see how our solutions help organizations transform care, research, and operational efficiency.1.2KViews2likes0CommentsFine-Tuning Healthcare AI Models: Custom Segmentation for Your Healthcare Data
This post is part of our healthcare AI fine-tuning series: MedImageInsight Fine-Tuning - Embeddings and classification MedImageParse Fine-Tuning - Segmentation and spatial understanding (you are here) CXRReportGen Fine-Tuning - Clinical findings generation Introduction MedImageParse now supports fine-tuning, allowing you to adapt Microsoft’s open-source biomedical foundation model to your healthcare use cases and data. Adapting this model can take as little as an hour to add new segmentation targets, add new modalities or boost performance significantly on your data. We’ll demonstrate how we achieved large performance gains across multiple metrics on a public dataset. Biomedical clinical apps often need highly specialized models, but training one from scratch is expensive and data-intensive. Traditional approaches require thousands of annotated images, weeks of compute time, and deep machine learning expertise just to get started. Fine-tuning offers a practical alternative. By starting with a strong foundation model and adapting it to your specific domain, you can achieve production-ready performance with hundreds of examples and hours of training time. Everything you need to start finetuning is available now, including a ready-to-use AzureML pipeline, complete workflow notebooks, and deployment capabilities. We fine-tuned MedImageParse on the CDD-CESM mammography dataset (specialized CESM modality for lesion segmentation) to demonstrate domain adaptation on data under‑represented in pre-training. Follow along: The complete example is in our GitHub repository as a ready-to-run notebook. What is MedImageParse? MedImageParse (MIP) is Microsoft’s open-source implementation of BiomedParse that comes with a permissive MIT license and is designed for integration into commercial products. It is a powerful and flexible foundation model for text-prompted medical imaging segmentation. MIP accepts an image and one or more prompts (e.g. “neoplastic cells in breast pathology” or “inflammatory cells,”) then accurately identifies and segments the corresponding structures within the input image. Trained on a wide range of biomedical imaging datasets and tasks, MIP captures robust feature representations that are highly transferrable to new domains. Furthermore, it operates efficiently on a single GPU, making it a practical tool for research laboratories without extensive computational resources. Built with adaptability in mind, the model can be fine-tuned using your own datasets to refine segmentation targets, accommodate unique imaging modalities, or improve performance on local data distributions. Its modest computational footprint, paired with this flexibility, positions MIP as a strong starting point for custom medical imaging solutions. When to Fine-tune (and When NOT to) Fine-tuning can transform MedImageParse into your own clinical asset that's aligned with your institution’s needs. But how do you know if that’s the right approach for your use case? Fine-tuning makes sense when you’re working with specialized imaging protocols (custom equipment or acquisition parameters), rare structures not well-represented in general datasets, or when you need high precision for quantitative analysis. You’ll need some high-quality annotated examples to see meaningful improvements; more is better, but thousands aren’t required. Simpler approaches might work instead if the pre-trained model already performs reasonably well on standard anatomies and common pathologies. If you’re still in exploratory mode figuring out what to measure, start with the base model first to establish a strong baseline for your use case. Our example shows how fine-tuning can deliver significant performance gains even with modest resources. With about one hour of GPU time and 200-500 annotated images, fine-tuning showed a significant improvement across multiple metrics. The Fine-tuning Pipeline: From Data to Deployed Model To demonstrate fine-tuning in action, we used the CDD-CESM mammography dataset: a collection of Contrast-Enhanced Spectral Mammography (CESM) images with expert-annotated breast lesion masks. CESM is a specialized imaging modality that wasn’t well represented in MedImageParse’s original training data. The dataset 1 (can be downloaded from our HuggingFace location or from its original TCIA page) includes predefined splits with high-quality segmentation annotations. Why AzureML Pipelines? Before diving into the workflow, it’s worth understanding why we use AzureML pipelines for this process. Every experiment is tracked with full versioning; you always know exactly what you ran and can reproduce results months later. The pipeline handles multi-GPU distribution automatically without code changes, making it easy to scale up. The modular design lets you mix and match components for your specific needs, swap data preprocessing, adjust training parameters, or change deployment strategies independently. Training metrics, validation curves, and resource utilization are logged automatically, giving you full visibility into the process. Learn more about Azure ML pipelines. Fine-Tuning Workflow Setup: Upload data and configure compute The first step uploads your training data and configuration to AzureML as versioned assets. You’ll configure a GPU compute cluster (H100 or A100 instances recommended) that will handle the training workload. # Create and upload training data folder training_data = Data( path="CDD-CESM", type=AssetTypes.URI_FOLDER, description=f"{name} training data", name=f"{name}-training_data", ) training_data = ml_client.data.create_or_update(training_data) # Create and upload parameters file parameters = Data( path="parameters.yaml", type=AssetTypes.URI_FILE, description=f"{name} parameters", name=f"{name}-parameters", ) parameters = ml_client.data.create_or_update(parameters) Fine-tuning: The medimageparse_finetune component The fine-tuning component takes three inputs: The pre-trained MedImageParse model (foundation weights) Your annotated dataset Training configuration (learning rate, batch size, augmentation settings) During training the pipeline applies augmentation, tracks validation metrics, and checkpoints periodically. The output is an MLflow-packaged model, a portable artifact that includes the model weights, preprocessing code that is ready to deploy in AzureML or AI Foundry. The pipeline uses parameter-efficient fine-tuning techniques to adapt the model while preserving the broad knowledge from pre-training. This means you get specialized performance without catastrophic forgetting of the base model’s capabilities. # Get the pipeline component finetune_pipeline_component = ml_registry.components.get( name="medimageparse_finetune", label="latest" ) # Get the latest MIP model model = ml_registry.models.get(name="MedImageParse", label="latest") # Create the pipeline @pipeline(name="medimageparse_finetuning" + str(random.randint(0, 100000))) def create_pipeline(): mip_pipeline = finetune_pipeline_component( pretrained_mlflow_model=model.id, data=data_assets["training_data"].id, config=data_assets["parameters"].id, ) return {"mlflow_model_folder": mip_pipeline.outputs.mlflow_model_folder} # Submit the pipeline pipeline_object = create_pipeline() pipeline_object.compute = compute.name pipeline_object.settings.continue_on_step_failure = False pipeline_job = ml_client.jobs.create_or_update( pipeline_object, experiment_name="medimageparse_finetune_experiment" ) Deployment: Register and serve the model After training, the model can be registered in your AzureML workspace with version tracking. From there, deployment to a managed online endpoint takes a single command. The endpoint provides a scalable REST API backed by GPU compute for optimal inference performance. # Register the Model run_model = Model( path=f"azureml://jobs/{pipeline_job.name}/outputs/mlflow_model_folder", name=f"MIP-{name}-{pipeline_job.name}", description="Model created from run.", type=AssetTypes.MLFLOW_MODEL, ) run_model = ml_client.models.create_or_update(run_model) # Create endpoint and deployment with the classification model endpoint = ManagedOnlineEndpoint(name=name) endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint).result() deployment = ManagedOnlineDeployment( name=name, endpoint_name=endpoint.name, model=run_model.id, instance_type="Standard_NC40ads_H100_v5", instance_count=1, ) deployment = ml_client.online_deployments.begin_create_or_update(deployment).result( Testing: Text-prompted inference With the endpoint deployed, you can send test images along with text prompts describing what to segment. For the CDD-CESM example, we use text prompts: “neoplastic cells in breast pathology & inflammatory cells”. The model returns multiple segmentation masks for different detected regions. Text-prompting lets you switch focus on the fly (e.g., “tumor boundary” vs. “inflammatory infiltration”) without retraining or reconfiguring the model. Results Fine-tuning made a huge difference in how well the model works. The Dice Score, which shows how closely the model’s results match the actual regions, more than doubled, from 0.198 to 0.486. The IoU, another measure of overlap, nearly tripled, going from 0.139 to 0.383. Sensitivity jumped from 0.251 to 0.535, which means the model found more real positives. Metric Base Fine-tuned Δ Abs Δ Rel Dice (F1) 0.198 0.486 +0.288 +145% IoU 0.139 0.383 +0.244 +176% Sensitivity 0.251 0.535 +0.284 +113% Specificity 0.971 0.987 +0.016 +1.6% Accuracy 0.936 0.963 +0.027 +2.9% These improvements really matter in practice. When the Dice and IoU scores go up, it means the model is better at outlining the exact shape and size of problem areas, which helps doctors get more accurate measurements and track changes over time. The jump in sensitivity means the model is finding more actual lesions, while keeping specificity above 98% makes sure there aren’t a lot of false alarms. The improvement accuracy is impressive, but the more significant upgrades in overlap and recall are most impressive and matter most for getting precise results in medical images. Try It on Your Own Data To successfully implement this solution in your organization, focus first on the core requirements and resources that will ensure a seamless transition. The following section outlines these essential steps so you can move efficiently from planning to deployment and set your team up for optimal results. Dataset size: Start with 200-500 annotated images. This is enough to see meaningful performance improvements without requiring massive data collection efforts. More data generally helps, but you don’t need thousands of examples to get started. Annotation quality: High-quality segmentation masks are critical. Invest in precise boundary delineations (pixel-level accuracy where possible), consistent annotation protocols across all images, and quality control reviews to catch and correct errors. Annotation effort: Budget enough time per image for careful annotation. Apply active learning approaches to focus effort on the most informative samples and start with a smaller pilot dataset (100-150 images) to validate the approach before scaling up. Training compute: A100 or H100 recommended (one device with multiple GPUs is sufficient for a few hundred image runs). For the CDD-CESM dataset, we used NC-series VMs (single-node) with 8 GPUs and training on 300 images took around 30 minutes for 10 epochs. If you’re training on larger datasets (thousands of images), consider upgrading to ND-series VMs, which offer better multi-node performance and allow you to train on large volumes of data faster. Where to Go from Here? So, what does this mean for your workflows and clinical teams? Foundation models like MedImageParse provide significant power and performance. They’re flexible with text-prompted multi-task capabilities that can integrate into existing workflows without retooling and are relatively cheap to use for inference. This means faster review, more precise assessments, and independence from vendor development timelines. But these models are not adapted to your institution and use cases out of the box, but developing a foundation model from scratch is prohibitively expensive. Fine-tuning bridges that gap: you can boost performance on your data and adapt it to your use case at a fraction of the cost. You control what the model learns, how it fits your workflow, and its validation for your context. We’ve provided the complete tools to do that: the fine-tuning notebook walks through the entire process, from data preparation to deployment. By following this workflow and collecting annotated data from your institution (see “Try It on Your Own Data” above for requirements), you can deploy MedImageParse tailored to your institution and use cases. References Khaled R., Helal M., Alfarghaly O., Mokhtar O., Elkorany A., El Kassas H., Fahmy A. Categorized Digital Database for Low energy and Subtracted Contrast Enhanced Spectral Mammography images [Dataset]. (2021) The Cancer Imaging Archive. DOI: 10.7937/29kw-ae92 https://www.cancerimagingarchive.net/collection/cdd-cesm/Microsoft Azure continues to expand scalability for Healthcare EHR Workloads
Microsoft Azure has reached a new milestone for Epic Chronicles Operational Database (ODB) scalability with the Standard_M416bs_v3 (Mbv3) VM. It can now scale up to 110 million GRefs/s (Global References per second) in the ECP configuration and up to 39 million GRefs/s in the SMP configuration, improving upon the previous Azure benchmarks of 65 million GRefs/s and 20 million GRefs/s respectively. Microsoft Azure now can host 96% of the Epic customer base, enabling healthcare organizations to run their EHR systems on Azure. New VM Size Purpose-Built for Epic’s Chronicles ODB The Standard_M416bs_v3 VM, newly added to Azure’s Mbv3 series, is purpose-built to meet the growing performance and scalability demands of large healthcare EHR environments. With higher CPU capacity, expanded memory and improved remote storage throughput, it delivers the reliability needed for mission-critical workloads at scale. Key specifications include: Mbv3 Processor Performance: Built on 4th Gen Intel® Xeon® Scalable processors, the Mbv3 series is optimized for high memory and storage performance, supporting workloads up to 4 TB of RAM with an NVMe interface for faster remote disk access. Compute Capacity: The Standard_M416bs_v3 delivers 416 vCPUs - more than twice the capacity of previous Mbv3 sizes, delivering stronger performance. Storage Performance: Achieves up to 550,000 IOPS and 10 GBps remote disk bandwidth using Azure Ultra Disk. Performance Optimization: Enhanced by Azure Boost, the M416bs_v3 provides low-latency, high remote storage performance, making it ideal for storage throughput-intensive applications such as Epic ODB, relational databases and analytics workloads. Available Regions: M416bs_v3 is available in 4 regions - East US, East US 2, Central US, and West US 2. Explore Epic on Azure to learn more. Epic and Chronicles are trademarks of Epic Systems Corporation.1.9KViews2likes1CommentImage Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology
Introduction As the use of diagnostic 3D images increases, effective management and analysis of these large volumes of data grows in importance. Medical 3D image search systems can play a vital role by enabling clinicians to quickly retrieve relevant or similar images and cases based on the anatomical features and pathologies present in a query image. Unlike traditional 2D imaging, 3D imaging offers a more comprehensive view for examining anatomical structures from multiple planes with greater clarity and detail. This enhanced visualization has potential to assist doctors with improved diagnostic accuracy and more precise treatment planning. Moreover, advanced 3D image retrieval systems can support evidence-based and cohort-based diagnostics, demonstrating an opportunity for more accurate predictions and personalized treatment options. These systems also hold significant potential for advancing research, supporting medical education, and enhancing healthcare services. This blog offers guidance on using Azure AI Foundry and the recently launched healthcare AI models to design and test a 3D image search system that can retrieve similar radiology images from a large collection of 3D images. Along with this blog, we share a Jupyter Notebook with the the 3D image search system code, which you may use to reproduce the experiments presented here or start you own solution. 3D Image Search Notebook: http://aka.ms/healthcare-ai-examples-mi2-3d-image-search It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources. The Problem Generally, the problem of 3D image search can be posed as retrieving cross-sectional (CS) imaging series (3D image results) that are similar to a given CS imaging series (query 3D image). Once posited this way, the key question becomes how to define such similarity? In the previous blog of this series, we worked with radiographs of the chest which constrained the notion of "similar" to the similarity between two 2D images, and a certain class of anatomy. In the case of 3D images, we are dealing with a volume of data, and a lot more variations of anatomy and pathologies, which expands the dimensions to consider for similarity; e.g., are we looking for similar anatomy? Similar pathology? Similar exam type? In this blog, we will discuss a technique to approximate the 3D similarity problem through a 2D image embedding model and some amount of supervision to constrain the problem to a certain class of pathologies (lesions) and cast it as "given cross-sectional MRI image , retrieve series with similar grade of lesions in similar anatomical regions". To build a search system for 3D radiology images using a foundation model (MedImageInsight) designed for 2D inputs, we explore the generation of representative 3D embedding vectors for the volumes with the foundation model embeddings of 2D slices to create a vector index from a large collection of 3D images. Retrieving relevant results for a given 3D image then consists in generating a representative 3D image embedding vector for the query image and searching for similar vectors in the index. An overview of this process is illustrated in Figure 1. Figure 1: Overview of the 3D image search process. The Data In the sample notebook that is provided alongside this blog, we use 3D CT images from the Medical Segmentation Decathlon (MSD) dataset [2-3] and annotations from the 3D-MIR benchmark [4]. The 3D-MIR benchmark offers four collections (Liver, Colon, Pancreas, and Lung) of positive and negative examples created from the MSD dataset with additional annotations related to the lesion flag (with/without lesion), and lesion group (1, 2, 3). The lesion grouping focuses on lesion morphology and distribution and considers the number, length, and volume of the lesions to define the three groups. It also adheres to the American Joint Committee on Cancer's Tumor, Node, Metastasis classification system’s recommendations for classifying cancer stages and provides a standardized framework for correlating lesion morphology with cancer stage. We selected the 3D-MIR Pancreas collection. 3D-MIR Benchmark: https://github.com/abachaa/3D-MIR Since the MSD collections only include unhealthy/positive volumes, each 3D-MIR collection was augmented with volumes randomly selected from the other datasets to integrate healthy/negative examples in the training and test splits. For instance, the Pancreas dataset was augmented using volumes from the Colon, Liver, and Lung datasets. The input images consist of CT volumes and associated 2D slices. The training set is used to create the index, and the test set is used to query and evaluate the 3D search system. 3D Image Retrieval Our search strategy, called volume-based retrieval, relies on aggregating the embeddings of the 2D slices of a volume to generate one representative 3D embedding vector for the whole volume. We describe additional search strategies in our 3D-MIR paper [4]. The 2D slice embeddings are generated using the MedImageInsight foundation model [5-6] from Azure AI Foundry model catalog [1]. In the search step, we generate the embeddings of the 3D query volumes according to the selected Aggregation method (Agg) and search for the top-k similar volumes/vectors in the corresponding 3D (Agg) index. We use the Median aggregation method to generate the 3D vectors and create the associated 3D index. We construct a 3D (Median) index using the training slices/volumes from the 3D-MIR Pancreas collection. Three other aggregation methods are available in the 3D image search notebook: Max Pooling, Average Pooling, and Standard Deviation. The search is performed following the k-Nearest Neighbors algorithm (or k-NN search) to find the k nearest neighbors of a given vector by calculating the distances between the query vector and all other vectors in the collection, then selecting the K vectors with the shortest distances. If the collection is large, the computation can be expensive, and it is recommended to use specific libraries for optimization. We use the FAISS (Facebook AI Similarity Search) library, an open-source library for efficient similarity search and clustering of high-dimensional vectors. Evaluation of the search results The 3D-MIR Pancreas test set consists of 32 volumes: 4 volumes with no lesion (lesion flag/group= -1) 3 volumes with lesion group 1 19 volumes with lesion group 2 6 volumes with lesion group 3 The training set consists of 269 volumes (with and without lesions) and was used to create the index. We evaluate the 3D search system by comparing the lesion group/category of the query volume and the top 10 retrieved volumes. We then compute Precision@k (P@k). Table 1 presents the P@1, P@3, P@5, P@10, and overall Precision. Table 1: Evaluation results on the 3D-MIR Pancreas test set The system accurately recognizes Healthy cases, consistently retrieving the correct label in test scenarios involving non-lesion pancreas images. However, performance varies for different lesion groups, reflecting challenges in precisely identifying smaller lesions (Group 1) or more advanced lesions (Group 3). This discrepancy highlights the complexity of lesion detection and underscores the importance of carefully tuning embeddings or adjusting the vector index to improve retrieval accuracy for specific lesion sizes. Visualization Figure 2 presents four different test queries from the Pancreas test set and the top 5 nearest neighbors retrieved by the volume-based search method. In each row, the first image is the query, followed by the retrieved images ranked by similarity. The visual overlays help in assessing retrieval accuracy; Blue indicates the pancreas organ boundaries, and Red highlights the mark regions corresponding to the pancreas tumor. Figure 2: Top 5 results for different queries from the Pancreas test set Table 2 presents additional results of the volume-based retrieval system [4] on other 3D-MIR datasets/organs (Liver, Colon, and Lung) using additional foundation models: BiomedCLIP [7], Med-Flamingo [8], and BiomedGPT [9]. When considering the macro-average across all datasets, MedImageInsight-based retrieval outperforms substantially other foundation models. Table 2: Evaluation Results on the 3D-MIR benchmark (Liver, Colon, Pancreas, and Lung) These results mirror a use case akin to lesion detection and severity measurement in a clinical context. In real-world applications—such as diagnostic support or treatment planning—it may be necessary to optimize the model to account for particular goals (e.g., detecting critical lesions early) or accommodate different imaging protocols. By refining search criteria, integrating more domain-specific data, or adjusting embedding methods, practitioners can enhance retrieval precision and better meet clinical requirements. Conclusion The integration of 3D image search systems in clinical environment can enhance and accelerate the retrieval of similar cases and provide better context to clinicians and researchers for accurate complex diagnoses, cohort selection, and personalized patient care. This 3D radiology image search blog and related notebook offers a solution based on 3D embedding generation for building and evaluating a 3D image search system using the MedImageInsight foundation model from Azure AI Foundry model catalog. References Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview Michela Antonelli et al. The medical segmentation decathlon. Nature Communications, 13(4128), 2022 https://www.nature.com/articles/s41467-022-30695-9 MSD: http://medicaldecathlon.com/ Asma Ben Abacha, Alberto Santamaría-Pang, Ho Hin Lee, Jameson Merkow, Qin Cai, Surya Teja Devarakonda, Abdullah Islam, Julia Gong, Matthew P. Lungren, Thomas Lin, Noel C. F. Codella, Ivan Tarapov: 3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in Radiology. CoRR abs/2311.13752, 2023 https://arxiv.org/abs/2311.13752 Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542, 2024 https://arxiv.org/abs/2410.06542 MedImageInsight: https://aka.ms/mi2modelcard Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. NEJM AI 2025; 2(1) https://ai.nejm.org/doi/full/10.1056/AIoa2400640 Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. Machine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA. Proceedings of Machine Learning Research, vol. 225, pp. 353–367. PMLR, (2023) https://proceedings.mlr.press/v225/moor23a.html Zhang, K., Zhou, R., Adhikarla, E., Yan, Z., Liu, Y., Yu, J., Liu, Z., Chen, X., Davison, B.D., Ren, H., et al.: A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, 1–13 (2024) https://www.nature.com/articles/s41591-024-03185-2 Image Search Series Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community Hub The Microsoft healthcare AI models are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology
Introduction Dermatology is inherently visual, with diagnosis often relying on morphological features such as color, texture, shape, and spatial distribution of skin lesions. However, the diagnostic process is complicated by the large number of dermatologic conditions, with over 3,000 identified entities, and the substantial variability in their presentation across different anatomical sites, age groups, and skin tones. This phenotypic diversity presents significant challenges, even for experienced clinicians, and can lead to diagnostic uncertainty in both routine and complex cases. Image-based retrieval systems represent a promising approach to address these challenges. By enabling users to query large-scale image databases using a visual example, these systems can return semantically or visually similar cases, offering useful reference points for clinical decision support. However, dermatology image search is uniquely demanding. Systems must exhibit robustness to variations in image quality, lighting, and skin pigmentation while maintaining high retrieval precision across heterogeneous datasets. Beyond clinical applications, scalable and efficient image search frameworks provide valuable support for research, education, and dataset curation. They enable automated exploration of large image repositories, assist in selecting challenging examples to enhance model robustness, and promote better generalization of machine learning models across diverse populations. In this post, we continue our series on using healthcare AI models in Azure AI Foundry to create efficient image search systems. We explore the design and implementation of such a system for dermatology applications. As a baseline, we first present an adapter-based classification framework for dermatology images by leveraging fixed embeddings from the MedImageInsight foundation model, available in the Azure AI Foundry model catalog. We then introduce a Retrieval-Augmented Generation (RAG) method that enhances vision-language models through similarity-based in-context prompting. We use the MedImageInsight foundation model to generate image embeddings and retrieve the top-k visually similar training examples via FAISS. The retrieved image-label pairs are included in the Vision-LLM prompt as in-context examples. This targeted prompting guides the model using visually and semantically aligned references, enhancing prediction quality on fine-grained dermatological tasks. It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources. The Data The DermaVQA-IIYI [2] dermatology image dataset is a de-identified, diverse collection of nearly 1,000 patient records and nearly 3,000 dermatological images, created to support research in skin condition recognition, classification, and visual question answering. DermaVQA-IIYI dataset: https://osf.io/72rp3/files/osfstorage (data/iiyi) The dataset is split into three subsets: Training Set: 2,474 images associated with 842 patient cases Validation Set: 157 images associated with 56 cases Test Set: 314 images associated with 100 cases Total Records: 2,945 images (998 patient cases) Patient Demographics: Out of 998 patient cases: Sex – F: 218, M: 239, UNK: 541 Age (available for 398 patients): Mean: 31 yrs | Min: 0.08 yrs | Max: 92 yrs This wide range supports studies across all age groups, from infants to the elderly. A total of 2,945 images are associated with the patient records, with an average of 2.9 images per patient. This multiplicity enables the study of skin conditions from different perspectives and at various stages. Image Count per Entry: 1 image: 225 patients 2 images: 285 patients 3 images: 200 patients 4 or more images: 288 patients The dataset includes additional annotations for anatomic location, comprising 39 distinct labels (e.g., back, fingers, fingernail, lower leg, forearm, eye region, unidentifiable). Each image is associated with one or multiple labels. We use these annotations to evaluate the performance of various methods across different anatomical regions. Image Embeddings We generate image embeddings using the MedImageInsight foundation model [1] from the Azure AI Foundry model catalog [3]. We apply Uniform Manifold Approximation and Projection (UMAP) to project high-dimensional image embeddings produced by the MedImageInsight model into two dimensions. The visualization is generated using embeddings extracted from both the DermaVQA training and test sets, which covers 39 anatomical regions. For clarity, only the most frequent anatomical labels are displayed in the projection. Figure 1. UMAP projection of image embeddings produced by the MedImageInsight Model on the DermaVQA dataset. The resulting projection reveals that the MedImageInsight model captures meaningful anatomical distinctions: visually distinct regions such as fingers, face, fingernail, and foot form well-separated clusters, indicating high intra-class consistency and inter-class separability. Other anatomically adjacent or visually similar regions, such as back, arm, and abdomen, show moderate overlap, which is expected due to shared visual features or potential labeling ambiguity. Overall, the embeddings exhibit a coherent and interpretable organization, suggesting that the model has learned to encode both local and global anatomical structures. This supports the model’s effectiveness in capturing anatomy-specific representations suitable for downstream tasks such as classification and retrieval. Enhancing Visual Understanding We explore two strategies for enhancing visual understanding through foundation models. I. Training an Adapter-based Classifier We build an adapter-based classification framework designed for efficient adaptation to medical imaging tasks (see our prior posts for introduction into the topic of adapters: Unlocking the Magic of Embedding Models: Practical Patterns for Healthcare AI | Microsoft Community Hub). The proposed adapter model builds upon fixed visual features extracted from the MedImageInsight foundation model, enabling task-specific fine-tuning without requiring full model retraining. The architecture consists of three main components: MLP Adapter: A two-layer feedforward network that projects 1024-dimensional embeddings (generated by the MedImageInsight model) into a 512-dimensional latent space. This module utilizes GELU activation and Layer Normalization to enhance training stability and representational capacity. As a bottleneck adapter, it facilitates parameter-efficient transfer learning. Convolutional Retrieval Module: A sequence of two 1D convolutional layers with GELU activation, applied to the output of the MLP adapter. This component refines the representations by modeling local dependencies within the transformed feature space. Prediction Head: A linear classifier that maps the 512-dimensional refined features to the task-specific output space (e.g., 39 dermatology classes). The classifier is trained for 10 epochs (approximately 48 seconds) using only CPU resources. Built on fixed image embeddings extracted from the MedImageInsight model, the adapter efficiently tailors these representations for downstream classification tasks with minimal computational overhead. By updating only the adapter components, while keeping the MedImageInsight backbone frozen, the model significantly reduces computational and memory overhead. This design also mitigates overfitting, making it particularly effective in medical imaging scenarios with limited or imbalanced labeled data. A Jupyter Notebook detailing the construction and training of an MedImageInsight -based adapter model is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-adapter Figure 3: MedImageInsight-based Adapter Model II. Boosting Vision-Language Models with in-Context Prompting We leverage vision-language models (e.g., GPT-4o, GPT-4.1), which represent a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models are particularly promising for dermatology tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding domain-specific medical terminology. 1. Few-shot Prompting In this setting, a small number of examples from the training dataset are randomly selected and embedded into the input prompt. These examples, consisting of paired images and corresponding labels, are intended to guide the model's interpretation of new inputs by providing contextual cues and examples of relevant dermatological features. 2. MedImageInsight-based Retrieval-Augmented Generation (RAG) This approach enhances vision-language model performance by integrating a similarity-based retrieval mechanism rooted in MedImageInsight (Medical Image-to-Image) comparison. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k dermatological training images that are most visually similar to a given query image. The retrieved examples, consisting of dermatological images and their corresponding labels, are then used as in-context examples in the Vision-LLM prompt. By presenting visually similar cases, this approach provides the model with more targeted contextual references, enabling it to generate predictions grounded in relevant visual patterns and associated clinical semantics. As illustrated in Figure 2, the system operates in two phases: Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval. Query and Retrieval: At inference time, the test image is encoded similarly to produce a query embedding. The system computes the Euclidean distance between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances. To handle the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors. The implementation of the image search method is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-2d-image-search Figure 2: MedImageInsight-based Retrieval-Augmented Generation Evaluation Table 1 presents accuracy scores for anatomic location prediction on the DermaVQA-iiyi test set using the proposed modeling approaches. The adapter model achieves a baseline accuracy of 31.73%. Vision-language models perform better, with GPT-4o (2024-11-20) achieving an accuracy of 47.11%, and GPT-4.1 (2025-04-14) improving to 50%. However, incorporating few-shot prompting with five randomly selected in-context examples (5-shot) slightly reduces GPT-4.1’s performance to 48.72%. This decline suggests that unguided example selection may introduce irrelevant or low-quality context, potentially reducing the effectiveness of the model’s predictions for this specialized task. The best performance among the vision-language approaches is achieved using the retrieval-augmented generation (RAG) strategy. In this setup, GPT-4.1 is prompted with five nearest-neighbor examples retrieved using the MedImageInsight-based search method (RAG-5), leading to a notable accuracy increase to 51.60%. This improvement over GPT-4.1’s 50% accuracy without retrieval showcases the relevance of the MedImageInsight-based RAG method. We expect larger performance gains when using a more extensive dermatology dataset, compared to the relatively small dataset used in this example -- a collection of 2,474 images associated with 842 patient cases which served as the basis for selecting relevant cases and similar images. Dermatology is a particularly challenging domain, marked by a high number of distinct conditions and significant variability in skin tone, texture, and lesion appearance. This diversity makes robust and representative example retrieval especially critical for enhancing model performance. The results underscore the importance of example relevance in few-shot prompting, demonstrating that similarity-based retrieval can effectively guide the model toward more accurate predictions in complex visual reasoning tasks. Table 1: Comparative Accuracy of Anatomic Location Prediction on DermaVQA-iiyi Figure 2: Confusion Matrix of Anatomical Location Predictions by the trained MLP adapter: The matrix illustrates the model's performance in classifying wound images across 39 anatomical regions. Strong diagonal values indicate correct classifications, while off-diagonal entries highlight common misclassifications, particularly among anatomically adjacent or visually similar regions such as 'lowerback' vs. 'back' and 'hand' vs. 'fingers'. Figure 3. Examples of correct anatomical predictions by the RAG approach. Each image depicts a case where the model's predicted anatomical region exactly matches the ground truth. Shown are examples from visually and anatomically distinct areas including the eye region, lips, lower leg, and neck. Figure 4. Examples of misclassifications by the RAG approach. Each image displays a case where the predicted anatomical label differs from the ground truth. In several examples, predictions are anatomically close to the correct regions (e.g., hand vs. hand-back, lower leg vs. foot, palm vs. fingers), suggesting that misclassifications often occur between adjacent or visually similar areas. These cases highlight the challenge of precise localization in fine-grained anatomical classification and the importance of accounting for anatomical ambiguity in both modeling and evaluation. Conclusion Our exploration of scalable image retrieval and advanced prompting strategies demonstrates the growing potential of vision-language models in dermatology. A particularly challenging task we address is anatomic location prediction, which involves 39 fine-grained classes of dermatology images, imbalanced training data, and frequent misclassifications between adjacent or visually similar regions. By leveraging Retrieval-Augmented Generation (RAG) with similarity-based example selection using image embeddings from the MedImageInsight foundation model, we show that relevant contextual guidance can significantly improve model performance in such complex settings. These findings underscore the importance of intelligent image retrieval and prompt construction for enhancing prediction accuracy in fine-grained medical tasks. As vision-language models continue to evolve, their integration with retrieval mechanisms and foundation models holds substantial promise for advancing clinical decision support, medical research, and education at scale. In the next blog of this series, we will shift focus to the wound care subdomain of dermatology, and we will release accompanying Jupyter notebooks for the adapter-based and RAG-based methods to provide a reproducible reference implementation for researchers and practitioners. The Microsoft healthcare AI models, including MedImageInsight, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. References Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024) Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Asma Ben Abacha, Meliha Yetisgen, Fei Xia: DermaVQA: A Multilingual Visual Question Answering Dataset for Dermatology. MICCAI (5) 2024: 209-219 Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview Image Search Series Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community HubImage Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval
Introduction Wound assessment and management are central tasks in clinical practice, requiring accurate documentation and timely decision-making. Clinicians and nurses often rely on visual inspection to evaluate wound characteristics such as size, color, tissue composition, and healing progress. However, when seeking comparable cases (e.g., to inform treatment choices, validate assessments, or support education), existing search methods have significant limitations. Traditional keyword-based systems require precise terminology, which may not align with the way wounds are described in practice. Moreover, textual descriptors cannot fully capture the variability of visual wound features, resulting in incomplete or imprecise retrieval. Recent advances in computer vision offer new opportunities to address these challenges through both image classification and image retrieval. Automated classification of wound images into clinically meaningful categories (e.g., wound type, tissue condition, infection status) can support standardized documentation and assist clinicians in making more consistent assessments. In parallel, image retrieval systems enable search based on visual similarity rather than textual input alone, allowing clinicians to query databases directly with wound images and retrieve cases with similar characteristics. Together, these AI-based functionalities have the potential to improve case comparison, facilitate consistent monitoring, and enhance clinical training by providing immediate access to relevant examples and structured decision support. The Data The WoundcareVQA dataset is a new multimodal multilingual dataset for Wound Care Visual Question Answering. The WoundcareVQA dataset is available at https://osf.io/xsj5u/ [1] Table 1 summarizes dataset statistics. WoundcareVQA contains 748 images associated with 447 instances (each instance/query includes one or more images). The dataset is split into training (279 instances, 449 images), validation (105 instances, 147 images), and test (93 instances, 152 images). The training set was annotated by a single expert, the validation set by two annotators, and the test set by three medical doctors. Each query is also labeled with wound metadata, covering seven categories: anatomic location (41 classes), wound type (8), wound thickness (6), tissue color (6), drainage amount (6), drainage type (5), and infection status (3). Table 1: Statistics about the WoundcareVQA Dataset We selected two tasks with the highest inter-annotator agreement: Wound Type Classification and Infection Detection (cf. Table 2). Table 3 lists the classification labels for these tasks. Table 2: Inter-Annotator Agreement in the WoundcareVQA Dataset Table 3: Classification Labels for the Tasks: Infection Detection & Wound Type Classification Methods 1. Foundation-Model-based Image Search This approach relies on an image similarity-based retrieval mechanism using a medical foundation model, MedImageInsight [2-3]. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k training images most visually similar to a given query image. The image search system operates in two phases: Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval. Query and Retrieval: At inference time, the test image is encoded to produce a query embedding. The system computes the Euclidean distances between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances. To address the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors. 2. Vision-Language Models (VLMs) & Retrieval-Augmented Generation (RAG) We leverage vision-language models (e.g., GPT-4o, GPT-4.1), a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models can be used for wound assessment tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding medical terminology. We evaluate three settings: Zero-shot: The model predicts directly from the query input without additional examples. Few-shot Prompting: A small number of examples (5) from the training dataset are randomly selected and embedded into the input prompt. These paired images and labels provide contextual cues that guide the model's interpretation of new inputs. Retrieval-Augmented Generation (RAG): The system first retrieves the Top-k visually similar wound images using the MedImageInsight-based image search described above. The language model then reasons over the retrieved examples and their labels to generate the final prediction. The implementation of the MedImageInsight-based image search and the RAG method for the infection detection task is available in our Samples Repository: https://aka.ms/healthcare-ai-examples rag_infection_detection.ipynb Evaluation We computed accuracy scores to evaluate the image search methods (Top-1 and Top-5 with majority vote), GPT-4o and GPT-4.1 models (zero-shot), as well as 5-shot and RAG-based methods. Table 4 reports accuracy for wound type classification and infection detection. Figure 1 presents examples of correct and incorrection predictions. Accuracy Image Search Top-1 Image Search Top-5 + majority vote GPT-4o (2023-07-01) GPT-4o (2024-11-20) GPT4.1 (2025-04-14) GPT4.1 5-shot Prompting GPT-4.1- RAG-5 Wound Type 0.7933 0.8333 0.4671 0.4803 0.5066 0.6118 0.7533 Infection 0.6800 0.7267 0.3947 0.3882 0.375 0.7237 0.7697 Table 4: Accuracy Scores for Wound Type Classification & Infection Detection Figure 1: Examples of Correct and Incorrection Predictions (GPT-4.1-RAG-5 Method) For wound type classification, image search with MedImageInsight embeddings performs best, achieving 0.7933 (Top-1) and 0.8333 (Top-5 + majority vote). GPT models alone perform substantially worse (0.4671-0.6118), while GPT-4.1 with retrieval augmentation (RAG-5), which uses the same MedImageInsight-based image search method to retrieve the Top-5 similar cases, narrows the gap (0.7533) but does not surpass direct image search. This suggests that categorical wound type is more effectively captured by visual similarity than by case-based reasoning with vision-language models. For infection detection, the trend reverses. Image search reaches 0.7267 (Top-5 + majority vote), while RAG-5 achieves the highest accuracy at 0.7697. In this case, the combination of visually similar cases with VLM-based reasoning outperforms both standalone image search and GPT prompting. This indicates that infection assessment depends on contextual or clinical cues that may not be fully captured by visual similarity alone but can be better interpreted when enriched with contextual reasoning over retrieved cases and their associated labels. Overall, these findings highlight complementary strengths: foundation-model-based image search excels at categorical visual classification (wound type), while retrieval-augmented VLMs leverage both visual similarity and contextual reasoning to improve performance on more nuanced tasks (infection detection). A hybrid system integrating both approaches may provide the most robust clinical support. Conclusion This study demonstrates the complementary roles of vision-language models in wound assessment. Image search using foundation-model embeddings shows strong performance on categorical tasks such as wound type classification, where visual similarity is most informative. In contrast, retrieval-augmented generation (RAG-5), which combines image search with case-based reasoning by a vision-language model, achieves the best results for infection detection, highlighting the value of integrating contextual interpretation with visual features. These findings suggest that a hybrid approach, leveraging both direct image similarity and retrieval-augmented reasoning, provides the most robust pathway for clinical decision support in wound care. References Wen-wai Yim, Asma Ben Abacha, Robert Doerning, Chia-Yu Chen, Jiaying Xu, Anita Subbarao, Zixuan Yu, Fei Xia, M Kennedy Hall, Meliha Yetisgen. Woundcarevqa: A Multilingual Visual Question Answering Benchmark Dataset for Wound Care. Journal of Biomedical Informatics, 2025. Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024) Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview Image Search Series Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community HubTowards Robust Evaluation of Multi-Agent Systems in Clinical Settings
Authors: Hao Qiu, Leonardo Schettini, Mert Öz, Noel Codella, Sam Preston, Wen-wai Yim As multi-agent systems become more capable and collaborative, their behavior begins to exhibit emergent properties that are difficult to predict or control – particularly in safety critical domains like healthcare. Coordination among agents can yield outputs that are non-deterministic, multi-faceted, and context sensitive. This makes robust evaluation not just a matter of accuracy, but of safety, accountability, and trust. Traditional NLP metrics like ROUGE or BLEU fall short in these settings as they presuppose a single ground truth and fail to capture clinically relevant errors such as subtle omissions, hallucinations, or fact distortions. To address this, we present a modular evaluation framework for the Healthcare Agent Orchestrator, designed to support fine-grained, clinical grounded assessment across both deployed clinical workflows and simulated scenarios. This framework enables targeted stress-testing of multi-agent behavior – particularly how agents share information, reason under uncertainty, and maintain factual fidelity in high-stakes contexts. Central to our framework is TBFact, a domain specific factuality metric that evaluates agent outputs based on three key criteria: factual inclusion, factual distortion, and factual omission. TBFact shows strong correlation with human experts (κ=0.760) and demonstrates that our Patient History agent successfully included up to 94% of high-importance information in the generated patient timelines. To ground evaluations of the Patient History agent, we constructed a high-quality benchmark dataset from de-identified tumor board discussions and associated patient histories. Reference patient timeline summaries (originally written by medical professionals) formatting was standardized via a large language model to facilitate consistent evaluation. And under our benchmark, while the Patient History agent included over 94% of high-importance facts (counting both fully and partially entailed information), the Patient History agent achieved 0.84 TBFact recall on high-importance facts, showing that TBFact's strict entailment criteria and partial credit scoring create meaningful headroom for future improvements. For more technical information about the evaluation framework, refer to the documentation. The healthcare-agent-orchestrator repository also includes an evaluation notebook with concrete examples for simulating conversations and evaluating them. : High-level architecture of the evaluation framework, showing data sources (real and simulated conversations) feeding into modular metrics for both orchestrator and individual agent assessment. Available Metrics Traditional similarity metrics (e.g.: ROUGE, BERTScore) fail to capture subtle yet critical factual inaccuracies in the output. Moreover, in agentic workflows, a ground truth answer often doesn’t exist or is expensive to curate. To overcome these shortcomings, we leverage Model-as-a-Judge to implement the following metrics: Component Metric Description Orchestrator Agent and tool selection accuracy Correct routing to specialized agents Orchestrator Intent resolution How accurately the orchestrator interprets and completes user requests, including scoping and clarification. Orchestrator Information aggregation Effective synthesis of multiple agent outputs. Individual Agents Context relevancy Relevance of retrieved information in relation to user’s requests. Individual Agents TBFact (Factual Consistency) An adapted version of RadFact for the text modality, that measures the factuality of claims in agents' messages and helps identifying omissions and hallucinations. Large Language Models serve as useful evaluation tools in our framework, offering advantages especially when ground truth data is not available. They can follow detailed evaluation guidelines, maintain consistency when applying criteria across conversations, and generate explanations for their assessments—facilitating verification of the evaluation process. However, due to their subjective nature, LLM-based evaluations should be treated as directional signals rather than absolute scores, providing better directional guidance for system improvement rather than absolute judgment of correctness. To complement LLM-based metrics with reproducible measurements especially when reference data is available, we include Rouge implementation, serving as an example for developers to incorporate other similarity metrics like BLEU or BERT-Score by extending the ReferenceBasedMetric class. TBFact: Domain-Specific Factuality Evaluation TBFact builds on RadFact (Bannur et al., 2024), a framework originally developed for evaluating factual consistency in radiology reports, by adapting its core principles to the text-only modality of healthcare agent interactions: Fact Extraction: Separately decomposes both agent responses and reference texts into discrete factual claims, categorized by clinical relevance (e.g., demographics, diagnosis, treatment) Logical Entailment: Compares each fact to determine if it's fully entailed, partially entailed, or not entailed by the reference, and further categorizes the reason for partial and total mismatches into “missing”, “ambiguous”, “incorrect” or “other”. Metric Calculation: TBFact performs the logical entailment in two directions: Precision (pred-to-gold): Measures the proportion of factual claims in the agent’s output that are supported by the reference data. A lower precision score may indicate the presence of hallucinated or extraneous facts not found in the reference, even if they are accurate. Precision can be seen as a proxy for succinctness. Recall (gold-to-pred): Measures the proportion of reference facts that are successfully captured in the agent’s output. A lower recall score signals missing or omitted information, which is especially critical in clinical contexts where completeness is essential. By operating at the level of atomic factual units, TBFact shifts the focus from holistic summary judgments to targeted, claim-by-claim analysis. While claim extraction introduces its own challenges—such as ensuring consistent coverage of verifiable content, maintaining entailment fidelity, and handling decontextualization (Metropolitansky & Larson, 2025)—factual claims make the evaluation process more modular and transparent, providing actionable insights into where and how agent responses differ from references. For example, when evaluating a discharge summary, TBFact might identify that while demographic facts achieve 95% precision, treatment recommendations only reach 75% recall, pinpointing specific areas for agent improvement. This granular feedback enables developers to identify systematic issues, such as an agent consistently omitting medication dosages or incorrectly interpreting temporal information, that would be difficult to detect with traditional metrics. Data Sources Due to the challenge of having real-world data for each use-case we want to evaluate, and to accommodate different development stages and data availability, the framework supports two primary evaluation modes: Real conversations: Healthcare Agent Orchestrator automatically saves chat sessions whenever a conversation is terminated with the command @Orchestrator: clear, enabling insight into actual clinical workflow performance. Simulated conversations: Generated for controlled testing using predefined scripts or adaptive scenarios. Essential for specialized scenarios with limited real-world data. Results and Performance Assessment Note: The following results represent initial validation from our current research phase, with ongoing work expanding evaluation scope and refining methodologies. These preliminary results demonstrate promising capabilities for clinical system coordination and factual accuracy assessment. Orchestrator Performance We evaluated the orchestrator using simulated conversations across multiple patient scenarios. GPT-4o served as the evaluator, providing both quantitative scores and qualitative explanations based on defined metric criteria. In this preliminary experiment, the orchestrator demonstrated promising coordination capabilities: Metric Score Range Average Score Agent Selection Accuracy 3.89 – 5 4 Intent Resolution 4 – 5 4.5 Information Aggregation 3 – 5 3.7 In our preliminary evaluation, agent selection examples are relatively straightforward given our agents' well-defined responsibilities but provide a foundation for expanding to more complex scenarios involving agent-human expert interactions as we gather real-world data. Future work could include turn-level labeling of tumor board dataset dialogues to test classification accuracy of choosing the right next expert or agent. Agent selection can also be combined with "tool selection" metrics, addressing the fragmentation problem in multi-agent evaluation approaches. In the current state, we mainly used the explanations provided by the evaluator model to better understand the behavior of the system in clinical workflows and guide the development process. Patient History Agent Performance with TBFact To evaluate the Patient History agent, we used an anonymized and PHI-free proprietary dataset, named TB-Bench, that comprehensively aggregates diverse medical records for 71 patients who had undergone the care of a Molecular Tumor Board (MTB). TB-Bench includes data such as tumor board transcripts, exported EHR data, and clinician-generated patient summaries. Due to the logistical challenges involved in curating such a comprehensive dataset across potentially multiple healthcare institutions and record keeping systems, we found that in some instances clinician-generated summaries available in the tumor board transcripts might refer to patient records that were lost in the data curation process. This mismatch made direct evaluation challenging. Therefore, to ensure evaluation reflects system performance when complete patient records are accessible, we used TBFact to evaluate the agent’s output against a curated set of dataset verifiable facts— facts limited to those referring to information that is present in the dataset. While TBFact measures both recall and precision of fact generation, our study focuses on recall because it measures how much of all important information is covered, which we consider the most critical metric for clinical applications where missing information can have serious consequences. The preliminary experiments revealed significant performance improvements through prompt optimization and format adjustments. With specialized prompting, we specify the types of information to prioritize—such as biomarker results, imaging assessments, and treatment timelines. For instance, our updated prompt instructs the agent to “organize the patient data in chronological order” and explicitly calls out key elements to include: “all biomarkers”, “response to treatment including dates and imaging,” and “a summary of current status.” This prompt engineering approach proved to be one of the most effective levers for improving the quality and completeness of Patient History outputs. Configuration TBFact Recall for All Facts TBFact Recall for Important Facts Generic prompts (baseline) 0.56 0.66 Specialized Prompts 0.71 0.84 Since TBFact operates by comparing discrete factual claims, higher scores indicate that the agent is, according to the reference data, factually accurate and comprehensive in its coverage of the available patient information. In other words, optimizing for TBFact scores brings the agent’s output structurally and semantically closer to the curated reference timelines. And, in our case, that meant striving for detailed outputs, including information about allergies and ongoing medications, even when specific dates were unavailable. This underscores the importance of having high-quality, human-validated reference datasets, as without them, even well-performing agents may appear incomplete or inaccurate. Human Validation Study To validate TBFact's reliability, we conducted a preliminary study with human annotators, medical scribes by training, using 71 patient records. Two annotators assessed (a) whether a claim was properly extracted from its source text, (b) whether the fact was important (low, medium, high), and (c) whether individual claims were properly entailed by a reference text. Inter-annotator agreement was measured at 0.999, 0.66(strict) and 0.77(relaxed), and 0.914 for the three tasks respectively. The accuracy of the fact extraction pipeline was calculated to be 99.9%, validating that during the fact extraction phase minimal-to-no hallucinations are introduced. System accuracy for fact importance classification was at 66% when measured strictly, however, when allowing for a tolerance of one level (e.g. classifying medium instead of high), this was at 93%. These values are comparable to those of the medical annotators. Entailment classification at 88%, suggesting reasonable performance of the system’s ability to recognize entailment. Finally, we measured the correlation of the entire end-to-end TBFact F1 score of the system compared to humans using Kendall Tau, Pearson, and Spearman correlations. These were revealed to be at 55.8%, 70.5%, 72.8%, moderate-to-high correlations suggesting that the TBFact metric are well-aligned with expert clinical reasoning. Qualitative insights from TBFact The table below illustrates how TBFact evaluates factual alignment between agent-generated summaries and reference data. Each row shows a fact extracted from the agent’s output, the corresponding excerpt from the reference, and the entailment judgment. The logical entailment was produced by TBFact, while the accompanying explanations were generated separately to support interpretability. Facts Extracted from Agent Response Related Excerpt from Reference Text (Ground Truth) TBFact Judgment Molecular studies from the 2019-05-18 surgery identified TERT promoter mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. […] Tumor Genetics: EGFR: Amplified CDKN2A/B: Deleted PTEN: p.L112R TERT: c.-146C>T Chromosome 10: Monosomy Chromosome 7: Trisomy […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ✔ Entailed: The summary lists TERT mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. Immunohistochemistry from 2019-05-18 showed GFAP positive, BRAF V600E negative, IDH1 R132H negative, ATRX retained, p53 negative, and a Ki-67 index of 3%. […] Tumor Genetics: IDH1: Wildtype - BRAF V600E: Negative […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ⚠️ Partial Entailment: Some IHC findings match (BRAF negative, IDH1 wildtype) but others (GFAP, p53, Ki-67) are not mentioned in the reference summary. During the first cycle of CCNU on 2020-04-14, the patient reported significant fatigue, thrombocytopenia, and occasional confusion. Introduction: […] The patient is experiencing poor tolerance to lomustine and is considering discontinuation due to further disease progression as confirmed by recent MRI scans. […] Timeline: 04/14/2020 - Present: Lomustine treatment initiated. […] ⚠️ Partial Entailment: Poor tolerance to lomustine is reported, but specific side effects are not listed in the reference summary. On 2020-05-16, the plan was to continue CCNU and monitor with imaging. No related information in the reference text. ⚠️ No Entailment: No mention in the summary of a plan on 2020-05-16 to continue CCNU with imaging follow-up. These examples show that partial entailments are not necessarily errors. In many cases, they reflect the agent surfacing clinically relevant details that are absent from the reference. This is especially important in healthcare settings, where agent outputs may synthesize information across multiple documents or express facts in more complete or structured ways than the reference defined. To further assess the factual grounding of the agent’s outputs, we compared all facts extracted from the Patient History agent’s summaries against the full set of available data for each patient in the TB-Bench dataset. We found that 97% of the extracted facts were entailed by at least one data point. Upon manually reviewing the remaining 3% of facts, we found that they often reflected condensed or synthesized information drawn from multiple sources, meaning these claims could not be matched to any one document in our one-to-one entailment setup. While we cannot rule out the presence of hallucinations entirely, this analysis highlights the agent’s capacity for multi-source summarization. Closing Thoughts As multi-agent systems become more capable and autonomous, robust evaluation must evolve in parallel. The framework presented here is a step toward that goal: modular, clinically grounded, and designed to surface actionable insights across both simulated and real-world workflows. By moving beyond traditional accuracy metrics and embracing factuality, relevance, and coordination as core evaluation dimensions, we can better understand how multi-agent systems work, and when and why they fail. Our preliminary experiments and insights reinforce the value of TBFact not just as a metric, but as a diagnostic tool. Its structured, claim-level analysis (combined with fact categorization and human validation) offers a transparent and clinically meaningful way to evaluate and improve healthcare agents. In evaluating the Patient History agent, findings demonstrate that the agent remains faithful to the underlying data and produces complete, clinically relevant summaries. These outputs can help physicians prepare more efficiently and productively for tumor board review meetings, and being in a chat multiple agents, facilitate further investigation and understanding about patients. Looking ahead, we see several promising directions for extending this work: incorporating human-in-the-loop review pipelines, expanding to multimodal evaluation, improving observability across agent interactions, and scaling to more diverse real-world datasets. We are also developing a standardized benchmark of synthetic and de-identified patient cases to support broader community testing and reproducibility. We hope this work encourages others to adopt similarly rigorous approaches to evaluation, and to contribute to the development of shared benchmarks, metrics, and methodologies References Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., ... & Hyland, S. L. (2024). Maira-2: Grounded radiology report generation. arXiv:2406.04449v2. Metropolitansky, D. & Larson, J. (2025). Towards Effective Extraction and Evaluation of Factual Claims. arXiv:2502.10855v2.Azure Logic App AI-Powered Monitoring Solution: Automate, Analyze, and Act on Your Azure Data
Introduction In today’s cloud-driven world, monitoring and analyzing application health is critical for business continuity and operational excellence. However, the sheer volume of monitoring data can make it challenging to extract actionable insights quickly. Enter the Azure Logic App AI-Powered Monitoring Solution—an intelligent, serverless pipeline that leverages Azure Logic Apps and Azure OpenAI to automate monitoring, analyze data, and deliver comprehensive reports right to your inbox. This solution is ideal for organizations seeking to modernize their monitoring workflows, reduce manual analysis, and empower teams with AI-driven insights for faster decision-making. What Does This Solution Accomplish? The Azure Logic App AI-Powered Monitoring Solution creates an automated pipeline that: Extracts monitoring data from Azure Log Analytics using KQL queries. Analyzes data with AI using the Azure OpenAI GPT-4o model. Generates intelligent reports and sends them via email. Runs automatically on a daily schedule. Uses managed identity for secure authentication across Azure services. Business Case Solved Automated Monitoring: No more manual log reviews—let AI do the heavy lifting. Actionable Insights: Receive daily, AI-generated summaries highlighting system health, key metrics, potential issues, and recommendations. Operational Efficiency: Reduce time-to-insight and empower teams to act faster on critical events. Secure and Scalable: Built on Azure’s serverless and identity-driven architecture. Key Features Serverless Architecture: Built on Azure Logic Apps Standard for scalability and cost efficiency. AI-Powered Insights: Uses Azure OpenAI for advanced data analysis and summarization. Infrastructure as Code: Deployable via Bicep templates for reproducibility and automation. Secure by Design: Managed identity and Azure RBAC ensure secure access. Cost Effective: Pay-per-execution model with optimized resource usage. Customizable: Easily modify KQL queries and AI prompts to fit your monitoring needs. Solution Architecture Technologies Involved Azure Logic Apps Standard: Orchestrates the workflow. Azure OpenAI Service (GPT-4o): Performs AI-powered data analysis and summarization. Azure Log Analytics: Source for monitoring data, queried via KQL. Application Insights: Monitors workflow execution and telemetry. Azure Storage Account: Stores Logic App runtime data. Managed Identity: Secures authentication across Azure services. Infrastructure as Code (Bicep): Enables automated, repeatable deployments. Office 365 Connector: Sends email notifications. Support Documentation: https://docs.microsoft.com/en-us/azure/logic-apps/ Issues: https://github.com/vinod-soni-microsoft/logicapp-ai-summarize/issues Star this repository if you find it helpful!1.3KViews0likes0Comments