healthcare
515 TopicsImage Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology
Introduction As the use of diagnostic 3D images increases, effective management and analysis of these large volumes of data grows in importance. Medical 3D image search systems can play a vital role by enabling clinicians to quickly retrieve relevant or similar images and cases based on the anatomical features and pathologies present in a query image. Unlike traditional 2D imaging, 3D imaging offers a more comprehensive view for examining anatomical structures from multiple planes with greater clarity and detail. This enhanced visualization has potential to assist doctors with improved diagnostic accuracy and more precise treatment planning. Moreover, advanced 3D image retrieval systems can support evidence-based and cohort-based diagnostics, demonstrating an opportunity for more accurate predictions and personalized treatment options. These systems also hold significant potential for advancing research, supporting medical education, and enhancing healthcare services. This blog offers guidance on using Azure AI Foundry and the recently launched healthcare AI models to design and test a 3D image search system that can retrieve similar radiology images from a large collection of 3D images. Along with this blog, we share a Jupyter Notebook with the the 3D image search system code, which you may use to reproduce the experiments presented here or start you own solution. 3D Image Search Notebook: http://aka.ms/healthcare-ai-examples-mi2-3d-image-search It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources. The Problem Generally, the problem of 3D image search can be posed as retrieving cross-sectional (CS) imaging series (3D image results) that are similar to a given CS imaging series (query 3D image). Once posited this way, the key question becomes how to define such similarity? In the previous blog of this series, we worked with radiographs of the chest which constrained the notion of "similar" to the similarity between two 2D images, and a certain class of anatomy. In the case of 3D images, we are dealing with a volume of data, and a lot more variations of anatomy and pathologies, which expands the dimensions to consider for similarity; e.g., are we looking for similar anatomy? Similar pathology? Similar exam type? In this blog, we will discuss a technique to approximate the 3D similarity problem through a 2D image embedding model and some amount of supervision to constrain the problem to a certain class of pathologies (lesions) and cast it as "given cross-sectional MRI image , retrieve series with similar grade of lesions in similar anatomical regions". To build a search system for 3D radiology images using a foundation model (MedImageInsight) designed for 2D inputs, we explore the generation of representative 3D embedding vectors for the volumes with the foundation model embeddings of 2D slices to create a vector index from a large collection of 3D images. Retrieving relevant results for a given 3D image then consists in generating a representative 3D image embedding vector for the query image and searching for similar vectors in the index. An overview of this process is illustrated in Figure 1. Figure 1: Overview of the 3D image search process. The Data In the sample notebook that is provided alongside this blog, we use 3D CT images from the Medical Segmentation Decathlon (MSD) dataset [2-3] and annotations from the 3D-MIR benchmark [4]. The 3D-MIR benchmark offers four collections (Liver, Colon, Pancreas, and Lung) of positive and negative examples created from the MSD dataset with additional annotations related to the lesion flag (with/without lesion), and lesion group (1, 2, 3). The lesion grouping focuses on lesion morphology and distribution and considers the number, length, and volume of the lesions to define the three groups. It also adheres to the American Joint Committee on Cancer's Tumor, Node, Metastasis classification system’s recommendations for classifying cancer stages and provides a standardized framework for correlating lesion morphology with cancer stage. We selected the 3D-MIR Pancreas collection. 3D-MIR Benchmark: https://github.com/abachaa/3D-MIR Since the MSD collections only include unhealthy/positive volumes, each 3D-MIR collection was augmented with volumes randomly selected from the other datasets to integrate healthy/negative examples in the training and test splits. For instance, the Pancreas dataset was augmented using volumes from the Colon, Liver, and Lung datasets. The input images consist of CT volumes and associated 2D slices. The training set is used to create the index, and the test set is used to query and evaluate the 3D search system. 3D Image Retrieval Our search strategy, called volume-based retrieval, relies on aggregating the embeddings of the 2D slices of a volume to generate one representative 3D embedding vector for the whole volume. We describe additional search strategies in our 3D-MIR paper [4]. The 2D slice embeddings are generated using the MedImageInsight foundation model [5-6] from Azure AI Foundry model catalog [1]. In the search step, we generate the embeddings of the 3D query volumes according to the selected Aggregation method (Agg) and search for the top-k similar volumes/vectors in the corresponding 3D (Agg) index. We use the Median aggregation method to generate the 3D vectors and create the associated 3D index. We construct a 3D (Median) index using the training slices/volumes from the 3D-MIR Pancreas collection. Three other aggregation methods are available in the 3D image search notebook: Max Pooling, Average Pooling, and Standard Deviation. The search is performed following the k-Nearest Neighbors algorithm (or k-NN search) to find the k nearest neighbors of a given vector by calculating the distances between the query vector and all other vectors in the collection, then selecting the K vectors with the shortest distances. If the collection is large, the computation can be expensive, and it is recommended to use specific libraries for optimization. We use the FAISS (Facebook AI Similarity Search) library, an open-source library for efficient similarity search and clustering of high-dimensional vectors. Evaluation of the search results The 3D-MIR Pancreas test set consists of 32 volumes: 4 volumes with no lesion (lesion flag/group= -1) 3 volumes with lesion group 1 19 volumes with lesion group 2 6 volumes with lesion group 3 The training set consists of 269 volumes (with and without lesions) and was used to create the index. We evaluate the 3D search system by comparing the lesion group/category of the query volume and the top 10 retrieved volumes. We then compute Precision@k (P@k). Table 1 presents the P@1, P@3, P@5, P@10, and overall Precision. Table 1: Evaluation results on the 3D-MIR Pancreas test set The system accurately recognizes Healthy cases, consistently retrieving the correct label in test scenarios involving non-lesion pancreas images. However, performance varies for different lesion groups, reflecting challenges in precisely identifying smaller lesions (Group 1) or more advanced lesions (Group 3). This discrepancy highlights the complexity of lesion detection and underscores the importance of carefully tuning embeddings or adjusting the vector index to improve retrieval accuracy for specific lesion sizes. Visualization Figure 2 presents four different test queries from the Pancreas test set and the top 5 nearest neighbors retrieved by the volume-based search method. In each row, the first image is the query, followed by the retrieved images ranked by similarity. The visual overlays help in assessing retrieval accuracy; Blue indicates the pancreas organ boundaries, and Red highlights the mark regions corresponding to the pancreas tumor. Figure 2: Top 5 results for different queries from the Pancreas test set Table 2 presents additional results of the volume-based retrieval system [4] on other 3D-MIR datasets/organs (Liver, Colon, and Lung) using additional foundation models: BiomedCLIP [7], Med-Flamingo [8], and BiomedGPT [9]. When considering the macro-average across all datasets, MedImageInsight-based retrieval outperforms substantially other foundation models. Table 2: Evaluation Results on the 3D-MIR benchmark (Liver, Colon, Pancreas, and Lung) These results mirror a use case akin to lesion detection and severity measurement in a clinical context. In real-world applications—such as diagnostic support or treatment planning—it may be necessary to optimize the model to account for particular goals (e.g., detecting critical lesions early) or accommodate different imaging protocols. By refining search criteria, integrating more domain-specific data, or adjusting embedding methods, practitioners can enhance retrieval precision and better meet clinical requirements. Conclusion The integration of 3D image search systems in clinical environment can enhance and accelerate the retrieval of similar cases and provide better context to clinicians and researchers for accurate complex diagnoses, cohort selection, and personalized patient care. This 3D radiology image search blog and related notebook offers a solution based on 3D embedding generation for building and evaluating a 3D image search system using the MedImageInsight foundation model from Azure AI Foundry model catalog. References Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview Michela Antonelli et al. The medical segmentation decathlon. Nature Communications, 13(4128), 2022 https://www.nature.com/articles/s41467-022-30695-9 MSD: http://medicaldecathlon.com/ Asma Ben Abacha, Alberto Santamaría-Pang, Ho Hin Lee, Jameson Merkow, Qin Cai, Surya Teja Devarakonda, Abdullah Islam, Julia Gong, Matthew P. Lungren, Thomas Lin, Noel C. F. Codella, Ivan Tarapov: 3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in Radiology. CoRR abs/2311.13752, 2023 https://arxiv.org/abs/2311.13752 Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542, 2024 https://arxiv.org/abs/2410.06542 MedImageInsight: https://aka.ms/mi2modelcard Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. NEJM AI 2025; 2(1) https://ai.nejm.org/doi/full/10.1056/AIoa2400640 Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. Machine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA. Proceedings of Machine Learning Research, vol. 225, pp. 353–367. PMLR, (2023) https://proceedings.mlr.press/v225/moor23a.html Zhang, K., Zhou, R., Adhikarla, E., Yan, Z., Liu, Y., Yu, J., Liu, Z., Chen, X., Davison, B.D., Ren, H., et al.: A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, 1–13 (2024) https://www.nature.com/articles/s41591-024-03185-2 Image Search Series Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community Hub The Microsoft healthcare AI models are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology
Introduction Dermatology is inherently visual, with diagnosis often relying on morphological features such as color, texture, shape, and spatial distribution of skin lesions. However, the diagnostic process is complicated by the large number of dermatologic conditions, with over 3,000 identified entities, and the substantial variability in their presentation across different anatomical sites, age groups, and skin tones. This phenotypic diversity presents significant challenges, even for experienced clinicians, and can lead to diagnostic uncertainty in both routine and complex cases. Image-based retrieval systems represent a promising approach to address these challenges. By enabling users to query large-scale image databases using a visual example, these systems can return semantically or visually similar cases, offering useful reference points for clinical decision support. However, dermatology image search is uniquely demanding. Systems must exhibit robustness to variations in image quality, lighting, and skin pigmentation while maintaining high retrieval precision across heterogeneous datasets. Beyond clinical applications, scalable and efficient image search frameworks provide valuable support for research, education, and dataset curation. They enable automated exploration of large image repositories, assist in selecting challenging examples to enhance model robustness, and promote better generalization of machine learning models across diverse populations. In this post, we continue our series on using healthcare AI models in Azure AI Foundry to create efficient image search systems. We explore the design and implementation of such a system for dermatology applications. As a baseline, we first present an adapter-based classification framework for dermatology images by leveraging fixed embeddings from the MedImageInsight foundation model, available in the Azure AI Foundry model catalog. We then introduce a Retrieval-Augmented Generation (RAG) method that enhances vision-language models through similarity-based in-context prompting. We use the MedImageInsight foundation model to generate image embeddings and retrieve the top-k visually similar training examples via FAISS. The retrieved image-label pairs are included in the Vision-LLM prompt as in-context examples. This targeted prompting guides the model using visually and semantically aligned references, enhancing prediction quality on fine-grained dermatological tasks. It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources. The Data The DermaVQA-IIYI [2] dermatology image dataset is a de-identified, diverse collection of nearly 1,000 patient records and nearly 3,000 dermatological images, created to support research in skin condition recognition, classification, and visual question answering. DermaVQA-IIYI dataset: https://osf.io/72rp3/files/osfstorage (data/iiyi) The dataset is split into three subsets: Training Set: 2,474 images associated with 842 patient cases Validation Set: 157 images associated with 56 cases Test Set: 314 images associated with 100 cases Total Records: 2,945 images (998 patient cases) Patient Demographics: Out of 998 patient cases: Sex – F: 218, M: 239, UNK: 541 Age (available for 398 patients): Mean: 31 yrs | Min: 0.08 yrs | Max: 92 yrs This wide range supports studies across all age groups, from infants to the elderly. A total of 2,945 images are associated with the patient records, with an average of 2.9 images per patient. This multiplicity enables the study of skin conditions from different perspectives and at various stages. Image Count per Entry: 1 image: 225 patients 2 images: 285 patients 3 images: 200 patients 4 or more images: 288 patients The dataset includes additional annotations for anatomic location, comprising 39 distinct labels (e.g., back, fingers, fingernail, lower leg, forearm, eye region, unidentifiable). Each image is associated with one or multiple labels. We use these annotations to evaluate the performance of various methods across different anatomical regions. Image Embeddings We generate image embeddings using the MedImageInsight foundation model [1] from the Azure AI Foundry model catalog [3]. We apply Uniform Manifold Approximation and Projection (UMAP) to project high-dimensional image embeddings produced by the MedImageInsight model into two dimensions. The visualization is generated using embeddings extracted from both the DermaVQA training and test sets, which covers 39 anatomical regions. For clarity, only the most frequent anatomical labels are displayed in the projection. Figure 1. UMAP projection of image embeddings produced by the MedImageInsight Model on the DermaVQA dataset. The resulting projection reveals that the MedImageInsight model captures meaningful anatomical distinctions: visually distinct regions such as fingers, face, fingernail, and foot form well-separated clusters, indicating high intra-class consistency and inter-class separability. Other anatomically adjacent or visually similar regions, such as back, arm, and abdomen, show moderate overlap, which is expected due to shared visual features or potential labeling ambiguity. Overall, the embeddings exhibit a coherent and interpretable organization, suggesting that the model has learned to encode both local and global anatomical structures. This supports the model’s effectiveness in capturing anatomy-specific representations suitable for downstream tasks such as classification and retrieval. Enhancing Visual Understanding We explore two strategies for enhancing visual understanding through foundation models. I. Training an Adapter-based Classifier We build an adapter-based classification framework designed for efficient adaptation to medical imaging tasks (see our prior posts for introduction into the topic of adapters: Unlocking the Magic of Embedding Models: Practical Patterns for Healthcare AI | Microsoft Community Hub). The proposed adapter model builds upon fixed visual features extracted from the MedImageInsight foundation model, enabling task-specific fine-tuning without requiring full model retraining. The architecture consists of three main components: MLP Adapter: A two-layer feedforward network that projects 1024-dimensional embeddings (generated by the MedImageInsight model) into a 512-dimensional latent space. This module utilizes GELU activation and Layer Normalization to enhance training stability and representational capacity. As a bottleneck adapter, it facilitates parameter-efficient transfer learning. Convolutional Retrieval Module: A sequence of two 1D convolutional layers with GELU activation, applied to the output of the MLP adapter. This component refines the representations by modeling local dependencies within the transformed feature space. Prediction Head: A linear classifier that maps the 512-dimensional refined features to the task-specific output space (e.g., 39 dermatology classes). The classifier is trained for 10 epochs (approximately 48 seconds) using only CPU resources. Built on fixed image embeddings extracted from the MedImageInsight model, the adapter efficiently tailors these representations for downstream classification tasks with minimal computational overhead. By updating only the adapter components, while keeping the MedImageInsight backbone frozen, the model significantly reduces computational and memory overhead. This design also mitigates overfitting, making it particularly effective in medical imaging scenarios with limited or imbalanced labeled data. A Jupyter Notebook detailing the construction and training of an MedImageInsight -based adapter model is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-adapter Figure 3: MedImageInsight-based Adapter Model II. Boosting Vision-Language Models with in-Context Prompting We leverage vision-language models (e.g., GPT-4o, GPT-4.1), which represent a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models are particularly promising for dermatology tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding domain-specific medical terminology. 1. Few-shot Prompting In this setting, a small number of examples from the training dataset are randomly selected and embedded into the input prompt. These examples, consisting of paired images and corresponding labels, are intended to guide the model's interpretation of new inputs by providing contextual cues and examples of relevant dermatological features. 2. MedImageInsight-based Retrieval-Augmented Generation (RAG) This approach enhances vision-language model performance by integrating a similarity-based retrieval mechanism rooted in MedImageInsight (Medical Image-to-Image) comparison. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k dermatological training images that are most visually similar to a given query image. The retrieved examples, consisting of dermatological images and their corresponding labels, are then used as in-context examples in the Vision-LLM prompt. By presenting visually similar cases, this approach provides the model with more targeted contextual references, enabling it to generate predictions grounded in relevant visual patterns and associated clinical semantics. As illustrated in Figure 2, the system operates in two phases: Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval. Query and Retrieval: At inference time, the test image is encoded similarly to produce a query embedding. The system computes the Euclidean distance between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances. To handle the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors. The implementation of the image search method is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-mi2-2d-image-search Figure 2: MedImageInsight-based Retrieval-Augmented Generation Evaluation Table 1 presents accuracy scores for anatomic location prediction on the DermaVQA-iiyi test set using the proposed modeling approaches. The adapter model achieves a baseline accuracy of 31.73%. Vision-language models perform better, with GPT-4o (2024-11-20) achieving an accuracy of 47.11%, and GPT-4.1 (2025-04-14) improving to 50%. However, incorporating few-shot prompting with five randomly selected in-context examples (5-shot) slightly reduces GPT-4.1’s performance to 48.72%. This decline suggests that unguided example selection may introduce irrelevant or low-quality context, potentially reducing the effectiveness of the model’s predictions for this specialized task. The best performance among the vision-language approaches is achieved using the retrieval-augmented generation (RAG) strategy. In this setup, GPT-4.1 is prompted with five nearest-neighbor examples retrieved using the MedImageInsight-based search method (RAG-5), leading to a notable accuracy increase to 51.60%. This improvement over GPT-4.1’s 50% accuracy without retrieval showcases the relevance of the MedImageInsight-based RAG method. We expect larger performance gains when using a more extensive dermatology dataset, compared to the relatively small dataset used in this example -- a collection of 2,474 images associated with 842 patient cases which served as the basis for selecting relevant cases and similar images. Dermatology is a particularly challenging domain, marked by a high number of distinct conditions and significant variability in skin tone, texture, and lesion appearance. This diversity makes robust and representative example retrieval especially critical for enhancing model performance. The results underscore the importance of example relevance in few-shot prompting, demonstrating that similarity-based retrieval can effectively guide the model toward more accurate predictions in complex visual reasoning tasks. Table 1: Comparative Accuracy of Anatomic Location Prediction on DermaVQA-iiyi Figure 2: Confusion Matrix of Anatomical Location Predictions by the trained MLP adapter: The matrix illustrates the model's performance in classifying wound images across 39 anatomical regions. Strong diagonal values indicate correct classifications, while off-diagonal entries highlight common misclassifications, particularly among anatomically adjacent or visually similar regions such as 'lowerback' vs. 'back' and 'hand' vs. 'fingers'. Figure 3. Examples of correct anatomical predictions by the RAG approach. Each image depicts a case where the model's predicted anatomical region exactly matches the ground truth. Shown are examples from visually and anatomically distinct areas including the eye region, lips, lower leg, and neck. Figure 4. Examples of misclassifications by the RAG approach. Each image displays a case where the predicted anatomical label differs from the ground truth. In several examples, predictions are anatomically close to the correct regions (e.g., hand vs. hand-back, lower leg vs. foot, palm vs. fingers), suggesting that misclassifications often occur between adjacent or visually similar areas. These cases highlight the challenge of precise localization in fine-grained anatomical classification and the importance of accounting for anatomical ambiguity in both modeling and evaluation. Conclusion Our exploration of scalable image retrieval and advanced prompting strategies demonstrates the growing potential of vision-language models in dermatology. A particularly challenging task we address is anatomic location prediction, which involves 39 fine-grained classes of dermatology images, imbalanced training data, and frequent misclassifications between adjacent or visually similar regions. By leveraging Retrieval-Augmented Generation (RAG) with similarity-based example selection using image embeddings from the MedImageInsight foundation model, we show that relevant contextual guidance can significantly improve model performance in such complex settings. These findings underscore the importance of intelligent image retrieval and prompt construction for enhancing prediction accuracy in fine-grained medical tasks. As vision-language models continue to evolve, their integration with retrieval mechanisms and foundation models holds substantial promise for advancing clinical decision support, medical research, and education at scale. In the next blog of this series, we will shift focus to the wound care subdomain of dermatology, and we will release accompanying Jupyter notebooks for the adapter-based and RAG-based methods to provide a reproducible reference implementation for researchers and practitioners. The Microsoft healthcare AI models, including MedImageInsight, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals. References Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024) Wen-wai Yim, Yujuan Fu, Zhaoyi Sun, Asma Ben Abacha, Meliha Yetisgen, Fei Xia: DermaVQA: A Multilingual Visual Question Answering Dataset for Dermatology. MICCAI (5) 2024: 209-219 Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview Image Search Series Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community HubImage Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval
Introduction Wound assessment and management are central tasks in clinical practice, requiring accurate documentation and timely decision-making. Clinicians and nurses often rely on visual inspection to evaluate wound characteristics such as size, color, tissue composition, and healing progress. However, when seeking comparable cases (e.g., to inform treatment choices, validate assessments, or support education), existing search methods have significant limitations. Traditional keyword-based systems require precise terminology, which may not align with the way wounds are described in practice. Moreover, textual descriptors cannot fully capture the variability of visual wound features, resulting in incomplete or imprecise retrieval. Recent advances in computer vision offer new opportunities to address these challenges through both image classification and image retrieval. Automated classification of wound images into clinically meaningful categories (e.g., wound type, tissue condition, infection status) can support standardized documentation and assist clinicians in making more consistent assessments. In parallel, image retrieval systems enable search based on visual similarity rather than textual input alone, allowing clinicians to query databases directly with wound images and retrieve cases with similar characteristics. Together, these AI-based functionalities have the potential to improve case comparison, facilitate consistent monitoring, and enhance clinical training by providing immediate access to relevant examples and structured decision support. The Data The WoundcareVQA dataset is a new multimodal multilingual dataset for Wound Care Visual Question Answering. The WoundcareVQA dataset is available at https://osf.io/xsj5u/ [1] Table 1 summarizes dataset statistics. WoundcareVQA contains 748 images associated with 447 instances (each instance/query includes one or more images). The dataset is split into training (279 instances, 449 images), validation (105 instances, 147 images), and test (93 instances, 152 images). The training set was annotated by a single expert, the validation set by two annotators, and the test set by three medical doctors. Each query is also labeled with wound metadata, covering seven categories: anatomic location (41 classes), wound type (8), wound thickness (6), tissue color (6), drainage amount (6), drainage type (5), and infection status (3). Table 1: Statistics about the WoundcareVQA Dataset We selected two tasks with the highest inter-annotator agreement: Wound Type Classification and Infection Detection (cf. Table 2). Table 3 lists the classification labels for these tasks. Table 2: Inter-Annotator Agreement in the WoundcareVQA Dataset Table 3: Classification Labels for the Tasks: Infection Detection & Wound Type Classification Methods 1. Foundation-Model-based Image Search This approach relies on an image similarity-based retrieval mechanism using a medical foundation model, MedImageInsight [2-3]. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k training images most visually similar to a given query image. The image search system operates in two phases: Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval. Query and Retrieval: At inference time, the test image is encoded to produce a query embedding. The system computes the Euclidean distances between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances. To address the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors. 2. Vision-Language Models (VLMs) & Retrieval-Augmented Generation (RAG) We leverage vision-language models (e.g., GPT-4o, GPT-4.1), a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models can be used for wound assessment tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding medical terminology. We evaluate three settings: Zero-shot: The model predicts directly from the query input without additional examples. Few-shot Prompting: A small number of examples (5) from the training dataset are randomly selected and embedded into the input prompt. These paired images and labels provide contextual cues that guide the model's interpretation of new inputs. Retrieval-Augmented Generation (RAG): The system first retrieves the Top-k visually similar wound images using the MedImageInsight-based image search described above. The language model then reasons over the retrieved examples and their labels to generate the final prediction. The implementation of the MedImageInsight-based image search and the RAG method for the infection detection task is available in our Samples Repository: https://aka.ms/healthcare-ai-examples rag_infection_detection.ipynb Evaluation We computed accuracy scores to evaluate the image search methods (Top-1 and Top-5 with majority vote), GPT-4o and GPT-4.1 models (zero-shot), as well as 5-shot and RAG-based methods. Table 4 reports accuracy for wound type classification and infection detection. Figure 1 presents examples of correct and incorrection predictions. Accuracy Image Search Top-1 Image Search Top-5 + majority vote GPT-4o (2023-07-01) GPT-4o (2024-11-20) GPT4.1 (2025-04-14) GPT4.1 5-shot Prompting GPT-4.1- RAG-5 Wound Type 0.7933 0.8333 0.4671 0.4803 0.5066 0.6118 0.7533 Infection 0.6800 0.7267 0.3947 0.3882 0.375 0.7237 0.7697 Table 4: Accuracy Scores for Wound Type Classification & Infection Detection Figure 1: Examples of Correct and Incorrection Predictions (GPT-4.1-RAG-5 Method) For wound type classification, image search with MedImageInsight embeddings performs best, achieving 0.7933 (Top-1) and 0.8333 (Top-5 + majority vote). GPT models alone perform substantially worse (0.4671-0.6118), while GPT-4.1 with retrieval augmentation (RAG-5), which uses the same MedImageInsight-based image search method to retrieve the Top-5 similar cases, narrows the gap (0.7533) but does not surpass direct image search. This suggests that categorical wound type is more effectively captured by visual similarity than by case-based reasoning with vision-language models. For infection detection, the trend reverses. Image search reaches 0.7267 (Top-5 + majority vote), while RAG-5 achieves the highest accuracy at 0.7697. In this case, the combination of visually similar cases with VLM-based reasoning outperforms both standalone image search and GPT prompting. This indicates that infection assessment depends on contextual or clinical cues that may not be fully captured by visual similarity alone but can be better interpreted when enriched with contextual reasoning over retrieved cases and their associated labels. Overall, these findings highlight complementary strengths: foundation-model-based image search excels at categorical visual classification (wound type), while retrieval-augmented VLMs leverage both visual similarity and contextual reasoning to improve performance on more nuanced tasks (infection detection). A hybrid system integrating both approaches may provide the most robust clinical support. Conclusion This study demonstrates the complementary roles of vision-language models in wound assessment. Image search using foundation-model embeddings shows strong performance on categorical tasks such as wound type classification, where visual similarity is most informative. In contrast, retrieval-augmented generation (RAG-5), which combines image search with case-based reasoning by a vision-language model, achieves the best results for infection detection, highlighting the value of integrating contextual interpretation with visual features. These findings suggest that a hybrid approach, leveraging both direct image similarity and retrieval-augmented reasoning, provides the most robust pathway for clinical decision support in wound care. References Wen-wai Yim, Asma Ben Abacha, Robert Doerning, Chia-Yu Chen, Jiaying Xu, Anita Subbarao, Zixuan Yu, Fei Xia, M Kennedy Hall, Meliha Yetisgen. Woundcarevqa: A Multilingual Visual Question Answering Benchmark Dataset for Wound Care. Journal of Biomedical Informatics, 2025. Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024) Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview Image Search Series Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community HubCopilot Chat: Snack Pack #1
Hungry? Grab your treat of choice—savory, sweet, crispy, or gooey—and come munch with us! We're kicking off a five-part series of Copilot Chat "Snack Packs": bite-sized articles packed with tips and tricks to help you become a Copilot Chat pro. Today, we're starting with the basics: accessing Copilot Chat. On PC and Mac: Follow the download links below to install the Copilot Chat desktop app. Double-click the installer when prompted, and you're in. Windows: Microsoft 365 Copilot - Free download and install on Windows | Microsoft Store MacOS: Microsoft 365 Copilot on the App Store On Mobile: Scan the QR code to download the app to your device. In Your Browser: Prefer not to download anything? You can also access Copilot Chat from Microsoft 365 Copilot Chat. Once you're in, try starting a conversation in the prompt box. Not sure where to begin? No worries—use or tweak one of the suggested prompts to get going. Here are a few other handy entry points: Reading a long article? Pop open the Copilot sidebar to get a quick, clear summary—no more word soup. In the middle of a meeting? Access Chat within Teams for a seamless, in-the-flow experience. Writing an email? Use Chat right inside Outlook to refine your writing and send polished messages—all in one place. Take this information, chew on it 😉, and test it out in your daily workflows. Next week, we'll be diving into prompting best practices. See you then!1.3KViews4likes0CommentsHow Microsoft Dragon Copilot Uses The Azure Health Data Services De-Identification Service
Empowering physician productivity through secure AI Microsoft developed Dragon Copilot to revolutionize real-time clinical documentation. Using clinically adapted generative AI, it listens to patient-clinician conversations and automatically generates draft clinical notes, freeing physicians to focus on what matters most: their patients. Dragon Copilot also allows clinicians to get the information they need when they need it and automates many other tasks such as initiating orders or writing draft patient after-visit summaries. The tool eliminates the burden of manual note-taking and multiple other clicks in the EMR, boosting efficiency, and reducing burnout, all of which are critical challenges in healthcare. With strong market traction across hospitals and physician practices across the USA, Dragon Copilot, previously known as Dragon Ambient eXperience (DAX) Copilot, has become a trusted productivity engine for healthcare organizations. In a field where protecting patient data is critical , privacy is paramount. Dragon Copilot’s deep commitment to data privacy, however, requires a strategic partner like the de-identification service to support safe and responsible AI development at scale. How the Azure Health Data Services de-identification service empowers Dragon Copilot Dragon Copilot operates at the intersection of audio capture, natural language generation (NLG), and clinical workflows. Its data pipelines include highly sensitive patient health information. As a result, Microsoft has invested in the Azure Health Data Services de-identification service to de-identify millions of patient transcripts and notes to uphold strict privacy standards and deliver secure, scalable clinical documentation. De-identifying unstructured text like clinical notes is particularly challenging due to the complexity and variability of how Protected Health Information (PHI) appears in real-world clinical documentation. References to dates like “Christmas” or “New Year’s Eve,” names, locations, and other identifiers are often embedded in free text in unpredictable ways. The Azure Health Data Services de-identification service is purpose-built to handle these nuances. It accurately identifies and replaces patient names while distinguishing them from doctors’ names, and it can also detect and tag the names of family members or close contacts mentioned in the clinical narrative. The service also retains the format of the dates presents in clinical notes, shifting them by a random number within a 45-day window and surrogates holidays with replacements close in seasonality. A key strength of the de-identification service is its use of surrogation, where sensitive terms are replaced with realistic, context-appropriate substitutes. This approach, used in services like Dragon Copilot, helps ensure clinical notes remain readable and useful while concealing real PHI in plain sight, strengthening privacy without sacrificing usability. Connecting to Microsoft Fabric for scalable analytics Once Dragon Copilot generates draft clinical notes, the data can be securely ingested into Microsoft Fabric, a unified data platform built for analytics and governance. Within Fabric, healthcare organizations can centralize and manage de-identified data using OneLake, making it accessible for advanced analytics, operational reporting, and research. Azure Health Data Services play a critical role in this ecosystem by ensuring that sensitive PHI is de-identified before analysis, allowing healthcare agents to extract meaningful insights, identify trends, and optimize care delivery without compromising patient privacy. Use Cases unlocked through partnering with the Azure Health Data Services de-identification service Azure Health Data Services de-identification has become a critical component of the Dragon Copilot data ingestion pipeline. Our service supports several teams within Dragon Copilot: Research Enablement: De-identified data fuels AI model building, success tracking, and product improvement—without exposing sensitive patient data. AI Model Quality & Evaluation: De-identified data supports safe iteration and experimentation while preserving important context (i.e. gender, timeline, and more). What makes Azure Health Data Services de-identification service stand out Dragon Copilot builds on the consistency, robustness, and seamless integration offered by Azure Health Data Services' de-identification capabilities. This service is purpose-built for healthcare and plays a critical role in enabling Dragon Copilot to uphold the highest privacy standards while continuing to innovate. Key strengths of the service include: Context Preservation: Maintains formatting and context alignment, which are essential for clinical accuracy. Surrogation Support: Replaces PHI with realistic pseudonyms to ensure de-identified data remains useful for model training. Beyond HIPAA Compliance: De-identifies 27 categories of PHI, surpassing HIPAA’s 18 identifiers, to support more comprehensive privacy protection. This foundation allows Dragon Copilot to scale responsibly, ensuring both compliance and usability in real-world clinical settings. Looking Ahead: Where Dragon Copilot is going with de-identification As Dragon Copilot expands and continues to add new capabilities, Azure Health Data Services de-identification service will continue to be a foundational piece of their AI development lifecycle. For Dragon Copilot, de-identification isn’t just a checkbox, it’s a catalyst for innovation. Learn more about the Azure Health Data Services De-identification service1.3KViews0likes0CommentsModel Mondays S2E11: Exploring Speech AI in Azure AI Foundry
1. Weekly Highlights This week’s top news in the Azure AI ecosystem included: Lakuna — Copilot Studio Agent for Product Teams: A hackathon project built with Copilot Studio and Azure AI Foundry, Lakuna analyzes your requirements and docs to surface hidden assumptions, helping teams reflect, test, and reduce bias in product planning. Azure ND H200 v5 VMs for AI: Azure Machine Learning introduced ND H200 v5 VMs, featuring NVIDIA H200 GPUs (over 1TB GPU memory per VM!) for massive models, bigger context windows, and ultra-fast throughput. Agent Factory Blog Series: The next wave of agentic AI is about extensibility: plug your agents into hundreds of APIs and services using Model Connector Protocol (MCP) for portable, reusable tool integrations. GPT-5 Tool Calling on Azure AI Foundry: GPT-5 models now support free-form tool calling—no more rigid JSON! Output SQL, Python, configs, and more in your preferred format for natural, flexible workflows. Microsoft a Leader in 2025 Gartner Magic Quadrant: Azure was again named a leader for Cloud Native Application Platforms—validating its end-to-end runway for AI, microservices, DevOps, and more. 2. Spotlight On: Azure AI Foundry Speech Playground The main segment featured a live demo of the new Azure AI Speech Playground (now part of Foundry), showing how developers can experiment with and deploy cutting-edge voice, transcription, and avatar capabilities. Key Features & Demos: Speech Recognition (Speech-to-Text): Try real-time transcription directly in the playground—recognizing natural speech, pauses, accents, and domain terms. Batch and Fast transcription options for large files and blob storage. Custom Speech: Fine-tune models for your industry, vocabulary, and noise conditions. Text to Speech (TTS): Instantly convert text into natural, expressive audio in 150+ languages with 600+ neural voices. Demo: Listen to pre-built voices, explore whispering, cheerful, angry, and more styles. Custom Neural Voice: Clone and train your own professional or personal voice (with strict Responsible AI controls). Avatars & Video Translation: Bring your apps to life with prebuilt avatars and video translation, which syncs voice-overs to speakers in multilingual videos. Voice Live API: Voice Live API (Preview) integrates all premium speech capabilities with large language models, enabling real-time, proactive voice agents and chatbots. Demo: Language learning agent with voice, avatars, and proactive engagement. One-click code export for deployment in your IDE. 3. Customer Story: Hilo Health This week’s customer spotlight featured Helo Health—a healthcare technology company using Azure AI to boost efficiency for doctors, staff, and patients. How Hilo Uses Azure AI: Document Management: Automates fax/document filing, splits multi-page faxes by patient, reduces staff effort and errors using Azure Computer Vision and Document Intelligence. Ambient Listening: Ambient clinical note transcription captures doctor-patient conversations and summarizes them for easy EHR documentation. Genie AI Contact Center: Agentic voice assistants handle patient calls, book appointments, answer billing/refill questions, escalate to humans, and assist human agents—using Azure Communication Services, Azure Functions, FastAPI (community), and Azure OpenAI. Conversational Campaigns: Outbound reminders, procedure preps, and follow-ups all handled by voice AI—freeing up human staff. Impact: Hilo reaches 16,000+ physician practices and 180,000 providers, automates millions of communications, and processes $2B+ in payments annually—demonstrating how multimodal AI transforms patient journeys from first call to post-visit care. 4. Key Takeaways Here’s what you need to know from S2E11: Speech AI is Accessible: The Azure AI Foundry Speech Playground makes experimenting with voice recognition, TTS, and avatars easy for everyone. From Playground to Production: Fine-tune, export code, and deploy speech models in your own apps with Azure Speech Service. Responsible AI Built-In: Custom Neural Voice and avatars require application and approval, ensuring ethical, secure use. Agentic AI Everywhere: Voice Live API brings real-time, multimodal voice agents to any workflow. Healthcare Example: Hilo’s use of Azure AI shows the real-world impact of speech and agentic AI, from patient intake to after-visit care. Join the Community: Keep learning and building—join the Discord and Forum. Sharda's Tips: How I Wrote This Blog I organize key moments from each episode, highlight product demos and customer stories, and use GitHub Copilot for structure. For this recap, I tested the Speech Playground myself, explored the docs, and summarized answers to common developer questions on security, dialects, and deployment. Here’s my favorite Copilot prompt this week: "Generate a technical blog post for Model Mondays S2E11 based on the transcript and episode details. Focus on Azure Speech Playground, TTS, avatars, Voice Live API, and healthcare use cases. Add practical links for developers and students!" Coming Up Next Week Next week: Observability! Learn how to monitor, evaluate, and debug your AI models and workflows using Azure and OpenAI tools. Register For The Livestream – Sep 1, 2025 Register For The AMA – Sep 5, 2025 Ask Questions & View Recaps – Discussion Forum About Model Mondays Model Mondays is your weekly Azure AI learning series: 5-Minute Highlights: Latest AI news and product updates 15-Minute Spotlight: Demos and deep dives with product teams 30-Minute AMA Fridays: Ask anything in Discord or the forum Start building: Register For Livestreams Watch Past Replays Register For AMA Recap Past AMAs Join The Community Don’t build alone! The Azure AI Developer Community is here for real-time chats, events, and support: Join the Discord Explore the Forum About Me I'm Sharda, a Gold Microsoft Learn Student Ambassador focused on cloud and AI. Find me on GitHub, Dev.to, Tech Community, and LinkedIn. In this blog series, I share takeaways from each week’s Model Mondays livestream.159Views0likes0CommentsAgentic AI in Healthcare
Healthcare organizations are at a crossroads where rising patient loads, complex data, and administrative burdens demand new solutions. Agentic AI – AI systems capable of autonomous action – is emerging as a catalyst for transformation, promising to act not just as tools but as collaborative digital team members. Microsoft’s ecosystem of AI technologies provides a robust foundation to harness agentic AI in healthcare. This report offers a comprehensive overview of agentic AI, distinguishes it from traditional AI, and explores its role in clinical workflows, administrative efficiency, patient engagement, and data governance. It also examines how Microsoft’s offerings (Microsoft 365 Copilot, Azure Health Data Services, Microsoft Fabric, Copilot Studio, and more) enable these advances responsibly and in compliance with healthcare regulations like HIPAA.Towards Robust Evaluation of Multi-Agent Systems in Clinical Settings
Authors: Hao Qiu, Leonardo Schettini, Mert Öz, Noel Codella, Sam Preston, Wen-wai Yim As multi-agent systems become more capable and collaborative, their behavior begins to exhibit emergent properties that are difficult to predict or control – particularly in safety critical domains like healthcare. Coordination among agents can yield outputs that are non-deterministic, multi-faceted, and context sensitive. This makes robust evaluation not just a matter of accuracy, but of safety, accountability, and trust. Traditional NLP metrics like ROUGE or BLEU fall short in these settings as they presuppose a single ground truth and fail to capture clinically relevant errors such as subtle omissions, hallucinations, or fact distortions. To address this, we present a modular evaluation framework for the Healthcare Agent Orchestrator, designed to support fine-grained, clinical grounded assessment across both deployed clinical workflows and simulated scenarios. This framework enables targeted stress-testing of multi-agent behavior – particularly how agents share information, reason under uncertainty, and maintain factual fidelity in high-stakes contexts. Central to our framework is TBFact, a domain specific factuality metric that evaluates agent outputs based on three key criteria: factual inclusion, factual distortion, and factual omission. TBFact shows strong correlation with human experts (κ=0.760) and demonstrates that our Patient History agent successfully included up to 94% of high-importance information in the generated patient timelines. To ground evaluations of the Patient History agent, we constructed a high-quality benchmark dataset from de-identified tumor board discussions and associated patient histories. Reference patient timeline summaries (originally written by medical professionals) formatting was standardized via a large language model to facilitate consistent evaluation. And under our benchmark, while the Patient History agent included over 94% of high-importance facts (counting both fully and partially entailed information), the Patient History agent achieved 0.84 TBFact recall on high-importance facts, showing that TBFact's strict entailment criteria and partial credit scoring create meaningful headroom for future improvements. For more technical information about the evaluation framework, refer to the documentation. The healthcare-agent-orchestrator repository also includes an evaluation notebook with concrete examples for simulating conversations and evaluating them. : High-level architecture of the evaluation framework, showing data sources (real and simulated conversations) feeding into modular metrics for both orchestrator and individual agent assessment. Available Metrics Traditional similarity metrics (e.g.: ROUGE, BERTScore) fail to capture subtle yet critical factual inaccuracies in the output. Moreover, in agentic workflows, a ground truth answer often doesn’t exist or is expensive to curate. To overcome these shortcomings, we leverage Model-as-a-Judge to implement the following metrics: Component Metric Description Orchestrator Agent and tool selection accuracy Correct routing to specialized agents Orchestrator Intent resolution How accurately the orchestrator interprets and completes user requests, including scoping and clarification. Orchestrator Information aggregation Effective synthesis of multiple agent outputs. Individual Agents Context relevancy Relevance of retrieved information in relation to user’s requests. Individual Agents TBFact (Factual Consistency) An adapted version of RadFact for the text modality, that measures the factuality of claims in agents' messages and helps identifying omissions and hallucinations. Large Language Models serve as useful evaluation tools in our framework, offering advantages especially when ground truth data is not available. They can follow detailed evaluation guidelines, maintain consistency when applying criteria across conversations, and generate explanations for their assessments—facilitating verification of the evaluation process. However, due to their subjective nature, LLM-based evaluations should be treated as directional signals rather than absolute scores, providing better directional guidance for system improvement rather than absolute judgment of correctness. To complement LLM-based metrics with reproducible measurements especially when reference data is available, we include Rouge implementation, serving as an example for developers to incorporate other similarity metrics like BLEU or BERT-Score by extending the ReferenceBasedMetric class. TBFact: Domain-Specific Factuality Evaluation TBFact builds on RadFact (Bannur et al., 2024), a framework originally developed for evaluating factual consistency in radiology reports, by adapting its core principles to the text-only modality of healthcare agent interactions: Fact Extraction: Separately decomposes both agent responses and reference texts into discrete factual claims, categorized by clinical relevance (e.g., demographics, diagnosis, treatment) Logical Entailment: Compares each fact to determine if it's fully entailed, partially entailed, or not entailed by the reference, and further categorizes the reason for partial and total mismatches into “missing”, “ambiguous”, “incorrect” or “other”. Metric Calculation: TBFact performs the logical entailment in two directions: Precision (pred-to-gold): Measures the proportion of factual claims in the agent’s output that are supported by the reference data. A lower precision score may indicate the presence of hallucinated or extraneous facts not found in the reference, even if they are accurate. Precision can be seen as a proxy for succinctness. Recall (gold-to-pred): Measures the proportion of reference facts that are successfully captured in the agent’s output. A lower recall score signals missing or omitted information, which is especially critical in clinical contexts where completeness is essential. By operating at the level of atomic factual units, TBFact shifts the focus from holistic summary judgments to targeted, claim-by-claim analysis. While claim extraction introduces its own challenges—such as ensuring consistent coverage of verifiable content, maintaining entailment fidelity, and handling decontextualization (Metropolitansky & Larson, 2025)—factual claims make the evaluation process more modular and transparent, providing actionable insights into where and how agent responses differ from references. For example, when evaluating a discharge summary, TBFact might identify that while demographic facts achieve 95% precision, treatment recommendations only reach 75% recall, pinpointing specific areas for agent improvement. This granular feedback enables developers to identify systematic issues, such as an agent consistently omitting medication dosages or incorrectly interpreting temporal information, that would be difficult to detect with traditional metrics. Data Sources Due to the challenge of having real-world data for each use-case we want to evaluate, and to accommodate different development stages and data availability, the framework supports two primary evaluation modes: Real conversations: Healthcare Agent Orchestrator automatically saves chat sessions whenever a conversation is terminated with the command @Orchestrator: clear, enabling insight into actual clinical workflow performance. Simulated conversations: Generated for controlled testing using predefined scripts or adaptive scenarios. Essential for specialized scenarios with limited real-world data. Results and Performance Assessment Note: The following results represent initial validation from our current research phase, with ongoing work expanding evaluation scope and refining methodologies. These preliminary results demonstrate promising capabilities for clinical system coordination and factual accuracy assessment. Orchestrator Performance We evaluated the orchestrator using simulated conversations across multiple patient scenarios. GPT-4o served as the evaluator, providing both quantitative scores and qualitative explanations based on defined metric criteria. In this preliminary experiment, the orchestrator demonstrated promising coordination capabilities: Metric Score Range Average Score Agent Selection Accuracy 3.89 – 5 4 Intent Resolution 4 – 5 4.5 Information Aggregation 3 – 5 3.7 In our preliminary evaluation, agent selection examples are relatively straightforward given our agents' well-defined responsibilities but provide a foundation for expanding to more complex scenarios involving agent-human expert interactions as we gather real-world data. Future work could include turn-level labeling of tumor board dataset dialogues to test classification accuracy of choosing the right next expert or agent. Agent selection can also be combined with "tool selection" metrics, addressing the fragmentation problem in multi-agent evaluation approaches. In the current state, we mainly used the explanations provided by the evaluator model to better understand the behavior of the system in clinical workflows and guide the development process. Patient History Agent Performance with TBFact To evaluate the Patient History agent, we used an anonymized and PHI-free proprietary dataset, named TB-Bench, that comprehensively aggregates diverse medical records for 71 patients who had undergone the care of a Molecular Tumor Board (MTB). TB-Bench includes data such as tumor board transcripts, exported EHR data, and clinician-generated patient summaries. Due to the logistical challenges involved in curating such a comprehensive dataset across potentially multiple healthcare institutions and record keeping systems, we found that in some instances clinician-generated summaries available in the tumor board transcripts might refer to patient records that were lost in the data curation process. This mismatch made direct evaluation challenging. Therefore, to ensure evaluation reflects system performance when complete patient records are accessible, we used TBFact to evaluate the agent’s output against a curated set of dataset verifiable facts— facts limited to those referring to information that is present in the dataset. While TBFact measures both recall and precision of fact generation, our study focuses on recall because it measures how much of all important information is covered, which we consider the most critical metric for clinical applications where missing information can have serious consequences. The preliminary experiments revealed significant performance improvements through prompt optimization and format adjustments. With specialized prompting, we specify the types of information to prioritize—such as biomarker results, imaging assessments, and treatment timelines. For instance, our updated prompt instructs the agent to “organize the patient data in chronological order” and explicitly calls out key elements to include: “all biomarkers”, “response to treatment including dates and imaging,” and “a summary of current status.” This prompt engineering approach proved to be one of the most effective levers for improving the quality and completeness of Patient History outputs. Configuration TBFact Recall for All Facts TBFact Recall for Important Facts Generic prompts (baseline) 0.56 0.66 Specialized Prompts 0.71 0.84 Since TBFact operates by comparing discrete factual claims, higher scores indicate that the agent is, according to the reference data, factually accurate and comprehensive in its coverage of the available patient information. In other words, optimizing for TBFact scores brings the agent’s output structurally and semantically closer to the curated reference timelines. And, in our case, that meant striving for detailed outputs, including information about allergies and ongoing medications, even when specific dates were unavailable. This underscores the importance of having high-quality, human-validated reference datasets, as without them, even well-performing agents may appear incomplete or inaccurate. Human Validation Study To validate TBFact's reliability, we conducted a preliminary study with human annotators, medical scribes by training, using 71 patient records. Two annotators assessed (a) whether a claim was properly extracted from its source text, (b) whether the fact was important (low, medium, high), and (c) whether individual claims were properly entailed by a reference text. Inter-annotator agreement was measured at 0.999, 0.66(strict) and 0.77(relaxed), and 0.914 for the three tasks respectively. The accuracy of the fact extraction pipeline was calculated to be 99.9%, validating that during the fact extraction phase minimal-to-no hallucinations are introduced. System accuracy for fact importance classification was at 66% when measured strictly, however, when allowing for a tolerance of one level (e.g. classifying medium instead of high), this was at 93%. These values are comparable to those of the medical annotators. Entailment classification at 88%, suggesting reasonable performance of the system’s ability to recognize entailment. Finally, we measured the correlation of the entire end-to-end TBFact F1 score of the system compared to humans using Kendall Tau, Pearson, and Spearman correlations. These were revealed to be at 55.8%, 70.5%, 72.8%, moderate-to-high correlations suggesting that the TBFact metric are well-aligned with expert clinical reasoning. Qualitative insights from TBFact The table below illustrates how TBFact evaluates factual alignment between agent-generated summaries and reference data. Each row shows a fact extracted from the agent’s output, the corresponding excerpt from the reference, and the entailment judgment. The logical entailment was produced by TBFact, while the accompanying explanations were generated separately to support interpretability. Facts Extracted from Agent Response Related Excerpt from Reference Text (Ground Truth) TBFact Judgment Molecular studies from the 2019-05-18 surgery identified TERT promoter mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. […] Tumor Genetics: EGFR: Amplified CDKN2A/B: Deleted PTEN: p.L112R TERT: c.-146C>T Chromosome 10: Monosomy Chromosome 7: Trisomy […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ✔ Entailed: The summary lists TERT mutation, PTEN mutation, EGFR amplification, CDKN2A/B deletion, monosomy 10, and trisomy 7. Immunohistochemistry from 2019-05-18 showed GFAP positive, BRAF V600E negative, IDH1 R132H negative, ATRX retained, p53 negative, and a Ki-67 index of 3%. […] Tumor Genetics: IDH1: Wildtype - BRAF V600E: Negative […] Timeline: 05/18/2019: Diagnosis of multifocal glioblastoma; craniotomy and resection of lesion from right temporal lobe. […] ⚠️ Partial Entailment: Some IHC findings match (BRAF negative, IDH1 wildtype) but others (GFAP, p53, Ki-67) are not mentioned in the reference summary. During the first cycle of CCNU on 2020-04-14, the patient reported significant fatigue, thrombocytopenia, and occasional confusion. Introduction: […] The patient is experiencing poor tolerance to lomustine and is considering discontinuation due to further disease progression as confirmed by recent MRI scans. […] Timeline: 04/14/2020 - Present: Lomustine treatment initiated. […] ⚠️ Partial Entailment: Poor tolerance to lomustine is reported, but specific side effects are not listed in the reference summary. On 2020-05-16, the plan was to continue CCNU and monitor with imaging. No related information in the reference text. ⚠️ No Entailment: No mention in the summary of a plan on 2020-05-16 to continue CCNU with imaging follow-up. These examples show that partial entailments are not necessarily errors. In many cases, they reflect the agent surfacing clinically relevant details that are absent from the reference. This is especially important in healthcare settings, where agent outputs may synthesize information across multiple documents or express facts in more complete or structured ways than the reference defined. To further assess the factual grounding of the agent’s outputs, we compared all facts extracted from the Patient History agent’s summaries against the full set of available data for each patient in the TB-Bench dataset. We found that 97% of the extracted facts were entailed by at least one data point. Upon manually reviewing the remaining 3% of facts, we found that they often reflected condensed or synthesized information drawn from multiple sources, meaning these claims could not be matched to any one document in our one-to-one entailment setup. While we cannot rule out the presence of hallucinations entirely, this analysis highlights the agent’s capacity for multi-source summarization. Closing Thoughts As multi-agent systems become more capable and autonomous, robust evaluation must evolve in parallel. The framework presented here is a step toward that goal: modular, clinically grounded, and designed to surface actionable insights across both simulated and real-world workflows. By moving beyond traditional accuracy metrics and embracing factuality, relevance, and coordination as core evaluation dimensions, we can better understand how multi-agent systems work, and when and why they fail. Our preliminary experiments and insights reinforce the value of TBFact not just as a metric, but as a diagnostic tool. Its structured, claim-level analysis (combined with fact categorization and human validation) offers a transparent and clinically meaningful way to evaluate and improve healthcare agents. In evaluating the Patient History agent, findings demonstrate that the agent remains faithful to the underlying data and produces complete, clinically relevant summaries. These outputs can help physicians prepare more efficiently and productively for tumor board review meetings, and being in a chat multiple agents, facilitate further investigation and understanding about patients. Looking ahead, we see several promising directions for extending this work: incorporating human-in-the-loop review pipelines, expanding to multimodal evaluation, improving observability across agent interactions, and scaling to more diverse real-world datasets. We are also developing a standardized benchmark of synthetic and de-identified patient cases to support broader community testing and reproducibility. We hope this work encourages others to adopt similarly rigorous approaches to evaluation, and to contribute to the development of shared benchmarks, metrics, and methodologies References Bannur, S., Bouzid, K., Castro, D. C., Schwaighofer, A., Thieme, A., Bond-Taylor, S., ... & Hyland, S. L. (2024). Maira-2: Grounded radiology report generation. arXiv:2406.04449v2. Metropolitansky, D. & Larson, J. (2025). Towards Effective Extraction and Evaluation of Factual Claims. arXiv:2502.10855v2.Azure Logic App AI-Powered Monitoring Solution: Automate, Analyze, and Act on Your Azure Data
Introduction In today’s cloud-driven world, monitoring and analyzing application health is critical for business continuity and operational excellence. However, the sheer volume of monitoring data can make it challenging to extract actionable insights quickly. Enter the Azure Logic App AI-Powered Monitoring Solution—an intelligent, serverless pipeline that leverages Azure Logic Apps and Azure OpenAI to automate monitoring, analyze data, and deliver comprehensive reports right to your inbox. This solution is ideal for organizations seeking to modernize their monitoring workflows, reduce manual analysis, and empower teams with AI-driven insights for faster decision-making. What Does This Solution Accomplish? The Azure Logic App AI-Powered Monitoring Solution creates an automated pipeline that: Extracts monitoring data from Azure Log Analytics using KQL queries. Analyzes data with AI using the Azure OpenAI GPT-4o model. Generates intelligent reports and sends them via email. Runs automatically on a daily schedule. Uses managed identity for secure authentication across Azure services. Business Case Solved Automated Monitoring: No more manual log reviews—let AI do the heavy lifting. Actionable Insights: Receive daily, AI-generated summaries highlighting system health, key metrics, potential issues, and recommendations. Operational Efficiency: Reduce time-to-insight and empower teams to act faster on critical events. Secure and Scalable: Built on Azure’s serverless and identity-driven architecture. Key Features Serverless Architecture: Built on Azure Logic Apps Standard for scalability and cost efficiency. AI-Powered Insights: Uses Azure OpenAI for advanced data analysis and summarization. Infrastructure as Code: Deployable via Bicep templates for reproducibility and automation. Secure by Design: Managed identity and Azure RBAC ensure secure access. Cost Effective: Pay-per-execution model with optimized resource usage. Customizable: Easily modify KQL queries and AI prompts to fit your monitoring needs. Solution Architecture Technologies Involved Azure Logic Apps Standard: Orchestrates the workflow. Azure OpenAI Service (GPT-4o): Performs AI-powered data analysis and summarization. Azure Log Analytics: Source for monitoring data, queried via KQL. Application Insights: Monitors workflow execution and telemetry. Azure Storage Account: Stores Logic App runtime data. Managed Identity: Secures authentication across Azure services. Infrastructure as Code (Bicep): Enables automated, repeatable deployments. Office 365 Connector: Sends email notifications. Support Documentation: https://docs.microsoft.com/en-us/azure/logic-apps/ Issues: https://github.com/vinod-soni-microsoft/logicapp-ai-summarize/issues Star this repository if you find it helpful!1.3KViews0likes0CommentsOptimizing Azure Healthcare Multimodal AI Models for Intel CPU Architecture
Alexander Mehmet Ersoy, Principal Product Manager, Microsoft HLS AI Abhishek Khowala, Principal AI Engineer, Intel Ravi Panchumarthy, AI Framework Engineer, Intel Srinarayan Srikanthan, AI Framework Engineer, Intel Ekaterina Aidova, AI Frameworks Engineer, Intel Alberto Santamaria-Pang, Principal Applied Data Scientist, Microsoft HLS AI and Adjunct Faculty at Johns Hopkins Medicine, Microsoft Peter Lee, Applied Scientist, Microsoft HLS AI and Adjunct Assistant Professor at Vanderbilt University Ivan Tarapov, Sr. Director, Microsoft HLS AI Pradeep Sakhamoori, Sr. SW Engineer, Microsoft The Rise of Multimodal AI in Healthcare The healthcare sector is witnessing a surge in the adoption of multimodal AI models, which are crucial for applications ranging from diagnostics to personalized treatment plans. These models combine data from various sources such as medical images, patient records, and genomic data to provide comprehensive insights. Microsoft’s Azure AI Foundry's Model Catalog of multimodal healthcare foundation models is at the forefront of this change. Models recently launched (such as MedImageInsights, MedImageParse, CXRReportGen [8], and many others) are designed to help healthcare organizations rapidly build and deploy AI solutions tailored to their specific needs, while minimizing the extensive compute and data requirements typically associated with building multimodal models from scratch. Real-World Examples from our industry partners regarding the adoption of multimodal AI models are highlighted in the article “Unlocking next-generation AI capabilities with healthcare AI models”. Challenges and Opportunities in Hardware Optimization As models get more complex, which is the case with the foundation model trend, the demands on the hardware rise. While GPUs remain the platform of choice for minimizing the model execution times, CPUs present substantial optimization possibilities, especially for inference workloads. We believe that providing a framework for efficient CPU-based environments holds a huge potential for many production scenarios where speed can be traded off for cost savings. With multimodal healthcare AI, the complexity of handling different data modalities and ensuring efficient inference requires innovative solutions and collaboration between industry leaders. Companies are increasingly looking towards hardware-specific optimizations to enhance model efficiency and reduce latency while keeping costs at bay. Intel, with its robust suite of AI tools and extensions for frameworks like PyTorch, is pioneering this optimization effort. For instance, the Intel® Distribution of OpenVINO™ toolkit has been instrumental in accelerating the development of computer vision and deep learning applications in healthcare [1]. You can learn about our recent collaboration with Intel on AI optimizations to advance medical innovations in the article "Empower Medical Innovations: Intel Accelerates PadChest & fMRI Models on Microsoft Azure* Machine Learning”. The demand for AI applications in healthcare is rapidly increasing. Multimodal AI models, which can process and analyze complex datasets, are essential for tasks such as early disease detection, treatment planning, and patient monitoring. While optimizing these models to perform efficiently on specific hardware is important, it is not necessarily a barrier to adoption. Models optimized with CUDA for Nvidia GPUs often deliver optimal performance and run faster than on any other hardware. However, the benefit of using CPUs lies in the tradeoff they offer. You can choose to optimize for speed by running your model on a GPU and optimizing for it in PyTorch, or you can optimize for cost by sacrificing speed. This is the proposition here: the option to run the model slower with an accessible CPU, which can be advantageous in scenarios where speed is not the primary concern, but access to GPU hardware is. The Intel® oneAPI Deep Neural Network Library (oneDNN) have proven effective in reducing GPU requirement burden and accelerating time to market for AI solutions [2]. Both Intel® Extension for PyTorch (IPEX) and OpenVINO utilize the Intel® oneDNN to accelerate deep learning operations, taking advantage of underlying hardware features. IPEX optimizes existing PyTorch workflows with minimal code changes. OpenVINO provides cross-platform deep learning optimization for deployment flexibility. In this blog post, a custom deployment was implemented using CXRReportGen along with both IPEX and OpenVINO optimizations, demonstrating how these techniques can support different deployment scenarios and technical requirements. This optimization is accessible through Azure's compute services and Intel's technology. Benchmarking and Performance Acceleration To address these challenges, our new collaboration with Intel focuses on leveraging Intel’s advanced AI tools and hardware capabilities to optimize multimodal AI models for greater healthcare access. By utilizing Intel's Extension for PyTorch and other optimization techniques, we aim to optimize CPUs for best model run time speed. While this may slightly degrade performance, the main benefit is addressing the problem of GPU hardware scarcity. This partnership not only underscores the importance of hardware-specific optimizations but also sets a new standard for AI model deployment in real-world healthcare applications. Both IPEX and OpenVINO are built on a common foundation - Intel® oneDNN which is a high-performance library designed specifically for deep learning applications and optimized for Intel architecture. oneDNN leverages specialized hardware instructions available in Intel processors such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) [3] on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs. Figure 1: OneDNN Library IPEX [4] extends PyTorch* with the latest performance optimizations for Intel hardware [5]. It leverages oneDNN under the hood to provide optimized implementations of key operations. This allows developers to stay within their existing PyTorch code with minimal changes - making it an excellent choice for teams already comfortable with the PyTorch ecosystem who want to quickly optimize their models for Intel hardware. import torch ############## import ipex ############### import intel_extension_for_pytorch as ipex model = Model() model.eval() ############## Optimize with IPEX ############### model = ipex.optimize(model, dtype=torch.bfloat16) # Continue with inference as normal Figure 2. Intel Extension for PyTorch The Intel® Distribution of OpenVINO™ toolkit is a powerful solution for optimizing and deploying deep learning models across a wide range of Intel hardware [6]. Like IPEX, it leverages oneDNN under the hood, but takes a different approach - offering cross-platform optimization and flexible deployment options. OpenVINO supports two main workflows: a convenience workflow, where you run models directly with minimal setup, and a performance workflow, recommended for production, where models are first converted offline into the OpenVINO Intermediate Representation (IR). This one-time conversion step enables highly optimized inference and allows the final application to remain lightweight and efficient. Here’s a simple example using OpenVINO for inference with a pre-converted IR model. Refer to OpenVINO Notebooks repo for more samples: import openvino as ov core = ov.Core() ############## Load the OpenVINO IR model ############### compiled_model = core.compile_model("model.xml", "CPU") ############## Run inference ################### infer_request = compiled_model.create_infer_request() results = infer_request.infer({input_tensor_name: input_tensor}) Figure 3: OpenVINO toolkit Overview. IPEX and OpenVINO are supported in all Intel architectures. However, for optimal performance, Intel recommends using instances powered by 4th Gen Intel® Xeon® Scalable processors or newer, which feature AMX and other hardware acceleration capabilities, such as Azure’s v6-series (e.g., Standard_E48s_v6) [7]. Results We conducted a detailed performance benchmark by using CXRReportGen, a state-of-the-art foundation model designed to generate a list of radiological findings from chest X-rays, over Standard_E48s_v6 hardware (48 vCPUs, 248 GiB RAM) with and without IPEX and OpenVINO optimization. We realized up to 70% improvement in CXRReportGen foundation model run time when applying optimizations with IPEX and similarly substantial gains using OpenVINO, compared to the non-optimized baseline on the same CPU hardware. This significant improvement highlights the potential of leveraging Intel's performance optimizations to make critical healthcare AI models more cost-efficient and accessible. Such advancements enable healthcare providers to deploy advanced diagnostic tools even in resource-constrained environments, ultimately improving patient care and operational efficiency. SKU Run Type (100 Runs) Mean Run Time (seconds) Standard Deviation of Run Time (seconds) Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) No Optimization 22.47 0.1061 Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) IPEX 8.21 0.2375 Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) OpenVINO 7.01 0.0569 Table 1: Performance Comparison of CXRReportGen Model Across 100 Runs with CPU. Future Prospects and Innovations Our benchmarks with Intel optimizations with both IPEX and OpenVINO show great potential on decreasing the model run time of our foundation models and increasing scalability via CPU. This optimization positions Intel CPUs as a viable deployment. This not only increases deployment options but also offers opportunities to reduce cloud costs with CPU-based instances and even consider deploying these workflows on existing compute headroom at the edge. For custom deployments, the setup described in this blog post is now available on the provided compute instances in Azure and with optimization software from Intel. So that developers can optimize inference workloads while taking advantage of large memory pools available via CPU and use towards handling large batch workloads. Our advancements with Intel in model runtime optimizations are considered to be available in the Azure AI model catalogs. Please stay tuned for further updates. As we continue to innovate and optimize, the potential for AI to transform healthcare and improve patient outcomes becomes increasingly attainable. We are now more equipped than ever to making it easier for our partners and customers to create connected experiences at every point of care, empower their healthcare workforce, and unlock the value from their data using data standards that are important to the healthcare industry. References [1] Intel OpenVINO Optimizes Deep Learning Performance for Healthcare Imaging [2] Accelerating Healthcare Diagnostics with Intel oneAPI and AI Tools [3] Intel Advanced Matrix Extensions [4] Intel Extension for Pytorch [5] Accelerate with Intel Extension to PyTorch [6] Intel Accelerates PadChest and fMRI Models on Azure ML [7] Azure’s first 5th Gen Intel® Xeon® processor instances are now available and we're excited! [8] CxrReportGen Model Card in Azure AI Foundry The healthcare AI models in Azure AI Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.