Asma Ben Abacha, Alberto Santamaria-Pang, Ho Hin Lee, Thomas Lin, Ivan Tarapov (Microsoft Health AI)
Introduction
As the use of diagnostic 3D images increases, effective management and analysis of these large volumes of data grows in importance. Medical 3D image search systems can play a vital role by enabling clinicians to quickly retrieve relevant or similar images and cases based on the anatomical features and pathologies present in a query image.
Unlike traditional 2D imaging, 3D imaging offers a more comprehensive view for examining anatomical structures from multiple planes with greater clarity and detail. This enhanced visualization has potential to assist doctors with improved diagnostic accuracy and more precise treatment planning.
Moreover, advanced 3D image retrieval systems can support evidence-based and cohort-based diagnostics, demonstrating an opportunity for more accurate predictions and personalized treatment options. These systems also hold significant potential for advancing research, supporting medical education, and enhancing healthcare services.
This blog offers guidance on using Azure AI Foundry and the recently launched healthcare AI models to design and test a 3D image search system that can retrieve similar radiology images from a large collection of 3D images.
Along with this blog, we share a Jupyter Notebook with the the 3D image search system code, which you may use to reproduce the experiments presented here or start you own solution.
- 3D Image Search Notebook: http://aka.ms/healthcare-ai-examples-mi2-3d-image-search
It is important to highlight that the models available on the AI Foundry Model Catalog are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources.
The Problem
Generally, the problem of 3D image search can be posed as retrieving cross-sectional (CS) imaging series (3D image results) that are similar to a given CS imaging series (query 3D image). Once posited this way, the key question becomes how to define such similarity? In the previous blog of this series, we worked with radiographs of the chest which constrained the notion of "similar" to the similarity between two 2D images, and a certain class of anatomy. In the case of 3D images, we are dealing with a volume of data, and a lot more variations of anatomy and pathologies, which expands the dimensions to consider for similarity; e.g., are we looking for similar anatomy? Similar pathology? Similar exam type?
In this blog, we will discuss a technique to approximate the 3D similarity problem through a 2D image embedding model and some amount of supervision to constrain the problem to a certain class of pathologies (lesions) and cast it as "given cross-sectional MRI image , retrieve series with similar grade of lesions in similar anatomical regions".
To build a search system for 3D radiology images using a foundation model (MedImageInsight) designed for 2D inputs, we explore the generation of representative 3D embedding vectors for the volumes with the foundation model embeddings of 2D slices to create a vector index from a large collection of 3D images. Retrieving relevant results for a given 3D image then consists in generating a representative 3D image embedding vector for the query image and searching for similar vectors in the index.
An overview of this process is illustrated in Figure 1.
|
Figure 1: Overview of the 3D image search process. |
The Data
In the sample notebook that is provided alongside this blog, we use 3D CT images from the Medical Segmentation Decathlon (MSD) dataset [2-3] and annotations from the 3D-MIR benchmark [4]. The 3D-MIR benchmark offers four collections (Liver, Colon, Pancreas, and Lung) of positive and negative examples created from the MSD dataset with additional annotations related to the lesion flag (with/without lesion), and lesion group (1, 2, 3). The lesion grouping focuses on lesion morphology and distribution and considers the number, length, and volume of the lesions to define the three groups. It also adheres to the American Joint Committee on Cancer's Tumor, Node, Metastasis classification system’s recommendations for classifying cancer stages and provides a standardized framework for correlating lesion morphology with cancer stage. We selected the 3D-MIR Pancreas collection.
- 3D-MIR Benchmark: https://github.com/abachaa/3D-MIR
Since the MSD collections only include unhealthy/positive volumes, each 3D-MIR collection was augmented with volumes randomly selected from the other datasets to integrate healthy/negative examples in the training and test splits. For instance, the Pancreas dataset was augmented using volumes from the Colon, Liver, and Lung datasets.
The input images consist of CT volumes and associated 2D slices. The training set is used to create the index, and the test set is used to query and evaluate the 3D search system.
3D Image Retrieval
Our search strategy, called volume-based retrieval, relies on aggregating the embeddings of the 2D slices of a volume to generate one representative 3D embedding vector for the whole volume. We describe additional search strategies in our 3D-MIR paper [4].
The 2D slice embeddings are generated using the MedImageInsight foundation model [5-6] from Azure AI Foundry model catalog [1]. In the search step, we generate the embeddings of the 3D query volumes according to the selected Aggregation method (Agg) and search for the top-k similar volumes/vectors in the corresponding 3D (Agg) index. We use the Median aggregation method to generate the 3D vectors and create the associated 3D index. We construct a 3D (Median) index using the training slices/volumes from the 3D-MIR Pancreas collection. Three other aggregation methods are available in the 3D image search notebook: Max Pooling, Average Pooling, and Standard Deviation.
The search is performed following the k-Nearest Neighbors algorithm (or k-NN search) to find the k nearest neighbors of a given vector by calculating the distances between the query vector and all other vectors in the collection, then selecting the K vectors with the shortest distances. If the collection is large, the computation can be expensive, and it is recommended to use specific libraries for optimization. We use the FAISS (Facebook AI Similarity Search) library, an open-source library for efficient similarity search and clustering of high-dimensional vectors.
Evaluation of the search results
The 3D-MIR Pancreas test set consists of 32 volumes:
- 4 volumes with no lesion (lesion flag/group= -1)
- 3 volumes with lesion group 1
- 19 volumes with lesion group 2
- 6 volumes with lesion group 3
The training set consists of 269 volumes (with and without lesions) and was used to create the index.
We evaluate the 3D search system by comparing the lesion group/category of the query volume and the top 10 retrieved volumes. We then compute Precision@k (P@k).
Table 1 presents the P@1, P@3, P@5, P@10, and overall Precision.
|
Table 1: Evaluation results on the 3D-MIR Pancreas test set |
The system accurately recognizes Healthy cases, consistently retrieving the correct label in test scenarios involving non-lesion pancreas images. However, performance varies for different lesion groups, reflecting challenges in precisely identifying smaller lesions (Group 1) or more advanced lesions (Group 3). This discrepancy highlights the complexity of lesion detection and underscores the importance of carefully tuning embeddings or adjusting the vector index to improve retrieval accuracy for specific lesion sizes.
Visualization
Figure 2 presents four different test queries from the Pancreas test set and the top 5 nearest neighbors retrieved by the volume-based search method. In each row, the first image is the query, followed by the retrieved images ranked by similarity. The visual overlays help in assessing retrieval accuracy;
- Blue indicates the pancreas organ boundaries, and
- Red highlights the mark regions corresponding to the pancreas tumor.
|
Figure 2: Top 5 results for different queries from the Pancreas test set |
Table 2 presents additional results of the volume-based retrieval system [4] on other 3D-MIR datasets/organs (Liver, Colon, and Lung) using additional foundation models: BiomedCLIP [7], Med-Flamingo [8], and BiomedGPT [9].
When considering the macro-average across all datasets, MedImageInsight-based retrieval outperforms substantially other foundation models.
|
Table 2: Evaluation Results on the 3D-MIR benchmark (Liver, Colon, Pancreas, and Lung) |
These results mirror a use case akin to lesion detection and severity measurement in a clinical context. In real-world applications—such as diagnostic support or treatment planning—it may be necessary to optimize the model to account for particular goals (e.g., detecting critical lesions early) or accommodate different imaging protocols. By refining search criteria, integrating more domain-specific data, or adjusting embedding methods, practitioners can enhance retrieval precision and better meet clinical requirements.
Conclusion
The integration of 3D image search systems in clinical environment can enhance and accelerate the retrieval of similar cases and provide better context to clinicians and researchers for accurate complex diagnoses, cohort selection, and personalized patient care. This 3D radiology image search blog and related notebook offers a solution based on 3D embedding generation for building and evaluating a 3D image search system using the MedImageInsight foundation model from Azure AI Foundry model catalog.
References
- Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview
- Michela Antonelli et al. The medical segmentation decathlon. Nature Communications, 13(4128), 2022 https://www.nature.com/articles/s41467-022-30695-9
- MSD: http://medicaldecathlon.com/
- Asma Ben Abacha, Alberto Santamaría-Pang, Ho Hin Lee, Jameson Merkow, Qin Cai, Surya Teja Devarakonda, Abdullah Islam, Julia Gong, Matthew P. Lungren, Thomas Lin, Noel C. F. Codella, Ivan Tarapov: 3D-MIR: A Benchmark and Empirical Study on 3D Medical Image Retrieval in Radiology. CoRR abs/2311.13752, 2023 https://arxiv.org/abs/2311.13752
- Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542, 2024 https://arxiv.org/abs/2410.06542
- MedImageInsight: https://aka.ms/mi2modelcard
- Sheng Zhang, Yanbo Xu, Naoto Usuyama, Hanwen Xu, Jaspreet Bagga, Robert Tinn, Sam Preston, Rajesh Rao, Mu Wei, Naveen Valluri, Cliff Wong, Andrea Tupini, Yu Wang, Matt Mazzola, Swadheen Shukla, Lars Liden, Jianfeng Gao, Angela Crabtree, Brian Piening, Carlo Bifulco, Matthew P. Lungren, Tristan Naumann, Sheng Wang, Hoifung Poon. BiomedCLIP: a multimodal biomedical foundation model pretrained from fifteen million scientific image-text pairs. NEJM AI 2025; 2(1) https://ai.nejm.org/doi/full/10.1056/AIoa2400640
- Moor, M., Huang, Q., Wu, S., Yasunaga, M., Dalmia, Y., Leskovec, J., Zakka, C., Reis, E.P., Rajpurkar, P.: Med-flamingo: a multimodal medical few-shot learner. Machine Learning for Health, ML4H@NeurIPS 2023, 10 December 2023, New Orleans, Louisiana, USA. Proceedings of Machine Learning Research, vol. 225, pp. 353–367. PMLR, (2023) https://proceedings.mlr.press/v225/moor23a.html
- Zhang, K., Zhou, R., Adhikarla, E., Yan, Z., Liu, Y., Yu, J., Liu, Z., Chen, X., Davison, B.D., Ren, H., et al.: A generalist vision–language foundation model for diverse biomedical tasks. Nature Medicine, 1–13 (2024) https://www.nature.com/articles/s41591-024-03185-2
The Microsoft healthcare AI models are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.