Image Search Series Part 1: Chest X-ray lookup with MedImageInsight

Alberto_Santamaria

Microsoft

Jan 31, 2025

@Alberto Santamaria-Pang, Principal Applied Data Scientist, HLS AI and Adjunct Faculty at Johns Hopkins Medicine

@Asma Ben Abacha, Senior Applied Scientist, HLS AI

@Peter Lee, Applied Scientist, HLS AI and Adjunct Assistant Professor at Vanderbilt University

@Alexander Mehmet Ersoy, Principal Product Manager, HLS AI

@Ivan Tarapov, Group Manager, HLS AI

Introduction

Accurate and efficient interpretation of medical images is crucial for timely diagnoses and effective treatment. Radiologists often compare current scans with prior cases, a process that can be time-consuming and prone to variability. Moreover, metadata in DICOM studies often lacks consistency across different institutions, vendors, and even among technicians conducting the imaging study. Performing imaging searches based on metadata—the traditional method in medical image storage systems—inevitably leads to mistakes and errors, resulting in time-consuming search operations with low efficiency. However, incorporating image lookup operations based on pixel data can complement and improve the traditional metadata-only approach, as pixel-based searches can deliver superior results in certain cases. Pixel-based search systems help solve this challenge by enabling rapid retrieval of similar images from large datasets, enhancing diagnostic accuracy and supporting rare disease detection.

This blog opens a series of articles that will highlight different opportunities to test facets of pixel-based search approaches in the space of medical imaging.

Specifically, we demonstrate how a researcher can build, test, and validate a 2D image search system for chest X-rays using MedImageInsight (MI2), a foundation healthcare AI model by Microsoft and FAISS, an open-source library by Meta that enables efficient similarity search and clustering of dense vectors. We begin by creating a baseline image search engine using MI2 embeddings (1024 dimensions) indexed with FAISS. Next, we improve the test system by training adapters—simple classifier networks— to generate a more refined representation (254 dimensions) of the dataset, rather than using the embeddings for classification directly (Figure 1). For evaluation, we use a small dataset of 100 2D chest X-ray DICOM images, each labeled with a single pathology class: No Findings, Devices and Surgical Hardware, Pleural Effusion, Cardiomegaly, or Atelectasis. These images are provided as DICOM files. While the dataset is small, the methods shown can be scaled to larger datasets and adapted for testing and evaluating more complex, multi-label scenarios commonly encountered in clinical practice. Sample code is available at the following link: https://aka.ms/healthcare-ai-examples-mi2-2d-image-search.

Figure 1. Overview of the image search engine.

Building the Image Search Engine

FAISS (Facebook AI Similarity Search) [2] is optimized to handle large-scale datasets, offering both exact and approximate nearest neighbor searches to balance speed and precision. In this tutorial, we use MedImageInsight (MI2), to generate image embeddings where similarity is encoded in the learned representation of the model. This representation captures the semantic features of the images. We then build an index using FAISS, which allows us to perform a vector search by providing a query—an image in this case. The task is to retrieve the most similar images based on their encoded semantic embeddings.

To build a baseline 2D image search engine, we select 80 chest X-ray images as a training dataset. Next, we organize these embeddings into a FAISS index, which enables fast and efficient similarity searches. For this tutorial, we use a flat index (IndexFlatL2) that computes the exact nearest neighbors by comparing the embeddings using Euclidean distance. This ensures high precision in the results, though it's most suitable for smaller datasets. For larger datasets, FAISS also offers more advanced indexing options like HNSW or IVFPQ, which balance speed and memory efficiency by approximating similarity searches.

Once the FAISS index is built, we can perform a search query by providing a new query embedding—representing a chest X-ray image. The index will retrieve K images with the highest the similarity based on their embeddings. This process allows for rapid identification of cases with anatomical and pathological features that closely resemble those of the query image.

Optimizing the Search

Building a baseline search engine provides a solid foundation, but what does is system actually searching? How do you specify what criteria do you want to perform search on? Are you looking for similar anatomy? Similar pathologies present? Similar image protocols?

If you use the MI2 model out-of-the box it will use the encoded representation to determine similarity which includes a lot of morphological features with some bias towards pathology. But if you want to optimize search to focus on something in particular, next thing to try would be to create an adapter model. We have published a blog on adapters in the past, and in the provided sample we will apply the pattern to improve the matching based on pathology presence.

This adapter enhances the original embeddings to better capture the specific features of the pathology classes in our dataset, improving the system’s ability to identify relevant images based on their diagnostic features.

The adapter is a lightweight neural network trained on a small, labeled subset of the chest X-ray dataset. By adjusting the embeddings, the adapter enables the model to distinguish more clearly between pathology classes, aligning the embeddings more closely with the clinical context. Once the adapter is trained, we use it to generate a new set of optimized embeddings for the entire dataset. Note that since the adapter is essentially a tool to transform embeddings from one space to another you don't need to recreate embeddings from scratch from the original images which can be a time consuming procedure, especially on large image databases. You can generate new set of embeddings from the old embeddings.

With the optimized embeddings in hand, we rebuild the FAISS index using this refined data. This new index is designed to outperform the baseline search engine by retrieving more clinically relevant images when given a query. As a result, the optimized search engine is more effective at identifying images with diagnostic features closely matching the query image, improving both precision and clinical utility.

Results

To evaluate the performance, of our test search engine, we used metrics such as precision @1, @3, and @5, which measure how often the correct pathology class appears among the top K retrieved images. Tables 1 and 2 below summarize the accuracy at different top-K values for both the baseline and optimized search engines.

Table 1: Baseline vs. Optimized Search Engine Accuracy
This table compares the accuracy at top-K values (k=1, k=3, k=5) for both the baseline and optimized search engines.

Category	Accuracy @ k=1 (Baseline)	Accuracy @ k=3 (Baseline)	Accuracy @ k=5 (Baseline)	Accuracy @ k=1 (Optimized)	Accuracy @ k=3 (Optimized)	Accuracy @ k=5 (Optimized)
No Finding	0.250	0.583	0.500	0.500	0.666	0.650
Devices and Surgical Hardware	0.200	0.200	0.160	0.800	0.800	0.760
Pleural Effusion	0.333	0.389	0.300	0.666	0.611	0.566
Cardiomegaly	0.667	0.778	0.600	1.0	0.888	0.8
Atelectasis	1.000	0.833	0.700	1.0	1.0	1.0

Table 2: Overall Accuracy Comparison
This table summarizes the overall accuracy of the baseline and optimized search engines at different top-K values (k=1, k=3, k=5).

Top-K	Overall Accuracy (Baseline)	Overall Accuracy (Optimized)
1	0.490	0.793
3	0.556	0.793
5	0.452	0.755

The adapter model was trained for 15 epoch in approximately 30 seconds on a CPU Standard_E4s_v3 (4 cores, 32 GB RAM, 64 GB disk) and achieved a Best Accuracy of 0.75 and a Best AUC of 0.9429. It’s important to note that the accuracy at k=1 aligns closely with the overall accuracy of the classifier. This is because the model applies a softmax transformation during training, which normalizes the output probabilities for each class. As a result, the top prediction (at k=1) is highly reflective of the classifier’s general performance.

The optimized search engine demonstrates a significant improvement in performance over the baseline. The overall accuracy for the optimized version increased dramatically, with precision @1 improving from 0.490 to 0.776. Figure 2 presents qualitative results, the query image (left column) for each pathology category (No Finding, Devices and Surgical Hardware, Pleural Effusion, Cardiomegaly) along with the top 3 retrieved images (right column) based on their similarity. The retrieved images are ranked by their relevance to the query, with the rank and category of each retrieval displayed above each image. This demonstrates the search engine's ability to retrieve similar images based on the encoded anatomical and pathological features of the query image.

Figure 2. Query images for each pathology category alongside the top 3 most similar retrieved images, ranked by relevance.

Note that in this, simplified example, we treat these pathologies as mutually exclusive classes and assuming that there would be one group that our image would be closest too. In real world this is not the case with chest xrays and these conditions. You can have cases where devices and surgical hardware are present along with cardiomegaly (in fact you can see some of these in the provided image examples), and then your lookup system would need to handle such scenarios depending on what business problem it is solving. We will look into some scenarios in the future posts in the series!

Conclusion

This tutorial demonstrated how to explore and test the ability to build and optimize a 2D image search system for chest X-rays using MedImageInsight (MI2) and FAISS. By refining embeddings with an adapter model, we significantly improved search accuracy, enhancing the retrieval of clinically relevant images. Techniques like this enable running pixel search queries over imaging repositories without requiring massive efforts, and more importantly, without the prerequisite of metadata normalization and corrections, allowing for more efficient data management systems. These methods offer the potential for a scalable approach to integrating AI-powered image search in healthcare, supporting faster and more accurate diagnostics. The results highlight the opportunity for researchers to further explore the potential for using AI to improve clinical workflows through optimized image search systems.

The Microsoft healthcare AI models are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.

References