Asma Ben Abacha, Alberto Santamaria-Pang, Wen-wai Yim, Thomas Lin, Ivan Tarapov (Microsoft Health AI)
Introduction
Wound assessment and management are central tasks in clinical practice, requiring accurate documentation and timely decision-making. Clinicians and nurses often rely on visual inspection to evaluate wound characteristics such as size, color, tissue composition, and healing progress. However, when seeking comparable cases (e.g., to inform treatment choices, validate assessments, or support education), existing search methods have significant limitations. Traditional keyword-based systems require precise terminology, which may not align with the way wounds are described in practice. Moreover, textual descriptors cannot fully capture the variability of visual wound features, resulting in incomplete or imprecise retrieval.
Recent advances in computer vision offer new opportunities to address these challenges through both image classification and image retrieval. Automated classification of wound images into clinically meaningful categories (e.g., wound type, tissue condition, infection status) can support standardized documentation and assist clinicians in making more consistent assessments. In parallel, image retrieval systems enable search based on visual similarity rather than textual input alone, allowing clinicians to query databases directly with wound images and retrieve cases with similar characteristics. Together, these AI-based functionalities have the potential to improve case comparison, facilitate consistent monitoring, and enhance clinical training by providing immediate access to relevant examples and structured decision support.
The Data
The WoundcareVQA dataset is a new multimodal multilingual dataset for Wound Care Visual Question Answering.
- The WoundcareVQA dataset is available at https://osf.io/xsj5u/ [1]
Table 1 summarizes dataset statistics. WoundcareVQA contains 748 images associated with 447 instances (each instance/query includes one or more images). The dataset is split into training (279 instances, 449 images), validation (105 instances, 147 images), and test (93 instances, 152 images). The training set was annotated by a single expert, the validation set by two annotators, and the test set by three medical doctors. Each query is also labeled with wound metadata, covering seven categories: anatomic location (41 classes), wound type (8), wound thickness (6), tissue color (6), drainage amount (6), drainage type (5), and infection status (3).
Table 1: Statistics about the WoundcareVQA Dataset
We selected two tasks with the highest inter-annotator agreement: Wound Type Classification and Infection Detection (cf. Table 2). Table 3 lists the classification labels for these tasks.
Table 2: Inter-Annotator Agreement in the WoundcareVQA Dataset
Table 3: Classification Labels for the Tasks: Infection Detection & Wound Type Classification
Methods
1. Foundation-Model-based Image Search
This approach relies on an image similarity-based retrieval mechanism using a medical foundation model, MedImageInsight [2-3]. Specifically, it employs a k-nearest neighbors (k-NN) search to identify the top k training images most visually similar to a given query image. The image search system operates in two phases:
- Index Construction: Embeddings are extracted from all training images using a pretrained vision encoder (MedImageInsight). These embeddings are then indexed to enable efficient and scalable similarity search during retrieval.
- Query and Retrieval: At inference time, the test image is encoded to produce a query embedding. The system computes the Euclidean distances between this query vector and all indexed embeddings, retrieving the k nearest neighbors with the smallest distances.
To address the computational demands of large-scale image datasets, the method leverages FAISS (Facebook AI Similarity Search), an open-source library designed for fast and scalable similarity search and clustering of high-dimensional vectors.
2. Vision-Language Models (VLMs) & Retrieval-Augmented Generation (RAG)
We leverage vision-language models (e.g., GPT-4o, GPT-4.1), a recent class of multimodal foundation models capable of jointly reasoning over visual and textual inputs. These models can be used for wound assessment tasks due to their ability to interpret complex visual patterns in medical images while simultaneously understanding medical terminology. We evaluate three settings:
- Zero-shot: The model predicts directly from the query input without additional examples.
- Few-shot Prompting: A small number of examples (5) from the training dataset are randomly selected and embedded into the input prompt. These paired images and labels provide contextual cues that guide the model's interpretation of new inputs.
- Retrieval-Augmented Generation (RAG): The system first retrieves the Top-k visually similar wound images using the MedImageInsight-based image search described above. The language model then reasons over the retrieved examples and their labels to generate the final prediction.
The implementation of the MedImageInsight-based image search and the RAG method for the infection detection task is available in our Samples Repository:
Evaluation
We computed accuracy scores to evaluate the image search methods (Top-1 and Top-5 with majority vote), GPT-4o and GPT-4.1 models (zero-shot), as well as 5-shot and RAG-based methods. Table 4 reports accuracy for wound type classification and infection detection. Figure 1 presents examples of correct and incorrection predictions.
Accuracy
|
Image Search Top-1 |
Image Search Top-5 + majority vote |
GPT-4o (2023-07-01)
|
GPT-4o (2024-11-20)
|
GPT4.1 (2025-04-14)
|
GPT4.1 5-shot Prompting |
GPT-4.1- RAG-5
|
Wound Type
|
0.7933 |
0.8333 |
0.4671 |
0.4803 |
0.5066 |
0.6118 |
0.7533
|
Infection
|
0.6800 |
0.7267 |
0.3947 |
0.3882 |
0.375 |
0.7237 |
0.7600 |
Table 4: Accuracy Scores for Wound Type Classification & Infection Detection
Figure 1: Examples of Correct and Incorrection Predictions
For wound type classification, image search with MedImageInsight embeddings performs best, achieving 0.7933 (Top-1) and 0.8333 (Top-5 + majority vote). GPT models alone perform substantially worse (0.4671-0.6118), while GPT-4.1 with retrieval augmentation (RAG-5), which uses the same MedImageInsight-based image search method to retrieve the Top-5 similar cases, narrows the gap (0.7533) but does not surpass direct image search. This suggests that categorical wound type is more effectively captured by visual similarity than by case-based reasoning with vision-language models.
For infection detection, the trend reverses. Image search reaches 0.7267 (Top-5 + majority vote), while RAG-5 achieves the highest accuracy at 0.7600. In this case, the combination of visually similar cases with VLM-based reasoning outperforms both standalone image search and GPT prompting. This indicates that infection assessment depends on contextual or clinical cues that may not be fully captured by visual similarity alone but can be better interpreted when enriched with contextual reasoning over retrieved cases and their associated labels.
Overall, these findings highlight complementary strengths: foundation-model-based image search excels at categorical visual classification (wound type), while retrieval-augmented VLMs leverage both visual similarity and contextual reasoning to improve performance on more nuanced tasks (infection detection). A hybrid system integrating both approaches may provide the most robust clinical support.
Conclusion
This study demonstrates the complementary roles of vision-language models in wound assessment. Image search using foundation-model embeddings shows strong performance on categorical tasks such as wound type classification, where visual similarity is most informative. In contrast, retrieval-augmented generation (RAG-5), which combines image search with case-based reasoning by a vision-language model, achieves the best results for infection detection, highlighting the value of integrating contextual interpretation with visual features. These findings suggest that a hybrid approach, leveraging both direct image similarity and retrieval-augmented reasoning, provides the most robust pathway for clinical decision support in wound care.
References
- Wen-wai Yim, Asma Ben Abacha, Robert Doerning, Chia-Yu Chen, Jiaying Xu, Anita Subbarao, Zixuan Yu, Fei Xia, M Kennedy Hall, Meliha Yetisgen. Woundcarevqa: A Multilingual Visual Question Answering Benchmark Dataset for Wound Care. Journal of Biomedical Informatics, 2025.
- Noel C. F. Codella, Ying Jin, Shrey Jain, Yu Gu, Ho Hin Lee, Asma Ben Abacha, Alberto Santamaría-Pang, Will Guyman, Naiteek Sangani, Sheng Zhang, Hoifung Poon, Stephanie L. Hyland, Shruthi Bannur, Javier Alvarez-Valle, Xue Li, John Garrett, Alan McMillan, Gaurav Rajguru, Madhu Maddi, Nilesh Vijayrania, Rehaan Bhimai, Nick Mecklenburg, Rupal Jain, Daniel Holstein, Naveen Gaur, Vijay Aski, Jenq-Neng Hwang, Thomas Lin, Ivan Tarapov, Matthew P. Lungren, Mu Wei: MedImageInsight: An Open-Source Embedding Model for General Domain Medical Imaging. CoRR abs/2410.06542 (2024)
- Model catalog and collections in Azure AI Foundry portal https://learn.microsoft.com/en-us/azure/ai-studio/how-to/model-catalog-overview
Image Search Series
- Image Search Series Part 1: Chest X-ray lookup with MedImageInsight | Microsoft Community Hub
- Image Search Series Part 2: AI Methods for the Automation of 3D Image Retrieval in Radiology | Microsoft Community Hub
- Image Search Series Part 3: Foundation Models and Retrieval-Augmented Generation in Dermatology | Microsoft Community Hub
- Image Search Series Part 4: Advancing Wound Care with Foundation Models and Context-Aware Retrieval | Microsoft Community Hub