Healthcare and Life Sciences Blog

13 MIN READ

Medical imaging embeddings bake-off: How Microsoft’s MedImageInsight compares to leading models

jamesonmerkow

Microsoft

Nov 26, 2025

Foundation models allow healthcare AI solutions to be deployed in minutes, not months. Microsoft's MedImageInsight, available in the Microsoft Foundry model catalog, is a leading example, and we evaluated its performance against six competing models using real clinical data in collaboration with the University of Wisconsin–Madison. Welcome to the bake-off, Let's see who takes home star baker!

Note: This post reports performance results on a specific evaluation dataset collected from real clinical data. These results do not constitute general clinical performance claims. This post shares methods and principles that can guide clinical AI development but are not meant for direct use in clinical settings without validation. You must validate performance on your own data, ensure models fit within your regulatory processes. You are fully responsible for verifying results, ensuring compliance with laws, and obtaining any needed clearance or approvals for clinical use.

Introduction

Medical imaging volumes continue to grow exponentially while radiologist capacity remains constrained. Hospitals generate thousands of X-rays, CT scans, and MRIs daily that require expert interpretation, but healthcare systems lack sufficient trained specialists to meet demand. Traditional AI approaches promised relief but created new challenges: each diagnostic task requires months of custom model development, specialized expertise, and substantial computational resources. These requirements represent precisely the bottlenecks AI was meant to solve.

Foundation models represent a paradigm shift in AI development. These models are trained on massive datasets to learn broad, transferable knowledge that serves as a starting point for diverse downstream tasks. Rather than building models from scratch, practitioners can adapt foundation models to new applications through lightweight adapter training or minimal fine-tuning. This approach dramatically reduces development time and computational requirements.

Embedding foundation models take this concept further by distilling complex inputs into rich vector representations called embeddings. These models excel at encoding the essential patterns and relationships within data into dense numerical vectors that capture semantic meaning. In medical imaging, embedding foundation models learn to compress X-rays, CT scans, and MRIs into vectors that encode clinical patterns, anatomical structures, and diagnostic features. Once trained, these embeddings become versatile building blocks that can be rapidly adapted to new diagnostic tasks with minimal additional data or computational overhead. The same foundation model that learns to encode pneumonia patterns can quickly adapt its embeddings for fracture detection, device placement verification, or abnormalities in entirely different imaging modalities.

The embedding approach offers compelling practical advantages for clinical deployment:

Training speed: Classifier adapters typically train in under one minute on standard CPU hardware
Resource requirements: No GPU infrastructure needed for deployment, reducing cost barriers significantly
Inference speed: Predictions generated within seconds, meeting clinical workflow requirements
Deployment flexibility: Same embeddings can support multiple clinical tasks without retraining the foundation model

These advantages make foundation model embeddings particularly attractive for healthcare settings. However, realizing these benefits depends critically on the quality of the underlying embeddings – the foundation model must create sufficiently descriptive representations to enable high-performance classification with simple downstream models. This represents a shift from training narrow specialist models to developing adaptable generalists. Foundation models can help create reusable medical imaging intelligence that healthcare systems can deploy across departments and diagnostic challenges, dramatically reducing development time and computational requirements while maintaining clinical accuracy. Instead of developing 10, 20, or even 100 separate diagnostic models, healthcare systems can deploy one foundation model plus an array of lightweight adapters, each trainable in minutes. This fundamentally shifts how medical AI systems are built and deployed.

Microsoft’s MedImageInsight (MI2) delivers on this vision at scale. Trained on 3.7 million medical images spanning 14 different domains (from X-rays and CT scans to MRIs and dermoscopy), MI2 encodes the full spectrum of medical imaging into rich vector representations that capture both anatomical structures and clinical patterns suitable for many applications. It can support zero-shot classification of medical conditions, outlier detection, automated exam parameters detection, multimodal analysis, and sophisticated image search systems for both 2D and 3D medical imaging. You can explore our healthcare AI examples repository, which demonstrates different applications through design templates and implementation examples.

While the original MI2 research paper demonstrated strong performance on curated public datasets, critical questions remained unanswered for clinical deployment:

How does MI2 perform on messy, clinical data?
How does it compare against other leading foundation models?
Which classifier approaches work best when paired with model embeddings?

In collaboration with the University of Wisconsin–Madison, we conducted a study to answer these questions. The study has been presented at the Society for Imaging Informatics in Medicine (SIIM) conference and has been accepted for publication in the Journal of Imaging Informatics in Medicine (JIIM). A pre-print is available on arXiv: From Embeddings to Accuracy: Comparing Foundation Models for Radiographic Classification.

The Bake-off rules and metrics

Different models produce different quality embeddings for medical imaging tasks. To determine which models work best, we need to evaluate them using multiple downstream adapters that are tuned with data that simulate authentic clinical conditions. Poor embeddings lead to weak classifiers, regardless of the downstream approach. High-quality embeddings enable simple classifiers to achieve excellent performance with minimal training data and computational overhead.

This study evaluates embeddings from different models by training lightweight adapter classifiers on a multi-class classification task and compares their performance against traditional end-to-end trained convolutional neural networks to establish a baseline.

Though this is a research-driven evaluation, the methodology demonstrates how the models can be assessed for your own clinical deployment needs. Our evaluation protocol ensures fair comparison across all models using three stages:

Embedding extraction: Each model generated vector representations for images using identical preprocessing.
Classifier training and optimization: Five different classifiers were trained: (K-Nearest Neighbors, Logistic Regression, Support Vector Machines, Random Forest, and Multi-Layer Perceptron) on the training set embeddings. We used the validation set to find optimal hyperparameters for each classifier through grid search.
Evaluation: We measured final performance on the held-out test set using mean Area Under the Curve (mAUC) averaged across all seven diagnostic categories as our primary benchmark metric. Statistical validation used 5-fold cross-validation, with each fold using one subset as test data and four for training/validation. (See paper for detailed metrics and additional analysis).

Figure 1 - Two-phase radiograph classification: embedding extraction using pre-trained foundation models, followed by lightweight adapter training for prediction.

Ingredients: fresh from the field

Rather than relying on curated research datasets, we needed to validate performance under real-world conditions with all its messy complexity. Thanks to researchers at the University of Wisconsin School of Medicine and Public Health, we were able to evaluate using 8,842 chest radiographs from 7,045 patients across seven diagnostic categories. Cases were identified through keyword searches and validated through joint review by a medical physicist and board-certified radiologist, with adjudicated reports serving as the reference standard. All images were acquired using modern CR (computed radiography) or DR (digital radiography) systems, ensuring contemporary clinical imaging quality. We focused on data that radiologists encounter frequently and data that reflects the practical challenges of clinical deployment such as imaging artifacts, varying scanner manufacturer and quality, diverse patient demographics, and natural imbalanced class distributions. This assessment challenges the models’ ability to distinguish between normal anatomy and six distinct pathological or procedural findings.

Table 1 – Statistical summary of diagnostic categories within dataset

Diagnostic Category	Count	Percentage	Clinical Significance
Normal Studies	1,185	13.4%	Baseline comparison for healthy anatomy
Venous Lines	2,281	25.8%	Central/peripheral line placement verification
Nasogastric Tubes	2,679	30.3%	Feeding tube positioning assessment
Rib Fractures	1,202	13.6%	Trauma evaluation and emergency diagnosis
Endotracheal Tubes	521	5.9%	Critical airway management verification
Pneumothorax	487	5.5%	Life-threatening air leak detection
Pneumoperitoneum	487	5.5%	Surgical emergency identification

Figure 2 - Overview of the composition of the dataset used in this evaluation.

Let’s meet the competitors

With our testing framework established, we compared MedImageInsight against six models that vary dramatically in scale, specialization, and accessibility.

Table 2 - Models evaluated in this study, showing their approach, architecture, scale, and licensing terms.

Model	Approach	Architecture	Model Size	Training Data	Publication	License
DenseNet121	Baseline	CNN	Small (8.0M)	-	Baseline comparison	Apache 2.0
BiomedCLIP	Vision-language learning on scientific literature	PubMedBERT + ViT-B/16	Medium (~224M)	15M image-text pairs from 4.4M PubMed articles	MICCAI 2023	MIT
Rad-DINO	Self-supervised chest X-ray specialization	Vision Transformer (DINOv2)	Small (86.6M)	882,775 chest X-rays from 5 datasets	MICCAI 2024	MSRLA
CXR-Foundation	Vision-language with clinical supervision	EfficientNet-L2 + BERT	Medium (~480M)	821,544 chest X-rays (India + US)	RSNA 2022	health-ai-developer-foundations
MedSigLIP	Vision-language on medical imaging	SigLIP architecture	Medium (~1.15B with 1152-dim embeddings)	Medical imaging datasets	MedGemma 2025	Google Terms
Med-Flamingo	Generative multimodal with medical textbooks	CLIP ViT-L/14 + LLaMA-7B	Large (8.6B total: 428M CLIP ViT-L/14, 7B text)	4,721 medical textbooks + PMC-OA paired datasets	ML4H 2023	CC BY-NC-SA 4.0 (Non Commercial)
MedImageInsight	Cross-domain medical vision-language	DaViT + text encoder	Medium (0.61B: 360M image, 252M text)	3.7M+ medical images with text/labels across 14 domains, ~500k chest xray	Nature Medicine 2024	MIT

The pre-foundation era contenders

To start, we have DenseNet121 with 8M parameters. It represents a lightweight baseline model that takes a traditional approach: a CNN trained directly on medical images. It's a simple, open-source (Apache 2.0) model that has been a reliable baseline for years.

Next, we have Rad-DINO and CXR-Foundation, mid-scale specialists trained exclusively on chest X-rays. Rad-DINO is a small model with only 86.6M parameters and was trained on ~900k chest radiographs using self-supervised learning. Rad-DINO is available under the Microsoft Research License Agreement (MSRLA). CXR-Foundation is on the larger end of mid-sized models with 480M parameters, trained with supervision from multiple datasets and released under the Health AI Developer Foundations license. These models focus deeply on chest X-ray patterns but lack the broader medical imaging understanding of multi-domain approaches.

Last of our "traditional" models is BiomedCLIP, which sits in size between Rad-DINO and CXR-Foundation at 224M parameters. BiomedCLIP learned from scientific literature rather than raw clinical imaging data, training on approximately 15M image-text pairs from 4.4M PubMed articles to bring vision and language together. Released under the MIT license, it represents a different approach to building medical imaging understanding.

Big tech’s foundation finalists

Meta, Google and Microsoft have each taken their own approach to medical imaging foundation models, balancing model scale, computational requirements, and licensing differently.

At 8.6B parameters, Med-Flamingo represents large-scale multimodal reasoning. Trained on 4700+ medical textbooks with image-text pairs, it combines a CLIP ViT-L/14 vision encoder (428M parameters) with a LLaMA-7B language model. Its substantial size enables sophisticated reasoning but requires significant computational resources. It is released under the CC BY-NC-SA 4.0 license, which explicitly prohibits commercial use, including derivatives.

With 1.15B parameters, MedSigLIP scales up the cross-modal approach, learning from medical imaging datasets across multiple domains. At this size, it can encode nuanced visual-language relationships. Notably, MedSigLIP serves as the vision encoder in MedGemma, a vision-language large model (VLLM) that connects MedSigLIP's visual capabilities to the Gemma large language model (LLM). This pairing enables MedGemma to interpret complex medical images and generate detailed clinical narratives. MedSigLIP uses Gemma Terms licensing, which includes use restrictions and a prohibited use policy.

MedImageInsight is the most efficient of the three foundation models with 610M parameters (360M for image encoding, 252M for text). Its training data included 3.7 million clinical images from 14 medical domains, such as radiology, pathology, dermatology, and ophthalmology, each paired with corresponding reports and labels. This cross-domain training enables broad medical understanding while maintaining computational efficiency. MedImageInsight is released under the permissive MIT license, allowing users to modify, fine-tune, and commercialize without restrictions.

Step into the tent with your own data

We’ve created a notebook that replicates this benchmark study, allowing you to test MedImageInsight against other foundation models using your own medical imaging data. The Bake-Off Notebook contains:

Embedding extraction: Generate embeddings using MI2 and other models, with support for custom embeddings in a standard format

Multi-classifier evaluation: Train and compare K-Nearest Neighbors, Logistic Regression, Support Vector Machines, Random Forest, and Multi-Layer Perceptron approaches

Automated hyperparameter optimization: Use scikit-learn’s grid search to find optimal classifier settings for each foundation model

Performance visualization: Generate the same performance metrics, comparison charts, and statistical analyses from our study

Custom dataset support: Bring your own medical imaging data and diagnostic labels to test foundation model performance on your specific challenges

The notebook provides everything needed to conduct your own model evaluation and determine which approach works best for your specific challenges.

MedImageInsight offers broader capabilities demonstrated in our healthcare AI examples repository, including semantic similarity search across medical imaging archives, zero-shot classification using natural language descriptions, multimodal regression combining imaging with clinical text, and outlier detection for quality assurance.

And the Judges Say…

After evaluating all seven models using five different classifier approaches, we see the results in Figure and they paint a clear picture: MI2 achieved 93.1% mAUC, outperforming all competing foundation models with a 2.1 percentage point increase over the next-best performer (Google’s MedSigLIP at 91.0%). Statistical testing using the Wilcoxon signed-rank test confirmed significant differences between MedImageInsight and all other models. Notably, the foundation model approach with lightweight adapters also outperformed a fully fine-tuned DenseNet121 CNN trained end-to-end on the same data, which achieved 87.2% mAUC. This demonstrates that high-quality embeddings paired with simple classifiers can exceed traditional deep learning approaches that require substantially more computational resources and training time. In addition, MedImageInsight demonstrated the lowest variability across test folds (standard deviation 0.3-0.6%), indicating consistent and reliable predictions, while Med-Flamingo showed higher uncertainty (0.9-1.3%). These results validate foundation models’ theoretical advantages in real world conditions. While traditional approaches require months of custom development for each diagnostic task, foundation models, particularly MI2, deliver superior performance with a fraction of the development time and computational requirements.

Figure 3 - Performance comparison charts showing mAUC scores across all models and adapter methods.

Table 3 - Summary of the best performance for each model

Model	Vector Length	Best Classifier	Performance (mAUC %)
MedImageInsight	1024	SVM/MLP	93.1
MedSigLIP	1152	MLP	91.0
Rad-DINO	768	SVM	90.7
CXR-Foundation	1024	LR	88.6
DenseNet121	N/A	End-to-end	87.2
BiomedCLIP	512	SVM	82.8
DenseNet121	1024	SVM	81.1
Med-Flamingo	768	RF	78.5

Note: For complete results including demographic breakdowns, fairness analysis, computational efficiency metrics, and detailed methodology, see the full research paper.

The results also reveal important insights on downstream classifier selection for foundation embedding. Support Vector Machines emerged as the optimal classifier for five out of seven foundation models, challenging conventional wisdom that deep learning classifiers always outperform traditional machine learning approaches. This finding suggests that high-quality foundation model embeddings are already highly descriptive and well-separated, exactly as intended. The tSNE visualization shown in Figure 4 confirms this separation, showing distinct clustering patterns across diagnostic categories in the embedding space. When embeddings effectively encode essential clinical patterns, simpler classifiers like SVMs can achieve optimal performance by finding clean decision boundaries in the pre-processed feature space. This embedding-plus-adapter approach not only matches but surpasses the performance of traditional end-to-end CNN training while offering dramatic advantages in computational efficiency: foundation model adapters train in minutes on CPU, whereas the fully fine-tuned CNN required hours of GPU-intensive training and exhibited higher performance.

While multiple models demonstrated strong performance, MedImageInsight’s advantages extend beyond raw accuracy. With approximately half the parameters of MedSigLIP (MI2 has 0.61B vs. MedSigLip’s 1.15B), MedImageInsight achieves superior performance while maintaining the most consistent performance in cross validation. This consistency and efficiency make MedImageInsight particularly attractive for clinical deployment, where reliable performance is paramount and model size directly impacts infrastructure costs.

The evaluation also demonstrates that cross-domain training creates more generalizable representations than single-modality approaches. While Rad-DINO and CXR-Foundation trained exclusively on chest X-rays, MedImageInsight’s broader training developed richer feature representations that transfer effectively to diverse diagnostic tasks.

Figure 4 - Embedding visualization showing how each foundation model clusters similar diagnostic cases. Better separation indicates higher quality embeddings for classification.

The full research paper pre-print provides comprehensive analysis of additional aspects including fairness across demographic groups (gender and age), computational efficiency metrics (training and inference times), detailed statistical validation, and per-class performance breakdowns. These analyses confirm that MedImageInsight maintains both high performance and equity across diverse patient populations.

Transforming healthcare with foundation models

This evaluation, performed in collaboration with the University of Wisconsin-Madison, validates the promise of foundation models. MedImageInsight outperformed all competing embedding models and traditionally trained CNNs on independently validated clinical data, with statistically significant advantages, consistent performance, and fairness across demographic groups. These models can now move past proof-of-concept experiments with curated datasets to deliver classifiers and findings identification on messy, real-world clinical data.

High performance alone isn't sufficient for clinical deployment. Healthcare systems need models that are computationally feasible, legally deployable, and generalizable across diagnostic challenges. This evaluation reveals how different approaches to foundation model development create fundamentally different deployment tradeoffs. The specialized models (Rad-DINO and CXR-Foundation) nearly matched top performance on chest X-rays, though their narrow training limits their ability to transfer to other imaging modalities, unlike more general-purpose models. The large-scale models deliver sophisticated capabilities but impose practical constraints. The large MedSigLIP has complicated licensing with prohibited uses. BiomedCLIP's training on curated scientific literature creates a gap between how researchers present examples and how clinicians document actual patient imaging.

MedImageInsight's combination of multi-domain training, computational efficiency, and permissive licensing directly addresses these practical barriers. Trained across 14 medical domains, it created richer, more generalizable representations than single-modality specialists. As a compact and efficient model, it delivers superior performance while fitting comfortably on standard hardware. Its permissive MIT allows you to use, modify, fine-tune, and commercialize without restrictions. This combination enables clinical deployment rather than constraining it.

These results demonstrate a practical path forward for healthcare AI development. Building custom models for each diagnostic challenge requires months of development, specialized expertise, and substantial computational resources. Instead, healthcare organizations can deploy one foundation model with multiple lightweight adapters, each trainable in minutes while maintaining superior accuracy. When properly developed and validated for specific clinical use cases, this approach reshapes how healthcare systems integrate AI into their workflows. Teams can scale with adapters to meet existing and emerging needs using one foundation model instead of separate solutions for each workflow.

For teams ready to explore these capabilities, our healthcare AI examples repository provides design templates and implementation examples for classification, similarity search, multimodal analysis, and fine-tuning workflows. The critical question has shifted from whether foundation models will transform medical imaging to how quickly healthcare teams will recognize and act on this opportunity. For teams ready to develop solutions that deliver measurable improvements in accuracy and efficiency, the foundation is built.