Blog Post

Healthcare and Life Sciences Blog
11 MIN READ

Discovering the Power of Finetuning MedImageInsight on Your Data

jamesonmerkow's avatar
jamesonmerkow
Icon for Microsoft rankMicrosoft
Mar 24, 2025

What if you could achieve state-of-the-art results for medical imaging tasks with 93% fewer parameters using a foundation model that's been calibrated to your institution’s data?

Introduction

That’s the promise of MedImageInsight (MI2), Microsoft’s open-source foundation model that’s revolutionizing medical imaging analysis. Developed by Microsoft Health and Life Sciences, MedImageInsight is designed as a "generalist" foundation model, offering capabilities across diverse medical imaging fields. MI2 achieves state-of-the-art or human expert-level results in tasks like classification, image search, and 3D medical image retrieval. Its features include:

  • Multi-domain versatility: Trained on medical images from fourteen different domains such as X-Ray, CT, MRI, dermoscopy, OCT, fundus photography, ultrasound, histopathology, and mammography.
  • State-of-the-art (SOTA) performance: Achieves SOTA or human expert-level results in tasks like classification, image-image search, and fine-tuning on public datasets, with proven excellence in CT 3D medical image retrieval, disease classification for chest X-ray, dermatology, OCT imaging, and even bone age estimation.
  • Regulatory-ready features: When used on downstream tasks, MI2 allows for sensitivity/specificity adjustments to meet clinical regulatory requirements.
  • Transparent decision-making: Provides evidence-based decision support through image-image search, image-text search, enhancing explainability.
  • Efficient report generation: When paired with a text decoder, it delivers near state-of-the-art report generation using only 7% of the parameters compared to similar models.
  • 3D capability: Leverages 3D image-text pre-training to achieve state-of-the-art performance for 3D medical image retrieval.
  • Fairness: Out-performs other models in AI fairness evaluations across age and gender in independent clinical assessments.

MI2 is available now through the Azure AI Foundry model catalog (docs) and has already demonstrated its value across numerous applications.  We’ve made it even easier for you to explore its capabilities with our repository full of examples and code for you to try. It covers:

  • Outlier detection: Encoding CT/MR series to spot anomalies.
  • Zero-shot classification with text labels: Identifying conditions without prior training.
  • Adapter training: Specializing in specific classification tasks.
  • Exam parameter detection: Normalizing MRI series and extracting critical details.
  • Multimodal adapter analysis: Merging insights from radiology and pathology
  • Image search: Finding similar cases to aid diagnosis using both 2D images and 3D volumes (cross-sectional imaging).
  • Model monitoring: Ensuring consistent performance over time (code coming soon).

While these capabilities are impressive on their own, the true potential of MI2 lies in its adaptability. This is where fine-tuning comes in: the ability to customize this powerful foundation model for specific clinical applications at your institution. Fine-tuning currently available in public preview, can transform this foundation model into production-ready, clinical-grade assets tailored to your specific needs and workflow while maintaining regulatory compliance.

Note: This blog post demonstrates how MedImageInsight can be fine-tuned for new data. This example is illustrative; however, the same process can be used to develop production-ready clinical assets when following appropriate regulatory guidelines.

Teaching an Old (Actually New) AI New Tricks

MedImageInsight’s architecture offers distinct advantages for fine-tuning:

  • Lightweight design: MI2 utilizes a DaViT image encoder (360M parameters) and language encoder (252M parameters)
  • Efficient scale: With a total of only 0.61B parameters compared to multi-billion parameter alternatives, MI2 requires significantly less computational resources than comparable models.
  • Training flexibility: The model supports both image-text and image-label pairs for different training approaches.
  • Solid foundation: Pre-trained on 3.7M+ diverse medical images, MI2 starts with robust domain knowledge.

MI2 is ideal for fine-tuning to specific medical imaging domains, allowing for clinical applications that integrate into healthcare workflows after validation. The model maintains its strengths while adapting to specialized patterns and requirements.

Using AzureML Pipelines for an MI2 Glow Up

The Azure Machine Learning (AzureML) pipeline streamlines the fine-tuning process for MI2. This end-to-end workflow, available now as a public preview, manages everything from data preparation to model registration in a reproducible manner:

To finetune MI2, we use an AzureML pipeline which streamlines the fine-tuning process with distributed training on GPU clusters. We’ve released five components into public preview to enable  you to fine-tune MI2 and simplify related processes like generating a classifier model:

  1. MedImageInsight model finetuning core component (component) is the core component of the fine-tuning process that trains the MedImageInsight model. This component requires four separate TSV files as input: an image TSV and a text TSV for training, plus the same two files for evaluation, TSV file of the all the possible text strings and a training configuration YAML file. This component supports distributed training on a multi-GPU cluster.
  2. MedImageInsight embedding generation component (component) creates embeddings from images using the MedImageInsight model. It allows customization of image quality and dimensions, and outputs a pickled NumPy array containing embeddings for all processed images.
  3. MedImageInsight adapter finetune component (component) takes NumPy arrays of training and validation data along with their associated text labels (from TSV). It trains a specialized 3-layer model designed for classification tasks and optimizes performance for specific domains while maintaining MI2's core capabilities.
  4. MedImageInsight image classifier assembler component (component) combines your fine-tuned embedding model with a label file into a deployable image classifier. This component takes the finetune MI2 embedding model, text labels and an optional adapter model then packages them into a unified MLFlow model ready for deployment. The resulting model package can operate in either zero-shot mode or with a custom adapter model.
  5. MedImageInsight pipeline component (component) provides an end-to-end pipeline component that integrates all components into one workflow. It is a simple pipeline that trains, evaluates, and outputs the embedding and classification models

Example Dataset: GastroVision

To demonstrate MI2's fine-tuning capabilities, we're using the GastroVision dataset [1] as a real-world example. It's important to note that our goal here is not to build the ultimate gastroenterology classifier, but rather to showcase how MI2 can be effectively fine-tuned. The techniques demonstrated can be applied to your institution's data to create customized embedding models that support not only classification, but all the applications we’ve mentioned, from zero shot classification and outlier detection to image search and multimodal analysis.

The GastroVision dataset offers an excellent test case for several reasons:

  • Open-access dataset: 8,000 endoscopy images collected from two hospitals in Norway and Sweden
  • Diverse classes: Spans 27 distinct classes of gastrointestinal findings with significant class imbalance
  • Real-world challenges: High similarity between certain classes, multi-center variability, and rare findings with limited examples
  • Recent publication: Published in 2023 and wasn't included in MI2's original training data.

With approximately 8,000 endoscopic images labeled across 27 different classes, this dataset provides a practical context for fine-tuning MI2's embedding capabilities. By demonstrating how MI2 can adapt to new data, we illustrate how you might fine-tune the model on your own data to create production-ready, clinical-grade specialized embedding models tailored to your unique imaging environments, equipment, and patient populations.

The Code: Getting the Data Prep’d

The first step in fine-tuning MI2 is preparing your dataset. For the GastroVision dataset, we need to preprocess the images and structure the data in a format suitable for training:

def gastro_folder_to_text(folder_name):
    label = folder_to_label[folder_name]
    view = labels_to_view[label]
    return f"endoscopy gastrointestinal {view} {label}"

gastrovision_root_directory = "/home/azureuser/data/Gastrovision"
text_to_label = {}
folders = os.listdir(gastrovision_root_directory)

for folder in folders:
    label = folder_to_label[folder]
    text = gastro_folder_to_text(folder)
    text_to_label[text] = label

data = []
files = list(glob.glob(os.path.join(gastrovision_root_directory, "**/*.jpg"), recursive=True))
for file_path in tqdm(files, ncols=120):
    folder = os.path.basename(os.path.dirname(file_path))
    filename = os.path.basename(file_path)
    text = gastro_folder_to_text(folder)
    with Image.open(file_path) as img:
        img = img.resize((512, 512)).convert("RGB")
        buffered = BytesIO()
        img.save(buffered, format="JPEG", quality=95)
        img_str = base64.b64encode(buffered.getvalue()).decode("utf-8")
        data.append(
            [f"{folder}/{filename}-{os.path.basename(file_path)}", img_str, text]
        )

df = pd.DataFrame(data, columns=["filename", "image", "text"])

This preprocessing pipeline does several important tasks:

  • Resizing and standardizing images to 512x512 pixels
  • Converting images to base64 encoding for efficient storage.

Then we convert the encoded images into the TSV: 

# Function to format text as JSON
def format_text_json(row):
    return json.dumps(
        {
            "class_id": text_index[row["text"]],
            "class_name": row["text"],
            "source": "gastrovision",
            "task": "classification",
        }
    )

# Filter the dataframe to only include the top 22 text captions
df_filtered = df[df["text"].isin(df["text"].value_counts().index[:22])].reset_index(
    drop=True
)

# Get unique texts from the filtered dataframe
unique_texts = df_filtered["text"].unique()

# Save the unique texts to a text file
with open("unique_texts.txt", "w") as f:
    for text in unique_texts:
        f.write(text + "\n")

# Create a dictionary to map text labels to indices
text_index = {label: index for index, label in enumerate(unique_texts)}

# Apply the formatting function to the text column
df_filtered["text"] = df_filtered.apply(format_text_json, axis=1)

# Split the dataframe into training, validation, and test sets
train_df, val_test_df = train_test_split(
    df_filtered, test_size=0.4, random_state=42, stratify=df_filtered["text"]
)
validation_df, test_df = train_test_split(
    val_test_df, test_size=0.5, random_state=42, stratify=val_test_df["text"]
)

# Create separate dataframes for images and labels and save the dataframes to TSV files
def split_and_save_tsvs(aligned_df, prefix):
    image_df = aligned_df[["filename", "image"]]
    text_df = aligned_df[["filename", "text"]]
    text_df.to_csv(
        f"{prefix}_text.tsv",
        sep="\t",
        index=False,
        header=False,
        quoting=csv.QUOTE_NONE,
    )
    image_df.to_csv(f"{prefix}_images.tsv", sep="\t", index=False, header=False)

split_and_save_tsvs(train_df, "train")
split_and_save_tsvs(validation_df, "validation")
split_and_save_tsvs(test_df, "test")

 

  • Filtering to include only classes with sufficient samples.
  • Creating label mappings for classification.
  • Splitting data into training, validation, and test sets.
  • Exporting processed data as TSV files for AzureML.

After preparing the datasets, we need to upload them to AzureML as data assets:

name = "gastrovision"
assets = {
    "image_tsv": "train_images.tsv",
    "text_tsv": "train_text.tsv",
    "eval_image_tsv": "validation_images.tsv",
    "eval_text_tsv": "validation_text.tsv",
    "label_file": "unique_texts.txt",
}

data_assets = {
    key: Data(
        path=value,
        type=AssetTypes.URI_FILE,
        description=f"{name} {key}",
        name=f"{name}-{key}",
    )
    for key, value in assets.items()
}

for key, data in data_assets.items():
    data_assets[key] = ml_client.data.create_or_update(data)
    print(
        f"Data asset {key} created or updated.",
        data_assets[key].name,
        data_assets[key].version,
    )

These uploaded assets are versioned in AzureML, allowing for reproducibility and tracking of which specific data was used for each training run.

The Code: Cue the Training Montage

In the notebook, we demonstrate a straightforward example of finetuning using the pipeline component, but you can integrate these components into larger pipelines that train more complex downstream tasks such as exam parameter classification, report generation, analysis of 3D scans, etc.

conf_file = "train-gastrovision.yaml"
data = Data(
    path=conf_file,
    type=AssetTypes.URI_FILE,
    description=f"{name} conf_files",
    name=f"{name}-conf_files",
)
data_assets["conf_files"] = ml_client.data.create_or_update(data)

# Get the pipeline component
finetune_pipline_component = ml_registry.components.get(
    name="medimage_insight_ft_pipeline", label="latest"
)

# Get the latest MI2 model
model = ml_registry.models.get(name="MedImageInsight", label="latest")

@pipeline(name="medimage_insight_ft_pipeline_job" + str(random.randint(0, 100000)))
def create_pipeline():
    mi2_pipeline = finetune_pipline_component(
        mlflow_embedding_model_path=model.id,
        compute_finetune=compute.name,
        instance_count=8,
        **data_assets,
    )
    return {
        "classification_model": mi2_pipeline.outputs.classification_mlflow_model,
        "embedding_model": mi2_pipeline.outputs.embedding_mlflow_model,
    }

pipeline_object = create_pipeline()
pipeline_object.compute = compute.name
pipeline_object.settings.continue_on_step_failure = False
pipeline_job = ml_client.jobs.create_or_update(pipeline_object, experiment_name=name)
pipeline_job_run_id = pipeline_job.name
pipeline_job

This pipeline approach offers several advantages:
  • Access to modular components (you can use only parts of the pipeline if needed)
  • Distributed training across multiple compute instances
  • Built-in monitoring and logging
  • Seamless integration with the AzureML model registry

The Code: Saving and Deploying your Models

After the training job is completed, we register the model in the AzureML registry and deploy it as an online endpoint:

# Create a Model to register
run_model = Model(
    path=f"azureml://jobs/{pipeline_job.name}/outputs/classification_model",
    name=f"classifier-{name}-{pipeline_job.name}",
    description="Model created from run.",
    type=AssetTypes.MLFLOW_MODEL,
)

# Register the Model
run_model = ml_client.models.create_or_update(run_model)

# Create endpoint and deployment with the classification model
endpoint = ManagedOnlineEndpoint(name=name)
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint).result()
deployment = ManagedOnlineDeployment(
    name=name,
    endpoint_name=endpoint.name,
    model=run_model.id,
    instance_type="Standard_NC6s_v3",
    instance_count=1,
)
deployment = ml_client.online_deployments.begin_create_or_update(deployment).result()

This deployment process creates a scalable API endpoint that can be integrated you’re your workflows, with built-in monitoring and scaling capabilities.

Results and Making Sure It Works

After fine-tuning MI2 on the GastroVision dataset, we can validate the quality of the resulting embeddings by evaluating their performance on a classification task.

Method

Macro Average

Micro Average

MCC

mAUC

 

Prec.

Recall

F1

Prec.

Recall

F1

 

 

ResNet-50 [2]

0.437

0.437

0.433

0.681

0.681

0.681

0.641

-

Pre-trained DenseNet-121 [3]

0.738

0.623

0.650

0.820

0.820

0.820

0.798

-

Greedy Soup (GenAI Aug)[4]

0.675

0.600

0.615

0.812

0.812

0.812

0.790

-

Greedy Soup (Basic Aug) [4]

0.762

0.639

0.666

0.832

0.830

0.830

0.809

-

MI Finetune

0.736

0.772

0.740

0.834

0.860

0.847

0.819

0.990

Using a KNN classifier, we achieve an impressive mAUC of 0.990 and SOTA in the other metrics. Though our goal was not to create the ultimate gastroenterology classifier, these results demonstrate that with minimal fine-tuning, MI2 produces embeddings that can power a state-of-the-art using only a KNN classifier.

The real potential here goes far beyond classification. Imagine applying this same fine-tuning approach to your institution's specific imaging data. The resulting domain-adapted model would provide enhanced performance across all MI2's capabilities:

  • More accurate outlier detection in your specific patient population
  • More precise image retrieval for similar cases in your database
  • Better multimodal analysis combining your radiology and pathology data
  • Enhanced report generation tailored to your clinical workflows

MI2's efficient architecture (0.36B/0.25B parameters for image/text encoder respectively) can be effectively adapted to specialized domains while maintaining its full range of capabilities. The classification metrics validate that the fine-tuning process has successfully adapted the embedding space to better represent your specific medical domain.

 

Your Turn to Build!

Fine-tuning MedImageInsight represents a significant opportunity to extend the capabilities of this powerful foundation model into specialized medical imaging domains and subspecialties. Through our demonstration with the GastroVision dataset, we have shown how MI2’s architecture, with just 0.36B and 0.25B parameters for the image and text encoder respectively, can be efficiently adapted to new tasks with competitive or superior performance compared to traditional approaches.

The key features of fine-tuning MI2 include:

  1. Efficiency: Achieving high performance with minimal data and computational resources
  2. Versatility: Adapting to specialized domains while preserving multi-domain capabilities
  3. Practicality: Streamlined workflow from training to deployment using AzureML

The fine-tuning process described here provides a pathway for healthcare institutions to develop production-ready, clinical-grade AI assets. By finetuning MedImageInsight and incorporating appropriate validation, testing, and regulatory compliance measures, the model can be transformed from a foundation model into specialized clinical tools optimized for your specific use cases and patient populations. With your finetuned, you gain several distinct advantages:

  • Enhanced domain adaptation: Models that better understand the unique characteristics of your patient population and imaging equipment
  • Improved rare condition detection: Higher sensitivity for conditions specific to your specialty or patient demographics
  • Reduced false positives: Better differentiation between similar-appearing conditions common in your practice
  • Customized explanations: More relevant evidence-based decisions through image-image search from your own database

As healthcare institutions increasingly adopt AI for medical imaging analysis, the ability to fine-tune models for specific patient populations, imaging equipment, and clinical specialties become crucial. MedImageInsight’s efficient architecture and adaptability make it an ideal foundation for building specialized medical imaging solutions that can be deployed in resource-constrained environments.

We encourage you to try fine-tuning MedImageInsight with your own specialized datasets using our sample Jupyter Notebook as your starting point. The combination of MI2’s regulatory-friendly features with domain-specific adaptations opens new possibilities for transparent, efficient, and effective AI-assisted medical imaging analysis.

 

[1] Jha, D. et al. (2023). GastroVision: A Multi-class Endoscopy Image Dataset for Computer Aided Gastrointestinal Disease Detection. ICML Workshop on Machine Learning for Multimodal Healthcare Data (ML4MHD 2023).
[2] He, K., Zhang, X., Ren, S., Sun, J.: Deep residual learning for image recognition.
 In: Proceedings of the IEEE conference on computer vision and pattern recognition
 (CVPR). pp. 770–778 (2016)
[3] Huang, G., Liu, Z., Van Der Maaten, L., Weinberger, K.Q.: Densely connected
 convolutional networks. In: Proceedings of the IEEE conference on computer vision
 and pattern recognition. pp. 4700–4708 (2017)
[4] Fahad, M. et al. (2025). Deep insights into gastrointestinal health. Biomedical Signal Processing and Control, 102, 107260.

Updated Mar 24, 2025
Version 1.0
No CommentsBe the first to comment