Achieve fine-grained segmentation results with a flexible healthcare foundation model, optimized for your use cases on your data!
This post is part of our healthcare AI fine-tuning series:
- MedImageInsight Fine-Tuning - Embeddings and classification
- MedImageParse Fine-Tuning - Segmentation and spatial understanding (you are here)
- CXRReportGen Fine-Tuning - Clinical findings generation
Introduction
MedImageParse now supports fine-tuning, allowing you to adapt Microsoft’s open-source biomedical foundation model to your healthcare use cases and data. Adapting this model can take as little as an hour to add new segmentation targets, add new modalities or boost performance significantly on your data. We’ll demonstrate how we achieved large performance gains across multiple metrics on a public dataset.
Biomedical clinical apps often need highly specialized models, but training one from scratch is expensive and data-intensive. Traditional approaches require thousands of annotated images, weeks of compute time, and deep machine learning expertise just to get started. Fine-tuning offers a practical alternative. By starting with a strong foundation model and adapting it to your specific domain, you can achieve production-ready performance with hundreds of examples and hours of training time. Everything you need to start finetuning is available now, including a ready-to-use AzureML pipeline, complete workflow notebooks, and deployment capabilities.
We fine-tuned MedImageParse on the CDD-CESM mammography dataset (specialized CESM modality for lesion segmentation) to demonstrate domain adaptation on data under‑represented in pre-training.
Follow along: The complete example is in our GitHub repository as a ready-to-run notebook.
What is MedImageParse?
MedImageParse (MIP) is Microsoft’s open-source implementation of BiomedParse that comes with a permissive MIT license and is designed for integration into commercial products. It is a powerful and flexible foundation model for text-prompted medical imaging segmentation. MIP accepts an image and one or more prompts (e.g. “neoplastic cells in breast pathology” or “inflammatory cells,”) then accurately identifies and segments the corresponding structures within the input image. Trained on a wide range of biomedical imaging datasets and tasks, MIP captures robust feature representations that are highly transferrable to new domains. Furthermore, it operates efficiently on a single GPU, making it a practical tool for research laboratories without extensive computational resources.
Built with adaptability in mind, the model can be fine-tuned using your own datasets to refine segmentation targets, accommodate unique imaging modalities, or improve performance on local data distributions. Its modest computational footprint, paired with this flexibility, positions MIP as a strong starting point for custom medical imaging solutions.
When to Fine-tune (and When NOT to)
Fine-tuning can transform MedImageParse into your own clinical asset that's aligned with your institution’s needs. But how do you know if that’s the right approach for your use case?
Fine-tuning makes sense when you’re working with specialized imaging protocols (custom equipment or acquisition parameters), rare structures not well-represented in general datasets, or when you need high precision for quantitative analysis. You’ll need some high-quality annotated examples to see meaningful improvements; more is better, but thousands aren’t required.
Simpler approaches might work instead if the pre-trained model already performs reasonably well on standard anatomies and common pathologies. If you’re still in exploratory mode figuring out what to measure, start with the base model first to establish a strong baseline for your use case.
Our example shows how fine-tuning can deliver significant performance gains even with modest resources. With about one hour of GPU time and 200-500 annotated images, fine-tuning showed a significant improvement across multiple metrics.
The Fine-tuning Pipeline: From Data to Deployed Model
To demonstrate fine-tuning in action, we used the CDD-CESM mammography dataset: a collection of Contrast-Enhanced Spectral Mammography (CESM) images with expert-annotated breast lesion masks. CESM is a specialized imaging modality that wasn’t well represented in MedImageParse’s original training data. The dataset 1 (can be downloaded from our HuggingFace location or from its original TCIA page) includes predefined splits with high-quality segmentation annotations.
Figure 1 - Example Images from CDD-CESM
Why AzureML Pipelines?
Before diving into the workflow, it’s worth understanding why we use AzureML pipelines for this process. Every experiment is tracked with full versioning; you always know exactly what you ran and can reproduce results months later. The pipeline handles multi-GPU distribution automatically without code changes, making it easy to scale up. The modular design lets you mix and match components for your specific needs, swap data preprocessing, adjust training parameters, or change deployment strategies independently. Training metrics, validation curves, and resource utilization are logged automatically, giving you full visibility into the process. Learn more about Azure ML pipelines.
Fine-Tuning Workflow
Setup: Upload data and configure compute
The first step uploads your training data and configuration to AzureML as versioned assets. You’ll configure a GPU compute cluster (H100 or A100 instances recommended) that will handle the training workload.
# Create and upload training data folder
training_data = Data(
path="CDD-CESM",
type=AssetTypes.URI_FOLDER,
description=f"{name} training data",
name=f"{name}-training_data",
)
training_data = ml_client.data.create_or_update(training_data)
# Create and upload parameters file
parameters = Data(
path="parameters.yaml",
type=AssetTypes.URI_FILE,
description=f"{name} parameters",
name=f"{name}-parameters",
)
parameters = ml_client.data.create_or_update(parameters)
Fine-tuning: The medimageparse_finetune component
The fine-tuning component takes three inputs:
- The pre-trained MedImageParse model (foundation weights)
- Your annotated dataset
- Training configuration (learning rate, batch size, augmentation settings)
During training the pipeline applies augmentation, tracks validation metrics, and checkpoints periodically. The output is an MLflow-packaged model, a portable artifact that includes the model weights, preprocessing code that is ready to deploy in AzureML or AI Foundry.
The pipeline uses parameter-efficient fine-tuning techniques to adapt the model while preserving the broad knowledge from pre-training. This means you get specialized performance without catastrophic forgetting of the base model’s capabilities.
# Get the pipeline component
finetune_pipeline_component = ml_registry.components.get(
name="medimageparse_finetune", label="latest"
)
# Get the latest MIP model
model = ml_registry.models.get(name="MedImageParse", label="latest")
# Create the pipeline
@pipeline(name="medimageparse_finetuning" + str(random.randint(0, 100000)))
def create_pipeline():
mip_pipeline = finetune_pipeline_component(
pretrained_mlflow_model=model.id,
data=data_assets["training_data"].id,
config=data_assets["parameters"].id,
)
return {"mlflow_model_folder": mip_pipeline.outputs.mlflow_model_folder}
# Submit the pipeline
pipeline_object = create_pipeline()
pipeline_object.compute = compute.name
pipeline_object.settings.continue_on_step_failure = False
pipeline_job = ml_client.jobs.create_or_update(
pipeline_object, experiment_name="medimageparse_finetune_experiment"
)
Figure 2 - AzureML Pipeline of Fine Tuning Process
Deployment: Register and serve the model
After training, the model can be registered in your AzureML workspace with version tracking. From there, deployment to a managed online endpoint takes a single command. The endpoint provides a scalable REST API backed by GPU compute for optimal inference performance.
# Register the Model
run_model = Model(
path=f"azureml://jobs/{pipeline_job.name}/outputs/mlflow_model_folder",
name=f"MIP-{name}-{pipeline_job.name}",
description="Model created from run.",
type=AssetTypes.MLFLOW_MODEL,
)
run_model = ml_client.models.create_or_update(run_model)
# Create endpoint and deployment with the classification model
endpoint = ManagedOnlineEndpoint(name=name)
endpoint = ml_client.online_endpoints.begin_create_or_update(endpoint).result()
deployment = ManagedOnlineDeployment(
name=name,
endpoint_name=endpoint.name,
model=run_model.id,
instance_type="Standard_NC40ads_H100_v5",
instance_count=1,
)
deployment = ml_client.online_deployments.begin_create_or_update(deployment).result(
Testing: Text-prompted inference
With the endpoint deployed, you can send test images along with text prompts describing what to segment. For the CDD-CESM example, we use text prompts: “neoplastic cells in breast pathology & inflammatory cells”. The model returns multiple segmentation masks for different detected regions. Text-prompting lets you switch focus on the fly (e.g., “tumor boundary” vs. “inflammatory infiltration”) without retraining or reconfiguring the model.
Results
Fine-tuning made a huge difference in how well the model works. The Dice Score, which shows how closely the model’s results match the actual regions, more than doubled, from 0.198 to 0.486. The IoU, another measure of overlap, nearly tripled, going from 0.139 to 0.383. Sensitivity jumped from 0.251 to 0.535, which means the model found more real positives.
|
Metric |
Base |
Fine-tuned |
Δ Abs |
Δ Rel |
|
Dice (F1) |
0.198 |
0.486 |
+0.288 |
+145% |
|
IoU |
0.139 |
0.383 |
+0.244 |
+176% |
|
Sensitivity |
0.251 |
0.535 |
+0.284 |
+113% |
|
Specificity |
0.971 |
0.987 |
+0.016 |
+1.6% |
|
Accuracy |
0.936 |
0.963 |
+0.027 |
+2.9% |
These improvements really matter in practice. When the Dice and IoU scores go up, it means the model is better at outlining the exact shape and size of problem areas, which helps doctors get more accurate measurements and track changes over time. The jump in sensitivity means the model is finding more actual lesions, while keeping specificity above 98% makes sure there aren’t a lot of false alarms. The improvement accuracy is impressive, but the more significant upgrades in overlap and recall are most impressive and matter most for getting precise results in medical images.
Figure 3 - Segmentation results before and after fine-tuning MedImageParse
Try It on Your Own Data
To successfully implement this solution in your organization, focus first on the core requirements and resources that will ensure a seamless transition. The following section outlines these essential steps so you can move efficiently from planning to deployment and set your team up for optimal results.
- Dataset size: Start with 200-500 annotated images. This is enough to see meaningful performance improvements without requiring massive data collection efforts. More data generally helps, but you don’t need thousands of examples to get started.
- Annotation quality: High-quality segmentation masks are critical. Invest in precise boundary delineations (pixel-level accuracy where possible), consistent annotation protocols across all images, and quality control reviews to catch and correct errors.
- Annotation effort: Budget enough time per image for careful annotation. Apply active learning approaches to focus effort on the most informative samples and start with a smaller pilot dataset (100-150 images) to validate the approach before scaling up.
- Training compute: A100 or H100 recommended (one device with multiple GPUs is sufficient for a few hundred image runs). For the CDD-CESM dataset, we used NC-series VMs (single-node) with 8 GPUs and training on 300 images took around 30 minutes for 10 epochs. If you’re training on larger datasets (thousands of images), consider upgrading to ND-series VMs, which offer better multi-node performance and allow you to train on large volumes of data faster.
Where to Go from Here?
So, what does this mean for your workflows and clinical teams?
Foundation models like MedImageParse provide significant power and performance. They’re flexible with text-prompted multi-task capabilities that can integrate into existing workflows without retooling and are relatively cheap to use for inference. This means faster review, more precise assessments, and independence from vendor development timelines.
But these models are not adapted to your institution and use cases out of the box, but developing a foundation model from scratch is prohibitively expensive. Fine-tuning bridges that gap: you can boost performance on your data and adapt it to your use case at a fraction of the cost. You control what the model learns, how it fits your workflow, and its validation for your context.
We’ve provided the complete tools to do that: the fine-tuning notebook walks through the entire process, from data preparation to deployment. By following this workflow and collecting annotated data from your institution (see “Try It on Your Own Data” above for requirements), you can deploy MedImageParse tailored to your institution and use cases.
References
- Khaled R., Helal M., Alfarghaly O., Mokhtar O., Elkorany A., El Kassas H., Fahmy A. Categorized Digital Database for Low energy and Subtracted Contrast Enhanced Spectral Mammography images [Dataset]. (2021) The Cancer Imaging Archive. DOI: 10.7937/29kw-ae92
https://www.cancerimagingarchive.net/collection/cdd-cesm/