Healthcare and Life Sciences Blog

8 MIN READ

Unlocking the Magic of Embedding Models: Practical Patterns for Healthcare AI

Microsoft

Jan 07, 2025

Unlocking the Power of Multimodal Healthcare AI

Multimodal health AI technologies have seen a lot of development lately and hold strong promise for the medical field as discussed in one our previous blogs.

Many of these technologies are based on embedding models that can create representations of underlying modalities. The exciting part is that, with the growth of these models and the advent of the foundation model era, these representations are becoming more powerful than ever.

In this blog we will dive into the amazing capabilities of embedding models using the MedImageInsight (or MI2 for brevity) model as an example. MedImageInsight, a foundation model recently published in the Azure AI Foundry Model Catalog is a versatile tool that can turn a wide range of medical images, or text strings into a set of numbers — a vector that is representing features in what’s called a “latent space” (also known as embeddings). MI2 has shown impressive performance on many benchmarks and can be the cornerstone of more complex systems.

The Magic of Embeddings

The idea behind embeddings is that during training embedding models learn to map high-dimensional data (like medical images) into vectors in a lower-dimensional space, capturing the most important features of the data while reducing computational complexity. This process allows similar images and text to be placed close together in this space, which is what makes these models so powerful – they can capture complex patterns and features in medical images that would otherwise be challenging to recognize.

Further in this blog we'll take a look at several practical ways you can explore and test the ability of an embedding model like MI2 for various medical image analysis tasks. Note that while many medical images come in series that represent volumes of imaged space (like CT or MR scans), MI2 focuses on 2D images. It is possible to use MI2 to compute embeddings from series and even studies, though it requires some extra techniques that we will explore in future blogs.

Now, we will take a closer look at a few different approaches that let you build powerful classification and retrieval systems that support model development and training:

Zero-Shot Approach: a) prediction via text-image correlations and b) prediction via image-image clustering capabilities
Adapter Approach: prediction via an addition of an extra small neural network

1. Zero-Shot Approach

The zero-shot approach takes advantage of a pre-trained embedding model without modifying the model’s weights or introducing new representations. Instead, it uses a small set of sample data as hint to guide the classification process. Essentially, it uses the model’s existing understanding of visual features to handle the classification and retrieval tasks.

1.1 Prediction via Text-Image Correlations

Embedding models like MedImageInsight, trained on paired image-text data can link images with corresponding text descriptions. By comparing the similarity between image embeddings and text embeddings – often using cosine similarity measure – the model can make zero-shot classifications.

If you’re performing an image classification task, you would typically pre-compute the text embeddings ahead of time and then match them to the embedding of the new image.

Here’s what your implementation would look like:

Embedding Generation: Pre-compute embeddings for text labels you’re interested in.
Inference: Compute the embedding for the newly submitted image.
Similarity Measurement: Calculate cosine similarity between the image and label embeddings – that in essence would be the “classifier” block in the diagram above.
Prediction: Assign the label with the highest similarity score to the submitted image.

Check out our Jupyter Notebook for a Python code sample that contains performance considerations using a sample dataset: https://aka.ms/healthcare-ai-examples-mi2-zero-shot

1.2 Prediction via Image-Image clustering

Now picture this: you've tried the previous approach – it only took a few lines of code and a handful of sample images to test – but it didn’t work out quite like you hoped. Sometimes this happens when the model's text-image associations don’t cover the specific labels you're dealing with. So, what's the next logical move?

Well, it’s time to guide the model a bit more directly by using a set of images that represent the classes you're interested in. Grab a few images (20-100 per class is a solid start), convert them into embeddings, and see how well they cluster together. If the model has seen similar images during training, you should see those embeddings organize into nice, tidy clusters.

Now, during inference time, you can take your new, previously unseen image and see how close its embedding lands to these pre-existing clusters. Next, check which of these clusters it falls close to and you've got your prediction!

You may think of this sample set as a 'training set' used in deep learning models – you're not wrong! This algorithm falls under the family of supervised learning methods. But the cool part is that there's no need for an expensive, time-consuming training phase like typical deep learning patterns. The computational demands are way lower, making this approach much more accessible and efficient.

To consider a more practical example - let’s say you’re trying to determine the presence of certain conditions in an X-ray image. You could take a few sample images of cardiomegaly, pneumothorax and pleural effusion and run them through MI2 to get their embeddings. The idea is that similar conditions will form clusters in the embedding space. You can even use visualization techniques like UMAP or t-SNE to get a feel for how the clusters are laid out – it’s a fun way to see if similar conditions are grouping together. Just keep in mind that for a proper analysis you’d want to back it up with more rigorous validation.

Now, with everything set up you can move on to inferencing. Here, you’d use a cosine similarity measure to find out how similar the new image is to the images in your clusters, using something like a KNN algorithm to make the final call. It’s like matching the new kid on the block to the closest group of friends!

Your implementation of this approach would need the following steps:

Embedding Extraction: Pre-compute embeddings for all images in the training set.
Inference: Compute embedding of the newly submitted image.
Clustering Algorithm: Use algorithms like K-means, DBSCAN or hierarchical clustering to find the closest clusters to the supplied image (these algorithms are standard implementations in many libraries out there). Your choice of algorithm would be guided by how well the images organize into clusters and how many samples you have. The good thing is that experimenting with different clustering techniques will only take seconds, so iterations can come quick.
Prediction: Assign the label of the cluster corresponding to the cluster closest to the submitted image.

The best part about this approach? Other than being computationally inexpensive once you have computed the embeddings, it’s super-explainable. You don’t just get a prediction, you also get a bit of an insight into your system’s thought process by visualizing the similar images it has used.

Check out our Jupyter Notebook for a Python code sample and sample visualizations: https://aka.ms/healthcare-ai-examples-mi2-zero-shot

Observant readers might have started wondering how this method would work if the Xrays that are used for forming clusters, or those that are sent through the system have more than one of these conditions which is often the case in for chest x-rays in the real world. This is a great question, since in this case cluster proximity will likely be a much less a reliable measure to detect multiple classes in a given image. If that is the case and you might want to consider a method that requires a bit more coding, and that is the adapter approach.

2. Adapter Approach

The variants of the zero-shot approach assume that the model knows enough about your image types already. But what if it doesn’t, as in it has seen something similar, but hasn’t quite been paying attention to the finer details? Let’s say you want to classify different types of implanted devices in X-rays or vascular structures in ultrasound images with MI2. If the previous two approaches aren’t quite hitting the performance you are looking for, you can train a simple adapter neural network that learns features from MI2 embeddings.

An adapter neural network can be pretty straightforward, like a Multi-layer Perceptron which is a type of feedforward neural network effective for learning non-linear relationships between embeddings and target classes.

With an adapter our system looks like this:

The adapter takes MI2 embeddings as input and outputs the classes of interest. It’s quick to train and doesn’t require much computational power. A network we provide in an example that goes along with this blog can be trained in under 10 seconds on a CPU. In real world scenarios it may take longer than that depending on dataset size and hardware, but we are still talking single CPU-grade training.

This approach can be taken quite a bit further with more complex network architectures.

With that in mind, an implementation of this approach will have the following steps:

Embedding Extraction: Pre-compute embeddings for all images in the training set (you can start with the same dataset you’d used in the Image-Image clustering step).
Adapter Training: Train the adapter network using the extracted embeddings.
Inference: Compute embedding of the newly submitted image.
Prediction: Pass the computed embedding to the trained adapter network to obtain the class prediction.

Check out our Jupyter Notebook for a Python code sample, sample dataset and some nice visualizations: https://aka.ms/healthcare-ai-examples-mi2-adapter

3. Conclusion

To sum it all up, foundation models like MedImageInsight are incredibly powerful tools for health AI, allowing you to represent complex medical data in a way that’s easy to work with and highly informative. Whether you’re using the zero-shot approach to classify images or taking it up a notch with clustering methods, foundation embedding models increase the accessibility of sophisticated medical image analysis. And when the pre-trained model isn’t quite enough, the adapter approach allows you to further explore customization.

What makes this era of foundation models truly exciting is how approachable medical imaging AI is becoming – you don’t need massive computational power or extensive training time to see significant results. Plus, the ability to visualize what’s happening under the hood makes these methods both effective and explainable.

While the methods we looked at here can be pretty powerful, there may still be situations where performance isn’t quite good enough. You may be seeking a powerful image encoder to cover a new type of imaging modality, or new body regions, or other structures being imaged. You may want to be looking to image different species altogether! In these cases, you would need to start looking at opening the lid somewhat and start adjusting the model weights via the process of fine-tuning. This process would be more resource-intensive than what we have covered so far, but still not as demanding as training your own model from scratch. We will cover fine-tuning in our follow up blogs. Stay tuned for more!

Thanks for reading and let’s continue pushing the boundaries of what’s possible in medical AI!

4. Resources

Looking to learn more? Check out these additional resources to help you dive deeper:

MedImageInsight model in the Azure Model Catalog – start here to deploy the model
MedImageInsight deployment notebook – a notebook going deeper into programmatic deployment and inference of MI2 model
Zero-shot example notebook – zero-shot approach examples
Adapter notebook – how to train an adapter?
Advanced calling patterns – how to obtain embeddings for large datasets?

The Microsoft healthcare AI models, including MedImageInsight, are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.

Updated Jan 24, 2025

Version 3.0

ivantarapov

Microsoft

Joined March 31, 2017

View Profile

Healthcare and Life Sciences Blog

Follow this blog board to get notified when there's new activity