Healthcare and Life Sciences Blog

10 MIN READ

Using natural language to build healthcare imaging cohorts for research

manoj1116

Microsoft

Dec 15, 2025

Manoj Kumar, Director, HLS - Data & AI HLS Frontiers AI
Alberto Santamaria-Pang, Principal Applied Data Scientist, HLS Frontiers AI and Adjunct Faculty, Johns Hopkins Medicine
Jared Erwin, Senior Software Engineer, HLS Nursing AI and Data Platform

Overview

In clinical research, assembling a cohort of medical images (e.g. all MRI scans meeting certain criteria) is notoriously slow and siloed. Traditionally, researchers must involve multiple teams, data warehouse specialists, PACS administrators, and radiologists to pull the desired dataset. The result: it can take weeks or even months to gather an imaging cohort, severely throttling the pace of research. Long turnaround times can ultimately deter the use of imaging data, leaving its value uncaptured.

For Microsoft’s Global Hackathon 2025 in September, a multi-disciplinary team from Microsoft set out to transform this process by developing an AI-powered prototype that would help democratize access to imaging data for research. The goal was to allow researchers to acquire imaging data in minutes by defining imaging cohorts in natural language — no complex coding, no manual retrieval, no coordinated, months-long effort.

Challenges in Traditional Imaging Cohort Building

Healthcare organizations generate petabytes of imaging data (MRIs, CTs, X-rays, etc.), but tapping into this trove for research is painstaking. Key challenges include:

Fragmented expertise and systems: Each step — formulating queries, pulling images from PACS, filtering by radiology reports — is handled by different specialists with different tools. Data is often siloed across systems. Researchers who lack direct access or specific IT skills must file data requests and wait.
Slow turnaround: A complicated, formal request process means it can take weeks or months to obtain even a simple cohort (e.g. “brain MRIs with contrast for male patients over 60”). By the time the data is ready, the research question may have evolved or the opportunity (e.g. a grant deadline) may have passed.
Under-utilization of imaging data: When researchers are deterred by the lengthy process and high effort required to acquire imaging data, its used sparingly and its potential value goes uncaptured.
Scaling issues: The imaging data acquisition process doesn’t scale linearly with demand. Adding more researchers requires more support staff and infrastructure, incurring additional costs.

These pain points are especially acute at academic centers, where research is a priority, but resources are limited. A new, AI-driven approach could be game-changing.

Hackathon 2025 – A collaborative effort

During the Microsoft Global Hackathon 2025, the annual company-wide innovation event, engineers and data scientists from Microsoft’s Health & Life Sciences group teamed up with research IT specialists and radiologists from one of our priority customers, The University of Texas Medical Branch, to build a prototype.

The idea was straightforward: a researcher would use plain English to write a query describing the desired imaging cohort like “MRI scans of the brain (axial orientation) with T1 contrast, male patients over 40,” and the system would instantly return a set of matching images (with thumbnails and metadata).

The team envisioned an architecture that combines natural language processing, medical imaging AI, and vector similarity search. Critically, the substantial work of image analysis and indexing would be performed in advance. When a user submits a query, the query is embedded and a vector search can return results in milliseconds. The team worked to prove this pipeline on a sample of customer imaging data and demonstrate an end-to-end scenario. Key technologies include:

MedImageInsight (MI2): Microsoft’s foundation model for medical image embeddings analyzes each image in the dataset and produces a numerical embedding (a vector representation) capturing its content/features.
BiomedCLIP: A biomedical adaptation of CLIP (which creates a joint embedding space for images and text) converts both the natural language query and any textual descriptions of images (e.g. “brain MRI axial T1 with contrast”) into vectors in the same space as the image embeddings.
FAISS (Facebook AI Similarity Search): An open-source vector database indexes all the embeddings, rapidly retrieving the nearest neighbors (most similar image vectors) to a given query vector, even in a collection of millions or billions of images.

Azure Cloud Services: The entire prototype pipeline was deployed in Azure, with model inference on images orchestrated through Azure ML and GPU acceleration applied for MI2 to handle large-scale data efficiently. For demonstration purposes, the web front-end and API operated as an Azure Web App, while images and JSON outputs were stored in Azure Blob storage or Fabric Lakehouse. This cloud-based architecture was designed to support scalability for hospital-scale datasets and enable secure web-based access for researchers.

Demo

The following demo describes functionality of the solution in greater detail.

Architecture and Data Flow

There are two phases: an offline preparation phase and an online query phase.

Healthcare Image cohort builder conceptual architecture

Offline Processing (Data Ingestion & Indexing)

Steps for processing images into an AI-friendly index needed to handle arbitrary queries efficiently.

Image ingestion: DICOM image series (e.g., MRI scans) are ingested from the hospital’s archive into an Azure Blob storage or Fabric Lakehouse. This could be many thousands of images, potentially in PETABYTES of data.
Image embedding MedImageInsight (MI2): Each image series is passed through the MI2model to generate a numeric image embedding. MI2, trained on multimodal medical images, condenses an image’s visual content (anatomy, modality, etc.) into a vector (e.g. 1024-dimensional). In parallel, a custom exam parameter detection pipeline uses MI2 classifiers to infer properties like orientation (axial/coronal), body region (brain/abdomen), contrast usage, etc., for each series.
Text embedding (BiomedCLIP):For each image series, the system generates a short textual description using available metadata and AI-derived labels, for example: “brain MRI, axial plane, T1 post-contrast.” This description, along with any report text, if available, is then processed by the BiomedCLIP text encoder to produce a corresponding text embedding. Now the image has two vector representations: one from the image itself (MI2) and one from textual/contextual data.
Joint indexing (FAISS): All image and text embeddings are stored in a FAISS index — essentially, a specialized database for high-dimensional vectors. To allow joint image-text search, the embeddings map images and their descriptive text to nearby points in the vector space using techniques like model fine-tuning to achieve alignment. A comprehensive metadata JSON containing patient ID, study date, modality, plus the AI-inferred tags is prepared alongside the FAISS index to allow filtering and display of cohort details later.

Online Query (Natural Language Query → Cohort)

Steps triggered in real time when a researcher enters a query.

User query processing:The researcher interacts with a web-based interface and enters a query (e.g. “brain MRI with contrast for male patients over 40”). The backend passes this text through the BiomedCLIP text encoder, generating a query embedding in the same vector space as the images.
Similarity search:The query embedding is then fed to the FAISS index to find the nearest neighbor embeddings – effectively retrieving the most relevant image series in the dataset. For example, if there are 1 million MRI scans indexed, FAISS can return the top N (say 100) that best match the query criteria in a fraction of a second.
Cohort assembly:The set of matching image series is distilled into a cohort result. The system fetches the metadata for these series from the JSON prepared earlier. On the front-end, the researcher sees a scatterplot of cohort images against all other image series. The queries can be refined, if needed.
Export and analysis: The researcher can save or export the cohort. In the hackathon prototype, export was implemented to Microsoft Fabric, sending the cohort’s metadata and image pointers to a Fabric Lakehouse for further analysis in notebooks or for sharing with collaborators. The cohort can be immediately used for model training, statistical analysis, or other research steps without manual data wrangling.

Observed benefits and key advantages

Our work revealed exciting possibilities when researchers can create imaging cohorts using natural language. Potential benefits include:

Speed:The time to build cohorts could be shrunk by orders of magnitude
— from months to seconds. Near-real-time turnaround allows researchers to ask follow-up questions on the fly, test multiple hypotheses in a session, and fail or succeed faster. Overall research timelines for imaging studies could be shortened dramatically.
Accessibility and self-service:If a user can use plain English to ask for the data they need without involving coding or other teams, imaging data could be available to a wider audience, like clinical trial coordinators, and more.
Advanced search capability:AI can enable a Bing-like search by image content and context that could surface relevant data that might be missed by the traditional methods relying on exact field matches or manual tags. For example, AI can detect contrast usage in images, so it could identify all “contrast-enhanced liver MRIs” even if the PACS database lacks structured metadata. Combining criteria like imaging and clinical filters could further enhance the specificity of cohorts.
Scalability and big data readiness: Complex queries on millions of images are just as easy on small datasets — and increase the chances of successfully identifying images that meet the criteria. Rather than breaking down and degrading performance at large scale, this AI approach simply requires more easily provisioned cloud compute.
AI-readiness and downstream utility:By turning unstructured images into structured embeddings and cohorts, imaging data is AI-ready. Researchers could feed the cohorts they build into machine learning pipelines for model training or analytics with minimal extra preprocessing. For example, if someone was developing an algorithm to detect tumors, they could instantly gather a labeled dataset of cases and controls — accelerating AI development and validation in healthcare. Moreover, since the cohorts could be exported along with metadata, they could integrate with existing data science workflow platforms (e.g. Microsoft Fabric, Databricks, etc.).
Improved reproducibility and tracking:When the exact criteria and image list of each cohort query can be saved and versioned, studies can be more easily reproduced or audited. Over time, an institution could build a library of cohort definitions (e.g., “Standard cohort for diabetic retinopathy study”) that could be reused or refined to maintain consistency for multi-center studies or longitudinal research.
Insight generation: The visual qualities of thumbnails or scatter plots that group the cohort by attributes could reveal intuitive insights. Exploring an image cohort often yields serendipitous findings – e.g., noticing all outliers of a certain cluster share a trait. By encouraging a more data-driven culture in clinical research, questions could lead to data, which could lead to new questions.

Prototype limitations

While the prototype shows a lot of promise, there were some important limitations:

Modality support: The hackathon scope was bounded to MRI scans to demonstrate feasibility, so the inclusion criteria and the AI models (like the exam parameter classifier) were tailored to MRI parameters like anatomy, orientation, sequences, contrast, etc.
Criteria limitations: Currently, natural language understanding is limited to a predefined set of imaging attributes. For MRI, these included body region (anatomy), plane (orientation), contrast vs non-contrast, sequence type (T1, T2, FLAIR, etc., treated as “modifiers”), and basic patient filters (age range, gender). More complex criteria (e.g., “patients with tumor size >2cm” or “images showing a specific pathology”), were not directly supported.
Single-modality queries: Multi-modal criteria (e.g., “find patients who have both an MRI and a CT scan meeting [certain criteria]”) was not supported.
Accuracy and validation: The relevance of results is very dependent on the quality of embeddings and the alignment between image and text domains. Limited test data returned good results, but it’s possible some matches were missed or wrongly included, especially if the query was phrased in an unusual way. Rigorous validation on a large dataset is still needed to ensure, for example, that all the returned “brain MRIs with contrast” indeed have contrast, etc.
User interface and experience: Each query was independent and displayed results with some interactive plots. There was no support for multi-turn conversations where previous queries were remembered or refined in context.
Data privacy and access: This prototype used de-identified data to avoid governance questions associated with using patient data concerning access, auditing, privacy, and compliance.
Deployment complexity: Real-world deployment in a hospital environment would require integrating with the hospital’s imaging archive and, possibly, their EHR (for up-to-date clinical context).

Conclusion

Our work exemplifies how AI can revolutionize a niche but critical task in healthcare research. By combining natural language understanding with deep image analytics, we could bridge the gap between human intent and complex medical data

to deliver a trifecta of value:

Empowerment and speed for researchers and clinicians: Asking new questions becomes as easy as formulating them. Insights from decades’ worth of images sitting in archives can emerge, fostering innovation and discovery.
Efficiency and cost-effectiveness for healthcare institutions: Reducing manual effort, breaking down silos between departments, and making better use of existing data assets can improve the ROI on prior investments in data collection and storage. Finally, data can be leveraged to the fullest.

Our vision is AI that can answer any query about imaging data immediately, fulfilling the promise of truly liquid data in healthcare. We’re actively working on incorporating other imaging modalities like CT, X-ray, and pathology images, as well as more advanced query capabilities such as tumor characteristics (type, size, shape) and even text-based findings. Each of these enhancements will further close the gap between a researcher’s question and the data needed to answer it.

Our goal is to turn the painstaking process of cohort compilation into an interactive, iterative, and enjoyable part of research. The role of the researcher transforms from waiting on data to directly engaging with it, accelerating the pace of medical innovation.

If you’re interested in partnering with us to work toward this goal, contact authors through your Microsoft account team.

Acknowledgements

This project was made possible by the collaborative spirit of the Microsoft-UTMB hackathon team. Special thanks to Dr. Peter McCaffrey and James Weatherhead for their collaboration, and to the Microsoft HLS engineers who built and iterated on the solution in record time. It’s a testament to what can be achieved when healthcare experts and technologists co-create solutions for pressing healthcare needs.

References

The healthcare AI models in Microsoft Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.

Updated Dec 15, 2025

Version 1.0

manoj1116

Microsoft

Joined October 07, 2022

View Profile

Healthcare and Life Sciences Blog

Follow this blog board to get notified when there's new activity