Healthcare and Life Sciences Blog

14 MIN READ

Cancer Survival with Radiology-Pathology Analysis and Healthcare AI Models in Azure AI Foundry

Alberto_Santamaria

Microsoft

Jan 14, 2025

@Alberto Santamaria-Pang, Principal Applied Data Scientist, HLS AI and Adjunct Faculty at Johns Hopkins Medicine

@Peter Lee, Applied Scientist, HLS AI and Adjunct Assistant Professor at Vanderbilt University

@Ivan Tarapov, Group Manager, HLS AI

1. Introduction

The fusion of radiology and pathology is transforming predictive analytics in medical and biomedical imaging. By integrating these two complementary modalities, healthcare professionals can unlock powerful predictive capabilities that enhance survival predictions and risk assessments, ultimately leading to better patient outcomes.

Imagine a future where a seamless diagnostic ecosystem provides a comprehensive view of patient data. Such an approach empowers clinicians with a holistic understanding, enabling more precise and informed decision-making.

Traditionally, relying on a single imaging modality, whether radiology or pathology, has left gaps in the diagnostic process. Each modality alone often lacks the complete information necessary for accurate diagnoses and effective treatment planning. The major limitation of classical radiomics lies in its inability to correlate clinically relevant data across different imaging modalities. This prevents the full integration of diverse datasets needed to capture the complexity of a patient's condition.

In addition, classical radiomics has struggled to incorporate broader patient data, such as genomic profiles or clinical histories, and to bridge the gap between macro-level imaging insights and micro-level biological information. Integrating radiology and pathology helps address these challenges, offering a more robust, comprehensive diagnostic framework. This integrated approach not only fills critical gaps but also sets the stage for advancements in personalized medicine by providing a more detailed and precise understanding of patient health.

In this blog we will show you how one can use the recently launched healthcare AI models in Azure AI Foundry to begin designing and testing systems that can reason about these two modalities together.

It is important to highlight that the models available on the model catalog in Azure AI Foundry portal are not designed to generate diagnostic-quality results. Developers are responsible for further developing, testing, and validating their appropriateness for specific tasks and eventually integrating these models into complete systems. The objective of this blog is to demonstrate how this can be achieved efficiently in terms of data and computational resources.

2. The Problem

We will guide you through a practical approach to predicting cancer grades while fitting a survival model using radiological MRI images and H&E-stained histopathology slides for brain tumor analysis. Inspired by the work of Chen et al. (2020) [1] and Can et al. (2022) [2], we’ll outline a step-by-step method for deploying and testing models from the model catalog in Azure AI Foundry. You’ll learn how to extract and fuse features from these complementary modalities, train adapters to link multi-modality input images with survival outcomes and validate the resulting predictive insights. An overview of this process is illustrated in Figure 1 below.

Figure 1. Overview of the computational pipeline for training Radiology, Pathology, and Multi-Modal adapters to predict Hazard Risk Scores. These scores are subsequently used to infer cancer grading and survival outcomes.

Sample code and Jupyter Notebook is available in our Samples Repository: https://aka.ms/healthcare-ai-examples-rad-path . Please refer to it if you want to reproduce this experiment or use the principles employed here to build and test your own multimodal systems using healthcare AI models in Azure AI Foundry.

3. The Data

We have used TCGA-GBMLGG dataset which serves as a valuable resource for studying malignant brain tumors, specifically gliomas, by combining data from glioblastoma multiforme (GBM) and lower-grade gliomas (LGG). GBM, classified as WHO grade IV, is the most aggressive type of brain tumor, known for its rapid growth, invasive nature, and resistance to therapy, leading to poor prognosis. In contrast, LGG, encompassing WHO grades II and III, grow more slowly and generally have a better initial prognosis, though they can progress to higher grades over time.

This dataset is particularly well-suited for survival prediction, as it integrates multimodal data: diagnostic brain MRI scans, including T1, T2, and FLAIR sequences, alongside histopathology images of H&E-stained tumor regions. Each subject's data is annotated with categorical tumor grades (Grade 0, 1, or 2) and survival durations (Figure 2). By combining these complementary data modalities, clinical researchers can uncover deeper insights into the relationships between tumor characteristics and patient outcomes, fostering advancements in glioma research and personalized medicine.

Figure 2. MRI and Histopathology images for grades 0, 1, and 2.

Exploring the Relationship Between Tumor Staging and Survival

Further in this blog, we explore how different tumor stages correlate with survival outcomes in glioma patients, while exploring cost-effective approaches to building and testing predictive models using the healthcare AI models in Azure AI Foundry. Our analysis leverages data from 170 subjects, categorized into three tumor grades:

Grade 0 (Early Stage): 40 subjects
Grade 1 (Intermediate Stage): 53 subjects
Grade 2 (Advanced Stage): 77 subjects (slight predominance of higher-grade tumors)

To gain better context of the problem and understand the data better we have visualized survival outcomes in Figure 3. This figure contains so-called Kaplan-Meier survival curves which are used in cancer survival analysis to estimate the probability of survival over time, accounting for censored data (e.g., patients alive at the end of study or patients lost to follow-up). This analysis reveals several key trends:

Survival Variability in Early-Stage Tumors
Patients with Grade 0 (Early Stage) tumors show the widest range of survival outcomes. This variability reflects the diverse responses to early interventions and the significant role of individual biological factors in determining prognosis.
Narrowing Survival Range in Higher Grades
As tumor grade increases, the survival range becomes progressively narrower. For patients with Grade 2 (Advanced Stage) tumors, survival periods are shorter and more predictable, highlighting the severe prognosis associated with higher-grade tumors.
Impact of Tumor Severity on Survival Outlook
A clear negative correlation exists between tumor grade and survival duration. Higher tumor grades, such as Grade 2, are consistently linked to shorter survival periods, emphasizing the critical importance of early detection and timely treatment. These findings underscore the value of predictive modeling in identifying high-risk patients and supporting early intervention strategies to improve outcomes.

Please note that this following figure is for illustrative purposes only and a larger dataset should be used to draw statistically meaningful conclusions.

Figure 3. Kaplan-Meier survival model representing survival years across each grade corroborates and quantifies this decreasing survival trend with advancing tumor severity.

4. Introducing model catalog in Azure AI Foundry

The model catalog in Azure AI Foundry provides a range of pre-trained foundation models accessible via the cloud, enabling organizations to efficiently test, develop, deploy, and scale custom AI applications. The Health and Life Sciences section of the catalog includes specialized healthcare models like MedImageInsight [1] for radiology and Prov-GigaPath [2] for pathology, specifically designed to process and analyze complex medical imaging data (Figure 4). To accomplish our task we can use the Azure AI model catalog to deploy individual models as real-time endpoints, extract features from single modalities, and implement feature fusion to test survival outcome predictions. This is particularly relevant for applications that use radiology and pathology data, often in drug discovery contexts.

Figure 4. Health and Life Sciences models from Azure AI Foundry's catalog (green box), highlighting Prov-GigaPath and MedImageInsight (red).

MedImageInsight and Prov-GigaPath: A Dynamic Duo

When MedImageInsight and Prov-GigaPath are used together, they create a unified embedding across different medical imaging modalities, enabling powerful multi-modal predictive capabilities. This synergy is especially effective for tasks such as survival prediction and risk stratification, offering a more comprehensive and accurate view of insights from these modalities. Here’s how these models complement each other:

MedImageInsight (MI2): Extracts macro-level anatomical features from radiology images such as MRI and CT scans, enabling insights into tissue structure, density, and morphology.
Prov-GigaPath (PGP): Focuses on micro-level cellular features from pathology images like histology slides, capturing cellular morphology and tissue architecture.
Combined Approach (MI2+PGP): Integrates anatomical insights from radiology with cellular context from pathology, enabling a holistic analysis of tissues and tumors for enhanced predictive accuracy.

Deploying these models is straightforward through the model catalog in Azure AI Foundry. Each model is assigned a URL and an API key upon deployment, making integration seamless. In our Jupyter Notebook, "mi2-deploy.ipynb" we demonstrate how to programmatically deploy MedImageInsight. This process can be easily adapted for deploying Prov-GigaPath, offering a flexible and scalable solution for multi-modal imaging analysis.

5. Training a Multi-Modal Adapter Model

As previously discussed, combining radiology and pathology embeddings offers both macro- and micro-level insights. We apply this integrated approach to train the multimodal adapter to leverage these combined features.

Figure 5. Multimodal framework integrating radiology (MRI) and histopathology (H&E-stained) images to predict Hazard Risk Score.

The workflow, illustrated in Figure 5, begins with cutting-edge feature extraction: MI2 processes images from different radiology sequences such as T1, T2, and FLAIR, while PGP extracts features from pathology images. These extracted features represented by embedding vectors are passed through modality-specific adapters—Radiology Adapter and Pathology Adapter—to refine and standardize the representations, bringing them into the same embedding space. The outputs are then fused through a Multi-Modal Adapter, merging macro-level anatomical insights with micro-level cellular details into a unified representation.

This integration results in a predictive model capable of capturing both the big picture and the fine-grained details, culminating in a survival hazard score – a continuous variable that researchers can use to estimate the survival of a subject.

Task Adapters – the trusty aides of foundation models

In the practical application of embedding models, adapters emerge as essential components to adapt unseen downstream tasks. These compact neural networks are designed to take embeddings as input and output either classes or new embeddings. Adapters enhance classification performance and uncover correlations across different modalities, making them indispensable for multi-modal analyses.

Figure 5 illustrates the use of three adapters in this workflow:

Radiology Adapter: Receives a 4096-sized vector, representing a concatenation of embeddings from four slices across different radiological modalities.
Pathology Adapter: Processes a feature tensor of size 1536×14×14, generated by the PGP model from histopathology slide regions of interest (ROIs).
Multi-Modal Adapter: Combines the outputs of the radiology and pathology adapters—refined into a unified 1024-sized vector—and converts this into a final hazard score.

These three adapters are trained together to minimize a Cox Proportional Hazards (Cox) survival loss function, with the primary goal of ranking individuals by their relative risks (hazard ratios) rather than predicting exact survival times or cancer grades. Unlike fully supervised tasks, this training process estimates risk by ordering individuals based on survival times. The loss function reinforces accurate ranking by focusing the model on meaningful, consistent risk-based ordering.

This approach is particularly advantageous over directly predicting cancer grades because it accounts for tumor heterogeneity and molecular phenotype variability, which often result in overlapping features between grades. By emphasizing relative risk ranking instead of fixed grade predictions, the model captures nuanced differences in survival outcomes and accommodates the continuous spectrum of tumor biology, offering a flexible and robust framework for survival analysis.

Model performance is evaluated using the Concordance Index (C-index), which measures how well the predicted risk scores align with actual survival outcomes in a test dataset. Although not part of the optimization process itself, the C-index serves as a key metric during training, providing insight into the model’s ranking accuracy.

By optimizing the survival loss and leveraging the C-index for evaluation, the three adapters learn to associate imaging features from different modalities with risk-based ordering. This enables the model to effectively represent the complex relationships between multi-modal imaging data and hazard scores, facilitating nuanced and reliable predictions of relative risk.

Model Output: Converting Hazard Scores to Cancer Grading

Hazard scores represent relative risk compared to other patients. Since higher cancer grades generally correlate with worse outcomes, hazard scores can effectively serve as a surrogate for disease severity, allowing continuous risk values to be segmented into discrete categories for more intuitive clinical interpretation and research insights.

In real-world datasets, hazard score distributions often overlap between categories (e.g., cancer grades). For example, patients with intermediate-grade cancer may have hazard scores that intersect with both low- and high-grade groups, complicating classification based on strict thresholds. For the purposes of this discussion, we assume minimal overlap between hazard score distributions for different cancer grades [1]. Under this assumption, hazard scores can be divided into percentiles to define risk categories:

Low risk (Grade 0): Scores up to the 33rd percentile.
Intermediate risk (Grade 1): Scores in the 34th to 66th percentile.
High risk (Grade 2): Scores in the 67th to 100th percentile.

In practice, however, the assumption of minimal overlap may not hold, as hazard scores frequently exhibit significant overlap across categories due to tumor heterogeneity and biological variability. In such cases, probabilistic approaches or alternative classification strategies, such as Bayesian models or soft classification methods, provide a more flexible framework. These methods accommodate uncertainty and better reflect the complex, continuous nature of cancer biology, enabling more accurate and clinically relevant predictions.

6. Evaluating the Cox Model: Survival Prediction and Cancer Grading

By now you have learned the principles behind the multi-modal analysis using foundation models which should equip you with tools required to build, test, and validate such adapters of your own. The job of building survival prediction model is not complete though!

This part of the blog explains how to get to the survival prediction model from the hazard score estimator, however it relies on material that pertains to the field of survival analysis and is grounded in the corresponding mathematical framework. We will provide a high-level overview of the principles, and if you are interested in learning further about the underlying math we will direct you to the appropriate external sources that can help you understand how and why certain concepts are applied. If you are curious, read on!

Once you have trained the adapters as described in the section above, we acquire the ability to predict a hazard score for each subject. Recall from previous section that hazard scores are not absolute survival probabilities, but rather relative ranking. We translate these relative risk rankings (which are computed in relation to other patients in the dataset) into meaningful survival probabilities over time so that we are able to predict the chance of survival of a given subject beyond certain time period given only imaging data as input. Although our hazard score indicates whether a patient is at a higher or lower risk compared to others, it does not directly provide a survival percentage so to bridge the gap, we would rely on the framework defined by Cox Proportional Hazards model.

The Cox PH model describes how an individual’s risk (hazard) at any time point is related to a baseline hazard function and the individual’s hazard score. Specifically, it models the hazard function as:

Here, h₀(t) is the baseline hazard function, which represents the underlying risk of an event (in our case the mortality from the tumor) for a reference subject. The hazard score serves as a multiplier that shifts the baseline hazard up or down based on the patient’s features (the representations of MRI scans and digital pathology slides in our case).

So, to convert these features into a survival probability at time t, we would first estimate the baseline survival curve from the training set. The lifelines library that we use in the Jupyter notebook that we will publish provides a method for such estimation.

However, for the sake of simplicity and interpretability, in the provided code sample we use a different method to estimate the ability of our model to help in estimating survival probabilities.

Instead of explicitly estimating the baseline hazard and converting the hazard scores into time-dependent survival probabilities, we used the scores to rank patients and classify them into risk groups. Since brain cancers are typically categorized into Grade 0, Grade 1, and Grade 2, we segmented our predicted scores into three strata using the 33^rd and 66^th percentiles as described in the previous section, thus mapping the continuous hazard scores into clinically interpretable categories that can be compared directly to traditional grading systems.

Using this stratification we applied the Kaplan-Meier fitter to each group in the test set and generated nonparametric survival curves, offering a visual and intuitive comparison of survival outcomes across risk categories. In our testing we compare Kaplan-Meier curves generated for the test set based on ground truth grading data to those generated for the same test set based on our model’s hazard score data.

With a small dataset of 135 subject (675 images) for training and 35 subjects (175 images) for testing, we achieved a Concordance Index (or C-index – an evaluation of how well our hazard score ranks patients’ hazard compared to survival given by the ground truth, measured on an interval of [0...1]) of 0.8097, indicating that the hazard scores output by our model provide rather meaningful prognostic information. Figure 6 presents Kaplan-Meier curves for both the actual and predicted stratifications in blue and orange respectively. Although the predicted curves capture the overall trend (Grade 2 < Grade 1 < Grade 0) the limited data size introduces more pronounced step changes in the curves. It is important to recognize that our example is illustrative, leveraging a relatively small dataset. Real-world applications would likely employ larger, more diverse populations and integrate additional modalities, such as genomic data, to enhance the model’s predictive accuracy and generalizability. The efficiency of foundation models as feature extractors from complex data demonstrated here, make it much easier for clinical researchers to experiment with these approaches, making them a versatile tool for clinical and research applications.

Figure 6. Survival model from the test set across grades, comparing the actual grade with the predicted grade.

This system is tailored for practical applications and optimized for efficiency, making it attractive for testing and developing real-world use cases. This efficiency makes it ideal for applications requiring rapid turnarounds or operating within limited computational environments. To explore its implementation and adaptability across diverse datasets, Jupyter Notebook tutorial [3] offers a clear, hands-on guide, enabling users to effectively apply this approach in real-world scenarios.

7. Conclusion

Our exploration of AI models in radiomics highlights how the dynamic duo of MedImageInsight [4] and Prov-GigaPath [5] utilizes modern cloud technologies to advance healthcare innovation. By combining radiology and pathology embeddings from our sandbox dataset to build a hazard prediction model, we demonstrated the potential of this approach for survival analysis. In real-world applications, larger datasets and more diverse models will further enhance these capabilities, enabling more robust and accurate predictions.

The key insight is clear: true innovation in healthcare AI emerges from the synergy of advanced tools, technical expertise, and clinical understanding. By leveraging healthcare AI models in Azure AI Foundry developers gain streamlined access to these models and tools, making it easier to integrate multimodal data and deliver actionable results. This convergence is not merely about improving analytical accuracy – it is about accelerating the pace at which meaningful insights reach the bedside, empowering clinicians to make more informed decisions and ultimately enhance patient outcomes.

References

R. J. Chen, M. Y. Lu, J. Wang, D. F. K. Williamson, S. J. Rodig, N. I. Lindeman, and F. Mahmood, "Pathomic Fusion: An Integrated Framework for Fusing Histopathology and Genomic Features for Cancer Diagnosis and Prognosis," IEEE Transactions on Medical Imaging, vol. 41, no. 4, pp. 1–14, Apr. 2022.
Cui, C., Liu, H., Liu, Q., Deng, R., Asad, Z., Wang, Y., ... & Huo, Y. (2022, September). Survival prediction of brain cancer with incomplete radiology, pathology, genomic, and demographic data. In International Conference on Medical Image Computing and Computer-Assisted Intervention (pp. 626-635). Cham: Springer Nature Switzerland.
Sample code: https://aka.ms/healthcare-ai-examples-rad-path
MedImageInsight: https://aka.ms/mi2modelcard
Prov-GigaPath: https://aka.ms/provgigapathmodelcard Azure AI Foundry Documentation
Azure AI Model Catalog – Foundation Models
MedImageInsight Paper: https://arxiv.org/abs/2410.06542
GigaPath Paper: https://www.nature.com/articles/s41586-024-07441-w
Prov-GigaPath GitHub Repository: https://aka.ms/healthcare-ai-examples-rad-path
TCGA-GBMLGG Dataset: https://github.com/mahmoodlab/PathomicFusion/tree/master/data/TCGA_GBMLGG
Cox Proportional Hazards model: https://pmc.ncbi.nlm.nih.gov/articles/PMC7876211/
Kaplan-Meier estimator: https://en.wikipedia.org/wiki/Kaplan%E2%80%93Meier_estimator
Lifelines survival analysis model: https://lifelines.readthedocs.io/en/latest/index.html

The healthcare AI models in Azure AI Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.

Updated Jan 22, 2025

Version 6.0

Microsoft Azure AI Foundry

Multimodal Imaging

Prov-GigaPath

Radiology-Pathology