Alexander Mehmet Ersoy, Principal Product Manager, Microsoft HLS AI
Abhishek Khowala, Principal AI Engineer, Intel
Ravi Panchumarthy, AI Framework Engineer, Intel
Srinarayan Srikanthan, AI Framework Engineer, Intel
Ekaterina Aidova, AI Frameworks Engineer, Intel
Alberto Santamaria-Pang, Principal Applied Data Scientist, Microsoft HLS AI and Adjunct Faculty at Johns Hopkins Medicine, Microsoft
Peter Lee, Applied Scientist, Microsoft HLS AI and Adjunct Assistant Professor at Vanderbilt University
Ivan Tarapov, Sr. Director, Microsoft HLS AI
The Rise of Multimodal AI in Healthcare
The healthcare sector is witnessing a surge in the adoption of multimodal AI models, which are crucial for applications ranging from diagnostics to personalized treatment plans. These models combine data from various sources such as medical images, patient records, and genomic data to provide comprehensive insights. Microsoft’s Azure AI Foundry's Model Catalog of multimodal healthcare foundation models is at the forefront of this change. Models recently launched (such as MedImageInsights, MedImageParse, CXRReportGen [8], and many others) are designed to help healthcare organizations rapidly build and deploy AI solutions tailored to their specific needs, while minimizing the extensive compute and data requirements typically associated with building multimodal models from scratch. Real-World Examples from our industry partners regarding the adoption of multimodal AI models are highlighted in the article “Unlocking next-generation AI capabilities with healthcare AI models”.
Challenges and Opportunities in Hardware Optimization
As models get more complex, which is the case with the foundation model trend, the demands on the hardware rise. While GPUs remain the platform of choice for minimizing the model execution times, CPUs present substantial optimization possibilities, especially for inference workloads. We believe that providing a framework for efficient CPU-based environments holds a huge potential for many production scenarios where speed can be traded off for cost savings.
With multimodal healthcare AI, the complexity of handling different data modalities and ensuring efficient inference requires innovative solutions and collaboration between industry leaders. Companies are increasingly looking towards hardware-specific optimizations to enhance model efficiency and reduce latency while keeping costs at bay. Intel, with its robust suite of AI tools and extensions for frameworks like PyTorch, is pioneering this optimization effort. For instance, the Intel® Distribution of OpenVINO™ toolkit has been instrumental in accelerating the development of computer vision and deep learning applications in healthcare [1]. You can learn about our recent collaboration with Intel on AI optimizations to advance medical innovations in the article "Empower Medical Innovations: Intel Accelerates PadChest & fMRI Models on Microsoft Azure* Machine Learning”.
The demand for AI applications in healthcare is rapidly increasing. Multimodal AI models, which can process and analyze complex datasets, are essential for tasks such as early disease detection, treatment planning, and patient monitoring. While optimizing these models to perform efficiently on specific hardware is important, it is not necessarily a barrier to adoption. Models optimized with CUDA for Nvidia GPUs often deliver optimal performance and run faster than on any other hardware. However, the benefit of using CPUs lies in the tradeoff they offer. You can choose to optimize for speed by running your model on a GPU and optimizing for it in PyTorch, or you can optimize for cost by sacrificing speed. This is the proposition here: the option to run the model slower with an accessible CPU, which can be advantageous in scenarios where speed is not the primary concern, but access to GPU hardware is. The Intel® oneAPI Deep Neural Network Library (oneDNN) have proven effective in reducing GPU requirement burden and accelerating time to market for AI solutions [2]. Both Intel® Extension for PyTorch (IPEX) and OpenVINO utilize the Intel® oneDNN to accelerate deep learning operations, taking advantage of underlying hardware features. IPEX optimizes existing PyTorch workflows with minimal code changes. OpenVINO provides cross-platform deep learning optimization for deployment flexibility.
In this blog post, a custom deployment was implemented using CXRReportGen along with both IPEX and OpenVINO optimizations, demonstrating how these techniques can support different deployment scenarios and technical requirements. This optimization is accessible through Azure's compute services and Intel's technology.
Benchmarking and Performance Acceleration
To address these challenges, our new collaboration with Intel focuses on leveraging Intel’s advanced AI tools and hardware capabilities to optimize multimodal AI models for greater healthcare access. By utilizing Intel's Extension for PyTorch and other optimization techniques, we aim to optimize CPUs for best model run time speed. While this may slightly degrade performance, the main benefit is addressing the problem of GPU hardware scarcity. This partnership not only underscores the importance of hardware-specific optimizations but also sets a new standard for AI model deployment in real-world healthcare applications.
Both IPEX and OpenVINO are built on a common foundation - Intel® oneDNN which is a high-performance library designed specifically for deep learning applications and optimized for Intel architecture. oneDNN leverages specialized hardware instructions available in Intel processors such as Intel® Advanced Vector Extensions 512 (Intel® AVX-512) Vector Neural Network Instructions (VNNI) and Intel® Advanced Matrix Extensions (Intel® AMX) [3] on Intel CPUs as well as Intel XeMatrix Extensions (XMX) AI engines on Intel discrete GPUs.
Figure 1: OneDNN Library
IPEX [4] extends PyTorch* with the latest performance optimizations for Intel hardware [5]. It leverages oneDNN under the hood to provide optimized implementations of key operations. This allows developers to stay within their existing PyTorch code with minimal changes - making it an excellent choice for teams already comfortable with the PyTorch ecosystem who want to quickly optimize their models for Intel hardware.
import torch
############## import ipex ###############
import intel_extension_for_pytorch as ipex
model = Model()
model.eval()
############## Optimize with IPEX ###############
model = ipex.optimize(model, dtype=torch.bfloat16)
# Continue with inference as normal
Figure 2. Intel Extension for PyTorch
The Intel® Distribution of OpenVINO™ toolkit is a powerful solution for optimizing and deploying deep learning models across a wide range of Intel hardware [6]. Like IPEX, it leverages oneDNN under the hood, but takes a different approach - offering cross-platform optimization and flexible deployment options. OpenVINO supports two main workflows: a convenience workflow, where you run models directly with minimal setup, and a performance workflow, recommended for production, where models are first converted offline into the OpenVINO Intermediate Representation (IR). This one-time conversion step enables highly optimized inference and allows the final application to remain lightweight and efficient.
Here’s a simple example using OpenVINO for inference with a pre-converted IR model. Refer to OpenVINO Notebooks repo for more samples:
import openvino as ov
core = ov.Core()
############## Load the OpenVINO IR model ###############
compiled_model = core.compile_model("model.xml", "CPU")
############## Run inference ###################
infer_request = compiled_model.create_infer_request()
results = infer_request.infer({input_tensor_name: input_tensor})
Figure 3: OpenVINO toolkit Overview.
IPEX and OpenVINO are supported in all Intel architectures. However, for optimal performance, Intel recommends using instances powered by 4th Gen Intel® Xeon® Scalable processors or newer, which feature AMX and other hardware acceleration capabilities, such as Azure’s v6-series (e.g., Standard_E48s_v6) [7].
Results
We conducted a detailed performance benchmark by using CXRReportGen, a state-of-the-art foundation model designed to generate a list of radiological findings from chest X-rays, over Standard_E48s_v6 hardware (48 vCPUs, 248 GiB RAM) with and without IPEX and OpenVINO optimization. We realized up to 70% improvement in CXRReportGen foundation model run time when applying optimizations with IPEX and similarly substantial gains using OpenVINO, compared to the non-optimized baseline on the same CPU hardware. This significant improvement highlights the potential of leveraging Intel's performance optimizations to make critical healthcare AI models more cost-efficient and accessible. Such advancements enable healthcare providers to deploy advanced diagnostic tools even in resource-constrained environments, ultimately improving patient care and operational efficiency.
SKU |
Run Type (100 Runs) |
Mean Run Time (seconds) |
Standard Deviation of Run Time (seconds) |
Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) |
No Optimization |
22.47 |
0.1061 |
Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) |
IPEX |
8.21 |
0.2375 |
Standard_E48s_v6 (48 vCPUs, 348 GiB RAM) |
OpenVINO |
7.01 |
0.0569 |
Table 1: Performance Comparison of CXRReportGen Model Across 100 Runs with CPU.
Future Prospects and Innovations
Our benchmarks with Intel optimizations with both IPEX and OpenVINO show great potential on decreasing the model run time of our foundation models and increasing scalability via CPU. This optimization positions Intel CPUs as a viable deployment. This not only increases deployment options but also offers opportunities to reduce cloud costs with CPU-based instances and even consider deploying these workflows on existing compute headroom at the edge. For custom deployments, the setup described in this blog post is now available on the provided compute instances in Azure and with optimization software from Intel. So that developers can optimize inference workloads while taking advantage of large memory pools available via CPU and use towards handling large batch workloads. Our advancements with Intel in model runtime optimizations are considered to be available in the Azure AI model catalogs. Please stay tuned for further updates.
As we continue to innovate and optimize, the potential for AI to transform healthcare and improve patient outcomes becomes increasingly attainable. We are now more equipped than ever to making it easier for our partners and customers to create connected experiences at every point of care, empower their healthcare workforce, and unlock the value from their data using data standards that are important to the healthcare industry.
References
[1] Intel OpenVINO Optimizes Deep Learning Performance for Healthcare Imaging
[2] Accelerating Healthcare Diagnostics with Intel oneAPI and AI Tools
[3] Intel Advanced Matrix Extensions
[4] Intel Extension for Pytorch
[5] Accelerate with Intel Extension to PyTorch
[6] Intel Accelerates PadChest and fMRI Models on Azure ML
[7] Azure’s first 5th Gen Intel® Xeon® processor instances are now available and we're excited!
[8] CxrReportGen Model Card in Azure AI Foundry
The healthcare AI models in Azure AI Foundry are intended for research and model development exploration. The models are not designed or intended to be deployed in clinical settings as-is nor for use in the diagnosis or treatment of any health or medical condition, and the individual models’ performances for such purposes have not been established. You bear sole responsibility and liability for any use of the healthcare AI models, including verification of outputs and incorporation into any product or service intended for a medical purpose or to inform clinical decision-making, compliance with applicable healthcare laws and regulations, and obtaining any necessary clearances or approvals.