Why is it so hard? Distinctive Requirements and Complexities of AI Benchmarking in Healthcare and Life Sciences (HLS) scenarios
Imagine this: you are a data scientist, software engineer, or a product person interested in using Generative AI for your healthcare use-case, and you have just logged into Azure AI Foundry. You are greeted by a Model Catalogue boasting over 1800+ models.
Figure 1: AI Foundry Model Catalog screen. Note the number of models in the catalog as of this post (1827!). Some models also have a small bar graph icon indicating they include benchmarking information.
Thankfully, you notice the industry filter and breathe a sigh of relief. You set it to “Health and Life Sciences” and start exploring the models we've launched with our academic and industry partners. However, your specific use-case might not be covered yet, or you want to do further due diligence and dive deeper into model performances.
You're back in the LLM jungle.
So, where do you go for help in selecting the best model(s) for your healthcare use-case?
Unicorns Wanted: high quality, relevant, open-ended datasets with vetted labels (reference data)
You start by looking at the benchmarks published with the models, as well as standard benchmarks found in leader boards like lmarena. The first problem you encounter is that most benchmarks for general-purpose models don't mean much for healthcare use-cases. If there are healthcare-specific questions in the datasets, they are few and far between. Then comes the realization that most of the datapoints that could be relevant in healthcare are closed-ended (read: multiple choice or Yes/No) questions[1]. While it is understandable that many benchmarks use closed-ended questions due to their availability and ease[2] of evaluation, real-life use-cases such as summarization, translation, reasoning, and common sense require open-ended answers for a true assessment of their output quality. This is even more crucial for healthcare: high scores in answering multiple choice questions do not guarantee success in accomplishing tasks accurately. They are a good start, but not more. And there are very few datasets that include high-quality reference materials. That is our first big challenge: high-quality, vetted datasets (with open-ended outputs), ideally something relevant to the HLS task we have in mind.
Figure2: A sampling of multiple choice questions and labels from MedQA dataset.
Objective evaluations in a subjective world
If you were lucky enough to solve the first problem and have access to vetted datasets[3], then you face the next problem: how do you evaluate the model output with the reference answer accurately, automatically, and cost-effectively? For datasets with closed-ended answers, this is easy, hence their popularity. But open ended language is by definition open to interpretation. There is no comprehensive and reliable way of evaluating two open-ended texts for many dimensions crucial for healthcare use-cases: clinical accuracy, relevance, patient safety, logic and reasoning, medical decision-making, bias, etc. Statistical/algorithmic metrics give us some indication of surface-matching of texts[4]; semantic-matching based scores[5] inform us about relevance and semantic distance. But they are only approximate measures, and not reliable. The other options are using human experts[6] to evaluate outputs or using LLMs and use derived approaches[7], which are not deterministic and not always consistent.
So, not only is it a challenge to get the datasets to use in benchmarks, but we also don't have accepted metrics or reliable ways to evaluate these datasets.
Figure3: 5 Open question / answers are challenging. These are five different ways of answering a fairly common, simple clinical question. Please note that while the answers are almost paraphrased, the slight variations can lead to clinical differences.
Now layer in the complexity multipliers well known in the healthcare industry: variability in clinical practice, data sensitivity and security, regulatory compliance, need for continuous re-evaluation to reflect latest and best clinical knowledge and accepted practice, need for longitudinal evaluation (of patient history, disease progression, etc.), etc.
This is a huge challenge for any one player in the Health and Life Sciences field, but it is a challenge we have to tackle to supply the objective, reproducible, transparent data the HLS customers need to make informed decisions. It will require broad collaboration and coordination among all stakeholders, including academia, industry, provider organizations, individual providers, public, private, and non-profit organizations alike.
How can Microsoft help?
Through Azure AI Foundry and Azure ML Studio, Microsoft enables you to streamline evaluation pipelines and provides access to code and prompts to implement various metrics for your own evaluation. Azure AI Foundry just launched its model evaluation capabilities, which include many algorithmic and model-based approaches.
Figure 4: A sampling of different evaluation methods available to you in AI-Foundry Evaluator Library
We are committed to building on this strong foundation and expanding it with the help of our Microsoft Research counterparts, HLS product teams, and industry and academic partners. Our goal is to address these key issues for the HLS community by enhancing benchmarking capabilities and fostering opportunities for sharing, comparing, and collaborating. In pursuit of these goals, we will continue to work closely with and support evaluation efforts from our partner community, including HealthcareDive, MedHelm, LMArena, M42 , not to forget the deep well of expertise from our own Microsoft Research community.
We will continue our discussion on evaluation benchmarking challenges in the next part of these series.
Until then, please visit us at HIMSS in Level 2 Booth #2221, and stay tuned for further announcements in the coming weeks.
[1] This is true for MedQA, HEAD-QA, VQA-Rad, MMLU, MedBullets, PubMedQA, BioMistral and many others.
[2] “Easy” is easier said than done. Even for simple matching of multiple choice options, you may need to jump through hoops and do some extraction and manipulation of LLM outputs to do a reliable comparison.
[3] Maybe you are an academic institution, who has the clinical / administrative bandwidth and access to data.
[4] F1, Bleu, Rouge, Meteor, …
[5] BERTScore, AlignScore, MoverScore, …
[6] Hired / crowd-sourced experts, or Arena / Challenge approaches such as chatbot arena
[7] See AI Foundry Evaluation section, but also G-Eval, GPT-Score, Prometheus etc.