Healthcare and Life Sciences Blog

12 MIN READ

Fine-Tuning Healthcare AI Models: Grounded Findings Generation with CXRReportGen

jamesonmerkow

Microsoft

Nov 14, 2025

Achieve breakthrough findings generation with a streamlined foundation model, custom-fit for your organizations data.

This post is part of our healthcare AI fine-tuning series:

MedImageInsight Fine-Tuning - Embeddings and classification
MedImageParse Fine-Tuning - Segmentation and spatial understanding
CXRReportGen Fine-Tuning - Clinical findings generation (you are here)

Introduction

If you followed the MedImageInsight and MedImageParse fine-tuning articles, you’ve adapted foundation models for classification and segmentation. Now we’re tackling fine-tuning CXRReportGen for radiology environments, including reporting practices. CXRReportGen is both a language model and a detection model, so we can improve the clinical accuracy of detection and language generation to adapt to local contexts: imaging/population characteristics, clinical patterns, and institutional reporting styles.

Every radiology department faces similar challenges: you have unique reporting styles, terminology preferences, and clinical workflows over years of practice. Generic AI-generated reports, no matter how technically accurate, won’t fit seamlessly into your documentation standards and workflows. They require heavy editing to match your organization’s methodology and requirements.

You may be sitting on hundreds of thousands or even millions of health records in your PACS archives. But how do you put that data to work and build something valuable with it? That’s why we’re excited to announce the fine-tuning capability for CXRReportGen, which enables you to harness your existing data to create a model that not only delivers greater clinical accuracy but also produces findings and reports that match your organization’s preferred style and formatting.

CXRReportGen fine-tuning is now available through our private preview program. If you would like to try fine-tuning CXRReportGen, please contact us by submitting your information into this form and we will reach out with further instructions: https://aka.ms/cxrrg-fine-tuning-request.

This article covers what CXRReportGen is, why you should fine-tune it, what kind of improvements you can expect and how to prepare your data to tailor it for your organization.

What is CXRReportGen?

CXRReportGen is Microsoft’s grounded report generation model for chest X-rays. It combines a radiology-specific image encoder with a large language model to generate structured findings with spatial localization. Each finding comes with a corresponding bounding box indicating where it appears on the image. This “grounding” shows exactly where on the image each observation is located, not just what the model sees.

Figure 1 - CXRReportGen generates structured chest X-ray findings with spatial localization. a) Example output showing bounding boxes that ground each finding to specific anatomical regions. b) Model architecture combining a radiology-specific vision encoder with an LLM to process current and prior images alongside clinical context. c) Structured output format linking findings to precise image coordinates, enabling explainable AI for clinical verification.

CXRReportGen accepts current and prior imaging studies along with clinical context fields (indication, technique, comparison notes) to predict a structured list of findings. Each finding can include both descriptive text and an associated bounding box that localizes the observation (grounded), or just the descriptive text without spatial localization (ungrounded), giving you flexibility in how you deploy it.

The architecture combines a vision encoder and a text encoder to adapt inputs to shared space, so an LLM can reason over both simultaneously. CXRReportGen generates clinically accurate report text while predicting precise spatial locations for each finding.

This spatial grounding addresses a core requirement in radiology AI: explainability. Radiologists need to verify which anatomical regions informed each finding. Rather than treating report generation as a black box, bounding boxes provide direct verification of the model’s reasoning. Verifiable spatial localization supports clinical integration and enables practical applications including preliminary report drafts, quality assurance tools that flag unusual findings, and teaching aids for residents learning systematic image interpretation.

Why Fine-Tune CXRReportGen?

CXRReportGen delivers strong performance out of the box. Before deciding to fine-tune, evaluate the base model on your local data to understand baseline performance. That evaluation reveals whether fine-tuning addresses gaps worth the investment.

Many institutions choose to fine-tune because CXRReportGen is a large, complex model that benefits from site-specific adaptation. CXRReportGen is a public OSS model that was trained on a large corpus of chest X-ray data and understands general radiology anatomy, terminology, and report structure, but fine-tuning generates your own custom private model that’s tailored to your organization’s specific imaging characteristics, clinical patterns, and reporting practices.

It is also possible to fine-tune with only ungrounded data, plain text reports extracted from your existing archives with light processing or with grounded data where each finding is annotated with bounding boxes. Both approaches adapt the model across multiple dimensions:

Imaging context: Sites differ in imaging equipment, acquisition protocols, and patient populations. These differences affect both image appearance and clinical interpretation patterns. Fine-tuning adapts to local characteristics like equipment vendors, detector technologies, pediatric versus geriatric populations, and geographic disease patterns.
Clinical documentation context: Clinical information structure varies from templated fields to free-text narratives, with detail levels ranging from comprehensive histories to minimal indications. Fine-tuning teaches the model to interpret context as your institution provides it, whether structured templates or free text.
Findings structure and style: Departments have different terminology preferences, reporting detail expectations, and documentation formats. Fine-tuning aligns with institutional standards like “cardiomegaly” versus “enlarged cardiac silhouette,” explicit versus omitted negative findings, or bullet points versus prose.

An Experiment in Fine-Tuning

To see what improvements are possible through fine-tuning, we evaluated CXRReportGen before and after fine-tuning on the PadChest Dataset, one of the largest public chest X-ray datasets available. PadChest contains over 160,000 high-resolution images from 67,000 patients covering 174 radiographic findings, 19 differential diagnoses, and 104 anatomic locations. We used the English-translated reports from PadChest for this ungrounded fine-tuning experiment.

To get an accurate picture of potential improvements, we measured relative changes along two distinct metric categories: clinical accuracy and language adaptation. Since CXRReportGen combines language generation with clinical detection, successful fine-tuning must improve both how accurately the model identifies findings in exams and how well it writes findings to match organizational requirements.

Here are metrics we used across these two dimensions:

Clinical accuracy metrics measure how well the model identifies what’s medically present in the image:

o CheXbert Micro-5 F1 and CheXbert Micro F1 track whether specific pathologies (like cardiomegaly, pneumonia, edema) are correctly identified as present or absent in generated reports.

o RadGraph-ER Partial F1 measures how well generated reports capture clinical concepts and their relationships. For example, correctly identifying “cardiomegaly” as an entity and expressing that it’s “located in heart.”

Language adaptation metrics measure how well the model learned organizational reporting structures:

o ROUGE-2 tracks adoption of site-specific phrasing and terminology combinations, e.g. whether the fine-tuned model uses the same two-word phrases that appear in reference reports.

o ROUGE-1, ROUGE-L, and ROUGE-Lsum measure vocabulary adoption, sentence flow, and overall report structure to confirm improvements extend beyond isolated phrases to word choice, sentence construction, and document-level organization.

Table 1

Metric	Before FT	After FT	Improvement
CheXbert Micro-5 F1	38.495	40.000	+3.9%
CheXbert Micro F1	43.394	53.087	+22.3%
RadGraph-ER Partial F1	2.818	13.087	+364.4%
ROUGE-1 F-measure	10.205	23.242	+127.7%
ROUGE-2 F-measure	1.819	13.086	+619.3%
ROUGE-L F-measure	8.154	20.996	+157.5%
ROUGE-Lsum F-measure	9.668	22.754	+135.4%

For clinical accuracy, we see from the F1 scores that fine-tuning significantly improved performance. That means fewer false alarms and fewer missed findings across chest pathologies. And the increase in RadGraph-ER also shows that the model learned to express findings in clinically structured ways aligned with PadChest, making the findings generate more meaningful and interpretable.

Next, we examine the metrics for language adaptation. Starting with ROUGE-2, we notice a dramatic improvement after fine-tuning indicating that the model learns to match the patterns in how findings are written in PadChest. From the improvement in ROUGE-1, ROUGE-L, and ROUGE-Lsum, we also observe strong evidence that fine-tuning improves the model’s internal understanding of PadChest’s reporting and language style, which would make reports sound more natural to your organization by using your common vocabulary and phrasing.

Image

Ground Truth

Base Model

Fine-tuned Model

Cardiomegaly with aortic elongation and perihilar interstitial pattern.

Cardiac silhouette is enlarged. Pulmonary vascularity is within normal limits. There is minimal linear atelectasis in the right mid lung. No focal areas of consolidation are present. There is no pleural effusion or pneumothorax. Degenerative changes are present within the spine.

Cardiomegaly. Aortic elongation. Pulmonary parenchyma without major alterations.

Radiological signs of COPD.

Heart size, mediastinal, and hilar contours are normal. Lungs are hyperinflated with attenuation of pulmonary markings in the upper lobes, consistent with emphysema. No focal areas of consolidation are present within the lungs. There are no pleural effusions or pneumothoraxes.

Radiological signs of COPD.

A Practical Guide to Building Your Own CXRReportGen

Usually at this point in our fine-tuning series, we dive into notebooks and walk through code examples. But CXRReportGen is in private preview, so you can’t follow along yet. More importantly, you’ve already seen similar notebook workflows in other posts. Whether you’ve contacted us about private preview access or not, you can start preparing now. Understanding these requirements upfront is the difference between success and frustration when you get access. This section covers what you need to know to get started.

Training Data Specification

As mentioned, you can fine-tune CXRReportGen in two different modes that use different types of data: ungrounded and grounded. In addition to that, there are numerous input fields per training sample, many of which are optional. This provides considerable flexibility in how training examples are structured, with some important constraints around target outputs. Let’s dive into the specifics.

Every training example requires a frontal image (the primary image the model will analyze) and a target output (the report or findings the model will generate). There are specific requirements for target outputs depending on training mode, covered in the next section.

Beyond those core requirements, CXRReportGen accepts several optional inputs that enhance its clinical reasoning. Training examples can use any combination of these optional fields on a per-sample basis: a lateral image from the same exam to provide additional anatomical views, three clinical context fields that provide important background, and a prior exam for direct comparison. The clinical context fields are indication (why the study was ordered), technique (imaging acquisition details), and comparison (prior studies or clinical changes). The prior exam includes both the previous frontal image and prior report text provided together as a pair. A more detailed explanation is provided in Table 2, which outlines the specific fields used for training data, their purposes, and example values.

This flexibility enables the model to work with whatever information is available for a given exam and since the model always seeks to minimize loss during training, variability in which fields are present teaches the model to handle missing data gracefully. For this reason, we recommend that you format your training input data to match what you can reasonably provide at inference time.

Table 2

Field	Purpose	Example
frontal_image_path	Primary frontal chest X-ray (required)	"images/patient001.png"
lateral_image_path	Lateral view from same exam	"images/patient001_lateral.png"
indication	Clinical reason study was ordered	"65-year-old with shortness of breath and fever"
technique	Imaging acquisition details	"PA and lateral chest radiographs"
comparison	Reference to prior study date or Summary of clinical changes	"Chest PA and lateral dated 2024-10-15" or "Severe C. difficile colitis. Worsening tachypnea"
Prior exam fields (must be used together):
prior_frontal_image_path	Previous frontal image	"images/patient001_prior.png"
prior_report	Previous report text	"FINDINGS: Mild cardiomegaly. Clear lungs..."

Ungrounded vs. Grounded Data

Unlike the input fields which can vary from sample to sample, the target output must remain consistent across the entire dataset. All samples in a dataset must have either an ungrounded or grounded output.

Ungrounded dataset uses the target_output field with plain text output. This is easier to obtain because you already have these reports from your existing clinical workflow. They require only light processing to format correctly.

{
    "frontal_image_path": "images/patient001.png",
    "target_output": "Moderate cardiomegaly. Bilateral interstitial infiltrates. Small bilateral pleural effusions."
}

Grounded dataset uses the grounded_target_output field where findings are split into separate observations, each with either a bounding box polygon or null for non-localized findings. This is harder to obtain because it requires manual annotation work.

{
    "frontal_image_path": "images/patient001.png",
    "grounded_target_output": [
        {
            "text": "Moderate cardiomegaly",
            "box": "POLYGON((145.2 310.5, 245.8 425.3))"
        },
        {
            "text": "No pneumothorax",
            "box": null
        }
    ]
}

A note on data scale and cost: These two training modes have very different data requirements. Ungrounded training uses plain text reports you already have from your existing PACS/RIS archives. You’ll typically work with hundreds of thousands of reports, making this approach suitable for learning language, terminology preferences, and report structure. Grounded training requires manual bounding box annotation for each finding, which dramatically changes the scale. Manual spatial annotation is expensive and time-intensive, and training with fewer samples, in the range of thousands, is sufficient for focusing on spatial localization accuracy.

Choosing the Right Compute

CXRReportGen fine-tuning works with two GPU types: H100 and A100. While both GPUs work, be aware that Azure is not currently deploying new capacity with A100s, so it might be difficult to secure quota for A100-based VMs. There are two VM families with those GPUs that we’ve tested: ND-series and NC-series. Deciding between them depends on your data scale, training requirements, and VM availability.

Here is a little more about the two families:

The ND-series VMs work for all training scenarios and offer the most GPUs per node: 8 GPUs for both A100 (Standard_ND96asr_v4) and H100 (Standard_ND96isr_H100_v5) variants. These machines include high-speed interconnects that enable efficient multi-node scaling. You can connect multiple nodes together for even larger training jobs. More nodes plus more GPUs per node means better parallelization for large models with large datasets. ND-series computes are generally more cost-efficient at scale, delivering more work per dollar spent. If you have quota and budget, ND-series are the safe bet for training at any scale.
The NC-series VMs offer lower cost per hour, but fewer GPUs per node and lack high-speed interconnect, making them less effective for multi-node scaling. H100 variants provide 2 GPUs per node (Standard_NC80adis_H100_v5), while A100 variants provide up to 4 GPUs per node (Standard_NC96ads_A100_v4). NC-series works well for prototyping and proof-of-concept work with smaller datasets, and it’s great for inference and evaluation tasks. However, NC-series are not suitable for larger-scale training jobs where you need significant parallelization across many GPUs.

Our Recommended Training Approach

Depending on your specific data availability and compute capabilities, we recommend a two-stage strategy for training CXRReportGen efficiently. This approach treats language adaptation and spatial localization as separate problems, allowing you to leverage the strengths of each data type.

Stage 1: Ungrounded training for language and terminology adaptation. Start by training on a larger dataset of ungrounded samples. The model adapts to your environment, learning your organization’s language, terminology preferences, and report structure from the records already sitting in your PACS archives. This stage establishes an improved foundation; the model learns what findings to look for, how to describe them in your institutional style, and how to structure reports according to your documentation practices.

Stage 2: Grounded training for spatial localization. Fine-tune the Stage 1 model on a smaller dataset of grounded data. Because the model already understands your clinical environment and reporting patterns, you can achieve performant spatial grounding with far fewer annotations. This stage focuses exclusively on teaching the model where findings appear on images through bounding box supervision.

The two-stage approach is efficient, allowing you to maximize the value of your existing report archives while minimizing expensive manual annotation work. The model learns reporting patterns from ungrounded data at scale, then is updated to provide precise grounding using the smaller annotated dataset.

Conclusion

You’ve seen how fine-tuning transforms CXRReportGen from a capable general model into one that addresses your organization’s specific challenges. A model like CXRReportGen can:

Accelerate report turnaround: Auto-drafts chest X-ray summaries from images and clinical context, reducing radiologist workload on repetitive documentation tasks.
Support multimodal reasoning: Integrates indication, technique, and comparison information for richer, context-aware reports.
Improve diagnostic consistency: Standardized language and structured report sections enhance quality assurance and simplify downstream review.
Manage high-volume workflows: Helps imaging departments handle increased workload without adding staff, improving operational efficiency.

Fine-tuning ensures that these capabilities are adapted to your specific clinical environment and tailors it to be both more accurate and easier to integrate into existing PACS/RIS pipelines since it will understand and generate results according to your requirements.

The practical guidance we’ve covered here—from data requirements and compute resources to our recommended two-stage training strategy—positions you to leverage the chest X-ray reports already in your PACS archives. Most institutions have hundreds of thousands or millions of health records, making fine-tuning a viable path to institutional customization. We showed with PadChest that ~160,000 exams can provide a dramatic improvement to performance across clinical accuracy and language adaptation metrics.

CXRReportGen can be fine-tuned further using smaller amounts of grounded data. While harder to obtain, grounded data improves spatial localization. Each finding includes spatial localization showing which anatomical region the model interpreted. This makes AI reasoning verifiable and transparent rather than opaque, which is crucial for clinical integration.

Ready to see what CXRReportGen can do with your data? If you’ve been thinking about how to reduce report turnaround times or standardize your department’s documentation, now’s a good time to start planning. You can begin preparing your PACS data today. The work you do now on formatting and quality checks will pay off when you get access. CXRReportGen fine-tuning is currently available through our private preview program.

If you would like to try fine-tuning CXRReportGen, please contact us by submitting your information into this form and we will reach out with further instructions: https://aka.ms/cxrrg-fine-tuning-request.

Updated Nov 14, 2025

Version 2.0

jamesonmerkow

Microsoft

Joined February 20, 2025

View Profile

Healthcare and Life Sciences Blog

Follow this blog board to get notified when there's new activity