Doc Intelligence: Custom Extraction model | Confidence score deterioration with new formats/layouts

Microsoft

Aug 18, 2025

Really interesting experiment, thanks for sharing your process and the graph! I’ve been exploring similar behavior with custom extraction models in Document Intelligence, and your findings resonate with some patterns I’ve seen too.

The drop in confidence scores after introducing documents with different formats is especially intriguing. It raises important questions about how these models handle layout variability and generalization.

I’ve put together a few thoughts on why this might be happening and some suggestions that could help improve performance across mixed-format datasets.

Your setup—starting with a consistent format and gradually introducing layout variability, is a great way to test the model’s generalization capabilities. The confidence scores dropping after introducing new formats is a classic sign of layout sensitivity in custom extraction models.

These models are trained to recognize spatial relationships and visual patterns. When you introduce documents with different layouts, the model’s internal representation becomes less certain, especially if the new formats are underrepresented or structurally divergent.

Why Confidence Scores Dropped

Layout Sensitivity
Custom extraction models are highly sensitive to spatial layout. When trained on consistent formats, they learn strong positional anchors. Introducing new layouts disrupts these learned associations, reducing confidence—even on previously well-performing formats.
Image-Based Complexity
Since the fields are extracted from images, OCR quality plays a major role. Variations in rotation, resolution, and lighting can degrade text detection, which directly impacts confidence scores.
Insufficient Format Representation
Adding only a few samples of new formats (e.g., 10 at a time) may not be enough for the model to generalize. This can lead to underfitting across all formats.
Field-Specific Vulnerability
Some fields (like Field2 and Field3 in the experiment) may be more layout-dependent, making them more susceptible to format changes. Field1 showed partial recovery, likely due to more consistent placement or semantic anchoring.

Recommendations to Improve Generalization

Train Format-Specific Models
If formats differ significantly, consider training separate models and using a classifier or routing logic to direct documents to the appropriate model.
Normalize Labeling Practices
Ensure consistent field naming and annotation across formats to help the model generalize.
Increase Sample Size per Format
More examples per format improve the model’s ability to learn layout-specific patterns robustly.
Use Semantic Anchors
Label fields using nearby keywords or patterns rather than relying solely on position.
Preprocess Images
Correct rotation, enhance contrast, and reduce noise to improve OCR quality and model confidence.

Interpret and improve model accuracy and confidence scores - Azure AI services | Microsoft Learn

Forum Discussion

Doc Intelligence: Custom Extraction model | Confidence score deterioration with new formats/layouts

Resources