Doc Intelligence: Custom Extraction model | Confidence score deterioration with new formats/layouts

Question

Hi everyone,This is my first time using custom extraction models on the Document Intelligence service, and I would appreciate your input on an experiment I am conducting.I wanted to investigate how these models' confidence scores behave when documents with significantly different format/layout are introduced (later) in the training phase.I started by training models with documents in the same format (some of worse picture quality and slightly rotated), increasingly adding more samples (a new model was trained every time I added new documents, at increments of 5). After every new model was trained, I checked scores against the same, unseen by the model holdout set that had the same format with those in the training set.&nbsp;After training the final model, with 35 identically formatted documents, I started introducing documents with a significantly different format/layout and retraining (at increments of 10). Confidence scores against the holdout set (unchanged) dropped after doing so, without recovering to previous levels.See graph below showing how confidence scores evolved after every training step (adding new documents at every step).Any insights as to why this has happened?&nbsp;

kuvaajankulma635 · Answer

Really thoughtful work—keep pushing forward!It's impressive that you're experimenting so carefully with the custom extraction model in Document Intelligence. Exploring how confidence scores react to layout changes is exactly the kind of curiosity that leads to real breakthroughs. Even if the drop in confidence scores feels discouraging, it's part of uncovering deeper insights into model behavior and limitations.A few encouraging thoughts:You're asking the right questions. Noticing and investigating confidence deterioration is a sign you're thinking like a true ML researcher.Every experiment adds value. Even unexpected results (like declining scores) help map the boundaries of model performance.You're not starting over—you're building on each step. Those earlier results aren’t lost; they form the foundation for a better, more robust system.Your work might even help others struggling with similar issues down the line. If you ever need help analyzing your graph or brainstorming ways to counter the decline, I’d be happy to dig into it with you.Keep going—what you're doing matters

misterem · Answer

Really interesting experiment, thanks for sharing your process and the graph! I’ve been exploring similar behavior with custom extraction models in Document Intelligence, and your findings resonate with some patterns I’ve seen too.
The drop in confidence scores after introducing documents with different formats is especially intriguing. It raises important questions about how these models handle layout variability and generalization.
I’ve put together a few thoughts on why this might be happening and some suggestions that could help improve performance across mixed-format datasets.
Your setup—starting with a consistent format and gradually introducing layout variability, is a great way to test the model’s generalization capabilities. The confidence scores dropping after introducing new formats is a classic sign of layout sensitivity in custom extraction models.
These models are trained to recognize spatial relationships and visual patterns. When you introduce documents with different layouts, the model’s internal representation becomes less certain, especially if the new formats are underrepresented or structurally divergent.
Why Confidence Scores Dropped

Layout SensitivityCustom extraction models are highly sensitive to spatial layout. When trained on consistent formats, they learn strong positional anchors. Introducing new layouts disrupts these learned associations, reducing confidence—even on previously well-performing formats.
Image-Based ComplexitySince the fields are extracted from images, OCR quality plays a major role. Variations in rotation, resolution, and lighting can degrade text detection, which directly impacts confidence scores.
Insufficient Format RepresentationAdding only a few samples of new formats (e.g., 10 at a time) may not be enough for the model to generalize. This can lead to underfitting across all formats.
Field-Specific VulnerabilitySome fields (like Field2 and Field3 in the experiment) may be more layout-dependent, making them more susceptible to format changes. Field1 showed partial recovery, likely due to more consistent placement or semantic anchoring.

Recommendations to Improve Generalization

Train Format-Specific ModelsIf formats differ significantly, consider training separate models and using a classifier or routing logic to direct documents to the appropriate model.
Normalize Labeling PracticesEnsure consistent field naming and annotation across formats to help the model generalize.
Increase Sample Size per FormatMore examples per format improve the model’s ability to learn layout-specific patterns robustly.
Use Semantic AnchorsLabel fields using nearby keywords or patterns rather than relying solely on position.
Preprocess ImagesCorrect rotation, enhance contrast, and reduce noise to improve OCR quality and model confidence.

Interpret and improve model accuracy and confidence scores - Azure AI services | Microsoft Learn

Forum Discussion

Doc Intelligence: Custom Extraction model | Confidence score deterioration with new formats/layouts

2 Replies

Resources