Azure Architecture Blog

8 MIN READ

From Large Semi-Structured Docs to Actionable Data: In-Depth Evaluation Approaches Guidance

anishganguli

Microsoft

Dec 15, 2025

Introduction

Extracting structured data from large, semi-structured documents (the detailed solution implementation overview and architecture is provided in this tech community blog: From Large Semi-Structured Docs to Actionable Data: Reusable Pipelines with ADI, AI Search & OpenAI) demands a rigorous evaluation framework.

The goal is to ensure our pipeline is accurate, reliable, and scalable before we trust it with mission-critical data. This framework breaks evaluation into clear phases, from how we prepare the document, to how we find relevant parts, to how we validate the final output. It provides metrics, examples, and best practices at each step, forming a generic pattern that can be applied to various domains.

Framework Overview

A very structured and stepwise approach for evaluation is given below:

Establish Ground Truth & Sampling: Define a robust ground truth set and sampling method to fairly evaluate all parts of the document.
Preprocessing Evaluation: Verify that OCR, chunking, and any structural augmentation (like adding headers) preserve all content and context.
Labelling Evaluation: Check classification of sections/chunks by content based on topic/entity and ensure irrelevant data is filtered out without losing any important context.
Retrieval Evaluation: Ensure the system can retrieve the right pieces of information (using search) with high precision@k and recall@k.
Extraction Accuracy Evaluation: Measure how well the final structured data matches the expected values (field accuracy, record accuracy, overall precision/recall).
Continuous Improvement Loop with SME: Use findings to retrain, tweak, and improve, enabling the framework to be reused for new documents and iterations. SMEs play a huge role in such scenarios.

Detailed Guidance on Evaluation

Below is a step-by-step, in-depth guide to evaluating this kind of IDP (Indelligent Document Processing) pipeline, covering both the overall system and its individual components:

Establish Ground Truth & Sampling

Why: Any evaluation is only as good as the ground truth it’s compared against. Start by assembling a reliable “source of truth” dataset for your documents. This often means manual labelling of some documents by domain experts (e.g., a legal team annotating key clauses in a contract, or accountants verifying invoice fields). Because manual curation is expensive, be strategic in what and how we sample.

Ground Truth Preparation: Identify the critical fields and sections we need to extract, and create an annotated set of documents with those values marked correct. For example, if processing financial statements, we might mark the ground truth values for Total Assets, Net Income, Key Ratios, etc. This ground truth should be the baseline to measure accuracy against. Although creating it is labour-intensive, it yields a precise benchmark for model performance.
Stratified Sampling: Documents like contracts or policies have diverse sections. To evaluate thoroughly, use stratified sampling – ensure your test set covers all major content types and difficulty levels. For instance, if 15% of pages in a set of contracts are annexes or addendums, then ~15% of your evaluation pages should come from annexes, not just the main body. This prevents the evaluation from overlooking challenging or rare sections. In practice, we might partition a document by section type (clauses, tables, schedules, footnotes) and sample a proportion from each. This way, metrics reflect performance on each type of content, not just the easiest portions.
Multi-Voter Agreement (Consensus): It’s often helpful to have multiple automated voters on the outputs before involving humans. For example, suppose we extracted an invoice amount; we can have:
- A regex/format checker/fuzzy matching voter
- A cross-field logic checker/embedding based matching voter
- An ML model confidence score/LLM as a judge vote

If all signals are strong, we label that extraction as Low Risk; if they conflict, mark it High Risk for human review. By tallying such “votes”, we create tiers of confidence. Why? Because in many cases, a large portion of outputs will be obviously correct (e.g., over 80% might have unanimous high confidence), and we can safely assume those are right, focusing manual review on the remainder. This strategy effectively reduces the human workload while maintaining quality.

Preprocessing Evaluation

Before extracting meaning, make sure the raw text and structure are captured correctly. Any loss here breaks the whole pipeline. Key evaluation checks:

OCR / Text Extraction Accuracy

Character/Error Rate: Sample pages to see how many words are recognized correctly (use per-word confidence to spot issues).
Layout Preservation: Ensure reading order isn’t scrambled, especially in multi-column pages or footnotes.
Content Coverage: Verify every sentence from a sample page appears in the extracted text. Missing footers or sidebars count as gaps.

Chunking

Completeness: Combined chunks should reconstruct the full document. Word counts should match.
Segment Integrity: Chunks should align to natural boundaries (paragraphs, tables). Track a metric like “95% clean boundaries.”
Context Preservation: If a table or section spans chunks, mark relationships so downstream logic sees them as connected.

Multi-page Table Handling

Header Insertion Accuracy: Validate that continued pages get the correct header (aim for high 90% to maintain context across documents).
No False Headers: Ensure new tables aren’t mistakenly treated as continuations. Track a False Continuation Rate and push it to near zero.
Practical Check: Sample multi-page tables across docs to confirm consistent extraction and no missed rows.

Structural Links / References

Link Accuracy: Confirm references (like footnotes or section anchors) map to the right targets (e.g., 98%+ correct).
Ontology / Category Coverage: If content is pre-grouped, check precision (no mis-grouping) and recall (nothing left uncategorized).

Implication

The goal is to ensure the pre-processed chunks are a faithful, complete, and structurally coherent representation of the original document. Metrics like content coverage, boundary cleanliness, and header accuracy help catch issues early. Fixing them here saves significant downstream debugging.

Labelling Evaluation – “Did we isolate the right pieces?”

Once we chunk the document, we label those chunks (with ML or rules) to map them to the right entities and throw out the noise. Think of this step as sorting useful clauses from filler.

Section/Entity Labelling Accuracy

Treat labelling as a multi-class or multi-label classification problem.

Precision (Label Accuracy):

Of the chunks we labelled as X, how many were actually X?

Example: Model tags 40 chunks as “Financial Data.” If 5 are wrong, precision is 87.5. High precision avoids polluting a category (topic/entity) with junk.

Recall (Coverage):

Of the chunks that truly belong to category X, how many did we catch?

Example: Ground truth has 50 Financial Data chunks, model finds 45. Recall is 90%. High recall prevents missing important sections.

Example:

A model labels paper sections as Introduction, Methods, Results, etc. It marks 100 sections as Results and 95 are correct (95% precision). It misses 5 actual Results (slightly lower recall). That’s acceptable if downstream steps can still recover some items. But low precision means the labelling logic needs tightening.

Implication

Low precision means wrong info contaminates the category. Low recall means missing crucial bits. Use these metrics to refine definitions or adjust the labelling logic. Don’t just report one accuracy number; precision and recall per label tell the real story.

Retrieval Evaluation – “Can we find the right info when we ask?”

Many document pipelines use retrieval to narrow a huge file down to the few chunks most likely to contain the answer corresponding to a topic/entity. If we need a “termination date,” we first fetch chunks about dates or termination, then extract from those. Retrieval must be sharp, or everything downstream suffers.

Precision@K

How many of the top K retrieved chunks are actually relevant?

If we grab 5 chunks for “Key Clauses” and 4 are correct, Precision@5 is 80%.

We usually set K to whatever the next stage consumes (3 or 5). High precision keeps extraction clean. Average it across queries or fields. Critical fields may demand very high Precision@K.

Recall@K

Did we retrieve enough of the relevant chunks? If there are 2 relevant chunks in the doc but the top 5 results include only 1, recall is 50%.

Good recall means we aren’t missing mentions in other sections or appendices. Increasing K improves recall but can dilute precision. Tune both together.

Ranking Quality (MRR, NDCG)

If order matters, use rank-aware metrics.

MRR: Measures how early the first relevant result appears. Perfect if it’s always at rank 1.

NDCG@K: Rewards having the most relevant chunks at the top. Useful when relevance isn’t binary.

Most pipelines can get away with Precision@K and maybe MRR.

Implication

Test 50 QA pairs from policy documents, retrieving 3 passages per query.

Average Precision@3: 85%.

Average Recall@3: 92%.

MRR: 0.8.

Suppose, we notice “data retention” answers appear in appendices that sometimes rank low. We increase K to 5 for that query type. Precision@3 rises to 90%, and Recall@5 hits roughly 99%.

Retrieval evaluation is a sanity check. If retrieval fails, extraction recall will tank no matter how good the extractor is. Measure both so we know where the leak is. Also keep an eye on latency and cost if fancy re-rankers slow things down.

Extraction Accuracy Evaluation – “Did we get the right answers?”

Look at each field and measure how often we got the right value.

Precision: Of the values we extracted, what percent are correct? Use exact match or a lenient version if small format shifts don’t matter. Report both when useful.
Recall: Out of all ground truth values, how many did we actually extract?
Per-field breakdown: Some fields will be easy (invoice numbers, dates), others messy (vendor names, free text). A simple table makes this obvious and shows where to focus improvements.

Error Analysis

Numbers don’t tell the whole story. Look at patterns:

OCR mix-ups
Bad date or amount formats
Wrong chunk retrieved upstream
Misread tables

Find the recurring mistakes. That’s where the fixes live.

Holistic Metrics

If needed, compute overall precision/recall across all extracted fields. But per-field and record-level are usually what matter to stakeholders.

Implication

Precision protects against wrong entries. Recall protects against missing data. Choose your balance based on risk:

If false positives hurt more (wrong financial numbers), favour precision.
If missing items hurts more (missing red-flag clauses), favour recall.

Continuous Improvement Loop with SME

Continuous improvement means treating evaluation as an ongoing feedback loop rather than a one-time check. Each phase’s errors point to concrete fixes, and every fix is re-measured to ensure accuracy moves in the right direction without breaking other components. The same framework also supports A/B testing alternative methods and monitoring real production data to detect drift or new document patterns. Because the evaluation stages are modular, they generalize well across domains such as contracts, financial documents, healthcare forms, or academic papers with only domain-specific tweaks. Over time, this creates a stable, scalable and measurable path toward higher accuracy, better robustness, and easier adaptation to new document types.

Conclusion

Building an end-to-end evaluation framework isn’t just about measuring accuracy, it’s about creating trust in the entire pipeline. By breaking the process into clear phases, defining robust ground truth, and applying precision/recall-driven metrics at every stage, we ensure that document processing systems are reliable, scalable, and adaptable. This structured approach not only highlights where improvements are needed but also enables continuous refinement through SME feedback and iterative testing. Ultimately, such a framework transforms evaluation from a one-time exercise into a sustainable practice, paving the way for higher-quality outputs across diverse domains.

Updated Dec 14, 2025

Version 1.0

advance analytics

application

artificial intelligence

well architected

anishganguli

Microsoft

Joined October 30, 2025

View Profile

Azure Architecture Blog

Follow this blog board to get notified when there's new activity