Problem Space
Large semi-structured documents such as contracts, invoices, hospital tariff/rate cards multi-page reports, and compliance records often carry essential information that is difficult to extract reliably with traditional approaches. Their layout can span several pages, the structure is rarely consistent, and related fields may appear far apart even though they must be interpreted together. This makes it hard not only to detect the right pieces of information but also to understand how those pieces relate across the document.
LLM can help, but when documents are long and contain complex cross-references, they may still miss subtle dependencies or generate hallucinated information. That becomes risky in environments where small errors can cascade into incorrect decisions. At the same time, these documents don’t change frequently, while the extracted data is used repeatedly by multiple downstream systems at scale. Because of this usage pattern, a RAG-style pipeline is often not ideal in terms of cost, latency, or consistency. Instead, organizations need a way to extract data once, represent it consistently, and serve it efficiently in a structured form to a wide range of applications, many of which are not conversational AI systems.
At this point, data stewardship becomes critical, because once information is extracted, it must remain accurate, governed, traceable, and consistent throughout its lifecycle. When the extracted information feed compliance checks, financial workflows, risk models, or end-user experiences, the organization must ensure that the data is not just captured correctly but also maintained with proper oversight as it moves across systems. Any extraction pipeline that cannot guarantee quality, reliability, and provenance introduces long-term operational risk.
The core problem, therefore, is finding a method that handles the structural and relational complexity of large semi-structured documents, minimizes LLM hallucination risk, produces deterministic results, and supports ongoing data stewardship so that the resulting structured output stays trustworthy and usable across the enterprise.
Target Use Cases
The potential applications for an Intelligent Document Processing (IDP) pipeline differ across industries. Several industry-specific use cases are provided as examples to guide the audience in conceptualising and implementing solutions tailored to their unique requirements.
Hospital Tariff Digitization for Tariff-Invoice Reconciliation in Health Insurance
Document types: Hospital tariff/rate cards, annexures/guidelines, pre-authorization guidelines etc.
Technical challenge:
- Charges for the same service might appear under different sections or for different hospital room types across different versions of tariff/rate cards.
- Table + free text mix, abbreviations, and cross-page references.
Downstream usage: Reimbursement orchestration, claims adjudication
Commercial Loan Underwriting in Banking
Document types: Balance sheets, cash-flow statements, auditor reports, collateral documents.
Technical Challenge:
- Ratios and covenants must be computed from fields located across pages.
- Contextual dependencies: “Net revenue excluding exceptional items” or footnotes that override values.
Downstream usage: Loan decisioning models, covenant monitoring, credit scoring.
Procurement Contract Intelligence in Manufacturing
Document types: Vendor agreements, SLAs, pricing annexures.
Technical Challenge:
- Pricing rules defined across clauses that reference each other.
- Penalty and escalation conditions hidden inside nested sections.
Downstream usage: Automated PO creation, compliance checks.
Regulatory Compliance Extraction
Document types: GDPR/HIPAA compliance docs, audit reports.
Technical Challenge:
- Requirements and exceptions buried across many sections.
- Extraction must be deterministic since compliance logic is strict.
Downstream usage: Rule engines, audit workflows, compliance checklist.
Solution Approaches
Problem Statement
Across industries from finance and healthcare to legal and compliance, large semi-structured documents serve as the single source of truth for critical workflows. These documents often span hundreds of pages, mixing free text, tables, and nested references. Before any automation can validate transactions, enforce compliance, or perform analytics, this information must be transformed into a structured, machine-readable format.
The challenge isn’t just size; it’s complexity. Rules and exceptions are scattered, relationships span multiple sections, and formatting inconsistencies make naive parsing unreliable. Errors at this stage ripple downstream, impacting reconciliation, risk models, and decision-making. In short, the fidelity of this digitization step determines the integrity of every subsequent process. Solving this problem requires a pipeline that can handle structural diversity, preserve context, and deliver deterministic outputs at scale.
Challenges
There are many challenges which can arise while solving for such large complex documents.
- The documents can have ~200-250 pages.
- The documents structures and layouts can be extremely complex in nature.
- A document or a page may contain a mix of various layouts like tables, text blocks, figures etc.
- Sometimes a single table can stretch across multiple pages, but only the first page contains the table header, leaving the remaining pages without column labels.
- A topic on one page may be referenced from a different page, so there can be complex inter-relationship amongst different topics in the same documents which needs to be structured in a machine-readable format.
- The document can be semi-structured as well (some parts are structured; some parts are unstructured or free text)
- The downstream applications might not always be AI-assisted (it can be core analytics dashboard or existing enterprise legacy system), so the structural storage of the digitized items from the documents need to be very well thought out before moving ahead with the solution.
Motivation Behind High Level Approach
- A larger document (number of pages ~200) needs to be divided into smaller chunks so that it becomes readable and digestible (within context length) for the LLM.
- To make the content/input of the LLM truly context-aware, the references must be maintained across pages (for example, table headers of long and continuous tables need to be injected to those chunks which would have the tables without the headers).
- If a pre-defined set of topics/entities are being covered in the documents in consideration, then topic/entity-wise information needs to be extracted for making the system truly context-aware.
- Different chunks can cover similar topic/entity which becomes a search problem
- The retrieval needs to happen for every topic/entity so that all information related to one topic/entity are in a single place and as a result the downstream applications become efficient, scalable and reliable over time.
Sample Architecture and Implementation
Let’s take a possible approach to demonstrate the feasibility of the following architecture, building on the motivation outlined above. The solution divides a large, semi-structured document into manageable chunks, making it easier to maintain context and references across pages. First, the document is split into logical sections. Then, OCR and layout extraction capture both text and structure, followed by structure analysis to preserve semantic relationships. Annotated chunks are labeled and grouped by entity, enabling precise extraction of items such as key-value pairs or table data. As a result, the system efficiently transforms complex documents into structured, context-rich outputs ready for downstream analytics and automation.
Architecture
Components
The key elements of the architecture diagram include components 1-6, which are code modules. Components 7 and 8 represent databases that store data chunks and extracted items, while component 9 refers to potential downstream systems that will use the structured data obtained from extraction.
- Chunking: Break documents into smaller, logical sections such as pages or content blocks. Enables parallel processing and improves context handling for large files. Technology: Python-based chunking logic using pdf2image and PIL for image handling.
- OCR & Layout Extraction: Convert scanned images into machine-readable text while capturing layout details like bounding boxes, tables, and reading order for structural integrity. Technology: Azure Document Intelligence or Microsoft Foundry Content Understanding Prebuilt Layout model combining OCR with deep learning for text, tables, and structure extraction.
- Context Aware Structural Analysis: Analyse the extracted layout to identify document components such as headers, paragraphs, and tables. Preserves semantic relationships for accurate interpretation. Technology: Custom Python logic leveraging OCR output to inject missing headers, summarize layout (row/column counts, sections per page).
- Labelling: Assign entity-based labels to chunks according to predefined schema or SME input. Helps filter irrelevant content and focus on meaningful sections. Technology: Azure OpenAI GPT-4.1-mini with NLI-style prompts for multi-class classification.
- Entity-Wise Grouping: Organize chunks by entity type (e.g., invoice number, total amount) for targeted extraction. Reduces noise and improves precision in downstream tasks. Technology: Azure AI Search with Hybrid Search and Semantic Reranking for grouping relevant chunks.
- Item Extraction: Extract specific values such as key-value pairs, line items, or table data from grouped chunks. Converts semi-structured content into structured fields. Technology: Azure OpenAI GPT-4.1-mini with Set of Marking style prompts using layout clues (row × column, headers, OCR text).
- Interim Chunk Storage: Store chunk-level data including OCR text, layout metadata, labels, and embeddings. Supports traceability, semantic search, and audit requirements. Technology: Azure AI Search for chunk indexing and Azure OpenAI Embedding models for semantic retrieval.
- Document Store: Maintain final extracted items with metadata and bounding boxes. Enables quick retrieval, validation, and integration with enterprise systems. Technology: Azure Cosmos DB, Azure SQL DB, Azure AI Search, or Microsoft Fabric depending on downstream needs (analytics, APIs, LLM apps).
- Downstream Integration: Deliver structured outputs (JSON, CSV, or database records) to business applications or APIs. Facilitates automation and analytics across workflows. Technology: REST APIs, Azure Functions, or Data Pipelines integrated with enterprise systems.
Algorithms
Consider these key algorithms when implementing the components above:
- Structural Analysis – Inject headers: Detect tables page by page; compare the last row of a table on page i with the first row of a table on page i+1, if column counts match and ≥4/5 style features (Font Weight, Background Colour, Font Style, Foreground Colour, Similar Font Family) match, mark it as a continuous table (header missing) and inject the previous page’s header into the next page’s table, repeating across pages.
- Labelling – Prompting Guide: Run NLI checks per SOC chunk image (ground on OCR text) across N curated entity labels, return {decision ∈ {ENTAILED, CONTRADICTED, NEUTRAL}, confidence ∈ [0,1]}, and output only labels where decision = ENTAILED and confidence > 0.7.
- Entity-Wise Grouping – Querying Chunks per Entity & Top‑50 Handling: Construct the query from the entity text and apply hybrid search with label filters for Azure AI Search, starting with chunks where the target label is sole, then expanding to observed co‑occurrence combinations under a cap to prevent explosion. If label frequency >50, run staged queries (sole‑label → capped co‑label combos); otherwise use a single hybrid search with semantic reranking, merge results and deduplicate before scoring.
- Entity-Wise Grouping – Chunk to Entity relevance scoring: For each retrieved chunk, split text into spans; compute cosine similarities to the entity text and take the mean s. Boost with a gated nonlinearity b=σ(k(s-m))⋅s. where σ is sigmoid function and k,m are tunables to emphasize mid-range relevance while suppressing very low s. Min–max normalize the re-ranker score r → r_norm; compute the final score F=α*b+(1-α)*r_norm, and keep the chunk iff F≥τ.
- Item Extraction – Prompting Guide: Provide the chunk image as an input and ground on visual structure (tables, headers, gridlines, alignment, typography) and document structural metadata to segment and align units; reconcile ambiguities via OCR extracted text, then enumerate associations by positional mapping (header ↔ column, row ↔ cell proximity) and emit normalized objects while filtering narrative/policy text by layout and pattern cues.
Deployment at Scale
There are several ways to implement a document extraction pipeline, each with its own pros and cons. The best deployment model depends on scenario requirements. Below are some common approaches with their advantages and disadvantages.
- Host as REST API
- Pros: Enables straightforward creation, customization, and deployment across scalable compute services such as Azure Kubernetes Service.
- Cons: Processing time and memory usage scale with document size and complexity, potentially requiring multiple iterations to optimize performance.
- Deploy as Azure Machine Learning (ML) Pipeline
- Pros: Facilitates efficient time and memory management, as Azure ML supports processing large datasets at scale.
- Cons: The pipeline may be more challenging to develop, customize, and maintain.
- Deploy as Azure Databricks Job
- Pros: Offers robust time and memory management similar to Azure ML, with advanced features such as Data Autoloader for detecting data changes and triggering pipeline execution.
- Cons: The solution is highly tailored to Azure Databricks and may have limited customization options.
- Deploy as Microsoft Fabric Pipeline
- Pros: Provides capabilities comparable to Azure ML and Databricks, and features like Fabric Activator replicate Databricks Autoloader functionality.
- Cons: Presents similar limitations found in Azure ML and Azure Databricks approaches.
Each method should be carefully evaluated to ensure alignment with technical and operational requirements.
Evaluation
Objective: The aim is to evaluate how accurately a document extraction pipeline extracts information by comparing its output with manually verified data.
Approach: Documents are split into sections, labelled, and linked to relevant entities; then, AI tools extract key items through the outlined pipeline mentioned above. The extracted data is checked against expert-curated records using both exact and approximate matching techniques.
Key Metrics:
- Individual Item Attribute Match: Assesses the system’s ability to identify specific item attributes using strict and flexible comparison methods.
- Combined Item Attribute Match: Evaluates how well multiple attributes are identified together, considering both exact and fuzzy matches.
- Precision Calculation: Precision for each metric reflects the proportion of correctly matched entries compared to all reference entries.
Findings for a real-world scenario: Fuzzy matching of item key attributes yields high precision (over 90%), but accuracy drops for key attribute combinations (between 43% and 48%). These results come from analysis across several datasets to ensure reliability.
How This Addresses the Problem Statement
The sample architecture described integrates sectioning, entity linking, and attribute extraction as foundational steps. Each extracted item is then evaluated against expert-curated datasets using both strict (exact) and flexible (fuzzy) matching algorithms. This approach directly addresses the problem statement by providing measurable metrics, such as individual and combined attribute match rates and precision calculations, that quantify the system’s reliability and highlight areas for improvement. Ultimately, this methodology ensures that the pipeline’s output is systematically validated, and its strengths and limitations are clearly understood in real-world contexts.
Plausible Alternative Approaches
No single approach fits every use case; the best method depends on factors like document complexity, structure, sensitivity, and length as well as the downstream application types, Consider these alternative approaches for different scenarios.
- Using Azure OpenAI alone
- Using Azure OpenAI + Azure Document Intelligence + Azure AI Search: RAG like solution
- Using Azure OpenAI + Azure Document Intelligence + Azure AI Search: Non-RAG like solution
Conclusion
Intelligent Document Processing for large semi-structured documents isn’t just about extracting data, it’s about building trust in that data. By combining Azure Document Intelligence for layout-aware OCR with OpenAI models for contextual understanding, we create a well thought out in-depth pipeline that is accurate, scalable, and resilient against complexity. Chunking strategies ensure context fits within model limits, while header injection and structural analysis preserve relationships across pages to make it context-aware. Entity-based grouping and semantic retrieval transform scattered content into organized, query-ready data. Finally, rigorous evaluation with scalable ground truth strategy roadmap, using precision, recall, and fuzzy matching, closes the loop, ensuring reliability for downstream systems. This pattern delivers more than automation; it establishes a foundation for compliance, analytics, and AI-driven workflows at enterprise scale. In short, it’s a blueprint for turning chaotic document into structured intelligence, efficient, governed, and future-ready for any kind of downstream applications.