Evaluating the quality of AI document data extraction with small and large language models

Microsoft

Jun 08, 2024

Evaluating the effectiveness of AI models in document data extraction. Comparing accuracy, speed, and cost-effectiveness between Small and Large Language Models (SLMs and LLMs).

Context

As the adoption of AI in solutions increases, technical decision-makers face challenges in selecting the most effective approach for document data extraction. Ensuring high quality is crucial, particularly when dealing with critical solutions where minor errors have substantial consequences. As the volume of documents increases, it becomes essential to choose solutions that can scale efficiently without compromising performance.

This article evaluates AI document data extraction techniques using Small Language Models (SLMs) and Large Language Models (LLMs). Including a specific focus on structured and unstructured data scenarios.

By evaluating models, the article provides insights into their accuracy, speed, and cost-efficiency for quality data extraction. It provides both guidance in evaluating models, as well as the quality of the outputs from models for specific scenarios.

Key challenges of effective document data extraction

With many AI models available to ISVs and Startups, challenges arise in which technique is the most effective for quality document data extraction. When evaluating the quality of AI models, key challenges include:

Ensuring high accuracy and reliability. High accuracy and confidence are crucial, especially for critical applications such as legal or financial documents. Minor errors in data extraction could lead to significant issues. Additionally, robust data validation mechanisms verify the data and minimize false positives and negatives.
Getting results in a timely manner. As the volume of documents increases, the selected approach must scale efficiently to handle large document quantities without significant impact. Balancing the need for fast processing speeds with maintaining high accuracy levels is challenging.
Balancing cost with accuracy and efficiency. Ensuring high accuracy and efficiency often requires the most advanced AI models, which can be expensive. Evaluating AI models and techniques highlights the most cost-effective solution without compromising on the quality of the data extraction.

When choosing an AI model for document data extraction on Azure, there is no one-size-fits-all solution. Depending on the scenario, one may outperform another for accuracy at the sacrifice of cost. While another model may provide sufficient accuracy at a much lower cost.

Establishing evaluation techniques for AI models in document data extraction

When evaluating AI models for document data extraction, it’s important to understand how they perform for specific use cases. This evaluation focused on structured and unstructured scenarios to provide insights into simple and complex document structures.

Evaluation Scenarios

Three example document scenarios for evaluating AI models in document data extraction

Structured Data: Invoices
- A collection of assorted invoices with varying simple and complex layouts, handwritten signatures, obscured content, and handwritten notes across margins.
Unstructured Data: Vehicle Insurance Policy
- A 10+ page vehicle insurance policy document containing both structured and unstructured data, including natural, domain-specific language with inferred data. This scenario focuses on extracting data by combining structured data with the natural language throughout the document.

Models and Techniques

This evaluation focused on multiple techniques for data extraction with the language models:

Markdown Extraction with Azure AI Document Intelligence. This technique involves converting the document into Markdown using the pre-built layout model in Azure AI Document Intelligence. Read more about this technique in our detailed article.
Vision Capabilities of Multi-Modal Language Models. This technique focuses on GPT-4o and GPT-4o Mini models by converting the document pages to images. This leverages the models’ capabilities to analyze both text and visual elements. Explore this technique in more detail in our sample project.
Comprehensive Combination. This technique combines both Markdown extraction with vision capable models to enhance the extraction process. Additionally, the layout analysis of Azure AI Document Intelligence will ease the human review of a document if the confidence or accuracy is low.

For each technique, the model is prompted using either Structured Outputs in GPT-4o or with inline JSON schemas for other models. This establishes the expected output, improving the overall accuracy of the generated response.

The AI models evaluated in this analysis include:

Phi-3.5 MoE, an SLM deployed as a serverless endpoint in Azure AI Studio
GPT-4o (2024-08-06), an LLM deployed with 10K TPM in Azure OpenAI
GPT-4o Mini (2024-07-18), an LLM deployed with 10K TPM in Azure OpenAI

Evaluation Methodology

To ensure a reliable and consistent evaluation, the following approach was established:

Baseline Accuracy. A single source of truth for the data extraction results ensures each model’s output is compared against a standard. This approach, while manually intensive, provides a precise measure for accuracy.
Confidence. To demonstrate when an extraction should be raised up to a human for review, each model provides an internal assessment on how certain it is about its predicted output. Azure OpenAI provides these confidence values as logprobs, while Azure AI Document Intelligence returns these confidence scores by default in the response.
Execution Time. This is calculated based on the time between the initial request for data extraction to the response, without streaming. For scenarios utilizing the Markdown technique, the time is based on the end-to-end processing, including the request and response from Azure AI Document Intelligence.
Cost Analysis. Using the average input and output tokens from each iteration, the estimated cost per 1,000 pages is calculated, providing a clearer picture of cost-effectiveness at scale.
Consistent Prompting. Each model has the same system and extraction prompt. The system prompt is consistent across all scenarios as “You are an AI assistant that extracts data from documents”. Each scenario has its own extraction prompt, including the output schema.
Multiple Iterations. 10 variants of the document are run per model technique. Every property in the result compares for an exact match against the standard response. This provides the results for accuracy, confidence, execution time, and cost.

These metrics establish the baseline evaluation. By establishing the baseline, it is possible to experiment with the prompt, schema, and request configuration. This allows you to compare improvements in the overall quality by evaluating the accuracy, confidence, speed, and cost.

For the evaluation outlined in this article, we created a Python test project with multiple test cases. Each test case is a combination of a specific use case and model. Additionally, each test case is run independently. This is to ensure that the speed is evaluated fairly for each request.

The tests take advantage of the Python SDKs for both Azure AI Document Intelligence and Azure OpenAI.

Evaluating AI Models for Structured Data

Complex Invoice Document

Model	Technique	Accuracy (95th)	Confidence (95th)	Speed (95th)	Est. Cost (1,000 pages)
GPT-4o	Vision	98.99%	99.85%	22.80s	$7.45
GPT-4o	Vision + Markdown	96.60%	99.82%	22.25s	$19.47
Phi-3.5 MoE	Markdown	96.11%	99.49%	54.00s	$10.35
GPT-4o	Markdown	95.66%	99.44%	31.60s	$16.11
GPT-4o Mini	Vision + Markdown	91.84%	99.99%	56.69s	$18.14
GPT-4o Mini	Vision	79.31%	99.76%	56.71s	$8.02
GPT-4o Mini	Markdown	78.61%	99.76%	24.52s	$10.41

When processing invoices in our analysis, GPT-4o with Vision capabilities stands out as the most ideal combination. This approach delivers the highest accuracy and confidence scores, effectively handling complex layouts and visual elements. Additionally, it handles this at reasonable speeds at significantly lower costs.

Accuracy in our evaluation shows that overall, most models in the evaluation can be regarded as having high accuracy. GPT-4o with Vision processing achieves the highest scores for invoices. While our assumptions that providing the additional document text context would increase this, our analysis showed that it's possible to retain high accuracy without it.
Confidence levels are high across models and techniques, demonstrating that combined with high accuracy, these approaches perform well for automated processing with minimal human intervention.
Speed is a crucial factor for scalability of a document processing pipeline. For background processing per document, GPT-4o models can process all techniques in a quick timescale. In contrast, small language models like Phi-3.5 MoE are took longer which could impact throughput for large-scale applications.
Cost-effectiveness is also essential when building a scalable pipeline to process thousands of document pages. GPT-4o with Vision stands out as the most cost-effective at $7.45 per 1,000 pages. However, all models in Vision or Markdown techniques offer high value when also considering their accuracy, confidence, and speed.

One significant benefit of using GPT-4o with Vision processing is its ability to handle visual elements such as handwritten signatures, obscured content, and stamps. By processing the document as an image, the model minimizes false positives and negatives that can arise when relying solely on text-based Markdown processing.

Phi-3.5 MoE is a notable highlight when it comes to the use of small language models. The analysis demonstrates these models are just as capable at processing documents into structured JSON outputs as the more advanced large language models.

For this Invoice analysis, GPT-4o with Vision provides the best balance between accuracy, confidence, speed, and cost. It is particularly adept at handling documents with complex layouts and visual elements, making it a suitable choice for extracting structured data from a diverse range of invoices.

Evaluating AI Models for Unstructured Data

Complex Vehicle Insurance Document

Model	Technique	Accuracy (95th)	Confidence (95th)	Speed (95th)	Est. Cost (1,000 pages)
GPT-4o	Vision + Markdown	100%	99.35%	68.93s	$13.96
GPT-4o	Markdown	98.25%	89.03%	134.85s	$12.24
GPT-4o	Vision	97.04%	98.71%	66.24s	$2.31
GPT-4o Mini	Markdown	93.25%	89.04%	99.78s	$10.12
GPT-4o Mini	Vision + Markdown	82.99%	99.16%	101.89s	$15.71
GPT-4o Mini	Vision	67.25%	98.73%	83.01s	$5.67
Phi-3.5 MoE	Markdown	64.99%	88.28%	102.89s	$10.16

When extracting structured data from large, unstructured documents, such as insurance policies, the combination of GPT-4o with both Vision and Markdown techniques proves to be the most ideal solution. This hybrid approach leverages the visual context of the document's layout alongside the structured textual representation, resulting in the highest degrees of accuracy and confidence. It effectively handles the complexity of domain-specific language and inferred fields, providing a comprehensive and precise extraction process.

Accuracy is spread across all models when extracting data from larger quantities of unstructured text. GPT-4o utilizing both Vision and Markdown demonstrates the effectiveness of combining visual and textual context for documents containing natural language.
Confidence varies also in comparison to the Invoice analysis, with less certainty from the models when extracting from large blocks of text. However, analyzing the confidence scores of GPT-4o for each technique shows that building on them towards a comprehensive approach yields higher confidence.
Speed of execution will naturally increase as the number of pages, complexity of layout, and quantity of text increases. These techniques for large, unstructured documents are likely to be reserved for background, batch processing than real-time applications.
Cost varies when utilizing multiple Azure services to perform document data extraction. However, the overall cost for GPT-4o with both Vision and Markdown demonstrates where utilizing multiple AI services to achieve a goal can yield exceptional accuracy and confidence. This leads to automated solutions that require minimal human intervention.

The combination of Vision and Markdown techniques can offer a highly efficient approach to structured document data extraction. However, while highly accurate, models like GPT-4o and 4o Mini are bound by their maximum context window of 128K tokens. When processing text and images in a single request, you may need to consider chunking or classification techniques to break down large documents into smaller document boundaries.

Highlighting the specific capabilities of Phi-3.5 MoE, it falls short in accuracy. This lower performance indicates limitations in handling large, complex natural language that requires understanding and inference to extract data accurately. While optimizations can be made in prompts to improve accuracy, this analysis highlights the importance of evaluating and selecting a model and technique that aligns with the specific demands of your document extraction scenarios.

Key Evaluation Findings

Accuracy: For most extraction scenarios, advanced large language models like GPT-4o consistently deliver high accuracy and confidence levels. They are particularly effective at managing complex layouts and accurately extracting data from both visual and text context.
Cost-Effectiveness: Language models with vision capabilities are highly cost-effective for large-scale processing, with GPT-4o demonstrating costs below $10 per 1,000 pages in all scenarios where vision was used solely. However, the cost-benefit of using a hybrid Vision and Markdown approach can be justified in certain scenarios where high precision is required.
Speed: The time of execution for document varies depending on the number of pages, layout complexity, and quantity of text. For most scenarios, using language models for document data extraction demonstrates the capabilities for large-scale background processing, rather than real-time applications.
Limitations: Smaller models, like Phi-3.5 MoE, indicate limitations when handling complex documents with large unstructured text. However, they excel with minimal prompting for smaller, structured documents, such as invoices.
Comprehensive Techniques: Combining both text and vision techniques provides an effective strategy for highly accurate, highly confident data extraction from documents. The approach enhances the extraction, particularly for documents that include complex layout, visual elements, and complex, domain-specific, natural language.

Recommendations for Evaluating AI Models in Document Data Extraction

High-Accuracy Solutions. For solutions where accuracy is critical or visual elements must be evaluated, such as medical records, legal cases, or financial reports, explore GPT-4o with both Vision and Markdown capabilities. Its high performance in accuracy and confidence justifies the investment.
Text-Based or Self-Hosted Solutions. For text-based document extractions where self-hosting a model is necessary, small open language models, such as Phi-3.5 MoE, can provide high accuracy in data extraction comparable to OpenAI's GPT-4o.
Adopt Evaluation Techniques. Implement a rigorous evaluation methodology like the one used in this analysis. Establishing a baseline for accuracy, speed, and cost through multiple iterations and consistent prompting ensures reliable and comparable results. Regularly conduct evaluations when considering new techniques, models, prompts, and configurations. This helps in making informed decisions when opting for an approach in your specific use cases.