Evaluating the quality of AI document data extraction with small and large language models
Published Jun 08 2024 05:55 AM 1,292 Views
Microsoft

Evaluating the effectiveness of AI models in document data extraction. Comparing accuracy, speed, and cost-effectiveness between Small and Large Language Models (SLMs and LLMs).

 

Context

 

As the adoption of AI in solutions increases, technical decision-makers face challenges in selecting the most effective approach for document data extraction. Ensuring high quality is crucial, particularly when dealing with critical solutions where minor errors have substantial consequences. As the volume of documents increases, it becomes essential to choose solutions that can scale efficiently without compromising performance.

 

This article evaluates AI document data extraction techniques using Small Language Models (SLMs) and Large Language Models (LLMs). Including a specific focus on structured and unstructured data scenarios.

 

By evaluating models, the article provides insights into their accuracy, speed, and cost-efficiency for quality data extraction. It provides both guidance into evaluating models, as well as the quality of the outputs from models for specific scenarios.

 

Key challenges of effective document data extraction

 

With many AI models available to ISVs and Digital Natives, challenges arise in which technique is the most effective for quality document data extraction. When evaluating the quality of AI models, key challenges include:

 

  • Ensuring high accuracy and reliability. High accuracy is crucial, especially for critical applications such as legal or financial documents. Minor errors in data extraction could lead to significant issues. Additionally, robust data validation mechanisms verify the data and minimize false positives and negatives.
  • Getting results in a timely manner. As the volume of documents increases, the selected approach must scale efficiently to handle large document quantities without significant impact. Balancing the need for fast processing speeds with maintaining high accuracy levels is challenging.
  • Balancing cost with accuracy and efficiency. Ensuring high accuracy and efficiency often requires the most advanced AI models, which can be expensive. Evaluating AI models and techniques highlights the most cost-effective solution without compromising on the quality of the data extraction.

When choosing an AI model for document data extraction on Azure, there is no one-size-fits-all solution. Depending on the scenario, one may outperform another for accuracy at the sacrifice of cost. While another model may provide sufficient accuracy at a much lower cost.

 

Establishing evaluation techniques for AI models in document data extraction

 

When evaluating AI models for document data extraction, it’s important to understand how they perform for specific use cases. This evaluation focused on structured and unstructured scenarios to provide insights into simple and complex document structures.

 

Evaluation Scenarios

 

Three example document scenarios for evaluating AI models in document data extractionThree example document scenarios for evaluating AI models in document data extraction

 

  1. Structured Data: Invoices
    • Simple: A 2-page invoice, including returns, with a clear table structure, well-defined columns, handwritten signatures, and typed text.
    • Complex: A 2-page invoice with a grid-based layout, handwritten signatures, overlapping content, and handwritten notes spanning multiple rows.
  2. Unstructured Data: Vehicle Insurance
    • A 13-page vehicle insurance document containing both structured data in initial pages, and natural, domain-specific language on subsequent pages. This scenario focuses on extracting data by combining structured data with the natural language throughout the document.

 

Models and Techniques

 

This evaluation focused on two techniques for data extraction with the language models:

 

  • Markdown Extraction with Azure AI Document Intelligence. This technique involves converting the document into Markdown using the pre-built layout model in Azure AI Document Intelligence. Read more about this technique in our detailed article.
  • Vision Capabilities of Multi-Modal Language Models. This technique focuses on GPT-4 Turbo and Omni models by converting the document pages to images. This leverages the models’ capabilities to analyze both text and visual elements. Explore this technique in more detail in our sample project.

For each technique, the model is prompted using a one-shot technique, providing the expected output schema for the response. This establishes the intention, improving the overall accuracy of the generated output.

 

The AI models evaluated in this analysis include:

 

 

Evaluation Methodology

 

To ensure a reliable and consistent evaluation, the following approach was established:

 

  1. Baseline Accuracy. A single source of truth for the data extraction results ensures each model’s output is compared against a standard. This approach, while manually intensive, provides a precise measure for the accuracy.
  2. Execution Time. This is calculated based on the time between the initial request for data extraction to the response, without streaming. For scenarios utilizing the Markdown technique, the time is based on the end-to-end processing, including the request and response from Azure AI Document Intelligence.
  3. Cost Analysis. Using the average input and output tokens from each iteration, the estimated cost per 1,000 pages is calculated, providing a clearer picture of cost-effectiveness at scale.
  4. Consistent Prompting. Each model has the same system and extraction prompt. The system prompt is consistent across all scenarios as “You are an AI assistant that extracts data from documents and returns them as structured JSON objects. Do not return as a code block”. Each scenario has its own extraction prompt including the output schema.
  5. Multiple Iterations. Each document is run 20 times per model technique. Every property in the result compares for an exact match against the standard response. This establishes the averages for accuracy, execution time, and cost.

These metrics establish the baseline evaluation. By establishing the baseline, it is possible to experiment with the prompt, schema, and request configuration. This allows you to compare improvements in the overall quality by evaluating the accuracy, speed, and cost.

 

For the evaluation outlined in this article, we created a .NET NUnit test project with multiple fixtures and cases. The tests take advantage of the .NET SDKs for both Azure AI Document Intelligence and Azure OpenAI.

 

Each model and technique combination per scenario is run independently. This is to ensure that the speed is evaluated fairly for each request.

 

You can find the repository for this evaluation on GitHub.

 

Evaluating AI Models for Structured Data

 

Simple Invoice Document Structure

 

Model
Technique
Accuracy (avg)
Speed (avg)
Cost (est) /
1,000 pages
Phi-3-Mini-128K-Instruct
Markdown
71%
19.91s
$10.39
GPT-3.5-Turbo
Markdown
83%
24.49s
$10.49
GPT-4-Turbo
Markdown
91%
62.79s
$20.46
GPT-4-Omni
Markdown
90%
26.19s
$15.19
GPT-4-Turbo
Vision
91%
20.74s
$16.37
GPT-4-Omni
Vision
90%
26.67s
$7.46

 

The results of this scenario indicates there is consistency across all models in both accuracy and speed. GPT-4 Turbo, when processing Markdown, presents an outlier in this scenario for speed.

 

  • GPT-4 Turbo and GPT-4 Omni for both techniques have the highest accuracy, while Phi-3 Mini has the lowest.
  • Phi-3 Mini and GPT-4 Turbo (Vision) are the fastest at processing. GPT-4 Turbo (Markdown) has the worst speed, almost 3x slower than all other models and techniques.
  • GPT-4 Omni (Vision) is significantly cheaper per 1,000 pages than other models. GPT-4 Turbo (Markdown) is almost 3x more expensive than GPT-4 Omni using Vision capabilities.

It is important to note that Markdown conversion strips away visual elements, providing only the result of OCR as text. This can often lead to potential misinterpretations, such as false positives for signatures. When using models with vision capabilities, visual elements are often interpreted correctly, resulting in a higher true positive accuracy.

 

For high-accuracy requirements, GPT-4-Omni with Vision capabilities are the best choice due to its excellent performance and cost-effectiveness. For simpler tasks where speed is a priority, models like Phi-3-Mini-128K-Instruct can be considered, but with the understanding that accuracy will be significantly lower.

 

Complex Invoice Document Structure

 

Model
Technique
Accuracy (avg)
Speed (avg)
Cost (est) /
1,000 pages
Phi-3-Mini-128K-Instruct
Markdown
93%
22.03s
$10.43
GPT-3.5-Turbo
Markdown
91%
15.52s
$10.57
GPT-4-Turbo
Markdown
94%
89.81s
$21.93
GPT-4-Omni
Markdown
94%
21.20s
$15.94
GPT-4-Turbo
Vision
91%
19.71s
$20.31
GPT-4-Omni
Vision
99%
15.65s
$9.16

 

The results of this scenario indicate that all models provide high accuracy for extracting structured data from invoices, but key differences in speed and cost highlight the trade-offs.

 

  • All models have 90%+ accuracy in this scenario. However, GPT-4 Omni (Vision) stands out as the highest accuracy for extraction, with minimal error. This is particularly beneficial for solutions where data integrity is critical, such as financial reporting and compliance.
  • Both GPT-4 Omni (Vision) and GPT-3.5 Turbo (Markdown) are the fastest for end-to-end data extraction. Like the previous example, GPT-4 Turbo (Markdown) has the worst speed, almost 6x slower than GPT-3.5 Turbo (Markdown).
  • Like the previous example, GPT-4 Omni (Vision) is cheaper per 1,000 pages than other models. However, closely followed by Phi-3 Mini in this scenario highlights how small language models can perform just as well for data extraction scenarios.

Models using vision techniques, like GPT-4 Omni (Vision), excel in interpreting visual elements of documents, particularly where they overlap content or require visual clues to direct the output. This minimizes false positives that occur when models interpret these elements based purely on surrounding textual clues.

 

For applications heavily reliant on visual data extraction, avoid extracting data using the Markdown technique and prefer vision-based models like GPT-4 Omni (Vision). For pure text-based extractions where speed is a priority, Markdown-based models like GPT-3.5 Turbo can be suitable.

 

However, in this specific use case, GPT-4 Omni (Vision) is the best overall technique, providing high accuracy and speed, at lower costs compared to other models.

 

Evaluating AI Models for Unstructured Data

 

Complex Vehicle Insurance Document

 

Model
Technique
Accuracy (avg)
Speed (avg)
Cost (est) /
1,000 pages
Phi-3-Mini-128K-Instruct
Markdown
37%
21.50s
$10.25
GPT-3.5-Turbo
Markdown
80%
25.06s
$10.33
GPT-4-Turbo
Markdown
84%
86.10s
$16.66
GPT-4-Omni
Markdown
95%
28.56s
$13.29
GPT-4-Turbo
Vision
86%
41.98s
$8.54
GPT-4-Omni
Vision
99%
39.88s
$3.63

 

The results of this scenario indicate that more advanced, multi-modal models excel in accuracy and cost efficiency for extract data from unstructured documents. It can be assumed that the advanced nature of the language model allows it to better interpret contextual clues in the natural language of the document to infer the expected output.

 

  • Accuracy is spread across models for unstructured data, with GPT-4 Omni (Vision) providing the most accurate with minimal error. The complexity in domain language and natural language processing for extracting values results in poor accuracy for small language models, such as Phi-3 Mini.
  • Speed is also varied across models, possibly due to the increase in the number of pages, as well as the complexity of the language understanding required to extract specific values from text-based rules in the contract. GPT-4 Turbo (Markdown) continues to provide the worst speed, a trend recognized across all scenarios.
  • In this specific scenario, GPT-4 Omni (Vision) is significantly cheaper than other model scenarios while achieving the highest accuracy. To the next cheapest, the factor is over 2x, and over 4x to the most expensive, GPT-4 Turbo (Markdown). This factor could drastically reduce the overall cost for large-scale document processing solutions.

In analyzing the extraction of data from complex unstructured documents, GPT-4 Omni (Vision) outperforms all other models and techniques. This superiority is seen across accuracy, speed, and cost.

 

While highly accurate, GPT-4 with Vision models have a limit of 10 images per request which can put a limit on how many pages can be processed in a single request. Effective pre-processing, such as stitching pages together, is essential to maximize the accuracy of the extraction. However, avoid overloading images with too many pages, as this can reducing the overall resolution of text, significantly degrading the model’s performance. Where processing of large page documents is required, considering the adoption of the Markdown extraction technique with advanced models such as GPT-4 Omni may be preferable.

 

Conclusion

 

Effective evaluation of AI document data extraction techniques using small language models (SLMs) and large language models (LLMs) can reveal benefits and drawbacks, guiding in the selection of the optimal approach for specific use cases.

 

The key findings from our analysis shows:

 

  • For high-accuracy requirements, especially in critical applications such as legal or financial documents, GPT-4 Omni with Vision capabilities stand out. It consistently delivers the highest accuracy across both structured and unstructured data scenarios.
  • SLMs like Phi-3 Mini-128K, while cost-effective, show significant limitations in accuracy, particularly with complex and unstructured documents.
  • Speed varies significantly between models and techniques. GPT-4 Turbo (Markdown) consistently shows the worst performance in terms of speed, making it less suitable for time-sensitive applications.
  • Models using Vision techniques, such as GPT-4 Omni (Vision), offer a balance of high accuracy and reasonable speed, making them ideal for applications requiring fast and accurate data extraction.
  • Cost considerations are crucial when scaling to large volumes of documents. GPT-4 Omni (Vision) not only provides high accuracy but also proves to be the most cost-effective per 1,000 pages, especially in scenarios with complex unstructured data.

 

Recommendations for Evaluating AI Models in Document Data Extraction

 

  • High-Accuracy Solutions. For solutions where accuracy is critical or visual elements are necessary, such as financial reporting or compliance, evaluate GPT-4 Omni with Vision capabilities. Its superior performance in accuracy and cost-effectiveness justifies the investment.
  • Text-Based Extractions. For simpler, text-based document extractions where speed is a priority, consider models like GPT-3.5 Turbo using Markdown. It provides sufficient accuracy at a lower cost and faster processing time.
  • Adopt Evaluation Techniques. Implement a rigorous evaluation methodology like the one used in this analysis. Establishing a baseline for accuracy, speed, and cost through multiple iterations and consistent prompting ensures reliable and comparable results. Regularly conduct evaluations when considering new techniques, models, prompts, and configurations. This helps in making guided decisions when opting for an approach in your specific use cases.

 

Read more on AI Document Intelligence

 

Thank you for taking the time to read this article. We are sharing our insights for ISVs and Digital Natives that enable document intelligence in their AI-powered solutions, based on real-world challenges we encounter. We invite you to continue your learning through our additional insights in this series.

 

 

Further Reading

Co-Authors
Version history
Last update:
‎Jun 08 2024 06:03 AM
Updated by: