Evaluating the quality of AI document data extraction with small and large language models
Evaluating the effectiveness of AI models in document data extraction. Comparing accuracy, speed, and cost-effectiveness between Small and Large Language Models (SLMs and LLMs).
Context
As the adoption of AI in solutions increases, technical decision-makers face challenges in selecting the most effective approach for document data extraction. Ensuring high quality is crucial, particularly when dealing with critical solutions where minor errors have substantial consequences. As the volume of documents increases, it becomes essential to choose solutions that can scale efficiently without compromising performance.
This article evaluates AI document data extraction techniques using Small Language Models (SLMs) and Large Language Models (LLMs). Including a specific focus on structured and unstructured data scenarios.
By evaluating models, the article provides insights into their accuracy, speed, and cost-efficiency for quality data extraction. It provides both guidance into evaluating models, as well as the quality of the outputs from models for specific scenarios.
Key challenges of effective document data extraction
With many AI models available to ISVs and Startups, challenges arise in which technique is the most effective for quality document data extraction. When evaluating the quality of AI models, key challenges include:
- Ensuring high accuracy and reliability. High accuracy is crucial, especially for critical applications such as legal or financial documents. Minor errors in data extraction could lead to significant issues. Additionally, robust data validation mechanisms verify the data and minimize false positives and negatives.
- Getting results in a timely manner. As the volume of documents increases, the selected approach must scale efficiently to handle large document quantities without significant impact. Balancing the need for fast processing speeds with maintaining high accuracy levels is challenging.
- Balancing cost with accuracy and efficiency. Ensuring high accuracy and efficiency often requires the most advanced AI models, which can be expensive. Evaluating AI models and techniques highlights the most cost-effective solution without compromising on the quality of the data extraction.
When choosing an AI model for document data extraction on Azure, there is no one-size-fits-all solution. Depending on the scenario, one may outperform another for accuracy at the sacrifice of cost. While another model may provide sufficient accuracy at a much lower cost.
Establishing evaluation techniques for AI models in document data extraction
When evaluating AI models for document data extraction, it’s important to understand how they perform for specific use cases. This evaluation focused on structured and unstructured scenarios to provide insights into simple and complex document structures.
Evaluation Scenarios
- Structured Data: Invoices
- A 2-page invoice with a grid-based layout, handwritten signatures, overlapping content, and handwritten notes spanning multiple rows.
- Unstructured Data: Vehicle Insurance Policy
- A 13-page vehicle insurance policy document containing both structured and unstructured data, including natural, domain-specific language with inferred data. This scenario focuses on extracting data by combining structured data with the natural language throughout the document.
Models and Techniques
This evaluation focused on two techniques for data extraction with the language models:
- Markdown Extraction with Azure AI Document Intelligence. This technique involves converting the document into Markdown using the pre-built layout model in Azure AI Document Intelligence. Read more about this technique in our detailed article.
- Vision Capabilities of Multi-Modal Language Models. This technique focuses on GPT-4o and GPT-4o Mini models by converting the document pages to images. This leverages the models’ capabilities to analyze both text and visual elements. Explore this technique in more detail in our sample project.
For each technique, the model is prompted using a one-shot learning technique, providing the expected output schema for the response. This establishes the intention, improving the overall accuracy of the generated output.
The AI models evaluated in this analysis include:
- Phi-3 Mini 128K Instruct, an SLM deployed as a serverless endpoint in Azure AI Studio
- Phi-3.5 Mini Instruct, an SLM deployed as a serverless endpoint in Azure AI Studio
- GPT-3.5 Turbo (1106), an LLM deployed with 10K TPM in Azure OpenAI
- GPT-4o (2024-05-13), an LLM deployed with 10K TPM in Azure OpenAI
- GPT-4o Mini (2024-07-18), an LLM deployed with 10K TPM in Azure OpenAI
Evaluation Methodology
To ensure a reliable and consistent evaluation, the following approach was established:
- Baseline Accuracy. A single source of truth for the data extraction results ensures each model’s output is compared against a standard. This approach, while manually intensive, provides a precise measure for the accuracy.
- Execution Time. This is calculated based on the time between the initial request for data extraction to the response, without streaming. For scenarios utilizing the Markdown technique, the time is based on the end-to-end processing, including the request and response from Azure AI Document Intelligence.
- Cost Analysis. Using the average input and output tokens from each iteration, the estimated cost per 1,000 pages is calculated, providing a clearer picture of cost-effectiveness at scale.
- Consistent Prompting. Each model has the same system and extraction prompt. The system prompt is consistent across all scenarios as “You are an AI assistant that extracts data from documents and returns them as structured JSON objects. Do not return as a code block”. Each scenario has its own extraction prompt including the output schema.
- Multiple Iterations. Each document is run 20 times per model technique. Every property in the result compares for an exact match against the standard response. This establishes the averages for accuracy, execution time, and cost.
These metrics establish the baseline evaluation. By establishing the baseline, it is possible to experiment with the prompt, schema, and request configuration. This allows you to compare improvements in the overall quality by evaluating the accuracy, speed, and cost.
For the evaluation outlined in this article, we created a .NET NUnit test project with multiple fixtures and cases. The tests take advantage of the .NET SDKs for both Azure AI Document Intelligence and Azure OpenAI.
Each model and technique combination per scenario is run independently. This is to ensure that the speed is evaluated fairly for each request.
You can find the repository for this evaluation on GitHub.
Evaluating AI Models for Structured Data
Complex Invoice Document
Model
|
Technique
|
Accuracy (avg)
|
Speed (avg)
|
Cost (est) /
1,000 pages |
GPT-4o
|
Vision
|
99%
|
24.04s
|
$9.16
|
GPT-4o
|
Markdown
|
96%
|
24.32s
|
$13.12
|
Phi-3 Mini 128K Instruct
|
Markdown
|
93%
|
22.03s
|
$10.43
|
GPT-4o Mini
|
Vision
|
91%
|
29.90s
|
$6.26
|
GPT-3.5-Turbo
|
Markdown
|
91%
|
20.35s
|
$10.57
|
Phi-3.5 Mini Instruct
|
Markdown
|
81%
|
31.16s
|
$10.59
|
GPT-4o Mini
|
Markdown
|
78%
|
28.37s
|
$10.30
|
The results of this scenario indicate that all models provide high accuracy for extracting structured data from invoices, but key differences in speed and cost highlight the trade-offs.
- Most models have 90%+ accuracy in this scenario. However, GPT-4o (Vision) stands out as the highest accuracy for extraction, with minimal error. This is particularly beneficial for solutions where data integrity is critical, such as financial reporting, medical records, and compliance.
- All models generally provide fast end-to-end data extraction.
- GPT-4o Mini (Vision) is the cheapest per 1,000 pages, with GPT-4o (Vision) closely behind.
It is important to note that Markdown conversion strips away visual elements, providing only the result of OCR as text. This can often lead to potential misinterpretations, such as false positives for signatures.
Models using vision techniques, like GPT-4o (Vision), excel in interpreting visual elements of documents, particularly where they overlap content or require visual clues to direct the output. This minimizes false positives that occur when models interpret these elements based purely on surrounding textual clues.
For applications heavily reliant on visual data extraction, avoid extracting data using the Markdown technique and prefer vision-based models like GPT-4o (Vision). For pure text-based extractions where speed is a priority, Markdown-based models like GPT-3.5 Turbo can be suitable.
For models like Phi-3 Mini, this scenario highlights how small language models can perform just as well for data extraction scenarios across the board.
With prompt engineering techniques, it is possible to improve the overall accuracy of GPT-4o Mini and Phi-3.5 Mini. However, changes to the input prompt may require more specific language and domain specific keywords that increase the number of tokens consumed, increasing the overall cost.
In this specific use case, GPT-4o (Vision) is the best overall technique, providing high accuracy and speed, at lower costs compared to other models.
Evaluating AI Models for Unstructured Data
Complex Vehicle Insurance Document
Model
|
Technique
|
Accuracy (avg)
|
Speed (avg)
|
Cost (est) /
1,000 pages |
GPT-4o
|
Vision
|
97%
|
39.88s
|
$3.78
|
GPT-4o
|
Markdown
|
95%
|
26.16s
|
$13.47
|
GPT-4o Mini
|
Markdown
|
85%
|
21.35s
|
$10.13
|
GPT-3.5 Turbo
|
Markdown
|
80%
|
25.64s
|
$10.33
|
GPT-4o Mini
|
Vision
|
67%
|
38.29s
|
$4.23
|
Phi-3.5 Mini Instruct
|
Markdown
|
42%
|
26.76s
|
$10.28
|
Phi-3 Mini 128K Instruct
|
Markdown
|
37%
|
21.50s
|
$10.25
|
The results of this scenario indicate that advanced, multi-modal models like GPT-4o excel in accuracy and cost efficiency for extracting data from unstructured documents. The advanced nature of these language models allow them to better interpret contextual clues in the natural language of the document to infer the expected output.
- Accuracy is spread across models for unstructured data, with GPT-4o (Vision) providing the most accurate with minimal error. The complexity in domain language and natural language processing for extracting values results in poor accuracy for small language models, such as Phi-3 Mini and Phi-3.5 Mini.
- Compared to the Invoice scenario, speed of processing large unstructured documents is higher. This is potentially due to the increase in the number of pages, as well as the complexity of the language understanding required to extract specific values from text-based rules in the document.
- In this specific scenario, GPT-4o (Vision) is significantly cheaper than other models while achieving the highest accuracy. In comparison, GPT-4o (Markdown) is 3.5x more expensive. Using GPT-4o (Vision) could drastically reduce the overall cost for large-scale unstructured document processing solutions.
Analyzing the extraction of data from complex unstructured documents, GPT-4o (Vision) outperforms all other models and techniques. This is seen across both accuracy and cost.
While highly accurate, Azure OpenAI's vision capable models (GPT-4o & GPT-4o Mini) have a maximum context window of 128K tokens per request. Image tokens are calculated based on a pixel-based tiling technique which can quickly consume this token limit. Additionally, the accuracy for data extraction can suffer the more pages are included in a single request. Consider breaking documents down into a smaller subset of pages per request, and then combining extraction results once complete using a fan out/parallel processing technique.
Conclusion
Effective evaluation of AI document data extraction techniques using small language models (SLMs) and large language models (LLMs) can reveal benefits and drawbacks, guiding in the selection of the optimal approach for specific use cases.
The key findings from our analysis shows:
- For high-accuracy requirements, especially in critical applications such as medical, legal, or financial documents, GPT-4o with Vision capabilities stand out. It consistently delivers the highest accuracy across both structured and unstructured data scenarios.
- SLMs like Phi-3 Mini and Phi-3.5 Mini, while cost-effective, show significant limitations in accuracy with complex, unstructured documents data extraction. However, prompt engineering techniques to improve the instruction for extraction can improve the accuracy.
- Speed is mostly consistent across models and techniques, with large, unstructured documents being slightly slower for processing than smaller, structured documents.
- Models with Vision capabilities, such as GPT-4o (Vision), offer a balance of high accuracy and reasonable speed, making them ideal for applications requiring fast and accurate data extraction.
- Cost considerations are crucial when scaling processing to large volumes of documents. GPT-4o (Vision) provides high accuracy and also proves to be the most cost-effective per 1,000 pages, especially in scenarios with complex unstructured data.
Recommendations for Evaluating AI Models in Document Data Extraction
- High-Accuracy Solutions. For solutions where accuracy is critical or visual elements must be evaluated, such as medical records, financial reporting or compliance, explore GPT-4o with Vision capabilities. Its superior performance in accuracy and cost-effectiveness justifies the investment.
- Text-Based Extractions. For simple, text-based document extractions where speed is a priority, consider models like GPT-3.5 Turbo or Phi-3 Mini using Markdown. It provides sufficient accuracy at a low cost and faster processing times.
- Adopt Evaluation Techniques. Implement a rigorous evaluation methodology like the one used in this analysis. Establishing a baseline for accuracy, speed, and cost through multiple iterations and consistent prompting ensures reliable and comparable results. Regularly conduct evaluations when considering new techniques, models, prompts, and configurations. This helps in making informed decisions when opting for an approach in your specific use cases.
Read more on AI Document Intelligence
Thank you for taking the time to read this article. We are sharing our insights for ISVs and Startups that enable document intelligence in their AI-powered solutions, based on real-world challenges we encounter. We invite you to continue your learning through our additional insights in this series.
-
- Discover how to enhance data extraction accuracy with Azure AI Document Intelligence by tailoring models to your unique document structures.
-
- Discover how Azure AI Document Intelligence and Azure OpenAI efficiently extract structured data from documents, streamlining document processing workflows for AI-powered solutions.
- Using Structured Outputs in Azure OpenAI’s GPT-4o for consistent document data processing
- Discover how to leverage GPT-4o’s Structured Outputs to ensure reliable, schema-compliant document data processing.
Further Reading
- Phi-3 Open Models - Small Language Models | Microsoft Azure
- Learn more about the Phi-3 small language models and their potential, including running effectively in offline environments.
- Prompt engineering techniques with Azure OpenAI | Microsoft Learn
- Discover how to improve your prompting techniques with Azure OpenAI to maximize the accuracy of your document data extraction.
This technical blog focuses on the unique needs of ISVs and startups on Azure such as SaaS, multi-tenancy, cloud native and multi-cloud solutions.