Evaluating the effectiveness of AI models in document data extraction. Comparing accuracy, speed, and cost-effectiveness between Small and Large Language Models (SLMs and LLMs).
As the adoption of AI in solutions increases, technical decision-makers face challenges in selecting the most effective approach for document data extraction. Ensuring high quality is crucial, particularly when dealing with critical solutions where minor errors have substantial consequences. As the volume of documents increases, it becomes essential to choose solutions that can scale efficiently without compromising performance.
This article evaluates AI document data extraction techniques using Small Language Models (SLMs) and Large Language Models (LLMs). Including a specific focus on structured and unstructured data scenarios.
By evaluating models, the article provides insights into their accuracy, speed, and cost-efficiency for quality data extraction. It provides both guidance into evaluating models, as well as the quality of the outputs from models for specific scenarios.
With many AI models available to ISVs and Digital Natives, challenges arise in which technique is the most effective for quality document data extraction. When evaluating the quality of AI models, key challenges include:
When choosing an AI model for document data extraction on Azure, there is no one-size-fits-all solution. Depending on the scenario, one may outperform another for accuracy at the sacrifice of cost. While another model may provide sufficient accuracy at a much lower cost.
When evaluating AI models for document data extraction, it’s important to understand how they perform for specific use cases. This evaluation focused on structured and unstructured scenarios to provide insights into simple and complex document structures.
Three example document scenarios for evaluating AI models in document data extraction
This evaluation focused on two techniques for data extraction with the language models:
For each technique, the model is prompted using a one-shot technique, providing the expected output schema for the response. This establishes the intention, improving the overall accuracy of the generated output.
The AI models evaluated in this analysis include:
To ensure a reliable and consistent evaluation, the following approach was established:
These metrics establish the baseline evaluation. By establishing the baseline, it is possible to experiment with the prompt, schema, and request configuration. This allows you to compare improvements in the overall quality by evaluating the accuracy, speed, and cost.
For the evaluation outlined in this article, we created a .NET NUnit test project with multiple fixtures and cases. The tests take advantage of the .NET SDKs for both Azure AI Document Intelligence and Azure OpenAI.
Each model and technique combination per scenario is run independently. This is to ensure that the speed is evaluated fairly for each request.
You can find the repository for this evaluation on GitHub.
Model
|
Technique
|
Accuracy (avg)
|
Speed (avg)
|
Cost (est) /
1,000 pages |
Phi-3-Mini-128K-Instruct
|
Markdown
|
71%
|
19.91s
|
$10.39
|
GPT-3.5-Turbo
|
Markdown
|
83%
|
24.49s
|
$10.49
|
GPT-4-Turbo
|
Markdown
|
91%
|
62.79s
|
$20.46
|
GPT-4-Omni
|
Markdown
|
90%
|
26.19s
|
$15.19
|
GPT-4-Turbo
|
Vision
|
91%
|
20.74s
|
$16.37
|
GPT-4-Omni
|
Vision
|
90%
|
26.67s
|
$7.46
|
The results of this scenario indicates there is consistency across all models in both accuracy and speed. GPT-4 Turbo, when processing Markdown, presents an outlier in this scenario for speed.
It is important to note that Markdown conversion strips away visual elements, providing only the result of OCR as text. This can often lead to potential misinterpretations, such as false positives for signatures. When using models with vision capabilities, visual elements are often interpreted correctly, resulting in a higher true positive accuracy.
For high-accuracy requirements, GPT-4-Omni with Vision capabilities are the best choice due to its excellent performance and cost-effectiveness. For simpler tasks where speed is a priority, models like Phi-3-Mini-128K-Instruct can be considered, but with the understanding that accuracy will be significantly lower.
Model
|
Technique
|
Accuracy (avg)
|
Speed (avg)
|
Cost (est) /
1,000 pages |
Phi-3-Mini-128K-Instruct
|
Markdown
|
93%
|
22.03s
|
$10.43
|
GPT-3.5-Turbo
|
Markdown
|
91%
|
15.52s
|
$10.57
|
GPT-4-Turbo
|
Markdown
|
94%
|
89.81s
|
$21.93
|
GPT-4-Omni
|
Markdown
|
94%
|
21.20s
|
$15.94
|
GPT-4-Turbo
|
Vision
|
91%
|
19.71s
|
$20.31
|
GPT-4-Omni
|
Vision
|
99%
|
15.65s
|
$9.16
|
The results of this scenario indicate that all models provide high accuracy for extracting structured data from invoices, but key differences in speed and cost highlight the trade-offs.
Models using vision techniques, like GPT-4 Omni (Vision), excel in interpreting visual elements of documents, particularly where they overlap content or require visual clues to direct the output. This minimizes false positives that occur when models interpret these elements based purely on surrounding textual clues.
For applications heavily reliant on visual data extraction, avoid extracting data using the Markdown technique and prefer vision-based models like GPT-4 Omni (Vision). For pure text-based extractions where speed is a priority, Markdown-based models like GPT-3.5 Turbo can be suitable.
However, in this specific use case, GPT-4 Omni (Vision) is the best overall technique, providing high accuracy and speed, at lower costs compared to other models.
Model
|
Technique
|
Accuracy (avg)
|
Speed (avg)
|
Cost (est) /
1,000 pages |
Phi-3-Mini-128K-Instruct
|
Markdown
|
37%
|
21.50s
|
$10.25
|
GPT-3.5-Turbo
|
Markdown
|
80%
|
25.06s
|
$10.33
|
GPT-4-Turbo
|
Markdown
|
84%
|
86.10s
|
$16.66
|
GPT-4-Omni
|
Markdown
|
95%
|
28.56s
|
$13.29
|
GPT-4-Turbo
|
Vision
|
86%
|
41.98s
|
$8.54
|
GPT-4-Omni
|
Vision
|
99%
|
39.88s
|
$3.63
|
The results of this scenario indicate that more advanced, multi-modal models excel in accuracy and cost efficiency for extract data from unstructured documents. It can be assumed that the advanced nature of the language model allows it to better interpret contextual clues in the natural language of the document to infer the expected output.
In analyzing the extraction of data from complex unstructured documents, GPT-4 Omni (Vision) outperforms all other models and techniques. This superiority is seen across accuracy, speed, and cost.
While highly accurate, GPT-4 with Vision models have a limit of 10 images per request which can put a limit on how many pages can be processed in a single request. Effective pre-processing, such as stitching pages together, is essential to maximize the accuracy of the extraction. However, avoid overloading images with too many pages, as this can reducing the overall resolution of text, significantly degrading the model’s performance. Where processing of large page documents is required, considering the adoption of the Markdown extraction technique with advanced models such as GPT-4 Omni may be preferable.
Effective evaluation of AI document data extraction techniques using small language models (SLMs) and large language models (LLMs) can reveal benefits and drawbacks, guiding in the selection of the optimal approach for specific use cases.
The key findings from our analysis shows:
Thank you for taking the time to read this article. We are sharing our insights for ISVs and Digital Natives that enable document intelligence in their AI-powered solutions, based on real-world challenges we encounter. We invite you to continue your learning through our additional insights in this series.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.