Evaluating the quality of AI document data extraction with small and large language models

Evaluating the effectiveness of AI models in document data extraction. Comparing accuracy, speed, and cost-effectiveness between Small and Large Language Models (SLMs and LLMs). Context As the ad...

Three example document scenarios for evaluating AI models in document data extraction

azure for isv and startups

isv

mmilanov76

Copper Contributor

Feb 07, 2025

James using traditional ML models we are able to assess in a real time a confidence factor of the extracted data point. Based on that confidence factor we were able to determine if human review is necessary. What would be your recommendation how to decide when to trigger human evaluation if extraction is done with LLM?

james_croft

Microsoft

Mar 24, 2025

mmilanov76 LLMs like OpenAI's GPT models can return log probabilities for the tokens that they generate. When we use structured data extraction, you can take the structured response and compare this with the tokens that the GPT model generated to understand the probability that this was the most fitting token. Which in turn, gives you a confidence score you can use to determine when to trigger human evaluation.

You can understand this better with the OpenAI logprobs confidence helper that was created for the Azure AI Document Processing samples repo - azure-ai-document-processing-samples/samples/dotnet/modules/samples/confidence/OpenAIConfidence.csx at main · Azure-Samples/azure-ai-document-processing-samples

Forum Discussion

Evaluating the quality of AI document data extraction with small and large language models