Forum Discussion
Evaluating the quality of AI document data extraction with small and large language models
James using traditional ML models we are able to assess in a real time a confidence factor of the extracted data point. Based on that confidence factor we were able to determine if human review is necessary. What would be your recommendation how to decide when to trigger human evaluation if extraction is done with LLM?
mmilanov76 LLMs like OpenAI's GPT models can return log probabilities for the tokens that they generate. When we use structured data extraction, you can take the structured response and compare this with the tokens that the GPT model generated to understand the probability that this was the most fitting token. Which in turn, gives you a confidence score you can use to determine when to trigger human evaluation.
You can understand this better with the OpenAI logprobs confidence helper that was created for the Azure AI Document Processing samples repo - azure-ai-document-processing-samples/samples/dotnet/modules/samples/confidence/OpenAIConfidence.csx at main · Azure-Samples/azure-ai-document-processing-samples