Maximizing Data Extraction Precision with Dual LLMs Integration and Human-in-the-Loop
Published Sep 04 2024 10:33 AM 1,947 Views
Microsoft

While improving data extraction accuracy is vital, validating the correctness of the extracted data is equally important. Leveraging the Layout model in Document Intelligence, combined with markdown format and semantic chunking, plays a key role in dividing documents into clear sections and subsections. This approach enhances navigation, comprehension, and information retrieval by preserving the relationships between different sections and other structured format (such as tables, paragraphs, sections, and figures). This structure helps LLMs understand data more contextually and accurately during extraction.  To learn more details on this concept:


More accuracy and Human-in-the-Loop

However, our customers continue to face challenges in achieving nearly 100% accuracy. They also seek ways to incorporate human validation in the process, particularly in a Human-in-the-Loop (HITL) approach, to ensure critical data points—such as financial figures, legal terms, or medical data—are accurately captured, especially in the initial stages before potentially phasing out human intervention if needed.

 

This article proposes a dual-approach leveraging two Large Language Models (LLMs)- Data extraction and Data validation - akin to the "two heads are better than one" concept. Data extraction involves converting the document to markdown format and using an LLM (e.g., GPT-4o) to extract data in a JSON format  based on a predefined schema and pass back to the system, then system to call the validation with the same schema to extract data from Document Intelligent to validate against data extracted from the first data extraction process. Discrepancy data identified will be sent to front end UI for the human validation. 

 

For our demonstration, we are utilizing the latest Document Field Extraction model, which harnesses generative AI to accurately extract specific fields from documents, regardless of their visual templates. This custom model combines advanced document intelligence special algorithm with Large Language Model (LLM) and precise custom extraction schemas. Additionally, it provides confidence scores for each field and offers training capabilities to further enhance accuracy. 

 

Below is a summary that illustrates how the process works.

Overall data extraction and validation process with human in the loopOverall data extraction and validation process with human in the loop

  1. Define the schema in JSON format to extract data.
  2. The system to call Data Extraction to convert PDF or image files in markdown format and send the markdown along with your pre-defined schema in the prompt message. Completion of output JSON format will be sent back to the system.
  3. The system will initiate the data validation process by calling the Data Validation. Documents can be submitted for analysis using the REST API or client librariesThe custom generative AI model (public preview) is effective at extracting straightforward fields without needing labeled samples, but providing labeled examples can significantly enhance accuracy, especially for complex fields like tables. 
  4. The validation process compares the extracted values based on the schema. the mismatched values with the flagged are sent to the user interface (UI) for human validation.
  5. Users validate the mismatched data, selecting the correct value based on the displayed PDF or image file with highlighted discrepancies. They also have the option to input a new value if both presented values are incorrect. This approach, which focuses on reviewing only the mismatched data rather than entire fields, leverages a LLM and Document Intelligence to enhance accuracy while minimizing the need for extensive human involvement.

JSON Schema:

 

 

 

 

    "docTypes": {
        "custom-docutment-intel-model": {
            "fieldSchema": {
                "apn_number": {
                    "type": "number"
                },
                "borrower_name": {
                    "type": "string"
                },
                "lender_name": {
                    "type": "string"
                },
                "trustee": {
                    "type": "string"
                },
                "amount": {
                    "type": "number"
                },
                "date": {
                    "type": "date"
                }
            },
            "buildMode": "generative"
        }

 

 

 

 

Data Extraction (Azure OpenAI)

 

 

 

 

def get_response_from_aoai_with_schema(document_content: str, schema: str):
    """Get a JSON response from the GPT-4o model with schema"""

    system_message = f"""
    ### you are AI assistant that helps extract information from given context.
    - context will be given by the user.
    - you will extract the relevant information using this json schema:
        ```json
        {schema}
        ```
    - if you are unable to extract the information, return JSON with the keys and empty strings or 0 as values.
    - if schema type is date, provide the date as a string in the format "YYYY-MM-DD".
    """

    messages = [
        {"role": "system", "content": system_message},
        {"role": "user", "content": document_content}
    ]

    try:
        response = client.chat.completions.create(
            model=azure_openai_model, # The deployment name you chose when you deploy GPT model
            messages=messages,
            response_format={ "type": "json_object" },
        )
        response_message = response.choices[0].message
        return response_message.content
    except Exception as e:
        print(f"Error: {e}")
        return None

 

 

 

 

Data validation in Document Intelligence:

 

 

 

 

def get_response_from_ai_doc_intel(target_file):
    # get file from documents folder in the main directory
    with open(target_file, "rb") as f:
        url = f"{docintel_endpoint}documentintelligence/documentModels/{docintel_custom_model_name}:analyze"
        headers = {
            "Ocp-Apim-Subscription-Key": docintel_key,
            "Content-Type": "application/octet-stream"
        }
        params  = {
            "api-version": "2024-07-31-preview",
            "outputContentFormat": "markdown"
        }
        sumbit_analysis = requests.post(url, params=params , headers=headers, data=f)

        if sumbit_analysis.status_code != 202:
            print(f"Error: {sumbit_analysis.json()}")
            return None

        # get the operation location
        operation_location = sumbit_analysis.headers["Operation-Location"]
        print(operation_location)

        # do while loop til the analysis is done
        while True:
            response = requests.get(operation_location, headers={"Ocp-Apim-Subscription-Key": docintel_key})

            if response.status_code != 200:
                print(f"Error: {response.json()}")
                return None
            
            analysis_results = response.json()

            if analysis_results["status"] == "running":
                # wait for 5 seconds
                print("Analysis is still running...")
                time.sleep(5)
                continue
            
            if analysis_results["status"] != "succeeded":
                print(f"Error: {analysis_results}")
                return None
            
            return analysis_results["analyzeResult"]

 

 

 

 

Output:

 

 

 

 

{
 "apn_number": 38593847301,
 "borrower_name": "Siyabonga Sithole",
 "lender_name": "Addullo Kholov",
 "trustee": "Fabrikam, Inc",
 "amount": 30000,
 "date": "2024-12-25"
}

 

 

 

 

The front-end interface below, based on our testing results, demonstrates how users can achieve nearly 100% data extraction accuracy by addressing discrepancies through a combination of three services: Large Language Models (LLMs), Document Intelligence, and human intervention. In the left column under "Fields," you'll find a list of extracted fields. Selecting a field's radio button will display the comparison results under "Field Information." A green highlight indicates an exact match, while a red highlight points to a mismatch, accompanied by a lower confidence score (e.g., 0.781 for date field in this example), as shown in the figure below. Users should focus on these red-highlighted fields, either accepting the correct value or overwriting it with a new one in the editable text box if both options are incorrect.

Front App for Human ValidationFront App for Human Validation

For detailed implementation and clear image, please visit our GitHub repository 

LLMs Selection

For the selection of correct LLMs, you would typically need a language model that can understand and process text effectively, including the structure, syntax specific to markdown, and validation capabilities. AI Studio, as a platform. Offers a variety of language models (LLMs) you can choose from.

Version history
Last update:
‎Sep 05 2024 05:44 AM
Updated by: