AI - Azure AI services Blog

10 MIN READ

Introduction to OCR Free Vision RAG using Colpali For Complex Documents

Microsoft

Oct 22, 2024

In the rapidly evolving landscape of artificial intelligence, the ability to understand and process complex documents is becoming increasingly vital. Traditional Optical Character Recognition (OCR) systems have served us well in extracting text from images, but they often fall short when it comes to interpreting the intricate visual elements that accompany textual information. In this blog we will use ColPali, a groundbreaking approach that leverages multi-vector retrieval through late interaction mechanisms and Vision Language Models (VLMs) to enhance Retrieval-Augmented Generation (RAG) processes. This blog post will take you on a deep dive into how ColPali is revolutionizing document understanding and retrieval.

In practice, these retrieval pipelines for PDF documents have a huge impact on performance but are non-trivial. Finding specific documents within vast collections represents a significant challenge - like searching for the proverbial needle in a haystack. When dealing with collections containing millions of documents, each potentially hundreds of pages long, the goal is to pinpoint exactly what's relevant to your search.

While web search engines have mastered this challenge, handling everything from text to multimedia content, business document search presents unique complexities. Corporate documents often combine text with crucial visual elements - think bold headlines, formatted section titles, graphical content, and data visualizations. Effective retrieval systems must process both textual and visual components.

Historically, document retrieval systems primarily focused on text, using OCR to convert documents into searchable text formats. This OCR-plus-layout analysis approach remains fundamental to leading document AI systems like LayoutLMv3, which analyzes text sequence, spatial positioning, and visual elements to achieve impressive results - but only when the initial OCR step performs well.

Unfortunately, OCR often falls short.

This highlights a crucial insight: While document AI models often show promising results on pristine academic datasets, they frequently struggle with real-world documents that are irregular, messy, and imperfect. This gap between controlled testing environments and practical applications remains a significant challenge in the field.

The Limitations of Traditional OCR

What is OCR?

Optical Character Recognition (OCR) is a technology that converts different types of documents—such as scanned paper documents, PDFs, or images captured by a digital camera—into editable and searchable data. While OCR has made significant strides in accuracy, it primarily focuses on text extraction, often overlooking the contextual and visual elements present in complex documents.

Challenges with Complex Documents

Complex documents, such as financial reports, legal contracts, and academic papers, often contain:

Tables and Charts: These elements convey critical information that cannot be captured through text alone.
Images and Diagrams: Visual aids play a significant role in understanding the content but are often ignored by traditional OCR systems.
Layout and Formatting: The arrangement of text and visuals can significantly impact meaning, yet OCR typically treats each element in isolation.

Due to these limitations, traditional OCR can lead to incomplete or misleading interpretations of complex documents.

What is ColPali?

ColPali builds upon recent developments in VLMs, which combine the power of Large Language Models (LLMs) with Vision Transformers (ViTs). By inputting image patch embeddings through a language model, ColPali maps visual features into a latent space aligned with textual content. This alignment is crucial for effective retrieval because it ensures that the visual elements of a document contribute meaningfully to the matching process with user queries. In ColBERT's approach, each word (or token) in a document gets its own vector representation in BERT's semantic space, rather than condensing the entire document into one vector. This collection of token vectors lets the system capture how words relate to their surrounding context within the document - that's what the "Co" in ColBERT represents.

The system works in two phases: First, during indexing, it creates these token-level document representations. Then during search, it converts the user's query into similar token vectors and compares them with the document vectors.

To determine how well a document matches the query, ColBERT uses a technique called MaxSimMaxSim: it finds the best match for each query token among all document tokens, then adds up these best-match scores to get an overall relevance score. This approach makes the matching process computationally efficient.

This design choice - waiting until search time to compare query and document vectors - is called "late interaction" (that's the "l" in ColBERT). This differs from "early interaction" systems that process queries and documents together from the start. While early interaction can capture more nuanced relationships between queries and documents, it's computationally more expensive, as detailed in the ColBERT paper.

As its name suggests, ColBERT was originally designed for text search, where it achieved breakthrough performance in both accuracy and speed. The ColBERT2 paper showed how well this multi-vector, late-interaction approach could scale.

ColPali builds on this foundation by applying these same principles to PaLiGemma, a vision-language model. This technique isn't limited to PaLiGemma though - it can work with other vision-language models too, as demonstrated by ColIdefics.

Key Features of ColPali

Integrated Vision Language Models (VLMs):
- ColPali utilizes VLMs like PaliGemma to interpret document images effectively. These models are trained on vast datasets that include not just text but also images, diagrams, and layouts.
- By understanding the relationship between visual elements and text, ColPali can provide richer insights into complex documents.
Enhanced Contextual Understanding:
- Unlike traditional OCR systems that treat text as isolated data points, ColPali analyzes the entire layout of a document.
- This means it can recognize how tables relate to surrounding text or how diagrams illustrate key concepts, leading to more accurate interpretations.
Dynamic Retrieval-Augmented Generation (RAG):
- ColPali seamlessly integrates into RAG frameworks, allowing for real-time information retrieval based on user queries.
- This dynamic approach ensures that responses are not only relevant but also contextually rich, providing users with comprehensive insights.

Beyond improved accuracy, ColPali also offers significant efficiency gains:

Simplified Indexing: By eliminating the need for complex preprocessing steps, ColPali accelerates the document indexing process. Traditional methods can be time-consuming due to the multiple stages involved in parsing and chunking documents.
Low Query Latency: ColPali maintains low latency during querying, a critical requirement for real-time applications. Its end-to-end trainable architecture optimizes the retrieval process, ensuring swift responses to user queries.

Ok now lets implement the same using Azure AI services.

Lets load the libraries.

import torch
from torch.utils.data import DataLoader
from tqdm import tqdm
from transformers import AutoProcessor
from PIL import Image
from io import BytesIO

from colpali_engine.models.paligemma_colbert_architecture import ColPali
from colpali_engine.utils.colpali_processing_utils import process_images, process_queries
from colpali_engine.utils.image_utils import scale_image, get_base64_image

import os
from dotenv import load_dotenv
load_dotenv('azure.env', override=True)

Load the ColPali model

if torch.cuda.is_available():
    device = torch.device("cuda")
    if torch.cuda.is_bf16_supported():
        type = torch.bfloat16
elif torch.backends.mps.is_available():
    device = torch.device("mps")
    type = torch.float32
else:
    device = torch.device("cpu")
    type = torch.float32

Lets store the model in the device selected above.

model_name = "vidore/colpali-v1.2"
model = ColPali.from_pretrained("vidore/colpaligemma-3b-pt-448-base", torch_dtype=type).eval()
model.load_adapter(model_name)
model = model.eval()
model.to(device)
processor = AutoProcessor.from_pretrained(model_name)

Once we have loaded the model , the first step in the process is get images from PDF.

import requests
from pdf2image import convert_from_path
from pypdf import PdfReader

def download_pdf(url):
    response = requests.get(url)
    if response.status_code == 200:
        return BytesIO(response.content)
    else:
        raise Exception(f"Failed to download PDF: Status code {response.status_code}")

def get_pdf_images(pdf_url):
    # Download the PDF
    pdf_file = download_pdf(pdf_url)
    # Save the PDF temporarily to disk (pdf2image requires a file path)
    temp_file = "temp.pdf"
    with open(temp_file, "wb") as f:
        f.write(pdf_file.read())
    reader = PdfReader(temp_file)
    page_texts = []
    for page_number in range(len(reader.pages)):
        page = reader.pages[page_number]
        text = page.extract_text()
        page_texts.append(text)
    images = convert_from_path(temp_file)
    assert len(images) == len(page_texts)
    return (images, page_texts)

Lets go ahead and download the PDF. Once pdf is downloaded lets fetch the PDF images.

sample_pdfs = [
        {
            "title": "Attention Is All You Need",
            "url": "https://arxiv.org/pdf/1706.03762"
        }
]

We once loaded into the PDF images , texts

for pdf in sample_pdfs:
  page_images, page_texts = get_pdf_images(pdf['url'])
  pdf['images'] = page_images
  pdf['texts'] = page_texts

Now lets go ahead create the embedding for each page image.

for pdf in sample_pdfs:
  page_embeddings = []
  dataloader = DataLoader(
        pdf['images'],
        batch_size=2,
        shuffle=False,
        collate_fn=lambda x: process_images(processor, x),
    )
  for batch_doc in tqdm(dataloader):
    with torch.no_grad():
      batch_doc = {k: v.to(model.device) for k, v in batch_doc.items()}
      embeddings = model(**batch_doc)
      mean_embedding = torch.mean(embeddings, dim=1).float().cpu().numpy()
      #page_embeddings.extend(list(torch.unbind(embeddings.to("cpu"))))
      page_embeddings.extend(mean_embedding)
  pdf['embeddings'] = page_embeddings

ColPali During indexing, we aim to strip away a lot of the complexity by using images (“screenshots”) of the document pages directly.

A Vision LLM (PaliGemma-3B) encodes the image by splitting it into a series of patches, which are fed to a vision transformer.

During runtime querying, a user query is embedded by the language model, to obtain token embeddings. ColBERT-style “late interaction” (LI) operation to efficiently match query tokens to document patches. To compute a LI(query, document) score, for each term in the query, we search for the document patch that has the most similar ColPali representation. We then sum the scores of the most similar patches for all terms of the query, to obtain the final query-document score. Intuitively, this late-interaction operation allows for a rich interaction between all terms of the query and document patches, all the while benefiting from the fast matching and offline computation offloading that more standard (bi-encoder) embedding models enable.

import numpy as np
lst_feed = []
for pdf in sample_pdfs:
    url = pdf['url']
    title = pdf['title']
    for page_number, (page_text, embedding, image) in enumerate(zip(pdf['texts'], pdf['embeddings'], pdf['images'])):
      base_64_image = get_base64_image(scale_image(image,640),add_url_prefix=False)   
      page = {
        "id": str(hash(url + str(page_number))),
        "url": url,
        "title": title,
        "page_number": page_number,
        "image": base_64_image,
        "text": page_text,
        "embedding": embedding.tolist()
      }
      lst_feed.append(page)

Now once we have embedding , we need to store these embedding into a vector Store. We will use AI search as our vector store. Lets create an AI index.

from azure.search.documents.indexes import SearchIndexClient
from azure.search.documents.indexes.models import (
    SimpleField,
    SearchFieldDataType,
    SearchableField,
    SearchField,
    VectorSearch,
    HnswAlgorithmConfiguration,
    VectorSearchProfile,
    SemanticConfiguration,
    SemanticPrioritizedFields,
    SemanticField,
    SemanticSearch,
    SearchIndex
)

def create_pdf_search_index(endpoint: str, key: str, index_name: str) -> SearchIndex:
    # Initialize the search index client
    credential = AzureKeyCredential(key)
    index_client = SearchIndexClient(endpoint=endpoint, credential=credential)

    # Define vector search configuration
    vector_search = VectorSearch(
            algorithms=[
                HnswAlgorithmConfiguration(
                    name="myHnsw",
                    parameters={
                        "m": 4,  # Default HNSW parameter
                        "efConstruction": 400,  # Default HNSW parameter
                        "metric": "cosine"
                    }
                )
            ],
            profiles=[
                VectorSearchProfile(
                    name="myHnswProfile",
                    algorithm_configuration_name="myHnsw",
                    vectorizer="myVectorizer"
                )
            ]
    )

    # Define the fields
    fields = [
            SimpleField(
                name="id",
                type=SearchFieldDataType.String,
                key=True,
                filterable=True
            ),
            SimpleField(
                name="url",
                type=SearchFieldDataType.String,
                filterable=True
            ),
            SearchableField(
                name="title",
                type=SearchFieldDataType.String,
                searchable=True,
                retrievable=True
            ),
            SimpleField(
                name="page_number",
                type=SearchFieldDataType.Int32,
                filterable=True,
                sortable=True
            ),
            SimpleField(
                name="image",
                type=SearchFieldDataType.String,
                retrievable=True
            ),
            SearchableField(
                name="text",
                type=SearchFieldDataType.String,
                searchable=True,
                retrievable=True
            ),
            SearchField(
                name="embedding",
                type=SearchFieldDataType.Collection(SearchFieldDataType.Single),
                searchable=True,
                vector_search_dimensions=128,
                vector_search_profile_name="myHnswProfile"
            )
        ]

    # Create the index definition
    index = SearchIndex(
        name=index_name,
        fields=fields,
        vector_search=vector_search
    )

    # Create the index in Azure Cognitive Search
    result = index_client.create_or_update_index(index)
    return result

Once the index is created, we should upload the documents.

from azure.search.documents import SearchClient
credential = AzureKeyCredential(SEARCH_KEY)
index_client = SearchClient(endpoint=SEARCH_ENDPOINT, credential=credential, index_name = INDEX_NAME)

index_client.upload_documents(documents=lst_feed)

Once the document ingestion is completed , next is handling user query. As you can see in the code we create embedding for input query.

def process_query(query: str, processor: AutoProcessor, model: ColPali) -> np.ndarray:
    mock_image = Image.new('RGB', (224, 224), color='white')

    inputs = processor(text=query, images=mock_image, return_tensors="pt")
    inputs = {k: v.to(model.device) for k, v in inputs.items()}

    with torch.no_grad():
        embeddings = model(**inputs)

    return torch.mean(embeddings, dim=1).float().cpu().numpy().tolist()[0]

Now lets create a search client.

from IPython.display import display, HTML
from openai import AzureOpenAI
client = AzureOpenAI(api_key=os.environ['AZURE_OPENAI_API_KEY'],
                    azure_endpoint=os.environ['AZURE_OPENAI_ENDPOINT'],
                    api_version=os.environ['OPENAI_API_VERSION'])

search_client = SearchClient(
    SEARCH_ENDPOINT,
    index_name=INDEX_NAME,
    credential=credential,
)

def display_query_results(query, response, hits=5):
    html_content = f"<h3>Query text: '{query}', top results:</h3>"

    for i, hit in enumerate(response):
        title = hit["title"]
        url = hit["url"]
        page = hit["page_number"]
        image = hit["image"]
        score = hit["@search.score"]

        html_content += f"<h4>PDF Result {i + 1}</h4>"
        html_content += f'<p><strong>Title:</strong> <a href="{url}">{title}</a>, page {page+1} with score {score:.2f}</p>'
        html_content += (
            f'<img src="data&colon;image/png;base64,{image}" style="max-width:100%;">'
        )

    display(HTML(html_content))

Now once we have retrieved the relevant images , we can use any VLM to pass in these images to ask question to user.

query = "What is the projected global energy related co2 emission in 2030?"
vector_query = VectorizedQuery(
    vector=process_query(query, processor, model),
    k_nearest_neighbors=3,
    fields="embedding",
)
results = search_client.search(search_text=None, vector_queries=[vector_query])
#display_query_results(query, results)
response = client.chat.completions.create(
  model="gpt-4o",
  messages=[
    {
        "role": "system",
        "content": """You are an assistant for question-answering tasks. Use the following pieces of retrieved context to answer the question. If you don't know the answer, just say that you don't know. Use three sentences maximum and keep the answer concise.You will be given a mixed of text, tables, and image(s) usually of charts or graphs."""
    },
    {
      "role": "user",
      "content": [
        {"type": "text", "text": query},
        *map(lambda x: {"type": "image_url", "image_url": {"url": f'data&colon;image/jpg;base64,{x["image"]}', "detail": "low"}}, results),
      ],
    }
  ],
  max_tokens=300,
)

print("Answer:" + response.choices[0].message.content)

Conclusion

As we move further into an era where data is increasingly complex and multifaceted, tools like Multi Modal LLM are essential for unlocking valuable insights from our documents. By integrating advanced Vision Language Models with Retrieval-Augmented Generation techniques, this sets a new standard for document understanding that transcends traditional OCR limitations. Whether you’re a researcher looking to streamline your workflow or a developer interested in AI advancements, embracing technologies like VLM and ColPali will undoubtedly enhance your ability to navigate complex information landscapes. Stay tuned for more updates as we continue to explore the fascinating intersection of AI and document processing!

** Do check the license of the opensource models before using the same

Learn more about AI Search: https://learn.microsoft.com/en-us/azure/search/search-what-is-azure-search

Model: https://huggingface.co/vidore/colpali-v1.2

Thanks

Manoranjan Rajguru

https://www.linkedin.com/in/manoranjan-rajguru/

Updated Oct 25, 2024

Version 3.0

azure ai search

azure ai studio

mrajguru

Microsoft