Blog Post

Azure AI Foundry Blog
5 MIN READ

Generate searchable PDFs with Azure Form Recognizer

anatolip's avatar
anatolip
Icon for Microsoft rankMicrosoft
Oct 17, 2022

Important update: Azure Document Intelligence (formerly Form Recognizer) now supports generation of the searchable PDFs starting from 2024-11-30 API (4.0 GA). Please read: Searchable PDF - Azure Document Intelligence

PDF documents are widely used in business processes. Digitally created PDFs are very convenient to use. Text can be searched, highlighted, and annotated. Unfortunately, a lot of PDFs are created by scanning or converting images to PDFs. There is no digital text in these PDFs, so they cannot be searched. In this blog post, we demonstrate how to convert such PDFs into searchable PDFs with a simple and easy to use code and Azure Form Recognizer. The code will generate a searchable PDF file that will allow you to store the document anywhere, search within the document and copy and paste. Blog content:

Azure Form Recognizer overview

Azure Form Recognizer is a cloud-based Azure Applied AI Service that uses deep machine-learning models to extract text, key-value pairs, tables, and form fields from your documents. In this blog post we will use text extracted by Form Recognizer to add it into PDF to make it searchable.

Searchable vs non-searchable PDFs

If PDF contains text information, user can select, copy/paste, annotate text in the PDF. In searchable PDF (example), text can be searched and selected, see text highlighting below:

PDF with digital text

If PDF is image-based (example), text cannot be searched or selected. Image compression artifacts are typically seen around text by zooming in:

Image based PDF

 

 

 

 

How to generate a searchable PDF

PDFs contain different types of elements: text, images, others. Image-based PDFs contain only image elements. The goal of this blog is to add invisible text elements into PDF, so users can search and select these elements. They are invisible to make sure that produced searchable PDF looks identical to original PDF. In example below word “Transition” is now selectable using invisible text layer:

Invisible text layer

 

 

Pre-requirement installation

Please install the following packages before running searchable pdf script:

  1. Python packages: 
    pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image​​
  1. Package pdf2image requires Poppler installation. Please follow instruction https://pypi.org/project/pdf2image/ based on your platform or use Conda install: 
    conda install -c conda-forge poppler

How to run searchable PDF script

  1. Create a Python file using the code below and save it on local machine as fr_generate_searchable_pdf.py.
  2. Update the key and endpoint variables with values from your Azure portal Form Recognizer instance (see Quickstart: Form Recognizer SDKs for more details).
  3. Execute script and pass input file (pdf or image) as parameter:
    python fr_generate_searchable_pdf.py <input.pdf/jpg>

    Sample script output is below:

    (base) C:\temp>python fr_generate_searchable_pdf.py input.jpg
    Loading input file input.jpg
    Starting Azure Form Recognizer OCR process...
    Azure Form Recognizer finished OCR text for 1 pages.
    Generating searchable PDF...
    Searchable PDF is created: input.jpg.ocr.pdf
  4. Script generates searchable PDF file with suffix .ocr.pdf.

Searchable PDF Python script

Copy code below and create a Python script on your local machine. The script takes scanned PDF or image as input and generates a corresponding searchable PDF document using Form Recognizer which adds a searchable layer to the PDF and enables you to search, copy, paste and access the text within the PDF.

fr_generate_searchable_pdf.py

# Script to create searchable PDF from scan PDF or images using Azure Form Recognizer
# Required packages
# pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image
import sys
import io
import math
import argparse
from pdf2image import convert_from_path
from reportlab.pdfgen import canvas
from reportlab.lib import pagesizes
from reportlab import rl_config
from PIL import Image, ImageSequence
from pypdf import PdfWriter, PdfReader
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

# Please provide your Azure Form Recognizer endpoint and key
endpoint = YOUR_FORM_RECOGNIZER_ENDPOINT
key = YOUR_FORM_RECOGNIZER_KEY

def dist(p1, p2):
    return math.sqrt((p1.x - p2.x)*(p1.x - p2.x) + (p1.y - p2.y) * (p1.y - p2.y))

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file', type=str, help="Input PDF or image (jpg, jpeg, tif, tiff, bmp, png) file name")
    parser.add_argument('-o', '--output', type=str, required=False, default="", help="Output PDF file name. Default: input_file + .ocr.pdf")
    args = parser.parse_args()

    input_file = args.input_file
    if args.output:
        output_file = args.output
    else:
        output_file = input_file + ".ocr.pdf"

    # Loading input file
    print(f"Loading input file {input_file}")
    if input_file.lower().endswith('.pdf'):
        # read existing PDF as images
        image_pages = convert_from_path(input_file)
    elif input_file.lower().endswith(('.tif', '.tiff', '.jpg', '.jpeg', '.png', '.bmp')):
        # read input image (potential multi page Tiff)
        image_pages = ImageSequence.Iterator(Image.open(input_file))
    else:
        sys.exit(f"Error: Unsupported input file extension {input_file}. Supported extensions: PDF, TIF, TIFF, JPG, JPEG, PNG, BMP.")

    # Running OCR using Azure Form Recognizer Read API 
    print(f"Starting Azure Form Recognizer OCR process...")
    document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key), headers={"x-ms-useragent": "searchable-pdf-blog/1.0.0"})

    with open(input_file, "rb") as f:
        poller = document_analysis_client.begin_analyze_document("prebuilt-read", document = f)

    ocr_results = poller.result()
    print(f"Azure Form Recognizer finished OCR text for {len(ocr_results.pages)} pages.")

    # Generate OCR overlay layer
    print(f"Generating searchable PDF...")
    output = PdfWriter()
    default_font = "Times-Roman"
    for page_id, page in enumerate(ocr_results.pages):
        ocr_overlay = io.BytesIO()

        # Calculate overlay PDF page size
        if image_pages[page_id].height > image_pages[page_id].width:
            page_scale = float(image_pages[page_id].height) / pagesizes.letter[1]
        else:
            page_scale = float(image_pages[page_id].width) / pagesizes.letter[1]

        page_width = float(image_pages[page_id].width) / page_scale
        page_height = float(image_pages[page_id].height) / page_scale

        scale = (page_width / page.width + page_height / page.height) / 2.0
        pdf_canvas = canvas.Canvas(ocr_overlay, pagesize=(page_width, page_height))

        # Add image into PDF page
        pdf_canvas.drawInlineImage(image_pages[page_id], 0, 0, width=page_width, height=page_height, preserveAspectRatio=True)

        text = pdf_canvas.beginText()
        # Set text rendering mode to invisible
        text.setTextRenderMode(3)
        for word in page.words:
            # Calculate optimal font size
            desired_text_width = max(dist(word.polygon[0], word.polygon[1]), dist(word.polygon[3], word.polygon[2])) * scale
            desired_text_height = max(dist(word.polygon[1], word.polygon[2]), dist(word.polygon[0], word.polygon[3])) * scale
            font_size = desired_text_height
            actual_text_width = pdf_canvas.stringWidth(word.content, default_font, font_size)
            
            # Calculate text rotation angle
            text_angle = math.atan2((word.polygon[1].y - word.polygon[0].y + word.polygon[2].y - word.polygon[3].y) / 2.0, 
                                    (word.polygon[1].x - word.polygon[0].x + word.polygon[2].x - word.polygon[3].x) / 2.0)
            text.setFont(default_font, font_size)
            text.setTextTransform(math.cos(text_angle), -math.sin(text_angle), math.sin(text_angle), math.cos(text_angle), word.polygon[3].x * scale, page_height - word.polygon[3].y * scale)
            text.setHorizScale(desired_text_width / actual_text_width * 100)
            text.textOut(word.content + " ")

        pdf_canvas.drawText(text)
        pdf_canvas.save()

        # Move to the beginning of the buffer
        ocr_overlay.seek(0)

        # Create a new PDF page
        new_pdf_page = PdfReader(ocr_overlay)
        output.add_page(new_pdf_page.pages[0])

    # Save output searchable PDF file
    with open(output_file, "wb") as outputStream:
        output.write(outputStream)

    print(f"Searchable PDF is created: {output_file}")
Updated Jan 30, 2025
Version 9.0

55 Comments

  • vamsikrishnak's avatar
    vamsikrishnak
    Copper Contributor

    Hi

    This is exciting but breaking when we are trying to convert multiple files. But starts working after some time. We don't know where it is breaking...

     

    Below is the error description:

    Starting Azure Form Recognizer OCR process... 

    Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object 

    Traceback (most recent call last): 

      File "<path to file>\venv\lib\site-packages\azure\core\polling\base_polling.py", line 514, in run 

        self._poll() 

      File "<path to file>\venv\lib\site-packages\azure\core\polling\base_polling.py", line 554, in _poll 

        raise OperationFailed("Operation failed or canceled") 

    azure.core.polling.base_polling.OperationFailed: Operation failed or canceled 

     

    During handling of the above exception, another exception occurred: 

     

    Traceback (most recent call last): 

      File "generateSearchablePages.py", line 159, in <module> 

        generateSearchablePages(path_of_pdf) 

      File "generateSearchablePages.py", line 20, in wrapper 

        result = func(*args, **kwargs) 

      File "generateSearchablePages.py", line 91, in generateSearchablePages 

        ocr_results = poller.result() 

      File "<path to file>\venv\lib\site-packages\azure\core\polling\_poller.py", line 247, in result 

        self.wait(timeout) 

      File "<path to file>\venv\lib\site-packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer 

        return func(*args, **kwargs) 

      File "<path to file>\venv\lib\site-packages\azure\core\polling\_poller.py", line 267, in wait 

        raise self._exception # type: ignore 

      File "<path to file>\venv\lib\site-packages\azure\core\polling\_poller.py", line 184, in _start 

        self._polling_method.run() 

      File "<path to file>venv\lib\site-packages\azure\core\polling\base_polling.py", line 532, in run 

        raise HttpResponseError( 

    azure.core.exceptions.HttpResponseError: (InternalServerError) An unexpected error occurred. 

    Code: InternalServerError 

    Message: An unexpected error occurred. 

    Exception Details:      (InternalServerError) An unexpected error occurred. 

            Code: InternalServerError 

            Message: An unexpected error occurred. 

     

  • akhemlani The sample code provided is super easy to use and enables you to output a searchable PDF. 

  • akhemlani's avatar
    akhemlani
    Copper Contributor

    NetaH - Thanks for the quick response. Will reach out to the mentioned email with more information. Does any other Azure Cognitive Service provide searchable pdf feature today that I can look at with my team?

  • Yes, we are looking into this. Can you please share more information on your scenario and need ? Please reach out to Form Recognizer Contact Us <formrecog_contact@microsoft.com>

  • akhemlani's avatar
    akhemlani
    Copper Contributor

    anatolip sanjeev_jagtap NetaH -

     

    Thanks for this article and python code, is your team looking at introducing retrieving searchable pdfs as one of the offerings directly via Azure Form Recognizer or Azure Cognitive Service via REST API calls?