Blog Post

Azure AI Foundry Blog

5 MIN READ

Generate searchable PDFs with Azure Form Recognizer

Microsoft

Oct 17, 2022

Important update: Azure Document Intelligence (formerly Form Recognizer) now supports generation of the searchable PDFs starting from 2024-11-30 API (4.0 GA). Please read: Searchable PDF - Azure Document Intelligence

PDF documents are widely used in business processes. Digitally created PDFs are very convenient to use. Text can be searched, highlighted, and annotated. Unfortunately, a lot of PDFs are created by scanning or converting images to PDFs. There is no digital text in these PDFs, so they cannot be searched. In this blog post, we demonstrate how to convert such PDFs into searchable PDFs with a simple and easy to use code and Azure Form Recognizer. The code will generate a searchable PDF file that will allow you to store the document anywhere, search within the document and copy and paste. Blog content:

Azure Form Recognizer overview
Searchable vs non-searchable PDFs
How to generate a searchable PDF
Pre-requirement installation
How to run searchable PDF script
Searchable PDF Python script

Azure Form Recognizer overview

Azure Form Recognizer is a cloud-based Azure Applied AI Service that uses deep machine-learning models to extract text, key-value pairs, tables, and form fields from your documents. In this blog post we will use text extracted by Form Recognizer to add it into PDF to make it searchable.

Searchable vs non-searchable PDFs

If PDF contains text information, user can select, copy/paste, annotate text in the PDF. In searchable PDF (example), text can be searched and selected, see text highlighting below:

PDF with digital text

If PDF is image-based (example), text cannot be searched or selected. Image compression artifacts are typically seen around text by zooming in:

Image based PDF

How to generate a searchable PDF

PDFs contain different types of elements: text, images, others. Image-based PDFs contain only image elements. The goal of this blog is to add invisible text elements into PDF, so users can search and select these elements. They are invisible to make sure that produced searchable PDF looks identical to original PDF. In example below word “Transition” is now selectable using invisible text layer:

Invisible text layer

Pre-requirement installation

Please install the following packages before running searchable pdf script:

Python packages:

pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image

Package pdf2image requires Poppler installation. Please follow instruction https://pypi.org/project/pdf2image/ based on your platform or use Conda install:
```
conda install -c conda-forge poppler
```

How to run searchable PDF script

Create a Python file using the code below and save it on local machine as fr_generate_searchable_pdf.py.
Update the key and endpoint variables with values from your Azure portal Form Recognizer instance (see Quickstart: Form Recognizer SDKs for more details).

Execute script and pass input file (pdf or image) as parameter:

python fr_generate_searchable_pdf.py <input.pdf/jpg>

Sample script output is below:

(base) C:\temp>python fr_generate_searchable_pdf.py input.jpg
Loading input file input.jpg
Starting Azure Form Recognizer OCR process...
Azure Form Recognizer finished OCR text for 1 pages.
Generating searchable PDF...
Searchable PDF is created: input.jpg.ocr.pdf

Script generates searchable PDF file with suffix .ocr.pdf.

Searchable PDF Python script

Copy code below and create a Python script on your local machine. The script takes scanned PDF or image as input and generates a corresponding searchable PDF document using Form Recognizer which adds a searchable layer to the PDF and enables you to search, copy, paste and access the text within the PDF.

fr_generate_searchable_pdf.py

# Script to create searchable PDF from scan PDF or images using Azure Form Recognizer
# Required packages
# pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image
import sys
import io
import math
import argparse
from pdf2image import convert_from_path
from reportlab.pdfgen import canvas
from reportlab.lib import pagesizes
from reportlab import rl_config
from PIL import Image, ImageSequence
from pypdf import PdfWriter, PdfReader
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

# Please provide your Azure Form Recognizer endpoint and key
endpoint = YOUR_FORM_RECOGNIZER_ENDPOINT
key = YOUR_FORM_RECOGNIZER_KEY

def dist(p1, p2):
    return math.sqrt((p1.x - p2.x)*(p1.x - p2.x) + (p1.y - p2.y) * (p1.y - p2.y))

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file', type=str, help="Input PDF or image (jpg, jpeg, tif, tiff, bmp, png) file name")
    parser.add_argument('-o', '--output', type=str, required=False, default="", help="Output PDF file name. Default: input_file + .ocr.pdf")
    args = parser.parse_args()

    input_file = args.input_file
    if args.output:
        output_file = args.output
    else:
        output_file = input_file + ".ocr.pdf"

    # Loading input file
    print(f"Loading input file {input_file}")
    if input_file.lower().endswith('.pdf'):
        # read existing PDF as images
        image_pages = convert_from_path(input_file)
    elif input_file.lower().endswith(('.tif', '.tiff', '.jpg', '.jpeg', '.png', '.bmp')):
        # read input image (potential multi page Tiff)
        image_pages = ImageSequence.Iterator(Image.open(input_file))
    else:
        sys.exit(f"Error: Unsupported input file extension {input_file}. Supported extensions: PDF, TIF, TIFF, JPG, JPEG, PNG, BMP.")

    # Running OCR using Azure Form Recognizer Read API 
    print(f"Starting Azure Form Recognizer OCR process...")
    document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key), headers={"x-ms-useragent": "searchable-pdf-blog/1.0.0"})

    with open(input_file, "rb") as f:
        poller = document_analysis_client.begin_analyze_document("prebuilt-read", document = f)

    ocr_results = poller.result()
    print(f"Azure Form Recognizer finished OCR text for {len(ocr_results.pages)} pages.")

    # Generate OCR overlay layer
    print(f"Generating searchable PDF...")
    output = PdfWriter()
    default_font = "Times-Roman"
    for page_id, page in enumerate(ocr_results.pages):
        ocr_overlay = io.BytesIO()

        # Calculate overlay PDF page size
        if image_pages[page_id].height > image_pages[page_id].width:
            page_scale = float(image_pages[page_id].height) / pagesizes.letter[1]
        else:
            page_scale = float(image_pages[page_id].width) / pagesizes.letter[1]

        page_width = float(image_pages[page_id].width) / page_scale
        page_height = float(image_pages[page_id].height) / page_scale

        scale = (page_width / page.width + page_height / page.height) / 2.0
        pdf_canvas = canvas.Canvas(ocr_overlay, pagesize=(page_width, page_height))

        # Add image into PDF page
        pdf_canvas.drawInlineImage(image_pages[page_id], 0, 0, width=page_width, height=page_height, preserveAspectRatio=True)

        text = pdf_canvas.beginText()
        # Set text rendering mode to invisible
        text.setTextRenderMode(3)
        for word in page.words:
            # Calculate optimal font size
            desired_text_width = max(dist(word.polygon[0], word.polygon[1]), dist(word.polygon[3], word.polygon[2])) * scale
            desired_text_height = max(dist(word.polygon[1], word.polygon[2]), dist(word.polygon[0], word.polygon[3])) * scale
            font_size = desired_text_height
            actual_text_width = pdf_canvas.stringWidth(word.content, default_font, font_size)
            
            # Calculate text rotation angle
            text_angle = math.atan2((word.polygon[1].y - word.polygon[0].y + word.polygon[2].y - word.polygon[3].y) / 2.0, 
                                    (word.polygon[1].x - word.polygon[0].x + word.polygon[2].x - word.polygon[3].x) / 2.0)
            text.setFont(default_font, font_size)
            text.setTextTransform(math.cos(text_angle), -math.sin(text_angle), math.sin(text_angle), math.cos(text_angle), word.polygon[3].x * scale, page_height - word.polygon[3].y * scale)
            text.setHorizScale(desired_text_width / actual_text_width * 100)
            text.textOut(word.content + " ")

        pdf_canvas.drawText(text)
        pdf_canvas.save()

        # Move to the beginning of the buffer
        ocr_overlay.seek(0)

        # Create a new PDF page
        new_pdf_page = PdfReader(ocr_overlay)
        output.add_page(new_pdf_page.pages[0])

    # Save output searchable PDF file
    with open(output_file, "wb") as outputStream:
        output.write(outputStream)

    print(f"Searchable PDF is created: {output_file}")

Updated Jan 30, 2025

Version 9.0

azure ai document intelligence

azure ai services

anatolip

Microsoft

Joined March 04, 2021

View Profile

Azure AI Foundry Blog

Follow this blog board to get notified when there's new activity

55 Comments

vamsikrishnak
Copper Contributor
Jan 06, 2023
Hi
This is exciting but breaking when we are trying to convert multiple files. But starts working after some time. We don't know where it is breaking...

Below is the error description:
Starting Azure Form Recognizer OCR process...
Unable to retrieve continuation token: cannot pickle '_io.BufferedReader' object
Traceback (most recent call last):
File "<path to file>\venv\lib\site-packages\azure\core\polling\base_polling.py", line 514, in run
    self._poll()
File "<path to file>\venv\lib\site-packages\azure\core\polling\base_polling.py", line 554, in _poll
    raise OperationFailed("Operation failed or canceled")
azure.core.polling.base_polling.OperationFailed: Operation failed or canceled

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
File "generateSearchablePages.py", line 159, in <module>
    generateSearchablePages(path_of_pdf)
File "generateSearchablePages.py", line 20, in wrapper
    result = func(*args, **kwargs)
File "generateSearchablePages.py", line 91, in generateSearchablePages
    ocr_results = poller.result()
File "<path to file>\venv\lib\site-packages\azure\core\polling\_poller.py", line 247, in result
    self.wait(timeout)
File "<path to file>\venv\lib\site-packages\azure\core\tracing\decorator.py", line 78, in wrapper_use_tracer
    return func(*args, **kwargs)
File "<path to file>\venv\lib\site-packages\azure\core\polling\_poller.py", line 267, in wait
    raise self._exception # type: ignore
File "<path to file>\venv\lib\site-packages\azure\core\polling\_poller.py", line 184, in _start
    self._polling_method.run()
File "<path to file>venv\lib\site-packages\azure\core\polling\base_polling.py", line 532, in run
    raise HttpResponseError(
azure.core.exceptions.HttpResponseError: (InternalServerError) An unexpected error occurred.
Code: InternalServerError
Message: An unexpected error occurred.
Exception Details:      (InternalServerError) An unexpected error occurred.
        Code: InternalServerError
        Message: An unexpected error occurred.
NetaH
Microsoft
Oct 18, 2022
akhemlani The sample code provided is super easy to use and enables you to output a searchable PDF.
akhemlani
Copper Contributor
Oct 18, 2022
NetaH - Thanks for the quick response. Will reach out to the mentioned email with more information. Does any other Azure Cognitive Service provide searchable pdf feature today that I can look at with my team?
NetaH
Microsoft
Oct 18, 2022
Yes, we are looking into this. Can you please share more information on your scenario and need ? Please reach out to Form Recognizer Contact Us <formrecog_contact@microsoft.com>
akhemlani
Copper Contributor
Oct 18, 2022
anatolip sanjeev_jagtap NetaH -

Thanks for this article and python code, is your team looking at introducing retrieving searchable pdfs as one of the offerings directly via Azure Form Recognizer or Azure Cognitive Service via REST API calls?