Blog Post

AI - Azure AI services Blog

5 MIN READ

Generate searchable PDFs with Azure Form Recognizer

Microsoft

Oct 17, 2022

Important update: Azure Document Intelligence (formerly Form Recognizer) now supports generation of the searchable PDFs starting from 2024-11-30 API (4.0 GA). Please read: Searchable PDF - Azure Document Intelligence

PDF documents are widely used in business processes. Digitally created PDFs are very convenient to use. Text can be searched, highlighted, and annotated. Unfortunately, a lot of PDFs are created by scanning or converting images to PDFs. There is no digital text in these PDFs, so they cannot be searched. In this blog post, we demonstrate how to convert such PDFs into searchable PDFs with a simple and easy to use code and Azure Form Recognizer. The code will generate a searchable PDF file that will allow you to store the document anywhere, search within the document and copy and paste. Blog content:

Azure Form Recognizer overview
Searchable vs non-searchable PDFs
How to generate a searchable PDF
Pre-requirement installation
How to run searchable PDF script
Searchable PDF Python script

Azure Form Recognizer overview

Azure Form Recognizer is a cloud-based Azure Applied AI Service that uses deep machine-learning models to extract text, key-value pairs, tables, and form fields from your documents. In this blog post we will use text extracted by Form Recognizer to add it into PDF to make it searchable.

Searchable vs non-searchable PDFs

If PDF contains text information, user can select, copy/paste, annotate text in the PDF. In searchable PDF (example), text can be searched and selected, see text highlighting below:

PDF with digital text

If PDF is image-based (example), text cannot be searched or selected. Image compression artifacts are typically seen around text by zooming in:

Image based PDF

How to generate a searchable PDF

PDFs contain different types of elements: text, images, others. Image-based PDFs contain only image elements. The goal of this blog is to add invisible text elements into PDF, so users can search and select these elements. They are invisible to make sure that produced searchable PDF looks identical to original PDF. In example below word “Transition” is now selectable using invisible text layer:

Invisible text layer

Pre-requirement installation

Please install the following packages before running searchable pdf script:

Python packages:

pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image

Package pdf2image requires Poppler installation. Please follow instruction https://pypi.org/project/pdf2image/ based on your platform or use Conda install:
```
conda install -c conda-forge poppler
```

How to run searchable PDF script

Create a Python file using the code below and save it on local machine as fr_generate_searchable_pdf.py.
Update the key and endpoint variables with values from your Azure portal Form Recognizer instance (see Quickstart: Form Recognizer SDKs for more details).

Execute script and pass input file (pdf or image) as parameter:

python fr_generate_searchable_pdf.py <input.pdf/jpg>

Sample script output is below:

(base) C:\temp>python fr_generate_searchable_pdf.py input.jpg
Loading input file input.jpg
Starting Azure Form Recognizer OCR process...
Azure Form Recognizer finished OCR text for 1 pages.
Generating searchable PDF...
Searchable PDF is created: input.jpg.ocr.pdf

Script generates searchable PDF file with suffix .ocr.pdf.

Searchable PDF Python script

Copy code below and create a Python script on your local machine. The script takes scanned PDF or image as input and generates a corresponding searchable PDF document using Form Recognizer which adds a searchable layer to the PDF and enables you to search, copy, paste and access the text within the PDF.

fr_generate_searchable_pdf.py

# Script to create searchable PDF from scan PDF or images using Azure Form Recognizer
# Required packages
# pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image
import sys
import io
import math
import argparse
from pdf2image import convert_from_path
from reportlab.pdfgen import canvas
from reportlab.lib import pagesizes
from reportlab import rl_config
from PIL import Image, ImageSequence
from pypdf import PdfWriter, PdfReader
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient

# Please provide your Azure Form Recognizer endpoint and key
endpoint = YOUR_FORM_RECOGNIZER_ENDPOINT
key = YOUR_FORM_RECOGNIZER_KEY

def dist(p1, p2):
    return math.sqrt((p1.x - p2.x)*(p1.x - p2.x) + (p1.y - p2.y) * (p1.y - p2.y))

if __name__ == '__main__':
    parser = argparse.ArgumentParser()
    parser.add_argument('input_file', type=str, help="Input PDF or image (jpg, jpeg, tif, tiff, bmp, png) file name")
    parser.add_argument('-o', '--output', type=str, required=False, default="", help="Output PDF file name. Default: input_file + .ocr.pdf")
    args = parser.parse_args()

    input_file = args.input_file
    if args.output:
        output_file = args.output
    else:
        output_file = input_file + ".ocr.pdf"

    # Loading input file
    print(f"Loading input file {input_file}")
    if input_file.lower().endswith('.pdf'):
        # read existing PDF as images
        image_pages = convert_from_path(input_file)
    elif input_file.lower().endswith(('.tif', '.tiff', '.jpg', '.jpeg', '.png', '.bmp')):
        # read input image (potential multi page Tiff)
        image_pages = ImageSequence.Iterator(Image.open(input_file))
    else:
        sys.exit(f"Error: Unsupported input file extension {input_file}. Supported extensions: PDF, TIF, TIFF, JPG, JPEG, PNG, BMP.")

    # Running OCR using Azure Form Recognizer Read API 
    print(f"Starting Azure Form Recognizer OCR process...")
    document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key), headers={"x-ms-useragent": "searchable-pdf-blog/1.0.0"})

    with open(input_file, "rb") as f:
        poller = document_analysis_client.begin_analyze_document("prebuilt-read", document = f)

    ocr_results = poller.result()
    print(f"Azure Form Recognizer finished OCR text for {len(ocr_results.pages)} pages.")

    # Generate OCR overlay layer
    print(f"Generating searchable PDF...")
    output = PdfWriter()
    default_font = "Times-Roman"
    for page_id, page in enumerate(ocr_results.pages):
        ocr_overlay = io.BytesIO()

        # Calculate overlay PDF page size
        if image_pages[page_id].height > image_pages[page_id].width:
            page_scale = float(image_pages[page_id].height) / pagesizes.letter[1]
        else:
            page_scale = float(image_pages[page_id].width) / pagesizes.letter[1]

        page_width = float(image_pages[page_id].width) / page_scale
        page_height = float(image_pages[page_id].height) / page_scale

        scale = (page_width / page.width + page_height / page.height) / 2.0
        pdf_canvas = canvas.Canvas(ocr_overlay, pagesize=(page_width, page_height))

        # Add image into PDF page
        pdf_canvas.drawInlineImage(image_pages[page_id], 0, 0, width=page_width, height=page_height, preserveAspectRatio=True)

        text = pdf_canvas.beginText()
        # Set text rendering mode to invisible
        text.setTextRenderMode(3)
        for word in page.words:
            # Calculate optimal font size
            desired_text_width = max(dist(word.polygon[0], word.polygon[1]), dist(word.polygon[3], word.polygon[2])) * scale
            desired_text_height = max(dist(word.polygon[1], word.polygon[2]), dist(word.polygon[0], word.polygon[3])) * scale
            font_size = desired_text_height
            actual_text_width = pdf_canvas.stringWidth(word.content, default_font, font_size)
            
            # Calculate text rotation angle
            text_angle = math.atan2((word.polygon[1].y - word.polygon[0].y + word.polygon[2].y - word.polygon[3].y) / 2.0, 
                                    (word.polygon[1].x - word.polygon[0].x + word.polygon[2].x - word.polygon[3].x) / 2.0)
            text.setFont(default_font, font_size)
            text.setTextTransform(math.cos(text_angle), -math.sin(text_angle), math.sin(text_angle), math.cos(text_angle), word.polygon[3].x * scale, page_height - word.polygon[3].y * scale)
            text.setHorizScale(desired_text_width / actual_text_width * 100)
            text.textOut(word.content + " ")

        pdf_canvas.drawText(text)
        pdf_canvas.save()

        # Move to the beginning of the buffer
        ocr_overlay.seek(0)

        # Create a new PDF page
        new_pdf_page = PdfReader(ocr_overlay)
        output.add_page(new_pdf_page.pages[0])

    # Save output searchable PDF file
    with open(output_file, "wb") as outputStream:
        output.write(outputStream)

    print(f"Searchable PDF is created: {output_file}")

Updated Jan 30, 2025

Version 9.0

azure ai document intelligence

azure ai services

anatolip

Microsoft

Joined March 04, 2021

View Profile

AI - Azure AI services Blog

Follow this blog board to get notified when there's new activity

55 Comments

anatolip
Microsoft
Oct 08, 2024
wsahawneh, thanks for reporting this issue and sorry for inconvenience. We are investigating similar issue reported by other customer. It looks like some types of PDFs are affected. Feel free to open Azure Support ticket to get latest updates.
wsahawneh
Copper Contributor
Oct 06, 2024
anatolip
Thanks for the search PDF functionality. I was wondering if something has changed with the API. When trying to download the searchable PDF generated I am getting a 404. The functionality was working fine a few days ago, but something seems to have changed. Here are the curl commands I have used

1. POST to Start the Analysis:

curl -X POST "https://<YOUR-ENDPOINT>.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-read:analyze?output=pdf&api-version=2024-07-31-preview" \ -H "Content-Type: application/json" \ -H "Ocp-Apim-Subscription-Key: <YOUR-AZURE-KEY>" \ --data-ascii '{"urlSource": "https://<YOUR-FILE-URL>.pdf"}'

The analysis starts successfully and returns a 202 Accepted response with an operation-location header, which contains the URL to check the status of the operation.

2. GET to Poll for the Analysis Status:

curl -H "Ocp-Apim-Subscription-Key: <YOUR-AZURE-KEY>" \ "https://<YOUR-ENDPOINT>.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-read/analyzeResults/<RESULT-ID>?api-version=2024-07-31-preview"

After polling, I receive a "succeeded" status, indicating that the analysis has completed successfully.

This is where the problem happens
3. GET to Retrieve the PDF:

curl -H "Ocp-Apim-Subscription-Key: <YOUR-AZURE-KEY>" \ "https://<YOUR-ENDPOINT>.cognitiveservices.azure.com/documentintelligence/documentModels/prebuilt-read/analyzeResults/<RESULT-ID>/pdf?api-version=2024-07-31-preview" \ --output results.pdf

This request returns a 404 Not Found error. I am unable to retrieve the PDF, despite the analysis showing as "succeeded".

Any ideas as to what I may be missing
anatolip
Microsoft
Aug 19, 2024
akinoril , thanks for trying new "output=pdf" parameter and sharing details of your rest calls. Could you please use API version 2024-07-31-preview instead of 2024-02-29-preview because new Searchable PDF functionality is only available in the latest public preview.

Also, keep in mind that "Searchable PDF currently only supports PDF files as input. Support for other file types, such as image files, will be available later." (see doc).
akinoril
Copper Contributor
Aug 19, 2024

I am trying to use the following cURL command to analyze a PDF file and generate a searchable PDF:

bash
複製程式碼
curl -i -X POST "%DI_ENDPOINT%/documentintelligence/documentModels/prebuilt-read:analyze?output=pdf&api-version=2024-02-29-preview" \ -H "Content-Type: application/json" \ -H "Ocp-Apim-Subscription-Key: %DI_KEY%" \ -d "{\"urlSource\": \"<PDF_FILE_URL>\"}"

However, this command fails with an error. In contrast, a similar command for analyzing an image file works successfully:

bash
複製程式碼
curl -i -X POST "%DI_ENDPOINT%/documentintelligence/documentModels/prebuilt-read:analyze?api-version=2024-02-29-preview" \ -H "Content-Type: application/json" \ -H "Ocp-Apim-Subscription-Key: %DI_KEY%" \ --data-ascii "{'urlSource': 'https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/read.png'}"

Issue Description:
Why does the command to analyze a PDF file fail while the command to analyze an image file succeeds?
Are there specific configurations or formats required for processing PDF files?
Could you provide any recommendations or solutions to successfully generate a searchable PDF?
Thank you for your assistance!
Additional Information:
API Endpoint: %DI_ENDPOINT%
API Key: %DI_KEY%
Request URL (Image): https://raw.githubusercontent.com/Azure-Samples/cognitive-services-REST-api-samples/master/curl/form-recognizer/rest-api/read.png
ShaneWWatson
Copper Contributor
Aug 14, 2024
Thank you, anatolip for posting updates on this. This will be really helpful for my organization.
anatolip
Microsoft
Aug 13, 2024
I'm glad to announce that Searchable PDF functionality was released a part of the 2024-07-31-preview as a part of prebuilt-read API: Read model OCR data extraction - Searchable PDF
nickstiv
Copper Contributor
May 24, 2024
anatolip this is wonderful and I can't wait to try this code out! The editing of the output file has been my biggest issue with the original process.

anatolip

Microsoft

May 13, 2024

I had offline discussion with ramprasadgajula and provided code sample of Azure Function which produces searchable PDF and using different method to edit existing PDF instead of rendering it. Sharing this code sample here, in case it can benefit others.

This solution uses different approach and modifies existing PDF using PyMuPDF package instead of rendering it using pdf2image. It would not for PDFs which have a mix of digital text and images since text will be duplicated. Also, it has some chances to hit some bugs on text alignment for complex PDFs, since PDF editing is more complex than rendering. Even this solution is less generic vs PDF rendering (described in original blog post), it has a few advantages:

No dependence on Poppler (pdf2image) binaries and allow to run as Azure Function easily (see code below)
Kept existing PDF structure/metadata stays as is (like annotations, objects, etc).
PDF file size has predictable increase in size. It increases just slightly due to text elements (vs rendering with specific DPI/jpeg quality).
Code requires very little CPU resource to add text elements vs rendering/image encoding.

Very basic Azure Function implementation with HTTP trigger, pass URL of the file as query parameter “fileurl” and it returns searchable PDF file, so it can be seen right in the browser:

https://%%YOUR_FUNCTION%%.azurewebsites.net/api/searchable_pdf?fileurl=https://documentintelligence.ai.azure.com/documents/samples/document/generaldoc.pdf&code=%%YOUR_AZURE_FUNCTION_CODE%%

Azure function will need to add FR_ENDPOINT and FR_KEY into your Azure Function environment variables.

function_app.py:

import azure.functions as func
import logging
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
import os
import math
import fitz # PyMuPDF 
import requests
import traceback
from urllib.parse import urlparse

app = func.FunctionApp(http_auth_level=func.AuthLevel.FUNCTION)
logger = logging.getLogger('azure')

# Function to adjust text position based on page rotation
def adjust_position_for_rotation(page, position):
    rotation = page.rotation
    if rotation == 90:
        return position[1], page.rect.width - position[0]
    elif rotation == 180:
        return page.rect.width - position[0], page.rect.height - position[1]
    elif rotation == 270:
        return page.rect.height - position[1], position[0]
    else:
        return position[0], position[1]
    return position  # No rotation or 0 degrees
    
def dist(p1, p2):
    return math.sqrt((p1.x - p2.x)*(p1.x - p2.x) + (p1.y - p2.y) * (p1.y - p2.y))

@app.route(route="searchable_pdf")
def searchable_pdf(req: func.HttpRequest) -> func.HttpResponse:
    try:
        logging.info('Python HTTP trigger function processed a request.')

        logging.info('Getting FR keys...')
        endpoint = os.environ["FR_ENDPOINT"]
        key = os.environ["FR_KEY"]
        logging.info(f'{endpoint} resource will be used later.')

        fileurl = req.params.get('fileurl')
        if not fileurl:
            return func.HttpResponse(
                "Please pass a fileurl on the query string",
                status_code=400
            )
        
        fileurl = urlparse(fileurl)
        
        logger.info(f"Downloading {fileurl}")
        file_get = requests.get(fileurl.geturl())
        if file_get.status_code != 200:
            return func.HttpResponse(
                f"Failed to download file from {fileurl.geturl()}. Status code: {file_get.status_code}",
                status_code=400
            )
        
        searchable_file_name = os.path.basename(fileurl.path) + ".ocr.pdf"
        logger.info(f"Downloaded")
        logger.info(f"Loading pdf as {searchable_file_name}...")
        existing_pdf = fitz.open(searchable_file_name, file_get.content)
        logger.info(f"Loaded {len(existing_pdf)} pages from PDF file.")
        logger.info(f"Starting Azure Form Recognizer OCR process...")
        document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key), headers={"x-ms-useragent": "searchable-pdf-blog/1.0.7"})
        poller = document_analysis_client.begin_analyze_document_from_url("prebuilt-read", document_url = fileurl.geturl())
        ocr_results = poller.result()
        logger.info(f"Azure Form Recognizer finished OCR text for {len(ocr_results.pages)} pages.")

        # Generate OCR overlay layer
        logger.info(f"Generating searchable PDF...")
            
        for page_id, page in enumerate(ocr_results.pages):
            # Calculate PDF page size and scale
            existing_pdf_page = existing_pdf[page_id]
            existing_pdf_page.wrap_contents()
            
            page_width = float(existing_pdf_page.rect.width)
            page_height = float(existing_pdf_page.rect.height)
            scale = 1.0 * (page_width + page_height) / (page.width + page.height)

            shape = existing_pdf_page.new_shape()
            for line in page.words:
                # Calculate optimal font size
                desired_text_width = max(dist(line.polygon[0], line.polygon[1]), dist(line.polygon[3], line.polygon[2])) * scale
                desired_text_height = max(dist(line.polygon[1], line.polygon[2]), dist(line.polygon[0], line.polygon[3])) * scale
                font_size = desired_text_height
                actual_text_width = fitz.get_text_length(line.content, fontsize=font_size)
                
                # Calculate text rotation angle
                text_angle = math.atan2((line.polygon[1].y - line.polygon[0].y + line.polygon[2].y - line.polygon[3].y) / 2.0, 
                                        (line.polygon[1].x - line.polygon[0].x + line.polygon[2].x - line.polygon[3].x) / 2.0)
                
                matrix = fitz.Matrix(fitz.Identity)
                matrix.prerotate(-math.degrees(text_angle) + existing_pdf_page.rotation)
                matrix.prescale(desired_text_width / actual_text_width, 1)
                pos = fitz.Point(adjust_position_for_rotation(existing_pdf_page, (line.polygon[3].x * scale, line.polygon[3].y * scale)))
                
                shape.insert_text(pos, line.content, fontsize=font_size, render_mode=3, morph = (pos, matrix))
            
            shape.commit(overlay=True)  

        pdf_bytes = existing_pdf.tobytes(deflate=True, linear=True, garbage = 4)
        return func.HttpResponse(pdf_bytes, status_code=200, mimetype="application/pdf", headers={"Content-Disposition": f"filename={searchable_file_name}"})
    except Exception as e:
        logger.exception("An error occurred:")
        return func.HttpResponse(
             f"An error occurred:\n{traceback.format_exc()}",
             status_code=400
        )

requirements.txt

# DO NOT include azure-functions-worker in this file
# The Python Worker is managed by Azure Functions platform
# Manually managing azure-functions-worker may cause unexpected issues

azure-functions
azure-ai-formrecognizer
pymupdf==1.23.* # latest version has some issue with coordinate system
requests

anatolip
Microsoft
May 13, 2024
Pavankumarpotta , there are no C# example for this code. But there are samples how to use .Net SDK for Document Intelligence here. For rendering and/or editing PDF, you will need to pick PDF library which satisfy your production requirements.
Pavankumarpotta
Copper Contributor
May 13, 2024
Is there any code reference to do the same in c#