PDF documents are widely used in business processes. Digitally created PDFs are very convenient to use. Text can be searched, highlighted, and annotated. Unfortunately, a lot of PDFs are created by scanning or converting images to PDFs. There is no digital text in these PDFs, so they cannot be searched. In this blog post, we demonstrate how to convert such PDFs into searchable PDFs with a simple and easy to use code and Azure Form Recognizer. The code will generate a searchable PDF file that will allow you to store the document anywhere, search within the document and copy and paste. Blog content:
- Azure Form Recognizer overview
- Searchable vs non-searchable PDFs
- How to generate a searchable PDF
- Pre-requirement installation
- How to run searchable PDF script
- Searchable PDF Python script
Azure Form Recognizer overview
Azure Form Recognizer is a cloud-based Azure Applied AI Service that uses deep machine-learning models to extract text, key-value pairs, tables, and form fields from your documents. In this blog post we will use text extracted by Form Recognizer to add it into PDF to make it searchable.
Searchable vs non-searchable PDFs
If PDF contains text information, user can select, copy/paste, annotate text in the PDF. In searchable PDF (example), text can be searched and selected, see text highlighting below:
If PDF is image-based (example), text cannot be searched or selected. Image compression artifacts are typically seen around text by zooming in:
How to generate a searchable PDF
PDFs contain different types of elements: text, images, others. Image-based PDFs contain only image elements. The goal of this blog is to add invisible text elements into PDF, so users can search and select these elements. They are invisible to make sure that produced searchable PDF looks identical to original PDF. In example below word “Transition” is now selectable using invisible text layer:
Pre-requirement installation
Please install the following packages before running searchable pdf script:
- Python packages:
pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image
- Package pdf2image requires Poppler installation. Please follow instruction https://pypi.org/project/pdf2image/ based on your platform or use Conda install:
conda install -c conda-forge poppler
How to run searchable PDF script
- Create a Python file using the code below and save it on local machine as fr_generate_searchable_pdf.py.
- Update the key and endpoint variables with values from your Azure portal Form Recognizer instance (see Quickstart: Form Recognizer SDKs for more details).
- Execute script and pass input file (pdf or image) as parameter:
python fr_generate_searchable_pdf.py <input.pdf/jpg>
Sample script output is below:
(base) C:\temp>python fr_generate_searchable_pdf.py input.jpg Loading input file input.jpg Starting Azure Form Recognizer OCR process... Azure Form Recognizer finished OCR text for 1 pages. Generating searchable PDF... Searchable PDF is created: input.jpg.ocr.pdf
- Script generates searchable PDF file with suffix .ocr.pdf.
Searchable PDF Python script
Copy code below and create a Python script on your local machine. The script takes scanned PDF or image as input and generates a corresponding searchable PDF document using Form Recognizer which adds a searchable layer to the PDF and enables you to search, copy, paste and access the text within the PDF.
fr_generate_searchable_pdf.py
# Script to create searchable PDF from scan PDF or images using Azure Form Recognizer
# Required packages
# pip install --upgrade azure-ai-formrecognizer>=3.3 pypdf>=3.0 reportlab pillow pdf2image
import sys
import io
import math
import argparse
from pdf2image import convert_from_path
from reportlab.pdfgen import canvas
from reportlab.lib import pagesizes
from reportlab import rl_config
from PIL import Image, ImageSequence
from pypdf import PdfWriter, PdfReader
from azure.core.credentials import AzureKeyCredential
from azure.ai.formrecognizer import DocumentAnalysisClient
# Please provide your Azure Form Recognizer endpoint and key
endpoint = YOUR_FORM_RECOGNIZER_ENDPOINT
key = YOUR_FORM_RECOGNIZER_KEY
def dist(p1, p2):
return math.sqrt((p1.x - p2.x)*(p1.x - p2.x) + (p1.y - p2.y) * (p1.y - p2.y))
if __name__ == '__main__':
parser = argparse.ArgumentParser()
parser.add_argument('input_file', type=str, help="Input PDF or image (jpg, jpeg, tif, tiff, bmp, png) file name")
parser.add_argument('-o', '--output', type=str, required=False, default="", help="Output PDF file name. Default: input_file + .ocr.pdf")
args = parser.parse_args()
input_file = args.input_file
if args.output:
output_file = args.output
else:
output_file = input_file + ".ocr.pdf"
# Loading input file
print(f"Loading input file {input_file}")
if input_file.lower().endswith('.pdf'):
# read existing PDF as images
image_pages = convert_from_path(input_file)
elif input_file.lower().endswith(('.tif', '.tiff', '.jpg', '.jpeg', '.png', '.bmp')):
# read input image (potential multi page Tiff)
image_pages = ImageSequence.Iterator(Image.open(input_file))
else:
sys.exit(f"Error: Unsupported input file extension {input_file}. Supported extensions: PDF, TIF, TIFF, JPG, JPEG, PNG, BMP.")
# Running OCR using Azure Form Recognizer Read API
print(f"Starting Azure Form Recognizer OCR process...")
document_analysis_client = DocumentAnalysisClient(endpoint=endpoint, credential=AzureKeyCredential(key), headers={"x-ms-useragent": "searchable-pdf-blog/1.0.0"})
with open(input_file, "rb") as f:
poller = document_analysis_client.begin_analyze_document("prebuilt-read", document = f)
ocr_results = poller.result()
print(f"Azure Form Recognizer finished OCR text for {len(ocr_results.pages)} pages.")
# Generate OCR overlay layer
print(f"Generating searchable PDF...")
output = PdfWriter()
default_font = "Times-Roman"
for page_id, page in enumerate(ocr_results.pages):
ocr_overlay = io.BytesIO()
# Calculate overlay PDF page size
if image_pages[page_id].height > image_pages[page_id].width:
page_scale = float(image_pages[page_id].height) / pagesizes.letter[1]
else:
page_scale = float(image_pages[page_id].width) / pagesizes.letter[1]
page_width = float(image_pages[page_id].width) / page_scale
page_height = float(image_pages[page_id].height) / page_scale
scale = (page_width / page.width + page_height / page.height) / 2.0
pdf_canvas = canvas.Canvas(ocr_overlay, pagesize=(page_width, page_height))
# Add image into PDF page
pdf_canvas.drawInlineImage(image_pages[page_id], 0, 0, width=page_width, height=page_height, preserveAspectRatio=True)
text = pdf_canvas.beginText()
# Set text rendering mode to invisible
text.setTextRenderMode(3)
for word in page.words:
# Calculate optimal font size
desired_text_width = max(dist(word.polygon[0], word.polygon[1]), dist(word.polygon[3], word.polygon[2])) * scale
desired_text_height = max(dist(word.polygon[1], word.polygon[2]), dist(word.polygon[0], word.polygon[3])) * scale
font_size = desired_text_height
actual_text_width = pdf_canvas.stringWidth(word.content, default_font, font_size)
# Calculate text rotation angle
text_angle = math.atan2((word.polygon[1].y - word.polygon[0].y + word.polygon[2].y - word.polygon[3].y) / 2.0,
(word.polygon[1].x - word.polygon[0].x + word.polygon[2].x - word.polygon[3].x) / 2.0)
text.setFont(default_font, font_size)
text.setTextTransform(math.cos(text_angle), -math.sin(text_angle), math.sin(text_angle), math.cos(text_angle), word.polygon[3].x * scale, page_height - word.polygon[3].y * scale)
text.setHorizScale(desired_text_width / actual_text_width * 100)
text.textOut(word.content + " ")
pdf_canvas.drawText(text)
pdf_canvas.save()
# Move to the beginning of the buffer
ocr_overlay.seek(0)
# Create a new PDF page
new_pdf_page = PdfReader(ocr_overlay)
output.add_page(new_pdf_page.pages[0])
# Save output searchable PDF file
with open(output_file, "wb") as outputStream:
output.write(outputStream)
print(f"Searchable PDF is created: {output_file}")