Generate searchable PDFs with Azure Form Recognizer

Microsoft

Sep 06, 2023

Sorry for delay with responses.

Size of the output PDF (kelevra1, sandhyarana13😞
Script render original PDF as image (see line 39-40):

        # read existing PDF as images
        image_pages = convert_from_path(input_file)

and quality and compression of the image is the main factor of output file size. Default settings is 200 DPI rendering and using format without any compression. As a results output file size maybe significantly large than original file PDF size but it is a compromise to keep good image quality for PDF with different text sizes and original quality.
It is possible to control rendering DPI and compress image as jpeg with specific quality. Example below render PDF with 150 DPI and compress as JPEG with quality 90:

        # read existing PDF as images
        image_pages = convert_from_path(input_file, 
                                        dpi=150, 
                                        fmt='JPEG', 
                                        jpegopt={
                                            'quality':90,
                                            'progressive':True,
                                            'optimize':True
                                        })

Let me know if you still can not achieve similar size as original PDF using these settings.

Searchable PDF as API output (Oscar_Huibers , isspid😞
It is in team's plans to add searchable PDF as a option for API output. We will update this blog post when this featrure will be available. Feel free to contact me/team directly if you want to get more information.

Blog Post

Generate searchable PDFs with Azure Form Recognizer