Forum Discussion

Rizwan Ansari's avatar
Rizwan Ansari
Copper Contributor
Oct 12, 2019

SharePoint Capability to do OCR in PDF Documents

Hi.

 

We have a requirement where all documents (PDF, Word, etc) with embedded images that are uploaded to SharePoint must be searchable. The text in the images must be searchable. 

 

I did find one article which says that images uploaded in the library are automatically OCRed. Does SharePoint have similar feature for images embedded in PDF and word files or should we opt for some third party tool. 

 

Kindly assist.

 

Thanks.

5 Replies

  • Hi Rizwan Ansari ,

    SharePoint extracts content from pdf, images as text, so you can find using OOB Search. Btw you can't customize this behavior, you need to use as it is.

    if you need to customize your OCR experience, without using a 3P tools, you can think about a solution like this one I described in my blog, using SharePoint, flow and Azure Cognitive Services.

    Cheers,

    Federico

     

    • Jason E. Heiser's avatar
      Jason E. Heiser
      Iron Contributor

      What I'm reading from this diagram, though, is that the actual OCR for PDFs can only be accomplished by running the item through Power Automate and processing either with Cognitive Services or some other OCR engine like Muhimbi or Aquaforest.

       

      Am I correct? What about the run limitations in PowerAutomate? A user could potentially upload thousands of PDFs in a week, and I'd hate to hit the run limit...

      • FedericoPorceddu82's avatar
        FedericoPorceddu82
        MVP

        Hi Jason E. Heiser 

        Flow by Power Automate is a way to build personal flow, so your statement is correct 🙂

        When designing the solution, you can consider using dedicated flows with a "per-flow" license or a Logic App on Azure.

        In this example I wanted to highlight the power of the low code solution - no code, but for personal use, not enterprise.

        Thanks for your comment 🙂

        Cheers

        Federico

    • lheidner's avatar
      lheidner
      Copper Contributor

      Hi FedericoPorceddu82 

      I saw your reply to his question on OCR so I wonder if you can help me. I have tens of thousands of PDFs and image files in my onedrive but I'm not sure if all of them are readable so when I do a keyword search on onedrive no files would escape my search. Do I need to identify pdf and image files that are not OCR enabled and convert them into OCR? if so, what would you recommend? finding and converting each file would take me years.

      I thank you in advance for any help you can provide with this issue.

      Regards

      • AlexEncodian's avatar
        AlexEncodian
        Copper Contributor

        lheidner 

         

        You can use a free audit tool such as https://www.encodian.com/product/indxr/ to determine how many files are missing text layers (even on a page level basis). Indxr provides low fixed cost unlimited OCR for bulk requirements in instances where using OCR via Power Automate is not cost effective. Indxr can have automated run schedules to achieve automated bulk OCR at a fixed price.

Resources