SOLVED

Is there any way at all to search for PDF files using PDF keywords in SharePoint Online?

Deleted
Not applicable

I have many PDF files in an SPO site.  They have been tagged with keywords, and I’d like them to be indexed by those keywords.  In the SPO Search Center I can find and filter PDF files by author and title, but there doesn’t seem to be any way to query by keywords.  The same question was asked by someone in 2015 on TechNet forums, but wasn’t really answered.

14 Replies
I am not sure if the pdf properties will be picked up. You should really add those properties to the pdf document at the SharePoint level. So use termsets, and document properties in SharePoint. It might be possible to create a script that reads the pdf properties and update the SharePoint properties accordingly.
Agree with Pieter, you will need to add this properties as metadata in your document libraries if you want to be able to search documents using this metadata
best response
Solution

SharePoint can query the properties (i.e. metadata) of a document only if there is a document parser that "promotes" such properties when uploading the document.

Unfortunately, SPO does not implement out of the box a document parser for PDF files, hence the PDF properties are not "promoted" (i.e. they are ignored).

So, if you want to query PDF properties in SPO, you have to fill by yourself, manually or automatically, the corresponding columns on the document library where the PDF is stored.

See https://blogs.technet.microsoft.com/wbaer/2014/08/29/document-property-promotion-and-demotion-overvi...

Thanks for the link.  I don’t think it would be easy to automate property promotion and demotion without a server-side document parser, which doesn’t seem possible with SPO.

Hi stesdsuk,

 

Which PDF metadata fields are of interest to you? The core properties like Title, Author, CreatorDate, ... or also the metadata stored within the PDF files in XMP format?

 

(rationale: there might be a solution for this that will even work in SPO)

 

Paul | SLIM Applications

I’m interested in the basic metadata stored in the PDF information dictionary, specifically “Author”, “Title”, and “Keywords”.  Do you think there’s a free and purely SPO-based solution?

Well, the free and purely solution is what Microsoft provides with the search engine...putting on an additional layer to extract metadata for SPO files could be possible, but not for free

The PDF property keyword is not searchable on SharePoint Online. The only alternative is to use a custom solution (can be build in JavaScript) that extracts the keyword property value from PDF files and then captures the value into a SharePoint column. This allows use of the keywords value in searches but also in views.
Because it uses JavaScript it means it will also work on SharePoint Online and can be packaged in different ways (e.g. provider hosted app). Such a custom solution can read all the properties in PDF files like XMP fields. modification date and custom properties. As far as I know there are no free solutions that offer this capability. It would be beneficial to a wide audience because PDF is a common format.
Paul

> The only alternative is to use a custom solution (can be build in JavaScript) […]

I’m not quite familiar with the SharePoint Online architecture.  I took a quick look at the article “SharePoint Add-ins compared with SharePoint solutions” and my understanding is that my custom solution will have to either run in an active browser session or be hosted and run somewhere else, and either way it’s going to have to use the standard, client-facing APIs to fetch and parse the PDF files and update their columns.  Is that right?

Indeed. The interaction between JavaScript and SP will use REST API or CSOM. 
Suppose the "solution" adds an extra option in the ribbon to upload PDF Files. This then allows for the JavaScript (running in the browser) to parse the PDF file prior to uploading, extract PDF metadata properties and then fill the corresponding SharePoint columns. This does require the users to use the new option to upload pdf files. This will not work in all cases (e.g. when they use explorer view to upload files).
To cater for the existing PDF documents already present in SharePoint you will need to find all PDFs without extracted metadata and then extract the PDF metadata. This needs to be repeated regularly to fix PDF files that are uploaded. It can be done but not trivial and requires ongoing attention.
A posssible alternative is to use workflows / flow. This event driven solution then needs to provide the logic to extract the pdf metadata. Plus you will need a solution to cater for the existing PDF documents. Again this is not simple to implement.

It’s a pretty large PDF library and is almost always updated by the OneDrive desktop app.  So I guess workflows is the only way to go.  I’ve just seen that it’s possible to send HTTP requests from SP 2013 workflows with for example file GUIDs upon file creation and update and I’ll look further in this direction.  Thank you.

Hi Juan Carlos. What option (even if it is a payment solution) exists to index in SPO other type of properties / metadata, of PDF files?

Nearly two years have gone.. 

Is there any solution to search for keywords in PDF files using OneDrive for Business?

 

Sven

 

  

OOTB this is not possible.
You need to tackle 2 problems:
1.  Extracting the PDF properties
Apps have emerged with the capability to extract properties from PDF files (like Keywords, Subject, PDF version, ... or custom PDF properties) and map them to columns in OneDrive (or SharePoint). e.g. SLIM Companion Explorer.
2. Making sure the columns used are searchable
If you use custom columns then they are searchable by default. So that is the easiest way forward.
Paul | SLIM Applications (https://www.slimapplications.com/)

1 best response

Accepted Solutions
best response
Solution

SharePoint can query the properties (i.e. metadata) of a document only if there is a document parser that "promotes" such properties when uploading the document.

Unfortunately, SPO does not implement out of the box a document parser for PDF files, hence the PDF properties are not "promoted" (i.e. they are ignored).

So, if you want to query PDF properties in SPO, you have to fill by yourself, manually or automatically, the corresponding columns on the document library where the PDF is stored.

See https://blogs.technet.microsoft.com/wbaer/2014/08/29/document-property-promotion-and-demotion-overvi...

View solution in original post