Word search through thousands of pdf's?

Copper Contributor

Is this the appropriate product to use if I want to create a word search index for thousands of pdf files, and then query it from my asp.net application?

 
9 Replies
Yes, absolutely. You could put your thousands of pdf files in a repository like blob storage, then index it and then query for the information in those PDFs. Note that there are many different types of PDFs. If you have scanned PDFs, you may want to add a skillset to your indexer that extracts text from the images embedded in the PDFs. The easiest way to do all of this (in just a couple of minutes) is to follow this tutorial : https://docs.microsoft.com/en-us/azure/search/cognitive-search-quickstart-blob By the end of that quickstart, you will have an index that you can query so you can find any PDFs that meet the query requirements. Thanks, Luis Cabrera, Azure Cognitive Search Team
Last time I checked though, the system only indexed the first 30k or so of each document. I need the entire pdf (10k-2MB) searchable. These are all text pdf's by the way. No images.
I have similar requirements where a client subscribes to a number of Market Research (in education) sources that provide regular PDF's containing text and images. My client wishes to perform a search across those documents e.g. all research documents related to students coming from China. The documentation I have read seems to also refer to the text limit as mentioned by @Wavel. How can we execute such an indexing operation? All of the files currently reside in SharePoint.

@Wavel You should be able to index all of the content of the document.

 

Note that different tiers have different limits in terms of the number of characters pushed into the index. See Service limits for tiers and skus - Azure Cognitive Search | Microsoft Docs.  Note that that is not the limit for the size of the actual PDF.

 

If you have an S2 tier for instance,8MB of characters get indexed, but it would take a pretty big PDF to generate 8MB of characters (about 2700 pages full of characters).

@David Duncan 

Azure Cognitive Search should work for this.
If all your content is in SharePoint though I would first check if SharePoint search meets your needs.

 

If you need additional flexibility that Azure Cognitive Search provides, you either could use the new SharePoint indexer (in preview: Configure a SharePoint Online indexer (preview) - Azure Cognitive Search | Microsoft Docs), or simply copy all your files to blob storage, and use the blob storage indexer.  (Search over Azure Blob storage content - Azure Cognitive Search | Microsoft Docs)

 

I hope this was helpful,

 

Luis Cabrera, Azure Cognitive Search team.

So the issue becomes pricing. S2 is almost a $1000/mo and in my case, we're talking about 20k documents. I think the pricing doesn't work for a small number of documents that might be larger than 30k or so.
My pdf total size is under the 2 Gig limit of Basic but because some are large, I'm forced into S1 or even S2. Not affordable in this case.

@Wavel, I am not sure what your budget is, but here is an idea... use S1 (where the limit will be 4MB of text -- abut 1300 pages per document)... that will only cost $250 per month.

For the bigger documents, it may not be worth paying 4X (I imagine you probably have a few outliers that are bigger than 1300 pages). In that case, maybe just take the first 1000 pages of content or so. That may be the best bang for the buck given your need...

You are correct, there are only a few outliers, however, I have to index the entire document. Can't miss any when our subscribers do a search.

My suggestion is to rethink the pricing structure. Base it on the total number of bytes being indexed, not the individual document size. Indexing 100 5MB files shouldn't cost so much more than 5000 2k files. (or whatever math makes my argument work ;)

@Wavel@Luis Cabrera-Cordon
Is it possible to split those outliers into smaller ones and index them?