Forum Discussion
Word search through thousands of pdf's?
- WavelMar 10, 2021Copper ContributorLast time I checked though, the system only indexed the first 30k or so of each document. I need the entire pdf (10k-2MB) searchable. These are all text pdf's by the way. No images.
- Luis Cabrera-CordonMar 10, 2021Former Employee
Wavel You should be able to index all of the content of the document.
Note that different tiers have different limits in terms of the number of characters pushed into the index. See Service limits for tiers and skus - Azure Cognitive Search | Microsoft Docs. Note that that is not the limit for the size of the actual PDF.
If you have an S2 tier for instance,8MB of characters get indexed, but it would take a pretty big PDF to generate 8MB of characters (about 2700 pages full of characters).
- WavelMar 10, 2021Copper ContributorSo the issue becomes pricing. S2 is almost a $1000/mo and in my case, we're talking about 20k documents. I think the pricing doesn't work for a small number of documents that might be larger than 30k or so.
My pdf total size is under the 2 Gig limit of Basic but because some are large, I'm forced into S1 or even S2. Not affordable in this case.
- David_BlueboxMar 10, 2021Copper ContributorI have similar requirements where a client subscribes to a number of Market Research (in education) sources that provide regular PDF's containing text and images. My client wishes to perform a search across those documents e.g. all research documents related to students coming from China. The documentation I have read seems to also refer to the text limit as mentioned by Wavel. How can we execute such an indexing operation? All of the files currently reside in SharePoint.
- Luis Cabrera-CordonMar 10, 2021Former Employee
Azure Cognitive Search should work for this.
If all your content is in SharePoint though I would first check if SharePoint search meets your needs.If you need additional flexibility that Azure Cognitive Search provides, you either could use the new SharePoint indexer (in preview: Configure a SharePoint Online indexer (preview) - Azure Cognitive Search | Microsoft Docs), or simply copy all your files to blob storage, and use the blob storage indexer. (Search over Azure Blob storage content - Azure Cognitive Search | Microsoft Docs)
I hope this was helpful,
Luis Cabrera, Azure Cognitive Search team.