Forum Discussion
Wavel
Mar 10, 2021Copper Contributor
Word search through thousands of pdf's?
Is this the appropriate product to use if I want to create a word search index for thousands of pdf files, and then query it from my asp.net application?
9 Replies
Sort By
- dgtvanCopper Contributor
WavelLuis Cabrera-Cordon
Is it possible to split those outliers into smaller ones and index them? - Luis Cabrera-CordonFormer EmployeeYes, absolutely. You could put your thousands of pdf files in a repository like blob storage, then index it and then query for the information in those PDFs. Note that there are many different types of PDFs. If you have scanned PDFs, you may want to add a skillset to your indexer that extracts text from the images embedded in the PDFs. The easiest way to do all of this (in just a couple of minutes) is to follow this tutorial : https://docs.microsoft.com/en-us/azure/search/cognitive-search-quickstart-blob By the end of that quickstart, you will have an index that you can query so you can find any PDFs that meet the query requirements. Thanks, Luis Cabrera, Azure Cognitive Search Team
- WavelCopper ContributorLast time I checked though, the system only indexed the first 30k or so of each document. I need the entire pdf (10k-2MB) searchable. These are all text pdf's by the way. No images.
- Luis Cabrera-CordonFormer Employee
Wavel You should be able to index all of the content of the document.
Note that different tiers have different limits in terms of the number of characters pushed into the index. See Service limits for tiers and skus - Azure Cognitive Search | Microsoft Docs. Note that that is not the limit for the size of the actual PDF.
If you have an S2 tier for instance,8MB of characters get indexed, but it would take a pretty big PDF to generate 8MB of characters (about 2700 pages full of characters).