Mar 10 2021 09:05 AM
Is this the appropriate product to use if I want to create a word search index for thousands of pdf files, and then query it from my asp.net application?
Mar 10 2021 09:10 AM
Mar 10 2021 09:12 AM
Mar 10 2021 09:19 AM
Mar 10 2021 09:26 AM
@Wavel You should be able to index all of the content of the document.
Note that different tiers have different limits in terms of the number of characters pushed into the index. See Service limits for tiers and skus - Azure Cognitive Search | Microsoft Docs. Note that that is not the limit for the size of the actual PDF.
If you have an S2 tier for instance,8MB of characters get indexed, but it would take a pretty big PDF to generate 8MB of characters (about 2700 pages full of characters).
Mar 10 2021 09:30 AM
Azure Cognitive Search should work for this.
If all your content is in SharePoint though I would first check if SharePoint search meets your needs.
If you need additional flexibility that Azure Cognitive Search provides, you either could use the new SharePoint indexer (in preview: Configure a SharePoint Online indexer (preview) - Azure Cognitive Search | Microsoft Docs), or simply copy all your files to blob storage, and use the blob storage indexer. (Search over Azure Blob storage content - Azure Cognitive Search | Microsoft Docs)
I hope this was helpful,
Luis Cabrera, Azure Cognitive Search team.
Mar 10 2021 09:41 AM
Mar 10 2021 09:58 AM - edited Mar 10 2021 10:47 AM
@Wavel, I am not sure what your budget is, but here is an idea... use S1 (where the limit will be 4MB of text -- abut 1300 pages per document)... that will only cost $250 per month.
For the bigger documents, it may not be worth paying 4X (I imagine you probably have a few outliers that are bigger than 1300 pages). In that case, maybe just take the first 1000 pages of content or so. That may be the best bang for the buck given your need...
Mar 10 2021 10:15 AM
Dec 20 2021 11:01 PM - edited Dec 20 2021 11:02 PM
@Wavel@Luis Cabrera-Cordon
Is it possible to split those outliers into smaller ones and index them?