Hi pablocastro this is an inspiring demo. Like WJK-DEV, I would like to use URLs as datasources, specifically from the company's internal Confluence space, but also other enterprise sources like Sharepoint. As I understand it, I would need to
- regularly crawl those pages that are accessible to everyone.
- cut them up into smaller chunks, similar to what prepdocs.py does.
- store those chunks in Azure with a field equivalent to `sourcepage` that links to the Confluence article.
- Create an index on those confluence chunks (or use an suitable existing index).
- The demo app's ask and chat functionality should more or less work out of the box, but I'll need to make citations open external links.
Is the above correct?
I have a few other questions:
- How can I enable access-based search? i.e., users A and B have access to different sets of documents, so the search should retrieve results from those respective sets.
- Can you refer me to any sample code that regularly crawls external data sources and cuts them into chunks like your demo app does?
- Would it be practical to store the document chunks in a non-Azure data store, such as s3, or is Azure Blob Storage adding some magic here?