RAG / Vector Database best practices for CoPilot Studio

Copper Contributor

I have been working with Microsoft Gen AI LLM tools (Azure OpenAI Studio and CoPilot Studio) for building a custom 'agent' for answering questions about a set of company internal documents. It seems like RAG is the best approach and fine-tuning would be overkill.

In support of RAG with a vector database I would like to understand best practices.  It isn't clear to me if manually uploading files to the CoPilot within CoPilot studio does effective preprocessing of the documents (e.g. tables) and chunking or not.  And, if it adds embeddings for words not in the pre-trained LLMs embeddings vocabulary.

I am looking for the best practice on an ongoing basis for automating the updates (add,update,delete) to the RAG content on an ongoing basis for multiple additional customized LLMs with different sets of documents.

It seems like that leveraging open source technology like "langchain" might be a way to achieve consistent results for LLMs that might be updated on a regular basis with RAG content. Is that advised?  Or are there Microsoft tools that might be better for automating content updates?

Also, there is a choice of what vector database to use, posgrestsql, Cosmos DB (Mongo), etc.   Which are supported and recommended for Copilot studio.

Thanks in advance for any guidance!

0 Replies