Thanks for the reply, markremmey!
I have a few follow-up questions:
Extra Context: We are currently implementing our solution in Semantic Kernel.
1. Regarding RAG Approaches:
How accurate can Retrieval-Augmented Generation (RAG) get? Many existing works suggest that the accuracy might not be optimal. Do you have any insights on improving the accuracy of RAG?
Here are a couple of references for context:
https://www.cidrdb.org/cidr2024/papers/p74-floratou.pdf
https://haystack.deepset.ai/blog/business-intelligence-sql-queries-llm
I am curious if implement pre-processing or post-processing to the RAG context will yield better results, maybe by adding more metadata like column description. What is your take on this?
2. LLM Agent schema selector:
Some approaches mention using schema information in step 2. However, passing all the schema information into an LLM to select the top tables could result in token limitations.
I am trying to replicate the work described in https://arxiv.org/abs/2312.11242 for the selector. However, even a single table can have a few hundred columns, each with its own descriptions, making it easy to hit token limitations.
Looking forward to your insights!