When we’re programming user-facing experiences, we want to feel confident that we’re creating a functional user experience - not a broken one! How do we do that? We write tests, like unit tests, integration tests, smoke tests, accessibility tests, loadtests, property-based tests. We can’t automate all forms of testing, so we test what we can, and hire humans to audit what we can’t.
But when we’re building RAG chat apps built on LLMs, we need to introduce an entirely new form of testing to give us confidence that our LLM responses are coherent, grounded, and well-formed.
We call this form of testing “evaluation”, and we can now automate it with the help of the most powerful LLM in town: GPT-4.
The general approach is:
A team of ML experts at Azure have put together an SDK to run evaluations on chat apps, in the azure-ai-generative Python package. The key functions are:
QADataGenerator.generate(text, qa_type, num_questions)
: Pass a document, and it will use a configured GPT-4 model to generate multiple Q/A pairs based on it.evaluate(target, data, data_mapping, metrics_list, ...)
: Point this function at a chat app function and ground truth data, configure what metrics you’re interested in, and it will ask GPT-4 to rate the answers.
Since I've been spending a lot of time maintaining our most popular RAG chat app solution, I wanted to make it easy to test changes to that app's base configuration - but also make it easy for any developers to test changes to their own RAG chat apps. So I've put together ai-rag-chat-evaluator, a repository with command-line tools for generating data, evaluating apps (local or deployed), and reviewing the results.
For example, after configuring an OpenAI connection and Azure AI Search connection, generate data with this command:
python3 -m scripts generate --output=example_input/qa.jsonl --numquestions=200
To run an evaluation against ground truth data, run this command:
python3 -m scripts evaluate --config=example_config.json
You'll then be able to view a summary of results with the summary
tool:
You'll also be able to easily compare answers across runs with the compare
tool:
For more details on using the project, check the README and please file an issue with any questions, concerns, or bug reports.
This evaluation process isn’t like other automated testing that a CI would runs on every commit, as it is too time-intensive and costly.
Instead, RAG development teams should run an evaluation flow when something has changed about the RAG flow itself, like the system message, LLM parameters, or search parameters.
Here is one possible workflow:
We'd love to hear how RAG chat app development teams are running their evaluation flows, to see how we can help in providing reusable tools for all of you. Please let us know!
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.