Content authored by: Arpita Parmar
If you've been delving into the potential of large language models (LLMs) for search and retrieval tasks, you've probably encountered Retrieval Augmented Generation (RAG) as a valuable technique. RAG enriches LLM-generated responses by integrating relevant contextual information, particularly when connected to private data sources. This integration empowers the model to deliver more accurate and contextually rich responses.
Evaluating RAG poses several challenges, requiring a multifaceted approach. Evaluating both response quality and retrieval effectiveness are in ensuring optimal performance.
Traditional evaluation metrics for RAG applications, while useful, have certain limitations that can impact their effectiveness in accurately assessing RAG performance. Some of these limitations include:
Overall, while traditional evaluation metrics offer valuable insights into RAG performance, they are not without their limitations. Incorporating user feedback into the evaluation process adds another layer of insight, bridging the gap between quantitative metrics and qualitative user experiences. Therefore, adopting a multifaceted approach that considers retrieval quality, relevance of response to retrieval, user intent, ground truth availability, query type diversity, and user feedback is essential for a comprehensive and accurate evaluation of RAG systems.
When evaluating RAG applications, it is crucial to accurately assess retrieval effectiveness and to tune relevance of retrieval. Since the retrieved data is key for the successful implementation of the RAG pattern, a retrieval system that can significantly enhance the quality of your results is integration of Azure AI Search. Even if AI Search offers keyword (full-text), vector and hybrid search capabilities, this post will be focused on using hybrid search. The hybrid search approach can be particularly beneficial in scenarios where the retrieval performance is varied or insufficient. By integrating both keyword and vector-based search techniques, hybrid search can improve the accuracy and completeness of the retrieved documents, which in turn can positively impact the relevance of the generated responses.
The hybrid search process in Azure AI Seach involves the following steps:
When turning for retrieval relevance and quality of retrieval, there are several strategies to consider:
By unifying these retrieval techniques and configurations, hybrid search can handle queries more effectively compared to using just keywords or vectors alone. It excels at finding relevant documents even when users query with concepts, abbreviations or phraseology different from the documents.
A recent Microsoft study highlights that hybrid search with semantic re-ranking outperforms traditional vector search methods like dense and sparse passage retrieval across diverse question-answering tasks.
According to this study, key advantages with hybrid search with semantic re-ranking include:
Now that we've covered retrieval tuning, let's turn our attention to evaluating generation and streamlining the RAG pipeline evaluation process. Azure AI prompt flow offers comprehensive framework to streamline RAG evaluation.
Prompt flow streamlines RAG evaluation with multifaceted approach by efficiently comparing prompt variations, integrating user feedback, and supporting both traditional and AI-generated metrics that don't require ground truth data. It ensures tailored responses for diverse queries, simplifying retrieval and response evaluation while providing comprehensive insights for improved RAG performance.
Both Azure AI Search and Azure AI prompt flow are available in Azure AI Studio, a unified platform for responsibly developing and deploying generative AI applications. The one-stop-shop platform enables developers to explore the latest APIs and models, access comprehensive tooling to support the generative AI development lifecycle, design applications responsibly, and deploy and scale models, flows and apps at scale with continuous monitoring.
With Azure AI Search, developers can connect models to their protected data for advanced fine-tuning and contextually relevant retrieval augmented generation. With Azure AI prompt flow, developers can orchestrate AI workflows with prompt orchestration, interactive visual flows, and code-first experiences to build sophisticated and customized enterprise chat applications.
Here is a video of how to build and deploy an enterprise chat application with Azure AI Studio.
Evaluating RAG applications in prompt flow revolves around three key aspects:
Below is the table of evaluation metrics for RAG applications in Prompt flow.
Metric Type |
AI Assisted/Ground Truth Based |
Metric |
Description |
Generation |
AI Assisted |
Groundedness |
Measures how well the model's generated answers align with information from the source data (user-defined context). |
Generation |
AI Assisted |
Relevance |
Measures the extent to which the model's responses generated are pertinent and directly related to the given questions. |
Retrieval |
AI Assisted |
Retrieval Score |
Measures the extent to which the model's retrieved documents are pertinent and directly related to the given questions. |
Generation |
Ground Truth Based |
Accuracy, Precision, Recall, F1 score |
Measures the RAG system’s responses to a set of predefined, correct answers. Measures the ratio of the number of shared words between the model generation and the ground truth answers. |
There are 3 AI assisted metrics available in prompt flow that do not require ground truth. Traditional metrics based on ground truth are useful while testing RAG applications in development, but AI-assisted metrics offer enhanced capabilities for evaluating user responses, especially in situations where ground truth data is unavailable. These metrics provide valuable insights into the performance of the RAG Application in real-world scenarios, enabling more comprehensive assessment of user interactions and system behavior. These are those metrics:
Groundedness, relevance, and the retrieval score along with prompt variant testing from prompt flow collectively provide insights into the performance of RAG applications. It enables refinement of RAG Applications, addressing challenges associated with information overload, incorrect response, insufficient retrieval and ensuring more accurate responses throughout the end-to-end evaluation process.
Now, let's explore 3 potential scenarios to evaluate RAG workflows and how prompt flow and Azure AI Search help in evaluating those scenarios.
This scenario entails the seamless integration of relevant contextual information with accurate and appropriate responses generated by RAG application. We have good response and good retrieval.
In this scenario, all three metrics perform optimally. Groundedness ensures factual accuracy and verifiability, relevance ensures the appropriateness of the answer to the query, and the retrieval score reflects the quality and relevance of the retrieved documents.
Here, despite the retrieval of relevant documents, the response from LLM is inaccurate. Groundedness may suffer if the response lacks verifiability against the provided sources. Relevance may also be compromised if the response does not adequately address the user's query. The retrieval score might indicate successful document retrieval but fails to capture the inadequacy of the response.
To address this challenge, Azure AI Search retrieval tuning can be leveraged to enhance the retrieval process, ensuring that the most relevant and accurate documents are retrieved. By fine-tuning the search parameters discussed above in section “Enhancing Retrieval: Quality of retrieval & relevance tuning,” Azure AI Search can significantly improve the retrieval score, thereby increasing the likelihood of obtaining relevant documents for the given query.
Additionally, you can refine the LLM's prompt by incorporating a conditional statement within the prompt template, such as "if relevant content is unavailable and no conclusive solution is found, respond with 'unknown'." Leveraging prompt flow, which allows for the evaluation and comparison of different prompt variations, you can assess the merit of various prompts and select the most effective one for handling such situations. This approach ensures accuracy and honesty in the model's responses, acknowledging its limitations and avoiding the dissemination of inaccurate information.
In this scenario, the retrieval of relevant documents is followed by an inaccurate response from the LLM. Groundedness may be maintained if the responses remain verifiable against the provided sources. However, relevance is compromised as the response fails to address the user's query accurately. The retrieval score might indicate successful document retrieval, but the flawed response highlights the limitations of the LLM.
Evaluation in this scenario involves several key steps facilitated by Azure AI prompt flow and Azure AI Search:
In this section, let’s walk through the step-by-step process of testing RAG using prompt variants with the prompt flow using metrics such as groundedness, relevance, and retrieval score.
Prerequisite: Build RAG using Azure Machine Learning prompt flow.
1. Prepare Test Data: Ideally, you should prepare a test dataset of 50-100 samples but for this article we will prepare a test dataset with a few samples. Save this as a csv file.
2. Add test data to Azure AI Studio: In your AI Studio project, under Components, select Data -> New data.
3. Select Upload files/folders and upload the test data from a local drive. Click on Next, provide a name to your data location and click on Create.
4. Once the test data is uploaded you can see its details.
5. Evaluate the flow: Under Tools -> Evaluation, click on New evaluation. Choose Conversation with context and select a flow you want to evaluate. Here we are testing two variants of prompt: Variant_0 and Variant_1. Click on Next.
6. Configure the test data. Click on Next.
7. Under Select Metrics RAG metrics are automatically selected based on the scenario you have chosen. Refer to more details of metrics. Choose your Azure OpenAI Service instance and model and click on Next.
8. Review and finish. Click on Submit.
9. Once the evaluation is complete it will be displayed under Evaluations.
10. Check the results by clicking on the evaluation. You can compare the two variants of prompts by comparing their metrics to see which prompt variant is performing better.
11. You can check the result of individual prompt variant evaluation metrics under the Output tab -> Metrics dashboard.
12. Also, under the Output tab, you can also see a detailed view of the metrics under Detailed metrics result.
13. Under the Trace tab, you can trace how many tokens were generated and the duration for each test question.
Conclusion:
The integration of Azure AI Search into the RAG pipeline can significantly improve retrieval scores. This enhancement ensures that retrieved documents are more aligned with user queries, thus enriching the foundation upon which responses are generated. Furthermore, by integrating Azure AI Search and Azure AI prompt flow in Azure AI Studio, developers can test and optimize response generation to improve groundedness and relevance. This approach not only elevates RAG application performance but also fosters more accurate, contextually relevant, and user-centric responses.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.