Bonus RAG Time Journey: Agentic RAG

Microsoft

Apr 16, 2025

This is a bonus post for RAG Time, a 6-part educational series on retrieval-augmented generation (RAG). In this series, we explored topics such as indexing and retrieval techniques for RAG, data ingestion, and storage optimization. The final topic for this series covers agentic RAG, and how to use semi-autonomous agents to make a dynamic and self-refining retrieval system.

What we'll cover:

Overview and definition of agentic RAG
Example of a single-shot RAG flow
Two examples of agentic RAG: single-step and multi-step reflection

What is agentic RAG?

An agent is a component of an AI application that leverages generative models to make decisions and execute actions autonomously. Agentic RAG improves the traditional RAG flow by actively interacting with its environment using tools, memory, and secure access to data. Agentic RAG systems also engage in a continuous loop of evaluation and improvement.

Here are three key characteristics of agentic RAG:

Autonomous evaluation: LLM-based evaluators assess the relevance and factual groundedness of generated answers.
Iterative improvement: A self-refinement loop identifies and corrects the shortcomings in generated answers.
Tool calling: An LLM decides which "tool", or action to take, will improve overall answer quality, which is especially important for complex or ambiguous queries.

This approach empowers developers to build AI applications that don’t just answer questions but continually enhance their responses for better depth and reliability.

Example: a single shot RAG flow

Let’s review an example of a single shot RAG flow, broken up into the following phases:

Query rewriting: An LLM reformulates the original query for optimal clarity, incorporating context from any past interactions in the session.
Retrieval: This rewritten query then triggers document retrieval powered by a retrieval system like Azure AI Search. The rewritten query can employ keyword, vector, and hybrid search as well as reranking systems to return highly relevant results.
Answer generation: An LLM uses the retrieved documents to generate a response, including citations to the relevant documents.

This answer generation process follows a one-pass, linear approach where the LLM synthesizes the retrieved data into a single, comprehensive response.

While efficient, the single-shot RAG method is static and may produce low quality responses, particularly with complex queries.

Example: Agentic RAG

Many complex queries demand answers that evolve beyond single-shot RAG. We’ll walk through two examples of agentic RAG using single-step and multi-step reflection.

Agentic RAG extends single-shot RAG with 4 extra steps:

Run the single-shot RAG pipeline to get an initial answer.
Evaluate the answer.
Reflect on the results to identify any shortcomings.
Decide if a new search needs to be performed, either covering an internal index or the public web.
Repeat until the answer is of sufficient quality.

Answer Evaluation

LLMs can be used as evaluators that rate responses on relevance, clarity, coherence, and factual accuracy to ensure each answer meets quality standards. The model examines whether the answer adequately addresses the prompt, confirms that its supporting details match known information, and identifies areas where additional context or corrections might be needed.

This self-evaluation process turns the LLM into an internal critic, ensuring a higher level of consistency and reliability. You can also use a different LLM model as a critic to add additional diversity to your evaluation process. By simulating a judgment process, the model can flag discrepancies or gaps, prompting further iterations that refine the output. The result is a robust response that has undergone an internal quality check, much like an independent review by a seasoned expert.

In our agentic RAG implementation, we use the Azure AI Evaluations SDK to assess the quality of our answer. Specifically, we check the relevance and groundedness of the answer from the traditional RAG flow. If either of these metrics are too low, we move to the next stage of our agentic RAG loop.

Reflection

After an initial evaluation, we leverage the built-in reasoning abilities of an LLM to reflect on the answer. The LLM examines the answer along with its groundedness and relevance, identifying the specific reasons why the answer scored low.

Three potential decisions come out of the reflection process:

If the answer is missing information that might come from an internal index, the LLM initiates an internal search with a newly rewritten query. A new answer is generated that incorporates the additional information found in the search.
If the answer is missing information that might come from a public web search, the LLM uses Bing Grounding to find this information.
If the answer cannot be improved with more searches, stop the agentic RAG loop. A new answer is generated, considering that there’s missing information searches couldn’t find.

The agentic RAG loop continues until the answer is of sufficient quality or too much time has passed.

Single-Step Reflection

We can put all the components of agentic RAG together into our first sample implementation: single-step reflection.

The single-shot RAG flow is run to get a candidate answer.
The answer is evaluated using relevance and groundedness evaluators.
If both scores from these evaluators are at least 4, the traditional RAG answer is accepted.
If either of the scores is below 4, an LLM reflects on why the answer was evaluated poorly. It determines if a follow-up internal search or web search might help improve the quality.
If a follow-up internal search could improve the answer, the LLM runs the search and regenerates the answer.
If a follow-up web search could improve the answer, the LLM runs the web search and regenerates the answer.
If a follow-up search won’t improve the answer, the LLM regenerates the answer considering that it doesn’t have all the necessary information.

The following diagram illustrates the single-step reflection process:

Multi-Step Reflection

If a single round of reflection is insufficient, we can run multiple rounds of single-step reflection to continuously improve our answer quality. The following diagram shows how single-step reflection can be refined into multi-step reflection by continuously reevaluating the output of the follow-up searches:

It’s important to note that if follow-up searches don’t end up improving the quality of the answer, the agentic RAG loop needs another condition to exit. In our sample, we only allow the agentic RAG loop to run a maximum of 3 iterations before exiting.

Reflection in action

Let’s look at one example query in our sample and compare how it behaves in traditional RAG versus multi-step reflection: “Who ensures the compliance of the Northwind Health Plus plan with state regulations?”

Single-shot RAG gives us this answer:

The compliance of the Northwind Health Plus plan with state regulations is ensured by Northwind Health Plus itself, as it strives to comply with applicable state and federal laws and regulations, including ERISA, the ACA, and the MHPAEA[Northwind_Health_Plus_Benefits_Details.pdf#page=90][Northwind_Standard_Benefits_Details.pdf#page=85]

Unfortunately, it’s not quite correct! When we evaluate this answer, it gets a low groundedness score. If we look at the underlying sources, it also mentions that external review bodies also have a role in the compliance of Northwind Health Plus plan (“You can also contact your insurance provider directly or the state insurance commissioner in your state.”), this is not only a responsibility of Northwind Health Plus. When agentic RAG reflects on the answer, it realizes this issue and suggests a new search query to learn more about external review bodies in the sources (“state government oversight compliance Northwind Health Plus plan”).

Once the internal search is completed and the new answer is generated, we get a higher quality answer:

The compliance of the Northwind Health Plus plan with state regulations is primarily ensured by Northwind Health Plus itself, as it strives to comply with applicable state and federal laws and regulations, including ERISA, the ACA, and the MHPAEA [Northwind_Health_Plus_Benefits_Details.pdf#page=90]. Additionally, employees have the option to contact their state's insurance department for further information about compliance and external review processes if they believe their appeals were not adequately addressed [Northwind_Standard_Benefits_Details.pdf#page=85]

The new answer specifically states that employees have an external review body to submit a complaint to.

Why agentic RAG matters

Imagine an AI assistant that isn’t content with its initial response—a system that actively asks, “Could I do better?” Agentic RAG makes this possible by:

Engaging in reflective evaluation using Azure AI Evaluations.
Iteratively enhancing responses until they truly address your query with the required depth and accuracy.

Today’s bonus journey showcased how combining RAG with agentic AI can transform traditional retrieval systems into autonomous, self-refining solutions. As you explore these techniques, remember that Agentic RAG isn’t just about getting an answer; it’s about ensuring that the answer is as insightful, accurate, and contextually relevant as possible.

Next Steps

Ready to explore further? Check out these resources, which can all be found in our centralized GitHub repo:

Have questions, thoughts, or want to share how you’re using RAG in your projects? Drop us a comment below or ask your questions in our Discord channel: https://aka.ms/rag-time/discord. Your feedback shapes our future content!