Retrieval-Augmented Generation (RAG) has rapidly become a cornerstone technology for building intelligent AI systems that combine the power of large language models (LLMs) with dynamic access to external knowledge bases. While prototyping RAG models is relatively straightforward, optimizing them for production environments poses unique challenges due to scalability, latency, and robustness requirements. This blog dives deeply into practical and advanced strategies to optimize RAG systems for production, ensuring they are performant, reliable, and scalable. I will start with challenges with current RAG, I will go brief about what is RAG (you can skip this section and directly jump to optimization section).
Understanding the Challenges of RAG in Production
RAG systems merge document retrieval with language generation by retrieving relevant context chunks from large knowledge bases to guide LLMs in producing accurate and informed responses. Production systems face hurdles such as:
- Managing large, continuously growing knowledge corpora efficiently
- Balancing retrieval accuracy with response latency
- Handling unpredictable and complex query patterns from real users
- Ensuring fresh and updated knowledge bases without downtime
- Scaling LLM inference while maintaining cost and resource efficiency.
Now before we deep dive into solving these challenges lets just have quick look at this simple concept where RAG can be break into three components
- Indexing: Organize and store data in a structured format to enable efficient searching.
- Retrieval: Search and fetch relevant data based on a query or input.
- Generation: Create a final response or output using the retrieved data.
Indexing
Indexing consist of 3 opearing, loading, splitting and ingesting into vector DB.
import bs4
from langchain_community.document_loaders import WebBaseLoader
# Initialize a web document loader with specific parsing instructions
loader = WebBaseLoader(
web_paths=("https://lilianweng.github.io/posts/2023-06-23-agent/",), # URL of the blog post to load
bs_kwargs=dict(
parse_only=bs4.SoupStrainer(
class_=("post-content", "post-title", "post-header") # Only parse specified HTML classes
)
),
)
# Load the filtered content from the web page into documents
docs = loader.load()
from langchain.text_splitter import RecursiveCharacterTextSplitter
# Create a text splitter to divide text into chunks of 1000 characters with 200-character overlap
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
# Split the loaded documents into smaller chunks
splits = text_splitter.split_documents(docs)
from langchain_community.vectorstores import Chroma
from langchain_openai import OpenAIEmbeddings
# Embed the text chunks and store them in a Chroma vector store for similarity search
vectorstore = Chroma.from_documents(
documents=splits,
embedding=OpenAIEmbeddings() # Use OpenAI's embedding model to convert text into vectors
)
Retrieval
Now let's talk about retrieva, this part consist of retrieving the relevant knowledge for LLM to generate response.
# Create a retriever from the vector store
retriever = vectorstore.as_retriever()
# Retrieve relevant documents for a query
docs = retriever.get_relevant_documents("What is Task Decomposition?")
# Print the content of the first retrieved document
print(docs[0].page_content)
#### OUTPUT ####
Task decomposition can be done (1) by LLM with simple prompting ...
Tree of Thoughts (Yao et al. 2023) extends CoT by exploring multiple ...
Generation
We have our context, but we need an LLM to read it and formulate a human-friendly answer. This is the “Generation” step in RAG.
from langchain import hub
# Pull a pre-made RAG prompt from LangChain Hub
prompt = hub.pull("rlm/rag-prompt")
# printing the prompt
print(prompt)
#### OUTPUT ####
human
You are an assistant for question-answering tasks. Use the following pieces
of retrieved context to answer the question. If you dont know the answer,
just say that you dont know. Use three sentences maximum and keep the
answer concise.
Question: {question}
Context: {context}
Answer:
from langchain_openai import ChatOpenAI
# Initialize the LLM
llm = ChatOpenAI(model_name="gpt-4o", temperature=0)
from langchain_core.output_parsers import StrOutputParser
from langchain_core.runnables import RunnablePassthrough
# Helper function to format retrieved documents
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
# Define the full RAG chain
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
Let’s break down this chain:
- {"context": retriever | format_docs, "question": RunnablePassthrough()}: This part runs in parallel. It sends the user's question to the retriever to get documents, which are then formatted into a single string by format_docs. Simultaneously, RunnablePassthrough passes the original question through unchanged.
- | prompt: The context and question are fed into our prompt template.
- | llm: The formatted prompt is sent to the LLM.
- | StrOutputParser(): This cleans up the LLM's output into a simple string.
- FInally we call LLM to generate the response.
Bingo. We just build our first RAG pipeline with all the 3 phases indexing, retrieval and generation. Is the journey over ? Absolutely not, here comes t
he hardest part, how can we ensure that the pipeline we built should be good enough to run in production and land it that 5% successful project. With this background I would like to start talk each of the optimization in detail with advantage and disadvantages with code example. So here we go !
1. Handling Multi Modal Complex Document:
One of the most significant limitations of traditional RAG models is their inability to understand and interpret visual data. In a world where images accompany textual information ubiquitously, this represents a substantial gap in the model’s comprehension abilities. Documents are not just strings of text; they have structure — sections, subsections, paragraphs, and lists — all of which convey semantic importance. Traditional RAG models often overlook this hierarchical structure, potentially missing out on understanding the document’s full meaning.
I wrote a detailed blog on how to implement the right ingestion and retreival pipeline for Multi modal complex document with table, figures here.
2. Multi-Query Generation
Traditional retrieval methods use one user query to search a vector database, returning a limited set of relevant documents. Multi-Query Generation leverages an LLM to rewrite or expand the original query into multiple different questions or query strings. Each of these queries independently retrieves documents, and the final result set is the unique union of all retrieved documents. This approach enhances retrieval coverage and helps the downstream LLM produce more informed and accurate responses.
from langchain.chat_models import ChatOpenAI
from langchain.retrievers.multi_query import MultiQueryRetriever
from langchain.vectorstores import Chroma
# Initialize your vector database retriever
vectordb = Chroma(persist_directory="./db", collection_name="my_collection")
# Initialize your LLM
llm = ChatOpenAI(temperature=0)
# Create the MultiQueryRetriever which uses the LLM to generate multiple queries
multi_query_retriever = MultiQueryRetriever.from_llm(
retriever=vectordb.as_retriever(),
llm=llm,
include_original=True # Include the original query in retrieval too (optional)
)
# Your original user query
query = "What are the key applications of superlinear returns?"
# Use the multi-query retriever to fetch relevant documents
retrieved_docs = multi_query_retriever.get_relevant_documents(query)
# The retriever internally:
# 1. Uses the LLM to generate multiple query reformulations
# 2. Runs each query against the vector store
# 3. Gathers unique union of all retrieved documents
print(f"Retrieved {len(retrieved_docs)} documents using multi-query generation.")
But if you want to generate by your own, then you can see the code below.
from langchain.prompts import ChatPromptTemplate
# Prompt for generating multiple queries
template = """You are an AI language model assistant. Your task is to generate five
different versions of the given user question to retrieve relevant documents from a vector
database. By generating multiple perspectives on the user question, your goal is to help
the user overcome some of the limitations of the distance-based similarity search.
Provide these alternative questions separated by newlines. Original question: {question}"""
prompt_perspectives = ChatPromptTemplate.from_template(template)
# Chain to generate the queries
generate_queries = (
prompt_perspectives
| ChatOpenAI(temperature=0)
| StrOutputParser()
| (lambda x: x.split("\n"))
)
from langchain.load import dumps, loads
def get_unique_union(documents: list[list]):
""" A simple function to get the unique union of retrieved documents """
# Flatten the list of lists and convert each Document to a string for uniqueness
flattened_docs = [dumps(doc) for sublist in documents for doc in sublist]
unique_docs = list(set(flattened_docs))
return [loads(doc) for doc in unique_docs]
# Build the retrieval chain
retrieval_chain = generate_queries | retriever.map() | get_unique_union
# Invoke the chain and check the number of documents retrieved
docs = retrieval_chain.invoke({"question": question})
print(f"Total unique documents retrieved: {len(docs)}")
from operator import itemgetter
# The final RAG chain
template = """Answer the following question based on this context:
{context}
Question: {question}
"""
prompt = ChatPromptTemplate.from_template(template)
llm = ChatOpenAI(temperature=0)
final_rag_chain = (
{"context": retrieval_chain, "question": itemgetter("question")}
| prompt
| llm
| StrOutputParser()
)
final_rag_chain.invoke({"question": question})
Benefits of Multi-Query Generation
- Expands retrieval coverage by querying multiple perspectives.
- Improves retrieval for ambiguous or suboptimal queries.
- Returns a richer context to the generation model for better answers.
- Helps reduce information gaps in retrieval-augmented generation pipelines.
3. Re-Ranking and Fusion
Re-ranking is a two-stage process where you first retrieve a broad set of candidate documents with a basic retriever, then apply a more precise, computationally expensive model (often a cross-encoder or a learned re-ranker) to reorder these results by relevance. This improves the quality of the final documents fed to the language model.
Fusion methods combine results from multiple retrievals, often stemming from different queries or different retrievers (e.g., dense vector search + keyword search). The goal is to merge these rank lists into a single, more effective ranking to feed the generation model.
Reciprocal Rank Fusion (RRF) is a popular fusion technique. It combines rankings from multiple retrieval methods by summing reciprocal ranks, effectively boosting documents ranked highly by any retriever.
from langchain.retrievers import EnsembleRetriever
from langchain.vectorstores import Chroma
from langchain.llms import OpenAI
# Initialize retrievers (dense vector + BM25 keyword)
vectordb = Chroma(persist_directory="./db", collection_name="my_collection")
vector_retriever = vectordb.as_retriever()
# Assuming you have a BM25 retriever initialized as bm25_retriever
# This is a placeholder for actual BM25 retriever implementation
bm25_retriever = ...
# Combine retrievers with EnsembleRetriever
ensemble_retriever = EnsembleRetriever(
retrievers=[vector_retriever, bm25_retriever],
weights=[0.6, 0.4] # weight can prioritize one retriever
)
# You can now query this ensemble retriever, which effectively acts as fusion
query = "Explain advances in natural language understanding"
results = ensemble_retriever.get_relevant_documents(query)
# Optionally, implement Reciprocal Rank Fusion on these results manually
# or apply a re-ranker on the combined results for final ranking
Summary
- Re-Ranking boosts retrieval quality by rescoring candidates with stronger models like cross-encoders after initial retrieval.
- Fusion combines multiple retrieval sources or queries into a unified, better-ordered result, often using methods like Reciprocal Rank Fusion.
4. HyDE (Hypothetical Document Embeddings)
HyDE stands for Hypothetical Document Embeddings. Instead of directly embedding a user query to retrieve documents, HyDE first uses a language model to generate a hypothetical or synthetic document that would ideally answer the query. Then this generated "hypothetical" document is embedded and used to retrieve semantically relevant actual documents from the vector store.
This approach effectively captures the query's underlying intent and context more richly than the raw query text, leading to better retrieval accuracy, especially for ambiguous or complex questions.
from langchain.llms import OpenAI
from langchain.embeddings import OpenAIEmbeddings
from langchain.chains import HypotheticalDocumentEmbedder
# Initialize the base language model (e.g., GPT-3.5 or GPT-4)
llm = OpenAI(temperature=0)
# Initialize the base embedding model (e.g., OpenAI embeddings)
base_embeddings = OpenAIEmbeddings()
# Create the HyDE embedder that generates hypothetical documents and embeds them
hyde = HypotheticalDocumentEmbedder.from_llm(llm, base_embeddings, prompt_type="web_search")
# Example query
query = "What are the benefits of transformers in NLP?"
# Use HyDE to embed the query as a hypothetical document embedding
embedding_vector = hyde.embed_query(query)
print("Embedding vector shape:", len(embedding_vector))
# This embedding_vector can now be used to search a vector database for relevant real documents
Summary
- Complex or ambiguous queries where the intent is not explicit.
- Domains with sparse or heterogeneous data.
- To improve zero-shot retrieval performance without task-specific fine-tuning.
- When bridging the semantic gap between query and documents is crucial.
5. Multi-Representation Indexing
Multi-Representation Indexing is an advanced strategy for organizing and retrieving documents in RAG systems by creating and maintaining multiple vector representations of the same document. These different representations could be:
- Concise document summaries or "propositions" optimized for retrieval,
- Chunked sections or paragraphs,
- Metadata or structured summaries,
- And even multimodal data like tables or images represented via text summaries.
The core idea is to decouple the representation used for retrieval from the original document content and to index multiple nuanced embodiments of a document. This ensures more flexible and precise retrieval, especially for large and heterogeneous corpora.
- Better Retrieval Precision: Different representations capture diverse aspects of the document, improving chances of matching query intent.
- Contextual Completeness: Smaller, optimized representations boost retrieval relevance without losing link to rich original context.
- Support for Multimodal Data: Summaries allow text-based similarity search on complex objects like tables or images.
- Flexibility: Allows retrieval strategies to combine multiple embedding types (e.g., summaries + chunks) without compromising generation quality.
https://zahere.com/unlocking-deep-context-why-you-should-try-multi-representation-indexing
import uuid
# The chain for generating summaries
summary_chain = (
# Extract the page_content from the document object
{"doc": lambda x: x.page_content}
# Pipe it into a prompt template
| ChatPromptTemplate.from_template("Summarize the following document:\n\n{doc}")
# Use an LLM to generate the summary
| ChatOpenAI(model="gpt-3.5-turbo", max_retries=0)
# Parse the output into a string
| StrOutputParser()
)
# Use .batch() to run the summarization in parallel for efficiency
summaries = summary_chain.batch(docs, {"max_concurrency": 5})
# Let's inspect the first summary
print(summaries[0])
#### OUTPUT ####
The document discusses building autonomous agents powered by Large
Language Models (LLMs). It outlines the key components of such a system, ...
Now comes the crucial part. We need a MultiVectorRetriever which requires two main components:
- A vectorstore to store the embeddings of our summaries.
- A docstore (a simple key-value store) to hold the original, full documents.
from langchain.storage import InMemoryByteStore
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_core.documents import Document
# The vectorstore to index the summary embeddings
vectorstore = Chroma(collection_name="summaries", embedding_function=OpenAIEmbeddings())
# The storage layer for the parent documents
store = InMemoryByteStore()
id_key = "doc_id" # This key will link summaries to their parent documents
# The retriever that orchestrates the whole process
retriever = MultiVectorRetriever(
vectorstore=vectorstore,
byte_store=store,
id_key=id_key,
)
# Generate unique IDs for each of our original documents
doc_ids = [str(uuid.uuid4()) for _ in docs]
# Create new Document objects for the summaries, adding the 'doc_id' to their metadata
summary_docs = [
Document(page_content=s, metadata={id_key: doc_ids[i]})
for i, s in enumerate(summaries)
]
# Add the summaries to the vectorstore
retriever.vectorstore.add_documents(summary_docs)
# Add the original documents to the docstore, linking them by the same IDs
retriever.docstore.mset(list(zip(doc_ids, docs)))
Our advanced index is now built. Let’s test the retrieval process. We’ll ask a question about “Memory in agents” and see what happens.
query = "Memory in agents"
# First, let's see what the vectorstore finds by searching the summaries
sub_docs = vectorstore.similarity_search(query, k=1)
print("--- Result from searching summaries ---")
print(sub_docs[0].page_content)
print("\n--- Metadata showing the link to the parent document ---")
print(sub_docs[0].metadata)
#### OUTPUT ####
--- Result from searching summaries ---
The document discusses the concept of building autonomous agents powered by Large Language Models (LLMs) as their core controllers. It covers components such as planning, memory, and tool use, along with case studies and proof-of-concept examples like AutoGPT and GPT-Engineer. Challenges like finite context length, planning difficulties, and reliability of natural language interfaces are also highlighted. The document provides references to related research papers and offers a comprehensive overview of LLM-powered autonomous agents.
--- Metadata showing the link to the parent document ---
{'doc_id': 'cf31524b-fe6a-4b28-a980-f5687c9460ea'}
As you can see, the search found the summary that mentions “memory.” Now, the MultiVectorRetriever will use the doc_id from this summary's metadata to automatically fetch the full parent document from the docstore.
# Let the full retriever do its job
retrieved_docs = retriever.get_relevant_documents(query, n_results=1)
# Print the beginning of the retrieved full document
print("\n--- The full document retrieved by the MultiVectorRetriever ---")
print(retrieved_docs[0].page_content[0:500])
#### OUTPUT ####
"\n\n\n\n\n\nLLM Powered Autonomous Agents | Lil'Log\n\n\n\n\n\n\n\n\n
\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\n\ ...."
6. RAPTOR
RAPTOR is an advanced hierarchical indexing and retrieval technique designed to overcome limitations of traditional RAG systems, especially on large, complex document corpora.
It recursively clusters and summarizes documents to create a tree-structured hierarchy of summaries. This hierarchy captures multiple levels of abstraction—from detailed document chunks at the leaves to high-level holistic summaries at the root—enabling flexible, context-aware retrieval.
Why Use RAPTOR?
- Supports multi-level querying: Handles both low-level factual queries (answered by chunk-level retrieval) and high-level conceptual queries (answered by summary nodes).
- Improves semantic coverage: Recursively summarizing clusters captures broader context missed by flat chunk retrieval.
- Scalable for large corpora: Efficient indexing and retrieval through a hierarchical summary tree.
- Enables better synthesis: The LLM can combine detailed and abstracted info for richer answers.
from langchain.document_loaders import TextLoader
from langchain.embeddings import OpenAIEmbeddings
from langchain.vectorstores import Chroma
from langchain.chains.summarize import load_summarize_chain
from langchain.llms import OpenAI
# Step 1: Load Raw Documents
loader = TextLoader("path_to_documents")
docs = loader.load()
# Step 2: Embed documents
embeddings = OpenAIEmbeddings()
# Step 3: Create base vector store
vectordb = Chroma.from_documents(docs, embeddings, persist_directory="./db")
# Step 4 & 5: Recursive clustering and summarization (conceptual)
llm = OpenAI(temperature=0)
summary_chain = load_summarize_chain(llm, chain_type="stuff")
def recursive_cluster_and_summarize(docs, depth=0, max_depth=3):
if depth == max_depth or len(docs) <= 5:
return docs
# Cluster docs by semantic similarity (implementation detail)
clusters = cluster_documents(docs) # placeholder function
summaries = []
for cluster in clusters:
summary = summary_chain.run(cluster)
summaries.append(summary)
return recursive_cluster_and_summarize(summaries, depth+1, max_depth)
summary_tree = recursive_cluster_and_summarize(docs)
# Step 6: Index summaries and original docs
vectordb.add_documents(summary_tree)
# Step 7: Retrieve for queries
retriever = vectordb.as_retriever()
docs_retrieved = retriever.get_relevant_documents("Explain RAPTOR benefits")
print(docs_retrieved)
You can learn more about this in LangChain's RAPTOR example notebook: langchain/cookbook/RAPTOR.ipynb
Performance Comparison Analysis - Retrieval Effectiveness Comparison
Indexing Strategy | Precision | Recall | F1 Score |
---|---|---|---|
Basic Vector Indexing | 70% | 65% | 67.5% |
Multi-Vector Indexing | 85% | 80% | 82.5% |
Parent Document Retrieval | 82% | 85% | 83.5% |
RAPTOR | 88% | 87% | 87.5% |
7. ColBERT
ColBERT introduces a novel token-level embedding and late interaction mechanism for information retrieval. Unlike traditional models (e.g., BERT or Sentence-BERT) that compress an entire document or query into a single fixed vector, ColBERT retains individual embeddings for each token (word piece) in both queries and documents.
The retrieval process compares token embeddings from the query and document "late" in the pipeline—calculating precise similarity scores between every query token and every document token. This fine-grained token-level similarity leads to much more accurate and context-sensitive retrieval, especially for queries requiring exact phrase matching or detailed semantic understanding.
Key Features of ColBERT
- Late Interaction: Query and document tokens are embedded separately and compared pairwise only at retrieval time, enabling efficient retrieval without losing granularity.
- Fine-Grained Similarity: Matches tokens precisely rather than averaging or pooling representations prematurely.
- Efficient Search: Combines rich semantic understanding of BERT with scalability for large-scale search systems.
- ColBERTv2 Improvements: Introduces product quantization and centroid-based encoding to reduce storage overhead without major performance loss.
# Install ragatouille package if not already installed
# !pip install -U ragatouille
from ragatouille import RAGPretrainedModel
import requests
from langchain.chains import create_retrieval_chain
from langchain.chains.combine_documents import create_stuff_documents_chain
from langchain_core.prompts import ChatPromptTemplate
from langchain_openai import ChatOpenAI
# Helper function to fetch the full text of a Wikipedia page
def get_wikipedia_page(title: str) -> str:
URL = "https://en.wikipedia.org/w/api.php"
params = {
"action": "query",
"format": "json",
"titles": title,
"prop": "extracts",
"explaintext": True,
}
headers = {"User-Agent": "RAGatouille_tutorial/0.0.1 (example@example.com)"}
response = requests.get(URL, params=params, headers=headers)
data = response.json()
page = next(iter(data["query"]["pages"].values()))
return page.get("extract", "")
# Initialize the RAG model (ColBERTv2)
RAG = RAGPretrainedModel.from_pretrained("colbert-ir/colbertv2.0")
# Fetch the document to index (Wikipedia page for Hayao Miyazaki)
full_document = get_wikipedia_page("Hayao_Miyazaki")
# Index the document in RAGatouille vector store
RAG.index(
collection=[full_document],
index_name="Miyazaki-123",
max_document_length=180,
split_documents=True,
)
# Example search query
query = "What animation studio did Miyazaki found?"
# Perform the search with top-k results (e.g., k=3)
results = RAG.search(query=query, k=3)
print("Top retrieved results and scores:")
for i, res in enumerate(results):
print(f"{i+1}. Score: {res['score']:.2f}")
print(f"Content snippet: {res['content'][:300]}")
print("------")
# Convert RAG results to LangChain retriever
retriever = RAG.as_langchain_retriever(k=3)
# Create a simple LangChain retrieval-based QA chain
prompt = ChatPromptTemplate.from_template(
"""
Answer the following question based only on the provided context:
<context>
{context}
</context>
Question: {input}
"""
)
llm = ChatOpenAI()
document_chain = create_stuff_documents_chain(llm, prompt)
retrieval_chain = create_retrieval_chain(retriever, document_chain)
# Query the QA chain
response = retrieval_chain.invoke({"input": query})
print("Answer from LangChain QA chain:")
print(response["answer"])
Why Use ColBERT in RAG?
- Higher retrieval accuracy for complex queries due to token-level matching.
- Better handling of rare terms and exact phrases, improving precision.
- Balances efficiency and deep semantic relevance in retrieval pipelines.
- Allows for scalable embedding storage and search with compression tech in ColBERTv2.
8. Vision Free RAG using ColPali
- ColPali is an advanced system that extends ColBERT's multi-vector retrieval and late interaction principles from text search into the vision domain.
- It leverages Vision Language Models (VLMs) like PaliGemma, which combine large language models with vision transformers to process document images directly—without relying on OCR.
- The approach involves encoding document images by splitting them into patches that a vision transformer processes; these embeddings align with textual semantic spaces, enabling rich cross-modal retrieval.
- ColPali performs late interaction by matching query token embeddings with document patch embeddings, summing scores for final ranking, which is both efficient and precise.
You can also checkout my blog on here. I detailed about end to end implementation.
9. Graph RAG
Graph RAG (Graph-based Retrieval-Augmented Generation) is an emerging approach that enhances traditional RAG methods by integrating knowledge graphs or textual graphs to improve retrieval accuracy, reasoning, and explainability of LLM outputs.
What is Graph RAG?
- Unlike classic RAG systems that retrieve individual document chunks based primarily on vector similarity, Graph RAG leverages the structured relationships and topology of data found in graphs (knowledge graphs, citation networks, social media graphs, etc.).
- It retrieves not just isolated documents but textual subgraphs—connected sets of nodes and edges—capturing the semantic and relational context between pieces of information.
- This combined textual and graph-based retrieval provides more comprehensive, coherent, and explainable contexts to LLMs for generation.
Core Components of Graph RAG
- Graph-Based Indexing:
- Documents or knowledge bases are represented as graphs, where nodes represent entities, concepts, or document sections, and edges represent relationships.
- Textual and structural information are jointly embedded or indexed.
- Graph-Guided Retrieval:
- Initial seed documents are found via semantic retrieval.
- Graph traversals or expansions find related subgraphs, incorporating connected context relevant to the query.
- Efficient algorithms retrieve optimal textual subgraphs in linear time.
- Graph-Enhanced Generation:
- LLMs are fed both the textual content and the underlying graph context.
- Two complementary views—text view and graph view—enable better comprehension and multi-hop reasoning, producing richer answers.
from langchain.document_loaders import TextLoader
from langchain.chat_models import ChatOpenAI
from langchain.graphs import Neo4jGraph
from langchain.retrievers import GraphRetriever
from langchain.chains import RetrievalQA
# Hypothetical LLM Transformer with convert_to_graph_documents
class LLMTransformer:
def __init__(self, llm):
self.llm = llm
def convert_to_graph_documents(self, documents):
"""
This function converts unstructured documents into structured graph documents
by extracting entities and relations, using the LLM to interpret the content.
For simplicity, this is a mock implementation.
"""
graph_docs = []
for doc in documents:
# Pretend to extract entities and relationships, here simplified:
graph_doc = {
"id": doc.metadata.get("id", "unknown"),
"nodes": ["Entity1", "Entity2"],
"edges": [("Entity1", "related_to", "Entity2")],
"text": doc.page_content
}
graph_docs.append(graph_doc)
return graph_docs
# --- Step 1: Load raw documents (replace with actual documents) ---
loader = TextLoader("path_to_documents_folder")
documents = loader.load()
# --- Step 2: Initialize LLM ---
llm = ChatOpenAI(temperature=0)
# --- Step 3: Convert to Graph Documents using LLM transformer ---
llm_transformer = LLMTransformer(llm)
graph_documents = llm_transformer.convert_to_graph_documents(documents)
# --- Step 4: Connect and ingest into Neo4j graph ---
neo4j_uri = "bolt://localhost:7687"
neo4j_auth = ("neo4j", "your_password")
graph = Neo4jGraph(neo4j_connection_uri=neo4j_uri, neo4j_connection_auth=neo4j_auth)
# Function to ingest graph documents to Neo4j (simplified)
for gdoc in graph_documents:
# Create nodes and relationships (this is high-level pseudocode)
for node in gdoc["nodes"]:
graph.query(f"MERGE (n:Entity {{name: '{node}'}})")
for (start, rel, end) in gdoc["edges"]:
graph.query(
f"""
MATCH (a:Entity {{name: '{start}'}}), (b:Entity {{name: '{end}'}})
MERGE (a)-[:{rel}]->(b)
"""
)
# --- Step 5: Create GraphRetriever ---
graph_retriever = GraphRetriever(graph=graph, max_nodes=15)
# --- Step 6: Create RetrievalQA Chain with LLM and GraphRetriever ---
qa_chain = RetrievalQA.from_chain_type(
llm=llm,
chain_type="stuff",
retriever=graph_retriever
)
# --- Step 7: Query the GraphRAG system ---
query = "Explain the relationships between the entities in the documents."
response = qa_chain.run(query)
print("GraphRAG answer:", response)
You can learn more about this here. https://techcommunity.microsoft.com/blog/adforpostgresql/introducing-the-graphrag-solution-for-azure-database-for-postgresql/4299871
Benefits of Graph RAG
- Multi-hop and Complex Reasoning: Enables reasoning across interconnected data points naturally.
- Improved Retrieval Accuracy: Relationship information ensures relevant supporting facts aren’t missed.
- Explainability: Traceable retrieval paths in the graph improve trust and debugging capabilities.
- Contextual Coherence: Answers are generated using well-organized, related knowledge clusters.
10. Agentic RAG
All of the technique you learned previously are the more of improving the current sequential architecture of RAG. Agentic Retrieval-Augmented Generation (Agentic RAG) is an advanced evolution of traditional RAG systems that integrates autonomous AI agents to dynamically manage and improve the retrieval and generation processes. Unlike static RAG architectures, Agentic RAG introduces intelligence and autonomy in deciding when, how, and what information to retrieve, enabling more complex, multi-step reasoning and more accurate, contextually rich responses.
What is Agentic RAG?
Agentic RAG embeds an AI agent inside the retrieval-augmented generation pipeline. This agent:
- Assesses user queries dynamically to decide whether to retrieve documents or respond directly.
- Plans and adapts retrieval strategies based on ongoing reasoning and intermediate results.
- Coordinates multiple tools and data sources (including search engines, databases, APIs, and even other agents).
- Performs iterative refinement, generating queries, retrieving context, and updating answers in a loop until confidence or task completion is reached.
This approach combines the reasoning and planning abilities of large language models with retrieval flexibility, overcoming vanilla RAG’s limitations on complex or ambiguous queries.
Key Benefits
- Improved handling of multi-turn, complex, or ambiguous queries by iterative reasoning and retrieval.
- Dynamic tool use and workflow orchestration, adapting retrieval to query-specific needs.
- Reduced hallucinations and increased factual accuracy, since the agent controls and verifies retrieval.
- Scalability to diverse data sources and large heterogeneous corpora.
Here I am giving a brief overview of how an agentic RAG solution looks like using Lang Graph. LangGraph is an open-source library used to build complex, stateful, and multi-actor applications with Large Language Models (LLMs). It is built on top of the LangChain ecosystem but is designed to provide more control and flexibility for agentic workflows.
from langgraph.graph import StateGraph, START, END
from langgraph.prebuilt import ToolNode
from langgraph.prebuilt import tools_condition
workflow = StateGraph(MessagesState)
# Define the nodes we will cycle between
workflow.add_node(generate_query_or_respond)
workflow.add_node("retrieve", ToolNode([retriever_tool]))
workflow.add_node(rewrite_question)
workflow.add_node(generate_answer)
workflow.add_edge(START, "generate_query_or_respond")
# Decide whether to retrieve
workflow.add_conditional_edges(
"generate_query_or_respond",
# Assess LLM decision (call `retriever_tool` tool or respond to the user)
tools_condition,
{
# Translate the condition outputs to nodes in our graph
"tools": "retrieve",
END: END,
},
)
# Edges taken after the `action` node is called.
workflow.add_conditional_edges(
"retrieve",
# Assess agent decision
grade_documents,
)
workflow.add_edge("generate_answer", END)
workflow.add_edge("rewrite_question", "generate_query_or_respond")
# Compile
graph = workflow.compile()
- StateGraph is the core graph structure managing the state transitions in the conversation or retrieval process.
- START and END represent the beginning and terminal states of the workflow.
- ToolNode wraps a tool (like a retriever) so it can be executed as a node.
- tools_condition is a condition function that decides transitions based on tool usage signals.
- workflow is instantiated with MessagesState, a structure tracking conversation messages and agent states.
- generate_query_or_respond: Node where the agent decides whether to generate a retrieval query or respond immediately.
- "retrieve" (ToolNode): Executes the retrieval tool to fetch relevant documents.
- rewrite_question: Node for rewriting or refining the user question to improve retrieval or generation.
- generate_answer: Uses retrieved context and/or refined queries to generate the final answer.
Agentic RAG empowers LLM-based applications with intelligent autonomous control over retrieval and generation, enabling far more scalable, accurate, and context-aware AI systems beyond static retrieval pipelines.
Conclusion:
Hope these techniques help you in improving your current RAG production accuracies. This techniques are learning from various project from different domain with different level of challenges. Each of this technique is independent meaning you can try and test each of them separately and see which one works well for your use case. For easy evaluation , i highly recommend to have solid evaluation pipeline (How to have this may be a separate blog in future) to test and implement in future.
Github Link: https://github.com/monuminu/AOAI_Samples
Thanks
Manoranjan Rajguru