The Future of AI: GraphRAG – A better way to query interlinked documents

vinayakh

Microsoft

Nov 13, 2024

GraphRAG is a technique that combines the power of knowledge graphs and large language models (LLMs) to improve the accuracy and relevance of responses to user queries.

All language models are trained on a huge corpus of data. They have some world knowledge and can answer a range of questions about different things. However, due to their probabilistic nature and incomplete world knowledge, especially when it comes to different niches and domains, it’s possible to receive incorrect answers. Retrieval Augmented Generation (RAG) helps augment world knowledge with enterprise-specific references, reducing inaccuracies and inconsistencies in the generated text.

How RAG works and improves LLM output

In RAG, the corpus of text relevant to your domain is converted into embeddings. Embeddings are created by translating documents into a mathematical form based on their traits, factors, and categories. The resulting vector representation is a long sequence of numbers. The distance between two vectors indicates how closely related they are. Similar objects are positioned closer together in a multi-dimensional embedding space, while less similar objects are positioned farther apart.

As the term signifies, RAG consists of three steps – First the relevant vectors related to the query are retrieved (typically from a vector database), then the prompt which is sent to the LLM is augmented with this relevant contextual information, and finally the LLMs generates an answer based on this context and query. Using the RAG approach, developers can extend the factual grounding of the model, improve the relevance, accuracy and quality of the answers generated by the LLMs, and in many cases, refer back to the document snippets which were used in the generation of the answer. RAG has emerged as a powerful approach that combines the strengths of information retrieval and generative models.

How GraphRAG builds upon RAG approach

Though RAG improves on the LLMs generative capabilities, RAG does sometimes struggle to make sense of concepts and relationships between them when they are spread across documents. Also, as the complexity of data structures grows, there is a need for more advanced systems capable of handling interconnected, multi-faceted information. This is where GraphRAG comes into play. GraphRAG is an advanced version of RAG that utilizes graph-based retrieval mechanisms, enhancing the generation process by capturing richer, more contextual information. GraphRAG improves over vector RAG in the following ways.

Enhanced Contextual Understanding with Graphs

RAG traditionally uses a flat retrieval system (through embeddings in a vector DB), where it retrieves documents (and relevant document fragments) from a knowledge base based on their relevance to a query. The generative model then uses these retrieved documents to generate a response. While effective, this method can struggle when information is spread across multiple, interconnected documents. GraphRAG, on the other hand, uses graph-based retrieval, which allows it to connect pieces of information across a web of nodes. Each node represents an entity or a concept, and the edges represent the relationships between them. Examples of this could be relations like “is part of,” “is cousin of,” or “is made of.” This structured approach enables GraphRAG to extract and utilize more nuanced, multi-layered contextual information, resulting in more coherent and accurate responses.

Improved Knowledge Integration

In RAG, the generative model can sometimes produce fragmented or inconsistent outputs when the retrieved documents lack cohesion because of the way the chunking process and embedding vectors work. GraphRAG solves this by using graph databases that can model complex relationships. Graph Databases store both the entities represented by nodes and the relationships connecting them. They make it possible to traverse nodes using relationships between them. By understanding the connections between different pieces of information, GraphRAG can integrate knowledge from diverse sources and provide a more unified and accurate response.

For example, if a question involves multiple entities and their interactions (e.g., "How does the supply chain impact product availability during a pandemic?"), GraphRAG can navigate through the interconnected data points, understand their dependencies, and generate a comprehensive answer. Another good example is compliance information for related documents and references to concepts in compliance. Let’s assume you are opening a restaurant and want to know different regulations needed to open a kitchen. Regulations can span fire safety, hygiene, food storage, ingredient sourcing, insurance, and labour guidelines. GraphRAG can work in such a scenario to collect all the references, traversing the relationships between them, giving users a coherent answer spanning a collection of documents.

Efficiency and Scalability

Another key metric, especially for large, interconnected datasets, is efficiency. RAG requires scanning through multiple documents for relevant content, which can be resource-intensive, especially with vast datasets. GraphRAG’s graph-based structure can efficiently traverse the data by focusing on relevant nodes and relationships, reducing computational overhead. Using GraphRAG intelligently, developers can use a combination of graph traversals of knowledge graphs and vector search to reduce computation and memory overheads. This s better, more intelligent indexing over traditional approaches.

Moreover, graphs can be scaled horizontally, allowing for the expansion of knowledge bases without significantly increasing retrieval times. This makes GraphRAG suitable for enterprise-level applications where scalability and performance are critical. Also, when an organization spans many different vertical domains, this helps focus the search. So, you have the advantage both in terms of scalability and performance.

GraphRAG Implementation

Now that we know the benefits of GraphRAG, let’s implement an approach using GraphRAG.

Setup

For this demonstration we will use, we will use the GPT-4o as the LLM model in Azure AI Studio and text-embedding-3-small as the embedding model to generate embeddings on the platform. We will use the open source lancedb to store the embeddings and retrieve them for GraphRAG. There are many other models available via the Azure AI model catalog which has a variety of LLMs, SLMs, and embedding models. Let’s now create the deployments using Azure AI Studio for both these models.

Next, let’s open a session on WSL to create a virtual env for Python. We will be using the Python package for GraphRAG for this demo.

# Create a graphrag directory and change directory to try out this example
$ mkdir graphrag
$ cd graphrag/
# Install virtualenv package, create a virtual environment called venv_name 
# & change directory to it. We create a virtual environment so we can safely 
# install and experiment with package without changing the global Python 
# environment
$ sudo apt-get install python3-virtualenv
$ virtualenv -p python3 venv_name
$ cd venv_name/
# Activate the virtual environment 
$ source bin/activate
# Next, install the Python GraphRAG package in the virtual environment  
# created. This will download and install a number of packages and may 
# take a little time. Amongst other things, it will install the opensource 
# DataShaper data processing library that allows users to declaratively 
# express data pipelines, schemas, and related assets using well-defined 
# schemas 
$ pip install graphrag

For the purposes of this demo, we will use the text of the Mahabharata. The Mahabharata is an epic Indian classical text that is divided into 18 chapters with a multitude of characters. It narrates the events that lead to the Kurukshetra war between two warring clans of cousins – Kauravas and Pandavas and the aftermath of the war. There are more than 100 human characters in the text who interact with each other and are also related to each other in some way. You can read about the epic text here and read about the many characters. We will use one of the translations of the epic text from project Gutenberg which is in the public domain.

# Create the directory for input text and download the file using curl and 
# store it in the input directory. Though this is one document it consists of 
# many parts. The word count (634955) and line count (58868) in the 
# example below can be seen using  wc commandline utility.  
$ mkdir -p ./mahabharata/input
$ curl curl https://www.gutenberg.org/cache/epub/15474/pg15474.txt -o ./mahabharata/input/book.txt
$ wc input/book.txt
  58868  634955 3752942 input/book.txt
# Next, we will initialize the environment for GraphRAG using the command:
$ python -m graphrag.index --init --root ./mahabharata/

This will create a .env file and a settings.yaml file in the mahabharata directory. .env contains the environment variables required to run the GraphRAG pipeline. If the file is edited, a single environment variable will be defined, GRAPHRAG_API_KEY=<API_KEY>. This is the API key for the OpenAI API or Azure OpenAI Service endpoint. This can be replaced with an API key. API keys and other settings can be seen in the screenshot below (red highlight) in Azure AI Studio.

In the llm section of settings.yaml, configure the following settings,
llm:
  api_key: ${GRAPHRAG_API_KEY}
  type:  azure_openai_chat # or openai_chat
  model: gpt-4o
  model_supports_json: true # recommended if this is available for your model.
  api_base: https://<your_instance_details>.openai.azure.com
  api_version: 2024-08-01-preview # please replace with your version
  deployment_name: gpt-4o

In the embeddings section of settings.yaml , configure the following settings,
  llm:
    api_key: ${GRAPHRAG_API_KEY}
    type: azure_openai_embedding
    model: text-embedding-3-small
    api_base: https://<your_instance_details>.openai.azure.com
    api_version: 2024-08-01-preview # please replace with your version
    deployment_name: text-embedding-3-small

Next, run the indexing process as a precursor to creating the embeddings. This will create a log to track the indexing process. This will start the chunking process, create the entities, figure out the relationship between different entities, generate graph relationships between the entities and finally after multiple processing create the final documents to be stored for retrieval in lanceDB. If the process is complete successfully, a message will appear which says, “All workflows completed successfully.” Note, there will be many warnings about deprecation which can be safely ignored - for now.

$ python -m graphrag.index --root ./mahabharata/

Now that the embeddings have been created successfully, let's run a couple of queries to see if we can get answers about the characters and the relationships between them.

$ python -m graphrag.query --root ./mahabharata --method global "Who is Duryodhana and How is he related to Arjuna?"
creating llm client with {'api_key': 'REDACTED,len=32', 'type': "azure_openai_chat", 'model': 'gpt-4o', 'max_tokens': 4000, 
'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'https://graphragdemo-inst.openai.azure.com', 
'api_version': '2024-08-01-preview', 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 
'deployment_name': 'gpt-4o', 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 
'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}

SUCCESS: Global Search Response: 
### Duryodhana: A Central Figure in the Mahabharata
Duryodhana is a pivotal character in the Indian epic, the Mahabharata. He is the eldest son of Dhritarashtra and Gandhari, 
making him the leader of the Kauravas, a group of a hundred brothers [Data: Reports (408, 397, 400, 275, +more)]. 
Duryodhana is known for his deep-seated enmity towards the Pandavas, particularly Arjuna, and his significant role in the 
Kurukshetra War, where he stands as a central antagonist [Data: Reports (408, 397, 569, 216, +more)].
### Relationship with Arjuna
Duryodhana and Arjuna are first cousins. Duryodhana is the son of Dhritarashtra, while Arjuna is the son of Pandu. 
Dhritarashtra and Pandu are brothers, making Duryodhana and Arjuna part of the same Kuru dynasty [Data: Reports 
(255, 398, 285, 177, 202, +more)]. This familial connection places them in direct conflict over the throne of Hastinapura, 
leading to the epic battle of Kurukshetra [Data: Reports (399, 216, 406, 440, +more)].
### Rivalry and Conflict
The relationship between Duryodhana and Arjuna is marked by intense rivalry and conflict. Duryodhana's ambition to 
rule Hastinapura and his enmity towards the Pandavas drive much of the narrative in the Mahabharata. This enmity is 
particularly highlighted during the Kurukshetra War, where Duryodhana leads the Kauravas against Arjuna and the 
Pandavas [Data: Reports (408, 397, 273, 202, +more)]. Their rivalry is a central theme in the epic, culminating in 
numerous battles and deceitful plots, including the infamous game of dice that led to the Pandavas' exile [Data: Reports 
(398, 255, 400, 256, +more)].
### Conclusion
Duryodhana's character is defined by his leadership of the Kauravas and his antagonistic relationship with the Pandavas, 
especially Arjuna. Their familial ties and subsequent rivalry form the crux of the Mahabharata's narrative, leading to the 
monumental conflict of the Kurukshetra War [Data: Reports (408, 397, 569, 216, +more)].

Let’s try another query for another character called Karna.

$ python -m graphrag.query --root ./mahabharata --method global "Who is Karna and what are his main relationships?"
creating llm client with {'api_key': 'REDACTED,len=32', 'type': "azure_openai_chat", 'model': 'gpt-4o', 'max_tokens': 4000, 
'temperature': 0.0, 'top_p': 1.0, 'n': 1, 'request_timeout': 180.0, 'api_base': 'https://graphragdemo-inst.openai.azure.com', 
'api_version': '2024-08-01-preview', 'organization': None, 'proxy': None, 'cognitive_services_endpoint': None, 
'deployment_name': 'gpt-4o', 'model_supports_json': True, 'tokens_per_minute': 0, 'requests_per_minute': 0, 'max_retries': 10, 
'max_retry_wait': 10.0, 'sleep_on_rate_limit_recommendation': True, 'concurrent_requests': 25}
SUCCESS: Global Search Response:

### Karna: A Key Figure in the Mahabharata
Karna, also known as the Son of Radha, Vasusena, and Radheya, is a pivotal character in the Indian epic, the Mahabharata. 
He is renowned for his exceptional martial prowess, unwavering loyalty, and tragic life. Born to Kunti and the Sun God, Surya, 
Karna's divine heritage endowed him with extraordinary abilities, including natural armor and ear-rings that made him nearly 
invincible [Data: Reports (373, 198, 465, 502, 155, +more)].

### Key Relationships
#### **Duryodhana**
Karna's most significant relationship is with Duryodhana, the leader of the Kauravas. Duryodhana befriends Karna and installs 
him as the king of Anga, solidifying their bond. This relationship is marked by deep loyalty and mutual support, with Karna 
vowing to slay Arjuna and supporting Duryodhana in various schemes against the Pandavas [Data: Reports (390, 397, 373, 
198, 465, +more)]. Karna's loyalty to Duryodhana is a defining aspect of his character, influencing many of his actions and 
decisions throughout the epic [Data: Reports (447, 440, 391, 383, 302)].

#### **Kunti**
Karna's relationship with his mother, Kunti, is complex and filled with emotional tension. Kunti reveals to Karna that he is her 
son, born before her marriage to Pandu, which adds a layer of tragedy to his character. Despite this revelation, Karna chooses 
to remain loyal to Duryodhana and fight against his half-brothers, the Pandavas [Data: Reports (373, 198, 465, 502, 155, 
+more)].

#### **Arjuna**
Karna's rivalry with Arjuna, one of the Pandavas, is a central theme in the Mahabharata. Both warriors are considered equals 
in skill and valor, and their final confrontation in the Kurukshetra war is one of the epic's most significant events. Karna's 
enmity with Arjuna is fueled by his loyalty to Duryodhana and his desire to prove his worth [Data: Reports (373, 198, 465, 
502, 155, +more)].

#### **Surya**
Karna's divine father, Surya, plays a crucial role in his life, often providing guidance and warnings. For instance, Surya 
forewarns Karna about Indra's intentions to obtain his ear-rings and coat of mail, which are sources of his invincibility [Data: 
Reports (518, 547, 391, 358, 371)].

#### **Indra**
Karna's interactions with Indra, the king of the gods, are also notable. Indra, disguised as a Brahmin, tricks Karna into giving 
up his ear-rings and armor, which were his sources of invincibility. In return, Indra grants Karna a powerful weapon, the Sakti, 
which he can use only once [Data: Reports (302, 394)].

### Conclusion
Karna's life is marked by his unwavering loyalty to Duryodhana, his complex relationships with his mother Kunti and his 
half-brother Arjuna, and his divine heritage. These relationships shape his actions and decisions, making him one of the 
most compelling and tragic figures in the Mahabharata [Data: Reports (390, 397, 373, 198, 465, +more)].

GraphRAG is able to piece together the relevant bits from different parts of the chapters to offer get us the relationship between the different characters with references (data reports or chunks). In some cases, it can do this over many different chunks of data over a large text. This is a huge improvement over the baseline performance of large language models and baseline vector RAG. In a recent Benchmark paper, it was found that knowledge graphs can improve the accuracy of answers up to 3x (54.2% vs 16.7%). GraphRAG can also be used in applications to make them more scalable and accurate, especially for domain-specific applications.

Also, if you are working with many documents such as in data lake or running this is production, I would suggest using Azure AI search as the vector store. The GraphRAG accelerator,

More information about GraphRAG and Azure AI Studio is available in the resources below: