In the rapidly evolving landscape of Artificial Intelligence and Natural Language Processing for creating Intelligent Applications, the use of the Retrieval Augmented Generation (RAG) technique has emerged as a powerful solution to enhance the accuracy and relevance of the responses generated by language models.
In this article, we will explore the talk given during the Hack Together: RAG Hack event, where Glaucia Lemos, a Cloud Advocate at Microsoft, and Yohan Lasorsa, a Senior Cloud Advocate also at Microsoft, presented how LangChain.js is revolutionizing the development of RAG applications, facilitating the creation of intelligent applications that combine large language models (LLMs) with their own data sources.
Join the Hack Together: RAG Hack!
The Hack Together: RAG Hack is a global and free hackathon happening from September 3rd to 16th, 2024, focused on exploring and developing applications using the Retrieval Augmented Generation (RAG) technique.
This event will bring together developers, researchers, and AI enthusiasts from all over the world to create innovative and intelligent applications that combine large language models (LLMs) with their own data sources using RAG.
The event will feature 25 live streams, demonstrating how to create RAG applications using Azure AI in different programming languages, such as Python, Java, JavaScript/TypeScript, and C#. It will also include various Azure AI services such as AI Search, PostgreSQL, Azure SQL, and Azure CosmosDB. Participants will have the opportunity to learn about popular frameworks like LangChain and Semantic Kernel, as well as cutting-edge technologies like agents and vision models.
The best part is that if your application is chosen, you could win a prize of USD 500.00! In the following categories:
- Best overall
- Best in JavaScript/TypeScript
- Best in Java
- Best in .NET
- Best in Python
- Best use of AI Studio
- Best use of AI Search
- Best use of PostgreSQL
- Best use of Cosmos DB
- Best use of Azure SQL
- Best use of Azure SQL
Want to know more about all the event details, how to participate, and how to submit your application? Visit the event website here and join in!
Live Session: Building RAG Applications with LangChain.js
On September 4th, 2024, a live session was held on the theme: Building RAG Applications with LangChain.js, where Glaucia Lemos and Yohan Lasorsa explained the importance of using LangChain.js for developing RAG applications with a remarkable demonstration of a Serverless AI Chat.
If you missed this live session, don't worry! You can watch the recording below:
Contoso Real Estate AI Chat Application
During the live session, Glaucia Lemos and Yohan Lasorsa presented an example application called Contoso Real Estate AI Chat, which is a chatbot application that allows users to ask questions about properties available for sale.
You can see the application in action in the gif below:
You can even access the application directly from the GitHub repository here!
Like it? Then fork the project directly from the repository and create your own version of the Contoso Real Estate AI Chat application! And don't forget to star the repository!
Well... let's dive into what was presented during the live session!
The AI Challenge and the Solution with RAG
During the first part of the live session, Glaucia Lemos explored the AI challenge and the solution with the use of RAG.
When working with LLMs, we face several challenges. These models, despite their impressive ability to generate natural language, sometimes produce inaccurate and not-so-accurate information. This is because they use outdated data or rely on unreliable sources.
That's where the RAG technique comes in, as it allows developers to combine large language models with their own data sources, improving the accuracy and relevance of the generated responses.
RAG, or Retrieval Augmented Generation, is a technique that combines two main components:
- Retrieval: Retrieves relevant information from a knowledge base under your control.
- Generator: A language model that generates responses based on the retrieved information, thus creating the final answers.
How Does RAG Work?
The RAG process can be divided into several steps, but it mainly works as follows:
- Send a question to the language model.
- The language model's retriever will search for information in the knowledge base to return relevant information to answer the question.
- Then, you send the question and the documents to the generator, which will finally create the answer based on the provided information.
The gif below perfectly illustrates the entire RAG process:
One of the most important factors in this process is creating the knowledge base. But how is it implemented? Let's see next!
Creating the Knowledge Base
The knowledge base is a collection of documents containing relevant information to answer questions. These documents can be of different types, such as text, images, tables, etc.
All this boils down to a two-step process:
1. Building the Knowledge Base
- The documents are processed to extract the text.
- The text is divided into smaller chunks.
- The embeddings are stored in a vector database.
2. Retrieval and Generation Process
- The user's question is transformed into an embedding.
- This embedding is used to search for relevant documents in the vector database.
- The retrieved documents are injected into the language model prompt.
- The model generates a response based on the question and relevant documents.
Although this process is repeated, it also allows for more precise and relevant answers by using specific data instead of learning solely from the language model data.
If you are confused about how this whole process works, the gif below illustrates it perfectly:
After creating your knowledge base, it's time to use context retrieval and augmentation. This process can be summarized in three steps:
-
- First, we transform the user's query into a vector, sending it to the same embedding model used to create the knowledge base.
-
- Next, we use the vector to search for relevant documents in our vector database and select the top N results.
-
- Finally, we take the text from the most relevant documents, along with the user's initial question, and use it to create the prompt. The user then sends the prompt to the LLM to obtain the requested answer.
Let's let the gif demonstrate how this entire process works:
What are the Advantages of Using RAG?
The advantages of using RAG are numerous, but we can highlight a few, such as:
-
1. No Additional Training Required: You can use existing language models, saving significant time and resources.
-
2. Always Updated Data: The knowledge base can be easily updated without the need to retrain the model, ensuring up-to-date information.
-
3. Solid Foundation: Responses are based on specific and reliable sources, increasing accuracy and reliability.
-
4. Increased User Confidence: It is possible to show users the sources used to generate the answers, providing transparency and credibility.
Now that we deeply understand how RAG works and its advantages let's see how LangChain.js helps in developing RAG Applications!
LangChain.js: Simplifying RAG Application Development
In the second part of the live session, Yohan Lasorsa presented LangChain.js, an open-source framework that simplifies the development of RAG applications. And how easy it was to implement a Serverless AI Chat with LangChain.js.
Langchain.js is an open-source JavaScript library designed to simplify working with large language models (LLMs) and implementing advanced techniques like RAG. It provides high-level abstractions for all the necessary components to build AI applications, facilitating the integration of models, vector databases, and complex agents.
Langchain.js allows the creation of highly customized and flexible artificial intelligence (AI) workflows, combining different processing steps into a logical sequence. Let's understand how this works through a practical example that generates a joke based on a specific theme using a natural language model.
type Joke = { joke: string };
const model = new ChatOpenAI({ model: "gpt-4o-mini" });
const prompt = ChatPromptTemplate.fromTemplate(
"Tell a joke about {topic}. Answer with valid JSON, containing one field: 'joke'"
);
const parser = new JsonOutputParser<Joke>();
const chain = prompt
.pipe(model)
.pipe(parser);
await chain.invoke({ topic: "bears" });
The above code uses LangChain.js to create an AI workflow that generates a joke on a specific topic. First, it defines the output type as a Joke
object. Then, it initializes the gpt-4o-mini
language model and creates a prompt template instructing the model to return a joke in JSON format. A parser is configured to transform the response into a JavaScript object. Finally, the workflow is assembled by chaining the prompt, model, and parser and is executed with the theme "bears."
You can learn more about LangChain.js and how to use it in your application by accessing the official documentation here.
Implementing the Serverless AI Chat with LangChain.js and RAG
Now, let's explore how a complete RAG system was implemented using Langchain.js, based on the example project presented during the live session. The project is a support chatbot for a fictitious real estate company called Contoso Real Estate.
Project Architecture
The project was developed using a serverless architecture, utilizing several Azure services:
The main technologies used were:
- Node.js/TypeScript: programming language for the backend.
- LangChain.js: for integration with language models.
- Azure Functions: hosting the backend API.
- Azure Static Web Apps: hosting the frontend.
- Azure Cosmos DB for MongoDB: for vector data storage.
- Azure Blob Storage: for storing original documents.
- Azure OpenAI: for language models and embeddings.
- Lit.dev: for creating web components.
The best part of this project is that all the Azure services used are in the free tier, meaning you can test and experiment without additional costs. Sensational, right?
Let's understand how the project works!
Document Processing
The first step in implementing RAG is processing the documents that will form our knowledge base. Let's see how this is done using Langchain.js based on the code developed in the project:
(... some imports here...)
export async function postDocuments(request: HttpRequest, context: InvocationContext): Promise<HttpResponseInit> {
const storageUrl = process.env.AZURE_STORAGE_URL;
const containerName = process.env.AZURE_STORAGE_CONTAINER_NAME;
const azureOpenAiEndpoint = process.env.AZURE_OPENAI_API_ENDPOINT;
try {
// Get the uploaded file from the request
const parsedForm = await request.formData();
if (!parsedForm.has('file')) {
return badRequest('"file" field not found in form data.');
}
// Type mismatch between Node.js FormData and Azure Functions FormData
const file = parsedForm.get('file') as any as File;
const filename = file.name;
// Extract text from the PDF
const loader = new PDFLoader(file, {
splitPages: false,
});
const rawDocument = await loader.load();
rawDocument[0].metadata.source = filename;
// Split the text into smaller chunks
const splitter = new RecursiveCharacterTextSplitter({
chunkSize: 1500,
chunkOverlap: 100,
});
const documents = await splitter.splitDocuments(rawDocument);
(... more code ...)
This file, documents-post.ts
, is responsible for processing the documents sent by the user. First, it extracts text from a PDF file using the PDFLoader
library. Then, it splits the text into smaller chunks with the RecursiveCharacterTextSplitter
. Finally, the documents are stored in Azure Blob Storage for later use.
RAG Chat Implementation
The heart of the RAG system is in the chat
function, where user questions are processed and answered based on relevant retrieved documents. Let's see how this is implemented:
(... some imports here...)
const systemPrompt = `Assistant helps the Consto Real Estate company customers with questions and support requests. Be brief in your answers. Answer only plain text, DO NOT use Markdown.
Answer ONLY with information from the sources below. If there isn't enough information in the sources, say you don't know. Do not generate answers that don't use the sources. If asking a clarifying question to the user would help, ask the question.
If the user question is not in English, answer in the language used in the question.
Each source has the format "[filename]: information". ALWAYS reference the source filename for every part used in the answer. Use the format "[filename]" to reference a source, for example: [info1.txt]. List each source separately, for example: [info1.txt][info2.pdf].
Generate 3 very brief follow-up questions that the user would likely ask next.
Enclose the follow-up questions in double angle brackets. Example:
<<Am I allowed to invite friends for a party?>>
<<How can I ask for a refund?>>
<<What If I break something?>>
Do no repeat questions that have already been asked.
Make sure the last question ends with ">>".
SOURCES:
{context}`;
export async function postChat(request: HttpRequest, context: InvocationContext): Promise<HttpResponseInit> {
const azureOpenAiEndpoint = process.env.AZURE_OPENAI_API_ENDPOINT;
try {
const requestBody = (await request.json()) as AIChatCompletionRequest;
const { messages } = requestBody;
if (!messages || messages.length === 0 || !messages.at(-1)?.content) {
return badRequest('Invalid or missing messages in the request body');
}
let embeddings: Embeddings;
let model: BaseChatModel;
let store: VectorStore;
if (azureOpenAiEndpoint) {
const credentials = getCredentials();
const azureADTokenProvider = getAzureOpenAiTokenProvider();
// Initialize models and vector database
embeddings = new AzureOpenAIEmbeddings({ azureADTokenProvider });
model = new AzureChatOpenAI({
// Controls randomness. 0 = deterministic, 1 = maximum randomness
temperature: 0.7,
azureADTokenProvider,
});
store = new AzureCosmosDBNoSQLVectorStore(embeddings, { credentials });
} else {
// If no environment variables are set, it means we are running locally
context.log('No Azure OpenAI endpoint set, using Ollama models and local DB');
embeddings = new OllamaEmbeddings({ model: ollamaEmbeddingsModel });
model = new ChatOllama({
temperature: 0.7,
model: ollamaChatModel,
});
store = await FaissStore.load(faissStoreFolder, embeddings);
}
// Create the chain that combines the prompt with the documents
const combineDocsChain = await createStuffDocumentsChain({
llm: model,
prompt: ChatPromptTemplate.fromMessages([
['system', systemPrompt],
['human', '{input}'],
]),
documentPrompt: PromptTemplate.fromTemplate('[{source}]: {page_content}\n'),
});
// Create the chain to retrieve the documents from the database
const chain = await createRetrievalChain({
retriever: store.asRetriever(3),
combineDocsChain,
});
const lastUserMessage = messages.at(-1)!.content;
const responseStream = await chain.stream({
input: lastUserMessage,
});
const jsonStream = Readable.from(createJsonStream(responseStream));
return data(jsonStream, {
'Content-Type': 'application/x-ndjson',
'Transfer-Encoding': 'chunked',
});
(... more code ...)
In this file, chat-post.ts
, the postChat
function processes the messages sent by the user and returns answers based on the relevant documents retrieved. The code creates a processing chain that combines the system prompt with the available documents and then retrieves the relevant documents from the vector database. Finally, the response is generated and sent back to the user.
Local Development with Ollama
A notable feature of this project is the ability to run it locally without depending on Azure OpenAI. This is possible thanks to Ollama, a tool that allows running open-source language models locally.
The constants.ts
file contains the necessary settings to run the project locally with Ollama:
export const ollamaEmbeddingsModel = 'all-minilm:l6-v2';
export const ollamaChatModel = 'mistral:v0.2';
export const faissStoreFolder = '.faiss';
Here, we see that the project uses the all-minilm:l6-v2
model for embeddings and mistral:v0.2
for chat. This allows developers to experiment with the RAG system using open-source models, such as Mistral, directly on their local machines.
To use these models locally, developers need to install Ollama and download the specified models. This provides a flexible and accessible development experience, especially useful for early development and testing phases.
Note: Ollama is an open-source tool that allows running language models locally. To learn more about Ollama and how to use it, visit the official website here. Using this model requires a machine with an NVIDIA GPU and CUDA installed.
Want to Implement in Production? Here's How!
For production deployment, the project uses the Azure Developer CLI (azd), simplifying the provisioning and deployment process of the necessary resources on Azure. With just a few commands, you can deploy all the infrastructure and code:
azd auth login --use-device-code
azd env new
azd up
These commands create a new development environment, deploy the necessary resources on Azure, and start the project locally. Azure Static Web Apps is used to host the frontend, while Azure Functions hosts the backend API.
If you want to understand better how this was implemented in the project, there is a folder called infra
that contains all the necessary configuration files to deploy the project on Azure. In this case, BicepLang was used to provision the infrastructure.
Once you implement the application, you will see it in action, as shown in the gif below:
Conclusion
Langchain.js, as demonstrated in this project with RAG, offers a robust and flexible platform for implementing advanced AI systems. By combining the power of large language models with the ability to retrieve relevant information from external sources, Langchain.js enables the creation of more intelligent and contextual AI applications.
If you liked the project and want to try it yourself, you can access the GitHub repository here and follow the instructions to run it locally or deploy it on Azure. The project is an excellent opportunity to learn more about RAG, Langchain.js, and how to integrate language models into your applications. Not to mention, you can run the project locally with Ollama and experiment with open-source models.
The journey to creating more intelligent and contextual AI applications starts here. RAG and Langchain.js open up a world of possibilities for developers, allowing the creation of AI systems that not only answer questions but do so with precision, relevance, and reliability based on specific and up-to-date data.
Now it's your turn to explore the project and maybe even participate in the Hack Together: RAG Hack to create your own RAG application and compete for incredible prizes! Visit the event site here and join in!
See you next time! 😎