Retrieval Augmented Fine Tuning: Use GPT-4o to fine tune GPT-4o mini for domain specific application

Former Employee

Sep 25, 2024

Customization is key!

One of the most impactful applications of generative AI for businesses is to create natural language interfaces that have been customized to use domain and use case specific data to provide better, more accurate responses. This means answering questions about specific domains such as banking, legal, and medical fields.

We often talk about two methods to achieve this:

Retrieval Augmented Generation (RAG): Storing those documents in a vector database and, at query time, retrieving documents based on their semantic similarity to the question, then using them as context for the LLM.
Supervised Fine-Tuning (SFT): Training an existing base model on a set of prompts and responses representing the domain-specific knowledge.

While most organizations experimenting with RAG aim to extend an LLM's knowledge with their internal knowledge base, many do not achieve the expected results without significant optimization. Similarly, it can be challenging to curate a sufficiently large and high-quality data set for fine-tuning. Both approaches have limitations: fine-tuning confines the model to its trained data, making it susceptible to approximation and hallucination, while RAG grounds the model but retrieves documents based merely on their semantic proximity to the query -- which may not be relevant and can lead to poorly reasoned answers.

RAFT to the rescue!

Instead of choosing RAG or fine-tuning, we can combine them! Think of RAG as an open book exam: the model looks up relevant documents to generate answers. Fine-tuning is like a closed book exam: the model relies on pre-trained knowledge. Just like in exams, the best results come from studying and having notes handy.

Retrieval Augmented Fine-Tuning (RAFT) is a powerful technique to prep fine-tuning data for domain-specific open-book settings, like in-domain RAG. It’s a game-changer for language models, combining the best parts of RAG and fine-tuning. RAFT helps tailor models to specific domains by boosting their ability to understand and use domain-specific knowledge. It’s the sweet spot between RAG and domain-specific SFT.

How does it work?

RAFT has three steps:

Preparing a dataset to teach the model how to answer questions about your domain.
Fine tuning a model with your prepared dataset
Evaluating the quality of your new, custom, domain adapted model

The key to RAFT is the training data generation, where each data point includes a question (Q), a set of documents (Dk), and a Chain-of-Thought style answer (A).

The documents are categorized into Oracle Documents (Do), which contain the answer, and Distractor Documents (Di), which do not. Fine-tuning teaches the model to differentiate between these, resulting in a custom model that outperforms the original with RAG or fine-tuning alone.

We use GPT-4o to generate training data and fine-tune GPT-4o mini, creating a cost-effective, faster model tailored to your use case. This technique, called distillation, uses GPT-4o as the teacher model and 4o-mini as the student.

In the next section of this blog, we'll get hands on. If you want to follow along on your own, or see reference code, check out https://aka.ms/aoai-raft-workshop. We'll create a domain adapted model for a banking use case, capable of answering questions about a bank's online tooling and accounts.

Notebook 1- Generating your RAFT training data

Start by gathering domain-specific documents; in our example, these are PDFs of bank documentation. To generate our training data, we convert the PDFS to markdown text format The document is in PDF format and contains a number of tables and charts, we will use GPT-4o to convert the pages content to markdown. We use Azure OpenAI GPT 4o to extract all of this information into a Markdown file to be used for downstream processing. We then use GPT-4o (our teacher model) to generate synthetic Question-Document-Answer triplets including examples of "golden documents" (highly relevant) and "Distractors" (misleading). This will ensure the model learns to differentiate between relevant and irrelevant information. RAFT utilizes Chain of Thought (CoT) process, by integrating CoT RAFT process improves the model’s ability to extract information and perform logical reasoning. This method helps prevent overfitting and enhances training robustness, making it particularly effective for tasks that require detailed and structured thinking

We then format this data for fine-tuning, splitting it into training, validation, and test sets. The validation data is used during training, and the test set measures performance at the end.

Notebook 2- RAFT Fine Tuning

Now it's time to teach our student! After preparing the training and validation data, the next step is to upload this data to Azure OpenAI and create the fine-tuning job. This is surprisingly easy: in AI Studio, selecting your model, uploading your training and validation data, and setting your training parameters are just a few clicks. We'll select 4o-mini as our student model for training. In the lab we will show you how to use the SDK to upload and trigger the fine-tuning job. UI makes it an easy way to experiment, and SDK approach is preferred way for productionalizing and enabling your llmops strategy for deploying in production.

Once the fine-tuning job is running, we can monitor its progress and, upon completion, analyze the fine-tuned model in Azure OpenAI Studio. Finally, we create a new deployment with the fine-tuned model, ready to be used for our specialized domain tasks.

Notebook 3 - Is our RAFT model really better than the base model? Let's check!

You can start off by reviewing the built in metrics returned by AI Studio, showing loss and accuracy. We want to see accuracy increase, while loss goes down:

However – we can do much more to measure the quality of our model. Remember our test dataset from the beginning? This is why we prepared it!

While there are many options for evaluation, including AI Studio evaluations, in our example we use the open-source library RAGAS, which evaluates RAG pipelines with metrics like answer relevancy, faithfulness, answer similarity, and answer correctness. These metrics rely on either an LLM as a judge or an embedding model to assess the quality and accuracy of the generated answers.

gpt4o-mini vs gpt4o-mini-raft

We could probably improve our metrics further by adjusting our training parameters and/or generating additional training data to improve model metrics.

Ready to get started yourself?

Access Azure AI Studio today: https://oai.azure.com/
Learn more about fine tuning and customization: https://learn.microsoft.com/en-us/azure/ai-services/openai/tutorials/fine-tune?tabs=python-new%2Ccommand-line
Follow along with the code: https://aka.ms/raftworkshop

Acknowledgement:

This Hands-On Lab was heavily inspired by Cedric Vidal and Suraj Subramanian excellent blogs: https://aka.ms/raft-blog , https://aka.ms/raft-recipe-blog-datagen and https://aka.ms/raft-recipe-blog-ft and reference implementation: https://aka.ms/raft-repo
Credit to Liam Cavanagh for the inspiration of using GPT-4o to convert the pages content to markdown
Credit to Cedric Vidal for helping review the blog and lab

References:

[2006.05525] Knowledge Distillation: A Survey (arxiv.org) https://arxiv.org/abs/2006.05525
RAFT: Adapting Language Model to Domain Specific RAG https://arxiv.org/html/2403.10131v1
RAFT: A new way to teach LLMs to be better at RAG https://techcommunity.microsoft.com/t5/ai-ai-platform-blog/raft-a-new-way-to-teach-llms-to-be-better...
Berkeley researchers published blog post outlining the benefits and drawbacks of DSF and RAG, and demonstrating how the RAFT approach can deliver more effective results. Their implementation of RAFT is available on GitHub