The Future of AI: LLM Distillation just got easier
Part 1 - Synthetic Data Gen with Llama 3.1 405B & RAFT
How Llama 405B and RAFT on Azure AI are changing the landscape of synthetic dataset generation and making model distillation much more approachable. (🚀🔥 Github recipe repo)
By Cedric Vidal, Principal AI Advocate, Microsoft
Part of the Future of AI 🚀 series initiated by Marco Casalaina with his Exploring Multi-Agent AI Systems blog post.
Gorilla scientist distilling, generated using Azure OpenAI DALL-E 3
The AI landscape is continuously evolving, with one of the latest advancements being the ability to generate high-quality synthetic datasets using large language models (LLMs). Llama 3.1 405B Instruct, released on Hugging Face on July 23rd and simultaneously on Azure AI, combined with the RAFT (Retrieval Augmented Fine Tuning) framework, is set to revolutionize how companies create synthetic data. This powerful combination simplifies a previously tedious, time-consuming, and costly process, enabling businesses to generate self-instruct Q&A and Chain of Thought datasets directly from their domain-specific documents.
This blog post is the first in a five-part series where we explore how you can leverage a new GitHub repository that makes it easier than ever to use Llama 405B for distilling smaller, more efficient models. In this first part, we’ll dive into the benefits of synthetic dataset generation with Llama 405B and RAFT, why it’s a game-changer, and how you can get started.
Synthetic dataset generation has become increasingly important in AI development. Acquiring high-quality, task-specific data often requires extensive manual effort, significant costs, and can be hampered by privacy concerns. This is especially challenging in industries where data is sensitive or hard to obtain. Synthetic data generation provides a practical solution by creating data that mirrors real-world scenarios, tailored to specific tasks without the need for traditional data collection.
Key benefits include:
Llama 3.1 405B Instruct is a state-of-the-art language model with 405 billion parameters, designed to excel at following instructions for complex text generation tasks. As one of the most powerful models available, it offers unparalleled capabilities in generating high-quality synthetic datasets.
RAFT, detailed in a recent paper from UC Berkeley’s Gorilla project and summarized in a previous blog post, significantly enhances the synthetic generation capabilities of Llama 3.1 405B. Originally, the Self-Instruct method—outlined in the Self-Instruct paper—advanced traditional synthetic dataset creation by automating the generation of questions and instructions typically crafted by humans. RAFT extends this methodology by generating synthetic questions directly from domain-specific documents, such as PDFs, and incorporating Chain of Thought reasoning. This approach trains the student model not only to understand the domain but also to comprehend the reasoning behind answering the questions.
RAFT is specifically designed to optimize Retrieval-Augmented Generation (RAG) workflows by being trained to identify and utilize relevant documents while discarding those that are irrelevant.
The RAFT self-instruct synthetic dataset generation steps
Here’s how RAFT makes the process more efficient:
The primary goal of the raft-distillation-recipe is to simplify and automate the end-to-end process of distilling large language models. The project automates the provisioning of the infrastructure required to run RAFT and Llama and provides a set of notebooks for each step of the distillation process. These notebooks are designed to be as hands-free as possible, ensuring that even complex tasks—such as synthetic dataset generation, model fine-tuning, and deployment—can be accomplished with minimal manual intervention while documenting the code necessary to run each step.
Whether you’re new to AI or an experienced practitioner, the focus is on delivering a seamless, user-friendly experience that allows you to concentrate on the outcomes rather than the process itself.
In this blog post, we will focus on the first step: self-instruct dataset generation.
Documentation on distillation often requires setting up GPUs, which requires some expertise, or leaves out critical steps like creating the dataset.
Azure AI Serverless offers an enterprise-ready solution, making it easy and cost-effective to run fine-tuning jobs at scale with a curated selection of teacher models, including Llama 3.1 405B Instruct.
RAFT simplifies synthetic dataset creation, generating data from documents that most companies already have on hand.
The RAFT Distillation Recipe combines Azure AI, Llama 3.1 405B, and RAFT to automate the distillation process end-to-end while explaining each step.
To begin generating synthetic datasets using Llama 3.1 405B and RAFT, we will utilize the raft-distillation-recipe GitHub repository that streamlines this process. Here’s how you can get started:
Alternatively, you can clone the repository locally and set up a Python virtual environment if you prefer more control over your environment setup.
Note: You will need an Azure Pay-As-You-Go account. If you don’t have one, head over to the Azure signup page. It will require a credit card, but for the sample datasets included, costs are capped and estimated on the project page.
azd auth login --use-device-code
Then, deploy the necessary resources with:
azd up
By default, the generation notebook uses the sample surfing domain and loads a PDF named Surfing - Wikipedia.pdf.
You can use other sample domains and documents or upload your own.
ds_name: str = "surfing"
doc_path: str = "sample_data/surfing/Surfing - Wikipedia.pdf"
To load a directory of PDFs, set the doc_path parameter to the directory.
The script stops generating Q/A samples when it reaches the --qa-threshold, controlling dataset size, cost, and generation time. After running the raft.py script, the notebook will export the generated dataset to a format suitable for the Azure AI Fine-tuning as a Service API using RAFT’s format.py script. This script supports multiple formats, depending on the intended use of the dataset.
Next, the dataset is split into three sets: training, validation, and evaluation (test). This splitting process is fundamental to ensuring your model is trained effectively, evaluated properly, and generalizes well to new data. Here’s a brief overview of each split:
This blog post covered the first step in leveraging Llama 405B and RAFT for streamlined distillation: self-instruct synthetic dataset generation using RAFT. In the next installment, we’ll explore how to fine-tune a Llama 3.1 8B model using these synthetically generated datasets with the Azure AI Serverless Python SDK.
Checkout the 🚀🔥 Github recipe repo, it contains the full code to run RAFT synthetic dataset generation on Azure AI and automates provisioning the infrastructure.
Stay tuned as our next blog post in this series will be out in a couple of weeks, continuing our exploration of cutting-edge methodologies that are making AI model development faster, more accessible, and more impactful for businesses everywhere.
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.