When we talk about the differences between Pre-Training and Supervised Fine-Tuning (SFT), the goals, datasets used, and number of GPUs required are all different. However, if we are to explain the difference from the essence of deep learning training, it is:
Pre-training involves randomly initializing model parameters, constructing the model, and then training it on a large amount of unlabeled data to learn general features of the corpus; whereas fine-tuning loads parameters from the pre-trained model, retains the general features learned during pre-training, and trains the model on a small amount of high-quality labeled data to enhance the model's capability and performance on specific tasks.
The parameters mentioned above include: weights, biases, Word Embeddings, Positional Encoding, attention mechanism parameters, etc.
Pre-Training aims to learn the fundamental structure and semantic features of a language using large-scale unsupervised datasets (such as text corpora). Pre-training typically involves the following steps:
Fine-Tuning aims to optimize the model's performance on a specific task using a task-specific dataset. Fine-tuning typically involves the following steps:
Taking GPT-2 as an Example
https://huggingface.co/docs/transformers/v4.44.0/en/model_doc/gpt2#transformers.GPT2LMHeadModel
To pre-train GPT-2, we need to use the classes GPT2LMHeadModel and GPT2Config:
config = GPT2Config()
model = GPT2LMHeadModel(config)
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
dataset = load_dataset("wikitext", "wikitext-2-raw-v1")
def tokenize_function(examples):
return tokenizer(examples["text"], truncation=True, padding="max_length", max_length=512, return_special_tokens_mask=True)
tokenized_datasets = dataset.map(tokenize_function, batched=True, remove_columns=["text"])
print("Train dataset size:", len(tokenized_datasets["train"]))
print("Validation dataset size:", len(tokenized_datasets["validation"]))
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
training_args = TrainingArguments(
output_dir="./results",
overwrite_output_dir=True,
num_train_epochs=5,
per_device_train_batch_size=64,
save_steps=10_000,
save_total_limit=2,
remove_unused_columns=False,
report_to=[],
learning_rate=5e-4
)
trainer = Trainer(
model=model,
args=training_args,
data_collator=data_collator,
train_dataset=tokenized_datasets["train"],
eval_dataset=tokenized_datasets["validation"]
)
if torch.cuda.is_available():
model.cuda()
trainer.train()
Since the model is small, pre-training can be done with a single H100 GPU.
Training result is as following:
Step | Training Loss |
---|---|
500 | 6.505700 |
1000 | 5.657100 |
1500 | 5.269900 |
2000 | 4.972000 |
2500 | 4.725000 |
The trained model can be used for inference validation.
model = GPT2LMHeadModel.from_pretrained("./results/checkpoint-2870")
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
model.eval()
input_text = "Once upon a time"
inputs = tokenizer(input_text, return_tensors="pt", padding=True).to(device)
with torch.no_grad():
outputs = model.generate(
inputs.input_ids,
attention_mask=inputs.attention_mask,
max_length=100,
num_return_sequences=1,
no_repeat_ngram_size=2,
early_stopping=True,
temperature=0.7,
top_p=0.9,
do_sample=True,
pad_token_id=tokenizer.eos_token_id
)
generated_text = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(generated_text)
Inference result is as following:
Once upon a time of the earthquake, the local community of local and a new new government, a military government who had begun with the " the most prominent ".
When we fine-tune a model, it usually refers to Supervised Fine Tuning (SFT). SFT can be divided into Parameter-Efficient Fine-Tuning (PEFT) and Full Fine Tuning.In PEFT implementations, methods like LoRA, QLoRA, and GA-LoRA are quite popular.
Let's first look at how to load a model for Full Fine Tuning. We use the AutoModelForCausalLM.from_pretrained class, which retrieves the parameters of the pre-trained model.
model = AutoModelForCausalLM.from_pretrained(
model_name, attn_implementation=attn_implementation, device_map={"": 0}
)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant':True})
For the complete Full fine tuning code, refer to the repository:
https://github.com/davidsajare/david-share/tree/master/Deep-Learning/SmolLM-Full-Fine-Tuning
Next, let's look at the differences in code implementation for fine-tuning, LoRA, and QLoRA. In terms of loading models and training parameters, Full Fine-Tuning, LoRA, and QLoRA have the following differences:
It should be noted that when performing LoRA or QLoRA fine-tuning, we can specify the modules to be trained, such as:
model = FastLanguageModel.get_peft_model(
model,
r = 128,
target_modules = ["q_proj", "k_proj", "v_proj", "o_proj",
"gate_proj", "up_proj", "down_proj",
"embed_tokens", "lm_head",], # Add for continual pretraining
lora_alpha = 32,
lora_dropout = 0, # Supports any, but = 0 is optimized
bias = "none", # Supports any, but = "none" is optimized
use_gradient_checkpointing = "unsloth", # True or "unsloth" for very long context
random_state = 3407,
use_rslora = True,
)
For detailed information, refer to:
https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Continue-Pre-training
There is no doubt that pre-training large language models requires multi-node and multi-GPU setups. This necessitates distributed training. Currently, the underlying distributed pre-training can be implemented by calling NCCL. Higher-level tools such as Megatron, DeepSpeed, and HF's accelerate library (which currently supports FSDP) can be used. These tools effectively implement DP/PP/TP.
For detailed information on pre-training using Megatron combined with DeepSpeed, refer to:
For an example of SFT implementation using DeepSpeed, refer to:
Currently, some open-source fine-tuning tools like Axolotl can also directly interface with DeepSpeed. For an example, refer to:
https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Fine-tuning-with-Axolotl
When using FSDP with accelerate, other parallel strategies can be combined to achieve more efficient training.
Combining these strategies usually requires significant customization and adjustments to the model and training scripts. accelerate provides some tools to simplify these processes, but specific implementations may require combining other PyTorch libraries (such as torch.distributed) and custom code.
For an example of FSDP with accelerate, refer to:
https://github.com/davidsajare/david-share/tree/master/Deep-Learning/Llama-3.1-70B-FSDP-Fine-Tuning
You must be a registered user to add a comment. If you've already registered, sign in. Otherwise, register and sign in.