Please refer to my repo to get more AI resources, wellcome to star it:
https://github.com/xinyuwei-david/david-share.git
This article if from one of my repo:
https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/DPO-DeepSpeed-FSDP
Direct Preference Optimization (DPO) is currently one of the popular methods for aligning large language models (LLMs) with human preferences. With parameter-efficient fine-tuning techniques like LoRA and QLoRA, we can perform DPO training on bigger models.
Distributed training technology
To train bigger model with 2 H100, we could use PyTorch's Fully Sharded Data Parallel (FSDP) technology, combined with parameter-efficient fine-tuning methods like LoRA and QLoRA.
FSDP is similar to DeepSpeed's ZeRO technology. Accelerate is a library from Hugging Face (HF). FSDP is a distributed training technique that shards the model's parameters, optimizer states, and gradients, distributing them across multiple devices (such as GPUs). During the forward and backward passes, only the required parameter shards are loaded into memory and released after computation. This greatly reduces memory requirements. Of course, when training even larger models, DeepSpeed can be used. DeepSpeed requires a large amount of memory to store full-precision model parameters.
In my repo, I used both DeepSpeed ZeRO-3 technology and FSDP technology, and the training results were the same. I will showcase the scripts and configuration files for both training methods. In the following DeepSpeed and Accelerate FSDP training, I use an adapter from HF
DeepSpeed Training
Deepspeed Configuration file, deepspeed_config.json:
{
"zero_optimization": {
"stage": 3,
"overlap_comm": true,
"contiguous_gradients": true,
"reduce_bucket_size": 104857600,
"stage3_prefetch_bucket_size": 104857600,
"stage3_param_persistence_threshold": 1048576
},
"bf16": {
"enabled": true
},
"train_micro_batch_size_per_gpu": 1,
"gradient_accumulation_steps": 16,
"steps_per_print": 10,
"wall_clock_breakdown": false
}
Training code, deepspeed.py:
import torch
import os
import multiprocessing
from datasets import load_dataset
from peft import PeftModel
from transformers import (
AutoModelForCausalLM,
AutoTokenizer,
BitsAndBytesConfig,
set_seed
)
from trl import DPOTrainer, DPOConfig
set_seed(1234)
model_name = "Qwen/Qwen2.5-72B-Instruct"
sft_adapter = "./adpter/" # 一个使用 SFT 微调的 LoRA 适配器
compute_dtype = torch.bfloat16
# 如果在使用 FlashAttention 时遇到问题,可以改用 'sdpa'
attn_implementation = 'flash_attention_2'
# 如果内存不足,可以修改以下三个训练参数
bs = 1 # 每个设备的批大小(训练和验证)
gas = 16 # 梯度累积步骤数
mseqlen = 512 # 最大序列长度
lr = 1e-5 # 学习率
QLoRA = True # 是否量化基模型
output_dir = "./DPO"
# 初始化 Tokenizer
tokenizer = AutoTokenizer.from_pretrained(model_name)
tokenizer.pad_token = "<|image_pad|>"
tokenizer.pad_token_id = 151655
tokenizer.padding_side = 'right' # 对于 Qwen2.5,左右 padding 都可以
# 加载并处理数据集
ds = load_dataset("mlabonne/orpo-dpo-mix-40k", split="train").train_test_split(test_size=0.01)
ds_train = ds['train']
ds_test = ds['test']
def process(row):
# 第一个消息是提示
prompt_messages = tokenizer.apply_chat_template([row["chosen"][0]], tokenize=False)
chosen_messages = tokenizer.apply_chat_template(row["chosen"][1:], tokenize=False) + tokenizer.eos_token
rejected_messages = tokenizer.apply_chat_template(row["rejected"][1:], tokenize=False) + tokenizer.eos_token
row["prompt"] = prompt_messages
row["chosen"] = chosen_messages
row["rejected"] = rejected_messages
return row
ds_train = ds_train.map(
process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False,
)
ds_test = ds_test.map(
process,
num_proc=multiprocessing.cpu_count(),
load_from_cache_file=False,
)
if QLoRA:
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=compute_dtype,
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=compute_dtype,
)
model = AutoModelForCausalLM.from_pretrained(
model_name,
quantization_config=bnb_config,
torch_dtype=compute_dtype,
attn_implementation=attn_implementation,
)
# 冻结基模型的参数
for name, param in model.named_parameters():
param.requires_grad = False
# 让输入嵌入支持梯度
def make_inputs_require_grad(module, input, output):
output.requires_grad_(True)
model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)
else:
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype=compute_dtype,
attn_implementation=attn_implementation,
)
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant': True})
# 加载 LoRA 适配器
model = PeftModel.from_pretrained(
model,
sft_adapter,
is_trainable=True,
adapter_name="DPO"
)
model.load_adapter(sft_adapter, adapter_name="reference")
# 将模型移动到设备上
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)
training_arguments = DPOConfig(
output_dir=output_dir,
eval_strategy="steps",
do_eval=True,
optim="adamw_torch",
per_device_train_batch_size=bs,
gradient_accumulation_steps=gas,
per_device_eval_batch_size=bs,
log_level="debug",
save_strategy="steps",
save_steps=5,
logging_steps=2,
learning_rate=lr,
bf16=True,
beta=0.1,
eval_steps=2,
max_steps=10,
warmup_ratio=0.1,
lr_scheduler_type="linear",
max_length=mseqlen,
max_prompt_length=512,
dataset_num_proc=multiprocessing.cpu_count(),
model_adapter_name="DPO",
ref_adapter_name="reference",
deepspeed="deepspeed_config.json", # 指定 DeepSpeed 配置文件
)
trainer = DPOTrainer(
model=model,
args=training_arguments,
train_dataset=ds_train,
eval_dataset=ds_test,
tokenizer=tokenizer,
)
# 开始训练
trainer.train()
# 保存模型
trainer.save_model(output_dir)
Launch training:
(dpo) root@h1002gpu:~# deepspeed deepspeed.py
Training result analyze
In DPO training, the model is provided with a set of conversations, each containing the same "prompt" or "question", along with corresponding "chosen" and "rejected" replies. The model needs to learn to distinguish between these replies and prefer generating high-quality "chosen" responses.
Training data and results
The training data includes:
- Source: Airoboros
- Chosen Reply: Contains multiple rounds of dialogue
- Rejected Reply: Contains multiple rounds of dialogue
- Prompt: A descriptive text
- Question: The same text as the prompt
- Sometimes in the data, the "prompt" and "question" may be identical, which can serve as the starting point for the conversation in certain training settings.
Training results are as following:
Next, I will combine the training data to roughly introduce the DPO training process and results.
DPO training process and results explanation
Core Objective of DPO
- Objective: Directly optimize the model parameters to reflect human preferences without the need for a separate reward model. DPO uses human preference data to adjust the model directly, making its generated responses more aligned with human expectations.
- Introducing the Reference Model: To prevent the model from deviating from its original language capabilities during optimization, DPO introduces a reference model (usually a copy of the initial model with fixed parameters) as a regularization term.
- Maintaining Language Capabilities: The reference model provides a baseline of the model before adjustment. By comparing with the reference model, the trained model can learn human preferences while avoiding overfitting and deviation from its original abilities, ensuring that its language understanding and generation capabilities remain intact. This helps prevent the model from prioritizing human preferences at the expense of core language skills like grammatical correctness and factual accuracy.
- Role of the Reference Model:
Training Data
- Prompt: User input, for example: "Please explain the phase changes of water."
- Chosen Reply: Responses evaluated by humans as high-quality, fully answering the question, and meeting expectations. These replies are typically accurate, complete, relevant, and fluent, satisfying user needs.
- Rejected Reply: Responses evaluated by humans as lower quality, not adequately answering the question, or not meeting expectations. These replies may lack accuracy, contain incomplete information, be irrelevant to the prompt, or be less fluent.
- Human Evaluation Criteria:
- Accuracy: Is the content of the reply correct and free from misleading information?
- Completeness: Does the reply fully answer the user's question?
- Relevance: Is the reply closely related to the user's prompt?
- Fluency: Is the reply grammatically correct and clearly expressed?
- Example:
- Prompt: "Please explain the phase changes of water."
- Chosen Reply:
- Evaluation Reasoning: The reply accurately explains the process of water's phase changes, provides complete information, is highly relevant to the prompt, and is fluent.
- Water exists in three states: solid, liquid, and gas. Through changes in temperature and pressure, water can transition between these states. For example, ice (solid) melts into water (liquid) when heated, and water vaporizes into steam (gas) upon further heating.
- Rejected Reply:
- Evaluation Reasoning: The reply does not address the question about the phase changes of water; the information is incomplete, and the relevance is insufficient.
- Water is a very common substance found everywhere in daily life.
Training Process
Step 1: Calculate Log Probabilities
For the trained model (parameters θ):
- Log probability of the chosen reply:
- log_p_model(chosen | prompt) = log( π_θ(chosen | prompt) )
- Log probability of the rejected reply:
- log_p_model(rejected | prompt) = log( π_θ(rejected | prompt) )
For the reference model (fixed parameters):
- Log probability of the chosen reply:
- log_p_ref(chosen | prompt) = log( π_ref(chosen | prompt) )
- Log probability of the rejected reply:
- log_p_ref(rejected | prompt) = log( π_ref(rejected | prompt) )
Step 2: Calculate Preference Differences
- Preference difference for the chosen reply:
- Δ_chosen = log_p_model(chosen | prompt) - log_p_ref(chosen | prompt)
- Preference difference for the rejected reply:
- Δ_rejected = log_p_model(rejected | prompt) - log_p_ref(rejected | prompt)
Step 3: Construct the Loss Function
- Loss function:Where β is the temperature hyperparameter controlling sensitivity to preference differences.
- loss = -log( exp(Δ_chosen / β) / [ exp(Δ_chosen / β) + exp(Δ_rejected / β) ] )
- Objective: Minimize the loss function loss to make the model more inclined to generate the "chosen" reply over the "rejected" reply.
Training Process Example
Assumed Values (for Illustration):
- log_p_model(chosen | prompt) = -5
- log_p_model(rejected | prompt) = -7
- log_p_ref(chosen | prompt) = -6
- log_p_ref(rejected | prompt) = -6
- Calculate Preference Differences:
- Δ_chosen = (-5) - (-6) = 1
- Δ_rejected = (-7) - (-6) = -1
- Calculate the Loss Function (assuming β = 1):
- Calculate the numerator:
- exp(Δ_chosen / β) = exp(1) ≈ 2.718
- Calculate the denominator:
exp(Δ_chosen / β) + exp(Δ_rejected / β) = exp(1) + exp(-1) ≈ 2.718 + 0.368 ≈ 3.086
- Calculate the loss:
loss = -log( 2.718 / 3.086 ) = -log(0.880) ≈ 0.127
Result Analysis:
- The loss value is relatively small (approximately 0.127), indicating that the model tends to prefer the "chosen" reply.
- Optimize Model Parameters:
- Through backpropagation, minimize the loss function loss to further enhance the model's preference for the "chosen" reply.
Explanation of Training Log Fields
Based on the DPO training process, here's a detailed explanation of each field in the training log and their importance in evaluating training effectiveness:
Example Training Log:
{ 'loss': 0.6931, 'grad_norm': 0.05, 'learning_rate': 1e-5, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.5, 'rewards/margins': 0.0, 'logps/chosen': -15.0, 'logps/rejected': -15.0, 'logits/chosen': [0.2, 0.3, ...], 'logits/rejected': [0.2, 0.3, ...], 'epoch': 0 }
1. loss
- Meaning:
- Represents the loss value at the current training step, measuring the model's ability to distinguish between the "chosen" and "rejected" replies.
- Importance:
- Core Indicator: The primary metric to evaluate training effectiveness.
- Training Goal: Minimizing loss indicates successful learning toward preferring the "chosen" reply.
- Indicator Trend:
- Initial Stage: loss is typically higher (around 0.6931), indicating no preference.
- During Training: Should decrease over time, showing the model is learning to prefer the "chosen" reply.
2. grad_norm
- Meaning:
- Represents the gradient norm of the model parameters, indicating the overall magnitude of parameter updates.
- Importance:
- Learning Intensity: Reflects how much the model is adjusting its parameters.
- Training Stability: Helps detect issues like vanishing or exploding gradients.
- Indicator Trend:
- Normal Range: Should be within a reasonable range (e.g., 0.01 to 1).
- Abnormal Situations:
- Too Small: Near zero may indicate lack of learning.
- Too Large: May require gradient clipping to prevent instability.
3. learning_rate
- Meaning:
- Controls the step size in parameter updates during training.
- Importance:
- Convergence Speed and Stability: Affects how quickly and smoothly the model learns.
- Adjustment Strategy:
- Slow Loss Decrease: Consider increasing the learning rate.
- Unstable Training: If loss fluctuates, decreasing the learning rate might help.
4. rewards/chosen and rewards/rejected
- Meaning:
- rewards/chosen: Reward value for the "chosen" reply (Δ_chosen).
- rewards/rejected: Reward value for the "rejected" reply (Δ_rejected).
- Importance:
- Model Preference: Indicates the model's inclination towards each reply.
- Indicator Trend:
- Initial Stage: Both may be around 0.0 (no preference).
- During Training:
- rewards/chosen should increase.
- rewards/rejected should decrease.
5. rewards/accuracies
- Meaning:
- The proportion of times the model correctly prefers the "chosen" reply.
- Importance:
- Performance Measure: Directly evaluates preference learning.
- Indicator Trend:
- Initial Stage: Around 0.5 (random guess).
- During Training: Should approach 1.0, indicating improved preference accuracy.
6. rewards/margins
- Meaning:
- The difference between rewards/chosen and rewards/rejected.
- Importance:
- Discrimination Ability: Larger margins indicate better distinction between replies.
- Indicator Trend:
- Should increase during training.
7. logps/chosen and logps/rejected
- Meaning:
- Total log probabilities of generating the "chosen" and "rejected" replies.
- Importance:
- Probability Basis: Used in calculating preference differences and rewards.
- Indicator Trend:
- Increasing logps/chosen indicates higher probability for the "chosen" reply.
- Stable or decreasing logps/rejected shows reduced preference for the "rejected" reply.
8. logits/chosen and logits/rejected
- Meaning:
- Raw output scores from the final layer before applying softmax, for both replies.
- Importance:
- Probability Calculation: Used to compute probabilities for each token, affecting log probabilities.
- Indicator Trend:
- Ensure Valid Values: Avoid nan or inf values.
- Monitor Changes: Changes in logits reflect learning progress.
9. epoch
- Meaning:
- Indicates the current training epoch or iteration over the training dataset.
- Importance:
- Training Progress: Helps track how far along the training is.
- Indicator Trend:
- As epoch increases, expect improvements in other metrics.
Summary
- Adjust Training Strategies Based on Indicators:
- Slow Loss Decrease: Increase learning rate or check data quality.
- Gradient Issues: If grad_norm is abnormal, inspect gradient computations or adjust optimizer settings.
- Low Preference Accuracy: Enhance data quality or quantity.
- Small Reward Margins: Adjust the temperature parameter β to influence sensitivity.
- Emphasize the Importance of the Reference Model:
- Maintaining Language Capabilities: Ensures the model doesn't overfit human preferences at the cost of language understanding and generation skills.
- Balancing Objectives: Optimizes for human preference while retaining overall model performance.
- Continuous Monitoring and Adjustment:
- Regular Evaluation: Use a validation set to assess performance and prevent overfitting.
- Dynamic Adjustment: Modify training strategies based on log indicators to optimize the model.
- By understanding DPO's core concepts, training processes, and how to interpret key training metrics, you can effectively train a model that aligns with human preferences while maintaining strong language capabilities.