Azure AI Foundry Blog

9 MIN READ

DPO 72B Model Fine-Tuning with DeepSpeed and FSDP

xinyuwei

Microsoft

Dec 20, 2024

Please refer to my repo to get more AI resources, wellcome to star it:

https://github.com/xinyuwei-david/david-share.git

This article if from one of my repo:

https://github.com/xinyuwei-david/david-share/tree/master/Deep-Learning/DPO-DeepSpeed-FSDP

Direct Preference Optimization (DPO) is currently one of the popular methods for aligning large language models (LLMs) with human preferences. With parameter-efficient fine-tuning techniques like LoRA and QLoRA, we can perform DPO training on bigger models.

Distributed training technology

To train bigger model with 2 H100, we could use PyTorch's Fully Sharded Data Parallel (FSDP) technology, combined with parameter-efficient fine-tuning methods like LoRA and QLoRA.

FSDP is similar to DeepSpeed's ZeRO technology. Accelerate is a library from Hugging Face (HF). FSDP is a distributed training technique that shards the model's parameters, optimizer states, and gradients, distributing them across multiple devices (such as GPUs). During the forward and backward passes, only the required parameter shards are loaded into memory and released after computation. This greatly reduces memory requirements. Of course, when training even larger models, DeepSpeed can be used. DeepSpeed requires a large amount of memory to store full-precision model parameters.

In my repo, I used both DeepSpeed ZeRO-3 technology and FSDP technology, and the training results were the same. I will showcase the scripts and configuration files for both training methods. In the following DeepSpeed and Accelerate FSDP training, I use an adapter from HF

DeepSpeed Training

Deepspeed Configuration file, deepspeed_config.json:

{  
  "zero_optimization": {  
    "stage": 3,  
    "overlap_comm": true,  
    "contiguous_gradients": true,  
    "reduce_bucket_size": 104857600,  
    "stage3_prefetch_bucket_size": 104857600,  
    "stage3_param_persistence_threshold": 1048576  
  },  
  "bf16": {  
    "enabled": true  
  },  
  "train_micro_batch_size_per_gpu": 1,  
  "gradient_accumulation_steps": 16,  
  "steps_per_print": 10,  
  "wall_clock_breakdown": false  
}

Training code, deepspeed.py:

import torch  
import os  
import multiprocessing  
from datasets import load_dataset  
from peft import PeftModel  
from transformers import (  
    AutoModelForCausalLM,  
    AutoTokenizer,  
    BitsAndBytesConfig,  
    set_seed  
)  
from trl import DPOTrainer, DPOConfig  
  
set_seed(1234)  
  
model_name = "Qwen/Qwen2.5-72B-Instruct"  
sft_adapter = "./adpter/"  # 一个使用 SFT 微调的 LoRA 适配器  
  
compute_dtype = torch.bfloat16  
  
# 如果在使用 FlashAttention 时遇到问题，可以改用 'sdpa'  
attn_implementation = 'flash_attention_2'  
  
# 如果内存不足，可以修改以下三个训练参数  
bs = 1        # 每个设备的批大小（训练和验证）  
gas = 16      # 梯度累积步骤数  
mseqlen = 512 # 最大序列长度  
  
lr = 1e-5     # 学习率  
QLoRA = True  # 是否量化基模型  
  
output_dir = "./DPO"  
  
# 初始化 Tokenizer  
tokenizer = AutoTokenizer.from_pretrained(model_name)  
tokenizer.pad_token = "<|image_pad|>"  
tokenizer.pad_token_id = 151655  
tokenizer.padding_side = 'right'  # 对于 Qwen2.5，左右 padding 都可以  
  
# 加载并处理数据集  
ds = load_dataset("mlabonne/orpo-dpo-mix-40k", split="train").train_test_split(test_size=0.01)  
ds_train = ds['train']  
ds_test = ds['test']  
  
def process(row):  
    # 第一个消息是提示  
    prompt_messages = tokenizer.apply_chat_template([row["chosen"][0]], tokenize=False)  
    chosen_messages = tokenizer.apply_chat_template(row["chosen"][1:], tokenize=False) + tokenizer.eos_token  
    rejected_messages = tokenizer.apply_chat_template(row["rejected"][1:], tokenize=False) + tokenizer.eos_token  
    row["prompt"] = prompt_messages  
    row["chosen"] = chosen_messages  
    row["rejected"] = rejected_messages  
    return row  
  
ds_train = ds_train.map(  
    process,  
    num_proc=multiprocessing.cpu_count(),  
    load_from_cache_file=False,  
)  
  
ds_test = ds_test.map(  
    process,  
    num_proc=multiprocessing.cpu_count(),  
    load_from_cache_file=False,  
)  
  
if QLoRA:  
    bnb_config = BitsAndBytesConfig(  
        load_in_4bit=True,  
        bnb_4bit_quant_type="nf4",  
        bnb_4bit_compute_dtype=compute_dtype,  
        bnb_4bit_use_double_quant=True,  
        bnb_4bit_quant_storage=compute_dtype,  
    )  
  
    model = AutoModelForCausalLM.from_pretrained(  
        model_name,  
        quantization_config=bnb_config,  
        torch_dtype=compute_dtype,  
        attn_implementation=attn_implementation,  
    )  
  
    # 冻结基模型的参数  
    for name, param in model.named_parameters():  
        param.requires_grad = False  
  
    # 让输入嵌入支持梯度  
    def make_inputs_require_grad(module, input, output):  
        output.requires_grad_(True)  
  
    model.get_input_embeddings().register_forward_hook(make_inputs_require_grad)  
else:  
    model = AutoModelForCausalLM.from_pretrained(  
        model_name,  
        torch_dtype=compute_dtype,  
        attn_implementation=attn_implementation,  
    )  
  
model.gradient_checkpointing_enable(gradient_checkpointing_kwargs={'use_reentrant': True})  
  
# 加载 LoRA 适配器  
model = PeftModel.from_pretrained(  
    model,  
    sft_adapter,  
    is_trainable=True,  
    adapter_name="DPO"  
)  
model.load_adapter(sft_adapter, adapter_name="reference")  
  
# 将模型移动到设备上  
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")  
model.to(device)  
  
training_arguments = DPOConfig(  
    output_dir=output_dir,  
    eval_strategy="steps",  
    do_eval=True,  
    optim="adamw_torch",  
    per_device_train_batch_size=bs,  
    gradient_accumulation_steps=gas,  
    per_device_eval_batch_size=bs,  
    log_level="debug",  
    save_strategy="steps",  
    save_steps=5,  
    logging_steps=2,  
    learning_rate=lr,  
    bf16=True,  
    beta=0.1,  
    eval_steps=2,  
    max_steps=10,  
    warmup_ratio=0.1,  
    lr_scheduler_type="linear",  
    max_length=mseqlen,  
    max_prompt_length=512,  
    dataset_num_proc=multiprocessing.cpu_count(),  
    model_adapter_name="DPO",  
    ref_adapter_name="reference",  
    deepspeed="deepspeed_config.json",  # 指定 DeepSpeed 配置文件  
)  
  
trainer = DPOTrainer(  
    model=model,  
    args=training_arguments,  
    train_dataset=ds_train,  
    eval_dataset=ds_test,  
    tokenizer=tokenizer,  
)  
  
# 开始训练  
trainer.train()  
  
# 保存模型  
trainer.save_model(output_dir)

Launch training:

(dpo) root@h1002gpu:~# deepspeed deepspeed.py

Training result analyze

In DPO training, the model is provided with a set of conversations, each containing the same "prompt" or "question", along with corresponding "chosen" and "rejected" replies. The model needs to learn to distinguish between these replies and prefer generating high-quality "chosen" responses.

Training data and results

The training data includes:

Source: Airoboros
Chosen Reply: Contains multiple rounds of dialogue
Rejected Reply: Contains multiple rounds of dialogue
Prompt: A descriptive text
Question: The same text as the prompt
Sometimes in the data, the "prompt" and "question" may be identical, which can serve as the starting point for the conversation in certain training settings.

Training results are as following:

Next, I will combine the training data to roughly introduce the DPO training process and results.

DPO training process and results explanation

Core Objective of DPO

Objective: Directly optimize the model parameters to reflect human preferences without the need for a separate reward model. DPO uses human preference data to adjust the model directly, making its generated responses more aligned with human expectations.
Introducing the Reference Model: To prevent the model from deviating from its original language capabilities during optimization, DPO introduces a reference model (usually a copy of the initial model with fixed parameters) as a regularization term.
- Maintaining Language Capabilities: The reference model provides a baseline of the model before adjustment. By comparing with the reference model, the trained model can learn human preferences while avoiding overfitting and deviation from its original abilities, ensuring that its language understanding and generation capabilities remain intact. This helps prevent the model from prioritizing human preferences at the expense of core language skills like grammatical correctness and factual accuracy.
Role of the Reference Model:

Training Data

Prompt: User input, for example: "Please explain the phase changes of water."
Chosen Reply: Responses evaluated by humans as high-quality, fully answering the question, and meeting expectations. These replies are typically accurate, complete, relevant, and fluent, satisfying user needs.
Rejected Reply: Responses evaluated by humans as lower quality, not adequately answering the question, or not meeting expectations. These replies may lack accuracy, contain incomplete information, be irrelevant to the prompt, or be less fluent.
Human Evaluation Criteria:
Accuracy: Is the content of the reply correct and free from misleading information?
Completeness: Does the reply fully answer the user's question?
Relevance: Is the reply closely related to the user's prompt?
Fluency: Is the reply grammatically correct and clearly expressed?
Example:
Prompt: "Please explain the phase changes of water."
Chosen Reply:
- Evaluation Reasoning: The reply accurately explains the process of water's phase changes, provides complete information, is highly relevant to the prompt, and is fluent.
Water exists in three states: solid, liquid, and gas. Through changes in temperature and pressure, water can transition between these states. For example, ice (solid) melts into water (liquid) when heated, and water vaporizes into steam (gas) upon further heating.
Rejected Reply:
- Evaluation Reasoning: The reply does not address the question about the phase changes of water; the information is incomplete, and the relevance is insufficient.
Water is a very common substance found everywhere in daily life.

Training Process

Step 1: Calculate Log Probabilities

For the trained model (parameters θ):

Log probability of the chosen reply:
log_p_model(chosen | prompt) = log( π_θ(chosen | prompt) )
Log probability of the rejected reply:
log_p_model(rejected | prompt) = log( π_θ(rejected | prompt) )

For the reference model (fixed parameters):

Log probability of the chosen reply:
log_p_ref(chosen | prompt) = log( π_ref(chosen | prompt) )
Log probability of the rejected reply:
log_p_ref(rejected | prompt) = log( π_ref(rejected | prompt) )

Step 2: Calculate Preference Differences

Preference difference for the chosen reply:
Δ_chosen = log_p_model(chosen | prompt) - log_p_ref(chosen | prompt)
Preference difference for the rejected reply:
Δ_rejected = log_p_model(rejected | prompt) - log_p_ref(rejected | prompt)

Step 3: Construct the Loss Function

Loss function:Where β is the temperature hyperparameter controlling sensitivity to preference differences.
loss = -log( exp(Δ_chosen / β) / [ exp(Δ_chosen / β) + exp(Δ_rejected / β) ] )
Objective: Minimize the loss function loss to make the model more inclined to generate the "chosen" reply over the "rejected" reply.

Training Process Example

Assumed Values (for Illustration):

log_p_model(chosen | prompt) = -5
log_p_model(rejected | prompt) = -7
log_p_ref(chosen | prompt) = -6
log_p_ref(rejected | prompt) = -6
Calculate Preference Differences:
Δ_chosen = (-5) - (-6) = 1
Δ_rejected = (-7) - (-6) = -1
Calculate the Loss Function (assuming β = 1):

Calculate the numerator:
exp(Δ_chosen / β) = exp(1) ≈ 2.718
Calculate the denominator:

exp(Δ_chosen / β) + exp(Δ_rejected / β) = exp(1) + exp(-1) ≈ 2.718 + 0.368 ≈ 3.086

Calculate the loss:

loss = -log( 2.718 / 3.086 ) = -log(0.880) ≈ 0.127

Result Analysis:

The loss value is relatively small (approximately 0.127), indicating that the model tends to prefer the "chosen" reply.
Optimize Model Parameters:
- Through backpropagation, minimize the loss function loss to further enhance the model's preference for the "chosen" reply.

Explanation of Training Log Fields

Based on the DPO training process, here's a detailed explanation of each field in the training log and their importance in evaluating training effectiveness:

Example Training Log:

{ 'loss': 0.6931, 'grad_norm': 0.05, 'learning_rate': 1e-5, 'rewards/chosen': 0.0, 'rewards/rejected': 0.0, 'rewards/accuracies': 0.5, 'rewards/margins': 0.0, 'logps/chosen': -15.0, 'logps/rejected': -15.0, 'logits/chosen': [0.2, 0.3, ...], 'logits/rejected': [0.2, 0.3, ...], 'epoch': 0 }

1. loss

Meaning:
- Represents the loss value at the current training step, measuring the model's ability to distinguish between the "chosen" and "rejected" replies.
Importance:
- Core Indicator: The primary metric to evaluate training effectiveness.
- Training Goal: Minimizing loss indicates successful learning toward preferring the "chosen" reply.
Indicator Trend:
- Initial Stage: loss is typically higher (around 0.6931), indicating no preference.
- During Training: Should decrease over time, showing the model is learning to prefer the "chosen" reply.

2. grad_norm

Meaning:
- Represents the gradient norm of the model parameters, indicating the overall magnitude of parameter updates.
Importance:
- Learning Intensity: Reflects how much the model is adjusting its parameters.
- Training Stability: Helps detect issues like vanishing or exploding gradients.
Indicator Trend:
- Normal Range: Should be within a reasonable range (e.g., 0.01 to 1).
- Abnormal Situations:
  - Too Small: Near zero may indicate lack of learning.
  - Too Large: May require gradient clipping to prevent instability.

3. learning_rate

Meaning:
- Controls the step size in parameter updates during training.
Importance:
- Convergence Speed and Stability: Affects how quickly and smoothly the model learns.
Adjustment Strategy:
- Slow Loss Decrease: Consider increasing the learning rate.
- Unstable Training: If loss fluctuates, decreasing the learning rate might help.

4. rewards/chosen and rewards/rejected

Meaning:
- rewards/chosen: Reward value for the "chosen" reply (Δ_chosen).
- rewards/rejected: Reward value for the "rejected" reply (Δ_rejected).
Importance:
- Model Preference: Indicates the model's inclination towards each reply.
Indicator Trend:
- Initial Stage: Both may be around 0.0 (no preference).
- During Training:
  - rewards/chosen should increase.
  - rewards/rejected should decrease.

5. rewards/accuracies

Meaning:
- The proportion of times the model correctly prefers the "chosen" reply.
Importance:
- Performance Measure: Directly evaluates preference learning.
Indicator Trend:
- Initial Stage: Around 0.5 (random guess).
- During Training: Should approach 1.0, indicating improved preference accuracy.

6. rewards/margins

Meaning:
- The difference between rewards/chosen and rewards/rejected.
rewards/margins = rewards/chosen - rewards/rejected
Importance:
- Discrimination Ability: Larger margins indicate better distinction between replies.
Indicator Trend:
- Should increase during training.

7. logps/chosen and logps/rejected

Meaning:
- Total log probabilities of generating the "chosen" and "rejected" replies.
Importance:
- Probability Basis: Used in calculating preference differences and rewards.
Indicator Trend:
- Increasing logps/chosen indicates higher probability for the "chosen" reply.
- Stable or decreasing logps/rejected shows reduced preference for the "rejected" reply.

8. logits/chosen and logits/rejected

Meaning:
- Raw output scores from the final layer before applying softmax, for both replies.
Importance:
- Probability Calculation: Used to compute probabilities for each token, affecting log probabilities.
Indicator Trend:
- Ensure Valid Values: Avoid nan or inf values.
- Monitor Changes: Changes in logits reflect learning progress.

9. epoch

Meaning:
- Indicates the current training epoch or iteration over the training dataset.
Importance:
- Training Progress: Helps track how far along the training is.
Indicator Trend:
- As epoch increases, expect improvements in other metrics.

Summary

Adjust Training Strategies Based on Indicators:
- Slow Loss Decrease: Increase learning rate or check data quality.
- Gradient Issues: If grad_norm is abnormal, inspect gradient computations or adjust optimizer settings.
- Low Preference Accuracy: Enhance data quality or quantity.
- Small Reward Margins: Adjust the temperature parameter β to influence sensitivity.
Emphasize the Importance of the Reference Model:
- Maintaining Language Capabilities: Ensures the model doesn't overfit human preferences at the cost of language understanding and generation skills.
- Balancing Objectives: Optimizes for human preference while retaining overall model performance.
Continuous Monitoring and Adjustment:
- Regular Evaluation: Use a validation set to assess performance and prevent overfitting.
- Dynamic Adjustment: Modify training strategies based on log indicators to optimize the model.
- By understanding DPO's core concepts, training processes, and how to interpret key training metrics, you can effectively train a model that aligns with human preferences while maintaining strong language capabilities.