Announcing Public Preview of Direct Preference Optimization Capabilities with Azure OpenAI Service

Microsoft

Dec 17, 2024

We are thrilled to introduce public preview of groundbreaking feature in the Azure OpenAI Service: Direct Preference Optimization (DPO). This new capability is set to revolutionize the alignment of large language models with human preferences, making it easier for customers to generate high-quality training datasets.

What is Direct Preference Optimization?

Direct Preference Optimization is an innovative alignment technique for large language models that adjusts model weights based on human preferences. Unlike Reinforcement Learning from Human Feedback (RLHF), DPO does not require fitting a reward model and uses binary preferences for training. This makes DPO computationally lighter and faster than RLHF while being equally effective at alignment.

Why is DPO Useful?

DPO is particularly beneficial in scenarios where there is no clear-cut correct answer, and subjective elements like tone, style, or specific content preferences are important. This approach allows the model to learn from both positive examples (what's considered correct or ideal) and negative examples (what's less desired or incorrect)

Simplicity: DPO eliminates the need for a separate reward model, which is required in traditional methods like Reinforcement Learning from Human Feedback (RLHF). This simplification reduces the complexity of the optimization process

Stability: By directly optimizing the policy based on human preferences, DPO avoids the instability often associated with training and maintaining multiple models. This leads to more consistent and reliable outcomes

Efficiency: DPO is computationally efficient as it does not require the extensive computational resources needed for RLHF. This efficiency allows for faster convergence and lower computational overhead

Bias Mitigation: DPO directly incorporates human preferences into the optimization process, which helps in reducing unintended biases in the model's behavior. This alignment with human values ensures that the model's outputs are more desirable and ethical2.

Overall, DPO offers a streamlined, stable, and efficient alternative to traditional methods, making it a promising approach for fine-tuning language models to better align with human expectations and values.

Dataset Format for DPO

The dataset format for DPO differs from SFT. Customers provide a "conversation" containing the system message and the initial user message, followed by "completions" with paired preference data. . The dataset includes three top-level fields: "input," "preferred_output," and "non_preferred_output." Each element in the preferred_output/non_preferred_output must contain at least one assistant message and can only have roles in (assistant, tool).

Example for DPO:
{"input": {"messages": [{"role": "system", "content": "You are a chatbot assistant. Given a user question with multiple choice answers, provide the correct answer."}, {"role": "user", "content": "Question: One year, the oak trees in a park began producing more acorns than usual. The next year, the population of chipmunks in the park also increased. Which best explains why there were more chipmunks the next year? Answer choices: A: Shady areas increased., B: Food sources increased., C: Oxygen levels increased., D: Available water increased."}]}, "preferred_output": [{"role": "assistant", "content": "B: Food Sources Increased."}], "non_preferred_output": [{"role": "assistant", "content": "A: shady areas increased."}]}

Example for SFT:

{"messages": [{"role": "system", "content": "You are a chatbot assistant. Given a user question with multiple choice answers, provide the correct answer."}, {"role": "user", "content": "Question: Which characteristic of a cheetah is more likely to be learned rather than inherited? Answer choices: A: speed, B: a spotted coat, C: hunting strategies, D: claws that do not retract"}, {"role": "assistant", "content": "C: Hunting Strategies"}]}

Supported Models and Regions

DPO is supported for the GPT-4o model. GPT-4o-mini support for DPO will follow soon. Users can preference fine-tune the base model of GPT-4o or supervised fine-tuned models of GPT-4o through this functionality.

Getting Started with DPO

Prepare datasets in preference formats as described above
Select the model and then select the method of customization “Direct Preference Optimization” using Azure AI Foundry Portal (Which has seamless support for all Azure Open AI services)
Upload training and validation datasets.
Select hyperparameters as needed, our defaults work well too.
Review the selections and create fine tuning job.

Hyperparameter :
What does DPO’s new hyperparameter 'Beta' mean? - Beta indicates temperature parameter for the DPO loss, typically in the range 0.1 to 0.5. This controls how much attention we pay to the reference model: the smaller the beta, the more we allow the model to drift away from the reference model

Pricing

The Pricing for DPO is same as supervised fine tuning of gpt 4o-mini and gpt 4o models.

Pro Tip: Doing supervised fine tuning on the preferred answers before DPO is sometimes needed, depending on the dataset (if too far from the distribution). Here is how to do it through Azure AI Foundry portal.

We are excited to bring this new capability to our customers and look forward to seeing the innovative ways you will use Direct Preference Optimization to enhance your AI models.

Detailed information would be available soon in documentation.

Stay tuned for more updates and happy fine-tuning!

Ready to get started?