Preference learning is a key technology for aligning language models with
human values. Reinforcement Learning from Human Feedback (RLHF) is a model
based algorithm to optimize preference learning, which first fitting a reward
model for preference score, and then optimizing generating policy with
on-policy PPO algorithm to maximize the reward. The processing of RLHF is
complex, time-consuming and unstable. Direct Preference Optimization (DPO)
algorithm using off-policy algorithm to direct optimize generating policy and
eliminating the need for reward model, which is data efficient and stable. DPO
use Bradley-Terry model and log-loss which leads to over-fitting to the
preference data at the expense of ignoring KL-regularization term when
preference near deterministic. IPO uses a root-finding pairwise MSE loss to
solve the ignoring KL-regularization problem, and learning an optimal policy.
But IPO’s pairwise loss still can’t s make the KL-regularization to work. In
this paper, we design a simple and intuitive off-policy preferences
optimization algorithm from an importance sampling view, and add an off-policy
KL-regularization term which makes KL-regularization truly effective. To
simplify the learning process and save memory usage, we can generate
regularization data in advance, which eliminate the needs for both reward model
and reference policy in the stage of optimization.

Preference learning is a critical technology for aligning language models with human values. One popular approach to preference learning is Reinforcement Learning from Human Feedback (RLHF). RLHF is a model-based algorithm that optimizes preference learning by fitting a reward model for preference scores and then using the on-policy Proximal Policy Optimization (PPO) algorithm to maximize the reward.

However, RLHF has some drawbacks. It can be complex, time-consuming, and unstable. To address these issues, a new algorithm called Direct Preference Optimization (DPO) has been proposed. DPO uses an off-policy algorithm to directly optimize the generating policy, eliminating the need for a reward model. This approach is more data-efficient and stable. Instead of using the Bradley-Terry model and log-loss, which can lead to overfitting, DPO uses a root-finding pairwise mean squared error (MSE) loss to address the KL-regularization problem and learn an optimal policy.

While DPO improves on RLHF, it still struggles with making the KL-regularization term work effectively. In this paper, the authors propose a new off-policy preferences optimization algorithm that takes an importance sampling view. They introduce an off-policy KL-regularization term that makes the KL-regularization truly effective.

To simplify the learning process and save memory usage, the authors suggest generating regularization data in advance. This eliminates the need for both a reward model and reference policy during the optimization stage.

Multi-disciplinary Nature

This content combines concepts from reinforcement learning, preference learning, optimization, and statistics. Reinforcement learning algorithms like RLHF and DPO are used to optimize preference learning models. The incorporation of KL-regularization highlights the importance of statistical regularization techniques in achieving reliable and stable optimization results. The off-policy approach leverages concepts from importance sampling, which is widely used in statistics and probability theory.

Search Results and Analysis

Searching for similar content yielded several relevant results that expand on the concepts discussed. Here are some notable references:

  1. Title: “Preference-based reinforcement learning: evolutionary direct policy search using a generative model”


    Analysis: This paper explores preference-based reinforcement learning using an evolutionary direct policy search approach. It presents a generative model to capture the preferences of human users and discusses its application in various scenarios.
  2. Title: “Combining Reinforcement Learning and Human Feedback for Preference-based Interactive Reinforcement Learning”


    Analysis: This study investigates the integration of reinforcement learning and human feedback for preference-based interactive reinforcement learning. It proposes an algorithm that combines user feedback and reinforcement learning to learn user preferences effectively.
  3. Title: “A Survey of Preference Handling Approaches in Reinforcement Learning”


    Analysis: This survey paper provides a comprehensive overview of preference handling approaches in reinforcement learning. It covers various methods, including interventional preference learning, active preference learning, and inverse reinforcement learning.

These references showcase the diverse range of research in the field of preference learning and its applications in reinforcement learning. They offer additional insights and perspectives that can further deepen one’s understanding of the topic.

Read the original article