赞
踩
目录
1、PPO(Proximal Policy Optimization)
2、DPO (Direct Preference Optimization)
3、DPO(Distributed Proximal Policy Optimization)
Proximal Policy Optimization (PPO) Algorithm Diagram
Direct Preference Optimization (DPO) Algorithm Diagram
加载4个模型,2个推理,2个训练
Actor Model:演员模型,想要训练的目标语言模型
Critic Model:评论家模型,它的作用是预估总收益
Reward Model:奖励模型,它的作用是计算即时收益
Reference Model:参考模型,它的作用是在RLHF阶段给语言模型增加一些“约束”,防止语言模型训歪(朝不受控制的方向更新,效果可能越来越差)
其中:
Actor/Critic Model在RLHF阶段是需要训练的;而Reward/Reference Model是参数冻结的。
Critic/Reward/Reference Model共同组成了一个“奖励-loss”计算体系,我们综合它们的结果计算loss,用于更新Actor和Critic Model
DPO是一种相对较新的方法,它直接优化用户或专家的偏好,而非传统的累积奖励。在DPO中,通过对比不同的决策序列或策略,并根据用户或专家的偏好来优化模型,使得最终的策略能够更好地符合预期的行为。DPO通常用于那些难以明确定义奖励函数的场景,或者在用户偏好需要直接编码到决策过程中的应用中。
DPO的实现需要构建一个偏好模型,该模型能够从用户或专家的反馈中学习。在实际应用中,可能需要设计一种机制来收集用户的偏好数据,例如通过对比查询或者排名反馈。然后使用这些数据来训练一个或多个模型,这些模型能够预测给定决策序列的偏好得分,并据此来优化策略。
只需要加载2个模型,其中一个推理,另外一个训练,直接在偏好数据上进行训练。
总结来说,PPO专注于通过剪切概率比率来稳定策略更新,而DPO在此基础上引入分布式计算,以提高数据收集和处理的效率,加快学习速度。
总结来说,PPO和DPO在算法框架和目标函数上有共同之处,但在实现方式、并行化程度以及适用的计算环境上存在差异,DPO特别适用于需要大规模并行处理的场景。
Proximal Policy Optimization (PPO) is a type of reinforcement learning algorithm developed by OpenAI. It is designed to perform comparably or better than state-of-the-art approaches while being much simpler to implement and tune. PPO has become a default reinforcement learning algorithm at OpenAI due to its ease of use and good performance.
PPO works by trying to compute an update at each step that minimizes the cost function while ensuring the deviation from the previous policy is relatively small. It uses a novel objective function that enables multiple epochs of minibatch updates. The objective function is expressed as:
[ L^{CLIP}(\theta) = \hat{E}_{t}[ \min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1 - \varepsilon, 1 + \varepsilon) \hat{A}_t ) ] ]
where:
The PPO algorithm implements a way to do a Trust Region update which is compatible with Stochastic Gradient Descent and simplifies the algorithm by removing the KL penalty and the need to make adaptive updates.
The PPO algorithm can be implemented using Python3 and TensorFlow. It includes scalable, parallel implementations of PPO and TRPO (Trust Region Policy Optimization) which both use MPI for data passing. OpenAI has also released a GPU-enabled implementation called PPO2, which runs approximately 3x faster than the current PPO baseline on Atari games.
Direct Preference Optimization (DPO) is introduced as a new parameterization of the reward model in Reinforcement Learning from Human Feedback (RLHF) that enables extraction of the corresponding optimal policy in closed form. It solves the standard RLHF problem with a simple classification loss, eliminating the need for sampling from the Language Model (LM) during fine-tuning or performing significant hyperparameter tuning.
DPO is stable, performant, and computationally lightweight. It fine-tunes Language Models (LMs) to align with human preferences effectively. Notably, DPO exceeds PPO-based RLHF in controlling the sentiment of generations and matches or improves response quality in summarization and single-turn dialogue tasks while being substantially simpler to implement and train.
DPO operates by directly optimizing a policy model ( \pi_\theta ) using preference data without the need for an explicit reward model. The DPO loss is computed as follows:
[ \mathcal{L}{DPO} = - \log \sigma(\beta (\log \pi\theta(y_w | x) - \log \pi_\theta(y_l | x))) ]
where:
DPO updates aim to increase the relative log probability of preferred responses over less preferred ones, incorporating a dynamic, per-sample importance weight to prevent model degradation.
Copyright © 2003-2013 www.wpsshop.cn 版权所有,并保留所有权利。