Pairwise Proximal Policy Optimization: Large Language Models Alignment via Comparative RL
Tianhao Wu
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2024-43
May 1, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-43.pdf
LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior is Reinforcement Learning from Human Feedback (RLHF). This involves training a reward model with a human-labeled ranking dataset and fine-tuning the LLM with the reward signal using RL. Despite the fact that the reward is learned from comparing different responses, the RL stage does not involve direct comparisons. This inconsistency between reward learning and reinforcement learning stages exacerbates RL's instability. An example would be that the well adopted RL optimizer, Proximal Policy Optimization (PPO), could perform different gradient updates even for batches with identical human preference information. To address this, we propose a new framework, reinforcement learning from comparative feedback, and a simple policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O), that learns to improve from direct comparison. Theoretically, P3O has the nice property of being invariant with any reward functions that contain identical preference information, while not requiring learning a value function. Empirical evaluations demonstrate that P3O can align with human preferences better than existing methods. This suggest that comparative RL is strong candidate for aligning LLM with preference data.
Advisors: Kannan Ramchandran and Jiantao Jiao
BibTeX citation:
@mastersthesis{Wu:EECS-2024-43, Author= {Wu, Tianhao}, Title= {Pairwise Proximal Policy Optimization: Large Language Models Alignment via Comparative RL}, School= {EECS Department, University of California, Berkeley}, Year= {2024}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-43.html}, Number= {UCB/EECS-2024-43}, Abstract= {LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior is Reinforcement Learning from Human Feedback (RLHF). This involves training a reward model with a human-labeled ranking dataset and fine-tuning the LLM with the reward signal using RL. Despite the fact that the reward is learned from comparing different responses, the RL stage does not involve direct comparisons. This inconsistency between reward learning and reinforcement learning stages exacerbates RL's instability. An example would be that the well adopted RL optimizer, Proximal Policy Optimization (PPO), could perform different gradient updates even for batches with identical human preference information. To address this, we propose a new framework, reinforcement learning from comparative feedback, and a simple policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O), that learns to improve from direct comparison. Theoretically, P3O has the nice property of being invariant with any reward functions that contain identical preference information, while not requiring learning a value function. Empirical evaluations demonstrate that P3O can align with human preferences better than existing methods. This suggest that comparative RL is strong candidate for aligning LLM with preference data.}, }
EndNote citation:
%0 Thesis %A Wu, Tianhao %T Pairwise Proximal Policy Optimization: Large Language Models Alignment via Comparative RL %I EECS Department, University of California, Berkeley %D 2024 %8 May 1 %@ UCB/EECS-2024-43 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-43.html %F Wu:EECS-2024-43