Pairwise Proximal Policy Optimization: Large Language Models Alignment via Comparative RL

Tianhao Wu and Banghua Zhu and Ruoyu Zhang and Zhaojin Wen and Kannan Ramchandran and Jiantao Jiao

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-21

April 26, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-21.pdf

LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior is Reinforcement Learning from Human Feedback (RLHF). This involves training a reward model with a human-labeled ranking dataset and fine-tuning the LLM with the reward signal using RL. Despite the fact that the reward is learned from comparing different responses, the RL stage doesn't involve direct comparisons. This inconsistency between reward learning and reinforcement learning stages exacerbates RL's instability. An example would be that the well adopted RL optimizer, Proximal Policy Optimization (PPO), could perform different gradient updates even for batches with identical human preference information. To address this, we propose a new framework, reinforcement learning from comparative feedback, and a simple policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O), that learns to improve from direct comparison. Theoretically, P3O has the nice property of being invariant with any reward functions that contain identical preference information, while doesn't require learning a value function. Empirical evaluations demonstrate that P3O can align with human preferences better than existing methods. This suggest that comparative RL is strong candidate for aligning LLM with preference data.

Advisors: Kannan Ramchandran and Jiantao Jiao

BibTeX citation:

@mastersthesis{Wu:EECS-2024-21,
    Author= {Wu, Tianhao and Zhu, Banghua and Zhang, Ruoyu and Wen, Zhaojin and Ramchandran, Kannan and Jiao, Jiantao},
    Title= {Pairwise Proximal Policy Optimization: Large Language Models Alignment via Comparative RL},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Apr},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-21.html},
    Number= {UCB/EECS-2024-21},
    Abstract= {LLMs may exhibit harmful behavior without aligning with human values. The dominant approach for steering LLMs towards beneficial behavior is Reinforcement Learning from Human Feedback (RLHF). This involves training a reward model with a human-labeled ranking dataset and fine-tuning the LLM with the reward signal using RL. Despite the fact that the reward is learned from comparing different responses, the RL stage doesn't involve direct comparisons. This inconsistency between reward learning and reinforcement learning stages exacerbates RL's instability. An example would be that the well adopted RL optimizer, Proximal Policy Optimization (PPO), could perform different gradient updates even for batches with identical human preference information. To address this, we propose a new framework, reinforcement learning from comparative feedback, and a simple policy gradient algorithm, Pairwise Proximal Policy Optimization (P3O), that learns to improve from direct comparison. Theoretically, P3O has the nice property of being invariant with any reward functions that contain identical preference information, while doesn't require learning a value function. Empirical evaluations demonstrate that P3O can align with human preferences better than existing methods. This suggest that comparative RL is strong candidate for aligning LLM with preference data.},
}

EndNote citation:

%0 Thesis
%A Wu, Tianhao 
%A Zhu, Banghua 
%A Zhang, Ruoyu 
%A Wen, Zhaojin 
%A Ramchandran, Kannan 
%A Jiao, Jiantao 
%T Pairwise Proximal Policy Optimization: Large Language Models Alignment via Comparative RL
%I EECS Department, University of California, Berkeley
%D 2024
%8 April 26
%@ UCB/EECS-2024-21
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-21.html
%F Wu:EECS-2024-21