Reward Modeling for Human Preferences

Evan Frick

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-82

May 16, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-82.pdf

Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning Large Language Models (LLMs) with human preferences. Effective RLHF relies heavily on reward models, which serve as scalable proxies for human judgments. We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback). The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance. However, this process is prohibitively expensive. In this thesis, we introduce Preference Proxy Evaluations (PPE), a comprehensive benchmark suite grounded in large-scale, crowdsourced human preference data and verifiably correct responses from established benchmarks. We experimentally validate PPE by demonstrating its strong correlation with downstream human preferences observed after RLHF processes, underscoring its predictive capability. Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance. Additionally, we leverage insights from PPE to enhance reward model robustness by incorporating advanced heteroscedastic regression techniques, addressing variability and uncertainty inherent in human preference data. We find that learning to estimate variances increases final performance, outperforming fixed variance or variance free alternatives–– even when the variance estimates are not utilized at test time. Further, we find that using variances estimates to form a pessimistic quantile reward benefits reward model performance and robustness–– especially on out-of-distribution tasks. In general, these results suggest that these reward models may serve as more robust human preference proxies during online RLHF procedures, which require reward models to be robust to an ever-changing policy model.

Advisors: Jiantao Jiao

BibTeX citation:

@mastersthesis{Frick:EECS-2025-82,
    Author= {Frick, Evan},
    Title= {Reward Modeling for Human Preferences},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-82.html},
    Number= {UCB/EECS-2025-82},
    Abstract= {Reinforcement Learning from Human Feedback (RLHF) has become the dominant paradigm for aligning Large Language Models (LLMs) with human preferences. Effective RLHF relies heavily on reward models, which serve as scalable proxies for human judgments. 
We introduce a new benchmark for reward models that quantifies their ability to produce strong language models through RLHF (Reinforcement Learning from Human Feedback).
The gold-standard approach is to run a full RLHF training pipeline and directly probe downstream LLM performance.
However, this process is prohibitively expensive.
In this thesis, we introduce Preference Proxy Evaluations (PPE), a comprehensive benchmark suite grounded in large-scale, crowdsourced human preference data and verifiably correct responses from established benchmarks. We experimentally validate PPE by demonstrating its strong correlation with downstream human preferences observed after RLHF processes, underscoring its predictive capability. 
Ultimately, we compile our data and findings into Preference Proxy Evaluations (PPE), the first reward model benchmark explicitly linked to post-RLHF real-world human preference performance. Additionally, we leverage insights from PPE to enhance reward model robustness by incorporating advanced heteroscedastic regression techniques, addressing variability and uncertainty inherent in human preference data.
We find that learning to estimate variances increases final performance, outperforming fixed variance or variance free alternatives–– even when the variance estimates are not utilized at test time. Further, we find that using variances estimates to form a pessimistic quantile reward benefits reward model performance and robustness–– especially on out-of-distribution tasks. In general, these results suggest that these reward models may serve as more robust human preference proxies during online RLHF procedures, which require reward models to be robust to an ever-changing policy model.},
}

EndNote citation:

%0 Thesis
%A Frick, Evan 
%T Reward Modeling for Human Preferences
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 16
%@ UCB/EECS-2025-82
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-82.html
%F Frick:EECS-2025-82