LLM Post-Training: Data Synthesis and Algorithms
Tianhao Wu
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2025-216
December 19, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-216.pdf
This dissertation addresses the challenge of scalable post-training for large language models along two axes: data generation and algorithmic robustness.
On the data side, we demonstrate that AI-generated preference data can achieve state-of-the-art alignment results through the Nectar dataset and Starling reward models, then advance to fully self-generated supervision via Meta-Rewarding, where models act as actor, judge, and meta-judge simultaneously. This progression---from expensive human annotation to external AI teachers to autonomous self-play---eliminates traditional data bottlenecks while maintaining or exceeding performance. The key insight is that comparative feedback (K-wise rankings, pairwise judgments) is more reliable than absolute scoring.
On the algorithm side, we identify and resolve fundamental inconsistencies in standard RLHF through P3O (Pairwise Proximal Policy Optimization), which performs comparative reinforcement learning rather than optimizing absolute rewards. We formalize this through reward equivalence: reward models trained with Bradley-Terry loss are invariant to constant shifts, but PPO is not, leading to training instabilities. P3O extracts the comparative signal correctly, achieving superior KL-reward frontiers. We then extend preference optimization beyond outputs to internal reasoning via TPO (Thought Preference Optimization), demonstrating that thinking benefits general instruction following across diverse categories including creative writing, health advice, and marketing---not just mathematical tasks.
Together, these contributions establish that scalable LLM post-training requires coordinated advances in both data generation and algorithmic robustness. The four contributions are tightly interconnected: Starling's K-wise ranking techniques inform Meta-Rewarding's judge design; Meta-Rewarding's self-play principle extends to TPO's thought generation; P3O formalizes the comparative feedback underlying all methods; and all four leverage iterative training with preference-based optimization. This framework enables autonomous model improvement without human annotation or algorithmic instability, pointing toward increasingly autonomous AI systems.
Advisors: Kannan Ramchandran and Jiantao Jiao
BibTeX citation:
@phdthesis{Wu:EECS-2025-216,
Author= {Wu, Tianhao},
Title= {LLM Post-Training: Data Synthesis and Algorithms},
School= {EECS Department, University of California, Berkeley},
Year= {2025},
Month= {Dec},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-216.html},
Number= {UCB/EECS-2025-216},
Abstract= {This dissertation addresses the challenge of scalable post-training for large language models along two axes: data generation and algorithmic robustness.
On the data side, we demonstrate that AI-generated preference data can achieve state-of-the-art alignment results through the Nectar dataset and Starling reward models, then advance to fully self-generated supervision via Meta-Rewarding, where models act as actor, judge, and meta-judge simultaneously. This progression---from expensive human annotation to external AI teachers to autonomous self-play---eliminates traditional data bottlenecks while maintaining or exceeding performance. The key insight is that comparative feedback (K-wise rankings, pairwise judgments) is more reliable than absolute scoring.
On the algorithm side, we identify and resolve fundamental inconsistencies in standard RLHF through P3O (Pairwise Proximal Policy Optimization), which performs comparative reinforcement learning rather than optimizing absolute rewards. We formalize this through reward equivalence: reward models trained with Bradley-Terry loss are invariant to constant shifts, but PPO is not, leading to training instabilities. P3O extracts the comparative signal correctly, achieving superior KL-reward frontiers. We then extend preference optimization beyond outputs to internal reasoning via TPO (Thought Preference Optimization), demonstrating that thinking benefits general instruction following across diverse categories including creative writing, health advice, and marketing---not just mathematical tasks.
Together, these contributions establish that scalable LLM post-training requires coordinated advances in both data generation and algorithmic robustness. The four contributions are tightly interconnected: Starling's K-wise ranking techniques inform Meta-Rewarding's judge design; Meta-Rewarding's self-play principle extends to TPO's thought generation; P3O formalizes the comparative feedback underlying all methods; and all four leverage iterative training with preference-based optimization. This framework enables autonomous model improvement without human annotation or algorithmic instability, pointing toward increasingly autonomous AI systems.},
}
EndNote citation:
%0 Thesis %A Wu, Tianhao %T LLM Post-Training: Data Synthesis and Algorithms %I EECS Department, University of California, Berkeley %D 2025 %8 December 19 %@ UCB/EECS-2025-216 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-216.html %F Wu:EECS-2025-216