Pre-Training on Observational Data with Reinforcement Learning

Dibya Ghosh

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-139

June 20, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-139.pdf

Reinforcement learning (RL) concerns the study of optimal decision making. Through iterative interaction and computation, methods learn to model the long-horizon effects of agent behaviors, and counterfactually, how to optimally perform tasks of interest. Despite exemplary success in domains like computer games and robotics, RL methods remain primarily confined to structured and narrowly scoped settings, where tasks are clearly defined and data is cleanly formatted with annotations of states, rewards, and actions.

For most problems of interest, there is a wealth of relevant data that does not prescribe to this format, instead available only unlabeled, unstructured, and noisy. For human-assistive robots, we have access to countless hours of videos of humans; for chat models, expansive web-corpora of images and text; for protein synthesis, large banks of experimental data. Through simple objectives like future or masked prediction, models have been able to acquire knowledge from these broad observational datasets, but there persists a growing mismatch between what features models acquire from pre-training objectives, and what features enable optimality for the downstream decision-making tasks we are ultimately interested in.

In this dissertation, we study how techniques from reinforcement learning may be applied to learn useful features for decision making from these broader sources of data. We begin by studying the problem of learning from passive data like videos, sequences of observations with no reward or action annotations. Next, we discuss the scaling of these algorithms to larger video datasets and show how the learned representations may be applied to real-world robotics problems. Finally, we discuss how these techniques extend naturally to learning from unlabeled images, such as uncurated multi-modal web datasets. Taken together, these investigations suggest a broader role for decision-making objectives in pre-training, in enabling learning from diverse, uncurated sources while guiding training for downstream utility.

Advisors: Sergey Levine

BibTeX citation:

@phdthesis{Ghosh:EECS-2025-139,
    Author= {Ghosh, Dibya},
    Title= {Pre-Training on Observational Data with Reinforcement Learning},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Jun},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-139.html},
    Number= {UCB/EECS-2025-139},
    Abstract= {Reinforcement learning (RL) concerns the study of optimal decision making. Through iterative interaction and computation, methods learn to model the long-horizon effects of agent behaviors, and counterfactually, how to optimally perform tasks of interest. Despite exemplary success in domains like computer games and robotics, RL methods remain primarily confined to structured and narrowly scoped settings, where tasks are clearly defined and data is cleanly formatted with annotations of states, rewards, and actions. 

For most problems of interest, there is a wealth of relevant data that does not prescribe to this format, instead available only unlabeled, unstructured, and noisy. For human-assistive robots, we have access to countless hours of videos of humans; for chat models, expansive web-corpora of images and text; for protein synthesis, large banks of experimental data. Through simple objectives like future or masked prediction, models have been able to acquire knowledge from these broad observational datasets, but there persists a growing mismatch between what features models acquire from pre-training objectives, and what features enable optimality for the downstream decision-making tasks we are ultimately interested in.

In this dissertation, we study how techniques from reinforcement learning may be applied to learn useful features for decision making from these broader sources of data. We begin by studying the problem of learning from passive data like videos, sequences of observations with no reward or action annotations. Next, we discuss the scaling of these algorithms to larger video datasets and show how the learned representations may be applied to real-world robotics problems. Finally, we discuss how these techniques extend naturally to learning from unlabeled images, such as uncurated multi-modal web datasets. Taken together, these investigations suggest a broader role for decision-making objectives in pre-training, in enabling learning from diverse, uncurated sources while guiding training for downstream utility.},
}

EndNote citation:

%0 Thesis
%A Ghosh, Dibya 
%T Pre-Training on Observational Data with Reinforcement Learning
%I EECS Department, University of California, Berkeley
%D 2025
%8 June 20
%@ UCB/EECS-2025-139
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-139.html
%F Ghosh:EECS-2025-139