Synergy of Prediction and Control in Model-based Reinforcement Learning

Nathan Lambert

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2022-65
May 11, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-65.pdf

Model-based reinforcement learning (MBRL) has often been touted for its potential to improve on the sample-efficiency, generalization, and safety of existing reinforcement learning algorithms. These model-based algorithms constrain the policy optimization during trial-and-error learning to include a structured representation of the environment dynamics. To date, the posited benefits have largely been left as directions for future work. This thesis attempts to illustrate the central mechanism in MBRL: how a learned dynamics model interacts with decision making. A better understanding of this interaction will point the field in the direction of enabling the posited benefits.

This thesis encompasses the interaction of model-learning with decision making with respect to two central issues: compounding prediction errors and objective mismatch. The compounding error challenge emerges from accumulating errors on recursive passes of any one-step transition model. Most dynamics models are trained for single-step accuracy, which often results in models with substantial long-term prediction error. Additionally, the model being trained for accurate transitions need not guarantee high-performance policies on the downstream task. The lack of correlation between model and policy metrics in separate optimization is coined and studied as Objective Mismatch.

These challenges are primarily studied in the context of sample-based model predictive control (MPC) algorithms, where the learned model is used to simulate trajectories and their resulting predicted rewards. To mitigate compounding error and objective mismatch, the trajectory-based dynamics model is a feedforward prediction parametrization containing a direct representation of time. This model represents one small, but important steps towards more useful dynamics models in model-based reinforcement learning. This thesis concludes with future directions on the synergy of prediction and control in MBRL, primarily focused on state-abstractions, temporal correlation, and future prediction methodologies.

Advisor: Kristofer Pister


BibTeX citation:

@phdthesis{Lambert:EECS-2022-65,
    Author = {Lambert, Nathan},
    Title = {Synergy of Prediction and Control in Model-based Reinforcement Learning},
    School = {EECS Department, University of California, Berkeley},
    Year = {2022},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-65.html},
    Number = {UCB/EECS-2022-65},
    Abstract = {Model-based reinforcement learning (MBRL) has often been touted for its potential to improve on the sample-efficiency, generalization, and safety of existing reinforcement learning algorithms.
These model-based algorithms constrain the policy optimization during trial-and-error learning to include a structured representation of the environment dynamics. 
To date, the posited benefits have largely been left as directions for future work.
This thesis attempts to illustrate the central mechanism in MBRL: how a learned dynamics model interacts with decision making.
A better understanding of this interaction will point the field in the direction of enabling the posited benefits. 

This thesis encompasses the interaction of model-learning with decision making with respect to two central issues: compounding prediction errors and objective mismatch.
The compounding error challenge emerges from accumulating errors on recursive passes of any one-step transition model.
Most dynamics models are trained for single-step accuracy, which often results in models with substantial long-term prediction error.
Additionally, the model being trained for accurate transitions need not guarantee high-performance policies on the downstream task.
The lack of correlation between model and policy metrics in separate optimization is coined and studied as Objective Mismatch.

These challenges are primarily studied in the context of sample-based model predictive control (MPC) algorithms, where the learned model is used to simulate trajectories and their resulting predicted rewards.
To mitigate compounding error and objective mismatch, the trajectory-based dynamics model is a feedforward prediction parametrization containing a direct representation of time.
This model represents one small, but important steps towards more useful dynamics models in model-based reinforcement learning.
This thesis concludes with future directions on the synergy of prediction and control in MBRL, primarily focused on state-abstractions, temporal correlation, and future prediction methodologies.}
}

EndNote citation:

%0 Thesis
%A Lambert, Nathan
%T Synergy of Prediction and Control in Model-based Reinforcement Learning
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 11
%@ UCB/EECS-2022-65
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-65.html
%F Lambert:EECS-2022-65