Paul Krueger and Thomas Griffiths and Stuart J. Russell

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2017-80

May 12, 2017

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-80.pdf

Model-free and model-based reinforcement learning have provided a successful framework for understanding both human behavior and neural data. These two systems are usually thought to compete for control of behavior. However, it has also been proposed that they can be integrated in a cooperative manner. For example, the Dyna algorithm uses model-based replay of past experience to train the model-free system, and has inspired research examining whether human learners do something similar. Here we introduce an approach that links model-free and model-based learning in a new way: via the reward function. Given a model of the learning environment, dynamic programming is used to iteratively estimate state values that monotonically converge to the state values under the optimal decision policy. Pseudorewards are calculated from these values and used to shape the reward function of a model-free learner in a way that is guaranteed not to change the optimal policy. In two experiments we show that this method offers computational advantages over Dyna. It also offers a new way to think about integrating model-free and model-based reinforcement learning: that our knowledge of the world doesn't just provide a source of simulated experience for training our instincts, but that it shapes the rewards that those instincts latch onto.

Advisors: Stuart J. Russell


BibTeX citation:

@mastersthesis{Krueger:EECS-2017-80,
    Author= {Krueger, Paul and Griffiths, Thomas and Russell, Stuart J.},
    Title= {Shaping Model-Free Reinforcement Learning with Model-Based Pseudorewards},
    School= {EECS Department, University of California, Berkeley},
    Year= {2017},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-80.html},
    Number= {UCB/EECS-2017-80},
    Abstract= {Model-free and model-based reinforcement learning have provided a successful framework for understanding both human behavior and neural data. These two systems are usually thought to compete for control of behavior. However, it has also been proposed that they can be integrated in a cooperative manner. For example, the Dyna algorithm uses model-based replay of past experience to train the model-free system, and has inspired research examining whether human learners do something similar. Here we introduce an approach that links model-free and model-based learning in a new way: via the reward function. Given a model of the learning environment, dynamic programming is used to iteratively estimate state values that monotonically converge to the state values under the optimal decision policy. Pseudorewards are calculated from these values and used to shape the reward function of a model-free learner in a way that is guaranteed not to change the optimal policy. In two experiments we show that this method offers computational advantages over Dyna. It also offers a new way to think about integrating model-free and model-based reinforcement learning: that our knowledge of the world doesn't just provide a source of simulated experience for training our instincts, but that it shapes the rewards that those instincts latch onto.},
}

EndNote citation:

%0 Thesis
%A Krueger, Paul 
%A Griffiths, Thomas 
%A Russell, Stuart J. 
%T Shaping Model-Free Reinforcement Learning with Model-Based Pseudorewards
%I EECS Department, University of California, Berkeley
%D 2017
%8 May 12
%@ UCB/EECS-2017-80
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-80.html
%F Krueger:EECS-2017-80