Uncertain Reward-Transition MDPs for Negotiable Reinforcement Learning

Nishant Desai

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2017-231

December 16, 2017

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-231.pdf

Markov Decision Processes (MDPs) allow us to build policies for maximizing the expectation of an objective in stochastic environments where the state of the world is fully observed. Partially Observable MDPs (POMDPs) give us machinery for planning when we have additional uncertainty over the state of the world. We can make a similar jump from MDPs to characterize uncertainty over other elements of the environment. Namely, we can also have uncertainty over the transition and reward functions in an MDP. Here, we introduce new classes of uncertain MDPs for dealing with these kinds of uncertainty and present several folk theorems showing that certain subsets of these can be reduced to the standard POMDP with uncertainty only over the state. We are particularly interested in developing these frameworks to explore applications to Negotiable Reinforcement Learning, a method for dynamically balancing the utilities of multiple actors.

Advisors: Stuart J. Russell

BibTeX citation:

@mastersthesis{Desai:EECS-2017-231,
    Author= {Desai, Nishant},
    Title= {Uncertain Reward-Transition MDPs for Negotiable Reinforcement Learning},
    School= {EECS Department, University of California, Berkeley},
    Year= {2017},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-231.html},
    Number= {UCB/EECS-2017-231},
    Abstract= {Markov Decision Processes (MDPs) allow us to build policies for maximizing the expectation of an objective in stochastic environments where the state of the world is fully observed. Partially Observable MDPs (POMDPs) give us machinery for planning when we have additional uncertainty over the state of the world. We can make a similar jump from MDPs to characterize uncertainty over other elements of the environment. Namely, we can also have uncertainty over the transition and reward functions in an MDP. Here, we introduce new classes of uncertain MDPs for dealing with these kinds of uncertainty and present several folk theorems showing that certain subsets of these can be reduced to the standard POMDP with uncertainty only over the state. We are particularly interested in developing these frameworks to explore applications to Negotiable Reinforcement Learning, a method for dynamically balancing the utilities of multiple actors.},
}

EndNote citation:

%0 Thesis
%A Desai, Nishant 
%T Uncertain Reward-Transition MDPs for Negotiable Reinforcement Learning
%I EECS Department, University of California, Berkeley
%D 2017
%8 December 16
%@ UCB/EECS-2017-231
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-231.html
%F Desai:EECS-2017-231