Uncertain Reward-Transition MDPs for Negotiable Reinforcement Learning

THIS REPORT HAS BEEN WITHDRAWN

Nishant Desai

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2017-230
December 16, 2017

Markov Decision Processes (MDPs) allow us to build policies for maximizing the expectation of an objective in stochastic environments where the state of the world is fully observed. Partially Observable MDPs (POMDPs) give us machinery for planning when we have additional uncertainty over the state of the world. We can make a similar jump from MDPs to characterize uncertainty over other elements of the environment. Namely, we can also have uncertainty over the transition and reward functions in an MDP. Here, we introduce new classes of uncertain MDPs for dealing with these kinds of uncertainty and present several folk theorems showing that certain subsets of these can be reduced to the standard POMDP with uncertainty only over the state. We are particularly interested in developing these frameworks to explore applications to Negotiable Reinforcement Learning, a method for dynamically balancing the utilities of multiple actors.

Advisor: Stuart J. Russell

Author Comments: see http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-231.html