What Supervision Scales? Practical Learning Through Interaction

Carlos Florensa Campo

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2020-128

May 30, 2020

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-128.pdf

To have an agent learn useful behaviors, we must be able to specify what are the desired outcomes. This supervision can come in many forms, like the reward in Reinforcement Learning (RL), the target state-action pairs in Imitation Learning, or the dynamics model in motion-planning. Each form of supervision needs to be evaluated along three cost dimensions that dictate how well it scales: how much domain knowledge is required to provide that supervision form, what is the total volume of interaction under such supervision needed to learn a task, and how does this amount increases for every new task that the agent has to learn. For example, guiding rewards provided at every time-step might speed up an RL algorithm, but it is hard to design such dense rewards that are easy-to-provide and induce a solution to the task at hand, and the design process must be repeated for every new task; On the other hand, a completion signal is a weaker form of supervision because non-expert users can specify the objective to many tasks in this way, but unfortunately standard RL algorithms struggle with such sparse rewards. In the first part of this dissertation we study how overcome this limitation by means of learning hierarchies over re-usable skills. In the second part of this dissertation, we extend the scope to explicitly minimize the supervision needed to learn distributions of tasks. This paradigm shifts the focus away from the complexity of learning a single task, hence paving the way towards more general agents that efficiently learn from multiple tasks. To achieve this objective, we propose two automatic curriculum generation methods. In the third part of the dissertation, we investigate how to leverage different kinds of partial experts as supervision. First we propose a method that does not require any reward, and is still able to largely surpass the performance of the demonstrator in goal-reaching tasks. This allows to leverage sub-optimal “experts", hence lowering the cost of the provided supervision. Finally we explore how to exploit a rough description of a task and an “expert" able to operate in only parts of the state-space. This is a common setting in robotic applications where the model provided by the manufacturer allows to execute efficient motion-planning as long as there’s no contacts or perception errors, but fails to complete the last contact-rich part of the task, like inserting a key. These are all key pieces to provide supervision that scales to generate robotic behavior for practical tasks.

Advisors: Pieter Abbeel

BibTeX citation:

@phdthesis{Florensa Campo:EECS-2020-128,
    Author= {Florensa Campo, Carlos},
    Title= {What Supervision Scales? Practical Learning Through Interaction},
    School= {EECS Department, University of California, Berkeley},
    Year= {2020},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-128.html},
    Number= {UCB/EECS-2020-128},
    Abstract= {To have an agent learn useful behaviors, we must be able to specify what are the desired outcomes. This supervision can come in many forms, like the reward in Reinforcement Learning (RL), the target state-action pairs in Imitation Learning, or the dynamics model in motion-planning. Each form of supervision needs to be evaluated along three cost dimensions that dictate how well it scales: how much domain knowledge is required to provide that supervision form, what is the total volume of interaction under such supervision needed to learn a task, and how does this amount increases for every new task that the agent has to learn. For example, guiding rewards provided at every time-step might speed up an RL algorithm, but it is hard to design such dense rewards that are easy-to-provide and induce a solution to the task at hand, and the design process must be repeated for every new task; On the other hand, a completion signal is a weaker form of supervision because non-expert users can specify the objective to many tasks in this way, but unfortunately standard RL algorithms struggle with such sparse rewards. In the first part of this dissertation we study how overcome this limitation by means of learning hierarchies over re-usable skills. In the second part of this dissertation, we extend the scope to explicitly minimize the supervision needed to learn distributions of tasks. This paradigm shifts the focus away from the complexity of learning a single task, hence paving the way towards more general agents that efficiently learn from multiple tasks. To achieve this objective, we propose two automatic curriculum generation methods. In the third part of the dissertation, we investigate how to leverage different kinds of partial experts as supervision. First we propose a method that does not require any reward, and is still able to largely surpass the performance of the demonstrator in goal-reaching tasks. This allows to leverage sub-optimal “experts", hence lowering the cost of the provided supervision. Finally we explore how to exploit a rough description of a task and an “expert" able to operate in only parts of the state-space. This is a common setting in robotic applications where the model provided by the manufacturer allows to execute efficient motion-planning as long as there’s no contacts or perception errors, but fails to complete the last contact-rich part of the task, like inserting a key. These are all key pieces to provide supervision that scales to generate robotic behavior for practical tasks.},
}

EndNote citation:

%0 Thesis
%A Florensa Campo, Carlos 
%T What Supervision Scales? Practical Learning Through Interaction
%I EECS Department, University of California, Berkeley
%D 2020
%8 May 30
%@ UCB/EECS-2020-128
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-128.html
%F Florensa Campo:EECS-2020-128