Safe and Efficient Robot Learning by Biasing Exploration Towards Expert Demonstrations

Albert Wilcox

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-152

May 12, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-152.pdf

Reinforcement learning (RL) has shown impressive results as a framework for learning to complete complex tasks in a wide variety of low- and high-dimensional environments. However, numerous challenges prevent it from being broadly useful as a tool for robotic control. For one, while others have had success training RL algorithms on densely defined reward functions, these can be exceedingly difficult to define and may lead to unintended behaviors. An alternative approach is to learn based on sparse reward functions, using demonstrations to address the additional exploration challenges these sparse reward functions introduce. A second challenge is that while robots running RL algorithms randomly explore in the real world they are known to exhibit dangerous behaviors, damaging themselves, their environments, or even harming people. In this dissertation, we propose the use of offline demonstrations of desirable behavior as a means to guide online exploration, and present two projects using this concept towards addressing problems with efficiency and safety in RL.

A promising strategy for learning in dynamically uncertain environments is requiring that the agent can robustly return to learned safe sets, where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with visual observations is exceedingly challenging. We present a novel continuous representation for safe sets by framing it as a binary classification problem in a learned latent space, which flexibly scales to image observations. We then present a new algorithm, Latent Space Safe Sets (LS3), which uses this representation for long-horizon tasks with sparse rewards.

While prior work has used expert demonstrations to improve RL, these algorithms introduce algorithmic complexity and additional hyperparameters, making them hard to implement and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter free modification to standard actor-critic algorithms which initializes the replay buffer with demonstrations and computes a modified Q-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go. This encourages exploration in the neighborhood of high-performing trajectories by encouraging high Q-values in corresponding regions of the state space.

Advisors: Ken Goldberg

BibTeX citation:

@mastersthesis{Wilcox:EECS-2023-152,
    Author= {Wilcox, Albert},
    Title= {Safe and Efficient Robot Learning by Biasing Exploration Towards Expert Demonstrations},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-152.html},
    Number= {UCB/EECS-2023-152},
    Abstract= {Reinforcement learning (RL) has shown impressive results as a framework for learning to complete complex tasks in a wide variety of low- and high-dimensional environments. However, numerous challenges prevent it from being broadly useful as a tool for robotic control. For one, while others have had success training RL algorithms on densely defined reward functions, these can be exceedingly difficult to define and may lead to unintended behaviors. An alternative approach is to learn based on sparse reward functions, using demonstrations to address the additional exploration challenges these sparse reward functions introduce. A second challenge is that while robots running RL algorithms randomly explore in the real world they are known to exhibit dangerous behaviors, damaging themselves, their environments, or even harming people. In this dissertation, we propose the use of offline demonstrations of desirable behavior as a means to guide online exploration, and present two projects using this concept towards addressing problems with efficiency and safety in RL.

A promising strategy for learning in dynamically uncertain environments is requiring that the agent can robustly return to learned safe sets, where task success (and therefore safety) can be guaranteed. While this approach has been successful in low-dimensions, enforcing this constraint in environments with visual observations is exceedingly challenging. We present a novel continuous representation for safe sets by framing it as a binary classification problem in a learned latent space, which flexibly scales to image observations. We then present a new algorithm, Latent Space Safe Sets (LS3), which uses this representation for long-horizon tasks with sparse rewards. 

While prior work has used expert demonstrations to improve RL, these algorithms introduce algorithmic complexity and additional hyperparameters, making them hard to implement and tune. We introduce Monte Carlo Augmented Actor Critic (MCAC), a parameter free modification to standard actor-critic algorithms which initializes the replay buffer with demonstrations and computes a modified Q-value by taking the maximum of the standard temporal distance (TD) target and a Monte Carlo estimate of the reward-to-go. This encourages exploration in the neighborhood of high-performing trajectories by encouraging high Q-values in corresponding regions of the state space.},
}

EndNote citation:

%0 Thesis
%A Wilcox, Albert 
%T Safe and Efficient Robot Learning by Biasing Exploration Towards Expert Demonstrations
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 12
%@ UCB/EECS-2023-152
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-152.html
%F Wilcox:EECS-2023-152