Exploration and Safety in Deep Reinforcement Learning

Joshua Achiam

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2021-34
May 7, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-34.pdf

Reinforcement learning (RL) agents need to explore their environments in order to learn optimal policies by trial and error. However, exploration is challenging when reward signals are sparse, or when safety is a critical concern and certain errors are unacceptable. In this thesis, we address these challenges in the deep reinforcement learning setting by modifying the underlying optimization problem that agents solve, incentivizing them to explore in safer or more-efficient ways.

In the first part of this thesis, we develop methods for intrinsic motivation to make progress on problems where rewards are sparse or absent. Our first approach uses an intrinsic reward to incentivize agents to visit states considered surprising under a learned dynamics model, and we show that this technique performs favorably compared to naive exploration. Our second approach uses an objective based on variational inference to endow agents with multiple skills that are distinct from each other, without the use of task-specific rewards. We show that this approach, which we call variational option discovery, can be used to learn locomotion behaviors in simulated robot environments.

In the second part of this thesis, we focus on problems in safe exploration. Building on a wide range of prior work on safe reinforcement learning, we propose to standardize constrained RL as the main formalism for safe exploration; we then proceed to develop algorithms and benchmarks for constrained RL. Our presentation of material tells a story in chronological order: we begin by presenting Constrained Policy Optimization (CPO), the first algorithm for constrained deep RL with guarantees of near-constraint satisfaction at each iteration. Next, we develop the Safety Gym benchmark, which allows us to find the limits of CPO and inspires us to press in a different direction. Finally, we develop PID Lagrangian methods, where we find that a small modification to the Lagrangian primal-dual gradient baseline approach results in significantly improved stability and robustness in solving constrained RL tasks in Safety Gym.

Advisor: S. Shankar Sastry and Pieter Abbeel


BibTeX citation:

@phdthesis{Achiam:EECS-2021-34,
    Author = {Achiam, Joshua},
    Title = {Exploration and Safety in Deep Reinforcement Learning},
    School = {EECS Department, University of California, Berkeley},
    Year = {2021},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-34.html},
    Number = {UCB/EECS-2021-34},
    Abstract = {Reinforcement learning (RL) agents need to explore their environments in order to learn optimal policies by trial and error. However, exploration is challenging when reward signals are sparse, or when safety is a critical concern and certain errors are unacceptable. In this thesis, we address these challenges in the deep reinforcement learning setting by modifying the underlying optimization problem that agents solve, incentivizing them to explore in safer or more-efficient ways.

In the first part of this thesis, we develop methods for intrinsic motivation to make progress on problems where rewards are sparse or absent. Our first approach uses an intrinsic reward to incentivize agents to visit states  considered surprising under a learned dynamics model, and we show that this technique performs favorably compared to naive exploration. Our second approach uses an objective based on variational inference to endow agents with multiple skills that are distinct from each other, without the use of task-specific rewards. We show that this approach, which we call variational option discovery, can be used to learn locomotion behaviors in simulated robot environments.

In the second part of this thesis, we focus on problems in safe exploration. Building on a wide range of prior work on safe reinforcement learning, we propose to standardize constrained RL as the main formalism for safe exploration; we then proceed to develop algorithms and benchmarks for constrained RL. Our presentation of material tells a story in chronological order: we begin by presenting Constrained Policy Optimization (CPO), the first algorithm for constrained deep RL with guarantees of near-constraint satisfaction at each iteration. Next, we develop the Safety Gym benchmark, which allows us to find the limits of CPO and inspires us to press in a different direction. Finally, we develop PID Lagrangian methods, where we find that a small modification to the Lagrangian primal-dual gradient baseline approach results in significantly improved stability and robustness in solving constrained RL tasks in Safety Gym.}
}

EndNote citation:

%0 Thesis
%A Achiam, Joshua
%T Exploration and Safety in Deep Reinforcement Learning
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 7
%@ UCB/EECS-2021-34
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-34.html
%F Achiam:EECS-2021-34