Michael Luo and Ashwin Balakrishna and Brijen Thananjeyan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-101

May 14, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-101.pdf

Reinforcement learning (RL) provides a flexible and general-purpose framework for learning new behaviors through interaction with the environment. However, safe exploration is critical to deploying reinforcement learning algorithms in risk-sensitive, real-world environments. Learning new tasks in unknown environments requires extensive exploration, but safety requires limiting exploration.

To navigate this tradeoff, we propose Recovery RL, an algorithm which (1) efficiently leverages offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only optimizes the task reward and a recovery policy that guides the agent back to safety when constraint violation is likely. Recovery RL can be applied on top of any RL algorithm. Simulation and physical experiments across 7 continuous control domains, including two contact rich manipulation tasks and an image-based navigation task, suggest that Recovery RL trades off constraint violations and task successes 2-80x more efficiently than the next best prior methods, which jointly optimize task performance and safety via constrained optimization or reward shaping.

Next, we generalize the problem of safe exploration to the transfer learning setting, where there is assumed access to environments of similar dynamics. In this setting, safe exploration is recasted as an offline meta-reinforcement learning problem, where the objective is to leverage datasets of safe and unsafe behavior across different environments to quickly adapt learned safety measures to new environments with unseen, perturbed dynamics. We propose MEta-learning for Safe Adaptation (MESA), an approach which meta-learns a safety measure and stacks on top of Recovery RL. Simulation experiments across 5 continuous control domains suggest that MESA can leverage datasets from prior environments to reduce constraint violations in unseen environments by up to 2x while maintaining task performance compared to prior algorithms that do not learn transferable risk measures.

Advisors: Ion Stoica


BibTeX citation:

@mastersthesis{Luo:EECS-2021-101,
    Author= {Luo, Michael and Balakrishna, Ashwin and Thananjeyan, Brijen},
    Editor= {Stoica, Ion and Goldberg, Ken},
    Title= {Safe and Sample-Efficient Reinforcement Learning},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-101.html},
    Number= {UCB/EECS-2021-101},
    Abstract= {Reinforcement learning (RL) provides a flexible and general-purpose framework for learning new behaviors through interaction with the environment. However, safe exploration is critical to deploying reinforcement learning algorithms in risk-sensitive, real-world environments. Learning new tasks in unknown environments requires extensive exploration, but safety requires limiting exploration. 

To navigate this tradeoff, we propose Recovery RL, an algorithm which (1) efficiently leverages offline data to learn about constraint violating zones before policy learning and (2) separating the goals of improving task performance and constraint satisfaction across two policies: a task policy that only optimizes the task reward and a recovery policy that guides the agent back to safety when constraint violation is likely. Recovery RL can be applied on top of any RL algorithm. Simulation and physical experiments across 7 continuous control domains, including two contact rich manipulation tasks and an image-based navigation task, suggest that Recovery RL trades off constraint violations and task successes 2-80x more efficiently than the next best prior methods, which jointly optimize task performance and safety via constrained optimization or reward shaping. 

Next, we generalize the problem of safe exploration to the transfer learning setting, where there is assumed access to environments of similar dynamics. In this setting, safe exploration is recasted as an offline meta-reinforcement learning problem, where the objective is to leverage datasets of safe and unsafe behavior across different environments to quickly adapt learned safety measures to new environments with unseen, perturbed dynamics. We propose MEta-learning for Safe Adaptation (MESA), an approach which meta-learns a safety measure and stacks on top of Recovery RL. Simulation experiments across 5 continuous control domains suggest that MESA can leverage datasets from prior environments to reduce constraint violations in unseen environments by up to 2x while maintaining task performance compared to prior algorithms that do not learn transferable risk measures.},
}

EndNote citation:

%0 Thesis
%A Luo, Michael 
%A Balakrishna, Ashwin 
%A Thananjeyan, Brijen 
%E Stoica, Ion 
%E Goldberg, Ken 
%T Safe and Sample-Efficient Reinforcement Learning
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 14
%@ UCB/EECS-2021-101
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-101.html
%F Luo:EECS-2021-101