### Adam Gleave

###
EECS Department

University of California, Berkeley

Technical Report No. UCB/EECS-2022-260

December 5, 2022

### http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-260.pdf

Real-world applications of machine learning often have complex objectives and safety-critical constraints. Contemporary machine learning systems excel at achieving high average-case performance at tasks with simple procedurally specified objectives, but they struggle at many more demanding real-world tasks. In this thesis, we work towards developing trustworthy machine learning systems that understand human values and reliably optimize them.

Machine learning’s key insight was that it is often easier to learn an algorithm than to write it down directly—yet many machine learning systems still have a hard-coded, procedurally specified objective. The field of reward learning applies this insight to instead learn the objective itself. As there is a many-to-one mapping between reward functions and objectives, we start by introducing the notion of equivalence classes consisting of reward functions that specify the same objective.

In the first part of the dissertation, we apply this notion of equivalence classes to three distinct settings. First, we study reward function identifiability: what set of reward functions is compatible with the data? We start by categorizing the equivalence classes of reward functions that induce the same data. By comparing these to the aforementioned optimal policy equivalence class, we can determine whether a given data source provides sufficient information to recover the optimal policy.

Second, we address the fundamental question of how similar or dissimilar two reward function equivalence classes are. We introduce a distance metric over these equivalence classes, the Equivalent-Policy Invariant Comparison (EPIC), and show rewards with low EPIC distance induce policies with similar returns even under different transition dynamics. Finally, we introduce an interpretability method for reward function equivalence classes. The method selects the easiest to understand representative from the equivalence class, and then visualizes the representative function.

In the second part of the dissertation, we study the adversarial robustness of models. We start by introducing a physically realistic threat model consisting of an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial to the defender. We train the adversary using deep RL against a frozen state-of-the-art defender that was trained via self-play to be robust to opponents. We find this attack reliably wins against state-of-the-art simulated robotics RL agents, and superhuman Go programs.

Finally, we investigate ways to improve agent robustness. We find adversarial training is ineffective, however population-based training offers hope as a partial defense: it does not prevent the attack, but it does increase the computational burden of the attacker. Using explicit planning also helps, as we find that defenders with large amounts of search are harder to exploit.

**Advisor:** Stuart J. Russell

BibTeX citation:

@phdthesis{Gleave:EECS-2022-260, Author = {Gleave, Adam}, Title = {Towards Trustworthy Machine Learning}, School = {EECS Department, University of California, Berkeley}, Year = {2022}, Month = {Dec}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-260.html}, Number = {UCB/EECS-2022-260}, Abstract = {Real-world applications of machine learning often have complex objectives and safety-critical constraints. Contemporary machine learning systems excel at achieving high average-case performance at tasks with simple procedurally specified objectives, but they struggle at many more demanding real-world tasks. In this thesis, we work towards developing trustworthy machine learning systems that understand human values and reliably optimize them. Machine learning’s key insight was that it is often easier to learn an algorithm than to write it down directly—yet many machine learning systems still have a hard-coded, procedurally specified objective. The field of reward learning applies this insight to instead learn the objective itself. As there is a many-to-one mapping between reward functions and objectives, we start by introducing the notion of equivalence classes consisting of reward functions that specify the same objective. In the first part of the dissertation, we apply this notion of equivalence classes to three distinct settings. First, we study reward function identifiability: what set of reward functions is compatible with the data? We start by categorizing the equivalence classes of reward functions that induce the same data. By comparing these to the aforementioned optimal policy equivalence class, we can determine whether a given data source provides sufficient information to recover the optimal policy. Second, we address the fundamental question of how similar or dissimilar two reward function equivalence classes are. We introduce a distance metric over these equivalence classes, the Equivalent-Policy Invariant Comparison (EPIC), and show rewards with low EPIC distance induce policies with similar returns even under different transition dynamics. Finally, we introduce an interpretability method for reward function equivalence classes. The method selects the easiest to understand representative from the equivalence class, and then visualizes the representative function. In the second part of the dissertation, we study the adversarial robustness of models. We start by introducing a physically realistic threat model consisting of an adversarial policy acting in a multi-agent environment so as to create natural observations that are adversarial to the defender. We train the adversary using deep RL against a frozen state-of-the-art defender that was trained via self-play to be robust to opponents. We find this attack reliably wins against state-of-the-art simulated robotics RL agents, and superhuman Go programs. Finally, we investigate ways to improve agent robustness. We find adversarial training is ineffective, however population-based training offers hope as a partial defense: it does not prevent the attack, but it does increase the computational burden of the attacker. Using explicit planning also helps, as we find that defenders with large amounts of search are harder to exploit.} }

EndNote citation:

%0 Thesis %A Gleave, Adam %T Towards Trustworthy Machine Learning %I EECS Department, University of California, Berkeley %D 2022 %8 December 5 %@ UCB/EECS-2022-260 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-260.html %F Gleave:EECS-2022-260