Shivam Singhal and Cassidy Laidlaw and Anca Dragan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-148

July 11, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-148.pdf

As AI designers, our aim is to develop AI agents that are not only capable, but are also able to understand the internal preferences of the humans with whom they are interacting in order to accomplish the correct goals. However, humans are fundamentally complicated: we have dynamic, sometimes unknown or conflicting desiderata that are difficult to encode directly or learn, and we do not always exhibit optimal behavior that we would like AI to replicate. As a result, we are significantly lacking in our ability to robustly model human preferences, while accounting for their suboptimality. With these misspecified human models, AI systems can not properly infer the goals we would like them to accomplish or be aligned with the values we would like them to have. AI has become increasingly skilled at making complex decisions, but in general, the practical deployment of AI in its misaligned state remains extremely dangerous since there aren't any real guarantees about its activity. Thus, in this thesis, we explore two avenues to achieve AI alignment despite our limitations. In particular, we propose a new regularization regime to prevent AI agents from hacking their specified rewards, and we present two new modeling strategies that we can use to learn from unreliable human feedback.

Advisors: Anca Dragan


BibTeX citation:

@mastersthesis{Singhal:EECS-2024-148,
    Author= {Singhal, Shivam and Laidlaw, Cassidy and Dragan, Anca},
    Title= {Achieving AI Alignment with Unreliable Supervision},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Jul},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-148.html},
    Number= {UCB/EECS-2024-148},
    Abstract= {As AI designers, our aim is to develop AI agents that are not only capable, but are also able to understand the internal preferences of the humans with whom they are interacting in order to accomplish the correct goals. However, humans are fundamentally complicated: we have dynamic, sometimes unknown or conflicting desiderata that are difficult to encode directly or learn, and we do not always exhibit optimal behavior that we would like AI to replicate. As a result, we are significantly lacking in our ability to robustly model human preferences, while accounting for their suboptimality. With these misspecified human models, AI systems can not properly infer the goals we would like them to accomplish or be aligned with the values we would like them to have. AI has become increasingly skilled at making complex decisions, but in general, the practical deployment of AI in its misaligned state remains extremely dangerous since there aren't any real guarantees about its activity. Thus, in this thesis, we explore two avenues to achieve AI alignment despite our limitations. In particular, we propose a new regularization regime to prevent AI agents from hacking their specified rewards, and we present two new modeling strategies that we can use to learn from unreliable human feedback.},
}

EndNote citation:

%0 Thesis
%A Singhal, Shivam 
%A Laidlaw, Cassidy 
%A Dragan, Anca 
%T Achieving AI Alignment with Unreliable Supervision
%I EECS Department, University of California, Berkeley
%D 2024
%8 July 11
%@ UCB/EECS-2024-148
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-148.html
%F Singhal:EECS-2024-148