Dylan Hadfield-Menell

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-207

August 26, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.pdf

The field of artificial intelligence has seen serious progress in recent years, and has also caused serious  concerns  that  range  from  the  immediate  harms  caused  by  systems  that  replicate harmful biases to the more distant worry that effective goal-directed systems may, at a certain level of performance,  be able to subvert meaningful control efforts.  In this dissertation,  I argue the following thesis: 1.  The  use  of  incomplete  or  incorrect  incentives  to  specify  the  target  behavior  for  an autonomous  system  creates  a value  alignment  problem between  the  principal(s),  on whose behalf a system acts, and the system itself; 2.  This value alignment problem can be approached in theory and practice through the development of systems that are responsive to uncertainty about the principal’s true, unobserved, intended goal; and 3.  Value alignment problems can be modeled as a class of cooperative assistance games, which are computationally similar to the class of partially-observable Markov decision processes.  This model captures the principal’s capacity to behave strategically in coordination with the autonomous system.  It leads to distinct solutions to alignment problems, compared  with  more  traditional  approaches  to  preference  learning  like  inverse reinforcement learning, and demonstrates the need for strategically robust alignment solutions.

Advisors: Stuart J. Russell and Pieter Abbeel and Anca Dragan


BibTeX citation:

@phdthesis{Hadfield-Menell:EECS-2021-207,
    Author= {Hadfield-Menell, Dylan},
    Title= {The Principal-Agent Alignment Problem in Artificial Intelligence},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.html},
    Number= {UCB/EECS-2021-207},
    Abstract= {The field of artificial intelligence has seen serious progress in recent years, and has also caused serious  concerns  that  range  from  the  immediate  harms  caused  by  systems  that  replicate harmful biases to the more distant worry that effective goal-directed systems may, at a certain level of performance,  be able to subvert meaningful control efforts.  In this dissertation,  I argue the following thesis: 1.  The  use  of  incomplete  or  incorrect  incentives  to  specify  the  target  behavior  for  an autonomous  system  creates  a value  alignment  problem between  the  principal(s),  on whose behalf a system acts, and the system itself; 2.  This value alignment problem can be approached in theory and practice through the development of systems that are responsive to uncertainty about the principal’s true, unobserved, intended goal; and 3.  Value alignment problems can be modeled as a class of cooperative assistance games, which are computationally similar to the class of partially-observable Markov decision processes.  This model captures the principal’s capacity to behave strategically in coordination with the autonomous system.  It leads to distinct solutions to alignment problems, compared  with  more  traditional  approaches  to  preference  learning  like  inverse reinforcement learning, and demonstrates the need for strategically robust alignment solutions.},
}

EndNote citation:

%0 Thesis
%A Hadfield-Menell, Dylan 
%T The Principal-Agent Alignment Problem in Artificial Intelligence
%I EECS Department, University of California, Berkeley
%D 2021
%8 August 26
%@ UCB/EECS-2021-207
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.html
%F Hadfield-Menell:EECS-2021-207