The Principal-Agent Alignment Problem in Artificial Intelligence
Dylan Hadfield-Menell
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2021-207
August 26, 2021
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.pdf
The field of artificial intelligence has seen serious progress in recent years, and has also caused serious concerns that range from the immediate harms caused by systems that replicate harmful biases to the more distant worry that effective goal-directed systems may, at a certain level of performance, be able to subvert meaningful control efforts. In this dissertation, I argue the following thesis: 1. The use of incomplete or incorrect incentives to specify the target behavior for an autonomous system creates a value alignment problem between the principal(s), on whose behalf a system acts, and the system itself; 2. This value alignment problem can be approached in theory and practice through the development of systems that are responsive to uncertainty about the principal’s true, unobserved, intended goal; and 3. Value alignment problems can be modeled as a class of cooperative assistance games, which are computationally similar to the class of partially-observable Markov decision processes. This model captures the principal’s capacity to behave strategically in coordination with the autonomous system. It leads to distinct solutions to alignment problems, compared with more traditional approaches to preference learning like inverse reinforcement learning, and demonstrates the need for strategically robust alignment solutions.
Advisors: Stuart J. Russell and Pieter Abbeel and Anca Dragan
BibTeX citation:
@phdthesis{Hadfield-Menell:EECS-2021-207, Author= {Hadfield-Menell, Dylan}, Title= {The Principal-Agent Alignment Problem in Artificial Intelligence}, School= {EECS Department, University of California, Berkeley}, Year= {2021}, Month= {Aug}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.html}, Number= {UCB/EECS-2021-207}, Abstract= {The field of artificial intelligence has seen serious progress in recent years, and has also caused serious concerns that range from the immediate harms caused by systems that replicate harmful biases to the more distant worry that effective goal-directed systems may, at a certain level of performance, be able to subvert meaningful control efforts. In this dissertation, I argue the following thesis: 1. The use of incomplete or incorrect incentives to specify the target behavior for an autonomous system creates a value alignment problem between the principal(s), on whose behalf a system acts, and the system itself; 2. This value alignment problem can be approached in theory and practice through the development of systems that are responsive to uncertainty about the principal’s true, unobserved, intended goal; and 3. Value alignment problems can be modeled as a class of cooperative assistance games, which are computationally similar to the class of partially-observable Markov decision processes. This model captures the principal’s capacity to behave strategically in coordination with the autonomous system. It leads to distinct solutions to alignment problems, compared with more traditional approaches to preference learning like inverse reinforcement learning, and demonstrates the need for strategically robust alignment solutions.}, }
EndNote citation:
%0 Thesis %A Hadfield-Menell, Dylan %T The Principal-Agent Alignment Problem in Artificial Intelligence %I EECS Department, University of California, Berkeley %D 2021 %8 August 26 %@ UCB/EECS-2021-207 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-207.html %F Hadfield-Menell:EECS-2021-207