The Alignment Problem Under Partial Observability

Scott Emmons

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-1

January 3, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-1.pdf

We adopt the game-theoretic framework of assistance games to study the human-AI alignment problem. Past work on assistance games studied the case where both the human and the AI assistant fully observe the physical state of the environment. Generalizing to the case where the human and the assistant may only partially observe the environment, we present the partially observable assistance game (POAG). Using the framework of POAGs, we prove a variety of theoretical results about AI assistants. We first consider the question of observation interference, showing three distinct factors that can cause an optimal AI assistant to interfere with a human's observations. We then revisit past guarantees about the so-called off-switch problem, showing that partial observability poses a new challenge for designing AI assistants that allow themselves to be switched off. Finally, we characterize how partial observability can cause reinforcement learning from human feedback--a widely-used algorithm for training AI assistants--to fall into deceptive failure modes. We conclude by discussing possible paths for translating these theoretical insights into improved techniques for creating beneficial AI assistants.

Advisors: Stuart J. Russell

BibTeX citation:

@phdthesis{Emmons:EECS-2025-1,
    Author= {Emmons, Scott},
    Title= {The Alignment Problem Under Partial Observability},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Jan},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-1.html},
    Number= {UCB/EECS-2025-1},
    Abstract= {We adopt the game-theoretic framework of assistance games to study the human-AI alignment problem. Past work on assistance games studied the case where both the human and the AI assistant fully observe the physical state of the environment. Generalizing to the case where the human and the assistant may only partially observe the environment, we present the partially observable assistance game (POAG). Using the framework of POAGs, we prove a variety of theoretical results about AI assistants. We first consider the question of observation interference, showing three distinct factors that can cause an optimal AI assistant to interfere with a human's observations. We then revisit past guarantees about the so-called off-switch problem, showing that partial observability poses a new challenge for designing AI assistants that allow themselves to be switched off. Finally, we characterize how partial observability can cause reinforcement learning from human feedback--a widely-used algorithm for training AI assistants--to fall into deceptive failure modes. We conclude by discussing possible paths for translating these theoretical insights into improved techniques for creating beneficial AI assistants.},
}

EndNote citation:

%0 Thesis
%A Emmons, Scott 
%T The Alignment Problem Under Partial Observability
%I EECS Department, University of California, Berkeley
%D 2025
%8 January 3
%@ UCB/EECS-2025-1
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-1.html
%F Emmons:EECS-2025-1