Nicholas Hay and Stuart J. Russell

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2011-119

November 20, 2011

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-119.pdf

Sequential decision problems are often approximately solvable by simulating possible future action sequences; such methods are a staple of game-playing algorithms, robot path planners, model-predictive control systems, and logistical planners in operations research. Since the 1960s, researchers have sought effective metareasoning methods for selecting which action sequences to simulate, basing their approach on some estimate of the expected improvement in decision quality resulting from any particular simulation. Recently, this approach has been applied successfully in the context of Monte Carlo tree search, where each simulation takes the form of a randomized sequence of actions leading to a terminal state. In particular, the UCT algorithm borrows asymptotically optimal selection rules from the theory of bandit problems and has led to a new generation of master-level Go programs such as MoGo. We argue that, despite this success, the bandit framework is inappropriate as a basis for selecting computations. We propose instead a theoretical framework for metareasoning that is isomorphic to the statistical framework of ranking and selection. In this framework, we describe two apparently distinct conceptual approaches to the forward search metareasoning problem and prove them to be equivalent. We derive a number of basic results applicable to simple Monte Carlo selection problems, including asymptotic regret bounds, and discuss prospects for their extension to combinatorial settings.


BibTeX citation:

@techreport{Hay:EECS-2011-119,
    Author= {Hay, Nicholas and Russell, Stuart J.},
    Title= {Metareasoning for Monte Carlo Tree Search},
    Year= {2011},
    Month= {Nov},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-119.html},
    Number= {UCB/EECS-2011-119},
    Abstract= {Sequential decision problems are often approximately solvable by simulating possible future action sequences; such methods are a staple of game-playing algorithms, robot path planners, model-predictive control systems, and logistical planners in operations research. Since the 1960s, researchers have sought effective metareasoning methods for selecting which action sequences to simulate, basing their approach on some estimate of the expected improvement in decision quality resulting from any particular simulation.  Recently, this approach has been applied successfully in the context of Monte Carlo tree search, where each simulation takes the form of a randomized sequence of actions leading to a terminal state.  In particular, the UCT algorithm borrows asymptotically optimal selection rules from the theory of bandit problems and has led to a new generation of master-level Go programs such as MoGo.  We argue that, despite this success, the bandit framework is inappropriate as a basis for selecting computations.  We propose instead a theoretical framework for metareasoning that is isomorphic to the statistical framework of ranking and selection. In this framework, we describe two apparently distinct conceptual approaches to the forward search metareasoning problem and prove them to be equivalent. We derive a number of basic results applicable to simple Monte Carlo selection problems, including asymptotic regret bounds, and discuss prospects for their extension to combinatorial settings.},
}

EndNote citation:

%0 Report
%A Hay, Nicholas 
%A Russell, Stuart J. 
%T Metareasoning for Monte Carlo Tree Search
%I EECS Department, University of California, Berkeley
%D 2011
%8 November 20
%@ UCB/EECS-2011-119
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-119.html
%F Hay:EECS-2011-119