# Efficient Sequential Decision Making

### http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-94.pdf

This thesis studies three problems in sequential decision making across two different frameworks. The first framework we consider is online learning: for each round of a \$T\$ round repeated game, the learner makes a prediction, the adversary observes this prediction and reveals the true outcome, and the learner suffers some loss based on the accuracy of the prediction. The learner's aim is to minimize the regret, which is defined to be the difference between the learner's cumulative loss and the cumulative loss of the best prediction strategy in some class. We study the minimax strategy, which guarantees the lowest regret against all possible adversary strategies. In general, computing the minimax strategy is exponential in \$T\$; we focus on two setting where efficient algorithms are possible.

The first is prediction under squared Euclidean loss. The learner predicts a point in \$Reals^d\$ and the adversary is constrained to respond with a point in some compact set. The regret is with respect to the single best prediction in the set. We compute the minimax strategy and the value of the game for any compact set and show that the value is the product of a horizon-dependent constant and the squared radius of the smallest enclosing ball of the set. We also present the optimal strategy of the adversary for two important sets: ellipsoids and polytopes that intersect their smallest enclosing ball at all vertices. The minimax strategy can be cast as a simple shrinkage of the past data towards the center of this minimum enclosing ball, where the shrinkage factor can be efficiently computed before the start of the game. Noting that the value does not have any explicit dimension dependence, we then extend these results to Hilbert space, finding, once again, that the value is proportional to the squared radius of the smallest enclosing ball.

The second setting where we derive efficient minimax strategies is online linear regression. At the start of each round, the adversary chooses and reveals a vector of covariates. The

regret is defined with respect to the best linear function of the covariates. We show that the minimax strategy is an easily computed linear predictor, provided that the adversary adheres to some natural constraints that prevent him from misrepresenting the scale of the problem. This strategy is horizon-independent: regardless of the length of the game, this strategy incurs no more regret than any strategy that has knowledge of the number of rounds. We also provide an interpretation of the minimax algorithm as a follow-the-regularized-leader strategy with a data-dependent regularizer and obtain an explicit expression for the minimax regret.

We then turn to the second framework, reinforcement learning. More specifically, we consider the problem of controlling a Markov decision process (MDP) with a large state-space. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. Specifically, we restrict the variables of the dual linear program to lie in some low-dimensional subspace, and show that we can find a policy that performs almost as well as the best policy in this class. We derive separate results for the average cost and discounted cost cases. Most importantly, the complexity of our method depends on the size of the comparison class but not the size of the state-space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.

BibTeX citation:

```@phdthesis{Malek:EECS-2017-94,
Author = {Malek, Alan},
Title = {Efficient Sequential Decision Making},
School = {EECS Department, University of California, Berkeley},
Year = {2017},
Month = {May},
URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-94.html},
Number = {UCB/EECS-2017-94},
Abstract = {This thesis studies three problems in sequential decision making across two different frameworks. The first framework we consider is online learning: for each round of a \$T\$ round repeated game, the learner makes a prediction, the adversary observes this prediction and reveals the true outcome, and the learner suffers some loss based on the accuracy of the prediction. The learner's aim is to minimize the regret, which is defined to be the difference between the learner's cumulative loss and the cumulative loss of the best prediction strategy in some class. We study the minimax strategy, which guarantees the lowest regret against all possible adversary strategies. In general, computing the minimax strategy is exponential in \$T\$; we focus on two setting where efficient algorithms are possible.

The first is prediction under squared Euclidean loss. The learner predicts a point in \$Reals^d\$ and the adversary is constrained to respond with a point in some compact set. The regret is with respect to the single best prediction in the set. We compute the minimax strategy and the value of the game for any compact set and show that the value is the product of a horizon-dependent constant and the squared radius of the smallest enclosing ball of the set. We also present the optimal strategy of the adversary for two important sets: ellipsoids and polytopes that intersect their smallest enclosing ball at all vertices. The minimax strategy can be cast as a simple shrinkage of the past data towards the center of this minimum enclosing ball, where the shrinkage factor can be efficiently computed before the start of the game. Noting that the value does not have any explicit dimension dependence, we then extend these results to Hilbert space, finding, once again, that the value is proportional to the squared radius of the smallest enclosing ball.

The second setting where we derive efficient minimax strategies is online linear regression. At the start of each round, the adversary chooses and reveals a vector of covariates. The

regret is defined with respect to the best linear function of the covariates. We show that the minimax strategy is an easily computed linear predictor, provided that the adversary adheres to some natural constraints that prevent him from misrepresenting the scale of the problem.  This strategy is horizon-independent: regardless of the length of the game, this strategy incurs no more regret than any strategy that has knowledge of the number of rounds.  We also provide an interpretation of the minimax algorithm as a follow-the-regularized-leader strategy with a data-dependent regularizer and obtain an explicit expression for the minimax regret.

We then turn to the second framework, reinforcement learning. More specifically, we consider the problem of controlling a Markov decision process (MDP) with a large state-space. Since it is intractable to compete with the optimal policy for large scale problems, we pursue the more modest goal of competing with a low-dimensional family of policies. Specifically, we restrict the variables of the dual linear program to lie in some low-dimensional subspace, and show that we can find a policy that performs almost as well as the best policy in this class. We derive separate results for the average cost and discounted cost cases. Most importantly, the complexity of our method depends on the size of the comparison class but not the size of the state-space. Preliminary experiments show the effectiveness of the proposed algorithms in a queuing application.}
}
```

EndNote citation:

```%0 Thesis
%A Malek, Alan
%T Efficient Sequential Decision Making
%I EECS Department, University of California, Berkeley
%D 2017
%8 May 12
%@ UCB/EECS-2017-94
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-94.html
%F Malek:EECS-2017-94
```