Minimax Optimality in Online Learning under Logarithmic Loss with Parametric Constant Experts

Fares Hedayati

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2013-213
December 16, 2013

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-213.pdf

We study online prediction of individual sequences under logarithmic loss with parametric experts. The goal is to predict a sequence of outcomes, revealed one at a time, almost as well as a set of experts. At round t, the forecaster's prediction takes the form of a conditional probability density. The loss that the forecaster suffers at that round is the negative log of the conditional probability of the outcome revealed after the forecaster's prediction. The performance of the prediction strategy is measured relative to the best in a reference set of experts, a parametric class of i.i.d distributions. The difference between the accumulated loss of the prediction strategy and the best expert in the reference set is called the regret. We focus on the minimax regret, which is the regret of the strategy with the minimum of the worst-case regret over outcome sequences. The minimax regret is achieved by the normalized maximum likelihood (NML) strategy. This strategy knows the length of the sequence in advance and the probability it assigns to each sequence is proportional to the maximum likelihood of the sequence. Conditionals are computed at each round by marginalization which is very costly for NML. Due to this drawback, much focus has been given to alternative strategies such as sequential normalized maximum likelihood (SNML) and Bayesian strategies. The conditional probability that SNML assigns to the next outcome is proportional to the maximum likelihood of the data seen so far and the next outcome.  We investigate conditions that lead to optimality of SNML and Bayesian strategies. A major part of this thesis is dedicated to showing that optimality of SNML and optimality of a certain Bayesian strategy, namely the Bayesian strategy under Jeffreys prior are equivalent to each other, i.e. if SNML is optimal, then so is the Bayesian strategy under Jeffreys prior and if the Bayesian strategy under the Jeffreys prior is optimal then so is SNML. Note that Jeffreys prior in parametric families is proportional to the square root of the determinant of the Fisher information. Furthermore we show that optimality of SNML happens if and only if the joint distribution on sequences defined by SNML is exchangeable, i.e. the probability that SNML assigns to any sequence is invariant under any permutation of the sequence. These results are proven for exponential families and any parametric family for which the maximum likelihood estimator is asymptotically normal. The most important implication of these results is that when SNML-exchangeability holds NML becomes horizon-independent, and it could be either calculated through a Bayesian update with Jeffreys prior or through a one step-ahead maximum likelihood calculation as in SNML. Another major part of this thesis is focused on showing that SNML-exchangeabilty holds for a large class of one-dimensional exponential family distributions, namely for Gaussian, the gamma, and the Tweedie exponential family of order 3/2, and any one-to-one transformation of them and that it cannot hold for other one-dimensional exponential family distributions.

Advisor: Peter Bartlett


BibTeX citation:

@phdthesis{Hedayati:EECS-2013-213,
    Author = {Hedayati, Fares},
    Title = {Minimax Optimality in Online Learning under Logarithmic Loss with Parametric Constant Experts},
    School = {EECS Department, University of California, Berkeley},
    Year = {2013},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-213.html},
    Number = {UCB/EECS-2013-213},
    Abstract = {We study online prediction of individual sequences under logarithmic loss with parametric experts. The goal is to predict a sequence of outcomes, revealed one at a time, almost as well as a set of experts. At round t, the forecaster's prediction takes the form of a conditional probability density. The loss that the forecaster suffers at that round is the negative log of the conditional probability of the outcome revealed after the forecaster's prediction. The performance of the prediction strategy is measured relative to the best in a reference set of experts, a parametric class of i.i.d distributions. The difference between the accumulated loss of the prediction strategy and the best expert in the reference set is called the regret. We focus on the minimax regret, which is the regret of the strategy with the minimum of the worst-case regret over outcome sequences.
The minimax regret is achieved by the normalized maximum likelihood (NML) strategy. This strategy knows the length of the sequence in advance and the probability it assigns to each sequence is proportional to the maximum likelihood of the sequence. Conditionals are computed at each round by marginalization which is very costly for NML. Due to this drawback, much focus has been given to alternative strategies such as sequential normalized maximum likelihood (SNML) and Bayesian strategies. The conditional probability that SNML assigns to the next outcome is proportional to the maximum likelihood of the data seen so far and the next outcome.  We investigate conditions that lead to optimality of SNML and Bayesian strategies. A major part of this thesis is dedicated to showing that optimality of SNML and optimality of a certain Bayesian strategy, namely the Bayesian strategy under Jeffreys prior are equivalent to each other, i.e. if SNML is optimal, then so is the Bayesian strategy under Jeffreys prior and if the Bayesian strategy under the Jeffreys prior is optimal then so is SNML. Note that Jeffreys prior in parametric families is proportional to the square root of the determinant of the Fisher information. Furthermore we show that optimality of SNML happens if and only if the joint distribution on sequences defined by SNML is exchangeable, i.e. the probability that SNML assigns to any sequence is invariant under any permutation of the sequence. These results are proven for exponential families and any parametric family for which the maximum likelihood estimator is asymptotically normal. The most important implication of these results is that when SNML-exchangeability holds NML becomes horizon-independent, and it could be either calculated through a Bayesian update with Jeffreys prior or through a one step-ahead maximum likelihood calculation as in SNML. Another major part of this thesis is focused on showing that SNML-exchangeabilty holds for a large class of one-dimensional exponential family distributions, namely for Gaussian, the gamma, and the Tweedie exponential family of order 3/2, and any one-to-one transformation of them and that it cannot hold for other one-dimensional exponential family distributions.}
}

EndNote citation:

%0 Thesis
%A Hedayati, Fares
%T Minimax Optimality in Online Learning under Logarithmic Loss with Parametric Constant Experts
%I EECS Department, University of California, Berkeley
%D 2013
%8 December 16
%@ UCB/EECS-2013-213
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2013/EECS-2013-213.html
%F Hedayati:EECS-2013-213