Concentration and Sequential Decision Making in Markovian Environments

Vrettos Moulos

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2020-202
December 16, 2020

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-202.pdf

In this dissertation we study concentration properties of Markov chains, and sequential decision making problems which involve stochastic modeling with Markov chains.

We start by developing a simple yet powerful Hoeffding inequality for Markovian sums under only the irreducibility assumption. To illustrate its usefulness we provide two applications in multi-armed bandit problems. The first is about identifying an approximately best Markovian arm, while the second is concerned with regret minimization in the context of Markovian bandits, generalizing two well known algorithms from the i.i.d. case.

We proceed with the study of the concentration properties of a Lipschitz function applied to a Markov chain, which form a generalization of Hoeffding's inequality. In particular we investigate a transportation problem that arises naturally when the martingale method is applied. The so called bicausal optimal transport problem for Markov chains, is an optimal transport formulation suitable for stochastic processes which takes into consideration the accumulation of information as time evolves. Our analysis is based on a relation between the transport problem and the theory of Markov decision processes. This way we are able to derive necessary and sufficient conditions for optimality in the transport problem, as well as an iterative algorithm, namely the value iteration, for the calculation of the transportation cost. Additionally, we draw the connection with the classic theory on couplings for Markov chains, and in particular with the notion of faithful couplings.

Next we focus on a finite-sample analysis of large deviation results for Markov chains. First we study the exponential family of stochastic matrices, which serve as a change of measure, and we develop conditions under which the asymptotic Perron-Frobenius eigenvector stays strictly positive. This leads to a Chernoff bound which attains a constant prefactor and an exponential decay with the optimal large deviations rate. Moreover, a finite-sample version of the law of the iterated logarithm is derived, and a uniform multiplicative ergodic theorem for the exponential family of tilted transition probability matrices is established.

On the applications side, we give a complete characterization of the sampling complexity of best Markovian arm identification in one-parameter Markovian bandit models. We derive instance specific nonasymptotic and asymptotic lower bounds which generalizing those of the i.i.d. setting, and we analyze the Track-and-Stop strategy, proving that asymptotically it is at most a factor of four apart from the lower bound.

We conclude with with an extension of the classic stochastic multi-armed bandit problem which involves multiple plays and Markovian rewards in the rested bandits setting. In order to tackle this problem we consider an adaptive allocation rule which at each stage combines the information from the sample means of all the arms, with the Kullback-Leibler upper confidence bound of a single arm which is selected in round-robin way. For rewards generated from a one-parameter exponential family of Markov chains, we provide a finite-time upper bound for the regret incurred from this adaptive allocation rule, which reveals the logarithmic dependence of the regret on the time horizon, and which is asymptotically optimal. As a byproduct of our analysis we also establish asymptotically optimal, finite-time guarantees for the case of multiple plays, and i.i.d. rewards drawn from a one-parameter exponential family of probability densities. Finally, we provide simulation results that illustrate that calculating Kullback-Leibler upper confidence bounds in a round-robin way, is significantly more efficient than calculating them for every arm at each round, and that the expected regrets of those two approaches behave similarly.

Advisor: Satish Rao


BibTeX citation:

@phdthesis{Moulos:EECS-2020-202,
    Author = {Moulos, Vrettos},
    Title = {Concentration and Sequential Decision Making in Markovian Environments},
    School = {EECS Department, University of California, Berkeley},
    Year = {2020},
    Month = {Dec},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-202.html},
    Number = {UCB/EECS-2020-202},
    Abstract = {In this dissertation we study concentration properties of Markov chains,
and sequential decision making problems which involve stochastic modeling with Markov chains.

We start by developing a simple yet powerful Hoeffding inequality for Markovian sums under only the irreducibility assumption.
To illustrate its usefulness we provide two applications in multi-armed bandit problems.
The first is about identifying an approximately best Markovian arm,
while the second is concerned with regret minimization in the context of Markovian bandits,
generalizing two well known algorithms from the i.i.d. case.

We proceed with the study of the concentration properties of a Lipschitz function applied to a Markov chain,
which form a generalization of Hoeffding's inequality.
In particular we investigate a transportation problem that arises naturally when the martingale method is applied.
The so called bicausal optimal transport problem for Markov chains,
is an optimal transport formulation suitable for stochastic processes which takes into consideration the accumulation of information as time evolves.
Our analysis is based on a relation between the transport problem and the theory of Markov decision processes.
This way we are able to derive necessary and sufficient conditions for optimality in the transport problem, as well as an iterative algorithm, namely the value iteration, for the calculation of the transportation cost.
Additionally, we draw the connection with the classic theory on couplings for Markov chains,
and in particular with the notion of faithful couplings.

Next we focus on a finite-sample analysis of large deviation results for Markov chains.
First we study the exponential family of stochastic matrices, which serve as a change of measure,
and we develop conditions under which the asymptotic Perron-Frobenius eigenvector stays strictly positive.
This leads to a Chernoff bound which attains a constant prefactor and
an exponential decay with the optimal large deviations rate.
Moreover, a finite-sample version of the law of the iterated logarithm is derived,
and a uniform multiplicative ergodic theorem for the exponential family of tilted transition probability matrices
is established.

On the applications side, we give a complete characterization of the sampling complexity
of best Markovian arm identification in one-parameter Markovian bandit models.
We derive instance specific nonasymptotic and asymptotic lower bounds which generalizing those of the i.i.d. setting,
and we analyze the Track-and-Stop strategy, proving that asymptotically it is at most a factor of four apart from the lower bound. 

We conclude with with an extension of the classic stochastic multi-armed bandit problem which involves multiple plays and Markovian rewards in the rested bandits setting.
In order to tackle this problem we consider an adaptive allocation rule which at each stage combines the information from the sample means of all the arms,
with the Kullback-Leibler upper confidence bound of a single arm which is selected in round-robin way.
For rewards generated from a one-parameter exponential family of Markov chains, we provide a finite-time upper bound for the regret incurred from this adaptive allocation rule, which reveals the logarithmic dependence of the regret on the time horizon, and which is asymptotically optimal.
As a byproduct of our analysis we also establish asymptotically optimal, finite-time guarantees for the case of multiple plays, and i.i.d. rewards drawn from a one-parameter exponential family of probability densities.
Finally, we provide simulation results that illustrate that calculating Kullback-Leibler upper confidence bounds in a round-robin way, is significantly more efficient than calculating them for every arm at each round, and that the expected regrets of those two approaches behave similarly.}
}

EndNote citation:

%0 Thesis
%A Moulos, Vrettos
%T Concentration and Sequential Decision Making in Markovian Environments
%I EECS Department, University of California, Berkeley
%D 2020
%8 December 16
%@ UCB/EECS-2020-202
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-202.html
%F Moulos:EECS-2020-202