Principled Statistical Approaches For Sampling and Inference in High Dimensions

Raaz Dwivedi

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-180

August 11, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-180.pdf

The growth in the number of algorithms to identify patterns in modern large-scale datasets has introduced a new dilemma for practitioners: How does one choose between the numerous methods? In supervised machine learning, accuracy on hold-out dataset is the flagship for choice making. This dissertation presents research that can provide principled guidance for making choices in three popular settings where such a flagship measure is not readily available. (I) Convergence of Markov chain Monte Carlo sampling algorithms, used commonly in Bayesian inference, Monte Carlo integration, and stochastic simulation: We provide explicit non-asymptotic guarantees for state-of-the-art sampling algorithms in high dimensions that can help the user pick a sampling method and the number of iterations based on the computational budget at hand. (II) Statistical-computational challenges with mixture model estimation, used commonly with heterogeneous data: We provide non-asymptotic guarantees with Expectation-Maximization for parameter estimation when the number of components is not known, and characterize the number of samples and iterations needed for desired accuracy, that can inform user of the potential two-edged price when dealing with noisy data in high dimensions. (III) Reliable estimation of heterogeneous treatment effects (HTE) in causal inference, crucial for decision making in medicine and public policy: We introduce a data-driven methodology StaDISC that is useful for validating commonly used models for estimating HTE, and for discovering interpretable and stable subgroups with HTE using calibration. While we illustrate its usefulness in precision medicine, we believe the methodology to be of general interest in randomized experiments.

Advisors: Bin Yu and Martin Wainwright

BibTeX citation:

@phdthesis{Dwivedi:EECS-2021-180,
    Author= {Dwivedi, Raaz},
    Title= {Principled Statistical Approaches For Sampling and Inference in High Dimensions},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-180.html},
    Number= {UCB/EECS-2021-180},
    Abstract= {The growth in the number of algorithms to identify patterns in modern large-scale datasets has introduced a new dilemma for practitioners: How does one choose between the numerous methods? In supervised machine learning, accuracy on hold-out dataset is the flagship for choice making. This dissertation presents research that can provide principled guidance for making choices in three popular settings where such a flagship measure is not readily available. 
(I) Convergence of Markov chain Monte Carlo sampling algorithms, used commonly in Bayesian inference, Monte Carlo integration, and stochastic simulation: We provide explicit non-asymptotic guarantees for state-of-the-art sampling algorithms in high dimensions that can help the user pick a sampling method and the number of iterations based on the computational budget at hand. 
(II) Statistical-computational challenges with mixture model estimation, used commonly with heterogeneous data: We provide non-asymptotic guarantees with Expectation-Maximization for parameter estimation when the number of components is not known, and characterize the number of samples and iterations needed for desired accuracy, that can inform user of the potential two-edged price when dealing with noisy data in high dimensions. 
(III) Reliable estimation of heterogeneous treatment effects (HTE) in causal inference,  crucial for decision making in medicine and public policy:  We introduce a data-driven methodology StaDISC that is useful for validating commonly used models for estimating HTE, and for discovering interpretable and stable subgroups with HTE using calibration. While we illustrate its usefulness in precision medicine, we believe the methodology to be of general interest in randomized experiments.},
}

EndNote citation:

%0 Thesis
%A Dwivedi, Raaz 
%T Principled Statistical Approaches For Sampling and Inference in High Dimensions
%I EECS Department, University of California, Berkeley
%D 2021
%8 August 11
%@ UCB/EECS-2021-180
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-180.html
%F Dwivedi:EECS-2021-180