High-dimensional Principal Component Analysis

Arash Ali Amini

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2011-104

September 16, 2011

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-104.pdf

Advances in data acquisition and emergence of new sources of data, in recent years, have led to generation of massive datasets in many fields of science and engineering. These datasets are usually characterized by having high dimensions and low number of samples. Without appropriate modifications, classical tools of statistical analysis are not quite applicable in these “high-dimensional” settings. Much of the effort of contemporary research in statistics and related fields is to extend inference procedures, methodologies and theories to these new datasets. One widely used assumption which can mitigate the effects of dimensionality is the sparsity of the underlying parameters. In the first half of this thesis we consider principal component analysis (PCA), a classical dimension reduction procedure, in the high-dimensional setting with “hard” sparsity constraints. We will analyze the statistical performance of two modified procedures for PCA, a simple diagonal cut-off method and a more elaborate semidefinite programming relaxation (SDP). Our results characterize the statistical complexity of the two methods, in terms of the number of samples required for asymptotic recovery. The results show a trade-off between statistical and computational complexity. In the second half of the thesis, we consider PCA in function spaces (fPCA), an infinite-dimensional analog of PCA, also known as Karhunen–Lo ́ve transform. We introduce a functional-theoretic framework to study effects of sampling in fPCA under smoothness constraints on functions. The framework generates high dimensional models with a different type of structural assumption, an “ellipsoid” condition, which can be thought of as a soft sparsity constraint. We provide a M-estimator to estimate principal component subspaces which takes the form of a regularized eigenvalue problem. We provide rates of convergence for the estimator and show minimax optimality. Along the way, some problems in approximation theory are also discussed.

Advisors: Martin Wainwright

BibTeX citation:

@phdthesis{Amini:EECS-2011-104,
Author= {Amini, Arash Ali},
Title= {High-dimensional Principal Component Analysis},
School= {EECS Department, University of California, Berkeley},
Year= {2011},
Month= {Sep},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-104.html},
Number= {UCB/EECS-2011-104},
Abstract= {Advances in data acquisition and emergence of new sources of data, in recent years, have led to generation of massive datasets in many fields of science and engineering. These
datasets are usually characterized by having high dimensions and low number of samples. Without appropriate modifications, classical tools of statistical analysis are not quite applicable in these “high-dimensional” settings. Much of the effort of contemporary research in statistics and related fields is to extend inference procedures, methodologies and theories to these new datasets. One widely used assumption which can mitigate the effects of dimensionality is the sparsity of the underlying parameters. In the first half of this thesis we consider principal component analysis (PCA), a classical dimension reduction procedure, in the high-dimensional setting with “hard” sparsity constraints. We will analyze the statistical performance of two modified procedures for PCA, a simple diagonal cut-off method and a more elaborate semidefinite programming relaxation (SDP). Our results characterize
the statistical complexity of the two methods, in terms of the number of samples required for asymptotic recovery. The results show a trade-off between statistical and computational complexity. In the second half of the thesis, we consider PCA in function spaces (fPCA), an
infinite-dimensional analog of PCA, also known as Karhunen–Lo ́ve transform. We introduce a functional-theoretic framework to study effects of sampling in fPCA under smoothness constraints on functions. The framework generates high dimensional models with a different type of structural assumption, an “ellipsoid” condition, which can be thought of as a soft sparsity constraint. We provide a M-estimator to estimate principal component subspaces which takes the form of a regularized eigenvalue problem. We provide rates of convergence for the estimator and show minimax optimality. Along the way, some problems in approximation theory are also discussed.},
}

EndNote citation:

%0 Thesis
%A Amini, Arash Ali 
%T High-dimensional Principal Component Analysis
%I EECS Department, University of California, Berkeley
%D 2011
%8 September 16
%@ UCB/EECS-2011-104
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2011/EECS-2011-104.html
%F Amini:EECS-2011-104