Donghui Yan and Ling Huang and Michael Jordan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2009-45

March 31, 2009

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-45.pdf

Spectral clustering refers to a flexible class of clustering procedures that can produce high-quality clusterings on small data sets but which has limited applicability to large-scale problems due to its computational complexity of $O(n^3)$, with $n$ the number of data points. We extend the range of spectral clustering by developing a general framework for fast approximate spectral clustering in which a distortion-minimizing local transformation is first applied to the data. This framework is based on a theoretical analysis that provides a statistical characterization of the effect of local distortion on the mis-clustering rate. We develop two concrete instances of our general framework, one based on local $k$-means clustering (KASP) and one based on random projection trees (RASP). Extensive experiments show that these algorithms can achieve significant speedups with little degradation in clustering accuracy. Specifically, our algorithms outperform $k$-means by a large margin in terms of accuracy, and run several times faster than approximate spectral clustering based on the Nystrom method, with comparable accuracy and significantly smaller memory footprint. Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.


BibTeX citation:

@techreport{Yan:EECS-2009-45,
    Author= {Yan, Donghui and Huang, Ling and Jordan, Michael},
    Title= {Fast Approximate Spectral Clustering},
    Year= {2009},
    Month= {Mar},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-45.html},
    Number= {UCB/EECS-2009-45},
    Abstract= {Spectral clustering refers to a flexible class of clustering procedures
that can produce high-quality clusterings on small data sets but which
has limited applicability to large-scale problems due to its computational
complexity of $O(n^3)$, with $n$ the number of data points.
We extend the range of spectral clustering by developing a general
framework for fast approximate spectral clustering in which a
distortion-minimizing local transformation is first applied to the
data.  This framework is based on a theoretical analysis that provides
a statistical characterization of the effect of local distortion on
the mis-clustering rate.  We develop two concrete instances of our
general framework, one based on local $k$-means clustering (KASP)
and one based on random projection trees (RASP).  Extensive
experiments show that these algorithms can achieve significant
speedups with little degradation in clustering accuracy.
Specifically, our algorithms outperform $k$-means by a large
margin in terms of accuracy, and run several times faster than
approximate spectral clustering based on the Nystrom method,
with comparable accuracy and significantly smaller memory
footprint.  Remarkably, our algorithms make it possible for a single machine to spectral cluster data sets with a million observations within several minutes.},
}

EndNote citation:

%0 Report
%A Yan, Donghui 
%A Huang, Ling 
%A Jordan, Michael 
%T Fast Approximate Spectral Clustering
%I EECS Department, University of California, Berkeley
%D 2009
%8 March 31
%@ UCB/EECS-2009-45
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-45.html
%F Yan:EECS-2009-45