Parallel Algorithms for Hierarchical Clustering
Clark F. Olson
EECS Department, University of California, Berkeley
Technical Report No. UCB/CSD-94-786
, 1994
http://www2.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-786.pdf
Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.
BibTeX citation:
@techreport{Olson:CSD-94-786, Author= {Olson, Clark F.}, Title= {Parallel Algorithms for Hierarchical Clustering}, Year= {1994}, Month= {Dec}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1994/6324.html}, Number= {UCB/CSD-94-786}, Abstract= {Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.}, }
EndNote citation:
%0 Report %A Olson, Clark F. %T Parallel Algorithms for Hierarchical Clustering %I EECS Department, University of California, Berkeley %D 1994 %@ UCB/CSD-94-786 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1994/6324.html %F Olson:CSD-94-786