Clark F. Olson
EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-94-786
December 1993
http://www2.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-786.pdf
Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. O( n^2) algorithms, where n is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe O( n) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an n node CRCW PRAM and O( n log n) algorithms for these metrics (except average link and complete link) on n\log n node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.
BibTeX citation:
@techreport{Olson:CSD-94-786, Author = {Olson, Clark F.}, Title = {Parallel Algorithms for Hierarchical Clustering}, Institution = {EECS Department, University of California, Berkeley}, Year = {1993}, Month = {Dec}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html}, Number = {UCB/CSD-94-786}, Abstract = {Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.} }
EndNote citation:
%0 Report %A Olson, Clark F. %T Parallel Algorithms for Hierarchical Clustering %I EECS Department, University of California, Berkeley %D 1993 %@ UCB/CSD-94-786 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1993/6324.html %F Olson:CSD-94-786