Clark F. Olson

EECS Department, University of California, Berkeley

Technical Report No. UCB/CSD-94-786

, 1994

http://www2.eecs.berkeley.edu/Pubs/TechRpts/1994/CSD-94-786.pdf

Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem. This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.


BibTeX citation:

@techreport{Olson:CSD-94-786,
    Author= {Olson, Clark F.},
    Title= {Parallel Algorithms for Hierarchical Clustering},
    Year= {1994},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1994/6324.html},
    Number= {UCB/CSD-94-786},
    Abstract= {Hierarchical clustering is a common method used to determine clusters of similar data points in multi-dimensional spaces. <i>O</i>(<i>n</i>^2) algorithms, where <i>n</i> is the number of points to cluster, have long been known for this problem.  This paper discusses parallel algorithms to perform hierarchical clustering using various distance metrics. I describe <i>O</i>(<i>n</i>) time algorithms for clustering using the single link, average link, complete link, centroid, median, and minimum variance metrics on an <i>n</i> node CRCW PRAM and <i>O</i>(<i>n</i> log <i>n</i>) algorithms for these metrics (except average link and complete link) on <i>n</i>\log </i>n</i> node butterfly networks or trees. Thus, optimal efficiency is achieved for a significant number of processors using these distance metrics. A general algorithm is given that can be used to perform clustering with the complete link and average link metrics on a butterfly. While this algorithm achieves optimal efficiency for the general class of metrics, it is not optimal for the specific cases of complete link and average link clustering.},
}

EndNote citation:

%0 Report
%A Olson, Clark F. 
%T Parallel Algorithms for Hierarchical Clustering
%I EECS Department, University of California, Berkeley
%D 1994
%@ UCB/CSD-94-786
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1994/6324.html
%F Olson:CSD-94-786