Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Alex Gittens, Aditya Devarakonda, Evan Racah, Michael Ringenburg, Lisa Gerhardt, Jey Kottaalam, Jialin Liu, Kristyn Maschhoff, Shane Canon, Jatin Chhugani, Pramod Sharma, Jiyan Yang, James Demmel, Jim Harrell, Venkat Krishnamurthy, Michael W. Mahoney and Mr Prabhat

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2016-151
August 23, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-151.pdf

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.


BibTeX citation:

@techreport{Gittens:EECS-2016-151,
    Author = {Gittens, Alex and Devarakonda, Aditya and Racah, Evan and Ringenburg, Michael and Gerhardt, Lisa and Kottaalam, Jey and Liu, Jialin and Maschhoff, Kristyn and Canon, Shane and Chhugani, Jatin and Sharma, Pramod and Yang, Jiyan and Demmel, James and Harrell, Jim and Krishnamurthy, Venkat and Mahoney, Michael W. and Prabhat, Mr},
    Title = {  Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {2016},
    Month = {Aug},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-151.html},
    Number = {UCB/EECS-2016-151},
    Abstract = {We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging.  The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.}
}

EndNote citation:

%0 Report
%A Gittens, Alex
%A Devarakonda, Aditya
%A Racah, Evan
%A Ringenburg, Michael
%A Gerhardt, Lisa
%A Kottaalam, Jey
%A Liu, Jialin
%A Maschhoff, Kristyn
%A Canon, Shane
%A Chhugani, Jatin
%A Sharma, Pramod
%A Yang, Jiyan
%A Demmel, James
%A Harrell, Jim
%A Krishnamurthy, Venkat
%A Mahoney, Michael W.
%A Prabhat, Mr
%T   Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
%I EECS Department, University of California, Berkeley
%D 2016
%8 August 23
%@ UCB/EECS-2016-151
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-151.html
%F Gittens:EECS-2016-151