Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies

Alex Gittens and Aditya Devarakonda and Evan Racah and Michael Ringenburg and Lisa Gerhardt and Jey Kottaalam and Jialin Liu and Kristyn Maschhoff and Shane Canon and Jatin Chhugani and Pramod Sharma and Jiyan Yang and James Demmel and Jim Harrell and Venkat Krishnamurthy and Michael W. Mahoney and Mr Prabhat

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2016-151

August 23, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-151.pdf

We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging. The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.

BibTeX citation:

@techreport{Gittens:EECS-2016-151,
    Author= {Gittens, Alex and Devarakonda, Aditya and Racah, Evan and Ringenburg, Michael and Gerhardt, Lisa and Kottaalam, Jey and Liu, Jialin and Maschhoff, Kristyn and Canon, Shane and Chhugani, Jatin and Sharma, Pramod and Yang, Jiyan and Demmel, James and Harrell, Jim and Krishnamurthy, Venkat and Mahoney, Michael W. and Prabhat, Mr},
    Title= {  Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies},
    Year= {2016},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-151.html},
    Number= {UCB/EECS-2016-151},
    Abstract= {We explore the trade-offs of performing linear algebra using Apache Spark, compared to traditional C and MPI implementations on HPC platforms. Spark is designed for data analytics on cluster computing platforms with access to local disks and is optimized for data-parallel tasks. We examine three widely-used and important matrix factorizations: NMF (for physical plausability), PCA (for its ubiquity) and CX (for data interpretability). We apply these methods to TB-sized problems in particle physics, climate modeling and bioimaging.  The data matrices are tall-and-skinny which enable the algorithms to map conveniently into Spark's data-parallel model. We perform scaling experiments on up to 1600 Cray XC40 nodes, describe the sources of slowdowns, and provide tuning guidance to obtain high performance.},
}

EndNote citation:

%0 Report
%A Gittens, Alex 
%A Devarakonda, Aditya 
%A Racah, Evan 
%A Ringenburg, Michael 
%A Gerhardt, Lisa 
%A Kottaalam, Jey 
%A Liu, Jialin 
%A Maschhoff, Kristyn 
%A Canon, Shane 
%A Chhugani, Jatin 
%A Sharma, Pramod 
%A Yang, Jiyan 
%A Demmel, James 
%A Harrell, Jim 
%A Krishnamurthy, Venkat 
%A Mahoney, Michael W. 
%A Prabhat, Mr 
%T   Matrix Factorizations at Scale: a Comparison of Scientific Data Analytics in Spark and C+MPI Using Three Case Studies
%I EECS Department, University of California, Berkeley
%D 2016
%8 August 23
%@ UCB/EECS-2016-151
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-151.html
%F Gittens:EECS-2016-151