Spark: Cluster Computing with Working Sets

Matei Zaharia and N. M. Mosharaf Chowdhury and Michael Franklin and Scott Shenker and Ion Stoica

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2010-53

May 7, 2010

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.pdf

MapReduce and its variants have been highly successful in implementing large-scale data intensive applications on clusters of unreliable machines. However, most of these systems are built around an acyclic data flow programming model that is not suitable for other popular applications. In this paper, we focus on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis environments. We propose a new framework called Spark that supports these applications while maintaining the scalability and fault-tolerance properties of MapReduce. To achieve these goals, Spark introduces a data abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.

BibTeX citation:

@techreport{Zaharia:EECS-2010-53,
    Author= {Zaharia, Matei and Chowdhury, N. M. Mosharaf and Franklin, Michael and Shenker, Scott and Stoica, Ion},
    Title= {Spark: Cluster Computing with Working Sets},
    Year= {2010},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html},
    Number= {UCB/EECS-2010-53},
    Abstract= {MapReduce and its variants have been highly successful in implementing large-scale data intensive applications on clusters of unreliable machines. However, most of these systems are built around an acyclic data flow programming model that is not suitable for other popular applications. In this paper, we focus on one such class of applications: those that reuse a working set of data across multiple parallel operations. This includes many iterative machine learning algorithms, as well as interactive data analysis environments. We propose a new framework called Spark that supports these applications while maintaining the scalability and fault-tolerance properties of MapReduce. To achieve these goals, Spark introduces a data abstraction called resilient distributed datasets (RDDs). An RDD is a read-only collection of objects partitioned across a set of machines that can be rebuilt if a partition is lost. Spark can outperform Hadoop by 10x in iterative machine learning jobs, and can be used to interactively query a 39 GB dataset with sub-second response time.},
}

EndNote citation:

%0 Report
%A Zaharia, Matei 
%A Chowdhury, N. M. Mosharaf 
%A Franklin, Michael 
%A Shenker, Scott 
%A Stoica, Ion 
%T Spark: Cluster Computing with Working Sets
%I EECS Department, University of California, Berkeley
%D 2010
%8 May 7
%@ UCB/EECS-2010-53
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2010/EECS-2010-53.html
%F Zaharia:EECS-2010-53