Reynold Shi Xin, Joshua Rosen, Matei Zaharia, Michael Franklin, Scott Shenker and Ion Stoica
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2012-214
November 26, 2012
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.pdf
Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.
BibTeX citation:
@techreport{Xin:EECS-2012-214, Author = {Xin, Reynold Shi and Rosen, Joshua and Zaharia, Matei and Franklin, Michael and Shenker, Scott and Stoica, Ion}, Title = {Shark: SQL and Rich Analytics at Scale}, Institution = {EECS Department, University of California, Berkeley}, Year = {2012}, Month = {Nov}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.html}, Number = {UCB/EECS-2012-214}, Abstract = {Shark is a new data analysis system that marries query processing with complex analytics on large clusters. It leverages a novel distributed memory abstraction to provide a unified engine that can run SQL queries and sophisticated analytics functions (e.g., iterative machine learning) at scale, and efficiently recovers from failures mid-query. This allows Shark to run SQL queries up to 100× faster than Apache Hive, and machine learning programs up to 100× faster than Hadoop. Unlike previous systems, Shark shows that it is possible to achieve these speedups while retaining a MapReduce-like execution engine, and the fine-grained fault tolerance properties that such engines provide. It extends such an engine in several ways, including column-oriented in-memory storage and dynamic mid-query replanning, to effectively execute SQL. The result is a system that matches the speedups reported for MPP analytic databases over MapReduce, while offering fault tolerance properties and complex analytics capabilities that they lack.} }
EndNote citation:
%0 Report %A Xin, Reynold Shi %A Rosen, Joshua %A Zaharia, Matei %A Franklin, Michael %A Shenker, Scott %A Stoica, Ion %T Shark: SQL and Rich Analytics at Scale %I EECS Department, University of California, Berkeley %D 2012 %8 November 26 %@ UCB/EECS-2012-214 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-214.html %F Xin:EECS-2012-214