Ray DataFrames: A library for parallel data analysis
Patrick Yang
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2018-84
May 22, 2018
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-84.pdf
The combined growth of data science and big datasets has increased the performance requirements for running data analysis experiments and work ows. However, popular data science toolkits such as Pandas have not adapted to the technical demands of modern multi- core, parallel hardware. As such, data scientists aiming to work with large quantities of data nd themselves either suering from libraries that under-utilize modern hardware or being forced to use big data processing tools that do not adapt well to the interactive nature of exploratory data analyses. In this report we present the foundations of Ray DataFrames, a library for large scale data analysis. Ray DataFrames emphasizes performant, parallel execution on big datasets previously deemed unwieldy for existing popular toolkits, all while importantly maintaining an interface and set of semantics similar to existing interactive data science tools. The experiments presented in this report demonstrate promising results towards developing a new generation of performant data science tools built for parallel and distributed modern hardware.
Advisors: Anthony D. Joseph
BibTeX citation:
@mastersthesis{Yang:EECS-2018-84, Author= {Yang, Patrick}, Editor= {Joseph, Anthony D.}, Title= {Ray DataFrames: A library for parallel data analysis}, School= {EECS Department, University of California, Berkeley}, Year= {2018}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-84.html}, Number= {UCB/EECS-2018-84}, Abstract= {The combined growth of data science and big datasets has increased the performance requirements for running data analysis experiments and work ows. However, popular data science toolkits such as Pandas have not adapted to the technical demands of modern multi- core, parallel hardware. As such, data scientists aiming to work with large quantities of data nd themselves either suering from libraries that under-utilize modern hardware or being forced to use big data processing tools that do not adapt well to the interactive nature of exploratory data analyses. In this report we present the foundations of Ray DataFrames, a library for large scale data analysis. Ray DataFrames emphasizes performant, parallel execution on big datasets previously deemed unwieldy for existing popular toolkits, all while importantly maintaining an interface and set of semantics similar to existing interactive data science tools. The experiments presented in this report demonstrate promising results towards developing a new generation of performant data science tools built for parallel and distributed modern hardware.}, }
EndNote citation:
%0 Thesis %A Yang, Patrick %E Joseph, Anthony D. %T Ray DataFrames: A library for parallel data analysis %I EECS Department, University of California, Berkeley %D 2018 %8 May 22 %@ UCB/EECS-2018-84 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-84.html %F Yang:EECS-2018-84