Ray DataFrames: A library for parallel data analysis

Patrick Yang

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2018-84
May 22, 2018

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-84.pdf

The combined growth of data science and big datasets has increased the performance requirements for running data analysis experiments and work ows. However, popular data science toolkits such as Pandas have not adapted to the technical demands of modern multi- core, parallel hardware. As such, data scientists aiming to work with large quantities of data nd themselves either sunullering from libraries that under-utilize modern hardware or being forced to use big data processing tools that do not adapt well to the interactive nature of exploratory data analyses. In this report we present the foundations of Ray DataFrames, a library for large scale data analysis. Ray DataFrames emphasizes performant, parallel execution on big datasets previously deemed unwieldy for existing popular toolkits, all while importantly maintaining an interface and set of semantics similar to existing interactive data science tools. The experiments presented in this report demonstrate promising results towards developing a new generation of performant data science tools built for parallel and distributed modern hardware.

Advisor: Anthony D. Joseph


BibTeX citation:

@mastersthesis{Yang:EECS-2018-84,
    Author = {Yang, Patrick},
    Editor = {Joseph, Anthony D.},
    Title = {Ray DataFrames: A library for parallel data analysis},
    School = {EECS Department, University of California, Berkeley},
    Year = {2018},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-84.html},
    Number = {UCB/EECS-2018-84},
    Abstract = {The combined growth of data science and big datasets has increased the performance
requirements for running data analysis experiments and work
ows. However, popular data
science toolkits such as Pandas have not adapted to the technical demands of modern multi-
core, parallel hardware. As such, data scientists aiming to work with large quantities of data
nd themselves either suering from libraries that under-utilize modern hardware or being
forced to use big data processing tools that do not adapt well to the interactive nature of
exploratory data analyses.
In this report we present the foundations of Ray DataFrames, a library for large scale
data analysis. Ray DataFrames emphasizes performant, parallel execution on big datasets
previously deemed unwieldy for existing popular toolkits, all while importantly maintaining
an interface and set of semantics similar to existing interactive data science tools. The
experiments presented in this report demonstrate promising results towards developing a
new generation of performant data science tools built for parallel and distributed modern
hardware.}
}

EndNote citation:

%0 Thesis
%A Yang, Patrick
%E Joseph, Anthony D.
%T Ray DataFrames: A library for parallel data analysis
%I EECS Department, University of California, Berkeley
%D 2018
%8 May 22
%@ UCB/EECS-2018-84
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-84.html
%F Yang:EECS-2018-84