Scaling Interactive Data Science Transparently with Modin

Devin Petersohn

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2018-191

December 19, 2018

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-191.pdf

The combined growth of data science and big datasets has increased the performance requirements for running data analysis experiments and workflows. However, popular data science toolkits such as Pandas have not adapted to the technical demands of modern multicore, parallel hardware. As such, data scientists aiming to work with large quantities of data find themselves either suffering from libraries that under-utilize modern hardware or being forced to use big data processing tools that do not adapt well to the interactive nature of exploratory data analyses.

In this report we present the foundations of Modin, a library for large scale data analysis. Modin emphasizes performant, parallel execution on big datasets previously deemed unwieldy for existing popular toolkits, all while importantly maintaining an interface and set of semantics similar to existing interactive data science tools. The experiments presented in this report demonstrate promising results towards developing a new generation of performant data science tools built for parallel and distributed modern hardware.

Advisors: Anthony D. Joseph

BibTeX citation:

@mastersthesis{Petersohn:EECS-2018-191,
    Author= {Petersohn, Devin},
    Editor= {Joseph, Anthony D.},
    Title= {Scaling Interactive Data Science Transparently with Modin},
    School= {EECS Department, University of California, Berkeley},
    Year= {2018},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-191.html},
    Number= {UCB/EECS-2018-191},
    Abstract= {The combined growth of data science and big datasets has increased the performance requirements for running data analysis experiments and workflows. However, popular data science toolkits such as Pandas have not adapted to the technical demands of modern multicore, parallel hardware. As such, data scientists aiming to work with large quantities of data find themselves either suffering from libraries that under-utilize modern hardware or being forced to use big data processing tools that do not adapt well to the interactive nature of exploratory data analyses.

In this report we present the foundations of Modin, a library for large scale data analysis. Modin emphasizes performant, parallel execution on big datasets previously deemed unwieldy for existing popular toolkits, all while importantly maintaining an interface and set of semantics similar to existing interactive data science tools. The experiments presented in this report demonstrate promising results towards developing a new generation of performant data science tools built for parallel and distributed modern hardware.},
}

EndNote citation:

%0 Thesis
%A Petersohn, Devin 
%E Joseph, Anthony D. 
%T Scaling Interactive Data Science Transparently with Modin
%I EECS Department, University of California, Berkeley
%D 2018
%8 December 19
%@ UCB/EECS-2018-191
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2018/EECS-2018-191.html
%F Petersohn:EECS-2018-191