Andrew Zhang and Richard Lin and Sean Meng and Crystal Jin

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-265

December 17, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-265.pdf

Pandas is a popular dataframe manipulation tool used by data scientists. A key problem with Pandas is its inability to scale across cores, which severely limits its ability to deal with big data workloads. In order to keep up with ever larger datasets, data scientists need a dataframe tool that can scale effectively but also retain Pandas’s ease of use. Modin, a drop-in substitute for Pandas, can effectively parallelize dataframe workloads and supports various computational backends, such as Ray, Dask, or Python. In this project we implement another compute backend for Modin: OpenMPI, an implementation of Message Passing Interface. This will allow users to tap into OpenMPI infrastructure to scale up their dataframe processing needs.

Advisors: Randy H. Katz


BibTeX citation:

@mastersthesis{Zhang:EECS-2021-265,
    Author= {Zhang, Andrew and Lin, Richard and Meng, Sean and Jin, Crystal},
    Title= {Modin OpenMPI Compute Engine},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-265.html},
    Number= {UCB/EECS-2021-265},
    Abstract= {Pandas is a popular dataframe manipulation tool used by data scientists. A key problem
with Pandas is its inability to scale across cores, which severely limits its ability to deal
with big data workloads. In order to keep up with ever larger datasets, data scientists need
a dataframe tool that can scale effectively but also retain Pandas’s ease of use. Modin, a
drop-in substitute for Pandas, can effectively parallelize dataframe workloads and
supports various computational backends, such as Ray, Dask, or Python. In this project
we implement another compute backend for Modin: OpenMPI, an implementation of
Message Passing Interface. This will allow users to tap into OpenMPI infrastructure to
scale up their dataframe processing needs.},
}

EndNote citation:

%0 Thesis
%A Zhang, Andrew 
%A Lin, Richard 
%A Meng, Sean 
%A Jin, Crystal 
%T Modin OpenMPI Compute Engine
%I EECS Department, University of California, Berkeley
%D 2021
%8 December 17
%@ UCB/EECS-2021-265
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-265.html
%F Zhang:EECS-2021-265