Vincent Truong

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2021-183

August 12, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-183.pdf

OrpheusDB is a lightweight dataset version management system designed to be integrated into data science workflows, akin to standard code version control systems. By using a relational database system as the backend, the Orpheus system leverages its querying and storage capabilities while remaining agnostic to the particular database system used. Previous work by Huang et al. [7] explored different designs for implementing versioned storage effectively. However, in practice, the current implementation of OrpheusDB does not support all of the features described that would improve performance, such as efficient support of schema changes, which can occur often during data science. This report improves on the original design by adjusting how OrpheusDB handles checkouts and commits. In addition to some direct improvements to these operations, we implement the partitioning algorithms, Lyresplit, to offer the same level of performance shown in the original paper, as part of the open source version.

Advisors: Aditya Parameswaran


BibTeX citation:

@mastersthesis{Truong:EECS-2021-183,
    Author= {Truong, Vincent},
    Title= {Generalized Partitioning for Dataset Versions in OrpheusDB},
    School= {EECS Department, University of California, Berkeley},
    Year= {2021},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-183.html},
    Number= {UCB/EECS-2021-183},
    Abstract= {OrpheusDB is a lightweight dataset version management system designed to be integrated into data science workflows, akin to standard code version control systems. By using a relational database system as the backend, the Orpheus system leverages its querying and storage capabilities while remaining agnostic to the particular database system used. 
Previous work by Huang et al. [7] explored different designs for implementing versioned storage effectively. However, in practice, the current implementation of OrpheusDB does not support all of the features described that would improve performance, such as efficient support of schema changes, which can occur often during data science. This report improves on the original design by adjusting how OrpheusDB handles checkouts and commits. In addition to some direct improvements to these operations, we implement the partitioning algorithms, Lyresplit, to offer the same level of performance shown in the original paper, as part of the open source version.},
}

EndNote citation:

%0 Thesis
%A Truong, Vincent 
%T Generalized Partitioning for Dataset Versions in OrpheusDB
%I EECS Department, University of California, Berkeley
%D 2021
%8 August 12
%@ UCB/EECS-2021-183
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-183.html
%F Truong:EECS-2021-183