Dibyo Majumdar

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2017-131

July 24, 2017

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-131.pdf

Data plays a crucial role in society today. With the cost of collecting, storing and processing data decreasing, more and more of it is getting collected and fed into complex analysis tools to obtain actionable results and insights. These are in turn being used to drive decisions that affect the lives of countless people in good ways and bad. It is imperative that data scientists properly record the provenance of the results they publish ie. they record the original sources of data and the exact sequence of operations performed on those sources to get to the published result. Doing so ensures that results are properly contextualized, and, more importantly, that they can be verified by other scientists. It also fosters collaboration, and leads to the standardization of common data operations and data transfer formats.

Unfortunately, this practice is not the norm in many scientific fields. We contend that this is the case because the tools available today for recording provenance information are inadequate. We presented a set of tools and systems for recording and publishing data provenance information to fill the void. These are built on top of the Git Version Control System and are geared towards data scientists of all research fields publishing the results of their research. We call this the Mezuri Data Provenance Management Platform, or Mezuri Provenance for short. Researchers can use these tools to annotate their existing data processing tools and workflows with provenance information. They can then publish this information potentially along with the actual implementation on our public registry.

Advisors: Eric Brewer


BibTeX citation:

@mastersthesis{Majumdar:EECS-2017-131,
    Author= {Majumdar, Dibyo},
    Title= {The Mezuri Data Provenance Management Platform},
    School= {EECS Department, University of California, Berkeley},
    Year= {2017},
    Month= {Jul},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-131.html},
    Number= {UCB/EECS-2017-131},
    Abstract= {Data plays a crucial role in society today. With the cost of collecting, storing and processing data decreasing, more and more of it is getting collected and fed into complex analysis tools to obtain actionable results and insights. These are in turn being used to drive decisions that affect the lives of countless people in good ways and bad. It is imperative that data scientists properly record the provenance of the results they publish ie. they record the original sources of data and the exact sequence of operations performed on those sources to get to the published result. Doing so ensures that results are properly contextualized, and, more importantly, that they can be verified by other scientists. It also fosters collaboration, and leads to the standardization of common data operations and data transfer formats.

Unfortunately, this practice is not the norm in many scientific fields. We contend that this is the case because the tools available today for recording provenance information are inadequate. We presented a set of tools and systems for recording and publishing data provenance information to fill the void. These are built on top of the Git Version Control System and are geared towards data scientists of all research fields publishing the results of their research. We call this the Mezuri Data Provenance Management Platform, or Mezuri Provenance for short. Researchers can use these tools to annotate their existing data processing tools and workflows with provenance information. They can then publish this information potentially along with the actual implementation on our public registry.},
}

EndNote citation:

%0 Thesis
%A Majumdar, Dibyo 
%T The Mezuri Data Provenance Management Platform
%I EECS Department, University of California, Berkeley
%D 2017
%8 July 24
%@ UCB/EECS-2017-131
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2017/EECS-2017-131.html
%F Majumdar:EECS-2017-131