Supporting Fine-Grained Data Lineage in a Database Visualization Environment

Allison Woodruff and Michael Stonebraker

EECS Department
University of California, Berkeley
Technical Report No. UCB/CSD-97-932
January 1997

http://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/CSD-97-932.pdf

The lineage of a datum records its processing history. Because such information can be used to trace the source of anomalies and errors in processed data sets, it is valuable to users for a variety of applications including investigation of anomalies and debugging. Traditional data lineage approaches rely on metadata. However, metadata does not scale well to fine-grained lineage, especially in large data sets. For example, it is not feasible to store all the information necessary to trace from a specific floating point value in a processed data set to a particular satellite image pixel in a source data set.

In this paper, we propose a novel method to support fine-grained data lineage. Rather than relying on metadata, our approach lazily computes lineage using a limited amount of information about the processing operators and the base data. We introduce the notions of weak inversion and verification. While our system does not perfectly invert the data, it uses weak inversion and verification to provide a number of guarantees about the lineage it generates. We propose a design for the implementation of weak inversion and verification in an object-relational database management system.


BibTeX citation:

@techreport{Woodruff:CSD-97-932,
    Author = {Woodruff, Allison and Stonebraker, Michael},
    Title = {Supporting Fine-Grained Data Lineage in a Database Visualization Environment},
    Institution = {EECS Department, University of California, Berkeley},
    Year = {1997},
    Month = {Jan},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/5412.html},
    Number = {UCB/CSD-97-932},
    Abstract = {The lineage of a datum records its processing history. Because such information can be used to trace the source of anomalies and errors in processed data sets, it is valuable to users for a variety of applications including investigation of anomalies and debugging. Traditional data lineage approaches rely on metadata. However, metadata does not scale well to fine-grained lineage, especially in large data sets. For example, it is not feasible to store all the information necessary to trace from a specific floating point value in a processed data set to a particular satellite image pixel in a source data set. <p>In this paper, we propose a novel method to support fine-grained data lineage. Rather than relying on metadata, our approach lazily computes lineage using a limited amount of information about the processing operators and the base data. We introduce the notions of weak inversion and verification. While our system does not perfectly invert the data, it uses weak inversion and verification to provide a number of guarantees about the lineage it generates. We propose a design for the implementation of weak inversion and verification in an object-relational database management system.}
}

EndNote citation:

%0 Report
%A Woodruff, Allison
%A Stonebraker, Michael
%T Supporting Fine-Grained Data Lineage in a Database Visualization Environment
%I EECS Department, University of California, Berkeley
%D 1997
%@ UCB/CSD-97-932
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/1997/5412.html
%F Woodruff:CSD-97-932