Daisy Zhe Wang and Michael Franklin and Luna Dong and Anish Das Sarma and Alon Halevy

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2009-119

August 15, 2009

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-119.pdf

Functional dependency is one of the most extensively researched subjects in database theory, originally for improving quality of schemas, and recently for improving quality of data. In a pay-as-you-go data integration system, where the goal is to provide best-effort service even without thorough understanding of the underlying domain and the various data sources, functional dependency can play an even more important role, applied in normalizing an automatically generated mediated schema, pinpointing sources of low quality, resolving conflicts in data from different sources, improving efficiency of query answering, and so on. Despite its importance, discovering functional dependencies in such a context is challenging: we cannot assume upfront domain knowledge for specifying dependencies, and the data can be dirty, incomplete, or even misinterpreted, so make automatic discovery of dependencies hard.

This paper studies how one can automatically discover functional dependencies in a pay-as-you-go data integration system. We introduce the notion of probabilistic functional dependencies (pFDs) and design Bayes models that compute probabilities of dependencies according to data from various sources. As an application, we study how to normalize a mediated schema based on the pFDs we generate. Experiments on real-world data sets with tens or hundreds of data sources show that our techniques obtain high precision and recall in dependency discovery and generate high-quality results in mediated-schema normalization.


BibTeX citation:

@techreport{Wang:EECS-2009-119,
    Author= {Wang, Daisy Zhe and Franklin, Michael and Dong, Luna and Das Sarma, Anish and Halevy, Alon},
    Title= {Discovering Functional Dependencies in Pay-As-You-Go Data Integration Systems},
    Year= {2009},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-119.html},
    Number= {UCB/EECS-2009-119},
    Abstract= {Functional dependency is one of the most extensively researched subjects in database theory, originally for improving quality of schemas, and recently for improving quality of data. In a pay-as-you-go data integration system, where the goal is to provide best-effort service even without thorough understanding of the underlying domain and the various data sources, functional dependency can play an
even more important role, applied in normalizing an automatically generated mediated schema, pinpointing sources of low quality, resolving conflicts in data from different sources, improving efficiency of query answering, and so on. Despite its importance, discovering functional dependencies in such a context is challenging: we cannot assume upfront domain knowledge for specifying dependencies, and the data can be dirty, incomplete, or even misinterpreted, so make automatic discovery of dependencies hard.

This paper studies how one can automatically discover functional dependencies in a pay-as-you-go data integration system. We introduce the notion of probabilistic functional dependencies (pFDs) and design Bayes models that compute probabilities of dependencies according to data from various sources. As an application, we study how to normalize a mediated schema based on the pFDs we generate. Experiments on real-world data sets with tens or hundreds of data sources show that our techniques obtain high precision and recall in dependency discovery and generate high-quality results in mediated-schema normalization.},
}

EndNote citation:

%0 Report
%A Wang, Daisy Zhe 
%A Franklin, Michael 
%A Dong, Luna 
%A Das Sarma, Anish 
%A Halevy, Alon 
%T Discovering Functional Dependencies in Pay-As-You-Go Data Integration Systems
%I EECS Department, University of California, Berkeley
%D 2009
%8 August 15
%@ UCB/EECS-2009-119
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2009/EECS-2009-119.html
%F Wang:EECS-2009-119