BayesStore: Supporting Native Probabilistic Models in Data Management Systems (BayesStore)


Daisy Z. Wang, Eirinaios C. Michelakis, Michael Franklin, Minos Garofalakis and Joseph M. Hellerstein

As massive data acquisition (e.g. web documents, sensor networks, click-streams, software logs) and storage becomes increasingly affordable, a wide variety of enterprises are employing statistical and machine learning models in advanced probabilistic data analysis. For instance, information extraction systems apply statistical models over free text to extract structured data; pervasive computing applications must constantly reason about volumes of noisy sensory readings to accomplish tasks like motion prediction and modeling of human behavior; web applications (e.g. social networks, recommendation systems) need to analyze click-streams to model their customers; and big software companies need to analyze huge volumes of software logs to predict and debug errors at run time. One approach to support such probabilistic data analyses over large volumes of data is by a probabilistic data management system (PDBMS). Early approaches in building PDBMS have relied on somewhat simplistic models of uncertainty that can be easily mapped onto existing relational architectures. However, these approaches introduce a gap between the statistical models which are used for probabilistic analytics and the uncertainty model in the PDBMS. Our solution to this “model-mismatch” problem is to support statistical models, evidence data and inference algorithms as first-class in a PDBMS. BayesStore is a novel probabilistic data management architecture built on the principle of handling statistical models and probabilistic inference tools as first-class citizens of the database system. BayesStore represents model and evidence data as relational tables; implements inference algorithms efficiently in SQL; adds probabilistic relational operators to the query engine; optimizes queries with both relational and inference operators. The design goals of BayesStore are: (1) to be able to support efficient query processing over different models compared to the off-the-shelf machine learning libraries; (2) to be able to support extensible API for plugging in new models and inference algorithms; and (3) to be able to scale up to very large data sets.