Query Aware Synthetic Data Generation

Zoey Sun

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-124

May 12, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-124.pdf

Evaluating query workload on relational database is an essential task for many developers and researchers, but it is challenging to acquire relational data due to data privacy and confidentiality reasons. Query-aware synthetic data generation for database management system (DBMS) becomes crucial for benchmark testing. In order to ensure data fidelity, the synthetic data has to conform query cardinality constraints as well as properties of the database schema. Unfortunately, prior work for data generation either made simple assumptions about queries and database schema or fail to scale with large query workloads. In this paper, we propose ezGen, a synthetic data generator for web application frameworks. ezGen decomposes complicates queries, especially subqueries, into cardinality constraints as data generator’s input, then generating data using a probability approximation model. ezGen leverages a heuristic rule-based method to translate and decouple query-based cardinality into attribute-based cardinality. In addition, different from prior work, we aim to generate synthetic data for real-world database-backed web application testing by exploiting integrity data constraints extracted from application source code to further ensure the generated data fidelity.

Advisors: Alvin Cheung

BibTeX citation:

@mastersthesis{Sun:EECS-2023-124,
    Author= {Sun, Zoey},
    Editor= {Cheung, Alvin},
    Title= {Query Aware Synthetic Data Generation},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-124.html},
    Number= {UCB/EECS-2023-124},
    Abstract= {Evaluating query workload on relational database is an essential task for many developers and researchers, but it is challenging to acquire relational data due to data privacy and confidentiality reasons. Query-aware synthetic data generation for database management system (DBMS) becomes crucial for benchmark testing. In order to ensure data fidelity, the synthetic data has to conform query cardinality constraints as well as properties of the database schema. Unfortunately, prior work for data generation either made simple assumptions about queries and database schema or fail to scale with large query workloads.
In this paper, we propose ezGen, a synthetic data generator
for web application frameworks. ezGen decomposes complicates
queries, especially subqueries, into cardinality constraints as data generator’s input, then generating data using a probability approximation model. ezGen leverages a heuristic rule-based method to translate and decouple query-based cardinality into attribute-based cardinality. In addition, different from prior work, we aim to generate synthetic data for real-world database-backed web application testing by exploiting integrity data constraints extracted from application source code to further ensure the generated data fidelity.},
}

EndNote citation:

%0 Thesis
%A Sun, Zoey 
%E Cheung, Alvin 
%T Query Aware Synthetic Data Generation
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 12
%@ UCB/EECS-2023-124
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-124.html
%F Sun:EECS-2023-124