Query Aware Synthetic Data Generation
Zoey Sun
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2023-124
May 12, 2023
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-124.pdf
Evaluating query workload on relational database is an essential task for many developers and researchers, but it is challenging to acquire relational data due to data privacy and confidentiality reasons. Query-aware synthetic data generation for database management system (DBMS) becomes crucial for benchmark testing. In order to ensure data fidelity, the synthetic data has to conform query cardinality constraints as well as properties of the database schema. Unfortunately, prior work for data generation either made simple assumptions about queries and database schema or fail to scale with large query workloads. In this paper, we propose ezGen, a synthetic data generator for web application frameworks. ezGen decomposes complicates queries, especially subqueries, into cardinality constraints as data generator’s input, then generating data using a probability approximation model. ezGen leverages a heuristic rule-based method to translate and decouple query-based cardinality into attribute-based cardinality. In addition, different from prior work, we aim to generate synthetic data for real-world database-backed web application testing by exploiting integrity data constraints extracted from application source code to further ensure the generated data fidelity.
Advisors: Alvin Cheung
BibTeX citation:
@mastersthesis{Sun:EECS-2023-124, Author= {Sun, Zoey}, Editor= {Cheung, Alvin}, Title= {Query Aware Synthetic Data Generation}, School= {EECS Department, University of California, Berkeley}, Year= {2023}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-124.html}, Number= {UCB/EECS-2023-124}, Abstract= {Evaluating query workload on relational database is an essential task for many developers and researchers, but it is challenging to acquire relational data due to data privacy and confidentiality reasons. Query-aware synthetic data generation for database management system (DBMS) becomes crucial for benchmark testing. In order to ensure data fidelity, the synthetic data has to conform query cardinality constraints as well as properties of the database schema. Unfortunately, prior work for data generation either made simple assumptions about queries and database schema or fail to scale with large query workloads. In this paper, we propose ezGen, a synthetic data generator for web application frameworks. ezGen decomposes complicates queries, especially subqueries, into cardinality constraints as data generator’s input, then generating data using a probability approximation model. ezGen leverages a heuristic rule-based method to translate and decouple query-based cardinality into attribute-based cardinality. In addition, different from prior work, we aim to generate synthetic data for real-world database-backed web application testing by exploiting integrity data constraints extracted from application source code to further ensure the generated data fidelity.}, }
EndNote citation:
%0 Thesis %A Sun, Zoey %E Cheung, Alvin %T Query Aware Synthetic Data Generation %I EECS Department, University of California, Berkeley %D 2023 %8 May 12 %@ UCB/EECS-2023-124 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-124.html %F Sun:EECS-2023-124