Todd Yu

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-116

May 11, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-116.pdf

Visualization Recommendation (VisRec) systems are popular for generating visualizations with limited effort during the Exploratory Data Analysis (EDA) process. However, computing visualization recommendations can be slow. A high proportion of computation time involves calculating ranking scores (i.e., statistical utility metrics based on underlying data) for a wide range of possible visualizations. In this work, our goal is to incrementally maintain VisRec ranking scores in EDA workflows under data updates. Updates to data are common, thanks to user-driven data cleaning or transformation steps, as well as external data updates. Our primary challenges stem from analyzing a wide range of VisRec systems and their ranking scores, as well as covering a broad set of data updates, to identify how to maintain updates to scores incrementally. We must also determine, upon updates, when to incrementally maintain ranking scores and when to recompute them from scratch. We first review an existing taxonomy of common VisRec categories, known as analytical actions, then decompose all visualization ranking scores into a minimal set of five core aggregates per column. We then propose a system to maintain these aggregates incrementally by presenting five core operators for tabular data that compose to make up a wide variety of common EDA data transformations, and then showing how we can efficiently update our ranking scores for these operations. We also propose a cost model to determine when to incrementally update ranking scores, and when to recompute scores from scratch. We implement our approach in Lux, a popular open-source VisRec system for pandas dataframes, and show how our system scales. We demonstrate the efficacy of our cost model by showing that for datasets with many columns and thus many ranking scores, our system is faster or equivalent to naive recomputation upon updates.

Advisors: Aditya Parameswaran


BibTeX citation:

@mastersthesis{Yu:EECS-2023-116,
    Author= {Yu, Todd},
    Editor= {Parameswaran, Aditya and Tang, Dixin},
    Title= {Efficient Visualization Recommendation under Updates},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-116.html},
    Number= {UCB/EECS-2023-116},
    Abstract= {Visualization Recommendation (VisRec) systems are popular for generating visualizations with limited effort during the Exploratory Data Analysis (EDA) process. However, computing visualization recommendations can be slow. A high proportion of computation time involves calculating ranking scores (i.e., statistical utility metrics based on underlying data) for a wide range of possible visualizations. In this work, our goal is to incrementally maintain VisRec ranking scores in EDA workflows under data updates. Updates to data are common, thanks to user-driven data cleaning or transformation steps, as well as external data updates. Our primary challenges stem from analyzing a wide range of VisRec systems and their ranking scores, as well as covering a broad set of data updates, to identify how to maintain updates to scores incrementally. We must also determine, upon updates, when to incrementally maintain ranking scores and when to recompute them from scratch. We first review an existing taxonomy of common VisRec categories, known as analytical actions, then decompose all visualization ranking scores into a minimal set of five core aggregates per column. We then propose a system to maintain these aggregates incrementally by presenting five core operators for tabular data that compose to make up a wide variety of common EDA data transformations, and then showing how we can efficiently update our ranking scores for these operations. We also propose a cost model to determine when to incrementally update ranking scores, and when to recompute scores from scratch. We implement our approach in Lux, a popular open-source VisRec system for pandas dataframes, and show how our system scales. We demonstrate the efficacy of our cost model by showing that for datasets with many columns and thus many ranking scores, our system is faster or equivalent to naive recomputation upon updates.},
}

EndNote citation:

%0 Thesis
%A Yu, Todd 
%E Parameswaran, Aditya 
%E Tang, Dixin 
%T Efficient Visualization Recommendation under Updates
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 11
%@ UCB/EECS-2023-116
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-116.html
%F Yu:EECS-2023-116