Rohan Bavishi

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-208

August 12, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-208.pdf

We live in the data age. Today, data analytics drives much of business decision-making, logistics, advertising, and recommendations. Data wrangling, profiling, and visualization are some of the key tasks in a data analytics workflow. These tasks also account for a majority of the time spent in data analysis. Academics and industry leaders have long attributed the disparity to the inherent domain-specific nature of data, which necessitates highly custom treatment for every new source of data. Specifying these custom analysis steps using low-level tools such as Excel can be prohibitively cumbersome. In response, much research has focused on smarter interactive and graphical tools for data processing and visualization tasks. Successful commercialization of this research has contributed to a $3 billion self-service analytics industry.

However, analysts with a programming background have not adopted such tools as widely as their non-programmer colleagues have. The desire to avoid shuffling between tools and work in a single environment, as well as a need for the full, unbounded expressivity of programming-based analysis tools, are a few major reasons. This does not mean that programmers are immune to the specification burden; the expressivity of programming comes at the cost of complexity and steep learning curves. Novices have to spend much time learning these tools from books or fragmented resources online. Even experts report a loss in productivity from having to constantly look up documentation to get uninteresting details such as function names and argument values right.

Thus, there is a need for programming assistants that reconcile the need to reduce the specification burden for programmer analysts with their desire to work with code in their preferred development environments. These assistants should help programmer-analysts write code more efficiently by automatically generating human-readable and readily-integrable code from high-level specifications.

This dissertation introduces techniques and corresponding prototypical assistants that accept input-output examples, demonstrations, or natural language specifications and automatically generate suitable data processing and visualization code utilizing popular data science libraries such as pandas, matplotlib, seaborn, and scikit-learn. Automatic code generation has long faced the tradeoff barrier between expressivity and performance/accuracy: supporting a large number of analysis tasks makes the problem of generating the right code quickly that much more difficult. Accordingly, prior research in program synthesis and semantic parsing has largely sacrificed full expressivity to support efficient code generation for a small but useful subset of tasks. The code-as-text approach of modern natural language processing systems, including the use of large language models, promises unbounded expressivity, but their sub-optimal accuracy remains a concern.

This dissertation tries to push the boundaries in terms of breaking this tradeoff barrier --- can we build programming systems that are fully expressive while remaining fast and accurate? Specifically, this dissertation builds upon prior work and introduces novel code generation techniques that combine insights from synthesis, automated testing, program analysis, and machine learning. It contributes four core techniques and corresponding assistants, namely AutoPandas, Gauss, VizSmith, and Datana. AutoPandas and Gauss constitute core advances in search space and algorithm design for example-based synthesis. VizSmith and Datana introduce novel mining and auto-summarization techniques to automatically build aligned code and natural language corpora, which Datana uses to greatly improve the code-generation capabilities of modern large language models. Compared to prior work, these assistants improve the expressivity of synthesis-based systems and the accuracy of machine-learning-based systems.

Advisors: Koushik Sen


BibTeX citation:

@phdthesis{Bavishi:EECS-2022-208,
    Author= {Bavishi, Rohan},
    Title= {Tools and Techniques for Building Programming Assistants for Data Analysis},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-208.html},
    Number= {UCB/EECS-2022-208},
    Abstract= {We live in the data age. Today, data analytics drives much of business decision-making, logistics, advertising, and recommendations. Data wrangling, profiling, and visualization are some of the key tasks in a data analytics workflow. These tasks also account for a majority of the time spent in data analysis. Academics and industry leaders have long attributed the disparity to the inherent domain-specific nature of data, which necessitates highly custom treatment for every new source of data. Specifying these custom analysis steps using low-level tools such as Excel can be prohibitively cumbersome. In response, much research has focused on smarter interactive and graphical tools for data processing and visualization tasks. Successful commercialization of this research has contributed to a $3 billion self-service analytics industry.

However, analysts with a programming background have not adopted such tools as widely as their non-programmer colleagues have. The desire to avoid shuffling between tools and work in a single environment, as well as a need for the full, unbounded expressivity of programming-based analysis tools, are a few major reasons. This does not mean that programmers are immune to the specification burden; the expressivity of programming comes at the cost of complexity and steep learning curves. Novices have to spend much time learning these tools from books or fragmented resources online. Even experts report a loss in productivity from having to constantly look up documentation to get uninteresting details such as function names and argument values right.

Thus, there is a need for programming assistants that reconcile the need to reduce the specification burden for programmer analysts with their desire to work with code in their preferred development environments.  These assistants should help programmer-analysts write code more efficiently by automatically generating human-readable and readily-integrable code from high-level specifications.

This dissertation introduces techniques and corresponding prototypical assistants that accept input-output examples, demonstrations, or natural language specifications and automatically generate suitable data processing and visualization code utilizing popular data science libraries such as pandas, matplotlib, seaborn, and scikit-learn. Automatic code generation has long faced the tradeoff barrier between expressivity and performance/accuracy: supporting a large number of analysis tasks makes the problem of generating the right code quickly that much more difficult. Accordingly, prior research in program synthesis and semantic parsing has largely sacrificed full expressivity to support efficient code generation for a small but useful subset of tasks. The code-as-text approach of modern natural language processing systems, including the use of large language models, promises unbounded expressivity, but their sub-optimal accuracy remains a concern.

This dissertation tries to push the boundaries in terms of breaking this tradeoff barrier --- can we build programming systems that are fully expressive while remaining fast and accurate? Specifically, this dissertation builds upon prior work and introduces novel code generation techniques that combine insights from synthesis, automated testing, program analysis, and machine learning. It contributes four core techniques and corresponding assistants, namely AutoPandas, Gauss, VizSmith, and Datana. AutoPandas and Gauss constitute core advances in search space and algorithm design for example-based synthesis. VizSmith and Datana introduce novel mining and auto-summarization techniques to automatically build aligned code and natural language corpora, which Datana uses to greatly improve the code-generation capabilities of modern large language models. Compared to prior work, these assistants improve the expressivity of synthesis-based systems and the accuracy of machine-learning-based systems.},
}

EndNote citation:

%0 Thesis
%A Bavishi, Rohan 
%T Tools and Techniques for Building Programming Assistants for Data Analysis
%I EECS Department, University of California, Berkeley
%D 2022
%8 August 12
%@ UCB/EECS-2022-208
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-208.html
%F Bavishi:EECS-2022-208