AI-Assisted Dataset Discovery with DATASCOUT

Rachel Lin, Bhavya Chopra, Wenjing Lin, Shreya Shankar, Madelon Hulsebos and Aditya Parameswaran

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-113
May 16, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-113.pdf

Dataset Search—--the process of finding appropriate datasets for a given task--—remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is a multi-pronged affair that involves understanding: data characteristics (e.g., granularity, attributes, size), semantics (e.g., dataset topic and creation goals), and relevance to the task at hand. Present-day dataset search interfaces are restrictive--—users struggle to convey implicit preferences and lack visibility into the search space and result inclusion criteria—--making query iteration and reformulation challenging. To bridge these gaps, we introduce DataScout, a tool that proactively steers users through the process of dataset discovery via—--(i) AI-assisted query reformulations informed by the underlying search space, (ii) semantic search and filtering based on dataset content, including attributes (columns) and granularity (rows), and (iii) dataset relevance indicators that are dynamically generated based on the user-specified task. A within-subjects study with 12 participants comparing DataScout to keyword and semantic dataset search tools reveals that users uniquely employ DataScout’s features not only for structured dataset explorations, but also as a means to glean feedback on their search queries and build conceptual models of the dataset search space.

Advisor: Aditya Parameswaran

\"Edit"; ?>


BibTeX citation:

@mastersthesis{Lin:EECS-2025-113,
    Author = {Lin, Rachel and Chopra, Bhavya and Lin, Wenjing and Shankar, Shreya and Hulsebos, Madelon and Parameswaran, Aditya},
    Title = {AI-Assisted Dataset Discovery with DATASCOUT},
    School = {EECS Department, University of California, Berkeley},
    Year = {2025},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-113.html},
    Number = {UCB/EECS-2025-113},
    Abstract = {Dataset Search—--the process of finding appropriate datasets for a given task--—remains a critical yet under-explored challenge in data science workflows. Assessing dataset suitability for a task (e.g., training a classification model) is a multi-pronged affair that involves understanding: data characteristics (e.g., granularity, attributes, size), semantics (e.g., dataset topic and creation goals), and relevance to the task at hand. Present-day dataset search interfaces are restrictive--—users struggle to convey implicit preferences and lack visibility into the search space and result inclusion criteria—--making query iteration and reformulation challenging. To bridge these gaps, we introduce DataScout, a tool that proactively steers users through
the process of dataset discovery via—--(i) AI-assisted query reformulations informed by the underlying search space, (ii) semantic search and filtering based on dataset content, including attributes (columns) and granularity (rows), and (iii) dataset relevance indicators that are dynamically generated based on the user-specified task. A within-subjects study with 12 participants comparing DataScout to keyword and semantic dataset search tools reveals that users uniquely employ DataScout’s features not only for structured dataset explorations, but also as a means to glean feedback on their search queries and build conceptual models of the dataset search space.}
}

EndNote citation:

%0 Thesis
%A Lin, Rachel
%A Chopra, Bhavya
%A Lin, Wenjing
%A Shankar, Shreya
%A Hulsebos, Madelon
%A Parameswaran, Aditya
%T AI-Assisted Dataset Discovery with DATASCOUT
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 16
%@ UCB/EECS-2025-113
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-113.html
%F Lin:EECS-2025-113