Fleece QA: Self-Instruct Generation for Question Answering Dataset from Large Corpus

Yueheng Zhang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-116

May 17, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-116.pdf

This technical report presents a novel method for creating question-answering (QA) datasets from extensive unstructured textual corpora using large language models (LLMs). We produced over one million questions from the Wikitext dataset without requiring human labor, with the questions being more flexible and higher or similar quality to human annotators. A standout feature is our advanced evaluation method that uses a guided self-instruction method to assess question quality on a scale from 1 to 5 using LLM, offering a more cohesive and consistent evaluation than unguided self-instruct ratings, and matches that of human labeling. The system includes an intuitive user interface and visualization tools that facilitate easy dataset analysis and refinement. Importantly, this dataset is tailored for Retrieval-Augmented Generation (RAG) and long-context evaluation across large corpora, significantly enhancing the performance of QA systems in handling complex queries. This framework sets a new standard for future QA research, model building, and system evaluations.

Advisors: Dawn Song

BibTeX citation:

@mastersthesis{Zhang:EECS-2024-116,
    Author= {Zhang, Yueheng},
    Title= {Fleece QA: Self-Instruct Generation for Question Answering Dataset from Large Corpus},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-116.html},
    Number= {UCB/EECS-2024-116},
    Abstract= {This technical report presents a novel method for creating question-answering (QA) datasets from extensive unstructured textual corpora using large language models (LLMs). We produced over one million questions from the Wikitext dataset without requiring human labor, with the questions being more flexible and higher or similar quality to human annotators. A standout feature is our advanced evaluation method that uses a guided self-instruction method to assess question quality on a scale from 1 to 5 using LLM, offering a more cohesive and consistent evaluation than unguided self-instruct ratings, and matches that of human labeling. The system includes an intuitive user interface and visualization tools that facilitate easy dataset analysis and refinement. Importantly, this dataset is tailored for Retrieval-Augmented Generation (RAG) and long-context evaluation across large corpora, significantly enhancing the performance of QA systems in handling complex queries. This framework sets a new standard for future QA research, model building, and system evaluations.},
}

EndNote citation:

%0 Thesis
%A Zhang, Yueheng 
%T Fleece QA: Self-Instruct Generation for Question Answering Dataset from Large Corpus
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 17
%@ UCB/EECS-2024-116
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-116.html
%F Zhang:EECS-2024-116