Enhancing NLP Model Performance Through Data Filtering

Sibo Ma

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-170

May 12, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-170.pdf

As Natural Language Processing (NLP) models continue to grow in size and complexity, there is an increasing demand for high-quality fine-tuning data. While the internet offers an abundant source of text data, only a small fraction of it is suitable for large models such as GPT-3 and GPT-4. In this paper, we propose a method for cleaning and filtering low-quality text data to improve both computational efficiency and model performance. To ensure the texts are closely related to the core characteristics of the dataset, we define high-quality text using four criteria: relevance, informativeness, readability, and objectivity. We then validate our approach through document classification tasks and analyze the contribution of each criterion to the model performance. The results showed a close relationship between criteria choice and the characteristic of the chosen dataset and NLP tasks.

BibTeX citation:

@mastersthesis{Ma:EECS-2023-170,
    Author= {Ma, Sibo},
    Title= {Enhancing NLP Model Performance Through Data Filtering},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-170.html},
    Number= {UCB/EECS-2023-170},
    Abstract= {As Natural Language Processing (NLP) models continue to grow in size and complexity, there is an increasing demand for high-quality fine-tuning data. While the internet offers an abundant source of text data, only a small fraction of it is suitable for large models such as GPT-3 and GPT-4. In this paper, we propose a method for cleaning and filtering low-quality text data to improve both computational efficiency and model performance. To ensure the texts are closely related to the core characteristics of the dataset, we define high-quality text using four criteria: relevance, informativeness, readability, and objectivity. We then validate our approach through document classification tasks and analyze the contribution of each criterion to the model performance. The results showed a close relationship between criteria choice and the characteristic of the chosen dataset and NLP tasks.},
}

EndNote citation:

%0 Thesis
%A Ma, Sibo 
%T Enhancing NLP Model Performance Through Data Filtering
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 12
%@ UCB/EECS-2023-170
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-170.html
%F Ma:EECS-2023-170