Enhancing NLP Model Performance Through Data Filtering
Sibo Ma
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2023-170
May 12, 2023
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-170.pdf
As Natural Language Processing (NLP) models continue to grow in size and complexity, there is an increasing demand for high-quality fine-tuning data. While the internet offers an abundant source of text data, only a small fraction of it is suitable for large models such as GPT-3 and GPT-4. In this paper, we propose a method for cleaning and filtering low-quality text data to improve both computational efficiency and model performance. To ensure the texts are closely related to the core characteristics of the dataset, we define high-quality text using four criteria: relevance, informativeness, readability, and objectivity. We then validate our approach through document classification tasks and analyze the contribution of each criterion to the model performance. The results showed a close relationship between criteria choice and the characteristic of the chosen dataset and NLP tasks.
BibTeX citation:
@mastersthesis{Ma:EECS-2023-170, Author= {Ma, Sibo}, Title= {Enhancing NLP Model Performance Through Data Filtering}, School= {EECS Department, University of California, Berkeley}, Year= {2023}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-170.html}, Number= {UCB/EECS-2023-170}, Abstract= {As Natural Language Processing (NLP) models continue to grow in size and complexity, there is an increasing demand for high-quality fine-tuning data. While the internet offers an abundant source of text data, only a small fraction of it is suitable for large models such as GPT-3 and GPT-4. In this paper, we propose a method for cleaning and filtering low-quality text data to improve both computational efficiency and model performance. To ensure the texts are closely related to the core characteristics of the dataset, we define high-quality text using four criteria: relevance, informativeness, readability, and objectivity. We then validate our approach through document classification tasks and analyze the contribution of each criterion to the model performance. The results showed a close relationship between criteria choice and the characteristic of the chosen dataset and NLP tasks.}, }
EndNote citation:
%0 Thesis %A Ma, Sibo %T Enhancing NLP Model Performance Through Data Filtering %I EECS Department, University of California, Berkeley %D 2023 %8 May 12 %@ UCB/EECS-2023-170 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-170.html %F Ma:EECS-2023-170