Yike Wang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-122

May 17, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-122.pdf

Under certain circumstances, we may lack the computational resources and data necessary to conduct experiments and assess the performance of a specific model on a given long-context task. In this work, we explore the potential factors associated with long-context performance by studying the correlation between different unit tasks and different context lengths. We propose a new multi-context evaluation dataset with three distinct tasks based on context accessing patterns, featuring adjustable context length, automated evaluation, and diagnosis beyond overall accuracy. All unit tasks present unexpected challenges for state-of-the-art large language models, and we observe a significant decrease in performance as token length increases. We hypothesize that, in terms of long-context prediction, the type of task provides more informative insights than context length and assessing the long-context understanding of language models solely through the needle-in-the-haystack approach lacks informativeness.


BibTeX citation:

@mastersthesis{Wang:EECS-2024-122,
    Author= {Wang, Yike},
    Title= {Evaluating and Predicting the Performance of Large Language Models on Long-Context Tasks},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-122.html},
    Number= {UCB/EECS-2024-122},
    Abstract= {Under certain circumstances, we may lack the computational resources and data necessary to conduct experiments and assess the performance of a specific model on a given long-context task. In this work, we explore the potential factors associated with long-context performance by studying the correlation between different unit tasks and different context lengths. We propose a new multi-context evaluation dataset with three distinct tasks based on context accessing patterns, featuring adjustable context length, automated evaluation, and diagnosis beyond overall accuracy. All unit tasks present unexpected challenges for state-of-the-art large language models, and we observe a significant decrease in performance as token length increases. We hypothesize that, in terms of long-context prediction, the type of task provides more informative insights than context length and assessing the long-context understanding of language models solely through the needle-in-the-haystack approach lacks informativeness.},
}

EndNote citation:

%0 Thesis
%A Wang, Yike 
%T Evaluating and Predicting the Performance of Large Language Models on Long-Context Tasks
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 17
%@ UCB/EECS-2024-122
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-122.html
%F Wang:EECS-2024-122