First Token Probabilities are Unreliable Indicators for LLM Knowledge
Justin Shao
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2024-114
May 16, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-114.pdf
Multiple Choice Questions (MCQs) are a prevalent evaluation method used across many popular LLM benchmarks. Typically, these evaluations rely on first-token probabilities to deduce the model’s proposed answer. However, previous studies have demonstrated that first-token probabilities are vulnerable to prompt perturbations. In this study, we broaden our examination to explore the performance disparity between direct free-generation and the assessment of MCQs using first-token probabilities. Our experiments confirm the unreliability of first-token probabilities, as they often do not align with generation results. Additionally, we uncover a surprising finding: LLMs tend to struggle with arithmetic MCQs, even though they can reliably generate the correct answers.
Advisors: Jiantao Jiao
BibTeX citation:
@mastersthesis{Shao:EECS-2024-114, Author= {Shao, Justin}, Title= {First Token Probabilities are Unreliable Indicators for LLM Knowledge}, School= {EECS Department, University of California, Berkeley}, Year= {2024}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-114.html}, Number= {UCB/EECS-2024-114}, Abstract= {Multiple Choice Questions (MCQs) are a prevalent evaluation method used across many popular LLM benchmarks. Typically, these evaluations rely on first-token probabilities to deduce the model’s proposed answer. However, previous studies have demonstrated that first-token probabilities are vulnerable to prompt perturbations. In this study, we broaden our examination to explore the performance disparity between direct free-generation and the assessment of MCQs using first-token probabilities. Our experiments confirm the unreliability of first-token probabilities, as they often do not align with generation results. Additionally, we uncover a surprising finding: LLMs tend to struggle with arithmetic MCQs, even though they can reliably generate the correct answers.}, }
EndNote citation:
%0 Thesis %A Shao, Justin %T First Token Probabilities are Unreliable Indicators for LLM Knowledge %I EECS Department, University of California, Berkeley %D 2024 %8 May 16 %@ UCB/EECS-2024-114 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-114.html %F Shao:EECS-2024-114