First Token Probabilities are Unreliable Indicators for LLM Knowledge

Justin Shao

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-114

May 16, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-114.pdf

Multiple Choice Questions (MCQs) are a prevalent evaluation method used across many popular LLM benchmarks. Typically, these evaluations rely on first-token probabilities to deduce the model’s proposed answer. However, previous studies have demonstrated that first-token probabilities are vulnerable to prompt perturbations. In this study, we broaden our examination to explore the performance disparity between direct free-generation and the assessment of MCQs using first-token probabilities. Our experiments confirm the unreliability of first-token probabilities, as they often do not align with generation results. Additionally, we uncover a surprising finding: LLMs tend to struggle with arithmetic MCQs, even though they can reliably generate the correct answers.

Advisors: Jiantao Jiao

BibTeX citation:

@mastersthesis{Shao:EECS-2024-114,
    Author= {Shao, Justin},
    Title= {First Token Probabilities are Unreliable Indicators for LLM Knowledge},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-114.html},
    Number= {UCB/EECS-2024-114},
    Abstract= {Multiple Choice Questions (MCQs) are a prevalent evaluation method used across many popular LLM benchmarks. Typically, these evaluations rely on first-token probabilities to deduce the model’s proposed answer. However, previous studies have demonstrated that first-token probabilities are vulnerable to prompt perturbations. In this study, we broaden our examination to explore the performance disparity between direct free-generation and the assessment of MCQs using first-token probabilities. Our experiments confirm the unreliability of first-token probabilities, as they often do not align with generation results. Additionally, we uncover a surprising finding: LLMs tend to struggle with arithmetic MCQs, even though they can reliably generate the correct answers.},
}

EndNote citation:

%0 Thesis
%A Shao, Justin 
%T First Token Probabilities are Unreliable Indicators for LLM Knowledge
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 16
%@ UCB/EECS-2024-114
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-114.html
%F Shao:EECS-2024-114