Jonathan Yue and Daniel Klein
EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-121
May 16, 2025
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.pdf
Large Language Models (LLMs) have improved dramatically at mathematical reasoning, progressing from basic arithmetic to olympiad level proofs. However, the existing, short-answer based benchmarks can suffer from limited scope for complex reasoning and therefore do not sufficiently measure the reasoning capabilities of LLMs. Formal proof-based benchmarks exist, but the need to convert problem statements into formal languages limits their scope. A potential reason for this significant gap in current literature is the difficulty in grading proof problems at scale. To address this, we first propose an LLM-as-a-judge framework to judge model-generated proofs and evaluated its efficacy. Then, we propose a benchmark of 77 PhD-level proof questions, drawn from Roman Vershynin’s “High-Dimensional Probability: An Introduction with Applications in Data Science”, and challenged state-of-the-art LLMs with these questions. We evaluated the LLM-generated solutions using the LLM-as-a-judge framework and found that, in general, state-of-the-art LLMs are still unable to adequately complete these proofs
Advisor: Daniel Klein
";
?>
BibTeX citation:
@mastersthesis{Yue:EECS-2025-121, Author = {Yue, Jonathan and Klein, Daniel}, Title = {Benchmarking LLMs on Advanced Mathematical Reasoning}, School = {EECS Department, University of California, Berkeley}, Year = {2025}, Month = {May}, URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.html}, Number = {UCB/EECS-2025-121}, Abstract = {Large Language Models (LLMs) have improved dramatically at mathematical reasoning, progressing from basic arithmetic to olympiad level proofs. However, the existing, short-answer based benchmarks can suffer from limited scope for complex reasoning and therefore do not sufficiently measure the reasoning capabilities of LLMs. Formal proof-based benchmarks exist, but the need to convert problem statements into formal languages limits their scope. A potential reason for this significant gap in current literature is the difficulty in grading proof problems at scale. To address this, we first propose an LLM-as-a-judge framework to judge model-generated proofs and evaluated its efficacy. Then, we propose a benchmark of 77 PhD-level proof questions, drawn from Roman Vershynin’s “High-Dimensional Probability: An Introduction with Applications in Data Science”, and challenged state-of-the-art LLMs with these questions. We evaluated the LLM-generated solutions using the LLM-as-a-judge framework and found that, in general, state-of-the-art LLMs are still unable to adequately complete these proofs} }
EndNote citation:
%0 Thesis %A Yue, Jonathan %A Klein, Daniel %T Benchmarking LLMs on Advanced Mathematical Reasoning %I EECS Department, University of California, Berkeley %D 2025 %8 May 16 %@ UCB/EECS-2025-121 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.html %F Yue:EECS-2025-121