Benchmarking LLMs on Advanced Mathematical Reasoning

Jonathan Yue and Daniel Klein

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-121
May 16, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.pdf

Large Language Models (LLMs) have improved dramatically at mathematical reasoning, progressing from basic arithmetic to olympiad level proofs. However, the existing, short-answer based benchmarks can suffer from limited scope for complex reasoning and therefore do not sufficiently measure the reasoning capabilities of LLMs. Formal proof-based benchmarks exist, but the need to convert problem statements into formal languages limits their scope. A potential reason for this significant gap in current literature is the difficulty in grading proof problems at scale. To address this, we first propose an LLM-as-a-judge framework to judge model-generated proofs and evaluated its efficacy. Then, we propose a benchmark of 77 PhD-level proof questions, drawn from Roman Vershynin’s “High-Dimensional Probability: An Introduction with Applications in Data Science”, and challenged state-of-the-art LLMs with these questions. We evaluated the LLM-generated solutions using the LLM-as-a-judge framework and found that, in general, state-of-the-art LLMs are still unable to adequately complete these proofs

Advisor: Daniel Klein

\"Edit"; ?>


BibTeX citation:

@mastersthesis{Yue:EECS-2025-121,
    Author = {Yue, Jonathan and Klein, Daniel},
    Title = {Benchmarking LLMs on Advanced Mathematical Reasoning},
    School = {EECS Department, University of California, Berkeley},
    Year = {2025},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.html},
    Number = {UCB/EECS-2025-121},
    Abstract = {Large Language Models (LLMs) have improved dramatically at mathematical reasoning, progressing from basic arithmetic to olympiad level proofs. However, the existing, short-answer
based benchmarks can suffer from limited scope for complex reasoning and therefore do not sufficiently measure the reasoning capabilities of LLMs. Formal proof-based benchmarks exist, but the need to convert problem statements into formal languages limits their scope. A potential reason for this significant gap in current literature is the difficulty in grading proof problems at scale. To address this, we first propose an LLM-as-a-judge framework to judge model-generated proofs and evaluated its efficacy. Then, we propose a benchmark of 77 PhD-level proof questions, drawn from Roman Vershynin’s “High-Dimensional Probability: An Introduction with Applications in Data Science”, and challenged state-of-the-art LLMs with these questions. We evaluated the LLM-generated solutions using the LLM-as-a-judge framework and found that, in general, state-of-the-art LLMs are still unable to adequately complete these proofs}
}

EndNote citation:

%0 Thesis
%A Yue, Jonathan
%A Klein, Daniel
%T Benchmarking LLMs on Advanced Mathematical Reasoning
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 16
%@ UCB/EECS-2025-121
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-121.html
%F Yue:EECS-2025-121