A Function Calling Perspective on Scalable Large Language Model Agent Evaluation

Fanjia Yan

EECS Department, University of California, Berkeley

Technical Report No. UCB/

December 1, 2025

This work introduces the Berkeley Function Calling Leaderboard(BFCL), a large-scale, multi-task, multi-turn benchmark that distinguishes function-calling capabilities among LLMs by evaluating their ability to invoke the correct function call. BFCL is composed of four parts: (1) a BFCL-Fundamental dataset that tests single-turn function-calling scenarios, including parallel function invocations and multiple function candidates; (2) a BFCL-Live dataset comprising more than 67k real-life function-calling examples curated through community contributions; (3) a BFCL-Agent dataset featuring eight curated API suites and 1k queries, assessing sustained context management and dynamic decision-making.

Evaluating LLMs' function-invocation capabilities poses unique challenges because deterministic validation typically requires executing the corresponding functions which complicates large-scale evaluation. BFCL overcomes this by introducing a novel validation strategy that obviates the need for function execution. Drawing inspiration from programming language literature, we employ Abstract Syntax Tree (AST) sub-string matching as a proxy for actual function execution, thereby facilitating scalable evaluations. To validate this approach, we utilize a subset of our dataset to evaluate models using the earlier mentioned execution approach and observe a strong correlation between BFCL's execution and AST metrics.

Advisors: Joseph Gonzalez

BibTeX citation:

@mastersthesis{Yan:31680,
    Author= {Yan, Fanjia},
    Title= {A Function Calling Perspective on Scalable Large Language Model Agent Evaluation},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Number= {UCB/},
    Abstract= {This work introduces the Berkeley Function Calling Leaderboard(BFCL), a large-scale, multi-task, multi-turn benchmark that distinguishes function-calling capabilities among LLMs by evaluating their ability to invoke the correct function call. BFCL is composed of four parts: (1) a BFCL-Fundamental dataset that tests single-turn function-calling scenarios, including parallel function invocations and multiple function candidates; (2) a BFCL-Live dataset comprising more than 67k real-life function-calling examples curated through community contributions; (3) a BFCL-Agent dataset featuring eight curated API suites and 1k queries, assessing sustained context management and dynamic decision-making.

Evaluating LLMs' function-invocation capabilities poses unique challenges because deterministic validation typically requires executing the corresponding functions which complicates large-scale evaluation. BFCL overcomes this by introducing a novel validation strategy that obviates the need for function execution. Drawing inspiration from programming language literature, we employ Abstract Syntax Tree (AST) sub-string matching as a proxy for actual function execution, thereby facilitating scalable evaluations. To validate this approach, we utilize a subset of our dataset to evaluate models using the earlier mentioned execution approach and observe a strong correlation between BFCL's execution and AST metrics.},
}

EndNote citation:

%0 Thesis
%A Yan, Fanjia 
%T A Function Calling Perspective on Scalable Large Language Model Agent Evaluation
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 1
%@ UCB/
%F Yan:31680