A Function Calling Perspective on Scalable Large Language Model Agent Evaluation
Fanjia Yan
EECS Department, University of California, Berkeley
Technical Report No. UCB/
December 1, 2025
This work introduces the Berkeley Function Calling Leaderboard(BFCL), a large-scale, multi-task, multi-turn benchmark that distinguishes function-calling capabilities among LLMs by evaluating their ability to invoke the correct function call. BFCL is composed of four parts: (1) a BFCL-Fundamental dataset that tests single-turn function-calling scenarios, including parallel function invocations and multiple function candidates; (2) a BFCL-Live dataset comprising more than 67k real-life function-calling examples curated through community contributions; (3) a BFCL-Agent dataset featuring eight curated API suites and 1k queries, assessing sustained context management and dynamic decision-making.
Evaluating LLMs' function-invocation capabilities poses unique challenges because deterministic validation typically requires executing the corresponding functions which complicates large-scale evaluation. BFCL overcomes this by introducing a novel validation strategy that obviates the need for function execution. Drawing inspiration from programming language literature, we employ Abstract Syntax Tree (AST) sub-string matching as a proxy for actual function execution, thereby facilitating scalable evaluations. To validate this approach, we utilize a subset of our dataset to evaluate models using the earlier mentioned execution approach and observe a strong correlation between BFCL's execution and AST metrics.
Advisors: Joseph Gonzalez
BibTeX citation:
@mastersthesis{Yan:31680, Author= {Yan, Fanjia}, Title= {A Function Calling Perspective on Scalable Large Language Model Agent Evaluation}, School= {EECS Department, University of California, Berkeley}, Year= {2025}, Number= {UCB/}, Abstract= {This work introduces the Berkeley Function Calling Leaderboard(BFCL), a large-scale, multi-task, multi-turn benchmark that distinguishes function-calling capabilities among LLMs by evaluating their ability to invoke the correct function call. BFCL is composed of four parts: (1) a BFCL-Fundamental dataset that tests single-turn function-calling scenarios, including parallel function invocations and multiple function candidates; (2) a BFCL-Live dataset comprising more than 67k real-life function-calling examples curated through community contributions; (3) a BFCL-Agent dataset featuring eight curated API suites and 1k queries, assessing sustained context management and dynamic decision-making. Evaluating LLMs' function-invocation capabilities poses unique challenges because deterministic validation typically requires executing the corresponding functions which complicates large-scale evaluation. BFCL overcomes this by introducing a novel validation strategy that obviates the need for function execution. Drawing inspiration from programming language literature, we employ Abstract Syntax Tree (AST) sub-string matching as a proxy for actual function execution, thereby facilitating scalable evaluations. To validate this approach, we utilize a subset of our dataset to evaluate models using the earlier mentioned execution approach and observe a strong correlation between BFCL's execution and AST metrics.}, }
EndNote citation:
%0 Thesis %A Yan, Fanjia %T A Function Calling Perspective on Scalable Large Language Model Agent Evaluation %I EECS Department, University of California, Berkeley %D 2025 %8 December 1 %@ UCB/ %F Yan:31680