Measuring General Intelligence with Generated Games

Vivek Verma, David Huang, William Chen, Daniel Klein and Nicholas Tomlin

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2025-60
May 14, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-60.pdf

We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.

Advisor: Daniel Klein

\"Edit"; ?>


BibTeX citation:

@mastersthesis{Verma:EECS-2025-60,
    Author = {Verma, Vivek and Huang, David and Chen, William and Klein, Daniel and Tomlin, Nicholas},
    Title = {Measuring General Intelligence with Generated Games},
    School = {EECS Department, University of California, Berkeley},
    Year = {2025},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-60.html},
    Number = {UCB/EECS-2025-60},
    Abstract = {We present gg-bench, a collection of game environments designed to evaluate general reasoning capabilities in language models. Unlike most static benchmarks, gg-bench is a data generating process where new evaluation instances can be generated at will. In particular, gg-bench is synthetically generated by (1) using a large language model (LLM) to generate natural language descriptions of novel games, (2) using the LLM to implement each game in code as a Gym environment, and (3) training reinforcement learning (RL) agents via self-play on the generated games. We evaluate language models by their winrate against these RL agents by prompting models with the game description, current board state, and a list of valid moves, after which models output the moves they wish to take. gg-bench is challenging: state-of-the-art LLMs such as GPT-4o and Claude 3.7 Sonnet achieve winrates of 7-9% on gg-bench using in-context learning, while reasoning models such as o1, o3-mini and DeepSeek-R1 achieve average winrates of 31-36%. We release the generated games, data generation process, and evaluation code in order to support future modeling work and expansion of our benchmark.}
}

EndNote citation:

%0 Thesis
%A Verma, Vivek
%A Huang, David
%A Chen, William
%A Klein, Daniel
%A Tomlin, Nicholas
%T Measuring General Intelligence with Generated Games
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 14
%@ UCB/EECS-2025-60
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-60.html
%F Verma:EECS-2025-60