Manish Shetty

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2026-52

May 4, 2026

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2026/EECS-2026-52.pdf

Software is one of the most leveraged forms of human labor: it underlies nearly every modern system, and writing, modifying, and maintaining it occupies a substantial fraction of skilled work worldwide. AI systems have begun to perform meaningful parts of this work, and the question of how capable they really are is no longer hypothetical. Software is also uniquely well-suited as a substrate for AI agents: actions are cheap to execute, consequences are immediate, and the surrounding tooling (compilers, test runners, type checkers, profilers) produces precise feedback. This feedback loop rests on two abstractions: environments, which let agents execute code in realistic settings, and verifiers, which assess the correctness and quality of their outputs.

This thesis finds that environments and verifiers are themselves a primary axis of capability for AI in software engineering, and that scaling them reveals what AI can and cannot do. First, executable environments can be constructed at scale from real source repositories through procedural extraction and synthetic problem generation, providing a training and evaluation surface that earlier benchmarks lacked. Second, dense verifiers (compilers, test suites, program analyzers) mined automatically from a program's behavior can drive models to results unreachable in single-shot generation, including a memory-safe Rust translation of a large C library validated against millions of inputs. Finally, the abstraction generalizes beyond source code: live cloud operations fit the same template, with microservice clusters as environments and incident outcomes as verifiers. Yet on the hardest coding tasks, even strong environments and verifiers do not close the gap; failures cluster into characteristic behavioral patterns (reward hacking, avoidance of low-level code, input-specific fast paths) that aggregate accuracy cannot surface.

In defining the abstractions that underpin the growth in AI-driven software engineering, these contributions show that environments and verifiers determine what agents can learn, what problems they can solve, and where systematic behavioral failures persist even at the frontier.

Advisors: Koushik Sen


BibTeX citation:

@phdthesis{Shetty:EECS-2026-52,
    Author= {Shetty, Manish},
    Title= {Scaling Environments and Verifiers for Software Engineering Agents},
    School= {EECS Department, University of California, Berkeley},
    Year= {2026},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2026/EECS-2026-52.html},
    Number= {UCB/EECS-2026-52},
    Abstract= {Software is one of the most leveraged forms of human labor: it underlies nearly every modern system, and writing, modifying, and maintaining it occupies a substantial fraction of skilled work worldwide. AI systems have begun to perform meaningful parts of this work, and the question of how capable they really are is no longer hypothetical. Software is also uniquely well-suited as a substrate for AI agents: actions are cheap to execute, consequences are immediate, and the surrounding tooling (compilers, test runners, type checkers, profilers) produces precise feedback. This feedback loop rests on two abstractions: environments, which let agents execute code in realistic settings, and verifiers, which assess the correctness and quality of their outputs.

This thesis finds that *environments* and *verifiers* are themselves a primary axis of capability for AI in software engineering, and that scaling them reveals what AI can and cannot do. First, executable environments can be constructed at scale from real source repositories through procedural extraction and synthetic problem generation, providing a training and evaluation surface that earlier benchmarks lacked. Second, dense verifiers (compilers, test suites, program analyzers) mined automatically from a program's behavior can drive models to results unreachable in single-shot generation, including a memory-safe Rust translation of a large C library validated against millions of inputs. Finally, the abstraction generalizes beyond source code: live cloud operations fit the same template, with microservice clusters as environments and incident outcomes as verifiers. Yet on the hardest coding tasks, even strong environments and verifiers do not close the gap; failures cluster into characteristic behavioral patterns (reward hacking, avoidance of low-level code, input-specific fast paths) that aggregate accuracy cannot surface.

In defining the abstractions that underpin the growth in AI-driven software engineering, these contributions show that environments and verifiers determine what agents can learn, what problems they can solve, and where systematic behavioral failures persist even at the frontier.},
}

EndNote citation:

%0 Thesis
%A Shetty, Manish 
%T Scaling Environments and Verifiers for Software Engineering Agents
%I EECS Department, University of California, Berkeley
%D 2026
%8 May 4
%@ UCB/EECS-2026-52
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2026/EECS-2026-52.html
%F Shetty:EECS-2026-52