Roy Huang

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-123

May 19, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-123.pdf

Reinforcement learning (RL) has become a primary technique for aligning Large Language Models (LLMs) with complex reasoning objectives, yet convergence is fragile when reward signals are noisy or exploitable. This thesis presents rLLM—an open-source, Ray-based RL framework that utilizes an improved Group-Relative Policy Optimization (GRPO+) with veRL modified with asynchronous pipelined sampling, and iterative context lengthening. Using rLLM we trained Deepcoder-14B, a 14-billion-parameter code-reasoning model that attains 60.6% Pass@1 on LiveCodeBench, a 1936 Codeforces rating, and 92.6% Pass@1 on HumanEval+, matching OpenAI’s proprietary o3-mini (low) and o1 on these benchmarks.

We show that such performance hinges on an airtight sandboxed execution environment that safeguards reward integrity. To that end we take inspiration from GoEx, a post-facto-validated runtime that envelopes every REST call, database mutation, and file operation in deterministic Undo and blast-radius-bounded confinement semantics. The airtight environments which rLLM consumes directly to compute rewards using, eliminating reward hacking.

The findings underscore that the proposed GRPO+ modification significantly enhances training convergence compared to existing widely-adopted algorithms such as GRPO and DAPO. Furthermore, the asynchronous pipelining mechanism incorporated into veRL substantially optimizes the training infrastructure, enabling efficient scalability. Ultimately, by integrating these advancements within a meticulously secure environment, this thesis delivers a comprehensive RL framework that reliably aligns LLMs with sophisticated reasoning objectives, paving the way for future research into robust and scalable reinforcement learning systems.

Advisors: Joseph Gonzalez


BibTeX citation:

@mastersthesis{Huang:EECS-2025-123,
    Author= {Huang, Roy},
    Title= {Reinforcement Learning for Safe LLM Code Generation},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-123.html},
    Number= {UCB/EECS-2025-123},
    Abstract= {Reinforcement learning (RL) has become a primary technique for aligning Large Language Models (LLMs) with complex reasoning objectives, yet convergence is fragile when reward signals are noisy or exploitable. This thesis presents rLLM—an open-source, Ray-based RL framework that utilizes an improved Group-Relative Policy Optimization (GRPO+) with veRL modified with asynchronous pipelined sampling, and iterative context lengthening. Using rLLM we trained Deepcoder-14B, a 14-billion-parameter code-reasoning model that attains 60.6% Pass@1 on LiveCodeBench, a 1936 Codeforces rating, and 92.6% Pass@1 on HumanEval+, matching OpenAI’s proprietary o3-mini (low) and o1 on these benchmarks.

We show that such performance hinges on an airtight sandboxed execution environment that safeguards reward integrity. To that end we take inspiration from GoEx, a post-facto-validated runtime that envelopes every REST call, database mutation, and file operation in deterministic Undo and blast-radius-bounded confinement semantics. The airtight environments which rLLM consumes directly to compute rewards using, eliminating reward hacking.

The findings underscore that the proposed GRPO+ modification significantly enhances training convergence compared to existing widely-adopted algorithms such as GRPO and DAPO. Furthermore, the asynchronous pipelining mechanism incorporated into veRL substantially optimizes the training infrastructure, enabling efficient scalability. Ultimately, by integrating these advancements within a meticulously secure environment, this thesis delivers a comprehensive RL framework that reliably aligns LLMs with sophisticated reasoning objectives, paving the way for future research into robust and scalable reinforcement learning systems.},
}

EndNote citation:

%0 Thesis
%A Huang, Roy 
%T Reinforcement Learning for Safe LLM Code Generation
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 19
%@ UCB/EECS-2025-123
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-123.html
%F Huang:EECS-2025-123