Michael Luo

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-203

December 18, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-203.pdf

Large Language Models (LLMs) have achieved remarkable capabilities across diverse and complex tasks, yet a fundamental decoupling persists between frontier models and the thousands of downstream AI applications with domain-specific needs. This division limits both performance and efficiency: models are trained and served without knowledge of downstream applications. In this dissertation, we design systems and algorithms that couple the model and application layers together—advancing application-aware infrastructure and models.

We first present three systems for application-aware infrastructure that optimize cost and performance. Agentix is a serving engine that introduces application-aware scheduling for AI agents. By tracking dependencies between LLM calls and leveraging application-level statistics, Agentix improves end-to-end response-times by 4-15× over state-of-the-art systems like vLLM, which reduces costs. Next, Stylus is a scalable model router that retrieves and composes the best adapters from a pool of over 100K LoRAs, which improves performance over base Stable Diffusion. Finally, Starburst is a cost-aware scheduler for hybrid cloud ML infrastructure that dynamically allocates waiting time based on job characteristics, reducing cloud costs by up to 91% while maintaining minimal job completion times.

Next, we then demonstrate that coupling models with applications via reinforcement learning (RL) unlocks both higher performance and lower cost. Through the Agentica Project, we show that small models trained with application-specific RL can match frontier models at a fraction of the cost. DeepScaleR is a 1.5B model that surpasses o1-preview on mathematical reasoning with only 3,800 GPU hours—an 18× cost reduction over prior approaches. DeepCoder achieves o3-mini level performance on competitive programming with a 14B model. Finally, DeepSWE trains a 32B state-of-the-art autonomous coding agent, beating all prior open-source agents by over 12 percentage points.

Overall, these contributions democratize access to frontier-level AI capabilities for application-builders and organizations alike.

Advisors: Ion Stoica


BibTeX citation:

@phdthesis{Luo:EECS-2025-203,
    Author= {Luo, Michael},
    Title= {From Serving to Training: Efficient Systems for LLM Agents at Scale},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-203.html},
    Number= {UCB/EECS-2025-203},
    Abstract= {Large Language Models (LLMs) have achieved remarkable capabilities across diverse and complex tasks, yet a fundamental decoupling persists between frontier models and the thousands of downstream AI applications with domain-specific needs. This division limits both performance and efficiency: models are trained and served without knowledge of downstream applications. In this dissertation, we design systems and algorithms that couple the model and application layers together—advancing application-aware infrastructure and models.

We first present three systems for application-aware infrastructure that optimize cost and performance. Agentix is a serving engine that introduces application-aware scheduling for AI agents. By tracking dependencies between LLM calls and leveraging application-level statistics, Agentix improves end-to-end response-times by 4-15× over state-of-the-art systems like vLLM, which reduces costs. Next, Stylus is a scalable model router that retrieves and composes the best adapters from a pool of over 100K LoRAs, which improves performance over base Stable Diffusion. Finally, Starburst is a cost-aware scheduler for hybrid cloud ML infrastructure that dynamically allocates waiting time based on job characteristics, reducing cloud costs by up to 91% while maintaining minimal job completion times.

Next, we then demonstrate that coupling models with applications via reinforcement learning (RL) unlocks both higher performance and lower cost. Through the Agentica Project, we show that small models trained with application-specific RL can match frontier models at a fraction of the cost. DeepScaleR is a 1.5B model that surpasses o1-preview on mathematical reasoning with only 3,800 GPU hours—an 18× cost reduction over prior approaches. DeepCoder achieves o3-mini level performance on competitive programming with a 14B model. Finally, DeepSWE trains a 32B state-of-the-art autonomous coding agent, beating all prior open-source agents by over 12 percentage points.

Overall, these contributions democratize access to frontier-level AI capabilities for application-builders and organizations alike.},
}

EndNote citation:

%0 Thesis
%A Luo, Michael 
%T From Serving to Training: Efficient Systems for LLM Agents at Scale
%I EECS Department, University of California, Berkeley
%D 2025
%8 December 18
%@ UCB/EECS-2025-203
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-203.html
%F Luo:EECS-2025-203