Designing LLM based agents to interact with the embodied world

Dylan Goetting

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-59

May 14, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-59.pdf

Large Language Models (LLMs) have seen rapid advancements across different modalities, yet they remain mostly isolated from physical environments. Meanwhile, robotics research continues to face challenges in generalization and scalability, limited by costly and narrow data collection processes. In this work, we study methods to bridge the gap between LLMs and physical robotic systems through structured observation and action interfaces. We first introduce VLMnav, a novel framework that transforms a Vision-Language Model (VLM) into an end-to-end navigation policy, allowing it to select low-level actions directly from visual input without fine-tuning. We evaluate its navigation capabilities on multiple benchmarks and perform a detailed design analysis. Building on this, we extend to a more complex manipulation setting, where the agent calls a Vision-Language-Action (VLA) model to handle fine-grained control. We analyze both task performance and design factors and show how the agent can most effectively utilize the capabilities of the VLA.

Advisors: Jitendra Malik

BibTeX citation:

@mastersthesis{Goetting:EECS-2025-59,
    Author= {Goetting, Dylan},
    Title= {Designing LLM based agents to interact with the embodied world},
    School= {EECS Department, University of California, Berkeley},
    Year= {2025},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-59.html},
    Number= {UCB/EECS-2025-59},
    Abstract= {Large Language Models (LLMs) have seen rapid advancements across different modalities, yet they remain mostly isolated from physical environments. Meanwhile, robotics research continues to face challenges in generalization and scalability, limited by costly and narrow data collection processes. In this work, we study methods to bridge the gap between LLMs and physical robotic systems through structured observation and action interfaces. We first introduce VLMnav, a novel framework that transforms a Vision-Language Model (VLM) into an end-to-end navigation policy, allowing it to select low-level actions directly from visual input without fine-tuning. We evaluate its navigation capabilities on multiple benchmarks and perform a detailed design analysis. Building on this, we extend to a more complex manipulation setting, where the agent calls a Vision-Language-Action (VLA) model to handle fine-grained control. We analyze both task performance and design factors and show how the agent can most effectively utilize the capabilities of the VLA.},
}

EndNote citation:

%0 Thesis
%A Goetting, Dylan 
%T Designing LLM based agents to interact with the embodied world
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 14
%@ UCB/EECS-2025-59
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-59.html
%F Goetting:EECS-2025-59