VLM Guided Exploration via Image Subgoal Synthesis

Arjun Bhorkar

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-62

May 7, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-62.pdf

Autonomous navigation in robotics has traditionally relied on predefined waypoints and structured maps, limiting scalability in dynamic, real-world environments. The lack of well-annotated language-action datasets further complicates the development of language driven navigation models. Inspired by recent advances in large-scale Vision-Language Models (VLMs), image generation models, and vision-based robotic control, we propose Exploration with VLM-guided Image Subgoal Synthesis (ElVISS) a framework to enhance exploration for robot navigation tasks with user instructions. This framework leverages the semantic reasoning of VLMs to decompose complex tasks into simpler ones and execute them by generating task-relevant image subgoals which are executed by a low-level policy. We also incorporate a VLM-based subgoal validation loop to minimize executing inaccurately generated subgoals. Experimental results show that our validation loop significantly improves the alignment of executed actions with our instructions, and our resulting system is capable of executing generalized search-based instructions.

Advisors: Sergey Levine

BibTeX citation:

@mastersthesis{Bhorkar:EECS-2024-62,
    Author= {Bhorkar, Arjun},
    Title= {VLM Guided Exploration via Image Subgoal Synthesis},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-62.html},
    Number= {UCB/EECS-2024-62},
    Abstract= {Autonomous navigation in robotics has traditionally relied on predefined waypoints and structured maps, limiting scalability in dynamic, real-world environments. The lack of well-annotated language-action datasets further complicates the development of language driven navigation models. Inspired by recent advances in large-scale Vision-Language Models (VLMs), image generation models, and vision-based robotic control, we propose Exploration with VLM-guided Image Subgoal Synthesis (ElVISS) a framework to enhance exploration for robot navigation tasks with user instructions. This framework leverages the semantic reasoning of VLMs to decompose complex tasks into simpler ones and execute them by generating task-relevant image subgoals which are executed by a low-level policy. We also incorporate a VLM-based subgoal validation loop to minimize executing inaccurately generated subgoals. Experimental results show that our validation loop significantly improves the alignment of executed actions with our instructions, and our resulting system is capable of executing generalized search-based instructions.},
}

EndNote citation:

%0 Thesis
%A Bhorkar, Arjun 
%T VLM Guided Exploration via Image Subgoal Synthesis
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 7
%@ UCB/EECS-2024-62
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-62.html
%F Bhorkar:EECS-2024-62