VLM Guided Exploration via Image Subgoal Synthesis
Arjun Bhorkar
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2024-62
May 7, 2024
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-62.pdf
Autonomous navigation in robotics has traditionally relied on predefined waypoints and structured maps, limiting scalability in dynamic, real-world environments. The lack of well-annotated language-action datasets further complicates the development of language driven navigation models. Inspired by recent advances in large-scale Vision-Language Models (VLMs), image generation models, and vision-based robotic control, we propose Exploration with VLM-guided Image Subgoal Synthesis (ElVISS) a framework to enhance exploration for robot navigation tasks with user instructions. This framework leverages the semantic reasoning of VLMs to decompose complex tasks into simpler ones and execute them by generating task-relevant image subgoals which are executed by a low-level policy. We also incorporate a VLM-based subgoal validation loop to minimize executing inaccurately generated subgoals. Experimental results show that our validation loop significantly improves the alignment of executed actions with our instructions, and our resulting system is capable of executing generalized search-based instructions.
Advisors: Sergey Levine
BibTeX citation:
@mastersthesis{Bhorkar:EECS-2024-62, Author= {Bhorkar, Arjun}, Title= {VLM Guided Exploration via Image Subgoal Synthesis}, School= {EECS Department, University of California, Berkeley}, Year= {2024}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-62.html}, Number= {UCB/EECS-2024-62}, Abstract= {Autonomous navigation in robotics has traditionally relied on predefined waypoints and structured maps, limiting scalability in dynamic, real-world environments. The lack of well-annotated language-action datasets further complicates the development of language driven navigation models. Inspired by recent advances in large-scale Vision-Language Models (VLMs), image generation models, and vision-based robotic control, we propose Exploration with VLM-guided Image Subgoal Synthesis (ElVISS) a framework to enhance exploration for robot navigation tasks with user instructions. This framework leverages the semantic reasoning of VLMs to decompose complex tasks into simpler ones and execute them by generating task-relevant image subgoals which are executed by a low-level policy. We also incorporate a VLM-based subgoal validation loop to minimize executing inaccurately generated subgoals. Experimental results show that our validation loop significantly improves the alignment of executed actions with our instructions, and our resulting system is capable of executing generalized search-based instructions.}, }
EndNote citation:
%0 Thesis %A Bhorkar, Arjun %T VLM Guided Exploration via Image Subgoal Synthesis %I EECS Department, University of California, Berkeley %D 2024 %8 May 7 %@ UCB/EECS-2024-62 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-62.html %F Bhorkar:EECS-2024-62