Vision-Language Representations for Zero-Shot Robotic Perception

Satvik Sharma

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-96

May 10, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-96.pdf

As robotics systems enter the real world, the challenge of creating robotic perception systems that are robust to the real world still remains. The real world contains a visually and semantically diverse set of environments filled with an even more diverse set of objects. We can account for this diversity with large vision-language models (VLMs), which recently have shown promise in capturing semantics at the scale of the real world as they are pretrained on internet-scale data. We want to rely on these VLMs without any additional environment-specific data collection as it can be expensive for many robotic domains. Thus, we seek to integrate VLMs into the robotic perception pipeline to be used out-of-the-box or zero-shot for different tasks. We introduce two methods that utilize VLMs zero-shot for the robotic tasks of occluded object search and grasping, namely Semantic Mechanical Search (SMS) and Language Embedded Radiance Fields for Task-Oriented Grasping (LERF-TOGO) respectively. SMS utilizes LLMs in addition to VLMs to better semantically reason about visually occluded objects when searching. By embedding semantic understanding into the search process, SMS improves efficiency in locating objects across both simulated and real-world environments. On the other hand, LERF-TOGO creates a 3D vision-language field derived from VLMs to execute precise grasps of object parts based on natural language inputs. This method shows high accuracy and adaptability in physical trials, effectively grasping specified parts on a variety of objects. We conclude with the limitations of both of these works and possible future directions.

BibTeX citation:

@mastersthesis{Sharma:EECS-2024-96,
    Author= {Sharma, Satvik},
    Editor= {Goldberg, Ken and Malik, Jitendra},
    Title= {Vision-Language Representations for Zero-Shot Robotic Perception},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-96.html},
    Number= {UCB/EECS-2024-96},
    Abstract= {As robotics systems enter the real world, the challenge of creating robotic perception systems that are robust to the real world still remains. The real world contains a visually and semantically diverse set of environments filled with an even more diverse set of objects. We can account for this diversity with large vision-language models (VLMs), which recently have shown promise in capturing semantics at the scale of the real world as they are pretrained on internet-scale data. We want to rely on these VLMs without any additional environment-specific data collection as it can be expensive for many robotic domains. Thus, we seek to integrate VLMs into the robotic perception pipeline to be used out-of-the-box or zero-shot for different tasks. We introduce two methods that utilize VLMs zero-shot for the robotic tasks of occluded object search and grasping, namely Semantic Mechanical Search (SMS) and Language Embedded Radiance Fields for Task-Oriented Grasping (LERF-TOGO) respectively. SMS utilizes LLMs in addition to VLMs to better semantically reason about visually occluded objects when searching. By embedding semantic understanding into the search process, SMS improves efficiency in locating objects across both simulated and real-world environments. On the other hand, LERF-TOGO creates a 3D vision-language field derived from VLMs to execute precise grasps of object parts based on natural language inputs. This method shows high accuracy and adaptability in physical trials, effectively grasping specified parts on a variety of objects. We conclude with the limitations of both of these works and possible future directions.},
}

EndNote citation:

%0 Thesis
%A Sharma, Satvik 
%E Goldberg, Ken 
%E Malik, Jitendra 
%T Vision-Language Representations for Zero-Shot Robotic Perception
%I EECS Department, University of California, Berkeley
%D 2024
%8 May 10
%@ UCB/EECS-2024-96
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-96.html
%F Sharma:EECS-2024-96