Pretrained Representations for Embodied AI

Sasha Sax

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-176

May 15, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-176.pdf

The world is messy and imperfect, unstructured and complex, and nonetheless we must still accomplish the basic behaviors necessary for survival. It is for this purpose, ecologically relevant behavior, that vision evolved 500-600 million years ago.

This thesis is about how learn representations of the visual world that are useful for the types of behaviors we might want an embodied AI system to do. In the first part of this thesis, we systematically study how bottlenecking visual inputs through different pretrained representations affects the ability of a robot to learn different atomic navigation skills (Chapter 2) and manipulation skills (Chapter 3) through trial-and-error. The main finding is that the appropriate pretrained representation greatly improves the sample efficiency for skill acquisition, and greatly improves the generalization of the learned skill. In the second part of the thesis, we use the lessons learned in order to improve the accuracy of the representations in a larger variety of contexts (indoors, outdoors, tabletop settings, and so on). In Chapter 4 we do this through adding cross-prediction consistency objectives. In Chapter 5 we do this by leveraging vast amounts of 3D data available on the internet and from a robot’s prior experience.

The methods are primarily developed for the purpose of vision and action, but many of the ideas are general and could work for other sensory modalities and behaviors.

Advisors: Jitendra Malik

BibTeX citation:

@phdthesis{Sax:EECS-2023-176,
    Author= {Sax, Sasha},
    Title= {Pretrained Representations for Embodied AI},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-176.html},
    Number= {UCB/EECS-2023-176},
    Abstract= {The world is messy and imperfect, unstructured and complex, and nonetheless we must still accomplish the basic behaviors necessary for survival. It is for this purpose, ecologically relevant behavior, that vision evolved 500-600 million years ago.

This thesis is about how learn representations of the visual world that are useful for the types of behaviors we might want an embodied AI system to do. In the first part of this thesis, we systematically study how bottlenecking visual inputs through different pretrained representations affects the ability of a robot to learn different atomic navigation skills (Chapter 2) and manipulation skills (Chapter 3) through trial-and-error. The main finding is that the appropriate pretrained representation greatly improves the sample efficiency for skill acquisition, and greatly improves the generalization of the learned skill. In the second part of the thesis, we use the lessons learned in order to improve the accuracy of the representations in a larger variety of contexts (indoors, outdoors, tabletop settings, and so on). In Chapter 4 we do this through adding cross-prediction consistency objectives. In Chapter 5 we do this by leveraging vast amounts of 3D data available on the internet and from a robot’s prior experience.

The methods are primarily developed for the purpose of vision and action, but many of the ideas are general and could work for other sensory modalities and behaviors.},
}

EndNote citation:

%0 Thesis
%A Sax, Sasha 
%T Pretrained Representations for Embodied AI
%I EECS Department, University of California, Berkeley
%D 2023
%8 May 15
%@ UCB/EECS-2023-176
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-176.html
%F Sax:EECS-2023-176