Scalable Representations for Vision and Robotics
Tete Xiao
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2023-189
June 9, 2023
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-189.pdf
Artificial intelligence systems have shown remarkable advancements in recent years. However, the challenge of scalability and generalization to real-world problems remains a significant issue. In this thesis, we explore the three key components of building scalable artificial intelligence systems for computer vision, including model optimizability, learning objectives, and large-scale datasets, and apply these outcomes for robotics.
Our work begins with an examination of the optimizability of vision transformers, proposing a new set of optimizability metrics and an alternative design for their patchify stem. Next, we introduce a contrastive self-supervised learning objective that reduces inductive biases in self-supervised learning, resulting in superior performance across various datasets. We then showcase the effectiveness of self-supervised visual pre-training from real-world images for learning motor control tasks from pixels, outperforming supervised baselines and matching oracle state performance.
Expanding on this, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks, demonstrating the effectiveness of pre-trained representations across a range of tasks and embodiments. In addition, we present a sim-to-real learning-based approach for real-world humanoid locomotion using a causal Transformer, marking the first fully learning-based method for real-world full-sized humanoid locomotion. Finally, we conclude the thesis and discuss potential future directions for further research in the field.
Advisors: Trevor Darrell
BibTeX citation:
@phdthesis{Xiao:EECS-2023-189, Author= {Xiao, Tete}, Title= {Scalable Representations for Vision and Robotics}, School= {EECS Department, University of California, Berkeley}, Year= {2023}, Month= {Jun}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-189.html}, Number= {UCB/EECS-2023-189}, Abstract= {Artificial intelligence systems have shown remarkable advancements in recent years. However, the challenge of scalability and generalization to real-world problems remains a significant issue. In this thesis, we explore the three key components of building scalable artificial intelligence systems for computer vision, including model optimizability, learning objectives, and large-scale datasets, and apply these outcomes for robotics. Our work begins with an examination of the optimizability of vision transformers, proposing a new set of optimizability metrics and an alternative design for their patchify stem. Next, we introduce a contrastive self-supervised learning objective that reduces inductive biases in self-supervised learning, resulting in superior performance across various datasets. We then showcase the effectiveness of self-supervised visual pre-training from real-world images for learning motor control tasks from pixels, outperforming supervised baselines and matching oracle state performance. Expanding on this, we explore self-supervised visual pre-training on images from diverse, in-the-wild videos for real-world robotic tasks, demonstrating the effectiveness of pre-trained representations across a range of tasks and embodiments. In addition, we present a sim-to-real learning-based approach for real-world humanoid locomotion using a causal Transformer, marking the first fully learning-based method for real-world full-sized humanoid locomotion. Finally, we conclude the thesis and discuss potential future directions for further research in the field.}, }
EndNote citation:
%0 Thesis %A Xiao, Tete %T Scalable Representations for Vision and Robotics %I EECS Department, University of California, Berkeley %D 2023 %8 June 9 %@ UCB/EECS-2023-189 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-189.html %F Xiao:EECS-2023-189