Learning About The World Through Video Generation

Wilson Yan

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2024-193

November 25, 2024

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-193.pdf

Learning large-scale video generation models provides a key avenue to learn about the visual world through internet-scale video data. Learning to generate accurate video requires a model to have a deep understanding of real world concepts, such as motion, physics, object interactions, and 3D consistency. In this dissertation, I will present my research that aims to address core bottlenecks in fundamental architectures and scaling of video generation models, and well us applications of such video models in downstream tasks.

In the first part of my dissertation, I will address computational bottlenecks in video generation mod- els through developing various methods in learning well-compressed, spatio-temporal hierarchical representations of video data. Specifically, I first present VideoGPT, where we learn a compressed latent space with a simple 3D CNN autoencoder that downsamples pixel representations of video in both space and time – resulting in orders of magnitudes of of savings in computation when learning a video generation model in this latent space. Next, I investigate more efficient video generation architectures in TECO that is able to scale to long sequences of videos. I then present ElasticTok, a method that is able to more efficiently encode video data by leveraging adaptive representations with variable-length encodings.

Next, I will focus on algorithmic approaches to scaling to longer context. In Large World Model, we demonstrate core training methodologies to stably train long-context models on a mixtures of language, video, and image data of up to millions of tokens.

Finally, I will present two studies on exploring the use of pre-trained video generation models for downstream tasks. In the first paper, I present VIPER, where we use video prediction model likelihoods as reward signal to learn a reinforcement learning agent. I then present MoCA, where we show that video generation models can be used to perform complex video editing tasks.

Advisors: Pieter Abbeel

BibTeX citation:

@phdthesis{Yan:EECS-2024-193,
    Author= {Yan, Wilson},
    Title= {Learning About The World Through Video Generation},
    School= {EECS Department, University of California, Berkeley},
    Year= {2024},
    Month= {Nov},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-193.html},
    Number= {UCB/EECS-2024-193},
    Abstract= {Learning large-scale video generation models provides a key avenue to learn about the visual world through internet-scale video data. Learning to generate accurate video requires a model to have a deep understanding of real world concepts, such as motion, physics, object interactions, and 3D consistency. In this dissertation, I will present my research that aims to address core bottlenecks in fundamental architectures and scaling of video generation models, and well us applications of such video models in downstream tasks.

In the first part of my dissertation, I will address computational bottlenecks in video generation mod- els through developing various methods in learning well-compressed, spatio-temporal hierarchical representations of video data. Specifically, I first present VideoGPT, where we learn a compressed latent space with a simple 3D CNN autoencoder that downsamples pixel representations of video in both space and time – resulting in orders of magnitudes of of savings in computation when learning a video generation model in this latent space. Next, I investigate more efficient video generation architectures in TECO that is able to scale to long sequences of videos. I then present ElasticTok, a method that is able to more efficiently encode video data by leveraging adaptive representations with variable-length encodings.

Next, I will focus on algorithmic approaches to scaling to longer context. In Large World Model, we demonstrate core training methodologies to stably train long-context models on a mixtures of language, video, and image data of up to millions of tokens.

Finally, I will present two studies on exploring the use of pre-trained video generation models for downstream tasks. In the first paper, I present VIPER, where we use video prediction model likelihoods as reward signal to learn a reinforcement learning agent. I then present MoCA, where we show that video generation models can be used to perform complex video editing tasks.},
}

EndNote citation:

%0 Thesis
%A Yan, Wilson 
%T Learning About The World Through Video Generation
%I EECS Department, University of California, Berkeley
%D 2024
%8 November 25
%@ UCB/EECS-2024-193
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2024/EECS-2024-193.html
%F Yan:EECS-2024-193