Video Models of People and Pixels

Jathushan Rajasegaran

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2025-65

May 15, 2025

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-65.pdf

From the moment we are born, we continuously witness the “video” of our own lives—hundreds of thousands of hours of rich, unfolding scenes. These visual experiences, streaming in seamlessly over time, form the foundation of how we understand the world: by tracking motion, recognizing people, and anticipating what comes next. In many ways, our perception begins with tracking—following a pixel, a person, or a motion—enabling higher-order understanding such as object permanence, social interaction, and physical causality. This thesis explores how to build visual models that can track, recognize, and predict.

First I will discuss about tracking people in monocular videos with PHALP (Predicting Human Appearance, Location, and Pose). By aggregating 3D representations into tracklets, temporal models predict future states, enabling persistent tracking. Next, I will discuss human action recognition from a Lagrangian perspective using these tracklets. LART (Lagrangian Action Recognition with Tracking), a transformer-based model, demonstrates the benefits of explicit 3D pose (SMPL) and location for predicting actions. LART fuses 3D pose dynamics with contextualized appearance features along tracklets, significantly improving performance on the AVA dataset, especially for interactive and complex actions. Finally, I will discuss about large-scale self-supervised learning through autoregressive video prediction with Toto, a family of causal transformers. Trained on next-token prediction using over a trillion visual tokens from diverse image and video datasets, Toto learns powerful, general-purpose visual representations with minimal inductive biases. An empirical study of architectural and tokenization choices shows these representations achieve competitive performance on downstream tasks including classification, tracking, object permanence, and robotics. We also analyze the power-law scaling of these video models.

Advisors: Jitendra Malik

BibTeX citation:

@phdthesis{Rajasegaran:EECS-2025-65,
Author= {Rajasegaran, Jathushan},
Title= {Video Models of People and Pixels},
School= {EECS Department, University of California, Berkeley},
Year= {2025},
Month= {May},
Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-65.html},
Number= {UCB/EECS-2025-65},
Abstract= {From the moment we are born, we continuously witness the “video” of our own lives—hundreds of thousands of hours of rich, unfolding scenes. These visual experiences, streaming in seamlessly over time, form the foundation of how we understand the world: by tracking motion, recognizing people, and anticipating what comes next. In many ways, our perception begins with tracking—following a pixel, a person, or a motion—enabling higher-order understanding such as object permanence, social interaction, and physical causality. This thesis explores how to build visual models that can track, recognize, and predict.

EndNote citation:

%0 Thesis
%A Rajasegaran, Jathushan 
%T Video Models of People and Pixels
%I EECS Department, University of California, Berkeley
%D 2025
%8 May 15
%@ UCB/EECS-2025-65
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2025/EECS-2025-65.html
%F Rajasegaran:EECS-2025-65