Perceiving People over Long Periods: Algorithms, Architectures & Datasets

Karttikeya Mangalam

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2023-282

December 15, 2023

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-282.pdf

Long-form video understanding remains as one of the last enduring open problems in computer vision. While the natural world offers long periods of visual stimuli, most computer vision systems still operate within a limited temporal scope, typically just a few seconds in both input and output. This thesis presents my work developing the neural machinery, i.e., the algorithms, architectures and datasets, that extend the temporal capacity of video understanding systems to minutes and beyond.

I start by presenting my work on algorithms for long-term multimodal human motion forecasting, termed PECNet and Y-net. Next, I introduce my contributions on neural architectures for hierarchical, temporally scalable and memory-efficient neural architectures for understanding long-form videos in form of MViT and Rev-ViT. Finally, I close by presenting my work on EgoSchema, the first certifiably long-form video-language dataset, which serves as a benchmark for evaluating the long-form understanding capabilities of multimodal models. The presented benchmark results on EgoSchema highlight the existing performance gap between current state-of-the-art models and human-level long-form video understanding. I believe that my presented advancements in algorithms, architectures, and datasets not only address several existing limitations but also open new avenues for future research and application.

Advisors: Jitendra Malik

BibTeX citation:

@phdthesis{Mangalam:EECS-2023-282,
    Author= {Mangalam, Karttikeya},
    Title= {Perceiving People over Long Periods: Algorithms, Architectures & Datasets},
    School= {EECS Department, University of California, Berkeley},
    Year= {2023},
    Month= {Dec},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-282.html},
    Number= {UCB/EECS-2023-282},
    Abstract= {Long-form video understanding remains as one of the last enduring open problems in computer
vision. While the natural world offers long periods of visual stimuli, most computer vision systems
still operate within a limited temporal scope, typically just a few seconds in both input and output.
This thesis presents my work developing the neural machinery, i.e., the algorithms, architectures and
datasets, that extend the temporal capacity of video understanding systems to minutes and beyond.


I start by presenting my work on algorithms for long-term multimodal human motion forecasting,
termed PECNet and Y-net. Next, I introduce my contributions on neural architectures for hierarchical,
temporally scalable and memory-efficient neural architectures for understanding long-form
videos in form of MViT and Rev-ViT. Finally, I close by presenting my work on EgoSchema, the
first certifiably long-form video-language dataset, which serves as a benchmark for evaluating the
long-form understanding capabilities of multimodal models. The presented benchmark results
on EgoSchema highlight the existing performance gap between current state-of-the-art models
and human-level long-form video understanding. I believe that my presented advancements in
algorithms, architectures, and datasets not only address several existing limitations but also open
new avenues for future research and application.},
}

EndNote citation:

%0 Thesis
%A Mangalam, Karttikeya 
%T Perceiving People over Long Periods: Algorithms, Architectures & Datasets
%I EECS Department, University of California, Berkeley
%D 2023
%8 December 15
%@ UCB/EECS-2023-282
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2023/EECS-2023-282.html
%F Mangalam:EECS-2023-282