Exploring the Effects of View Transforms on Self-Supervised Video Representation Learning Techniques

Ilian Herzi

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2021-140
May 18, 2021

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-140.pdf

Self-supervised video representation learning algorithms, such as pretext task learning, contrastive learning, and multimodal learning, have made significant progress in extracting features that generalize well to downstream video benchmarks. All of these learning algorithms rely on the underlying view transforms and research on how view transformations impact the performance of these learning algorithms has not been thoroughly explored. In this work, we investigate the effect of many different spatial, temporal, and visual view transforms on pretext task learning and contrastive learning. We provide a detailed analysis of the performance of these methods on video action recognition, and investigate how different methods compare by combining the learned features of several models pretrained using different learning algorithms and/or view transforms. In our setup, certain combinations of pretraining algorithms and view transforms perform better than supervised training alone on the UCF-101 and HMDB action recognition datasets but underperform some of the current state-of-the-art methods.

Advisor: John F. Canny


BibTeX citation:

@mastersthesis{Herzi:EECS-2021-140,
    Author = {Herzi, Ilian},
    Editor = {Chan, David and Canny, John F.},
    Title = {Exploring the Effects of View Transforms on Self-Supervised Video Representation Learning Techniques},
    School = {EECS Department, University of California, Berkeley},
    Year = {2021},
    Month = {May},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-140.html},
    Number = {UCB/EECS-2021-140},
    Abstract = {Self-supervised video representation learning algorithms, such as pretext task learning, contrastive learning, and multimodal learning, have made significant progress in extracting features that generalize well to downstream video benchmarks. All of these learning algorithms rely on the underlying view transforms and research on how view transformations impact the performance of these learning algorithms has not been thoroughly explored. In this work, we investigate the effect of many different spatial, temporal, and visual view transforms on pretext task learning and contrastive learning. We provide a detailed analysis of the performance of these methods on video action recognition, and investigate how different methods compare by combining the learned features of several models pretrained using different learning algorithms and/or view transforms. In our setup, certain combinations of pretraining algorithms and view transforms perform better than supervised training alone on the UCF-101 and HMDB action recognition datasets but underperform some of the current state-of-the-art methods.}
}

EndNote citation:

%0 Thesis
%A Herzi, Ilian
%E Chan, David
%E Canny, John F.
%T Exploring the Effects of View Transforms on Self-Supervised Video Representation Learning Techniques
%I EECS Department, University of California, Berkeley
%D 2021
%8 May 18
%@ UCB/EECS-2021-140
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2021/EECS-2021-140.html
%F Herzi:EECS-2021-140