Multimodal Contrastive Learning for Unsupervised Video Representation Learning

Anup Hiremath

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-206

August 12, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-206.pdf

In this report, we propose a multimodal unsupervised video learning algorithm designed to incorporate information from any number of modalities present in the data. We cooperatively train a network corresponding to each modality: at each stage of training, one of these networks is selected to be trained using the output of the other networks. To verify our algorithm, we train a model using RGB, optical flow, and audio. We then evaluate the effectiveness of our unsupervised learning model by performing action classification and nearest neighbor retrieval on a supervised dataset. We compare this triple modality model to contrastive learning models using one or two modalities, and find using all three modalities in tandem provides a 1.5% improvement in UCF101 classification accuracy, a 1.4% improvement in R@1 retrieval recall, a 3.5% improvement in R@5 retrieval recall, and a 2.4% improvement in R@10 retrieval recall as compared to using only RGB and optical flow, demonstrating the merit of utilizing as many modalities as possible in a cooperative learning model.

Advisors: Avideh Zakhor

BibTeX citation:

@mastersthesis{Hiremath:EECS-2022-206,
    Author= {Hiremath, Anup},
    Editor= {Zakhor, Avideh and Friedland, Gerald},
    Title= {Multimodal Contrastive Learning for Unsupervised Video Representation Learning},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-206.html},
    Number= {UCB/EECS-2022-206},
    Abstract= {In this report, we propose a multimodal unsupervised video learning algorithm designed to incorporate information from any number of modalities present in the data. We cooperatively train a network corresponding to each modality: at each stage of training, one of these networks is selected to be trained using the output of the other networks. To verify our algorithm, we train a model using RGB, optical flow, and audio. We then evaluate the effectiveness of our unsupervised learning model by performing action classification and nearest neighbor retrieval on a supervised dataset. We compare this triple modality model to contrastive learning models using one or two modalities, and find using all three modalities in tandem provides a 1.5% improvement in UCF101 classification accuracy, a 1.4% improvement in R@1 retrieval recall, a 3.5% improvement in R@5 retrieval recall, and a 2.4% improvement in R@10 retrieval recall as compared to using only RGB and optical flow, demonstrating the merit of utilizing as many modalities as possible in a cooperative learning model.},
}

EndNote citation:

%0 Thesis
%A Hiremath, Anup 
%E Zakhor, Avideh 
%E Friedland, Gerald 
%T Multimodal Contrastive Learning for Unsupervised Video Representation Learning
%I EECS Department, University of California, Berkeley
%D 2022
%8 August 12
%@ UCB/EECS-2022-206
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-206.html
%F Hiremath:EECS-2022-206