Contrastive Feature Learning for Audio Classification

Daniel Lin

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2022-126

May 14, 2022

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-126.pdf

Models currently can accurately classify images if given a large labeled dataset. However, labeling is often a tedious process and does not scale well. Contrastive learning aims to learn an embedding space in which similar pairs are mapped close to each other and dissimilar pairs are mapped away from each other. This self-supervised learning method can leverage unlabeled datasets to generate feature representations that can be applied to various tasks such as classification and segmentation. Most of the current literature in contrastive learning deals with images, but there are significantly fewer works related to audio inputs. Recently, deep learning showed promising results to solve tasks such as event classification on audio data, so I believe some of the contrastive learning techniques can be applied to this domain as well. In this thesis, I investigate the effectiveness of two different instance discrimination frameworks, non-parametric instance discrimination (NPID) and Momentum Contrast (MoCo) that are trained on audio event data. I demonstrate that a network, without large amounts of data, can learn an audio representation that can be applied to improve performance on various classification tasks, including music and environmental sounds. Finally, my experiments show that pretrained weights from these frameworks can lead to faster convergence than other standard weight initialization methods.

Advisors: Alexei (Alyosha) Efros and Stella Yu

BibTeX citation:

@mastersthesis{Lin:EECS-2022-126,
    Author= {Lin, Daniel},
    Title= {Contrastive Feature Learning for Audio Classification},
    School= {EECS Department, University of California, Berkeley},
    Year= {2022},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-126.html},
    Number= {UCB/EECS-2022-126},
    Abstract= {Models currently can accurately classify images if given a large labeled dataset. However, labeling is often a tedious process and does not scale well. Contrastive learning aims to learn an embedding space in which similar pairs are mapped close to each other and dissimilar pairs are mapped away from each other. This self-supervised learning method can leverage unlabeled datasets to generate feature representations that can be applied to various tasks such as classification and segmentation. Most of the current literature in contrastive learning deals with images, but there are significantly fewer works related to audio inputs. Recently, deep learning showed promising results to solve tasks such as event classification on audio data, so I believe some of the contrastive learning techniques can be applied to this domain as well. In this thesis, I investigate the effectiveness of two different instance discrimination frameworks, non-parametric instance discrimination (NPID) and Momentum Contrast (MoCo) that are trained on audio event data. I demonstrate that a network, without large amounts of data, can learn an audio representation that can be applied to improve performance on various classification tasks, including music and environmental sounds. Finally, my experiments show that pretrained weights from these frameworks can lead to faster convergence than other standard weight initialization methods.},
}

EndNote citation:

%0 Thesis
%A Lin, Daniel 
%T Contrastive Feature Learning for Audio Classification
%I EECS Department, University of California, Berkeley
%D 2022
%8 May 14
%@ UCB/EECS-2022-126
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2022/EECS-2022-126.html
%F Lin:EECS-2022-126