Thomas Langlois

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2019-14

May 1, 2019

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-14.pdf

Despite the importance that music has played in multi-media across many cultures and for the better part of human history (from Gamelan performances, to 19th century Italian opera, through to today), it remains a mystery why humans prefer one music-video pairing over another. We present a novel dataset of human-annotated music videos. Our hope is that this dataset can serve as a springboard for a new vein of research into human audio-visual correspondences in the context of music and video, where no assumptions are made from the outset about which audio-visual features are implicated in human cross-modal correspondences. We also sketch out some approaches to learning these correspondences directly from the data in an end-to-end manner using contemporary machine learning methods, and present some preliminary results. We describe a model --- a three-stream audio-visual convolutional network --- that predicts these human judgments. Our primary contribution is a novel dataset of videos paired with a variety of music samples, for which we obtained human aesthetic judgments (ratings of the degree of ``fit" between the music and video).

Advisors: Alexei (Alyosha) Efros


BibTeX citation:

@mastersthesis{Langlois:EECS-2019-14,
    Author= {Langlois, Thomas},
    Title= {Learning audio-visual correspondences for music-video recommendation},
    School= {EECS Department, University of California, Berkeley},
    Year= {2019},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-14.html},
    Number= {UCB/EECS-2019-14},
    Abstract= {Despite the importance that music has played in multi-media across many cultures and for the better part of human history (from Gamelan performances, to 19th century Italian opera, through to today), it remains a mystery why humans prefer one music-video pairing over another. We present a novel dataset of human-annotated music videos. Our hope is that this dataset can serve as a springboard for a new vein of research into human audio-visual correspondences in the context of music and video, where no assumptions are made from the outset about which audio-visual features are implicated in human cross-modal correspondences. We also sketch out some approaches to learning these correspondences directly from the data in an end-to-end manner using contemporary machine learning methods, and present some preliminary results. We describe a model --- a three-stream audio-visual convolutional network --- that predicts these human judgments. Our primary contribution is a novel dataset of videos paired with a variety of music samples, for which we obtained human aesthetic judgments (ratings of the degree of ``fit" between the music and video).},
}

EndNote citation:

%0 Thesis
%A Langlois, Thomas 
%T Learning audio-visual correspondences for music-video recommendation
%I EECS Department, University of California, Berkeley
%D 2019
%8 May 1
%@ UCB/EECS-2019-14
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-14.html
%F Langlois:EECS-2019-14