Learning audio-visual correspondences for music-video recommendation
Thomas Langlois
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2019-14
May 1, 2019
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-14.pdf
Despite the importance that music has played in multi-media across many cultures and for the better part of human history (from Gamelan performances, to 19th century Italian opera, through to today), it remains a mystery why humans prefer one music-video pairing over another. We present a novel dataset of human-annotated music videos. Our hope is that this dataset can serve as a springboard for a new vein of research into human audio-visual correspondences in the context of music and video, where no assumptions are made from the outset about which audio-visual features are implicated in human cross-modal correspondences. We also sketch out some approaches to learning these correspondences directly from the data in an end-to-end manner using contemporary machine learning methods, and present some preliminary results. We describe a model --- a three-stream audio-visual convolutional network --- that predicts these human judgments. Our primary contribution is a novel dataset of videos paired with a variety of music samples, for which we obtained human aesthetic judgments (ratings of the degree of ``fit" between the music and video).
Advisors: Alexei (Alyosha) Efros
BibTeX citation:
@mastersthesis{Langlois:EECS-2019-14, Author= {Langlois, Thomas}, Title= {Learning audio-visual correspondences for music-video recommendation}, School= {EECS Department, University of California, Berkeley}, Year= {2019}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-14.html}, Number= {UCB/EECS-2019-14}, Abstract= {Despite the importance that music has played in multi-media across many cultures and for the better part of human history (from Gamelan performances, to 19th century Italian opera, through to today), it remains a mystery why humans prefer one music-video pairing over another. We present a novel dataset of human-annotated music videos. Our hope is that this dataset can serve as a springboard for a new vein of research into human audio-visual correspondences in the context of music and video, where no assumptions are made from the outset about which audio-visual features are implicated in human cross-modal correspondences. We also sketch out some approaches to learning these correspondences directly from the data in an end-to-end manner using contemporary machine learning methods, and present some preliminary results. We describe a model --- a three-stream audio-visual convolutional network --- that predicts these human judgments. Our primary contribution is a novel dataset of videos paired with a variety of music samples, for which we obtained human aesthetic judgments (ratings of the degree of ``fit" between the music and video).}, }
EndNote citation:
%0 Thesis %A Langlois, Thomas %T Learning audio-visual correspondences for music-video recommendation %I EECS Department, University of California, Berkeley %D 2019 %8 May 1 %@ UCB/EECS-2019-14 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2019/EECS-2019-14.html %F Langlois:EECS-2019-14