Finding Celebrities in Video
Nazli Ikizler and Jai Vasanth and Linus Wong and David Forsyth
EECS Department, University of California, Berkeley
Technical Report No. UCB/EECS-2006-77
May 23, 2006
http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-77.pdf
We present a system for finding celebrities in videos that uses face information in conjunction with text or speech. We achieve an approximate tripling of precision for searches over the use of transcripts or speech alone. Our work is motivated by the recent growth of personal video recording devices such as TiVo, which makes watching television more like information retrieval. We use a large dataset consisting of 13.5 hours of commercial video, which presents a challenging speech and face recognition environment. Faces are extracted using a face detector and processed via kernel PCA, LDA for use in one-vs-many SVM face classifiers. We evaluate two scenarios, one where transcripts are provided and the other more difficult scenario with speech as the only language cue. Wordspotting over audio is done using an HMM and SVM combination. We demonstrate our system¿s improved retrieval under realistic conditions using video recorded directly from television.
BibTeX citation:
@techreport{Ikizler:EECS-2006-77, Author= {Ikizler, Nazli and Vasanth, Jai and Wong, Linus and Forsyth, David}, Title= {Finding Celebrities in Video}, Year= {2006}, Month= {May}, Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-77.html}, Number= {UCB/EECS-2006-77}, Abstract= {We present a system for finding celebrities in videos that uses face information in conjunction with text or speech. We achieve an approximate tripling of precision for searches over the use of transcripts or speech alone. Our work is motivated by the recent growth of personal video recording devices such as TiVo, which makes watching television more like information retrieval. We use a large dataset consisting of 13.5 hours of commercial video, which presents a challenging speech and face recognition environment. Faces are extracted using a face detector and processed via kernel PCA, LDA for use in one-vs-many SVM face classifiers. We evaluate two scenarios, one where transcripts are provided and the other more difficult scenario with speech as the only language cue. Wordspotting over audio is done using an HMM and SVM combination. We demonstrate our system¿s improved retrieval under realistic conditions using video recorded directly from television.}, }
EndNote citation:
%0 Report %A Ikizler, Nazli %A Vasanth, Jai %A Wong, Linus %A Forsyth, David %T Finding Celebrities in Video %I EECS Department, University of California, Berkeley %D 2006 %8 May 23 %@ UCB/EECS-2006-77 %U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-77.html %F Ikizler:EECS-2006-77