Finding Celebrities in Video | EECS at UC Berkeley

Nazli Ikizler and Jai Vasanth and Linus Wong and David Forsyth

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2006-77

May 23, 2006

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-77.pdf

We present a system for finding celebrities in videos that uses face information in conjunction with text or speech. We achieve an approximate tripling of precision for searches over the use of transcripts or speech alone. Our work is motivated by the recent growth of personal video recording devices such as TiVo, which makes watching television more like information retrieval. We use a large dataset consisting of 13.5 hours of commercial video, which presents a challenging speech and face recognition environment. Faces are extracted using a face detector and processed via kernel PCA, LDA for use in one-vs-many SVM face classifiers. We evaluate two scenarios, one where transcripts are provided and the other more difficult scenario with speech as the only language cue. Wordspotting over audio is done using an HMM and SVM combination. We demonstrate our system¿s improved retrieval under realistic conditions using video recorded directly from television.

BibTeX citation:

@techreport{Ikizler:EECS-2006-77,
    Author= {Ikizler, Nazli and Vasanth, Jai and Wong, Linus and Forsyth, David},
    Title= {Finding Celebrities in Video},
    Year= {2006},
    Month= {May},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-77.html},
    Number= {UCB/EECS-2006-77},
    Abstract= {We present a system for finding celebrities in videos that
uses face information in conjunction with text or speech.
We achieve an approximate tripling of precision for searches
over the use of transcripts or speech alone. Our work is
motivated by the recent growth of personal video recording
devices such as TiVo, which makes watching television
more like information retrieval. We use a large dataset consisting
of 13.5 hours of commercial video, which presents a
challenging speech and face recognition environment. Faces
are extracted using a face detector and processed via kernel
PCA, LDA for use in one-vs-many SVM face classifiers.
We evaluate two scenarios, one where transcripts are provided
and the other more difficult scenario with speech as
the only language cue. Wordspotting over audio is done using
an HMM and SVM combination. We demonstrate our
system¿s improved retrieval under realistic conditions using
video recorded directly from television.},
}

EndNote citation:

%0 Report
%A Ikizler, Nazli 
%A Vasanth, Jai 
%A Wong, Linus 
%A Forsyth, David 
%T Finding Celebrities in Video
%I EECS Department, University of California, Berkeley
%D 2006
%8 May 23
%@ UCB/EECS-2006-77
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2006/EECS-2006-77.html
%F Ikizler:EECS-2006-77