Visually-Grounded Bayesian Word Learning

Yangqing Jia and Joshua Abbott and Joseph Austerweil and Thomas Griffiths and Trevor Darrell

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2012-202

October 17, 2012

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-202.pdf

Learning the meaning of a novel noun from a few labeled objects is one of the simplest aspects of learning a language, but approximating human performance on this task is still a significant challenge for current machine learning systems. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given visual stimulus. Recent work in cognitive science on Bayesian models of word learning partially addresses this challenge, but it assumes that the labels of objects are given (hence no object recognition) and it has only been evaluated in small domains. We present a system for learning nouns directly from images, using probabilistic predictions generated by visual classifiers as the input to Bayesian word learning, and compare this system to human performance in an automated, large-scale experiment. The system captures a significant proportion of the variance in human responses. Combining the uncertain outputs of the visual classifiers with the ability to identify an appropriate level of abstraction that comes from Bayesian word learning allows the system to outperform alternatives that either cannot deal with visual stimuli or use a more conventional computer vision approach.

BibTeX citation:

@techreport{Jia:EECS-2012-202,
    Author= {Jia, Yangqing and Abbott, Joshua and Austerweil, Joseph and Griffiths, Thomas and Darrell, Trevor},
    Title= {Visually-Grounded Bayesian Word Learning},
    Year= {2012},
    Month= {Oct},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-202.html},
    Number= {UCB/EECS-2012-202},
    Abstract= {Learning the meaning of a novel noun from a few labeled objects is one of the simplest aspects of learning a language, but approximating human performance on this task is still a significant challenge for current machine learning systems. Current methods typically fail to find the appropriate level of generalization in a concept hierarchy for a given visual stimulus. Recent work in cognitive science on Bayesian models of word learning partially addresses this challenge, but it assumes that the labels of objects are given (hence no object recognition) and it has only been evaluated in small domains. We present a system for learning nouns directly from images, using probabilistic predictions generated by visual classifiers as the input to Bayesian word learning, and compare this system to human performance in an automated, large-scale experiment. The system captures a significant proportion of the variance in human responses. Combining the uncertain outputs of the visual classifiers with the ability to identify an appropriate level of abstraction that comes from Bayesian word learning allows the system to outperform alternatives that either cannot deal with visual stimuli or use a more conventional computer vision approach.},
}

EndNote citation:

%0 Report
%A Jia, Yangqing 
%A Abbott, Joshua 
%A Austerweil, Joseph 
%A Griffiths, Thomas 
%A Darrell, Trevor 
%T Visually-Grounded Bayesian Word Learning
%I EECS Department, University of California, Berkeley
%D 2012
%8 October 17
%@ UCB/EECS-2012-202
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2012/EECS-2012-202.html
%F Jia:EECS-2012-202