Contextual Visual Recognition from Images and Videos

Georgia Gkioxari and Jitendra Malik

EECS Department
University of California, Berkeley
Technical Report No. UCB/EECS-2016-132
July 19, 2016

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-132.pdf

Object recognition from images and videos has been a topic of great interest in the computer vision community. Its success directly impacts a wide variety of real-world applications; from surveillance and health care to self-driving cars and online shopping.

Objects exhibit organizational structure in their real-world setting (Biederman \etal, 1982). Contextual reasoning is part of human's visual understanding and has been modeled by various efforts in computer vision in the past (Torralba, 2001). Recently, object recognition has reached a new peak with the help of deep learning. State-of-the-art object recognition systems use convolutional neural networks (CNNs) to classify regions of interest in an image. The visual cues extracted for each region are limited to the content of the region and ignore the contextual information from the scene. So the question remains, how can we enhance convolutional neural networks with contextual reasoning to improve recognition?

Work presented in this manuscript shows how contextual cues conditioned on the scene and the object can improve CNNs' ability to recognize difficult, highly contextual objects from images. Turning to the most interesting object of all, people, contextual reasoning is a key for the fine-grained tasks of action and attribute recognition. Here, we demonstrate the importance of extracting cues in an instance-specific and category-specific manner tied to the task in question. Finally, we study motion which captures the change in shape and appearance in time and is a way to extract dynamic contextual cues. We show that coupling motion with the complementary signal of static visual appearance leads to a very effective representation for action recognition from videos.

Advisor: Jitendra Malik


BibTeX citation:

@phdthesis{Gkioxari:EECS-2016-132,
    Author = {Gkioxari, Georgia and Malik, Jitendra},
    Title = {Contextual Visual Recognition from Images and Videos},
    School = {EECS Department, University of California, Berkeley},
    Year = {2016},
    Month = {Jul},
    URL = {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-132.html},
    Number = {UCB/EECS-2016-132},
    Abstract = {Object recognition from images and videos has been a topic of great interest in the computer vision community. Its success directly impacts a wide variety of real-world applications; from surveillance and health care to self-driving cars and online shopping.

Objects exhibit organizational structure in their real-world setting (Biederman \etal, 1982). Contextual reasoning is part of human's visual understanding and has been modeled by various efforts in computer vision in the past (Torralba, 2001). Recently, object recognition has reached a new peak with the help of deep learning. State-of-the-art object recognition systems use convolutional neural networks (CNNs) to classify regions of interest in an image. The visual cues extracted for each region are limited to the content of the region and ignore the contextual information from the scene. So the question remains, how can we enhance convolutional neural networks with contextual reasoning to improve recognition? 

Work presented in this manuscript shows how contextual cues conditioned on the scene and the object can improve CNNs' ability to recognize difficult, highly contextual objects from images. Turning to the most interesting object of all, people, contextual reasoning is a key for the fine-grained tasks of action and attribute recognition. Here, we demonstrate the importance of extracting cues in an instance-specific and category-specific manner tied to the task in question. Finally, we study motion which captures the change in shape and appearance in time and is a way to extract dynamic contextual cues. We show that coupling motion with the complementary signal of static visual appearance leads to a very effective representation for action recognition from videos.}
}

EndNote citation:

%0 Thesis
%A Gkioxari, Georgia
%A Malik, Jitendra
%T Contextual Visual Recognition from Images and Videos
%I EECS Department, University of California, Berkeley
%D 2016
%8 July 19
%@ UCB/EECS-2016-132
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2016/EECS-2016-132.html
%F Gkioxari:EECS-2016-132