Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns

Shiry Ginosar

EECS Department, University of California, Berkeley

Technical Report No. UCB/EECS-2020-148

August 13, 2020

http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-148.pdf

The human visual system is highly adept at making use of the rich subtleties of the visual world such as non-verbal communication signals, style, emotion, and the fine-grained details of individuals. Computer vision systems, by contrast, excel in categorical tasks, such as classification and detection, where training often relies on single-word or simple bounding-box annotations. These simple annotations do not capture the richness of the visual world which is often hard to describe in words or localize in an image. Our current systems are thus left to only make use of the obvious, easily describable parts of the visual input. This dissertation investigates several initial directions toward modeling visual minutiae and endowing computer vision systems with rich perception.

Part I describes methods for learning directly from video data without the need for human-provided annotations. The section begins by discussing the use of multi-modal correlations between audio and motion for modeling conversational gestures---an essential part of human communication that is currently ignored by machine perception. The section then proposes a simple method for capturing the appearance details of individual people in motion, which can be used to implement a ``do-as-I-do'' motion-transfer application.

Part II explores ways to discover temporal visual patterns in historical data. The section begins by discussing data-mining methods in a dataset of historical high school yearbook portraits where fashion and behavioral styles change over time. The rest of the section proposes an unsupervised method to learn to disentangle the time-varying visual factors from the permanent ones in a large dataset of urban scenes.

Part III discusses one possible avenue for testing whether our man-made systems have achieved human-like rich perception by comparing their performance to that of humans on a unique dataset of abstract art.

Advisors: Alexei (Alyosha) Efros

BibTeX citation:

@phdthesis{Ginosar:EECS-2020-148,
    Author= {Ginosar, Shiry},
    Title= {Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns},
    School= {EECS Department, University of California, Berkeley},
    Year= {2020},
    Month= {Aug},
    Url= {http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-148.html},
    Number= {UCB/EECS-2020-148},
    Abstract= {The human visual system is highly adept at making use of the rich subtleties of the visual world such as non-verbal communication signals, style, emotion, and the fine-grained details of individuals. Computer vision systems, by contrast, excel in categorical tasks, such as classification and detection, where training often relies on single-word or simple bounding-box annotations. These simple annotations do not capture the richness of the visual world which is often hard to describe in words or localize in an image. Our current systems are thus left to only make use of the obvious, easily describable parts of the visual input. This dissertation investigates several initial directions toward modeling visual minutiae and endowing computer vision systems with rich perception.

Part I describes methods for learning directly from video data without the need for human-provided annotations. The section begins by discussing the use of multi-modal correlations between audio and motion for modeling conversational gestures---an essential part of human communication that is currently ignored by machine perception. The section then proposes a simple method for capturing the appearance details of individual people in motion, which can be used to implement a ``do-as-I-do'' motion-transfer application.

Part II explores ways to discover temporal visual patterns in historical data. The section begins by discussing data-mining methods in a dataset of historical high school yearbook portraits where fashion and behavioral styles change over time. The rest of the section proposes an unsupervised method to learn to disentangle the time-varying visual factors from the permanent ones in a large dataset of urban scenes.

Part III discusses one possible avenue for testing whether our man-made systems have achieved human-like rich perception by comparing their performance to that of humans on a unique dataset of abstract art.},
}

EndNote citation:

%0 Thesis
%A Ginosar, Shiry 
%T Modeling Visual Minutiae: Gestures, Styles, and Temporal Patterns
%I EECS Department, University of California, Berkeley
%D 2020
%8 August 13
%@ UCB/EECS-2020-148
%U http://www2.eecs.berkeley.edu/Pubs/TechRpts/2020/EECS-2020-148.html
%F Ginosar:EECS-2020-148