CS 294-2 Biological Vision

CS294-2 Visual Grouping and Object Recognition
Prof. Jitendra Malik
December 1, 1999
Scribe Notes by Daniel Herrera

Biological Vision

Neurobiology
Our visual system depends on the use of nerve cells called neurons. Human beings have about 10¹⁰ Neurons and these form around 10⁴ connections. The neuron consists of dendrites, where input is received, axons, where output is transmitted, the cell body, where the input is processes, and synapses, which are the connections between neurons. Neurons generally have several inputs from the dendrites, but only one output through the axon. Input from the dendrites adds up in the cell body and the cell will only fire once a threshold of activation is reached.

Neurons can be organized as:

Sensory Systems: Hearing, Vision, etc.
Motor Systems: Speaking, Moving, etc.
Central Systems: Language, Planning, etc.

To measure the activity of a neuron, an electrode can be put on its axon. The activity of the neuron can be represented by spikes on a graph. The signals a neuron sends depend on the amount of spikes over time. A typical gap between two spikes is 10 ms.

Note: The reason why we can tell that a sound is coming from the right or the left depends on the difference in time that the sound waves hit each ear. This time difference can be less than 10 ms. Thus, our nervous system is still capable of discriminating signals with an accuracy of less than 10 ms.

The brain is like a 6 layer handkerchief made out of neurons that is crumpled and shoved inside the skull. Connections among neurons on this "handkerchief" can be:

Among layers
Local across surface
Long-range

Some areas of the brain are dedicated mainly to vision, these are labeled with V's (V1, V2, V3, etc.) Area V1 is said to take care of filtering, different orientations, etc. Area V2 is assumed to be in charge of grouping. Area Mt is dedicated to motion processing. Investigation of these areas is difficult because there are feedback connections between these areas. The only one-way connection that we can be sure about is the connection between eye to brain.

Studying the Brain
Several disciplines have approached the problem of investigating how the brain works.

Anatomy - Trace out the wiring of the neurons by dyeing the neurons.
Electrophysiology - Connect an electrode near the axon, have monkey look at stimulus, measure spikes.
FMRI - Allows experiments to be done on humans, it measures the activity on surface of brain.

Another way to investigate which areas of the brain are in charge of which processes is to study humans who have unfortunately suffered from very localized brain lesions. These lesions are known as agnosias and aphasias. Two very famous areas of the brain, which are believed to affect language processing are Wernike's area and Broca's area. Symptoms of lesions to Broca's area include the loss of the ability to form coherent grammatical sentences. Some aphasics of Broca's area may say things like "Get... car... drive...store" instead of "Let's get into the car and drive to the store". Symptoms of lesions to Wernike's area include the loss of understanding of language both in producing it and receiving it yet they preserve the grammatical structure of language. Aphasics of Wernike's area may ramble on in perfect grammar, but what they say makes no sense. A brain lesion that may aid us in studying vision and especially object recognition is propagnosia. Symptoms of propagnosia include mainly the loss of the ability to recognize faces. The lesion usually does not impair the ability to recognize other objects. An example of the symptoms was an Italian man who could not recognize anyone he knew, but could classify and categorize all the different kinds of pasta.

So far, what we can account for as far as object recognition is concerned, is that the measurement of optical flow and optical filtering occur in specific areas of the brain.

Ungerleider and Miskin (1982) proposed that there are two streams of visual processing, Dorsal and Ventral.
The Dorsal stream begins in the Primary Visual Cortex and moves up to the Posterior Parietal Globule. The Ventral stream begins in the Primary Visual Cortex as well, but it moves down to the Inferotemporal Cortex. The Dorsal stream involves the interaction of vision with the motor cortex. Thus, the Dorsal stream may be responsible for the hand-eye coordination required to do such activities as picking something up. The Ventral stream is related to recognition and categorization. The functions of each stream can be summarized as Where vs. What, that is, action vs. recognition. An important distinction between these is that the Dorsal stream involves feedback. From these studies, it has been assumed that object recognition takes place in the Inferotemporal cortex.

Gross, Rolls and Perret (1972) investigated face recognition by studying how a monkey's neurons respond to different orientations of monkey faces. Tanaka conducted a similar investigation using objects (probably monkey-specific objects) instead of faces. The results of these investigations led to the conclusion that there are areas in the brain specifically dedicated to object recognition and face recognition.

Finally, Another area of study that may be related to object and face recognition are studies in Gait and Expression recognition. Everyone walks in a specific way, whether it just be our skeletal structure, or the fact that we're carrying something or we have a hurt leg, all these factors influence the way we walk. One way to implement this would be to have a feature vector ( f ) of the joints of the body. This feature vector would have to be over time. The implementation is based on the use of Hidden Markov Models. Hidden Markov Models have been used in language processing to determine the probability that a sound wave is a given phoneme. As the name implies, Hidden Markov Models calculate the probability that hidden data (phonemes) corresponds to the known data (sound waves). This model also takes into account the past and the future. Applying this to Gait recognition, the joint data would be the known data, and the gait (walking, old person, limp leg, carrying a rock, etc.) would be the unknown data. A similar process can be used for expressions and gestures. The difference is that an expression or gesture is not periodic although it still is a function of time. An important application of this technology is in human-computer interaction, where our space-age house may be able to recognize we had a bad day and dim the lights, put on some mellow music and open a chilled beer.