Object Recognition (2001)

CS564 Lecture 7.
Object Recognition and Scene Analysis

Reading Assignments:
TMB2: Sections 2.2, and 5.2 Handout: Extracts from HBTNN 2e Drafts:
Shimon Edelman and Nathan Intrator: Visual Processing of Object Structure Guy Wallis and Heinrich Blthoff: Object recognition, neurophysiology Simon Thorpe and Michle Fabre-Thorpe: Fast Visual Processing (My thanks to Laurent Itti and Bosco Tjan for permission to use the slides they prepared for lectures on this topic.)
Arbib: CS564 - Brain Theory and Artificial Intelligence, USC, Fall 2001. Lecture 7. Object Recognition
Bottom-Up Segmentation or Top-Down Control?
Object Recognition
What is Object Recognition?
Segmentation/Figure-Ground Separation: prerequisite or consequence? Labeling an object [The focus of most studies] Extracting a parametric description as well
Object Recognition versus Scene Analysis

An object may be part of a scene or Itself be recognized as a scene
What is Object Recognition for?

As a context for recognizing something else (locating a house by the tree in the garden) As a target for action (climb that tree)
"What" versus "How in Human

DF: Jeannerod et al. Lesion here: Inability to Preshape (except for objects with size in the semantics
reach programming
Parietal Cortex
grasp programming
How (dorsal) Visual Cortex
Monkey Data: Mishkin and Ungerleider on What versus Where
Inferotemporal Cortex
What (ventral)
AT: Goodale and Milner Lesion here: Inability to verbalize or pantomime size or orientation
Clinical Studies
Studies with patients with some visual deficits strongly argue that tight interaction between where and what/how visual streams are necessary for scene interpretation. Visual agnosia: can see objects, copy drawings of them, etc., but cannot recognize or name them! Dorsal agnosia: cannot recognize objects if more than two are presented simultaneously: problem with localization Ventral agnosia: cannot identify objects.
These studies suggest

We bind features of objects into objects (feature binding) We bind objects in space into some arrangement (space binding) We perceive the scene.
Feature binding = what/how stream Space binding = where stream

Double role of spatial relationships:
To relate different portions of an object or scene as a guide to recognition Augmented by other how parameters, to guide our behavior with respect to the observed scene.
Inferotemporal Pathways
Later stages of IT (AIT/CIT) connect to the frontal lobe, whereas earlier ones (CIT/PIT) connect to the parietal lobe. This functional distinction may well be important in forming a complete picture of inter-lobe interaction.
Shape perception and scene analysis

- Shape-selective
neurons in cortex - Coding: one neuron per object or population codes? - Biologically-inspired algorithms for shape perception - The "gist" of a scene: how can we get it in 100ms or less? - Visual memory: how much do we remember of what we have seen? - The world as an outside memory and our eyes as a lookup tool
Face Cells in Monkey
Object recognition
- The
basic issues - Translation and rotation invariance - Neural models that do it - 3D viewpoint invariance (data and models) - Classical computer vision approaches: template matching and matched filters; wavelet transforms; correlation; etc. - Examples: face recognition. - More examples of biologicallyinspired object recognition systems which work remarkably well
Extended Scene Perception
Attention-based analysis: Scan scene with attention, accumulate evidence from detailed local analysis at each attended location. Main issues: - what is the internal representation? - how detailed is memory? - do we really have a detailed internal representation at all!!?
Gist: Can very quickly (120ms) classify entire scenes or do simple recognition tasks; can only shift attention twice in that much time!
Thorpe: Recognizing Whether a Scene Contains an Animal
A.
1400 1200 1000 800 600 400 200 0 0 200 400 600 800 1000 Reaction Tim e Targets Distractors
Minimum ResponseTime
Claim: This processing can be involved B.is so 6quick that only feedforward A n li m a
V
Arbib: CS564 - Brain Theory and Artificial Intelligence,
N o n -l a n i m a f erence USC, Fall 2001. Lecture 7. ObjectDif Recognition
Eye Movements: Beyond Feedforward Processing

1) Examine scene freely 2) estimate material circumstances of family 3) give ages of the people 4) surmise what family has been doing before arrival of unexpected visitor 5) remember clothes worn by the people 6) remember position of people and objects 7) estimate how long the unexpected visitor has been away from family
The World as an Outside Memory

Kevin ORegan, early 90s:
why build a detailed internal representation of the world? too complex not enough memory and useless?
The world is the memory. Attention and the eyes are a look-up tool!
The Attention Hypothesis

Rensink, 2000 No integrative buffer Early processing extracts information up to proto-object complexity in massively parallel manner Attention is necessary to bind the different proto-objects into complete objects, as well as to bind object and location Once attention leaves an object, the binding dissolves. Not a problem, it can be formed again whenever needed, by shifting attention back to the object. Only a rather sketchy virtual representation is kept in memory, and attention/eye movements are used to gather details as needed
Challenges of Object Recognition

The binding problem: binding different features (color, orientation, etc) to yield a unitary percept. (see next slide) Bottom-up vs. top-down processing: how much is assumed top-down vs. extracted from the image? Perception vs. recognition vs. categorization: seeing an object vs. seeing is as something. Matching views of known objects to memory vs. matching a novel object to object categories in memory. Viewpoint invariance: a major issue is to recognize objects irrespective of the viewpoint from which we see them.
Four stages of representation (Marr, 1982)
1) pixel-based (light intensity) 2) primal sketch (discontinuities in intensity) 3) 2 D sketch (oriented surfaces, relative depth between surfaces) 4) 3D model (shapes, spatial relationships, volumes)
TMB2 view: This may work in ideal cases, but in general cooperative computation of multiple visual cues and perceptual schemas will be required. problem: computationally intractable!
VISIONS
A computer vision system from 1987 developed by Allen Hanson and Edward Riseman on the basis of the HEARSAY system for speech understanding (TMB2 Sec. 4.2) and Arbibs Schema Theory (TMB2 Sec. 2.2 and Chap. 5)
This is schema-based and can be mapped onto hypotheses about cooperative computation in the brain.
Key idea: Bringing context and scene knowledge into play so that recognition of objects proceeds via islands of reliability to yield a consensus interpretation of the scene. See TMB2 Sec. 5.2 for the figures.
Biederman: Recognition by Components

Biederman et al. (1991 )
geons: units of 3D geometric structure
JIM 3 (Hummel)
Collection of Fragments (Edelman and Intrator)
Collection of Fragments 2
Viewpoint Invariance
Major problem for recognition.
Biederman & Gerhardstein, 1994: We can recognize two views of an unfamiliar object as being the same object.
Thus, viewpoint invariance cannot only rely on matching views to memory.
Models of Object Recognition

See Hummel, 1995, The Handbook of Brain Theory & Neural Networks
Direct Template Matching: Processing hierarchy yields activation of view-tuned units. A collection of view-tuned units is associated with one object. View tuned units are built from V4-like units, using sets of weights which differ for each object. e.g., Poggio & Edelman, 1990; Riesenhuber & Poggio, 1999
Computational Model of Object Recognition (Riesenhuber and Poggio, 1999)
the model neurons are tuned for size and 3D orientation of object

Hierarchical Template Matching:
Image passed through layers of units with progressively more complex features at progressively less specific locations. Hierarchical in that features at one stage are built from features at earlier stages.
e.g., Fukushima & Miyake (1982)s Neocognitron:

Several processing layers, comprising simple (S) and complex (C) cells. S-cells in one layer respond to conjunctions of C-cells in previous layer. C-cells in one layer are excited by small neighborhoods of S-cells.

Transform & Match: First take care of rotation, translation, scale, etc. invariances. Then recognize based on standardized pixel representation of objects. e.g., Olshausen et al, 1993, dynamic routing model
Template match: e.g., with an associative memory based on a Hopfield network.
Recognition by Components
Structural approach to object recognition:
Biederman, 1987: Complex objects are composed so simpler pieces
We can recognize a novel/unfamiliar object by parsing it in terms of its component pieces, then comparing the assemblage of pieces to those of known objects.
Recognition by components (Biederman, 1987)

GEONS: geometric elements of which all objects are composed (cylinders, cones, etc). On the order of 30 different shapes. Skips 2 D sketch: Geons are directly recognized from edges, based on their nonaccidental properties (i.e., 3D features that are usually preserved by the projective imaging process).
Basic Properties of GEONs
They are sufficiently different from each other to be easily discriminated
They are view-invariant (look identical from most viewpoints)
They are robust to noise (can be identified even with parts of image missing)
Support for RBC: We can recognize partially occluded objects easily if the occlusions do not obscure the set of geons which constitute the object.
Potential difficulties
A. Structural description not enough, also need metric info B. Difficult to extract geons from real images C. Ambiguity in the structural description: most often we have several candidates D. For some objects, deriving a structural representation can be difficult
Edelman, 1997
Geon Neurons in IT?

These are preferred stimuli for some IT neurons.
Fusiform Face Area in Humans
Standard View on Visual Processing

representation
visual processing
Image specific Supports fine discrimination Noise tolerant
Image invariant Supports generalization Noise sensitive

Tjan, 1999
Face Early visual processing
Place
Common objects
(e.g. Kanwisher et al; Ishai et al)
primary visual processing
(Tjan, 1999)
Multiple memory/decision sites
Tjans Recognition by Anarchy
primary visual processing

Sensory Memory
memory
...
ti
memory
memory
Independent R1 Decisions
Ri
Rn
Delays
Homunculus Response
t1
tn
the first arriving response
A toy visual system
Task:
Identify letters from arbitrary positions & orientations
Image
normalize position
normalize orientation
downsampling
memory
Image
normalize position
normalize orientation
downsampling
memory
Site 1
memory
Site 2
memory
Site 3
5 orientations 20 positions at high SNR
Study stimuli:
Test stimuli:
1) familiar (studied) views, 2) new positions, 3) new position & orientations
1800 {30%}
1500 {25%}
800 {20%}
450 {15%}
210 {10%}
Signal-to-Noise Ratio {RMS Contrast}
Site 3
norm. ori. Site 2
norm. pos.
Site 1 raw image
Processing speed for each recognition module depends on recognition difficulty by that module.
Familiar views
1 1
Novel positions
1
Novel positions & orientations

Site 3
Proportion Correct
0.8
0.8
0.8
norm. ori.
0.6
0.6
0.6
Site 2 norm. pos.
0.4
0.4
0.4
Site 1
0.2
0.2
0.2
raw image
0 10 100
0 10 100
0 10 100
Contrast (%)
Familiar views
1 1
Novel positions
1
Novel positions & orientations

Site 3
Proportion Correct
0.8
0.8
0.8
norm. ori.
0.6
0.6
0.6
Site 2 norm. pos.
0.4
0.4
0.4
Site 1
0.2
0.2
0.2
raw image
0 10 100
0 10 100
0 10 100
Contrast (%)
Black curve: full model in which recognition is based on the fastest of the responses from the three stages.
Experimental techniques in visual neuroscience
- Recording
from neurons: electrophysiology - Multi-unit recording using electrode arrays - Stimulating while recording - Anesthetized vs. awake animals - Single-neuron recording in awake humans - Probing the limits of vision: visual psychophysics - Functional neuroimaging: Techniques - Experimental design issues - Optical imaging - Transcranial magnetic stimulation

Object Recognition (2001)

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Object Recognition (2001)

Uploaded by

Copyright:

Available Formats

CS564 Lecture 7.

Object Recognition and Scene Analysis

Bottom-Up Segmentation or Top-Down Control?

Object Recognition versus Scene Analysis

What is Object Recognition for?

"What" versus "How in Human

How (dorsal) Visual Cortex

Monkey Data: Mishkin and Ungerleider on What versus Where

These studies suggest

Feature binding = what/how stream Space binding = where stream

Shape perception and scene analysis

Face Cells in Monkey

Extended Scene Perception

Thorpe: Recognizing Whether a Scene Contains an Animal

N o n -l a n i m a f erence USC, Fall 2001. Lecture 7. ObjectDif Recognition

Eye Movements: Beyond Feedforward Processing

The World as an Outside Memory

The Attention Hypothesis

Challenges of Object Recognition

Four stages of representation (Marr, 1982)

Biederman: Recognition by Components

geons: units of 3D geometric structure

Collection of Fragments (Edelman and Intrator)

Thus, viewpoint invariance cannot only rely on matching views to memory.

Models of Object Recognition

Computational Model of Object Recognition (Riesenhuber and Poggio, 1999)

Models of Object Recognition

e.g., Fukushima & Miyake (1982)s Neocognitron:

Models of Object Recognition

Template match: e.g., with an associative memory based on a Hopfield network.

Biederman, 1987: Complex objects are composed so simpler pieces

Recognition by components (Biederman, 1987)

Basic Properties of GEONs

They are sufficiently different from each other to be easily discriminated

They are view-invariant (look identical from most viewpoints)

Geon Neurons in IT?

Fusiform Face Area in Humans

Standard View on Visual Processing

Image specific Supports fine discrimination Noise tolerant

Image invariant Supports generalization Noise sensitive

Face Early visual processing

(e.g. Kanwisher et al; Ishai et al)

primary visual processing

Multiple memory/decision sites

Tjans Recognition by Anarchy

primary visual processing

the first arriving response

A toy visual system

Identify letters from arbitrary positions & orientations

5 orientations 20 positions at high SNR

Signal-to-Noise Ratio {RMS Contrast}

norm. ori. Site 2

Novel positions & orientations

Site 2 norm. pos.

Novel positions & orientations

Site 2 norm. pos.

Experimental techniques in visual neuroscience

You might also like